現在SQLインジェクションの研究を行っており
Pythonで形態素解析を行い辞書を作ろうと考えているのですが、
encodeing周りのエラーがわからず解決できません。
python
1mecab = MeCab.Tagger('mecabrc') 2 3 4def tokenize(text): 5 node = mecab.parseToNode(text) 6 while node: 7 if node.feature.split(',')[0] == '名詞': 8 yield node.surface.lower() 9 node = node.next 10 11 12def get_words(contents): 13 ret = [] 14 # ここのfor文のループでエラーが発生する 15 for content in contents: 16 ret.append(get_words_main(content)) 17 return ret 18 19 20def get_words_main(content): 21 return [token for token in tokenize(content)] 22 23if __name__ == '__main__': 24 column = [] 25 num = [] 26 with open('word.csv',encoding="utf8", errors='ignore') as f: 27 reader = csv.reader(f) 28 29 for row in reader: 30 column.append(row[0]) 31 num.append(row[1]) 32 33 data_train_s, data_test_s, label_train_s, label_test_s = train_test_split(column, num, test_size=0.3) 34 #この部分でコケる 35 words = get_words(data_train_s) 36 37
utf-8で保存してある読み込み用のcsvファイルです。
csv
1onLoading {Function},0 2onSuccess {Function},0 3onAfterRender {Function},0 4print(len([s for s in l if s.endswith('e')])),0 5select* from database where id = 1;,0 6Graph minors IV Widths of trees and well quasi ordering,0 7"1' UNION ALL SELECT CONCAT(0x716b6b6a71,(CASE WHEN (EXISTS(SELECT random FROM performance_schema.events_waits_summary_by_instance)) THEN 1 ELSE 0 END),0x716a717a71),NULL-- hYEx",1 8"1' UNION ALL SELECT CONCAT(0x716b6b6a71,(CASE WHEN (EXISTS(SELECT aTEC FROM zsTX)) THEN 1 ELSE 0 END),0x716a717a71),NULL-- utMa",1 9"1' AND (SELECT 2551 FROM(SELECT COUNT(*),CONCAT(0x716b6b6a71,(SELECT REPEAT(0x38,128)),0x716a717a71,FLOOR(RAND(0)*2))x FROM INFORMATION_SCHEMA.PLUGINS GROUP BY x)a) AND 'uDRn'='uDRn",1 10"1' UNION ALL SELECT CONCAT(0x716b6b6a71,(CASE WHEN (EXISTS(SELECT creditcard_id FROM performance_schema.events_waits_summary_by_instance)) THEN 1 ELSE 0 END),0x716a717a71),NULL-- mwJp",1 11
ファイルを実行した結果以下のようなエラーが数回に一回発生します。
例1) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 0: invalid continuation byte
例2) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte
お答えしてほしい点として
1 UnicodeDecodeErrorが起きてしまっている原因箇所
2 encode問題の対処の仕方をお答えしてほしいです。
補足質問への回答
1 エラー発生時のdata_train_sの中身について
["print(len([s for s in l if s.endswith('e')]))", 'onSuccess {Function}', "1' UNION ALL SELECT CONCAT(0x716b6b6a71,(CASE WHEN (EXISTS(SELECT creditcard_id FROM performance_schema.events_waits_summary_by_instance)) THEN 1 ELSE 0 END),0x716a717a71),NULL-- mwJp", 'onAfterRender {Function}', "1' UNION ALL SELECT CONCAT(0x716b6b6a71,(CASE WHEN (EXISTS(SELECT aTEC FROM zsTX)) THEN 1 ELSE 0 END),0x716a717a71),NULL-- utMa", 'select* from database where id = 1;']
2 Tracebackの中身
Traceback (most recent call last):
File "svm.py", line 59, in <module>
words = get_words(data_train_s)
File "svm.py", line 30, in get_words
ret.append(get_words_main(content))
File "svm.py", line 35, in get_words_main
return [token for token in tokenize(content)]
File "svm.py", line 35, in <listcomp>
return [token for token in tokenize(content)]
File "svm.py", line 22, in tokenize
yield node.surface.lower()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte
回答1件
あなたの回答
tips
プレビュー