ValueError: Tokenization produced tokens that do not belong in string!

前提・実現したいこと

pythonでアンケートの設問文からシングルアンサーorマルチアンサーかを判別するシステムを作成しようとしております。Bertでモデルを作成し、LIMEで局所説明機能の実装中にエラーメッセージが発生しました。エラーの示していることを少なくとも把握したいので、理解できる方がいらっしゃいましたらアドバイスをお願いいたします。

発生している問題・エラーメッセージ

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-12e0afbf92ae> in <module>
      1 #Limeの適応
      2 sample_text = 'あなたの年齢をお答えください。'
----> 3 exp_result = explainer.explain_instance(
      4     sample_text, predict_proba, num_features=5, labels=[target_categories.index('SA')])
      5 #Limeの結果確認

~\Anaconda3\lib\site-packages\lime\lime_text.py in explain_instance(self, text_instance, classifier_fn, labels, top_labels, num_features, num_samples, distance_metric, model_regressor)
    407             text_instance, bow=self.bow, mask_string=self.mask_string)
    408                           if self.char_level else
--> 409                           IndexedString(text_instance, bow=self.bow,
    410                                         split_expression=self.split_expression,
    411                                         mask_string=self.mask_string))

~\Anaconda3\lib\site-packages\lime\lime_text.py in __init__(self, raw_string, split_expression, bow, mask_string)
    102         if callable(split_expression):
    103             tokens = split_expression(self.raw)
--> 104             self.as_list = self._segment_with_tokens(self.raw, tokens)
    105             tokens = set(tokens)
    106 

~\Anaconda3\lib\site-packages\lime\lime_text.py in _segment_with_tokens(text, tokens)
    192                 text_ptr += 1
    193                 if text_ptr >= len(text):
--> 194                     raise ValueError("Tokenization produced tokens that do not belong in string!")
    195             text_ptr += len(token)
    196             if inter_token_string:

ValueError: Tokenization produced tokens that do not belong in string!

該当のソースコード

python
1#LIMEでつかう確率算出クラス
2def predict_proba(texts, model=model, tokenizer=tokenizer, max_length=max_length, batch_size=batch_size):
3    tokenized = tokenizer.batch_encode_plus(
4        texts, padding=True, truncation=True, max_length=max_length, return_tensors='pt')
5    ids, masks = [tokenized[key] for key in ['input_ids', 'attention_mask']]
6    n_batch = ids.shape[0] // batch_size + 1
7    list_prob = []
8    for i_batch in range(n_batch):
9        idx_from = i_batch * batch_size
10        idx_to = (i_batch + 1) * batch_size
11
12        ids_batch = ids[idx_from:idx_to].to(device)
13        mask_batch = masks[idx_from:idx_to].to(device)
14        
15        logits = model(ids_batch, attention_mask=mask_batch)['logits']
16        prob = F.softmax(logits, dim=1).cpu().detach().numpy()
17        list_prob.append(prob)
18    
19    return np.vstack(list_prob)
20#LimeTextExplainerの準備
21word_tokenizer = transformers.AutoTokenizer.from_pretrained(
22    model_name, do_subword_tokenise=False)
23explainer = LimeTextExplainer(
24    class_names=target_categories, 
25    split_expression=word_tokenizer.tokenize, 
26    mask_string=tokenizer.pad_token, 
27    random_state=0)
28#Limeの適応
29sample_text = 'あなたの年齢をお答えください。'
30exp_result = explainer.explain_instance(
31    sample_text, predict_proba, num_features=5, labels=[target_categories.index('SA')])
32#Limeの結果確認
33exp_result.show_in_notebook(text=True)

lime_text.py
1 def _segment_with_tokens(text, tokens):
2        """Segment a string around the tokens created by a passed-in tokenizer"""
3        list_form = []
4        text_ptr = 0
5        for token in tokens:
6            inter_token_string = []
7            while not text[text_ptr:].startswith(token):
8                inter_token_string.append(text[text_ptr])
9                text_ptr += 1
10                if text_ptr >= len(text):
11                    raise ValueError("Tokenization produced tokens that do not belong in string!")
12            text_ptr += len(token)
13            if inter_token_string:
14                list_form.append(''.join(inter_token_string))
15            list_form.append(token)
16        if text_ptr < len(text):
17            list_form.append(text[text_ptr:])
18        return list_form
19
20    def __get_idxs(self, words):
21        """Returns indexes to appropriate words."""
22        if self.bow:
23            return list(itertools.chain.from_iterable(
24                [self.positions[z] for z in words]))
25        else:
26            return self.positions[words]

試したこと

python初心者でエラーの意味もさっぱり分からず、何を試せばよいか途方に暮れております。。
なおエラーを出しているlime_text.pyを見てもよくわかりませんでした。
設問文の文字列以上にトークンが存在するときに発生するエラー？のようなので、デバッグをしようとlime_text.pyにprint(text)を埋め込もうとしましたが、なぜか何もprintされませんでした。

補足情報

「XAI（説明可能なAI）」という書籍を参考にしております。

jbpb0

2021/10/23 10:11 編集

質問に掲載のコードの前半部分は、書籍の142〜144ページに載ってますけど、後半部分が見当たりませんこれは書籍のどこかに載ってるコードでしょうか？それとも、質問者さんが自分で書いたコードでしょうか？【追記】質問のエラー見たら分かりました https://github.com/marcotcr/lime/blob/master/lime/lime_text.py の184行目〜ですね

jbpb0

2021/10/23 13:11

書籍の通りに実行したら、エラー出ませんでした https://www2.ric.co.jp/cgi-bin/download/book_1292.cgi からダウンロードして展開して、その中の「ch08-01_LIME(text)_Integrated-Gradients_Attention.ipynb」をGoogle Colabにアップロードして、先頭から順番に実行したら、最後までエラー出ずに実行できましたただし、一番最初の「!pip install...」の初回の実行後は、カーネルを再起動するように表示されたので、それをやってから、次に進みました一応、 sample_text = 'あなたの年齢をお答えください。' としても実行してみましたが、やはりエラー出ませんでした質問者さんが実行してる時は、学習データを作成する部分以外で、コードを書籍とどこか変えてますでしょうか？もし学習データを作成する部分以外ではコードは変えてないのなら、質問者さんが作成した学習データのどこかに不備があり、それがエラーの原因になってる可能性がありますあと、使用してるPythonの各モジュールのバージョンは、書籍の記載に合わせてますでしょうか？もしバージョンが違うものがあるのなら、念の為にバージョンを全て書籍に合わせて実行してみるといいと思いますそれでもダメなら、やはり学習データが怪しいと思います

aburauri

2021/10/25 01:07

ご回答いただきありがとうございます。ご指摘の通り書籍とコードが違う箇所があり、「tokenizer」を「tokeniser」としていたスペルミスを修正したらエラーは出なくなりました。助かりました。大変お恥ずかしいのですが、なかなかpythonのエラーに慣れることができずに苦労しております。つまらないオチで申し訳ないです。。