nltkのbleuスコアの使い方

概略

2つの文章を、bleuで評価して類似度を算出したいです。
エラーではないのですが、思うように結果が得られません。

実際の出力値

5.238101011110965e-78

該当のソースコード

python
1from nltk import word_tokenize
2from nltk import bleu_score
3
4#参照文
5references = 'I have a pen and apple'
6ref = [word_tokenize(references)]
7#仮定文
8hypothesis = 'I have a pineapple'
9hyp = word_tokenize(hypothesis)
10#計算
11bleuscore = bleu_score.sentence_bleu(ref, hyp)
12#出力
13print(bleuscore)

問題点

あまりにも小さい数値になり、
なにか間違っているのではないかと思うのですが
原因がわかりませんでした。
そもそも、この出力値は5.238(略)×10の-78乗という認識であっていますか？
ご教授お願い致します。

補足ーnltk環境の構築手順

pipでnltkをインストールした後、

python
1nltk.download('punkt')

を実行しました。

行動規範の内容に同意します

回答1件

ベストアンサー

nltkの公式ページによると、bleu_score.sentence_bleuはデフォルトで連続4単語が一致している同士での比較になっているようです。その条件に満たない場合は、0を出力します。質問者様の状況は0が出力されていますということです。この意味のwarningも出力されていたかと思います。

https://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu

実際に、下記の修正コード1では、連続4単語一致させて比較したところ、正常に値が出力されます。

このデフォルト動作を変えるには、上記公式ページによると2通り方法があります。

ウエイトを変える: 下記の修正コード2は連続3単語一致のウエイトにしました。
SmoothingFunction（連続一致が少ない比較のウエイトをスムーズにつなげる）を使う: 下記の修正コード3です。

なお、結果はroundを使って小数3位くらいまでとするのがよいでしょう。0がちゃんと0として出力できますので。

Python
1from nltk import word_tokenize
2from nltk import bleu_score
3
4# 元コード
5references = 'I have a pen and apple'
6ref = [word_tokenize(references)]
7hypothesis = 'I have a pineapple'
8hyp = word_tokenize(hypothesis)
9bleuscore = bleu_score.sentence_bleu(ref, hyp)
10print(bleuscore)
11#/bleu_score.py:516: UserWarning: 
12#The hypothesis contains 0 counts of 4-gram overlaps.
13#Therefore the BLEU score evaluates to 0, independently of
14#how many N-gram overlaps of lower order it contains.
15#Consider using lower n-gram order or use SmoothingFunction()
16#  warnings.warn(_msg)
17#5.238101011110965e-78
18
19# 修正コード1: 連続4単語が一致している場合は値が出力される
20references = 'I have a pen and apple'
21ref = [word_tokenize(references)]
22hypothesis = 'I have a pen and pineapple'
23hyp = word_tokenize(hypothesis)
24bleuscore = bleu_score.sentence_bleu(ref, hyp)
25print(bleuscore)
26#0.7598356856515925
27
28# 修正コード2: using lower n-gram order
29references = 'I have a pen and apple'
30ref = [word_tokenize(references)]
31hypothesis = 'I have a pineapple'
32hyp = word_tokenize(hypothesis)
33bleuscore = bleu_score.sentence_bleu(ref, hyp, (1./3., 1./3., 1./3.))
34print(bleuscore)
35#0.3820903727892856
36
37# 修正コード3: use SmoothingFunction()
38references = 'I have a pen and apple'
39ref = [word_tokenize(references)]
40hypothesis = 'I have a pineapple'
41hyp = word_tokenize(hypothesis)
42bleuscore = bleu_score.sentence_bleu(ref, hyp, smoothing_function=bleu_score.SmoothingFunction().method1)
43print(bleuscore)
44#0.24117803988461298
45