ElasticSearch(sudachi)でクエリがヒットしない

Question

社内向けにElasticSearch（全文検索）の検証を行っています。
検証中に、理解できない事象に遭遇したため、ElastiSearchについて知見を持っている方のアドバイスを頂きたいと考えています。

# 事象
テキストをインデックスし、インデックスした「同一のテキスト」でmatch_phraseした際に、マッチを得られない。
「同一のテキスト」でクエリしているため、マッチが得られる想定。


# 環境
以下の環境で事象が再現している。

- ElasticSearch 6.8.1 & Kibana(共にWindows10上にホスト)
- sudachi（構文解析器）
- sudachi full版 辞書（small版辞書では事象は再現しない。同様にkuromojiでも事象は再現しない。）
- kibana discover & kibana dev tool


# 再現手順(kibana dev toolを想定)

## 1. elastic sudachi導入
導入手順は、後述の環境構築手順を参考。
辞書は、full版を導入する。

## 2. マッピング設定
プレーンなsudachi analyzerとフィールドを設定する。

``` json
PUT mytestindex
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index.query.default_field": [
      "sudachi_field"
    ],
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "sudachi_tokenizer",
          "type": "custom"
        },
        //kuromojiで検証する場合
        "kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer"
        }
      },
      "tokenizer": {
        //full版辞書では、mode: normal,search,extended共にマッチが得られない。small版辞書では、満足する結果が得られている。
        "sudachi_tokenizer": {
          "type": "sudachi_tokenizer",
          "mode": "search",
          "resources_path": "sudachi_tokenizer" //デフォルトで、<elasticsearch_config>/sudachi_tokenizer/system_core.dic を参照する
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "sudachi_field": {
          "type": "text",
          "analyzer": "default",
          "search_analyzer": "default",
          "store": true,
          "fielddata": true,
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}
```


## 2. インデックス
テストケースを登録する。

```
POST _doc/_bulk
{ "index" : { "_index" : "mytestindex", "_type": "_doc", "_id" : "1" } }
{ "sudachi_field": "十一代目 市川海老蔵"}
{ "index" : { "_index" : "mytestindex", "_type": "_doc", "_id" : "2" } }
{ "sudachi_field": "十一代目市川海老蔵"}
{ "index" : { "_index" : "mytestindex", "_type": "_doc", "_id" : "3" } }
{ "sudachi_field": "11代目 市川海老蔵"}
{ "index" : { "_index" : "mytestindex", "_type": "_doc", "_id" : "4" } }
{ "sudachi_field": "11代目市川海老蔵"}
```

## 3. クエリの発行
クエリを発行する。

``` json
GET mytestindex/_search
{
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "should": [
                    //=================================================
                    //ヒットしないケース
                    //=================================================
                    {
                      "match_phrase": {
                        "sudachi_field": "十一代目市川海老蔵"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "十一代目 市川海老蔵"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "11代目市川海老蔵"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "11代目 市川海老蔵"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "11代目市川"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "十一代目 市川"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "十一代目市川"
                      }
                    },
                    {
                      "match_phrase": {
                        "sudachi_field": "11代目 市川"
                      }
                    }
                    //=================================================
                    //ヒットするケース
                    //=================================================
                    //,{
                    //  "match_phrase": {
                    //    "sudachi_field": "十一代目"
                    //  }
                    //}
                    //,{
                    //  "match_phrase": {
                    //    "sudachi_field": "11代目"
                    //  }
                    //}
                    //,{
                    //  "match_phrase": {
                    //    "sudachi_field": "市川"
                    //  }
                    //}
                    //,{
                    //  "match_phrase": {
                    //    "sudachi_field": "海老蔵"
                    //  }
                    //}
                    //,{
                    //  "match_phrase": {
                    //    "sudachi_field": "市川海老蔵"
                    //  }
                    //}
                    //,{
                    //  "match_phrase": {
                    //    "sudachi_field": "市川 海老蔵"
                    //  }
                    //}
                  ],
                  "minimum_should_match": 1
                }
              }
            ]
          }
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}
```

## 4. 結果確認
``` json
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}
```

## 5. 検証
- [十一代目 市川]等で検索した場合、解析結果は[十一代目 市川海老蔵]となっていることから、[十一代目 市川]にマッチしないのは納得できる
- [十一代目 市川海老蔵]等で検索した場合、同一の解析結果が得られると想定しているため、マッチしないのは納得できない
- sudachiのsmall辞書で解析した場合はフルネーム単位で解析されず、苗字・名前単位で解析されるためマッチした
- kuromojiで解析した場合はフルネーム単位で解析されず、苗字・名前単位で解析されるためマッチした

GET mytestindex/_analyze
```
{
  "tokens" : [
    {
      "token" : "十一",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "代目",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1,
      "positionLength" : 2
    },
    {
      "token" : "代",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "目",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "市川海老蔵",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3,
      "positionLength" : 2
    },
    {
      "token" : "市川",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "海老蔵",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    }
  ]
}
```

# 環境構築手順

## elastic search6.8.1のインストール
- [https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.8.1.zip](https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.8.1.zip)

## elastic sudachi導入手順
- [参考 sudachi構築手順](https://colabmix.co.jp/tech-blog/sakura-vps-centos7-elasticsearch-6-2-sudachi-install/)
- [elasticsearch-sudachi gitリポジトリ](https://github.com/WorksApplications/elasticsearch-sudachi)
- git clone コマンド(elasticsearch 6.8.1用)　　　git clone -b v6.8.1-1.3.1-SNAPSHOT https://github.com/WorksApplications/elasticsearch-sudachi.git

## elastic sudachi辞書配置
full辞書をダウンロードし、フォルダに配置する。

- [https://github.com/WorksApplications/Sudachi](https://github.com/WorksApplications/Sudachi)
- 配置先：Windowsの場合　C:\ProgramData\Elastic\Elasticsearch\config\sudachi_tokenizer\system_core.dic（デフォルトはelastisearch config配下のsudachi_tokenizer/system_core.dicを参照します。）

# 見解
sudachiの構文解析の品質は良いと感じている。
full版の辞書を利用すると、辞書による固有名詞の理解が多くなり、分析面でメリットがある一方、
単語単位での分かち書きと固有名詞トークンが生成されるため、フレーズ一致の面でデメリットが生じる。

自社要件としては、漏れなくフレーズに一致できることが求められているため、small版辞書が要件を満たしそうだ。

sudachiは辞書が充実していることがウリだが、検索面では辞書によるデメリットも生じるため、辞書はsmallの方が挙動を理解しやすく思える。
辞書がsmallの場合、kuromojiに対する優位性は現在分かっていない。

Accepted Answer

はじめまして。

modeをnormalにして、データを登録し直したらうごきませんか?
手元で7.6.0で動作させてみましたが、7.6.0の場合は、modeをnormalにしてデータ登録したところ、
検索にヒットしました。

searchのモードで動かないのは、まだ推測ですが、「代目」「代」「目」という1つの単語に対して複数のパターンが出てきてしまったあとに、単語を組み合わせたフレーズ「代目市川」などが問題かと思います。
validateクエリというAPIがあるので、これで、クエリが確認できます。
まだ、推測の域を出ていませんが、参考になればと。

```

GET mytestindex/_validate/query?explain
{
  "query": {
    "match_phrase": {
      "sudachi_field": "十一代目市川海老蔵"
    }
  }
}

```