GiNZA / spaCy (python) 自然言語処理　日本語形態素解析でのMiscデータの取得

GiNZA/spaCyのコンソール出力結果のMiscの
取得方法とデータの意味を教えてほしいです

参考サイトの3.にコンソールでの実行結果があるのですが、
https://www.ogis-ri.co.jp/otc/hiroba/technical/similar-document-search/part4.html

BunsetuBILabel=I|BunsetuPositionType=SEM_HEAD|SpaceAfter=No|NP_I
という項目だけspaCyから取得できません

上記の項目はCoNLL-Uフォーマットの解説ページ↓では、
https://universaldependencies.org/format.html#syntactic-annotation

MISC,Featureなどと呼ばれているようなのですが
spaCyでtoken.miscやtoken.featureで取得しようとしてもエラーになります
(そもそも、どういう名前で取得できるのかも見つかりませんでした)

各項目の意味も
BunsetuBILabelとNP_I
はIOBタグ関連で
BunsetuPositionType
に関しては述語とか名詞句関連だと思うのですが
確証となるソースが得られず困っています

公式ページ含めいろいろな解説ページを調べたのですが
そもそもMiscデータを取得しているところがありませんでした

何かアドバイスや参考サイトなどご教示願います

行動規範の内容に同意します

回答1件

ベストアンサー

ginzaコマンドのソース見る方が早そうです。

https://github.com/megagonlabs/ginza/blob/v3.1.1/ginza/command_line.py#L200
https://github.com/megagonlabs/ginza/blob/v3.1.1/ginza/command_line.py#L215

https://github.com/megagonlabs/ginza/blob/v3.1.1/ginza/command_line.py#L12
おそらくは、_属性がタガーなどが自身で使う情報を入れておけるもののようですね。

"spacy _ attribute" あたりで検索すると
https://spacy.io/usage/linguistic-featuresがヒットしました。
_でページ内をワード検索すると
https://spacy.io/usage/linguistic-features#retokenization-extensions

の項目があります。

If you’ve registered custom extension attributes, you can overwrite them during tokenization by providing a dictionary of attribute names mapped to new values as the "_" key in the attrs. For merging, you need to provide one dictionary of attributes for the resulting merged token. For splitting, you need to provide a list of dictionaries with custom attributes, one per split subtoken.

conllu_token_line関数がginzaコマンドにおける結果の組み立てをしているので、それを参考にするといいかと思います。

python
1import spacy
2nlp = spacy.load('ja_ginza')
3
4doc = nlp('spaCyはオープンソースの自然言語処理ライブラリです。')
5attr = doc[0]._
6print(attr.bunsetu_bi_label, attr.bunsetu_position_type)