wordのdocxファイルからコメントとドキュメントの対を取得したい。

wordファイルのdocxからドキュメントとコメントを取得したいです。wordというのは複数のxmlファイルをzipしたものであるのですが、どんな構成になっていて、どのパラメータがどれなのか分かりません。

今回はxml解析にはpythonを使用します。
docxを解凍して、comments.xmlにコメントが、document.xmlに本文が書かれていると調べて見つかりました。しかし、comments.xmlのコメント文とdocument.xmlの本文がお互いにどこと対応しているのかをみつけられません。xmlパーサのetreeを用いてgetiterator()関数で回したところ、テキストがぐちゃぐちゃになってしまうので、よくわからなくなってしまいました。

もし、コメントと本文を取る方法を知っていたら、教えてください。

以下は今まで行ったコードの例です。

python
1from lxml import etree
2import zipfile
3
4docxZip = zipfile.ZipFile(file_path) 
5commentsXML = docxZip.read('word/comments.xml')　#zipファイルを解凍しcomments.xmlを読み込む
6et = etree.XML(commentsXML)
7
8for i, e in enumerate(et.getiterator()):
9    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id'):
10        comment['comment_id'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id')
11    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author'):
12        comment['author'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author')
13    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date'):
14        comment['date'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date')
15    if e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}paraId'):
16        comment['para_id'] = e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}paraId')
17    if e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}textId'):
18        comment['text_id'] = e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}textId')
19    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR'):
20        comment['rs_id'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR')
21
22documentsXML = docxZip.read('word/document.xml') #zipファイルを解答しdocument.xmlを読み込む
23et = etree.XML(documentsXML)
24    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id'):
25        document['document_id'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id')
26    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author'):
27        document['author'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author')
28    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date'):
29        document['date'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date')
30    if e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}paraId'):
31        document['para_id'] = e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}paraId')
32    if e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}textId'):
33        document['text_id'] = e.attrib.get('{http://schemas.microsoft.com/office/word/2010/wordml}textId')
34    if e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR'):
35        document['rs_id'] = e.attrib.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR')