python、pandoc、Luaフィルタを用いてmdファイルの内容をwordに書き込みたい

実現したいこと

mdファイルの中にある「サンプル文字」をwordに書き込みたいが、下線を引いた状態で出力できない。

プログラムは実行できる、wordファイルも出力できる。wordに文字も出力されている。

発生している問題・分からないこと

luaフィルタのRawInline関数で下線が見つかったならば、pandocを用いて下線を引く処理を追記しているが出力されたwordファイルを見ると、下線が引かれていない。

該当のソースコード

python(main.py)
1# ■■ Pandocインストール必要(https://github.com/jgm/pandoc/releases/tag/3.6.4)
2import os, re
3import subprocess
4from docx import Document
5import time
6
7# Markdownの整形処理
8def md_edit(md_file_path, temp_file_path):
9    with open(md_file_path, 'r', encoding='utf-8') as infile, open(temp_file_path, 'w', encoding='utf-8') as outfile:
10        prev_line_flg = False
11        for line in infile:
12            # リンク内のスペース除外
13            if '<a name="' in line and '</a>' in line:
14                start_index = line.find('<a name="') + len('<a name="')
15                end_index = line.find('">', start_index)
16                name_content = line[start_index:end_index].replace(' ', '').replace('　', '')
17                line = line[:start_index] + name_content + line[end_index:]
18            
19            # URL以外は、リンク内のスペース除外
20            if '](' in line and ')' in line and 'http' not in line:
21                start_index = line.find('](') + len('](')
22                end_index = line.find(')', start_index)
23                link_content = line[start_index:end_index].replace(' ', '').replace('　', '')
24                line = line[:start_index] + link_content + line[end_index:]
25            
26            # 行の先頭が<で次に英語または/以外の文字が続く場合、<を＜に置換し、同じ行の次の>を＞に置換
27            if re.match(r'^<[^a-zA-Z/]', line):
28                line = line.replace('<', '＜', 1)
29                line = line.replace('>', '＞', 1)
30            
31            # "#"から始まる対象は、前回が空でなければ改行を追加
32            if line.startswith('#') and prev_line_flg:
33                outfile.write('\n') # 改行追加
34            
35            # ">"のみの対象は、半角スペースを追加
36            if line.strip() == '>':
37                line = '> \n'
38            
39            # 行の末尾が"<br>"または"<br/>"の場合、除外
40            if line.rstrip().endswith('<br>') and line.strip() != '<br>':
41                line = line.rstrip()[:-4] + '\n'
42            elif line.rstrip().endswith('<br/>') and line.strip() != '<br/>':
43                line = line.rstrip()[:-5] + '\n'
44
45            
46
47            outfile.write(line) # 行反映
48
49            # 改行のみであったかチェック
50            if line == '\n':
51                prev_line_flg = False
52            else:
53                prev_line_flg = True
54
55# Word編集処理
56def word_edit(docx_file_path):
57    document = Document(docx_file_path)
58    for paragraph in document.paragraphs:
59        if paragraph.text.startswith('> '):
60            paragraph.style = 'Quote'
61            paragraph.text = paragraph.text[2:]
62
63    document.save(docx_file_path)
64
65def main():
66    try:
67        # ◆Markdownファイルリスト
68        in_dir = "in"
69        md_file_lists = list(filter(lambda f: f.endswith(".md"), os.listdir(in_dir)))
70
71        # ◆ディレクトリ直下のカスタムテンプレートファイル
72        # Wordのスタイルウィンドウを出して、出力したいスタイルのテンプレートを作成
73        template_file = "template.docx"
74
75        # ◆出力ディレクトリ
76        out_dir = "out"
77
78        # ◆出力ディレクトリが存在しない場合は作成
79        if not os.path.exists(out_dir):
80            os.makedirs(out_dir)
81
82        # ◆Pandocコマンド実行
83        if md_file_lists:
84            for i, md_file in enumerate(md_file_lists):
85                input_file_path = os.path.join(in_dir, md_file) # インプットファイルパス作成
86                temp_file_path = os.path.join(in_dir, 'temp_' + md_file) # 一時ファイルパス作成
87                output_file_path = os.path.join(out_dir, os.path.splitext(os.path.basename(md_file))[0] + ".docx") # アウトプットファイルパス作成
88                
89                md_edit(input_file_path, temp_file_path) # Markdownファイル整形
90                
91                if template_file != "":
92                    # ◇テンプレート指定あり
93                    cmd = [
94                        "pandoc", temp_file_path,
95                        "--reference-doc", template_file,
96                        "--lua-filter=fix.lua",
97                        "--wrap=preserve",
98                        "-o", output_file_path
99                    ]
100                else:
101                    # ◇テンプレート指定なし
102                    cmd = [
103                        "pandoc", temp_file_path,
104                        "--lua-filter=fix.lua",
105                        "--wrap=preserve",
106                        "-o", output_file_path
107                    ]
108                
109                result = subprocess.run(cmd, capture_output=True, text=True) # Pandoc実行
110                if result.returncode != 0:
111                    raise Exception(f"変換異常発生: {result.stderr} >{md_file}")
112                
113                word_edit(output_file_path) # word編集
114                
115                os.remove(temp_file_path) # 一時ファイル削除
116                
117                if i == 0: print("【変換対象ファイル一覧】")
118                print(f"{i+1}: {md_file}")
119
120            print("-- 変換完了 --")
121        
122        else:
123            print("-- MDファイルなし --")
124
125    except Exception as err:
126        print(err)
127
128
129if __name__ == '__main__':
130    main()

luaフィルタ(fix.lua)
1local b_cnt = 0
2
3-- ブロック処理
4function RawBlock(el)
5    -- 改ページ処理
6    if el.text == '<div style="page-break-before:always"></div>'  then
7        return pandoc.RawBlock('openxml', '<w:p><w:r><w:br w:type="page"/></w:r></w:p>')
8    else
9        return el
10
11    end
12end
13
14-- インライン処理
15function RawInline(el)
16    -- 2025/05/07下線処理
17    if el.text:match('<u>(.-)</u>') then
18        local underlined_text = el.text:match('<u>(.-)</u>')
19        return pandoc.RawInline('openxml', '<w:r><w:rPr><w:u w:val="single"/></w:rPr><w:t>' .. underlined_text .. '</w:t></w:r>')
20
21    -- 改行処理
22    elseif el.text == '<br>' or el.text == '<br/>' then
23        return pandoc.RawInline('openxml', '<w:br/>')
24
25    -- ブックマーク処理
26    elseif el.text:match('<a name="(.-)">') then
27        local b_name = el.text:match('<a name="(.-)">')
28        b_cnt = b_cnt + 1
29        return pandoc.RawInline('openxml', '<w:bookmarkStart w:id="' .. b_cnt .. '" w:name="' .. b_name .. '"/><w:bookmarkEnd w:id="' .. b_cnt .. '"/>')
30
31    else
32        return el
33
34    end
35end
36
37-- リンク処理
38function Link(el)
39    if el.target:match("^#") then
40        b_cnt = b_cnt + 1
41        local b_name = el.target:sub(2):gsub("%%20", " ")
42        local content = pandoc.utils.stringify(el.content)
43        return {
44            pandoc.RawInline('openxml', '<w:bookmarkStart w:id="' .. b_cnt .. '" w:name="' .. b_name .. '"/>'),pandoc.RawInline('openxml', '<w:hyperlink w:anchor="' .. b_name .. '"><w:r><w:t>' .. content .. '</w:t></w:r></w:hyperlink>'),
45            pandoc.RawInline('openxml', '<w:bookmarkEnd w:id="' .. b_cnt .. '"/>')
46        }
47    else
48        return el
49    end
50end

sample.md
1<br>アイウエオ</br>
2<u>サンプル文字</u>

試したこと・調べたこと

teratailやGoogle等で検索した
ソースコードを自分なりに変更した
知人に聞いた
その他

上記の詳細・結果

ネットで「pandoc luaフィルター」と調べ、pandoc.Underlineメソッドも使ったが実行できなかった。
https://pandoc.org/lua-filters.html#type-inlines

chatgptを使っても、私が書いたコードと同じになり、実行結果も変わらない。

補足

pandoc 3.6.4
python 3.8

行動規範の内容に同意します

回答1件

ベストアンサー

function RawInline(el) の中で print(el) してみるとわかると思いますが、el としては '' や '' が渡されています。
なので、'(.-)' のパターンにマッチすることはありません。
'' や '' のときに、それぞれ開始・終了のタグを挿入してやるのがいいんじゃないでしょうか。
(docx は詳しくないので正しいかわかりませんが、下記で sample.md には下線がつきました)

lua
1-- インライン処理
2function RawInline(el)
3    -- 下線処理
4    if el.text == '<u>'  then
5        return pandoc.RawInline('openxml', '<w:r><w:rPr><w:u w:val="single"/><w:t>')
6    elseif el.text == '</u>'  then
7        return pandoc.RawInline('openxml', '</w:t></w:rPr></w:r>')
8
9    -- 改行処理
10    -- (以下変更なし、省略)
11end

もしくは、python 側で元の md を編集してから pandoc に渡しているようなので、python の md_edit() の中で、～～～ を ～～～ に置換してやるのもいいと思います。
(タグが別の行に分かれているケースは下記では対応不可です)

python
1def md_edit(md_file_path, temp_file_path):
2    with open(md_file_path, 'r', encoding='utf-8') as infile, open(temp_file_path, 'w', encoding='utf-8') as outfile:
3        prev_line_flg = False
4        for line in infile:
5            # (省略)
6            # アンダーライン <u>...</u>
7            line = re.sub('<u>(.*?)</u>', '<span class="underline">\\1</span>', line)
8
9            # (省略)