txt.ファイルの書き出し及びフィル名の設置について(Python)

パイソン初心者です。
以下の課題の解決策が分からず途方に暮れています...
完全な解決案でなくても、課題解決に役立つヒント等でも頂けると幸いです。

【概要】
①各章を個別に所定のファイル名と連番の形式でtxt.ファイルにて切り出す
②上記①の工程に加え、各章のタイトルをファイル名に反映させる
※上記をハードコードせずにソースコードを組むこと

【実現したいこと詳細】
①各章を個別に連番のtxt.ファイルにて切り出す
・各章とは各ファイルの“CHAPTER...●●(章の番号)” で始まる行から、その章の最終行までを対象とする
例）1章目であれば「・・・sky.」で終わる
※最終章は「〜THE END」までを対象とする

②上記①の工程に加え、各章のタイトルをファイル名に反映させる
冒頭目次にある各章のタイトルを参照し、上記②のファイル名に反映させる。
¥

【備考】
各章のパターンとして「CHAPTER...●●」で始まり、各章の終わりは最終章を除き5行文の改行コードがあるため編集中のコードは改行コードを参照しての切り分けを想定した途中経過となっています。

tiitoi

2018/10/16 05:08

これはなにか ebook の規定されたフォーマットでしょうか？構造情報がないので、コンピューターでは扱いずらいです。章の切り替わりに CHAPTER III のようにチャプター名とその下にタイトルがあるので、それを目印に切り分けるぐらいしか方法がないです。

退会済みユーザー

2018/10/16 21:05 編集

早速の確認ありがとうございます。ファイル元はご指摘の通り電子書籍です。構造情報というご質問の答えになっているか分からないのですが、元ファイルを見る限り構造というかパターンと見受けられるようなものは"CHAPTER"と各章(チャプター)の改行コード5行文のみだったため、そこを目印に切り分ける方法を模索しております。要領を得ず申し訳ありませんが、現状の目印での方法にて差し支えない限りでアドバイスと頂けると幸いです。

行動規範の内容に同意します

回答2件

ベストアンサー

そのテキストのソースは Project Gutenberg ですよね？

構造情報が残っている epub 形式も配布されているので、epub 形式でダウンロードして、ebooklib という epub 形式を扱えるライブラリを使ってパースしたほうがよいかと思います。

import ebooklib
from bs4 import BeautifulSoup

book = ebooklib.epub.read_epub('test.epub')

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        content = item.get_content().decode("utf-8")  # bytes -> str
        soup = BeautifulSoup(content, "html.parser")  # BeautifulSoup で XML を解析する。

        for h2_tag in soup.select('h2'):  # h2 タグがチャプター名があるタグ
            # タイトル抽出
            title = h2_tag.text.replace('\n', ' ')
            print('タイトル: ', title)
            # コンテンツ抽出
            next_tag = h2_tag
            while True:
                next_tag = next_tag.nextSibling
                # 次の h2 タグが現れるまでループ
                if not next_tag or next_tag.name == 'h2':
                    break
                
                if next_tag.name == 'p':
                    print(next_tag.text)  # 文章

タイトル:  CHAPTER I  JONATHAN HARKER’S JOURNAL
(Kept in shorthand.)
3 May. Bistritz.—Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule.
以下略

思い通りの情報をパースするには、epub 形式とスクレイピングの勉強をしてください。

追記

テキストをチャプターごとに分割するサンプルコード

ファイルを読み込む。

python
1import itertools
2import os
3import re
4
5# read file.
6# -----------------------------------
7with open('dracula.txt') as f:
8    # strip empty lines.
9    lines = [l for l in f.read().splitlines() if l]
10print('number of lines: {}'.format(len(lines)))  # number of lines: 13414

解析する。

python
1chapters = []
2
3# parse the book into chapters.
4# -----------------------------------
5itr = iter(lines)
6
7# skip lines until the title 'DRACULA' is found.
8for i in itertools.count(1):
9    line = next(itr)
10    if line == 'DRACULA':
11        break
12print('{} lines skipped'.format(i))
13print(line)
14chapters = []
15
16# first chapter is followed by title.
17next(itr)  # skip chapter no
18
19no = 1  # chapter no
20title = next(itr)  # chapter title
21sentences = []  # sentences of chapter
22
23for i in itertools.count(i):
24    line = next(itr)
25
26    if line.find(r'THE END') != -1:
27        # end mark found.
28        chapters.append({'no': no, 'title': title, 'sentences': sentences})
29        break  # story ends.
30    
31    # check if line is CHAPTER <Roman numerals>.
32    if re.match(r'CHAPTER [MDCLXVI]{1,2}', line):
33        chapters.append({'no': no, 'title': title, 'sentences': sentences})
34        no += 1
35        title = next(itr)  # chapter title is follwed by chapter number.
36        sentences = []
37        continue
38
39    sentences.append(line)
40
41print('{} lines parsed'.format(i))  # 13036 lines parsed

ファイルに書き込む。

python
1# write every chapter to files.
2# -------------------------------------
3output_dirpath = 'chapters'
4os.makedirs(output_dirpath, exist_ok=True)
5
6for chapter in chapters:
7    filename = 'Dracula-Chapter-{no}_{title}.txt'.format(
8        no=chapter['no'], title=chapter['title'].replace(' ', '_'))
9    filepath = os.path.join(output_dirpath, filename)
10    
11    with open(filepath, 'w') as f:
12        f.write("\n".join(chapter['sentences']))

出力結果

chapters
├── Dracula-Chapter-10__Letter,_Dr._Seward_to_Hon._Arthur_Holmwood._.txt
├── Dracula-Chapter-11__Lucy_Westenra's_Diary._.txt
├── Dracula-Chapter-12_DR._SEWARD'S_DIARY.txt
├── Dracula-Chapter-13_DR._SEWARD'S_DIARY--_continued_..txt
├── Dracula-Chapter-14_MINA_HARKER'S_JOURNAL.txt
├── Dracula-Chapter-15_DR._SEWARD'S_DIARY--_continued_..txt
├── Dracula-Chapter-16_DR._SEWARD'S_DIARY--_continued_.txt
├── Dracula-Chapter-17_DR._SEWARD'S_DIARY--_continued_.txt
├── Dracula-Chapter-18_DR._SEWARD'S_DIARY.txt
├── Dracula-Chapter-19_JONATHAN_HARKER'S_JOURNAL.txt
├── Dracula-Chapter-1_JONATHAN_HARKER'S_JOURNAL.txt
├── Dracula-Chapter-20_JONATHAN_HARKER'S_JOURNAL.txt
├── Dracula-Chapter-21_DR._SEWARD'S_DIARY.txt
├── Dracula-Chapter-22_JONATHAN_HARKER'S_JOURNAL.txt
├── Dracula-Chapter-23_DR._SEWARD'S_DIARY.txt
├── Dracula-Chapter-24_DR._SEWARD'S_PHONOGRAPH_DIARY,_SPOKEN_BY_VAN_HELSING.txt
├── Dracula-Chapter-25_DR._SEWARD'S_DIARY.txt
├── Dracula-Chapter-26_DR._SEWARD'S_DIARY.txt
├── Dracula-Chapter-27_MINA_HARKER'S_JOURNAL.txt
├── Dracula-Chapter-2_JONATHAN_HARKER'S_JOURNAL--_continued_.txt
├── Dracula-Chapter-3_JONATHAN_HARKER'S_JOURNAL--_continued_.txt
├── Dracula-Chapter-4_JONATHAN_HARKER'S_JOURNAL--_continued_.txt
├── Dracula-Chapter-5__Letter_from_Miss_Mina_Murray_to_Miss_Lucy_Westenra._.txt
├── Dracula-Chapter-6_MINA_MURRAY'S_JOURNAL.txt
├── Dracula-Chapter-7_CUTTING_FROM_"THE_DAILYGRAPH,"_8_AUGUST.txt
├── Dracula-Chapter-8_MINA_MURRAY'S_JOURNAL.txt
└── Dracula-Chapter-9__Letter,_Mina_Harker_to_Lucy_Westenra._.txt

投稿2018/10/17 10:23

編集2018/10/18 15:59

tiitoi

総合スコア21956

退会済みユーザー

2018/10/18 15:00 編集

丁寧に回答頂きありがとうございます。まさにProject Gutenbergです。まったくの個人的な状況にて申し訳ないのですが、海外の大学での課題でtxt.ファイルを参照しての課題解決が条件となっております、最初のコメントで提案頂いた、txt.ファイルを使用してのチャプター名とその下のタイトルを目印にした切り分け方法についても時間を割いてご教示頂けるようでしたら、共有頂けると大変幸いです...