回答編集履歴

1

a

2018/10/18 15:59

投稿

tiitoi
tiitoi

スコア21956

test CHANGED
@@ -77,3 +77,233 @@
77
77
 
78
78
 
79
79
  思い通りの情報をパースするには、epub 形式とスクレイピングの勉強をしてください。
80
+
81
+
82
+
83
+ ## 追記
84
+
85
+
86
+
87
+ テキストをチャプターごとに分割するサンプルコード
88
+
89
+
90
+
91
+ ### ファイルを読み込む。
92
+
93
+
94
+
95
+ ```python
96
+
97
+ import itertools
98
+
99
+ import os
100
+
101
+ import re
102
+
103
+
104
+
105
+ # read file.
106
+
107
+ # -----------------------------------
108
+
109
+ with open('dracula.txt') as f:
110
+
111
+ # strip empty lines.
112
+
113
+ lines = [l for l in f.read().splitlines() if l]
114
+
115
+ print('number of lines: {}'.format(len(lines))) # number of lines: 13414
116
+
117
+ ```
118
+
119
+
120
+
121
+ ### 解析する。
122
+
123
+
124
+
125
+ ```python
126
+
127
+ chapters = []
128
+
129
+
130
+
131
+ # parse the book into chapters.
132
+
133
+ # -----------------------------------
134
+
135
+ itr = iter(lines)
136
+
137
+
138
+
139
+ # skip lines until the title 'DRACULA' is found.
140
+
141
+ for i in itertools.count(1):
142
+
143
+ line = next(itr)
144
+
145
+ if line == 'DRACULA':
146
+
147
+ break
148
+
149
+ print('{} lines skipped'.format(i))
150
+
151
+ print(line)
152
+
153
+ chapters = []
154
+
155
+
156
+
157
+ # first chapter is followed by title.
158
+
159
+ next(itr) # skip chapter no
160
+
161
+
162
+
163
+ no = 1 # chapter no
164
+
165
+ title = next(itr) # chapter title
166
+
167
+ sentences = [] # sentences of chapter
168
+
169
+
170
+
171
+ for i in itertools.count(i):
172
+
173
+ line = next(itr)
174
+
175
+
176
+
177
+ if line.find(r'THE END') != -1:
178
+
179
+ # end mark found.
180
+
181
+ chapters.append({'no': no, 'title': title, 'sentences': sentences})
182
+
183
+ break # story ends.
184
+
185
+
186
+
187
+ # check if line is CHAPTER <Roman numerals>.
188
+
189
+ if re.match(r'CHAPTER [MDCLXVI]{1,2}', line):
190
+
191
+ chapters.append({'no': no, 'title': title, 'sentences': sentences})
192
+
193
+ no += 1
194
+
195
+ title = next(itr) # chapter title is follwed by chapter number.
196
+
197
+ sentences = []
198
+
199
+ continue
200
+
201
+
202
+
203
+ sentences.append(line)
204
+
205
+
206
+
207
+ print('{} lines parsed'.format(i)) # 13036 lines parsed
208
+
209
+ ```
210
+
211
+
212
+
213
+ ### ファイルに書き込む。
214
+
215
+
216
+
217
+ ```python
218
+
219
+ # write every chapter to files.
220
+
221
+ # -------------------------------------
222
+
223
+ output_dirpath = 'chapters'
224
+
225
+ os.makedirs(output_dirpath, exist_ok=True)
226
+
227
+
228
+
229
+ for chapter in chapters:
230
+
231
+ filename = 'Dracula-Chapter-{no}_{title}.txt'.format(
232
+
233
+ no=chapter['no'], title=chapter['title'].replace(' ', '_'))
234
+
235
+ filepath = os.path.join(output_dirpath, filename)
236
+
237
+
238
+
239
+ with open(filepath, 'w') as f:
240
+
241
+ f.write("\n".join(chapter['sentences']))
242
+
243
+ ```
244
+
245
+
246
+
247
+ ### 出力結果
248
+
249
+
250
+
251
+ ```
252
+
253
+ chapters
254
+
255
+ ├── Dracula-Chapter-10__Letter,_Dr._Seward_to_Hon._Arthur_Holmwood._.txt
256
+
257
+ ├── Dracula-Chapter-11__Lucy_Westenra's_Diary._.txt
258
+
259
+ ├── Dracula-Chapter-12_DR._SEWARD'S_DIARY.txt
260
+
261
+ ├── Dracula-Chapter-13_DR._SEWARD'S_DIARY--_continued_..txt
262
+
263
+ ├── Dracula-Chapter-14_MINA_HARKER'S_JOURNAL.txt
264
+
265
+ ├── Dracula-Chapter-15_DR._SEWARD'S_DIARY--_continued_..txt
266
+
267
+ ├── Dracula-Chapter-16_DR._SEWARD'S_DIARY--_continued_.txt
268
+
269
+ ├── Dracula-Chapter-17_DR._SEWARD'S_DIARY--_continued_.txt
270
+
271
+ ├── Dracula-Chapter-18_DR._SEWARD'S_DIARY.txt
272
+
273
+ ├── Dracula-Chapter-19_JONATHAN_HARKER'S_JOURNAL.txt
274
+
275
+ ├── Dracula-Chapter-1_JONATHAN_HARKER'S_JOURNAL.txt
276
+
277
+ ├── Dracula-Chapter-20_JONATHAN_HARKER'S_JOURNAL.txt
278
+
279
+ ├── Dracula-Chapter-21_DR._SEWARD'S_DIARY.txt
280
+
281
+ ├── Dracula-Chapter-22_JONATHAN_HARKER'S_JOURNAL.txt
282
+
283
+ ├── Dracula-Chapter-23_DR._SEWARD'S_DIARY.txt
284
+
285
+ ├── Dracula-Chapter-24_DR._SEWARD'S_PHONOGRAPH_DIARY,_SPOKEN_BY_VAN_HELSING.txt
286
+
287
+ ├── Dracula-Chapter-25_DR._SEWARD'S_DIARY.txt
288
+
289
+ ├── Dracula-Chapter-26_DR._SEWARD'S_DIARY.txt
290
+
291
+ ├── Dracula-Chapter-27_MINA_HARKER'S_JOURNAL.txt
292
+
293
+ ├── Dracula-Chapter-2_JONATHAN_HARKER'S_JOURNAL--_continued_.txt
294
+
295
+ ├── Dracula-Chapter-3_JONATHAN_HARKER'S_JOURNAL--_continued_.txt
296
+
297
+ ├── Dracula-Chapter-4_JONATHAN_HARKER'S_JOURNAL--_continued_.txt
298
+
299
+ ├── Dracula-Chapter-5__Letter_from_Miss_Mina_Murray_to_Miss_Lucy_Westenra._.txt
300
+
301
+ ├── Dracula-Chapter-6_MINA_MURRAY'S_JOURNAL.txt
302
+
303
+ ├── Dracula-Chapter-7_CUTTING_FROM_"THE_DAILYGRAPH,"_8_AUGUST.txt
304
+
305
+ ├── Dracula-Chapter-8_MINA_MURRAY'S_JOURNAL.txt
306
+
307
+ └── Dracula-Chapter-9__Letter,_Mina_Harker_to_Lucy_Westenra._.txt
308
+
309
+ ```