Beautifulsoup,Mecab,concurrent.futuresを用いたときのメモリー処理

前提・実現したいこと

事前に取得したhtml（４０万程度）からそれぞれbodyを取得し、分かち書きして保存したい。

発生している問題・エラーメッセージ

プログラムにおいて、メモリーが徐々に溜まっていき、インスタンスがダウンしてしまう。
メモリー蓄積を防ぎたい。
しかし、どこが原因かがわからない。

該当のソースコード

Python3
1import bs4
2import MeCab 
3import concurrent.futures
4import glob
5import sys
6import os
7import gzip
8m = MeCab.Tagger('-Owakati')
9
10def _htmlbow(inputs):
11  each, i, url = inputs
12  print(inputs)
13
14  #既に取っているものはスキップ
15  if os.path.exists('%s/%s'%(each,url.replace("./gzhtml/","") )):
16    print("already scraped %s"%url)
17    return
18
19  #スクレイピングしてgzで保存
20  try:
21    with gzip.open(url,"rt") as f:
22      html = f.read()
23    #Bodyの取得
24    soup = bs4.BeautifulSoup(html, 'html5lib')
25    [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
26    text = soup.getText()
27    #分かち書きして保存
28    wakati = m.parse(text).strip()
29    with gzip.open('%s/%s'%(each, url.replace("./gzhtml","")), 'wt') as f:
30      f.write('{wakati}'.format(wakati=wakati))
31    print('finished %d '%(i))
32  
33  #取得できない場合は失敗リストに保存
34  except Exception as e:
35    with open("failed_scraping.txt","a") as f:
36      f.write('{url}\n'.format(url=url.replace("./gzhtml/","")))
37    print('failed     %d '%(i))
38
39
40def htmlbow():
41  #htmlリストを読み込み、インデックスを振る
42  urls = []
43  for i,ents in enumerate(glob.glob("./gzhtml/*")):
44    urls.append(["bow",i,ents])
45  print('load finished')
46  #分散処理
47  with concurrent.futures.ProcessPoolExecutor(max_workers = 992) as executor:
48    executor.map(_htmlbow, urls)
49
50
51"""htmlbow"""
52if '--htmlbow' in sys.argv:
53   htmlbow()

行動規範の内容に同意します

回答2件

ベストアンサー

with concurrent.futures.ProcessPoolExecutor(max_workers = 992) as executor:

max_workersを設定しないようにしても問題は発生しますか？

あと例外メッセージが発生してるなら、質問文に追記していただくと回答時に参考情報になります。

追記
以下のように変更してくださいな。

Python
1with concurrent.futures.ProcessPoolExecutor(max_workers = 992) as executor:

↓

Python
1with concurrent.futures.ProcessPoolExecutor() as executor:

maxworker 指定なしで大丈夫でした。

指定がない場合はプロセス数はどのように処理されているのでしょうか？

Githubにある、CPythonのProcessPoolExecutorの__init__部分より引用。

Python
1if max_workers is None:
2    self._max_workers = os.cpu_count() or 1

■参考情報
os.cpu_count()

■余談
1,プロファイルを取ってみたほうがいいと思いますが。
URLからダウンロードしている部分 with gzip.open(url,"rt") はThreadPoolExecutorでI/O をオーバーラップしたほうが良いかもしれません。

2,max_workers = 992に関して
この設定値はどこかのサイトのサンプルソースの設定値なのでしょうか？

最大ワーカープロセス数=992なため
a,プロセス数の生成が最大で992個行われる
b,1プロセスあたり20MBメモリを消費

このプログラムは最大で992*20MBの19,840MB(大凡19.8GB)のメモリを使用します。

一番重要な、質問文のメモリ問題は解決したのでしょうか？
その点気になってますが。。。

以下はコメント欄への質問への回答。
できるだけ調べたことを書いてくださいな。どのような内容を調べて、どこの記述がわからないのか。丸投げではなく、そういう形の質問にしてくださいな。

ProcessPoolExecutorとThreadPoolExecutor違い

GILはGlobalInterpreterLockといいます。

投稿2018/02/20 12:15

編集2018/02/21 12:51

umyu

総合スコア5846

dialectic4th

2018/02/20 13:09

以下のように試しみましたが、状況変わらずです。 ``` with concurrent.futures.ProcessPoolExecutor(max_workers = 992) as executor: for url in urls: executor.submit(_htmlbow, url) ```

dialectic4th

2018/02/20 13:47

コメント先間違えてましたね・・・ maxworker 指定なしで大丈夫でした。指定がない場合はプロセス数はどのように処理されているのでしょうか？

dialectic4th

2018/02/21 02:39

ProcessPoolExecutorとThreadPoolExecutorの違いが、わかっていないのですが、どういう違いがあるのでしょうか？

行動規範の内容に同意します