subprocessで外部のapiを並列処理させたい場合、subprocess.popenを利用するか、concurrent.futuresかmuiltiprocessingで処理させるべきか。

例えば、下記のようなコードでtesseractを並列処理させて実行したい場合は、下記の関数自体をconcurrent.futuresで指定して処理させるのか、もしくは、subprocess.popenを利用すべきか、でわからずにいます。
私の調べた限りでは、concurrent.futuresを利用するほうがよりよいという情報をみましたが、しっくりした理由がみつからず、個人的にこの関数自体exec_asyncを並列処理させる、というのが違和感があります。とくにその違和感の理由もないのですが、慣れていないだけだと思います。

もし、ご存知の方いらしたら具体的にどんな書き方したらよさそうなどのご教示を頂けると助かります。

python
1import subprocess
2
3def detection_async(request):
4    image_path = get_path(request)
5    text_path = '/path/to/text'
6
7    # 下記、並列処理させたい。
8    result = subprocess.run(['tesseract', '--psm', '1', '--oem', '1', image_path, text_path], stdout = subprocess.PIPE, stderr = subprocess.PIPE)
9
10    return "Tasks were successfully finished"

行動規範の内容に同意します

回答1件

ThreadPoolExecutorにdetection_asyncを渡すのが自然だと思います。
「外部のapiを呼ぶ」処理と「並列に実行する」処理を疎結合にしやすいからです。

具体例としては以下のようになります。
detection_async_parallel が detection_async の戻り値に依存していないのがおわかりいただけますか？

python
1import subprocess
2from concurrent.futures import ThreadPoolExecutor
3
4
5def detection_async(request):
6    image_path = get_path(request)
7    text_path = '/path/to/text'
8
9    # 下記、並列処理させたい。
10    result = subprocess.run(['tesseract', '--psm', '1', '--oem', '1', image_path, text_path], stdout = subprocess.PIPE, stderr = subprocess.PIPE)
11
12    return result
13
14
15def detection_async_parallel(requests):
16    results = []
17    with ThreadPoolExecutor() as executor:
18        for result in executor.map(detection_async, requests):
19            results.append(result)
20    return results
21
22
23if __name__ == "__main__":
24    requests = [
25        'image_path1',
26        'image_path2',
27        'image_path3',
28    ]
29    for result in detection_async_parallel(requests):
30        print(result)
31

これをもしsubprocess.Popenだけで実装するなら、以下のようになります。
ThreadPoolExecutorを使うときと違って、detection_async_parallel が detection_async の戻り値に依存しています。

python
1def detection_async(request):
2    image_path = get_path(request)
3    text_path = '/path/to/text'
4
5    # 下記、並列処理させたい。
6    process = subprocess.Popen(['echo', '--psm', '1', '--oem', '1', image_path, text_path], stdout = subprocess.PIPE, stderr = subprocess.PIPE)
7
8    return process
9
10
11def detection_async_parallel(requests):
12    processes = []
13    for request in requests:
14        processes.append(detection_async(request))
15    
16    results = []
17    for p in processes:
18        results.append(p.communicate())
19
20    return results

投稿2020/12/07 06:46

kzm4269

総合スコア184

sequelanonymous

2020/12/07 07:08

とてもわかりやすい説明ありがとうございます。下記の点、理解しました。 > detection_async_parallel が detection_async の戻り値に依存していないのがおわかりいただけますか？すみません、ちょっと2点だけもやっとする箇所があります。一点目： if __name__ == "__main__": を記述している理由はなんでしょうか？上記の質問にあげたコードはほんの一部です。実際は、APIに組込まれて動くと思うのですが、そのさいは、app.runサーバーを起動するモジュール内で下記のように記載すると思います。 if __name__ == "__main__": app.run(host="0.0.0.0", port=5000) if __name__ == "__main__": をrunサーバーが動く以外のモジュールで上記ファイル内にも記載するという認識であっていますでしょうか？二点目：下記の記述についてですが、どうやってdetection_async_parallel関数を実行させるようにするかです。(変な質問してるかもしれません) requests = [ 'image_path1', 'image_path2', 'image_path3', ] for result in detection_async_parallel(requests): print(result) 実際には、上記のコードを実装しようとするとdetection_async_parallel(requests)の引数であるrequestsをどうかけばいいのでしょうか、つまりどうrequestをうけとるようにすればいいのでしょうか？実際は、リクエストは一回ずつクライアントからエンドポイントをたたかれるごとにおくられてきて、detection_async(request)にくるとおもっていますが、エンドポイント定義をしているルーティングしている関数側のほうで最初にリクエストを渡す関数をdetection_async_parallelにするという認識であっていますでしょうか？

kzm4269

2020/12/07 08:47 編集

> 一点目 if __name__ == "__main__": の部分は、今回の質問への回答としてdetection_async_parallelの使い方を例示しただけです。実際のコードには書かなくても結構です。 > 二点目 Web APIのバックエンドでどうリクエストを並列処理すればいいのか、という質問であればこれは今回の質問の内容である「subprocessで外部のapiを並列処理させたい」から逸脱していると思われますので、この場での回答は控えさせていただきます。すみません。 Webフレームワークに何を使っているかを明確にしたうえで別の質問に分けることをおすすめします。

行動規範の内容に同意します