【Python】コア数を増やすほど遅くなる件について...

前提・実現したいこと

以前、2回ほど質問させて頂きましたがどうしても予期しない結果が出てしまい、私なりに考察を立ててみましたが、考察が正しいかどうか教えて頂きたいです....

以下のソースコードを実行した時の実行結果なのですが、タイトルにも有ります通りコア数を増やすほど実行速度が遅くなってしまいます。
処理の内容としては画像の横１行を引数にとりchangeToGrayと言う関数に与え処理が完了した行を戻り値として返すプログラムで、空きのプロセスがあれば逐次関数を呼び出して実行しております。

そこで、実行したら空きプロセスに逐次行を渡すことで高速に処理を実現できると考えておりましたが割り当てのコア数を増やせば増やすほど実行速度が遅くなってしまいます。

私なりの考察なのですが、changeToGrayの処理が高速なためにmulchProcess関数のforのところで待ち状態が続いてしまい実行するのは1コアの時と変わらないが無駄に8コアに処理を与えることでその分遅延に繋がっているのではと考えました。

正直なところ全然腑に落ちないので分かる方がいらっしゃいましたら教えていただきたいです。

ソースコード

【speedtest.py】

Python3
1"""
2並列で画像をモノクロにする
3画像をコマンドライン引数の第1引数で渡しておく
4コア数をコマンドライン引数の第2引数で指定
5例）python Monochrome.py 画像.png 4とか
6事前に
7  pip install opencv-python
8  pip install matplotlib
9  pip install futures
10をしておく
11"""
12import concurrent.futures
13import matplotlib.pyplot as plt
14import numpy as np
15import cv2, common, sys, os, time
16
17img = common.getRGBImage( sys.argv[1] )
18useCPU = int( sys.argv[2] )
19
20def main():
21    mulchProcess(useCPU=useCPU)
22            
23
24def changeToGray( number: int, width: np.ndarray ):
25    """
26    並列化する処理
27    @param  number (int)       : このプロセスの番号
28    @param  width (np.ndarray) : 横１行の配列[ [R, G, B], ・・・・ ,[R, G, B] ]
29    @return number (int)       : このプロセスの番号
30    @return width (np.ndarray) : 引数で受け取った配列をグレースケールに変換した配列
31    """
32    return number, np.tile((width * [0.3, 0.59, 0.11]).sum(axis=1), (3, 1)).T
33
34
35def mulchProcess(useCPU: int):
36    """
37    マルチコアでプロセスを生成して実行させる処理
38    @param  useCPU (int)  : 使用するCPUのコア数
39    """
40    start = time.time()
41    with concurrent.futures.ProcessPoolExecutor(max_workers=useCPU) as executer:
42        fs = [ executer.submit(changeToGray, i, width) for width, i in zip( img, range(len(img)) ) ]
43        for future in concurrent.futures.as_completed(fs):
44            img[future.result()[0]] = future.result()[1]
45    finish = time.time()-start
46    print(str(finish))
47
48
49if __name__ == '__main__':
50    main()

【common.py】

Python3
1"""
2共通の処理を行う自作モジュール
3"""
4import cv2
5
6def getRGBImage(filePath: str):
7    """
8    画像を読み込んでRGBの配列にして返す関数
9    @param  filePath (str)      : 画像のファイルパス
10    @return RGBImage (np.Array) : RGBに変換された画像の配列
11    """
12    # 引数から画像を読み込み
13    img = cv2.imread(filePath)
14
15    # 色の並びがデフォルトでは[B, G, R]となっているので[R, G, B]に変換して返す
16    return cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

試したこと

以下のシェルスクリプトを実行して速度を1〜8コアを10回ずつ測定しました。

zsh
1for i in `seq 1 8`; do ;
2echo $i
3for j in `seq 1 10`
4python speedtest.py 画像のパス $i
5echo ""
6done

補足情報（FW/ツールのバージョンなど）

[実行環境]
MacBook Pro (15-inch, 2016)
プロセッサ : 2.6 GHz Intel Core i7 ( 4コア8スレッド )
メモリ : 16 GB 2133 MHz LPDDR3
Python 3.6.4

[1〜8コアの平均実行時間]
・ 1コア : 6.0182 秒
・ 2コア : 6.0921 秒
・ 3コア : 6.3480 秒
・ 4コア : 6.6975 秒
・ 5コア : 6.8116 秒
・ 6コア : 6.9263 秒
・ 7コア : 7.0920 秒
・ 8コア : 7.0808 秒

実行結果

1
6.1593170166015625
6.039182186126709
5.922972917556763
6.074885845184326
5.917132139205933
6.013262987136841
6.059257745742798
5.964179039001465
5.972550868988037
6.0595598220825195

2
6.058332920074463
6.004819869995117
6.023962736129761
6.338515043258667
6.072969913482666
6.004595994949341
6.057007789611816
6.209378957748413
6.10593581199646
6.0453040599823

3
6.305116891860962
6.351101875305176
6.3122687339782715
6.350980043411255
6.4230170249938965
6.33448600769043
6.272423267364502
6.284146070480347
6.463899850845337
6.382644176483154

4
6.593387126922607
7.192111015319824
6.550843954086304
6.615111827850342
7.154192686080933
6.57856297492981
6.570522785186768
6.65663480758667
6.5169501304626465
6.5469067096710205

5
6.768412828445435
6.908749103546143
6.742299795150757
6.808297872543335
6.675406217575073
6.8053460121154785
6.822022199630737
6.761777877807617
6.966326713562012
6.85724401473999

6
6.913501977920532
7.00965690612793
7.009225130081177
6.954202890396118
6.9130120277404785
6.907424211502075
6.902018070220947
6.956757068634033
6.811483860015869
6.885385990142822

7
6.951028108596802
6.981894254684448
7.024693965911865
6.979804039001465
6.93758225440979
7.292918920516968
7.0332419872283936
6.951812982559204
7.195998191833496
7.571115970611572

8
7.039088726043701
7.113111972808838
7.077091693878174
7.106177806854248
7.0976550579071045
7.09247899055481
7.061521768569946
7.14942479133606
6.9959800243377686
7.075492858886719

行動規範の内容に同意します

回答1件

ベストアンサー

だいたいおっしゃるとおりだと思います。
changeToGray関数は現状非常に高速で、増やしたコア数の管理のほうが手間になっているのではないでしょうか。

cProfileを使いプログラムのどこで時間がかかっているか探しました。多分どの環境でも概ね似た結果になると思います。

python
1if __name__ == '__main__':
2    import cProfile
3    import pstats
4    profiler = cProfile.Profile()
5    profiler.runcall(main)
6    cProfile.Profile(main)
7    stats = pstats.Stats(profiler)
8
9    # /usr/local/lib..のようなディレクトリ名を省略し
10    # threading.pyとだけ表示させる
11    stats.strip_dirs()
12
13    # その関数の呼び出しにかかった時間でソートする('tottime')
14    # 'cumtime'ならば、他の関数呼び出しの時間も含める
15    stats.sort_stats('tottime')
16
17    # 10件まで表示
18    stats.print_stats(10)

　
python main.py a.jpg 1 で実行すると、以下のようになります。
0.139秒のうち、大体2分の1を{method 'acquire' of '_thread.lock' objects}
全体の3分の1ぐらいをmulchProcess関数で費やしています。

         33640 function calls (33601 primitive calls) in 0.139 seconds

   Ordered by: internal time
   List reduced from 289 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1912    0.071    0.000    0.071    0.000 {method 'acquire' of '_thread.lock' objects}
        1    0.037    0.037    0.139    0.139 main.py:49(mulchProcess)
      423    0.003    0.000    0.079    0.000 _base.py:196(as_completed)
      422    0.002    0.000    0.012    0.000 process.py:596(submit)
      838    0.002    0.000    0.003    0.000 _base.py:174(_yield_finished_futures)
      844    0.001    0.000    0.003    0.000 _base.py:405(result)
      372    0.001    0.000    0.071    0.000 threading.py:264(wait)
      416    0.001    0.000    0.073    0.000 threading.py:534(wait)
      423    0.001    0.000    0.001    0.000 {built-in method posix.write}
     2519    0.001    0.000    0.002    0.000 threading.py:240(__enter__)

　
python main.py a.jpg 8
このあたりから、{built-in method posix.fork}等が上位に食い込みます。

         33592 function calls (33553 primitive calls) in 0.165 seconds

   Ordered by: internal time
   List reduced from 291 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1860    0.097    0.000    0.097    0.000 {method 'acquire' of '_thread.lock' objects}
        1    0.026    0.026    0.164    0.164 main.py:49(mulchProcess)
        8    0.008    0.001    0.008    0.001 {built-in method posix.fork}
      423    0.002    0.000    0.103    0.000 _base.py:196(as_completed)
      819    0.002    0.000    0.003    0.000 _base.py:174(_yield_finished_futures)
      359    0.002    0.000    0.094    0.000 threading.py:264(wait)
      844    0.002    0.000    0.003    0.000 _base.py:405(result)
      422    0.001    0.000    0.021    0.000 process.py:596(submit)
      397    0.001    0.000    0.096    0.000 threading.py:534(wait)
      423    0.001    0.000    0.001    0.000 {built-in method posix.write}

　
python main.py a.jpg 16
さすがに過剰になってきたのか、{built-in method posix.fork}の割合がはっきりと増えます。

         32700 function calls (32661 primitive calls) in 0.208 seconds

   Ordered by: internal time
   List reduced from 291 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1540    0.111    0.000    0.111    0.000 {method 'acquire' of '_thread.lock' objects}
        1    0.042    0.042    0.207    0.207 main.py:49(mulchProcess)
       16    0.017    0.001    0.017    0.001 {built-in method posix.fork}
      423    0.003    0.000    0.110    0.000 _base.py:196(as_completed)
      422    0.002    0.000    0.035    0.000 process.py:596(submit)
      777    0.002    0.000    0.003    0.000 _base.py:174(_yield_finished_futures)
      844    0.002    0.000    0.003    0.000 _base.py:405(result)
      279    0.001    0.000    0.102    0.000 threading.py:264(wait)
      423    0.001    0.000    0.001    0.000 {built-in method posix.write}
      355    0.001    0.000    0.104    0.000 threading.py:534(wait)

　
極端な例ですが、python main.py a.jpg 500 で実行すると、以下のようになります。
ここまでくるとこのプログラムに限らず遅くはなりますが、プロセス関連の処理がやたらとボトルネックになります。

         301193 function calls (301154 primitive calls) in 3.206 seconds

   Ordered by: internal time
   List reduced from 291 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   124750    1.031    0.000    1.031    0.000 {built-in method posix.waitpid}
     1396    0.772    0.001    0.772    0.001 {method 'acquire' of '_thread.lock' objects}
      500    0.657    0.001    0.657    0.001 {built-in method posix.fork}
        1    0.246    0.246    3.206    3.206 main.py:49(mulchProcess)
      500    0.116    0.000    1.228    0.002 process.py:53(_cleanup)
   124750    0.082    0.000    1.113    0.000 popen_fork.py:25(poll)
      500    0.040    0.000    0.761    0.002 popen_fork.py:67(_launch)
      503    0.032    0.000    0.032    0.000 {built-in method posix.pipe}
        1    0.030    0.030    2.146    2.146 process.py:585(_adjust_process_count)
      500    0.028    0.000    0.046    0.000 process.py:72(__init__)

投稿2018/07/31 15:26

編集2018/07/31 16:05

toritoritorina

総合スコア972

reishisu

2018/07/31 16:06

なるほど... 丁寧な説明と詳しい検証結果で実行結果がとても理解することが出来ました！！ cProfileと言うものがあるんですね、Python始めたばかりで最初見たときは情報量が多くてびっくりしましたがとても見易くていいですね！これから何かあれば自分でcProfileを利用してみようと思います。回答して頂きありがとうございました！！

行動規範の内容に同意します