回答編集履歴

Fix code

2020/08/10 06:03

投稿

y_shinoda

スコア3272

answer CHANGED Viewed

@@ -304,15 +304,15 @@
     'https://www.google.co.jp/mail/help/intl/ja/about.html?vm=r', # GMail
 ]
-list_response = ParallelHtmlScraper.execute(f'{host_google}', path_and_content, AnalyzerForTest())
+list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerForTest())
 print(list_response)
 ```
 実行結果:
 ```python
-$ python test.py
+$ pipenv run python test.py
-['Google', 'Google 画像検索', 'Google ショッピング', 'コレクション', 'Google マップ', '\n        Google ドライブ\n    ', '\n        Gmail - Google のメール\n    ']
+['Google 画像検索', ' Google マップ ', '\n      Gmail - Google のメール\n    ', 'Google ショッピング', 'Google', '\n      Google ドライブ\n    ', 'コレクション']
 ```
 このパッケージを実装していたときは、

Fix code

2020/08/10 06:02

投稿

y_shinoda

スコア3272

answer CHANGED Viewed

@@ -304,7 +304,7 @@
     'https://www.google.co.jp/mail/help/intl/ja/about.html?vm=r', # GMail
 ]
-list_response = ParallelHtmlScraper.execute(f'{host_google}', path_and_content.keys(), AnalyzerForTest())
+list_response = ParallelHtmlScraper.execute(f'{host_google}', path_and_content, AnalyzerForTest())
 print(list_response)
 ```

Fix link to Stack Overflow

2020/08/08 03:09

投稿

y_shinoda

スコア3272

answer CHANGED Viewed

@@ -48,7 +48,7 @@
     asyncio.run(asynchronous_process(urls))
 ```
-参考: [Answer: How could I use requests in asyncio? ](https://stackoverflow.com/a/47572164/12721873)
+参考: [Answer: How could I use requests in asyncio? ](https://stackoverflow.com/a/22414756/12721873)
 ## 検証

Fix answer that mistook intention of question

2020/07/01 21:46

投稿

y_shinoda

スコア3272

answer CHANGED Viewed

@@ -30,11 +30,32 @@
     time.sleep(1)
 ```
+## スレッドから非同期への変更
+次の箇所を変更します:
+```python
+    with ThreadPoolExecutor(max_workers=2) as pool:
+        threads = [res for res in pool.map(check_url, urls)]
+```
+↓
+```python
+    async def asynchronous_process(urls):
+        loop = asyncio.get_event_loop()
+        futures = [loop.run_in_executor(None, check_url, url) for url in urls]
+        await asyncio.gather(*futures)
+    asyncio.run(asynchronous_process(urls))
+```
+参考: [Answer: How could I use requests in asyncio? ](https://stackoverflow.com/a/47572164/12721873)
 ## 検証
 次のように行いました:
 ```python
+import asyncio
 import requests
 import json
 import time
@@ -42,7 +63,7 @@
 import os
 import sys
 from bs4 import BeautifulSoup
-from concurrent.futures import ThreadPoolExecutor
+# from concurrent.futures import ThreadPoolExecutor
 if os.path.exists("list.txt"):
     pass
@@ -79,19 +100,18 @@
     start = time.time()
     def check_url(target_url, headers=headers, proxies=proxies, retry=3):
         for i in range(retry):
             try:
                 start = time.time()
                 #ここ↓のリクエストが時間がかかるので非同期処理をさせたい
                 req = requests.get(
                     target_url, headers=headers, proxies=proxies, allow_redirects=False
                 )
                 logs.append(str(req.status_code) + "\t" + target_url)
                 target_urlll = re.sub("(.*)(?=/)|/|(?=?)(.*)", "", target_url)
+                print(target_url)
                 print(str(req.status_code) + "\t" + target_urlll)
                 # if req.status_code == 404:
@@ -171,7 +191,7 @@
                 #                 break
-                # return
+                return
             except requests.exceptions.ConnectTimeout:
                 logs.append("TIMEOUT" + "\t" + target_url)
                 time.sleep(10)
@@ -187,10 +207,15 @@
     with open("list.txt") as f:
         urls = f.read().splitlines()
-    threads = []
+    # threads = []
-    with ThreadPoolExecutor(max_workers=2) as pool:
+    # with ThreadPoolExecutor(max_workers=()) as pool:
-        threads = [res for res in pool.map(check_url, urls)]
+    #     threads = [res for res in pool.map(check_url, urls)]
+    async def asynchronous_process(urls):
+        loop = asyncio.get_event_loop()
+        futures = [loop.run_in_executor(None, check_url, url) for url in urls]
+        await asyncio.gather(*futures)
+    asyncio.run(asynchronous_process(urls))
     with open("log.txt", "w") as f:
         f.write("\n".join(logs))
@@ -219,15 +244,15 @@
 実行結果:
 ```console
-$ pipenv run python test2.py
+$ pipenv run python test.py
 https://www.yahoo.co.jp/
 200
+https://www.facebook.com/
+302
 https://www.google.com/
 200
 https://www.bing.com/
 200
-https://www.facebook.com/
-302
 https://www.instagram.com/
 200
 https://twitter.com/
@@ -235,24 +260,26 @@
 https://www.amazon.co.jp/
 200
-elapsed_time:1.1018426418304443[sec]
+elapsed_time:0.526411771774292[sec]
 1回目
 ```
+レスポンスの早さの違いによって
-レスポンスの早さによって結果が入れ替わっているのがわかります
+実行結果の出力順が `list.txt` の順序と入れ替わっているのがわかります
 ## aiohttp を使った方法
-以前にほぼ同様のことを実現するためのライブラリーを作成しました
+以前にほぼ同様のことを実現するためのパッケージを作成しました
 [yukihiko-shinoda/parallel-html-scraper](https://github.com/yukihiko-shinoda/parallel-html-scraper)
 [parallelhtmlscraper · PyPI](https://pypi.org/project/parallelhtmlscraper/)
 (README.md を記述していなくて恐縮です)
-ライブラリーとして使えば再発明の必要がなくなりますし、
+パッケージとして使えば再発明の必要がなくなりますし、
+パッケージのコードの中身を読んでいただくと、
-コードの中身を読んでいただくと、どのように実装すべきかが理解いただけると思います
+どのように実装すべきかが理解いただけると思います
 ライブラリーの利用例:
@@ -271,9 +298,9 @@
     '/webhp?tab=rw',                                              # Google 検索
     '/imghp?hl=ja&tab=wi&ogbl',                                   # Google 画像検索
     '/shopping?hl=ja&source=og&tab=wf',                           # Google ショッピング
-    '/save',                                                      # Google Collection
+    '/save',                                                      # コレクション
-    'https://www.google.co.jp/maps',                              # Google Maps
+    'https://www.google.co.jp/maps',                              # Google マップ
-    'https://www.google.co.jp/drive/apps.html',                   # Google Drive
+    'https://www.google.co.jp/drive/apps.html',                   # Google ドライブ
     'https://www.google.co.jp/mail/help/intl/ja/about.html?vm=r', # GMail
 ]
@@ -286,4 +313,9 @@
 ```python
 $ python test.py
 ['Google', 'Google 画像検索', 'Google ショッピング', 'コレクション', 'Google マップ', '\n        Google ドライブ\n    ', '\n        Gmail - Google のメール\n    ']
-```
+```
+このパッケージを実装していたときは、
+とにかく公式ドキュメントばかりを読んでいた記憶があります
+[コルーチンと Task — Python 3.8.4rc1 ドキュメント](https://docs.python.org/ja/3.8/library/asyncio-task.html)

Update description

2020/07/01 21:40

投稿

y_shinoda

スコア3272

answer CHANGED Viewed

@@ -2,7 +2,7 @@
 ## 修正が必要な箇所
-1.
+1
 `ThreadPoolExecutor()` の引数 `max_workers` に int 型の値を設定します:
 ```python
@@ -16,7 +16,7 @@
     with ThreadPoolExecutor(max_workers=2) as pool:
 ```
-2.
+2
 少なくとも、同じリソースへの繰り返しアクセスは 1 秒以上間隔を空けた方が良いでしょう:
 ```python
@@ -252,8 +252,10 @@
 (README.md を記述していなくて恐縮です)
 ライブラリーとして使えば再発明の必要がなくなりますし、
-コードを読んでいただくと、どのように実装すべきかが理解いただけると思います
+コードの中身を読んでいただくと、どのように実装すべきかが理解いただけると思います
+ライブラリーの利用例:
 ```python
 from bs4 import BeautifulSoup