編集履歴

質問編集履歴

コードを一部修正

2019/04/11 07:47

投稿

nasu0922

スコア17

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -40,12 +40,8 @@
     # ログレベルを DEBUG に変更
-    #logging.basicConfig(level=logging.DEBUG)
     logging.basicConfig(format=formatter, filename='logger.log', level=logging.INFO)
-    #logging.basicConfig(level=logging.DEBUG, format=formatter)
     base_url = 'https://p-town.dmm.com'
@@ -182,16 +178,6 @@
-                        #selector2 = 'body > div.o-layout > div > div.o-container > main > div:nth-child(2) > div tr'
-                        #temp = soup2.select(selector2)
-                        #for elem4 in soup2.select(selector2):
-                        #    print(elem4.text)
                     # 次ページ読込、なければループ終了
                     elif elem3.attrs.get('class')[0] == 'item':

前提・実現したいことの文言修正

2019/04/11 07:46

投稿

nasu0922

スコア17

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -2,7 +2,9 @@
 dmm.ぱちタウンから全パチンコホールをWEBスクレイピングしたいと考えています。
-昨日投稿しましたが、出来る範囲内で修正をしましたが、解決できておりません。
+昨日投稿しましたが、丸投げのような内容になったため出来る範囲内で修正をしました。
+しかしながらｴﾗｰが発生し解決できておりません。
 初心者のためご教授いただきく存じます。

コードを自身で見直し（その後エラーが出て別問題が発生）

2019/04/11 07:45

投稿

nasu0922

スコア17

test CHANGED Viewed

	@@ -1 +1 @@
1	- サイト内~~にある複数のURLを横断して~~データをWEBスクレイピングしたい。
1	+ サイト内から特定データをWEBスクレイピングしたい。

test CHANGED Viewed

@@ -1,156 +1,224 @@
 ### 前提・実現したいこと
-dmm.ぱちタウンから全パチンコホールの基本情報と機種情報をWEBスクレイピングしたいと考えています。
+dmm.ぱちタウンから全パチンコホールをWEBスクレイピングしたいと考えています。
+昨日投稿しましたが、出来る範囲内で修正をしましたが、解決できておりません。
+初心者のためご教授いただきく存じます。
 ### 発生している問題・エラーメッセージ
-特定店舗のスクレイピングはできましたが、サイト内の全店舗のデータをスクレイピングするにはどうしたらいいかわからない状況です。
+・「NameError: name 'hall_info' is not defined」とエラーが発生しています。
-例.北海道の店舗→青森→....沖縄まで順番にデータを取得した
+・エラーが解消すれば問題なくスクレイピングできるか確認したい。
-#店舗基本情報
+#コード
 ```python3
-import re
 import requests
+import logging
 from bs4 import BeautifulSoup
-url = 'https://p-town.dmm.com/shops/tokyo/12670'
-r = requests.get(url)
-soup = BeautifulSoup(r.text, 'lxml')
-data = {}
-for tr in soup.select('table[class="default-table"] tr'):
-    name = tr.th.text
-    if name == '住所':
-        # 前後の空白文字を削除 strip=true
-        value = tr.p.get_text(strip=True)
-    elif name == '新台':
-        # 空白文字を削除 replace(置換する文字列, 置換される文字列)
-        items = [a.text.replace(' ', '') for a in tr.find_all('p')]
-        value = ''.join(items)
-    else:
-        value = tr.get_text(strip=True)
-    # 不要な文字削除 re.sub(正規表現, 置換する文字列, 置換される文字列)
-    value = re.sub('[\u3000\n]', '', value)
-    data[name] = value
-from pprint import pprint
-# データを整形して出力
-pprint(data)
+if __name__ == "__main__":
+    # フォーマットを定義
+    formatter = '%(asctime)s : %(levelname)s : %(message)s'
+    # ログレベルを DEBUG に変更
+    #logging.basicConfig(level=logging.DEBUG)
+    logging.basicConfig(format=formatter, filename='logger.log', level=logging.INFO)
+    #logging.basicConfig(level=logging.DEBUG, format=formatter)
+    base_url = 'https://p-town.dmm.com'
+    target_url = '/'
+    r = requests.get(base_url + target_url)         #requestsを使って、webから取得
+    soup = BeautifulSoup(r.text, 'lxml') #要素を抽出
+    selector = 'body > div.o-layout > div > div.o-container > main > section.default-box.-shop > div > div li'
+    # 都道府県ループ
+    for elem1 in soup.select(selector):
+        string_ = elem1.text
+        target_url = elem1.next_element.attrs.get('href')
+        area_name = target_url.rsplit('/', 1)[1]
+        #print(area_name)
+        logging.info('%s %s', 'test:', string_ + ':' + base_url + target_url)
+        r = requests.get(base_url + target_url)
+        soup= BeautifulSoup(r.text, 'lxml')
+        selector = 'body > div.o-layout > div > div > main > section:nth-child(3) li'
+        num = 0
+        # 市区町村ループ
+        for elem2 in soup.select(selector):
+            target_url = elem2.next_element.attrs.get('href')
+            city_id = target_url.rsplit('/', 1)[1]
+            print(elem2.text + ':' + base_url + target_url)
+            logging.info('%s %s', 'test:', elem2.text + ':' + base_url + target_url)
+            r = requests.get(base_url + target_url)
+            soup = BeautifulSoup(r.text, 'lxml')
+            selector = 'body > div.o-layout > div > div.o-container > main > section li'
+            nextpage = True
+            while nextpage:
+                # 次ページ有無チェック
+                for elem3 in soup.select(selector):
+                    if elem3.attrs.get('class')[0] == 'item':
+                        if elem3.text == '>':
+                            if elem3.next.attrs.get('href') is not None:
+                                nextpage = True
+                                break
+                        else:
+                            nextpage = False
+                # 登録ホールループ
+                for elem3 in soup.select(selector):
+                    if elem3.attrs.get('class')[0] == 'unit':
+                        # ホール情報収集
+                        num += 1
+                        target_url = elem3.next_element.attrs.get('href')
+                        hall_id = target_url.rsplit('/', 1)[1]
+                        r2 = requests.get(base_url + target_url)
+                        soup2 = BeautifulSoup(r2.text, 'lxml')
+                        # 店舗名取得
+                        selector2 = 'body > div.o-layout > div > div.o-container > main > div:nth-child(1) > div > h1'
+                        hall_name = soup2.select(selector2)[0].text
+                        # 店舗基本情報取得
+                        for tr in soup.select('table[class="default-table"] tr'):
+                            name = tr.th.text
+                            if name == '住所':
+                                # 前後の空白文字を削除 strip=true
+                                value = tr.p.get_text(strip=True)
+                            elif name == '新台':
+                                # 空白文字を削除 replace(置換する文字列, 置換される文字列)
+                                items = [a.text.replace(' ', '') for a in tr.find_all('p')]
+                                value = ''.join(items)
+                            else:
+                                value = tr.get_text(strip=True)
+                                # 不要な文字削除 re.sub(正規表現, 置換する文字列, 置換される文字列)
+                                value = re.sub('[\u3000\n]', '', value)
+                                hall_info = value
+                        print(str(num) + '[' + hall_id + ']:' + hall_name + ':' + hall_info )
+                        logging.info('%s %s', str(num) + '[' + hall_id + ']:' + hall_name + ':' + hall_info + ':', base_url + target_url)
+                        #selector2 = 'body > div.o-layout > div > div.o-container > main > div:nth-child(2) > div tr'
+                        #temp = soup2.select(selector2)
+                        #for elem4 in soup2.select(selector2):
+                        #    print(elem4.text)
+                    # 次ページ読込、なければループ終了
+                    elif elem3.attrs.get('class')[0] == 'item':
+                        if elem3.text == '>':
+                            #print(elem3.next.attrs.get('href'))
+                            if elem3.next.attrs.get('href') is not None:
+                                target_url = elem3.next.attrs.get('href')
+                                r = requests.get(target_url)
+                                soup = BeautifulSoup(r.text, 'lxml')
+                            else:
+                                nextpage = False
+                            break
 ```
-#機種情報
-```python3
-import re
-import requests
-from bs4 import BeautifulSoup
-url = 'https://p-town.dmm.com/shops/tokyo/12670'
-r = requests.get(url)
-soup = BeautifulSoup(r.text, 'lxml')
-#from urllib.parse import urljoin
-#base_url = 'https://p-town.dmm.com'
-data = {}
-for ul in soup.select('ul[class="list-machinesettings"]'):
-    machine_type = 'パチ'
-    if 'パチ' in ul.h4.text:
-        machine_type = 'パチ'
-    elif 'スロ' in ul.h4.text:
-        machine_type = 'スロ'
-    machines = []
-    for li in ul.select('li[class="item"]'):
-        name = li.select_one('div[class="text"]').get_text(strip=True)
-        num = li.select_one('div[class="number"]').get_text(strip=True)
-        #link = urljoin(base_url, li.a['href']) if li.a else None
-        #machines.append([name, num, link])
-        machines.append([name, num])
-    data[machine_type] = machines
-from pprint import pprint
-pprint(data)
-```
 ### 試したこと
-特定店舗の場合は、URLを指定するだけで問題ないのですが、全国店舗となると、
+Name errorのため該当箇所の名前を確認しましたが、特に問題ありませんでした。
-https://p-town.dmm.com/shops/hokkaido
-のように、shop以降のURL情報（地域や店舗№）に着目する必要があるのは認識していますが、
-どういった形でコードを作成していいか分からない状況です。

先ほどの修正でコードがおかしくなっていたので再編集しました。

2019/04/11 07:43

投稿

nasu0922

スコア17

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -10,7 +10,7 @@
 例.北海道の店舗→青森→....沖縄まで順番にデータを取得した
-```
 #店舗基本情報

1つのコードを削除し、「発生している問題」「試したこと」を編集しました。

2019/04/10 06:31

投稿

nasu0922

スコア17

test CHANGED Viewed

	@@ -1 +1 @@
1	- 特定サイト内の~~全都道府県の~~データをWEBスクレイピングしたい。
1	+ サイト内にある複数のURLを横断してデータをWEBスクレイピングしたい。

test CHANGED Viewed

@@ -1,200 +1,18 @@
 ### 前提・実現したいこと
-以前、こちらでdmm.ぱちタウンから特定のパチンコホールの基本情報と機種情報をWEBスクレイピングする方法について
-ご教授いただきました。（店舗基本情報/機種情報）
-また別のコードで、サイト内から店舗IDと店舗名の抽出はできているのですが、最終的に
-サイト内の全パチンコホールの基本情報と機種情報をWEBスクレイピングしたいと考えています。
+dmm.ぱちタウンから全パチンコホールの基本情報と機種情報をWEBスクレイピングしたいと考えています。
 ### 発生している問題・エラーメッセージ
-①店舗IDと店舗名
+特定店舗のスクレイピングはできましたが、サイト内の全店舗のデータをスクレイピングするにはどうしたらいいかわからない状況です。
-②店舗基本情報
-③機種情報
-→サイト内の全パチンコホールの上記項目を全てを合算させたい。
-→既存作成コードをつなぎ合わせるのか、それとももっとスッキリしたコードが作れるのかご教授いただきたい。
-### ①店舗ID＆店舗名
-```Python3
-import requests
-import logging
-from bs4 import BeautifulSoup
-if __name__ == "__main__":
-    # フォーマットを定義
-    formatter = '%(asctime)s : %(levelname)s : %(message)s'
-    # ログレベルを DEBUG に変更
-    #logging.basicConfig(level=logging.DEBUG)
-    logging.basicConfig(format=formatter, filename='logger.log', level=logging.INFO)
-    #logging.basicConfig(level=logging.DEBUG, format=formatter)
-    base_url = 'https://p-town.dmm.com'
-    target_url = '/'
-    r = requests.get(base_url + target_url)         #requestsを使って、webから取得
-    soup = BeautifulSoup(r.text, 'lxml') #要素を抽出
-    selector = 'body > div.o-layout > div > div.o-container > main > section.default-box.-shop > div > div li'
-    # 都道府県ループ
-    for elem1 in soup.select(selector):
-        string_ = elem1.text
-        target_url = elem1.next_element.attrs.get('href')
-        area_name = target_url.rsplit('/', 1)[1]
-        #print(area_name)
-        logging.info('%s %s', 'test:', string_ + ':' + base_url + target_url)
-        r = requests.get(base_url + target_url)
-        soup= BeautifulSoup(r.text, 'lxml')
-        selector = 'body > div.o-layout > div > div > main > section:nth-child(3) li'
-        num = 0
-        # 市区町村ループ
-        for elem2 in soup.select(selector):
-            target_url = elem2.next_element.attrs.get('href')
-            city_id = target_url.rsplit('/', 1)[1]
-            print(elem2.text + ':' + base_url + target_url)
-            logging.info('%s %s', 'test:', elem2.text + ':' + base_url + target_url)
-            r = requests.get(base_url + target_url)
-            soup = BeautifulSoup(r.text, 'lxml')
-            selector = 'body > div.o-layout > div > div.o-container > main > section li'
-            nextpage = True
-            while nextpage:
-                # 次ページ有無チェック
-                for elem3 in soup.select(selector):
-                    if elem3.attrs.get('class')[0] == 'item':
-                        if elem3.text == '>':
-                            if elem3.next.attrs.get('href') is not None:
-                                nextpage = True
-                                break
-                        else:
-                            nextpage = False
-                # 登録ホールループ
-                for elem3 in soup.select(selector):
-                    if elem3.attrs.get('class')[0] == 'unit':
-                        # ホール情報収集
-                        num += 1
-                        target_url = elem3.next_element.attrs.get('href')
-                        hall_id = target_url.rsplit('/', 1)[1]
-                        r2 = requests.get(base_url + target_url)
-                        soup2 = BeautifulSoup(r2.text, 'lxml')
-                        # 店舗名取得
-                        selector2 = 'body > div.o-layout > div > div.o-container > main > div:nth-child(1) > div > h1'
-                        hall_name = soup2.select(selector2)[0].text
-                        print(str(num) + '[' + hall_id + ']:' + hall_name)
-                        logging.info('%s %s', str(num) + '[' + hall_id + ']:' + hall_name + ':', base_url + target_url)
-                        #selector2 = 'body > div.o-layout > div > div.o-container > main > div:nth-child(2) > div tr'
-                        #temp = soup2.select(selector2)
+例.北海道の店舗→青森→....沖縄まで順番にデータを取得した
-                        #for elem4 in soup2.select(selector2):
-                        #    print(elem4.text)
-                    # 次ページ読込、なければループ終了
-                    elif elem3.attrs.get('class')[0] == 'item':
-                        if elem3.text == '>':
-                            #print(elem3.next.attrs.get('href'))
-                            if elem3.next.attrs.get('href') is not None:
-                                target_url = elem3.next.attrs.get('href')
-                                r = requests.get(target_url)
-                                soup = BeautifulSoup(r.text, 'lxml')
-                            else:
-                                nextpage = False
-                            break
 ```
-#②店舗基本情報
+#店舗基本情報
 ```python3
@@ -256,7 +74,7 @@
 ```
-#③機種情報
+#機種情報
 ```python3
@@ -322,15 +140,17 @@
 ### 試したこと
-書物やネットで情報収集しましたが、Python全体の流れを理解していない部分もあり
+特定店舗の場合は、URLを指定するだけで問題ないのですが、全国店舗となると、
-処理が難航しています。
-見当違いの質問でしたらすみません。
+https://p-town.dmm.com/shops/hokkaido
+のように、shop以降のURL情報（地域や店舗№）に着目する必要があるのは認識していますが、
-よろしくお願いいたします。
+どういった形でコードを作成していいか分からない状況です。