【python3 スクレイピング】ページネーションも全て含めて、最下層のURLを取得し、指定した情報を取得したい
退会済みユーザー
総合スコア0

Question

### 前提・実現したいこと python初心者です。今回、pythonを用いてスクレイピングを実装しようとしたところ、つまづきました。 https://www.judo-ch.jp/sekkotsuinsrch/ 上記のサイトの最下層ページ（例：https://www.judo-ch.jp/sekkotsuinsrch/13/13111/030318/）を全て取得し、下記2つの項目を取得したいです。・会社名（）・URL（）最終的にはcsvファイルに出力します。（今回の質問は、CSVファイルに出力する前の段階です。） ### 質問の内容【機能1】最下層ページの一覧情報が記載されているページの取得（例：https://www.judo-ch.jp/sekkotsuinsrch/01/list/2/）【機能2】下層ページ（上記のURL）から最下層ページ情報の取得それぞれ別のコードで書いた段階では、問題なく動作しました。しかし、【機能1】と【機能2】を繋ぎ合わせたコードを書いたのですが、何も出力されませんでした。 ###### 【機能1】のコード→問題なく動作 ```jupyter from bs4 import BeautifulSoup import sys import time import requests import re time.sleep(2.0) num = 2 i = 1 r = str(i).zfill(2) while i < 47: url = 'https://www.judo-ch.jp/sekkotsuinsrch/' + str(r) + '/list/' + str(num) + '/' res = requests.get(url) if res.status_code == 200: print(url) num += 1 else: i += 1 ``` ###### 【機能2】のコード→問題なく動作 ```jupyter url = 'https://www.judo-ch.jp/sekkotsuinsrch/13/list/2/' res = requests.get(url) soup = BeautifulSoup(res.text,'lxml') links = soup.findAll('a', class_="fa_name") for link in links: print(link.get('href')) ``` ###### 【機能1】＋【機能2】→何も出力されない ```jupyter from bs4 import BeautifulSoup import sys import time import requests import re time.sleep(2.0) num = 2 i = 1 r = str(i).zfill(2) while i < 47: url = 'https://www.judo-ch.jp/sekkotsuinsrch/' + str(r) + '/list/' + str(num) + '/' res = requests.get(url) if res.status_code == 200: soup = BeautifulSoup(res.text,'lxml') links = soup.findAll('a', class_="fa_name") for link in links: print(link.get('href')) num += 1 else: i += 1 ``` また、上記をクリアしたら、最後に【機能3】を追加する予定です。 ###### 【機能3】最下層ページから「会社名」と「URL」を取得→問題なく動作 ```jupyter def get_soup(url): """URLのSoupを取得する""" html = requests.get(url) return BeautifulSoup(html.content, "html.parser") def scraping_gh(): """Software Design の情報を取得""" soup = get_soup("https://www.judo-ch.jp/sekkotsuinsrch/13/13201/030637/") # 整骨院の名称 res_p = soup.find("span", class_="name") res = res_p.find(text=re.compile("")) print(res.string) # ホームページのURL res_p = soup.find("p", class_="lnk_url") res = res_p.find(text=re.compile("")) print(res.string) scraping_gh() ``` ### 試したこと while i < 47: → for i in range(47): とやってみたり、わかる範囲で色々やりましたがダメでした。【機能1】と【機能2】を組み合わせるフェーズで、どこが間違っているか教えていただけると幸いです。どうぞ、よろしくお願い致します。

Accepted Answer

下記で解決しました。

```python
import sys
import requests
import re
import urllib.request, urllib.error
from bs4 import BeautifulSoup


i = 1
num = 2
while i < 48:
    for num in range(1, 300):
        zero_i = str(i).zfill(2)
        base = 'https://www.judo-ch.jp/sekkotsuinsrch/{}/list/{}/'
        url = base.format(zero_i,num)
        res = requests.get(url)
        if res.status_code == 200:
            html = requests.get(url)
            soup = BeautifulSoup(html.content,"html.parser")
            for tag in soup.find_all("h3","shisetsu_name_s"):
                link = tag.find("a")
                url = link.get("href")
                html = requests.get(url)
                get_soup = BeautifulSoup(html.content, "html.parser")
                res_p = get_soup.find("p", "lnk_url")
                if res_p is not None:
                    print(res_p.text)
                res_p = get_soup.find("span", "name")
                if res_p is not None:
                    print(res_p.text)
                res_p = get_soup.find("dd", "name")
                if res_p is not None:
                    print(res_p.text)
            for s_tag in soup.find_all("h3","shisetsu_name"):
                s_link = s_tag.find("a")
                s_url = s_link.get("href")
                html = requests.get(s_url)
                get_soup = BeautifulSoup(html.content, "html.parser")
                res_p = get_soup.find("p", "lnk_url")
                if res_p is not None:
                    print(res_p.text)
                res_p = get_soup.find("span", "name")
                if res_p is not None:
                    print(res_p.text)
                res_p = get_soup.find("dd", "name")
                if res_p is not None:
                    print(res_p.text)
            links = soup.find_all("a","fa_name")
            for link in links:
                i_url = link.get("href")
                html = requests.get(i_url)
                get_soup = BeautifulSoup(html.content, "html.parser")
                res_p = get_soup.find("p", "lnk_url")
                if res_p is not None:
                    print(res_p.text)
                res_p = get_soup.find("span", "name")
                if res_p is not None:
                    print(res_p.text)
                res_p = get_soup.find("dd", "name")
                if res_p is not None:
                    print(res_p.text)
        else:
            break
        num += 1
    else:
        break
    i += 1
```

前提・実現したいこと

質問の内容

【機能1】のコード→問題なく動作

【機能2】のコード→問題なく動作

【機能1】＋【機能2】→何も出力されない

【機能3】最下層ページから「会社名」と「URL」を取得→問題なく動作

試したこと

関連した質問