WEBスクレイピング　beautifulsoup4　ページを変遷し情報取得するもページを残し終わってしまう

前提・実現したいこと

DMMぱちタウンからスロットの遊技機情報をWEBスクレイピングで取得したい。
→スロットの全機種の情報取得(メーカーID、メーカ名、機種ID、機種名～機種概要)
<取得イメージ>
1.3491:https://p-town.dmm.com/machines/3491
[['3491', 'タマどき！', '4', 'JPS', '97.2%〜105.6%', '2019/10上旬予定', 'コンサートホールグループでのみ打つことができるオリジナルパチスロが、この『タマどき！』。擬似ボーナスを搭載したAT機となっており、リール左右の花が
光ればボーナス確定だ。ボーナス後の32G間はボーナスの引き戻しに期待でき、最大ループ率は約90%と大量出玉への期待が膨らむ仕様となっている。花の光り方で滞在モードを示唆しているので、ここにも注目しておこう。', '']]･･･以下続く

###できていること
上記イメージ通り、スロット機種情報の取得はできています。

###できていないこと
ループはできていますが、400機種目（20ページ目の最後の機種）で終わってしまいます。

実現したいこと

全ページを横断し、スロット全機種の情報を取得したい。
エラーも特に出ていませんが、400機種を取得後にループを抜けて処理が終わってしまい、対応策が見つからず苦慮しております。

該当のソースコード（余計なものは省きました）

python3
1import requests
2import random
3import time
4import re
5from bs4 import BeautifulSoup
6
7if __name__ == "__main__":
8    
9    #requestsを使って、webから取得
10    base_url = 'https://p-town.dmm.com/machines'
11    target_url = '/slot'
12    r = requests.get(base_url + target_url)
13    soup = BeautifulSoup(r.text, 'lxml')
14
15    # データカラム定義
16    col_list = ['機種ID', '機種名', 'メーカーID', 'メーカー名', '機械割', '導入開始日', '機種概要', '取得日時']
17    num = 0
18    csv_list = []
19    #機種IDループ
20    for link_ in soup.find_all('a', class_='link', href=re.compile(r'/machines/' + '\d+')):
21        machine_url = link_.attrs.get('href')
22        machine_id = machine_url.rsplit('/', 1)[1]
23        selector = 'body > div.o-layout > div > div > main > section li'
24        nextpage = True
25        while nextpage:
26        # 次ページ有無チェック
27            for pages_ in soup.select(selector):
28                if pages_.attrs.get('class')[0] == 'item':
29                    if pages_.text == '>':
30                        if pages_.get('href') is not None:
31                            nextpage = True
32                            break
33                    else:
34                        nextpage = False
35            # 遊技機情報ループ
36            for pages_ in soup.select(selector):
37                machine_list = ['','','','','','','','']
38                if pages_.attrs.get('class')[0] == 'unit':
39                    num += 1
40                    target_url = pages_.next_element.attrs.get('href')
41                    machine_id = target_url.rsplit('/', 1)[1]
42                    machine_list[0] = machine_id
43                    time.sleep(random.randint(1, 3))   #スリープ(1秒～3秒)
44                    r2 = requests.get(base_url + '/' + machine_id)
45                    soup2 = BeautifulSoup(r2.text, 'lxml')
46                    print(str(num)+ '.'+ machine_id + ':' + base_url + '/' + machine_id)
47                    #機種名取得
48                    for title in soup2.select('h1[class="title"]'):
49                        machine_name = title.get_text(strip=True)
50                        machine_list[1] = machine_name
51                        for tr in soup2.select('table[class="default-table"] tr'):
52                            # 'tr'要素から'th'をpopして項目名を取得
53                            th_ = tr.find_all("th").pop(0).get_text(strip=True).upper()
54                            # 'tr'要素から'td'をpopしてデータを取得
55                            td_ = tr.find_all("td").pop(0).get_text(strip=True)
56                            if th_ == 'メーカー名':
57                                try:
58                                    makers = tr.find_all('a', class_='textlink').pop(0).attrs['href']
59                                    makers = makers.rsplit('/', 1)[1]
60                                except IndexError:
61                                    makers = -1
62                                machine_list[2] = makers
63                            machine_list[col_list.index(th_)] = td_
64                        csv_list.append(machine_list)
65                        #print(csv_list)
66                # 次ページ読込、なければループ終了
67                elif pages_.attrs.get('class')[0] == 'item':
68                    if pages_.text == '>':
69                        if pages_.next.attrs.get('href') is not None:
70                            target_url = pages_.next.attrs.get('href')
71                            r = requests.get(target_url)
72                            soup = BeautifulSoup(r.text, 'lxml')
73                        else:
74                            nextpage = False
75                        break
76

試したこと

スロット機種情報の20ページ目に何か原因があると思い、20ページ目から情報を取得したところ、特に問題なくループしたので
ページそのものに何かあるわけではなさそうです。
■参考　20ページ目URL
https://p-town.dmm.com/machines/slot?page=20

エラーも出ずに問題がわからない為、質問させていただきました。
問題解決のため、ご教授いただきたく存じます。

何卒、よろしくお願いいたします。

yamato_user

2019/06/28 11:16

下記記述して下さい・何がしたいのか・何ができているのか・何ができていないのか

nasu0922

2019/06/29 04:05

依頼内容を更新いたしました。何卒、よろしくお願いいたします。

行動規範の内容に同意します

回答1件

自己解決

「次ページ有無チェック」のソースコードを削除したところ、うまくループするようになりました。
お騒がせし申し訳ありませんでした。
よろしくお願いいたします。

python3
1import requests
2import random
3import time
4import re
5from bs4 import BeautifulSoup
6
7if __name__ == "__main__":
8    
9    #requestsを使って、webから取得
10    base_url = 'https://p-town.dmm.com/machines'
11    target_url = '/slot'
12    r = requests.get(base_url + target_url)
13    soup = BeautifulSoup(r.text, 'lxml')
14
15    # データカラム定義
16    col_list = ['機種ID', '機種名', 'メーカーID', 'メーカー名', '機械割', '導入開始日', '機種概要', '取得日時']
17    num = 0
18    csv_list = []
19    #機種IDループ
20    for link_ in soup.find_all('a', class_='link', href=re.compile(r'/machines/' + '\d+')):
21        machine_url = link_.attrs.get('href')
22        machine_id = machine_url.rsplit('/', 1)[1]
23        selector = 'body > div.o-layout > div > div > main > section li'
24        nextpage = True
25        while nextpage:
26'''
27        # 次ページ有無チェック
28            for pages_ in soup.select(selector):
29                if pages_.attrs.get('class')[0] == 'item':
30                    if pages_.text == '>':
31                        if pages_.get('href') is not None:
32                            nextpage = True
33                            break
34                    else:
35                        nextpage = False
36'''
37            # 遊技機情報ループ
38            for pages_ in soup.select(selector):
39                machine_list = ['','','','','','','','']
40                if pages_.attrs.get('class')[0] == 'unit':
41                    num += 1
42                    target_url = pages_.next_element.attrs.get('href')
43                    machine_id = target_url.rsplit('/', 1)[1]
44                    machine_list[0] = machine_id
45                    time.sleep(random.randint(1, 3))   #スリープ(1秒～3秒)
46                    r2 = requests.get(base_url + '/' + machine_id)
47                    soup2 = BeautifulSoup(r2.text, 'lxml')
48                    print(str(num)+ '.'+ machine_id + ':' + base_url + '/' + machine_id)
49                    #機種名取得
50                    for title in soup2.select('h1[class="title"]'):
51                        machine_name = title.get_text(strip=True)
52                        machine_list[1] = machine_name
53                        for tr in soup2.select('table[class="default-table"] tr'):
54                            # 'tr'要素から'th'をpopして項目名を取得
55                            th_ = tr.find_all("th").pop(0).get_text(strip=True).upper()
56                            # 'tr'要素から'td'をpopしてデータを取得
57                            td_ = tr.find_all("td").pop(0).get_text(strip=True)
58                            if th_ == 'メーカー名':
59                                try:
60                                    makers = tr.find_all('a', class_='textlink').pop(0).attrs['href']
61                                    makers = makers.rsplit('/', 1)[1]
62                                except IndexError:
63                                    makers = -1
64                                machine_list[2] = makers
65                            machine_list[col_list.index(th_)] = td_
66                        csv_list.append(machine_list)
67                        #print(csv_list)
68                # 次ページ読込、なければループ終了
69                elif pages_.attrs.get('class')[0] == 'item':
70                    if pages_.text == '>':
71                        if pages_.next.attrs.get('href') is not None:
72                            target_url = pages_.next.attrs.get('href')
73                            r = requests.get(target_url)
74                            soup = BeautifulSoup(r.text, 'lxml')
75                        else:
76                            nextpage = False
77                        break