webスクレイピング：検索サイト内での次ページへの遷移

検索サイト（＊規約確認済）内で、情報をスクレイピングしたいのですが、調べながら作成した以下のコードでは次ページの情報まで取得できません。(10/17 21:48)コードを再度改めてみましたが、エラーは出なくなりましたが、ページ遷移しません。中身も取れていません。

(10/10 23:38)
listにテキスト抽出を命じていたので、ページ遷移箇所を書き直したら今度はプログラムが終わりません。
どうぞ宜しくお願いします。

python　10/17編集済み
1import time
2from selenium import webdriver
3driver=webdriver.Chrome()
4
5driver.get('https://www.mrso.jp/searches/?redirect&view=plan')
6
7def search(driver):
8    i = 1               # ループ番号、ページ番号を定義
9    i_max = 5           # 最大何ページまで分析するかを定義
10    courses_list=[]
11    facili_list=[]
12    price_list=[]
13    link_list=[]
14    next_list=[]
15
16    # 現在のページが指定した最大分析ページを超えるまでループする
17    while i <= i_max:
18        class_group =driver.find_elements_by_class_name('page-search__wrap.facility')
19        # タイトルとリンクを抽出しリストに追加するforループ
20        for elem in  class_group:
21            courses_list.append(elem.find_element_by_class_name('-name').text)
22        for elem in  class_group:
23            facili_list.append(elem.find_element_by_class_name('-facility-name').text)
24        for elem in  class_group:
25            price_list.append(elem.find_element_by_class_name('-price').text)
26        for elem in  class_group:
27            link_list.append(elem.find_element_by_class_name('-link').get_attribute('href'))
28    
29        # 「次へ」は1つしかないが、あえてelementsで複数検索。空のリストであれば最終ページの意味になる。
30        for elem in  class_group:
31            next_list=elem.find_elements_by_class_name('-item -next')
32        if next_list==[]:
33            i = i_max + 1
34        else:
35            next_list.click()
36            i = i + 1               # iを更新
37            time.sleep(3)           # 3秒間待機
38    return courses_list,facili_list, price_list,link_list    # タイトルとリンクのリストを戻り値に指定
39
40courses_list,facili_list,price_list,link_list=search(driver)
41
42

otn

2020/10/10 08:36

どの行でのエラーですか？

Dantesu

2020/10/10 08:40

下記の行です。ただ、その前が長い関数の定義なので、実行した結果、関数の中身について言われている気がします。 courses_list,facili_list,price_list,link_list=search(driver) search.quit()

otn

2020/10/10 08:54

エラーメッセージに行番号が出ていると思うのですが？

Dantesu

2020/10/10 09:23 編集

下記のように出ますが、11行目と32行目かもしれません。 InvalidSelectorException: Message: invalid selector: Compound class names not permitted (Session info: chrome=86.0.4240.75) (Driver info: chromedriver=2.38.552522 (437e6fbedfa8762dec75e2c5b3ddb86763dc9dcb),platform=Windows NT 10.0.18363 x86_64) 11行目と32行目を指摘しているように見えます。 32　 courses_list,facili_list,price_list,link_list=search(driver) 11 class_group =driver.find_elements_by_class_name('page-search__wrap facility')　＜エラー文全て＞ --------------------------------------------------------------------------- InvalidSelectorException Traceback (most recent call last) <ipython-input-5-500fd912a0d6> in <module> 30 time.sleep(3) # 3秒間待機 31 return courses_list,facili_list, price_list,link_list # タイトルとリンクのリストを戻り値に指定 ---> 32 courses_list,facili_list,price_list,link_list=search(driver) 33 search.quit() <ipython-input-5-500fd912a0d6> in search(driver) 9 # 現在のページが指定した最大分析ページを超えるまでループする 10 while i <= i_max: ---> 11 class_group =driver.find_elements_by_class_name('page-search__wrap facility') 12 # タイトルとリンクを抽出しリストに追加するforループ 13 for elem in class_group: ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in find_elements_by_class_name(self, name) 578 elements = driver.find_elements_by_class_name('foo') 579 """ --> 580 return self.find_elements(by=By.CLASS_NAME, value=name) 581 582 def find_element_by_css_selector(self, css_selector): ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in find_elements(self, by, value) 1005 return self.execute(Command.FIND_ELEMENTS, { 1006 'using': by, -> 1007 'value': value})['value'] or [] 1008 1009 @property ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params) 319 response = self.command_executor.execute(driver_command, params) 320 if response: --> 321 self.error_handler.check_response(response) 322 response['value'] = self._unwrap_value( 323 response.get('value', None)) ~\Anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response) 240 alert_text = value['alert'].get('text') 241 raise exception_class(message, screen, stacktrace, alert_text) --> 242 raise exception_class(message, screen, stacktrace) 243 244 def _value_or_default(self, obj, key, default): InvalidSelectorException: Message: invalid selector: Compound class names not permitted (Session info: chrome=86.0.4240.75) (Driver info: chromedriver=2.38.552522 (437e6fbedfa8762dec75e2c5b3ddb86763dc9dcb),platform=Windows NT 10.0.18363 x86_64)

行動規範の内容に同意します

回答2件

ベストアンサー

間違いとまでは言いませんがclass_groupの定義の仕方が少しおかしいです。
どうがんばっても指定の要素は1つしかない為、elementsとする必要がないです。
またそれが原因でページ内のデータも冒頭の1個しか取得出来ていない様です。
(リストになってしまっている為)

この場合であればfind_element_by_class_name('page-search__list')と指定し
もう少し細かく対象を絞った方が抽出も楽で、これでページ内の全ての情報が取得できます。

ページ遷移の次ページの抽出についても特にclass_groupから抽出する必要性は特になさそうでdriver変数から取ったほうが良さそうです。

python
1def search(driver):
2	i = 1			   # ループ番号、ページ番号を定義
3	i_max = 5		   # 最大何ページまで分析するかを定義
4	courses_list=[]
5	facili_list=[]
6	price_list=[]
7	link_list=[]
8
9	# 現在のページが指定した最大分析ページを超えるまでループする
10	while i <= i_max:
11		class_group =driver.find_element_by_class_name('page-search__list')
12		courses_list.append([elem.text for elem in  class_group.find_elements_by_class_name('-name')])
13		facili_list.append([elem.text for elem in class_group.find_elements_by_class_name('-facility-name')])
14		price_list.append([elem.text for elem in class_group.find_elements_by_class_name('-price')])
15		link_list.append([elem.get_attribute('href') for elem in class_group.find_elements_by_class_name('-link')])
16
17
18		next_list = driver.find_element_by_class_name('-next').click()
19		i += 1
20		time.sleep(3)		   # 3秒間待機
21		
22
23	return courses_list,facili_list, price_list,link_list	# タイトルとリンクのリストを戻り値に指定
24
25courses_list,facili_list,price_list,link_list=search(driver)

備考

抽出しているlinkについてですが、各データに対しclass="-link"が付与された要素が3つずつある様なので
抽出の仕方を変えるか、重複した場合の対処を行ったほうが良さそうです。

追記

コードをよく見ていたら、ページ遷移の際の次ページがなかった場合の処理をわすれていた為
関数内のnext_list部分を以下の様に変更してあげてください。

python
1		try:
2			next_list = driver.find_element_by_class_name('-next').click()
3			i += 1
4			time.sleep(3)		# 3秒間待機
5		except:
6			break
7
8
9	return courses_list,facili_list, price_list,link_list	# タイトルとリンクのリストを戻り値に指定

投稿2020/10/18 11:42

編集2020/10/21 02:48

nto

総合スコア1438

Dantesu

2020/10/21 01:18

ありがとうございます！！！書かれていたコードも綺麗で勉強になりました。基礎的な質問にも、何度もお付き合いいただき本当にありがとうございました。

nto

2020/10/21 02:50

とんでもないです。また先程コードを見ていたらページ遷移の際の「次ページがなかった場合」の処理を忘れてしまっていたので、先程回答に追記を致しました。今回の場合ではtry文で判定する等で良いでしょう。ご確認ください。

Dantesu

2020/10/21 06:27 編集

ありがとうございます。ちなみに１つの列にカンマで複数抽出されていたので（区切りのカンマと金額のカンマが混在）、これはpndasのお勉強として調べて分割します。最悪、エクセルでもできます。

nto

2020/10/21 06:57

複数の抽出というのは具体的に、どちらが複数になっていたという事でしょうか？ courses_list, facili_list, price_list, link_listについてという事であればページごとに各リンクやテキストをリスト化して、そのリストを上記リストにappend()している為ページ分の二次元リストとして返ってくる形となっております。コードが綺麗に見えるのは内包表記でその様にページごとにデータをリスト化→そのリストを大元のリストに追加としている為です。

nto

2020/10/21 06:58

いずれもpandasで整形は可能ですので、勉強も兼ねてという事であれば是非整形もお試しください。

行動規範の内容に同意します

11 class_group =driver.find_elements_by_class_name('page-search__wrap facility')

で、

invalid selector: Compound class names not permitted

というエラーなので、
class_group =driver.find_elements_by_class_name('page-search__wrap.facility')
でしょうか。

複数クラス名を指定する場合は、.で繋ぐようです。

投稿2020/10/10 09:28

otn

総合スコア85901

Dantesu

2020/10/10 09:48

ありがとうございます。反映しましたが、list形式から find elements が出来ないと怒られてしまいました。[0]でやったりしましたがダメで、forにするのかなと思ったのですが、浮かばないです。 --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-4-cf50d9096225> in <module> 30 time.sleep(3) # 3秒間待機 31 return courses_list,facili_list, price_list,link_list # タイトルとリンクのリストを戻り値に指定 ---> 32 courses_list,facili_list,price_list,link_list=search(driver) 33 search.quit() <ipython-input-4-cf50d9096225> in search(driver) 21 22 # 「次へ」は1つしかないが、あえてelementsで複数検索。空のリストであれば最終ページの意味になる。 ---> 23 if class_group.find_elements_by_class_name('-item -next') == []: 24 i = i_max + 1 25 else: AttributeError: 'list' object has no attribute 'find_elements_by_class_name'

otn

2020/10/10 11:27

そうですね。find_elements_～～でリストで取得したら、そのリストの中の各要素について処理します。

Dantesu

2020/10/10 14:36

ページ遷移の箇所（22－23行目）をリストのclass groupからfor文で書き直したら、forが多すぎるのか、いっこうに処理が終わりません。難しいです。元の質問も修正します。 # 「次へ」は1つしかないが、あえてelementsで複数検索。空のリストであれば最終ページの意味になる。 for elem in class_group: next=elem.find_elements_by_class_name('-item -next') if next==[]: i = i_max + 1 else: # 次ページのURLは-item -nextのhref属性 for elem in class_group: next_page = elem.find_elements_by_class_name('-item -next').get_attribute('href') driver.get(next_page) # 次ページへ遷移する i = i + 1 # iを更新 time.sleep(3) # 3秒間待機 return courses_list,facili_list, price_list,link_list # タイトルとリンクのリストを戻り値に指定 courses_list,facili_list,price_list,link_list=search(driver) search.quit()

otn

2020/10/11 00:06

> next_page = elem.find_elements_by_class_name('-item -next').get_attribute('href') で、リストに対してget_attribute('href')してます。

Dantesu

2020/10/13 03:35 編集

ご親切に再三アドバイス頂きながら大変申し訳ございませんが、下記を試みましたが、解決できませんでした。・elements→element ・a タグで引っかける　別サイトではIDで特定した記憶がありますが、IDタグがないので[8]にしました。＜参考にしたサイト＞ https://www.seleniumqref.com/api/python/element_get/Python_find_element_by_tag_name.html ＜抜粋：コードとエラー＞ else: # 次ページのURLは-item -nextのhref属性 next_page = class_group.find_element_by_tag_name('a')[8].get_attribute('href') driver.get(next_page) # 次ページへ遷移する i = i + 1 # iを更新 time.sleep(3) # 3秒間待機＜エラー＞ --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-8-593da006c575> in <module> 34 return courses_list,facili_list, price_list,link_list # タイトルとリンクのリストを戻り値に指定 35 ---> 36 courses_list,facili_list,price_list,link_list=search(driver) 37 search.quit() <ipython-input-8-593da006c575> in search(driver) 28 else: 29 # 次ページのURLは-item -nextのhref属性 ---> 30 next_page = class_group.find_element_by_tag_name('a')[8].get_attribute('href') 31 driver.get(next_page) # 次ページへ遷移する 32 i = i + 1 # iを更新 AttributeError: 'list' object has no attribute 'find_element_by_tag_name'

otn

2020/10/13 13:39

変数になにが入っているのか混乱するようなら、リストを入れる変数は変数名を複数形にするとか。

Dantesu

2020/10/17 12:53

○○_listのようにしてみました。エラーが消えましたが、解決していないので、改めてご相談のコードを修正しました。

行動規範の内容に同意します

あなたの回答