BeautifulSoupで特定のリスト番号のみ抽出したい

前提・実現したいこと

BeautifulSoupにて抽出したいURLから、特定のindex番号のものを取得する方法で悩んでいます。

プログラムの概要

セブンイレブンの公式ホームページから、商品の情報をスクレイピングしたい。

情報の取得方法

URLの取得 → クラスの抽出 → aの抽出 → hrefの抽出 → URLを絶対表記に置換

処理の流れ

1.トップページから情報を抽出する

https://www.sej.co.jp/products/

2.商品カテゴリ別に情報を格納した後、関東地方にリンクを限定する

'https://www.sej.co.jp/products/a/onigiri/kanto/',
~ 省略 ~
'https://www.sej.co.jp/products/a/chukaman/kanto/']

3.関東地方だけに設定した商品リンクから、情報を抽出する

該当のソースコード

Python
1import re
2import requests
3import pandas as pd
4from bs4 import BeautifulSoup
5from time import sleep
6
7# URLを取得する関数
8def GetUrl(GetTheURL):
9    # 取得したいURL
10    url = GetTheURL
11
12    global r
13    
14    # urlを引数に指定して、HTTPリクエストを送信してHTMLを取得、取得したデータを変数 r に格納
15    r = requests.get(url)
16
17    # 格納したデータの文字コードを自動でエンコーディング
18    r.encoding = r.apparent_encoding
19
20    sleep(2)
21
22
23# 取得したいURLからクラスを抽出する関数
24def GetSoupClass(GetClass, value):
25
26    soup = BeautifulSoup(r.text, 'html.parser')
27
28    global contents
29    content = []
30    
31    # find or find_allを設定するための値
32    value == 0
33    
34    if value == 0:
35        contents = soup.find(class_= GetClass)
36    elif value == 1:
37        contents = soup.find_all(class_= GetClass)
38    else:
39        # ここをindex[3,5,7,9]番目だけ抽出したい
40        contents = soup.find_all(class_= GetClass)[3]
41        content.append(contents)
42        contents = content
43
44# soupで入手した値から a を抽出する関数
45def Find_a(FindContents):
46    
47    global ProductList
48    ProductList = []
49
50    for i in range(len(FindContents)):
51        content = FindContents[i].find('a')
52        ProductList.append(content)
53        
54
55# soupで取得した値から href を抽出する関数
56def GetHref(hrefValue):
57    
58    global ProductLink
59    ProductLink = []
60
61    for i in range(len(hrefValue)):   
62        link_ = hrefValue[i].get('href')
63        ProductLink.append(link_)
64
65
66# URLを相対表記から絶対表記に置換する関数
67def ReplaceURL(Before, After):
68    
69    global ProductLink
70    
71    ProductLink = [item.replace(Before, After) for item in ProductLink]
72
73GetUrl("https://www.sej.co.jp/products/")
74GetSoupClass("sideCategoryNav", 0)
75
76# HTMLから商品情報が格納されている a タグを全て表示
77get_a = contents.find_all('a')
78GetHref(get_a)
79ReplaceURL("/products", "https://www.sej.co.jp/products")
80
81# 関東地方のみに商品カテゴリを絞り込む
82ProductLinkKanto = []
83
84for i in range(len(ProductLink)):
85    text = ProductLink[i]
86    # URLの末尾にkanto/を追加
87    Newtext = re.sub('$',"kanto/",text)
88    ProductLinkKanto.append(Newtext)
89
90# 今週の新商品と来週の新商品は内容が重複するため削除する
91ProductLink =  ProductLinkKanto[2:18]
92
93# 取得したリンク先に、カテゴリ別の表記を格納する
94# カテゴリ別の表記がない場合は、リンクをそのまま格納する
95
96ProductList = []
97
98for i in range(len(ProductLink)):
99    
100    GetUrl(ProductLink[i])
101        
102    if i < 2:
103        GetSoupClass("list_btn brn pbNested pbNestedWrapper", 1)
104    elif i < 3:
105        GetSoupClass("list_btn pbNested pbNestedWrapper", 1)
106    elif i == 4 or i == 6 or i >= 10 and i <= 13:
107        GetSoupClass("pbBlock pbBlockBase", 2)
108    else:
109        contents = ProductLink[i]
110
111    ProductList.append(contents)
112

試したこと

GetSoupClassの

GetSoupClass
1    if value == 0:
2        contents = soup.find(class_= GetClass)
3    elif value == 1:
4        contents = soup.find_all(class_= GetClass)
5    else:
6        # 3,5,7,9番目だけ抽出する
7        contents = soup.find_all(class_= GetClass)[3]
8        content.append(contents)
9        contents = content
10

の部分について、奇数番号のみを取得したかったので

    for i in range(10):
        if value == 2 and i % 2 == 1:
            contents = soup.find_all(class_= GetClass)[i]
            content.append(contents)
        elif value == 1:
            contents = soup.find_all(class_= GetClass)
        else:
            contents = soup.find(class_= GetClass)
        contents = content

と変更して試してみました。

発生している問題・エラーメッセージ

IndexError Traceback (most recent call last)
<ipython-input-36-eb08a21b7137> in <module>
13 GetSoupClass("list_btn pbNested pbNestedWrapper", 1)
14 elif i == 4 or i == 6 or i >= 10 and i <= 13:
---> 15 GetSoupClass("pbBlock pbBlockBase", 2)
16 else:
17 contents = ProductLink[i]

<ipython-input-35-c1265b5f2c6f> in GetSoupClass(GetClass, value)
13 for i in range(10):
14 if value == 2 and i % 2 == 1:
---> 15 contents = soup.find_all(class_= GetClass)[i]
16 content.append(contents)
17 elif value == 1:

IndexError: list index out of range


となってしまいました。これらの解決策についてご教授いただけると幸いです。


### 補足情報（FW/ツールのバージョンなど）

python 3.8.8 

### 最後に

プログラミング初心者のため拙いコードでの質問になりますが何卒よろしくお願い致します。

行動規範の内容に同意します

回答1件

自己解決

自己解決できたため、終了します。

投稿2021/11/22 06:47

BigCulture

総合スコア2

あなたの回答

tips

プレビュー

行動規範の内容に同意します

質問の解決につながる回答をしましょう。サンプルコードなど、より具体的な説明があると質問者の理解の助けになります。また、読む側のことを考えた、分かりやすい文章を心がけましょう。

15分調べてもわからないことは
teratailで質問しよう！

ただいまの回答率
85.35%

質問をまとめることで
思考を整理して素早く解決

テンプレート機能で
簡単に質問をまとめる

質問する

質問をすることでしか得られない、回答やアドバイスがある。

15分調べてもわからないことは、質問しよう！