タブ内の要素のみスクレイピング

Python初心者の大学生です。
以下のようなHTMLコード

・・・

行動規範の内容に同意します

回答2件

python
1r = requests.get(url, headers=headers, timeout=3)
2
3soup = BeautifulSoup(r.content, "html5lib")
4
5# json-ldを取得
6ld_json = json.loads(soup.find("script", type="application/ld+json").text)

json.loadsで取れます

食べログスクレイピングのJSON-LDを取得
https://imabari.hateblo.jp/entry/2019/11/09/215422

投稿2020/06/25 12:46

編集2020/06/25 12:48

barobaro

総合スコア1286

ベストアンサー

変数 data に辞書としてデータを取り込みます。

python
1#
2# 対話モード >>> に
3# コピペで実行できます。
4#
5from urllib import request
6from bs4 import BeautifulSoup
7from pprint import pprint
8url2 = "https://infallible-heisenberg-6a2b5e.netlify.app/"
9response2 = request.urlopen(url2)
10soup2 = BeautifulSoup(response2, features="html.parser")
11response2.close()
12shops2 = soup2.find_all('script', type='application/ld+json')
13# print(shops2)
14data = eval(shops2[0].string.strip())
15pprint(data)

変数 data に辞書としてデータを取り込みます。

python
1#
2# 対話モード >>> に
3# コピペで実行できます。
4#
5import re
6import pprint
7text = """
8<script type="application/ld+json">
9    {"@context":"http://schema.org","@type":"Restaurant","@id":"https://tabelog.com/shizuoka/A2205/A220501/22000127/","name":"うなぎ 桜家","image":"https://tblg.k-img.com/restaurant/images/Rvw/118641/200x200_square_118641612.jpg","address":{"@type":"PostalAddress","streetAddress":"広小路町13-2","addressLocality":"三島市","addressRegion":"静岡県","postalCode":"4110856","addressCountry":"JP"},"geo":{"@type":"GeoCoordinates","latitude":35.115425,"longitude":138.9151125},"telephone":"055-975-4520","priceRange":"￥4,000～￥4,999","servesCuisine":"うなぎ、割烹・小料理、丼もの（その他）","aggregateRating":{"@type":"AggregateRating","ratingCount":"800","ratingValue":"3.79"}}
10</script>
11"""
12pattern = r'<script type="application/ld+json">((.|\n)*?)</script>'
13text_json = re.findall(pattern, text)[0][0].strip()
14data = eval(text_json)
15pprint.pprint(data)

>>> pprint.pprint(data)
{'@context': 'http://schema.org',
 '@id': 'https://tabelog.com/shizuoka/A2205/A220501/22000127/',
 '@type': 'Restaurant',
 'address': {'@type': 'PostalAddress',
             'addressCountry': 'JP',
             'addressLocality': '三島市',
             'addressRegion': '静岡県',
             'postalCode': '4110856',
             'streetAddress': '広小路町13-2'},
 'aggregateRating': {'@type': 'AggregateRating',
                     'ratingCount': '800',
                     'ratingValue': '3.79'},
 'geo': {'@type': 'GeoCoordinates',
         'latitude': 35.115425,
         'longitude': 138.9151125},
 'image': 'https://tblg.k-img.com/restaurant/images/Rvw/118641/200x200_square_118641612.jpg',
 'name': 'うなぎ 桜家',
 'priceRange': '￥4,000～￥4,999',
 'servesCuisine': 'うなぎ、割烹・小料理、丼もの（その他）',
 'telephone': '055-975-4520'}
>>>

投稿2020/06/22 08:36

編集2020/06/22 15:01

nico25

総合スコア830

Leon_Na

2020/06/22 09:05

ご返信ありがとうございます。いただいたコードをコピペして実行したところできたのですが、以下のようにスクレイピングしてきたもので、辞書型リストのみを取り出すことはできますでしょうか？？（最初に説明が不足しておりましたすみません。） url2 = result[0] response2 = request.urlopen(url2) soup2 = BeautifulSoup(response2) response2.close() shops2 = soup2.find_all('script',type= 'application/ld+json') print(shops2) >>> [<script type="application/ld+json"> {"@context":"http://schema.org","@type":"Restaurant","@id":"https://tabelog.com/shizuoka/A2205/A220501/22000127/","name":"うなぎ桜家","image":"https://tblg.k-img.com/restaurant/images/Rvw/118641/200x200_square_118641612.jpg","address":{"@type":"PostalAddress","streetAddress":"広小路町13-2","addressLocality":"三島市","addressRegion":"静岡県","postalCode":"4110856","addressCountry":"JP"},"geo":{"@type":"GeoCoordinates","latitude":35.115425,"longitude":138.9151125},"telephone":"055-975-4520","priceRange":"￥4,000～￥4,999","servesCuisine":"うなぎ、割烹・小料理、丼もの（その他）","aggregateRating":{"@type":"AggregateRating","ratingCount":"800","ratingValue":"3.79"}} </script>]

nico25

2020/06/22 09:22

なるほどですね。変数 shop2 に代入されたオブジェクトから「辞書型リスト」を取り出したいという認識であっていますか？

Leon_Na

2020/06/22 13:01

返信が遅れて申し訳ございません。そういう感じです！

nico25

2020/06/22 13:50

大丈夫ですよ笑実際にどういう結果が欲しいか、データのサンプルを書くことはできますか？面倒なところは点々 ... で端折ってもらって構わないので。「辞書型のリスト」が、何を具体的に指しているのか掴めなくて。

Leon_Na

2020/06/22 14:17

本当に親切にありがとうございます(o*。_。)oペコッ以下が自分の書いたプログラムコードと結果です。やりたいことはプログラムを実行する際に、<script・・・></script>タブ内の要素である{何とか：何とか、何とか：何とか、・・・}部分のみを取り出したいということです。【コード】 shops2 = soup2.find_all('script',type= 'application/ld+json') for shop2 in shops2: print(shop2) ＞＞＞ <script type="application/ld+json"> {"@context":"http://schema.org","@type":"Restaurant","@id":"https://tabelog.com/shizuoka/A2205/A220501/22000127/","name":"うなぎ桜家","image":"https://tblg.k-img.com/restaurant/images/Rvw/118641/200x200_square_118641612.jpg","address":{"@type":"PostalAddress","streetAddress":"広小路町13-2","addressLocality":"三島市","addressRegion":"静岡県","postalCode":"4110856","addressCountry":"JP"},"geo":{"@type":"GeoCoordinates","latitude":35.115425,"longitude":138.9151125},"telephone":"055-975-4520","priceRange":"￥4,000～￥4,999","servesCuisine":"うなぎ、割烹・小料理、丼もの（その他）","aggregateRating":{"@type":"AggregateRating","ratingCount":"800","ratingValue":"3.79"}} </script>

Leon_Na

2020/06/25 07:42

ごめんなさい。自己解決できました。一度タブをテキストに格納してから辞書型のリストに格納しなおしたらできました。教えていただいてありがとうございます。 import re import pprint shops2 = soup2.find_all('script',type= 'application/ld+json') text = "" for x in list(map(str,shops2)): text += x pattern = r'<script type="application/ld+json">((.|\n)*?)</script>' text_json = re.findall(pattern, text)[0][0].strip() data = eval(text_json) pprint.pprint(data) ＞＞＞ {'@context': 'http://schema.org', '@id': 'https://tabelog.com/okayama/A3301/A330101/33001167/', '@type': 'Restaurant', 'address': {'@type': 'PostalAddress', 'addressCountry': 'JP', 'addressLocality': '岡山市北区', 'addressRegion': '岡山県', 'postalCode': '7000824', 'streetAddress': '内山下1-8-18'}, 'aggregateRating': {'@type': 'AggregateRating', 'ratingCount': '57', 'ratingValue': '3.67'}, 'geo': {'@type': 'GeoCoordinates', 'latitude': 34.657428333333, 'longitude': 133.93406166667}, 'image': 'https://tblg.k-img.com/restaurant/images/Rvw/96229/200x200_square_96229572.jpg', 'name': 'うじょう亭', 'priceRange': '￥15,000～￥19,999', 'servesCuisine': 'うなぎ', 'telephone': '086-234-1139'}

nico25

2020/06/25 09:39

質問を意図を履き違えていたようで、お手数おかけしました。解決できたようで、よかったです。

Leon_Na

2020/06/26 03:35

何度もすみません。こんにちは。最初にこたえていただいたもので、以下の3行の意味をもしよろしければ少し教えていただけないでしょうか？？いろいろ調べてみましたが、よくわからなくて、、、 pattern = r'<script type="application/ld+json">((.|\n)*?)</script>' text_json = re.findall(pattern, text)[0][0].strip() data = eval(text_json)

nico25

2020/06/26 03:55

## 1, 2 行目 1, 2 は正規表現です。以下の動画が、ここで使った findall もでてきてとてもわかりやすいと思います。 Pythonで正規表現を扱おう | 中学生でもわかるPython入門シリーズ https://www.youtube.com/watch?v=ImIF5IREDlg 正規表現は、今回のように BeautifulSoup を使わなくてもいける簡単なスクレイピングや、あるいは入力形式のチェックなどで使います。 Pythonの正規表現ではじめに覚えるべき3大パターン https://hashikake.com/RegEx ## 3 行目 eval は簡単で、文字列を実行してくれます。 ``` >>> eval("print('Hello, world!')") Hello, world! >>> >>> eval("print(1 + 1)") 2 >>> ``` text_json の中身は、文字列です。これを Python のコードどして実行しています。 print(text_json) して表示するとわかりやすいかもしれません。

Leon_Na

2020/06/26 11:43

本当に御親切にありがとうございます！

nico25

2020/06/26 13:41

恐れ入ります :)

行動規範の内容に同意します

あなたの回答