関数に格納した複数要素をpandasで複数行でCSV出力する。

pandasのCSV出力に関して質問させて頂きます。

行いたい内容

スクレイピングで抽出した複数の要素を格納した関数をpandasでCSV出力する際、複数行で出力したい。

処理の流れ

BeautifulSoupで<script></script>内を取得
取得した要素をhttp　　jpgで分割
分割した要素をimg_urlに格納
pandasでimg_urlをcsvで出力

現状のコード

Python
1import urllib.request
2import pandas as pd
3from bs4 import BeautifulSoup
4from selenium import webdriver
5
6driver = webdriver.Chrome ( executable_path=r'C:\Users\chromedriver_win32\chromedriver.exe' )
7driver.get ("https://hogehoge.com" )
8
9html = driver.page_source
10bs = BeautifulSoup ( html, "html.parser" )
11
12for script_ele in bs.find_all ( "script" ):
13    text_list = str(script_ele).split("\"")# 引用符「"」で分割
14    for img_url in text_list:
15        if img_url.startswith("http") and img_url.endswith("jpg"):
16            print (img_url)
17            Datef = pd.DataFrame([
18                ["A", "B", (img_url),],],
19                columns=['Title', 'Body (HTML)',  'Image Src', ])
20Datef.to_csv('C:/Users/csv_out.csv',encoding='cp932')

img_urlに格納されている要素

https://hogehoge.com/hogehoge.jpg
https://hogehoge.com/hogehoge1.jpg
https://hogehoge.com/hogehoge2.jpg
https://hogehoge.com/hogehoge3.jpg
https://hogehoge.com/hogehoge4.jpg
https://hogehoge.com/hogehoge5.jpg

現状の出力状況

Title	Body (HTML)	Image Src
A	B	https://hogehoge.com/hogehoge5.jpg

※何故か、一番最後の要素が出力されます。

行いたい出力形式

Title	Body (HTML)	Image Src
A	B	https://hogehoge.com/hogehoge.jpg
	https://hogehoge.com/hogehoge1.jpg
	https://hogehoge.com/hogehoge2.jpg
	https://hogehoge.com/hogehoge3.jpg
	https://hogehoge.com/hogehoge4.jpg
	https://hogehoge.com/hogehoge5.jpg

最後に

以上の出力がそもそも出来るのか、出来ないのか、できるのであればどうすれば出来るのか皆様のお力をお貸し頂けますと幸いです。

お願いします。

行動規範の内容に同意します

回答2件

ベストアンサー

img_urlの作り方はbamboo-novaさんの回答のとおりです。
その後は以下のようなコードで求めるDataFrameを作ることができます。

Python
1import pandas as pd
2
3img_url = [ 'https://hogehoge.com/hogehoge.jpg',
4            'https://hogehoge.com/hogehoge1.jpg',
5            'https://hogehoge.com/hogehoge2.jpg']
6
7blanks = ['' for _ in range(1,len(img_url))]
8df = pd.DataFrame( {'Title':['A'] + blanks, 'Body (HTML)':['B'] + blanks, 'Image Src':img_url})
9print(df)
10"""
11  Title Body (HTML)                           Image Src
120     A           B   https://hogehoge.com/hogehoge.jpg
131                    https://hogehoge.com/hogehoge1.jpg
142                    https://hogehoge.com/hogehoge2.jpg
15"""

投稿2020/01/25 05:47

can110

総合スコア38266

pasomtr

2020/01/25 07:13 編集

bamboo-nova様 can110様ご回答ありがとうございます。ご指摘いただきましたように、 ```Python for script_ele in bs.find_all ( "script" ): text_list = str(script_ele).split("\"")# 引用符「"」で分割 img_url = [] for img_url in text_list: if img_url.startswith("http") and img_url.endswith("jpg"): print (img_url) img_url.append(img_url) blanks = ['' for _ in range(1,len(img_url))] Datef = pd.DataFrame({'Title':['A'] + blanks, 'Body (HTML)':['B'] + blanks, 'Image Src':img_url}) ``` として変更を加えました。すると、 AttributeError: 'str' object has no attribute 'append' が発生致します。解決するにはどうすれば宜しいでしょうか？理解力が乏しく申し訳ございませんが、ご教授頂けますと幸いです。

can110

2020/01/25 07:20

img_url = [] for img_url in text_list: の部分が明らかにおかしいです。img_urlがtext_listの要素で上書きされてしまっています。

pasomtr

2020/01/25 07:37

can110様ありがとうございます！！見落としていました。 for script_ele in bs.find_all ( "script" ): text_list = str(script_ele).split("\"")# 引用符「"」で分割 image_url = [] for img_url in text_list: if img_url.startswith("http") and img_url.endswith("jpg"): print (img_url) image_url.append(img_url) blanks = ['' for _ in range(1,len(image_url))] Datef = pd.DataFrame({'Title':['A'] + blanks, 'Body (HTML)':['B'] + blanks, 'Image Src':image_url}) にて解決致しました！問題解決に細やかに寄り添って頂き感謝致します！！ありがとうございます！

行動規範の内容に同意します

原因は、

Datef = pd.DataFrame([
                ["A", "B", (img_url),],],
                columns=['Title', 'Body (HTML)',  'Image Src', ])

の部分で毎回データフレームを初期化しているので、最後のimageのURLしか取得できないようになっています。あくまで一例ですが、事前にforループの前にimage_url = []とでも定義して、appendを使って

for img_url in text_list:
        if img_url.startswith("http") and img_url.endswith("jpg"):
            print (img_url)
            image_url.append(img_url)

とすれば必要なURLが全て取得できると思います。

投稿2020/01/25 05:34