pythonでの結果をexcelに出力する方法

pythonでwebスクレイピングをして、その検索ワードや検索結果をexcelに出力したいと考えています。

excelでの出力の例としてこのような形を構想しています。

ですが、今のプログラムを実行すると

このような結果になってしまい、きれいに出力されません。

以下がコードなのですが、解決策があればご教示お願い致します。

python
1from selenium import webdriver
2from selenium.webdriver.common.by import By
3from selenium.webdriver.support.ui import WebDriverWait
4from urllib import request
5from bs4 import BeautifulSoup
6import requests
7from urllib.parse import urljoin
8import openpyxl as op
9import datetime
10import time
11
12
13def change_window(browser):
14    all_handles = set(browser.window_handles)
15    switch_to = all_handles - set([browser.current_window_handle])
16    assert len(switch_to) == 1
17    browser.switch_to.window(*switch_to)
18
19
20def main():
21    for i in range(1,9):
22        wb = op.load_workbook('一般名称.xlsx')
23        ws = wb.active
24        word = ws['A'+str(i)].value
25        
26        driver = webdriver.Chrome(r'C:/chromedriver.exe')
27        driver.get("https://www.pmda.go.jp/PmdaSearch/kikiSearch/")
28        #id検索
29        elem_search_word = driver.find_element_by_id("txtName")
30        elem_search_word.send_keys(word)
31        #name検索
32        elem_search_btn = driver.find_element_by_name('btnA')
33        elem_search_btn.click()
34        change_window(driver)
35        
36        #print(driver.page_source)
37        cur_url = driver.current_url
38        html = driver.page_source
39        soup = BeautifulSoup(html,'html.parser')
40        #print(cur_url)
41
42        has_pdf_link = False
43        print(word)
44        
45        wb = op.load_workbook('URL_DATA.xlsx')
46        ws = wb.active
47        ws['C'+str(i)].value = word
48        
49        for a_tag in soup.find_all('a'):
50            link_pdf = (urljoin(cur_url, a_tag.get('href')))
51            #link_PDFから文末がpdfと文中にPDFが入っているものを抽出
52            #print(word)
53            
54            if (not link_pdf.lower().endswith('.pdf')) and ('/ResultDataSetPDF/' not in link_pdf):
55                continue
56            if ('searchhelp' not in link_pdf):
57                has_pdf_link = True
58                print(link_pdf)
59                ws['B'+str(i)].value = link_pdf
60
61        if not has_pdf_link:
62            print('False')
63            ws['B'+str(i)].value = has_pdf_link
64
65            time.sleep(2)
66            time_data = datetime.datetime.today()
67
68            ws['A'+str(i)].value = time_data
69             
70        wb.save('URL_DATA.xlsx')
71        
72
73
74if __name__ == "__main__":
75    main()
76

行動規範の内容に同意します

回答1件

ベストアンサー

1, デバックしていませんがコードを見て原因はfor i in range(1,9):だと思います。
一般名称.xlsxの行を取得するためのindexと
URL_DATA.xlsxに出力するための行番号(index)で使い回しを行っているからかと。
別の変数としてください。
行単位に出力するならば、listにtupleで格納するのも良いのでは。

2, 日付が各行に出力されない原因は以下のインデントが一段深いです。

Python
1time.sleep(2)
2time_data = datetime.datetime.today()
3
4ws['A'+str(i)].value = time_data

追記・修正依頼欄に書ききれないので。

a. 初心者の方に多いのですが、質問者さんのように処理を一つの関数にどんどん追加していく人が多いです。
これをしてしまうとなんらかの問題がソースコードに発生したときに、
どこの処理が問題なのかの原因の切り分けが不可能になりやすいです。
今回の件は出力の問題なのでスクレイピングはほぼ関係ないですよね。
でも同じ関数内に書いてしまうとほぼなのでもしかしたら関係あるかもで調査する必要があります。
対策としては適度な関数分割です。

スクレイピングをして、HTMLを取得する部分のコードは以下のようにできます。
こうすることでスクレイピングの処理は関数内で閉じているので、該当の処理は意識しなくても良くなります。

Python
1def get_content(word):
2    """
3    スクレイピングする。
4    :param word 検索キーワード
5    :return スクレイピング結果(HTML)とURL
6    ※ chromedriver.exe をCドライブ直下に置くこと。
7    """
8    driver = webdriver.Chrome(r'C:/chromedriver.exe')
9    driver.get("https://www.pmda.go.jp/PmdaSearch/kikiSearch/")
10    # id検索
11    elem_search_word = driver.find_element_by_id("txtName")
12    elem_search_word.send_keys(word)
13    # name検索
14    elem_search_btn = driver.find_element_by_name('btnA')
15    elem_search_btn.click()
16    change_window(driver)
17
18    # print(driver.page_source)
19    html = driver.page_source
20    cur_url = driver.current_url
21    return html, cur_url
22

Python
1html, cur_url = get_content(word)

一般名称.xlsxから検索キーワードを取得する部分は以下のように（未テスト）

Python
1def get_search_keyword():
2    """
3    エクセルファイルを開き、検索キーワードを取得する。
4    """
5    # テスト用
6    #yield "血液照射装置"
7    #yield "放射性医薬品合成設備"
8    from contextlib import closing
9    with closing(op.load_workbook('一般名称.xlsx')) as wb:
10        for i in range(1, 9):
11            ws = wb.active
12            yield ws['A' + str(i)].value

Python
1for i, word in enumerate(get_search_keyword(), start=1):

b, 次に一般名称.xlsxやURL_DATA.xlsxのファイルが回答者の環境には無いため実行再現しずらいです。
適当なサンプルデータを質問文に追記していただくか。
もしくはウィンドウズ環境にしかないxlsxではなく汎用性のあるデータ構造csv形式など。

この質問はリストに質問文の画像のような形で値を格納したいとも言いかえれます。
こちらなら環境を選ばないので回答が付きやすいかと。

◇不具合
2列目にFalseが出力される原因はこのコードですね。
has_pdf_linkがBool型で値がFalseが設定されています。

Python
1if not has_pdf_link:
2        print('False')
3        ws['B'+str(i)].value = has_pdf_link
4

試しにseleniumとopenpyxlを使わないように出力はリストになるように書き換えてみました。

Python
1# -*- coding: utf-8 -*-
2from bs4 import BeautifulSoup
3from urllib.parse import urljoin
4import datetime
5import time
6#import openpyxl as op
7
8
9def get_content(word: str):
10    return """
11    <table class="SearchResultTable" id="ResultList">
12    <tbody><tr>
13    	<th scope="col" style="width:13em" nowrap="">一般的名称</th>
14    	<th scope="col" style="width:15em" nowrap="">販売名</th>
15    	<th scope="col" style="width:15em" nowrap="">製造販売業者等</th>
16    	<th scope="col" style="width:13em" nowrap="">添付文書</th>
17    	<th scope="col" style="width:13em" nowrap="">改訂指示<br />反映履歴</th>
18    	<th scope="col" style="width:13em" nowrap="">審査報告書／<br />再審査報告書等</th>
19    	<th scope="col" style="width:13em" nowrap="">緊急安全性情報</th>
20    </tr>
21    <tr class="TrColor01">
22    	<td><div><a target="_blank" href="/PmdaSearch/kikiDetail/GeneralList/20500BZZ00241000_A_01">血液照射装置</a></div></td>
23    	<td><div>日立Ｘ線照射装置 ＭＢＲ−１５２０Ａ−ＴＷ</div></td>
24    	<td><div>製造販売／株式会社 日立メディコ</div></td>
25    	<td><div><a href="javascript:void(0)" onclick="detailDisp(&quot;PmdaSearch&quot; ,&quot;650053_20500BZZ00241000_A_01_01&quot;);">HTML</a><br /><a target="_blank" href="/PmdaSearch/kikiDetail/ResultDataSetPDF/650053_20500BZZ00241000_A_01_01">PDF (2007年12月19日)</a></div></td>
26    	<td></td>
27    	<td></td>
28    	<td></td>
29    </tr>
30    </tbody></table>
31    """, "https://www.pmda.go.jp/PmdaSearch/kikiSearch/"
32
33
34def get_search_keyword():
35    # テスト用
36    yield "血液照射装置"
37    yield "放射性医薬品合成設備"
38
39
40def parse(soup, cur_url: str):
41    """
42    スクレイピング結果を解析
43    """
44    for a_tag in soup.find_all('a'):
45        link_pdf = (urljoin(cur_url, a_tag.get('href')))
46        # link_PDFから文末がpdfと文中にPDFが入っているものを抽出
47        if (not link_pdf.lower().endswith('.pdf')) and ('/ResultDataSetPDF/' not in link_pdf):
48            continue
49        if 'searchhelp' not in link_pdf:
50            yield True, link_pdf
51
52
53def main():
54    for i, word in enumerate(get_search_keyword(), start=1):
55        html, cur_url = get_content(word)
56        soup = BeautifulSoup(html, 'html.parser')
57        output = []
58        time_data = datetime.datetime.today()
59        for has_pdf_link, link_pdf in parse(soup, cur_url):
60            output.append([time_data, link_pdf, word])
61            print(link_pdf)
62
63        print(output)
64
65
66if __name__ == "__main__":
67    main()

以下のFalseの仕様がよく分かりませんでしたので、その部分がうまく実装できてないですが。
質問文の画像を見る限りこういうふうに出力したいのでしょうか。

Python
1if not has_pdf_link:
2    print('False')
3    ws['B'+str(i)].value = has_pdf_link
4
5    time.sleep(2)
6    time_data = datetime.datetime.today()
7
8    ws['A'+str(i)].value = time_data

Python
1# -*- coding: utf-8 -*-
2from selenium import webdriver
3from selenium.webdriver.common.by import By
4from selenium.webdriver.support.ui import WebDriverWait
5from urllib import request
6from bs4 import BeautifulSoup
7import requests
8from urllib.parse import urljoin
9import openpyxl as op
10import datetime
11import time
12
13
14def get_search_keyword():
15    """
16    エクセルファイルを開き、検索キーワードを取得する。
17    """
18    # テスト用
19    #yield "血液照射装置"
20    #yield "放射性医薬品合成設備"
21    from contextlib import closing
22    with closing(op.load_workbook('一般名称.xlsx')) as wb:
23        for i in range(1, 9):
24            ws = wb.active
25            yield ws['A' + str(i)].value
26
27
28def get_content(word: str) -> tuple:
29    """
30    スクレイピングする。
31    :param word 検索キーワード
32    :return スクレイピング結果(HTML)とURL
33    ※ chromedriver.exe をCドライブ直下に置くこと。
34    """
35    def change_window(browser):
36        """
37        ブラウザのウィンドウを切り替える。
38        """
39        all_handles = set(browser.window_handles)
40        switch_to = all_handles - set([browser.current_window_handle])
41        assert len(switch_to) == 1
42        browser.switch_to.window(*switch_to)
43
44    driver = webdriver.Chrome(r'C:/chromedriver.exe')
45    driver.get("https://www.pmda.go.jp/PmdaSearch/kikiSearch/")
46    # id検索
47    elem_search_word = driver.find_element_by_id("txtName")
48    elem_search_word.send_keys(word)
49    # name検索
50    elem_search_btn = driver.find_element_by_name('btnA')
51    elem_search_btn.click()
52    change_window(driver)
53
54    # print(driver.page_source)
55    html = driver.page_source
56    cur_url = driver.current_url
57    driver.quit()
58
59    return html, cur_url
60
61
62def parse(soup, cur_url: str):
63    """
64    スクレイピング結果を解析
65    """
66    for a_tag in soup.find_all('a'):
67        link_pdf = (urljoin(cur_url, a_tag.get('href')))
68        #print(link_pdf)
69        # link_PDFから文末がpdfと文中にPDFが入っているものを抽出
70        if (not link_pdf.lower().endswith('.pdf')) and ('/ResultDataSetPDF/' not in link_pdf):
71            continue
72        if 'searchhelp' not in link_pdf:
73            yield True, link_pdf
74
75
76def output_excel(output:list, row_index: int):
77    """
78    エクセルに出力する。
79    :param output 行データ
80    :param row_index 出力するための開始行
81    """
82    #wb = op.load_workbook('URL_DATA.xlsx')
83    #ws = wb.active
84    print("#" * 50)
85    for i, (time_data, link_pdf, word_col) in enumerate(output, start=row_index):
86        print(i , time_data, link_pdf,word_col)
87        # ここにエクセルの設定処理を
88
89    #wb.save('URL_DATA.xlsx')
90
91
92def main():
93    START_ROW = 0
94    row_index = 1
95    for word in get_search_keyword():
96        html, cur_url = get_content(word)
97        soup = BeautifulSoup(html, 'html.parser')
98        output = []
99        time_data = datetime.datetime.today()
100        for i, (has_pdf_link, link_pdf) in enumerate(parse(soup, cur_url), start=START_ROW):
101            word_col = word if i == START_ROW else ""
102            output.append([time_data, link_pdf, word_col])
103
104        output_excel(output, row_index)
105        row_index += len(output)
106
107
108if __name__ == "__main__":
109    main()
110

投稿2018/07/23 08:06

編集2018/07/23 13:29

umyu

総合スコア5846

dkymmmmmt

2018/07/23 09:20

回答ありがとうございます。いただいた回答をもとに、修正してみます。関数での分割もどこで分けていいかがわからず敬遠していたのですが、もう一度学び直したいと思います。また実行再現の件も承知しました。次回からはサンプルデータなどを添付致します。

umyu

2018/07/23 09:41 編集

@dkymmmmmtさんへ＞関数での分割もどこで分けていいか処理の流れが変わったタイミングです。入力→出力→（この箇所）→出力結果を元に入力＞また実行再現の件再現しずらいというより、環境が限定されているというのがあって。 selenium＆chromedriverが導入済み、Windows　OS、エクセルが開ける、Pythonの環境がある。スクレイピングをエクセルに書き出したいという要件は理解できるのですが、ここまで環境が限定されると回答が付きづらいのです。そのため、Pythonがあれば再現できる汎用的な話題に転換してそれを自分のプログラムに組み込む形のほうがいいです。たとえば、BeautifulSoupにはHTMLを渡せるので、HTML部分をコードに文字列として記載しておくそうすれば、回答者の環境にselenium＆chromedriverの部分は不要になります。http://python.zombie-hunting-club.com/entry/2017/11/08/192731

dkymmmmmt

2018/07/23 10:03

＠umyuさんお答えいただきありがとうございます。環境が限定される、確かにその通りです・・・。この場合はexcelに検索の結果を格納する関数をmainにすればよいのでしょうか？

umyu

2018/07/23 10:06

@dkymmmmmtさんへ mainに書いてもいいですし。output_excelみたいな関数を作ってもよいかと。

dkymmmmmt

2018/07/23 10:43

@umyuさん質問続きで申し訳ないのですが、mainで関数を実行する場合、回答で頂いた関数の下のコードを実行すれば良いのでしょうか？

umyu

2018/07/23 10:45

話を振っておいてなんですが、混乱してきました。えと話題が2つあって、話題は関数化の話でしょうか、環境にあまり依存しない書き方のほうでしょうか？どちらでしょうか。

dkymmmmmt

2018/07/23 10:53

申し訳ございません。関数化の話です。

umyu

2018/07/23 11:00

回答ありがとうございます。関数化のほうは、下のほうが関数を呼び出すコードです。一番最後に記載したコードが関数化して呼び出しているので、回答のコードを参考にしてくださいな。

dkymmmmmt

2018/07/23 12:56

こちらこそ回答ありがとうございます。。。丁寧に何度も回答いただき本当に感謝です。これを参考に、defに慣れていきたいと思います。

umyu

2018/07/23 13:09

@dkymmmmmtさんへ解決してよかったですー。

dkymmmmmt

2018/07/23 16:10

すいません、もう一点だけ確認したいことがあるのですがリンクが見つからないときにFalseを出力する場合、関数parseのif文を編集すれば可能でしょうか？

umyu

2018/07/23 17:07 編集

"リンクが見つからないとき"とはどういう条件なのか明確にしないといけないのですが、 1件もないということなら、output_excelの前にlistの件数をチェックすれば良いのでは。 if len(output)== 0: output.append([time_data, "FALASE", word]) output_excel(output, row_index)

dkymmmmmt

2018/07/24 01:46

ありがとう御座います。思った通りの動作が確認できました。夜分遅くにすいませんでした・・・

行動規範の内容に同意します