異なるwebページからの情報取得を一本化したい

現在、Pythonを学習しています。
Google検索APIで「東京都会社概要」と検索し、検索結果の各webページのURLを取得し、
それらのURL先をスクレイピングして会社概要を取得しようと考えています。
当たり前のことですが、各webページのhtmlの書き方が異なっているため、
上手く求めている情報を取得できません。
何かアイデアがあれば教えて頂きたいです。

下記の3つのコードを順に実行していて、2つ目のURLを取得するところまでは何とかなりました。
現在、3つ目のコードはとりあえずtableタグの情報を持ってきている状態です。

Python
1# googleAPI検索し、jsonファイルに出力
2import json
3import urllib.request
4import urllib.parse
5from urllib.request import urlopen
6QUERY = u'会社概要+東京都'
7key = 'KEY'
8cx = 'CX'
9NUM = 3
10cseurl = 'https://www.googleapis.com/customsearch/v1?'
11params = {
12 'key': key,
13  'q': QUERY,
14  'cx': cx,
15  'alt': 'json',
16 'lr': 'lang_ja',
17}
18start = 1
19f = open('result/GoogleResult.json', 'w')
20
21for i in range(0, NUM):
22    params['start'] = start
23    req_url = cseurl + urllib.parse.urlencode(params)
24    search_response = urllib.request.urlopen(req_url)
25    search_results = search_response.read().decode("utf8")
26    dump = json.loads(search_results)
27    f.write(json.dumps(dump) + "\n")
28    start = int(dump['queries']['nextPage'][0]['startIndex'])
29f.close()

Python
1# google検索結果のjsonファイルからURL抽出
2import re
3read_file = open('result/GoogleResult.json', 'r')
4resultFileData = read_file.read().replace(',', '\n')
5read_file.close()
6# URL抽出するための正規表現パターン
7pattern = re.compile(r'"link":\s"http.+"')
8link_urls = pattern.findall(resultFileData)
9write_file = open('result/UrlList.txt', 'w')
10for link_url in link_urls:
11    geturl = link_url.replace("\"link\": \"", "").replace("\"", "")
12    write_file.write(geturl + '\n')
13write_file.close()

Python
1# URL先のtableタグを情報を取得
2import csv
3from bs4 import BeautifulSoup
4urlfile = open('result/UrlList.txt', 'r')
5urlrows = urlfile.readlines()
6urlfile.close()
7
8csvFile = open("result/url_file.csv", 'wt', newline='', encoding='utf-8')
9for urlrow in urlrows:
10    html = urlopen(urlrow)
11    bsObj = BeautifulSoup(html)
12    tables = bsObj.findAll("table")
13    writer = csv.writer(csvFile)
14    for table in tables:
15    rows = table.findAll("tr")
16        for row in rows:
17            csvRow = []
18            for cell in row.findAll(['td', 'th']):
19                csvRow.append(cell.get_text())
20                if len(csvRow) == 2:
21                    writer.writerow(csvRow)
22    writer.writerow("--------")
23csvFile.close()

行動規範の内容に同意します

回答1件

ベストアンサー

まず、回収したい会社概要の具体的な形をキチンと決める必要があります。
例えば、住所を回収したいなら、
そのURL内にある『住所や本社という単語を含むの要素』の中か後の『都道府県名が含まれる文字列』を回収するとかになるのでしょう。
資本金なら、『資本金という文字列を含む要素』の中か後にある『金額』という事になります。

しかし、このデータの回収方法では間違って回収することがあるため（例えば住所で～ビルまで回収できなかったとか資本金が多すぎる少なすぎるとか）、下手に公開すると『間違った情報を公開されて不利益が出た』と訴えられる可能性があるので、研究目的でご自身のみ閲覧可能とすることをおすすめします。

投稿2016/06/14 10:48