BeautifulSoup4でbodyタグ内のテキストを取得すると余計なものまで取得される

取得方法

Python
1from urllib.request import urlopen
2from bs4 import BeautifulSoup
3
4body = BeautifulSoup(urlopen('http://example.com/'), 'lxml').find('body').text
5

余計なもの

javascriptやHTMLのコメント(これは普通？)
javascriptのコード
XML的なの（![CDATA[<greeting>Hello,world!</greeting>]]>など）

解決策は・・・

BeautifulSoupで<script>タグを削除したり、正規表現でごっそり削るしか無いのでしょうか・・・？
Selenium＋PhantomJSにて行った場合も同様に余計なものが含まれてしまいます。
単純にコンテンツのみがほしいのですが、ほかに解決策はありますでしょうか。
お願いします。

行動規範の内容に同意します

回答1件

ベストアンサー

BeatifulSoupにある機能で両方とも取り除くことができます。

Python
1from bs4 import BeautifulSoup, Comment
2
3html = """
4<html>
5 <head>
6   <title>タイトル</title>
7   <script type="text/javascript"><![CDATA[
8   alert('hoge');
9   ]]>
10   </script>
11 </head>
12 <body>
13   <!-- コメント -->
14   <h1>見出し</h1>
15   <div>本文</div>
16 </body>
17</html>
18"""
19
20soup = BeautifulSoup(html, "lxml")
21
22# コメントタグの除去
23for comment in soup(text=lambda x: isinstance(x, Comment)):
24    comment.extract()
25
26# scriptタグの除去
27for script in soup.find_all('script', src=False):
28    script.decompose()
29
30# テキストだけの抽出
31for text in soup.find_all(text=True):
32    if text.strip():
33        print(text)