## 取得方法 ```Python from urllib.request import urlopen from bs4 import BeautifulSoup body = BeautifulSoup(urlopen('http://example.com/'), 'lxml').find('body').text ``` ## 余計なもの * javascriptやHTMLのコメント(これは普通？) * javascriptのコード * XML的なの（`![CDATA[ Hello,world! ]]>`など） ## 解決策は・・・ **BeautifulSoup**で` `タグを削除したり、正規表現でごっそり削るしか無いのでしょうか・・・？ **Selenium＋PhantomJS**にて行った場合も同様に余計なものが含まれてしまいます。単純にコンテンツのみがほしいのですが、ほかに解決策はありますでしょうか。お願いします。

BeatifulSoupにある機能で両方とも取り除くことができます。 ```Python from bs4 import BeautifulSoup, Comment html = """ タイトル見出し本文 """ soup = BeautifulSoup(html, "lxml") # コメントタグの除去 for comment in soup(text=lambda x: isinstance(x, Comment)): comment.extract() # scriptタグの除去 for script in soup.find_all('script', src=False): script.decompose() # テキストだけの抽出 for text in soup.find_all(text=True): if text.strip(): print(text) ``` - [https://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract) - [https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose) - [https://stackoverflow.com/a/23516633/7724457](https://stackoverflow.com/a/23516633/7724457) - [https://stackoverflow.com/a/23299678/7724457](https://stackoverflow.com/a/23299678/7724457)

BeautifulSoup4でbodyタグ内のテキストを取得すると余計なものまで取得される

取得方法

Python
1from urllib.request import urlopen
2from bs4 import BeautifulSoup
3
4body = BeautifulSoup(urlopen('http://example.com/'), 'lxml').find('body').text
5

余計なもの

javascriptやHTMLのコメント(これは普通？)
javascriptのコード
XML的なの（![CDATA[<greeting>Hello,world!</greeting>]]>など）

解決策は・・・

BeautifulSoupで<script>タグを削除したり、正規表現でごっそり削るしか無いのでしょうか・・・？
Selenium＋PhantomJSにて行った場合も同様に余計なものが含まれてしまいます。
単純にコンテンツのみがほしいのですが、ほかに解決策はありますでしょうか。
お願いします。

行動規範の内容に同意します

回答1件

ベストアンサー

BeatifulSoupにある機能で両方とも取り除くことができます。

Python
1from bs4 import BeautifulSoup, Comment
2
3html = """
4<html>
5 <head>
6   <title>タイトル</title>
7   <script type="text/javascript"><![CDATA[
8   alert('hoge');
9   ]]>
10   </script>
11 </head>
12 <body>
13   <!-- コメント -->
14   <h1>見出し</h1>
15   <div>本文</div>
16 </body>
17</html>
18"""
19
20soup = BeautifulSoup(html, "lxml")
21
22# コメントタグの除去
23for comment in soup(text=lambda x: isinstance(x, Comment)):
24    comment.extract()
25
26# scriptタグの除去
27for script in soup.find_all('script', src=False):
28    script.decompose()
29
30# テキストだけの抽出
31for text in soup.find_all(text=True):
32    if text.strip():
33        print(text)