回答編集履歴

コード修正

2017/08/09 10:09

投稿

スコア38352

test CHANGED Viewed

@@ -8,24 +8,62 @@
 これらは`BeautifulSoup`では単なるテキストとして扱われるため、取り除かれません。
-ちょっと無理やりですが、以下のように正規表現で本文のみ抽出できました。
+ちょっと無理やりですが、`description`部分を`HTML`とみなして解析することで抽出できました。
+検証コード
 ```Python
+import requests
+from bs4 import BeautifulSoup
+import re
+url = 'http://feeds.reuters.com/reuters/JPBusinessNews'
+#url = 'http://feeds.reuters.com/reuters/healthNews'
+html = requests.get(url)
+root = BeautifulSoup(html.content, 'html.parser')
 for link in root.findAll("item"):
     print(link.find("title").text)
+    print('-----')
-    desc = link.find("description").text
+    desc = link.find('description').text
     #print(desc)
-    # 「本文～<div（など表示用の要素）～</a>」→「本文～」を抽出
+    # 方法１：HTMLとみなして解析
+    desc = BeautifulSoup(desc,'html.parser')
+    desc = desc.text.rstrip()
+    # 方法２：「本文～<div（など表示用の要素）～</a>」→「本文～」を抽出
+    # 一部の項目で正規表現が一致しない場合あり
-    desc = re.match(r'(.*)<div',desc).group(1)
+    #desc = re.match(r'(.*)<div',desc).group(1)
     print(desc)
+    print('-----')
 ```

コード修正

2017/08/09 10:09

投稿

スコア38352

test CHANGED Viewed

@@ -1,13 +1,31 @@
 `BeautifulSoup`を使っていれば`.text`または[.get_text()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)でテキストのみ抽出できます。
+#### 元データを確認したうえで修正
+元データ`RSS(XML)`を確認したところ`description`要素内には、本文に加え、`<div～</a>`のような表示用の要素？が**テキスト**として含まれていました。
+これらは`BeautifulSoup`では単なるテキストとして扱われるため、取り除かれません。
+ちょっと無理やりですが、以下のように正規表現で本文のみ抽出できました。
 ```Python
-for link in htmlSource.findAll("item"):
+for link in root.findAll("item"):
     print(link.find("title").text)
-    print(link.find("description").text)
+    desc = link.find("description").text
+    #print(desc)
+    # 「本文～<div（など表示用の要素）～</a>」→「本文～」を抽出
+    desc = re.match(r'(.*)<div',desc).group(1)
+    print(desc)
 ```

リンク追加

2017/08/09 09:42

投稿

スコア38352

test CHANGED Viewed

@@ -1,4 +1,6 @@
-`BeautifulSoup`を使っていれば`.text`または`.get_text()`でテキストのみ抽出できます。
+`BeautifulSoup`を使っていれば`.text`または[.get_text()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)でテキストのみ抽出できます。
 ```Python