teratail header banner
teratail header banner
質問するログイン新規登録

回答編集履歴

2

d

2019/04/05 06:53

投稿

tiitoi
tiitoi

スコア21960

answer CHANGED
@@ -42,4 +42,31 @@
42
42
  soup = BeautifulSoup(html)
43
43
  vals = [t.text for t in soup.find_all('p')]
44
44
  print(vals) # ['ZZZ', 'AAA', 'BBB', 'CCC', 'YYY', '', 'TTTSSS', 'RRR', '4', 'XXX']
45
+ ```
46
+
47
+ ## 追記
48
+
49
+ ```python
50
+ html = '''<p><span>ZZZ</span></p>,
51
+ <p>AAA</p>,
52
+ <p>BBB</p>,
53
+ <p>CCC</p>,
54
+ <p class="tags">YYY</p>,
55
+ <p class="list"><a href="/WWW/"><img alt="VVV" src="/UUU"/></a></p>,
56
+ <p class="tags">TTT<br class="sp"/>SSS</p>,
57
+ <p class="hoge"><a class="tagb" href="/socialmedia/">RRR</a></p>,
58
+ <p class="fuga"><a class="typesquare_tags" href="/chronicle/04/">4</a></p>,
59
+ <p class="capion typesquare_tags">XXX</p>'''
60
+
61
+ from bs4 import BeautifulSoup
62
+
63
+ soup = BeautifulSoup(html)
64
+
65
+ vals = []
66
+ for t in soup.find_all('p'):
67
+ # p タグの子でテキストがある要素のみ検索
68
+ p_text = t.find(text=True, recursive=False)
69
+ if p_text:
70
+ vals.append(p_text)
71
+ print(vals) # ['AAA', 'BBB', 'CCC', 'YYY', 'TTT', 'XXX']
45
72
  ```

1

d

2019/04/05 06:53

投稿

tiitoi
tiitoi

スコア21960

answer CHANGED
@@ -19,4 +19,27 @@
19
19
  soup = BeautifulSoup(html)
20
20
  vals = [t.text for t in soup.find_all('p', attrs=lambda attrs: not attrs)]
21
21
  print(vals) # ['ZZZ', 'AAA', 'BBB', 'CCC']
22
+ ```
23
+
24
+ ## 追記
25
+
26
+ 単純に p タグの値だけ取り出すという意味でしたら、以下です。
27
+
28
+ ```python
29
+ html = '''<p><span>ZZZ</span></p>,
30
+ <p>AAA</p>,
31
+ <p>BBB</p>,
32
+ <p>CCC</p>,
33
+ <p class="tags">YYY</p>,
34
+ <p class="list"><a href="/WWW/"><img alt="VVV" src="/UUU"/></a></p>,
35
+ <p class="tags">TTT<br class="sp"/>SSS</p>,
36
+ <p class="hoge"><a class="tagb" href="/socialmedia/">RRR</a></p>,
37
+ <p class="fuga"><a class="typesquare_tags" href="/chronicle/04/">4</a></p>,
38
+ <p class="capion typesquare_tags">XXX</p>'''
39
+
40
+ from bs4 import BeautifulSoup
41
+
42
+ soup = BeautifulSoup(html)
43
+ vals = [t.text for t in soup.find_all('p')]
44
+ print(vals) # ['ZZZ', 'AAA', 'BBB', 'CCC', 'YYY', '', 'TTTSSS', 'RRR', '4', 'XXX']
22
45
  ```