スクレイピングにおいて、子要素内の文字列は取得しないようにしたい

Beautifulsoupにて、ある子要素のテキストのみを取得する方法はありますでしょうか...?
以下のHTMLなのですが、仮に、sectionのクラスcontentをbody = soup.select_one('.content')のような形で取得し、print(body.text)とすると、div内の「テキスト」といった文字まで取得されてしまいます。
これを、子要素のタグの文字列は取得せず「hello」の文字列だけ取得する方法がありましたら、教えて頂きたいです。

<section class="content">
   <div>テキスト<div>

   "hellohellohellohellohello"

   <div>テキスト</div>

</section>

※soupはBeautifulsoupオブジェクトです。

quickquip

2020/02/02 00:53

body = soup.select_one('content') は body = soup.select_one('.content') の誤記ではないでしょうか。

行動規範の内容に同意します

回答3件

NavigableStringはじめて知りました

自分はタグがないので「tag.name is None」で検索してました

python
1for tags in soup.find_all("section"):
2    for tag in tags.contents:
3        if tag.name is None:
4            t = tag.strip()
5            if t:
6                print(t)

投稿2020/02/02 04:26

barobaro

総合スコア1286

ruuuu

2020/02/03 08:56

ご回答ありがとうございます。参考にさせて頂きます！

行動規範の内容に同意します

xml.etree.ElementTree --- ElementTree XML APIを使うことにより、以下のように抽出できます。

Python
1from bs4 import BeautifulSoup
2import xml.etree.ElementTree as ET
3
4# 直下のテキストのみを抽出して返す
5def get_root_text(html_str):
6    root = ET.fromstring(html_str)
7    ret = [root.text] + [child.tail for child in root]
8    ret = [e for e in ret if e]
9    return ''.join(ret)
10
11
12s = """
13<section class="content">
14   <div>テキスト</div>
15
16   "hellohellohellohellohello"
17
18   <div>テキスト</div>
19
20</section>
21"""
22soup = BeautifulSoup(s, 'lxml')
23body = soup.select_one('.content')
24
25text = get_root_text(str(body))
26print('-----')
27print(text)
28print('-----')
29
30"""
31-----
32
33
34
35   "hellohellohellohellohello"
36
37
38
39-----
40"""

投稿2020/02/01 22:22

can110

総合スコア38258

ruuuu

2020/02/02 01:20

ご回答ありがとうございます。頂いたコードを元に、検証してみましたら、思った通りの結果を得ることができました。ご丁寧にご回答頂きましたのに、申し訳ないのですが、今回はBeautifulsoupで実装可能な、コードを提示頂きました「jun68ykt」さんのご回答をBAとさせて頂きたいと思います。「xml.etree.ElementTree 」こちらでも実装可能ということがわかり勉強になりました。

行動規範の内容に同意します

ベストアンサー

こんにちは。

ご質問の意図が、

Beautifulsoupのみを使って子要素以下のテキストを除いたテキストを取得したい

ということだと、以下のようにすればよいかと思います。

python3
1from bs4 import BeautifulSoup, NavigableString
2
3html = '''
4<html><body>
5  <section class="content">
6    <div>テキスト</div>
7
8    hellohellohellohellohello
9
10    <div>テキスト</div>
11  </section>
12</body></html>  
13'''
14
15soup = BeautifulSoup(html, "html.parser")
16
17text = None
18
19for e in soup.section.contents:
20    if type(e) is NavigableString and str(e).strip():
21        text = str(e).strip()
22        break
23
24print(text)    #=> hellohellohellohellohello
25

動作確認用Repl.it: https://repl.it/@jun68ykt/Q238966

追記

もし、htmlが以下

html
1<html><body>
2  <section class="content">
3    hello1
4    <div>テキスト</div>
5    hello2
6    <div>テキスト</div>
7    hello3
8  </section>
9</body></html>

のようなものであった場合は、

python3
1text_children = [
2    str(e).strip() for e in soup.section.contents
3        if type(e) is NavigableString
4]

によって ['hello1', 'hello2', 'hello3'] が得られます。

動作確認用Repl.it: https://repl.it/@jun68ykt/Q238966_2

追記2

htmlが以下のように、<section>を４つ含んでおり、３つ目の<section> にのみ、直接の子ノードに空白文字以外のテキストが含まれているとします。

html
1<html><body>
2  <section class="content">
3    <div>テキスト</div>
4    <div>テキスト</div>
5  </section>
6  <section class="content">
7    <div>テキスト</div>
8    <div>テキスト</div>
9  </section>
10  <section class="content">
11    <div>テキスト</div>
12    hello
13    <div>テキスト</div>
14  </section>
15  <section class="content">
16    <div>テキスト</div>
17    <div>テキスト</div>
18  </section>
19</body></html>

上記に対して、各sectionを上から順に調べていき、直接の子ノードとして空白文字以外の文字列が含まれているsectionがみつかったら、その文字列を表示して、ループを抜けるようなプログラムを書くとすると、以下のようになります。

python3
1def get_first_text(elm):
2    for child in elm.contents:
3        if type(child) is NavigableString and str(child).strip():
4            return str(child).strip()
5
6
7
8soup = BeautifulSoup(html, "html.parser")
9
10for sec in soup.find_all('section'):
11    text = get_first_text(sec)
12    if text:
13        print(text)
14        break