回答編集履歴

テキスト修正

2020/02/02 00:54

投稿

jun68ykt

スコア9058

answer CHANGED Viewed

@@ -58,4 +58,49 @@
 ```
 によって `['hello1', 'hello2', 'hello3']` が得られます。
-- **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966_2](https://repl.it/@jun68ykt/Q2389662)
+- **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966_2](https://repl.it/@jun68ykt/Q2389662)
+### 追記2
+htmlが以下のように、`<section>`を４つ含んでおり、３つ目の`<section>` にのみ、直接の子ノードに空白文字以外のテキストが含まれているとします。
+```html
+<html><body>
+  <section class="content">
+    <div>テキスト</div>
+    <div>テキスト</div>
+  </section>
+  <section class="content">
+    <div>テキスト</div>
+    <div>テキスト</div>
+  </section>
+  <section class="content">
+    <div>テキスト</div>
+    hello
+    <div>テキスト</div>
+  </section>
+  <section class="content">
+    <div>テキスト</div>
+    <div>テキスト</div>
+  </section>
+</body></html>
+```
+上記に対して、各`section`を上から順に調べていき、直接の子ノードとして空白文字以外の文字列が含まれている`section`がみつかったら、その文字列を表示して、ループを抜けるようなプログラムを書くとすると、以下のようになります。
+```python3
+def get_first_text(elm):
+    for child in elm.contents:
+        if type(child) is NavigableString and str(child).strip():
+            return str(child).strip()
+soup = BeautifulSoup(html, "html.parser")
+for sec in soup.find_all('section'):
+    text = get_first_text(sec)
+    if text:
+        print(text)
+        break
+```
+- **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966_3](https://repl.it/@jun68ykt/Q2389663)

テキスト修正

2020/02/02 00:54

投稿

jun68ykt

スコア9058

answer CHANGED Viewed

@@ -33,4 +33,29 @@
 print(text)    #=> hellohellohellohellohello
 ```
-- **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966](https://repl.it/@jun68ykt/Q238966)
+- **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966](https://repl.it/@jun68ykt/Q238966)
+### 追記
+もし、htmlが以下
+```html
+<html><body>
+  <section class="content">
+    hello1
+    <div>テキスト</div>
+    hello2
+    <div>テキスト</div>
+    hello3
+  </section>
+</body></html>
+```
+のようなものであった場合は、
+```python3
+text_children = [
+    str(e).strip() for e in soup.section.contents
+        if type(e) is NavigableString
+]
+```
+によって `['hello1', 'hello2', 'hello3']` が得られます。
+- **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966_2](https://repl.it/@jun68ykt/Q2389662)

テキスト修正

2020/02/01 23:06

投稿

jun68ykt

スコア9058

answer CHANGED Viewed

@@ -1,5 +1,11 @@
-こんにちは。以下で `hellohellohellohellohello` を得られると思います。
+こんにちは。
+ご質問の意図が、
+- Beautifulsoupのみを使って子要素以下のテキストを除いたテキストを取得したい
+ということだと、以下のようにすればよいかと思います。
 ```python3
 from bs4 import BeautifulSoup, NavigableString

テキスト修正

2020/02/01 22:30

投稿

jun68ykt

スコア9058

answer CHANGED Viewed

@@ -1,9 +1,8 @@
 こんにちは。以下で `hellohellohellohellohello` を得られると思います。
 ```python3
-from bs4 import BeautifulSoup
+from bs4 import BeautifulSoup, NavigableString
 html = '''
 <html><body>
   <section class="content">
@@ -21,10 +20,11 @@
 text = None
 for e in soup.section.contents:
-    if e.__class__.__name__ == 'NavigableString' and str(e).strip():
+    if type(e) is NavigableString and str(e).strip():
         text = str(e).strip()
         break
-print(text)    # => hellohellohellohellohello
+print(text)    #=> hellohellohellohellohello
 ```
 - **動作確認用Repl.it:** [https://repl.it/@jun68ykt/Q238966](https://repl.it/@jun68ykt/Q238966)