a タグ等単語のハイライト等を無視でき p タグ以下に含まれている文章全てを結合させた形で抽出できるような CSSセレクタの指定方法をご教授頂きたいです．

前提・実現したいこと

ニュース記事をスクレイピングをする際の，CSSセレクタの指定方法を教えていただきたいです．
ニュース記事の文章のみをPythonのscrapyを用いてスクレイピングしようとしているのですが，文章中にハイライトがある部分では別のテキストとして認識されて，抽出されるときにも別の要素として配列に格納されます．
例えば，以下のテキストにおいて

Rob Leathern, director of product management at Facebook, said the company will update its disclosure policy in Britain next month. It will require political advertisers to verify their identities and then attach accurate information about their identities to the ads.

The changes are part of new political advertising policies that Facebook announced this week for users in Britain. No only will political ads need to be more clearly labeled, but the company is establishing a searchable archive of political ads that have been published on the site.

"Rob Leathern ... ads"で一つの配列要素
"The changes are part of"で一つの配列要素
"new political advertising"で一つの配列要素
"that ... site."で一つの配列要素
として，

Python
1["Rob Leathern ... ads", "The changes are part of", "new political advertising", "that ... site."]

というように配列に格納されてしまいます．
理想的には

Python
1["Rob Leathern ... ads", "The changes are part of ... new political advertising ... that ... site."]

というように分割されないように格納したいと思っております．
段落わけや，改行によって文章要素が分割されて，別の要素として格納される分には問題ないのですが，1文中の単語だけがあたかも段落わけや改行されてしまうのは都合が悪いのです．
そのため，a タグ等単語のハイライト等を無視でき
p タグ以下に含まれている文章全てを結合させた形で抽出できるような
CSSセレクタの指定方法をご教授頂きたいです．

どうぞよろしくお願い致します。

発生している問題・エラーメッセージ

<div class="css-18sbwfn StoryBodyCompanionColumn">
    <div class="css-4w7y5l">
        <p class="css-1xl4flh e2kc3sl0">
            Rob Leathern, director of product management at Facebook, said the company will update its disclosure policy in Britain next month. It will require political advertisers to verify their identities and then attach accurate information about their identities to the ads.
        </p>
    </div>
    <aside class="css-14jsv4e">
        <span>
        </span>
    </aside>
</div>
<div class="css-18sbwfn StoryBodyCompanionColumn">
    <div class="css-4w7y5l">
        <p class="css-1xl4flh e2kc3sl0">
            The changes are part of 
            <a class="css-1g7m0tk" href="https://newsroom.fb.com/news/2018/10/increasing-transparency-uk/" title="" rel="noopener noreferrer" target="_blank">
                new political advertising policies
            </a>
            that Facebook announced this week for users in Britain. No only will political ads need to be more clearly labeled, but the company is establishing a searchable archive of political ads that have been published on the site.
        </p>
    </div>
    <aside class="css-14jsv4e">
        <span>
        </span>
    </aside>
</div>

該当のソースコード

以下のコードにあるCSSセレクタの指定書式では分割されてしまいます．

Python
1body = response.css('article#story section div *::text').extract()
2print(body)

行動規範の内容に同意します

回答2件

python
1from bs4 import BeautifulSoup
2
3html = '''
4<div class="css-18sbwfn StoryBodyCompanionColumn">
5    <div class="css-4w7y5l">
6        <p class="css-1xl4flh e2kc3sl0">
7            Rob Leathern, director of product management at Facebook, said the company will update its disclosure policy in Britain next month. It will require political advertisers to verify their identities and then attach accurate information about their identities to the ads.
8        </p>
9    </div>
10    <aside class="css-14jsv4e">
11        <span>
12        </span>
13    </aside>
14</div>
15<div class="css-18sbwfn StoryBodyCompanionColumn">
16    <div class="css-4w7y5l">
17        <p class="css-1xl4flh e2kc3sl0">
18            The changes are part of
19            <a class="css-1g7m0tk" href="https://newsroom.fb.com/news/2018/10/increasing-transparency-uk/" title="" rel="noopener noreferrer" target="_blank">
20                new political advertising policies
21            </a>
22            that Facebook announced this week for users in Britain. No only will political ads need to be more clearly labeled, but the company is establishing a searchable archive of political ads that have been published on the site.
23        </p>
24    </div>
25    <aside class="css-14jsv4e">
26        <span>
27        </span>
28    </aside>
29</div>
30'''
31
32soup = BeautifulSoup(html, 'html.parser')
33
34for i in soup.select('p'):
35    print(i.get_text(' ', strip=True))
36    print('-' * 35)

i.get_text(' ', strip=True)
strip=Trueで前後の空白文字を除去
タグの区切りを' 'に変換しています。

リスト化でしたら

python
1result = [i.get_text(' ', strip=True) for i in soup.select('p')]

投稿2018/10/22 08:43

編集2018/10/22 08:48

barobaro

総合スコア1286

ベストアンサー

こんな感じでどうでしょうか？

BeautifulSoup を使った例

python
1import re
2from bs4 import BeautifulSoup
3
4with open('test.html') as f:
5    soup = BeautifulSoup(f.read())
6
7for p_tag in soup.select('p'):
8    text = p_tag.get_text()
9    # 前後の空白削除、改行削除
10    text = text.strip().replace('\n', '')
11    # 2つ以上の空白を1つにする。
12    text = re.sub(' +', ' ', text)
13    print(text)
14    print('-----------------------------------')

Rob Leathern, director of product management at Facebook, said the company will update its disclosure policy in Britain next month. It will require political advertisers to verify their identities and then attach accurate information about their identities to the ads.
-----------------------------------
The changes are part of new political advertising policies that Facebook announced this week for users in Britain. No only will political ads need to be more clearly labeled, but the company is establishing a searchable archive of political ads that have been published on the site.
-----------------------------------

投稿2018/10/20 04:47