やりたいこと

hrefのリンクを取得するのには成功したのですが、hrefの中身が/projects/python/で始まるリンクに絞り込むにはどうしたら良いでしょうか。

スクレイピング対象: freelancer.com

コード

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv
import warnings
warnings.filterwarnings('ignore')

r = requests.get("https://www.freelancer.com/archives/python/2018-40/")
soup_content = BeautifulSoup(r.content, "html.parser")
f1 = open('r.txt', 'w')
f1.write(r.text)
f1.close

soup = BeautifulSoup(r.text)
with open('allhref.csv', 'w+',newline='',encoding='utf-8') as f:
    writer = csv.writer(f, lineterminator='\n')
    for link in soup.find_all('a'):
        writer.writerow([link.get('href')])

get('href')の中身
全てではありませんが、主なもののみ

/info/how-it-works
/jobs/
/post-project
/projects/python/online-web-tool/
/projects/php/Project-for-Kseniia-17867039/

r.textの中身
全てではありませんが、主なもののみ

<a title="Project for Kseniia I. -- 18/09/30 05:09:30 Job" href="/projects/php/Project-for-Kseniia-17867039/" class="job">Project for Kseniia I. -- 18/09/30 05:09:30</a>
<a title="online web tool Job" href="/projects/python/online-web-tool/" class="job">online web tool</a>
<a title="Instagram credentials getting Job" href="/projects/php/Instagram-credentials-getting-fix/" class="job">Instagram credentials getting</a>

よろしくお願いしますm(__)m

できましたー(≧∇≦)b

jun68yktさま、ありがとうございます!
おかげさまでできました(^^)

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import csv
import re
import warnings
warnings.filterwarnings('ignore')

r = requests.get("https://www.freelancer.com/archives/python/2018-40/")
soup_content = BeautifulSoup(r.content, "html.parser")
f1 = open('r.txt', 'w')
f1.write(r.text)
f1.close

soup = BeautifulSoup(r.text)
with open('pythonHref.csv', 'w+',newline='',encoding='utf-8') as f:
    writer = csv.writer(f, lineterminator='\n')
    for link in soup.find_all('a', {'href': re.compile(r'^/projects/python/')}):
        writer.writerow([link.get('href')])

行動規範の内容に同意します

回答1件

ベストアンサー

こんにちは。

ご質問に挙げられているソースコードに、以下の２点で追加、修正してみるといかがでしょうか？

(1) 正規表現を使えるように、冒頭に以下を追加

python
1import re

(2) soup.find_all の引数として'a'だけではなく、hrefの条件を以下のように追加

python
1for link in soup.find_all('a', {'href': re.compile(r'^/projects/python/')}):

　
以上、参考になれば幸いです。

投稿2018/10/14 07:31

編集2018/10/14 07:52

jun68ykt

総合スコア9058

Yukiya025

2018/10/14 07:53

[jun68ykt](https://teratail.com/users/jun68ykt)さま、こんにちは＼(^o^)／ありがとうございます! できました! これでpythonの実践課題 (過去のプロジェクトで要求されているソースコードを書く) がはかどります(・∀・) 完成したコードを質問本文に追記しました!