ネットからPDFを抽出したい

ネットからPDFファイルを抽出するコードを書いているのですがうまくいきません。
例えば、練習で"http://www.netkyouzai.jp/index.html"から
それぞれの教科のリンクを辿って科目ごとに教材のPDFファイルを抽出しようとしているのですが、うまくいきません。
こちらのやり方を教えていただきたいです。

今書いているコード

import scrapy
from scrapy import item
from tika import parser
import re

class NetkyozaiSpider(scrapy.Spider):
    name = 'netkyozai'
    allowed_domains = ['netkyozai.jp']
    base_url = 'http://netkyozai.jp/'
    start_urls = ['http://www.netkyouzai.jp/index.html']

    

    def parse(self, response):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_subject)


    def parse_subject(self, response):
        for subject_anchor in response.xpath("//html/body//a/@href"):
            subject_url = subject_anchor.xpath("//html/body//a/@href")
            yield scrapy.Request( subject_url, callback=self.parse_content)
    

    def parse_content(self, response):
        for pdf_anchor in response.xpath("//a/@href"):
         item['title'] = pdf_anchor.xpath("//a/@href")
         item['content'] =pdf_anchor.xpath("//a/@href")
         item['site_base_url'] = self.base_url
         item['site_content_url'] = response.url
         item['site_name'] = 'ネット教材'
         item['content_type'] = 'pdf'
         yield item

実行結果

2021-06-01 11:47:59 [scrapy.core.engine] INFO: Spider opened
2021-06-01 11:47:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-06-01 11:47:59 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-06-01 11:47:59 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.netkyouzai.jp/robots.txt> (referer: None)
2021-06-01 11:47:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkyouzai.jp/index.html> (referer: None)
2021-06-01 11:47:59 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.netkyouzai.jp': <GET http://www.netkyouzai.jp/index.html>
2021-06-01 11:47:59 [scrapy.core.engine] INFO: Closing spider (finished)
2021-06-01 11:47:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 454,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 9012,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.235292,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 6, 1, 2, 47, 59, 774075),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'memusage/max': 54829056,
 'memusage/startup': 54829056,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 6, 1, 2, 47, 59, 538783)}
2021-06-01 11:47:59 [scrapy.core.engine] INFO: Spider closed (finished)

行動規範の内容に同意します

回答2件

おはようございます。

問題文読ませていただきました。

こちらの記事を参考にすると良さそうです。
・Scrapyを使用してWebサイトからPDFファイルを見つけてダウンロードする

投稿2021/06/01 23:41

退会済みユーザー

総合スコア0

itir

2021/06/03 09:12

ありがとうございます！

行動規範の内容に同意します

そもそものお話ですが、目的がHTMLをクローリングしPDFファイルをローカルにダウンロード、ご自身が内容を閲覧されたいだけでしたら、wgetコマンドで一発です。コードを書く必要すらなかったりします。
https://www.atmarkit.co.jp/ait/articles/1606/20/news024.html
※--acceptオプションで拡張子が指定できます。

投稿2021/06/01 03:59

編集2021/06/01 06:07