質問編集履歴

コードを追記しました。

2020/12/09 12:01

投稿

Kanadekana_nana

スコア8

title CHANGED Viewed

File without changes

body CHANGED Viewed

@@ -1,7 +1,90 @@
 ### 前提・実現したいこと
 scrapyのcrawlSpiderを用いてデータを収集した後、itemが一つなら scrapy spidername -o file.csv　でcsvにできるのですがItemが2つ以上になったときにcsvをItemごとに生成したいです。
+```python
+import logging
+import scrapy
+from scrapy.spiders import CrawlSpider, Rule
+from scrapy.linkextractors import LinkExtractor
+import re
+from .. import items
+class JtnewsSpider(CrawlSpider):
+    name = 'jtnews'
+    allowed_domains = ['www.jtnews.jp']
+    start_urls = ['http://www.jtnews.jp/cgi-bin/revlist.cgi?PAGE_NO=1']
+    rules = (
+        Rule(
+            LinkExtractor(allow=r"revlist.cgi?&?PAGE_NO=\d+$"),  # レビュワー名簿
+            callback="parse_user",
+            follow=False
+        ),
+        Rule(
+            LinkExtractor(allow=r"revper.cgi?&?REVPER_NO=\d+$"),  # 個人ページ
+            follow=True
+        ),
+        Rule(
+            LinkExtractor(allow=r"revper.cgi?&?PAGE_NO=\d+&REVPER_NO=\d+&TYPE=2$"),  # レビューの変更日付順化
+            callback="parse_review"
+        )
+    )
+    user_pattern = re.compile(r"REVPER_NO=(?P<user_id>\d+)")
+    movie_pattern = re.compile(r"TITLE_NO=(?P<movie_id>\d+)")
+    def parse_user(self, response):
+        try:
+            user_table = response.css("table.hover-table")
+            for link in user_table.css("a"):
+                user = items.UserItem()
+                user_url = link.css("a::attr(href)").get()
+                user["user_id"] = int(self.user_pattern.findall(user_url)[0])
+                user["name"] = link.css("a::text").get()
+                yield user
+        except:
+            self.log(f"parse failed: {response.url}", level=logging.ERROR)
+            yield scrapy.Request(
+                response.url, callback=self.parse_user, dont_filter=True
+            )
+    def parse_review(self, response):
+        try:
+            user_table = response.css("table.normal-table")[2]
+            for link in user_table.css("tr")[1:]:
+                review = items.ReviewItem()
+                review_url = link.css("a::attr(href)")[0].get()
+                review["movie_id"] = int(self.movie_pattern.findall(review_url)[0])
+                review["title"] = link.css("a::text")[0].get()
+                review["point"] = link.css("td::text").get()
+                yield review
+        except:
+            self.log(f"parse failed: {response.url}", level=logging.ERROR)
+            yield scrapy.Request(
+                response.url, callback=self.parse_review, dont_filter=True
+            )
+```
+```python
+import scrapy
+class ReviewItem(scrapy.Item):
+    point = scrapy.Field(serializer=str)
+    movie_id = scrapy.Field(serializer=str)
+    title = scrapy.Field(serializer=str)
+class UserItem(scrapy.Item):
+    user_id = scrapy.Field(serializer=str)
+    name = scrapy.Field(serializer=str)
+```
 ### 試したこと
 https://qiita.com/bakeratta/items/6fe9030ad838a2a71aa5