teratail header banner
teratail header banner
質問するログイン新規登録

質問編集履歴

1

コードを追記しました。

2020/12/09 12:01

投稿

Kanadekana_nana
Kanadekana_nana

スコア8

title CHANGED
File without changes
body CHANGED
@@ -1,7 +1,90 @@
1
1
  ### 前提・実現したいこと
2
2
 
3
3
  scrapyのcrawlSpiderを用いてデータを収集した後、itemが一つなら scrapy spidername -o file.csv でcsvにできるのですがItemが2つ以上になったときにcsvをItemごとに生成したいです。
4
+ ```python
5
+ import logging
6
+ import scrapy
7
+ from scrapy.spiders import CrawlSpider, Rule
8
+ from scrapy.linkextractors import LinkExtractor
9
+ import re
4
10
 
11
+ from .. import items
12
+
13
+
14
+ class JtnewsSpider(CrawlSpider):
15
+ name = 'jtnews'
16
+ allowed_domains = ['www.jtnews.jp']
17
+ start_urls = ['http://www.jtnews.jp/cgi-bin/revlist.cgi?PAGE_NO=1']
18
+
19
+ rules = (
20
+ Rule(
21
+ LinkExtractor(allow=r"revlist.cgi?&?PAGE_NO=\d+$"), # レビュワー名簿
22
+ callback="parse_user",
23
+ follow=False
24
+ ),
25
+ Rule(
26
+ LinkExtractor(allow=r"revper.cgi?&?REVPER_NO=\d+$"), # 個人ページ
27
+ follow=True
28
+ ),
29
+ Rule(
30
+ LinkExtractor(allow=r"revper.cgi?&?PAGE_NO=\d+&REVPER_NO=\d+&TYPE=2$"), # レビューの変更日付順化
31
+ callback="parse_review"
32
+ )
33
+ )
34
+
35
+ user_pattern = re.compile(r"REVPER_NO=(?P<user_id>\d+)")
36
+ movie_pattern = re.compile(r"TITLE_NO=(?P<movie_id>\d+)")
37
+
38
+ def parse_user(self, response):
39
+ try:
40
+ user_table = response.css("table.hover-table")
41
+ for link in user_table.css("a"):
42
+ user = items.UserItem()
43
+ user_url = link.css("a::attr(href)").get()
44
+ user["user_id"] = int(self.user_pattern.findall(user_url)[0])
45
+ user["name"] = link.css("a::text").get()
46
+ yield user
47
+ except:
48
+ self.log(f"parse failed: {response.url}", level=logging.ERROR)
49
+ yield scrapy.Request(
50
+ response.url, callback=self.parse_user, dont_filter=True
51
+ )
52
+
53
+ def parse_review(self, response):
54
+ try:
55
+ user_table = response.css("table.normal-table")[2]
56
+ for link in user_table.css("tr")[1:]:
57
+ review = items.ReviewItem()
58
+ review_url = link.css("a::attr(href)")[0].get()
59
+ review["movie_id"] = int(self.movie_pattern.findall(review_url)[0])
60
+ review["title"] = link.css("a::text")[0].get()
61
+ review["point"] = link.css("td::text").get()
62
+ yield review
63
+ except:
64
+ self.log(f"parse failed: {response.url}", level=logging.ERROR)
65
+ yield scrapy.Request(
66
+ response.url, callback=self.parse_review, dont_filter=True
67
+ )
68
+
69
+ ```
70
+
71
+ ```python
72
+ import scrapy
73
+
74
+
75
+ class ReviewItem(scrapy.Item):
76
+ point = scrapy.Field(serializer=str)
77
+ movie_id = scrapy.Field(serializer=str)
78
+ title = scrapy.Field(serializer=str)
79
+
80
+
81
+ class UserItem(scrapy.Item):
82
+ user_id = scrapy.Field(serializer=str)
83
+ name = scrapy.Field(serializer=str)
84
+
85
+
86
+ ```
87
+
5
88
  ### 試したこと
6
89
 
7
90
  https://qiita.com/bakeratta/items/6fe9030ad838a2a71aa5