質問編集履歴

ソースコードの修正

2019/07/08 05:40

投稿

abokadoishii

スコア12

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -102,7 +102,71 @@
 サンプルコードののfor url　in response.scc('a.entry-link::attr("href")')をfor url in response.css('.entrylist-contents-title a::attr("herf")')やfor url in response.css('a.entrylist-contents-title::attr("herf")'),for url in response.css('.entrylist-contents-title a::attr("herf")')に書き換えてみたりしましたが動きませんでした。
+```python
+# -*- coding: utf-8 -*-
+import scrapy
+from myproject.items import Page
+from myproject.utils import get_content
+from bs4 import BeautifulSoup
+class BroadSpider(scrapy.Spider):
+	name = 'broad'
+	allowed_domains = ['b.hatena.ne.jp/entrylist']
+	start_urls = ['http://b.hatena.ne.jp/entrylist/']
+	def parse(self, response):
+		print('\n\nresponse:{}\n\n'.format(response))
+		for url in response.css('.entrylist-contents-title a::attr("href")').extract():
+			print('\n\nURL:{}'.format(url))
+			yield scrapy.Request(url,callback=self.parse_page)
+		url_more=response.css('a::attr("href")').re_first(r'.*?of=\d{2}$')
+		print('\n\nurl_more:{}\n\n'.format(url_more))
+		if url_more:
+			yield scrapy.Request(responce.urljoin(url_more))
+	def parse_page(self, response):
+		print('\n\npase_page\n\n')
+		title, content = get_content(reaponse.text)
+		yield Page(url=responce.url, title=title , content=content)
+```
+に修正したところ最初のfor文は動きました。urlmoreには相変わらず何も入っていません。
 ### 補足情報（FW/ツールのバージョンなど）

タグの編集

2019/07/08 05:40

投稿

abokadoishii

スコア12

test CHANGED Viewed

File without changes

test CHANGED Viewed

File without changes

詳細の追記

2019/07/07 15:57

投稿

abokadoishii

スコア12

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -2,9 +2,13 @@
-python　クローリング＆スクレイピングｰデータ収集・解析のための実践開発ガイドｰというテキストのp223のreadabiltyを利用するSpider1の実装に詰まっています。個別ページをたどる関数（pase）と個別webページをパースする関数(pase_page)が動かない状態です。
+python　クローリング＆スクレイピングｰデータ収集・解析のための実践開発ガイドｰというテキストのp223のreadabiltyを利用するSpiderの実装に詰まっています。個別ページをたどる関数（pase）と個別webページをパースする関数(pase_page)が動かない状態です。
 エントリーページ内のURL取得してfor文を回したいのですがどう書いたらいいのかわかりません。
+paseメソッドの一つ目のfor文ではh3タグ内にあるクラスentrylist-contents-title内のurlがあればfor文をまわすという風にしたいです。
+二つ目ではof=の値が２桁である場合のみ次の２０件をたどるという風にしたいです。
 ご教授お願い致します。
@@ -14,9 +18,9 @@
-```
+```
-[scrapy.middleware] INFO: Enabled spider middlewares:
+INFO: Enabled spider middlewares:
 ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
@@ -28,17 +32,23 @@
  'scrapy.spidermiddlewares.depth.DepthMiddleware']
 ```
 ### python
+# -*- coding: utf-8 -*-
 import scrapy
 from myproject.items import Page
 from myproject.utils import get_content
+from bs4 import BeautifulSoup
@@ -48,31 +58,39 @@
 	allowed_domains = ['b.hatena.ne.jp/entrylist']
-	start_urls = ['http://b.hatena.ne.jp/entrylist/']
+	start_urls = ['http://b.hatena.ne.jp/entrylist/']
 	def parse(self, response):
-		for url in response.css('a.entry-link::attr("href")').extract():
+		for url in response.css('.entrylist-contents-title a::attr("herf")').extract():
 			yield scrapy.Request(url,callback=self.parse_page)
 		url_more=response.css('a::attr("href")').re_first(r'.*?of=\d{2}$')
-		print("url_more:{}".format(url_more))
 		if url_more:
 			yield scrapy.Request(responce.urljoin(url_more))
 	def parse_page(self, response):
 		title, content = get_content(reaponse.text)
-		yield Page(url=responce.url, title=title , content=content)```
+		yield Page(url=responce.url, title=title , content=content)
+		```
@@ -80,9 +98,9 @@
-テキストのサンプルコードも動かしましたがどちらの関数も動いていませんでした。
+テキストのサンプルコードも動かしましたが同様に動きませんでした。サポートページにも何も書かれていなかったためわかりませんでした。
-pase関数はfor文が動きませんでした。ローリングするwebのにはentry-linkはなく、entrylistがあったので書き換えてみましたが動きませんでした。
+サンプルコードののfor url　in response.scc('a.entry-link::attr("href")')をfor url in response.css('.entrylist-contents-title a::attr("herf")')やfor url in response.css('a.entrylist-contents-title::attr("herf")'),for url in response.css('.entrylist-contents-title a::attr("herf")')に書き換えてみたりしましたが動きませんでした。

文書の修正

2019/07/07 15:15

投稿

abokadoishii

スコア12

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -4,7 +4,9 @@
 python　クローリング＆スクレイピングｰデータ収集・解析のための実践開発ガイドｰというテキストのp223のreadabiltyを利用するSpider1の実装に詰まっています。個別ページをたどる関数（pase）と個別webページをパースする関数(pase_page)が動かない状態です。
-■■な機能を実装中に以下のエラーメッセージが発生しました。
+エントリーページ内のURL取得してfor文を回したいのですがどう書いたらいいのかわかりません。
+ご教授お願い致します。
@@ -80,7 +82,7 @@
 テキストのサンプルコードも動かしましたがどちらの関数も動いていませんでした。
-pase関数はfor文が動きませんでした。ローリングするwebのにはentry-linkはなくentrylistがあったので書き換えてみましたが動きませんでした。
+pase関数はfor文が動きませんでした。ローリングするwebのにはentry-linkはなく、entrylistがあったので書き換えてみましたが動きませんでした。

書式の改善

2019/07/02 04:27

投稿

abokadoishii

スコア12

test CHANGED Viewed

File without changes

test CHANGED Viewed

@@ -30,13 +30,7 @@
-### 該当のソースコード
-```python
+### python
-# -*- coding: utf-8 -*-
 import scrapy
@@ -86,7 +80,7 @@
 テキストのサンプルコードも動かしましたがどちらの関数も動いていませんでした。
-pase関数はfor文が回っていないみたいでした。クローリングするwebのにはentry-linkはなくentrylistがあったので書き換えてみましたが動きませんでした。
+pase関数はfor文が動きませんでした。ローリングするwebのにはentry-linkはなくentrylistがあったので書き換えてみましたが動きませんでした。