scrapy shell, response.xpath().extract()で抽出できない

Question

例として[Yahoo! Japanトップページ](https://www.yahoo.co.jp/)で、以下イメージのように要素 li#mhi2nd をScrapyのshell、xpathで抽出しようとすると、[]という空(?)の値が返ってきます。一方、別のxpath（/html/head/title）では正常に抽出されます。

**①この違いは何でしょうか？**
**②正常に li#mhi2nd のテキスト "ヤフオク!" を抽出できる方法をお教えください**


![HTML要素](c1d4c6f55074bcaacd952849693e1afd.png)



```scrapy
scrapy shell https://www.yahoo.co.jp/
```


###抽出が**失敗**するXpath
```scrapy
>>> response.xpath('//*[@id="mhi2nd"]/a/text()').extract()
```

###抽出が**成功**するXpath
```scrapy
>>> response.xpath('/html/head/title/text()').extract()
```

Accepted Answer

ブラウザの種類やバージョンを判断しているので違うコンテンツが返ってきているだけです。
```
print(response.body_as_unicode())
```
してみると確認できます。古式ゆかしいtableレイアウトのページだと思います。

Scrapyの設定を[Chromeのもの](https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/)にでもすればいいのですね。

参考
[https://doc.scrapy.org/en/latest/topics/settings.html#user-agent](https://doc.scrapy.org/en/latest/topics/settings.html#user-agent)
[https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options](https://doc.scrapy.org/en/latest/topics/settings.html#command-line-options)

```
scrapy shell -s USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36' 'https://www.yahoo.co.jp/'
```
とかして起動すればいいかと。

結果
```
In [1]: response.xpath('//*[@id="mhi2nd"]/a/text()').extract()
Out[1]: ['ヤフオク!']
```

Answer

XPath の意味は `//*[@id="mhi2nd"]/a/text()` id="mhi2nd" という属性を持つ任意の要素の子 a の値です。 `/html/head/title/text()` はルートから html の子の head の子の title 要素の値です。ちなみに Chrome で F12 で開発者ツール開いて、Ctrl+F とすると XPath を入力する欄が出てくるので、それで選択される要素を確認できます。前者の Xpath でもちゃんとヤフオクが選択されましたよ。 ![イメージ説明](5d7100b98d02b5368310b8ef7b36e2f7.png) XPath については[こちら](https://webbibouroku.com/Blog/Article/xpath)を参考にするとよいかと思います。 ## 追記 XPath 自体はあっていることを示すために上記のことを紹介しました。以下のコードで取得した HTML を保存したところ、 ``` with open('test.html', 'w') as f: f.write(response.body.decode("utf-8")) ``` Yahoo Japan を見るための環境を満たしていないと判断され、弾かれています。なので、ブラウザでアクセスしたときの HTML は取得できていないため、上記の XPath が指す要素も存在しませんでした。 ``` Yahoo! JAPANトップページの機能を正しくご利用いただくには、下記の環境が必要です。
Windows：Internet Explorer 11.0以上 / Chrome 最新版 / Firefox 最新版 / Microsoft Edge　Macintosh：Safari 9.0以上

※Internet Explorer 11.0以上をご利用の場合は、「Internet Explorerの互換表示について」を参考に、互換表示の無効化をお試しください。

```

追記

関連した質問