Pythonでtdの中をスクレイピングしたい

前提・実現したいこと

現在、某サイトをPythonのlxmlを使ってスクレイピングしようとしています。
tdのタグ内の情報を取得したいのですが、cssselectで指定できそうなclassがないのでXpathのfollowing-siblingである単語で出たtdの次のtdを取得しています。

tdのタグの中から「奥行」とだけ書いているtdの次の「600 mm」を取り出したいのですが、「奥行」とだけ書いているtdの前に「奥行」という言葉が含まれたtdが存在しているので次のtdの「カラー」を取り出してしまいます。

良い方法はないでしょうか？

該当のソースコード

HTMLのソース

          <tr>            
            <th rowspan="9" mt:id="ATTR_ITEM_COL">商品仕様</th>                                  
            <td class="heading">
              寸法
            </td>
            <td class="yahooHighlightSearch spec">
              幅1600×奥行600×高さ700ｍｍ、引出し内形センタートレー・幅：758ｍｍ、奥行：382ｍｍ、高さ：42ｍｍ、サイドトレー・幅：322ｍｍ、奥行：382ｍｍ、高さ：42ｍｍ               
            </td>                                                
            <td class="heading">
              カラー
            </td>
            <td class="yahooHighlightSearch spec">
              メープル/ホワイト               
            </td>                                    
          </tr>
          <tr>
            <td class="heading">
              奥行
            </td>
            <td class="yahooHighlightSearch spec">
              600
              mm
            </td>          
            <td class="heading">
              高さ
            </td>
            <td class="yahooHighlightSearch spec">
              700
              mm
            </td>                     
          </tr>

Pythonのコード

    desk = {
        'url' : response.url,
        'title' : root.cssselect('.productTitle')[0].text_content().strip(),
        'price_num' : root.cssselect('p.priceNum > span.num')[0].text_content(),
        'price_tax' : root.cssselect('span.tax')[0].text_content(),
        'size' : root.xpath('//td[contains(., "寸法")]/following-sibling::td[1]')[0].text_content().strip(),
        'depth' : root.xpath('//td[contains(., "奥行"]/following-sibling::td[1]')[0].text_content().strip(),
        'height' : root.xpath('//td[contains(., "高さ")]/following-sibling::td[1]')[0].text_content().strip(),
        
    }

結果

{'url': '某サイト', 'title': 'プラス フラットライン 平机 引出し付き メープル/ホワイト 幅1600×奥行600×高さ700mm 1台', 'price_num': '￥18,400', 'price_tax': '￥19,872', 'depth': 'カラー', 'height': 'カラー', }

この結果のdepthとheightを'depth': '600mm', 'height': '700mm', にしたいです。
今はXpathのtd[contains(., "奥行")]のように部分一致で指定しているので、tdの中身と完全一致で指定する方法などがあれば知りたいです。他の良い方法があれば別の方法でも構いません。

よろしくお願い致します。

行動規範の内容に同意します

回答2件

containsを使わず、=で比較してはどうでしょうか？

'//td[normalize-space(text())="奥行"]/following-sibling::td[1]'

テストはしてません。

投稿2018/07/19 02:04

otn

総合スコア86349

lxmlだけでは、html内の要素指定がキツイ(html変更あった場合の修正コストが高い)ので、BeautifulSoupを使うのはいかがでしょう？
参考:https://qiita.com/matsu0228/items/edf7dbba9b0b0246ef8f

spec_head  =  soup.find_all("td", attrs={"class": "heading"})
spec_body =  soup.find_all("td", attrs={"class": "yahooHighlightSearch"})

で、同じ要素数取れらようでしたら、

spec_head=="奥行"のインデクスが同じspec_bodyを参照することで、やりたいことが実現できるかと思います

投稿2018/07/18 21:50