##■経緯
以下のようなプログラムで、スクレイピングしたWebページのHTMLソースを1つのレコードフィールドとしてCSVに保存し、別のWebサービス上でそのソースのHTMLタグやレイアウトの再現性を可能な限り高い形で展開/再構成し表示したいと考えています。
上記を達成するためのスクレイピング手法として私なりに意識したのは以下の点です。
・HTMLタグ付き
・CSVの改行コードを"\r\n" に変更することでHTMLソース内の改行\nと区別
しかし、下記コードを実行しても、CSVとしてレコードのフィールド内にうまく収まってくれません。
python
1# -*- coding: utf-8 -*- 2import requests 3import csv 4from bs4 import BeautifulSoup as bs4 5 6source = './test.csv' 7url = 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html' 8row = [] 9 10r = requests.get(url) 11soup = bs4(r.content,'lxml') 12 13row.append(soup.find(class_='content')) 14 15with open(source, 'w', encoding="UTF-8") as f: 16 writer = csv.writer(f, lineterminator='\r\n', quoting=csv.QUOTE_NONNUMERIC) 17 writer.writerow(row)
##■質問
上記目的を達成するための常套の手法などございましたらヒントなどいただけますと誠にありがたく存じます。
##■環境
Windows 10
Python 3.8.3
##■CSV出力結果(test.csv)
csv
1"<div class=""content""> 2<div id=""promotions""> 3</div> 4<div id=""content_inner""> 5<article class=""product_page""><!-- Start of product page --> 6<div class=""row""> 7<div class=""col-sm-6""> 8<div class=""carousel"" id=""product_gallery""> 9<div class=""thumbnail""> 10<div class=""carousel-inner""> 11<div class=""item active""> 12<img alt=""A Light in the Attic"" src=""../../media/cache/fe/72/fe72f0532301ec28892ae79a629a293c.jpg""/> 13</div> 14</div> 15</div> 16</div> 17</div> 18<div class=""col-sm-6 product_main""> 19<h1>A Light in the Attic</h1> 20<p class=""price_color"">£51.77</p> 21<p class=""instock availability""> 22<i class=""icon-ok""></i> 23 24 In stock (22 available) 25 26</p> 27<p class=""star-rating Three""> 28<i class=""icon-star""></i> 29<i class=""icon-star""></i> 30<i class=""icon-star""></i> 31<i class=""icon-star""></i> 32<i class=""icon-star""></i> 33<!-- <small><a href=""/catalogue/a-light-in-the-attic_1000/reviews/""> 34 35 36 0 customer reviews 37 38 </a></small> 39 --> 40 41 42<!-- 43 <a id=""write_review"" href=""/catalogue/a-light-in-the-attic_1000/reviews/add/#addreview"" class=""btn btn-success btn-sm""> 44 Write a review 45 </a> 46 47 --></p> 48<hr/> 49<div class=""alert alert-warning"" role=""alert""><strong>Warning!</strong> This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.</div> 50</div><!-- /col-sm-6 --> 51</div><!-- /row --> 52<div class=""sub-header"" id=""product_description""> 53<h2>Product Description</h2> 54</div> 55<p>It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more</p> 56<div class=""sub-header""> 57<h2>Product Information</h2> 58</div> 59<table class=""table table-striped""> 60<tr> 61<th>UPC</th><td>a897fe39b1053632</td> 62</tr> 63<tr> 64<th>Product Type</th><td>Books</td> 65</tr> 66<tr> 67<th>Price (excl. tax)</th><td>£51.77</td> 68</tr> 69<tr> 70<th>Price (incl. tax)</th><td>£51.77</td> 71</tr> 72<tr> 73<th>Tax</th><td>£0.00</td> 74</tr> 75<tr> 76<th>Availability</th> 77<td>In stock (22 available)</td> 78</tr> 79<tr> 80<th>Number of reviews</th> 81<td>0</td> 82</tr> 83</table> 84<section> 85<div class=""sub-header"" id=""reviews""> 86</div> 87</section> 88</article><!-- End of product page --> 89</div> 90</div>" 91
回答2件
あなたの回答
tips
プレビュー
バッドをするには、ログインかつ
こちらの条件を満たす必要があります。
2020/07/22 07:28 編集
2020/07/22 07:31
2020/07/22 07:42