質問編集履歴

変更

2018/05/12 09:37

投稿

trafalbad

スコア303

test CHANGED Viewed

	@@ -1 +1 @@
1	- imgタグ~~に.j~~p~~g要素がないサイト~~の~~画像スレイピングのコードエラーについて~~
1	+ htmlのimgタグのsrc属性内の「http」で始まる文のみを正規表現で取得する方法

test CHANGED Viewed

@@ -1,10 +1,36 @@
 サイトから画像をスレイピングしたいのですが[対象のサイト](http://www.asos.com/search/dress?page=1&q=dress)はgoogle chromeで見たところ、imgタグのsrc属性に.jpgなのど拡張子の要素がありません。どうやら画像のリンクがあるようです。
-BeautifulSoupを使い、かつ正規表現でスクレイピングをしようとしたのですが、うまくいきません。
+そこでimgダグのsrc属性ないの「http」で始まる文章を正規表現で取得したのですが、pythonの正規表現でうまい取得の仕方はないでしょうか？
-正規表現が間違っている以外にも、正常に画像がスクレイピングできるかわかりません。このような.jpg拡張子がないサイトはスクレイピングできるのでしょうか？訂正箇所の、ご教授お願いします
+imgタグ内のsrc属性内はhttpで始まる文章の後に　空行が入って400pxが書いてあり、また空行を挟んでhttpで始まる文章が繰り返されています
+例：http~ 400px http~ 200px http~ 300px http~
+このような文章でimgタグのsrc属性内のhttpの文章のみ取得する正規表現がどのように書けばいいのでしょうか?
+下の正規表現だとうまくいかないようなので、ご教授お願いします
+＃該当正規表現
+```python
+for link in soup.find_all('img'):
+    images.append(urljoin(URL, link.get('src'=re.compile('^http.*$'))))
+```
@@ -18,7 +44,7 @@
 import re
-from bs4 import BeautifulSoup
@@ -34,34 +60,4 @@
     images.append(urljoin(URL, link.get('src'=re.compile('^http.*$'))))
-for target in images: # imagesからtargetに入れる
-    re = requests.get(target)
-    with open('/Users/Downloads/img/' + target.split('/')[-1], 'wb') as f: # imgフォルダに格納
-        f.write(re.content) # .contentにて画像データとして書き込む
-print("ok") # 確認
 ```
-#エラー
-```
-File "scrapying.py", line 11
-    images.append(urljoin(URL, link.get('src'=re.compile('^http.*$'))))
-                                       ^
-SyntaxError: keyword can't be an expression
-```

質問変更

2018/05/12 09:36

投稿

trafalbad

スコア303

test CHANGED Viewed

	@@ -1 +1 @@
1	- Am~~azon Mechanical Turkで大量~~の画像のラ~~ベル付けの依頼で大量の画像をアップロ~~ー~~ドするやり方~~
1	+ imgタグに.jpg要素がないサイトの画像スレイピングのコードエラーについて

test CHANGED Viewed

@@ -1,173 +1,67 @@
+サイトから画像をスレイピングしたいのですが[対象のサイト](http://www.asos.com/search/dress?page=1&q=dress)はgoogle chromeで見たところ、imgタグのsrc属性に.jpgなのど拡張子の要素がありません。どうやら画像のリンクがあるようです。
-Amazon mechanical turkで画像をラベリングしてもらうことを考えています。
+BeautifulSoupを使い、かつ正規表現でスクレイピングをしようとしたのですが、うまくいきません。
-編集画面はソースコード(html)で編集可能のですが[この事例](https://blog.makky.io/articles/2017/06/25/mturk-jp/)のように、条件に合わせて一枚の画像にtrainかtestの2つチェックボックスを設置し、1000枚程度の画像をラベリングしてもらうことを考えています。
-条件文は後で記すとして、そもそも画像を1000枚もチェックボックスを設置してラベリングするようにできるようなhtmlはどのようにするのでしょうか？
-htmlで編集するのではなく、画像はurlでアップロードですが、そもそもローカルにある画像ですし、Dropboxで共有ファルダを作ってURLを貼ってもうまくいきません。
-Amazon mechanical turkで大量の画像のラベリングを依頼する人はどのように大量の画像を、チェックボックスを設置してかつアップロードしているのでしょうか?とてもhtmlを編集しているとは思えないのですが、、
+正規表現が間違っている以外にも、正常に画像がスクレイピングできるかわかりません。このような.jpg拡張子がないサイトはスクレイピングできるのでしょうか？訂正箇所の、ご教授お願いします
-何かご存知でしたらご教授お願いします。
 ＃　ソースコード
-```html
+```python
-<!-- HIT template: ImageTagging-v3.0 --><!-- Bootstrap v3.0.3 --><!-- Please note that Bootstrap CSS/JS and JQuery are 3rd party libraries that may update their url/code at any time. Amazon Mechanical Turk (MTurk) is including these libraries as a default option for you, but is not responsible for any changes to the external libraries -->
+import requests
-<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css" integrity="sha384-IS73LIqjtYesmURkDE9MXKbXqYA8rvKEp/ghicjem7Vc3mGRdQRptJSz60tvrB6+" rel="stylesheet" /><!-- The following snippet enables the 'responsive' behavior on smaller screens -->
+from requests.compat import urljoin
-<meta content="width=device-width,initial-scale=1" name="viewport" /><!-- Instructions -->
+import re
-<section class="container" id="TaggingOfAnImage">
-<div class="row">
-<div class="col-xs-12 col-md-12"><!-- Instructions -->
-<div class="panel panel-primary"><!-- WARNING: the ids "collapseTrigger" and "instructionBody" are being used to enable expand/collapse feature --><a class="panel-heading" href="javascript:void(0);" id="collapseTrigger"><strong>Image Tagging Instructions</strong> <span class="collapse-text">(Click to expand)</span> </a>
-<div class="panel-body" id="instructionBody">Select train image and test image bleow pilincipls:</div>
+from bs4 import BeautifulSoup
-<div class="panel-body">train image:&nbsp;</div>
+URL = 'http://www.asos.com/search/dress?page=1&q=dress' # URL入力
-</div>
+images = [] # 画像リストの配列
-</div>
-</div>
-<!-- End Instructions --><!-- Image Tagging Layout -->
-<div class="row" id="workContent">
+soup = BeautifulSoup(requests.get(URL).content,'lxml') # bsでURL内を解析
-<div class="col-xs-12 col-sm-8 image"><img alt="image_url" class="img-responsive center-block" src="https://www.dropbox.com/s/uxrwpfrnbm0ufxs/0a1f0a6016bdea001d4ba02a42c015a7ae2ab892.jpg?dl=0" /></div>
+for link in soup.find_all('img'):
+    images.append(urljoin(URL, link.get('src'=re.compile('^http.*$'))))
-<div class="col-xs-12 col-sm-4 fields">
+for target in images: # imagesからtargetに入れる
+    re = requests.get(target)
-<h3 class="form-group"><label for="tag1">Tag 1:train</label></h3>
+    with open('/Users/Downloads/img/' + target.split('/')[-1], 'wb') as f: # imgフォルダに格納
+        f.write(re.content) # .contentにて画像データとして書き込む
-<div class="form-group"><input class="form-control" id="tag1" maxlength="20" name="tag1" required="" size="30" type="text" value="0" /></div>
+print("ok") # 確認
+```
-<div class="form-group"><label for="tag2">Tag 2:test</label><input class="form-control" id="tag2" maxlength="20" name="tag2" required="" size="30" type="text" value="1" /></div>
-<div class="form-group"><label for="tag3">Tag 3:None</label><input class="form-control" id="tag3" maxlength="20" name="tag3" required="" size="30" type="text" value="2" /></div>
-<div class="form-group"><label class="group-label">Tag 4:</label>
-<div class="radio"><label><input autocomplete="off" id="option1" name="Tag4" required="" type="radio" value="yes" /> Option 1 </label></div>
-<div class="radio"><label><input autocomplete="off" id="option2" name="Tag4" required="" type="radio" value="no" /> Option 2 </label></div>
-</div>
-</div>
-</div>
-</section>
-<!-- End Image Tagging Layout --><!-- Open internal style sheet -->
-<style type="text/css">#collapseTrigger{
-  color:#fff;
-  display: block;
-  text-decoration: none;
-}
-#submitButton{
-  white-space: normal;
-}
-.image{
-  margin-bottom: 15px;
-}
-.radio:first-of-type{
-  margin-top: -5px;
-}
-</style>
-<!-- Close internal style sheet --><!-- Please note that Bootstrap CSS/JS and JQuery are 3rd party libraries that may update their url/code at any time. Amazon Mechanical Turk (MTurk) is including these libraries as a default option for you, but is not responsible for any changes to the external libraries --><script src="https://code.jquery.com/jquery-3.1.0.min.js" integrity="sha256-cCueBR6CsyA4/9szpPfrX3s49M9vUU5BgtiJj06wt/s=" crossorigin="anonymous"></script><script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js" integrity="sha384-s1ITto93iSMDxlp/79qhWHi+LsIi9Gx6yL+cOKDuymvihkfol83TYbLbOw+W/wv4" crossorigin="anonymous"></script><script>
-  $(document).ready(function() {
-    // Instructions expand/collapse
-    var content = $('#instructionBody');
-    var trigger = $('#collapseTrigger');
-    content.hide();
-    $('.collapse-text').text('(Click to expand)');
-    trigger.click(function(){
-      content.toggle();
-      var isVisible = content.is(':visible');
-      if(isVisible){
-        $('.collapse-text').text('(Click to collapse)');
-      }else{
-        $('.collapse-text').text('(Click to expand)');
-      }
-    });
-    // end expand/collapse
-  });
-</script>
+#エラー
 ```
+File "scrapying.py", line 11
+    images.append(urljoin(URL, link.get('src'=re.compile('^http.*$'))))
+                                       ^
+SyntaxError: keyword can't be an expression
+```