webスクレイピングした内容をmecabで形態素分析

Question

###前提・実現したいこと
 現在、chrome拡張機能を作っています。ユーザーが見たページのURLをサーバー側に送って、そのURL先のページの本文を抽出しそれをmecabで形態素分析したいです。

###発生している問題・エラーメッセージ
 webから本文を抽出することと、自分で用意した例文を形態素分析し名詞をmysqlに送ることには成功しました。しかし、抽出した本文を形態素分析してmysqlに送ることがありません。
また、printすると以下のようなエラーが出ていました。
```
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 2: invalid start byte
```
そこで、スクレイピングしたtextの型を調べてみましたが、str型で特に問題になるとは思えません。（形態素分析できた例文もstr型です。）
###該当のソースコード
pythonをcgiとして実行しました。
```python
#!/usr/local/bin/python3.6

print("Content-Type: text/html; charset=UTF-8")
print()

import cgi
import cgitb; cgitb.enable()
import json
#import bs4
#import urllib.request, urllib.parse
from readability.readability import Document
import urllib
import html2text
import MeCab
import mysql.connector
import re

form = cgi.FieldStorage()
for key in form.keys():
    url = form.getvalue(key)
    html = urllib.request.urlopen(url)
    s = html.read()
    article = Document(s).summary()
    text = html2text.html2text(article)
    text = text.strip("")
    text = text.replace("http","")
    text = text.replace("/","")
    text = text.replace("[","")
    text = text.replace("]","")
    text = text.replace("%","")
    text = text.replace(" ","")
    text = text.replace("{","")
    text = text.replace("}","")
    text = text.replace("%","")
    text = text.replace(":","")
    text = text.replace("#","")
    text = text.replace("*","")
    text = text.replace("
","")
    text = text.replace("|","")
    text = text.replace("-","")
    text = text.replace(".","")
    text = text.replace("?","")
    text = text.replace("(","")
    text = text.replace(")","")
    text = text.replace("<","")
    text = text.replace(">","")
    text = text.replace("¥¥","")
    text = text.replace("¥n","")
    text = text.replace("1","")
    text = text.replace("2","")
    text = text.replace("3","")
    text = text.replace("4","")
    text = text.replace("5","")
    text = text.replace("6","")
    text = text.replace("7","")
    text = text.replace("8","")
    text = text.replace("9","")
    text = text.replace("0","")
    text = text.replace("＠","")
    text = text.replace("\◆","")
    text = text.replace("◆","")

    tagger = MeCab.Tagger('-d ./mecab-ipadic-neologd')
    tagger.parse('')
    input = text
    result = tagger.parseToNode(input)
    node = tagger.parseToNode(input)
    target_parts_of_speech = ('名詞', )
    words = []
    print(node.feature)
    while node:
        if node.feature.split(',')[0] in target_parts_of_speech:
            print(type(node.surface))
            words.append(node.surface)
        node = node.next
    word = ','.join(words)
    config = {
      'user': 'root',
      'password': 'root',
      'unix_socket': '/Applications/MAMP/tmp/mysql/mysql.sock',
      'database': 'hoge',
      'raise_on_warnings': True,
    }
    link = mysql.connector.connect(**config)
    cursor = link.cursor()

    cursor.execute('''insert into horizon (url,text) values (%s,%s)''', [url,word])
    link.commit()

    cursor.execute("select * from horizon;")
    for row in cursor.fetchall():
        print(row[0],row[1],row[2])

    cursor.close()
    link.close()
```

###試したこと
スクレイピングしてきた情報には無駄な記号や空白がたくさんあり、text = ""に入れても文字列として認識されませんでした。そのため、replaceで文章をシンプルにするようにしてみたものの、まだmecabは反応してくれません。

###補足情報(言語/FW/ツール等のバージョンなど)
開発環境は以下の通りです。
- macOS Sierra
- python3
- MAMP バージョン4.2

Accepted Answer

提示エラー`0xbb in position 2`より、レスポンスのHTMLデータには`UTF-8BOM`が付加されていると思われます。参考 : [バイトオーダーマーク](https://ja.wikipedia.org/wiki/%E3%83%90%E3%82%A4%E3%83%88%E3%82%AA%E3%83%BC%E3%83%80%E3%83%BC%E3%83%9E%E3%83%BC%E3%82%AF) そこでローカルサーバにて`UTF-8BOM付.html`を取得してみました。 HTTPレスポンスの`HTML(バイナリデータ)`には`BOM`は含まれていますが `Document(s).summary()`にて`str`に変換後には`BOM`はちゃんと外されているようです。よって提示エラーは再現できませんでした。再現できる取得元データ（URL）を提示されるとさらに検証できるかもしれません。検証環境 : Win10, Python3.5.x, readability-lxml-0.6.2(+cssselect-1.0.1), html2text-2016.9.19、HTTPサーバはOS標準のIIS utf8bom.html(utf-8 BOM付で保存) ```.html タイトル本文 ``` 検証スクリプト ```Python # -*- coding: utf-8 -*- import sys print(sys.getdefaultencoding()) print(sys.stdin.encoding) print(sys.stdout.encoding) import urllib.request url = 'http://localhost/utf8bom.html' html = urllib.request.urlopen(url) s = html.read() print(type(s)) print('s----- ',repr(s)) from readability.readability import Document article = Document(s).summary() print(type(article)) print('article----- ',repr(article)) import html2text text = html2text.html2text(article) print(type(text)) print('text----- ',repr(text)) import MeCab tagger = MeCab.Tagger('-Ochasen') tagger.parse('') node = tagger.parseToNode(text) while node: print( node.surface,node.feature) node = node.next ``` 実行結果 ``` utf-8 cp932 cp932 s----- b'\xef\xbb\xbf \xe3\x82\xbf\xe3\x82\xa4\xe3\x83\x88\xe3\x83\xab \xe6\x9c\xac\xe6\x96\x87 ' article----- ' 本文 ' text----- '本文 ' pathname[C:\Program Files\Anaconda3\lib\site-packages\_MeCab.cp35-win_amd64.pyd] desc[('.cp35-win_amd64.pyd', 'rb', 3)] BOS/EOS,*,*,*,*,*,*,*,* 本文名詞,一般,*,*,*,*,本文,ホンブン,ホンブン BOS/EOS,*,*,*,*,*,*,*,* ``` #### 実行環境の違いについて追記当回答のコメント欄にて提示された`print`エラーの原因についてですが、質問本文のエンコーディングと異なることから推測すると、CGI動作させ、その標準出力のエンコーディング`US-ASCII`が日本語に対応していないためだと思われます。ここで、ターミナル（シェル）上とCGI上では標準出力のエンコーディングが異なる可能性が高いことにご注意ください。その理由と解決方法(TextIOWrapperで標準出力を開き直す)は以下に記載されています。 [[python3]デフォルト文字コードの指定(CGI実行時)](http://chidipy.jpn.com/topics/?p=309) すなわち、ターミナルでは動作するのにCGIではエンコーディングエラーが発生するなどがありえます。

Answer

UnicodeDecodeErrorというエラーが出ているということでしたら、以下の記事に書かれている方法で解決できませんか？

[Python スクリプト実行時に UnicodeDecodeError が出る場合の対処方法](http://d.hatena.ne.jp/shu223/20111201/1328334689)

Answer

> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 2: invalid start byte UTF-8にエンコードできない`0xbb`の文字コードが原因です。この`0xbb`は、バイトオーダーといって、BOM付きのファイルなどを読み込んだりすると出力されます。 BOM以外にもNULL文字（\0）もかなり危険です。普通はこのような制御文字コード（`[\x00-\x1f\x7f]`）は危険なので、`reモジュール`の正規表現を使って削除してからMecabで解析してみてはいかがでしょうか。 ```python #HTML文字列から制御コードを削除 text = re.sub(r'[\x00-\x1f\x7f]+', '', text) #BOMを消すだけなら、下記のようにデコード→エンコードしてもOKです text = text.decode('utf_8_sig').encode('utf-8') ``` [http://docs.python.jp/3.6/library/re.html#re.sub](http://docs.python.jp/3.6/library/re.html#re.sub) [http://docs.python.jp/3.6/library/stdtypes.html#str.encode](http://docs.python.jp/3.6/library/stdtypes.html#str.encode) ただし、`

...

`のHTMLコードは改行が抜けると表示も崩れるので、HTMLをスクレイピングしたい場合は注意してください。また、対象のWebサイトが不特定多数なのであれば、文字コードの問題もでてくると思われます。最近のWebサイトはUTF-8ですが、Shift-JISやEUC-JPが文字コードのサイトもあります。外国語のサイトは更に膨大な量の文字コードになるのである程度仕様を決めなければなりません。（UTF-8のみなど）

Answer

pythonはさっぱりなんですが、DB以外は参考になりそうなサイトを見つけたので、情報提供。

[Pythonでつくる検索エンジン(Webクローラ, Mecab, MongoDB, Flask)](http://nwpct1.hatenablog.com/entry/python-search-engine)

まあ、既にご存知かもしれませんが。

Answer

解析したい文章を入れると動くのでしょうか？
それでしたらHTMLを闇雲にテキストに直すのではなくて、抜き出したい箇所にXPATHなりCSS PATHを使って取り出した方が良いかと思います

実行環境の違いについて追記

関連した質問