PYTHON writeでの文字化け解消方法

Question

###前提・実現したいこと
こんばんは。
いつもお世話になっております。
今回はpython3を使って、
読み込んだPDFからテキストを抜粋して、テキストファイルに書き出すという簡単なプログラムの作成を目指しております。

###発生している問題・エラーメッセージ

生成物である'abcdefg.txt'ですが、内容がどうしても文字化けしてしまいます。下記先頭50行の表示内容です。

``` 
2ˆ
  

  

  
!·
  
/²
 
˘B
29

2
˙v
28
ˆ¥#'
 

 
&É
          
% 
 
5 
     
8€
 
&É
           
% 
 
5 
     
8€
 

 
˜v2(#Ø
 
#'5 8x5 
 
ˇC5 
 

 
˙¦ˆq-¶
```

###該当のソースコード
```python3
import PyPDF2
pdf_file_obj = open('11.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
page_obj = pdf_reader.getPage(0)
content = page_obj.extractText()

new_obj = open('abcdefg.txt', 'w', encoding = 'utf-8')
new_obj.write(content)

new_obj.close()
```

###試したこと

色々調べて、openの引数にencodingを追加したのですが、文字化け解消には至りませんでした。
何卒アドバイスを頂ければ幸いでございます。

###補足情報(言語/FW/ツール等のバージョンなど)
python3
PyPDF2
windows8.1

Accepted Answer

[Extracting text from a PDF file using Python](https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python)
によると、ある意味正しい動作のようです。

上記では[sample.pdf](https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1)を例として示されています。
これは一般的なPDFビューワでは
`This is a sample PDF document I’m using to follow along with the tutorial`
と表示されますが、PyPDF2では
```
!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
```
が取得されます。

当方の環境で以下を実行したところ、ページ情報はおそらく正しく取得でき、extractTextは上記ページと同じ結果が得られました。
```Python
import PyPDF2
pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
page_obj = pdf_reader.getPage(0)
print(page_obj)
"""
{'/Type': '/Page', '/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 612, 792]}
"""
content = page_obj.extractText()
print(content)
"""
!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%
"""
```

たしかPDFに埋め込まれている文字列`!"#$%#$～`は何等かのエンコードが施されており、それがそのまま取得されていると思われます。
上記の回答では、別モジュール`textract`で取得するように回答されていますので、`textract`を利用されてはいかがでしょうか？

#### 補足
公式の[extractText()](https://pythonhosted.org/PyPDF2/PageObject.html#PyPDF2.pdf.PageObject.extractText)にも以下のように記載されていました。
> Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

ざっくりまとめると、`extractText()`はPDFのテキスト描画コマンド文字列をそのまま抽出するだけで、まだまだ貧弱な機能なようです。

Answer

encodeをutf-8じゃなくてshift_jisにしてみてください。

補足

関連した質問