前提・実現したいこと
Colaboratoryで適時開示のPDFをテキスト化したいです。
下記サイトをそのまま実行したところエラーになってしまいました。
https://qiita.com/Fortinbras/items/b892841109487f0f666b
発生している問題・エラーメッセージ
Collecting pdfminer.six Downloading https://files.pythonhosted.org/packages/60/0a/5806bd37362bceebb88cff526177c308276b3e0696611564ed01d67b8c6b/pdfminer.six-20200124-py3-none-any.whl (5.6MB) |████████████████████████████████| 5.6MB 2.9MB/s Collecting pycryptodome Downloading https://files.pythonhosted.org/packages/54/e4/72132c31a4cedc58848615502c06cedcce1e1ff703b4c506a7171f005a75/pycryptodome-3.9.6-cp36-cp36m-manylinux1_x86_64.whl (13.7MB) |████████████████████████████████| 13.7MB 30.5MB/s Requirement already satisfied: sortedcontainers in /usr/local/lib/python3.6/dist-packages (from pdfminer.six) (2.1.0) Requirement already satisfied: chardet; python_version > "3.0" in /usr/local/lib/python3.6/dist-packages (from pdfminer.six) (3.0.4) Installing collected packages: pycryptodome, pdfminer.six Successfully installed pdfminer.six-20200124 pycryptodome-3.9.6 /usr/local/bin/pdf2txt.py: line 2: $'A command line tool for extracting text and images from PDF and\noutput it to plain text, html, xml or tags.': command not found /usr/local/bin/pdf2txt.py: line 3: import: command not found /usr/local/bin/pdf2txt.py: line 4: import: command not found /usr/local/bin/pdf2txt.py: line 5: import: command not found /usr/local/bin/pdf2txt.py: line 7: import: command not found /usr/local/bin/pdf2txt.py: line 8: import: command not found /usr/local/bin/pdf2txt.py: line 12: syntax error near unexpected token `OUTPUT_TYPES' /usr/local/bin/pdf2txt.py: line 12: `OUTPUT_TYPES = ((".htm", "html"),'
該当のソースコード
!pip install pdfminer.six import os import urllib.request pdfpath = "test.pdf" tkjkj = 'https://www.release.tdnet.info/inbs/' pdf_url = '140120200212462806.pdf' url = str(tkjkj) + str(pdf_url) os.system("wget -O " + str(pdfpath) + " " + str(url)) py_url = 'https://github.com/pdfminer/pdfminer.six/blob/master/tools/pdf2txt.py' py_fn = 'pdf2txt.py' os.system("wget -O" + str(py_fn) + " " + str(py_url)) lines = !pdf2txt.py { pdfpath } txt = '\n' .join(lines) print(txt)
回答1件
あなたの回答
tips
プレビュー
バッドをするには、ログインかつ
こちらの条件を満たす必要があります。