PDFの表をそのままCSVで吐き出し

PDFの表をCSVに吐き出そうとしていますが、変なところで区切られていたりし、うまく抽出できません。
表の形のままなるようには、どこを修正すべきでしょうか。

# -*- coding: utf-8 -*-

import sys


from tabula import read_pdf

import codecs
import os

def find_all_files(directory):
  for root, dirs, files in os.walk(directory):
    yield root
  for file in files:
    yield os.path.join(root, file)

tmp_path = os.getcwd().replace('/', os.sep)

for file in find_all_files(tmp_path):
  name, ext = os.path.splitext(file)
if(ext.find('.pdf')>-1):
  print(file)
df = read_pdf(file, guess=False, encoding='cp932', pandas_options={'header':None}, pages='all')
df.to_csv(file+".csv")

行動規範の内容に同意します

回答2件

直接にはむずかしいですが、こちらのSpire.Office for .NETを使ってください。dllをインストールし、参照に追加した後、形式間の変換はできるようになります。まずはPDFをExcelに変換します：

using Spire.Pdf;

namespace ConvertPDFToExcel
{
    class Program
    {
        static void Main(string[] args)
        {
            //PdfDocumentインスタンスを作成する
            PdfDocument pdf = new PdfDocument();
            //PDFファイルをロードする
            pdf.LoadFromFile("Shopping list.pdf");
            //Excelとして保存する
            pdf.SaveToFile("PDFToExcel.xlsx", FileFormat.XLSX);
        }
    }
}

それから変換して手に入れたExcelドキュメントをCSVに変換します：

using Spire.Xls;
using System.Text;

namespace ConvertAWorksheetToCsv
{
    class Program
    {
        static void Main(string[] args)
        {
            //Workbookクラスのインスタンスを作成する
            Workbook workbook = new Workbook();
            //Excelファイルをロードする
            workbook.LoadFromFile("Sample.xlsx");

            //Get the first worksheet
            Worksheet sheet = workbook.Worksheets[0];

            //Save the worksheet as CSV
            sheet.SaveToFile("ExcelToCSV.csv", ",", Encoding.UTF8);
        }
    }
}

これで完了します。

投稿2022/03/11 07:00