aws lambda(python): CSV形式のデータをJSON形式に変換できません。

前提・実現したいこと

表が印字された画像のようなファイルをaws s3にアップロードすると、
表の見出しを項目名にしたJson形式ファイルに変換する仕組みを作ろうとしています。

json
1{
2 "kokyaku_id" : "a001",
3 "address" : "Osaka Pref-Osaka"
4 以下略
5}
6

公式サイトを参考に、Amazon Textractを使用して画像ファイルから文字列を抽出したあとに一度csvファイルに変換したのち、参考サイトに書いているようにcsvモジュールを使ってjson形式に変更しようとしています。

発生している問題・エラーメッセージ

・項目名が一律「kokyaku_id」の一文字目である"k"になっている。
・値がカンマ単位ではなく1文字ずつで区切られている。
（しかも、最初の値が「kokyaku_id」の二文字目になっている。）

['{"k": "o"}', '{"k": "k"}', '{"k": "y"}', '{"k": "a"}', '{"k": "k"}', '{"k": "u"}', '{"k": "i"}', '{"k": "d"}', '{"k": " "}', '{"k": "", "null": [""]}',
 '{"k": "a"}', '{"k": "d"}', '{"k": "d"}', '{"k": "r"}', '{"k": "e"}', '{"k": "s"}', '{"k": "s"}', '{"k": " "}', '{"k": "", "null": [""]}', （以下略）

該当のソースコード

python
1import json
2import boto3
3from datetime import datetime
4import csv as csvlib
5
6resourceS3 = boto3.resource('s3')
7
8def lambda_handler(event, context):
9
10    # Document
11    s3BucketName = "xxxxxxxxxx"
12    documentName = "xxxxxx.png"
13    
14    # Amazon Textract client
15    textract = boto3.client('textract')
16
17    response = textract.analyze_document(
18        Document={
19            'S3Object': {
20                'Bucket': s3BucketName,
21                'Name': documentName
22            }
23        }
24        , FeatureTypes=['TABLES'])
25
26    # Get the text blocks
27    blocks=response['Blocks']
28    #print(blocks)
29
30    blocks_map = {}
31    table_blocks = []
32    for block in blocks:
33        blocks_map[block['Id']] = block
34        if block['BlockType'] == "TABLE":
35            table_blocks.append(block)
36
37    if len(table_blocks) <= 0:
38        return "<b> NO Table FOUND </b>"
39
40    wk_csv = ''
41
42    for index, table in enumerate(table_blocks):
43        wk_csv += generate_table_csv(table, blocks_map, index +1)
44    #    wk_csv += '\n\n'
45

ここからがCSV⇒jsonへの変換です。

    result = []
    
    for line in csvlib.DictReader(wk_csv):
        line_json = json.dumps(line)
        result.append(line_json)
    print(result)

以降はtextractの内容をcsv形式に編集するものです。

def get_rows_columns_map(table_result, blocks_map):
    rows = {}
    for relationship in table_result['Relationships']:
        if relationship['Type'] == 'CHILD':
            for child_id in relationship['Ids']:
                cell = blocks_map[child_id]
                if cell['BlockType'] == 'CELL':
                    row_index = cell['RowIndex']
                    col_index = cell['ColumnIndex']
                    if row_index not in rows:
                        # create new row
                        rows[row_index] = {}
                        
                    # get the text value
                    rows[row_index][col_index] = get_text(cell, blocks_map)
    return rows


def get_text(result, blocks_map):
    text = ''
    if 'Relationships' in result:
        for relationship in result['Relationships']:
            if relationship['Type'] == 'CHILD':
                for child_id in relationship['Ids']:
                    word = blocks_map[child_id]
                    if word['BlockType'] == 'WORD':
                        text += word['Text'] + ' '
                    if word['BlockType'] == 'SELECTION_ELEMENT':
                        if word['SelectionStatus'] =='SELECTED':
                            text +=  'X '    
    return text

def generate_table_csv(table_result, blocks_map, table_index):
    rows = get_rows_columns_map(table_result, blocks_map)

    #table_id = 'Table_' + str(table_index)
    
    # get cells.
    #csv = 'Table: {0}\n\n'.format(table_id)
    csv = ""
    for row_index, cols in rows.items():
        
        for col_index, text in cols.items():
            csv += '{}'.format(text) + ","
        csv += '\n'
        
    #csv += '\n\n\n'
    return csv

if __name__ == "__main__":
    file_name = sys.argv[1]
    main(file_name)

試したこと

そもそもCSVファイルが正しく作られていないのかと思い、CSVファイルを出力してみましたが、以下のとおり問題ありませんでした。

csv
1kokyaku_id ,address ,birth_ymd ,kana_name ,kanji_name ,last_update_ymd ,
2a001 ,Osaka Pref-Osaka ,19710120 ,hoge ,HOG E ,20200201 ,

行動規範の内容に同意します

回答2件

ベストアンサー

for line in csvlib.DictReader(wk_csv):において
DictReaderがCSV文字列を直接ストリームとして解釈し、おかしな結果になっています。
以下のようにStringIOを用いてストリームオブジェクトにして渡す必要があります。
参考：Python csv.DictReader: parse string?

Python
1import csv as csvlib
2from io import StringIO
3
4csv = """c1,c2\n1,a\n2,b"""
5
6# NG
7for line in csvlib.DictReader(csv):
8    print(line)
9"""
10rderedDict([('c', '1')])
11OrderedDict([('c', ''), (None, [''])])
12OrderedDict([('c', 'c')])
13OrderedDict([('c', '2')])
14OrderedDict([('c', '1')])
15OrderedDict([('c', ''), (None, [''])])
16OrderedDict([('c', 'a')])
17OrderedDict([('c', '2')])
18OrderedDict([('c', ''), (None, [''])])
19OrderedDict([('c', 'b')])
20"""
21
22# OK
23for line in csvlib.DictReader(StringIO(csv)):
24    print(line)
25"""
26OrderedDict([('c1', '1'), ('c2', 'a')])
27OrderedDict([('c1', '2'), ('c2', 'b')])
28""