PDF等の画像の文字の文字サイズを取得できず困っています

Question

画像の文字を認識する方法については試してできたのですが、
タイトルの通り、画像の文字の文字サイズを取得する方法が探しても見つかりませんでした。

getFontSize的なものがあるのかと思いocrライブラリ内の命令も探してみたのですが見つかりません。

使用したocrはpyocrです。

文字認識で試したソース一応載せておきます。
何の文字なのかまでは認識できてませんが動いたものです。

このソースに修正や追加を行って対応可能でしょうか？


```python

from PIL import Image
import sys
import pyocr
import pyocr.builders

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]

txt = tool.image_to_string(
    Image.open('D:/sample.png'),
    lang="jpn+eng",
    builder=pyocr.builders.TextBuilder(tesseract_layout=6)
)

print(txt)

```


やり方が分かる方、是非教えていただければ幸いです。
何卒宜しくお願い申し上げます。

Answer

Tesseract には確かにフォントサイズを取得する API がありますが、残念ながら pyocr では対応していないようですね。そのままでは取得できません。

仕方ないので、 OCR ツールとして `tessaract` CLI コマンドを用いるとして次のようなモンキーパッチを当ててみたところ、手元の環境では追加でフォントサイズも取得できるようになりました。尚、 `tesseract` CLI コマンドからフォントサイズを要求するオプションを使用するには、 Tesseract 3.04.00 以降が必要です。

```python
import re
import PIL.Image
import pyocr


class _WordHTMLParserWithFontAttr(pyocr.builders._WordHTMLParser):
    '''
    Tesseract hOCR parser with support for x_fsize additional attribute.
    '''
    def __init__(self):
        super(_WordHTMLParserWithFontAttr, self).__init__()

        self._last_box = None
        self._re_fsize = re.compile(r'(^|;)\s*x_fsize\s+(?P<fsize>\d+)')
        self.__current_fsize = None

    def handle_starttag(self, tag, attrs):
        super(_WordHTMLParserWithFontAttr, self).handle_starttag(tag, attrs)

        # Memorize `x_fsize` attribute if found
        if tag != 'span':
            return

        title = None
        tag_type = None

        for attr in attrs:
            if attr[0] == 'class':
                tag_type = attr[1]
            if attr[0] == 'title':
                title = attr[1]

        if title is not None and tag_type in ('ocr_word', 'ocrx_word'):
            self.__current_fsize = None
            m = self._re_fsize.search(title)

            if m:
                self.__current_fsize = int(m.group('fsize'))

    def handle_endtag(self, tag):
        super(_WordHTMLParserWithFontAttr, self).handle_endtag(tag)

        if len(self.boxes) > 0 and self.boxes[-1] != self._last_box:
            self._last_box = self.boxes[-1]
            self._last_box.fsize = self.__current_fsize


class WordBoxBuilderWithFontAttr(pyocr.builders.WordBoxBuilder):
    '''
    Builder with support for `hocr_font_info` on Tesseract 3.04 and 3.05.
    '''
    def __init__(self, tesseract_layout=1):
        super(WordBoxBuilderWithFontAttr, self).__init__(
            tesseract_layout=tesseract_layout
        )

        # Requires tesseract >= 3.04.00
        self.tesseract_configs += ['-c', 'hocr_font_info=1']

    def read_file(self, file_descriptor):
        '''
        Same as WordBoxBuilder.read_file, except for using parser class.
        '''
        parser = _WordHTMLParserWithFontAttr()
        html_str = file_descriptor.read()
        parser.feed(html_str)

        if len(parser.boxes) > 0:
            last_box = parser.boxes[-1]

            if last_box.content == pyocr.builders.to_unicode(''):
                # some parser leave an empty box at the end
                parser.boxes.pop(-1)

            return parser.boxes

        return []


image = PIL.Image.open('sample.png')
builder = WordBoxBuilderWithFontAttr()
boxes = pyocr.tesseract.image_to_string(image, lang='jpn+eng', builder=builder)

for box in boxes:
    print(
        '"%s": at %d, %d (%d x %d) / fsize=%d' % (
            box.content,
            box.position[0][0],
            box.position[0][1],
            box.position[1][0] - box.position[0][0],
            box.position[1][1] - box.position[0][1],
            box.fsize,
        )
    )
```

ただ、さらに残念なことに、同様のコードは (まだ alpha 版ですが) 最新の Tesseract 4.00.00alpha では動作しません。異常な値のフォントサイズが返されます。これは、 [4.00 で採用されている LSTM ベースの OCR エンジンがフォント認識に対応していないため](https://github.com/tesseract-ocr/tesseract/issues/684) だそうで、今後対応するかどうかも不透明なようです。

そういった将来的な互換性のことも考えると、もしかすると変なパッチを当てて無理やり使うよりは、単に認識された box の高さ (前掲のコード同様に、標準の `pyocr.builders.WordBoxBuilder` を使えば `position` から計算できます) をフォントサイズの代わりとして使用した方が良いのかもしれません。そのあたりは適宜ご自身でご判断下さい。