Table of Content

はじめに

camelotではパラメータの調整だけでは点線を含むテーブルの処理が上手く動作しません。

たとえば、以下のようなPDFがそれにあたります。
➀縦の点線
https://github.com/atlanhq/camelot/files/3565115/Test.pdf

②横の点線
https://github.com/mima3/yakusyopdf/blob/master/20200502/%E5%85%B5%E5%BA%AB%E7%9C%8C.pdf

この問題は下記のIssueとして挙げられていますが、解決されていないようです。

Detect dotted line #370
https://github.com/atlanhq/camelot/issues/370

対応方法

Camelotのテーブル認識の仕組み

対応方法の説明の前にCamelotがどのようにテーブルを認識しているか説明します。
Camelotにはテーブルの解析の方法に２種類存在します。１つはStreamでもう１つはLatticeです。
デフォルトではLatticeで解析しています。

Latticeではhostscriptを使用してPDFページを画像に変換し、次に、OpenCVを使用して線とその交点を求めます。
そしてそれらの座標を元に、テーブルのセルに割り当てていきます。

対応方法

Latticeでは一旦、画像に変換しているわけですから、その画像において点線を実線に置き換えればよいです。

縦の点線を実線に変換する場合は、縦方向にdilateとerodeを行います。
横の点線を実線に変換する場合は、横方向にdilateとerodeを行います。
こうやって変更した画像をcamelotの線分を見つける処理に渡せばいいのです。

具体的な実装例

まずLatticeをベースクラスとしたLatticeExを用意します。
https://github.com/mima3/yakusyopdf/blob/master/camelot_ex.py

これを以下のように使用します

import camelot_ex
import camelot
import os
import cv2
import numpy

from camelot.utils import (
    TemporaryDirectory,
)

def image_proc_tate_tensen(threshold):
    el = numpy.zeros((5,5), numpy.uint8)
    el[:, 1] =1
    threshold = cv2.dilate(threshold, el, iterations=1)
    threshold = cv2.erode(threshold, el, iterations=1)
    return threshold

def image_proc_yoko_tensen(threshold):
    el = numpy.zeros((5,5), numpy.uint8)
    el[2, :] =1
    threshold = cv2.dilate(threshold, el, iterations=1)
    threshold = cv2.erode(threshold, el, iterations=1)
    return threshold

def parse(filepath, pages, password=None, suppress_stdout=False, layout_kwargs={}, **kwargs):
    handler = camelot.handlers.PDFHandler(filepath)
    handler_pages = handler._get_pages(filepath, pages)

    tables = []
    with TemporaryDirectory() as tempdir:
        for p in handler_pages:
            handler._save_page(filepath, p, tempdir)
        tmp_pages = [
            os.path.join(tempdir, "page-{0}.pdf".format(p)) for p in handler.pages
        ]
        parser = camelot_ex.LatticeEx(
            **kwargs
        )
        for p in tmp_pages:
            t = parser.extract_tables(
                p,
                suppress_stdout=suppress_stdout,
                layout_kwargs=layout_kwargs
            )
            tables.extend(t)

        return tables

# 縦の点線を含むPDF
ret = parse(
     'test.pdf', 
     'all', 
     line_scale=60,
     image_proc = image_proc_tate_tensen
)
for ix in ret[0].df.index.values:
    s = ''
    for col in ret[0].df.columns:
        s = s + ',' + ret[0].df.loc[ix][col]
    print(ix, s)

# 横の点線を含むPDF
ret = parse(
     '20200502/兵庫県.pdf', 
     '1', 
     layout_kwargs={
         'char_margin': 0.25
     },
     line_scale=60,
     copy_text=['v'],
     image_proc = image_proc_yoko_tensen
)
for ix in ret[0].df.index.values:
    s = ''
    for col in ret[0].df.columns:
        s = s + ',' + ret[0].df.loc[ix][col]
    print(ix, s)

参考

OpenCVを使って点線を含めた縦棒を画像から削除する

camelotで点線を実線として処理する

はじめに

対応方法

Camelotのテーブル認識の仕組み

対応方法

具体的な実装例

参考

「camelotで点線を実線として処理する」への2件のフィードバック