libs support modules

Shared helpers that sit alongside the feature packages above: rasterizing PDFs, parsing layout JSON, geometry types, success metrics, and debug PDF overlays.

formhtr.libs.pdf_to_image.convert_pdf_to_image(pdf_path, page=0, dpi=300)[source]

Convert PDF to image (assume only one page).

Parameters:
  • pdf_path (str) – path to given PDF file

  • page (int) – page number to be extracted. Defaults to 0.

  • dpi (int) – quality of picture in DPI. Defaults to 300.

Returns:

converted image

Return type:

Image

formhtr.libs.pdf_to_image.resize_image(image, size)[source]

Resize image to given size.

Parameters:
  • image (Image) – image object

  • size ((int, int)) – Provide pair of dimentions to scale the image.

Returns:

scaled image

Return type:

Image

formhtr.libs.pdf_to_image.get_image_size(logsheet_image)[source]

Find the size of the image

Parameters:

logsheet_image (np.array) – image of interest

Returns:

size in bytes

Return type:

int

class formhtr.libs.logsheet_config.LogsheetConfig(regions, residuals, height=None, width=None)[source]

Bases: object

Class to store and represent the whole config.

add_roi(start_x, start_y, end_x, end_y, varname=None, content_type=None)[source]

Append a new ROI rectangle to regions.

Parameters:
  • start_x – Left edge (inclusive).

  • start_y – Top edge (inclusive).

  • end_x – Right edge.

  • end_y – Bottom edge.

  • varname – Optional variable label.

  • content_type – Optional type string (e.g. Handwritten).

Returns:

None.

delete_last_region()[source]

Remove the most recently added ROI, if any.

Returns:

None.

update(index, attribute, value)[source]

Update content type of particular region

Parameters:
  • index (int) – region identifier

  • attribute (str) – attribute to be set

  • value (str) – desired value

Returns:

None.

announce_status(index, clean_len=20)[source]

Print current region status to command line

Parameters:
  • index (int) – region identifier

  • clean_len (int, optional) – length of text to clear. Defaults to 20.

Returns:

None.

export_to_json(output_file, remove_unannotated=False)[source]

Output logsheet config to JSON file

Parameters:
  • output_file (str) – location of output file.

  • remove_unannotated (bool, optional) – Remove ROIs without any content type specified. Defaults to False.

Returns:

None.

import_from_json(input_file)[source]

Import losheet config from a JSON file

Parameters:

input_file (str) – path to JSON file

Returns:

None; mutates self regions, residuals, height, and width.

class formhtr.libs.region.Region(start_x, start_y, end_x, end_y)[source]

Bases: object

Class to represent single ROI

class formhtr.libs.region.Residual(start_x, start_y, end_x, end_y, expected_content)[source]

Bases: Region

class formhtr.libs.region.ROI(start_x, start_y, end_x, end_y, varname=None, content_type=None)[source]

Bases: Region

class formhtr.libs.region.Rectangle(start_x, start_y, end_x, end_y, content)[source]

Bases: Region

formhtr.libs.statistics.compute_success_ratio(contents, artefacts)[source]

Compute ratio between number identified regions and extra content

Parameters:
  • contents (list) – list of identified regions

  • artefacts (dict) – artefact per service

Returns:

Dict with keys identified (int), artefacts (int), ratio (float).

formhtr.libs.visualise_regions.load_font()[source]

Return a TrueType font for overlay labels.

Returns:

A PIL ImageFont instance (Arial if available, else default bitmap font).

Note

May download Arial.ttf into the current working directory once.

formhtr.libs.visualise_regions.create_debug_dir()[source]

Create debug/ in the current working directory if missing.

Returns:

None.

formhtr.libs.visualise_regions.annotate_pdfs(identified_content, logsheet_image, front)[source]

Write one debug PDF per OCR provider under debug/.

Parameters:
  • identified_content – Dict google / amazon / azure mapping to iterables of regions with get_coords() and content (may be None).

  • logsheet_image – Raster image (numpy) to draw on.

  • front – If False, suffix output filenames with _back.

Returns:

None.

formhtr.libs.visualise_regions.visualise_regions(regions, image, output_pdf)[source]

Draw bounding boxes and labels, save as PDF in debug/.

Parameters:
  • regions – Iterable of objects with get_coords(), get_start(), content.

  • image – Source numpy image (RGB/BGR as supported by PIL).

  • output_pdf – Filename only; written as debug/{output_pdf}.

Returns:

None.