Core API

Logsheet pipeline

The main OCR path rasterizes and aligns the scan, calls cloud vision services, and parses ROIs. Start with extract_logsheet() for in-process use, or process_logsheet_to_xlsx() to write a workbook.

class formhtr.logsheet.ServiceCredentials(google_credentials_path: str | None, amazon_credentials: dict[str, Any] | None, azure_credentials: dict[str, Any] | None)[source]

Bases: object

Holds OCR credentials for call_services.

google_credentials_path

Path to Google service-account JSON, or None.

Type:

str | None

amazon_credentials

Loaded Amazon credentials dict, or None.

Type:

dict[str, Any] | None

azure_credentials

Loaded Azure credentials dict, or None.

Type:

dict[str, Any] | None

formhtr.logsheet.load_credentials(*, google_credentials_path: str | None = None, amazon_credentials_path: str | None = None, azure_credentials_path: str | None = None) ServiceCredentials[source]

Load credential files into a ServiceCredentials instance.

Parameters:
  • google_credentials_path – Path to Google JSON (not loaded here).

  • amazon_credentials_path – Path to Amazon JSON (ACCESS_KEY, SECRET_KEY, REGION).

  • azure_credentials_path – Path to Azure JSON (SUBSCRIPTION_KEY, ENDPOINT).

Returns:

Frozen dataclass with paths/dicts for enabled providers.

formhtr.logsheet.preprocess_input(*, scanned_logsheet_pdf: str, template_pdf: str, config: LogsheetConfig, page: int, skip_alignment: bool, filter_grayscale: bool, max_size_mb: float = 4, dpi: int = 300, alignment_config_path: str | None = None)[source]

Rasterize PDFs, align scan to template, and enforce a maximum JPEG size.

Parameters:
  • scanned_logsheet_pdf – Path to the scanned logsheet PDF.

  • template_pdf – Path to the blank template PDF.

  • config – Loaded layout (width/height used for resizing).

  • page – Page index in the scan PDF.

  • skip_alignment – If True, skip homography alignment.

  • filter_grayscale – Passed to automatic alignment (edge-based corners).

  • max_size_mb – If the in-memory JPEG exceeds this, reduce dpi and retry.

  • dpi – Initial rasterization DPI.

  • alignment_config_path – Optional JSON with template_points and target_points.

Returns:

Aligned logsheet as a numpy array, or None if alignment yields no image.

formhtr.logsheet.extract_logsheet(*, scanned_logsheet_pdf: str, template_pdf: str, config_json: str, credentials: ServiceCredentials, debug: bool = False, front: bool = True, checkbox_edges: float = 0.2, skip_alignment: bool = False, filter_grayscale: bool = False, alignment_config_path: str | None = None)[source]

Preprocess one page, run OCR services, optionally write debug PDFs, parse ROIs.

Parameters:
  • scanned_logsheet_pdf – Path to the scanned PDF.

  • template_pdf – Path to the template PDF.

  • config_json – Path to ROI/residual JSON config.

  • credentials – Provider credentials (any subset may be set).

  • debug – If True, write annotated debug PDFs under debug/.

  • front – If True, use page 0; else page 1.

  • checkbox_edges – Inner margin ratio for checkbox tick detection.

  • skip_alignment – Skip alignment in preprocessing.

  • filter_grayscale – Passed to automatic alignment.

  • alignment_config_path – Optional manual alignment JSON.

Returns:

(results, artefacts) from process_content, or (None, None) if preprocess fails.

formhtr.logsheet.process_logsheet_to_xlsx(*, scanned_logsheet_pdf: str, template_pdf: str, config_json: str, output_xlsx: str, credentials: ServiceCredentials, debug: bool = False, backside: bool = False, backside_template_pdf: str | None = None, backside_config_json: str | None = None, ugly_checkboxes: bool = False, already_aligned: bool = False, filter_grayscale: bool = False, store_csv: bool = False, alignment_config_path: str | None = None, backside_alignment_config_path: str | None = None) float | None[source]

End-to-end extraction to spreadsheet or CSV, optionally both sides of a scan.

Parameters:
  • scanned_logsheet_pdf – Path to the scanned PDF.

  • template_pdf – Front template PDF path.

  • config_json – Front ROI config JSON path.

  • output_xlsx – Output .xlsx or .csv path (see store_csv).

  • credentials – OCR credentials for enabled providers.

  • debug – Enable debug PDF output during extraction.

  • backside – Whether to append back-side ROIs.

  • backside_template_pdf – Back template PDF (required if backside).

  • backside_config_json – Back config JSON (required if backside).

  • ugly_checkboxes – Use a larger edge ignore ratio for checkboxes.

  • already_aligned – Skip alignment in preprocess_input.

  • filter_grayscale – Passed to automatic alignment.

  • store_csv – If True, write CSV via store_results_csv instead of XLSX.

  • alignment_config_path – Optional front alignment JSON.

  • backside_alignment_config_path – Optional back alignment JSON.

Returns:

Dict from compute_success_ratio (identified, artefacts, ratio), or None if the front side could not be processed.

ROI tools

Interactive definition and labelling of regions on a template PDF. These wrap the widgets and helpers under libs.extract_ROI and libs.annotate_ROI.

formhtr.roi_tools.select_rois(*, template_pdf: str, output_config_json: str, autodetect: bool = False, autodetect_filter: float = 3, existing_config_json: str | None = None, detect_residuals: bool = False, google_credentials_path: str | None = None, display_residuals: bool = False, headless: bool = False) None[source]

Define ROIs on a template PDF and save JSON layout.

Parameters:
  • template_pdf – Path to the template PDF (first page).

  • output_config_json – Path to write the ROI config JSON.

  • autodetect – Run rectangle detection to seed ROIs.

  • autodetect_filter – Scale passed to detect_rectangles.

  • existing_config_json – Optional config to load and continue editing.

  • detect_residuals – Use Google Vision to find printed text to ignore (needs credentials).

  • google_credentials_path – Google JSON path when detect_residuals is True.

  • display_residuals – Draw residual regions in the UI.

  • headless – Skip GUI and only export (with autodetect/residuals as configured).

Returns:

None; writes output_config_json.

formhtr.roi_tools.annotate_rois(*, template_pdf: str, config_json: str, output_config_json: str, remove_unannotated: bool = False, display_residuals: bool = False) None[source]

Label ROI content types and variable names, then save updated config.

Parameters:
  • template_pdf – Path to the template PDF.

  • config_json – Existing ROI config to load.

  • output_config_json – Path to write the updated config.

  • remove_unannotated – If True, drop ROIs without a type on export.

  • display_residuals – Draw residual regions in the UI.

Returns:

None; writes output_config_json.