Pix2StructCzechInvoice (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.2521
  • F1: 0.7311

Model description

Pix2StructCzechInvoice (V2) represents an advanced stage of the generative document understanding pipeline.

The model:

  • processes full document images
  • generates structured outputs as text sequences

It is trained to extract key invoice fields:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

This version introduces real layout injection, significantly improving visual realism and model generalization.


Training data

The dataset consists of three components:

  1. Synthetic template-based invoices
  2. Synthetic invoices with randomized layouts
  3. Hybrid invoices with real layouts and synthetic content

Real layout injection

In the hybrid dataset:

  • real invoice layouts are used as templates
  • original content is replaced with synthetic data
  • new content is rendered into realistic visual structures

This preserves:

  • real-world layout complexity
  • visual patterns and formatting
  • document structure variability

while maintaining:

  • full control over annotations
  • consistent output format

Role in the pipeline

This model corresponds to:

V2 – Synthetic + layout augmentation + real layout injection

It is used to:

  • reduce the domain gap between synthetic and real documents
  • evaluate the effect of realistic layouts on generative models
  • compare with:
    • V0–V1 (synthetic-only training)
    • V3 (real data fine-tuning)

Intended uses

  • End-to-end invoice extraction from images
  • Document VQA-style tasks
  • Research in generative document understanding
  • Evaluation of hybrid training strategies

Limitations

  • Generated outputs may contain formatting errors
  • Sensitive to decoding strategy and tokenization
  • Still lacks full exposure to real linguistic variability
  • Training remains less stable than classification-based models

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 1
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss F1
0.3432 1.0 115 0.2771 0.6644
0.1942 2.0 230 0.2611 0.6745
0.1934 3.0 345 0.2521 0.7311
0.1325 4.0 460 0.2665 0.7133
0.1131 5.0 575 0.2686 0.6762
0.1125 6.0 690 0.2601 0.7277
0.1011 7.0 805 0.2962 0.7118
0.1229 8.0 920 0.2893 0.7095
0.0861 9.0 1035 0.3019 0.6931
0.0860 10.0 1150 0.3167 0.7186

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
94
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/Pix2StructCzechInvoiceV012

Finetuned
(3)
this model