Pix2StructCzechInvoice (V3 – Full Pipeline with Real Data Fine-Tuning)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.1542
  • F1: 0.8404

Model description

Pix2StructCzechInvoice (V3) is the final generative model in the experimental pipeline.

Unlike token classification approaches, this model:

  • processes full document images
  • generates structured outputs as text sequences

It extracts key invoice fields such as:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

By combining synthetic, hybrid, and real data, this version significantly improves both performance and stability.


Training data

The dataset used in this stage combines:

  1. Synthetic template-based invoices (V0)
  2. Synthetic invoices with randomized layouts (V1)
  3. Hybrid invoices with real layouts and synthetic content (V2)
  4. Real annotated invoices

Real data fine-tuning

The final stage introduces:

  • real invoice images
  • realistic visual noise and distortions
  • natural language variability
  • real formatting inconsistencies

This allows the model to:

  • better align generated outputs with real-world distributions
  • improve robustness of sequence generation
  • reduce hallucinations and formatting errors

Role in the pipeline

This model corresponds to:

V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)

It represents:

  • the final generative model
  • the best-performing Pix2Struct variant
  • an end-to-end extraction approach

Intended uses

  • End-to-end invoice information extraction from images
  • Document VQA and generative document understanding
  • OCR-free document processing pipelines
  • Research in generative vs structured extraction approaches

Limitations

  • Output format may still be inconsistent
  • Sensitive to decoding strategy and prompt structure
  • Less interpretable than token classification models
  • Requires post-processing for structured outputs
  • Computationally more expensive

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 1
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine_with_restarts
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss F1
0.3277 1.0 23 0.1958 0.7239
0.2366 2.0 46 0.1446 0.8037
0.1780 3.0 69 0.1247 0.8060
0.1153 4.0 92 0.1178 0.8316
0.0895 5.0 115 0.1279 0.8312
0.0774 6.0 138 0.1542 0.8404
0.0766 7.0 161 0.1530 0.7972
0.0697 8.0 184 0.1385 0.8372
0.0804 9.0 207 0.1433 0.7963
0.0664 10.0 230 0.1614 0.7991

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
99
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/Pix2StructCzechInvoiceV0123

Finetuned
(3)
this model