Pix2StructCzechInvoice (V3 – Full Pipeline with Real Data Fine-Tuning)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.1542
F1: 0.8404

Model description

Pix2StructCzechInvoice (V3) is the final generative model in the experimental pipeline.

Unlike token classification approaches, this model:

processes full document images
generates structured outputs as text sequences

It extracts key invoice fields such as:

supplier
customer
invoice number
bank details
totals
dates

By combining synthetic, hybrid, and real data, this version significantly improves both performance and stability.

Training data

The dataset used in this stage combines:

Synthetic template-based invoices (V0)
Synthetic invoices with randomized layouts (V1)
Hybrid invoices with real layouts and synthetic content (V2)
Real annotated invoices

Real data fine-tuning

The final stage introduces:

real invoice images
realistic visual noise and distortions
natural language variability
real formatting inconsistencies

This allows the model to:

better align generated outputs with real-world distributions
improve robustness of sequence generation
reduce hallucinations and formatting errors

Role in the pipeline

This model corresponds to:

V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)

It represents:

the final generative model
the best-performing Pix2Struct variant
an end-to-end extraction approach

Intended uses

End-to-end invoice information extraction from images
Document VQA and generative document understanding
OCR-free document processing pipelines
Research in generative vs structured extraction approaches

Limitations

Output format may still be inconsistent
Sensitive to decoding strategy and prompt structure
Less interpretable than token classification models
Requires post-processing for structured outputs
Computationally more expensive

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 1
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	F1
0.3277	1.0	23	0.1958	0.7239
0.2366	2.0	46	0.1446	0.8037
0.1780	3.0	69	0.1247	0.8060
0.1153	4.0	92	0.1178	0.8316
0.0895	5.0	115	0.1279	0.8312
0.0774	6.0	138	0.1542	0.8404
0.0766	7.0	161	0.1530	0.7972
0.0697	8.0	184	0.1385	0.8372
0.0804	9.0	207	0.1433	0.7963
0.0664	10.0	230	0.1614	0.7991

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 99

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/Pix2StructCzechInvoiceV0123

Base model

google/pix2struct-docvqa-base

Finetuned

(3)

this model