Pix2StructCzechInvoice (V1 – Synthetic + Random Layout)

This model is a fine-tuned version of TomasFAV/Pix2StructCzechInvoice for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.4679
F1: 0.6432

Model description

Pix2StructCzechInvoice (V1) extends the baseline generative model by introducing layout variability into the training data.

Unlike token classification models, this model:

processes full document images
generates structured outputs as text sequences

It is trained to extract key invoice fields:

supplier
customer
invoice number
bank details
totals
dates

Training data

The dataset consists of:

synthetically generated invoice images
augmented variants with randomized layouts
corresponding structured text outputs

Key properties:

variable layout structure
visual diversity (spacing, positioning, formatting)
consistent annotation format
fully synthetic data

This introduces layout variability in the visual domain, which is crucial for generative multimodal models.

Role in the pipeline

This model corresponds to:

V1 – Synthetic templates + randomized layouts

It is used to:

evaluate the effect of layout variability on generative models
compare against:
- V0 (fixed templates)
- later hybrid and real-data stages (V2, V3)
analyze robustness of end-to-end extraction

Intended uses

End-to-end invoice extraction from images
Document VQA-style tasks
Research in generative document understanding
Comparison with structured prediction models

Limitations

Still trained only on synthetic data
Sensitive to output formatting inconsistencies
Training instability (fluctuating F1 across epochs)
Evaluation depends on string matching quality
Less interpretable than token classification models

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 1
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	F1
0.1978	1.0	75	0.3757	0.5804
0.1031	2.0	150	0.3578	0.6399
0.0725	3.0	225	0.3504	0.6318
0.0512	4.0	300	0.3929	0.6396
0.0500	5.0	375	0.4072	0.6394
0.0462	6.0	450	0.4655	0.4377
0.0502	7.0	525	0.6320	0.3384
0.0528	8.0	600	0.4835	0.5018
0.0393	9.0	675	0.4679	0.6432
0.0392	10.0	750	0.5330	0.4931

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 96

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/Pix2StructCzechInvoiceV01

Base model

google/pix2struct-docvqa-base

Finetuned

TomasFAV/Pix2StructCzechInvoiceV0

Finetuned

(1)

this model