Pix2StructCzechInvoice (V2 – Synthetic + Random Layout + Real Layout Injection)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.2521
F1: 0.7311

Model description

Pix2StructCzechInvoice (V2) represents an advanced stage of the generative document understanding pipeline.

The model:

processes full document images
generates structured outputs as text sequences

It is trained to extract key invoice fields:

supplier
customer
invoice number
bank details
totals
dates

This version introduces real layout injection, significantly improving visual realism and model generalization.

Training data

The dataset consists of three components:

Synthetic template-based invoices
Synthetic invoices with randomized layouts
Hybrid invoices with real layouts and synthetic content

Real layout injection

In the hybrid dataset:

real invoice layouts are used as templates
original content is replaced with synthetic data
new content is rendered into realistic visual structures

This preserves:

real-world layout complexity
visual patterns and formatting
document structure variability

while maintaining:

full control over annotations
consistent output format

Role in the pipeline

This model corresponds to:

V2 – Synthetic + layout augmentation + real layout injection

It is used to:

reduce the domain gap between synthetic and real documents
evaluate the effect of realistic layouts on generative models
compare with:
- V0–V1 (synthetic-only training)
- V3 (real data fine-tuning)

Intended uses

End-to-end invoice extraction from images
Document VQA-style tasks
Research in generative document understanding
Evaluation of hybrid training strategies

Limitations

Generated outputs may contain formatting errors
Sensitive to decoding strategy and tokenization
Still lacks full exposure to real linguistic variability
Training remains less stable than classification-based models

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 1
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	F1
0.3432	1.0	115	0.2771	0.6644
0.1942	2.0	230	0.2611	0.6745
0.1934	3.0	345	0.2521	0.7311
0.1325	4.0	460	0.2665	0.7133
0.1131	5.0	575	0.2686	0.6762
0.1125	6.0	690	0.2601	0.7277
0.1011	7.0	805	0.2962	0.7118
0.1229	8.0	920	0.2893	0.7095
0.0861	9.0	1035	0.3019	0.6931
0.0860	10.0	1150	0.3167	0.7186

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 94

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/Pix2StructCzechInvoiceV012

Base model

google/pix2struct-docvqa-base

Finetuned

(3)

this model