Pix2StructCzechInvoice (V0 – Synthetic Templates Only)

This model is a fine-tuned version of google/pix2struct-docvqa-base for structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.5022
F1: 0.5907

Model description

Pix2StructCzechInvoice (V0) is a generative multimodal model designed for document understanding.

Unlike token classification models (e.g., BERT, LiLT, LayoutLMv3), this model:

processes the entire document image
generates structured outputs as text sequences

The model is trained to extract key invoice fields such as:

supplier
customer
invoice number
bank details
totals
dates

Training data

The dataset consists of:

synthetically generated invoice images
fixed template layouts
corresponding target text sequences representing structured fields

Key properties:

clean and consistent visual structure
no OCR noise (end-to-end image input)
controlled output formatting
no real-world documents

This represents the baseline dataset for generative multimodal models.

Role in the pipeline

This model corresponds to:

V0 – Synthetic template-based dataset only

It is used to:

establish a baseline for generative document models
compare with:
- token classification approaches (BERT, LiLT)
- multimodal encoders (LayoutLMv3)
evaluate feasibility of end-to-end extraction

Intended uses

End-to-end invoice information extraction from images
Document VQA-style tasks
Research in generative document understanding
Comparison with structured prediction approaches

Limitations

Trained only on synthetic data
Sensitive to output formatting inconsistencies
Lower stability compared to token classification models
Requires careful evaluation (string matching vs structured metrics)
Performance depends on generation quality

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 1
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	F1
3.1072	1.0	300	2.9769	0.0
2.6572	2.0	600	2.8684	0.0
2.4810	3.0	900	2.6349	0.0
1.7941	4.0	1200	1.6395	0.0
0.8458	5.0	1500	1.0680	0.2173
0.6198	6.0	1800	0.7713	0.4835
0.1999	7.0	2100	0.4331	0.5700
0.0946	8.0	2400	0.3844	0.5907
0.1020	9.0	2700	0.4066	0.4294
0.0842	10.0	3000	0.5022	0.4665

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 32

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/Pix2StructCzechInvoiceV0

Base model

google/pix2struct-docvqa-base

Finetuned

(3)

this model

Finetunes

1 model