BERTInvoiceCzechR (V1 – Synthetic + Random Layout)

This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for the task of structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

Loss: 0.2295
Precision: 0.6594
Recall: 0.7309
F1: 0.6933
Accuracy: 0.9534

Model description

BERTInvoiceCzechR (V1) extends the baseline model (V0) by introducing layout variability into the training data.

The model performs token-level classification to extract structured invoice fields such as:

supplier
customer
invoice number
bank details
totals
dates

Compared to V0, this version is trained on synthetically generated invoices with randomized layouts, improving robustness to positional and structural variations.

Training data

The dataset consists of:

synthetically generated invoices based on templates
additional variants with randomized layout structures

Key properties:

variable positioning of fields
layout perturbations (shifts, spacing, ordering)
preserved semantic correctness of labels
still fully synthetic (no real invoices)

This dataset introduces layout diversity, which is critical for generalization in document understanding tasks.

Role in the pipeline

This model corresponds to:

V1 – Synthetic templates + randomized layouts

It is used to:

evaluate the impact of layout variability
compare against:
- V0 (fixed templates)
- later stages with real data (V2, V3)
measure improvements in generalization

Intended uses

Research in layout-aware NLP without explicit layout models
Benchmarking robustness to structural variation
Intermediate baseline for synthetic data pipelines
Czech invoice information extraction

Limitations

Still trained only on synthetic data
No exposure to real-world noise (OCR errors, distortions)
Layout variation is artificial and may not fully reflect real documents
Does not leverage explicit spatial features (pure BERT)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 2
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 0.1
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
No log	1.0	65	0.2059	0.6571	0.6781	0.6674	0.9533
No log	2.0	130	0.2292	0.6598	0.7313	0.6937	0.9534
No log	3.0	195	0.2172	0.6789	0.6913	0.6850	0.9565
No log	4.0	260	0.2435	0.6385	0.7565	0.6925	0.9498
No log	5.0	325	0.2525	0.6347	0.7550	0.6896	0.9489
No log	6.0	390	0.2723	0.5994	0.7270	0.6571	0.9444
No log	7.0	455	0.2907	0.5963	0.7429	0.6616	0.9432
0.0306	8.0	520	0.2810	0.6146	0.7270	0.6661	0.9463
0.0306	9.0	585	0.2853	0.6059	0.7208	0.6584	0.9455
0.0306	10.0	650	0.2859	0.6054	0.7239	0.6594	0.9452

Framework versions

Transformers 5.0.0
PyTorch 2.10.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

Downloads last month: 122

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for TomasFAV/BERTInvoiceCzechV01

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(953)

this model

Finetunes

1 model