BERTInvoiceCzechR (V1 – Synthetic + Random Layout)
This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for the task of structured information extraction from Czech invoices.
It achieves the following results on the evaluation set:
- Loss: 0.2295
- Precision: 0.6594
- Recall: 0.7309
- F1: 0.6933
- Accuracy: 0.9534
Model description
BERTInvoiceCzechR (V1) extends the baseline model (V0) by introducing layout variability into the training data.
The model performs token-level classification to extract structured invoice fields such as:
- supplier
- customer
- invoice number
- bank details
- totals
- dates
Compared to V0, this version is trained on synthetically generated invoices with randomized layouts, improving robustness to positional and structural variations.
Training data
The dataset consists of:
- synthetically generated invoices based on templates
- additional variants with randomized layout structures
Key properties:
- variable positioning of fields
- layout perturbations (shifts, spacing, ordering)
- preserved semantic correctness of labels
- still fully synthetic (no real invoices)
This dataset introduces layout diversity, which is critical for generalization in document understanding tasks.
Role in the pipeline
This model corresponds to:
V1 – Synthetic templates + randomized layouts
It is used to:
- evaluate the impact of layout variability
- compare against:
- V0 (fixed templates)
- later stages with real data (V2, V3)
- measure improvements in generalization
Intended uses
- Research in layout-aware NLP without explicit layout models
- Benchmarking robustness to structural variation
- Intermediate baseline for synthetic data pipelines
- Czech invoice information extraction
Limitations
- Still trained only on synthetic data
- No exposure to real-world noise (OCR errors, distortions)
- Layout variation is artificial and may not fully reflect real documents
- Does not leverage explicit spatial features (pure BERT)
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 2
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|---|
| No log | 1.0 | 65 | 0.2059 | 0.6571 | 0.6781 | 0.6674 | 0.9533 |
| No log | 2.0 | 130 | 0.2292 | 0.6598 | 0.7313 | 0.6937 | 0.9534 |
| No log | 3.0 | 195 | 0.2172 | 0.6789 | 0.6913 | 0.6850 | 0.9565 |
| No log | 4.0 | 260 | 0.2435 | 0.6385 | 0.7565 | 0.6925 | 0.9498 |
| No log | 5.0 | 325 | 0.2525 | 0.6347 | 0.7550 | 0.6896 | 0.9489 |
| No log | 6.0 | 390 | 0.2723 | 0.5994 | 0.7270 | 0.6571 | 0.9444 |
| No log | 7.0 | 455 | 0.2907 | 0.5963 | 0.7429 | 0.6616 | 0.9432 |
| 0.0306 | 8.0 | 520 | 0.2810 | 0.6146 | 0.7270 | 0.6661 | 0.9463 |
| 0.0306 | 9.0 | 585 | 0.2853 | 0.6059 | 0.7208 | 0.6584 | 0.9455 |
| 0.0306 | 10.0 | 650 | 0.2859 | 0.6054 | 0.7239 | 0.6594 | 0.9452 |
Framework versions
- Transformers 5.0.0
- PyTorch 2.10.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- 122