BERTInvoiceCzechR (V1 – Synthetic + Random Layout)

This model is a fine-tuned version of google-bert/bert-base-multilingual-cased for the task of structured information extraction from Czech invoices.

It achieves the following results on the evaluation set:

  • Loss: 0.2295
  • Precision: 0.6594
  • Recall: 0.7309
  • F1: 0.6933
  • Accuracy: 0.9534

Model description

BERTInvoiceCzechR (V1) extends the baseline model (V0) by introducing layout variability into the training data.

The model performs token-level classification to extract structured invoice fields such as:

  • supplier
  • customer
  • invoice number
  • bank details
  • totals
  • dates

Compared to V0, this version is trained on synthetically generated invoices with randomized layouts, improving robustness to positional and structural variations.


Training data

The dataset consists of:

  • synthetically generated invoices based on templates
  • additional variants with randomized layout structures

Key properties:

  • variable positioning of fields
  • layout perturbations (shifts, spacing, ordering)
  • preserved semantic correctness of labels
  • still fully synthetic (no real invoices)

This dataset introduces layout diversity, which is critical for generalization in document understanding tasks.


Role in the pipeline

This model corresponds to:

V1 – Synthetic templates + randomized layouts

It is used to:

  • evaluate the impact of layout variability
  • compare against:
    • V0 (fixed templates)
    • later stages with real data (V2, V3)
  • measure improvements in generalization

Intended uses

  • Research in layout-aware NLP without explicit layout models
  • Benchmarking robustness to structural variation
  • Intermediate baseline for synthetic data pipelines
  • Czech invoice information extraction

Limitations

  • Still trained only on synthetic data
  • No exposure to real-world noise (OCR errors, distortions)
  • Layout variation is artificial and may not fully reflect real documents
  • Does not leverage explicit spatial features (pure BERT)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 16
  • eval_batch_size: 2
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
No log 1.0 65 0.2059 0.6571 0.6781 0.6674 0.9533
No log 2.0 130 0.2292 0.6598 0.7313 0.6937 0.9534
No log 3.0 195 0.2172 0.6789 0.6913 0.6850 0.9565
No log 4.0 260 0.2435 0.6385 0.7565 0.6925 0.9498
No log 5.0 325 0.2525 0.6347 0.7550 0.6896 0.9489
No log 6.0 390 0.2723 0.5994 0.7270 0.6571 0.9444
No log 7.0 455 0.2907 0.5963 0.7429 0.6616 0.9432
0.0306 8.0 520 0.2810 0.6146 0.7270 0.6661 0.9463
0.0306 9.0 585 0.2853 0.6059 0.7208 0.6584 0.9455
0.0306 10.0 650 0.2859 0.6054 0.7239 0.6594 0.9452

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.10.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
122
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TomasFAV/BERTInvoiceCzechV01

Finetuned
(953)
this model
Finetunes
1 model