dpo

This model is a fine-tuned version of aubmindlab/aragpt2-base on an unknown dataset. It achieves the following results on the evaluation set:

  • Loss: 0.4430
  • Rewards/chosen: 5.6454
  • Rewards/rejected: 3.4725
  • Rewards/accuracies: 0.8448
  • Rewards/margins: 2.1729
  • Logps/rejected: -779.9817
  • Logps/chosen: -1153.5770
  • Logits/rejected: -3.0501
  • Logits/chosen: -3.3050

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 2
  • training_steps: 200

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6408 0.0769 10 0.5783 0.8112 0.4760 0.8448 0.3351 -809.9464 -1201.9193 -3.1723 -3.4517
0.4779 0.1538 20 0.5212 2.0793 1.2294 0.8276 0.8499 -802.4123 -1189.2380 -3.1494 -3.4178
0.4764 0.2308 30 0.5064 2.9292 1.7859 0.8362 1.1433 -796.8478 -1180.7388 -3.1253 -3.3872
0.4429 0.3077 40 0.4839 3.3147 2.0341 0.8276 1.2806 -794.3660 -1176.8838 -3.1091 -3.3693
0.4766 0.3846 50 0.5141 3.4676 2.1403 0.8190 1.3274 -793.3040 -1175.3546 -3.0906 -3.3490
0.4798 0.4615 60 0.5002 3.5407 2.1864 0.8276 1.3543 -792.8427 -1174.6235 -3.0892 -3.3485
0.4054 0.5385 70 0.4733 3.5696 2.2000 0.8362 1.3696 -792.7064 -1174.3348 -3.0960 -3.3586
0.377 0.6154 80 0.4556 3.9933 2.4739 0.8448 1.5194 -789.9678 -1170.0979 -3.0895 -3.3516
0.4159 0.6923 90 0.4460 4.4103 2.7327 0.8362 1.6777 -787.3801 -1165.9279 -3.0808 -3.3423
0.3655 0.7692 100 0.4507 4.7961 2.9496 0.8448 1.8465 -785.2107 -1162.0699 -3.0704 -3.3290
0.335 0.8462 110 0.4592 5.0963 3.1378 0.8534 1.9585 -783.3284 -1159.0679 -3.0658 -3.3242
0.3374 0.9231 120 0.4784 5.4616 3.3750 0.8534 2.0866 -780.9568 -1155.4149 -3.0582 -3.3136
0.2969 1.0 130 0.4803 5.5532 3.4306 0.8534 2.1226 -780.4006 -1154.4990 -3.0565 -3.3120
0.2832 1.0769 140 0.4859 5.6912 3.5236 0.8448 2.1675 -779.4703 -1153.1194 -3.0532 -3.3079
0.3746 1.1538 150 0.4890 5.8066 3.5976 0.8448 2.2090 -778.7309 -1151.9652 -3.0512 -3.3061
0.386 1.2308 160 0.4675 5.7611 3.5620 0.8448 2.1990 -779.0862 -1152.4202 -3.0508 -3.3057
0.2852 1.3077 170 0.4615 5.7631 3.5564 0.8448 2.2067 -779.1427 -1152.4001 -3.0499 -3.3041
0.3886 1.3846 180 0.4501 5.6984 3.5097 0.8448 2.1887 -779.6100 -1153.0469 -3.0502 -3.3049
0.368 1.4615 190 0.4441 5.6548 3.4791 0.8448 2.1757 -779.9158 -1153.4833 -3.0502 -3.3049
0.318 1.5385 200 0.4430 5.6454 3.4725 0.8448 2.1729 -779.9817 -1153.5770 -3.0501 -3.3050

Framework versions

  • PEFT 0.14.0
  • Transformers 4.45.2
  • Pytorch 2.3.1+cu121
  • Datasets 3.2.0
  • Tokenizers 0.20.3
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Fatnaoui/dpo

Adapter
(2)
this model