dpo

This model is a fine-tuned version of aubmindlab/aragpt2-base on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.4430
Rewards/chosen: 5.6454
Rewards/rejected: 3.4725
Rewards/accuracies: 0.8448
Rewards/margins: 2.1729
Logps/rejected: -779.9817
Logps/chosen: -1153.5770
Logits/rejected: -3.0501
Logits/chosen: -3.3050

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2
training_steps: 200

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6408	0.0769	10	0.5783	0.8112	0.4760	0.8448	0.3351	-809.9464	-1201.9193	-3.1723	-3.4517
0.4779	0.1538	20	0.5212	2.0793	1.2294	0.8276	0.8499	-802.4123	-1189.2380	-3.1494	-3.4178
0.4764	0.2308	30	0.5064	2.9292	1.7859	0.8362	1.1433	-796.8478	-1180.7388	-3.1253	-3.3872
0.4429	0.3077	40	0.4839	3.3147	2.0341	0.8276	1.2806	-794.3660	-1176.8838	-3.1091	-3.3693
0.4766	0.3846	50	0.5141	3.4676	2.1403	0.8190	1.3274	-793.3040	-1175.3546	-3.0906	-3.3490
0.4798	0.4615	60	0.5002	3.5407	2.1864	0.8276	1.3543	-792.8427	-1174.6235	-3.0892	-3.3485
0.4054	0.5385	70	0.4733	3.5696	2.2000	0.8362	1.3696	-792.7064	-1174.3348	-3.0960	-3.3586
0.377	0.6154	80	0.4556	3.9933	2.4739	0.8448	1.5194	-789.9678	-1170.0979	-3.0895	-3.3516
0.4159	0.6923	90	0.4460	4.4103	2.7327	0.8362	1.6777	-787.3801	-1165.9279	-3.0808	-3.3423
0.3655	0.7692	100	0.4507	4.7961	2.9496	0.8448	1.8465	-785.2107	-1162.0699	-3.0704	-3.3290
0.335	0.8462	110	0.4592	5.0963	3.1378	0.8534	1.9585	-783.3284	-1159.0679	-3.0658	-3.3242
0.3374	0.9231	120	0.4784	5.4616	3.3750	0.8534	2.0866	-780.9568	-1155.4149	-3.0582	-3.3136
0.2969	1.0	130	0.4803	5.5532	3.4306	0.8534	2.1226	-780.4006	-1154.4990	-3.0565	-3.3120
0.2832	1.0769	140	0.4859	5.6912	3.5236	0.8448	2.1675	-779.4703	-1153.1194	-3.0532	-3.3079
0.3746	1.1538	150	0.4890	5.8066	3.5976	0.8448	2.2090	-778.7309	-1151.9652	-3.0512	-3.3061
0.386	1.2308	160	0.4675	5.7611	3.5620	0.8448	2.1990	-779.0862	-1152.4202	-3.0508	-3.3057
0.2852	1.3077	170	0.4615	5.7631	3.5564	0.8448	2.2067	-779.1427	-1152.4001	-3.0499	-3.3041
0.3886	1.3846	180	0.4501	5.6984	3.5097	0.8448	2.1887	-779.6100	-1153.0469	-3.0502	-3.3049
0.368	1.4615	190	0.4441	5.6548	3.4791	0.8448	2.1757	-779.9158	-1153.4833	-3.0502	-3.3049
0.318	1.5385	200	0.4430	5.6454	3.4725	0.8448	2.1729	-779.9817	-1153.5770	-3.0501	-3.3050

Framework versions

PEFT 0.14.0
Transformers 4.45.2
Pytorch 2.3.1+cu121
Datasets 3.2.0
Tokenizers 0.20.3

Downloads last month: 15

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Fatnaoui/dpo

Base model

aubmindlab/aragpt2-base

Adapter

(2)

this model