trl internal testing

company

Activity Feed Request to join this org

AI & ML interests

Internal testing artifact mangement for trl library

Recent Activity

qgallouedec updated a model 3 days ago

trl-internal-testing/tiny-Qwen3ForCausalLM-Instruct-2507

qgallouedec published a model 3 days ago

trl-internal-testing/tiny-Qwen3ForCausalLM-Instruct-2507

qgallouedec updated a model 6 days ago

trl-internal-testing/tiny-Phi3ForCausalLM-3.5

View all activity

qgallouedec

updated a model 3 days ago

trl-internal-testing/tiny-Qwen3ForCausalLM-Instruct-2507

Text Generation • 2.45M • Updated 3 days ago • 1.01k

qgallouedec

published a model 3 days ago

trl-internal-testing/tiny-Qwen3ForCausalLM-Instruct-2507

Text Generation • 2.45M • Updated 3 days ago • 1.01k

qgallouedec

posted an update 4 days ago

Post

1667

TRL v1.2 introduces the SSDTrainer 🚀

Simple Self-Distillation (SSD) from Apple's paper "Embarrassingly Simple Self-Distillation Improves Code Generation" is now available as an experimental trainer in TRL.

The recipe is as minimal as the name suggests: sample completions from the model itself at a training-time temperature, then fine-tune on those raw, unverified samples with plain cross-entropy. No reward model. No verifier. No teacher model. No reinforcement learning. Just prompts and the model.

from trl.experimental.ssd import SSDConfig, SSDTrainer

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
    train_dataset=dataset,
)
trainer.train()

v1.2 also ships expanded tool-calling support (LLaMA 3.1 / 3.2, DeepSeek-V3), another round of KTO ↔ DPO alignment getting us closer to promoting KTO to stable, a big GRPO simplification for overlong tool results, deprecation of use_transformers_paged, and key fixes for VLM response parsing.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.2.0

sergiopaniego

posted an update 5 days ago

Post

1017

Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy

And… it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team 🐎

Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple 🍎

How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed

You can try it right away with this ready-to-run example (Qwen3-4B on rStar-Coder):
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd.py
or benchmark a checkpoint with the eval script:
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd_eval.py

One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train × T_eval, so a broad band of configs works well. even very noisy samples still help

Want to dig deeper?

Paper: Embarrassingly Simple Self-Distillation Improves Code Generation (2604.01193)
Trainer docs: https://huggingface.co/docs/trl/main/en/ssd_trainer

qgallouedec

updated a model 6 days ago

trl-internal-testing/tiny-Phi3ForCausalLM-3.5

Text Generation • 514k • Updated 6 days ago • 2.01k

qgallouedec

published a model 6 days ago

trl-internal-testing/tiny-Phi3ForCausalLM-3.5

Text Generation • 514k • Updated 6 days ago • 2.01k

qgallouedec

updated a model 6 days ago

trl-internal-testing/tiny-Phi3ForCausalLM-3

Text Generation • 514k • Updated 6 days ago • 2.01k

qgallouedec

published a model 6 days ago

trl-internal-testing/tiny-Phi3ForCausalLM-3

Text Generation • 514k • Updated 6 days ago • 2.01k

qgallouedec

updated a model 6 days ago

trl-internal-testing/tiny-Phi3ForCausalLM

Text Generation • 514k • Updated 6 days ago • 148k

qgallouedec

in trl-internal-testing/tiny-Gemma4ForConditionalGeneration 10 days ago

Update chat_template.jinja

#1 opened 11 days ago by

qgallouedec

Update chat_template.jinja

#2 opened 11 days ago by

qgallouedec

updated a model 10 days ago

trl-internal-testing/tiny-Gemma4ForConditionalGeneration

Image-Text-to-Text • 13.9M • Updated 10 days ago • 57.7k

sergiopaniego

posted an update 11 days ago

Post

331

Great experience yesterday at PyTorch Conf Europe in Paris 🇫🇷

We (w/ @kashif ) talked about training LLMs through interaction, using trajectories across games, browsers, or simulators

Room was packed, a clear sign of interest in where RL post-training is heading.

sharing the slides! 🤓
https://drive.google.com/file/d/16k7YRnf5EJEo0XjXGlRJ_hVeLoFWKyNP/view?usp=sharing

qgallouedec

published a model 17 days ago

trl-internal-testing/tiny-Gemma4ForConditionalGeneration

Image-Text-to-Text • 13.9M • Updated 10 days ago • 57.7k

sergiopaniego

posted an update 18 days ago

Post

2767

Gemma 4 💎 is here and it’s strong!

to celebrate, we’re rolling out in TRL:

> support for multimodal tool responses for environments (OpenEnv)
> an example to train it in CARLA for autonomous driving with image-based tool calls

go check it out 🏎️🏎️

blog: https://huggingface.co/blog/gemma4
script: https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/carla_vlm_gemma.py

sergiopaniego

posted an update 20 days ago

Post

2001

TRL is officially an adult 🥳

excited to announce TRL v1.0❗️

head to the blog to see how we got here and what’s next for this post-training library, designed to keep pace with the field

https://huggingface.co/blog/trl-v1

2 replies

qgallouedec

posted an update 20 days ago

Post

2310

TRL v1.0 is out!

Hugging Face's TRL library is downloaded 3 million times a month. Over 130k models trained with it are public on the Hub, and major projects like @unsloth and @axolotl-ai-co build directly on top of it. v1.0 is the moment we acknowledged that responsibility explicitly, with a real stability contract.

The field hasn't settled. Building stable software in a domain that keeps invalidating its own assumptions is the actual problem we're solving. The answer is a design that can absorb the next shift without breaking what people rely on.

What's in v1.0:
Deep Hugging Face integration, low infrastructure burden
What's next: asynchronous GRPO, better scaling support, and making training legible enough that agents can inspect and steer it.