Post
42
š Exciting week, 2 new research projects and 2 new tools!
ā¶ Mythic-RDT - OpenMythos blueprint with a retrofit-recurrence fine-tune
https://github.com/mann1x/Mythic-RDT
ā¶ cross-tokenizer-distill (CTD) - knowledge distillation across different tokenizer vocabularies
https://github.com/mann1x/cross-tokenizer-distill
For Mythic-RDT, I have chosen the pretty outdated DS-Coder-V2 16B.
It's small enough to not need more than 48GB VRAM but once I leaned on KL for depth recurring fine-tune (couldn't go above parity to T=1 with T=4, not the best for 4x inference time), started investigating the KL recipe and questioned the teacher, same DS-Coder-V2 but at BF16.
For a better teacher the option would have been just one, DS-Coder-V2-236B. Not only so big that I'd need 4xH100 to run but also surpassed even by Qwen3-Coder-32B on HE/MBPP.
Hence here's CTD tool, validated but still in development to find a good recipe for Qwen->DS distill.
ā¶ Qwen3.5-4B-MicroCoder - code-leaning and reasoning merge of Qwen3.5-4B
ManniX-ITA/Qwen3.5-4B-MicroCoder
ā¶ Omnimergekit - merge toolkit, merge & quantization scripts, experiments logs
https://github.com/mann1x/omnimergekit
You can find my merge toolkit and scripts in the repo, so they don't get scattered over the HF repos.
Interesting experiment with MicroCoder; only a couple of base, reasoning broken, coding fine-tune to merge with the excellent instruct reasoning JackRong-v2.
The result is not truly exciting but manages to improve LiveCodeBench above JR-v2, improve MBPP and not completely breaking reasoning.
This is achieved with omnimergekit using differential signals generated by the delta vs the base model from the good and wrong answers delta between the sources (HE/MBPP/AIME).
The very long eval sessions proved that the method does not just bias the scores of these evals but improve others even above the baseline.
ā¶ Mythic-RDT - OpenMythos blueprint with a retrofit-recurrence fine-tune
https://github.com/mann1x/Mythic-RDT
ā¶ cross-tokenizer-distill (CTD) - knowledge distillation across different tokenizer vocabularies
https://github.com/mann1x/cross-tokenizer-distill
For Mythic-RDT, I have chosen the pretty outdated DS-Coder-V2 16B.
It's small enough to not need more than 48GB VRAM but once I leaned on KL for depth recurring fine-tune (couldn't go above parity to T=1 with T=4, not the best for 4x inference time), started investigating the KL recipe and questioned the teacher, same DS-Coder-V2 but at BF16.
For a better teacher the option would have been just one, DS-Coder-V2-236B. Not only so big that I'd need 4xH100 to run but also surpassed even by Qwen3-Coder-32B on HE/MBPP.
Hence here's CTD tool, validated but still in development to find a good recipe for Qwen->DS distill.
ā¶ Qwen3.5-4B-MicroCoder - code-leaning and reasoning merge of Qwen3.5-4B
ManniX-ITA/Qwen3.5-4B-MicroCoder
ā¶ Omnimergekit - merge toolkit, merge & quantization scripts, experiments logs
https://github.com/mann1x/omnimergekit
You can find my merge toolkit and scripts in the repo, so they don't get scattered over the HF repos.
Interesting experiment with MicroCoder; only a couple of base, reasoning broken, coding fine-tune to merge with the excellent instruct reasoning JackRong-v2.
The result is not truly exciting but manages to improve LiveCodeBench above JR-v2, improve MBPP and not completely breaking reasoning.
This is achieved with omnimergekit using differential signals generated by the delta vs the base model from the good and wrong answers delta between the sources (HE/MBPP/AIME).
The very long eval sessions proved that the method does not just bias the scores of these evals but improve others even above the baseline.