TokSuite
community
AI & ML interests
Tokenization, Robustness, LLMs
Recent Activity
View all activity
Papers
View all Papers Organization Card
TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.
Our code is available at https://github.com/r-three/Tokenizers.
models 22
toksuite/meta-llama-Llama-3.2-7B
Text Generation • 8B • Updated • 184
toksuite/meta-llama-Llama-3.2-300M
Text Generation • 0.6B • Updated • 158
toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_222
Text Generation • 2B • Updated • 171
toksuite/meta-llama-Llama-3.2-1B-seed_777_model_seed_888
Text Generation • 2B • Updated • 149
toksuite/google-gemma-2-2b
Text Generation • 2B • Updated • 454
toksuite/meta-llama-Llama-3.2-1B
Text Generation • 2B • Updated • 11
toksuite/CohereLabs-aya-expanse-8b
Text Generation • 2B • Updated • 105
toksuite/tiktoken-gpt-4o
Text Generation • 2B • Updated • 16
toksuite/common-pile-comma-v0.1
Text Generation • 2B • Updated • 15
toksuite/microsoft-Phi-3-mini-4k-instruct
Text Generation • 1B • Updated • 23
datasets 10
toksuite/toksuite_pretraining_data
Viewer • Updated • 107M • 2.1k
toksuite/toksuite_chinese
Viewer • Updated • 485 • 467
toksuite/toksuite_turkish
Viewer • Updated • 621 • 252
toksuite/toksuite_farsi
Viewer • Updated • 747 • 259
toksuite/toksuite_math
Viewer • Updated • 189 • 203
toksuite/toksuite_english
Viewer • Updated • 1.14k • 429
toksuite/toksuite_italian
Viewer • Updated • 1.09k • 220
toksuite/toksuite_stem
Viewer • Updated • 613 • 271
toksuite/toksuite_general
Viewer • Updated • 68 • 47
toksuite/Qwen-Qwen3-8B-toksuite-detokenized
Viewer • Updated • 28M • 16