TinyMemoryLM (Haiku)

⚠️ IMPORTANT NOTICE

  1. The model is really dumb. This is a sub-1M parameter research model designed for experimentation, not production use.
  2. Do not expect it to answer any questions. It is prone to repetition, hallucination, and format collapse.

Overview

TinyMemoryLM is an ultra-lightweight language model optimized for edge cases and architectural experimentation. Despite its small footprint, it incorporates several novel training innovations aimed at stabilizing tiny model convergence, including hybrid tokenization, loss boosting strategies, and context-aware relevance modeling.

This release includes both Pretrained Weights (base language modeling) and Instruction Weights (fine-tuned for chat/completion).

Files Provided

File Description
tokenizer.json Hybrid word/character tokenizer vocabulary (2,133 tokens).
pretrain.pt Base pretrained checkpoint (language modeling).
model.pt Instruction-tuned checkpoint (SFT/Chat).
samples.jsonl Sample generations with NLL/PPL metrics at checkpoints.
loss_curve.png Training loss progression across all phases.

Model Specifications

Parameter Value
Architecture Transformer Decoder (GQA)
Parameters ~700K
Context Length 2,048 tokens
Sliding Window 512 tokens
Dimensions d_model=128, unique_layers=8, logical_layers=16, heads=4, kv_heads=2, ffn=224
Vocabulary ~2,133 tokens (Hybrid Char + Word)
Normalization RMSNorm
Embeddings Rotary Embeddings (RoPE, 25% fraction)
Activation SwiGLU
Multi-Token Prediction Horizons at 2, 3, 4

Architecture Highlights

TinyMemoryLM implements several research-focused modifications to standard transformer architectures:

  • Weight-Tied Logical Layers: 8 unique transformer blocks are repeated to create 16 logical layers (every 3rd layer uses global attention vs. sliding window), drastically reducing parameter count.
  • Grouped-Query Attention (GQA): 4 attention heads share 2 KV heads, reducing KV cache and compute.
  • Sliding Window Attention: Local attention within 512-token windows, with periodic global layers for long-range context.
  • Multi-Token Prediction (MTP): Auxiliary prediction heads at horizons 2, 3, and 4 with dedicated adapters and norms, weighted at 0.3 during training.
  • Hybrid Tokenizer: Combines character-level fallback with frequent word tokens to balance compression and vocabulary size.
  • Word Token Loss Boosting: Upweights loss signals for multi-character tokens (3x) to prevent the model from ignoring them in favor of character-level spelling.
  • Response-Start Weighting: Prioritizes the first 20 tokens of assistant responses (3x weight) to improve prompt conditioning.
  • Embedding Scale: Learned scaling factor applied to token embeddings for improved training dynamics.

Training Hyperparameters

Parameter Value
Batch Size 48
Pretrain LR 8e-4 (min 1e-5)
SFT LR 2e-4 (min 1e-5)
Warmup 300 steps
Weight Decay 0.02
Max Grad Norm 1.0
MTP Weight 0.3
Word Token Loss Boost 3.0x
Response-Start Boost 3.0x (first 20 tokens)
Checkpointing Every 1,000 steps
Sampling Every 5,000 steps

Training Loss Curve

Training loss progression across pretrain and SFT phases:

Training Loss Curve

Limitations & Expectations

Please manage your expectations when using TinyMemoryLM:

  • Repetition: Tiny models are prone to collapsing into repetitive token loops.
  • Knowledge: The model has limited world knowledge due to parameter constraints.
  • Usage: This model is intended for research, educational purposes, and architectural benchmarking. It is not suitable for assistant tasks or reliable information retrieval.

Generated for research purposes. Use responsibly.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train CompactAI/TMLM-Haiku-1.3