| --- |
| license: mit |
| language: |
| - en |
| datasets: |
| - speechbrain/LoquaciousSet |
| base_model: |
| - zai-org/GLM-ASR-Nano-2512 |
| - Qwen/Qwen3-0.6B |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - asr |
| - speech-recognition |
| - audio |
| - qwen |
| - glm-asr |
| library_name: transformers |
| --- |
| |
| # Tiny Audio |
|
|
| A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework. |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) |
| result = pipe("audio.wav") |
| print(result["text"]) |
| ``` |
|
|
| ## Usage Examples |
|
|
| ### Basic Transcription |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) |
| |
| # From file |
| result = pipe("audio.wav") |
| print(result["text"]) |
| |
| # From URL |
| result = pipe("https://example.com/audio.mp3") |
| |
| # From numpy array (must be 16kHz) |
| import numpy as np |
| audio = np.random.randn(16000).astype(np.float32) # 1 second |
| result = pipe(audio) |
| ``` |
|
|
| ### Batch Processing |
|
|
| ```python |
| # Process multiple files |
| files = ["audio1.wav", "audio2.wav", "audio3.wav"] |
| results = pipe(files, batch_size=4) |
| for r in results: |
| print(r["text"]) |
| ``` |
|
|
| ### Word-Level Timestamps |
|
|
| ```python |
| result = pipe("audio.wav", return_timestamps="word") |
| # Returns: |
| # { |
| # "text": "hello world", |
| # "chunks": [ |
| # {"text": "hello", "timestamp": (0.0, 0.5)}, |
| # {"text": "world", "timestamp": (0.6, 1.0)} |
| # ] |
| # } |
| ``` |
|
|
| ### Streaming Inference |
|
|
| ```python |
| from tiny_audio import ASRModel, ASRProcessor |
| import torch |
| |
| model = ASRModel.from_pretrained("mazesmazes/tiny-audio") |
| processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") |
| |
| # Load and process audio |
| import librosa |
| audio, sr = librosa.load("audio.wav", sr=16000) |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
| |
| # Stream tokens |
| for token in model.generate_streaming(inputs["input_features"]): |
| print(token, end="", flush=True) |
| ``` |
|
|
| ### Using with torch directly |
|
|
| ```python |
| from tiny_audio import ASRModel, ASRProcessor |
| import torch |
| import librosa |
| |
| # Load model and processor |
| model = ASRModel.from_pretrained("mazesmazes/tiny-audio") |
| processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio") |
| |
| # Load audio (16kHz) |
| audio, sr = librosa.load("audio.wav", sr=16000) |
| |
| # Process |
| inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
| |
| # Generate |
| with torch.no_grad(): |
| output = model.generate( |
| input_features=inputs["input_features"], |
| attention_mask=inputs["attention_mask"], |
| max_new_tokens=256 |
| ) |
| |
| # Decode |
| text = processor.batch_decode(output, skip_special_tokens=True)[0] |
| print(text) |
| ``` |
|
|
| ### GPU Inference |
|
|
| ```python |
| import torch |
| |
| pipe = pipeline( |
| "automatic-speech-recognition", |
| model="mazesmazes/tiny-audio", |
| trust_remote_code=True, |
| device="cuda" # or device=0 |
| ) |
| ``` |
|
|
| ### Half Precision |
|
|
| ```python |
| pipe = pipeline( |
| "automatic-speech-recognition", |
| model="mazesmazes/tiny-audio", |
| trust_remote_code=True, |
| torch_dtype=torch.float16, |
| device="cuda" |
| ) |
| ``` |
|
|
| ## Architecture |
|
|
| ``` |
| Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text |
| ``` |
|
|
| Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge. |
|
|
| | Component | Model | Parameters | Status | |
| |-----------|-------|------------|--------| |
| | Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen | |
| | Projector | 2-layer MLP | ~12M | Trained | |
| | Language Model | Qwen3-0.6B | ~600M | Frozen | |
|
|
| ### How It Works |
|
|
| 1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim) |
| 2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces |
| 3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio |
|
|
| The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1` |
|
|
| ## Model Specifications |
|
|
| | Specification | Value | |
| |---------------|-------| |
| | Input | Audio (16kHz mono) | |
| | Output | Text transcription | |
| | Max Audio Length | ~30 seconds (limited by encoder) | |
| | Vocabulary | Qwen3 tokenizer | |
| | Languages | English only | |
| | Generation | Greedy decoding (num_beams=1, do_sample=False) | |
|
|
| ## Training Details |
|
|
| | | | |
| |---|---| |
| | **Dataset** | LoquaciousSet (25,000 hours) | |
| | **Hardware** | Single NVIDIA A40 | |
| | **Time** | ~24 hours | |
| | **Cost** | ~$12 | |
| | **Optimizer** | AdamW | |
| | **Learning Rate** | 1e-4 | |
| | **Batch Size** | 4 | |
| | **Steps** | 50,000 | |
|
|
| ## Limitations |
|
|
| - **English only**: Not trained on other languages |
| - **Sample rate**: Expects 16kHz audio (other rates resampled automatically) |
| - **Audio length**: Best for clips under 30 seconds |
| - **Accuracy**: May degrade on: |
| - Heavily accented speech |
| - Noisy or low-quality audio |
| - Domain-specific terminology |
| - Overlapping speakers |
| - **No punctuation**: Output is lowercase without punctuation by default |
|
|
| ## Requirements |
|
|
| ``` |
| transformers>=4.40.0 |
| torch>=2.0.0 |
| torchaudio>=2.0.0 |
| ``` |
|
|
| Optional for streaming: |
| ``` |
| librosa |
| soundfile |
| ``` |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `config.json` | Model configuration | |
| | `model.safetensors` | Projector weights (~48MB) | |
| | `preprocessor_config.json` | Audio preprocessing config | |
| | `tokenizer.json` | Tokenizer | |
| | `tokenizer_config.json` | Tokenizer config | |
| | `special_tokens_map.json` | Special tokens | |
|
|
| Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @misc{tinyaudio2024, |
| author = {Alex Kroman}, |
| title = {Tiny Audio: Minimal ASR Training}, |
| year = {2024}, |
| publisher = {GitHub}, |
| url = {https://github.com/alexkroman/tiny-audio} |
| } |
| ``` |
|
|
| ## Links |
|
|
| - [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model |
| - [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch |
| - [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser |
|
|
| ## Acknowledgments |
|
|
| - [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder |
| - [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model |
| - [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data |
|
|
| ## License |
|
|
| MIT |
|
|