Cohere Transcribe Q8 Cache-External CoreML
CoreML conversion of Cohere Transcribe 03-2026 with an INT8-quantized encoder and an FP16 cache-external decoder. This is the hybrid pairing used by FluidAudio for on-device inference.
For the pure FP16 variant see
FluidInference/cohere-transcribe-cache-external-coreml.
Why this hybrid?
The encoder dominates compute (~65β70% of per-sample processing time at FP16). INT8 quantization of the encoder weights cuts encoder size ~4Γ (7.0 GB β 1.8 GB) and improves ANE utilisation with no measurable WER regression. The decoder is small (<300 MB) and sensitive to precision, so it stays FP16.
| Component | Precision | Size | Notes |
|---|---|---|---|
| Encoder | INT8 (per-channel weight) | 1.8 GB | Conformer, 24 blocks |
| Decoder | FP16 | 291 MB | Cache-external (Parakeet pattern), 8 layers |
| Tokenizer | β | 484 KB | SentencePiece (16,384 tokens) |
| Vocab | β | 332 KB | vocab.json idβpiece map |
Model Description
Same architecture as the FP16 cache-external variant:
- Cache-external decoder: KV cache managed in Swift/Python (not CoreML state)
- macOS 14+ / iOS 17+ compatible
- O(n) decode complexity with manual cache management
- Correct EOS token (token 3, not 151643)
The only change is the encoder weight precision β decoder I/O, masks, caches, and tokenizer are identical.
Architecture
Encoder (INT8)
- Input: Mel spectrogram
[1, 128, 3500](35 s at 10 ms/frame) - Output: Hidden states
[1, 438, 1024](FP16) - Quantization: per-channel INT8 weight quantization (activations FP16)
- Size: ~1.8 GB
Decoder (Cache-External FP16)
- Pattern: Parakeet TDT with external KV cache
- Inputs (19 total):
input_id:[1, 1]β current tokenposition_id:[1, 1]β current positionencoder_hidden_states:[1, 438, 1024]cross_attention_mask:[1, 1, 1, 438]attention_mask:[1, 1, 1, seq_len]β grows each stepk_cache_0..7:[1, 8, 108, 128]v_cache_0..7:[1, 8, 108, 128]
- Outputs (17 total):
logits:[1, 16384]k_cache_0_out..7_out,v_cache_0_out..7_out: updated caches
- Size: ~291 MB
Performance
Tested with FluidAudio's fluidaudiocli cohere-mixed-benchmark on FLEURS (full splits,
INT8 encoder + FP16 cache-external decoder, ANE):
| Language | Samples | WER | CER | RTFx |
|---|---|---|---|---|
| en_us | 647 | 5.63% | 3.19% | 1.93Γ |
| fr_fr | 676 | 6.22% | 3.11% | 1.65Γ |
| (more languages pending) |
LibriSpeech test-clean (FP16 reference, for parity cross-check):
| Metric | Value |
|---|---|
| WER (FP16) | 11.95% |
| Perfect transcriptions | 2 / 10 |
| Main errors | Punctuation differences |
INT8-encoder WER on LibriSpeech is within FP16 noise (validated by comparing encoder hidden states against the FP16 reference on the same audio).
Critical Fix: EOS Token
β οΈ Important: The EOS token is 3 (<|endoftext|>), not 151643!
# WRONG (vocabulary only has 16384 tokens)
EOS_TOKEN = 151643 # Out of range!
# CORRECT
EOS_TOKEN = 3 # From model.generation_config.eos_token_id
Usage
Swift (FluidAudio)
import CoreML
import Foundation
let encoderURL = modelDir.appendingPathComponent("cohere_encoder.mlmodelc")
let decoderURL = modelDir.appendingPathComponent("cohere_decoder_cache_external.mlmodelc")
let loaded = try CohereFixedPipeline.loadModels(
encoderURL: encoderURL,
decoderURL: decoderURL,
vocabDir: modelDir
)
let pipeline = try CohereFixedPipeline(models: loaded)
let result = try pipeline.transcribe(audio: samples, models: loaded)
print(result.text)
Full Swift implementation:
CohereFixedPipeline.swiftβ pipeline orchestration + vocab loadingCohereDecoderState.swiftβ KV cache managementCohereModelInference.swiftβ decoder execution
Source: https://github.com/FluidInference/FluidAudio
Python
Identical to the FP16 variant β swap cohere_encoder.mlpackage for the INT8 version:
encoder = ct.models.MLModel("cohere_encoder.mlpackage") # INT8
decoder = ct.models.MLModel("cohere_decoder_cache_external.mlpackage") # FP16
See the FP16 variant's README for the complete Python loop (unchanged here).
Cache Management (Parakeet Pattern)
The cache-external pattern manages KV cache outside the CoreML model:
- Initialize 16 cache arrays (8 layers Γ K/V) filled with zeros
- Each decode step:
- Pass current token + 16 caches into the model
- Model returns logits + 16 updated caches
- Use updated caches for next step
- Attention mask grows:
[1,1,1,1]β[1,1,1,2]β β¦ β[1,1,1,108]
Supported Languages
14 languages: English, French, German, Spanish, Italian, Portuguese, Dutch, Polish, Greek, Arabic, Japanese, Chinese, Korean, Vietnamese.
Files
cohere-transcribe-q8-cache-external-coreml/
βββ cohere_encoder.mlmodelc # 1.8 GB β INT8 encoder (compiled)
βββ cohere_encoder.mlpackage # 1.8 GB β INT8 encoder (source)
βββ cohere_decoder_cache_external.mlmodelc # 291 MB β FP16 decoder (compiled)
βββ cohere_decoder_cache_external.mlpackage # 291 MB β FP16 decoder (source)
βββ tokenizer.model # SentencePiece tokenizer
βββ vocab.json # idβpiece map (16,384 entries)
βββ example.py # Python usage example
βββ requirements.txt # Python deps for example.py
βββ wer_results_cache_external.json # Reference WER data
βββ README.md # This file
Compilation
The .mlmodelc variants are already compiled for fast runtime loading.
If you only need the source package, download just the .mlpackage directories:
huggingface-cli download FluidInference/cohere-transcribe-q8-cache-external-coreml \
--include "*.mlpackage/**" "tokenizer.model" "vocab.json"
To recompile .mlpackage β .mlmodelc:
xcrun coremlcompiler compile cohere_encoder.mlpackage output/
xcrun coremlcompiler compile cohere_decoder_cache_external.mlpackage output/
Quantization Notes
- Scheme: per-channel INT8 weight quantization, FP16 activations
- Toolchain:
coremltools.optimize.coreml.linear_quantize_weights(applied viatools/quantize_to_int8.pyin the conversion pipeline) - Validation: hidden-state parity check against the FP16 reference on 10 LibriSpeech test-clean samples; max abs diff within FP16 noise
- ANE residency: profiled with
coreml-cliβ no CPU fallback
Comparison
| Variant | Encoder | Decoder | Encoder size | Notes |
|---|---|---|---|---|
cohere-transcribe-cache-external-coreml |
FP16 | FP16 | 7.0 GB | Reference |
cohere-transcribe-q8-cache-external-coreml (this) |
INT8 | FP16 | 1.8 GB | Production hybrid |
Citation
@misc{cohere-transcribe-q8-cache-external-coreml,
title={Cohere Transcribe Q8 Cache-External CoreML},
author={FluidInference},
year={2026},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/FluidInference/cohere-transcribe-q8-cache-external-coreml}},
note={CoreML conversion with INT8 encoder + FP16 cache-external decoder (Parakeet pattern).}
}
License
CC-BY-NC-4.0 (matches original Cohere Transcribe model).
Links
- Original model: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026
- FP16 sibling: https://huggingface.co/FluidInference/cohere-transcribe-cache-external-coreml
- Source code: https://github.com/FluidInference/FluidAudio
- Conversion scripts: https://github.com/FluidInference/mobius
- Downloads last month
- 390