YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Turn-Taking Model β Deteccao de Fim de Turno para BabelCast
Pesquisa, benchmarks e fine-tuning de modelos de deteccao de fim de turno para traducao simultanea em portugues.
Estrutura do Repositorio
docs/turn-taking-study/
README.md # Este documento
melhorias_turn_detection.md # Plano de melhorias + resultados das 3 rodadas
RESEARCH_LOG.md # Log de pesquisa
data/ # Datasets (NURC-SP, CORAA, TTS) β ~10GB
hf_cache/ # Cache HuggingFace
previous-experiments/
01-benchmarks/ # Benchmark de 5 modelos em portugues
benchmark_*.py # Scripts de benchmark (Silence, Silero, VAP, Pipecat, LiveKit)
setup_*.py # Scripts de setup de datasets
report/ # Relatorio gerado (markdown + LaTeX + graficos)
02-finetune-scratch/ # Fine-tuning do zero (3 rodadas)
finetune_smart_turn_v3.py # Script principal (Whisper Tiny + Focal Loss)
modal_finetune.py # Deploy no Modal
results/ # Rodada 1: Whisper Base + BCE (F1=0.796)
results-tiny/ # Rodada 2: Whisper Tiny + BCE (F1=0.788)
results-focal/ # Rodada 3: Whisper Tiny + Focal Loss (F1=0.798)
checkpoints/ # Checkpoints v1/v2
03-finetune-pipecat-pt/ # NOVO: Fine-tune a partir do Pipecat pre-treinado
README.md # Documentacao completa do experimento
Resumo dos Experimentos
01 β Benchmarks (5 modelos em portugues)
Comparacao de modelos existentes em audio portugues real (NURC-SP, 77 min).
02 β Fine-tune do zero (3 rodadas)
Treinamos Whisper Tiny encoder + classifier do zero em 15K amostras de portugues (CORAA + MUPE). Melhor resultado: F1=0.798, precision 83% @threshold=0.65. Detalhes em melhorias_turn_detection.md.
03 β Fine-tune a partir do Pipecat (proximo)
Fine-tune do modelo pre-treinado do Pipecat (270K amostras, 23 linguas) especificamente pra portugues + frances falando portugues. Usa LLMs (Claude) pra criar labels de qualidade + TTS pra gerar audio. Detalhes em 03-finetune-pipecat-pt/README.md.
Resultados dos Benchmarks (Experimento 01)
Comparative evaluation of turn-taking prediction models for real-time conversational AI, with focus on Portuguese language performance.
Models Evaluated
| Model | Type | Size | GPU | ASR | Portuguese Support |
|---|---|---|---|---|---|
| Silence Threshold (300/500/700ms) | Rule-based | 0 | No | No | Language-independent |
| Silero VAD | Audio DNN | 2MB | No | No | Language-independent |
| VAP | Audio Transformer (CPC) | 20MB | Optional | No | Trained on English only |
| Pipecat Smart Turn v3.1 | Audio Transformer (Whisper) | 8MB | No | No | Included in 23 languages |
| LiveKit EOT | Text Transformer (Qwen2.5) | 281MB | No | Yes | English only |
Key Results β Portuguese
Real Portuguese Speech (NURC-SP corpus, 77 min, 15 dialogues)
End-of-utterance detection accuracy (is the speaker done talking?):
| Model | Detects speaker stopped | False alarm rate | Overall accuracy |
|---|---|---|---|
| Pipecat Smart Turn v3.1 (original) | 84.9% | 54.9% | 68.6% |
| Pipecat Smart Turn v3.1 (fine-tuned PT) | 98.4% | 73.8% | 68.5% |
| Silero VAD | ~95%+ | ~5% | ~95% |
Conclusion: Silero VAD remains the most robust approach for detecting when a speaker stops talking in Portuguese. Smart Turn's Whisper-based approach adds linguistic intelligence but suffers from high false alarm rates on Portuguese, even after fine-tuning.
Turn-taking benchmark (Edge TTS, 10 dialogues, 6.4 min)
| Rank | Model | Macro-F1 | Balanced Acc | Latency p50 | False Int. |
|---|---|---|---|---|---|
| 1 | Pipecat Smart Turn v3.1 | 0.639 | 0.639 | 18.3ms | 22.8% |
| 2 | Silence 700ms | 0.566 | 0.573 | 0.1ms | 18.1% |
| 3 | Silero VAD | 0.401 | 0.500 | 9.0ms | 100.0% |
| 4 | VAP | 0.000 | 0.000 | β | β (needs stereo) |
Pipecat Smart Turn β Model Documentation
Overview
Smart Turn is an open-source end-of-turn detection model created by Daily (daily.co), the company behind the Pipecat voice AI framework. It predicts whether a speaker has finished their turn ("complete") or is still talking ("incomplete") using only audio input.
No academic paper exists. The model is documented through blog posts and GitHub only.
Architecture
Input: 16kHz mono PCM audio (up to 8 seconds)
β
βΌ
Whisper Feature Extractor β Log-mel spectrogram (80 bins Γ 800 frames)
β
βΌ
Whisper Tiny Encoder (pretrained, openai/whisper-tiny)
β Output: (batch, 400, 384) β 400 frames, 384-dim hidden state
βΌ
Attention Pooling: Linear(384β256) β Tanh β Linear(256β1)
β Learns which audio frames are most important for the decision
βΌ Weighted sum β (batch, 384)
Classifier MLP:
Linear(384β256) β LayerNorm β GELU β Dropout(0.1)
β Linear(256β64) β GELU β Linear(64β1)
β
βΌ
Sigmoid β probability [0, 1]
> 0.5 = "Complete" (speaker finished)
β€ 0.5 = "Incomplete" (speaker still talking)
Total parameters: ~8M Model size: 8MB (int8 ONNX) / 32MB (fp32 ONNX)
Why Whisper Tiny?
The team evolved through several architectures:
| Version | Backbone | Size | Problem |
|---|---|---|---|
| v1 | wav2vec2-BERT | 2.3GB | Overfitted, too large |
| v2 | wav2vec2 + linear | 360MB | Still large |
| v3+ | Whisper Tiny encoder | 8MB | Good balance |
Whisper Tiny was chosen because:
- Pretrained on 680,000 hours of multilingual speech (99 languages)
- Encoder produces rich acoustic representations without needing the decoder
- Only 39M params in full Whisper Tiny; encoder alone is much smaller
- The attention pooling + MLP classifier adds minimal overhead
Training Data
Dataset: pipecat-ai/smart-turn-data-v3.2-train on HuggingFace
Size: 270,946 samples (41.4 GB)
Languages: 23 (Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese)
Data format per sample:
| Field | Type | Description |
|---|---|---|
audio |
Audio | 16kHz mono PCM, up to 16s |
endpoint_bool |
bool | True = complete, False = incomplete |
language |
string | ISO 639-3 code (e.g., "por") |
midfiller |
bool | Filler word mid-utterance ("um", "Γ©h") |
endfiller |
bool | Filler word at end |
synthetic |
bool | TTS-generated vs human |
dataset |
string | Source (12 different sources) |
Data generation pipeline:
- Text sources: 1.2M+ multilingual sentences from HuggingFace datasets
- Cleaning: Gemini 2.5 Flash filtered grammatically incorrect sentences (removed 50-80%)
- TTS: Google Chirp3 for synthetic audio generation
- Filler words: Language-specific lists (generated by Claude/GPT), inserted by Gemini Flash
- Human audio: Contributed by Liva AI, Midcentury, MundoAI
- Noise augmentation (v3.2): Background noise from CC-0 Freesound.org samples
- Target split: 50/50 complete vs incomplete
Training Process
# Hyperparameters (from train.py)
learning_rate = 5e-5
epochs = 4
train_batch_size = 384
eval_batch_size = 128
warmup_ratio = 0.2
weight_decay = 0.01
lr_scheduler = "cosine"
loss = BCEWithLogitsLoss(pos_weight=dynamic_per_batch)
Hardware: Modal L4 GPU (or local GPU) Training time: ~53-79 minutes depending on GPU Framework: HuggingFace Transformers Trainer API Logging: Weights & Biases
Published Accuracy by Language
| Language | Accuracy | FPR | FNR |
|---|---|---|---|
| Turkish | 97.10% | 1.66% | 1.24% |
| Korean | 96.85% | 1.12% | 2.02% |
| English | 95.60% | β | β |
| Spanish | 91.00% | β | β |
| Bengali | 84.10% | 10.80% | 5.10% |
| Vietnamese | 81.27% | 14.84% | 3.88% |
| Portuguese | Not reported | β | β |
Inference Latency
| Device | Latency |
|---|---|
| AWS c7a.2xlarge (CPU) | 12.6 ms |
| NVIDIA L40S (GPU) | 3.3 ms |
| Apple M-series (MPS) | ~18 ms |
Our Evaluation on Portuguese
We tested Smart Turn v3.1 on real Brazilian Portuguese speech from the NURC-SP Corpus Minimo (239h corpus of spontaneous SΓ£o Paulo dialogues, CC BY-NC-ND 4.0):
| Metric | Result |
|---|---|
| Boundary detection (speaker actually stopped β model says "Complete") | 84.9% |
| Mid-turn detection (speaker still talking β model says "Incomplete") | 45.1% |
| Overall binary accuracy | 68.6% |
| Shift detection (speaker change) | 87.7% |
| Probability at boundaries (mean) | 0.809 |
| Probability at mid-turn (mean) | 0.522 |
| Separation (boundary - midturn) | 0.287 |
Key finding: Smart Turn detects end-of-utterance well (84.9%) but has a high false positive rate (54.9%) during ongoing speech. The model tends to predict "Complete" too aggressively on Portuguese.
Fine-tuning Attempt
We fine-tuned the model on Portuguese using 6,031 samples extracted from NURC-SP (15 dialogues, 77 minutes) + Edge TTS dialogues:
| Metric | Original | Fine-tuned |
|---|---|---|
| Boundary detection | 84.9% | 98.4% |
| Mid-turn detection | 45.1% | 26.2% (worse) |
| Overall accuracy | 68.6% | 68.5% (same) |
| False alarm rate | 54.9% | 73.8% (worse) |
Result: Fine-tuning improved boundary detection but worsened false alarm rate. The model overfitted to predicting "Complete" for everything. The overall accuracy did not improve.
Strategy: Improving Smart Turn for Portuguese
Why It Doesn't Work Well on Portuguese
Underrepresented in training data: Portuguese is 1 of 23 languages in 270K samples β likely <5% of training data. English dominates.
Mostly synthetic Portuguese data: The training pipeline uses TTS (Google Chirp3) for most non-English languages. Synthetic speech lacks natural hesitations, overlaps, and prosodic variation.
Portuguese prosody differs from English:
- Portuguese has more overlap between speakers (~15% vs ~5% in English)
- Shorter inter-turn gaps (median ~200ms vs ~300ms in English)
- Different intonation patterns at sentence endings
- More use of filler words ("nΓ©", "tipo", "Γ©h", "entΓ£o")
NURC-SP audio quality: 1970s-1990s recordings with noise, which the model wasn't trained on (v3.2 added noise augmentation, but for modern noise profiles).
Improvement Strategy
Phase 1: Better Training Data (Estimated effort: 1-2 weeks)
Goal: Create 20,000+ high-quality Portuguese training samples with proper class balance.
Data sources:
- NURC-SP Corpus Minimo (19h, already downloaded) β extract more samples with sliding windows at various positions
- CORAA NURC-SP Audio Corpus (239h, HuggingFace) β massive source of real dialogues
- C-ORAL-BRASIL (21h, via Zenodo) β spontaneous informal speech
- Edge TTS generation β create diverse Portuguese dialogues with multiple speakers/styles
- Real conversation recording β record actual Portuguese conversations with timestamp annotations
Key improvements over our first attempt:
- Use cross-validation β never test on conversations used for training
- Generate more diverse "incomplete" samples β multiple positions within each turn, not just midpoint
- Include Portuguese-specific fillers ("nΓ©?", "tipo assim", "Γ©h", "entΓ£o") as end-of-utterance markers
- Add noise augmentation (background noise, room reverb, microphone artifacts)
- Balance dataset: exactly 50/50 complete vs incomplete, without augmentation tricks
Phase 2: Architecture Tweaks (Estimated effort: 1 week)
Lower threshold for Portuguese: Instead of 0.5, use 0.65-0.75 as the "Complete" threshold. This reduces false alarms at the cost of slightly slower detection.
Language-specific classification head: Add a language embedding to the classifier so the model can learn different decision boundaries per language.
Longer context window: Increase from 8s to 12-16s. Portuguese turns tend to be longer (2.5s mean vs 1.8s in English), so more context helps.
Prosody features: Add pitch (F0) contour as an additional input feature. Portuguese has distinctive falling intonation at statement endings vs rising at questions.
Phase 3: Proper Evaluation (Estimated effort: 1 week)
- Hold-out test set: Reserve 3-5 NURC-SP conversations never seen during training
- Cross-corpus evaluation: Test on CORAA data not used in training
- Real-world test: Record and test on modern Portuguese conversations (Zoom/Teams calls)
- Compare with Silero VAD: Side-by-side evaluation on the same test set with identical metrics
- Threshold sweep: Find the optimal probability threshold for Portuguese specifically
Phase 4: Integration with BabelCast (Estimated effort: 2-3 days)
If the improved model achieves >85% accuracy with <15% false alarm rate on Portuguese:
- Replace Silero VAD's end-of-speech detection with Smart Turn PT
- Keep Silero VAD for initial voice activity detection (speech vs silence)
- Use Smart Turn only for the endpoint decision (when to trigger translation)
- Hybrid approach:
Silero VAD (speech detected) β Smart Turn PT (speech complete?) β Translate
Required Resources
| Resource | Purpose | Cost |
|---|---|---|
| NURC-SP + CORAA data | Training samples | Free (CC BY-NC-ND 4.0) |
| GPU for training (L4/A6000) | Fine-tuning, ~1 hour | ~$1-2 on Vast.ai |
| Edge TTS | Synthetic data generation | Free |
| Weights & Biases | Training tracking | Free tier |
Expected Outcome
With 20,000+ properly prepared Portuguese samples and cross-validated evaluation, we estimate:
- Boundary detection: 90%+ (up from 84.9%)
- False alarm rate: <20% (down from 54.9%)
- Overall accuracy: >85% (up from 68.6%)
This would make Smart Turn PT a viable complement to Silero VAD for Portuguese end-of-utterance detection.
Quick Start
Local (CPU)
pip install -r requirements.txt
# Generate Portuguese dataset
python setup_portuguese_dataset.py --dataset synthetic
# Run benchmarks
python run_portuguese_benchmark.py
# Generate report
python generate_report.py
With Real Portuguese Speech (NURC-SP)
# Prepare NURC-SP dialogues (downloads from HuggingFace)
python setup_nurc_dataset.py
# Run Pipecat Smart Turn benchmark
python -c "
from benchmark_pipecat import PipecatSmartTurnModel
from benchmark_base import evaluate_model
# ... (see run_portuguese_benchmark.py)
"
Fine-tune Smart Turn for Portuguese
# 1. Prepare training data from NURC-SP
python prepare_training_data.py
# 2. Fine-tune (runs on MPS/CUDA/CPU)
python finetune_smart_turn.py
# 3. Test the fine-tuned model
# ONNX model saved to checkpoints/smart_turn_pt/smart_turn_pt.onnx
Vast.ai (GPU)
export VAST_API_KEY="your_key"
python deploy_vast.py --all
Project Structure
turn-taking-study/
βββ README.md # This file
βββ Dockerfile # GPU-ready container
βββ requirements.txt # Python dependencies
β
βββ # Benchmark Framework
βββ benchmark_base.py # Base classes & evaluation metrics
βββ benchmark_silence.py # Silence threshold baseline
βββ benchmark_silero_vad.py # Silero VAD model
βββ benchmark_vap.py # Voice Activity Projection model
βββ benchmark_livekit_eot.py # LiveKit End-of-Turn model
βββ benchmark_pipecat.py # Pipecat Smart Turn v3.1
βββ run_benchmarks.py # General benchmark orchestrator
βββ run_portuguese_benchmark.py # Portuguese-specific benchmark
β
βββ # Dataset Preparation
βββ setup_dataset.py # General dataset download
βββ setup_portuguese_dataset.py # Portuguese synthetic dataset
βββ setup_nurc_dataset.py # NURC-SP real speech dataset
βββ generate_tts_dataset.py # Edge TTS Portuguese dialogues
β
βββ # Fine-tuning
βββ prepare_training_data.py # Extract training samples from NURC-SP
βββ finetune_smart_turn.py # Fine-tune Smart Turn on Portuguese
β
βββ # Deployment & Reporting
βββ deploy_vast.py # Vast.ai deployment automation
βββ generate_report.py # Report & figure generation
β
βββ data/ # Audio files & annotations (gitignored)
β βββ annotations/ # JSON ground truth files
β βββ nurc_sp/ # NURC-SP real speech
β βββ portuguese/ # Synthetic Portuguese audio
β βββ portuguese_tts/ # Edge TTS Portuguese audio
β βββ smart_turn_pt_training/ # Fine-tuning training samples
β
βββ checkpoints/ # Trained models (gitignored)
β βββ smart_turn_pt/
β βββ best_model.pt # PyTorch checkpoint
β βββ smart_turn_pt.onnx # ONNX model (30.6 MB)
β
βββ results/ # Benchmark result JSONs
βββ report/ # Generated reports
βββ benchmark_report.md
βββ benchmark_report.tex # IEEE format for thesis
βββ figures/ # PNG charts
Datasets Used
| Dataset | Type | Size | Language | Source |
|---|---|---|---|---|
| Portuguese Synthetic | Generated audio | 1.4h, 100 convs | pt-BR | Local generation |
| Portuguese TTS | Edge TTS speech | 6.4min, 10 convs | pt-BR | Microsoft Edge TTS |
| NURC-SP Corpus Minimo | Real dialogues (1970s-90s) | 19h, 21 recordings | pt-BR | HuggingFace |
| CORAA NURC-SP | Real dialogues | 239h | pt-BR | HuggingFace |
References
- Ekstedt, E. & Torre, G. (2024). Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection. arXiv:2401.04868.
- Ekstedt, E. & Torre, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. INTERSPEECH 2022.
- Ekstedt, E., Holmer, E., & Torre, G. (2024). Multilingual Turn-taking Prediction Using Voice Activity Projection. LREC-COLING 2024.
- Daily. (2025). Smart Turn: Real-time End-of-Turn Detection. GitHub. https://github.com/pipecat-ai/smart-turn
- Daily. (2025). Announcing Smart Turn v3, with CPU inference in just 12ms. https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/
- Daily. (2025). Improved accuracy in Smart Turn v3.1. https://www.daily.co/blog/improved-accuracy-in-smart-turn-v3-1/
- Daily. (2026). Smart Turn v3.2: Handling noisy environments and short responses. https://www.daily.co/blog/smart-turn-v3-2-handling-noisy-environments-and-short-responses/
- LiveKit. (2025). Improved End-of-Turn Model Cuts Voice AI Interruptions 39%. https://blog.livekit.io/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/
- Silero Team. (2021). Silero VAD: pre-trained enterprise-grade Voice Activity Detector. https://github.com/snakers4/silero-vad
- Skantze, G. (2021). Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67, 101178.
- Sacks, H., Schegloff, E.A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696-735.
- Raux, A. & Eskenazi, M. (2009). A Finite-State Turn-Taking Model for Spoken Dialog Systems. NAACL-HLT.
- Krisp. (2024). Audio-only 6M weights Turn-Taking model for Voice AI Agents. https://krisp.ai/blog/turn-taking-for-voice-ai/
- Castilho, A.T. (2019). NURC-SP Audio Corpus. 239h of transcribed Brazilian Portuguese dialogues.
- Godfrey, J.J., et al. (1992). SWITCHBOARD: Telephone speech corpus for research and development. ICASSP-92.
License
MIT