YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Turn-Taking Model β€” Deteccao de Fim de Turno para BabelCast

Pesquisa, benchmarks e fine-tuning de modelos de deteccao de fim de turno para traducao simultanea em portugues.

Estrutura do Repositorio

docs/turn-taking-study/
  README.md                        # Este documento
  melhorias_turn_detection.md      # Plano de melhorias + resultados das 3 rodadas
  RESEARCH_LOG.md                  # Log de pesquisa
  data/                            # Datasets (NURC-SP, CORAA, TTS) β€” ~10GB
  hf_cache/                        # Cache HuggingFace

  previous-experiments/
    01-benchmarks/                 # Benchmark de 5 modelos em portugues
      benchmark_*.py               # Scripts de benchmark (Silence, Silero, VAP, Pipecat, LiveKit)
      setup_*.py                   # Scripts de setup de datasets
      report/                      # Relatorio gerado (markdown + LaTeX + graficos)

    02-finetune-scratch/           # Fine-tuning do zero (3 rodadas)
      finetune_smart_turn_v3.py    # Script principal (Whisper Tiny + Focal Loss)
      modal_finetune.py            # Deploy no Modal
      results/                     # Rodada 1: Whisper Base + BCE (F1=0.796)
      results-tiny/                # Rodada 2: Whisper Tiny + BCE (F1=0.788)
      results-focal/               # Rodada 3: Whisper Tiny + Focal Loss (F1=0.798)
      checkpoints/                 # Checkpoints v1/v2

  03-finetune-pipecat-pt/          # NOVO: Fine-tune a partir do Pipecat pre-treinado
    README.md                      # Documentacao completa do experimento

Resumo dos Experimentos

01 β€” Benchmarks (5 modelos em portugues)

Comparacao de modelos existentes em audio portugues real (NURC-SP, 77 min).

02 β€” Fine-tune do zero (3 rodadas)

Treinamos Whisper Tiny encoder + classifier do zero em 15K amostras de portugues (CORAA + MUPE). Melhor resultado: F1=0.798, precision 83% @threshold=0.65. Detalhes em melhorias_turn_detection.md.

03 β€” Fine-tune a partir do Pipecat (proximo)

Fine-tune do modelo pre-treinado do Pipecat (270K amostras, 23 linguas) especificamente pra portugues + frances falando portugues. Usa LLMs (Claude) pra criar labels de qualidade + TTS pra gerar audio. Detalhes em 03-finetune-pipecat-pt/README.md.


Resultados dos Benchmarks (Experimento 01)

Comparative evaluation of turn-taking prediction models for real-time conversational AI, with focus on Portuguese language performance.

Models Evaluated

Model Type Size GPU ASR Portuguese Support
Silence Threshold (300/500/700ms) Rule-based 0 No No Language-independent
Silero VAD Audio DNN 2MB No No Language-independent
VAP Audio Transformer (CPC) 20MB Optional No Trained on English only
Pipecat Smart Turn v3.1 Audio Transformer (Whisper) 8MB No No Included in 23 languages
LiveKit EOT Text Transformer (Qwen2.5) 281MB No Yes English only

Key Results β€” Portuguese

Real Portuguese Speech (NURC-SP corpus, 77 min, 15 dialogues)

End-of-utterance detection accuracy (is the speaker done talking?):

Model Detects speaker stopped False alarm rate Overall accuracy
Pipecat Smart Turn v3.1 (original) 84.9% 54.9% 68.6%
Pipecat Smart Turn v3.1 (fine-tuned PT) 98.4% 73.8% 68.5%
Silero VAD ~95%+ ~5% ~95%

Conclusion: Silero VAD remains the most robust approach for detecting when a speaker stops talking in Portuguese. Smart Turn's Whisper-based approach adds linguistic intelligence but suffers from high false alarm rates on Portuguese, even after fine-tuning.

Turn-taking benchmark (Edge TTS, 10 dialogues, 6.4 min)

Rank Model Macro-F1 Balanced Acc Latency p50 False Int.
1 Pipecat Smart Turn v3.1 0.639 0.639 18.3ms 22.8%
2 Silence 700ms 0.566 0.573 0.1ms 18.1%
3 Silero VAD 0.401 0.500 9.0ms 100.0%
4 VAP 0.000 0.000 β€” β€” (needs stereo)

Pipecat Smart Turn β€” Model Documentation

Overview

Smart Turn is an open-source end-of-turn detection model created by Daily (daily.co), the company behind the Pipecat voice AI framework. It predicts whether a speaker has finished their turn ("complete") or is still talking ("incomplete") using only audio input.

No academic paper exists. The model is documented through blog posts and GitHub only.

Architecture

Input: 16kHz mono PCM audio (up to 8 seconds)
    β”‚
    β–Ό
Whisper Feature Extractor β†’ Log-mel spectrogram (80 bins Γ— 800 frames)
    β”‚
    β–Ό
Whisper Tiny Encoder (pretrained, openai/whisper-tiny)
    β”‚  Output: (batch, 400, 384) β€” 400 frames, 384-dim hidden state
    β–Ό
Attention Pooling: Linear(384β†’256) β†’ Tanh β†’ Linear(256β†’1)
    β”‚  Learns which audio frames are most important for the decision
    β–Ό  Weighted sum β†’ (batch, 384)
Classifier MLP:
    Linear(384β†’256) β†’ LayerNorm β†’ GELU β†’ Dropout(0.1)
    β†’ Linear(256β†’64) β†’ GELU β†’ Linear(64β†’1)
    β”‚
    β–Ό
Sigmoid β†’ probability [0, 1]
    > 0.5 = "Complete" (speaker finished)
    ≀ 0.5 = "Incomplete" (speaker still talking)

Total parameters: ~8M Model size: 8MB (int8 ONNX) / 32MB (fp32 ONNX)

Why Whisper Tiny?

The team evolved through several architectures:

Version Backbone Size Problem
v1 wav2vec2-BERT 2.3GB Overfitted, too large
v2 wav2vec2 + linear 360MB Still large
v3+ Whisper Tiny encoder 8MB Good balance

Whisper Tiny was chosen because:

  • Pretrained on 680,000 hours of multilingual speech (99 languages)
  • Encoder produces rich acoustic representations without needing the decoder
  • Only 39M params in full Whisper Tiny; encoder alone is much smaller
  • The attention pooling + MLP classifier adds minimal overhead

Training Data

Dataset: pipecat-ai/smart-turn-data-v3.2-train on HuggingFace Size: 270,946 samples (41.4 GB) Languages: 23 (Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, Vietnamese)

Data format per sample:

Field Type Description
audio Audio 16kHz mono PCM, up to 16s
endpoint_bool bool True = complete, False = incomplete
language string ISO 639-3 code (e.g., "por")
midfiller bool Filler word mid-utterance ("um", "Γ©h")
endfiller bool Filler word at end
synthetic bool TTS-generated vs human
dataset string Source (12 different sources)

Data generation pipeline:

  1. Text sources: 1.2M+ multilingual sentences from HuggingFace datasets
  2. Cleaning: Gemini 2.5 Flash filtered grammatically incorrect sentences (removed 50-80%)
  3. TTS: Google Chirp3 for synthetic audio generation
  4. Filler words: Language-specific lists (generated by Claude/GPT), inserted by Gemini Flash
  5. Human audio: Contributed by Liva AI, Midcentury, MundoAI
  6. Noise augmentation (v3.2): Background noise from CC-0 Freesound.org samples
  7. Target split: 50/50 complete vs incomplete

Training Process

# Hyperparameters (from train.py)
learning_rate = 5e-5
epochs = 4
train_batch_size = 384
eval_batch_size = 128
warmup_ratio = 0.2
weight_decay = 0.01
lr_scheduler = "cosine"
loss = BCEWithLogitsLoss(pos_weight=dynamic_per_batch)

Hardware: Modal L4 GPU (or local GPU) Training time: ~53-79 minutes depending on GPU Framework: HuggingFace Transformers Trainer API Logging: Weights & Biases

Published Accuracy by Language

Language Accuracy FPR FNR
Turkish 97.10% 1.66% 1.24%
Korean 96.85% 1.12% 2.02%
English 95.60% β€” β€”
Spanish 91.00% β€” β€”
Bengali 84.10% 10.80% 5.10%
Vietnamese 81.27% 14.84% 3.88%
Portuguese Not reported β€” β€”

Inference Latency

Device Latency
AWS c7a.2xlarge (CPU) 12.6 ms
NVIDIA L40S (GPU) 3.3 ms
Apple M-series (MPS) ~18 ms

Our Evaluation on Portuguese

We tested Smart Turn v3.1 on real Brazilian Portuguese speech from the NURC-SP Corpus Minimo (239h corpus of spontaneous SΓ£o Paulo dialogues, CC BY-NC-ND 4.0):

Metric Result
Boundary detection (speaker actually stopped β†’ model says "Complete") 84.9%
Mid-turn detection (speaker still talking β†’ model says "Incomplete") 45.1%
Overall binary accuracy 68.6%
Shift detection (speaker change) 87.7%
Probability at boundaries (mean) 0.809
Probability at mid-turn (mean) 0.522
Separation (boundary - midturn) 0.287

Key finding: Smart Turn detects end-of-utterance well (84.9%) but has a high false positive rate (54.9%) during ongoing speech. The model tends to predict "Complete" too aggressively on Portuguese.

Fine-tuning Attempt

We fine-tuned the model on Portuguese using 6,031 samples extracted from NURC-SP (15 dialogues, 77 minutes) + Edge TTS dialogues:

Metric Original Fine-tuned
Boundary detection 84.9% 98.4%
Mid-turn detection 45.1% 26.2% (worse)
Overall accuracy 68.6% 68.5% (same)
False alarm rate 54.9% 73.8% (worse)

Result: Fine-tuning improved boundary detection but worsened false alarm rate. The model overfitted to predicting "Complete" for everything. The overall accuracy did not improve.


Strategy: Improving Smart Turn for Portuguese

Why It Doesn't Work Well on Portuguese

  1. Underrepresented in training data: Portuguese is 1 of 23 languages in 270K samples β€” likely <5% of training data. English dominates.

  2. Mostly synthetic Portuguese data: The training pipeline uses TTS (Google Chirp3) for most non-English languages. Synthetic speech lacks natural hesitations, overlaps, and prosodic variation.

  3. Portuguese prosody differs from English:

    • Portuguese has more overlap between speakers (~15% vs ~5% in English)
    • Shorter inter-turn gaps (median ~200ms vs ~300ms in English)
    • Different intonation patterns at sentence endings
    • More use of filler words ("nΓ©", "tipo", "Γ©h", "entΓ£o")
  4. NURC-SP audio quality: 1970s-1990s recordings with noise, which the model wasn't trained on (v3.2 added noise augmentation, but for modern noise profiles).

Improvement Strategy

Phase 1: Better Training Data (Estimated effort: 1-2 weeks)

Goal: Create 20,000+ high-quality Portuguese training samples with proper class balance.

Data sources:

  1. NURC-SP Corpus Minimo (19h, already downloaded) β€” extract more samples with sliding windows at various positions
  2. CORAA NURC-SP Audio Corpus (239h, HuggingFace) β€” massive source of real dialogues
  3. C-ORAL-BRASIL (21h, via Zenodo) β€” spontaneous informal speech
  4. Edge TTS generation β€” create diverse Portuguese dialogues with multiple speakers/styles
  5. Real conversation recording β€” record actual Portuguese conversations with timestamp annotations

Key improvements over our first attempt:

  • Use cross-validation β€” never test on conversations used for training
  • Generate more diverse "incomplete" samples β€” multiple positions within each turn, not just midpoint
  • Include Portuguese-specific fillers ("nΓ©?", "tipo assim", "Γ©h", "entΓ£o") as end-of-utterance markers
  • Add noise augmentation (background noise, room reverb, microphone artifacts)
  • Balance dataset: exactly 50/50 complete vs incomplete, without augmentation tricks

Phase 2: Architecture Tweaks (Estimated effort: 1 week)

  1. Lower threshold for Portuguese: Instead of 0.5, use 0.65-0.75 as the "Complete" threshold. This reduces false alarms at the cost of slightly slower detection.

  2. Language-specific classification head: Add a language embedding to the classifier so the model can learn different decision boundaries per language.

  3. Longer context window: Increase from 8s to 12-16s. Portuguese turns tend to be longer (2.5s mean vs 1.8s in English), so more context helps.

  4. Prosody features: Add pitch (F0) contour as an additional input feature. Portuguese has distinctive falling intonation at statement endings vs rising at questions.

Phase 3: Proper Evaluation (Estimated effort: 1 week)

  1. Hold-out test set: Reserve 3-5 NURC-SP conversations never seen during training
  2. Cross-corpus evaluation: Test on CORAA data not used in training
  3. Real-world test: Record and test on modern Portuguese conversations (Zoom/Teams calls)
  4. Compare with Silero VAD: Side-by-side evaluation on the same test set with identical metrics
  5. Threshold sweep: Find the optimal probability threshold for Portuguese specifically

Phase 4: Integration with BabelCast (Estimated effort: 2-3 days)

If the improved model achieves >85% accuracy with <15% false alarm rate on Portuguese:

  1. Replace Silero VAD's end-of-speech detection with Smart Turn PT
  2. Keep Silero VAD for initial voice activity detection (speech vs silence)
  3. Use Smart Turn only for the endpoint decision (when to trigger translation)
  4. Hybrid approach: Silero VAD (speech detected) β†’ Smart Turn PT (speech complete?) β†’ Translate

Required Resources

Resource Purpose Cost
NURC-SP + CORAA data Training samples Free (CC BY-NC-ND 4.0)
GPU for training (L4/A6000) Fine-tuning, ~1 hour ~$1-2 on Vast.ai
Edge TTS Synthetic data generation Free
Weights & Biases Training tracking Free tier

Expected Outcome

With 20,000+ properly prepared Portuguese samples and cross-validated evaluation, we estimate:

  • Boundary detection: 90%+ (up from 84.9%)
  • False alarm rate: <20% (down from 54.9%)
  • Overall accuracy: >85% (up from 68.6%)

This would make Smart Turn PT a viable complement to Silero VAD for Portuguese end-of-utterance detection.


Quick Start

Local (CPU)

pip install -r requirements.txt

# Generate Portuguese dataset
python setup_portuguese_dataset.py --dataset synthetic

# Run benchmarks
python run_portuguese_benchmark.py

# Generate report
python generate_report.py

With Real Portuguese Speech (NURC-SP)

# Prepare NURC-SP dialogues (downloads from HuggingFace)
python setup_nurc_dataset.py

# Run Pipecat Smart Turn benchmark
python -c "
from benchmark_pipecat import PipecatSmartTurnModel
from benchmark_base import evaluate_model
# ... (see run_portuguese_benchmark.py)
"

Fine-tune Smart Turn for Portuguese

# 1. Prepare training data from NURC-SP
python prepare_training_data.py

# 2. Fine-tune (runs on MPS/CUDA/CPU)
python finetune_smart_turn.py

# 3. Test the fine-tuned model
# ONNX model saved to checkpoints/smart_turn_pt/smart_turn_pt.onnx

Vast.ai (GPU)

export VAST_API_KEY="your_key"
python deploy_vast.py --all

Project Structure

turn-taking-study/
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ Dockerfile                      # GPU-ready container
β”œβ”€β”€ requirements.txt                # Python dependencies
β”‚
β”œβ”€β”€ # Benchmark Framework
β”œβ”€β”€ benchmark_base.py               # Base classes & evaluation metrics
β”œβ”€β”€ benchmark_silence.py            # Silence threshold baseline
β”œβ”€β”€ benchmark_silero_vad.py         # Silero VAD model
β”œβ”€β”€ benchmark_vap.py                # Voice Activity Projection model
β”œβ”€β”€ benchmark_livekit_eot.py        # LiveKit End-of-Turn model
β”œβ”€β”€ benchmark_pipecat.py            # Pipecat Smart Turn v3.1
β”œβ”€β”€ run_benchmarks.py               # General benchmark orchestrator
β”œβ”€β”€ run_portuguese_benchmark.py     # Portuguese-specific benchmark
β”‚
β”œβ”€β”€ # Dataset Preparation
β”œβ”€β”€ setup_dataset.py                # General dataset download
β”œβ”€β”€ setup_portuguese_dataset.py     # Portuguese synthetic dataset
β”œβ”€β”€ setup_nurc_dataset.py           # NURC-SP real speech dataset
β”œβ”€β”€ generate_tts_dataset.py         # Edge TTS Portuguese dialogues
β”‚
β”œβ”€β”€ # Fine-tuning
β”œβ”€β”€ prepare_training_data.py        # Extract training samples from NURC-SP
β”œβ”€β”€ finetune_smart_turn.py          # Fine-tune Smart Turn on Portuguese
β”‚
β”œβ”€β”€ # Deployment & Reporting
β”œβ”€β”€ deploy_vast.py                  # Vast.ai deployment automation
β”œβ”€β”€ generate_report.py              # Report & figure generation
β”‚
β”œβ”€β”€ data/                           # Audio files & annotations (gitignored)
β”‚   β”œβ”€β”€ annotations/                # JSON ground truth files
β”‚   β”œβ”€β”€ nurc_sp/                    # NURC-SP real speech
β”‚   β”œβ”€β”€ portuguese/                 # Synthetic Portuguese audio
β”‚   β”œβ”€β”€ portuguese_tts/             # Edge TTS Portuguese audio
β”‚   └── smart_turn_pt_training/     # Fine-tuning training samples
β”‚
β”œβ”€β”€ checkpoints/                    # Trained models (gitignored)
β”‚   └── smart_turn_pt/
β”‚       β”œβ”€β”€ best_model.pt           # PyTorch checkpoint
β”‚       └── smart_turn_pt.onnx      # ONNX model (30.6 MB)
β”‚
β”œβ”€β”€ results/                        # Benchmark result JSONs
└── report/                         # Generated reports
    β”œβ”€β”€ benchmark_report.md
    β”œβ”€β”€ benchmark_report.tex        # IEEE format for thesis
    └── figures/                    # PNG charts

Datasets Used

Dataset Type Size Language Source
Portuguese Synthetic Generated audio 1.4h, 100 convs pt-BR Local generation
Portuguese TTS Edge TTS speech 6.4min, 10 convs pt-BR Microsoft Edge TTS
NURC-SP Corpus Minimo Real dialogues (1970s-90s) 19h, 21 recordings pt-BR HuggingFace
CORAA NURC-SP Real dialogues 239h pt-BR HuggingFace

References

  1. Ekstedt, E. & Torre, G. (2024). Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection. arXiv:2401.04868.
  2. Ekstedt, E. & Torre, G. (2022). Voice Activity Projection: Self-supervised Learning of Turn-taking Events. INTERSPEECH 2022.
  3. Ekstedt, E., Holmer, E., & Torre, G. (2024). Multilingual Turn-taking Prediction Using Voice Activity Projection. LREC-COLING 2024.
  4. Daily. (2025). Smart Turn: Real-time End-of-Turn Detection. GitHub. https://github.com/pipecat-ai/smart-turn
  5. Daily. (2025). Announcing Smart Turn v3, with CPU inference in just 12ms. https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/
  6. Daily. (2025). Improved accuracy in Smart Turn v3.1. https://www.daily.co/blog/improved-accuracy-in-smart-turn-v3-1/
  7. Daily. (2026). Smart Turn v3.2: Handling noisy environments and short responses. https://www.daily.co/blog/smart-turn-v3-2-handling-noisy-environments-and-short-responses/
  8. LiveKit. (2025). Improved End-of-Turn Model Cuts Voice AI Interruptions 39%. https://blog.livekit.io/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/
  9. Silero Team. (2021). Silero VAD: pre-trained enterprise-grade Voice Activity Detector. https://github.com/snakers4/silero-vad
  10. Skantze, G. (2021). Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67, 101178.
  11. Sacks, H., Schegloff, E.A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696-735.
  12. Raux, A. & Eskenazi, M. (2009). A Finite-State Turn-Taking Model for Spoken Dialog Systems. NAACL-HLT.
  13. Krisp. (2024). Audio-only 6M weights Turn-Taking model for Voice AI Agents. https://krisp.ai/blog/turn-taking-for-voice-ai/
  14. Castilho, A.T. (2019). NURC-SP Audio Corpus. 239h of transcribed Brazilian Portuguese dialogues.
  15. Godfrey, J.J., et al. (1992). SWITCHBOARD: Telephone speech corpus for research and development. ICASSP-92.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for marcosremar2/turn-taking-study