arxiv:2601.01554

MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

Published on Jan 4

· Submitted by

Zhaoye Fei on Jan 7

#3 Paper of the day

OpenMOSS

Upvote

Authors:

Donghua Yu ,

Chen Yang ,

Zhe Xu ,

Yuqian Zhang ,

Zhaoye Fei ,

Qinyuan Cheng ,

Xipeng Qiu

Abstract

A unified multimodal large language model for end-to-end speaker-attributed, time-stamped transcription with extended context window and strong generalization across benchmarks.

AI-generated summary

Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

View arXiv page View PDF Project page Add to collection

Community

ngc7293

Paper author Paper submitter 3 days ago

MOSS Transcribe Diarize 🎙️

We introduce MOSS Transcribe Diarize — a unified multimodal model for Speaker-Attributed, Time-Stamped Transcription (SATS).

🔍 End-to-end SATS in a single pass (transcription + speaker attribution + timestamps)
🧠 128k context window for up to ~90-minute audio without chunking (strong long-range speaker memory)
🌍 Trained on extensive in-the-wild conversations + controllable simulated mixtures (robust to overlap/noise/domain shift)
📊 Strong results on AISHELL-4 / Podcast / Movies benchmarks (best cpCER / Δcp among evaluated systems)

Paper: [2601.01554] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
Homepage: https://mosi.cn/models/moss-transcribe-diarize
Online Demo: https://moss-transcribe-diarize-demo.mosi.cn