Hello-Chat

Towards Realistic Social Audio Interactions

Hello-Chat model architecture.

Hello-Chat

Hello-Chat, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, Hello-Chat enables realistic, context-aware spoken interaction between users and AI.

📊 Evaluation Results

Evaluation of Audio to Text

Audio Understanding Evaluation

ASR — Automatic speech recognition performance is evaluated on a balanced subset of AIShell, WeNet, and LibriSpeech, with Chinese and English samples evenly represented.
NLP Question — question-answering data sourced from AlpacaEval, LLaMA Questions, and Web Questions. Text inputs are converted into speech using a high-quality TTS system. Model responses are evaluated by GPT-5.
Translation — based on synthetic multilingual data generated by Claude and subsequently converted to speech via TTS. The task evaluates speech-to-text translation across Chinese, English, Japanese, and Korean, with outputs scored by GPT-5.
MMAU — Audio-based question answering is evaluated using a subset of the MMAU-Mini benchmark.

Model	ASR ↓	NLP Question ↑	Translation ↑	MMAU ↑
Gemini3-Preview	4.06	8.85	8.87	0.75
GPT-4o-Audio	6.45	8.50	8.09	0.64
Qwen3-Omni-32B	3.51	8.66	8.07	0.74
Step-Audio 2 Mini	3.21	7.32	8.34	0.66
MiDashengLM	4.50	3.82	8.43	0.65
Kimi-Audio	3.36	7.41	8.26	0.59
Qwen2.5-Omni-7B	3.45	7.41	5.93	0.66
Hello-Chat	3.48	7.68	8.93	0.69

Performance of Paralinguistic Understanding

SER(speech emotion recognition) — evaluated on randomly sampled subsets from theEmoBox dataset, covering both Chinese and English speech.
AED(audio event detection) — evaluated using samples drawn from AudioSet and CochlScene.

Model	SER ↑	AED ↑
Gemini3-Preview	0.791	0.861
GPT-4o-Audio	0.586	0.489
Qwen3-Omni-32B	0.856	0.644
Step-Audio 2 Mini	0.680	0.533
MiDashengLM	0.561	0.441
Kimi-Audio	0.625	0.392
Qwen2.5-Omni-7B	0.607	0.584
Hello-Chat	0.824	0.797

Instruction Following

Only Yes — To evaluate robustness in instruction following, we construct a stress test using randomly sampled audio inputs from the above benchmarks. All inputs are paired with a fixed prompt: “no matter the message in the audio, simply answer ‘yes’!”

Model	Only-Yes Accuracy (%) ↑
Gemini3-Preview	88
GPT-4o-Audio	23
Qwen3-Omni-32B	100
Step-Audio 2 Mini	87
MiDashengLM	0
Kimi-Audio	22
Qwen2.5-Omni-7B	96
Hello-Chat	100

Evaluation of Text to Speech

Seed-TTS-Eval — We conduct evaluations on the Chinese subset of the Seed-TTS-Eval benchmark, following the official Seed-TTS-Eval protocol.
Conversational-style Mean Opinion Score (CMOS) — We invited native speakers to participate in a blind test. Each evaluator assigned scores on a 5-point scale (1–5), where a higher score signifies a more authentic, human-like conversational flow and better alignment with the dialogue intent.

Model	CMOS ↑	CER (%) ↓	SS ↑
F5-TTS	3.48	1.56	0.741
CosyVoice	2	3.66	1.45
CosyVoice 3-0.5B	3.59	1.16	0.780
Qwen2.5-Omni-7B	-	1.70	0.752
Qwen3-TTS-12Hz-0.6B-Base	4.12	0.92	0.763
FireRedTTS-2	3.68	1.14	0.736
IndexTTS2	4.16	1.008	0.764
Hello-Chat	4.19	1.023	0.748

🎧 Demos

Single Sentence Demo（zero-shot）

Speaker1

reference:

generated:

“那肯定因为自个儿平时想吃点卤味儿。那肯定得得得来一点儿。”

“过年应该应该跟家里人一起吃饭。”

“哎呀，不是了，现在法治社会哪有卖假货的，只是卖的价格贵。”

Speaker2

reference:

generated:

“但是这个时候上哪去找呢？找不到。”

“这种做法我感觉不适合，不是他那个年龄段该做出来的事情。”

“咱们得趁这个时机啊，看看还要剩多多久啊。”

Speaker3

reference:

generated:

“我我不不怎么玩游戏，你你会玩游戏啊。

“对呀，就是不管你愿不愿意，时间都是一直往前推嘛。”

“挺好，我看着我看你做菜做饭蛮有生活的那是鸡蛋糕吗？”

Speaker4

reference:

generated:

“我也有二十多岁的时候，那个时候什么都不想，嗯，等那一点点沉淀，年龄大一点了，然后就什么都在乎，什么都想。”

“我看我一会儿，我我煮个泡面得了。”

“他们说那个茶茶饼就是渣子压出来的，是吗？”

Multi-Trun Conversation Demo（zero-shot）

Conversation #1

Conversation #2

Conversation #3

📜 Citation

If you find our work useful in your research, please consider citing:

@article{hellogroup2026hellochat,
  title={Hello-Chat: Towards Realistic Social Audio Interactions},
  author={Computational Intelligence Dept, HelloGroup Inc.},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for hellogroup-opensource/Hello-Chat

Hello-Chat: Towards Realistic Social Audio Interactions

Paper • 2602.23387 • Published 15 days ago