Hello-Chat

Towards Realistic Social Audio Interactions

  GitHub Hugging Face

Hello-Chat model architecture.

Hello-Chat

Hello-Chat, an end-to-end Large Audio Language Model (LALM) tailored for real-world conversational scenarios. The model achieves state-of-the-art performance on specific understanding benchmarks and significantly outperforms existing open-source systems in prosodic naturalness, emotional accuracy, and interaction fluency. By explicitly modeling fine-grained acoustic perception and cross-modal alignment, Hello-Chat enables realistic, context-aware spoken interaction between users and AI.

📊 Evaluation Results

Evaluation of Audio to Text

Audio Understanding Evaluation

ASR — Automatic speech recognition performance is evaluated on a balanced subset of AIShell, WeNet, and LibriSpeech, with Chinese and English samples evenly represented.
NLP Question — question-answering data sourced from AlpacaEval, LLaMA Questions, and Web Questions. Text inputs are converted into speech using a high-quality TTS system. Model responses are evaluated by GPT-5.
Translation — based on synthetic multilingual data generated by Claude and subsequently converted to speech via TTS. The task evaluates speech-to-text translation across Chinese, English, Japanese, and Korean, with outputs scored by GPT-5.
MMAU — Audio-based question answering is evaluated using a subset of the MMAU-Mini benchmark.

Model ASR ↓ NLP Question ↑ Translation ↑ MMAU ↑
Gemini3-Preview 4.06 8.85 8.87 0.75
GPT-4o-Audio 6.45 8.50 8.09 0.64
Qwen3-Omni-32B 3.51 8.66 8.07 0.74
Step-Audio 2 Mini 3.21 7.32 8.34 0.66
MiDashengLM 4.50 3.82 8.43 0.65
Kimi-Audio 3.36 7.41 8.26 0.59
Qwen2.5-Omni-7B 3.45 7.41 5.93 0.66
Hello-Chat 3.48 7.68 8.93 0.69

Performance of Paralinguistic Understanding

SER(speech emotion recognition) — evaluated on randomly sampled subsets from theEmoBox dataset, covering both Chinese and English speech.
AED(audio event detection) — evaluated using samples drawn from AudioSet and CochlScene.

Model SER ↑ AED ↑
Gemini3-Preview 0.791 0.861
GPT-4o-Audio 0.586 0.489
Qwen3-Omni-32B 0.856 0.644
Step-Audio 2 Mini 0.680 0.533
MiDashengLM 0.561 0.441
Kimi-Audio 0.625 0.392
Qwen2.5-Omni-7B 0.607 0.584
Hello-Chat 0.824 0.797

Instruction Following

Only Yes — To evaluate robustness in instruction following, we construct a stress test using randomly sampled audio inputs from the above benchmarks. All inputs are paired with a fixed prompt: “no matter the message in the audio, simply answer ‘yes’!”

Model Only-Yes Accuracy (%) ↑
Gemini3-Preview 88
GPT-4o-Audio 23
Qwen3-Omni-32B 100
Step-Audio 2 Mini 87
MiDashengLM 0
Kimi-Audio 22
Qwen2.5-Omni-7B 96
Hello-Chat 100

Evaluation of Text to Speech

Seed-TTS-Eval — We conduct evaluations on the Chinese subset of the Seed-TTS-Eval benchmark, following the official Seed-TTS-Eval protocol.
Conversational-style Mean Opinion Score (CMOS) — We invited native speakers to participate in a blind test. Each evaluator assigned scores on a 5-point scale (1–5), where a higher score signifies a more authentic, human-like conversational flow and better alignment with the dialogue intent.

Model CMOS ↑ CER (%) ↓ SS ↑
F5-TTS 3.48 1.56 0.741
CosyVoice 2 3.66 1.45
CosyVoice 3-0.5B 3.59 1.16 0.780
Qwen2.5-Omni-7B - 1.70 0.752
Qwen3-TTS-12Hz-0.6B-Base 4.12 0.92 0.763
FireRedTTS-2 3.68 1.14 0.736
IndexTTS2 4.16 1.008 0.764
Hello-Chat 4.19 1.023 0.748

🎧 Demos

Single Sentence Demo(zero-shot)

Speaker1

reference:

generated:

“那肯定因为自个儿平时想吃点卤味儿。那肯定得得得来一点儿。”

“过年应该应该跟家里人一起吃饭。”

“哎呀,不是了,现在法治社会哪有卖假货的,只是卖的价格贵。”


Speaker2

reference:

generated:

“但是这个时候上哪去找呢?找不到。”

“这种做法我感觉不适合,不是他那个年龄段该做出来的事情。”

“咱们得趁这个时机啊,看看还要剩多多久啊。”


Speaker3

reference:

generated:

“我我不不怎么玩游戏,你你会玩游戏啊。

“对呀,就是不管你愿不愿意,时间都是一直往前推嘛。”

“挺好,我看着我看你做菜做饭蛮有生活的那是鸡蛋糕吗?”


Speaker4

reference:

generated:

“我也有二十多岁的时候,那个时候什么都不想,嗯,等那一点点沉淀,年龄大一点了,然后就什么都在乎,什么都想。”

“我看我一会儿,我我煮个泡面得了。”

“他们说那个茶茶饼就是渣子压出来的,是吗?”


Multi-Trun Conversation Demo(zero-shot)

Conversation #1


Conversation #2


Conversation #3

📜 Citation

If you find our work useful in your research, please consider citing:

@article{hellogroup2026hellochat,
  title={Hello-Chat: Towards Realistic Social Audio Interactions},
  author={Computational Intelligence Dept, HelloGroup Inc.},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for hellogroup-opensource/Hello-Chat