Wonseok Choi's picture

1 17

Wonseok Choi PRO

wlabchoi

·

no-brand

AI & ML interests

None yet

Recent Activity

published a model 5 days ago

wlabchoi/qwen3-8b-telemath-expert

updated a model 5 days ago

wlabchoi/qwen3-4b-wireless-math-expert

updated a model 5 days ago

wlabchoi/qwen3-4b-telemath-expert

View all activity

Organizations

None yet

published a model 5 days ago

wlabchoi/qwen3-8b-telemath-expert

Updated 5 days ago

updated 2 models 5 days ago

wlabchoi/qwen3-4b-wireless-math-expert

Updated 5 days ago

wlabchoi/qwen3-4b-telemath-expert

Updated 5 days ago

liked a model 6 days ago

allenai/SERA-8B-GA

8B • Updated 6 days ago • 44 • 13

liked 2 datasets 6 days ago

zwhe99/FireAct

Viewer • Updated Oct 24, 2023 • 4.64k • 38 • 2

zai-org/AgentInstruct

Viewer • Updated Oct 23, 2023 • 1.87k • 806 • 231

upvoted an article 27 days ago

Article

GSMA Open-Telco LLM Benchmarks 2.0: The first dedicated LLM Evaluation for Telecoms

Oct 20, 2025

•

30

commented on GSMA Open-Telco LLM Benchmarks 2.0: The first dedicated LLM Evaluation for Telecoms 27 days ago

Hello, thank you for sharing the GSMA Open-Telco LLM Benchmarks 2.0. it is a very valuable initiative for the telecom community.

I had a question specifically regarding the TeleMath evaluation.
In my experiments, when running models locally, we observed that the performance is noticeably lower than the scores reported in the benchmark.

For reference, our setup was as follows:

Hugging Face: models were loaded via transformers, and inference was performed using model.generate.
Ollama: inference was done using the chat API, with structured output enforced so that the model returns a numeric value.

Given this discrepancy, I was wondering how TeleMath was evaluated in the benchmark:

Was any specific prompt template, system instruction, or chain-of-thought style used?
Were answers evaluated using exact match, tolerance ranges, or any form of normalization?
Were there any post-processing or answer-extraction steps applied before scoring?

Understanding the exact evaluation protocol would be extremely helpful for reproducing the results and aligning local evaluations with the benchmark.

Thank you very much in advance for your time and for making this work public! 😀

liked a dataset 27 days ago

open-web-math/open-web-math

Viewer • Updated Oct 17, 2023 • 6.32M • 9.23k • 328

liked a dataset about 1 month ago

XINLI1997/WirelessMATHBench-XL

Viewer • Updated Oct 9, 2025 • 4.03k • 111 • 3

liked a dataset about 2 months ago

tasksource/nlgraph

Viewer • Updated May 23, 2023 • 6.02k • 6 • 5

updated a model about 2 months ago

wlabchoi/qwen3-4b-telemath-full-sft

Text Generation • 4B • Updated Dec 14, 2025 • 1

published a model about 2 months ago

wlabchoi/qwen3-4b-wireless-math-expert

Updated 5 days ago

updated a model about 2 months ago

wlabchoi/qwen3-4b-telemath-dora

Updated Dec 14, 2025

published 2 models about 2 months ago

wlabchoi/qwen3-4b-telemath-full-sft

Text Generation • 4B • Updated Dec 14, 2025 • 1

wlabchoi/qwen3-4b-telemath-dora

Updated Dec 14, 2025

updated a model about 2 months ago

wlabchoi/qwen3-4b-telemath-lora64-mlp

Updated Dec 14, 2025

published a model about 2 months ago

wlabchoi/qwen3-4b-telemath-lora64-mlp

Updated Dec 14, 2025

updated a model about 2 months ago

wlabchoi/qwen3-4b-telemath-lora16-epoch15

Updated Dec 14, 2025

published a model about 2 months ago

wlabchoi/qwen3-4b-telemath-lora16-epoch15

Updated Dec 14, 2025