wlabchoi/qwen3-8b-telemath-expert
Updated
Hello, thank you for sharing the GSMA Open-Telco LLM Benchmarks 2.0. it is a very valuable initiative for the telecom community.
I had a question specifically regarding the TeleMath evaluation.
In my experiments, when running models locally, we observed that the performance is noticeably lower than the scores reported in the benchmark.
For reference, our setup was as follows:
Given this discrepancy, I was wondering how TeleMath was evaluated in the benchmark:
Understanding the exact evaluation protocol would be extremely helpful for reproducing the results and aligning local evaluations with the benchmark.
Thank you very much in advance for your time and for making this work public! 😀