Abstract
ModelLens is a unified framework that recommends models in real-world scenarios by learning from public leaderboard data to rank unseen models on unseen datasets without requiring costly evaluations.
The open-source model ecosystem now contains hundreds of thousands of pretrained models, yet picking the best model for a new dataset is increasingly infeasible: new models and unbenchmarked datasets emerge continuously, leaving practitioners with no prior records on either side. Existing approaches handle only fragments of this in-the-wild setting: AutoML and transferability estimation select models from small predefined pools or require expensive per-model forward passes on the target dataset, while model routing presupposes a given candidate pool. We introduce ModelLens, a unified framework for model recommendation in the wild. Our key insight is that public leaderboard interactions, though scattered and noisy, collectively trace out an implicit atlas of model capabilities across heterogeneous evaluation settings, a signal rich enough to learn from directly. By learning a performance-aware latent space over model--dataset--metric tuples, ModelLens ranks unseen models on unseen datasets without running candidates on the target dataset. On a new benchmark of 1.62M evaluation records spanning 47K models and 9.6K datasets, ModelLens surpasses baselines that either rely on metadata alone or require running each candidate on the target dataset. Its recommended Top-K pools further improve multiple representative routing methods by up to 81% across diverse QA benchmarks. Case studies on recently released benchmarks further confirm generalization to both text and vision-language tasks.
Community
With thousands of open-source models being released on HuggingFace, model selection is becoming a serious bottleneck.
Our work studies a new problem:
Can we recommend the best pretrained models for a new task without directly evaluating or finetuning every candidate?
We build a large-scale model–dataset interaction benchmark across thousands of models and tasks, and study task-level model recommendation as a ranking problem.
Curious to hear the community's thoughts:
- How do you currently choose models for new tasks?
- Do leaderboard scores actually transfer across datasets?
- Could model recommendation become a foundation layer for routing, cascading, or multi-LLM systems?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Discovering Novel LLM Experts via Task-Capability Coevolution (2026)
- Scalable Prompt Routing via Fine-Grained Latent Task Discovery (2026)
- LLM Router: Rethinking Routing with Prefill Activations (2026)
- HORIZON: A Benchmark for In-the-wild User Behaviour Modeling (2026)
- No Single Best Model for Diversity: Learning a Router for Sample Diversity (2026)
- Cost-Efficient Estimation of General Abilities Across Benchmarks (2026)
- An Embarrassingly Simple Graph Heuristic Reveals Shortcut-Solvable Benchmarks for Sequential Recommendation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.07075 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper