SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality Paper • 2306.14610 • Published Jun 26, 2023 • 2
facebook/dinov3-convnext-small-pretrain-lvd1689m Image Feature Extraction • 49.5M • Updated Aug 19, 2025 • 15.9k • 25
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog • 9 items • Updated Mar 2 • 88
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation Paper • 2412.15188 • Published Dec 19, 2024 • 2
view article Article nanoVLM: The simplest repository to train your VLM in pure PyTorch +5 May 21, 2025 • 254
Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks Paper • 2510.25480 • Published Oct 29, 2025
Equivariant Differentially Private Deep Learning: Why DP-SGD Needs Sparser Models Paper • 2301.13104 • Published Jan 30, 2023
MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis Paper • 2509.06617 • Published Sep 8, 2025 • 1