Purpose: Models that understand text + image + audio together.
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text • 7B • Updated • 2.53M • 355 -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 2.18M • 849 -
google/pix2struct-base
Image-to-Text • 0.3B • Updated • 2.34k • 79 -
microsoft/kosmos-2-patch14-224
Image-to-Text • Updated • 160k • 184