Timur's Engineering Blog

Fusing Text and Images

CLIP is one of the first large-scale models to fuse text and image representations effectively, demonstrating strong zero-shot generalization across vision tasks. Multimodal learning—combining different types of data—may be a crucial step toward more general forms of artificial intelligence, possibly even AGI.

Image from “Learning Transferable Visual Models From Natural Language Supervision” (Radford et al., 2021)

Image taken from:https://www.appypiedesign.ai/api/image-to-text/llava-1.5-api

Vision-Language Model: LLaVA

Read about the architecture of LLaVA and a case study of fine tuning LLaVA for a downstream task.

Subscribe for to hear more about my work

Engineering notes, implementations, new libraries and more….