Large Language and Vision Assistant (LLaVA), developed by researchers at Microsoft, the University of Wisconsin-Madison, and Columbia University, has been made publicly available.
In order to attain state-of-the-art accuracy on the ScienceQA benchmark, LLaVA was fine-tuned on a synthetic instruction-following dataset and is based on a CLIP image encoder and an LLaMA language decoder.
The instruction-following dataset, which consists of fictitious talks between a human user and an AI assistant concerning the content of images, was created by the researchers using GPT-4. The LLaVA model, which is made up of two foundation models—CLIP for vision and LLaMA for language—and a third network layer connecting the two—was adjusted using the data from this dataset. Additionally, the scientists asked GPT-4 to rank LLaVA’s output on a scale of 1 to 10 in order to assess how well it responded to the studies. LLaVA set a new milestone for accuracy by achieving an accuracy of 92.53% on the ScienceQA training dataset. The researchers claim that:
“This paper demonstrates the effectiveness of visual instruction tuning using language-only GPT-4. We have presented an automatic pipeline to create language-image instruction-following data, based on which we train LLaVA, a multimodal model to follow human intent to complete visual tasks. It achieves [an] excellent visual chat experience when fine-tuned on multimodal chat data.”
As demonstrated by ChatGPT, the method of fine-tuning large language models (LLMs) with instruction-following datasets has improved performance, leading researchers to investigate the method with smaller LLMs. The addition of the capability to process picture data has been the next stage in the development of AI assistants, as demonstrated by the introduction of GPT-4 and Visual ChatGPT.
The team at LLaVA set out to train a model from beginning to end via visual instruction tweaking. The researchers began by using photos taken from the COCO dataset for this. The team fed this information into a text-only GPT-4 along with prompts asking GPT-4 to output instruction-following data, such as: imagined conversations between a person and an assistant, questions about the specifics of the image content, and questions requiring reasoning about the image content, because the images are annotated with captions and object bounding boxes. The generated dataset has 158K samples in total.
The projection matrix layer in the LLaVA architecture transforms text input and visuals into the same word embedding space after the CLIP foundation model. The LLaMA decoder then generates output using the picture and text tokens. The project matrix is first trained by a pre-training approach, after which the projection layer and the LLaMA decoder weights are updated while the CLIP weights remain frozen.
Both an interactive demo and the LLaVA source code are accessible on the project website. Huggingface hosts the LLaVA training data and model weights. The model, which incorporates LLaMA with delta weights on top, “should not be used outside of research purposes.”