Two open-source contenders are emerging to challenge OpenAI’s GPT-4V, a cutting-edge multimodal AI model. LLaVA-1.5, developed by a collaboration between academic researchers, showcases impressive “zero-shot” object detection and meme understanding but struggles with complex images and text recognition. Adept’s Fuyu-8B, while not competing directly with LLaVA-1.5, focuses on understanding unstructured data, such as software interfaces and diagrams. These alternatives provide potential solutions but also raise concerns about responsible use and safety mechanisms, adding complexity to the evolving field of multimodal AI.
OpenAI’s GPT-4V has garnered significant attention as the latest breakthrough in AI, boasting a “multimodal” capability to understand both text and images. While this innovation offers immense potential, it also raises concerns that open-source projects are now addressing. Here’s an overview of these emerging alternatives.
Multimodal models possess unique capabilities that set them apart from strictly text or image-based models. For instance, GPT-4V can offer instructions that are better conveyed visually, such as repairing a bicycle. These models not only identify the contents of an image but can extrapolate and understand its context, going beyond the obvious, like suggesting recipes based on ingredients seen in a refrigerator.
However, the rise of multimodal models also presents new challenges. OpenAI initially delayed the release of GPT-4V due to concerns about its potential misuse, such as identifying individuals in images without their consent.
Even now, GPT-4V, accessible only to OpenAI’s ChatGPT Plus subscribers, exhibits concerning flaws, including difficulty recognizing hate symbols and displaying biases against certain genders, demographics, and body types, as acknowledged by OpenAI.
Nevertheless, despite these risks, companies and groups of independent developers are forging ahead with open-source alternatives that may not match GPT-4V’s capabilities but can perform many similar tasks.
One recent entrant is LLaVA-1.5, developed by a team comprising researchers from the University of Wisconsin-Madison, Microsoft Research, and Columbia University. Like GPT-4V, LLaVA-1.5 can answer questions about images based on prompts, making it accessible even on consumer-level hardware.
Another notable option is Adept, a startup that has open-sourced a GPT-4V-like multimodal model with a unique twist. Adept’s model understands “knowledge worker” data, such as charts, graphs, and screens, enabling it to manipulate and reason over this type of information.
LLaVA-1.5 is an improved version of the original LLaVA, combining a “visual encoder” with Vicuna, an open-source chatbot based on Meta’s Llama model. To enhance its capabilities, the LLaVA-1.5 team scaled up image resolution and incorporated data from ShareGPT.
While the larger LLaVA-1.5 model with 13 billion parameters requires substantial computing resources, its cost is significantly lower than the tens of millions of dollars OpenAI spent on training GPT-4. In tests conducted by software engineers at Roboflow, LLaVA-1.5 demonstrated impressive “zero-shot” object detection and meme understanding abilities. However, it struggled with complex images and text recognition, an area where GPT-4V excels.
Adept’s offering, Fuyu-8B, takes a different approach. It’s not intended to compete directly with LLaVA-1.5 but serves as a way for Adept to showcase its in-house developments and gather feedback from the developer community. Fuyu-8B’s strength lies in its ability to understand unstructured data, making it adept at tasks like extracting details from software user interfaces and answering questions about charts and diagrams.
While Fuyu-8B has potential, its small size and lack of built-in safety mechanisms raise concerns about potential misuse. Adept acknowledges the need for case-specific safety measures, emphasizing that developers should ensure responsible use.
Overall, open-source projects are pushing the boundaries of multimodal AI, offering alternatives to OpenAI’s GPT-4V. While these models show promise, they also raise important questions about ethics, safety, and responsible development in the field of AI.