Cheetor - A New Multi-Modal LLM Strategy Empowered by Controllable Knowledge Re-Injection

cross-posted from: https://lemmy.world/post/3439370

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

A wild new GitHub Repo has appeared!

https://github.com/DCDmllm/Cheetah/tree/main

Today we cover Cheetah - an exciting new take on interleaving image and text context & instruction.

For higher quality images, please visit the main projects repo to see their code and approach in all of their glory.

I4 Benchmark

To facilitate research in interleaved vision-language instruction following, we build I4 (semantically Interconnected, Interleaved Image-Text Instruction-Following), an extensive large-scale benchmark of 31 tasks with diverse instructions in a uniﬁed instruction-response format, covering 20 diverse scenarios.

I4 has three important properties:

Interleaved vision-language context: all the instructions contain sequences of inter-related images and texts, such as storyboards with scripts, textbooks with diagrams.

Diverse forms of complex instructions: the instructions range from predicting dialogue for comics, to discovering differences between surveillance images, and to conversational embodied tasks.

Vast range of instruction-following scenarios: the benchmark covers multiple application scenarios, including cartoons, industrial images, driving recording, etc.

Cheetor: a multi-modal large language model empowered by controllable knowledge re-injection

Cheetor is a Transformer-based multi-modal large language model empowered by controllable knowledge re-injection, which can effectively handle a wide variety of interleaved vision-language instructions.

Cases

Cheetor demonstrates strong abilities to perform reasoning over complicated interleaved vision-language instructions. For instance, in (a), Cheetor is able to keenly identify the connections between the images and thereby infer the reason that causes this unusual phenomenon. In (b, c), Cheetor can reasonably infer the relations among the images and understand the metaphorical implications they want to convey. In (e, f), Cheetor exhibits the ability to comprehend absurd objects through multi-modal conversations with humans.

Getting Started

1. Installation

Git clone our repository and creating conda environment:
git clone https://github.com/DCDmllm/Cheetah.git
cd Cheetah/Cheetah
conda create -n cheetah python=3.8
conda activate cheetah
pip install -r requirement.txt
2. Prepare Vicuna Weights and Llama2 weights

The current version of Cheetor supports Vicuna-7B and LLaMA2-7B as the language model. Please first follow the instructions to prepare Vicuna-v0 7B weights and follow the instructions to prepare LLaMA-2-Chat 7B weights.

Then modify the llama_model in the Cheetah/cheetah/configs/models/cheetah_vicuna.yaml to the folder that contains Vicuna weights and modify the llama_model in the Cheetah/cheetah/configs/models/cheetah_llama2.yaml to the folder that contains LLaMA2 weights.

3. Prepare the pretrained checkpoint for Cheetor

Download the pretrained checkpoints of Cheetah according to the language model you prepare:

Checkpoint Aligned with Vicuna 7B Checkpoint Aligned with LLaMA2 7B

Download Download

For the checkpoint aligned with Vicuna 7B, please set the path to the pretrained checkpoint in the evaluation config file in Cheetah/eval_configs/cheetah_eval_vicuna.yaml at Line 10.

For the checkpoint aligned with LLaMA2 7B, please set the path to the pretrained checkpoint in the evaluation config file in Cheetah/eval_configs/cheetah_eval_llama2.yaml at Line 10.

Besides, Cheetor reuses the pretrained Q-former from BLIP-2 that matches FlanT5-XXL.

4. How to use Cheetor

Examples of using our Cheetah model are provided in files Cheetah/test_cheetah_vicuna.py and Cheetah/test_cheetah_llama2.py. You can test your own samples following the format shown in these two files. And you can run the test code in the following way (taking the Vicuna version of Cheetah as an example):
python test_cheetah_vicuna.py --cfg-path eval_configs/cheetah_eval_vicuna.yaml --gpu-id 0
And in the near future, we will also demonstrate how to launch the gradio demo of Cheetor locally.

ChatGPT-4 Breakdown:

Imagine a brilliant detective who has a unique skill: they can understand stories told not just through spoken or written words, but also by examining pictures, diagrams, or comics. This detective doesn’t just listen or read; they also observe and link the visual clues with the narrative. When given a comic strip without dialogues or a textbook diagram with some text, they can deduce what’s happening, understanding both the pictures and words as one unified story.

In the world of artificial intelligence, “Cheetor” is that detective. It’s a sophisticated program designed to interpret and respond to a mix of images and texts, enabling it to perform tasks that require both vision and language understanding.

Projects to Try with Cheetor:

Comic Story Creator: Input: A series of related images or sketches. Cheetor’s Task: Predict and generate suitable dialogues or narratives to turn those images into a comic story.

Education Assistant: Input: A page from a textbook containing both diagrams and some accompanying text. Cheetor’s Task: Answer questions based on the content, ensuring it considers both the visual and written information.

Security Analyst: Input: Surveillance footage or images with accompanying notes or captions. Cheetor’s Task: Identify discrepancies or anomalies, integrating visual cues with textual information.

Drive Safety Monitor: Input: Video snippets from a car’s dashcam paired with audio transcriptions or notes. Cheetor’s Task: Predict potential hazards or traffic violations by understanding both the visual and textual data.

Art Interpreter: Input: Art pieces or abstract images with associated artist’s notes. Cheetor’s Task: Explain or interpret the art, merging the visual elements with the artist’s intentions or story behind the work.

This is a really interesting strategy and implementation! A model that can interpret both natural language text with high quality image recognition and computer vision can lead to all sorts of wild new applications. I am excited to see where this goes in the open-source community and how it develops the rest of this year.

Cheetor - A New Multi-Modal LLM Strategy Empowered by Controllable Knowledge Re-Injection

Cheetor - A New Multi-Modal LLM Strategy Empowered by Controllable Knowledge Re-Injection

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

I4 Benchmark

Cheetor: a multi-modal large language model empowered by controllable knowledge re-injection

Cases

Getting Started

ChatGPT-4 Breakdown: