DeepSeek-VL2 is a sophisticated series of vision-language models developed by deepseek-ai, designed to enhance multimodal understanding across various tasks. This model series is an extension of the earlier DeepSeek-VL, offering improved capabilities and performance in processing and understanding multimodal data.
Model Variants: DeepSeek-VL2 includes three different model sizes:
Advanced Multimodal Tasks: The models are adept at handling a range of complex tasks including:
Performance: DeepSeek-VL2 models achieve competitive or state-of-the-art performance, often with fewer parameters compared to other existing dense and MoE-based models.
Release Dates:
Download: Models are available for download on the Huggingface platform with a sequence length support of 4096 tokens.
License: Usage of DeepSeek-VL2 models is governed by the MIT license, ensuring open access for both academic and commercial use, subject to terms specified in the license documentation.
DeepSeek-VL2 requires a Python environment (version 3.8 or higher). Dependencies can be installed using the following command:
pip install -e .
DeepSeek-VL2 models can be utilized for tasks such as single or multiple image understanding. Below is a simple example of how to use DeepSeek-VL2 for image-based conversation:
Loading the Model:
from transformers import AutoModelForCausalLM
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
model_path = "deepseek-ai/deepseek-vl2-tiny"
vl_chat_processor = DeepseekVLV2Processor.from_pretrained(model_path)
vl_gpt = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).to(torch.bfloat16).cuda().eval()
Running the Model:
# Example of processing an input with a reference to an object in an image
conversation = [{"role": "<|User|>", "content": "<image> \n <|ref|>The giraffe at the back.<|/ref|>.", "images": ["./images/visual_grounding_1.jpeg"]}]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(conversations=conversation, images=pil_images, force_batchify=True, system_prompt="").to(vl_gpt.device)
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
outputs = vl_gpt.language.generate(inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=False)
print(answer)
This example demonstrates the model's capability to engage in a conversation about an image, identifying and localizing objects within the visual input.
DeepSeek-VL2 is a significant step forward in the field of vision-language models, offering robust performance across a variety of complex multimodal tasks. Its availability on platforms like Huggingface facilitates easy access and integration into existing projects, supporting a wide range of applications in both research and commercial settings.