
What's Ovis2?
Ovis2 is a family of next-generation multimodal large language models released on February 21, 2025 as open source by Alibaba's internationalization team. As the successor of Ovis1.6, Ovis2 features significant improvements in data construction and training methodology, aiming to enhance the model's performance in multimodal tasks.
Ovis2 Key Features:
- High performance for small models: By optimizing the training strategy, Ovis2's small-scale model achieves higher capacity density and excels in cross-layer performance.
- Enhanced reasoning skills: The use of instruction fine-tuning and preference learning techniques substantially improves the chain-of-thinking (CoT) reasoning of the model.
- Video and multi-image processing: Video and multi-image processing capabilities have been introduced to enhance the model's understanding of dynamic and complex visual information.
- Multi-language OCR support: Expanded multi-language optical character recognition (OCR) capabilities to improve text extraction performance in complex scenarios.
Ovis2 model version:
The Ovis2 family consists of six versions, 1B, 2B, 4B, 8B, 16B and 34B, all of which are state-of-the-art (SOTA) at the same parameter scale. Among them, Ovis2-34B is ranked second among all open source models in the multimodal general capability list of OpenCompass, the authoritative review platform, surpassing many open source flagship models with 70B parameters at less than half the parameter scale.
Ovis2 architecture design:
Ovis2 utilizes an innovative architectural design aimed at structurally aligning visual and textual embeddings. Its core components include a Visual Segmenter, a Visual Embedding Table and a Large Language Model (LLM). The Visual Segmenter segments the input image into multiple image chunks, extracts features and maps them to "visual words" to generate probabilized visual tokens, the Visual Embedding Table stores the embedding vectors corresponding to each visual word, and the LLM splices together the visual and textual embedding vectors to generate textual outputs, thus completing the multimodal task.
Ovis2 training strategy:
Ovis2 uses a four-stage training method:
- Visual module training: Freeze most of the LLM and Visual Transformer (ViT) parameters and train the vision module to learn the visual feature-to-embedding transformation.
- Feature Extraction Enhancement: Further improve the feature extraction capability of the vision module, enhance high-resolution image understanding, multi-language support and OCR capability.
- Visual-Text Alignment: Aligning visual embedding with LLM's dialog format by visually describing data in dialog form.
- Multimodal instruction training and preference learning: Enhancing the model's ability to follow user commands and the quality of its output in multimodal situations.
Ovis2 keyframe selection algorithm:
To improve video comprehension, Ovis2 has developed an innovative key frame selection algorithm. The algorithm selects the most useful video frames based on the relevance of the frames to the text, the diversity between frames, and the sequentiality. Through high-dimensional conditional similarity computation, Deterministic Point Process (DPP) and Markov Decision Process (MDP), key frames are efficiently selected in limited visual contexts to enhance video comprehension performance.
Ovis2 open source information:
The code of Ovis2 has been open-sourced on GitHub, and the model is available on the Hugging Face and ModelScope platforms with online demos for user experience. Related research papers have also been published on arXiv for developers and researchers.
Ovis2 Related Links:
- GitHub code repository:https://github.com/AIDC-AI/Ovis
- Hugging Face model:https://huggingface.co/AIDC-AI/Ovis2-34B
- ModelScope model:https://modelscope.cn/collections/Ovis2-1e2840cb4f7d45
- Online Demo:https://huggingface.co/spaces/AIDC-AI/Ovis2-16B
- The arXiv paper:https://arxiv.org/abs/2405.20797
With these resources, developers and researchers can gain insights into Ovis2's architecture, training methods, and application scenarios, further advancing themultimodal macromodeldevelopment and innovation.
data statistics
Relevant Navigation

A high-performance large-scale language model from Microsoft, tuned with instructions to support cross-platform operation, with excellent language comprehension and reasoning capabilities, especially suitable for multimodal application scenarios.

Zidong Taichu
The cross-modal general artificial intelligence platform developed by the Institute of Automation of the Chinese Academy of Sciences has the world's first graphic, text and audio three-modal pre-training model with cross-modal comprehension and generation capabilities, supporting full-scene AI applications, which is a major breakthrough towards general artificial intelligence.

SmartResume
Ali open source SmartResume is a high-precision resume parsing system based on OCR and lightweight large models, which can convert 12 formats of resumes such as PDF/pictures into structured data in seconds, with an accuracy rate of 93.1%.

Moonshot
(Moonshot AI) launched a large-scale AI general model with hundreds of millions of parameters, capable of processing inputs of up to 200,000 Chinese characters, and widely used in natural language processing, intelligent recommendation, medical diagnosis and other fields, demonstrating excellent generalization ability and accuracy.

Doubao
ByteDance launched a self-developed big model. Through byte jumping internal 50 + business scene practice verification, daily 100 billion tokens large use of continuous polishing, to provide multi-modal capabilities, with high quality model effect for the enterprise to create a rich business experience

Waver 1.0
Waver 1.0 is an open source full-featured video generation model that makes it easy to create text/images to HD video with efficiency, convenience and outstanding quality.

Zen Browser
An open-source desktop browser based on the Firefox engine, featuring vertical tabs, workspaces, and split-screen views, emphasizing privacy protection and a modern browsing experience focused on efficiency and concentration.

EmaFusion
Ema introduces a hybrid expert modeling system that dynamically combines multiple models to accomplish enterprise-class AI tasks at low cost and high accuracy.
No comments...
