Ali open source the first image generation base model Qwen-Image, support for Chinese high-fidelity output, topped the global open source list

artifact1dys agoupdate AiFun
38 0

Domestic SOTA (best of show) level open source image generation model, here we come!

On August 5, Ali open-sourcedQwen-Image(math.) genusThe first image generation base model in the Tongyi Thousand Questions series.Qwen-Image MainComplex Text RenderingThe ability to accurately generate text in different languages and styles in different scenarios, even writing brush calligraphy, or directly generating PPT pages with text and images.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

In the image below, Qwen-Image not only accurately reproduces the "Hayao Miyazaki" style requirement in the cue, but also accurately renders the words "cloud storage", "cloud computing", etc., as the depth of field of the composition changes. Cloud Storage" and "Cloud Computing" are accurately rendered with the change of depth of field in the composition. The text blends naturally with the image.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

Qwen-Image also accurately generated English content. It generated a bookstore window scene based on the English cue words, all the specified text was accurately reproduced, and it automatically generated a different stylized font for each book as well as a cover that matched the title.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

In addition to text processing, Qwen-Image in generalized image generationA wide range of art styles are supported.. Master everything from photo-realistic scenes to impressionistic paintings, from anime styles to minimalist designs.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

Qwen-Image is a 20B model that uses theMMDiT (Multi-Modal Diffusion Transformer) ArchitectureThe "MM" stands for the model's ability to generate multimodal content such as images, text, etc., and the "DiT" stands for this is a Diffusion Transformer.

The Thousand Questions team evaluated Qwen-Image on a number of public benchmarks, comparing open-source, closed-source image generation models for global head...Total of 12 SOTA (best of show). On the generalized image generation tests GenEval, DPG, and OneIG-Bench, as well as the image editing tests GEdit, ImgEdit, and GSO, Qwen-lmage outperformed theFlux.1, BAGELOpen source models such as ByteHopper'sSeedDream 3.0and OpenAI'sGPT Image 1 (High).

Results on the LongText-Bench, ChineseWord and TextCraft benchmarks for text rendering show that Qwen-Image performs particularly well in text rendering.In particular, Chinese text rendering is dramatically ahead of existing state-of-the-art models, including SeedDream 3.0 and GPT Image 1 (High).

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

Currently, Qwen-Image has been open-sourced in communities such as Magic Hitch and Hugging Face, and ordinary users can directly experience this model by selecting the image generation function in QwenChat (chat.qwen.ai).

Qwen-Image's technical report is also synchronized open source, the content of the report reveals the specific technical implementation of this model one by one.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

 

RELATED:

 

01.The architecture consists of three core components, theMulti-model collaboration for image generation

The Thousand Questions team observed that although the existing image generation models on the market have achieved certain breakthroughs in resolution and detail portrayal, they still perform poorly when it comes to tasks such as multi-line text rendering, non-alphabetic language (e.g. Chinese) generation, localized text insertion, or text fusion with visual elements.

We also compare the Flux model with Qwen-Image, which has attracted a lot of attention. Under the premise that the cue words are identical, Flux on the left side directly refuses to generate the Chinese characters "released in summer 2025", and its image impact is slightly inferior to that of Qwen-Image.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

▲Tested movie poster generation, cue word: sci-fi movie poster title 『GALAXY INVASION』, metallic font with neon light effects and broken edges, space explosions in the background, and small letters labeled 『Summer 2025 release』.

Secondly, in the case of image editing, alignment between the editing result and the original image remains a challenge. Firstly, it is necessary to maintain visual consistency, e.g., changing only the hair color without affecting the facial features; it is also necessary to have semantic coherence, e.g., modifying the pose of a character while keeping the identity and the scene consistent.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

▲ Cue word for inclusion of spaceship elements, changed to a summer 2026 release (Flux on the left, Qwen-Image on the right)

In the Chinese calligraphy scenario, Qwen-Image was allowed to choose the content and font of the writing on its own, and the final result is as follows.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

The architecture of Qwen-Image consists of three core components that work together to enable text-to-image generation.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

Qwen2.5-VL Multimodal Large Language Model (MLLM)As a conditional encoder, it is responsible for extracting features from the text input.

The system prompt word requires Qwen2.5-VL to describe in detail the color, number, text, shape, size, texture, and spatial relationships of objects and backgrounds to inform image generation and guide the model to generate refined potential representations.

Wan-2.1 Video generation modelof the sub-autonomous encoder (VAE) is used to act as an image disambiguator (tokenizer) for Qwen-Image.

It compresses the input image into a compact latent representation and decodes it back in the inference phase. Notably, the Qwen team froze the Wan-2.1 encoder and only fine-tuned the image decoder, but still significantly enhanced the detailed representation of the model.

Multimodal Diffusion Transformer (MMDiT) as a backbone diffusion model, modeling complex joint distributions between noise and potential representations of images under textual guidance.

In each MMDiT, the Thousand Questions team introduces a multimodal scalable RoPE method, which can help the model to maintain strong high-resolution image generation capability and accurately generate textual content under the premise of distinguishing between image and text tokens.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

 

02.Building multi-billion scale datasets thatModeling "incremental learning" to generate graphical capabilities

Once the architecture is designed, Qwen-Image addresses common problems with image generation models through data engineering, incremental learning strategies, enhanced multi-task training paradigms, and scalable infrastructure optimizations.

为实现复杂提示对齐,千问团队构建了一套data processing流程,涵盖大规模数据采集、标注、过滤、合成增强与类别平衡。

During the data collection and categorization phase, the team systematically collected and labeled theBillions of pairs of graphic data on a scale, the dataset is organized into four core areas:

(1) Nature (55%): covers a wide range of categories such as objects, landscapes, cities, plants, animals, interiors, food, etc., and is the basis for the model to generate diverse and realistic images.

(2) Design (27%): encompasses posters, user interfaces, PPTs, paintings, sculptures, digital art, etc., rich in text, complex layouts, and artistic styles that are critical to improving the model's understanding of artistic directives, text layout, and design semantics.

(3) People (13%): includes portraits, movement, and character activities, and is used to enhance the model's ability to generate realistic and diverse portraits.

(4) Synthetic data (5%): specifically refers to data generated by controlled text rendering techniques rather than images generated by other AI models. This avoids the risk of artifacts, text distortion and bias associated with AI-generated images and ensures data reliability.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

After getting the original dataset, the Thousand Questions team performed seven stages of progressive data filtering, including initial pre-training data organization, image quality enhancement, graphic alignment optimization, text rendering enhancement, high resolution refinement, category balancing and portrait enhancement, and balanced multi-scale training.

Entering the pre-training phase.Qwen-Image utilizes a curriculum learning strategy., starting from the basic text rendering task and gradually transitioning to the generation of paragraph-level and layout-sensitive descriptions. This approach significantly improves the model's ability to understand and generate diverse languages, especially on ideographic languages such as Chinese.

This progressive approach to learning is reflected in a number of ways.

In terms of resolution, Qwen-Image starts from 256p and gradually improves to 640p and 1328p, allowing the model to learn visual features from whole to details from shallow to deep.

Data quality has also been progressively improved, initially using large-scale data to quickly build up foundational capabilities, and later introducing stricter data filtering and refining using high-quality, high-resolution data.

The text rendering capability is gradually enhanced by training generic image generation without text, and then gradually adding text-containing (especially Chinese) images to specifically strengthen the text rendering capability.

The data distribution is such that the proportion of data from different domains (nature, design, people) and different resolutions is adjusted according to the training to prevent overfitting.

In terms of image editing, Qwen-Image supports style migration, additions, deletions, detail enhancements, text editing, character pose adjustment and many other operations.

 

03.Multi-tasking framework for unified generation, editingTop 3 in the arena after 200,000+ duels

The training of image generation models puts higher demands on the infrastructure. In order to cope with the huge number of parameters and data volume of the models, the Thousand Questions team has designed an efficient distributed training framework.

Included in this framework areProducer-Consumer (Producer-Consumer).

The producer is responsible for all data preprocessing, including data filtering, MLLM feature extraction and VAE encoding. The processed data is stored in the cache by resolution.

Consumers are deployed on a GPU cluster and focus on model training. With a dedicated HTTP transport layer, consumers are able to pull pre-processed batches from producers asynchronously and with zero copies.

This framework decouples throughput-intensive preprocessing and compute-intensive training, greatly improves GPU utilization and overall throughput, and supports online updates to the data pipeline.

The Thousand Questions team also adopted a hybrid parallelism strategy, combining data parallelism with tensor parallelism, and head-wise parallelism in the multi-head attention module to reduce communication overhead.

After pre-training, the team went throughSupervised fine-tuning (SFT) and reinforcement learning (RL)Further improving the generation quality of Qwen-Image and aligning human preferences.

The supervised fine-tuning phase constructs a hierarchically organized, high-quality dataset with all samples manually fine-labeled to emphasize image clarity, detail richness, luminance performance, and photo-quality realism, which is used to guide the model to output higher-quality visual content.

Subsequently, reinforcement learning is introduced to further optimize the generation of preferences, mainly using direct preference optimization: multiple candidate images are generated for the same cue, and the best and worst samples are manually labeled, the DPO loss function is based on the stream matching framework, and the parameters are updated by comparing the difference in speed prediction between the model's "good" and "bad" samples. The DPO loss function is based on the stream matching framework, and the parameters are updated by comparing the model's difference in speed prediction between "good" and "bad" samples.

On this basis, Qwen-Image further applies Group Relative Strategy Optimization (GRPO) for fine-grained tuning, calculates the dominance function based on the reward model scoring in each group of generated results, and adjusts the strategy accordingly; to enhance the exploration capability, GRPO uses Stochastic Differential Equations (SDEs) instead of the traditional ODEs in sampling.

Qwen-Image has also been made available to the public through aA unified multi-tasking framework that supports multiple generation modes such as text-to-image (T2I) and graphic-to-image (TI2I, i.e., image editing).For the image editing task, user-supplied reference images are processed by Qwen2.5-VL, and the frames are encoded by Visual Transformer (ViT) and spliced with text tokens to form the input sequence.

This process involves Qwen2.5-VL describing the key features of the input image (color, shape, size, texture, objects, background) and then interpreting how the user's textual commands should change or modify that image.Qwen-Image generates a new image that meets the user's requirements, while maintaining consistency with the original input image where appropriate.

In order to enable the model to differentiate between multiple images, the QI team extended MSRoPE beyond the original height and width used to localize image blocks within a single image by introducing an additional "Frame" dimension, which further enhances the model's ability to maintain visual fidelity and structural consistency with user-supplied images. The ability of the model to maintain visual fidelity and structural consistency with user-supplied images is further enhanced.

The Qwen-Image team has verified Qwen-Image's capabilities in both generation and editing with a large number of quantitative and qualitative experiments.

In the AI Arena with 5,000 prompts and 200,000+ anonymous duels, Qwen-Image was among the top three as the only open source model, more than 30 points ahead of GPT Image 1, FLUX.1 Pro and others.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

In its main Chinese text generation scenario, Qwen-Image achieves a single character rendering accuracy of 58.31 TP4T.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

On the image editing task, Qwen-Image gets first place in GEdit, ImgEdit, and other lists, and depth estimation and zero-sample new perspective synthesis can be equal to or better than closed-source models.

The technical report also shows a comparison of the generated results of this model with other models. It can be seen that in the case of the bookstore window, Qwen-Image did a good job of matching the book cover with the text.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

On the complex English text rendering, the two models on the left have been messed up to varying degrees, while the GPT Image 1 (High) and Qwen Image on the right do not have similar problems.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

On the image editing task, the other three models failed to accurately display the refrigerator sticker texture requested in the cue word, while Qwen-Image's results were more in line with the cue word, both in terms of color and shape.

阿里开源首个图像生成基础模型Qwen-Image,支持中文高保真输出,登顶全球开源榜首

 

04.Conclusion: Ali continues to open source image models thatAvailability is further enhanced

In June this year, Ali has open-sourced Wan 2.1 image generation model, providing a maximum of 14B. The current Qwen-Image will further increase the number of parameters to 20B, and the model usability is further improved.

With the targeted enhancement of text generation, image editing and other functions, Qwen-Image has been equipped with the ability to produce posters, PPT generation, accurate image editing, etc. These abilities are of greater significance for the image generation technology to go into the real production scenarios.

Note: Original source from Wise Stuff

© Copyright notes

Related articles

No comments

none
No comments...