Tongyi Wanxiang 2.1: Ali's powerful open-source video generation of large-scale model of the actual test

artifact4mos agoupdate AiFun
796 0

On the evening of the 25th, Alibaba announced a comprehensive open source of itsVideo GenerationmouldWan2.1models, a move that has sparked widespread interest among AI developers around the world.Tongyi WanxiangThe 2.1 model is based on the Apache 2.0 protocol, which opens up the full inference code and weights for both 14B and 1.3B parameter specifications, and supports both text-born video and graph-born video tasks. Developers around the world can download the experience on Github, HuggingFace, and the Magic Hitch community, which means that theAI Video Creationwill enter a new era of open source collaboration.

Tongyi Manxiang is AliyunTongyiAn important member of the series of AI painting creation big model, since July 2023 online, has been widely used in the field of picture creation. The open source version 2.1 goes a step further by supporting video generation functions, which are especially good at understanding Chinese characteristics and oriental aesthetics. For example, in the 2024 Spring Festival Gala, the Tongyi Wanxiang model was used to generate stage backgrounds and choreography effects, demonstrating its power in complex scenes.

It is worth noting that the Tongyi Wanxiang 2.1 model is able to accurately understand Chinese commands when generating videos, and outputs video materials with Chinese characteristics and a strong New Year style. For example, when the user inputs the command "with red New Year's rice paper as the background, a drop of ink appears, and the halo ink is slowly haloed", the model can generate video content with a rich oriental flavor, which provides a powerful aid for creative design.

In the review set VBench, WanPhase 2.1 outperforms domestic and international open source models such as Sora, Luma, and Pika.

通义万相2.1:阿里旗下强大开源视频生成大模型实测

How well does it really work? Without further ado, let's get to the review!

01 Model measurement

Currently in the Tongyi Wanxiang experience 2.1 Extreme Edition and Professional Edition, both versions are 14B, the speed of the Extreme Edition generation speed of about 4 minutes or so, the professional version of the generation speed is slower, about 1 hour to generate, but the effect is more stable.

通义万相2.1:阿里旗下强大开源视频生成大模型实测
Vincent Video 2.1 Professional is more accurate in understanding text than Extreme, and the clarity of the screen is relatively high. However, the video images generated by both versions are significantly distorted, and there is a lack of understanding of some details of the physical world.
Cue word: In reference to Inception, a top-down wide-angle shot of a hotel corridor rotating at a constant 15 degrees per second, two suited agents roll and fight between walls and ceilings, their ties fluttering up at 45 degrees due to centrifugal force. Shards of overhead lighting splash in the wrong direction of gravity.

通义万相2.1:阿里旗下强大开源视频生成大模型实测Professional Edition

通义万相2.1:阿里旗下强大开源视频生成大模型实测
Speed Edition

Cue word: red-skirted girl jumping up and down the Montmartre steps, old stuff collector's box (wind-up toys/old photos/glass marbles) popping out of each step, flock of pigeons forming a heart-shaped trajectory under a warm-tone filter, accordion scales precisely synchronized with the rhythm of the footsteps, fisheye lens following the camera.

通义万相2.1:阿里旗下强大开源视频生成大模型实测
Professional Edition

通义万相2.1:阿里旗下强大开源视频生成大模型实测
Speed Edition

WanPhase 2.1 is currently the world's first open source video model that can directly generate Chinese text. Although it can accurately generate the specified text, but only limited to relatively short text, beyond a certain length will appear garbled.

Cue word: Wolf-hair brushes are waved on the rice paper, and the word "fate" emerges stroke by stroke as the ink stains, and the edges of the words glow with a golden shimmer.
通义万相2.1:阿里旗下强大开源视频生成大模型实测

The Tusheng video effect is more stable, with high character consistency and no obvious deformation, but the understanding of the cue words is incomplete and lacks details. For example, there are no pearls in the pearl milk tea in the case video, and Nyonya Shiji did not turn into a big fat girl.

Cue word: oil painting style, a plainly dressed young girl took out a cup of pearl milk tea, gently opened her lips and slowly tasted it, her movements elegant and calm. The background of the picture is a deep dark tone, the only light focuses on the girl's face, creating a mysterious and serene atmosphere. Close up, side close-up.
通义万相2.1:阿里旗下强大开源视频生成大模型实测
Cue word: The stone man's arm swings naturally with his steps, and the background light gradually changes from bright to dim, creating a visual effect of time passing. The camera remains stationary, focusing on the dynamic changes of the stone man. The small stone man in the initial frame gradually increases in size as the video progresses, and finally transforms into a round and cute stone girl in the ending frame.
通义万相2.1:阿里旗下强大开源视频生成大模型实测

Overall, the 10,000 phase 2.1 semantic understanding and physical performance still needs to be improved, but the overall aesthetic online, and open source may speed up the optimization and update speed, looking forward to the follow-up can be a better presentation.

02 Low cost, high effectiveness, high controllability

In the algorithm design, Wanphase is still based on the mainstream DiT architecture, and linear noise trajectory Flow Matching, look a little complicated, in fact, everyone is almost the same idea.

Means that Mr. into a pile of noise (similar to the TV snow screen), until the picture becomes pure noise, the model and then start "denoising", will be placed in the position of each noise, through multiple iterations to generate high-quality pictures.

However, the problem is that the traditional diffusion model is extremely computationally intensive when generating videos, and requires constant sorting and optimization, which leads to a long generation time but not long enough video time, and memory consumption and arithmetic power.

At this time Wan-phase proposed a novel 3D spatio-temporal variational autoencoder (VAE) called Wan-VAE, which improves spatio-temporal compression and reduces memory usage by combining multiple strategies.

This technique is somewhat similar to the "two-way foil" in "Three Bodies", which transforms a person from three-dimensional to two-dimensional. Spatio-temporal compression means compressing the spatio-temporal dimensions of the video, such as breaking the video into lower dimensions, from producing a three-dimensional cube, to first generating a two-dimensional cube and then restoring it to three dimensions, or adopting hierarchical generation to improve efficiency.

As a simple example, Wan-VAE can compress a book "Romance of the Three Kingdoms" into an outline, and the method of retaining the restored content in the outline greatly reduces the occupation of memory, and at the same time, you can memorize longer novels by this method.

Solving the content occupancy problem incidentally solves the problem of long video production. Traditional video models can only handle a fixed length, and will lag or crash if they exceed a certain length, but if only the outline is stored and the front and back associations are remembered, then by temporarily storing the key information of the previous frames when generating each frame, recalculation from the first frame can be avoided. Theoretically, following this method, it is possible to encode and decode 1080P videos of unlimited length without losing historical information.

通义万相2.1:阿里旗下强大开源视频生成大模型实测

That's why, Wanphase can run on consumer graphics cards. Traditional HD videos (e.g. 1080P) have too much data volume and not enough memory for normal graphics cards. However, Manphase reduces the resolution before processing the video, such as scaling the 1080P to 720P to reduce the amount of data, and then boosts the image quality to 1080P with the super-scoring model after the generation is complete.

After 10,000-phase measurements, the memory footprint of 29% during inference is further reduced without loss of performance by compressing the spatial downsampling ahead of time, and the production speed is fast and the picture quality does not shrink.

通义万相2.1:阿里旗下强大开源视频生成大模型实测

This part of the technological innovation, to solve the previous video generation model has been unable to large-scale application of engineering challenges. However, at the same time, Wanxiang has also made further optimization on the generation effect.

For example, for fine-grained motion control, Runaway's native video model relied on motion brushes to draw trajectories for single- and multi-object relative motion control, whereas Manphase allows the user to control the motion of objects in the video through text, keypoints, or simple sketches (e.g., specifying that "butterflies hover and fly into the frame from the lower-left corner ").

Manphase 2.1 transforms the user input motion trajectory into a mathematical model, which is used as an additional condition to guide the model during the video generation process. But this is not enough, the motion of the object has to satisfy the real-world physical laws, based on the mathematical model, the calculation results of the physics engine are introduced to enhance the realism of the motion.

Overall, the core advantage of Wanphase is to solve the challenges in actual production scenarios through engineering capabilities, while flowing out space for subsequent iterations through modular design. For ordinary users, it actually lowers the threshold of video creation.

The strategy of full open source also completely breaks the business model of paying for video models, and with the emergence of WanPhase 2.1, the video generation track in 2025, there is a good show again!

© Copyright notes

Related posts

No comments

none
No comments...