LMArena's latest list: DeepSeek-R1 web programming power catches up with Claude Opus 4
in the field of open source modeling.DeepSeek Another surprise.
上个月 28 号,DeepSeek 来了波小更新,其 R1 inference model升级到了最新版本(0528),并公开了模型及权重。
This time, R1-0528 further improves benchmarking performance, enhances front-end functionality, reduces illusions, and supports JSON output and function calls.

Today, LMArena, the industry-renowned, but also recently embroiled in controversy (having been noted for favoring OpenAI, Google, and Meta's big models), a public benchmarking platform for big models, released its latest performance rankings, which DeepSeek-R1(0528) was particularly notable.

Among them, DeepSeek-R1 (0528) ranked 6th overall and 1st among open models in the text benchmark test (Text).

Specific to the following segments:
- Ranked #4 in the Hard Prompt test
- Ranked #2 in Coding Tests
- Ranked #5 in Math Tests
- Ranked No. 6 on the Creative Writing test
- Ranked #9 in the Instruction Fellowing test
- Ranked #8 in the Longer Query test
- Ranked #7 in the Multi-Turn test

In addition, on the WebDev Arena platform, DeepSeek-R1 (0528) tied for first place with closed-source large models such as Gemini-2.5-Pro-Preview-06-05 and Claude Opus 4 (20250514), and surpassed Claude Opus 4 in terms of score.

WebDev Arena is a real-time AI programming competition platform developed by the LMArena team that pits various large language models against each other in a web development challenge that measures human preference for the models' ability to build beautiful and powerful web applications.
DeepSeek-R1 (0528) showed great performance that provoked more people to use it.

It was also stated that given that Claude has long been the benchmark in AI programming, the fact that DeepSeek-R1 (0528) now matches Claude Opus in performance is a milestone moment and a pivotal moment for open source AI.
DeepSeek-R1 (0528) provides leading performance under the fully open MIT protocol and rivals the best closed-source models. While this breakthrough is most evident in Web development, its impact may extend to the broader field of programming.
However, raw performance does not define real-world performance. While DeepSeek-R1 (0528) may be comparable to Claude in terms of technical capabilities, whether it can provide a user experience comparable to Claude's in day-to-day workflows is something that needs to be verified in more real-world situations.

(Text: Heart of the Machine)
© Copyright notes
The copyright of the article belongs to the author, please do not reprint without permission.
Related posts
No comments...