LMArena's latest list: DeepSeek-R1 web programming power catches up with Claude Opus 4

in the field of open source modeling.DeepSeek Another surprise.

On the 28th of last month, DeepSeek came with a small update, upgrading its R1 inference model to the latest version (0528) and making the model and weights publicly available.

This time, R1-0528 further improves benchmarking performance, enhances front-end functionality, reduces illusions, and supports JSON output and function calls.

LMArena最新榜单：DeepSeek-R1网页编程能力赶超了Claude Opus 4

Today, LMArena, the industry-renowned, but also recently embroiled in controversy (having been noted for favoring OpenAI, Google, and Meta's big models), a public benchmarking platform for big models, released its latest performance rankings, which DeepSeek-R1(0528) was particularly notable.

Among them, DeepSeek-R1 (0528) ranked 6th overall and 1st among open models in the text benchmark test (Text).

Specific to the following segments:

Ranked #4 in the Hard Prompt test
Ranked #2 in Coding Tests
Ranked #5 in Math Tests
Ranked No. 6 on the Creative Writing test
Ranked #9 in the Instruction Fellowing test
Ranked #8 in the Longer Query test
Ranked #7 in the Multi-Turn test

In addition, on the WebDev Arena platform, DeepSeek-R1 (0528) tied for first place with closed-source large models such as Gemini-2.5-Pro-Preview-06-05 and Claude Opus 4 (20250514), and surpassed Claude Opus 4 in terms of score.

WebDev Arena is a real-time AI programming competition platform developed by the LMArena team that pits various large language models against each other in a web development challenge that measures human preference for the models' ability to build beautiful and powerful web applications.

DeepSeek-R1 (0528) showed great performance that provoked more people to use it.

It was also stated that given that Claude has long been the benchmark in AI programming, the fact that DeepSeek-R1 (0528) now matches Claude Opus in performance is a milestone moment and a pivotal moment for open source AI.

DeepSeek-R1 (0528) provides leading performance under the fully open MIT protocol and rivals the best closed-source models. While this breakthrough is most evident in Web development, its impact may extend to the broader field of programming.

However, raw performance does not define real-world performance. While DeepSeek-R1 (0528) may be comparable to Claude in terms of technical capabilities, whether it can provide a user experience comparable to Claude's in day-to-day workflows is something that needs to be verified in more real-world situations.

(Text: Heart of the Machine)