OpenAI releases a new series of models for GPT-4.1! Comprehensively outperforms GPT-4o Smarter and cheaper!

1,763 0

in the wee hours of today1Point.OpenAIConducted a technical live release of the latest model - theGPT-4.1.

apart fromGPT-4.1Beyond that, there areGPT 4.1-Minicap (a poem)GPT 4.1-NanoTwo models that achieve substantial improvements in multimodal processing, code capability, instruction compliance, and cost. In particular, support for100ten thousandtokencontext, which helps tremendously in areas such as financial analysis, fiction writing, and education.

due toGPT-4.1The release of theOpenAIannounced that it will phase out the recently releasedGPT-4.5, whose ability to do so is evident.

Currently, if you want to experienceGPT-4.1And it's impossible to passAPIAuthentication peeps, Microsoft has been in theAzure OpenAIThe model is online and ready to use.

OpenAI发布GPT-4.1全新系列模型！全面超越GPT-4o 更聪明、更便宜

GPT-4.1brief introduction

GPT-4.1One of the biggest highlights is the support100ten thousandtokensContext, which is alsoOpenAIFirst release of a long window model.

Compared to the predecessor model, theGPT-4.1,GPT-4.1 Minicap (a poem)GPT-4.1 NanoCapable of handling up to100ten thousandtokensThe context of theGPT-4o(used form a nominal expression)8Times.

OpenAIexistLong Context EvalsThe test was performed on long texts on theGPT-4.1All three models in the series are able to find the target text at any depth in the corpus, whether it is at the beginning, middle or end, and even at up to100ten thousandtokenscontext, the model is still able to accurately localize the target text.

OpenAIstill in existenceMulti-Round CoreferenceTests were conducted to test the model's ability to comprehend and reason in long contexts by creating synthetic dialogs. In these dialogs, the user and the assistant alternate, where the user may ask the model to generate a poem about a certain topic, then another poem about a different topic, and then a short story about a third topic. The model needs to find specific content in these complex conversations, such as "a second short story about a certain topic".

The test results showed thatGPT-4.1After handling up to128K tokensdata is significantly better than theGPT-4oand for as long as100ten thousandtokenscontext can still maintain high performance.

In the coding aptitude test, theSWEBenchThe evaluation places the model in the Python code base environment to explore the code base, write code and test cases. The results showed that theGPT-4.1 The accuracy of the 55% but (not) GPT-4oonly 33%.

In terms of multilingual coding capabilities, the Ader polyglot benchmark covers a wide range of programming languages and different formatting requirements. gpt-4.1 doubles the differential performance of gpt-4o, making it more efficient in handling multilingual programming tasks, code optimization, and versioning.

In the command-following aptitude test, theOpenAI Build an internal assessment system to simulate API Developer usage scenarios that test the model's ability to follow complex instructions. Each sample contains complex instructions that fall into different categories and are divided into difficulty levels. In the difficulty subset evaluation, theGPT-4.1 exceed by far GPT-4o.

In the Video MME Benchmark of the Multimodal Processing Test, GPT 4.1 comprehended and answered multiple-choice questions on a 30 - 60-minute un-subtitled video, achieving a score of 72%, which is the best current level and a major breakthrough in video content comprehension.

Price.GPT -4.1The series is more competitively priced with improved performance.GPT -4.1 compare GPT-4o Price reduction 26%but (not)GPT -4.1 Nano As the smallest, fastest and cheapest model, per million token The cost is only12Cents.

practical applicationGPT-4.1case (law)

Thomson Reuters is the world's leading provider of financial and legal information, and its professional-gradeAIassistantsCoCounselIt is widely used in legal work.

CoCounsels primary mission is to help legal professionals with complex legal documents and workflows. In the testGPT-4.1At the time, Reuters found that the model excelled at multi-document review, especially when dealing with complex legal workflows involving multiple long documents.

together withGPT-4oCompare.GPT-4.1Multi-document review accuracy in internal ministerial context benchmarking increased by17%. This enhancement is critical for legal professionals as it is directly related to theCoCounselAbility to handle complex legal workflow.

Legal documents often contain multiple long documents that may have complex interrelationships with each other, such as conflicting clauses or additional context.GPT-4.1It has demonstrated great reliability in these areas, accurately identifying the subtle relationships between documents that are critical to legal analysis and decision-making.

And when dealing with multiple legal documents.GPT-4.1The ability to efficiently maintain contextual information across documents and accurately identify conflicting clauses or supplementary information between documents. This

Carlyleis a leading global private equity firm whose business involves extensive financial data analysis and document processing.CarlyleutilizationGPT-4.1to accurately extract granular financial data from multiple long documents that arePDFDocumentation,ExcelTables and other complex formats.

CarlyleThe internal assessment showed thatGPT-4.1Outperforms other available models in retrieving data from large documents50%.

GPT-4.1It performs well in handling very large documents, especially in the retrieval of dense data. The model successfully overcomes key limitations of other models, including retrieval problems, errors of missing information in intermediate positions, and multi-hop reasoning across documents.

These capabilities enableGPT-4.1The ability to more efficiently extract key information from complex financial documents provides the opportunity forCarlyleof analysts provide more accurate and comprehensive data support.

Windsurfis a company that specializes in providing efficient development tools with internal coding benchmarking to evaluateAIThe performance of the model in real development provides an important reference. In the analysis of theGPT-4.1When performing the test, theWindsurfThe model was found to perform better than its predecessor in the coding taskGPT-4oThere has been a significant improvement:GPT-4.1existWindsurfscored better in the internal coding benchmark test thanGPT-4ocome out ahead60%.

WindsurfThe user feedback of theGPT-4.1There is a lot more variation in tool invocation thanGPT-4oIt's more efficient. It's more effective.30%.GPT-4.1The likelihood of repeating unnecessary edits or overly granular steps in the coding process is higher thanGPT-4oReduced by about50%.