Review of Research Results in the Field of Large Models 2024

In a sense, 2024 is not only a year of technological breakthroughs, but also an important turning point for the industry to mature.

It was a year in which GPT-4 level models were no longer rare, and many organizations developed models with performance beyond GPT-4; a year in which operational efficiency improved significantly and costs dropped dramatically; and a year in which multimodal LLMs, especially those that support image, audio, and video processing, became more and more common.

Advances in technology have also led to a boom in application scenarios. Cue word-based app generation became an industry standard, and voice dialog and real-time camera interaction made sci-fi scenarios a reality. When OpenAI launched the o1 series of inference-based models at the end of the year, pioneering a new paradigm of improving performance by optimizing the inference phase, the whole industry took another big step forward.

On December 31st, local time, independent AI researcher and Django creator, Simon Willison, wrote a retrospective summary of the key events in the field of big language modeling in 2024, listing nearly 20 key themes, important moments, and industry insights.

The following are highlights:

GPT-4 barriers are broken down across the board:By the end of 2024, 70 models from 18 organizations have scored higher on the ChatbotArena leaderboard than the original GPT-4 released in March 2023.
top-notchLarge ModelThe cost of training is dramatically reduced:DeepSeek v3 can achieve performance comparable to models like Claude 3.5 Sonnet for only $5.57 million in training costs.
LLM prices have fallen sharply:The cost of running LLM has dropped dramatically due to increased competition and efficiency. For example, Google's Gemini 1.5Flash8B is 27 times cheaper than the 2023 GPT-3.5Turbo.Lower costs will further drive the popularity and adoption of LLMs.
Multimodal visual models became popular and audio and video models began to appear:In 2024, almost all major model vendors released multimodal models capable of handling image, audio, and video inputs.This allows LLM to handle richer types of informationThe company has expanded its field of application.
Voice and live camera modes turn science fiction into reality:Both ChatGPT and GoogleGemini now support voice and live camera modes, which allow users to interact with the model via voice and video.This will provide a more natural and convenient way for users to interact.
Some of the GPT-4 level models can be run on a laptop:Thanks to the improved model efficiency, some GPT-4 level models, such as Qwen2.5-Coder-32B and Meta's Llama3.370B, can now be run on laptops with 64GB of RAM.This signals that LLM's hardware requirements are decreasing, opening the door to a wider range of application scenarios.
Prompt-based application generation has become the norm:LLM can now generate complete interactive applications based on Prompts, including HTML, CSS, and JavaScript code.Tools such as Anthropic's ClaudeArtifacts, GitHubSpark, and MistralChat's Canvas provide this functionality.This feature greatly simplifies the application development process, providing a way for non-professional programmers to build applications.
Universal access to the best models lasted only a few months:OpenAI launches ChatGPTPro paid subscription service, restricting free access to best models.This reflects the evolution of the LLM business model, more paid models may emerge in the future.
"Agent" is still not really realized:The term "Agent" lacks a clear definition and its usefulness has been questioned because LLMs are gullible enough to believe false information.How to solve the trustworthiness problem of LLM is the key to realize "Agent"..
Evaluation is critical:Well-written automated evaluations for LLM systems are essential for building useful applications.An effective evaluation system can help developers better understand and improve LLM.
Synthetic training data works well:More and more AI labs are using synthetic data to train LLMs, which helps improve the performance and efficiency of the models.Synthetic data can overcome the limitations of real data, providing more flexible options for LLM training.
LLM's environmental impacts are mixed:On the one hand, improved model efficiency reduces the energy consumption for a single inference. On the other hand, the race of large tech companies to build infrastructure for LLM has led to the construction of a large number of data centers, exacerbating the pressure on power networks and the environment.
LLM is becoming more difficult to use:As the functionality of LLM continues to expand, so does the difficulty of using it.Users need a deeper understanding of how LLM works and its limitations, in order to better utilize its advantages.

The following is a translation:

GPT-4:From "unreachable" to "universally surpassed"

Over the past year, the field of Large Language Modeling (LLM) has undergone a sea change. Looking back to the end of 2023, OpenAI's GPT-4 is still an insurmountable peak, and other AI labs are pondering the same question: what unique technical secrets does OpenAI hold?

Today, one year later, the situation has radically changed: according to the Chatbot Arena leaderboard, the original version of GPT-4 (GPT-4-0314) has fallen to around 70th place. Currently, 70 models from 18 organizations have surpassed this former benchmark in terms of performance.

Google's Gemini 1.5 Pro was the first to break through in February 2024, not only reaching the GPT-4 level, but also delivering two major innovations: it increased the input context length to 1 million tokens (later updated to 2 million) and enabled video input processing for the first time, opening up new possibilities for the entire industry.

Anthropic followed up with the Claude 3 series in March, with the Claude 3 Opus quickly becoming a new benchmark in the industry, and the June release of Claude 3.5 Sonnet, which pushed performance to new heights, and remained at the same version number (informally referred to as Claude 3.6 in the industry) even after a major upgrade in October.

The most significant technological advancement in 2024 is the overall increase in the ability of models to process long text. Only a year ago, most models were limited to 4096 or 8192 tokens, with the exception of Claude 2.1, which supported 200,000 tokens, while now almost all major providers support 100,000+ tokens. This advancement has greatly expanded the application scope of LLM - not only can users input an entire book for content analysis, but more importantly, in specialized areas such as programming, by inputting a large amount of sample code, the model is able to provide more accurate solutions.

The camp beyond GPT-4 is now quite large. If you browse the Chatbot Arena leaderboards today, GPT-4-0314 has fallen to about 70th place. The 18 organizations with higher-scoring models are Google, OpenAI, Alibaba, Anthropic, Meta, Reka AI, 01 AI, Amazon, Cohere, DeepSeek, Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21 Labs, Princeton and Tencent.

This change is a profound reflection of the rapid evolution of the AI field. Whereas in 2023, surpassing GPT-4 was still a major breakthrough worthy of the history books, by 2024 it seems to have become the basic threshold against which top AI models are measured.

partGPT-4 level model for local operation on PCs

In 2024, another important breakthrough in the field of large language modeling will come: GPT-4 level models can be run on ordinary PCs. This breaks the traditional perception that high-performance AI models must rely on expensive data centers.

In the case of the M2 MacBook Pro with 64GB of RAM, for example, the same device that was only barely able to run a GPT-3 model in 2023 is now able to run multiple GPT-4 models, including the open-source Qwen 2.5-Coder-32B and Meta's Llama 3.3 70B.

This breakthrough is surprising because previously running a GPT-4 class model was thought to require a data center-class server with one or more $40,000+ GPUs.

Even more remarkable is Meta's Llama 3.2 series. Versions 1B and 3B are not quite as good as GPT-4, but the performance far exceeds model scale expectations. Users can even run Llama 3.2 3B on their iPhones via the MLC Chat iOS app, which generates content at a rate of 20 tokens per second on a model that requires only 2GB of storage space.

The fact that they can run is a testament to the incredible training and inference performance gains that many of the models have made over the past year.

Model prices plummet due to competition and efficiency gains

There has been a sharp decline in the price of large models over the past 12 months.

In December 2023, OpenAI charges $30/million input tokens for GPT-4. Today, $30/mTok gives you access to OpenAI's most expensive model o1. GPT-4o costs $2.50 (12x cheaper than GPT-4), and GPT-4o mini costs $0.15/mTok - nearly 7x cheaper than GPT-3.5 and much more powerful.

Other model suppliers charge less. Anthropic's Claude 3 Haiku (launched in March, but still its cheapest model) costs $0.25/mTok. Google's Gemini 1.5 Flash costs $0.075/mTok, and their Gemini 1.5 Flash 8B costs $0.0375 USD/mTok - 27 times cheaper than last year's GPT-3.5 Turbo.

These price declines were driven by two factors: increased competition and efficiency gains.

Multimodal LLM Emergence

A year ago, the most notable example was GPT-4 Vision, which was released in November 2023 at OpenAI's DevDay. Google's multimodal model Gemini 1.0 was released on December 7, 2023

2024 has seen the release of multimodal models from almost every major modeling supplier. We saw Anthropic's Claude 3 series in March, Gemini 1.5 Pro (image, audio, and video) in April, and then September brought Qwen2-VL and Mistral's Pixtral 12B and Meta's Llama 3.2 11B and 90B visual models. We got audio inputs and outputs from OpenAI in October, then saw Hugging Face's SmolVLM in November , and image and video models from Amazon Nova in December.

Multimodality is a huge step forward for LLM, and being able to run cues against images (as well as audio and video) is a fascinating new way to apply these models.

Voice and live video unleash the imagination

The audio and live video modes that are beginning to emerge deserve special mention.

The ability to talk to ChatGPT was first implemented in September 2023, although at that time it was only the interface between the speech-to-text model and the new text-to-speech model.

GPT-4o, released on May 13, performed a demonstration of a new speech model that takes audio input and outputs very realistic-sounding speech without the need for a separate TTS or STT model.

When the ChatGPT advanced speech model finally rolled out (slowly from August through September), the results were amazing.OpenAI isn't the only team with a multimodal audio model. Google's Gemini also accepts audio input, and the Google Gemini app can now speak in a similar way to ChatGPT. Amazon also teased a voice model for Amazon Nova, but that will be available in the first quarter of 2025.

Google's NotebookLM, released in September, takes audio output to a new level by allowing two "podcast hosts" to have creepily realistic conversations about anything you type into their tool.

In December, live video took on a new focus. chatGPT now enables sharing the camera with models and discussing what you see in real time. google Gemini also showed a preview version with the same feature.

Instant-driven application generation is already a commodity

This is accomplished with GPT-4 in 2023, but the value it provides is not apparent until 2024.

Big models are great at writing code, and if you give a hint correctly, they can build a complete interactive application using HTML, CSS, and JavaScript.

When Anthropic released Claude Artifacts, they pushed the idea hard, and it was a groundbreaking new feature. With Artifacts, Claude can write an on-demand interactive application for you and then let you use it directly within the Claude interface.

Since then, many other teams have built similar systems. gitHub released their version of GitHub Spark in October. mistral Chat added it as a feature called Canvas in November.

This prompt-driven custom interface is so powerful and easy to build that it is expected to appear as a feature in a wide range of products by 2025.

Free access to the best models lasts only a few short months

In just a few months this year, three of the best models - the GPT-4o, the Claude 3.5 Sonnet, and the Gemini 1.5 Pro - have been made available for free in most parts of the world.

OpenAI made GPT-4o free to all users in May, and Claude 3.5 Sonnet has been free since its release in June. This is a significant change, as free users have mostly been limited to GPT-3.5 level models for the past year.

With the launch of ChatGPT Pro by OpenAI, that era seems to be over, and possibly forever. This $200 per month subscription service is the only way to access its most powerful model, o1 Pro.

Since the trick behind the o1 series (and other future models) is to spend more computing time to get better results, I think the days of free access to the best available models are unlikely to return.

"Agent" doesn't really exist yet.

The term "Agent" is very frustrating because it lacks a single, clear and widely understood meaning. If you tell me that you are building an "Agent", then you are not communicating anything to me.

The two main categories of "Agent" that I see are: one that thinks of AI intelligences as things that act on your behalf - models like Travel Agents; and the other that thinks of AI intelligences as large language models (LLMs) that can access tools and run through them in a loop during the problem solving process. Large Language Models (LLMs) that are able to access tools and run through them in a loop in the process of solving a problem. In addition, the term "autonomy" is often added, but again without a clear definition.

Regardless of the meaning of the term, Agent still has that ever-present "coming soon" feeling. Terminology aside, I remain skeptical about the utility of Agents.

Assessment is really important

In 2024, one thing becomes clear: writing good automated evaluations for LLM-driven systems is the skill most needed to build useful applications on top of these models.

If you have a robust evaluation suite, you can adopt new models faster, iterate better, and build product features that are more reliable and useful than your competitors.

Everyone knows that assessments are important, but there is still a lack of good guidance on how best to implement them.

Apple Intelligence sucks, Apple's MLX library is great!

As a Mac user, last year I felt the lack of a Linux/Windows machine with an NVIDIA GPU was a huge disadvantage for trying out new models. 2024 is much better.

In practice, many of the models are released as model weights and libraries that are more biased toward supporting NVIDIA's CUDA than other platforms.

The llama.cpp ecosystem helped a lot in this regard, but the real breakthrough was Apple's MLX library, "an array framework for the Apple Silicon". It's pretty awesome.

Apple's mlx-lm Python supports running a wide range of MLX-compatible models on my Mac with excellent performance. mlx-community on Hugging Face provides over 1,000 models that have been converted to the desired format.

While MLX is a game changer, Apple's own "Apple Intelligence" features have been mostly disappointing, and Apple's LLM features are a poor imitation of cutting edge LLM features.

The rise of "reasoning" models

The most interesting development in the last quarter of 2024 is the emergence of new inference models. Take, for example, OpenAI's o1 model-originally released on September 12 as o1-preview and o1-mini.

The biggest innovation in inference modeling is that it opens up a new way of scaling models: instead of improving model performance simply by adding computation at training time, models can solve harder problems by investing more computation in inference.

o3, the sequel to o1, was released on December 20 and achieved impressive results in the ARC-AGI benchmarks, but not at a low cost, with an estimated total compute time cost cost of more than $1 million. o3 is expected to be officially open for use in January 2025. o3 is expected to be available for use by the end of the year.

OpenAI is not the only company participating in the category. Google released the category's first entrant, gemini-2.0-flash-thinking-exp, on Dec. 19. Alibaba's Qwen team released their QwQ model on Nov. 28; DeepSeek opened up its chat interface on Nov. 20 through itsDeepSeek-R1-Lite-Preview models are available for trial.Anthropic and Meta haven't made any progress yet, but they will certainly follow.

The best LLM training in China costs less than $6 million?

The big news at the end of 2024 is the release of DeepSeek v3. DeepSeek v3 is a huge 685B parametric model, with some benchmarks placing its performance alongside Claude 3.5 Sonnet.

The Vibe benchmark currently ranks it at #7, behind the Gemini 2.0 and OpenAI 4o/o1 models. This is the highest ranked open source licensed model to date.

What's really impressive about DeepSeek v3 is the cost of training. The model was trained on 2,788,000 H800 GPU hours at an estimated cost of $5,576,000. llama 3.1 405B was trained on 30,840,000 GPU hours - 11 times as long as DeepSeek v3 took, but with slightly worse benchmark results.

Improved environmental impact

A promising result of the increased efficiency of the models (hosted models and models run locally) is that the energy usage and environmental impact of running cue words has decreased significantly over the past few years.

But there is still tremendous competitive pressure to build the infrastructure to train and run the models. Companies such as Google, Meta, Microsoft and Amazon have invested billions of dollars in new data centers, which have very significant impacts on the power grid and the environment, and there is even talk of building new nuclear power plants.

Is this infrastructure necessary?DeepSeek v3's $6 million in training costs and the continued decline in the price of big models may suggest that it is not.

Synthetic training data works well

It's now popular to suggest that as the internet becomes flooded with AI-generated garbage, the models themselves will degrade, feeding on their own output and eventually leading to their inevitable demise.

But that's clearly not going to happen. Instead, we're seeing AI labs increasingly using synthetic content for training - deliberately creating artificial data to help steer their models in the right direction. Synthetic data is becoming more commonplace as an important part of pre-training.

Another common technique is to use larger models to help create training data for smaller, cheaper alternatives - an approach used by a growing number of labs. deepSeek v3 uses "inference" data created by DeepSeek-R1. "data created by DeepSeek-R1.

Crafting training data for use in LLMs seems to be all that's needed to create these models. Gone are the days of grabbing complete data from the web and randomly putting it into training runs.

Large models are becoming increasingly difficult to use

One of the points I've been emphasizing is that LLMs are tools for advanced users. They seem simple enough - how hard can it be to type a message to a chatbot? -- but in reality, to make the most of them and avoid all their pitfalls, you need to have a deep understanding and experience.

If anything, the problem has gotten worse in 2024.

We've built computer systems that can converse in human language, that can answer your questions, and usually do so correctly! ... but it depends on the type of question, the way it's asked, and whether the question is accurately represented in those undisclosed, secret training data sets.

The default LLM chat interface is like dropping new users with absolutely no computer experience into a Linux terminal and leaving them to figure it out on their own. At the same time, the end-user model for understanding these tools is becoming increasingly inaccurate and full of misunderstandings.

Many people with more comprehensive information have given up on LLM altogether because they can't see how anyone could benefit from such a flawed tool. The key skill to get the most value from LLM is to learn how to use techniques that are both unreliable and extremely powerful. Mastering this skill is obviously not easy.

Extremely uneven distribution of knowledge

Most people have heard of ChatGPT by now, however how many have heard of Claude? There is a huge knowledge gap between those who actively follow these technologies and those who 99% don't care.

The speed of change hasn't helped alleviate the problem either. In the past month alone, we've witnessed the popularity of live streaming interfaces, where you can point your phone's camera at an object and talk to it with your voice ....... Most people who consider themselves geeks haven't even tried this feature yet.

Considering the continued (and potential) impact of this technology on society, I think the size of the gap is unhealthy. I wish more effort could be put into improving this.

LLM needs better criticism

Many people are extremely offended by large modeling techniques. In some public forums, the mere suggestion that "LLM is useful" is enough to start a big debate.

There are many reasons to dislike this technology - environmental impact, the (lack of) ethicality of the training data, lack of reliability, negative applications, and the potential negative impact on people's jobs.

LLM definitely deserves to be criticized. We need to discuss them, find ways to mitigate them, and help people learn how to use these tools responsibly so that their positive applications outweigh the negative impacts.

Link to original article:https://simonwillison.net/2024/Dec/31/llms-in-2024/

trade # Large model

The copyright of the article belongs to the author, please do not reprint without permission.

Review of Research Results in the Field of Large Models 2024

GPT-4:From "unreachable" to "universally surpassed"

partGPT-4 level model for local operation on PCs

Model prices plummet due to competition and efficiency gains

Multimodal LLM Emergence

Voice and live video unleash the imagination

Instant-driven application generation is already a commodity

Free access to the best models lasts only a few short months

"Agent" doesn't really exist yet.

Assessment is really important

Apple Intelligence sucks, Apple's MLX library is great!

The rise of "reasoning" models

The best LLM training in China costs less than $6 million?

Improved environmental impact

Synthetic training data works well

Large models are becoming increasingly difficult to use

Extremely uneven distribution of knowledge

LLM needs better criticism

The big model "six tigers" of 2024: the big factory is strong, difficult to find a way out

Review of China's AI Development in 2024: The Road to Catch Up and Surpass

Related posts

AI Agent Platform Clay Gets Investment from Sequoia Capital, Valued at $1.5 Billion

Industry Insight: In-depth Insight into the Application of AI + Financial Industry Segmented Business Scenarios

After taking in $300 million in funding, AI finance unicorn Airwallex makes global debut in payments AI agent finance

Pika Closes $580 Million Series B Financing, Doubles Valuation to $3.4 Billion

No comments

Popular Articles

Popular Sites