Explanation of the principles and applications of AI Agent technology in one article

popularization of science10mos agoupdate AiFun

1,259 0

1. AI Agent(LLM Agent) Introduction

1.1 Terminology

Agent: An "agent" is usually a manifestation of intentional action. In the field of philosophy, an agent can be a human being, an animal, or even a concept or entity with autonomy.
AI Agent: An AI Agent (Artificial Intelligence Agent) is an intelligent entity capable of sensing its environment, making decisions and performing actions.
RPA: RPA (Robotic Process Automation) is a software automation technology that automates business processes by mimicking manual human actions on a computer, such as opening a website, clicking a mouse, keyboarding, etc. RPA systems can automate a large number of repetitive, rule-based workflow tasks, such as paper document entry, document ticket validation, data extraction from emails and documents, cross-system data migration, and automated IT application operations. For example, in banking, paper document entry, document note validation, data extraction from emails and documents, cross-system data migration, automated IT application operations, etc. The key benefits of RPA include reduced labor costs, increased productivity, low error rates, monitorable operations, and short development cycles. It can play an important role in various fields such as finance, office automation, IT process automation, etc.
Copilot: i.e., the "co-pilot" of an airplane, in this case Copilot refers to the reliance on the underlying Large Language Model (LLM), which allows the user to simply say a few words and give instructions, and it creates text and other content similar to that written by a human.
LangChain: LangChain is a powerful framework designed to help developers build end-to-end applications using language models by providing a set of tools, components, and interfaces that simplify the process of creating applications powered by Large Language Models (LLMs) and Chat Models.LangChain is a Language Model Integration Framework with use cases that roughly overlap with those of the LangChain is a language model integration framework with use cases that largely overlap with those of language models, including document analysis and summarization, chatbots, and code analysis.
LLM: Large Language Modeling (LLM) is an Artificial Intelligence (AI) algorithm that uses deep learning techniques and large numbers of large datasets to understand, summarize, generate and predict new content.
Sensory Memory: Sensory memory is the first stage of information processing and involves the brief storage of information received through the senses. Sensory memory usually lasts only a few hundred milliseconds to a few seconds. Just like when you see a photo of a beautiful landscape, perceptual memory is the brain's short-lived storage of information just received through the senses. For example, if you close your eyes and briefly "see" the colors and shapes of the photo in your mind's eye, this is perceptual memory at work.
Short-term memory: Short-term memory is like a workbench for your mind that can temporarily store and process small amounts of information. For example, when you're trying to remember a phone number, you might recite the number over and over again until you dial it, and that's short-term memory at work. All In-context Learning (In-context Learning) utilizes a model's short-term memory to learn.
Long-term memory: Long-term memory is like a big warehouse that can store our experience, knowledge and skills, and this storage can be very long, even for a lifetime. For example, if you learn the skill of riding a bicycle, and even if you don't ride it for years, you still remember how to ride it, this is long-term memory.Agent is generally realized by external vector storage and fast retrieval.
Memory Stream: "Memory" stores the Agent's past sequences of observations, thoughts, and actions. Just as the human brain relies on memory systems to retrospectively utilize prior experience to formulate strategies and make decisions, the Agent requires specific memory mechanisms to ensure its proficiency in handling a series of sequential tasks. + MRKL (Modular Reasoning, Knowledge and Language): MRKL can be understood as a way of building an AI for the neural symbolic structure of an autonomous agent, which processes reasoning, knowledge understanding, and linguistic ability as distinct modules. Like building blocks, each block represents an ability of the AI, and together they allow the AI to engage in complex thinking and communication.
TALM (Tool Augmented Language Models): TOOL Augmented Language Models are language processing models that are augmented by tools or techniques, usually through fine-tuning. An AI chatbot, for example, is able to answer questions or provide information more accurately by tapping into a search engine or other database.
Subgoal and decomposition: In problem solving, Agents often decompose a large goal into several smaller goals (subgoals) to achieve efficient processing of complex tasks. For example, to prepare a dinner, you may need to go shopping first (Subgoal 1), then prepare the ingredients (Subgoal 2), and finally cook (Subgoal 3).
Reflection and refinement: Agents can self-critique and self-reflect on historical actions, learn from mistakes, and make improvements for future steps to improve the quality of the end result. Just like after writing an essay, you review and revise grammatical errors or unclear expressions to make the essay better.
Chain-of-thought (CoT): has become a standard prompting technique for improving model performance on complex tasks. Models are asked to "think one step at a time", breaking down difficult tasks into smaller, simpler steps. Thinking chains transform large tasks into multiple manageable tasks and help people understand the model's thought process. Thinking Chains are logical reasoning processes used in problem solving. For example, if you want to find out why the sky is blue, you might think, "Light is made up of different colors... blue light has short waves and is easily scattered by the atmosphere... so the sky looks blue. + Tree of Thoughts (ToT): extends the chain of thought by exploring multiple reasoning possibilities at each step of the task. It creates a tree structure by first breaking down the problem into multiple thinking steps and generating multiple ideas at each step. The search process can be either BFS (breadth-first search) or DFS (depth-first search). A Thinking Village is a graphical chain of thoughts that resembles a large tree, with each branch representing a direction or idea to think about, and can help us organize and visualize complex thinking processes.
Self Reflection: Self-reflection is the process of thinking deeply and analyzing one's own actions, thoughts, or feelings. It's like looking back at what you did at the end of the day and assessing what you did well and what you need to improve.
ReAct: combines separate behavioral and linguistic spaces in a task so that the reasoning and actions of the Big Model are integrated. The pattern helps the Big Model to interact with the environment (e.g., using the Wikipedia search API) and leave a trail of reasoning in natural language. The main components are Thought:Action\Observation.
Reflexion: a framework for equipping AI Agents with dynamic memory and self-reflection to improve reasoning. Follows the setup in ReAct and provides simple binary rewards. After each action, the AI Agent computes a heuristic function and decides whether to reset the environment to start a new trial based on the results of self-reflection. This heuristic function determines whether the current path is inefficient (takes too long without success) or contains hallucinations (encounters a sequence of identical actions in the environment that lead to the same observation) and terminates the function in the presence of both.
Self-ask: Self-ask may refer to an AI system autonomously asking questions to guide its thought process when dealing with a problem. This is similar to how humans, when faced with a problem, ask themselves, "What should I do next?" to drive the problem-solving process. + Chain of Hindsight: By explicitly presenting a series of past outputs to the model, it encourages the model to improve its outputs so that the next predicted action will be better than the previous trial. Algorithm Distillation applies the same idea to cross-set trajectories in a reinforcement learning task.

1.2. Meaning of the term Agent, what is an Agent?

1.2.1 Agent Origin

A lot of people may wonder why Agent, which doesn't seem so far from LLM, is so popular these days instead of being called LLM-Application or something else.This brings us to the origin of agent, which is a very old term, dating back to the words of Aristotle and Hume, among others. In a philosophical sense, an "agent" is an entity that has the capacity to act, and the word "agent" denotes the exercise or manifestation of that capacity. In a narrower sense, "agency" usually refers to the manifestation of intentional action; Accordingly, the term "agent" denotes an entity that possesses desires, beliefs, intentions and the ability to act. It is important to note that agents include not only individual human beings, but also other entities in the physical and virtual worlds. Importantly, the concept of "agency" relates to the autonomy of individuals, empowering them to exercise their will, make choices and take action, rather than reacting passively to external stimuli.

It may come as a surprise that relatively little attention was paid to Agent-related concepts by researchers in the mainstream AI community until the mid-to-late 1980s. Since then, however, interest in the topic has grown considerably in the computer science and AI communities. As Wooldridge et al. put it, we can define AI in this way, "It is a subfield of computer science that seeks to design and build computer-based Agents that exhibit various aspects of intelligent behavior." Thus, we can think of Agent as a core concept of AI.

When the concept of Agent was introduced into the field of Artificial Intelligence, its meaning underwent some changes. In the field of philosophy, Agent can be a human, an animal, or even a concept or entity with autonomy. However, in the field of AI, Agent is a computational entity. Because concepts such as consciousness and desire seem metaphysical to computational entities, and because we can only observe the behavior of machines, many AI researchers, including Alan Turing, have suggested that the question of whether an Agent is "really" thinking or has a "mind" be set aside for the time being. "thoughts" for the time being. Instead, researchers have used other attributes to help describe Agents, such as autonomy, responsiveness, proactivity, and social skills. Some researchers argue that intelligence is "in the eye of the beholder"; it is not an innate, isolated attribute. In essence, an AI Agent is not the same as a Philosophy Agent; rather, it is a materialization of the philosophical concept of Agent in the field of Artificial Intelligence.

Now there is no completely unified name for AI Agent, such as "AI Agent", "Intelligent Agent", "Intelligent Body" and so on, we can through the following article also We can understand what is AI Agent, and the technical principles and application scenarios.

1.2.2 What is an AI Agent?

An AI Agent (Artificial Intelligence Agent) is an intelligent entity capable of sensing the environment, making decisions and performing actions. Unlike traditional AI, an AI Agent has the ability to think independently and invoke tools to accomplish a given goal step by step. For example, if you tell an AI Agent to order a takeaway, it can directly call an APP to select the takeaway, and then call a payment program to place the order, without the need for human beings to specify each step of the operation. The concept of Agent was proposed by Minsky in his 1986 book The Thinking Society. Minsky believed that some individuals in a society can find a solution to a problem after negotiation, and these individuals are Agents, and he also believed that Agents should be socially interactive and intelligent.

The concept of agent has been introduced into the field of artificial intelligence and computer, and has rapidly become a research hotspot. However, due to the limitation of data and arithmetic power, there is a lack of necessary realistic conditions to realize truly intelligent AI Agents.

The difference between a Big Language Model and an AI Agent is that an AI Agent can think and act independently.The difference with RPA is that it can handle unknown environmental information.ChatGPT. After the birth of ChatGPT, AI had the ability to have multi-round conversations with humans in the true sense of the word, and was able to give specific answers and suggestions to the corresponding questions. Subsequently, various fields of "Copilot" launched, such as Microsoft 365 Copilot, Microsoft Security Copilot, GitHub Copilot, Adobe Firefly, etc., so that AI has become the office, code, design and other scenarios, making AI the "intelligent co-pilot" for office, code, design and other scenarios. The difference between an AI Agent and a large model is:

The interaction between macromodels and humans is realized based on prompts, and the clarity of user prompts affects the effectiveness of macromodels' answers, e.g., both ChatGPT and these Copilots need to clarify the task in order to get a useful answer.
AI Agent's work only needs to be given a goal, it will be able to think and act independently in response to the goal, it will disassemble each step of the plan in detail according to the given task, relying on feedback from the outside world and independent thinking, creating prompts for itself to achieve the goal. If Copilot is a "co-pilot", then Agent can be considered a primary "main driver".

Compared to traditional RPA, which can only process work under given conditions and according to a preset process within the program, and which cannot work in an environment with a large amount of unknown information and unpredictability, the AI Agent can interact with the environment, perceive the information, and think and act accordingly.

The AI Agents we see tend to use Q&A bots as the interaction portal, triggering a fully automated workflow through natural language, with no human intervention in between. Since humans are only responsible for sending commands, they are not involved in giving feedback on AI results.

1.2.3 Why you need an AI Agent

Some disadvantages of LLM:

It's a hallucination.
The results are not always true
Limited or no knowledge of current events
Difficult to cope with complex calculations
Inability to act
No long-term memory capacity

For example, if you ask ChatGPT to buy a cup of coffee, the feedback from ChatGPT is usually something like "I can't buy a cup of coffee, it's just a text-based AI assistant". But if you tell the ChatGPT-based AI Agent tool to buy a cup of coffee, it will first disassemble how to buy a cup of coffee for you and formulate a number of steps such as placing an order on behalf of a certain APP and making a payment, and then follow these steps to call the APP to choose a takeaway and then call the payment program to place an order and make a payment, without requiring human beings to specify each step of the operation. This is where the AI Agent comes in, utilizing external tools to overcome these limitations. What is a tool here? A tool is a plugin, an integration API, a codebase, etc. that the agent uses to accomplish a specific task, for example:

Google search: access to up-to-date information
Python REPL: Executing Code
Wolfram: performing complex calculations
External API: Getting specific information

LangChain provides a common framework to easily implement these tools through the instructions of a large language model. We all know that in the execution of a complex task, we need to consider a variety of influencing factors, the complex task will be split into small sub-tasks to be executed. the birth of the AI Agent is to deal with a variety of complex tasks, in terms of the processing of complex tasks, the AI Agent is divided into two main categories:Operational, planning and implementationIn short, an AI Agent is a computing body that can automatically think, plan, validate and execute to accomplish a specific task goal by combining big models. In a nutshell, AI Agent is a computing body that can automatically think, plan, validate and execute in combination with a big model to accomplish a specific task goal. If the big model is compared to a brain, then AI Agent can be understood as a cerebellum + arms and legs.

1.2.4 AI Agent vs. Human vs. Other AI Synergy

The AI Agent is more independent than the Copilot model that has been widely used in the past. In contrast to the interaction between AI and humans, there has been a shift from embedded tool-based AI (e.g., siri) to assistant-based AI. Instead of mechanically fulfilling human commands, current AI Copilots can participate in human workflows, advising on matters such as writing code, planning events, optimizing processes, and so on, in collaboration with humans. The work of the AI Agent only needs to be given a goal, and it will be able to think and act independently in response to the goal. It will disassemble the planning steps for each step in detail according to the given task, and rely on feedback from the outside world and independent thinking to create prompts for itself to achieve the goal. If Copilot is a "co-pilot", then Agent can be considered a primary "main driver".

1.3 AI Agent Cases

1.3.1 AI virtual towns

Link to paper:https://arxiv.org/pdf/2304.03442v1.pdf
Demo Address:https://reverie.herokuapp.com/arXiv_Demo/

As Valentine's Day approaches, Isabella, a coffee shop owner living in a small town called "Smallville" tries to throw a Valentine's Day party, and she invites her best friend Maria to set up the party with her, who learns that there is such a party and secretly invites her crush, Klaus, to go with her ...... In the same timeline in Smallville, Tom, who is in his late sixties, has a strong interest in the town's upcoming mayoral election, and as a middle-aged married man who cares deeply about politics, he declines Isabella's invitation to the Valentine's Day party. The above plot doesn't take place in the real world, but it's not a fictional plot made up by humans either; it comes from a virtual town of 25 AI characters. Anything that happens in the town is the result of random interactions between the AIs, and the town has been running smoothly for two days now.

1.3.2. AutoGPT to do market research

Pretend you run a shoe company and give AutoGPT the order to do market research on waterproof shoes, then ask it to give the top5 companies and report on the strengths and weaknesses of their competitors :)

First, AutoGPT goes straight to Google and looks for the top 5 companies in terms of comprehensive waterproof shoe reviews. Once it found the relevant links, AutoGPT asked itself questions such as "what are the pros and cons of each shoe, what are the pros and cons of each of the top 5 waterproof shoes, top 5 waterproof shoes for men," etc. AutoGPT then asked the top 5 waterproof shoes for men, and the top 5 waterproof shoes for men.
After that, AutoGPT continued to analyze various other websites and, in conjunction with Google searches, updated the query until it was satisfied with the results. During this time, AutoGPT was able to determine which comments might be biased toward forgery, so it had to verify the commenters.

In the process, AutoGPT even spawned its own sub-intelligence to perform the task of analyzing the website and finding solutions to the problems, all on its own. The result was a very detailed report from AutoGPT on the top 5 waterproof shoe companies, with strengths and weaknesses of each company, plus a concise conclusion. The whole thing took 8 minutes and cost 10 cents. It was also not optimized at all.

AutoGPT official public demo:https://www.bilibili.com/video/BV1HP411B7cG/?vd_source=6a57ee58a99488dc38bda2374baa1c10

2. Framework for the AI Agent

While the above describes what an AI Agent is and some case studies, the following section analyzes the technology behind the AI Agent.A Big Model-based AI Agent system can be broken down into four component parts: Big Model, Planning, Memory, and Tool Use.In June, Lilian Weng, OpenAI's director of applied research, wrote a blog post arguing that the AI Agent could be the beginning of a new era. She proposed an infrastructure of Agent = LLM + Planning Skills + Memory + Tool Usage, where LLM plays the role of the Agent's "brain," providing reasoning, planning, and other capabilities in this system.

2.1. Big Model + Planning: Agent's "Brain", Task Decomposition through Chain of Thought Capabilities

The LLM has logical reasoning capability, and the Agent can bring out the logical reasoning capability of the LLM. When the model size is large enough, the LLM itself is capable of reasoning. In simple reasoning problems, LLM has reached a very good ability; however, in complex reasoning problems, LLM sometimes still makes mistakes. As a matter of fact, the reason why users cannot get the desired answers through LLM is that the prompts are not suitable enough to stimulate the reasoning ability of LLM, and the reasoning effect of LLM can be greatly improved by adding additional prompts to assist the reasoning. In the test of the paper "Large language models are zero-shot reasoners", after adding "Let's think step by step" to the questions asked to LLMs, the reasoning effect of LLMs on the mathematical reasoning test set GSM8K is significantly improved. The reasoning accuracy on the test set GSM8K increased from 10.4% to 40.7%, while Agent, as an intelligent agent, can create its own suitable prompts according to the given goals, which can better stimulate the reasoning ability of large models.

Typically, a complex task tends to involve many steps.The AI Agent needs to first disassemble these steps and plan ahead. The link to the decomposition of a task can be accomplished in three ways:
1) Enter simple prompts in the Big Model, such as "Steps to XYZ," or "What are the sub-goals to achieve XYZ?" ;
2) Use task-specific instructions, such as asking the big model to "write a story outline" when you need to write a novel;
3) Providing information by hand. The prevailing technology models today include thought chains and thought trees:

Chain of Thoughts (Chain of Thoughts)

Chain of Thoughts (Chain of Thoughts) has become a standard prompting technique used to improve the performance of models on complex tasks. Models are asked to "think one step at a time," breaking down difficult tasks into smaller, simpler steps. Chains of Thoughts transform large tasks into multiple manageable tasks and help people understand the model's thought process.

Take for example a math problem with standard Prompting, the model input:

Q: Roger has 5 tennis balls, and he bought two more boxes of tennis balls, each with 3 tennis balls. How many tennis balls does he now have?
A: The answer is 11

Q: The cafeteria has 23 apples, and if they use up 20 and then buy 6 more. How many apples do they have now?
Model output:
A: The answer is 27

You can see that the model cannot give a correct answer. But if we say, we give the model some ideas about how to solve the problem, just like our math test, we will write out the solution process before finally arriving at the answer, otherwise we can't score any points.CoT does this thing, the example is as follows: CoT Prompting, model input:

Q: Roger has 5 tennis balls, and he bought two more boxes of tennis balls, each with 3 tennis balls. How many tennis balls does he now have?
A: Roger starts with 5 tennis balls, 2 boxes of 3 tennis balls, which is a total of 2*3=6 tennis balls, 5+6=11.
The answer is 11.

Q: The cafeteria has 23 apples, and if they use up 20 and then buy 6 more. How many apples do they have now?
Model output:
A: The cafeteria originally had 23 apples, and they used up 20, so there are 23-20 = 3 left. They bought 6 more, so now there are 6+3=9.
The answer is 9.

As you can see, for similar arithmetic problems, thought chain hints will also automatically give the reasoning steps before giving the answer. Thought Chain Hints.It's taking a multi-step inference problem, breaking it down into many intermediate steps, assigning more computation, generating more tokens, and then stitching those answers together to solve it.

Tree of Thoughts (TT)

Tree of Thoughts extends the chain of thought by exploring multiple reasoning possibilities at each step of a task. It creates a tree structure by first breaking down the problem into multiple thought steps and generating multiple ideas at each step. The search process can be either BFS (breadth-first search) or DFS (depth-first search).ToT does 4 things:Idea Decomposition, Idea Generator, State Evaluator, and Search Algorithms.

An example of the ToT Prompt is as follows.

Suppose three different experts answer this question. All experts write down the first step in their thinking about the question and share it. Then all the experts write down the next step in their thinking and share it. And so on, until all the experts have written down all the steps in their thinking. As soon as the group realizes that an expert's step is wrong, let that expert leave. Ask...

On the other hand, trial-and-error and error correction are inevitable and crucial steps in real-world task decision-making. Self-reflection helps AI Agents to refine past action decisions, correct previous mistakes, and thus continuously improve. Current technologies include ReAct, Reflexion, Chain of Hindsight, and others.

ReAct(!)

ReAct: combines separate behavioral and linguistic spaces in a task so that the reasoning and actions of the Big Model are integrated. The pattern helps the Big Model interact with the environment (e.g., using the Wikipedia search API) and leave a trail of reasoning in natural language.

React paper ReAct: Synergizing Reasoning and Acting in Language Models.https://react-lm.github.io/

QUESTION: What other devices besides the Apple Remote can control Related Software? Related software refers to the earliest software that can be controlled by the Apple remote.
Description:
1. The Apple Remote could only control Front Row software at first.
2. The Front Row software can be controlled by two devices, the Apple Remote and the keyboard's function keys.
3. so the correct answer is the function keys of the keyboard.

Traditional way 1, ask LLM directly. get result: iPod.
The answer is wrong, and it's clearly gibberish. Since LLM's content generation predicts subsequent content based on all previous words, I hypothesize that this result is likely to occur because, in fact, LLM doesn't know the relevant answer, but just that the Apple remote and iPod are semantically close enough to come to the conclusion that iPod is the answer.

Traditional Method 2, Chain of Thought. Get result: idea: let's think about this one step at a time. the Apple Remote can control the Apple TV. the Apple TV can be controlled by the iPhone, iPad and iPod Touch.
Answer: iPhone, iPad and iPod Touch controls.
The answer is wrong, the chain of thought paradigm, and it's actually making stuff up because it doesn't know the answer.

Traditional Method 3, Action Paradigm. Getting results:
Action 1: Search for [Apple Remote].
Result 1: The Apple Remote is a remote control ...... was the first to control "Front Row" ......

Action 2: Search [Front Row].
Outcome 2: No result

Action 3: Search for [Front Row Software].
Result 3: FrontRow is an obsolete software ......

Action 4: Completion [Yes] The answer is incorrect and the action paradigm, implicit reasoning, does not yield the correct answer either. It is not difficult to understand that implicit reasoning requires a high level of reasoning ability from the LLM because he will hide the thinking process. I think when LLM reasoning ability reaches a certain level, it should be able to come up with the correct answer as well.

ReAct method, i.e. reasoning + action. Getting results:Idea 1: I need to search for Apple remotes and find the earliest software that an Apple remote can control.
Action 1: Search for "Apple Remote" Result 1: The Apple Remote is a remote control ...... was the first to control the "Front Row" ......

Idea #2: The Apple remote was the first to control Front Row. i need to search Front Row next to find out what other devices can control him!
Action 2: Search [Front Row] Result 2: No result, try "Front Row Seat to Earth" and "Front Row Software".

Idea 3: Front Row is not found, I can search for "Front Row software" Action 3: Search for [Front Row software].
Result 3: FrontRow is an obsolete software ...... can be controlled by the Apple remote and keyboard function keys.

Idea #4: Now I know the answer.
Action 4: Complete [Keyboard Function Keys
The answer is correct. Through explicit reasoning, coupled with actions, the LLM Agent autonomously found the answer. The whole process felt like a not-so-smart kid trying to write out ideas, then synthesize all the ideas and observations, and then act accordingly. But it clearly worked, and it eventually found the answer.

Reflexion

Reflexion: a framework for equipping AI Agents with dynamic memory and self-reflection to improve reasoning. Follows the setup in ReAct and provides simple binary rewards. After each action, the AI Agent computes a heuristic function and decides whether to reset the environment to start a new trial based on the results of self-reflection. This heuristic function determines whether the current path is inefficient (takes too long without success) or contains hallucinations (encounters a sequence of identical actions in the environment that lead to the same observation) and terminates the function in the presence of both.

2.2. Memory: more memory with limited context lengths

The memory module is responsible for storing information, including past interactions, learned knowledge, and even temporary task information.. For an intelligence, an effective memory mechanism guarantees that it can call upon previous experience and knowledge when faced with new or complex situations. For example, a memory-enabled chatbot can remember user preferences or the content of previous conversations to provide a more personalized and coherent communication experience.

Inputs to the AI system become the system's memory, which can be mapped to human memory patterns.. Memory can be defined as the process used to acquire, store, retain and subsequently retrieve information. There are various types of memory in the human brain, such as sensory memory, short-term memory and long-term memory. For the AI Agent system, the content generated by the user during its interaction can be considered as the Agent's memory, which can correspond to the pattern of human memory. Sensory memory is the original input for learning embedded representations, including text, images, or other modalities; short-term memory is the context, which is limited by the length of the limited context window; and long-term memory can be considered as the external vector database that the Agent needs to query during its work, which can be accessed through fast search. Currently, the Agent mainly utilizes the external long-term memory to complete many complex tasks, such as reading PDFs, searching real-time news on the Internet, etc. The tasks and results will be stored in the memory module. Tasks and results are stored in the memory module, and when the information is called up, the information stored in the memory will be returned to the dialog with the user, thus creating a tighter contextual environment.

To address the constraints of limited memory time, external memories are often used. A common practice is to save embedded representations of information into vector storage databases that can support fast Maximum Inner Product Search (MIPS). Vector databases solve the problem of storing, retrieving, and matching massive knowledge of large models by transforming data into vector storage.. Vectors are a common form of data for AI to understand the world, and large models require a large amount of data for training to obtain rich semantic and contextual information, leading to an exponential increase in the amount of data. Vector Database utilizes the Embedding method in AI to abstract and convert unstructured data such as images, audio and video into multi-dimensional vectors, which can be structured and managed in a vector database, thus realizing fast and efficient data storage and retrieval processes, and giving the Agent a "long-term memory". At the same time, mapping the multimodal data in high-dimensional space to vectors in low-dimensional space can also significantly reduce the cost of storage and computation, and the cost of storage in vector database is two to four orders of magnitude lower than the cost of storage in neural network.

Embedding techniques and vector similarity computation are at the core of vector databasesEmbedding technology is a method of transforming unstructured data such as images, audio and video into language that can be recognized by computers, for example, a common map is the Embedding of real geography, the information of real geography and terrain is far more than three-dimensional, but maps maximize the expression of real geographic information through color and contour lines, etc. After transforming unstructured data such as text data into vectors by Embedding technology, the similarity between two vectors can be calculated by mathematical methods to achieve the comparison of the text. After transforming unstructured data such as text data into vectors by Embedding technology, the similarity between two vectors can be calculated mathematically to realize the comparison of text. The powerful retrieval function of vector database is based on the vector similarity calculation, through the similarity retrieval characteristics, for similar problems to find out the approximate matching results, is a kind of fuzzy matching retrieval, there is no standard accurate answer, and then more efficiently support a wider range of application scenarios.

2.3 Tools: knowing how to use them makes you more human

One of the major differences between an AI Agent and a larger model is the ability to use external tools to expand the model's capabilitiesChatGPT's major drawback is that its training data is only available up to a certain point. The use of tools is one of the most remarkable and unique aspects of human beings, and in the same way, it is possible to equip a large model with external tools that allow the model to do what it would otherwise be unable to do. one of the major drawbacks of ChatGPT is that its training data is only up to the end of 2021, and it is not able to answer questions directly to the newer knowledge content. Although OpenAI subsequently updated ChatGPT with a plug-in feature that allows it to call a browser plug-in to access the latest information, it requires the user to specify whether or not the plug-in is needed for the question, which does not allow it to provide a completely natural answer. In June, OpenAI updated the function call functionality for GPT-4 and GPT-3.5, allowing developers to describe functions to the two models and have the models intelligently choose to output JSON objects containing the arguments to those functions. JSON objects containing the arguments to call these functions. This is a new way to more reliably connect GPT's functionality to external tools and APIs, allowing developers to more reliably obtain structured data from models for AI developers. The way to implement calling tools is to write a large number of tool-calling datasets to fine-tune the model.

To summarize. The principle of the AI Agent consists of four main capabilities: sensing, analyzing, deciding and executing. These capabilities synergize with each other to form the basic working principle of the AI Agent. First is the perception ability, through the sensor to obtain information about the external environment, so that the AI Agent can have an understanding of the surrounding situation. Second is the analysis ability, through the analysis and processing of the perceived information, to extract useful features and patterns. Then comes the decision-making ability, where the AI Agent makes decisions based on the analysis results and formulates corresponding action plans. Finally, it is the execution capability, which transforms the decision into concrete actions and realizes the completion of the task. These four capabilities work in tandem to enable the AI Agent to efficiently run and execute tasks in complex environments.

3. Progress in the application of AI Agent

3.1. AutoGPT: Driving the AI Agent Research Boom

AutoGPT takes the AI Agent concept "out of the loop". In March 2023, developer Significant Ggravitas released the open-source project AutoGPT on GitHub, which is powered by GPT-4 and allows the AI to act autonomously without the need for user prompts for each action. Give AutoGPT a goal and it will be able to break down the task, perform the operation, and complete the task on its own. AutoGPT still has drawbacks such as high cost, slow response, and dead loop bugs. Auto-GPT uses the APIs of GPT-3.5 and GPT-4, and the price of a single token of GPT-4 is 15 times that of GPT-3.5. Assuming that each task requires 20 steps (ideally), and each step costs 4K tokens of GPT-4 usage, the average cost per thousand tokens for prompts and replies is $0.05 (since replies use far more tokens than prompts in practice), and assuming that the exchange rate is $1 = 7 CNY, the cost would be 2040.05*7=28 RMB. This is only the ideal situation, the normal use of the task often need to be split into dozens of hundreds of steps, then the processing cost of a single task will be difficult to accept. Moreover, the response speed of GPT-4 is much slower than that of GPT-3.5, which makes the task processing become very slow when there are many steps. Moreover, when AutoGPT encounters a step problem that cannot be solved by GPT-4, it will fall into a dead loop, repeating meaningless prompts and outputs, resulting in a large amount of wasted resources and losses.

3.2 Game Domain Application: Westworld Town

For the first time, Stanford's Westworld town creates a virtual environment where multiple intelligences live. In April 2023, Stanford researchers published a paper titled Generative Agents: Interactive Simulacra of Human Behavior, which demonstrated a virtual Westworld town populated by Generative Agents. It's an interactive sandbox environment in which 25 generative AI agents that mimic human behavior live. they take walks in the park, drink coffee in a cafe, and share the day's news with colleagues. Even if an agent wants to hold a Valentine's Day party, these agents will spend the next two days automatically spreading the word about party invitations, meeting new people, asking each other to go to the party, and coordinating with each other to show up together at the right time. Such Agents have human-like qualities, independent decision-making and long-term memory, and they are closer to "native AI Agents". In this cooperative model, Agents are not just tools that serve humans, they are also able to build social relationships with other Agents in the digital world.

Architecture of Agents in Westworld Town The Memory Stream contains a large number of observation and retrieval processes The Memory Stream is the core of the architecture of AI Agents in Westworld Town. Agents in the town contain three important fundamentals: memory, reflection, and planning, which are slightly modified from the previously mentioned core components. These three fundamentals are all based on a single core: the Memory Stream, which stores a record of all of an Agent's experiences as a list of observations, each of which contains a description of the event, the time it was created, and a timestamp of the most recent visit, and which can be either the Agent's own behavior or the behavior it perceives from others. In order to retrieve the most important memories to pass on to the language model, the researcher identified three factors to be considered in the retrieval process: recency, importance, and relevance. The Agent's next action is determined by determining the score of each memory based on these three factors, which is finally summed up to get the most weighted memory, which is passed to the larger model as part of the prompt. Reflection and planning are updated and created based on observations in the memory stream.

3.3. HyperWrite: Launching the First Personal AI Assistant Agent

HyperWrite Launches First Personal AI Assistant Agent. August 3, 2023 - Artificial intelligence startup HyperWrite has officially launched Personal Assistant, an application for an AI agent that it hopes will become a "digital assistant" for humans. "We're going to start seeing real personal AI assistants for the first time," said Aidan Gomez, co-founder of generative AI startup Cohere and an investor in HyperWrite. As a personal assistant agent, it can help users organize their email inboxes and draft replies, book flights, order takeout, and organize suitable resumes on LinkedIn, seamlessly integrating AI capabilities into their daily lives and workflows. The tool is currently in the trial stage and is mainly applicable to web browser scenarios.

Personal Assistant can autonomously perform specific tasks in the browser. Personal Assistant is now available as a browser extension, and users can try it out after installing the plugin and registering for an account. The initial page is similar to a search engine like New Bing, with only a chat box for natural language interaction. Once the user types in what they want to accomplish, the plugin creates a new browser page with a sidebar showing every step of the process and the thought process. Take the goal of "Give me some new ideas about AI Agents in the U.S." as an example, the personal assistant will first conduct a relevant search, then open a relevant article page to read and summarize the ideas, and after reading and summarizing the results, it will summarize the results and return to the chat box, which will take about 2 minutes overall. At present, the ability of personal AI assistant is still limited, but the potential is promising.

Currently, HyperWrite Personal Assistant is only in version 0.01, its functions are still relatively limited, there are some problems with errors, and the response process is also slow. However, we believe that the AI Agent has since taken the first step towards the personal consumer field, and with the further improvement of the big model capability and the continuous popularization of the arithmetic infrastructure in the future, the development potential of the Personal AI Assistant is worth looking forward to.

3.4. Inflection AI: High Emotional Intelligence Personal AI - Pi

Inflection AI launches Pi, a personal AI focused on emotional companionship. Inflection AI is an AI startup founded in 2022 and currently valued at over $4 billion, second only to OpenAI in the AI space. in May 2023, the company launched Pi, its personal AI product. In May 2023, the company launched Pi, its personal AI product. Unlike ChatGPT, Pi is never advertised as professional or as a replacement for human work. Unlike ChatGPT, Pi has never advertised itself as a professional or human replacement. It can't write code or produce original content for us, and in contrast to popular general-purpose chatbots, Pi can only engage in friendly conversations, offer succinct advice, or even just listen. Its main characteristics are empathy, humility, curiosity, humor, innovation, and a good emotional intelligence that allows it to offer unlimited knowledge and companionship based on the user's unique interests and needs. Since the beginning of the development of the Pi, Inflection has determined that the Pi will be a Personal Intelligence, not just a tool to assist people in their work.

At the heart of Pi is the company's Inflection-1 big model, with performance comparable to GPT-3.5. Inflection-1 is the big model introduced by Inflection AI, and according to the company's evaluation tests, Inflection-1's performance is slightly better than that of GPT-3.5 in a number of tests, such as multitasking language comprehension, common sense problems, etc. Inflection-1 is also the most commonly used big model, such as LLaMA, but it lags behind GPT-3.5 in terms of code capability, LLaMA and other commonly used big models, but lags behind GPT-3.5 in terms of code capability, but this is the company's competitive differentiation, and Pi, as an agent that focuses on emotional companionship, doesn't need to have strong code and auxiliary work capabilities.

Unlike a work-assisting agent, Pi is able to fulfill the need for more emotional companionship. As an AI agent with a high level of emotional intelligence, Pi is able to communicate with users in a more everyday and life-like language, rather than in the tone of a cold working AI. Pi's replies are very personal and appropriate, and it cares about your current state of affairs like a therapist or your best friend. Pi also avoids using any playful expressions or lighthearted tones to offend users when responding to potentially negative questions. It even uses emoji in its replies to make the user feel more like they are having a conversation with a real human being. pi also remembers conversations and gets to know the user better over time. Pi makes up for traditional AI's neglect of human emotional desires. We believe that there is a large market for personal AI agents like Pi that can provide emotional value.

3.5. AgentBench: LLM's Agent Capability Evaluation Criteria

AgentBench Evaluating LLM's Ability as an Agent Commonly Ranked LLM's Agent Ability Tsinghua University's joint team proposes the world's first evaluation standard for the ability of large-model AI agents. Despite the current hot research on AI intelligences, the AI industry lacks a systematic and standardized benchmark to evaluate the intelligence level of LLMs as agents. In August 2023, a team of researchers from Tsinghua University, The Ohio State University, and the University of California, Berkeley, proposed the first systematic benchmark test, AgentBench, to evaluate the performance of LLM as an agent in a variety of real-world challenges and eight different environments (e.g., reasoning and decision-making). The 8 environments are: operating system) The 8 environments are: operating systems, databases, knowledge graphs, card games, household chores, lateral thinking puzzles, online shopping, and web browsing. Based on these 8 environments, the research team designed different real-world challenges, covering both code and life scenarios, such as extracting numbers from tables using SQL, winning a card game, booking an airplane ticket from a web page, etc. The research team also developed a set of real-world challenges for the 8 environments, which were based on the 8 environments.

The performance of GPT-4 is far ahead, and the open-source model is significantly weaker than the closed-source model. The researchers selected 25 mainstream big model APIs for agent capability evaluation, covering both closed-source models (e.g., OpenAI's GPT-4, GPT-3.5, etc.) and open-source models (LLaMA 2 and Baichuan, etc.). According to the test results, GPT-4 basically takes the lead in all environments, and is truly the current boundary of large model capability. Closed-source models Anthropic's Claude and OpenAI's GPT-3.5 do not differ much in level, while some common open-source models such as Vicuna, Dolly, etc. have significantly weaker performance evaluations due to at least an order of magnitude difference in size from closed-source models. We believe that although LLM can reach a basic human-like level in NLP such as natural language communication, it still lags behind in the performance of important agent capabilities such as focusing on the effectiveness of actions, memory of context length, consistency of multi-round conversations, and code generation and execution, and the development of AI agents based on LLM still has potential.

3.6 Application Scenarios of AI Agent in Security Operations

At present, in the network security vertical field, Microsoft Security Copilot is still the main product form of the situation, or to the "co-driver" as the positioning, has not appeared in the security business in the AI Agent, but I believe that soon there will be virus analysis AI Agent, Red Army AI Agent, Blue Army AI Agent, and so on. But I believe there will soon be an AI Agent for virus analysis, a Red Army AI Agent, a Blue Army AI Agent, and so on.

4. Judging the future development trend of AI Agent

Combining the current applications of AI Agents developed based on LLM in academia and industry, we categorize the current AI Agents into two main groups:

Autonomous intelligences that seek to automate complex processes. When given a goal, autonomous intelligences are able to create tasks on their own, complete them, create new tasks, reprioritize the task list, complete the new top priority task, and repeat the process until the goal is completed. Accuracy requirements are high, thus increasing the need for external tools to assist in reducing the negative effects of large model uncertainty.
Intelligent body simulations that strive to be more anthropomorphic and believable. Divided into intelligences that emphasize emotional intelligence and intelligences that emphasize interaction, the latter tend to be in multi-intelligence environments where scenarios and capabilities may emerge beyond the designer's planning, and where the uncertainty of large-model generation becomes an advantage instead, the diversity promises to be an important part of AIGC.

The research in the report "Top Ten Trends in AIGC Application Layer" shows that all enterprises believe that AI Agent is a definite direction for AIGC development, and 50% enterprises have already piloted AI Agent in a certain work, and another 34% enterprises are formulating plans for the application of AI Agent. This report also makes two predictions on the development trend of AI Agent: AI Agent will make "human-machine collaboration" become the new normal, and individuals and enterprises will step into the era of AI assistants; AI Agent will be able to help future enterprises build a new intelligent operation with "human-machine collaboration" as the core. AI Agent can help future enterprises to build a new normal of intelligent operation with "human-machine collaboration" as the core, and AI Agent can change the organization of future productivity and fight against the entropy increase of organization. In the future, enterprise tasks will become increasingly atomized and fragmented with the help of AIGC, and complex processes will be infinitely disassembled, then flexibly arranged and combined, and the effectiveness and potential of each link will be continuously excavated by AI. From the supply side, the efficient collaboration mode of "human + AI digital employees" will provide an ideal solution for large enterprises to fight against organizational entropy.

4.1 Autonomous intelligences: automation, the new productivity revolution

Autonomous Intelligence, which seeks to automate complex processes. Dai Yusen, Managing Partner of Zenith Fund, analogizes the degree of collaboration between AI and human beings to the different stages of autonomous driving, and the AI Agent is about the L4 stage of autonomous driving, where the Agent completes the task, and the human being carries out external assistance and supervision. Autonomous Intelligent Body is expected to bring about changes in the interaction and business model of the software industry:

Change of Interaction: Compared with the APP/software in the past, from people adapting to apps to apps adapting to people, the decision making/planning/execution of Agent requires deeper understanding of user requirements and stronger polishing of engineering details. For example, the current Agent operation often encounters endless expansion, misunderstanding of the output format and other problems, which do not only rely on the enhancement of the capacity of the big model, but also require the design of the Agent architecture and the learning of the pendant data.
Business model change: charging by service content is converted to charging by token, which requires more practicality of Agent function. The ability of the base model is important, but it can only solve the lower limit problem. In the actual enterprise application scenario, the architectural design, engineering capability, and the quality of pendant data of autonomous intelligences are also crucial, and pendant/middleware players also have opportunities. Accuracy and efficiency are important indicators of autonomous intelligence (decision-based AI is better at what it does, which also means lower fault tolerance), and there is also demand from enterprises for low-threshold customized AGENTs, so players focusing on specific domains/providing AGENT frameworks still have a chance.

Typical Representatives:

AutoGPT
Code Development GPT Engineer
Research GPT Researcher
Creation class ShortGPT+ Multi-intelligentsia: AGENT teams perform complex development tasks, e.g. MetaGPT, AutoGEN

4.2. Intelligent body simulation: anthropomorphism, new mental consumer goods

Companion intelligences emphasize human traits such as emotional intelligence, have a "personality", and are capable of remembering historical interactions with the user.

LLM's important breakthroughs in natural language understanding capabilities have made companion intelligences technically possible.

GPT4 is significantly higher in emotional intelligence than other previous macromodels, and with the iterative macromodeling of emotional intelligence and the development of multimodal technology, it is expected that there will be companion intelligences that are more three-dimensional anthropomorphic and believable, and capable of providing higher emotional value.

Research institutes believe that the domestic emotional consumption market still has more imagination space (the change of social concept of marriage and the increase of national loneliness under the tense pace of modern work and life), and the companion smart body may benefit from the dividends of the emotional consumption trend and become an important AI native application in the LLM era. From the first principle of the user's companionship demand, we expect that most of the commercial value of companion intelligences is concentrated in IP, based on which we are more optimistic about players with rich IP reserves or those who can allow users to customize intelligences:

Reference to human companionship: Stranger socialization and show live belong to the representative application of online companionship, the former core problem is that after the establishment of a certain emotional connection between the users will be transferred to the most commonly used social platforms, the latter user value will be gradually concentrated to the head anchor, rather than the platform.
The companionship of reference objects: Consumer products such as trendy games have certain companionship attributes, and the audience spends most of their money on their favorite IP. typical representatives: companionship: high emotional intelligence, with personality traits, such as Pi platform entertainment, such as Character.AI, Glow, etc. game world AI players, such as Voyager + Smallville towns, simulation of Westworld-like society+. NetEase "Against the Shuihan" handheld game, AI NPC to improve the player experience Kunlun Wanwei "Club Koala" virtual world more credible

5. Summary

In this article, we have analyzed the whole picture of AI Agent, a cutting-edge technology. From the clarification of basic concepts, to the detailed deconstruction of technical principles, to the exploration of rich and diverse application scenarios, to the outlook of its exciting future development trend, each step of the way is cohesive with infinite visions of the potential of AI Agent technology. Admittedly, many of the projects mentioned in this article are still in the early stage of academic research and practical exploration, and the experimental results presented in the form of papers are like bright stars, which indicate the infinite possibilities of the future direction of big model technology. These sparks will ignite a prairie fire of AI application, giving rise to new fields and opportunities never seen before, and injecting intelligent vitality into every corner of society.

Looking ahead, AI Agent technology is expected to achieve leapfrog development in the next five years, and its popularity will far exceed expectations, deeply penetrating and reshaping the operation mode of thousands of industries. We expect that this technology will not only realize a qualitative leap in efficiency and bring disruptive changes to traditional industries, but also serve as the core force of innovation drive, opening up a new era of intelligent interconnection and efficient collaboration. Let's witness together how AI Agent technology will work together with various industries to draw a vivid and colorful picture of the intelligent future.

This article was shared from Huawei Cloud CommunityThe AI Agent Framework (LLM Agent): How LLM-powered Intelligentsia are Leading Industry Transformation, Application Exploration and Future ProspectsThe author: Ting, Ting, Ting.

popularization of science # AI Agent

The copyright of the article belongs to the author, please do not reprint without permission.