Google Launches Gemini 2.0, Full Shift to Agents, Multimodal Inputs and Outputs

In response to OpenAI's many previous new offerings, Google on Wednesday launched the next generation of its key AI modelsGemini 2.0 Flash, which generates images and audio natively, as well as supporting text generation.2.0 Flash also enables the use of third-party applications and services, giving it access to features such as Google search, code execution, and more.

Starting Wednesday, an experimental version of 2.0 Flash will be available via the Gemini API and Google's AI development platforms (AI Studio and Vertex AI). However, the audio and image generation features will only be available to "early access partners," with a full rollout planned for January of next year.

In the coming months, Google says it will release different versions of 2.0 Flash for Android Studio, Chrome DevTools, Firebase, Gemini Code Assist and other products.

Flash upgrades

The first generation of Flash (1.5 Flash) could only generate text and was not designed for particularly demanding workloads. According to Google, the new version 2.0 Flash model is more versatile, in part because of its ability to invoke tools (such as search) and interact with external APIs.

Tulsee Doshi, Google's Gemini modeling product lead, said that

"We know that Flash is loved by developers for its good balance of speed and performance. In 2.0 Flash, it still maintains the speed advantage, but is now even more powerful."

Google claims that, according to the company's internal tests, 2.0 Flash runs twice as fast as the Gemini 1.5 Pro model in some benchmarks and offers "significant" improvements in areas such as encoding and image analysis.In fact, the company says that 2.0 Flash replaces 1.5 Pro as Gemini's flagship model, with its better math performance and "factoring".

2.0 Flash generates and modifies images and supports text generation. The model can also read photos, videos, and audio recordings to answer questions related to them.

Audio generation is another key feature of 2.0 Flash, which Doshi describes as "manipulatable" and "customizable". For example, the model can read text in eight voices optimized for different accents and languages.

However, Google did not provide image or audio samples generated by 2.0 Flash, so it is impossible to judge the quality of its output compared to other models.

Google says it is using its SynthID technology to watermark all audio and images generated by 2.0 Flash. On SynthID-enabled software and platforms (i.e., some Google products), the model's output will be labeled as synthesized content.

The move is intended to ease concerns about abuse. Indeed, "deepfake" (deepfake) is becoming a growing threat. According to identity verification service Sumsub, the number of deepfakes detected globally quadrupled from 2023 to 2024.

Multimodal API

The productivity version of 2.0 Flash will be available next January. But in the meantime, Google has launched an API called Multimodal Live API to help developers build apps with live audio and video streaming capabilities.

With the Multimodal Live API, Google says developers can create real-time multimodal apps with audio and video input from a camera or screen. The API supports tool integration to accomplish tasks and can handle "natural conversation patterns" such as interruptions - similar to the functionality of OpenAI's Live API.

The Multimodal Live API was fully open for use on Wednesday morning.

AI agent operated web page

Google on Wednesday also unveiled its first AI agent capable of performing actions on web pages, a research model introduced by its DeepMind unit called Project Mariner. the agent, powered by Gemini, can take over a user's Chrome browser, move the cursor on the screen, click buttons, and fill out forms to use and navigate websites like a human being.

Google said the AI agent will first roll out to a small group of pre-selected beta testers starting Wednesday.

Google is continuing to experiment with new ways for Gemini to read, summarize, and even use websites, the press reports. One Google executive told the press that this marks a "whole new paradigm shift in user experience": instead of interacting directly with websites, users will be able to do so through a generative AI system.

The shift could affect millions of businesses - from publishers like TechCrunch to retailers like Walmart - that have long relied on Google to direct real users to their sites, according to the analysis.

In a demo with tech media outlet TechCrunch, Google Labs director Jaclyn Konzelmann showed how Project Mariner works.

After installing an extension in Chrome, a chat window pops up on the right side of the browser. The user can instruct the agent to complete tasks such as "Create a shopping cart at the supermarket based on this list".

The AI agent then navigates to a supermarket's website, then searches for and adds items to a virtual shopping cart. An obvious problem is that the agent runs slowly - there is a delay of about 5 seconds between each cursor movement. Sometimes, the agent would interrupt the task and return to the chat window to ask for clarification on certain items (e.g. how many carrots are needed, etc.).

Google's agent cannot complete the checkout because it does not fill in credit card numbers or billing information. Additionally, Project Mariner will not accept cookies for users or sign a terms of service agreement. Google says this is a deliberate attempt to not allow the agent to perform these actions out of a concern for better user control.

In the background, Google's agent takes a screenshot of the user's browser window (the user has to agree to this in the terms of service) and sends it to Gemini in the cloud for processing. Gemini then sends instructions for navigating the page back to the user's computer.

Project Mariner can also be used to search for flights and hotels, shop for home furnishings, find recipes, and other tasks that currently require users to click on a web page to accomplish.

However, Project Mariner only works on the front-most active tab of the Chrome browser, which means that users can't do anything else with their computers while the agent is running in the background, and will instead need to watch Gemini slowly click through the actions. Koray Kavukcuoglu, Google DeepMind's chief technology officer, said that this was a very intentional decision to let users know what Google's AI agent is doing.

Konzelmann stated.

"[Project Mariner] marks a fundamental new user experience paradigm shift that we're seeing right now. We need to explore the right way for this to change the way users interact with the web, as well as the way publishers create experiences for users as well as agents."

AI agents do research, write code, and familiarize themselves with the game

In addition to Project Mariner, Google on Wednesday unveiled several new AI agents dedicated to specific tasks.

One of the AI agents, Deep Research, aims to help users research complex studies by creating multi-step research programs. It appears to be a competitor to OpenAI's o1, which is also capable of multi-step reasoning. However, a Google spokesperson noted that the agent is not intended for solving mathematical and logical reasoning problems, writing code, or performing data analysis.Deep Research is available now in Gemini Advanced and will be coming to Gemini apps in 2025.

When a difficult or large-scale question is received, Deep Research creates a multi-step action plan to answer the question. After the user approves the plan, Deep Research takes a few minutes to answer the question, search the web, and then generate a detailed research report.

Another new AI agent, Jules, is designed to help developers with code tasks. It integrates directly into GitHub workflows, allowing Jules to view existing work and make changes directly in GitHub.Jules is rolling out to a small group of beta testers now and will be released later in 2025.

Finally, Google DeepMind says it is developing an AI agent for helping users familiarize themselves with games, based on its long experience in creating game AI. Google is working with game developers like Supercell to test Gemini's ability to explain the world of games like Clash of Clans.

AI-generated abstracts

Google on Wednesday also unveiled "AI Overviews," an AI-generated summary feature based on the Gemini 2.0 model, which provides summarized content for certain Google search queries and will soon be able to handle "more complex topics," as well as "multimodal" and "multistep" search content. as well as "multimodal" and "multistep" searches. Google says this includes advanced math problems and programming issues.

The new AI Overviews feature will begin limited testing this week and will be rolled out widely early next year.

Since its launch this spring, however, AI Overviews has stirred up a lot of controversy, with some of the dubious statements and advice it offers (such as the recommendation to put glue on pizza) having sparked a firestorm of controversy online. According to a recent report from SEO platform SE Ranking, AI Overviews cites websites that are "not entirely reliable or evidence-based," including outdated research and paid product listings.

The main problem, according to the analysis, is that AI Overviews sometimes have trouble recognizing whether a source is factual, fictional, satirical or serious content. Over the past few months, Google has changed the way AI Overviews works, limiting answers related to current events and health topics. But Google doesn't claim that the feature has been perfect.

Nonetheless, Google says that AI Overviews have boosted search engagement, especially among the key user group of 18- to 24-year-olds - a key target demographic for Google.

Latest AI gas pedal chip Trillium exclusively for Gemini 2.0

Google on Wednesday unveiled Trillium, its sixth-generation artificial intelligence gas pedal chip, claiming that the chip's performance improvements could fundamentally change the economic model for AI development.

This custom processor was used to train Google's newly released Gemini 2.0 AI model, which delivers four times the training performance of its predecessor, while dramatically reducing power consumption.

Google has connected more than 100,000 Trillium chips in a single network structure to form one of the world's most powerful AI supercomputers, Google CEO Sundar Pichai explained in an announcement post.

Trillium delivers significant improvements in multiple dimensions. Compared to its predecessor, this chip delivers a 4.7x increase in peak compute performance per chip, while doubling both high-bandwidth memory capacity and inter-chip interconnect bandwidth. More importantly, it is 67% more energy efficient, a key metric for data centers when dealing with the huge energy demands of AI training.

Trillium's business impact goes beyond performance metrics. Google claims that the chip offers a 2.5x improvement in training performance per dollar compared to its predecessor, which could reshape the economic model of AI development.

Analysis suggests that the release of Trillium has intensified competition in the AI hardware space, which NVIDIA has long dominated with its GPU-based solutions. While NVIDIA's chips remain the industry standard for many AI applications, Google's custom chip approach may have advantages for specific workloads, especially training very large models.

Other analysts say that Google's huge investment in custom chip development reflects its strategic bet on the importance of AI infrastructure. Google's decision to make Trillium available to cloud customers signals its desire to be more competitive in the cloud AI market, competing fiercely with Microsoft Azure and Amazon AWS. For the tech industry as a whole, the release of Trillium suggests that the battle for AI hardware supremacy is entering a new phase.