Google Releases First Embedding Model: No. 1 in MTEB Rankings, Surpassing OpenAI

1,053 0

in the wee hours of today1point, Google released the firstGeminiThe embedded model refreshes theMTEBThe list record becomes number one and the price is cheap per100ten thousandtokenprovided that0.15Dollars, already openAPI.

According to Google in the Multi-Text Embedding Benchmarking PlatformMTEBThe test results on theGeminiThe average embedded model score reaches68.37, substantially exceeding theOpenAIText embedding modeling of58.93Points.

In bilingual mining, classification, clustering, command retrieval, multi-label classification, pairwise classification, rearrangement, retrieval, semantic text similarity and other tests, all are very good to become the strongest embedding model at present.

Free Experience Address:https://aistudio.google.com/prompts/new_chat

In response to Google's new model, Netflix says that most people underestimate the power of embedding techniques, but they are a central pillar of smarter AI workflows.

Search, clustering, personalized recommendations, and even matching blog content to user intent - all of these applications are improved by embedding technology.

By virtue of each100ten thousandtoken 0.15For $$$, indie creators and freelancers will finally be able to use this technology as well. This is a major move.

This is great! Many of my students have asked me what the best embedding model is, so it's nice to see that Gemini has its own embedding model. the Gemini model will likely work well with these embedding models, and the cost isn't too bad, either.

Multi-lingual capabilities are critical for global adoption, as there is a large population whose native language is not English. I've always thought Google had the edge in state-of-the-art natural language processing. It's great to see Gemini at the top of the MTEB as well, and that it's cost-effective.

GeminiA Brief Introduction to Embedding Models

GeminiThe architectural design of the embedding model is based on Geminitwo-wayTransformer Encoder. This design retains theGeminiThe two-way attention mechanism allows the model to fully utilize its pre-trained language comprehension.

GeminiThe embedding model is based onGeminiunderlying32floor (of a building)Transformerbased, these layers are frozen to ensure that the model inherits the Gemini 's powerful language comprehension On top of these frozen layers, the model adds a pooling layer for each of the input sequence'stokenThe embeddings are aggregated to produce a single embedding vector that is representative of the entire input.

To achieve this, the model employs a simple mean pooling strategy where all token embeddings of the input sequence are averaged along the sequence axis. This pooling method is not only simple and efficient, but also shows good results in terms of model adaptation.

After the pooling layer, the model adjusts the dimensions of the embedding vectors to the target dimensions through a linear projection layer that is randomly initialized. This design allows the model to flexibly output embeddings of different dimensions, e.g., the768 Vee,1536 dimensional or 3072 dimension. To support this multi-dimensional embedded output, theGeminiThe embedding model introduces theMRLTechnology.

MRL techniques allow the model to optimize the embedding of multiple sub-dimensions simultaneously during training, for example, by first optimizing the pre 768 Dimension, before re-optimization 1536 dimensions, ultimately optimizing the complete 3072 dimension. This multi-dimensional training strategy not only improves the flexibility of the model, but also enhances its adaptability in different tasks.

During training.Gemini The embedding model employs a noisy contrast estimation (NCE) loss function, which is a widely used technique for embedding model training.NCE The core idea of the loss function is to optimize the embedding space by comparing positive and negative samples, so that semantically similar texts are close to each other in the embedding space, while semantically different texts are far away from each other.

Each training sample consists of a query, a positive sample, and an optional hard negative sample. The model optimizes the embedding space by computing the similarity between the query vectors and the positive sample vectors and comparing it to the similarity of the negative sample vectors.

To further enhance the performance of the model, theGemini The embedding model was trained using a multidimensional NCE Loss Functions. By means of the MRL techniques, the model is able to simultaneously optimize the embedding of multiple sub-dimensions, such as 768 Vee,1536 peace-keeping 3072 dimension. This multi-dimensional training strategy not only improves the flexibility of the model, but also enhances its adaptability in different tasks.

In addition.Gemini Embedding model A masking mechanism is introduced in the loss function to deal with the case of a small number of targets in the classification task. This masking mechanism can effectively avoid the problem of repeated calculations when calculating the loss, thus improving the training efficiency of the model.

Training data

The research team designed different synthetic data generation strategies for the retrieval task and the classification task, respectively. For the retrieval task, the team extended previous work by employing Gemini Generate synthetic queries and pass Gemini An automated scorer filters low-quality examples.

first using Gemini generates a query related to the given paragraph, which is then passed through another Gemini The model scores the generated queries to ensure their quality and relevance. In this way, the team was able to generate a large amount of high-quality data for the retrieval task, thus improving the model's performance in the retrieval task.

For the categorization task, the team used a more sophisticated multi-stage cueing strategy. They first generate synthetic data such as user profiles, product information, or movie reviews, and then generate specific classification task data based on this.

For example, when generating sentiment classification data, the team will first generate a series of user comments with emotional tendencies, and then filter the samples from them to match specific sentiment labels. This multi-stage generation strategy not only increases the diversity of the data, but also enables the distribution of the data to be adjusted as needed to better suit different classification tasks.

To improve the quality of the training data, the research team utilized the Gemini Filtering of training data. Identify and remove low quality samples by assessing data quality based on few sample cues. Mainly utilizes Gemini of language comprehension to evaluate the samples in the dataset on a case-by-case basis to determine if they meet the expected quality criteria.

For example, in a retrieval taskGemini will evaluate the correlation between the query and the positive sample, and the irrelevance of the query to the negative sample.

If the quality of a particular sample does not meet the requirements, e.g., the correlation between the query and the positive samples is too low or the irrelevance between the query and the negative samples is too high, then this sample is labeled as a low-quality sample and removed from the training data.

Training methods

GeminiThe training process of the embedded model is divided into two main phases: pre-fine-tuning and fine-tuning. In the pre-fine-tuning phase, the model is trained using a large number of potentially noisy pairs. These pairs come from a large-scale web (loanword) corpus, through the form of title and paragraph pairs as input and positive sample pairs.

The main objective of the pre-fine-tuning phase is to bring Gemini of parameters were adapted from the autoregressive generation task to the coding task. Therefore, a larger batch size is used in this phase (e.g., the 8192) to provide a more stable gradient and reduce the effects of noise. The pre-fine-tuning phase has a high number of training steps, usually up to100 Ten thousand steps.

In the fine-tuning phase, the model is further trained on multiple task-specific datasets containing query, target, and hard-negative sample triples. These datasets cover a wide range of task types such as retrieval, classification, clustering, re-ranking, semantic text similarity, etc.

The fine-tuning stage uses smaller batch sizes (e.g.256) and restricting each batch to contain only data from the same task. This strategy allows the model to better focus on task-specific optimization, thus improving its performance across tasks.

In order to further improve the generalization of the modelGeminiThe embedding model also employs the Model Soup Technology.Model Soup is a simple parameter averaging technique that can significantly improve the performance of a model by parameter averaging the model checkpoints obtained by training on several different hyperparameters.

(Text: AIGC Open Community)

Newsflash # Gemini

The copyright of the article belongs to the author, please do not reprint without permission.

Google Releases First Embedding Model: No. 1 in MTEB Rankings, Surpassing OpenAI

GeminiA Brief Introduction to Embedding Models

Training data

Training methods

Four times more accurate than a 10-year medical professional! Microsoft Releases Breakthrough Medical AI System

Anthropic has been valued at over $100 billion with 4x revenue growth and gross margins over 60%!

Related posts

Microsoft Developer Conference: 50 new products fully bet on Agent

Manus: Chinese team releases world's first universal AI Agent to blow up tech scene

OpenAI is about to zoom in: GPT-4.5 is coming soon, with GPT-5 following close behind!

Microsoft Launches World's Largest 'Autonomous AI Intelligence Body,' 100,000 Enterprise Workflows Transformed

No comments

Popular Articles

Popular Sites