Google Announces Open Source Watermark Recognition Tool SynthID, Available for Free to Developers and Businesses

Google announced on October 23rd thatSynthIDIt is now open to anyone who wants to try it. The system, which is used to verify the authenticity of AI-generated content, embeds imperceptible watermarks in generated images, videos, and text, enabling users to verify whether a piece of content was generated by a human or a machine. Google said, "We are open sourcing the SynthID Text text watermarking tool. The tool is free to use for developers and businesses to help them identify content generated by AI."

SynthID was introduced in 2023 as a way to watermark AI-generated images, audio, and video. It was initially integrated into Imagen, and then the company announced its integration into the Gemini chatbot at the May 2024 I/O conference.

The SynthID system works by imperceptibly watermarking the tokens generated during the text generation process.DeepMind said in May this year that the system does this by introducing additional information at the point of generation and regulating the likelihood of generating tokens.

By comparing the model's word choice and its adjusted probability scores with the expected pattern of scores for watermarked and unwatermarked text, SynthID can detect whether the sentence was written by an AI.

The process does not affect the accuracy, quality, or speed of the response, and it is not easily bypassed, according to a study expressed in the Oct. 23 issue of Nature. Unlike standard metadata, which can be easily removed, SynthID's watermarks remain even after content is cropped, edited, or otherwise modified.

In an interview with MIT Technology Review, Soheil Feizi, an associate professor at the University of Maryland, said, "Implementing reliable and imperceptible watermarking techniques for AI-generated text is fundamentally challenging, especially when the output of a large language model is close to being deterministic, such as factual questions or code-generation tasks. " Prof. Fezzi also noted that its open-source nature allows the community to test these detectors in different environments and evaluate their robustness, contributing to a better understanding of the limitations of these techniques.

However, the SynthID system is not foolproof. Although it is resistant to tampering, the SynthID watermark is removed if the text is processed through a language translation application or significantly rewritten. The SynthID system does not work well with shorter snippets of text, and there is no way to determine whether a response based on a statement of fact was generated by the AI. For example, for the question "What is the capital of France?" There is only one correct answer to this question, and both humans and AI will tell you it's Paris.

If users want to try SynthID for themselves, they can download it from Hugging Face, part of Google's updated Responsible AI Toolkit.

Paper address: https://www.nature.com/articles/s41586-024-08025-4

Open source address: https://github.com/synthid-text

SynthID-Text is a production-ready text watermarking solution that maintains text quality and achieves high detection accuracy while minimizing latency overhead. Moreover, SynthID-Text does not affect LLM training and only modifies the sampling procedure; watermark detection is computationally efficient and does not require the use of the underlying LLM.

SynthID-Text builds on the previous generation of watermark components and introduces a new sampling algorithm called Tournament sampling SynthID-Text can be configured to be either non-distorted (preserving the quality of the text) or distorted (improving watermark detectability at the expense of text quality). In both settings, SynthID-Text provides higher detection rates.

As a simple example, for the phrase "My favorite tropical fruit is ___", LLM might use the tokens "mango", "lychee", "papaya", or "durian" to complete the sentence, and each token would be given a probability score. When a range of different tokens are available, SynthID can adjust the probability score for each predicted token so as not to affect the quality, accuracy and creativity of the output.

Google's large-scale user feedback evaluation of nearly 20 million responses from Gemini's real-time interactions showed that non-distorted SynthID-Text maintains text quality. As a result, SynthID-Text has been used to add watermarks to Gemini and Gemini Advanced. This proves that generating text watermarks can be successfully implemented and extended to real-world production systems serving millions of users.

In addition, Google provides an algorithm that combines generative watermarking with speculative sampling, allowing the integration of SynthID-Text into mass production systems with negligible additional computational overhead.

However, SynthID-Text can currently only handle text as short as three sentences and text that has been cropped, interpreted, or modified, but has difficulty with short texts, content that has been rewritten or translated, or even answers to factual questions.

Google said, "SynthID is not a panacea for recognizing AI-generated content, but SynthID will be an important part of developing more reliable AI recognition tools."