On March 10, Alphabet's DeepMind unveiled Gemini Embedding 2, the company's first native multimodal embedding model. It maps text, images, video, audio, and documents into a unified embedding space, marking a new stage of full-modal fusion in AI embedding technology.
Gemini Embedding 2 supports semantic understanding in over 100 languages and has surpassed existing leading models in benchmark tests for text, image, and video tasks. It also introduces speech processing capabilities previously lacking in embedding models. The model is now available in public preview via the Gemini API and Vertex AI, allowing developers immediate access.
For enterprise users, the model's release significantly lowers the technical barriers for building multimodal Retrieval-Augmented Generation (RAG) systems, semantic search, and data classification systems. It is expected to simplify complex data pipelines that previously required separate cross-modal processing.
**Full Modal Unification: Expanding from Text to Five Media Types**
Built on the Gemini architecture, Gemini Embedding 2 extends embedding capabilities from pure text to five input forms:
* Text supports up to 8192 input tokens. * Images can process up to 6 per request, supporting PNG and JPEG formats. * Video supports MP4 and MOV files up to 120 seconds long. * Audio can be directly ingested to generate embedding vectors without an intermediate text transcription step. * Documents support direct embedding of PDF files up to 6 pages long.
Unlike traditional methods that process individual modalities separately, this model supports interleaved inputs. A single request can include a combination of modalities, such as images and text, enabling the model to capture complex and nuanced semantic relationships between different media types.
Gemini Embedding 2 continues using the Matryoshka Representation Learning (MRL) technology adopted in Alphabet's previous embedding models. This "nested" technique dynamically compresses vector dimensions, allowing the output dimension to be flexibly reduced from the default 3072. This helps developers balance model performance with storage costs.
**Benchmark Leadership, Speech Capability as a New Highlight**
Alphabet stated that Gemini Embedding 2 outperforms current mainstream competing models in benchmark tests for text, image, and video tasks, positioning it as a new performance benchmark in the multimodal embedding field.
The company recommends developers choose from three dimension tiers—3072, 1536, or 768—based on their application scenarios to achieve optimal embedding results. This design is particularly important for enterprises needing large-scale deployment of embedding vectors, allowing effective control of infrastructure costs without significantly sacrificing accuracy.
In terms of capability coverage, the model introduces native speech embedding, a feature generally absent in comparable models, enabling direct processing of audio data without relying on speech-to-text intermediaries.
Alphabet noted that embedding technology is already widely used across several of its products, covering context engineering in RAG scenarios, large-scale data management, and traditional search and analysis scenarios.
Some early access partners have already begun building multimodal applications based on Gemini Embedding 2, with Alphabet stating these use cases are demonstrating the model's practical potential in high-value scenarios.
Comments