Getting Started with Multimodal RAG

Published:

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

As companies begin experimenting with multimodal retrieval augmented generation (RAG), companies providing multimodal embeddings — a way to transform data to RAG-readable files — advise enterprises to start small when starting with embedding images and videos.

sajdhasd

Multimodal RAG, RAG that can also surface a variety of file types from text, images or videos, relies on embedding models that transform data into numerical representations that AI models can read. Embeddings that can process all kinds of files let enterprises find information from financial graphs, product catalogs, or any informational video they have and get a more holistic view of their company.

Cohere, which updated its embeddings model, Embed 3, to process images and videos last month, said enterprises need to prepare their data differently, ensure suitable performance from the embeddings, and better use multimodal RAG.

“Before committing extensive resources to multimodal embeddings, it’s a good idea to test it on a more limited scale. This enables you to assess the model’s performance and suitability for specific use cases and should provide insights into any adjustments needed before full deployment,” a blog post from Cohere staff solutions architect Yann Stoneman said.

The company said many of the processes discussed in the post are present in many other multimodal embedding models.

Stoneman said, depending on some industries, models may also need “additional training to pick up fine-grain details and variations in images.” He used medical applications as an example, where radiology scans or photos of microscopic cells require a specialized embedding system that understands the nuances in those kinds of images.

Data preparation is key

Before feeding images to a multimodal RAG system, these must be pre-processed so the embedding model can read them well.

Images may need to be resized so they’re all a consistent size, while organizations need to figure out if they want to improve low-resolution photos so important details don’t get lost or make too high-resolution pictures a lower quality so it doesn’t strain processing time.

“The system should be able to process image pointers (e.g. URLs or file paths) alongside text data, which may not be possible with text-based embeddings. To create a smooth user experience, organizations may need to implement custom code to integrate image retrieval with existing text retrieval,” the blog said.

Multimodal embeddings become more useful

Many RAG systems mainly deal with text data because using text-based information as embeddings is easier than images or videos. However, since most enterprises hold all kinds of data, RAG which can search pictures and texts has become more popular. Organizations often had to implement separate RAG systems and databases, preventing mixed-modality searches.

Multimodal search is nothing new, as OpenAI and Google offer the same on their respective chatbots. OpenAI launched its latest generation of embeddings models in January. Other companies also provide a way for businesses to harness their different data for multimodal RAG. For example, Uniphore released a way to help enterprises prepare multimodal datasets for RAG.

VB Daily

Stay in the know! Get the latest news in your inbox daily

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

FAQs

What is multimodal retrieval augmented generation (RAG)?

Multimodal retrieval augmented generation (RAG) is a system that can surface a variety of file types from text, images, or videos, relying on embedding models that transform data into numerical representations that AI models can read.

Why is data preparation key for multimodal RAG?

Data preparation is crucial for multimodal RAG as it ensures that images are pre-processed to be read well by the embedding model. This may involve resizing images, improving resolution, and integrating image retrieval with text data for a smooth user experience.

How can enterprises benefit from multimodal embeddings?

Enterprises can benefit from multimodal embeddings by gaining a more holistic view of their data, being able to search both pictures and texts, and preparing multimodal datasets for effective RAG systems, improving overall data retrieval and analysis.


Credit: venturebeat.com

Related articles

You May Also Like