Architecture

Multimodal

AI models that can process and generate multiple types of data — such as text, images, audio, and video — rather than being limited to text alone.

Multimodal AI models can understand and work with multiple types of information simultaneously. While traditional language models only process text, multimodal models can accept images, audio, video, or documents as input alongside text, and some can generate non-text outputs as well. This capability dramatically expands what AI can do, from analyzing charts and photographs to transcribing meetings and understanding video content.

The current generation of frontier models is increasingly multimodal. GPT-4o can process text, images, and audio, and can generate text and audio output. Claude 3 models accept images alongside text. Gemini 1.5 models can process text, images, audio, and video in a unified context window. These models use a shared transformer architecture with specialized encoders for different modalities — a vision encoder processes images into token-like representations that the language model can attend to alongside text tokens.

Multimodal capabilities enable powerful applications. Document understanding systems can process PDFs with complex layouts, tables, and figures. Visual question answering lets users upload photos and ask questions about them. Code assistants can analyze screenshots of UIs or error messages. Accessibility tools can describe images for visually impaired users. Meeting assistants can process audio recordings alongside shared documents to generate comprehensive summaries.

The quality of multimodal understanding varies across models and modalities. Most models handle text-and-image combinations well but may struggle with fine-grained visual details, complex charts, or handwritten text. Video understanding is still early — most models process video as a sequence of sampled frames rather than understanding motion and temporal relationships. Audio processing capabilities are similarly evolving. When choosing a multimodal model, test it specifically on your data types and use cases rather than relying solely on benchmark scores, which may not reflect real-world performance on your specific content.

Mixture of Experts (MoE)

Open-Source Models

Explore more AI concepts in the glossary

Browse Full Glossary