Multi-modal Embeddings
Learn about multi-modal embeddings in the Zapdos platform
Zapdos uses multi-modal embeddings to represent each media unit of video with multiple complementary signals. Instead of relying on a single vector, we generate embeddings from:
- Frame-level visual features
- Captions and summaries
- Text extracted via OCR
- Objects and entities detected in the scene
- Temporal context from surrounding frames
Together, these embeddings provide a richer and more resilient representation of video content. Allow fast & accurate semantic search, clustering and recommendation.
Access through Cloud Database (Coming Soon)
Multi-modal embeddings will also be accessible through our Cloud Database soon
Pairing with Graph RAG
Structured outputs (objects, OCR text, captions) are also linked in graph form. When combined with embeddings, Zapdos can answer complex queries by using both semantic similarity and explicit relationships between entities, scenes, and events.
This pairing makes search and retrieval not only powerful, but also explainable and composable across your entire video library.