Zapdos Labs

Multi-modal Embeddings

Learn about multi-modal embeddings in the Zapdos platform

Zapdos uses multi-modal embeddings to represent each media unit of video with multiple complementary signals. Instead of relying on a single vector, we generate embeddings from:

  • Frame-level visual features
  • Captions and summaries
  • Text extracted via OCR
  • Objects and entities detected in the scene
  • Temporal context from surrounding frames

Together, these embeddings provide a richer and more resilient representation of video content. Allow fast & accurate semantic search, clustering and recommendation.

Access through Cloud Database (Coming Soon)

Multi-modal embeddings will also be accessible through our Cloud Database soon

Pairing with Graph RAG

Structured outputs (objects, OCR text, captions) are also linked in graph form. When combined with embeddings, Zapdos can answer complex queries by using both semantic similarity and explicit relationships between entities, scenes, and events.

This pairing makes search and retrieval not only powerful, but also explainable and composable across your entire video library.