September 5, 2024

Researching multi-modal indexing methods

We partnered with a research company to explore cutting-edge solutions for consolidating various forms of data and building efficient search capabilities using state-of-the-art technologies.

Goals

The project set out to research and evaluate methods for indexing and searching across multiple data modalities: images, text, and vector embeddings. The work focused on identifying cost-efficient approaches to multi-modal search, benchmarking vector databases against graph databases, and building a working prototype that could consolidate different data types into a single searchable system. A secondary goal was developing a custom image signature and perception-based authentication layer for securing data inputs.

Challenges

Most of the complexity came from the experimental nature of the work. Each data modality required its own indexing and retrieval strategy: image segment classification, synonym-aware text search, and high-dimensional vector similarity search all behave differently and have different infrastructure requirements. Integrating Milvus (a purpose-built vector database) with a Spring Boot application had no established patterns at the time, so the team had to build a custom integration layer. Balancing search accuracy, query speed, and infrastructure cost across two database systems (Milvus and ArangoDB) added another layer of engineering tradeoffs that needed systematic benchmarking rather than assumptions.

Software Consulting and Development

We built a multi-modal search prototype on Spring Boot that consolidates image, text, and vector data into a unified system with multiple retrieval paths. The architecture combines FastAPI services for image processing and model inference, Milvus for vector storage and similarity search, and ArangoDB as a graph database for representing relationships between data points. Each component was selected and benchmarked for its specific strengths in the overall pipeline.

Image Indexing and Segment-Based Search

We built an image processing pipeline that supports importing, segmenting, and searching images based on visual similarity at the segment level. Classification models allow queries to match not just whole images but specific regions within them. The pipeline includes a custom image signature system and perception-based authentication to verify data integrity on input.

Text Indexing with Synonym Resolution

The system indexes several text formats including transcripts, forum content, and dictionary entries. Text records are linked to their dictionary definitions, which enables synonym-based and semantically related queries. This means a search for one term can surface results connected through meaning rather than exact keyword matches, improving recall without sacrificing precision.

Milvus and ArangoDB Integration

Vector embeddings are stored and queried through Milvus, while ArangoDB handles the graph layer connecting data points across modalities. We developed a custom Spring Boot integration for Milvus since no mature connector existed at the time. This setup lets the system run vector similarity searches and graph traversals in the same request flow, combining the strengths of both database types without forcing a single-database compromise.

Results

The research produced a working multi-modal search prototype capable of importing and querying across image segments, text documents, and vector embeddings. Benchmarking confirmed that the ArangoDB + Milvus combination provided the best tradeoff between feature coverage, query performance, availability, and infrastructure cost for this use case. The image search pipeline successfully handles segment-level classification and retrieval, and the text search layer resolves synonyms through dictionary graph relationships. We also delivered a minimal web interface for testing and demonstrating the image search capabilities. The findings gave the client a clear technical foundation for deciding how to move forward with their production indexing architecture.