In today’s AI-driven world, the importance of efficient data management cannot be overstated. Vector databases play a crucial role in providing the infrastructure for ML applications. In this article, we will explore the significance of vector databases in AI and examine the current top vector database options available in the market.
Understanding Vector Databases
Vector Databases
Vector databases provide long-term memory and enhance the search and querying capabilities of Large Language Models (LLMs) and other AI applications. In essence, a vector database is designed to efficiently store and manage large sets of vectors, also known as vector embeddings. However, the true value of a vector database extends beyond vector storage — it lies in the range and efficiency of the computations it enables on the stored vectors. These computations include tasks such as similarity searches, which are crucial for various AI applications.
To put it into perspective, traditional databases like SQL databases store data in rows and columns, while graph databases specialise in storing graphs, and object databases focus on storing objects. Given that AI applications largely rely on vector embeddings, vector databases are exceptionally well-suited for AI applications. Because of this, they are increasingly recognized as a critical component of the AI tech stack.
Vector Embeddings
Vector embeddings are numeric representations of real-world objects such as text documents, images, or audio recordings. They simplify complex and unstructured data by transforming it into lower-dimensional vectors. This is done to enable Machine Learning algorithms to more effectively process and analyse the data.
Nearest Neighbour Search
At the core of vector databases lies the concept of enhancing the searchability and usability of unstructured data. Approximate nearest neighbour (ANN) search plays a crucial role, as in Machine Learning applications we usually want to find the most similar vectors rather than exact matches. The distance between vector embeddings serves as a measure of similarity, and the “nearest neighbours” refer to the vectors that closely match the query. Vector databases employ various search optimization algorithms to efficiently find these nearest neighbours.
The Need for Specialised Vector Databases
There is an ongoing debate about the need for specialised vector databases as a separate category within the database landscape. Alternatively, vector extensions in traditional databases are seen as a potential way to support the AI market. Noteworthy examples of databases that have already embraced vector extensions include Redis and Elasticsearch.
Other Notable Details
Index types
Indexes serve as a means to enhance the speed of searching within a database. In the realm of vector databases, various established index types exist, and they directly impact the database’s performance, influencing factors such as the query completion time.
Performance
When reviewing the benchmarks provided below, you will notice that the results tend to vary. Conducting accurate and unbiased benchmarks is a challenging task. It’s important to consider that certain solutions may be better suited for specific use cases. Ideally, you should perform benchmarks based on your specific requirements. However, as an initial evaluation, it is helpful to understand the fundamental factors that influence performance and examine a selection of benchmarks and explanations. Moreover, there is an ANN algorithms benchmarking tool, which can help you compare the performance of different databases under the same setup.
Exploring the Top Vector Database Options
To understand the current market situation, we are comparing the top vector database options that are gaining traction. In general, we see many new and young companies with significant funding.
Please note: if the table doesn’t fit into your device’s screen, you can view it in full here.
Vector Database | Description | GitHub stars | Embeds / Uses | Approximate Nearest Neighbor (ANN) | In-memory Support | Sharding | Index Types | Consistency Model | License | Funding | HQ in | Company |
Milvus | Ccloud-native Commercial Open Source vector database | 18k | Not clear: SQLite, RocksDB or MySQL | ✅ | ❌ | ✅ | ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG | 4 consistency levels: strong, bounded staleness, session, and eventually | Apache-2.0 | 113M USD, series B | 🇺🇸 | Zilliz |
Qdrant | Commercial Open Source vector similarity search engine and vector database | 6.6k | RocksDB | ✅ | ✅ | ✅ | HNSW | Eventual Consistency, tunable consistency | Apache-2.0 | 9.8M € | 🇪🇺 | Qdrant Solutions GmbH |
Weaviate | Commercial Open Source cloud-native vector database that stores both objects and vectors. | 5.6k | ❔ | ✅ (It can support multiple ANN algorithms as long as they support full CRUD) | ❌ | ✅ | a custom HNSW algorithm that supports CRUD | Eventual Consistency | BSD | 67.7M USD, series B | 🇪🇺 | SeMI Technologies |
Vespa | Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data | 4.4k | ❔ | ✅ | maintains disk and memory structures for documents | ✅ | HNSW; BM25 | Eventual Consistency | Apache-2.0 | Yahoo! | 🇺🇸 | Yahoo! |
Chroma | Commercial Open Source vector database | 4.4k | HNSW lib, DuckDB; based on ClickHouse | ✅ | ❌ | ❔ | HNSW | ❔ | Apache-2.0 | 20.3M USD, seed | 🇺🇸 | Chroma Inc. |
Marqo AI | A tensor-based cloud-native commercial Open Source search and analytics engine | 2.8k | Tensor-based | ❔ | ❔ | ✅ | HNSW | ❔ | Apache-2.0 | undisclosed preseed in May 2022 | 🇦🇺 | S2Search Australia Pty Ltd |
Vald | Cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine | 1.2k | uses the vector search engine NGT | ✅ (NGT) | ❔ | ✅ | Distributed Index, asynchronous indexing | N/A | Apache-2.0 | Yahoo Japan Corporation | 🇯🇵 | Yusuke Kato and Kiichiro Yukawa (Yahoo Japan Corporation) |
Pinecone | Fully managed vector database that specializes in enabling semantic search capabilities | NA | built on top of Faiss | ✅ (proprietary), plus KNN (with Faiss) | ❌ | ✅ | proprietary | Eventual Consistency | Proprietary | 138M, series B | 🇺🇸 | Pinecone Systems Inc |
More About the Vector Database Market
Here are some more questions answered for anyone interested.
What is an “Open SaaS” business model?
Software as a Service (SaaS) refers to software that is managed or hosted for the client, essentially being rented. In the context of Open SaaS, the term “open” refers to the use of open source software provided as a service. However, it’s important to note that not all code is open source, especially the parts related to managed services, hosting, and additional features. It’s worth mentioning that the open source software used in this way may or may not be developed by the company offering the SaaS. This situation has caused tensions within the open source community, as the original creators sometimes struggle to earn a living or maintain the software, while other companies profit from it. The issue has led to the emergence of new licences that keep the source code open but restrict others from offering it as a service without contributing the entire source code back to the community. This situation has become particularly notable due to the involvement of major cloud providers in leveraging this option.
How Can AI Companies Raise Such Substantial Funding?
AI’s popularity and the importance of data management for future success have made the database market intriguing. Despite having many established players, the market continues to grow consistently (e.g., 17% in 2020). Historical analysis show that new types of databases create new markets, typically with one pre-dominant player (“winner-takes-all”). Venture capitalists seek such opportunities, with potential market leaders worth over 100 million dollars in Annual Recurring Revenue. Examples include MongoDB, Cockroach, Neo4j, and Influx. Vector databases might be the next contender. However, profitability in the database market typically takes over 10 years, so there likely will be no quick wins.
Conclusion
Vector databases serve as a vital addition to numerous AI applications, offering an efficient framework for storing vector embeddings. The vector database market is currently experiencing active growth, fuelled by the ongoing AI hype, resulting in new solutions emerging regularly. In this article, we’ve attempted to compare the primary players available as of mid-2023. To stay informed and up-to-date, we recommend keeping an eye on the ranking of vector DBMS by db-engines, which tracks the popularity of leading options.