Top vector database choices in 2023 - Open Source by greenrobot

In today’s AI-driven world, the importance of efficient data management cannot be overstated. Vector databases play a crucial role in providing the infrastructure for ML applications. In this article, we will explore the significance of vector databases in AI and examine the current top vector database options available in the market.

Understanding Vector Databases

Vector Databases

Vector databases provide long-term memory and enhance the search and querying capabilities of Large Language Models (LLMs) and other AI applications. In essence, a vector database is designed to efficiently store and manage large sets of vectors, also known as vector embeddings. However, the true value of a vector database extends beyond vector storage — it lies in the range and efficiency of the computations it enables on the stored vectors. These computations include tasks such as similarity searches, which are crucial for various AI applications.

To put it into perspective, traditional databases like SQL databases store data in rows and columns, while graph databases specialise in storing graphs, and object databases focus on storing objects. Given that AI applications largely rely on vector embeddings, vector databases are exceptionally well-suited for AI applications. Because of this, they are increasingly recognized as a critical component of the AI tech stack.

Vector Embeddings

Vector embeddings are numeric representations of real-world objects such as text documents, images, or audio recordings. They simplify complex and unstructured data by transforming it into lower-dimensional vectors. This is done to enable Machine Learning algorithms to more effectively process and analyse the data.

Nearest Neighbour Search

At the core of vector databases lies the concept of enhancing the searchability and usability of unstructured data. Approximate nearest neighbour (ANN) search plays a crucial role, as in Machine Learning applications we usually want to find the most similar vectors rather than exact matches. The distance between vector embeddings serves as a measure of similarity, and the “nearest neighbours” refer to the vectors that closely match the query. Vector databases employ various search optimization algorithms to efficiently find these nearest neighbours.

The Need for Specialised Vector Databases

There is an ongoing debate about the need for specialised vector databases as a separate category within the database landscape. Alternatively, vector extensions in traditional databases are seen as a potential way to support the AI market. Noteworthy examples of databases that have already embraced vector extensions include Redis and Elasticsearch.

Other Notable Details

Index types

Indexes serve as a means to enhance the speed of searching within a database. In the realm of vector databases, various established index types exist, and they directly impact the database’s performance, influencing factors such as the query completion time.

Performance

When reviewing the benchmarks provided below, you will notice that the results tend to vary. Conducting accurate and unbiased benchmarks is a challenging task. It’s important to consider that certain solutions may be better suited for specific use cases. Ideally, you should perform benchmarks based on your specific requirements. However, as an initial evaluation, it is helpful to understand the fundamental factors that influence performance and examine a selection of benchmarks and explanations. Moreover, there is an ANN algorithms benchmarking tool, which can help you compare the performance of different databases under the same setup.

Exploring the Top Vector Database Options

To understand the current market situation, we are comparing the top vector database options that are gaining traction. In general, we see many new and young companies with significant funding.

Please note: if the table doesn’t fit into your device’s screen, you can view it in full here.

Vector Database

Description

GitHub stars

Embeds / Uses

Approximate Nearest Neighbor (ANN)

In-memory Support

Sharding

Index Types

Consistency Model

License

Funding

HQ in

Company

Milvus

Ccloud-native Commercial Open Source vector database

18k

Not clear: SQLite, RocksDB or MySQL

✅

❌

✅

ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG

4 consistency levels: strong, bounded staleness, session, and eventually

Apache-2.0

113M USD, series B

🇺🇸

Zilliz

Qdrant

Commercial Open Source vector similarity search engine and vector database

6.6k

RocksDB

✅

HNSW

Eventual Consistency, tunable consistency

Apache-2.0

9.8M €

🇪🇺

Qdrant Solutions GmbH

Weaviate

Commercial Open Source cloud-native vector database that stores both objects and vectors.

5.6k

❔

✅ (It can support multiple ANN algorithms as long as they support full CRUD)

❌

✅

a custom HNSW algorithm that supports CRUD

Eventual Consistency

BSD

67.7M USD, series B

🇪🇺

SeMI Technologies

Vespa

Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data

4.4k

❔

✅

maintains disk and memory structures for documents

✅

HNSW; BM25

Eventual Consistency

Apache-2.0

Yahoo!

🇺🇸

Yahoo!

Chroma

Commercial Open Source vector database

4.4k

HNSW lib, DuckDB; based on ClickHouse

✅

❌

❔

HNSW

❔

Apache-2.0

20.3M USD, seed

🇺🇸

Chroma Inc.

Marqo AI

A tensor-based cloud-native commercial Open Source search and analytics engine

2.8k

Tensor-based

❔

✅

HNSW

❔

Apache-2.0

undisclosed preseed in May 2022

🇦🇺

S2Search Australia Pty Ltd

Vald

Cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine

1.2k

uses the vector search engine NGT

✅ (NGT)

❔

✅

Distributed Index, asynchronous indexing

N/A

Apache-2.0

Yahoo Japan Corporation

🇯🇵

Yusuke Kato and Kiichiro Yukawa (Yahoo Japan Corporation)

Pinecone

Fully managed vector database that specializes in enabling semantic search capabilities

built on top of Faiss

✅ (proprietary), plus KNN (with Faiss)

❌

✅

proprietary

Eventual Consistency

Proprietary

138M, series B

🇺🇸

Pinecone Systems Inc

More About the Vector Database Market

Here are some more questions answered for anyone interested.

What is an “Open SaaS” business model?

Software as a Service (SaaS) refers to software that is managed or hosted for the client, essentially being rented. In the context of Open SaaS, the term “open” refers to the use of open source software provided as a service. However, it’s important to note that not all code is open source, especially the parts related to managed services, hosting, and additional features. It’s worth mentioning that the open source software used in this way may or may not be developed by the company offering the SaaS. This situation has caused tensions within the open source community, as the original creators sometimes struggle to earn a living or maintain the software, while other companies profit from it. The issue has led to the emergence of new licences that keep the source code open but restrict others from offering it as a service without contributing the entire source code back to the community. This situation has become particularly notable due to the involvement of major cloud providers in leveraging this option.

How Can AI Companies Raise Such Substantial Funding?

AI’s popularity and the importance of data management for future success have made the database market intriguing. Despite having many established players, the market continues to grow consistently (e.g., 17% in 2020). Historical analysis show that new types of databases create new markets, typically with one pre-dominant player (“winner-takes-all”). Venture capitalists seek such opportunities, with potential market leaders worth over 100 million dollars in Annual Recurring Revenue. Examples include MongoDB, Cockroach, Neo4j, and Influx. Vector databases might be the next contender. However, profitability in the database market typically takes over 10 years, so there likely will be no quick wins.

Conclusion

Vector databases serve as a vital addition to numerous AI applications, offering an efficient framework for storing vector embeddings. The vector database market is currently experiencing active growth, fuelled by the ongoing AI hype, resulting in new solutions emerging regularly. In this article, we’ve attempted to compare the primary players available as of mid-2023. To stay informed and up-to-date, we recommend keeping an eye on the ranking of vector DBMS by db-engines, which tracks the popularity of leading options.

Spread the love