Top vector database choices in 2024 - Open Source by greenrobot

In today’s AI-driven world, the importance of efficient data management cannot be overstated. Vector databases play a crucial role in providing the infrastructure for ML applications. In this article, we will explore the significance of vector databases in AI and examine the current top vector database options available in the market.

Understanding Vector Databases

Vector Databases

Vector databases provide long-term memory and enhance the search and querying capabilities of Large Language Models (LLMs) and other AI applications. In essence, a vector database is designed to efficiently store and manage large sets of vectors, also known as vector embeddings. However, the true value of a vector database extends beyond vector storage — it lies in the range and efficiency of the computations it enables on the stored vectors. These computations include tasks such as similarity searches, which are crucial for various AI applications.

To put it into perspective, traditional databases like SQL databases store data in rows and columns, while graph databases specialise in storing graphs, and object databases focus on storing objects. Given that AI applications largely rely on vector embeddings, vector databases are exceptionally well-suited for AI applications. Because of this, they are increasingly recognized as a critical component of the AI tech stack.

Vector Embeddings

Vector embeddings are numeric representations of real-world objects such as text documents, images, or audio recordings. They simplify complex and unstructured data by transforming it into lower-dimensional vectors. This is done to enable Machine Learning algorithms to more effectively process and analyse the data.

Nearest Neighbour Search

At the core of vector databases lies the concept of enhancing the searchability and usability of unstructured data. Approximate nearest neighbour (ANN) search plays a crucial role, as in Machine Learning applications we usually want to find the most similar vectors rather than exact matches. The distance between vector embeddings serves as a measure of similarity, and the “nearest neighbours” refer to the vectors that closely match the query. Vector databases employ various search optimization algorithms to efficiently find these nearest neighbours.

The Need for Specialised Vector Databases

There is an ongoing debate about the need for specialised vector databases as a separate category within the database landscape. Alternatively, vector extensions in traditional databases are seen as a potential way to support the AI market. Noteworthy examples of databases that have already embraced vector extensions include Redis and Elasticsearch.

Other Notable Details

Index types

Indexes serve as a means to enhance the speed of searching within a database. In the realm of vector databases, various established index types exist, and they directly impact the database’s performance, influencing factors such as the query completion time.

Performance

When reviewing the benchmarks provided below, you will notice that the results tend to vary. Conducting accurate and unbiased benchmarks is a challenging task. It’s important to consider that certain solutions may be better suited for specific use cases. Ideally, you should perform benchmarks based on your specific requirements. However, as an initial evaluation, it is helpful to understand the fundamental factors that influence performance and examine a selection of benchmarks and explanations. Moreover, there is an ANN algorithms benchmarking tool, which can help you compare the performance of different databases under the same setup.

Exploring the Top Vector Database Options

To understand the current market situation, we are comparing the top vector database options that are gaining traction. In general, we see many new and young companies with significant funding.

Please note: if the table doesn’t fit into your device’s screen, you can view it in full here.

Name

Open Source

License

GitHub stars

Developed in (language)

Summary

Embeds / Uses

In-memory Unterstützung

Sharding

Index Types

Consistency Model

Benchmarks (Performance?)

Approximate Nearest Neighbor (ANN) Vector Databases

Who’s behind it

HQ in

Chroma

Apache-2.0

4.4k ⭐

Python & Typescript

Chroma is a Commercial Open Source vector database

HNSW lib, DuckDB; based on ClickHouse

Dynamic segment placement

Chroma Inc.

🇺🇸

Marqo AI

Apache-2.0

2.8k ⭐

Python

A tensor-based cloud-native commercial Open Source search and analytics engine.

Tensor-based

HNSW

–

S2Search Australia Pty Ltd

🇦🇺

Milvus

Apache-2.0

18k ⭐

GoLang & Python

Milvus is a cloud-native Commercial Open Source vector database

Initial blog post from them said SQLite, but meanwhile they said RocksDB – exchanged?
they also have a ChatGPT-Cache that is build on SQLite
and say "Milvus uses SQLite or MySQL to manage metadata"

Dynamic segment placement

ANNOY; HNSW; IVF_PQ; IVF_SQ(; IVF_FLAT; FLAT; IVF_SQ8_H; RNSG

Strong, bounded staleness, session, and eventually. The default consistency level in Milvus is bounded staleness.

not comparative

Zilliz

🇺🇸

ObjectBox

Apache-2.0

C++, supports native language APIs in Java, Flutter / Dart, Swift, Python, GoLang, and C++

ObjectBox is an on-device vector database for Edge AI on Mobile, IoT, Embedded and other commodity devices

HNSW built and optimized from scratch for efficiency / speed on devices with limited resources

HNSW

Transactionally safe, ACID

ObjectBox

🇪🇺

Pinecone

Proprietary

Pinecone is a fully managed vector database that specializes in enabling semantic search capabilities

built on top of Faiss

proprietary

Eventual Consistency

more programming language comparison for vector databases

Y (proprietary), plus KNN (with Faiss)

Pinecone Systems Inc

🇺🇸

Qdrant

Apache-2.0

6.6k ⭐

Rust

Qdrant is a Commercial Open Source vector similarity search engine and vector database

RocksDB

Y, static sharding

HNSW (SQ & PQ)

Eventual Consistency, tunable consistency

compares to weaviate, milvus, elastic (note: redis took too long to complete)

Qdrant Solutions GmbH

🇪🇺

Vald

Apache-2.0

1.2k ⭐

GoLang

Vald is a cloud-native Open Source distributed approximate nearest neighbor (ANN) dense vector search engine

uses the vector search engine NGT

❔

N/A

not comparitive, but Vald performance only

Y (NGT)

Yusuke Kato (Yahoo Japan Corporation), Kiichiro Yukawa (Yahoo Japan Corporation)

🇯🇵

Vespa

Apache-2.0

4.4k ⭐

Java & C++

Vespa is a Commercial Open Source vector database by Yahoo! It is a search engine which supports vector search, lexical search, and search in structured data

❔

maintains disk and memory structures for documents

Custom HNSW (Multi-vector hybrid HNSW-IF)

Eventual Consistency

not comparative

Yahoo!

🇺🇸

Weaviate

BSD

5.6k ⭐

Assembly, C++, GoLang

Weaviate is a commercial Open Source cloud-native vector database that stores both objects and vectors.

❔

Y, static sharding

a custom HNSW PQ algorithm that supports CRUD

Eventual Consistency

not comparative, just evaluating their own performance

Y (multiple ANN algorithms as long as they support full CRUD)

SeMI Technologies

🇪🇺

More About the Vector Database Market

Here are some more questions answered for anyone interested.

What is an “Open SaaS” business model?

Software as a Service (SaaS) refers to software that is managed or hosted for the client, essentially being rented. In the context of Open SaaS, the term “open” refers to the use of open source software provided as a service. However, it’s important to note that not all code is open source, especially the parts related to managed services, hosting, and additional features. It’s worth mentioning that the open source software used in this way may or may not be developed by the company offering the SaaS. This situation has caused tensions within the open source community, as the original creators sometimes struggle to earn a living or maintain the software, while other companies profit from it. The issue has led to the emergence of new licences that keep the source code open but restrict others from offering it as a service without contributing the entire source code back to the community. This situation has become particularly notable due to the involvement of major cloud providers in leveraging this option.

How Can AI Companies Raise Such Substantial Funding?

AI’s popularity and the importance of data management for future success have made the database market intriguing. Despite having many established players, the market continues to grow consistently (e.g., 17% in 2020). Historical analysis show that new types of databases create new markets, typically with one pre-dominant player (“winner-takes-all”). Venture capitalists seek such opportunities, with potential market leaders worth over 100 million dollars in Annual Recurring Revenue. Examples include MongoDB, Cockroach, Neo4j, and Influx. Vector databases might be the next contender. However, profitability in the database market typically takes over 10 years, so there likely will be no quick wins.

Conclusion

Vector databases serve as a vital addition to numerous AI applications, offering an efficient framework for storing vector embeddings. The vector database market is currently experiencing active growth, fuelled by the ongoing AI hype, resulting in new solutions emerging regularly. In this article, we’ve attempted to compare the primary players available as of mid-2023. To stay informed and up-to-date, we recommend keeping an eye on the ranking of vector DBMS by db-engines, which tracks the popularity of leading options.

Spread the love