Vector indexes, MariaDB & pgvector, large server, small dataset: part 1
This post has results for vector index support in MariaDB and Postgres. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
My previous posts (here and here) used a server with 8 cores and 32G of RAM. While that was OK for one of the smaller datasets from ann-benchmarks it wasn't enough for larger datasets and the problem was the amount of memory used by the benchmark client. I have some changes to the benchmark client to reduce the transient spikes in memory usage I wasn't able to fully solve the problem, so I moved to a larger server with 48 cores and 128G of RAM.
This post has results for the fashion-mnist-784-euclidean dataset using from 1 to 48 concurrent sessions. This is part 1. There will be parts 2, 3 and 4 to explain the results and then I move on to a larger dataset.
I compare MariaDB with pgvector because I respect the work that the Postgres community has done to support vector search workloads. And I am happy to report that MariaDB has also done a great job on this. While I don't know the full story of the development effort, this feature came from the MariaDB Foundation and the community and it is wonderful to see that collaboration.
tl;dr
- At low and high concurrency levels MariaDB gets much more QPS for a given recall target. Here low means <= 4 concurrent sessions and high means >= 24.
- At middle concurrency levels (8 through 20 concurrent sessions) MariaDB still does better but the gap isn't as large. I try to explain this in future posts.
Benchmark
This post has much more detail. However I switched to a larger server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices.
I use ann-benchmarks via my fork of a fork of a fork at this commit. Note that parallel index create was disabled for Postgres by my configuration and isn't (yet) supported by MariaDB.
I ran tests for fashion-mnist-784-euclidean at 1, 2, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44 and 44 concurrent sessions. The command lines were the following using pgvector, pgvector_halfvec and mariadb as the value of $alg. When --batch is used the concurrency level (between 2 and 48 concurrent sessions) is set by an environment variable (POSTGRES_BATCH_CONCURRENCY or MARIADB_BATCH_CONCURRENCY)
python3 run.py --algorithm $alg --dataset fashion-mnist-784-euclidean --timeout -1 --local --force \ --respect_config_order --runs 3
python3 run.py --algorithm $alg --dataset fashion-mnist-784-euclidean --timeout -1 --local --force \ --respect_config_order --runs 3 --batch
I filed MDEV-35897 for MariaDB because it allocates and then deallocates too much memory when ef_search is large, and large for me was >= 300. The overhead from this hurts query response times. Fortunately the fix should be easy. For now I changed config.yml for MariaDB to not use ef_search values larger than 200 (see query_args here).
Files:
- The config.yml files are here for MariaDB and for pgvector.
- Files related to these tests are archived here.
Results: QPS vs recall graphs
The recall vs QPS graph is created by running plot.py from ann-benchmarks. The line colors are red for MariaDB, dark blue for pgvector with halfvec (float16) and light blue for pgvector.
1 session (no concurrency)2 sessions4 sessions8 sessions12 sessions16 sessions20 sessions24 sessions28 sessions32 sessions36 sessions40 sessions44 sessions48 sessions
Source: View source