With the use and implementation of LLM technology, the database needs to be advanced in vector analysis and AI support, and vector database and vector retrieval can "emerge", ushering in the industry to continue to be cared for. In a nutshell, vector retrieval technology and vector databases can provide LLMs with external image units, and the process provides content that is coherent with the answers to the questions and histories, helping the LLMs return more accurate answers.
Not only LLM, vector retrieval has long been related to OLAP engines. As a software for data profiling, OLAP can quickly and efficiently process large amounts of data and provide multi-dimensional profiling functions, while vector retrieval can help OLAP engines further improve the ability to analyze and retrieve unstructured data.
Recently, ByteHouse, the cloud-native data stack of Volcano Engine, has launched a high-function vector retrieval function, which supports a variety of vector retrieval algorithms and efficient implementation links, which can support large-scale vector retrieval scenarios and reach millisecond-level query delays.
The ByteHouse team has been working on vector retrieval technology for a long time. According to bytehouse technical experts, "the current development of vector databases is mainly two ideas, one is to establish a dedicated vector database, based on the vector-centric idea to plan the storage of vector data and indexes and capital governance strategy, the query style is simple, and the supporting data paradigm is limited; The other is to expand the vector retrieval ability based on the existing rare data database, and add the vector index maintenance and query implementation logic in the existing rare data governance mechanism and query implementation link. Now, the two ideas are creating each other's own ideas, and they are developing in the form of complete database function support + high-function vector retrieval. ”
ByteHouse originated from ClickHouse, but ClickHouse has problems such as repeated reading of vector indexes and redundancy, and the availability of vector retrieval scenarios with low delay requirements and high concurrency requirements is weak.
Based on the above analysis, bytehouse can make comprehensive innovations in vector retrieval. First of all, based on the idea of vector-centric, bytehouse has built an efficient vector retrieval implementation link, combined index caching, storage layer filtering and other mechanisms, so that the function can be further broken. In addition, in order to answer divergent application scenarios, ByteHouse supports a variety of multi-common vector indexing algorithms such as HNSW, FLAT, IVFFLAT, and IVFPQ. In addition, the newly introduced vector index supports the current secondary index coherent semantics, and the new implementation link also adapts to the existing interval function, so as to lower the user application threshold and learn the old book, and users can simply use the existing semantics of clickhouse to apply the high-performance vector retrieval function.
bytehouse vector to retrieve related components.
In the process of establishing high-performance vector retrieval capabilities, bytehouse first refrains from the following three catastrophe points:
At the beginning, the column storage structure reads the enlarged questions. In order to reduce unnecessary data read manipulation, bytehouse has been optimized in both the query implementation and the data read layer, and the solid plans of the two engines, Hamergetree and HauniqueMergetree, provide stability guarantee for vector retrieval. Secondly, there will be cold read problems when newly written data and service restarts, resulting in stable functions. To this end, bytehouse introduces the preload mechanism, which actively loads the index into the cache after it is built, and supports the active engraving of outdated indexes to avoid excess capital occupation. Finally, because index construction will consume a lot of capital, in order to reduce the impact of construction manipulation on the function of normal inquiry, bytehouse introduces a capital control strategy, which allows users to dynamically control the capital of index building applications based on application scenarios, which greatly reduces the overhead of the original link.
Based on the open-source software Vectordbbench, with MiLVUS 23.0 for evaluation.
Test environment: 1 node, 80 cores, 376 GB memory).
In terms of the ultimate functional result, the ByteHouse team tested based on the industry's latest Vectordbbench test stuff, and on the Cohere 1M specification test dataset, Recall 98 can reach similar functions as a dedicated vector database. In the case of recall 95 or more, the QPS can reach more than 2600, and the P99 latency is about 15ms, which has the industry's leading advantage.
Performance optimization has always been one of the goals of Bytehouse Center to meet the needs of data processing and profiling. Not only is the vector retrieval technology, the process continues to develop and innovate, bytehouse is also extremely optimized in many aspects such as inquiry and analysis, data import, etc., and has achieved significant performance improvements, and continues to assist enterprises to better achieve accelerated resolution efficiency driven by data on the basis of reducing costs and increasing efficiency.