Won the championship and runner up!Open source search engine Puck shines at NeurIPS 2023

Mondo Sports Updated on 2024-01-30

Recently, the international vector retrieval competition BIG-Ann in Neurips, which has attracted global attention'The leaderboard of 23 was officially announced. The search content technology team successfully won the champion of the streaming search track and the runner-up of the filtered search track with the excellent performance of its self-developed ann retrieval algorithm PUCK.

Neurips (Conference on Neural Information Processing Systems) is a prestigious academic conference in the field of machine learning, artificial intelligence and neuroscience, and is known as the most difficult, highest-level and most influential conference in the field of machine learning along with ICML and ICLR. It provides an opportunity for researchers and practitioners to showcase and compare the latest technologies and algorithms.

As vector retrieval becomes more common, it becomes more important to solve diverse variations of nearest neighbor search or vector retrieval problems in real-world scenarios. In this context, neurips'23 The BIG-ANN competition was set up to promote large-scale ANN research innovation and practical applications in production environments, and to promote the development of indexed data structures and indexing algorithms. Because of the high visibility and authority of NeurLPS, BIG-ANN has attracted many well-known companies and top universities to compete on the same stage.

This year's BIG-ANN consists of 4 questions: filtered search, out-of-distribution, sparse, and streaming search, and participants can choose to submit their entries to one or more tracks. The evaluation hardware standard is Azure Standard D8LDS v5 (8 vCPUs and 16GB DRAM), and the index build time on this machine will be limited to 12 hours, and the time limit for the Streaming Search track will be more stringent.

Requirements: The 30m dataset is clustered by k-means and divided into 64 groups. A round of index update includes: inserting a group of samples, retrieval, deleting a group of samples, and retrieval. The algorithm is required to complete 5 rounds of index updates within 1 hour, with a total of 1280 operations. Each time the inserted data belongs to a different group, the algorithm is required to provide better retrieval results even when the full distribution of the data is unknown. The insertion and deletion operations are constantly alternating, and the samples contained in the entire index are dynamically transformed, which requires the algorithm to keep the index structure compact (consuming less memory) on the dynamic dataset.

Compared with other tracks (which only focus on retrieval performance), the streaming search track expects the algorithm to be more robust to the changing data distribution. The streaming track uses the statistical value of each retrieved recall@10 as the score of the algorithm, and the puck is 09849 ranked first in the recall rate and won the championship.

Combined with the requirements of the competition questions and the cost of business applications, PUCK transforms the current process to support the training, database building and retrieval process. The default index structure of PUCK has four layers, each with a different function. Adjust the index parameters to reduce the influence of the spatially divided layer on the recall, and the recall rate is controlled by the third filter layer. The third-layer results rely on the distance between the query and the sample, which is less affected by the change of data distribution, so the recall rate can be guaranteed to be stable enough.

Task: Random 10M slices of YFCC 100M dataset converted by CLIP embedding. Each ** tag is placed in one"bag": Words extracted from the description, camera model, year of shooting**, and country. These tags come from a glossary of 200386 possible tags.

This track focuses on retrieval performance, and the problem requires the algorithm to support filtered retrieval with tags, and the recall results only contain samples that match the query tags. The leaderboard is based on the fact that the algorithm achieves the QPS score and ranks on the premise that the recall@10 is at least 90%. Puck in 19153A QPS of 42 ranks second.

The current track requires the retrieval of 1 or 2 tags, and in practical business applications, it is usually required to support the retrieval of multiple tags (>=3). Therefore, PUCK abandons the scheme of using each tag to build an index separately (the performance of the scheme with a separate index will be better, but the scalability is limited), and adopts the index structure of fusion vector and tag, which can support the retrieval of arbitrary tags, and the retrieval performance is less affected by the number of tags.

neurips'23 competitions have built a good platform for exchanges and cooperation in the field of artificial intelligence. This award is not only an affirmation of the technical strength of the Puck team, but also a recognition of its continuous exploration and innovative spirit. It is believed that in the future, PUCK will contribute more to the development of vector retrieval technology.

Puckerk's outstanding performance and accuracy enable it to efficiently process large amounts of data while maintaining highly accurate search results. In addition, PUCK is highly flexible and supports very large datasets, which can be adapted to different data types and query needs. [2]

PUCK encourages collaboration and sharing among developers, and supports the sharing and dissemination of knowledge, creating an active and extensive ecosystem, promoting the high-speed and sustainable development of projects, and promoting technological innovation.

When you are rewarded for participating in open source, you are also influencing the development of the open source field and promoting the open source field to rush in a broader direction.

end——References:

1] Title & Champion:

2] See github for details

Recommended Reading:

AI Native Engineering: Practice of AI interactive technology in APP.

Unravel the mystery of the cycle of events.

Refactoring the Search and Presentation Service: Improvements and Optimizations.

App iOS package size 50M optimization practice (7) Compiler optimization.

Search for the htap storage system.

Related Pages