1302: 实验2-2-5 附加题 Top-k最近库向量检索 / Top-k Nearest Library Vector Search
Memory Limit:128 MB
Time Limit:1.000 S
Judge Style:Text Compare
Creator:
Submit:26
Solved:15
Description
【中文题面】
真实的向量检索系统通常不只返回“最相似的一个”结果,而是返回前k个最相似结果,供后续模型、业务规则或人工判断继续筛选。这个过程常被称为Top-k最近邻检索。
在人脸识别中,Top-k结果可以作为候选身份列表,系统再结合阈值、活体检测、人工复核等步骤做最终判断;在推荐算法中,系统常先从海量商品、文章、短视频或音乐中召回Top-k个相似项目,再用排序模型进一步重排;在以文搜图、以图搜图、医学病例检索和大语言模型RAG知识库检索中,Top-k向量检索也经常作为第一步召回机制。
本题给定一个查询向量和若干个库向量,输出距离最近的前k个库向量。若距离相同,编号较小者优先。距离仍使用欧氏距离。
本题是题目4的提高版。可以使用vector保存向量和距离。若尚未熟练使用sort,也可以通过重复查找当前未选中的最小距离来完成;若已经掌握sort,可以把“距离平方+编号”保存后排序。调试记录中建议比较两种思路的代码复杂度、运行时间和内存使用。
本题仍是小规模教学模型。真实向量检索库会进一步考虑索引构建、近似搜索、召回率、延迟、内存占用和并发查询等工程问题。
扩展阅读:Faiss(高效相似向量搜索与聚类库)、Milvus(开源向量数据库)、hnswlib(基于HNSW的近似最近邻检索库)。
[English Statement]
Real vector retrieval systems usually do not return only the single most similar result. They often return the top-k most similar results, so that later models, business rules, or human reviewers can make further decisions. This process is commonly called top-k nearest neighbor retrieval.
In face recognition, top-k results can be used as a candidate identity list before applying thresholds, liveness checks, or manual review. In recommender systems, top-k retrieval is often used to recall candidate products, articles, short videos, or songs from a large collection before a ranking model reorders them. Top-k vector retrieval is also widely used in text-to-image search, image-to-image search, medical case retrieval, and RAG knowledge-base retrieval for large language models.
Given a query vector and several library vectors, output the top-k nearest library vectors. If distances are equal, the smaller index has higher priority. Euclidean distance is used.
This is an extension of Problem 4. You may use vector to store vectors and distances. If you are not familiar with sort, you can repeatedly find the current unused minimum distance. If you have learned sort, you can store each pair of squared distance and index, then sort them. In the debug record, compare the code complexity, runtime, and memory use of the two approaches.
This problem is still a small-scale teaching model. Real vector retrieval libraries also need to consider index construction, approximate search, recall, latency, memory usage, and concurrent queries.
Further reading: Faiss, Milvus, and hnswlib.
真实的向量检索系统通常不只返回“最相似的一个”结果,而是返回前k个最相似结果,供后续模型、业务规则或人工判断继续筛选。这个过程常被称为Top-k最近邻检索。
在人脸识别中,Top-k结果可以作为候选身份列表,系统再结合阈值、活体检测、人工复核等步骤做最终判断;在推荐算法中,系统常先从海量商品、文章、短视频或音乐中召回Top-k个相似项目,再用排序模型进一步重排;在以文搜图、以图搜图、医学病例检索和大语言模型RAG知识库检索中,Top-k向量检索也经常作为第一步召回机制。
本题给定一个查询向量和若干个库向量,输出距离最近的前k个库向量。若距离相同,编号较小者优先。距离仍使用欧氏距离。
本题是题目4的提高版。可以使用vector保存向量和距离。若尚未熟练使用sort,也可以通过重复查找当前未选中的最小距离来完成;若已经掌握sort,可以把“距离平方+编号”保存后排序。调试记录中建议比较两种思路的代码复杂度、运行时间和内存使用。
本题仍是小规模教学模型。真实向量检索库会进一步考虑索引构建、近似搜索、召回率、延迟、内存占用和并发查询等工程问题。
扩展阅读:Faiss(高效相似向量搜索与聚类库)、Milvus(开源向量数据库)、hnswlib(基于HNSW的近似最近邻检索库)。
[English Statement]
Real vector retrieval systems usually do not return only the single most similar result. They often return the top-k most similar results, so that later models, business rules, or human reviewers can make further decisions. This process is commonly called top-k nearest neighbor retrieval.
In face recognition, top-k results can be used as a candidate identity list before applying thresholds, liveness checks, or manual review. In recommender systems, top-k retrieval is often used to recall candidate products, articles, short videos, or songs from a large collection before a ranking model reorders them. Top-k vector retrieval is also widely used in text-to-image search, image-to-image search, medical case retrieval, and RAG knowledge-base retrieval for large language models.
Given a query vector and several library vectors, output the top-k nearest library vectors. If distances are equal, the smaller index has higher priority. Euclidean distance is used.
This is an extension of Problem 4. You may use vector to store vectors and distances. If you are not familiar with sort, you can repeatedly find the current unused minimum distance. If you have learned sort, you can store each pair of squared distance and index, then sort them. In the debug record, compare the code complexity, runtime, and memory use of the two approaches.
This problem is still a small-scale teaching model. Real vector retrieval libraries also need to consider index construction, approximate search, recall, latency, memory usage, and concurrent queries.
Further reading: Faiss, Milvus, and hnswlib.
Input
【输入】
第一行输入d、n和k。第二行输入查询向量。接下来n行,每行输入一个库向量。保证1<=k<=n。
[Input]
The first line contains d, n, and k. The second line contains the query vector. The next n lines each contain one library vector. It is guaranteed that 1<=k<=n.
第一行输入d、n和k。第二行输入查询向量。接下来n行,每行输入一个库向量。保证1<=k<=n。
[Input]
The first line contains d, n, and k. The second line contains the query vector. The next n lines each contain one library vector. It is guaranteed that 1<=k<=n.
Output
【输出】
输出k行,每行包含一个库向量编号和对应距离,距离保留2位小数。
[Output]
Print k lines. Each line contains one library vector index and its distance rounded to 2 decimal places.
输出k行,每行包含一个库向量编号和对应距离,距离保留2位小数。
[Output]
Print k lines. Each line contains one library vector index and its distance rounded to 2 decimal places.
Sample Input Copy
2 4 2
0 0
1 0
0 2
-1 0
3 4
Sample Output Copy
0 1.00
2 1.00
HINT
可比较重复扫描法和排序法在代码复杂度、运行时间和内存使用上的差异。
Compare repeated scanning and sorting in code complexity, runtime, and memory use.
Compare repeated scanning and sorting in code complexity, runtime, and memory use.