7 Computing scores in a complete search system

小短手 (日新)

读过 Introduction to Information Retrieval

章节名：7 Computing scores in a complete search system
2011-12-05 13:01:42

已经忘记了当初为什么要用英文来写这本书的笔记了，既然如此，后面的笔记就用中文来写好了，不过由于时间关系会跳过一些章节，而且从此以后的章节由于时间关系会以泛读为主... 7.1 Efficient scoring and ranking 第六章主要讲了针对query为collection中的document进行打分的过程，有tf-idf的方法，也有cosine score，但是在实际的大型系统当中这种非常全面的打分方法效率不高，因此7.1主要介绍一种近似的通过打分选出inexact top K document的方法。首先是针对query score 的简化，将绝对的score简化为相对score，即query vector被简化，在dictionary中而不在query中的为0，否则为1，(省略了idf作为权重)将每个term看成是单独的。下来介绍了我们进行inexact top K document retrieval的方法，选取特定的document集合A，where K < |A| ≪ N 。然后在A中选取top K document。之后介绍了几种选取A的方法： 7.1.2 Index elimination 采用两种启发式的方法减少index中posting lists的长度： 1）对query中的term设定一个基准idf，只有包含超过基准idf的term的document才会被选取进入A 2）只选取包含所有term的document 7.1.3 Champion lists 我们对dictionary中的所有term列出champion lists，可以根据不同的标准r，但一般这个r是tf。这样针对每个term选取tf考前的N篇document组成champion list。然后将query中的term的champion lists组合起来形成A 7.1.4 Static quality scores and ordering 为document打static score g(d)，实现query-independent net score 将g(d)和query-dependent score相加这种方法其实也使用了champion lists的思想，这里还介绍了一种折中的方法，我们对每个term维护两个不想交的posting lists，可以用tf作为weight区分，一个是high 相当于champion lists，用于top K retrieval；另一个是low，用于补充情况或者想要取得全部结果的情况。 7.1.5 Impact ordering 我们以Figure 6.14所述的cosine socre算法为基础

基本cosine score algorithm

普通的posting lists按照docID排序，这里则按照tf按降序进行排序。同样的，我们在进行打分时也将query term按idf的降序进行排序，这样在外层循环中首先考虑的就是区分度较强的term，在内层循环中我们先考虑的是tf较高的document 这样所有先被考虑的都是很可能成为A中的document，通过这样我们也可以选取A 7.1.6 对所有的N document，随机挑选√N document作为leader，之后对每个不是leader的document 选取和其最“近”（难道是说docID最近的？）的leader并作为其follower，这样每个leader有大约√N 的follower。然后对每个leader和query做匹配，最适合的leader及其follower作为A中的元素，之后再用cosine score进行top K retrieval。 7.2 Components of an information retrieval system 本节中我们将在完整的检索系统中来阐述其部件和检索方法。 7.2.1 Tiered index 分层索引，这是对champion lists的一般化，通过一定的标准（例如tf）将整个索引分层。（例如Tier 1：tf>20 Tier 2: 20>tf>10 Tier 3: 10>tf） 7.2.2 Query-term proximity 在搜索之中term间的距离也是很重要的因素，而vector space model则忽略了这个因素 7.2.3 Designing parsing and scoring functions query parser适用于分析query，比如query是 rising interest rates，我们首先将其作为一个词组在VSM中进行搜索；如果返回的结果太少则将其作为两个不同的词组，每个词组包含两个词 rising interest 和 interest rates；若结果还是太少则将三个区分开考虑(one document可以只包含其中的一个词)。这上面每种情况打分时都可以考虑vector space scoring, static quality, proximity weighting and other factors, 其中位置因素：在posting lists integration时考虑间隔k个term以内的情况进行打分。

A complete search system

7.3 被跳过了...

59人阅读

> 小短手的所有笔记（42篇）

小短手对本书的所有笔记 · · · · · ·

6.3 The vector space model for scoring

6.3 The vector space model for scoring The representation of a set of documents as vect...
6.4 Variant tf-idf functions

6.4 6.4.1 We can modify tf-idf weighting like this. 6.4.2 Maximum tf normalization The ...
7 Computing scores in a complete search system
8 Evaluation in information retrieval

8.1 Information retrieval system evaluation 对一个特定系统进行评价需要三样东西： 1. doc...
14 Vector space classification

这一章描述了使用vector space model对documents进行分类的方法。首先提到contiguity hypoth...

> 查看全部23篇

说明 · · · · · ·

表示其中内容是对原文的摘抄

7 Computing scores in a complete search system

小短手 (日新)

小短手对本书的所有笔记 · · · · · ·

6.3 The vector space model for scoring

6.4 Variant tf-idf functions

7 Computing scores in a complete search system

8 Evaluation in information retrieval

14 Vector space classification

说明 · · · · · ·