第71页 3.2.4 Posterior predictive distribution

灵魂机器 (彪悍的人生不需要解释)

章节名：3.2.4 Posterior predictive distribution
页码：第71页 2013-04-27 17:11:02

这一章节很重要，是因为 Posterior predictive distribution 这个名词是在本书中第一次出现在这里，在书的后续章节多次出现了这个名词，比如 3.3.4， 3.4.4节等。

先看看MAP和MLE的定义，在第69页

($\widehat {h}^{MAP}=\mathop{\arg\max}_h p(h|D)=\mathop{\arg\max}_h \dfrac{p(D|h)p(h)}{p(D)}=\mathop{\arg\max}_h p(D|h)p(h) $)
($=\mathop{\arg\max}_h [\log p(D|h)+\log p(h)]$)               (3.6)

($\widehat {h}^{MLE}=\mathop{\arg\max}_h p(D|h)=\mathop{\arg\max}_h [\log p(D|h)]$)      (3.7)

MAP和MLE都是从可能的一堆h中，选择一个h，使得posterior和likelihood最大。这里h相当于是参数 ($\theta$)

跟Posterior predictive distribution有什么关系？

MAP和MLE是已知数据D，估计出参数h，然后可以把h代回 ($P(x=y|D)$)，x是未知样本，求出x的label y。

Posterior predictive distribution也叫做Bayes model averaging(BMA, Hoeting et al. 1999)。
Posterior predictive distribution不是点估计（MLE,MAP都是点估计，过早的丢弃了其他h，只保留了一个h，过早优化是万恶之源！），而是把所有可能的P(h|D)加起来，每个posterior distribution给予不同的权重。这就是P71的公式 3.8 ：
($p\left(\tilde{x}\in{C}|D\right)=\sum\limits_{h} p\left(y=1|\tilde{x},h\right)p\left(h|D\right)$)      (3.8)

书中，($p\left(y=1|\tilde{x},h\right)$)叫做 prediction of each individual hypothesis，而($ p\left(h|D\right)$) 叫做 weight associated with each hypothesis。

公式(3.8)可以写成如下形式，更容易理解
($p(\tilde{\vec{x}}|\mathcal{D})=\sum\limits_{h}p(\tilde{\vec{x}}|h)p(h|\mathcal{D})$)
Wikipedia <http://en.wikipedia.org/wiki/Posterior_predictive_distribution> 上就是这么写的。

有个问题，从这个式子看($p(\tilde{\vec{x}}|\mathcal{D})$)有可能会大于1，怎么办？

BMA 的定义可以见这个帖子，http://metaoptimize.com/qa/questions/7885/what-is-the-relationship-between-mle-map-em-point-estimation ，它对MLE,MAP也有很精彩的解释：
To understand these terms, you first need to understand the concept of likelihood. Assume you have a probability distribution - or rather family of such distributions - p(x;w) which assigns a probability to each data point x, given a specific setting of its parameters w. That is, different values of the parameters, w, will change the probability assigned to each data point, x.

Now, since different parameters correspond to different distributions, we can tune the parameters in such a way that the data that we observe, D, is assigned a high probability and possible data that we don't observe is assigned a low probability. To this end we define the likelihood function L(D;w) = product_{x in D} p(x;w). That is the likelihood is just the joint probability of the observed data as a function of the parameters.

Maximum likelihood estimation (MLE) simply means that we seek the parameters w that maximizes the likelihood function. That is to say, we seek the parameters that assigns as much mass as possible to the observed data.

One pitfall with the maximum likelihood estimate should be obvious from the above definition. If we assign all the probability only to the things we actually observed, we won't have any probability mass left for all the things that we didn't observe. One way to conquer this is to add a prior distribution over the parameters w, such as a Gaussian in the case of logistic regression or a Dirichlet prior in the case of the multinomial distribution, which allows us to control how probability mass should be assigned to observed and unobserved data. So how can we make use of such a prior distribution? By means of Bayes rule, we can derive the a posteriori distribution over the parameters, conditioned on the data: p(w;D) = p(D;w)*p(w) / Z, where Z is a normalization factor which assures that p(w;D) sums to 1.

Maximum a posteriori (MAP) estimation means that we seek the parameters w that maximize the posterior distribution p(w;D). Since Z is a constant factor, it does not affect the choice of w that maximize the posterior and we can therefore simply find the w that maximize p(D;w)*p(w).

Both MLE and MAP are point estimation methods, they only return a single value of the parameters w. This means that any information on the uncertainty of the parameters are lost, which is unfortunate since knowledge about this uncertainty can be used to compute things like confidence in our predictions. In order to keep this uncertainty we may adopt a fully Bayesian approach, in which we instead of finding the MLE or MAP, we find the full posterior distribution p(w;D). In this case we need to compute the normalization factor Z, which might be difficult or even intractable depending on the structure of p(x;w). This is one of the reasons why point estimates are popular.

Finally, Expectation Maximization (EM) is a technique to cope with parts of the model that we cannot observe, but we assume should be there. These unobserved parts can be either data that is simply missing from the data, but it might also be things that are never observed, but whose relation to the observed parts we can model (or in other words, parts that "explain" the observed data). In topic models, for example, the hidden parts are the topics that are assumed to explain the observed words. Assume that for each observed point x (e.g. a vector of words), we have a hidden variabel z (e.g. a topic). We then have a distribution p(x,z;w) which assigns a probability to x and z jointly. EM is a method for maximizing the joint likelihood of both the observed and hidden parts. This is done by iteratively using the current value of the parameters to estimate a distribution over the hidden variables given the observed part (expectation) and then finding the parameters that maximize the joint likelihood of the observed variables and the setting of the hidden variables from the previous step (maximization).

If you want to get a more solid understanding of these topics, I suggest you read Christopher Bishop's book Pattern Recognition and Machine Learning.引自 3.2.4 Posterior predictive distribution

感谢 @张巍(http://weibo.com/zh3f) 指出了 Posterior predictive distribution 这个概念与MAP,MLE的区别。

128人阅读

> 灵魂机器的所有笔记（35篇）

灵魂机器对本书的所有笔记 · · · · · ·

第75页 3.3.3 Posterior

公式(3.16)有笔误，漏掉了一个=号 ($p(\theta | D)\propto Bin(N_1|\theta,N_0+N_1)Beta(\the...
第58页 2.8.2 KL Divergency

公式2.115 怎么推导到2.116的？仅仅是A换成了X
第71页 3.2.4 Posterior predictive distribution
第36页 2.3.2.1 Application: DNA sequence motifs

第一段最后一句话： For example, column 7 is all G’s. 感觉不太对啊，第13列才是全G吧？ ...
第37页 2.3.4 The empirical distribution

这节是经验分布，但是书里写的不好。经验分布函数的定义是 ($$ F_n(x)=\dfrac{1}{n}S(x) $$)...

> 查看全部24篇

说明 · · · · · ·

表示其中内容是对原文的摘抄

第71页 3.2.4 Posterior predictive distribution

灵魂机器 (彪悍的人生不需要解释)

灵魂机器对本书的所有笔记 · · · · · ·

第75页 3.3.3 Posterior

第58页 2.8.2 KL Divergency

第71页 3.2.4 Posterior predictive distribution

第36页 2.3.2.1 Application: DNA sequence motifs

第37页 2.3.4 The empirical distribution

说明 · · · · · ·