《Machine Learning》的原文摘录

To understand these terms, you first need to understand the concept of likelihood. Assume you have a probability distribution - or rather family of such distributions - p(x;w) which assigns a probability to each data point x, given a specific setting of its parameters w. That is, different values of the parameters, w, will change the probability assigned to each data point, x. Now, since different parameters correspond to different distributions, we can tune the parameters in such a way that the data that we observe, D, is assigned a high probability and possible data that we don't observe is assigned a low probability. To this end we define the likelihood function L(D;w) = product_{x in D} p(x;w). That is the likelihood is just the joint probability of the observed data as a function of ... (查看原文)

灵魂机器 1赞 2013-04-27 17:11:02

—— 引自第71页
In particular, we define machine learning as a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty (查看原文)

ACMing 1赞 2013-08-31 21:16:18

—— 引自第1页
a property known as the long tail, which means that a few things (e.g., words) are very common, but most things are quite rare. Machine learning is usually divided into two main types. In the predictive or supervised learning approach. most methods assume that yi is a categorical or nominal variable from some finite set When yi is categorical, the problem is known as classification or pattern recognition when yi is real-valued, the problem is known as regression. The second main type of machine learning is the descriptive or unsupervised learning approach. This is sometimes called knowledge discovery. This is a much less well-defined problem, since we are not told what kinds of patterns to look for, and there is no obvious error metric to use. There is a third type of machine learning... (查看原文)

ACMing 2013-08-31 21:21:42

—— 引自第2页
In our notation, we make explicit that the probability is conditional on the test input x, as well as the training set D, by putting these terms on the right hand side of the conditioning bar |. When choosing between different models, we will make this assumption explicit by writing p(y|x,D,M), where M denotes the model. (查看原文)

ACMing 2013-08-31 21:57:43

—— 引自第3页
Regression is just like classification except the response variable is continuous. (查看原文)

ACMing 2013-08-31 22:24:55

—— 引自第8页
Instead, we will formalize our task as one of density estimation, that is, we want to build models of the form p(xi|θ). There are two differences from the supervised case: First, we have written p(xi|θ) instead of p(yi|xi, θ); that is, supervised learning is conditional density estimation, whereas unsupervised learning is unconditional density estimation. Second, xi is a vector of features, so we need to create multivariate probability models. By contrast, in supervised learning, yi is usually just a single variable that we are trying to predict. (查看原文)

ACMing 2013-08-31 22:28:38

—— 引自第9页
Picking a model of the “right” complexity is called model selection, and will be discussed in detail below. zi is an example of a hidden or latent variable, since it is never observed in the training set. (查看原文)

ACMing 2013-08-31 22:40:19

—— 引自第10页
There are many ways to define such models, but the most important distinction is this: does the model have a fixed number of parameters, or does the number of parameters grow with the amount of training data? The former is called a parametric model, and the latter is called a nonparametric model. Parametric models have the advantage of often being faster to use, but the disadvantage of making stronger assumptions about the nature of the data distributions. Nonparametric models are more flexible, but often computationally intractable for large datasets. (查看原文)

ACMing 2013-08-31 23:35:31

—— 引自第16页
The main way to combat the curse of dimensionality is to make some assumptions about the nature of the data distribution (either p(y|x) for a supervised problem or p(x) for an unsupervised problem). (查看原文)

ACMing 2013-09-01 00:01:02

—— 引自第19页
Linear regression can be made to model non-linear relationships by replacing x with some non-linear function of the inputs, φ(x). This is known as basis function expansion. In fact, many popular machine learning methods — such as support vector machines, neural networks, classification and regression trees, etc. — can be seen as just different ways of estimating basis functions from data, as we discuss in Chapters 14 and 16. (查看原文)

ACMing 2013-09-01 00:05:49

—— 引自第20页
We can generalize linear regression to the (binary) classification setting by making two changes. First we replace the Gaussian distribution for y with a Bernoulli distribution,which is more appropriate for the case when the response is binary, y ∈ {0, 1}. p(y|x, w) = Ber(y|μ(x)) where μ(x) = E [y|x] = p(y = 1|x). Second, we compute a linear combination of the inputs, as before, but then we pass this through a function that ensures 0 ≤ μ(x) ≤ 1 by deﬁning μ(x) = sigm(wT x) Putting these two steps together we get p(y|x, w) = Ber(y|sigm(wT x)) This is called logistic regression due to its similarity to linear regression (查看原文)

ACMing 2013-09-01 11:28:37

—— 引自第21页
A simple but popular solution to this is to use cross validation (CV). The idea is simple: we split the training data into K folds; then, for each fold k ∈ {1,...,K }, we train on all the folds but the k’th, and test on the k’th, in a round-robin fashion, (查看原文)

ACMing 2013-09-01 11:43:47

—— 引自第24页
There are actually at least two different interpretations of probability. One is called the frequentist interpretation. In this view, probabilities represent long run frequencies of events. The other interpretation is called the Bayesian interpretation of probability. In this view, probability is used to quantify our uncertainty about something; hence it is fundamentally related to information rather than repeated trials (Jaynes 2003). In the Bayesian view, the above statement means we believe the coin is equally likely to land heads or tails on the next toss. One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not have long term frequencies. (查看原文)

ACMing 2013-09-01 13:26:11

—— 引自第27页
This is called a generative classifier, since it specifies how to generate the data using the classconditional density p(x|y = c) and the class prior p(y = c). An alternative approach is to directly fit the class posterior, p(y = c|x); this is known as a discriminative classifier. (查看原文)

ACMing 2013-09-02 00:26:41

—— 引自第30页
We therefore say X and Y are conditionally independent (CI) given Z iff the conditional joint can be written as a product of conditional marginals: X ⊥ Y |Z ⇐⇒ p(X, Y |Z )= p(X |Z )p(Y |Z ) Theorem 2.2.1. X ⊥ Y |Z iff there exist function g and h such that p(x, y|z)= g(x, z)h(y, z) for all x, y, z such that p(z) > 0. (查看原文)

ACMing 2013-09-02 00:29:52

—— 引自第31页
We will often talk about the precision of a Gaussian, by which we mean the inverse variance: λ = 1/σ2. The Gaussian distribution is the most widely used distribution in statistics. There are several reasons for this. First, it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance. Second, the central limit theorem (Section 2.6.3) tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or “noise”. Third, the Gaussian distribution makes the least number of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance, Finally, it has a simple mathematical form, ... (查看原文)

ACMing 2013-09-03 19:49:51

—— 引自第38页
the Dirac measure, deﬁned by (查看原文)

ACMing 2013-09-03 20:01:35

—— 引自第37页
One problem with the Gaussian distribution is that it is sensitive to outliers, since the log- probability only decays quadratically with distance from the center. A more robust distribution is the Student t distribution5 Its pdf is as follows: (查看原文)

ACMing 2013-09-03 20:05:22

—— 引自第39页
Another distribution with heavy tails is the Laplace distribution, also known as the double sided exponential distribution. Lap(x|mu,b) = exp(|x-mu|/b)/(2*b) (查看原文)

ACMing 2013-09-03 20:26:38

—— 引自第41页
The Chi-Squared Distribution is the distribution of the sum of squared Gaussian random variables. (查看原文)

ACMing 2013-09-03 20:38:40

—— 引自第42页

<前页 1 2 后页>

作者: Kevin Murphy
副标题: A Probabilistic Perspective
isbn: 0262018020
书名: Machine Learning
页数: 1096
定价: USD 90.00
出版社: MIT Press
出版年: 2012-9-18
装帧: Hardcover