# Notes

Notes of Introduction to Statistical Learning

=====================================

## Statistical Learning

- basic concepts

- two main reasons to estimate f: prediction and inference

- trade-off: complex models may be good for accurate prediction, but it may also be hard to interpret

- reducible vs. irreducible error

- how to estimate f: parametric vs. non-parametric approach

- pros of non-parametric

- flexibility, could possibly fit a wider range of f

- cons

- does not reduce the problem to a set of parameters, may be complicated and need a large dataset to get a nice result

- accessing model accuracy

- training/test error and complexity/flexibility of the model

- note: error could be **MSE** for regression and **error rate** for classification

- bias-variance trade-off

- complex models typically have a higher variance and lower bias, see notes for deep learning for derivation

## Linear Regression

- simple linear regression

- estimate coefficients

- note that the estimation is **unbiased**, which means that average of a large number of estimations for $\hat{\mu}$ will be very close to true $\mu$.

- confidence interval, hypothesis testing for linear relationship using **t-statistic**, p-value

- multiple linear regression

- important questions

- whether there exists a relationship between response and predictors

- use F-statistic for all predictors, instead of use p-value for every single predictor, the reason is when the number of predictors is large, it is very likely to see at least one small p-values (which rejects hypothesis of corresponding coefficient is 0, indicating there is some linear relationship), instead F-statistic takes all predictors into consideration

- deciding on important variables (variable selection)

- methods

- try all possible models with different combinations of variables

- forward/backward/mixed selection

- model fit

- prediction

- extensions of linear model

- linear model is based on two assumptions: additive and linear

- remove additive assumption: introduce interaction term

- nonlinear relationships: polynomial regression, kernel trick, etc.

- potential problems

- nonlinearity issue

- can be identified by **residual plot**

- correlation of error terms

- typically happens for time series data, need to check autocorrelation

- non-constant variance of error terms

- outliers

- high leverage points

- collinearity

## Classification

- why not linear regression?

- hard to convert multiple (more than 2) class labels to quantitative numerical values

- even for binary responses, it does not make sense to use linear regression since the output may be outside [0,1] interval

- logistic regression

- determine coefficients

- use maximum likelihood method

- confounding

- the results obtained from using one predictor may be quite different from those obtained using multiple predictors, **especially when there is correlation among the predictors** (see the student-balance example in the text)

- linear discriminant analysis

- basic idea: classify data points using Bayesian approach, that is to find $k$ such that $$p_k(x)=\frac{\pi_kf_k(x)}{\sum_l \pi_lf_l(x)}$$ is maximum, where $\pi_k$ is prior, $f_k(x)$ is likelihood function. We do further assumption that $f_k$ are Gaussian that share the same $\Sigma$ but have different $\mu_k$. $\mu_k$, $\Sigma$ and $\pi_k$ can be estimated from training set, and then the decision boundary is determined based on Bayesian method by

$$\delta_k(x) = -\frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) + \log \pi_k$$

since $\Sigma_k$ are equal, we have

$$\delta_k(x)=x^T \Sigma^{-1}\mu_k-\frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \log \pi_k + const$$

This gives **linear decision boundary**. If $\Sigma_k$ are different, **boundary may be quadratic**

- quadratic discriminant analysis

- it assumes that classes have different covariance matrices.

- LDA vs. QDA

- this is a typical bias-variance trade-off issue. Since QDA contains more parameters, it is better when there are a large number of data points or it is clear that different classes have different covariance matrices. In contrast, LDA works well when number of data is small.

- comparison of classification methods

- when true boundary is linear, it is better to use LDA or logistic regression, if is nonlinear, it may be better to use QDA or KNN, but the level of smoothness of KNN should be carefully chosen.

## Resampling methods

- cross validation

- leave-one-out cross-validation

- k-fold cross-validation

- bootstrap

## Linear model selection and regression

- subset selection

- best subset selection

- basic idea

- fix number of predictors k, find the best model $M_k$ for each k

- may use $R^2$, MSE, etc

- for all $M_k$, select the best one

- should use AIC, BIC, adjusted $R^2$, etc, since as number of predictors increases, MSE always decreases and $R^2$ always increases, AIC/BIC/adjusted $R^2$ includes penalization for the number of predictors

- definition of AIC/BIC/adjusted $R^2$

- AIC = $\frac{1}{n\hat{\sigma}^2}(RSS+2d\hat{\sigma}^2)$

- BIC = $\frac{1}{n}(RSS+\log(n)d\hat{\sigma}^2)$

- adjusted $R^2 = 1- \frac{RSS/(n-d-1)}{TSS/(n-1)}$

- note that in addition to using these metrics above related to training set directly, we may also use **cross validation** (estimate test error directly) to select the optimal model, which requires fewer assumptions about the model and is more general

- problem

- need to list all possibilities, it is super computationally expensive

- stepwise selection

- forward stepwise selection: also generate best model $M_k$ for each k, using greedy approach to add predictors one at a time, then select the best among $M_k$

- backward stepwise selection

- hybrid approaches

- shrinkage methods

- ridge regression

- basic idea: apply a L-2 norm regularization term

- since regularization term is affected by scaling of the training data, it is best to apply ridge regression after standardization of predictors

- pros and cons

- pros: very convenient for model selection, does not require to fit many different models

- cons: the model still depends on all predictors, instead of depending on a subset of predictors. Lasso method can solve this issue.

- lasso

- basic idea: apply a L-1 norm regularization term

- this method forces some of coefficients to **be exactly zero**, this performs variable selection (a diagram explaining this is shown in Fig. 6.7 in the text)

- another formulation for ridge and lasso, which reveals a close connection between lasso, ridge and best subset selection, see text

- comparing ridge and lasso (see text for details)

- ridge works better for models related to many predictors with coefficients of roughly equal size, while lasso works better for models where only a relatively small number of variables are important

- effects of two methods

- roughly speaking, ridge shrinks every dimension of data **by the same proportion**, while lasso shrinks every dimension towards 0 **by a similar amount**.

- Bayesian interpretation of ridge and lasso

- Bayesian formulation says there is a prior distribution $p(\beta)$, and posterior distribution of $\beta$ given $X,Y$ is $p(\beta|X,Y)\propto f(Y|X,\beta)p(\beta)$, training process is to find the maximum $p(\beta|X,Y)$, when prior is Gaussian with zero mean, we get ridge, when prior is double-exponential, we get lasso.

- dimension reduction methods

- previous methods use **original predictors**, dimension reduction methods use **transformed predictors**.

- methods

- PCA

- note that standardization is typically needed

- partial least squares (PLS)

- motivation: in PCA, response Y is not used, only X is used in training in an **unsupervised** way. The principal directions extracted by PCA well explains predictors X, but may not be well correlated with Y.

- basic idea: to find first PLS direction, set the weight of each predictor $X_j$ to be the coefficient of simple linear regression of $Y$ onto $X_j$. To find subsequent PLS directions, take **orthogonalized data** and do processing iteratively.

- considerations in high dimensions

## Moving beyond linearity

- polynomial regression

- step functions

- basis functions

- regression splines

- piecewise polynomials

- constraints and splines

- the spline basis representation

- choosing the number and locations of the knots

- see "notes of numerical analysis course" for more information on this part

- comparison to polynomial regression

- regression splines are typically better and more stable than polynomial regression, due to its lower degree

- smoothing splines

- definition: $$\sum_{i=1}^n (y_i-g(x_i))^2+\lambda \int g''(t)^2dt$$

it penalizes roughness of the function

- property: it can be proved that $g(x)$ that minimizes error function is a piecewise cubic polynomial with knots at the unique values of $\\{x_i\\}$

- local regression

- basic idea

- for each $x_0$, find a neighbor data set $K_0$, and train a model with weighted error (weight associated to each training data is determined by distance between this point to $x_0$), then use this local model to predict $y$ for $x_0$

- generalized additive models

- basic idea

- apply a nonlinear transformation to each feature and then do linear fitting on transformed features

## tree-based models

- basics of decision trees

- regression decision trees

- basic idea

- given predictors $X_1, ..., X_p$, find the optimal predictor to split $j$ and optimal cutpoint $s$, such that for two defined regions $R_1(j,s)=\\{X|X_j<s\\}$ and $R_2(j,s)=\\{X|X_j\ge s\\}$, the error

$$\sum_{k=1,2}\sum_{i:x_i\in R_k}(y_i-\hat{y}_{R_k})^2 $$

is minimized. Then pick one of these regions and repeat this process until stopping criterion is reached (e.g. each leaf node contains less than a specific number of data). This greedy approach is called **recursive binary splitting**

- pruning

- a complex tree may lead to overfitting, pruning the tree may solve this issue

- cost complexity pruning: try to minimize error function

$$\sum_{m=1}^{|T|} \sum_{i:x_i \in R_m} (y_i-\hat{y}_{R_m})^2+\alpha |T|$$

- classification decision trees

- basic idea

- most part is the same, except that error function is Gini index or cross entropy

- pros and cons

- pros

- interpretability

- easy to handle both quantitative and qualitative predictors

- cons

- may not have good predictive accuracy (can be improved by bagging, boosting, etc)

- bagging, random forests, boosting

- bagging

- motivation: reduce prediction variance

- methods: train several un-pruned trees (which has high variance but low bias) with different training sets, and make prediction based on average outputs/majority vote

- how are different training sets generated: bootstrap

- error estimation

- in addition to cross-validation, we may use **out-of-bag** observations to estimate error. The basic idea is that when using bootstrap approach, each bagged tree contains about $(1-1/e)$ of all data, so the remaining $1/e$ of data (called "out-of-bag" data) can be used to estimate test error.

- variable importance measures

- in bagging model, we do not have the good interpretability of an individual decision tree, variance importance measures work as a metric to determine the relative importance of each feature

- random forests

- basic idea: similar to bagging, except that instead of using all predictors for splitting, we select a subset of predictors and request that the splitter can only be selected from this subset. This method **reduces correlation of the trees** and makes the prediction more reliable, especially when one of the predictors is particularly important compared with others.

- boosting

- basic idea: instead of building many trees in parallel, boosting builds trees sequentially: each tree fits the residue that has not been explained by previous trees. By controlling shrinkage parameter $\lambda$, we can control how fast the new tree learns the residue.

## support vector machines

- see notes for course "advanced data science"

## unsupervised learning

- challenges of unsupervised learning

- PCA

- clustering

- K-means

- hierarchical clustering

- pros: does not require pre-specify number of clusters, one single dendrogram can be used to obtain any number of clusters.