最近读了Antonio Guilli的A collection of data science interview questions solved in Python and Spark，对PySpark讲的很清楚。
Logistic regression(Odds Ratio是胜/输的概率比，范围在0-1之间，所以用log来扩大这个值到全域)
Support vector machines aim to identify a hyperplane in a high dimensional space of features. Intuitively the best hyperplane is the one having the largest distance to the nearest data point observed in any training data. In fact this hyperplane provides the best separation among the training data.
K-means is a form of flat clustering where the goal is to partition space into a set of groups without creating relations among them.
Naive Bayes: posterior = prior * likelihood / evidence 用于discrete dataset
（所以苹果siri那题找sports相关的网页，应该是要用Naive Bayes做这些text classification啊）
还有Bernoulli and Multivariate Naive Bayes。
还有对于continous data可以用Gaussian Naive Bayes
If the data is continuous, it might be convenient to estimate the probability by using a Gaussian distribution. So if feature is continuous, we compute the mean and the variance of in , then we model the probabilities as Gaussians in this way;
Another approach for dealing with continuous features is to discretize them in buckets and use Multinomial Naïve Bayes for the discrete model. However some attention should be put in place for determining the right number of buckets.
TFxIDF is a weighting technique frequently used for text classifications. The key intuition is to boost a term which is frequent for a document d in a collection of documents (Term Frequency=TF) but is not so frequent in all the remaining documents in (Inverse Document Frequency, IDF).
Feature hashing(continuous and discrete values)
Feature binning(continuous values)
Use Chi-Square Selection(Chi-Square is a statistical test used to understand if two categorical features are correlated.)
Use mutual information to do feature selection
If you want to investigate whether your data follow some distribution (normal, uniform), one useful way is to use a qq-plot, which can compare two probability distributions by plotting their quantiles against each other. If the distributions are similar, then the graph will show a straight line.
import numpy as np
import scipy.stats as stats
ReLU -- max(0,x)
Loss functions: Squared(Linear Regression)/Logistic(Logistic Regression)/Hinge(SVM/image recognition)
Stochastic Gradient Descient
L1 and L2