Notes

This is a detailed notes from ISLR, by chapter, to record the study points in the book. Initiated in Dec 10, 2021.

CHAPTER 1, Intro

1.0 history of method:

  • linear regression : (X-category, Y-continuous) 19 century.
  • linear discriminant analysis : (X-continuous, Y-category) 1936.
  • logistics regression : 1940s.
  • generalized model : 1970s, put linear reg / log reg as unique cases under GLS.
  • classification/reg tree: 1980s.
  • generalized additive model: 1986.

1.1 notations:

  • n denotes number of observations.
  • p denotes number of variables.
  • \(x_{ij}\) denotes the ith observation of jth variable, i is for OBSERVATIONS, j is for FEATURES.
  • X denotes n * p matrix.
  • an observation of p dimensions, x\(_{i}\) = (x\(_{i1}\),x\(_{i2}\),…,x\(_{ip}\))T , which is a vector, and vectors are by default represented as columns.
  • a feature of n observations, x\(_{j}\) = (x\(_{1j}\),x\(_{2j}\),…,x\(_{nj}\))T , then X = (x1, x2, … xp).
  • an observation vector of length n will always be in lower case bold, such as a ; a feature vector of length p, will be shown normal, such as a.
  • the product of matrix A and B is denoted AB.

1.2 chapter contents index:

linear

  • Ch2: terminology, KNN
  • ch3: linear reg
  • ch4: classification, logistics reg and linear discriminant analysis
  • ch5: cross-validation and bootstrap
  • ch6: linear models improvements, stepwise, ridge reg, principal components reg, partial least square and lasso

non-linear

  • ch7: additive
  • ch8+9: trees, bagging, boosting, random forest, support vector machine
  • ch10: principal component analysis, K-means clustering and hierarchical clustering

CHAPTER 2, BASIC

2.1 gemeral terms

  • reasons to estimate f: prediction and inference(interpretation)

prediction –> accuracy

  • reducible errors: model assumption and selection, reduce it by using a more appropriate model.
  • irreducible errors: unmeasured variables + un-measurable variations
  • the goal is to get a desirable estimate of f that can minimize the reducible error

inference –> interpretability

  • to understand the exact form of “f()”, namely, how Y changes as a function of X1,X2,…Xp, sub-question:
    • which predictors are included in “f()” and their importance;
    • impact direction (+/-) of X on Y, –> NOTE: the sign of a specific predictor X1’s impact on Y can depend on other predictors (X2,X3…).

2.2 Parametric vs non-parametric

2.2.1 parametirc method :

  • it involves a 2 step process,
    • step 1: make an assumption on function form/shape of “f”, e.g. linear model with p features, (p+1) coefficients(1 extra constant coefficient).
    • step 2: train model using training set data
      • one general approach is OLS (Chapter 3), and other approaches (chapter 6)
  • Flexible models requires estimating a greater number of parameters, which might lead to overfitting, a phenomenon of following the errors/noise too closely

2.2.2 non-parametric method :

  • since no model specifications/parameters are made prior to training, a very large number of observations are required to get an accurate estimate of “f”.

  • 2-D plot accuracy(or flexibility) - interpretability:

    • [lowest accuracy, highest interpretibility]
      • Subset Selection Lasso > Lease square > Generalized Additive Model Trees > Bagging, Boosting > Support Vector Machines
      • lasso sets more restrictive procedure to estimate coefficients, sets a number of coefficients exactly to 0, so it’s less accurate, more interpretable, more restrictive.
      • GAM allows for non-linear relationship so it’s more accurate and less interpretable.
      • because of over-fitting, low accuracy method could have higher accuracy in prediction compared with higher accuracy, more flexible method.
    • thin plate splines are more flexibile as they can have a much wider range of possbile shapes to estimate “f”.

2.3 supervised and unsupervised learning

  • unsupervised: clustering

2.4 Accuracy

  • MSE: mean squared error
    • ‘testing MSE’ not ‘training MSE’ is critical
    • degree of freedom: because the sum of deviation from mean is 0, for a vector of k dimensions, it’s degree of freedom is k-1, meaning only (k-1) dimensions can freely change.
    • monotone decreasing ‘training MSE’ as number of features increase
    • U-shaped ‘testing MSE’ as number of features increase
    • ‘Training MSE’ is always smaller than ‘testing MSE’ (45/434)

2.5 variance - Bias tradeoff

  • Variance: means using a different data-set, how much “f”(parameters) would change compared with original data-set.

  • Bias: changing the form of “f” could lead to change of of accuracy, usually, it relates to

    • the form of “f”, i.e. linear or non-linear
    • number of features in the model as predictors
  • testing MSE = variance + bias + irreducible error:

    • bias is negatively related to flexibility (i.e., # of features, more features means lower MSE with lower bias)
    • variance is positively related to flexibility (more features means higher MSE with higher variance)
  • low variance high bias example: fitting a horizontal line to the data

  • low bias high variance example: drawing a curve that passes through every training observation.(50/434)

2.6 Bayes Classifier

  • the prob. of Y = j given predictor x0 is largest. If there are only two categories, the Bayes classifier corresponds to predicting class ONE if P(Y=1/X=x0) > 0.5.
  • Bayes Decision Boundary: the probability is exactly 50% in TWO-CATEGORY problem.
  • Bayes error rate : the lowest possible test error rate, which is analogous to irreducible error.
  • For real data, we don’t know the conditional distribution of Y given X, so computing the Bayesian classifier is impossible. But, many approaches attempt to estimate the conditional distribution of (Y given X)then classify a given observation to the class with the highest estimated probability. (52/434)

2.7 KNN

  • KNN prediction process:

    • step 1: for an unknown observation Xi, define “K” nearest neighbors of Xi;
    • out of the K neighbors, get the category(P) that has the largest counts (in training set or known dataset), assign “Xi” to this category P.
  • KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

  • the choice of K:

    • with K = 1, the training error is 0, test error is very high, –> low bias high variance;
    • with K = INFINITY, the classifier becomes less flexible and a decision boundary close to linear, –> high bias low variance;

2.8 R commands:

  • pdf(), jpeg() to produce outputs that have the specified format
  • image() produce a 3-D color-coded plot whose colors depend on z value

CHAPTER 3, LINEAR REGRESSION

3.1 Accuracy of Coefficients Estimates

  • “population regression line” and “least square line”
  • if you could average a huge number of u obtained from a huge number of sets of observations(u), then this average would exactly match population u_true.
  • confidence interval for coefficients,
  • t-statistics = (beta_hat- 0) / SE(beta_hat)

3.2 the Quality of linear regresssion fit:

  • Two quality metrics: RSE and R-square
  • residual standard error (RSE)
  • R-square
    • WHAT CONSTITUTES A GOOD r-SQUARE:
      • In physics, we know data are from linear models so the R-squared is close to 1;
      • In social sciences, there could be lots of factors that are not captured by the model, so a relative low R-squared is acceptable, as low as 1;
  • Shark-attacks and ice cream sales –> ban the ice-cream to reduce shark attacks.
  • F statistics:

[(TSS-RSS)/p] /[RSS/(n-p-1)]:

* F statistics is close to 1: means there are no relationship between response and predictors. 
* F statistics is greater than 1: there are relation, at least one one of the predictors have a non-0 coefficient.
  • when p > n, namely “number of features” is larger than “number of observations”, a method called “FORWARD SELECTION” is used (or some other high dimension methods). (89/434)

  • selection criteria on variable selections (which predictor is retained in the model):

    • Mallow’s Cp
    • Akaike information criterion (AIC)
    • Bayesian information criterion (BIC)
    • adjusted R-squared

3.3 three classical approcaches for variable selection

  • forward selection: 0-1-2-3-4—> until some conditions, increase the number of features(LOWEST RSS) in the model;
  • backward selection: p - (p-1) - (p-2) –> , decrease the number of features (largest p-value)
  • mixed- selection(92/434)

3.4 Dummy variables (98/434)

3.5 Linear extension

  • assumptions on Linear model relationship: additive and linear
    • additive means Xi and Xj are independent (no interaction effect, X1*X2)
    • linear means 1 unit of X change’s impact is same regardless of magnitude of X. (polynomial)
  • Hierarchical principle: if an interaction term (X1*X2) were included in model, the respective MAIN effects(X1 or X2) should always be included as well regardless of their coefficients’ significance.

3.6 potential problems: (104/434)

  • non-linearity of X-Y relationships:
    • a residual plot can show (y-hat and residual) if there is non-linearity;
    • some form of X, such as sqrt(X), log(X), X^3 can be used when there is non-linearity;
  • Correlation of error terms:
    • means e(i) doesnot impact e(i+1)
    • when there is error-correlation, the true error is underestimated;
    • double size N sample will leads to unreliable confidence intervals( 106/434)
  • non-constant variance of error terms (heteroscedasticity, funnel-shaped error terms)
    • non-constant error usually happen when Y has a larger value, then the error terms have greater magnitude
    • potential solution could be: transform Y(log(Y)), shrinkage of Y;
  • Outliers (Y)
    • studentized-t residual >3 means a higher probability of outlier
    • careful removal of outliers, because it might mean that we are missing a predictor in the model;
  • High-leverage points (X)
    • X-dimension is far away from most of observations makes a high-leverage
    • multi-dimension predictors, even all individual dimension is within good range, but still could create high-leverage points
    • a high-leverage statistics is calculated to determine high-leverage point
  • Co-linearity (X-X)
    • leads to deficiency to power of hypothesis test;
    • either close to +1 or -1 indicates colinearity in correlation matrix;
    • multi-colinearity: a combination of 3 or more predictors are highly correlated in the Xs;
    • VIF( variance inflation factor) to detect multi-colinearity (always >=1; =1 menas absence of multi0colinearity) (114/434)

3.7 other issues

  • prediction intervals and confidence intervals:

    • prediction intervals are wider than confidence intervals because it incorporates error on top of the confidence interval;

Comparison of linear regression and KNN regression:

  • parametric and non-parametric

  • KNN regression and linear regression comparison:

    • when p=1 or 2, meaning one or two predictors, KNN works fine for a sample of 100 obs.; but when p =20 paired with 100 obs., most obs. dont have enough neighbours, leading to explosion of MSE;
    • this phenomenon is also called “the curse of dimensionality”, poor prediction power; (120/434)
  • parametric method performs better over non-parametric method when, on average, each predictor has a small amount of available observations;

3.8 linear regression lab

#library(MASS)
#library(ISLR)
#
#lm.fit = lm(medv~lstat,data = Boston)
# names(lm.fit), to get other info in lm.fit
# coef(lm.fit), to get coefficients
# confint(lm.fit), to get confidence interval

#predict(lm.fit,data.frame(lstat = c(5,10,15)), interval = "confidence") 
# to predict the interval; if interval = "prediction", means the prediction interval

# par(mfrow =c(2,2))
# plot(lm.fit), plot the fitted model 4 plots;
# plot(predict (lm.fit), residuals (lm.fit)), plot residual and prediction;
# summary (lm(medv∼lstat *age ,data=Boston )), fit with interaction terms
# lm.fit2=lm(medv∼lstat +I(lstat ^2)), incorporate the polynomial  (127/434)

# lm.fit =lm(Sales∼.+ Income :Advertising +Price :Age ,data=Carseats ), # fit with some interaction terms
# attach (Carseats );contrasts (ShelveLoc ) # contrasts() functions examine the dummy variables 

CHAPTER 4, Classification

Three classifiers are discussed: logistics regression, linear discriminant analysis, K-nearest neighbors. (general additive models, trees, random forests, boosting and SVM in later chapters).[2021/01/21]

4.1 Linear discriminant analysis. (150/434).

  • MORE popular when there are more than two response classes;
  • LDA approximates “prior probability”, “u” and “delta” for all density functions and plug that into Bayes formula.
    • ROC Curve
  • LDA assumes, each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all K classes.
  • confusion matrix: domain knowledge should be applied when determining the cost associated with FP/FN.(160/434)

4.2 QDA:

  • LDA tends to be better bet that QDA id there are relatively few training observations and so reducing variance is crucial;
  • QDA is recommended if the training set is very large, so variance is not a big concern, so bias matters more.

4.3 comparisons of classification methods: KNN, logistics, LDA and QDA

  • Both logistics and LDA produce linear decision boundaries, belongs to parametric (166/434), lab(167).

4.6 Lab, stock data 2001-2005

library(ISLR)
#names(Smarket)
#cor(Smarket[,-9])
# fits logistics regression, use glm():
#glm.fits = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Smarket, family = binomial)
#summary(glm.fits)
# predict(glm.fits, type = "response")
# contrasts(Direction), this function shows the dummy variable meaning
#### LDA/QDA fit using "MASS"
# lda.fit=lda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train) (173/434)
# qda.fit=qda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)

#### KNN(), 176/434

### Caravan insurance data: 

# standardize data: use scale()
## standardized.X=scale(Caravan [,-86])
## var(Caravan [,1])
#### 165
##var(Caravan [,2])
#### 0.165
## var( standardized.X[,1])
#### 1
## var( standardized.X[,2])
#### 1 (178/434)
  • KNN limitation: Any variables that are on a large scale will have a much larger effect on the distance between the observations and hence on KNN classifier than variables on a small scale. (178/434)

  • non-parametric methods performs poorly when # of predictors are large;

CHAPTER 5 RESMAPLING METHOD

5.1 LOOCV:

  • the testing error MSE = average the n MSE of leave-one-out sampling, so there is no randomness in running the result, we always get the same MSE using LOOCV. (190/434)
  • BUT LOOCV is expensive to implement, as the model has to be fit n times.
  • LOOCV is general that can be applied to any predictive modeling.
  • K-fold CV: LOOCV is the unique case where K equals n (sample size).
  • K-fold vs LOOCK: bias-variance tradeoff, K-fold where K<n gives the more accurate estimate of the test error rate than does LOOCV because of variance-bais tradeoff.
    • K-fold has higher bias compared with LOOCV
    • K-fold has lower variance compared with LOOCV (193/434,20220211)

5.2 Cross Validation in classifying problems

  • MSE in continuous; mis-classification rate in categorical;

5.3 Bootstrap

  • illustration of asset allocation:
    • the original sample n = 3 contains three different obs. ;
      -> bootstrap data is created by using re-sampling with replacement from original dataset, drawing n obs; -> each bootstrap data set is used to obtain an estimate of alpha, the asset allocation coefficient (true value 0.6) , (202/434)

5.4 Lab: cross-valiadation and bootstrap

# LOOCV estimate can be automatically computed for any generalized linear model usig glm() and cv.glm()

# glm() without passing in (family = "binomial" ), it by default gives same result as lm(). 

# cv.glm() is part of "boot" library. 

######

# bootstrap of estimating alpha: asset allocation
# 
# alpha.fn = function(data,index){
#   X = data$X[index]
#   Y = data$Y[index]
#   return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
# }
# 
# library(ISLR)
# library(boot)
# set.seed (1)
# alpha.fn(Portfolio,sample (100 ,100 , replace =T)) # portfolio is data, alpha.fn returns statistics
# boot(Portfolio ,alpha.fn,R=1000)

CHAPTER 6 LINEAR MODEL SELECTION AND REGULARIZATION

6.1 intro:

  • the necessity of using multiple fitting procedures other than only least square:

    • prediction accuracy: if n>>p, ordinary Least square is fine as it has low variance;
      • if n is close to p, poor prediction on future observations not used in model training;
      • if n<p, the OLS can’t be used at all as the variance is infinite.
      • by shrinking the estimated coefficient, the variance is substantially reduced at a negligible cost of increasing bias.
    • model interpretability: less predictors means removing irrelevant predictors, eaiser to interpret
  • Three methods used in improving least square fitting :

    • Subset Selection: identify a subset of the total p predictors(step-wise selection,forward selection,backward selection(n>p), );

    • Shrinkage: regualrization, make some coefficients close to zero or exactly 0. (ridge regression, lasso, 225/434)

    • Dimension Reduction: reduce p predictors to M where M<p, M are different linear combinations or projections of the “p” variables.

  • to evaluate test error, two approaches can be applied:

    • in the past performing cross-validation is computationally prohibitive because either n or p is large, so AIC/BIC/C/adjusted-R2 are popular.
    1. make adjustments to the training error to account for the bias due to over-fitting, as an estimate of test error;
    • C(p) = 1/n(RSS+2*d*sigma2), smallest wins

    • AIC = 1/n*sigmaHat(RSS+2*d*sigma2), smallest wins

    • BIC = 1/n*sigmaHat(RSS+log(n)*d*sigma2), smallest wins

    • adjusted R2 = 1-RSS/(n-d-1) / TSS/(n-1); R2 = 1-RSS/TSS; where “d” are # of predictors selected in the model

    1. directly estimate the test error using a CV approach.

6.2 Ridge Regression (226/434) -> shrink coefficient to 0

  • Estimated RSS = OLS_RSS + shrinkage penalty, where

    • shrinkage penalty = lambda*sum(b1Squared+b2Squared+ … + bpSquared);
    • when lambda changes, the set of coefficients changes;
  • l2 norm : ||beta||2 = sqrt(beta1Squared + beta2Squared + …+ beta_p_squared), measuring the distance of vector beta from 0.

  • It’s best to apply the ridge regression after standarizing the predictors, using x_ij_tildo = x_ij/sqrt(1/n*sum(x_ij-x_bar)).

  • Ridge regression has advantage over least square because it has better bias-variance trade-off.

    • As lambda increases, fewer predictors are incorporated into the model, meaning an increase in bias and a decrease in variance
  • Ridge regression also has advantage of less computation in searching for the best models (2^p)

  • one disadvantage of ridge regression is that it can’t eliminate unimportant predictors, once there are all predictors in the model, all of them are retained. Lasso can.

6.3 Lasso (229/434) –> feature selection

  • l1 penalty: shrinks the coefficients toward 0, and forces some of the coefficients estimates to be exact 0, when lambda is sufficient large. –> feature selection, ridge don’t.

  • Lasso is easier to interpret than ridge;

  • both Ridge and Lasso can be fit with about the same amount of work as a single least square fit.

6.4 dimension reduction:

  • reduce (p+1) coefficients to a simpler M+1 coefficients where M<p;

  • PCA: a popular approach for deriving a low-dimensional set of features from a large set of variables. [PCR, partial least square]

    • first principal component: the direction of the data that along which the observations vary most, the projected observations have the largest possible variance
    • PCA is NOT a feature selection method!!!
    • PCR(principal component regression) is closely related to ridge regression, can think of “Ridge regression” as a continuous version of PCR!
    • when performing PCR, it’s recommended to standardize each predictor.
  • Partial Least Square(PLS): a dimension reduction method, supervised version of PCA, finds “M” new features that best predict both original Xs and Y, while PCR is unsupervised without trying to predict Y.

    • THe predictors and response are standardrized before performing PLS.

6.5 Considerations in high Dimensions (248/434)

  • dimension means the magnitude(scale) of p, the larger the p is, the higher the dimension is.
  • high dimension data usually means p>n, where p is number of features and n is the number of observations.
  • when number of features included in the model increases, the testing MSE also increases –> “curse of dimensionality”
  • CAUTION: it is easy to obtain a model that has zero residuals, so one should never use the “sum of squared errors”, “R-squared statistics”, or other traditional measures of model fit on the TRAINING DATASET as evidence of a good model fit in the high dimensional setting.

6.6 lab: subset selection models

  • best subset selection
    • use function “regsubset()”
  • forward and backward stepwise selection
  • Ridge regression and lasso
    • package used “glmnet” (# glmnet can only take in numeric variables) (lab, 261/434)
    • by default, the function “cv.glmnet()” performs 10-fold cross validation;
    • alpha = 0 when use RIDGE while alpha = 1 when use LAsso.
  • Principal Component Regression
    • function “pcr()” from “pls” library to perform principal component regression
    • the model is difficult to interpret because it does not perform variable selection or produce any coefficient estimates.
  • Partial Least Square
    • plsr() from “pls” library
# library (leaps)
# library(ISLR)
# regfit.full=regsubsets(Salary ∼ .,data = Hitters )
# summary(regfit.full)
# 10^ seq (10,-2, length =100) , # setting up lambda
# plot(regfit.full ,scale ="adjr2")

CHAPTER 7 MOVING BEYOND LINEARITY (275/434)

7.1 INTRO

  • a list of methods:
    • polynomial regression
    • step function, cut variables into K regions;
    • regression splines, a combination of poly- and step-, to divide X into K regions and polynomials are joined smoothly at the region boundary (known as knots)

7.2 polynomial regression (276/434)

  • it is unusual to use a degree(power) of greater than 3 or 4 for the polynomial regression as some strange shapes could exist near the boundary of X variable;

7.3 Regression splines

  • splines often give superior results to polynomial regression, because splines produce flexible fits by increasing the number of knots but keeping the degree fixed; ploynomial can produce greater flexiblity by introducing higher degrees, which leads to unexpectedly weird shapes.

7.4 smothing splimnes:

  • in fitting a smoothing spline, we don’t choose the NUMBER or LOCATION of knots, there will be a knot at each training observation, but we need to choose the value of lambda
  • LOOCV is efficient for calculating RSS.

7.5 GAM, pros and cons

7.6 lab:

  • polynomial regression and step functions
# fit=glm(I(wage >250)∼poly(age ,4) ,data=Wage ,family =binomial ), use I() to create a binary response variable

CHAPTER 8 TREE BASED METHODS

8.1 INTRO

  • bagging, boosting and random forests are discussed: each of these approaches involve producing multiple trees which are combined to yield a single consensus prediction.

  • regression trees:

    • 2 steps: DIVIDE and Average, first divide the predictor space then assign the mean of response to points falling in the space.

      • recursive binary splitting,

      • weakest link pruning:

  • classification trees:

  • classification error rate: the error rate is the fraction of the training observations that don’t belong to the most common class.

    • classification error: when prediction accuracy is the goal, classification error is used.

    • gini index: a small value means a node contains predominantly observations from a single class.

    • cross-entropy:

  • trees versus linear models:

    • trees are easier to interpret and visualize;
    • trees tend to over-fit the data, it’s relying too much on the samples so not robust and not high level of prediction accuracy.

8.2 Bagging, Random FOrest and Boosting

  • all rely on trees as building blocks to construct more powerful prediction models.

8.2.1 Bagging

  • bagging is bootstrap aggregation, “averaging a set of observations reduces variances”

    • average of the predicted values –> continuous response;
    • major vote –> categorical response
  • out-of-bag estimation (327/434)

  • Bagging improves the prediction accuracy at the expense of interpretability

  • Bagging: calculate B # of estimated y-hat using B# of separate training set, and then average them in order to obtain a single low-variance learning model.

8.2.2 Random Forest

  • random forest provide an improvement over bagged trees by way of a small tweak that de-correlates the trees:
    • a random sample of m predictors out of total p predictors is chosen as split cnadidates for running model.
  • one major difference of bagging and random forest is the size of number of features. if m = p, then random forest is bagging.

8.2.3 Boosting

  • boosting does fit smaller trees on residuals and add it to the original tree, using original predictors.
  • it is a slow learning process, in general, statistical learning approaches that learn slowly tend to perform well. (331/434)
  • boosting has three tuning parameters: number of trees B, shrinkage parameter lambda which controls the learning speed, and number of d splits on each tree.
  • Boosting use smaller trees because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are sufficient.
  • Using stumps leads to additive model, a STUMP is a tree that has only one single split.

8.3 Lab: Decision Trees: (326/434), trees, CV.tree

  • Random Forest: 337/434
  • Bagging is a special case of random forest, m=p
  • boosting: 339/434

CHAPTER 9 SUPPORTIVE VECTOR MACHINE

9.1 non separable case:

  • support vector classifier: the generalization of maximal margin classifier to non-separable case is called “support vector clasifier”.

  • if the maximal margin hyper-plane is extremely sensitive to a change in a single observation suggests that the model overfits the training data.

  • C is small(margin is narrow), low bias high variance; C is large(margin is wide), high bias and low variance.

  • Important properties of support vector machine:

    • only observations lying on the margin line affect the hyper-plane, those observations not on the margin line, however they change, don’t affect the decision criteria at all.
    • observations that lie directly on the margin, or the wrong side of the margin for its class, they are known as support vectors. They affect support vector classifier.
    • LDA is heavily influenced by outliers while support vector machine does not, in fact logistic regression are more related. (359/434)

9.2 svm with more than two classes (364/434)

9.3 logistics regression with SVM (365/434)

  • logistics regression has similar loss function to SVM, even thou the observation far from the decision boundary are not exactly 0, they are close to 0.

  • when classes are well separated SVM tend to perform better than logistics;

  • nuisance parameter: a parameter which is not of immediate interest but which must be accounted for in analysis of parameters that are of interest.

  • SVM regression

9.4 SVM lab, (367/434)

  • SVM
  • ROC curves (373/434)

CHAPTER 10 UNSUPERVISED LEARNING (381/434)

10.1 INTRO: PCA

  • each of the dimensions found by PCA is a linear combination of the all p features.
  • the first component loading vector has a very special property: it is the line in p-dimension space that is closest to the n observations.
  • before doing PCA, variables should be scaled(normalized); but if all variables are measured in the same units, no normalization is needed. (388/434)
  • proportion of variance explained (PVE)
  • scree plot

10.2 Clustering

  • difference between PCA and clustering:

    • PCA looks to find low-dimensional representation of the observations that explain a good fraction of the variance;
    • clustering looks to find homogeneous subgroups among the observations;
    • scree plot
  • methods:

    • K-means clustering :
      • one disadvantage is to pre-specify the number of clusters K
    • hierarchical clustering
      • it results in a tree-based representation of the observations called dendrogram. (bottom-up/ agglomerative clustering).
      • hierarchical clustering, the similarity could be drawn from vertical approximity, NOT horizontal closeness.
      • hierarchical clustering can sometimes generate worse results than K-means for a given number of clusters.
      • linkage: it describes the dissimilarity between two groups of observations.
        • complete: maximal intercluster dissimilarity
        • single: minimal intercluster dissimilarity
        • average: mean intercluster dissimilarity
        • centroid: mena vector length
      • Euclidean distance is used in measuring distance;Correlation based distance can also be used for measuring distance
  • Practical issues in clustering:

    • clustering can be non-robust, for example, using subset of data can create groups quite different from full sets.
  • LAB-PCA (409/434)