Notes
This is a detailed notes from ISLR, by chapter, to record the study points in the book. Initiated in Dec 10, 2021.
CHAPTER 1, Intro
1.0 history of method:
- linear regression : (X-category, Y-continuous) 19 century.
- linear discriminant analysis : (X-continuous, Y-category) 1936.
- logistics regression : 1940s.
- generalized model : 1970s, put linear reg / log reg as unique cases under GLS.
- classification/reg tree: 1980s.
- generalized additive model: 1986.
1.1 notations:
- n denotes number of observations.
- p denotes number of variables.
- \(x_{ij}\) denotes the ith observation of jth variable, i is for OBSERVATIONS, j is for FEATURES.
- X denotes n * p matrix.
- an observation of p dimensions, x\(_{i}\) = (x\(_{i1}\),x\(_{i2}\),…,x\(_{ip}\))T , which is a vector, and vectors are by default represented as columns.
- a feature of n observations, x\(_{j}\) = (x\(_{1j}\),x\(_{2j}\),…,x\(_{nj}\))T , then X = (x1, x2, … xp).
- an observation vector of length n will always be in lower case bold, such as a ; a feature vector of length p, will be shown normal, such as a.
- the product of matrix A and B is denoted AB.
1.2 chapter contents index:
linear
- Ch2: terminology, KNN
- ch3: linear reg
- ch4: classification, logistics reg and linear discriminant analysis
- ch5: cross-validation and bootstrap
- ch6: linear models improvements, stepwise, ridge reg, principal components reg, partial least square and lasso
non-linear
- ch7: additive
- ch8+9: trees, bagging, boosting, random forest, support vector machine
- ch10: principal component analysis, K-means clustering and hierarchical clustering
CHAPTER 2, BASIC
2.1 gemeral terms
- reasons to estimate f: prediction and inference(interpretation)
prediction –> accuracy
- reducible errors: model assumption and selection, reduce it by using a more appropriate model.
- irreducible errors: unmeasured variables + un-measurable variations
- the goal is to get a desirable estimate of f that can minimize the reducible error
inference –> interpretability
- to understand the exact form of “f()”, namely, how Y changes as a function of X1,X2,…Xp, sub-question:
- which predictors are included in “f()” and their importance;
- impact direction (+/-) of X on Y, –> NOTE: the sign of a specific predictor X1’s impact on Y can depend on other predictors (X2,X3…).
2.2 Parametric vs non-parametric
2.2.1 parametirc method :
- it involves a 2 step process,
- step 1: make an assumption on function form/shape of “f”, e.g. linear model with p features, (p+1) coefficients(1 extra constant coefficient).
- step 2: train model using training set data
- one general approach is OLS (Chapter 3), and other approaches (chapter 6)
- Flexible models requires estimating a greater number of parameters, which might lead to overfitting, a phenomenon of following the errors/noise too closely
2.2.2 non-parametric method :
since no model specifications/parameters are made prior to training, a very large number of observations are required to get an accurate estimate of “f”.
2-D plot accuracy(or flexibility) - interpretability:
- [lowest accuracy, highest interpretibility]
- Subset Selection Lasso > Lease square > Generalized Additive Model Trees > Bagging, Boosting > Support Vector Machines
- lasso sets more restrictive procedure to estimate coefficients, sets a number of coefficients exactly to 0, so it’s less accurate, more interpretable, more restrictive.
- GAM allows for non-linear relationship so it’s more accurate and less interpretable.
- because of over-fitting, low accuracy method could have higher accuracy in prediction compared with higher accuracy, more flexible method.
- thin plate splines are more flexibile as they can have a much wider range of possbile shapes to estimate “f”.
2.3 supervised and unsupervised learning
2.4 Accuracy
- MSE: mean squared error
- ‘testing MSE’ not ‘training MSE’ is critical
- degree of freedom: because the sum of deviation from mean is 0, for a vector of k dimensions, it’s degree of freedom is k-1, meaning only (k-1) dimensions can freely change.
- monotone decreasing ‘training MSE’ as number of features increase
- U-shaped ‘testing MSE’ as number of features increase
- ‘Training MSE’ is always smaller than ‘testing MSE’ (45/434)
2.5 variance - Bias tradeoff
Variance: means using a different data-set, how much “f”(parameters) would change compared with original data-set.
Bias: changing the form of “f” could lead to change of of accuracy, usually, it relates to
- the form of “f”, i.e. linear or non-linear
- number of features in the model as predictors
testing MSE = variance + bias + irreducible error:
- bias is negatively related to flexibility (i.e., # of features, more features means lower MSE with lower bias)
- variance is positively related to flexibility (more features means higher MSE with higher variance)
low variance high bias example: fitting a horizontal line to the data
low bias high variance example: drawing a curve that passes through every training observation.(50/434)
2.6 Bayes Classifier
- the prob. of Y = j given predictor x0 is largest. If there are only two categories, the Bayes classifier corresponds to predicting class ONE if P(Y=1/X=x0) > 0.5.
- Bayes Decision Boundary: the probability is exactly 50% in TWO-CATEGORY problem.
- Bayes error rate : the lowest possible test error rate, which is analogous to irreducible error.
- For real data, we don’t know the conditional distribution of Y given X, so computing the Bayesian classifier is impossible. But, many approaches attempt to estimate the conditional distribution of (Y given X)then classify a given observation to the class with the highest estimated probability. (52/434)
2.8 R commands:
- pdf(), jpeg() to produce outputs that have the specified format
- image() produce a 3-D color-coded plot whose colors depend on z value
CHAPTER 3, LINEAR REGRESSION
3.1 Accuracy of Coefficients Estimates
- “population regression line” and “least square line”
- if you could average a huge number of u obtained from a huge number of sets of observations(u), then this average would exactly match population u_true.
- confidence interval for coefficients,
- t-statistics = (beta_hat- 0) / SE(beta_hat)
3.2 the Quality of linear regresssion fit:
- Two quality metrics: RSE and R-square
- residual standard error (RSE)
- R-square
- WHAT CONSTITUTES A GOOD r-SQUARE:
- In physics, we know data are from linear models so the R-squared is close to 1;
- In social sciences, there could be lots of factors that are not captured by the model, so a relative low R-squared is acceptable, as low as 1;
- Shark-attacks and ice cream sales –> ban the ice-cream to reduce shark attacks.
- F statistics:
[(TSS-RSS)/p] /[RSS/(n-p-1)]:
* F statistics is close to 1: means there are no relationship between response and predictors.
* F statistics is greater than 1: there are relation, at least one one of the predictors have a non-0 coefficient.
when p > n, namely “number of features” is larger than “number of observations”, a method called “FORWARD SELECTION” is used (or some other high dimension methods). (89/434)
selection criteria on variable selections (which predictor is retained in the model):
- Mallow’s Cp
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- adjusted R-squared
3.3 three classical approcaches for variable selection
- forward selection: 0-1-2-3-4—> until some conditions, increase the number of features(LOWEST RSS) in the model;
- backward selection: p - (p-1) - (p-2) –> , decrease the number of features (largest p-value)
- mixed- selection(92/434)
3.4 Dummy variables (98/434)
3.5 Linear extension
- assumptions on Linear model relationship: additive and linear
- additive means Xi and Xj are independent (no interaction effect, X1*X2)
- linear means 1 unit of X change’s impact is same regardless of magnitude of X. (polynomial)
- Hierarchical principle: if an interaction term (X1*X2) were included in model, the respective MAIN effects(X1 or X2) should always be included as well regardless of their coefficients’ significance.
3.6 potential problems: (104/434)
- non-linearity of X-Y relationships:
- a residual plot can show (y-hat and residual) if there is non-linearity;
- some form of X, such as sqrt(X), log(X), X^3 can be used when there is non-linearity;
- Correlation of error terms:
- means e(i) doesnot impact e(i+1)
- when there is error-correlation, the true error is underestimated;
- double size N sample will leads to unreliable confidence intervals( 106/434)
- non-constant variance of error terms (heteroscedasticity, funnel-shaped error terms)
- non-constant error usually happen when Y has a larger value, then the error terms have greater magnitude
- potential solution could be: transform Y(log(Y)), shrinkage of Y;
- Outliers (Y)
- studentized-t residual >3 means a higher probability of outlier
- careful removal of outliers, because it might mean that we are missing a predictor in the model;
- High-leverage points (X)
- X-dimension is far away from most of observations makes a high-leverage
- multi-dimension predictors, even all individual dimension is within good range, but still could create high-leverage points
- a high-leverage statistics is calculated to determine high-leverage point
- Co-linearity (X-X)
- leads to deficiency to power of hypothesis test;
- either close to +1 or -1 indicates colinearity in correlation matrix;
- multi-colinearity: a combination of 3 or more predictors are highly correlated in the Xs;
- VIF( variance inflation factor) to detect multi-colinearity (always >=1; =1 menas absence of multi0colinearity) (114/434)
Comparison of linear regression and KNN regression:
parametric and non-parametric
KNN regression and linear regression comparison:
- when p=1 or 2, meaning one or two predictors, KNN works fine for a sample of 100 obs.; but when p =20 paired with 100 obs., most obs. dont have enough neighbours, leading to explosion of MSE;
- this phenomenon is also called “the curse of dimensionality”, poor prediction power; (120/434)
parametric method performs better over non-parametric method when, on average, each predictor has a small amount of available observations;
3.8 linear regression lab
#library(MASS)
#library(ISLR)
#
#lm.fit = lm(medv~lstat,data = Boston)
# names(lm.fit), to get other info in lm.fit
# coef(lm.fit), to get coefficients
# confint(lm.fit), to get confidence interval
#predict(lm.fit,data.frame(lstat = c(5,10,15)), interval = "confidence")
# to predict the interval; if interval = "prediction", means the prediction interval
# par(mfrow =c(2,2))
# plot(lm.fit), plot the fitted model 4 plots;
# plot(predict (lm.fit), residuals (lm.fit)), plot residual and prediction;
# summary (lm(medv∼lstat *age ,data=Boston )), fit with interaction terms
# lm.fit2=lm(medv∼lstat +I(lstat ^2)), incorporate the polynomial (127/434)
# lm.fit =lm(Sales∼.+ Income :Advertising +Price :Age ,data=Carseats ), # fit with some interaction terms
# attach (Carseats );contrasts (ShelveLoc ) # contrasts() functions examine the dummy variables
CHAPTER 4, Classification
Three classifiers are discussed: logistics regression, linear discriminant analysis, K-nearest neighbors. (general additive models, trees, random forests, boosting and SVM in later chapters).[2021/01/21]
multi-class categorical variables
maximum likelihood: OLS is a special case of maximum likelihood, p = e/(1+e)
4.1 Linear discriminant analysis. (150/434).
- MORE popular when there are more than two response classes;
- LDA approximates “prior probability”, “u” and “delta” for all density functions and plug that into Bayes formula.
- LDA assumes, each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all K classes.
- confusion matrix: domain knowledge should be applied when determining the cost associated with FP/FN.(160/434)
4.2 QDA:
- LDA tends to be better bet that QDA id there are relatively few training observations and so reducing variance is crucial;
- QDA is recommended if the training set is very large, so variance is not a big concern, so bias matters more.
4.3 comparisons of classification methods: KNN, logistics, LDA and QDA
- Both logistics and LDA produce linear decision boundaries, belongs to parametric (166/434), lab(167).
4.6 Lab, stock data 2001-2005
library(ISLR)
#names(Smarket)
#cor(Smarket[,-9])
# fits logistics regression, use glm():
#glm.fits = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Smarket, family = binomial)
#summary(glm.fits)
# predict(glm.fits, type = "response")
# contrasts(Direction), this function shows the dummy variable meaning
#### LDA/QDA fit using "MASS"
# lda.fit=lda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train) (173/434)
# qda.fit=qda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)
#### KNN(), 176/434
### Caravan insurance data:
# standardize data: use scale()
## standardized.X=scale(Caravan [,-86])
## var(Caravan [,1])
#### 165
##var(Caravan [,2])
#### 0.165
## var( standardized.X[,1])
#### 1
## var( standardized.X[,2])
#### 1 (178/434)
KNN limitation: Any variables that are on a large scale will have a much larger effect on the distance between the observations and hence on KNN classifier than variables on a small scale. (178/434)
non-parametric methods performs poorly when # of predictors are large;
CHAPTER 5 RESMAPLING METHOD
- two most common method: cross-validation and bootstrapping.
5.1 LOOCV:
- the testing error MSE = average the n MSE of leave-one-out sampling, so there is no randomness in running the result, we always get the same MSE using LOOCV. (190/434)
- BUT LOOCV is expensive to implement, as the model has to be fit n times.
- LOOCV is general that can be applied to any predictive modeling.
- K-fold CV: LOOCV is the unique case where K equals n (sample size).
- K-fold vs LOOCK: bias-variance tradeoff, K-fold where K<n gives the more accurate estimate of the test error rate than does LOOCV because of variance-bais tradeoff.
- K-fold has higher bias compared with LOOCV
- K-fold has lower variance compared with LOOCV (193/434,20220211)
5.2 Cross Validation in classifying problems
- MSE in continuous; mis-classification rate in categorical;
5.3 Bootstrap
- illustration of asset allocation:
- the original sample n = 3 contains three different obs. ;
-> bootstrap data is created by using re-sampling with replacement from original dataset, drawing n obs; -> each bootstrap data set is used to obtain an estimate of alpha, the asset allocation coefficient (true value 0.6) , (202/434)
5.4 Lab: cross-valiadation and bootstrap
# LOOCV estimate can be automatically computed for any generalized linear model usig glm() and cv.glm()
# glm() without passing in (family = "binomial" ), it by default gives same result as lm().
# cv.glm() is part of "boot" library.
######
# bootstrap of estimating alpha: asset allocation
#
# alpha.fn = function(data,index){
# X = data$X[index]
# Y = data$Y[index]
# return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
# }
#
# library(ISLR)
# library(boot)
# set.seed (1)
# alpha.fn(Portfolio,sample (100 ,100 , replace =T)) # portfolio is data, alpha.fn returns statistics
# boot(Portfolio ,alpha.fn,R=1000)
CHAPTER 6 LINEAR MODEL SELECTION AND REGULARIZATION
6.1 intro:
the necessity of using multiple fitting procedures other than only least square:
- prediction accuracy: if n>>p, ordinary Least square is fine as it has low variance;
- if n is close to p, poor prediction on future observations not used in model training;
- if n<p, the OLS can’t be used at all as the variance is infinite.
- by shrinking the estimated coefficient, the variance is substantially reduced at a negligible cost of increasing bias.
- model interpretability: less predictors means removing irrelevant predictors, eaiser to interpret
Three methods used in improving least square fitting :
Subset Selection: identify a subset of the total p predictors(step-wise selection,forward selection,backward selection(n>p), );
Shrinkage: regualrization, make some coefficients close to zero or exactly 0. (ridge regression, lasso, 225/434)
Dimension Reduction: reduce p predictors to M where M<p, M are different linear combinations or projections of the “p” variables.
to evaluate test error, two approaches can be applied:
- in the past performing cross-validation is computationally prohibitive because either n or p is large, so AIC/BIC/C/adjusted-R2 are popular.
- make adjustments to the training error to account for the bias due to over-fitting, as an estimate of test error;
C(p) = 1/n(RSS+2*d*sigma2), smallest wins
AIC = 1/n*sigmaHat(RSS+2*d*sigma2), smallest wins
BIC = 1/n*sigmaHat(RSS+log(n)*d*sigma2), smallest wins
adjusted R2 = 1-RSS/(n-d-1) / TSS/(n-1); R2 = 1-RSS/TSS; where “d” are # of predictors selected in the model
- directly estimate the test error using a CV approach.
6.2 Ridge Regression (226/434) -> shrink coefficient to 0
Estimated RSS = OLS_RSS + shrinkage penalty, where
- shrinkage penalty = lambda*sum(b1Squared+b2Squared+ … + bpSquared);
- when lambda changes, the set of coefficients changes;
l2 norm : ||beta||2 = sqrt(beta1Squared + beta2Squared + …+ beta_p_squared), measuring the distance of vector beta from 0.
It’s best to apply the ridge regression after standarizing the predictors, using x_ij_tildo = x_ij/sqrt(1/n*sum(x_ij-x_bar)).
Ridge regression has advantage over least square because it has better bias-variance trade-off.
- As lambda increases, fewer predictors are incorporated into the model, meaning an increase in bias and a decrease in variance
Ridge regression also has advantage of less computation in searching for the best models (2^p)
one disadvantage of ridge regression is that it can’t eliminate unimportant predictors, once there are all predictors in the model, all of them are retained. Lasso can.
6.3 Lasso (229/434) –> feature selection
l1 penalty: shrinks the coefficients toward 0, and forces some of the coefficients estimates to be exact 0, when lambda is sufficient large. –> feature selection, ridge don’t.
Lasso is easier to interpret than ridge;
both Ridge and Lasso can be fit with about the same amount of work as a single least square fit.
6.4 dimension reduction:
reduce (p+1) coefficients to a simpler M+1 coefficients where M<p;
PCA: a popular approach for deriving a low-dimensional set of features from a large set of variables. [PCR, partial least square]
- first principal component: the direction of the data that along which the observations vary most, the projected observations have the largest possible variance
- PCA is NOT a feature selection method!!!
- PCR(principal component regression) is closely related to ridge regression, can think of “Ridge regression” as a continuous version of PCR!
- when performing PCR, it’s recommended to standardize each predictor.
Partial Least Square(PLS): a dimension reduction method, supervised version of PCA, finds “M” new features that best predict both original Xs and Y, while PCR is unsupervised without trying to predict Y.
- THe predictors and response are standardrized before performing PLS.
6.5 Considerations in high Dimensions (248/434)
- dimension means the magnitude(scale) of p, the larger the p is, the higher the dimension is.
- high dimension data usually means p>n, where p is number of features and n is the number of observations.
- when number of features included in the model increases, the testing MSE also increases –> “curse of dimensionality”
- CAUTION: it is easy to obtain a model that has zero residuals, so one should never use the “sum of squared errors”, “R-squared statistics”, or other traditional measures of model fit on the TRAINING DATASET as evidence of a good model fit in the high dimensional setting.
6.6 lab: subset selection models
- best subset selection
- use function “regsubset()”
- forward and backward stepwise selection
- Ridge regression and lasso
- package used “glmnet” (# glmnet can only take in numeric variables) (lab, 261/434)
- by default, the function “cv.glmnet()” performs 10-fold cross validation;
- alpha = 0 when use RIDGE while alpha = 1 when use LAsso.
- Principal Component Regression
- function “pcr()” from “pls” library to perform principal component regression
- the model is difficult to interpret because it does not perform variable selection or produce any coefficient estimates.
- Partial Least Square
- plsr() from “pls” library
# library (leaps)
# library(ISLR)
# regfit.full=regsubsets(Salary ∼ .,data = Hitters )
# summary(regfit.full)
# 10^ seq (10,-2, length =100) , # setting up lambda
# plot(regfit.full ,scale ="adjr2")
CHAPTER 7 MOVING BEYOND LINEARITY (275/434)
7.1 INTRO
- a list of methods:
- polynomial regression
- step function, cut variables into K regions;
- regression splines, a combination of poly- and step-, to divide X into K regions and polynomials are joined smoothly at the region boundary (known as knots)
7.2 polynomial regression (276/434)
- it is unusual to use a degree(power) of greater than 3 or 4 for the polynomial regression as some strange shapes could exist near the boundary of X variable;
7.3 Regression splines
- splines often give superior results to polynomial regression, because splines produce flexible fits by increasing the number of knots but keeping the degree fixed; ploynomial can produce greater flexiblity by introducing higher degrees, which leads to unexpectedly weird shapes.
7.4 smothing splimnes:
- in fitting a smoothing spline, we don’t choose the NUMBER or LOCATION of knots, there will be a knot at each training observation, but we need to choose the value of lambda
- LOOCV is efficient for calculating RSS.
7.5 GAM, pros and cons
7.6 lab:
- polynomial regression and step functions
# fit=glm(I(wage >250)∼poly(age ,4) ,data=Wage ,family =binomial ), use I() to create a binary response variable
CHAPTER 8 TREE BASED METHODS
8.1 INTRO
bagging, boosting and random forests are discussed: each of these approaches involve producing multiple trees which are combined to yield a single consensus prediction.
regression trees:
classification trees:
classification error rate: the error rate is the fraction of the training observations that don’t belong to the most common class.
classification error: when prediction accuracy is the goal, classification error is used.
gini index: a small value means a node contains predominantly observations from a single class.
cross-entropy:
trees versus linear models:
- trees are easier to interpret and visualize;
- trees tend to over-fit the data, it’s relying too much on the samples so not robust and not high level of prediction accuracy.
8.2 Bagging, Random FOrest and Boosting
- all rely on trees as building blocks to construct more powerful prediction models.
8.2.1 Bagging
bagging is bootstrap aggregation, “averaging a set of observations reduces variances”
- average of the predicted values –> continuous response;
- major vote –> categorical response
out-of-bag estimation (327/434)
Bagging improves the prediction accuracy at the expense of interpretability
Bagging: calculate B # of estimated y-hat using B# of separate training set, and then average them in order to obtain a single low-variance learning model.
8.2.2 Random Forest
- random forest provide an improvement over bagged trees by way of a small tweak that de-correlates the trees:
- a random sample of m predictors out of total p predictors is chosen as split cnadidates for running model.
- one major difference of bagging and random forest is the size of number of features. if m = p, then random forest is bagging.
8.2.3 Boosting
- boosting does fit smaller trees on residuals and add it to the original tree, using original predictors.
- it is a slow learning process, in general, statistical learning approaches that learn slowly tend to perform well. (331/434)
- boosting has three tuning parameters: number of trees B, shrinkage parameter lambda which controls the learning speed, and number of d splits on each tree.
- Boosting use smaller trees because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are sufficient.
- Using stumps leads to additive model, a STUMP is a tree that has only one single split.
8.3 Lab: Decision Trees: (326/434), trees, CV.tree
- Random Forest: 337/434
- Bagging is a special case of random forest, m=p
- boosting: 339/434
CHAPTER 9 SUPPORTIVE VECTOR MACHINE
- maximal margin classifier: maximal margin hyperplane is the farthest from the training observations.
- WHEN P IS LARGE, overfitting exists.
- the observations lying on the decision boundary are called support vectors.
- if a separating hyper-plane exists, maximal margin classifier is a natural way to perform classification.
9.1 non separable case:
support vector classifier: the generalization of maximal margin classifier to non-separable case is called “support vector clasifier”.
if the maximal margin hyper-plane is extremely sensitive to a change in a single observation suggests that the model overfits the training data.
C is small(margin is narrow), low bias high variance; C is large(margin is wide), high bias and low variance.
Important properties of support vector machine:
- only observations lying on the margin line affect the hyper-plane, those observations not on the margin line, however they change, don’t affect the decision criteria at all.
- observations that lie directly on the margin, or the wrong side of the margin for its class, they are known as support vectors. They affect support vector classifier.
- LDA is heavily influenced by outliers while support vector machine does not, in fact logistic regression are more related. (359/434)
9.2 svm with more than two classes (364/434)
9.3 logistics regression with SVM (365/434)
logistics regression has similar loss function to SVM, even thou the observation far from the decision boundary are not exactly 0, they are close to 0.
when classes are well separated SVM tend to perform better than logistics;
nuisance parameter: a parameter which is not of immediate interest but which must be accounted for in analysis of parameters that are of interest.
SVM regression
CHAPTER 10 UNSUPERVISED LEARNING (381/434)
10.1 INTRO: PCA
- each of the dimensions found by PCA is a linear combination of the all p features.
- the first component loading vector has a very special property: it is the line in p-dimension space that is closest to the n observations.
- before doing PCA, variables should be scaled(normalized); but if all variables are measured in the same units, no normalization is needed. (388/434)
- proportion of variance explained (PVE)
- scree plot