Notes

This is a detailed notes from ISLR, by chapter, to record the study points in the book. Initiated in Dec 10, 2021.

CHAPTER 1, Intro

1.0 history of method:

linear regression : (X-category, Y-continuous) 19 century.
linear discriminant analysis : (X-continuous, Y-category) 1936.
logistics regression : 1940s.
generalized model : 1970s, put linear reg / log reg as unique cases under GLS.
classification/reg tree: 1980s.
generalized additive model: 1986.

1.1 notations:

n denotes number of observations.
p denotes number of variables.
\(x_{ij}\) denotes the ith observation of jth variable, i is for OBSERVATIONS, j is for FEATURES.
X denotes n * p matrix.
an observation of p dimensions, x\(_{i}\) = (x\(_{i1}\),x\(_{i2}\),…,x\(_{ip}\))^T , which is a vector, and vectors are by default represented as columns.
a feature of n observations, x\(_{j}\) = (x\(_{1j}\),x\(_{2j}\),…,x\(_{nj}\))^T , then X = (x₁, x₂, … x_p).
an observation vector of length n will always be in lower case bold, such as a ; a feature vector of length p, will be shown normal, such as a.
the product of matrix A and B is denoted AB.

1.2 chapter contents index:

linear

Ch2: terminology, KNN
ch3: linear reg
ch4: classification, logistics reg and linear discriminant analysis
ch5: cross-validation and bootstrap
ch6: linear models improvements, stepwise, ridge reg, principal components reg, partial least square and lasso

non-linear

ch7: additive
ch8+9: trees, bagging, boosting, random forest, support vector machine
ch10: principal component analysis, K-means clustering and hierarchical clustering

CHAPTER 2, BASIC

2.1 gemeral terms

reasons to estimate f: prediction and inference(interpretation)

prediction –> accuracy

reducible errors: model assumption and selection, reduce it by using a more appropriate model.
irreducible errors: unmeasured variables + un-measurable variations
the goal is to get a desirable estimate of f that can minimize the reducible error

inference –> interpretability

to understand the exact form of “f()”, namely, how Y changes as a function of X1,X2,…Xp, sub-question:
- which predictors are included in “f()” and their importance;
- impact direction (+/-) of X on Y, –> NOTE: the sign of a specific predictor X1’s impact on Y can depend on other predictors (X2,X3…).

2.2 Parametric vs non-parametric

2.2.1 parametirc method :

it involves a 2 step process,
- step 1: make an assumption on function form/shape of “f”, e.g. linear model with p features, (p+1) coefficients(1 extra constant coefficient).
- step 2: train model using training set data
  - one general approach is OLS (Chapter 3), and other approaches (chapter 6)
Flexible models requires estimating a greater number of parameters, which might lead to overfitting, a phenomenon of following the errors/noise too closely

2.2.2 non-parametric method :

since no model specifications/parameters are made prior to training, a very large number of observations are required to get an accurate estimate of “f”.
2-D plot accuracy(or flexibility) - interpretability:
- [lowest accuracy, highest interpretibility]
  - Subset Selection Lasso > Lease square > Generalized Additive Model Trees > Bagging, Boosting > Support Vector Machines
  - lasso sets more restrictive procedure to estimate coefficients, sets a number of coefficients exactly to 0, so it’s less accurate, more interpretable, more restrictive.
  - GAM allows for non-linear relationship so it’s more accurate and less interpretable.
  - because of over-fitting, low accuracy method could have higher accuracy in prediction compared with higher accuracy, more flexible method.
- thin plate splines are more flexibile as they can have a much wider range of possbile shapes to estimate “f”.

2.3 supervised and unsupervised learning

unsupervised: clustering

2.4 Accuracy

MSE: mean squared error
- ‘testing MSE’ not ‘training MSE’ is critical
- degree of freedom: because the sum of deviation from mean is 0, for a vector of k dimensions, it’s degree of freedom is k-1, meaning only (k-1) dimensions can freely change.
- monotone decreasing ‘training MSE’ as number of features increase
- U-shaped ‘testing MSE’ as number of features increase
- ‘Training MSE’ is always smaller than ‘testing MSE’ (45/434)

2.5 variance - Bias tradeoff

Variance: means using a different data-set, how much “f”(parameters) would change compared with original data-set.
Bias: changing the form of “f” could lead to change of of accuracy, usually, it relates to
- the form of “f”, i.e. linear or non-linear
- number of features in the model as predictors
testing MSE = variance + bias + irreducible error:
- bias is negatively related to flexibility (i.e., # of features, more features means lower MSE with lower bias)
- variance is positively related to flexibility (more features means higher MSE with higher variance)
low variance high bias example: fitting a horizontal line to the data
low bias high variance example: drawing a curve that passes through every training observation.(50/434)

2.6 Bayes Classifier

the prob. of Y = j given predictor x0 is largest. If there are only two categories, the Bayes classifier corresponds to predicting class ONE if P(Y=1/X=x0) > 0.5.
Bayes Decision Boundary: the probability is exactly 50% in TWO-CATEGORY problem.
Bayes error rate : the lowest possible test error rate, which is analogous to irreducible error.
For real data, we don’t know the conditional distribution of Y given X, so computing the Bayesian classifier is impossible. But, many approaches attempt to estimate the conditional distribution of (Y given X)then classify a given observation to the class with the highest estimated probability. (52/434)

2.7 KNN

KNN prediction process:
- step 1: for an unknown observation Xi, define “K” nearest neighbors of Xi;
- out of the K neighbors, get the category(P) that has the largest counts (in training set or known dataset), assign “Xi” to this category P.
KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.
the choice of K:
- with K = 1, the training error is 0, test error is very high, –> low bias high variance;
- with K = INFINITY, the classifier becomes less flexible and a decision boundary close to linear, –> high bias low variance;

2.8 R commands:

pdf(), jpeg() to produce outputs that have the specified format
image() produce a 3-D color-coded plot whose colors depend on z value

CHAPTER 3, LINEAR REGRESSION

3.1 Accuracy of Coefficients Estimates

“population regression line” and “least square line”
if you could average a huge number of u obtained from a huge number of sets of observations(u), then this average would exactly match population u_true.
confidence interval for coefficients,
t-statistics = (beta_hat- 0) / SE(beta_hat)

3.2 the Quality of linear regresssion fit:

Two quality metrics: RSE and R-square
residual standard error (RSE)
R-square
- WHAT CONSTITUTES A GOOD r-SQUARE:
  - In physics, we know data are from linear models so the R-squared is close to 1;
  - In social sciences, there could be lots of factors that are not captured by the model, so a relative low R-squared is acceptable, as low as 1;
Shark-attacks and ice cream sales –> ban the ice-cream to reduce shark attacks.
F statistics:

[(TSS-RSS)/p] /[RSS/(n-p-1)]:

* F statistics is close to 1: means there are no relationship between response and predictors. 
* F statistics is greater than 1: there are relation, at least one one of the predictors have a non-0 coefficient.

when p > n, namely “number of features” is larger than “number of observations”, a method called “FORWARD SELECTION” is used (or some other high dimension methods). (89/434)
selection criteria on variable selections (which predictor is retained in the model):
- Mallow’s Cp
- Akaike information criterion (AIC)
- Bayesian information criterion (BIC)
- adjusted R-squared

3.3 three classical approcaches for variable selection

forward selection: 0-1-2-3-4—> until some conditions, increase the number of features(LOWEST RSS) in the model;
backward selection: p - (p-1) - (p-2) –> , decrease the number of features (largest p-value)
mixed- selection(92/434)

3.4 Dummy variables (98/434)

3.5 Linear extension

assumptions on Linear model relationship: additive and linear
- additive means Xi and Xj are independent (no interaction effect, X1*X2)
- linear means 1 unit of X change’s impact is same regardless of magnitude of X. (polynomial)
Hierarchical principle: if an interaction term (X1*X2) were included in model, the respective MAIN effects(X1 or X2) should always be included as well regardless of their coefficients’ significance.

3.6 potential problems: (104/434)

non-linearity of X-Y relationships:
- a residual plot can show (y-hat and residual) if there is non-linearity;
- some form of X, such as sqrt(X), log(X), X^3 can be used when there is non-linearity;
Correlation of error terms:
- means e(i) doesnot impact e(i+1)
- when there is error-correlation, the true error is underestimated;
- double size N sample will leads to unreliable confidence intervals( 106/434)
non-constant variance of error terms (heteroscedasticity, funnel-shaped error terms)
- non-constant error usually happen when Y has a larger value, then the error terms have greater magnitude
- potential solution could be: transform Y(log(Y)), shrinkage of Y;
Outliers (Y)
- studentized-t residual >3 means a higher probability of outlier
- careful removal of outliers, because it might mean that we are missing a predictor in the model;
High-leverage points (X)
- X-dimension is far away from most of observations makes a high-leverage
- multi-dimension predictors, even all individual dimension is within good range, but still could create high-leverage points
- a high-leverage statistics is calculated to determine high-leverage point
Co-linearity (X-X)
- leads to deficiency to power of hypothesis test;
- either close to +1 or -1 indicates colinearity in correlation matrix;
- multi-colinearity: a combination of 3 or more predictors are highly correlated in the Xs;
- VIF( variance inflation factor) to detect multi-colinearity (always >=1; =1 menas absence of multi0colinearity) (114/434)

3.7 other issues

prediction intervals and confidence intervals:
- prediction intervals are wider than confidence intervals because it incorporates error on top of the confidence interval;

Comparison of linear regression and KNN regression:

parametric and non-parametric
KNN regression and linear regression comparison:
- when p=1 or 2, meaning one or two predictors, KNN works fine for a sample of 100 obs.; but when p =20 paired with 100 obs., most obs. dont have enough neighbours, leading to explosion of MSE;
- this phenomenon is also called “the curse of dimensionality”, poor prediction power; (120/434)
parametric method performs better over non-parametric method when, on average, each predictor has a small amount of available observations;

3.8 linear regression lab

#library(MASS)
#library(ISLR)
#
#lm.fit = lm(medv~lstat,data = Boston)
# names(lm.fit), to get other info in lm.fit
# coef(lm.fit), to get coefficients
# confint(lm.fit), to get confidence interval

#predict(lm.fit,data.frame(lstat = c(5,10,15)), interval = "confidence") 
# to predict the interval; if interval = "prediction", means the prediction interval

# par(mfrow =c(2,2))
# plot(lm.fit), plot the fitted model 4 plots;
# plot(predict (lm.fit), residuals (lm.fit)), plot residual and prediction;
# summary (lm(medv∼lstat *age ,data=Boston )), fit with interaction terms
# lm.fit2=lm(medv∼lstat +I(lstat ^2)), incorporate the polynomial  (127/434)

# lm.fit =lm(Sales∼.+ Income :Advertising +Price :Age ,data=Carseats ), # fit with some interaction terms
# attach (Carseats );contrasts (ShelveLoc ) # contrasts() functions examine the dummy variables

CHAPTER 4, Classification

Three classifiers are discussed: logistics regression, linear discriminant analysis, K-nearest neighbors. (general additive models, trees, random forests, boosting and SVM in later chapters).[2021/01/21]

multi-class categorical variables
maximum likelihood: OLS is a special case of maximum likelihood, p = e/(1+e)

4.1 Linear discriminant analysis. (150/434).

MORE popular when there are more than two response classes;
LDA approximates “prior probability”, “u” and “delta” for all density functions and plug that into Bayes formula.
- ROC Curve
LDA assumes, each class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is common to all K classes.
confusion matrix: domain knowledge should be applied when determining the cost associated with FP/FN.(160/434)

4.2 QDA:

LDA tends to be better bet that QDA id there are relatively few training observations and so reducing variance is crucial;
QDA is recommended if the training set is very large, so variance is not a big concern, so bias matters more.

4.3 comparisons of classification methods: KNN, logistics, LDA and QDA

Both logistics and LDA produce linear decision boundaries, belongs to parametric (166/434), lab(167).

4.6 Lab, stock data 2001-2005

library(ISLR)
#names(Smarket)
#cor(Smarket[,-9])
# fits logistics regression, use glm():
#glm.fits = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Smarket, family = binomial)
#summary(glm.fits)
# predict(glm.fits, type = "response")
# contrasts(Direction), this function shows the dummy variable meaning
#### LDA/QDA fit using "MASS"
# lda.fit=lda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train) (173/434)
# qda.fit=qda(Direction∼Lag1+Lag2 ,data=Smarket ,subset =train)

#### KNN(), 176/434

### Caravan insurance data: 

# standardize data: use scale()
## standardized.X=scale(Caravan [,-86])
## var(Caravan [,1])
#### 165
##var(Caravan [,2])
#### 0.165
## var( standardized.X[,1])
#### 1
## var( standardized.X[,2])
#### 1 (178/434)

KNN limitation: Any variables that are on a large scale will have a much larger effect on the distance between the observations and hence on KNN classifier than variables on a small scale. (178/434)
non-parametric methods performs poorly when # of predictors are large;

CHAPTER 5 RESMAPLING METHOD

two most common method: cross-validation and bootstrapping.

5.1 LOOCV:

the testing error MSE = average the n MSE of leave-one-out sampling, so there is no randomness in running the result, we always get the same MSE using LOOCV. (190/434)
BUT LOOCV is expensive to implement, as the model has to be fit n times.
LOOCV is general that can be applied to any predictive modeling.
K-fold CV: LOOCV is the unique case where K equals n (sample size).
K-fold vs LOOCK: bias-variance tradeoff, K-fold where K<n gives the more accurate estimate of the test error rate than does LOOCV because of variance-bais tradeoff.
- K-fold has higher bias compared with LOOCV
- K-fold has lower variance compared with LOOCV (193/434,20220211)

5.2 Cross Validation in classifying problems

MSE in continuous; mis-classification rate in categorical;

5.3 Bootstrap

illustration of asset allocation:
- the original sample n = 3 contains three different obs. ;
  -> bootstrap data is created by using re-sampling with replacement from original dataset, drawing n obs; -> each bootstrap data set is used to obtain an estimate of alpha, the asset allocation coefficient (true value 0.6) , (202/434)

5.4 Lab: cross-valiadation and bootstrap

# LOOCV estimate can be automatically computed for any generalized linear model usig glm() and cv.glm()

# glm() without passing in (family = "binomial" ), it by default gives same result as lm(). 

# cv.glm() is part of "boot" library. 

######

# bootstrap of estimating alpha: asset allocation
# 
# alpha.fn = function(data,index){
#   X = data$X[index]
#   Y = data$Y[index]
#   return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
# }
# 
# library(ISLR)
# library(boot)
# set.seed (1)
# alpha.fn(Portfolio,sample (100 ,100 , replace =T)) # portfolio is data, alpha.fn returns statistics
# boot(Portfolio ,alpha.fn,R=1000)

CHAPTER 6 LINEAR MODEL SELECTION AND REGULARIZATION

6.1 intro:

the necessity of using multiple fitting procedures other than only least square:
- prediction accuracy: if n>>p, ordinary Least square is fine as it has low variance;
  - if n is close to p, poor prediction on future observations not used in model training;
  - if n<p, the OLS can’t be used at all as the variance is infinite.
  - by shrinking the estimated coefficient, the variance is substantially reduced at a negligible cost of increasing bias.
- model interpretability: less predictors means removing irrelevant predictors, eaiser to interpret
Three methods used in improving least square fitting :
- Subset Selection: identify a subset of the total p predictors(step-wise selection,forward selection,backward selection(n>p), );
- Shrinkage: regualrization, make some coefficients close to zero or exactly 0. (ridge regression, lasso, 225/434)
- Dimension Reduction: reduce p predictors to M where M<p, M are different linear combinations or projections of the “p” variables.
to evaluate test error, two approaches can be applied:
- in the past performing cross-validation is computationally prohibitive because either n or p is large, so AIC/BIC/C/adjusted-R2 are popular.
1. make adjustments to the training error to account for the bias due to over-fitting, as an estimate of test error;
- C(p) = 1/n(RSS+2*d*sigma2), smallest wins
- AIC = 1/n*sigmaHat(RSS+2*d*sigma2), smallest wins
- BIC = 1/n*sigmaHat(RSS+log(n)*d*sigma2), smallest wins
- adjusted R2 = 1-RSS/(n-d-1) / TSS/(n-1); R2 = 1-RSS/TSS; where “d” are # of predictors selected in the model
1. directly estimate the test error using a CV approach.

6.2 Ridge Regression (226/434) -> shrink coefficient to 0

Estimated RSS = OLS_RSS + shrinkage penalty, where
- shrinkage penalty = lambda*sum(b1Squared+b2Squared+ … + bpSquared);
- when lambda changes, the set of coefficients changes;
l2 norm : ||beta||2 = sqrt(beta1Squared + beta2Squared + …+ beta_p_squared), measuring the distance of vector beta from 0.
It’s best to apply the ridge regression after standarizing the predictors, using x_ij_tildo = x_ij/sqrt(1/n*sum(x_ij-x_bar)).
Ridge regression has advantage over least square because it has better bias-variance trade-off.
- As lambda increases, fewer predictors are incorporated into the model, meaning an increase in bias and a decrease in variance
Ridge regression also has advantage of less computation in searching for the best models (2^p)
one disadvantage of ridge regression is that it can’t eliminate unimportant predictors, once there are all predictors in the model, all of them are retained. Lasso can.

6.3 Lasso (229/434) –> feature selection

l1 penalty: shrinks the coefficients toward 0, and forces some of the coefficients estimates to be exact 0, when lambda is sufficient large. –> feature selection, ridge don’t.
Lasso is easier to interpret than ridge;
both Ridge and Lasso can be fit with about the same amount of work as a single least square fit.

6.4 dimension reduction:

reduce (p+1) coefficients to a simpler M+1 coefficients where M<p;
PCA: a popular approach for deriving a low-dimensional set of features from a large set of variables. [PCR, partial least square]
- first principal component: the direction of the data that along which the observations vary most, the projected observations have the largest possible variance
- PCA is NOT a feature selection method!!!
- PCR(principal component regression) is closely related to ridge regression, can think of “Ridge regression” as a continuous version of PCR!
- when performing PCR, it’s recommended to standardize each predictor.
Partial Least Square(PLS): a dimension reduction method, supervised version of PCA, finds “M” new features that best predict both original Xs and Y, while PCR is unsupervised without trying to predict Y.
- THe predictors and response are standardrized before performing PLS.

6.5 Considerations in high Dimensions (248/434)

dimension means the magnitude(scale) of p, the larger the p is, the higher the dimension is.
high dimension data usually means p>n, where p is number of features and n is the number of observations.
when number of features included in the model increases, the testing MSE also increases –> “curse of dimensionality”
CAUTION: it is easy to obtain a model that has zero residuals, so one should never use the “sum of squared errors”, “R-squared statistics”, or other traditional measures of model fit on the TRAINING DATASET as evidence of a good model fit in the high dimensional setting.

6.6 lab: subset selection models

best subset selection
- use function “regsubset()”
forward and backward stepwise selection
Ridge regression and lasso
- package used “glmnet” (# glmnet can only take in numeric variables) (lab, 261/434)
- by default, the function “cv.glmnet()” performs 10-fold cross validation;
- alpha = 0 when use RIDGE while alpha = 1 when use LAsso.
Principal Component Regression
- function “pcr()” from “pls” library to perform principal component regression
- the model is difficult to interpret because it does not perform variable selection or produce any coefficient estimates.
Partial Least Square
- plsr() from “pls” library

# library (leaps)
# library(ISLR)
# regfit.full=regsubsets(Salary ∼ .,data = Hitters )
# summary(regfit.full)
# 10^ seq (10,-2, length =100) , # setting up lambda
# plot(regfit.full ,scale ="adjr2")

CHAPTER 7 MOVING BEYOND LINEARITY (275/434)

7.1 INTRO

a list of methods:
- polynomial regression
- step function, cut variables into K regions;
- regression splines, a combination of poly- and step-, to divide X into K regions and polynomials are joined smoothly at the region boundary (known as knots)

7.2 polynomial regression (276/434)

it is unusual to use a degree(power) of greater than 3 or 4 for the polynomial regression as some strange shapes could exist near the boundary of X variable;

7.3 Regression splines

splines often give superior results to polynomial regression, because splines produce flexible fits by increasing the number of knots but keeping the degree fixed; ploynomial can produce greater flexiblity by introducing higher degrees, which leads to unexpectedly weird shapes.

7.4 smothing splimnes:

in fitting a smoothing spline, we don’t choose the NUMBER or LOCATION of knots, there will be a knot at each training observation, but we need to choose the value of lambda
LOOCV is efficient for calculating RSS.

7.5 GAM, pros and cons

7.6 lab:

polynomial regression and step functions

# fit=glm(I(wage >250)∼poly(age ,4) ,data=Wage ,family =binomial ), use I() to create a binary response variable

CHAPTER 8 TREE BASED METHODS

8.1 INTRO

bagging, boosting and random forests are discussed: each of these approaches involve producing multiple trees which are combined to yield a single consensus prediction.
regression trees:
- 2 steps: DIVIDE and Average, first divide the predictor space then assign the mean of response to points falling in the space.
  - recursive binary splitting,
  - weakest link pruning:
classification trees:
classification error rate: the error rate is the fraction of the training observations that don’t belong to the most common class.
- classification error: when prediction accuracy is the goal, classification error is used.
- gini index: a small value means a node contains predominantly observations from a single class.
- cross-entropy:
trees versus linear models:
- trees are easier to interpret and visualize;
- trees tend to over-fit the data, it’s relying too much on the samples so not robust and not high level of prediction accuracy.

8.2 Bagging, Random FOrest and Boosting

all rely on trees as building blocks to construct more powerful prediction models.

8.2.1 Bagging

bagging is bootstrap aggregation, “averaging a set of observations reduces variances”
- average of the predicted values –> continuous response;
- major vote –> categorical response
out-of-bag estimation (327/434)
Bagging improves the prediction accuracy at the expense of interpretability
Bagging: calculate B # of estimated y-hat using B# of separate training set, and then average them in order to obtain a single low-variance learning model.

8.2.2 Random Forest

random forest provide an improvement over bagged trees by way of a small tweak that de-correlates the trees:
- a random sample of m predictors out of total p predictors is chosen as split cnadidates for running model.
one major difference of bagging and random forest is the size of number of features. if m = p, then random forest is bagging.

8.2.3 Boosting

boosting does fit smaller trees on residuals and add it to the original tree, using original predictors.
it is a slow learning process, in general, statistical learning approaches that learn slowly tend to perform well. (331/434)
boosting has three tuning parameters: number of trees B, shrinkage parameter lambda which controls the learning speed, and number of d splits on each tree.
Boosting use smaller trees because the growth of a particular tree takes into account the other trees that have already been grown, smaller trees are sufficient.
Using stumps leads to additive model, a STUMP is a tree that has only one single split.

8.3 Lab: Decision Trees: (326/434), trees, CV.tree

Random Forest: 337/434
Bagging is a special case of random forest, m=p
boosting: 339/434

CHAPTER 9 SUPPORTIVE VECTOR MACHINE

maximal margin classifier: maximal margin hyperplane is the farthest from the training observations.
WHEN P IS LARGE, overfitting exists.
the observations lying on the decision boundary are called support vectors.
if a separating hyper-plane exists, maximal margin classifier is a natural way to perform classification.

9.1 non separable case:

support vector classifier: the generalization of maximal margin classifier to non-separable case is called “support vector clasifier”.
if the maximal margin hyper-plane is extremely sensitive to a change in a single observation suggests that the model overfits the training data.
C is small(margin is narrow), low bias high variance; C is large(margin is wide), high bias and low variance.
Important properties of support vector machine:
- only observations lying on the margin line affect the hyper-plane, those observations not on the margin line, however they change, don’t affect the decision criteria at all.
- observations that lie directly on the margin, or the wrong side of the margin for its class, they are known as support vectors. They affect support vector classifier.
- LDA is heavily influenced by outliers while support vector machine does not, in fact logistic regression are more related. (359/434)

9.2 svm with more than two classes (364/434)

9.3 logistics regression with SVM (365/434)

logistics regression has similar loss function to SVM, even thou the observation far from the decision boundary are not exactly 0, they are close to 0.
when classes are well separated SVM tend to perform better than logistics;
nuisance parameter: a parameter which is not of immediate interest but which must be accounted for in analysis of parameters that are of interest.
SVM regression

9.4 SVM lab, (367/434)

SVM
ROC curves (373/434)

CHAPTER 10 UNSUPERVISED LEARNING (381/434)

10.1 INTRO: PCA

each of the dimensions found by PCA is a linear combination of the all p features.
the first component loading vector has a very special property: it is the line in p-dimension space that is closest to the n observations.
before doing PCA, variables should be scaled(normalized); but if all variables are measured in the same units, no normalization is needed. (388/434)
proportion of variance explained (PVE)
scree plot

10.2 Clustering

difference between PCA and clustering:
- PCA looks to find low-dimensional representation of the observations that explain a good fraction of the variance;
- clustering looks to find homogeneous subgroups among the observations;
- scree plot
methods:
- K-means clustering :
  - one disadvantage is to pre-specify the number of clusters K
- hierarchical clustering
  - it results in a tree-based representation of the observations called dendrogram. (bottom-up/ agglomerative clustering).
  - hierarchical clustering, the similarity could be drawn from vertical approximity, NOT horizontal closeness.
  - hierarchical clustering can sometimes generate worse results than K-means for a given number of clusters.
  - linkage: it describes the dissimilarity between two groups of observations.
    - complete: maximal intercluster dissimilarity
    - single: minimal intercluster dissimilarity
    - average: mean intercluster dissimilarity
    - centroid: mena vector length
  - Euclidean distance is used in measuring distance;Correlation based distance can also be used for measuring distance
Practical issues in clustering:
- clustering can be non-robust, for example, using subset of data can create groups quite different from full sets.
LAB-PCA (409/434)

Statistics in R Notes

Harry

10/12/2021