Material adapted from UC Busisness Analytics R Programming Guide.
USArrests data set is built into R.
| Murder | Assault | UrbanPop | Rape | |
|---|---|---|---|---|
| Alabama | 13.2 | 236 | 58 | 21.2 |
| Alaska | 10.0 | 263 | 48 | 44.5 |
| Arizona | 8.1 | 294 | 80 | 31.0 |
| Arkansas | 8.8 | 190 | 50 | 19.5 |
| California | 9.0 | 276 | 91 | 40.6 |
| Colorado | 7.9 | 204 | 78 | 38.7 |
| Connecticut | 3.3 | 110 | 77 | 11.1 |
| Delaware | 5.9 | 238 | 72 | 15.8 |
| Florida | 15.4 | 335 | 80 | 31.9 |
| Georgia | 17.4 | 211 | 60 | 25.8 |
| x | |
|---|---|
| Murder | 18.97047 |
| Assault | 6945.16571 |
| UrbanPop | 209.51878 |
| Rape | 87.72916 |
| Murder | Assault | UrbanPop | Rape |
|---|---|---|---|
| 1.2425641 | 0.7828393 | -0.5209066 | -0.0034165 |
| 0.5078625 | 1.1068225 | -1.2117642 | 2.4842029 |
| 0.0716334 | 1.4788032 | 0.9989801 | 1.0428784 |
| 0.2323494 | 0.2308680 | -1.0735927 | -0.1849166 |
| 0.2782682 | 1.2628144 | 1.7589234 | 2.0678203 |
| 0.0257146 | 0.3988593 | 0.8608085 | 1.8649672 |
We need to generate a Covariance matrix. The idea is that not all data dimensions are interesting and PCA will combine only interesting variables in a linear combination (=component). For example, the first principal component will be a linear combination of variables that present the largest variance.
Eigenvalues are calculated from Cov matrix and eigenvectors (a set of loadings) explain the proportion of the variability.
# Calculate eigenvalues and eigenvectors
arrests.cov <- cov(scaled_df)
arrests.eigen <- eigen(arrests.cov)
str(arrests.eigen)
## List of 2
## $ values : num [1:4] 2.48 0.99 0.357 0.173
## $ vectors: num [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
## - attr(*, "class")= chr "eigen"
What are the first two sets of loadings (Principal component PC1 and
Principal component PC2?). We can extract them from eigenvector - we
will have two vectors. To access vectors from our arrest.eigens we need
to use $:
# Extract the loadings
phi <- arrests.eigen$vectors[,1:2]
print(phi)
## [,1] [,2]
## [1,] -0.5358995 0.4181809
## [2,] -0.5831836 0.1879856
## [3,] -0.2781909 -0.8728062
## [4,] -0.5434321 -0.1673186
By default, eigenvectors in R point into the negative direction. Positive direction leads to more logical interpretation of graphical results. To use the positive-pointing vector, we multiply the default loadings by -1. The set of loadings for the first principal component (PC1) and second principal component (PC2) are shown below:
phi <- -1*phi
row.names(phi) <- c("Murder", "Assault", "UrbanPop", "Rape")
colnames(phi) <- c("PC1", "PC2")
kbl(phi) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| PC1 | PC2 | |
|---|---|---|
| Murder | 0.5358995 | -0.4181809 |
| Assault | 0.5831836 | -0.1879856 |
| UrbanPop | 0.2781909 | 0.8728062 |
| Rape | 0.5434321 | 0.1673186 |
Each principal component vector defines a direction in feature space. Because eigenvectors are orthogonal to every other eigenvector, principal components are uncorrelated with one another.
From the table, we can infer that PC1 corresponds to an overall rate of serious crimes: Murder, Assault, and Rape (largest values). PC2 is affected by UrbanPop more than the other three variables (the level of urbanization of the state).
We can calculate PC for each state.
# Calculate Principal Components scores
PC1 <- as.matrix(scaled_df) %*% phi[,1]
PC2 <- as.matrix(scaled_df) %*% phi[,2]
# Create data frame with Principal Components scores
PC <- data.frame(State = row.names(USArrests), PC1, PC2)
head(PC)
## State PC1 PC2
## 1 Alabama 0.9756604 -1.1220012
## 2 Alaska 1.9305379 -1.0624269
## 3 Arizona 1.7454429 0.7384595
## 4 Arkansas -0.1399989 -1.1085423
## 5 California 2.4986128 1.5274267
## 6 Colorado 1.4993407 0.9776297
We should also plot Proportion of Variance Explained (PVE) which would give us insights on how many components to select. The rule of Thumb is components should explain at least 80%, and you can discard other components. Below, the first two component explain 87% of the variability: 62 + 25.
PVE <- arrests.eigen$values / sum(arrests.eigen$values)
round(PVE, 2)
## [1] 0.62 0.25 0.09 0.04
Cumulative PVE helps to visualize cumulative values to determine how many PCs to choose. These Viz are called scree plots.
For the actual analysis, we will use prcomp function and
scale our data by setting scale parameter as
TRUE.
pca_result <- prcomp(USArrests, scale = TRUE)
pca_result$rotation <- -pca_result$rotation
pca_result$x <- - pca_result$x # change to positive direction; x is returned values
For visualization we will use a bibplot. Notice PC1 (x-axis) is represented by three vectors (serious crimes). The further from zero to the right - more positive directions. PC2 (y-axis) is mainly represented by UrbanPop (positive direction), the rest of variables are clustered around zero (look where the arrows are in vertical space).
biplot(pca_result, scale = 0)
Adapted from SDS 293 - Machine Learning Jordan Crouser (2016) and “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
Now we will perform LDA on the Smarket data. In
R, we can fit a LDA model using the lda()
function, which is part of the MASS library.
Note: dplyr and MASS have a
name clash around the word select(), so we need to do a
little magic to make them play nicely.
pacman::p_load(MASS,dplyr,ISLR)
select <- dplyr::select
The syntax for the lda() function is identical to that
of lm(), and to that of glm() except for the
absence of the family option. As we did with logistic
regression and KNN, we’ll fit the model using only the observations
before 2005, and then test the model on the data from 2005.
train = Smarket %>%
filter(Year < 2005)
test = Smarket %>%
filter(Year >= 2005)
model_LDA=lda(Direction~Lag1+Lag2,data=train)
print(model_LDA)
## Call:
## lda(Direction ~ Lag1 + Lag2, data = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420190
## Lag2 -0.5135293
The LDA output indicates prior probabilities of \({\hat{\pi}}_1 = 0.492\) and \({\hat{\pi}}_2 = 0.508\); in other words, 49.2% of the training observations correspond to days during which the market went down.
The function also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of \(\mu_k\). These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines.
The coefficients of linear discriminants output provides the linear
combination of Lag1 and Lag2 that are used to
form the LDA decision rule.
If \(−0.642\times{\tt Lag1}−0.514\times{\tt Lag2}\) is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.
We can use the plot() function to produce plots of the
linear discriminants, obtained by computing \(−0.642\times{\tt Lag1}−0.514\times{\tt
Lag2}\) for each of the training observations.
plot(model_LDA)
The predict() function returns a list with three
elements. The first element, class, contains LDA’s
predictions about the movement of the market. The second element,
posterior, is a matrix whose \(k^{th}\) column contains the posterior
probability that the corresponding observation belongs to the \(k^{th}\) class. Finally, x
contains the linear discriminants, described earlier.
predictions_LDA = data.frame(predict(model_LDA, test))
names(predictions_LDA)
## [1] "class" "posterior.Down" "posterior.Up" "LD1"
Let’s check out the confusion matrix to see how this model is doing.
We’ll want to compare the predicted class (which we can
find in the class column of the
predictions_LDA data frame) to the true
class.
predictions_LDA = cbind(test, predictions_LDA)
predictions_LDA %>%
count(class, Direction)
## class Direction n
## 1 Down Down 35
## 2 Down Up 35
## 3 Up Down 76
## 4 Up Up 106
predictions_LDA %>%
summarize(score = mean(class == Direction))
## score
## 1 0.5595238
The LDA predictions are identical to the ones from our logistic model:
# Logistic model, for comparison
model_logistic = glm(Direction~Lag1+Lag2, data=train ,family=binomial)
logistic_probs = data.frame(probs = predict(model_logistic, test, type="response"))
predictions_logistic = logistic_probs %>%
mutate(class = ifelse(probs>.5, "Up", "Down"))
predictions_logistic = cbind(test, predictions_logistic)
predictions_logistic %>%
count(class, Direction)
## class Direction n
## 1 Down Down 35
## 2 Down Up 35
## 3 Up Down 76
## 4 Up Up 106
predictions_logistic %>%
summarize(score = mean(class == Direction))
## score
## 1 0.5595238
We will now fit a QDA model to the Smarket data. QDA is
implemented in R using the qda() function,
which is also part of the MASS library. The syntax is identical to that
of lda().
model_QDA = qda(Direction~Lag1+Lag2, data=train)
model_QDA
## Call:
## qda(Direction ~ Lag1 + Lag2, data = train)
##
## Prior probabilities of groups:
## Down Up
## 0.491984 0.508016
##
## Group means:
## Lag1 Lag2
## Down 0.04279022 0.03389409
## Up -0.03954635 -0.03132544
The output contains the group means. But it does not contain the
coefficients of the linear discriminants, because the QDA classifier
involves a quadratic, rather than a linear, function of the
predictors. The predict() function works in exactly the
same fashion as for LDA.
predictions_QDA = data.frame(predict(model_QDA, test))
predictions_QDA = cbind(test, predictions_QDA)
predictions_QDA %>%
count(class, Direction)
## class Direction n
## 1 Down Down 30
## 2 Down Up 20
## 3 Up Down 81
## 4 Up Up 121
predictions_QDA %>%
summarize(score = mean(class == Direction))
## score
## 1 0.5992063
Interestingly, the QDA predictions are accurate almost 60% of the time, even though the 2005 data was not used to fit the model. This level of accuracy is quite impressive for stock market data, which is known to be quite hard to model accurately.
This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, we recommend evaluating this method’s performance on a larger test set before betting that this approach will consistently beat the market!
Let’s see how the LDA/QDA approach performs on the
Carseats data set, which is part of the ISLR
library.
Recall: this is a simulated data set containing sales of child car seats at 400 different stores.
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
See if you can build a model that predicts ShelveLoc,
the shelf location (Bad, Good, or
Medium) of the product at each store. Don’t forget to hold
out some of the data for testing!
#convert to factor
Carseats$ShelveLoc <- factor(Carseats$ShelveLoc)
pacman::p_load(caret)
# Split into train and test data
t.idx = createDataPartition(Carseats$ShelveLoc,p=0.7,list=FALSE)
cseat_train = Carseats[t.idx,]
cseat_test = Carseats[-t.idx,]
model_LDA=lda(ShelveLoc~Sales+CompPrice+Income+Advertising+Population+Price+Age+Education+Urban+US,data=cseat_train)
print(model_LDA)
## Call:
## lda(ShelveLoc ~ Sales + CompPrice + Income + Advertising + Population +
## Price + Age + Education + Urban + US, data = cseat_train)
##
## Prior probabilities of groups:
## Bad Good Medium
## 0.2411348 0.2127660 0.5460993
##
## Group means:
## Sales CompPrice Income Advertising Population Price Age
## Bad 5.606471 123.8529 74.57353 6.514706 279.1765 115.6029 51.10294
## Good 10.332500 126.1167 70.75000 6.583333 270.8167 116.3000 51.96667
## Medium 7.263247 124.3636 67.27922 6.883117 256.0779 115.2338 55.12338
## Education UrbanYes USYes
## Bad 13.77941 0.7794118 0.6617647
## Good 13.70000 0.6666667 0.6833333
## Medium 13.87662 0.7012987 0.6363636
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sales 1.0047593184 -0.020508094
## CompPrice -0.0900719250 -0.003806911
## Income -0.0151494835 -0.018545360
## Advertising -0.1177273168 0.098632522
## Population -0.0003347859 -0.003323486
## Price 0.0941952405 -0.003812730
## Age 0.0468812991 0.033053262
## Education 0.0287507551 -0.016147239
## UrbanYes -0.2839039507 -0.516285602
## USYes 0.1881758685 -1.052780870
##
## Proportion of trace:
## LD1 LD2
## 0.9859 0.0141
plot(model_LDA)
predictions_LDA = data.frame(predict(model_LDA, cseat_test))
names(predictions_LDA)
## [1] "class" "posterior.Bad" "posterior.Good" "posterior.Medium"
## [5] "x.LD1" "x.LD2"
predictions_LDA = cbind(cseat_test, predictions_LDA)
predictions_LDA %>%
count(class, ShelveLoc)
## class ShelveLoc n
## 1 Bad Bad 23
## 2 Bad Medium 5
## 3 Good Good 21
## 4 Good Medium 1
## 5 Medium Bad 5
## 6 Medium Good 4
## 7 Medium Medium 59
predictions_LDA %>%
summarize(score = mean(class == ShelveLoc))
## score
## 1 0.8728814