Our goal is to explain most of the variability in our data with a smaller number of variables. This new data (output from PCA) can be further used for Regression, Factor analysis, and other ML tasks. And it helps us plotting multidimensional data (dimension = variable).

Part 1 will provide a detailed explanation and Part 2 will use PCA built-in functions.

PCA in R

Part 1. Concepts Explanation

Data

USArrests data set is built into R.

Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8

Normalization

Each variable should be centered at zero for PCA. Notice that variables are not scaled (each is meaasured at a different scale)
x
Murder 18.97047
Assault 6945.16571
UrbanPop 209.51878
Rape 87.72916
We need to create a scaled data for PCA. Note: PCA is influenced by the magnitude of each variable and the results will also depend on whether the variables have been scaled.
Murder Assault UrbanPop Rape
1.2425641 0.7828393 -0.5209066 -0.0034165
0.5078625 1.1068225 -1.2117642 2.4842029
0.0716334 1.4788032 0.9989801 1.0428784
0.2323494 0.2308680 -1.0735927 -0.1849166
0.2782682 1.2628144 1.7589234 2.0678203
0.0257146 0.3988593 0.8608085 1.8649672

Principal Components

We need to generate Covariance matrix. The idea is that not all data domensions are interesting and PCA will combine only interesting variables in a linear combination (=component). For example, the first principal componnet will be a linear combination of variable that present the largest variance.

Eigenvalues are calculated from Cov matrix and eigenvectors (a set of loadings) explain the proportion of the variability.

arrests.cov <- cov(scaled_df)
arrests.eigen <- eigen(arrests.cov)
str(arrests.eigen)
## List of 2
##  $ values : num [1:4] 2.48 0.99 0.357 0.173
##  $ vectors: num [1:4, 1:4] 0.536 0.583 0.278 0.543 0.418 ...
##  - attr(*, "class")= chr "eigen"

What are the first two sets of loadings (Principal component PC1 and Principal component PC2?). We can extract them from eigenvector - we will have two vectors. To access vectors from our arrest.eigens we need to use $:

phi <- arrests.eigen$vectors[,1:2]
print(phi)
##           [,1]       [,2]
## [1,] 0.5358995  0.4181809
## [2,] 0.5831836  0.1879856
## [3,] 0.2781909 -0.8728062
## [4,] 0.5434321 -0.1673186

By default, eigenvectors in R point into the negative direction. Positive direction leads to more logical interpretation of graphical results. To use the positive-pointing vector, we multiply the default loadings by -1. The set of loadings for the first principal component (PC1) and second principal component (PC2) are shown below:

phi <- -1*phi
row.names(phi) <- c("Murder", "Assault", "UrbanPop", "Rape")
colnames(phi) <- c("PC1", "PC2")
kbl(phi) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
PC1 PC2
Murder -0.5358995 -0.4181809
Assault -0.5831836 -0.1879856
UrbanPop -0.2781909 0.8728062
Rape -0.5434321 0.1673186

Each principal component vector defines a direction in feature space. Because eigenvectors are orthogonal to every other eigenvector, principal components are uncorrelated with one another.

From the table, we can infer that PC1 corresponds to an overall rate of serious crimes: Murder, Assault, and Rape (largest values). PC2 is affected by UrbanPop more than the other three variables (the level of urbanization of the state).

We can calculaate PC for each state.

PC1 <- as.matrix(scaled_df) %*% phi[,1]
PC2 <- as.matrix(scaled_df) %*% phi[,2]

PC <- data.frame(State = row.names(USArrests), PC1, PC2)
head(PC)
##        State        PC1        PC2
## 1    Alabama -0.9756604 -1.1220012
## 2     Alaska -1.9305379 -1.0624269
## 3    Arizona -1.7454429  0.7384595
## 4   Arkansas  0.1399989 -1.1085423
## 5 California -2.4986128  1.5274267
## 6   Colorado -1.4993407  0.9776297

Plotting

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the modelr package.
##   Please report the issue at <https://github.com/tidyverse/modelr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We should also plot Proportion of Variance Explained (PVE) which would give us insights on how many componnets to select. The rule of Thumb is components should explain at least 80%, and you can discard other components. Below, the first two component explain 87% of the variability: 62 + 25.

PVE <- arrests.eigen$values / sum(arrests.eigen$values)
round(PVE, 2)
## [1] 0.62 0.25 0.09 0.04

Cumulative PVE helps to visualize cumulative values to determine how many PCs to choose. These Viz are called scree plots.

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Part 2. PCA Built-In Analysis

For the actual analysis, we will use prcomp function and scale our data by setting scale parameter as TRUE.

pca_result <- prcomp(USArrests, scale = TRUE)
pca_result$rotation <- -pca_result$rotation
pca_result$x <- - pca_result$x 

For visualization we will use a bibplot. Notice PC1 (x-axis) is represented by three vectors (serious crimes). The further from zero to the righ - more positive directions. PC2 (y-axis) is mainly represented by UrbaanPop (positive direction), the rest of variables are clustered around zero (look where the arrows are in vertical space).

biplot(pca_result, scale = 0)

Discriminant Analysis

Adapted from SDS 293 - Machine Learning Jordan Crouser (2016) and “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

4.6.3 Linear Discriminant Analysis

Now we will perform LDA on the Smarket data. In R, we can fit a LDA model using the lda() function, which is part of the MASS library. Note: dplyr and MASS have a name clash around the word select(), so we need to do a little magic to make them play nicely.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(ISLR)
select <- dplyr::select
Smarket <- read.csv("Smarket.csv")
Carseats <- read_csv("Carseats.csv")
## Rows: 400 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): ShelveLoc, Urban, US
## dbl (8): Sales, CompPrice, Income, Advertising, Population, Price, Age, Educ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The syntax for the lda() function is identical to that of lm(), and to that of glm() except for the absence of the family option. As we did with logistic regression and KNN, we’ll fit the model using only the observations before 2005, and then test the model on the data from 2005.

train <- Smarket %>%
  filter(Year < 2005) %>%
  mutate(Direction = factor(Direction, levels = c("Down","Up")))

test <- Smarket %>%
  filter(Year >= 2005) %>%
  mutate(Direction = factor(Direction, levels = c("Down","Up")))

model_LDA=lda(Direction~Lag1+Lag2,data=train)
print(model_LDA)
## Call:
## lda(Direction ~ Lag1 + Lag2, data = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293

The LDA output indicates prior probabilities of \({\hat{\pi}}_1 = 0.492\) and \({\hat{\pi}}_2 = 0.508\); in other words, 49.2% of the training observations correspond to days during which the market went down.

The function also provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of \(\mu_k\). These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines.

The coefficients of linear discriminants output provides the linear combination of Lag1 and Lag2 that are used to form the LDA decision rule.

If \(−0.642\times{\tt Lag1}−0.514\times{\tt Lag2}\) is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.

We can use the plot() function to produce plots of the linear discriminants, obtained by computing \(−0.642\times{\tt Lag1}−0.514\times{\tt Lag2}\) for each of the training observations.

plot(model_LDA)

The predict() function returns a list with three elements. The first element, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose \(k^{th}\) column contains the posterior probability that the corresponding observation belongs to the \(k^{th}\) class. Finally, x contains the linear discriminants, described earlier.

predictions_LDA = data.frame(predict(model_LDA, test))
names(predictions_LDA)
## [1] "class"          "posterior.Down" "posterior.Up"   "LD1"

Let’s check out the confusion matrix to see how this model is doing. We’ll want to compare the predicted class (which we can find in the class column of the predictions_LDA data frame) to the true class.

predictions_LDA = cbind(test, predictions_LDA)

predictions_LDA %>%
  count(class, Direction)
##   class Direction   n
## 1  Down      Down  35
## 2  Down        Up  35
## 3    Up      Down  76
## 4    Up        Up 106
predictions_LDA %>%
  summarize(score = mean(class == Direction))
##       score
## 1 0.5595238

The LDA predictions are identical to the ones from our logistic model:

model_logistic = glm(Direction~Lag1+Lag2, data=train ,family=binomial)

logistic_probs = data.frame(probs = predict(model_logistic, test, type="response"))

predictions_logistic = logistic_probs %>%
  mutate(class = ifelse(probs>.5, "Up", "Down"))

predictions_logistic = cbind(test, predictions_logistic)

predictions_logistic %>%
  count(class, Direction)
##   class Direction   n
## 1  Down      Down  35
## 2  Down        Up  35
## 3    Up      Down  76
## 4    Up        Up 106
predictions_logistic %>%
  summarize(score = mean(class == Direction))
##       score
## 1 0.5595238

4.6.4 Quadratic Discriminant Analysis

We will now fit a QDA model to the Smarket data. QDA is implemented in R using the qda() function, which is also part of the MASS library. The syntax is identical to that of lda().

model_QDA = qda(Direction~Lag1+Lag2, data=train)
model_QDA
## Call:
## qda(Direction ~ Lag1 + Lag2, data = train)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA.

predictions_QDA = data.frame(predict(model_QDA, test))

predictions_QDA = cbind(test, predictions_QDA)

predictions_QDA %>%
  count(class, Direction)
##   class Direction   n
## 1  Down      Down  30
## 2  Down        Up  20
## 3    Up      Down  81
## 4    Up        Up 121
predictions_QDA %>%
  summarize(score = mean(class == Direction))
##       score
## 1 0.5992063

Interestingly, the QDA predictions are accurate almost 60% of the time, even though the 2005 data was not used to fit the model. This level of accuracy is quite impressive for stock market data, which is known to be quite hard to model accurately.

This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, we recommend evaluating this method’s performance on a larger test set before betting that this approach will consistently beat the market!

An Application to Carseats Data

Let’s see how the LDA/QDA approach performs on the Carseats data set, which is part of the ISLR library.

Recall: this is a simulated data set containing sales of child car seats at 400 different stores.

summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc              Age       
##  Min.   : 10.0   Min.   : 24.0   Length:400         Min.   :25.00  
##  1st Qu.:139.0   1st Qu.:100.0   Class :character   1st Qu.:39.75  
##  Median :272.0   Median :117.0   Mode  :character   Median :54.50  
##  Mean   :264.8   Mean   :115.8                      Mean   :53.32  
##  3rd Qu.:398.5   3rd Qu.:131.0                      3rd Qu.:66.00  
##  Max.   :509.0   Max.   :191.0                      Max.   :80.00  
##    Education       Urban                US           
##  Min.   :10.0   Length:400         Length:400        
##  1st Qu.:12.0   Class :character   Class :character  
##  Median :14.0   Mode  :character   Mode  :character  
##  Mean   :13.9                                        
##  3rd Qu.:16.0                                        
##  Max.   :18.0

See if you can build a model that predicts ShelveLoc, the shelf location (Bad, Good, or Medium) of the product at each store. Don’t forget to hold out some of the data for testing!

library(MASS)
library(dplyr)
library(ISLR)
select <- dplyr::select

set.seed(42)
n  <- nrow(Carseats)
id <- sample.int(n, size = floor(0.7*n))
train <- Carseats[id, ]
test  <- Carseats[-id, ]

train$ShelveLoc <- factor(train$ShelveLoc, levels = c("Bad","Medium","Good"))
test$ShelveLoc  <- factor(test$ShelveLoc,  levels = levels(train$ShelveLoc))

num_vars <- c("Sales","CompPrice","Income","Advertising",
              "Population","Price","Age","Education")

model_LDA <- lda(ShelveLoc ~ ., data = train %>% select(ShelveLoc, all_of(num_vars)))
print(model_LDA)
## Call:
## lda(ShelveLoc ~ ., data = train %>% select(ShelveLoc, all_of(num_vars)))
## 
## Prior probabilities of groups:
##       Bad    Medium      Good 
## 0.2607143 0.5178571 0.2214286 
## 
## Group means:
##           Sales CompPrice   Income Advertising Population    Price      Age
## Bad    5.547260  124.9589 69.97260    6.383562   280.8493 116.2329 49.98630
## Medium 7.313310  126.4759 65.97931    6.551724   253.7241 116.8759 53.45517
## Good   9.883387  124.4677 66.32258    7.064516   270.6290 119.2097 52.04839
##        Education
## Bad     14.01370
## Medium  14.11034
## Good    13.67742
## 
## Coefficients of linear discriminants:
##                       LD1          LD2
## Sales        0.9575679117  0.002911597
## CompPrice   -0.0971000761 -0.041590770
## Income      -0.0149270034  0.009502286
## Advertising -0.1038456831 -0.013723644
## Population  -0.0006247425  0.003287830
## Price        0.0941740437  0.018822015
## Age          0.0453198775 -0.035566735
## Education    0.0120090319 -0.076839899
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9917 0.0083
plot(model_LDA)

predictions_LDA <- data.frame(predict(model_LDA, newdata = test %>% select(all_of(num_vars))))
predictions_LDA <- cbind(test, predictions_LDA)

predictions_LDA %>%
  count(class, ShelveLoc)
##    class ShelveLoc  n
## 1    Bad       Bad 19
## 2    Bad    Medium  6
## 3 Medium       Bad  4
## 4 Medium    Medium 65
## 5 Medium      Good  2
## 6   Good    Medium  3
## 7   Good      Good 21
predictions_LDA %>%
  summarize(score = mean(class == ShelveLoc))
##   score
## 1 0.875
model_QDA <- qda(ShelveLoc ~ ., data = train %>% select(ShelveLoc, all_of(num_vars)))
model_QDA
## Call:
## qda(ShelveLoc ~ ., data = train %>% select(ShelveLoc, all_of(num_vars)))
## 
## Prior probabilities of groups:
##       Bad    Medium      Good 
## 0.2607143 0.5178571 0.2214286 
## 
## Group means:
##           Sales CompPrice   Income Advertising Population    Price      Age
## Bad    5.547260  124.9589 69.97260    6.383562   280.8493 116.2329 49.98630
## Medium 7.313310  126.4759 65.97931    6.551724   253.7241 116.8759 53.45517
## Good   9.883387  124.4677 66.32258    7.064516   270.6290 119.2097 52.04839
##        Education
## Bad     14.01370
## Medium  14.11034
## Good    13.67742
predictions_QDA <- data.frame(predict(model_QDA, newdata = test %>% select(all_of(num_vars))))
predictions_QDA <- cbind(test, predictions_QDA)

predictions_QDA %>%
  count(class, ShelveLoc)
##    class ShelveLoc  n
## 1    Bad       Bad 18
## 2    Bad    Medium  7
## 3 Medium       Bad  5
## 4 Medium    Medium 61
## 5 Medium      Good  1
## 6   Good    Medium  6
## 7   Good      Good 22
predictions_QDA %>%
  summarize(score = mean(class == ShelveLoc))
##       score
## 1 0.8416667