PCA

load library needed

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
## ✔ tibble  3.1.6     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

prepare dataset

data("USArrests")
head(USArrests)

calculate Principal component analysis

results <- prcomp(USArrests, scale. = T)

reverse the sign

results$rotation <- -1 * results$rotation

display principal components

results$rotation

##                PC1        PC2        PC3         PC4
## Murder   0.5358995 -0.4181809  0.3412327 -0.64922780
## Assault  0.5831836 -0.1879856  0.2681484  0.74340748
## UrbanPop 0.2781909  0.8728062  0.3780158 -0.13387773
## Rape     0.5434321  0.1673186 -0.8177779 -0.08902432

PC1 has high values, indicates principal component describes the most variation in these variables

Reverse the sign of the scores

results$x <- -1 * results$x

Display the first six scores

head(results$x)

##                   PC1        PC2         PC3          PC4
## Alabama     0.9756604 -1.1220012  0.43980366 -0.154696581
## Alaska      1.9305379 -1.0624269 -2.01950027  0.434175454
## Arizona     1.7454429  0.7384595 -0.05423025  0.826264240
## Arkansas   -0.1399989 -1.1085423 -0.11342217  0.180973554
## California  2.4986128  1.5274267 -0.59254100  0.338559240
## Colorado    1.4993407  0.9776297 -1.08400162 -0.001450164

Next, we create Bi-plot (a plot that projects each of the observations in the dataset onto a scatterplot that uses the first and second principal components as the axes)

biplot(results, scale = 0)

Each of the fifty states is visible in a simple two-dimensional space.
The states that are adjacent to one another on the scatter plot have comparable data patterns for the variables in the original dataset.
Additionally, we can see that some states are more strongly related with certain crimes than others.

Display states with the highest murder rates in the original dataset

head(USArrests[order(-USArrests$Murder),])

calculate total variance explained by each principal component

results$sdev^2 / sum(results$sdev^2)

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

The first principal component is responsible for explaining 62% of the overall variance in the dataset.
The second principal component is responsible for explaining 24.7% of the total variation that was found in the dataset.
The third principal component is responsible for explaining 8.9% of the total variance present in the dataset.
The dataset has a total variance, and the fourth principal component is able to explain 4.3% of that total variance.

calculate total variance explained by each principal component

var_explained = results$sdev^2 / sum(results$sdev^2)

create a screen plot

qplot(c(1:4), var_explained) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  ylim(0, 1)

PCA

Duque, Don Gerard H. BS-Math

2022-06-20