PCA

Perform and visualize PCA in the given mtcars dataset

# Loading our dataset
df <- mtcars
head(df)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Selecting the numerical data (excluding the categorical variables vs and am)

df <- mtcars[,c(1:7,10,11)]
head(df)

##                    mpg cyl disp  hp drat    wt  qsec gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22    3    1

# We then pass df to the prcomp(). We also set two arguments, center and scale, 
# to be TRUE then preview our object with summary

mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE, scale. = TRUE)
summary(mtcars.pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000

As a result we obtain 9 principal components, each which explain a percentate of the total variation of the dataset PC1 explains 63% of the total variance, which means that nearly two-thirds of the information in the dataset (9 variables) can be encapsulated by just that one Principal Component. PC2 explains 23% of the variance. etc

# Calling str() to have a look at your PCA object

str(mtcars.pca)

## List of 5
##  $ sdev    : num [1:9] 2.378 1.443 0.71 0.515 0.428 ...
##  $ rotation: num [1:9, 1:9] -0.393 0.403 0.397 0.367 -0.312 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "mpg" "cyl" "disp" "hp" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:9] 20.09 6.19 230.72 146.69 3.6 ...
##   ..- attr(*, "names")= chr [1:9] "mpg" "cyl" "disp" "hp" ...
##  $ scale   : Named num [1:9] 6.027 1.786 123.939 68.563 0.535 ...
##   ..- attr(*, "names")= chr [1:9] "mpg" "cyl" "disp" "hp" ...
##  $ x       : num [1:32, 1:9] -0.664 -0.637 -2.3 -0.215 1.587 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. ..$ : chr [1:9] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Here we note that our pca object: The center point ($center), scaling ($scale), standard deviation(sdev) of each principal component. The relationship (correlation or anticorrelation, etc) between the initial variables and the principal components ($rotation$). The values of each sample in terms of the principal components ($x)

# We will now plot our pca. This will provide us with some very useful insights i.e. 
# which cars are most similar to each other 

library(devtools)

## Loading required package: usethis

install_github("vqv/ggbiplot")

## Skipping install of 'ggbiplot' from a github remote, the SHA1 (7325e880) has not changed since last install.
##   Use `force = TRUE` to force installation

# Then Loading our ggbiplot library

library(ggbiplot)

## Loading required package: ggplot2

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

ggbiplot(mtcars.pca)

From the graph we will see that the variables hp, cyl and disp contribute to PC1, with higher values in those variables moving the samples to the right on the plot.

Adding more detail to the plot, we provide arguments rownames as labels.

ggbiplot(mtcars.pca, labels=rownames(mtcars), obs.scale = 1, var.scale = 1)

We now see which cars are similar to one another. The sports cars Maserati Bora, Ferrari Dino and Ford Pantera L all cluster together at the top.

We can also look at the origin of each of the cars by putting them into one of three categories i.e. US, Japanese and European cars.

mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe", 7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3), "US", rep("Europe", 3))

ggbiplot(mtcars.pca,ellipse=TRUE,  labels=rownames(mtcars), groups=mtcars.country, obs.scale = 1, var.scale = 1)

We get to see that US cars for a cluster on the right. This cluster is characterized by high values for cyl, disp and wt. Japanese cars are characterized by high mpg.

European cars are somewhat in the middle and less tightly clustered that either group.

We now plot PC3 and PC4

ggbiplot(mtcars.pca,ellipse=TRUE,choices=c(3,4),   labels=rownames(mtcars), groups=mtcars.country)

We find it difficult to derive insights from the given plot mainly because PC3 and PC4 explain very small percentages of the total variation, thus it would be surprising if we found that they were very informative and separated the groups or revealed apparent patterns.

Having performed PCA using this dataset, if we were to build a classification model to identify the origin of a car (i.e. European, Japanese, US), the variables cyl, disp, wt and mpg would be significant variables as seen in our PCA analysis.

PCA

Ruth Muriithi

1/19/2021