The goal of this assignment is to perform Principle Components Analysis (PCA) using R.
The dataset we will use is called Cereals.csv in the textbook datasets folder.
Before PCA, we find that there is high correlation between potassium/fiber (positive), calories/rating (negative), and sugars/rating (negative).
After performing PCA, we find that only 5 Principle Components are needed to explain over 80% of the variability in the data.
Each of the 5 has a variance of about or greater than 1, as seen in the scree plot.
Using rating as the response variable, we find that using the 5 PCs as the predictor variables makes for a good model.
Using ggbiplot() can help us see which variables are the biggest indicators in each principle component.
library("tidyverse")
library("reshape2")
Let’s first start by reading in the data set and previewing the data:
original_file <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/Cereals.csv")
head(original_file)
## name mfr type calories protein fat sodium fiber carbo
## 1 100%_Bran N C 70 4 1 130 10.0 5.0
## 2 100%_Natural_Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond_Delight R C 110 2 2 200 1.0 14.0
## 6 Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 NA 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
Note: At the end there are histograms showing the distribution of each numerical variable.
Before performing PCA, let’s look at the correlation matrix and heatmap to see what variables are commonly shared amongst certain cereals. For example, potassium/fiber are highly and positively correlated, while calories/rating and sugars/rating are highly and negatively correlated. The cereals rated best were probably not super sugary and high in calories.
correlation_matrix <- cor(na.omit(original_file[, -c(1:3)]))
round(correlation_matrix, 2)
## calories protein fat sodium fiber carbo sugars potass vitamins shelf
## calories 1.00 0.03 0.51 0.30 -0.30 0.27 0.57 -0.07 0.26 0.09
## protein 0.03 1.00 0.20 0.01 0.51 -0.04 -0.29 0.58 0.05 0.20
## fat 0.51 0.20 1.00 0.00 0.01 -0.28 0.29 0.20 -0.03 0.28
## sodium 0.30 0.01 0.00 1.00 -0.07 0.33 0.04 -0.04 0.33 -0.12
## fiber -0.30 0.51 0.01 -0.07 1.00 -0.38 -0.15 0.91 -0.04 0.31
## carbo 0.27 -0.04 -0.28 0.33 -0.38 1.00 -0.45 -0.37 0.25 -0.19
## sugars 0.57 -0.29 0.29 0.04 -0.15 -0.45 1.00 0.00 0.07 0.06
## potass -0.07 0.58 0.20 -0.04 0.91 -0.37 0.00 1.00 0.00 0.39
## vitamins 0.26 0.05 -0.03 0.33 -0.04 0.25 0.07 0.00 1.00 0.28
## shelf 0.09 0.20 0.28 -0.12 0.31 -0.19 0.06 0.39 0.28 1.00
## weight 0.70 0.23 0.22 0.31 0.25 0.14 0.46 0.42 0.32 0.19
## cups 0.09 -0.24 -0.16 0.12 -0.51 0.36 -0.03 -0.50 0.13 -0.35
## rating -0.69 0.47 -0.41 -0.38 0.60 0.06 -0.76 0.42 -0.21 0.05
## weight cups rating
## calories 0.70 0.09 -0.69
## protein 0.23 -0.24 0.47
## fat 0.22 -0.16 -0.41
## sodium 0.31 0.12 -0.38
## fiber 0.25 -0.51 0.60
## carbo 0.14 0.36 0.06
## sugars 0.46 -0.03 -0.76
## potass 0.42 -0.50 0.42
## vitamins 0.32 0.13 -0.21
## shelf 0.19 -0.35 0.05
## weight 1.00 -0.20 -0.30
## cups -0.20 1.00 -0.22
## rating -0.30 -0.22 1.00
ggplot(data = melt(correlation_matrix),
aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(limit = c(-1, 1), midpoint = 0,
low = "dark blue", high = "dark red", mid = "white")
We can also use the basic plot() function to plot each of the numerical variables against each other with different colors representing different manufacturers.
pairs(original_file[, -c(1:3)], upper.panel = NULL, pch = 16, cex = 0.5)
Next, let’s use the prcomp() function, apart of the stats base package, to analayze the variances of the 13 numerical variables, removing rows with NA as a value.
This function performs principle components analysis, but because the units are different, we need to set scale. = TRUE, to help “normalize” the variables before performing PCA.
Looking at the summary, we see that in order to explain over 80% of the variability in the data we would need 5 principle components. In order to explain 90%,w e would need 7 PCs.
pca <- prcomp(na.omit(original_file[,-c(1:3)]), scale. = T)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9062 1.7743 1.3818 1.00969 0.9947 0.84974 0.81946
## Proportion of Variance 0.2795 0.2422 0.1469 0.07842 0.0761 0.05554 0.05166
## Cumulative Proportion 0.2795 0.5217 0.6685 0.74696 0.8231 0.87861 0.93026
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.64515 0.56192 0.30301 0.25194 0.13897 1.499e-08
## Proportion of Variance 0.03202 0.02429 0.00706 0.00488 0.00149 0.000e+00
## Cumulative Proportion 0.96228 0.98657 0.99363 0.99851 1.00000 1.000e+00
Looking at the scree plot, we see 5 or 7 would make for good “elbows”. Some statisticians would say to keep all PCs with a variance greater than or about 1, so we will keep 5 components.
screeplot(pca, type = "lines", main = "Scree Plot")
Let’s look at the first 5 Principle Components. PC1 measures calories and cups vs. consumer rating, potassium, fiber, and protein. PC2 measures weight, calories, sugars, and fats vs. cups and rating. PC3 is sugars vs. sodium and carbohydrates. PC4 is protein vs. vitamins and shelf #. PC5 is fat, shelf #, and protein vs. sodium and sugars. These will be important in the following steps.
pca$rotation[, 1:5]
## PC1 PC2 PC3 PC4 PC5
## calories 0.29954236 0.3931479 -0.114857453 0.20435870 0.20389885
## protein -0.30735632 0.1653233 -0.277281953 0.30074318 0.31974897
## fat 0.03991542 0.3457243 0.204890102 0.18683311 0.58689327
## sodium 0.18339651 0.1372205 -0.389431009 0.12033726 -0.33836424
## fiber -0.45349036 0.1798119 -0.069766079 0.03917361 -0.25511906
## carbo 0.19244902 -0.1494483 -0.562452458 0.08783547 0.18274252
## sugars 0.22806849 0.3514345 0.355405174 -0.02270716 -0.31487243
## potass -0.40196429 0.3005442 -0.067620183 0.09087843 -0.14836048
## vitamins 0.11598020 0.1729092 -0.387858660 -0.60411064 -0.04928672
## shelf -0.17126336 0.2650503 0.001531036 -0.63887859 0.32910135
## weight 0.05029930 0.4503085 -0.247138314 0.15342874 -0.22128334
## cups 0.29463553 -0.2122479 -0.139999705 0.04748909 0.12081645
## rating -0.43837841 -0.2515389 -0.181842433 0.03831622 0.05758420
Next, let’s create a ggbiplot, with the numbers correspoding to the row number:
As we can see, the lines correspond to the values in the Principle Component. PC1 scores high with calories and cups, while PC2 scores high with weight, calories, sugars, and fats. PC1 scores low with rating, potassium, protein, and fiber, while PC2 scores low with cups and rating. This would be the case for each of the principle components.
ggbiplot(pca, labels = rownames(na.omit(original_file)))
Next using grouping variables, we will look at another style of biplots. We will use vitamins to show that points with high vitamin amounts score low in PC4 and score low in PC3, which was indicated by the large negative value in the the third and fourth component.
ggbiplot(pca,
ellipse =TRUE,
choices = c(3,4),
obs.scale = 1,
var.scale = 1,
groups = as.factor(na.omit(original_file)$vitamins)
)
As expected, we see vitamins has is largely negative. the points (blue) that score high i vitamins also score low in PC3 and PC4. Those cereals that have no vitamins are much higher on both PCs.
These biplots will work with all componenets and you can use any factor varaible to group by. Here is the ggbiplot, when grouping by cereal manufacturer. I don’t know what all the companies are but Q stands for Quaker, which might have less sodium then Kellogg cereals, as sodium pulls negatively for both PCS although Quaker observations remains quite high.
ggbiplot(pca,
ellipse =TRUE,
choices = c(3, 5),
obs.scale = 1,
var.scale = 1,
groups = na.omit(original_file)$mfr
)
Fitting a regression model to predict rating from the first 5 principle components results in significant p-values for 4 of the 5 principle component. PC4 is marginally insiginificant. The adjusted R-squared is 0.96, which is very high. An F-test results in a p-value of < 0.0001 and this model is pretty effective in predicting rating.
fit_rating_by_5PC <- lm(na.omit(original_file)$rating ~ pca$x[, 1:5])
summary(fit_rating_by_5PC)
##
## Call:
## lm(formula = na.omit(original_file)$rating ~ pca$x[, 1:5])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6469 -1.6663 -0.1419 1.6716 5.5948
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.3718 0.3145 134.739 <2e-16 ***
## pca$x[, 1:5]PC1 -6.1521 0.1661 -37.038 <2e-16 ***
## pca$x[, 1:5]PC2 -3.5300 0.1785 -19.782 <2e-16 ***
## pca$x[, 1:5]PC3 -2.5519 0.2291 -11.137 <2e-16 ***
## pca$x[, 1:5]PC4 0.5377 0.3136 1.715 0.0909 .
## pca$x[, 1:5]PC5 0.8081 0.3183 2.539 0.0134 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.705 on 68 degrees of freedom
## Multiple R-squared: 0.9654, Adjusted R-squared: 0.9628
## F-statistic: 379.3 on 5 and 68 DF, p-value: < 2.2e-16
ggplot(mapping = aes(x = na.omit(original_file)$rating, y = predict(fit_rating_by_5PC))) +
geom_point() +
stat_smooth(method = "lm", color = "red", se = FALSE) +
labs(x = "Observed Rating", y = "Predicted Rating",
title = "Predicting Cereal Rating using the first 5 Principle Components")
Here are the distributions of the numerical variables (last 13 columns):
for (i in 1:13){
hist(original_file[, i+3], #Skip first three categorical variables
# create titles/labels
main = paste("Distribution of", colnames(original_file[i+3])),
xlab = colnames(original_file[i+3])
)
}