Objective

The goal of this assignment is to perform Principle Components Analysis (PCA) using R.
The dataset we will use is called Cereals.csv in the textbook datasets folder.

Summary

Principle Components Analysis

Next, let’s use the prcomp() function, apart of the stats base package, to analayze the variances of the 13 numerical variables, removing rows with NA as a value.
This function performs principle components analysis, but because the units are different, we need to set scale. = TRUE, to help “normalize” the variables before performing PCA.
Looking at the summary, we see that in order to explain over 80% of the variability in the data we would need 5 principle components. In order to explain 90%,w e would need 7 PCs.

pca <- prcomp(na.omit(original_file[,-c(1:3)]), scale. = T)
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.9062 1.7743 1.3818 1.00969 0.9947 0.84974 0.81946
## Proportion of Variance 0.2795 0.2422 0.1469 0.07842 0.0761 0.05554 0.05166
## Cumulative Proportion  0.2795 0.5217 0.6685 0.74696 0.8231 0.87861 0.93026
##                            PC8     PC9    PC10    PC11    PC12      PC13
## Standard deviation     0.64515 0.56192 0.30301 0.25194 0.13897 1.499e-08
## Proportion of Variance 0.03202 0.02429 0.00706 0.00488 0.00149 0.000e+00
## Cumulative Proportion  0.96228 0.98657 0.99363 0.99851 1.00000 1.000e+00

Looking at the scree plot, we see 5 or 7 would make for good “elbows”. Some statisticians would say to keep all PCs with a variance greater than or about 1, so we will keep 5 components.

screeplot(pca, type = "lines", main = "Scree Plot")

Let’s look at the first 5 Principle Components. PC1 measures calories and cups vs. consumer rating, potassium, fiber, and protein. PC2 measures weight, calories, sugars, and fats vs. cups and rating. PC3 is sugars vs. sodium and carbohydrates. PC4 is protein vs. vitamins and shelf #. PC5 is fat, shelf #, and protein vs. sodium and sugars. These will be important in the following steps.

pca$rotation[, 1:5]
##                  PC1        PC2          PC3         PC4         PC5
## calories  0.29954236  0.3931479 -0.114857453  0.20435870  0.20389885
## protein  -0.30735632  0.1653233 -0.277281953  0.30074318  0.31974897
## fat       0.03991542  0.3457243  0.204890102  0.18683311  0.58689327
## sodium    0.18339651  0.1372205 -0.389431009  0.12033726 -0.33836424
## fiber    -0.45349036  0.1798119 -0.069766079  0.03917361 -0.25511906
## carbo     0.19244902 -0.1494483 -0.562452458  0.08783547  0.18274252
## sugars    0.22806849  0.3514345  0.355405174 -0.02270716 -0.31487243
## potass   -0.40196429  0.3005442 -0.067620183  0.09087843 -0.14836048
## vitamins  0.11598020  0.1729092 -0.387858660 -0.60411064 -0.04928672
## shelf    -0.17126336  0.2650503  0.001531036 -0.63887859  0.32910135
## weight    0.05029930  0.4503085 -0.247138314  0.15342874 -0.22128334
## cups      0.29463553 -0.2122479 -0.139999705  0.04748909  0.12081645
## rating   -0.43837841 -0.2515389 -0.181842433  0.03831622  0.05758420

GGBiplot

Next, let’s create a ggbiplot, with the numbers correspoding to the row number:
As we can see, the lines correspond to the values in the Principle Component. PC1 scores high with calories and cups, while PC2 scores high with weight, calories, sugars, and fats. PC1 scores low with rating, potassium, protein, and fiber, while PC2 scores low with cups and rating. This would be the case for each of the principle components.

ggbiplot(pca, labels = rownames(na.omit(original_file)))

Next using grouping variables, we will look at another style of biplots. We will use vitamins to show that points with high vitamin amounts score low in PC4 and score low in PC3, which was indicated by the large negative value in the the third and fourth component.

ggbiplot(pca, 
         ellipse =TRUE,
         choices = c(3,4),
         obs.scale = 1,
         var.scale = 1,
         groups = as.factor(na.omit(original_file)$vitamins)
         )

As expected, we see vitamins has is largely negative. the points (blue) that score high i vitamins also score low in PC3 and PC4. Those cereals that have no vitamins are much higher on both PCs.

These biplots will work with all componenets and you can use any factor varaible to group by. Here is the ggbiplot, when grouping by cereal manufacturer. I don’t know what all the companies are but Q stands for Quaker, which might have less sodium then Kellogg cereals, as sodium pulls negatively for both PCS although Quaker observations remains quite high.

ggbiplot(pca, 
         ellipse =TRUE,
         choices = c(3, 5),
         obs.scale = 1,
         var.scale = 1,
         groups = na.omit(original_file)$mfr
         )

Model building

Fitting a regression model to predict rating from the first 5 principle components results in significant p-values for 4 of the 5 principle component. PC4 is marginally insiginificant. The adjusted R-squared is 0.96, which is very high. An F-test results in a p-value of < 0.0001 and this model is pretty effective in predicting rating.

fit_rating_by_5PC <- lm(na.omit(original_file)$rating ~ pca$x[, 1:5])
summary(fit_rating_by_5PC)
## 
## Call:
## lm(formula = na.omit(original_file)$rating ~ pca$x[, 1:5])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6469 -1.6663 -0.1419  1.6716  5.5948 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      42.3718     0.3145 134.739   <2e-16 ***
## pca$x[, 1:5]PC1  -6.1521     0.1661 -37.038   <2e-16 ***
## pca$x[, 1:5]PC2  -3.5300     0.1785 -19.782   <2e-16 ***
## pca$x[, 1:5]PC3  -2.5519     0.2291 -11.137   <2e-16 ***
## pca$x[, 1:5]PC4   0.5377     0.3136   1.715   0.0909 .  
## pca$x[, 1:5]PC5   0.8081     0.3183   2.539   0.0134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.705 on 68 degrees of freedom
## Multiple R-squared:  0.9654, Adjusted R-squared:  0.9628 
## F-statistic: 379.3 on 5 and 68 DF,  p-value: < 2.2e-16
ggplot(mapping = aes(x = na.omit(original_file)$rating, y = predict(fit_rating_by_5PC))) + 
  geom_point() + 
  stat_smooth(method = "lm", color = "red", se = FALSE) + 
  labs(x = "Observed Rating", y = "Predicted Rating", 
       title = "Predicting Cereal Rating using the first 5 Principle Components")

Distribution of the numerical variables

Here are the distributions of the numerical variables (last 13 columns):

for (i in 1:13){
  hist(original_file[, i+3],  #Skip first three categorical variables
       # create titles/labels
       main = paste("Distribution of", colnames(original_file[i+3])),
       xlab = colnames(original_file[i+3])
       )
}