Principal Components Analysis

Objective
Summary
Looking at trends in the data
Principle Components Analysis
GGBiplot
Model building
Distribution of the numerical variables

Objective

The goal of this assignment is to perform Principle Components Analysis (PCA) using R.
The dataset we will use is called Cereals.csv in the textbook datasets folder.

Summary

Before PCA, we find that there is high correlation between potassium/fiber (positive), calories/rating (negative), and sugars/rating (negative).
After performing PCA, we find that only 5 Principle Components are needed to explain over 80% of the variability in the data.
Each of the 5 has a variance of about or greater than 1, as seen in the scree plot.
Using rating as the response variable, we find that using the 5 PCs as the predictor variables makes for a good model.
- The model passes the F-test with a p-value < 0.0001 and has an R-squared (adjusted) of 0.96.
- Only one of the 5 PCs is marginally significant, but still has an individual t-test p-value of < 0.1.
Using ggbiplot() can help us see which variables are the biggest indicators in each principle component.

Looking at trends in the data

library("tidyverse")
library("reshape2")

Let’s first start by reading in the data set and previewing the data:

original_file <- read.csv("/Users/Simbo/Desktop/School/STAT415/DMBA-R-datasets/Cereals.csv")
head(original_file)

##                        name mfr type calories protein fat sodium fiber carbo
## 1                 100%_Bran   N    C       70       4   1    130  10.0   5.0
## 2         100%_Natural_Bran   Q    C      120       3   5     15   2.0   8.0
## 3                  All-Bran   K    C       70       4   1    260   9.0   7.0
## 4 All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## 5            Almond_Delight   R    C      110       2   2    200   1.0  14.0
## 6   Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5  10.5
##   sugars potass vitamins shelf weight cups   rating
## 1      6    280       25     3      1 0.33 68.40297
## 2      8    135        0     3      1 1.00 33.98368
## 3      5    320       25     3      1 0.33 59.42551
## 4      0    330       25     3      1 0.50 93.70491
## 5      8     NA       25     3      1 0.75 34.38484
## 6     10     70       25     1      1 0.75 29.50954

Note: At the end there are histograms showing the distribution of each numerical variable.

Before performing PCA, let’s look at the correlation matrix and heatmap to see what variables are commonly shared amongst certain cereals. For example, potassium/fiber are highly and positively correlated, while calories/rating and sugars/rating are highly and negatively correlated. The cereals rated best were probably not super sugary and high in calories.

correlation_matrix <- cor(na.omit(original_file[, -c(1:3)]))

round(correlation_matrix, 2)

##          calories protein   fat sodium fiber carbo sugars potass vitamins shelf
## calories     1.00    0.03  0.51   0.30 -0.30  0.27   0.57  -0.07     0.26  0.09
## protein      0.03    1.00  0.20   0.01  0.51 -0.04  -0.29   0.58     0.05  0.20
## fat          0.51    0.20  1.00   0.00  0.01 -0.28   0.29   0.20    -0.03  0.28
## sodium       0.30    0.01  0.00   1.00 -0.07  0.33   0.04  -0.04     0.33 -0.12
## fiber       -0.30    0.51  0.01  -0.07  1.00 -0.38  -0.15   0.91    -0.04  0.31
## carbo        0.27   -0.04 -0.28   0.33 -0.38  1.00  -0.45  -0.37     0.25 -0.19
## sugars       0.57   -0.29  0.29   0.04 -0.15 -0.45   1.00   0.00     0.07  0.06
## potass      -0.07    0.58  0.20  -0.04  0.91 -0.37   0.00   1.00     0.00  0.39
## vitamins     0.26    0.05 -0.03   0.33 -0.04  0.25   0.07   0.00     1.00  0.28
## shelf        0.09    0.20  0.28  -0.12  0.31 -0.19   0.06   0.39     0.28  1.00
## weight       0.70    0.23  0.22   0.31  0.25  0.14   0.46   0.42     0.32  0.19
## cups         0.09   -0.24 -0.16   0.12 -0.51  0.36  -0.03  -0.50     0.13 -0.35
## rating      -0.69    0.47 -0.41  -0.38  0.60  0.06  -0.76   0.42    -0.21  0.05
##          weight  cups rating
## calories   0.70  0.09  -0.69
## protein    0.23 -0.24   0.47
## fat        0.22 -0.16  -0.41
## sodium     0.31  0.12  -0.38
## fiber      0.25 -0.51   0.60
## carbo      0.14  0.36   0.06
## sugars     0.46 -0.03  -0.76
## potass     0.42 -0.50   0.42
## vitamins   0.32  0.13  -0.21
## shelf      0.19 -0.35   0.05
## weight     1.00 -0.20  -0.30
## cups      -0.20  1.00  -0.22
## rating    -0.30 -0.22   1.00

ggplot(data = melt(correlation_matrix), 
       aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile(color = "white") + 
  scale_fill_gradient2(limit = c(-1, 1), midpoint = 0, 
                       low = "dark blue", high = "dark red", mid = "white")

We can also use the basic plot() function to plot each of the numerical variables against each other with different colors representing different manufacturers.

pairs(original_file[, -c(1:3)], upper.panel = NULL, pch = 16, cex = 0.5)

Principle Components Analysis

Next, let’s use the prcomp() function, apart of the stats base package, to analayze the variances of the 13 numerical variables, removing rows with NA as a value.
This function performs principle components analysis, but because the units are different, we need to set scale. = TRUE, to help “normalize” the variables before performing PCA.
Looking at the summary, we see that in order to explain over 80% of the variability in the data we would need 5 principle components. In order to explain 90%,w e would need 7 PCs.

pca <- prcomp(na.omit(original_file[,-c(1:3)]), scale. = T)
summary(pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     1.9062 1.7743 1.3818 1.00969 0.9947 0.84974 0.81946
## Proportion of Variance 0.2795 0.2422 0.1469 0.07842 0.0761 0.05554 0.05166
## Cumulative Proportion  0.2795 0.5217 0.6685 0.74696 0.8231 0.87861 0.93026
##                            PC8     PC9    PC10    PC11    PC12      PC13
## Standard deviation     0.64515 0.56192 0.30301 0.25194 0.13897 1.499e-08
## Proportion of Variance 0.03202 0.02429 0.00706 0.00488 0.00149 0.000e+00
## Cumulative Proportion  0.96228 0.98657 0.99363 0.99851 1.00000 1.000e+00

Looking at the scree plot, we see 5 or 7 would make for good “elbows”. Some statisticians would say to keep all PCs with a variance greater than or about 1, so we will keep 5 components.

screeplot(pca, type = "lines", main = "Scree Plot")

Let’s look at the first 5 Principle Components. PC1 measures calories and cups vs. consumer rating, potassium, fiber, and protein. PC2 measures weight, calories, sugars, and fats vs. cups and rating. PC3 is sugars vs. sodium and carbohydrates. PC4 is protein vs. vitamins and shelf #. PC5 is fat, shelf #, and protein vs. sodium and sugars. These will be important in the following steps.

pca$rotation[, 1:5]

##                  PC1        PC2          PC3         PC4         PC5
## calories  0.29954236  0.3931479 -0.114857453  0.20435870  0.20389885
## protein  -0.30735632  0.1653233 -0.277281953  0.30074318  0.31974897
## fat       0.03991542  0.3457243  0.204890102  0.18683311  0.58689327
## sodium    0.18339651  0.1372205 -0.389431009  0.12033726 -0.33836424
## fiber    -0.45349036  0.1798119 -0.069766079  0.03917361 -0.25511906
## carbo     0.19244902 -0.1494483 -0.562452458  0.08783547  0.18274252
## sugars    0.22806849  0.3514345  0.355405174 -0.02270716 -0.31487243
## potass   -0.40196429  0.3005442 -0.067620183  0.09087843 -0.14836048
## vitamins  0.11598020  0.1729092 -0.387858660 -0.60411064 -0.04928672
## shelf    -0.17126336  0.2650503  0.001531036 -0.63887859  0.32910135
## weight    0.05029930  0.4503085 -0.247138314  0.15342874 -0.22128334
## cups      0.29463553 -0.2122479 -0.139999705  0.04748909  0.12081645
## rating   -0.43837841 -0.2515389 -0.181842433  0.03831622  0.05758420

GGBiplot

Next, let’s create a ggbiplot, with the numbers correspoding to the row number:
As we can see, the lines correspond to the values in the Principle Component. PC1 scores high with calories and cups, while PC2 scores high with weight, calories, sugars, and fats. PC1 scores low with rating, potassium, protein, and fiber, while PC2 scores low with cups and rating. This would be the case for each of the principle components.

ggbiplot(pca, labels = rownames(na.omit(original_file)))

Next using grouping variables, we will look at another style of biplots. We will use vitamins to show that points with high vitamin amounts score low in PC4 and score low in PC3, which was indicated by the large negative value in the the third and fourth component.

ggbiplot(pca, 
         ellipse =TRUE,
         choices = c(3,4),
         obs.scale = 1,
         var.scale = 1,
         groups = as.factor(na.omit(original_file)$vitamins)
         )

As expected, we see vitamins has is largely negative. the points (blue) that score high i vitamins also score low in PC3 and PC4. Those cereals that have no vitamins are much higher on both PCs.

These biplots will work with all componenets and you can use any factor varaible to group by. Here is the ggbiplot, when grouping by cereal manufacturer. I don’t know what all the companies are but Q stands for Quaker, which might have less sodium then Kellogg cereals, as sodium pulls negatively for both PCS although Quaker observations remains quite high.

ggbiplot(pca, 
         ellipse =TRUE,
         choices = c(3, 5),
         obs.scale = 1,
         var.scale = 1,
         groups = na.omit(original_file)$mfr
         )

Model building

Fitting a regression model to predict rating from the first 5 principle components results in significant p-values for 4 of the 5 principle component. PC4 is marginally insiginificant. The adjusted R-squared is 0.96, which is very high. An F-test results in a p-value of < 0.0001 and this model is pretty effective in predicting rating.

fit_rating_by_5PC <- lm(na.omit(original_file)$rating ~ pca$x[, 1:5])
summary(fit_rating_by_5PC)

## 
## Call:
## lm(formula = na.omit(original_file)$rating ~ pca$x[, 1:5])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6469 -1.6663 -0.1419  1.6716  5.5948 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      42.3718     0.3145 134.739   <2e-16 ***
## pca$x[, 1:5]PC1  -6.1521     0.1661 -37.038   <2e-16 ***
## pca$x[, 1:5]PC2  -3.5300     0.1785 -19.782   <2e-16 ***
## pca$x[, 1:5]PC3  -2.5519     0.2291 -11.137   <2e-16 ***
## pca$x[, 1:5]PC4   0.5377     0.3136   1.715   0.0909 .  
## pca$x[, 1:5]PC5   0.8081     0.3183   2.539   0.0134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.705 on 68 degrees of freedom
## Multiple R-squared:  0.9654, Adjusted R-squared:  0.9628 
## F-statistic: 379.3 on 5 and 68 DF,  p-value: < 2.2e-16

ggplot(mapping = aes(x = na.omit(original_file)$rating, y = predict(fit_rating_by_5PC))) + 
  geom_point() + 
  stat_smooth(method = "lm", color = "red", se = FALSE) + 
  labs(x = "Observed Rating", y = "Predicted Rating", 
       title = "Predicting Cereal Rating using the first 5 Principle Components")

Distribution of the numerical variables

Here are the distributions of the numerical variables (last 13 columns):

for (i in 1:13){
  hist(original_file[, i+3],  #Skip first three categorical variables
       # create titles/labels
       main = paste("Distribution of", colnames(original_file[i+3])),
       xlab = colnames(original_file[i+3])
       )
}