Project#3-PCA

Cereals <- Cereals %>% na.omit() #gonna use this reduced data frame later
Cereals2 <- Cereals
Cereals2$rating2 <- Cereals$rating >50 #grouping the ratings
PCA <- Cereals %>% 
    select(-c(name:type)) %>% 
    prcomp()
summary(PCA)

## Importance of components:
##                            PC1     PC2      PC3      PC4     PC5     PC6
## Standard deviation     83.7641 70.9143 22.64375 19.18148 8.42323 2.09167
## Proportion of Variance  0.5395  0.3867  0.03943  0.02829 0.00546 0.00034
## Cumulative Proportion   0.5395  0.9262  0.96560  0.99389 0.99935 0.99968
##                            PC7     PC8     PC9    PC10   PC11    PC12      PC13
## Standard deviation     1.69942 0.77963 0.65783 0.37043 0.1864 0.06302 5.334e-08
## Proportion of Variance 0.00022 0.00005 0.00003 0.00001 0.0000 0.00000 0.000e+00
## Cumulative Proportion  0.99991 0.99995 0.99999 1.00000 1.0000 1.00000 1.000e+00

Note that 0.9656 of are variability in our model can be explained by PC1-3.

PCAmethods <- pca(Cereals[,-c(1:3,16)],scale = "uv", center = T, nPcs = 2, method = "svd")
slplot(PCAmethods, scoresLoadings = c(T,T), scol = factor(Cereals2$rating2))

Note that the red circles are ratings above 50. As we can see from the graph it seems that calories account for a lot of the variance for a lower rating. While a higher fiber tends to be a higher rating (for explaining the variability)

#fit lm using the first 3 PCis
CerealsRat <- Cereals %>%
    select("rating")
modCereals <- lm(unlist(CerealsRat) ~ PCA$x[,1:3])
summary(modCereals)

## 
## Call:
## lm(formula = unlist(CerealsRat) ~ PCA$x[, 1:3])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0281  -7.3987   0.1549   6.3447  20.4457 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     42.37179    1.07929  39.259  < 2e-16 ***
## PCA$x[, 1:3]PC1 -0.07530    0.01297  -5.804 1.73e-07 ***
## PCA$x[, 1:3]PC2 -0.07174    0.01532  -4.682 1.35e-05 ***
## PCA$x[, 1:3]PC3  0.30795    0.04799   6.417 1.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.284 on 70 degrees of freedom
## Multiple R-squared:  0.5803, Adjusted R-squared:  0.5623 
## F-statistic: 32.26 on 3 and 70 DF,  p-value: 3.298e-13

Using the first 3 PCis to make a model for predicting ratings. We find that it explains roughly 0.5623 if the variation in our model which is pretty good since we are only using 3 variables out of 16.(12 numerical predictors)

CerealsFull <- lm(rating ~ .,data = Cereals[,-c(1:3)])
summary(CerealsFull)

## 
## Call:
## lm(formula = rating ~ ., data = Cereals[, -c(1:3)])
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.352e-07 -2.482e-07  4.551e-08  2.376e-07  5.430e-07 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  5.493e+01  3.848e-07  1.428e+08   <2e-16 ***
## calories    -2.227e-01  7.958e-09 -2.799e+07   <2e-16 ***
## protein      3.273e+00  5.724e-08  5.718e+07   <2e-16 ***
## fat         -1.691e+00  8.279e-08 -2.043e+07   <2e-16 ***
## sodium      -5.449e-02  5.101e-10 -1.068e+08   <2e-16 ***
## fiber        3.443e+00  4.906e-08  7.019e+07   <2e-16 ***
## carbo        1.092e+00  3.752e-08  2.912e+07   <2e-16 ***
## sugars      -7.249e-01  3.550e-08 -2.042e+07   <2e-16 ***
## potass      -3.399e-02  1.729e-09 -1.966e+07   <2e-16 ***
## vitamins    -5.121e-02  2.026e-09 -2.528e+07   <2e-16 ***
## shelf       -2.733e-08  5.587e-08 -4.890e-01    0.627    
## weight      -5.057e-07  5.731e-07 -8.820e-01    0.381    
## cups         1.130e-07  1.998e-07  5.650e-01    0.574    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.099e-07 on 61 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.248e+16 on 12 and 61 DF,  p-value: < 2.2e-16

This is the full model with 12 predictors and 1 response variable (ratings). This model explains nearly all of the variability though it does have alot of demensions.

# Compare PCA vs. FullModel plots
par(mfrow = c(1,2))
plot(Cereals$rating, predict(modCereals), xlab = "Observed rating", ylab = "Predicted rating", main = "PCA", abline(a = 0, b = 1, col = "red"))
plot(Cereals$rating, predict(CerealsFull), xlab = "Observed rating", ylab = "Predicted rating", main = "Full model", abline(a = 0, b = 1, col = "red"))

Side-by-side comparison of the two different models. The PCA one is less perfect at predicting but it is pretty good since we only used 25% of the explanatory variables available.

Project#3-PCA

Sean Cranston

2/23/2021