Cereals <- Cereals %>% na.omit() #gonna use this reduced data frame later
Cereals2 <- Cereals
Cereals2$rating2 <- Cereals$rating >50 #grouping the ratings
PCA <- Cereals %>%
select(-c(name:type)) %>%
prcomp()
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 83.7641 70.9143 22.64375 19.18148 8.42323 2.09167
## Proportion of Variance 0.5395 0.3867 0.03943 0.02829 0.00546 0.00034
## Cumulative Proportion 0.5395 0.9262 0.96560 0.99389 0.99935 0.99968
## PC7 PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 1.69942 0.77963 0.65783 0.37043 0.1864 0.06302 5.334e-08
## Proportion of Variance 0.00022 0.00005 0.00003 0.00001 0.0000 0.00000 0.000e+00
## Cumulative Proportion 0.99991 0.99995 0.99999 1.00000 1.0000 1.00000 1.000e+00
Note that 0.9656 of are variability in our model can be explained by PC1-3.
PCAmethods <- pca(Cereals[,-c(1:3,16)],scale = "uv", center = T, nPcs = 2, method = "svd")
slplot(PCAmethods, scoresLoadings = c(T,T), scol = factor(Cereals2$rating2))
Note that the red circles are ratings above 50. As we can see from the graph it seems that calories account for a lot of the variance for a lower rating. While a higher fiber tends to be a higher rating (for explaining the variability)
#fit lm using the first 3 PCis
CerealsRat <- Cereals %>%
select("rating")
modCereals <- lm(unlist(CerealsRat) ~ PCA$x[,1:3])
summary(modCereals)
##
## Call:
## lm(formula = unlist(CerealsRat) ~ PCA$x[, 1:3])
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0281 -7.3987 0.1549 6.3447 20.4457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.37179 1.07929 39.259 < 2e-16 ***
## PCA$x[, 1:3]PC1 -0.07530 0.01297 -5.804 1.73e-07 ***
## PCA$x[, 1:3]PC2 -0.07174 0.01532 -4.682 1.35e-05 ***
## PCA$x[, 1:3]PC3 0.30795 0.04799 6.417 1.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.284 on 70 degrees of freedom
## Multiple R-squared: 0.5803, Adjusted R-squared: 0.5623
## F-statistic: 32.26 on 3 and 70 DF, p-value: 3.298e-13
Using the first 3 PCis to make a model for predicting ratings. We find that it explains roughly 0.5623 if the variation in our model which is pretty good since we are only using 3 variables out of 16.(12 numerical predictors)
CerealsFull <- lm(rating ~ .,data = Cereals[,-c(1:3)])
summary(CerealsFull)
##
## Call:
## lm(formula = rating ~ ., data = Cereals[, -c(1:3)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.352e-07 -2.482e-07 4.551e-08 2.376e-07 5.430e-07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.493e+01 3.848e-07 1.428e+08 <2e-16 ***
## calories -2.227e-01 7.958e-09 -2.799e+07 <2e-16 ***
## protein 3.273e+00 5.724e-08 5.718e+07 <2e-16 ***
## fat -1.691e+00 8.279e-08 -2.043e+07 <2e-16 ***
## sodium -5.449e-02 5.101e-10 -1.068e+08 <2e-16 ***
## fiber 3.443e+00 4.906e-08 7.019e+07 <2e-16 ***
## carbo 1.092e+00 3.752e-08 2.912e+07 <2e-16 ***
## sugars -7.249e-01 3.550e-08 -2.042e+07 <2e-16 ***
## potass -3.399e-02 1.729e-09 -1.966e+07 <2e-16 ***
## vitamins -5.121e-02 2.026e-09 -2.528e+07 <2e-16 ***
## shelf -2.733e-08 5.587e-08 -4.890e-01 0.627
## weight -5.057e-07 5.731e-07 -8.820e-01 0.381
## cups 1.130e-07 1.998e-07 5.650e-01 0.574
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.099e-07 on 61 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.248e+16 on 12 and 61 DF, p-value: < 2.2e-16
This is the full model with 12 predictors and 1 response variable (ratings). This model explains nearly all of the variability though it does have alot of demensions.
# Compare PCA vs. FullModel plots
par(mfrow = c(1,2))
plot(Cereals$rating, predict(modCereals), xlab = "Observed rating", ylab = "Predicted rating", main = "PCA", abline(a = 0, b = 1, col = "red"))
plot(Cereals$rating, predict(CerealsFull), xlab = "Observed rating", ylab = "Predicted rating", main = "Full model", abline(a = 0, b = 1, col = "red"))
Side-by-side comparison of the two different models. The PCA one is less perfect at predicting but it is pretty good since we only used 25% of the explanatory variables available.