title: “PCA - Cereals Analysis” author: “Vivek Teja” date: “6/19/2020” output: html_document — Objective:we will first clean a dataset and create some simple visualizations. Then, we will conduct a principal component analysis to reduce dimensionality. You must use R (NOT Tableau) to answer the questions. The data were collected on the nutritional information and consumer rating of 77 breakfast cereals. For each cereal we have included 13 numerical variables and 2 categorical variables (i.e. mfr and type). Variable Name Description Name: Name of cereal mfr: Manufacturer of cereal. {A=American Home Food Products, G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Qauker Oats, R=Ralston Purina} type: {cold, hot} calories: Calories per serving protein: Grams of protein fat: Grams of fat sodium: Milligrams of sodium fiber: Grams of dietary fiber carbo: Grams of complex carbohydrates sugars: Grams of sugars potassL Milligrams of potassium vitaminsL Vitamins and minerals-0, 25, or 100, indicating the typical percentage of FDA recommended shelf Display shelf: 1, 2, or 3, counting from the floor weight: Weight in ounces of one serving cups: Number of cups in one serving rating: A rating of the cereals calculated by Consumer Reports


Loading the Cereals Data Set

setwd("C:/Users/vamsi/Downloads")
library(readr)
library(ggplot2)
Cereals <- read_csv("Cereals.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   mfr = col_character(),
##   type = col_character(),
##   calories = col_double(),
##   protein = col_double(),
##   fat = col_double(),
##   sodium = col_double(),
##   fiber = col_double(),
##   carbo = col_double(),
##   sugars = col_double(),
##   potass = col_double(),
##   vitamins = col_double(),
##   shelf = col_double(),
##   weight = col_double(),
##   cups = col_double(),
##   rating = col_double()
## )
View(Cereals)
  1. Regarding two categorical variables (mfr and type), use a single proper visualization to show the distributions of both variables
ggplot(Cereals, aes(type, ..count..)) + geom_bar(aes(fill = mfr), position = "dodge")

  1. Calculate the average consumer rating for each combination of manufacturer (mfr) and type. Obtain a data frame (table) that follows the structure below. Use “0” to represent missing value if there is no any cereal for a certain combination of manufacturer and type. Take a screenshot of the data frame.
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(tidyr)
library(reshape)
## Warning: package 'reshape' was built under R version 3.6.3
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
q2<- sqldf("select type,mfr,avg(rating) as avg from Cereals group by type,mfr")

a<- sqldf("select distinct type from Cereals")

b<- sqldf("select distinct mfr from Cereals")

c<-crossing(a, b)

q2_<-sqldf("select b.type, b.mfr, case when avg is null then 0 else avg end as avg from (select c.*, a.avg from c left join q2 a on c.type=a.type and c.mfr =a.mfr)b ")

v<-data.frame(cast(q2_, type ~ mfr))
## Using avg as value column.  Use the value argument to cast to override this choice
v
##   type        A        G        K        N        P        Q      R
## 1    C  0.00000 34.48585 44.03846 68.65552 41.70574 41.78565 41.543
## 2    H 54.85092  0.00000  0.00000 64.53382  0.00000 50.82839  0.000
  1. Use a proper visualization to compare the distribution of calories by hot vs. cold cereals. What does this plot show us regarding mean, variance, and skewness?
t<- boxplot(calories~type,data=Cereals,main="Cold vs Hot",
              xlab="Type",ylab="calories",
              col="blue")

t
## $stats
##      [,1] [,2]
## [1,]   90  100
## [2,]  100  100
## [3,]  110  100
## [4,]  110  100
## [5,]  120  100
## 
## $n
## [1] 74  3
## 
## $conf
##          [,1] [,2]
## [1,] 108.1633  100
## [2,] 111.8367  100
## 
## $out
##  [1]  70  70  50 130 140 150 150 160 140 130  50  50  80 140
## 
## $group
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $names
## [1] "C" "H"

Variance: High in cold than hot Skewness: For Cold, Median is almost near the end of 3 rd quartile, this means that mean is less than median, hence left skewed(Also when we check summary stats we have mean of 107.2 and median of 110 thus indicating left skewed) For Hot, Mean = Median and No skewness

  1. Use a proper visualization to show the distribution of consumer rating by different shelf height. If we were to predict consumer rating from shelf height, does it appear that we need to keep all three categories of shelf height or we could combine two of the categories as a new category?
d<- boxplot(rating~shelf,data=Cereals,main="cust ratings and shelf link",
              xlab="Ratings",ylab="Shelf",horizontal=TRUE, 
              col="green")

d
## $stats
##          [,1]     [,2]     [,3]
## [1,] 28.74241 18.04285 28.59278
## [2,] 35.72000 23.80404 36.78112
## [3,] 43.93113 31.23005 40.80468
## [4,] 51.21029 39.25920 52.69536
## [5,] 72.80179 59.36399 68.40297
## 
## $n
## [1] 20 21 36
## 
## $conf
##          [,1]     [,2]     [,3]
## [1,] 38.45843 25.90136 36.61394
## [2,] 49.40383 36.55875 44.99543
## 
## $out
## [1] 74.47295 64.53382 93.70491
## 
## $group
## [1] 1 2 3
## 
## $names
## [1] "1" "2" "3"
Cereals$shelf<- as.factor(Cereals$shelf)
p <- ggplot(data = Cereals,
             mapping = aes(x = rating, fill = shelf, color = shelf))
p + geom_density(alpha = 0.4)

Looking at the box and density plots, 1 and 3 can be combined

  1. Get the correlation table for the numerical attributes. In addition, generate a scatter plot array for these variables. Find one pair of variables with strongest positive correlation and find one pair of variables with strongest negative correlation
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
c<- Cereals[complete.cases(Cereals), ]
c$shelf<- as.numeric(c$shelf)
corrplot(cor(c[,4:16],use="complete.obs"),type="upper",main="Correlation plot")

  1. Standardize the numerical attributes and then conduct principal component analysis for the numerical variables except consumer rating to derive 4 principal components. Obtain the cumulative proportion of variance (of the numerical variables) captured by the 4 principal components??
scaled.dat <- scale(c[,4:16])
head(scaled.dat)
##        calories    protein        fat     sodium       fiber      carbo
## [1,] -1.8659155  1.3817478  0.0000000 -0.3910227  3.22866747 -2.5001396
## [2,]  0.6537514  0.4522084  3.9728810 -1.7804186 -0.07249167 -1.7292632
## [3,] -1.8659155  1.3817478  0.0000000  1.1795987  2.81602258 -1.9862220
## [4,] -2.8737823  1.3817478 -0.9932203 -0.2702057  4.87924705 -1.7292632
## [5,]  0.1498180 -0.4773310  0.9932203  0.2130625 -0.27881412 -1.0868662
## [6,]  0.1498180 -0.4773310 -0.9932203 -0.4514312 -0.48513656 -0.9583868
##          sugars     potass   vitamins      shelf     weight       cups
## [1,] -0.2542051  2.5605229 -0.1818422  0.9419715 -0.2008324 -2.0856582
## [2,]  0.2046041  0.5147738 -1.3032024  0.9419715 -0.2008324  0.7567534
## [3,] -0.4836096  3.1248675 -0.1818422  0.9419715 -0.2008324 -2.0856582
## [4,] -1.6306324  3.2659536 -0.1818422  0.9419715 -0.2008324 -1.3644493
## [5,]  0.6634132 -0.4022862 -0.1818422 -1.4616799 -0.2008324 -0.3038480
## [6,]  1.5810314 -0.9666308 -0.1818422 -0.2598542 -0.2008324  0.7567534
##          rating
## [1,]  1.8549038
## [2,] -0.5977113
## [3,]  1.2151965
## [4,]  3.6578436
## [5,] -0.9165248
## [6,] -0.6553998
qs<- c[,4:15]
PCA <- prcomp(qs, center = TRUE, scale = TRUE)
summary(PCA)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.8219 1.6270 1.3364 1.0087 0.99211 0.83053 0.81231
## Proportion of Variance 0.2766 0.2206 0.1488 0.0848 0.08202 0.05748 0.05499
## Cumulative Proportion  0.2766 0.4972 0.6461 0.7308 0.81287 0.87035 0.92534
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.64503 0.56012 0.30298 0.23726 0.13440
## Proportion of Variance 0.03467 0.02614 0.00765 0.00469 0.00151
## Cumulative Proportion  0.96001 0.98615 0.99380 0.99849 1.00000
screeplot(PCA, type = "l", npcs = 15, main = "Screeplot of the first 12 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"), col=c("red"), lty=5, cex=0.6)

var<- PCA$sdev^2 
sum(var[1:4])/sum(var)
## [1] 0.7308454
cumpro <- cumsum(PCA$sdev^2 / sum(PCA$sdev^2))
plot(cumpro[0:12], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 4, col="blue", lty=5)
abline(h = 0.7308454, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC4"),        col=c("blue"), lty=5, cex=0.6)

library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(PCA)