title: “PCA - Cereals Analysis” author: “Vivek Teja” date: “6/19/2020” output: html_document — Objective:we will first clean a dataset and create some simple visualizations. Then, we will conduct a principal component analysis to reduce dimensionality. You must use R (NOT Tableau) to answer the questions. The data were collected on the nutritional information and consumer rating of 77 breakfast cereals. For each cereal we have included 13 numerical variables and 2 categorical variables (i.e. mfr and type). Variable Name Description Name: Name of cereal mfr: Manufacturer of cereal. {A=American Home Food Products, G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Qauker Oats, R=Ralston Purina} type: {cold, hot} calories: Calories per serving protein: Grams of protein fat: Grams of fat sodium: Milligrams of sodium fiber: Grams of dietary fiber carbo: Grams of complex carbohydrates sugars: Grams of sugars potassL Milligrams of potassium vitaminsL Vitamins and minerals-0, 25, or 100, indicating the typical percentage of FDA recommended shelf Display shelf: 1, 2, or 3, counting from the floor weight: Weight in ounces of one serving cups: Number of cups in one serving rating: A rating of the cereals calculated by Consumer Reports
Loading the Cereals Data Set
setwd("C:/Users/vamsi/Downloads")
library(readr)
library(ggplot2)
Cereals <- read_csv("Cereals.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## mfr = col_character(),
## type = col_character(),
## calories = col_double(),
## protein = col_double(),
## fat = col_double(),
## sodium = col_double(),
## fiber = col_double(),
## carbo = col_double(),
## sugars = col_double(),
## potass = col_double(),
## vitamins = col_double(),
## shelf = col_double(),
## weight = col_double(),
## cups = col_double(),
## rating = col_double()
## )
View(Cereals)
ggplot(Cereals, aes(type, ..count..)) + geom_bar(aes(fill = mfr), position = "dodge")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(tidyr)
library(reshape)
## Warning: package 'reshape' was built under R version 3.6.3
##
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
q2<- sqldf("select type,mfr,avg(rating) as avg from Cereals group by type,mfr")
a<- sqldf("select distinct type from Cereals")
b<- sqldf("select distinct mfr from Cereals")
c<-crossing(a, b)
q2_<-sqldf("select b.type, b.mfr, case when avg is null then 0 else avg end as avg from (select c.*, a.avg from c left join q2 a on c.type=a.type and c.mfr =a.mfr)b ")
v<-data.frame(cast(q2_, type ~ mfr))
## Using avg as value column. Use the value argument to cast to override this choice
v
## type A G K N P Q R
## 1 C 0.00000 34.48585 44.03846 68.65552 41.70574 41.78565 41.543
## 2 H 54.85092 0.00000 0.00000 64.53382 0.00000 50.82839 0.000
t<- boxplot(calories~type,data=Cereals,main="Cold vs Hot",
xlab="Type",ylab="calories",
col="blue")
t
## $stats
## [,1] [,2]
## [1,] 90 100
## [2,] 100 100
## [3,] 110 100
## [4,] 110 100
## [5,] 120 100
##
## $n
## [1] 74 3
##
## $conf
## [,1] [,2]
## [1,] 108.1633 100
## [2,] 111.8367 100
##
## $out
## [1] 70 70 50 130 140 150 150 160 140 130 50 50 80 140
##
## $group
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $names
## [1] "C" "H"
Variance: High in cold than hot Skewness: For Cold, Median is almost near the end of 3 rd quartile, this means that mean is less than median, hence left skewed(Also when we check summary stats we have mean of 107.2 and median of 110 thus indicating left skewed) For Hot, Mean = Median and No skewness
d<- boxplot(rating~shelf,data=Cereals,main="cust ratings and shelf link",
xlab="Ratings",ylab="Shelf",horizontal=TRUE,
col="green")
d
## $stats
## [,1] [,2] [,3]
## [1,] 28.74241 18.04285 28.59278
## [2,] 35.72000 23.80404 36.78112
## [3,] 43.93113 31.23005 40.80468
## [4,] 51.21029 39.25920 52.69536
## [5,] 72.80179 59.36399 68.40297
##
## $n
## [1] 20 21 36
##
## $conf
## [,1] [,2] [,3]
## [1,] 38.45843 25.90136 36.61394
## [2,] 49.40383 36.55875 44.99543
##
## $out
## [1] 74.47295 64.53382 93.70491
##
## $group
## [1] 1 2 3
##
## $names
## [1] "1" "2" "3"
Cereals$shelf<- as.factor(Cereals$shelf)
p <- ggplot(data = Cereals,
mapping = aes(x = rating, fill = shelf, color = shelf))
p + geom_density(alpha = 0.4)
Looking at the box and density plots, 1 and 3 can be combined
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
c<- Cereals[complete.cases(Cereals), ]
c$shelf<- as.numeric(c$shelf)
corrplot(cor(c[,4:16],use="complete.obs"),type="upper",main="Correlation plot")
scaled.dat <- scale(c[,4:16])
head(scaled.dat)
## calories protein fat sodium fiber carbo
## [1,] -1.8659155 1.3817478 0.0000000 -0.3910227 3.22866747 -2.5001396
## [2,] 0.6537514 0.4522084 3.9728810 -1.7804186 -0.07249167 -1.7292632
## [3,] -1.8659155 1.3817478 0.0000000 1.1795987 2.81602258 -1.9862220
## [4,] -2.8737823 1.3817478 -0.9932203 -0.2702057 4.87924705 -1.7292632
## [5,] 0.1498180 -0.4773310 0.9932203 0.2130625 -0.27881412 -1.0868662
## [6,] 0.1498180 -0.4773310 -0.9932203 -0.4514312 -0.48513656 -0.9583868
## sugars potass vitamins shelf weight cups
## [1,] -0.2542051 2.5605229 -0.1818422 0.9419715 -0.2008324 -2.0856582
## [2,] 0.2046041 0.5147738 -1.3032024 0.9419715 -0.2008324 0.7567534
## [3,] -0.4836096 3.1248675 -0.1818422 0.9419715 -0.2008324 -2.0856582
## [4,] -1.6306324 3.2659536 -0.1818422 0.9419715 -0.2008324 -1.3644493
## [5,] 0.6634132 -0.4022862 -0.1818422 -1.4616799 -0.2008324 -0.3038480
## [6,] 1.5810314 -0.9666308 -0.1818422 -0.2598542 -0.2008324 0.7567534
## rating
## [1,] 1.8549038
## [2,] -0.5977113
## [3,] 1.2151965
## [4,] 3.6578436
## [5,] -0.9165248
## [6,] -0.6553998
qs<- c[,4:15]
PCA <- prcomp(qs, center = TRUE, scale = TRUE)
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8219 1.6270 1.3364 1.0087 0.99211 0.83053 0.81231
## Proportion of Variance 0.2766 0.2206 0.1488 0.0848 0.08202 0.05748 0.05499
## Cumulative Proportion 0.2766 0.4972 0.6461 0.7308 0.81287 0.87035 0.92534
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.64503 0.56012 0.30298 0.23726 0.13440
## Proportion of Variance 0.03467 0.02614 0.00765 0.00469 0.00151
## Cumulative Proportion 0.96001 0.98615 0.99380 0.99849 1.00000
screeplot(PCA, type = "l", npcs = 15, main = "Screeplot of the first 12 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"), col=c("red"), lty=5, cex=0.6)
var<- PCA$sdev^2
sum(var[1:4])/sum(var)
## [1] 0.7308454
cumpro <- cumsum(PCA$sdev^2 / sum(PCA$sdev^2))
plot(cumpro[0:12], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 4, col="blue", lty=5)
abline(h = 0.7308454, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC4"), col=c("blue"), lty=5, cex=0.6)
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(PCA)