```
It is the first week of your internship at Kellogg’s cereal department as a data analyst. Congrats! Your first project is to present a report on the recent Consumer Reports review of 80 top cereal brands. This report is for members of the sales team, who vary in their understanding of statistics, but who all understand well-polished charts and graphics.
#(Hint: the col_types= arguments can take a vector of a single string; ex: col_types = “lmnopqrs”
c <- read_csv(("cereal.csv"), col_types=cols(mfr= col_factor(levels=c("General Mills","Kellogg's","Nabisco","Post","Quaker Oats","Ralston", "Homestat Farm", "NA"),ordered=FALSE),type=col_factor(levels=c("Hot","Cold"),ordered=FALSE),target=col_factor(levels=c("Child","Adult"), ordered=FALSE),shelf=col_factor(levels=c("1", "2", "3"),ordered = TRUE),calories=col_number(),protein= col_number(),fat= col_number(),sodium=col_number(),fiber=col_number(),carbs=col_number(),sugars=col_number(),potass=col_number(),vitamins=col_number(),weight=col_number(),cups=col_number(),rating=col_number()))
summary(c)
## name mfr type target
## Length:77 Kellogg's :23 Hot : 3 Child:22
## Class :character General Mills:22 Cold:74 Adult:55
## Mode :character Quaker Oats :15
## Post : 9
## Nabisco : 6
## Ralston : 1
## (Other) : 1
## calories protein fat sodium
## Min. : 50.0 Min. :1.000 Min. :0.000 Min. : 0.0
## 1st Qu.:100.0 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0
## Median :110.0 Median :3.000 Median :1.000 Median :180.0
## Mean :106.9 Mean :2.545 Mean :1.013 Mean :159.7
## 3rd Qu.:110.0 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0
## Max. :160.0 Max. :6.000 Max. :5.000 Max. :320.0
##
## fiber carbs sugars potass
## Min. : 0.000 Min. :-1.0 Min. :-1.000 Min. : -1.00
## 1st Qu.: 1.000 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 40.00
## Median : 2.000 Median :14.0 Median : 7.000 Median : 90.00
## Mean : 2.152 Mean :14.6 Mean : 6.922 Mean : 96.08
## 3rd Qu.: 3.000 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00
## Max. :14.000 Max. :23.0 Max. :15.000 Max. :330.00
##
## vitamins shelf weight cups rating
## Min. : 0.00 1:20 Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.: 25.00 2:21 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median : 25.00 3:36 Median :1.00 Median :0.750 Median :40.40
## Mean : 28.25 Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.: 25.00 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :100.00 Max. :1.50 Max. :1.500 Max. :93.70
##
#This is done above but this is the formula I used shelf=col_factor(levels=c("1", "2", "3"),ordered = TRUE) along with the read_csv and col_typles= cols()
Your boss has several questions he wants answered. The first few are focused on the trends in the cereal market and the last few are about the Kellogg’s brand specifically.
report.theme <- theme(plot.background =element_rect(color="grey100"), panel.border = element_rect(color="black", size = .5, fill = NA), panel.background = element_rect(color="white"), panel.grid.major = element_line(color = "black",size = .1), legend.key = element_rect(color="white"), axis.title = element_text(hjust=0, vjust=0, size=10,face="bold"), axis.text.x = element_text(angle=30, vjust=.75), plot.title = element_text(color="Red", face="bold"), strip.background = element_blank(), strip.text = element_text(size=10, face = "italic"))
colors1 <-c("black",brewer.pal(n =7 , "Set1"))
colors2 <-c(brewer.pal(n =7 , "Set1"))
colors3 <- c(brewer.pal(9, "YlOrRd")[4:8])
For each graph we create, you will need to write 3-5 sentences interpreting the graph . The first few graphs we will walk through together. The last few you will decide which graphs will best answer the question and justify your choice in the interpretation (we are not looking for 1 correct answer, just a good answer that shows thoughtfulness and effort). Make sure each graph has the correct theme and formatting with appropriate titles.
Q1_A) What effect do calories have on ratings? If Kellogg’s has two cereals in production, one with 150 calories per serving and the other with 80 calories, where would it expect each rating to be? - Create a jittered scatter plot that looks at the effect of calories on ratings. Distinguish each brand by color. - Place a regression line for each brand using the “lm” method, with no standard error - Include a second, general trend line for all the data (Hint: …aes(group =1 , col = “All” )… ) - Change line colors to colors1 with the legend title as “Manufacturers” (Hint: scale_color_manual(“Title”, values = … )) - Add an xlab and ylab so the axis titles are capitalized (Hint: Just retype the title in x/ylab() ) - Give the plot the title “Rating Trends by Calories”
ggplot(c, aes(x=calories, y=rating, colour=mfr, group=1)) + ggtitle("Rating Trends by Calories") + geom_point(position = "jitter") + geom_smooth(method=lm) + scale_color_manual("Manufacturers", values=`colors1`) + xlab("Calories") + ylab("Rating")
Interpretation & Answer to Q1_A:The rating for 80 calories will be higher than the rating with 150 calories
Q1_B) What is unique about Nabisco’s rating trend? Why might that be the case? (Hint: Assume Consumer Reports surveyed only adults) Nabisco rating is higher than Kellogg’s and their calories do not go about 100 calories. Thi strend line is this way because their brand is more targeted towards Adults while Kellogg’s cereal is more sugary, higher calorie and more children focused.
#It might help to see what brands are owned by Nabisco and which are owned by Kellogg's
c %>% filter(mfr == "Kellogg's")
## # A tibble: 23 x 17
## name mfr type target calories protein fat sodium fiber carbs
## <chr> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nut&… Kell… Cold Adult 120 2 1 190 0 15
## 2 Mues… Kell… Cold Adult 160 3 2 150 3 17
## 3 Smac… Kell… Cold Child 110 2 1 70 1 9
## 4 Fros… Kell… Cold Child 110 1 0 200 1 14
## 5 Froo… Kell… Cold Child 110 2 1 125 1 11
## 6 Appl… Kell… Cold Child 110 2 0 125 1 11
## 7 Corn… Kell… Cold Child 110 1 0 90 1 13
## 8 Just… Kell… Cold Adult 140 3 1 170 2 20
## 9 Just… Kell… Cold Adult 110 2 1 170 1 17
## 10 Rais… Kell… Cold Adult 120 3 1 210 5 14
## # … with 13 more rows, and 7 more variables: sugars <dbl>, potass <dbl>,
## # vitamins <dbl>, shelf <ord>, weight <dbl>, cups <dbl>, rating <dbl>
c %>% filter(mfr == "Nabisco")
## # A tibble: 6 x 17
## name mfr type target calories protein fat sodium fiber carbs sugars
## <chr> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Stra… Nabi… Cold Adult 90 2 0 15 3 15 5
## 2 Crea… Nabi… Hot Adult 100 3 0 80 1 21 0
## 3 Shre… Nabi… Cold Adult 80 2 0 0 3 16 0
## 4 100%… Nabi… Cold Adult 70 4 1 130 10 5 6
## 5 Shre… Nabi… Cold Adult 90 3 0 0 3 20 0
## 6 Shre… Nabi… Cold Adult 90 3 0 0 4 19 0
## # … with 6 more variables: potass <dbl>, vitamins <dbl>, shelf <ord>,
## # weight <dbl>, cups <dbl>, rating <dbl>
Q2_A) What patterns do you observe in the high-calorie cereals? How is Kellogg’s doing compared to competitive brands? - Copy the plot from Q1_A - Change the method of all the trend lines to “loess” - Zoom in so that we look at calories between 110 and 150 and ratings between 0 and 50 (Hint: use coord_cartesian() ) - Change the title to be more appropriate - Colors, Axis labels, and Themes should all remain the same.
ggplot(c, aes(x=calories, y=rating, colour=mfr, group=1)) + ggtitle("Rating Trends by High Calories") + geom_point(position = "jitter") + geom_smooth(method="loess") + scale_color_manual("Manufacturers", values=`colors1`) + xlab("Calories") + ylab("Rating") + coord_cartesian(xlim = c(110, 150))
Interpretation & Answer to Q2_A:
Zoomed out Kellogg’s is not doing that great and appears to have lower rating then most brands.
Q3_A) How is Rating related to Manufacturer? Does Kellogg’s have a name-brand boost or not? If not, which manufacturer does get a boost?
c2 <- c %>% filter(c$mfr != "Ralston" )
c3 <- c2 %>% filter(c2$mfr != "Homestat Farm")
ggplot(c3, aes(x=mfr, y=rating, colour=mfr, group=1)) + ggtitle("Rating Trends by Mfr") + geom_point(position = "jitter") + geom_smooth(method="loess") + scale_color_manual("Manufacturer", values=`colors3`) + xlab("Mfr") + ylab("Rating")+ stat_summary(geom="point", fun.y=mean, shape="X",size = 5)
Interpretation & Answer to Q3_A:
Kellogg’s has a boost over General Mills and in the other chart it appeared that Kellog’s was bellow General Mills.
Q3_B) There is another kind of plot that displays similar information to the plot above (i.e., it shows the distribution of ratings by brand). Create this plot. Which plot is better, in your opinion? Why? - Filter out “Ralston” and “Homestat Farm”. You don’t need them since they have one observation each. - Creates a different boxplot for each manufacturer that compares their respective rating distribution - Each manufacturer should also have their own fill color - Changes the colors to colors3 and a legend title “Manufacturers” - Remember appropriate axis titles and plot title
ggplot(c3, aes(x=mfr, y=rating, colour=mfr)) + ggtitle("Rating Trends by Mfr") + geom_boxplot() + scale_color_manual("Manufacturers", values=`colors3`) + xlab("Mfr") + ylab("Rating")
Interpretation & Answer to Q3_B: I think the boxplot is better because it makes it easier to show the range of the ratings per brand instead of looking at the dots and guessing.
Q4_A) The intern before you took his Business Analytics course at the University of Utah and made the following pie chart. What is wrong with the chart? Does it answer the question it sets out to answer? (Hint: Think about the title of the plot and the data the plot is based on)
I would say no. These companies have a variety of different cereal and it would be beneficial to understand each cereal the company sells.
#Run this code and examine the pie chart
c %>% ggplot(aes(x=1, fill = mfr)) +
geom_bar() +
scale_fill_manual("Mfr", values=colors2) +
coord_polar(theta = "y") +
ggtitle("Market Share for each Company")
Q4_B) Your boss asks you to fix the pie chart so that it answers the question of how many cereals each company has in the Consumer Reports analysis. (This includes choosing the appropriate plot type.) Remember to include appropriate titles, themes, colors, and labels.
ggplot(c3, aes(mfr, fill=mfr)) + geom_bar() + scale_fill_manual("Mfr", values=colors2) + ggtitle("Number of Cereals by Company")
For the next three questions, your boss does not have any specific guidelines other than the graph 1) answers the given question, 2) follows the proper theme and design guidelines, and 3) includes a short justification and interpretation for each plot.
Q5) How does Kellogg’s compare calorie-wise to its three closest competitors; Post, Quaker Oats, and General Mills? - Create a graphic that answers the question - Make sure the graphic has the appropriate theme, colors, titles, labels, and format - Include a brief justification for the graphic and interpret it so that is answers the question.
c4 <- c3 %>% filter(c3$mfr != "Nabisco" )
ggplot(c4, aes(mfr, calories, fill=mfr)) +
geom_boxplot() + scale_fill_manual("Mfr", values=colors2) + ggtitle("Average Calories by Company") + scale_fill_manual("Mfr", values=colors1)
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
Interpretation & Answer to Q5:
ANSWER: General Mils has the highest average of calories but Kellogg’s is not far behind. The lowest is Quaker Oats, which makes sense based on the type of ceral they followed. I used a box plot as to see the average calories which gives me a better perscpetive then specifally looking at all the calorie counts or using more code to find the average with a bar chart. It also is visually easier with the boxplot to see who is the highest and lowest mean.
Q6) According to Consumer Reports, to which target audience (Adult or Child) should Kellogg’s market if sales are closely and positively correlated with rating? - Create a graphic that answers the question - Make sure the graphic has the appropriate theme, colors, titles, labels, and format - Include a brief justification for the graphic and interpret it so that is answers the question.
c5 <- c4 %>% filter(c4$mfr == "Kellogg's" )
ggplot(c5, aes(target, rating, fill=target)) +
geom_boxplot() + scale_fill_manual("target", values=colors2) + ggtitle("Average Kellogg's Ratings by Target")
Interpretation & Answer to Q6: They should target to adults as they have a higher average rating compared to children. I used a boxplot so that I could easily see the averages and also get a better understanding of the max and min ratings by Target.
Q7) Does sugar level affect the ratings of cereals differently for different targets AND for different companies? Should Kellogg’s increase or decrease the sugar level of their children’s cereals? - Create a graphic that answers the question - Make sure the graphic has the appropriate theme, colors, titles, labels, and format - Include a brief justification for the graphic and interpret it so that is answers the question.
ggplot(c, aes(x=sugars, y=rating, colour=target, group=1)) + ggtitle("Ratings based on Sugars by Company and Target") + geom_point(position = "jitter") +geom_smooth(method="lm") + scale_color_manual("Manufacturers", values=colors1)+xlab("Sugars") + ylab("Rating") + facet_wrap(c$mfr)
Interpretation & Answer to Q7:
ANSWER: Accross the board all companies follow a trend that the lower the sugar, the higher the rating. Kellogg’s should not increase it’s sugar but should decrease their suger, as to obtain higher ratings as the other companies have. I used a scatter plot with a facet wrap so I could compare them all together. If I had done just a scatterplot, it would make the data messy and hard to read.