```

It is the first week of your internship at Kellogg’s cereal department as a data analyst. Congrats! Your first project is to present a report on the recent Consumer Reports review of 80 top cereal brands. This report is for members of the sales team, who vary in their understanding of statistics, but who all understand well-polished charts and graphics.

  1. First, let’s explore the data. Load in the “cereal.csv” datasheet. Include the col_types argument to make sure that each variable is the right class. Explore the data with any summary functions you’d like and fill in the table below with a brief explanation of each variable. (The shelf variable is completed for you as an example.) name: (char): Length mfr: (fact): Show levels of cereal ames but not ordered type: (fact): Category of Hot and Cold cereal but no ordered target: (fact): Category of Adult or Child not ordered calories: (num): Calories to weight (exluding shelf) are simply numbers that depect inthe information of the cereal. protein: (num) fat: (num) sodium: (num) fiber: (num) carbs: (num) sugars: (num) potass: (num) vitamins: (num) shelf: (ordered fact) where on the shelf the cereal is located: 1-3 based on the level of where the cereal is at. weight: (num) cups: (num) *rating: (num)

#(Hint: the col_types= arguments can take a vector of a single string; ex: col_types = “lmnopqrs”

c <- read_csv(("cereal.csv"), col_types=cols(mfr= col_factor(levels=c("General Mills","Kellogg's","Nabisco","Post","Quaker Oats","Ralston", "Homestat Farm", "NA"),ordered=FALSE),type=col_factor(levels=c("Hot","Cold"),ordered=FALSE),target=col_factor(levels=c("Child","Adult"), ordered=FALSE),shelf=col_factor(levels=c("1", "2", "3"),ordered = TRUE),calories=col_number(),protein= col_number(),fat= col_number(),sodium=col_number(),fiber=col_number(),carbs=col_number(),sugars=col_number(),potass=col_number(),vitamins=col_number(),weight=col_number(),cups=col_number(),rating=col_number()))

summary(c)
##      name                      mfr       type      target  
##  Length:77          Kellogg's    :23   Hot : 3   Child:22  
##  Class :character   General Mills:22   Cold:74   Adult:55  
##  Mode  :character   Quaker Oats  :15                       
##                     Post         : 9                       
##                     Nabisco      : 6                       
##                     Ralston      : 1                       
##                     (Other)      : 1                       
##     calories        protein           fat            sodium     
##  Min.   : 50.0   Min.   :1.000   Min.   :0.000   Min.   :  0.0  
##  1st Qu.:100.0   1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0  
##  Median :110.0   Median :3.000   Median :1.000   Median :180.0  
##  Mean   :106.9   Mean   :2.545   Mean   :1.013   Mean   :159.7  
##  3rd Qu.:110.0   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0  
##  Max.   :160.0   Max.   :6.000   Max.   :5.000   Max.   :320.0  
##                                                                 
##      fiber            carbs          sugars           potass      
##  Min.   : 0.000   Min.   :-1.0   Min.   :-1.000   Min.   : -1.00  
##  1st Qu.: 1.000   1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00  
##  Median : 2.000   Median :14.0   Median : 7.000   Median : 90.00  
##  Mean   : 2.152   Mean   :14.6   Mean   : 6.922   Mean   : 96.08  
##  3rd Qu.: 3.000   3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00  
##  Max.   :14.000   Max.   :23.0   Max.   :15.000   Max.   :330.00  
##                                                                   
##     vitamins      shelf      weight          cups           rating     
##  Min.   :  0.00   1:20   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.: 25.00   2:21   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median : 25.00   3:36   Median :1.00   Median :0.750   Median :40.40  
##  Mean   : 28.25          Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.: 25.00          3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :100.00          Max.   :1.50   Max.   :1.500   Max.   :93.70  
## 
  1. The shelf variable tells where on the store shelf a product is placed where 1 = floor level , 2 = child eye-level , and 3 = adult eye-level. Make the the shelf variable an ordered factor, with levels = c(1,2,3).
#This is done above but this is the formula I used shelf=col_factor(levels=c("1", "2", "3"),ordered = TRUE) along with the read_csv and col_typles= cols()

Your boss has several questions he wants answered. The first few are focused on the trends in the cereal market and the last few are about the Kellogg’s brand specifically.

  1. Kellogg’s has a strict report format. You must create a theme called report.theme that meets the following criteria : -has a plot.background that is “grey100” -has a panel.border that is “black”, has size = .5, and fill = NA -has a panel.background that is “white” -has panel.grid.major lines that are color = “black” and size = .1 -has a legend.key that is “white” -has an axis.title where hjust=0, vjust=0, size=10, face=“bold” -has an axis.text.x where angle=30, vjust=.75 -has a plot.title where color=“Red” and face=“bold” -has a strip.background that is blank -has a strip.text that is size=10 and face = “italic”
report.theme <- theme(plot.background =element_rect(color="grey100"), panel.border = element_rect(color="black", size = .5, fill = NA), panel.background = element_rect(color="white"), panel.grid.major = element_line(color = "black",size = .1), legend.key = element_rect(color="white"), axis.title = element_text(hjust=0, vjust=0, size=10,face="bold"), axis.text.x = element_text(angle=30, vjust=.75), plot.title = element_text(color="Red", face="bold"), strip.background = element_blank(), strip.text = element_text(size=10, face = "italic"))
  1. Create 3 color palettes that will be used from here on out: -colors1 will be a vector of 8 colors where the first is “black” and the next 7 are from the brewer palette “Set1” (Hint: c(“black”,brewer.pal(n = # of colors , “palette”))) -colors2 will be a vector of 7 colors all from brewer palette “Set1” -colors3 will be a vector of 5 colors from the brewer palette “YlOrRd” with only darker oranges and reds (Provided)
colors1 <-c("black",brewer.pal(n =7  , "Set1"))
colors2 <-c(brewer.pal(n =7  , "Set1"))
colors3 <- c(brewer.pal(9, "YlOrRd")[4:8])

For each graph we create, you will need to write 3-5 sentences interpreting the graph . The first few graphs we will walk through together. The last few you will decide which graphs will best answer the question and justify your choice in the interpretation (we are not looking for 1 correct answer, just a good answer that shows thoughtfulness and effort). Make sure each graph has the correct theme and formatting with appropriate titles.

Q1_A) What effect do calories have on ratings? If Kellogg’s has two cereals in production, one with 150 calories per serving and the other with 80 calories, where would it expect each rating to be? - Create a jittered scatter plot that looks at the effect of calories on ratings. Distinguish each brand by color. - Place a regression line for each brand using the “lm” method, with no standard error - Include a second, general trend line for all the data (Hint: …aes(group =1 , col = “All” )… ) - Change line colors to colors1 with the legend title as “Manufacturers” (Hint: scale_color_manual(“Title”, values = … )) - Add an xlab and ylab so the axis titles are capitalized (Hint: Just retype the title in x/ylab() ) - Give the plot the title “Rating Trends by Calories”

 ggplot(c, aes(x=calories, y=rating, colour=mfr, group=1)) + ggtitle("Rating Trends by Calories") + geom_point(position = "jitter") + geom_smooth(method=lm) + scale_color_manual("Manufacturers", values=`colors1`) + xlab("Calories") + ylab("Rating")

Interpretation & Answer to Q1_A:The rating for 80 calories will be higher than the rating with 150 calories

Q1_B) What is unique about Nabisco’s rating trend? Why might that be the case? (Hint: Assume Consumer Reports surveyed only adults) Nabisco rating is higher than Kellogg’s and their calories do not go about 100 calories. Thi strend line is this way because their brand is more targeted towards Adults while Kellogg’s cereal is more sugary, higher calorie and more children focused.

#It might help to see what brands are owned by Nabisco and which are owned by Kellogg's

c %>% filter(mfr == "Kellogg's") 
## # A tibble: 23 x 17
##    name  mfr   type  target calories protein   fat sodium fiber carbs
##    <chr> <fct> <fct> <fct>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 Nut&… Kell… Cold  Adult       120       2     1    190     0    15
##  2 Mues… Kell… Cold  Adult       160       3     2    150     3    17
##  3 Smac… Kell… Cold  Child       110       2     1     70     1     9
##  4 Fros… Kell… Cold  Child       110       1     0    200     1    14
##  5 Froo… Kell… Cold  Child       110       2     1    125     1    11
##  6 Appl… Kell… Cold  Child       110       2     0    125     1    11
##  7 Corn… Kell… Cold  Child       110       1     0     90     1    13
##  8 Just… Kell… Cold  Adult       140       3     1    170     2    20
##  9 Just… Kell… Cold  Adult       110       2     1    170     1    17
## 10 Rais… Kell… Cold  Adult       120       3     1    210     5    14
## # … with 13 more rows, and 7 more variables: sugars <dbl>, potass <dbl>,
## #   vitamins <dbl>, shelf <ord>, weight <dbl>, cups <dbl>, rating <dbl>
c %>% filter(mfr == "Nabisco")
## # A tibble: 6 x 17
##   name  mfr   type  target calories protein   fat sodium fiber carbs sugars
##   <chr> <fct> <fct> <fct>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 Stra… Nabi… Cold  Adult        90       2     0     15     3    15      5
## 2 Crea… Nabi… Hot   Adult       100       3     0     80     1    21      0
## 3 Shre… Nabi… Cold  Adult        80       2     0      0     3    16      0
## 4 100%… Nabi… Cold  Adult        70       4     1    130    10     5      6
## 5 Shre… Nabi… Cold  Adult        90       3     0      0     3    20      0
## 6 Shre… Nabi… Cold  Adult        90       3     0      0     4    19      0
## # … with 6 more variables: potass <dbl>, vitamins <dbl>, shelf <ord>,
## #   weight <dbl>, cups <dbl>, rating <dbl>

Q2_A) What patterns do you observe in the high-calorie cereals? How is Kellogg’s doing compared to competitive brands? - Copy the plot from Q1_A - Change the method of all the trend lines to “loess” - Zoom in so that we look at calories between 110 and 150 and ratings between 0 and 50 (Hint: use coord_cartesian() ) - Change the title to be more appropriate - Colors, Axis labels, and Themes should all remain the same.

 ggplot(c, aes(x=calories, y=rating, colour=mfr, group=1)) + ggtitle("Rating Trends by High Calories") + geom_point(position = "jitter") + geom_smooth(method="loess") + scale_color_manual("Manufacturers", values=`colors1`) + xlab("Calories") + ylab("Rating") + coord_cartesian(xlim = c(110, 150))

Interpretation & Answer to Q2_A:

Zoomed out Kellogg’s is not doing that great and appears to have lower rating then most brands.

Q3_A) How is Rating related to Manufacturer? Does Kellogg’s have a name-brand boost or not? If not, which manufacturer does get a boost?

c2 <-  c %>% filter(c$mfr != "Ralston" )
c3 <- c2 %>% filter(c2$mfr != "Homestat Farm")


ggplot(c3, aes(x=mfr, y=rating, colour=mfr, group=1)) + ggtitle("Rating Trends by Mfr") + geom_point(position = "jitter") + geom_smooth(method="loess") + scale_color_manual("Manufacturer", values=`colors3`) + xlab("Mfr") + ylab("Rating")+ stat_summary(geom="point", fun.y=mean, shape="X",size = 5)

Interpretation & Answer to Q3_A:

Kellogg’s has a boost over General Mills and in the other chart it appeared that Kellog’s was bellow General Mills.

Q3_B) There is another kind of plot that displays similar information to the plot above (i.e., it shows the distribution of ratings by brand). Create this plot. Which plot is better, in your opinion? Why? - Filter out “Ralston” and “Homestat Farm”. You don’t need them since they have one observation each. - Creates a different boxplot for each manufacturer that compares their respective rating distribution - Each manufacturer should also have their own fill color - Changes the colors to colors3 and a legend title “Manufacturers” - Remember appropriate axis titles and plot title

ggplot(c3, aes(x=mfr, y=rating, colour=mfr)) + ggtitle("Rating Trends by Mfr") + geom_boxplot() + scale_color_manual("Manufacturers", values=`colors3`) + xlab("Mfr") + ylab("Rating")

Interpretation & Answer to Q3_B: I think the boxplot is better because it makes it easier to show the range of the ratings per brand instead of looking at the dots and guessing.

Q4_A) The intern before you took his Business Analytics course at the University of Utah and made the following pie chart. What is wrong with the chart? Does it answer the question it sets out to answer? (Hint: Think about the title of the plot and the data the plot is based on)

I would say no. These companies have a variety of different cereal and it would be beneficial to understand each cereal the company sells.

#Run this code and examine the pie chart
c %>% ggplot(aes(x=1, fill = mfr)) +
  geom_bar() + 
  scale_fill_manual("Mfr", values=colors2) +
  coord_polar(theta = "y") +
  ggtitle("Market Share for each Company")

Q4_B) Your boss asks you to fix the pie chart so that it answers the question of how many cereals each company has in the Consumer Reports analysis. (This includes choosing the appropriate plot type.) Remember to include appropriate titles, themes, colors, and labels.

ggplot(c3, aes(mfr, fill=mfr)) + geom_bar() + scale_fill_manual("Mfr", values=colors2) + ggtitle("Number of Cereals by Company")

For the next three questions, your boss does not have any specific guidelines other than the graph 1) answers the given question, 2) follows the proper theme and design guidelines, and 3) includes a short justification and interpretation for each plot.

Q5) How does Kellogg’s compare calorie-wise to its three closest competitors; Post, Quaker Oats, and General Mills? - Create a graphic that answers the question - Make sure the graphic has the appropriate theme, colors, titles, labels, and format - Include a brief justification for the graphic and interpret it so that is answers the question.

c4 <-  c3 %>% filter(c3$mfr != "Nabisco" )


ggplot(c4, aes(mfr, calories, fill=mfr)) +
  geom_boxplot() + scale_fill_manual("Mfr", values=colors2) + ggtitle("Average Calories by Company") + scale_fill_manual("Mfr", values=colors1)
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

Interpretation & Answer to Q5:

ANSWER: General Mils has the highest average of calories but Kellogg’s is not far behind. The lowest is Quaker Oats, which makes sense based on the type of ceral they followed. I used a box plot as to see the average calories which gives me a better perscpetive then specifally looking at all the calorie counts or using more code to find the average with a bar chart. It also is visually easier with the boxplot to see who is the highest and lowest mean.

Q6) According to Consumer Reports, to which target audience (Adult or Child) should Kellogg’s market if sales are closely and positively correlated with rating? - Create a graphic that answers the question - Make sure the graphic has the appropriate theme, colors, titles, labels, and format - Include a brief justification for the graphic and interpret it so that is answers the question.

c5 <-  c4 %>% filter(c4$mfr == "Kellogg's" )

ggplot(c5, aes(target, rating, fill=target)) +
  geom_boxplot() + scale_fill_manual("target", values=colors2) + ggtitle("Average Kellogg's Ratings by Target") 

Interpretation & Answer to Q6: They should target to adults as they have a higher average rating compared to children. I used a boxplot so that I could easily see the averages and also get a better understanding of the max and min ratings by Target.

Q7) Does sugar level affect the ratings of cereals differently for different targets AND for different companies? Should Kellogg’s increase or decrease the sugar level of their children’s cereals? - Create a graphic that answers the question - Make sure the graphic has the appropriate theme, colors, titles, labels, and format - Include a brief justification for the graphic and interpret it so that is answers the question.

ggplot(c, aes(x=sugars, y=rating, colour=target, group=1)) + ggtitle("Ratings based on Sugars by Company and Target") + geom_point(position = "jitter") +geom_smooth(method="lm") + scale_color_manual("Manufacturers", values=colors1)+xlab("Sugars") + ylab("Rating") + facet_wrap(c$mfr)

Interpretation & Answer to Q7:

ANSWER: Accross the board all companies follow a trend that the lower the sugar, the higher the rating. Kellogg’s should not increase it’s sugar but should decrease their suger, as to obtain higher ratings as the other companies have. I used a scatter plot with a facet wrap so I could compare them all together. If I had done just a scatterplot, it would make the data messy and hard to read.