head(fastfood) # look at the fastfood dataset
Make a plot calories from fat of the options from Dairy Queen and McDonald’s. How do their centers, shapes, and spreads compare?
Both the Dairy Queen and the McDonald’s fat calorie histograms show right skewed data. The spread of the McDonald’s data is wider, with several more higher fat calorie values than the DQ data. The centers of the the two graphs are similar falling between 250 and 300 calories, with the McDonald’s center being pulled upwards by the higher values in the right tail.
fastfood %>%
filter(restaurant == "Mcdonalds" | restaurant == "Dairy Queen") %>% # filter for MickyDs & DQ
ggplot() +
geom_histogram(aes(x = cal_fat, # create a facted histogram
fill = restaurant)) +
facet_grid(cols = vars(restaurant), scales = "free_y") + # facets by restaurant
scale_fill_manual(guide = NULL, values = c("blue", "gold2")) + # suppress guide add pretty colors
theme_bw() # simple plot theme
Based on the this plot, does it appear that the data follow a nearly normal distribution?
The data does follow a nearly normal distribution. It’s unimodal, and and the curve of the majority of the data is similar to the normal curve line. It’s thrown off a bit by the high values in the right tail.
dairy_queen <- fastfood %>% # create DQ dataset
filter(restaurant == "Dairy Queen")
dqmean <- mean(dairy_queen$cal_fat) # calc the mean and std. deviation
dqsd <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) + # create histogram
geom_histogram(aes(y = ..density..), # use the density function built into ggplot
fill = "lightblue") + # set histogram color
stat_function(fun = dnorm, # create a normal curve line
args = c(mean = dqmean, sd = dqsd), # using the mean and sd from DQ data
col = "tomato", size = 1.5) + # set line color and width
labs(x = "fat calories", # add label on x axis
title = "Dairy Queen density plot") + # plot title
theme_light() # simple plot theme
ggplot() +
geom_line(stat = "qq", # create normal quantile plot
data = dairy_queen, aes(sample = cal_fat), # for DQ fat calories data
color = "red", size = 1.5) + # set line color and width
labs(title = "Dairy Queen fat calories normal quantile plot") + # plot title
theme_light()
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
No, not all of the points fall on the line, there’s some jiggling in the middle, but they’re all very close to the line. The ends of the line are strongly linear.
set.seed(19) # set seed for random simulation
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd) # simulate normal data using DQ mean & sd
ggplot() +
geom_line(stat = "qq", aes(sample = sim_norm), # plot the simulated data in a normal quantile line
color= "blue", # set line color & width
size = 1.5) +
labs(title = "Normal simulation",
subtitle = "using Dairy Queen mean and sd") + # plot title
theme_light()
The normal quantile plots for the DQ data and the simulated normal data are close, the right skew and fat tails show up in the ends of the DQ qq line.
rc <- dairy_queen$restaurant # create a vector with DQ restaurant name
rc[43:84] <- "Normal" # Normal restaurant name
cal_fat <- append(dairy_queen$cal_fat, sim_norm) # DQ cal_fat data
qqdata <- tibble("restaurant" = rc, cal_fat) # simulated normal data
ggplot(data = qqdata) +
geom_line(stat = "qq", # create a normal quantile plot overlapping
aes(sample = cal_fat, # the two lines - DQ & normal sim data
color = restaurant),
size = 1.5) +
labs(title = "Dairy Queen qq plot v. Normal simulation") + # plot title
scale_color_manual(labels = c("DQ", "Normal"), # create guide
values = c("red", "blue")) + # assign colors
theme_light()
Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories from fat are nearly normal?
The plots look similar, plots 7 and 8 have a good bit of similarity to the DQ data. The plots appear to show that the DQ fat calorie data is nearly normal.
qqnormsim(sample = cal_fat, data = dairy_queen) + # create facet plot of DQ
theme_bw() # and several normal simulations
Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
The McDonald’s data shows the high values in the right tail, without these values the data looks nearly normal.
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds") # create faceted plot of McD data
qqnormsim(sample = cal_fat, data = mcdonalds) + # and several normal simulations
labs(title = "McDonald's Fat Calorie distribution") +
theme_bw()
What is the probability that a randomly chosen Dairy Queen product has more than 600 calories from fat?
Using the normal distribution, the probability of more than 600 calories from fat is 1.5%.
1 - pnorm(q = 600, mean = dqmean, sd = dqsd) # use complement to find greater than
## [1] 0.01501523
The DQ data shows that the probability of more that 600 calories from fat is 4.8%. The difference is 3.3%, and the empirical value is more than times the theoretical normal value. Showing the effect of the right skewed data.
dairy_queen %>% # calculate percentage from DQ data
filter(cal_fat > 600) %>%
summarise(percent = n() / nrow(dairy_queen)) %>%
as.numeric()
## [1] 0.04761905
Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
Question 1. What is the probability that a randomly selected Arby’s product will have less than 9 grams of sugar?
arbys <- fastfood %>%
filter(restaurant == "Arbys")
rbmean <- mean(arbys$sugar)
rbsd <- sd(arbys$sugar)
ptn <- pnorm(9, rbmean, rbsd) %>% round(3)
paste("Theoretical Normal ditribution percentage:", ptn)
## [1] "Theoretical Normal ditribution percentage: 0.597"
ptm <- arbys %>%
filter(sugar < 9) %>%
summarise(percent = n() / nrow(arbys)) %>%
round(3)
paste("Empirical distribution percentage:", ptm)
## [1] "Empirical distribution percentage: 0.636"
paste("Difference = ", ptm - ptn)
## [1] "Difference = 0.039"
The difference in calculations is about 4%.
Question 2. What is the probability that a randomly selected Taco Bell product will have more than 5 grams of fiber?
tacobell <- fastfood %>%
filter(restaurant == "Taco Bell")
tbmean <- mean(tacobell$fiber)
tbsd <- sd(tacobell$fiber)
ptn <- 1 - pnorm(5, tbmean, tbsd) %>% round(3)
paste("Theoretical Normal ditribution percentage:", ptn)
## [1] "Theoretical Normal ditribution percentage: 0.592"
ptm <- tacobell %>%
filter(fiber > 5) %>%
summarise(percent = n() / nrow(tacobell)) %>%
round(3)
paste("Empirical distribution percentage:", ptm)
## [1] "Empirical distribution percentage: 0.443"
paste("Difference = ", ptn - ptm)
## [1] "Difference = 0.149"
The difference in percentage calculations is almost 15%, fiber content in Taco Bell products is probably not normally distributed.
The Arby’s sugar distribution is closer to it’s theoretical normal distribution than the Taco Bell fiber distribution.
Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
ggplot(data = fastfood) +
geom_histogram(aes(x = sodium,
y = ..density..,
fill = restaurant)) +
labs(title = "Sodium content histograms") +
facet_wrap(vars(restaurant)) +
guides(fill = "none") +
theme_bw()
ggplot(data = fastfood) +
geom_line(stat = "qq",
aes(sample = sodium,
color = restaurant),
size = 1.5) +
labs(title = "Sodium content qq plots") +
facet_wrap(vars(restaurant)) +
guides(color = "none") +
theme_light()
Arby’s, Burger King, Taco Bell and Subway are close to normally distributed. Chick Fil-A, Dairy Queen, McDonald’s and Sonic are right skewed.
qqnormsim(sample = sodium, data = arbys) +
labs(title = "Arby's Sodium distribution") +
theme_bw()
fastfood %>% filter(restaurant == "Burger King") %>%
qqnormsim(sample = sodium) +
labs(title = "Burger King Sodium distribution") +
theme_bw()
fastfood %>% filter(restaurant == "Chick Fil-A") %>%
qqnormsim(sample = sodium) +
labs(title = "Chick Fil-A Sodium distribution") +
theme_bw()
qqnormsim(sample = sodium, data = dairy_queen) +
labs(title = "DQ Sodium distribution") +
theme_bw()
qqnormsim(sample = sodium, data = mcdonalds) +
labs(title = "McDonald's Sodium distribution") +
theme_bw()
fastfood %>% filter(restaurant == "Sonic") %>%
qqnormsim(sample = sodium) +
labs(title = "Sonic Sodium distribution") +
theme_bw()
fastfood %>% filter(restaurant == "Subway") %>%
qqnormsim(sample = sodium) +
labs(title = "Subway Sodium distribution") +
theme_bw()
qqnormsim(sample = sodium, data = tacobell) +
labs(title = "Taco Bell Sodium distribution") +
theme_bw()
Some of the normal probability plots for sodium distributions seem to have a step-wise pattern. Why do you think this might be the case?
Looking at the histograms for these graphs, Chick Fil-A, DQ, McDonald’s, have gaps in sodium content distributions.
Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
cfa <- fastfood %>% filter(restaurant == "Chick Fil-A")
ggplot(data = cfa) +
geom_line(stat = "qq",
aes(sample = total_carb),
color = "green4",
size = 1.5) +
labs(title = "Chick Fil-A total carbs qq plot") +
theme_light()
The total carbs data isn’t normally distributed, and it doesn’t look symmetric, it may have fat tails.
ggplot(data = cfa) +
geom_histogram(aes(x = total_carb,
y = ..density..
), fill = "green4") +
labs(title = "Chick Fil-A total carbs histogram") +
theme_bw()
The histogram shows that the data is not uni-modal and is not normally distributed.