We’re programming within the tidyverse and using data from the openintro library.
library(tidyverse)
library(openintro)
This data set from the openintro library contains data on 515 menu items from some of the most popular fast food restaurants worldwide.
data("fastfood", package='openintro')
Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
They both have similar medians of 200 calories, they are both skewed to the right, however the McDonald’s data has more outliers that affect the mean.
# Save subsets of the data by restaurant
<- fastfood %>%
mcdonalds filter(restaurant == "Mcdonalds")
<- fastfood %>%
dairy_queen filter(restaurant == "Dairy Queen")
# Plot the McDonald's calories due to fat per menu item
ggplot(data = mcdonalds) +
geom_histogram(color = "#e9ecef", alpha=0.6, position = 'identity',
mapping = aes(x = cal_fat), binwidth = 50) +
ggtitle("McDonalds")
# Plot the Dairy Queen calories due to fat per menu item
ggplot(data = dairy_queen) +
geom_histogram(color = "#e9ecef", alpha=0.6, position = 'identity',
mapping = aes(x = cal_fat), binwidth = 50) +
ggtitle("Dairy Queen")
Based on the this plot, does it appear that the data follow a nearly normal distribution?
Yes - It has one mode and seemingly an equal number of observations on either side of the mean (mean is close to median), however it’s choppy.
<- mean(dairy_queen$cal_fat)
dqmean <- sd(dairy_queen$cal_fat) dqsd
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Make a normal probability plot of sim_norm
. Do all
of the points fall on the line? How does this plot compare to the
probability plot for the real data? (Since sim_norm
is not
a data frame, it can be put directly into the sample
argument and the data
argument can be dropped.)
The points fall roughly on the line. This is similar to the probability plot for the real data.
# Plot the real data
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")
# Generate the simulated data
<- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
sim_norm <- data.frame(sim_norm)
sim_norm_df
# Plot the simulated data
ggplot(data = sim_norm_df, aes(sample = sim_norm)) +
geom_line(stat = "qq")
Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?
The plots suggest that the calories are nearly normal.
qqnormsim(sample = cal_fat, data = dairy_queen)
Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
The McDonald’s menu does not appear to come from a normal distribution because of the skewed-right tail.
qqnormsim(sample = cal_fat, data = mcdonalds)
Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
Theoretically 35.0%, Empirically 42.9%, for a difference of 7.9%.
Theoretically 18.6%, Empirically 21.4%, for a difference of 2.8%. This one had the closer agreement between the two models.
# Probability for the theoretical normal distribution is 0.350
pnorm(q = 200, mean = dqmean, sd = dqsd)
## [1] 0.3495757
# Probability using the empirical distribution is 0.429
%>%
dairy_queen filter(cal_fat < 200) %>%
summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.429
# Probability for the theoretical normal distribution is 0.186
1 - pnorm(q = 400, mean = dqmean, sd = dqsd)
## [1] 0.1863007
# Probability using the empirical distribution is 0.214
%>%
dairy_queen filter(cal_fat > 400) %>%
summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.214
Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
Burger King’s menu items’ sodium distributions are the most normal. Arby’s seems to be second closest to normal for sodium.
<- fastfood %>%
arbys filter(restaurant == "Arbys")
<- fastfood %>%
burger_king filter(restaurant == "Burger King")
<- fastfood %>%
chick_fila filter(restaurant == "Chick Fil-A")
<- fastfood %>%
sonic filter(restaurant == "Sonic")
<- fastfood %>%
subway filter(restaurant == "Subway")
<- fastfood %>%
taco_bell filter(restaurant == "Taco Bell")
ggplot(data = dairy_queen, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Dairy Queen")
ggplot(data = mcdonalds, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("McDonalds")
ggplot(data = arbys, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Arbys")
ggplot(data = burger_king, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Burger King")
ggplot(data = chick_fila, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Chick Fil-A")
ggplot(data = sonic, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Sonic")
ggplot(data = subway, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Subway")
ggplot(data = taco_bell, aes(sample = sodium)) +
geom_line(stat = "qq") +
ggtitle("Taco Bell")
Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?
Stepwise patterns are more prominent in discrete data as opposed to continuous data. This is probably because menu items are either supposed to be not salty (icecream), a little salty (burger) or very salty (fries).
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
While this isn’t Keto Bell, it looks like there are many low carb menu item options. The distribution looks skewed to the right.
ggplot(data = taco_bell) +
geom_histogram(color = "#e9ecef", alpha=0.6, position = 'identity',
mapping = aes(x = total_carb), binwidth = 5) +
ggtitle("Taco Bell")
<- mean(taco_bell$total_carb)
tbmean <- sd(taco_bell$total_carb) tbsd
ggplot(data = taco_bell, aes(x = total_carb)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = tbmean, sd = tbsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.