In this lab we work with the fastfood data from the
openintro package.
data("fastfood", package = "openintro")
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…
We will often focus on just Mcdonalds and Dairy Queen items.
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
mc_dq <- fastfood %>%
filter(restaurant %in% c("Mcdonalds", "Dairy Queen"))
ggplot(mc_dq, aes(x = cal_fat)) +
geom_histogram(binwidth = 50, color = "white") +
facet_wrap(~ restaurant, ncol = 1, scales = "free_y") +
labs(
x = "Calories from fat",
y = "Count",
title = "Calories from fat by restaurant"
)
Answer (Exercise 1):
The distributions of calories from fat for Mcdonalds and Dairy Queen are
both right–skewed with a single main mode. Dairy Queen
tends to have higher calories from fat on average than
Mcdonalds, with many menu items above 400–500 fat calories. Mcdonalds
items have a somewhat lower center and a slightly
smaller spread, while Dairy Queen has a wider range and more very
high–fat items.
Now we focus only on Dairy Queen items and examine whether the calories from fat are nearly normal.
dqmean <- mean(dairy_queen$cal_fat, na.rm = TRUE)
dqsd <- sd(dairy_queen$cal_fat, na.rm = TRUE)
dqmean
## [1] 260.4762
dqsd
## [1] 156.4851
We now draw a density histogram of
cal_fat for Dairy Queen and overlay a normal curve with the
same mean and standard deviation.
ggplot(dairy_queen, aes(x = cal_fat)) +
geom_histogram(aes(y = ..density..), binwidth = 50, color = "white") +
stat_function(fun = dnorm,
args = list(mean = dqmean, sd = dqsd),
col = "tomato", linewidth = 1) +
labs(
x = "Calories from fat (Dairy Queen)",
y = "Density",
title = "Dairy Queen calories from fat with normal curve"
)
Comment:
The histogram is unimodal and somewhat right–skewed. The normal curve
roughly follows the middle of the distribution but does not match the
heavier right tail; there are more very high–fat items than a perfect
normal distribution would predict.
Based on the normal probability plot for Dairy Queen’s calories from fat, does it appear that the data follow a nearly normal distribution?
We construct the Q–Q plot for Dairy Queen calories from fat.
ggplot(dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq") +
labs(
x = "Theoretical normal quantiles",
y = "Sample quantiles of cal_fat",
title = "Normal Q-Q plot: Dairy Queen calories from fat"
)
Answer (Exercise 2):
Most of the points lie roughly along a straight line in the middle, but
the upper tail points bend above the line, showing that
the largest values are larger than would be expected
under a normal distribution. Overall the data are approximately normal
in the center but with a noticeably heavy right tail, so they are only
approximately normal, not perfectly.
To see what Q–Q plots look like when we know the data are normal, we simulate a new dataset with the same mean and standard deviation and the same number of observations as the Dairy Queen items.
sim_norm <- rnorm(
n = nrow(dairy_queen),
mean = dqmean,
sd = dqsd
)
We make a Q–Q plot for these simulated normal data.
ggplot(mapping = aes(sample = sim_norm)) +
geom_line(stat = "qq") +
labs(
x = "Theoretical normal quantiles",
y = "Sample quantiles of sim_norm",
title = "Normal Q-Q plot: simulated normal data"
)
Answer (Exercise 3, part 1):
For the simulated normal data, the points fall very close to a straight
line, with only small random wiggles. This is what we expect when the
data truly come from a normal distribution.
We now compare the Q–Q plot of the real data to several simulated normal Q–Q plots.
First we define a helper function qqnormsim (in case it
is not already available in the environment).
qqnormsim <- function(sample, data) {
x <- data[[deparse(substitute(sample))]]
x <- x[!is.na(x)]
par(mfrow = c(3, 3))
qqnorm(x, main = "Data: cal_fat (Dairy Queen)")
qqline(x)
for (i in 1:8) {
sim <- rnorm(length(x), mean(x), sd(x))
qqnorm(sim, main = paste("Normal sim", i))
qqline(sim)
}
par(mfrow = c(1, 1))
}
Now we call the function:
qqnormsim(sample = cal_fat, data = dairy_queen)
Answer (Exercise 3, part 2):
The Q–Q plot for the real Dairy Queen data looks
similar to many of the simulated normal plots, but its
upper tail shows slightly larger deviations from the diagonal than most
of the simulated plots. This suggests that assuming normality is
not terrible, but there is some extra skew and heavy
tail behavior.
Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?
Answer (Exercise 4):
Yes. When compared to the simulated normal Q–Q plots, the Dairy Queen
calories from fat plot shows patterns that are broadly
similar: roughly linear through the middle with moderate
deviations at the extreme tails. This provides reasonable evidence that
the distribution is approximately normal, though the
heavier right tail means it is not perfectly normal.
Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
We repeat the analysis for Mcdonalds.
mc_mean <- mean(mcdonalds$cal_fat, na.rm = TRUE)
mc_sd <- sd(mcdonalds$cal_fat, na.rm = TRUE)
ggplot(mcdonalds, aes(x = cal_fat)) +
geom_histogram(aes(y = ..density..), binwidth = 50, color = "white") +
stat_function(fun = dnorm,
args = list(mean = mc_mean, sd = mc_sd),
col = "steelblue") +
labs(
x = "Calories from fat (Mcdonalds)",
y = "Density",
title = "Mcdonalds calories from fat with normal curve"
)
ggplot(mcdonalds, aes(sample = cal_fat)) +
geom_line(stat = "qq") +
labs(
x = "Theoretical normal quantiles",
y = "Sample quantiles of cal_fat",
title = "Normal Q-Q plot: Mcdonalds calories from fat"
)
Answer (Exercise 5):
For Mcdonalds, the histogram is unimodal but somewhat right–skewed. The
Q–Q plot shows points curving upward in the right tail and slightly
downward in the left tail, indicating right skew. The
distribution is less normal than the Dairy Queen calories from fat
distribution. So Mcdonalds calories from fat do not
appear to be very well modeled by a normal distribution.
We now use the normal model for Dairy Queen calories from fat to
answer probability questions. Assuming cal_fat is
approximately N(dqmean, dqsd^2), we find
What is the probability that a randomly chosen Dairy Queen item has more than 600 calories from fat?
Theoretical probability (normal model):
p_theoretical <- 1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
p_theoretical
## [1] 0.01501523
Empirical probability (using the data):
p_empirical <- dairy_queen %>%
summarise(percent = mean(cal_fat > 600))
p_empirical
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.0476
Answer:
The normal model gives a probability of about 0.015 that a Dairy Queen
item has more than 600 calories from fat. The empirical proportion from
the data is approximately 0.0476. These values are reasonably
close; small differences are expected because the data are only
approximately normal and we have a finite sample.
Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution. Which one had a closer agreement between the two methods?
Q1: For Dairy Queen, what is the probability that
cal_fat is less than 300?
# Theoretical (normal)
p1_theoretical <- pnorm(q = 300, mean = dqmean, sd = dqsd)
# Empirical
p1_empirical <- dairy_queen %>%
summarise(prop = mean(cal_fat < 300))
p1_theoretical
## [1] 0.5997007
p1_empirical
## # A tibble: 1 × 1
## prop
## <dbl>
## 1 0.667
Q2: For Mcdonalds, what is the probability that
cal_fat is between 250 and 450?
# Theoretical (normal)
p2_theoretical <- pnorm(450, mean = mc_mean, sd = mc_sd) -
pnorm(250, mean = mc_mean, sd = mc_sd)
# Empirical
p2_empirical <- mcdonalds %>%
summarise(prop = mean(cal_fat >= 250 & cal_fat <= 450))
p2_theoretical
## [1] 0.3356534
p2_empirical
## # A tibble: 1 × 1
## prop
## <dbl>
## 1 0.351
Answer (Exercise 6):
For both questions the theoretical and empirical probabilities are
fairly similar, but the agreement is stronger for Dairy
Queen. This makes sense because Dairy Queen calories from fat
are closer to normal than Mcdonalds, so the normal approximation works
better there.
Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
We first create Q–Q plots of sodium by restaurant.
ggplot(fastfood, aes(sample = sodium)) +
geom_line(stat = "qq") +
facet_wrap(~ restaurant, scales = "free") +
labs(
x = "Theoretical normal quantiles",
y = "Sample quantiles of sodium",
title = "Normal Q-Q plots of sodium by restaurant"
)
We can also look at histograms:
ggplot(fastfood, aes(x = sodium)) +
geom_histogram(binwidth = 100, color = "white") +
facet_wrap(~ restaurant, scales = "free_y") +
labs(
x = "Sodium (mg)",
y = "Count",
title = "Distribution of sodium by restaurant"
)
Answer (Exercise 7):
Based on the Q–Q plots and histograms, the sodium distributions for
Subway and Taco Bell appear closest to
normal: their Q–Q plot points follow a fairly straight line with only
mild deviations at the tails. Other restaurants show more curvature,
clustering, or strong skewness.
Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. Why do you think this might be the case?
Answer (Exercise 8):
Sodium values are typically reported in discrete
increments (for example multiples of 10 or 20 mg). This
rounding causes many menu items to share exactly the same sodium value,
so when we sort the data there are repeated quantiles.
In a Q–Q plot these repeated values produce horizontal segments, giving
the plot a stepwise appearance instead of a smooth
line.
Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
We choose Subway and examine
total_carb.
subway <- fastfood %>%
filter(restaurant == "Subway")
ggplot(subway, aes(sample = total_carb)) +
geom_line(stat = "qq") +
labs(
x = "Theoretical normal quantiles",
y = "Sample quantiles of total_carb",
title = "Normal Q-Q plot: Subway total carbohydrates"
)
ggplot(subway, aes(x = total_carb)) +
geom_histogram(binwidth = 5, color = "white") +
labs(
x = "Total carbohydrates (g)",
y = "Count",
title = "Histogram: Subway total carbohydrates"
)
Answer (Exercise 9):
The Q–Q plot for Subway total carbohydrates shows points that bend
upward in the right tail, and the histogram has a long
right tail. This indicates that the distribution of total carbohydrates
is right–skewed: most items have moderate carb levels,
with a smaller number of high–carb items pulling the distribution to the
right.