urls <- c(
"https://www.openintro.org/data/csv/fastfood.csv", # primary (works)
"https://raw.githubusercontent.com/OpenIntroStat/openintro-r-package/master/inst/extdata/fastfood.csv",
"https://raw.githubusercontent.com/OpenIntroStat/openintro/master/data/fastfood.csv"
)
fastfood <- NULL
for (u in urls) {
fastfood <- tryCatch(read.csv(u), error = function(e) NULL)
if (!is.null(fastfood)) { ff_url <- u; break }
}
if (is.null(fastfood)) {
message("Could not download from the known URLs. If you have the file locally, upload it (Files pane → Upload) and set:")
message('fastfood <- read.csv("fastfood.csv")')
stop("fastfood.csv not found online; please upload and rerun.")
}
if (!all(c("restaurant","cal_fat") %in% names(fastfood))) {
stop("Loaded file does not have expected columns: restaurant, cal_fat.")
}
str(fastfood[1:5])
## 'data.frame': 515 obs. of 5 variables:
## $ restaurant: chr "Mcdonalds" "Mcdonalds" "Mcdonalds" "Mcdonalds" ...
## $ item : chr "Artisan Grilled Chicken Sandwich" "Single Bacon Smokehouse Burger" "Double Bacon Smokehouse Burger" "Grilled Bacon Smokehouse Chicken Sandwich" ...
## $ calories : int 380 840 1130 750 920 540 300 510 430 770 ...
## $ cal_fat : int 60 410 600 280 410 250 100 210 190 400 ...
## $ total_fat : int 7 45 67 31 45 28 12 24 21 45 ...
Context. We’ll study calories from
fat (cal_fat
) for Dairy Queen and
McDonald’s. We’ll: (E1) pick variable & summarize,
(E2–E5) assess normality (histogram with normal overlay; QQ plots), and
(E6–E7) compute probabilities two ways—theoretical
normal vs empirical.
dq <- subset(fastfood, restaurant == "Dairy Queen" & !is.na(cal_fat))
mc <- subset(fastfood, restaurant == "Mcdonalds" & !is.na(cal_fat))
n_dq <- nrow(dq); dq_mean <- mean(dq$cal_fat); dq_sd <- sd(dq$cal_fat)
n_mc <- nrow(mc); mc_mean <- mean(mc$cal_fat); mc_sd <- sd(mc$cal_fat)
data.frame(Restaurant = c("Dairy Queen","McDonald’s"),
n = c(n_dq, n_mc),
mean_cal_fat = c(dq_mean, mc_mean),
sd_cal_fat = c(dq_sd, mc_sd))
## Restaurant n mean_cal_fat sd_cal_fat
## 1 Dairy Queen 42 260.4762 156.4851
## 2 McDonald’s 57 285.6140 220.8993
Answer E1. Variable:
cal_fat
. Restaurants: Dairy
Queen (n = 42, mean = 260.5, sd = 156.5) and
McDonald’s (n = 57, mean = 285.6, sd = 220.9).
hist(dq$cal_fat, breaks = 30, freq = FALSE,
main = "Dairy Queen: cal_fat with Normal Overlay",
xlab = "Calories from fat", ylab = "Density", col = "grey90", border = "white")
curve(dnorm(x, mean = dq_mean, sd = dq_sd), add = TRUE, lwd = 2)
Answer E2 (shape). The histogram is unimodal with mild right skew. The normal curve matches the center reasonably, but misses the tails—so the distribution is approximately normal, not perfect.
sim_norm_dq <- rnorm(n = n_dq, mean = dq_mean, sd = dq_sd)
qqnorm(sim_norm_dq, main = "QQ Plot — Simulated Normal (Dairy Queen)")
qqline(sim_norm_dq, col = "red", lwd = 2)
Answer E3. As expected for true normals, the simulated points track the QQ line closely, with only tiny tail deviations.
qqnorm(dq$cal_fat, main = "QQ Plot — Dairy Queen cal_fat (real data)")
qqline(dq$cal_fat, col = "red", lwd = 2)
Answer E4. The real cal_fat
values show
systematic tail departures from the line (more
curvature in the extremes) → only approximately
normal.
hist(mc$cal_fat, breaks = 30, freq = FALSE,
main = "McDonald’s: cal_fat with Normal Overlay",
xlab = "Calories from fat", ylab = "Density", col = "grey90", border = "white")
curve(dnorm(x, mean = mc_mean, sd = mc_sd), add = TRUE, lwd = 2)
qqnorm(mc$cal_fat, main = "QQ Plot — McDonald’s cal_fat (real data)")
qqline(mc$cal_fat, col = "red", lwd = 2)
Answer E5. McDonald’s looks a bit closer to normal in the center, though the tails still deviate. Overall: roughly normal.
We’ll answer four events (two per restaurant) with both methods: 1) \(P(200 \le \text{cal\_fat} \le 400)\) 2) \(P(\text{cal\_fat} \ge 500)\)
# Dairy Queen
p_theo_dq_200_400 <- pnorm(400, dq_mean, dq_sd) - pnorm(200, dq_mean, dq_sd)
p_emp_dq_200_400 <- mean(dq$cal_fat >= 200 & dq$cal_fat <= 400)
p_theo_dq_ge500 <- 1 - pnorm(500, dq_mean, dq_sd)
p_emp_dq_ge500 <- mean(dq$cal_fat >= 500)
# McDonald’s
p_theo_mc_200_400 <- pnorm(400, mc_mean, mc_sd) - pnorm(200, mc_mean, mc_sd)
p_emp_mc_200_400 <- mean(mc$cal_fat >= 200 & mc$cal_fat <= 400)
p_theo_mc_ge500 <- 1 - pnorm(500, mc_mean, mc_sd)
p_emp_mc_ge500 <- mean(mc$cal_fat >= 500)
out6 <- data.frame(
Restaurant = c("Dairy Queen","Dairy Queen","McDonald’s","McDonald’s"),
Event = c("200<=cal_fat<=400","cal_fat>=500","200<=cal_fat<=400","cal_fat>=500"),
Theoretical_Normal = round(c(p_theo_dq_200_400, p_theo_dq_ge500, p_theo_mc_200_400, p_theo_mc_ge500), 3),
Empirical_Sample = round(c(p_emp_dq_200_400, p_emp_dq_ge500, p_emp_mc_200_400, p_emp_mc_ge500), 3)
)
out6
## Restaurant Event Theoretical_Normal Empirical_Sample
## 1 Dairy Queen 200<=cal_fat<=400 0.464 0.357
## 2 Dairy Queen cal_fat>=500 0.063 0.071
## 3 McDonald’s 200<=cal_fat<=400 0.349 0.474
## 4 McDonald’s cal_fat>=500 0.166 0.105
Answer E6 (compare). The middle interval (200–400) shows closer agreement between theoretical and empirical. The tail event (≥ 500) differs more—consistent with the QQ-plot tail deviations.
We add:
- Q7a (Dairy Queen): \(P(\text{cal\_fat} \le 250)\)
- Q7b (McDonald’s): \(P(300
\le \text{cal\_fat} \le 450)\)
p_theo_dq_le250 <- pnorm(250, dq_mean, dq_sd)
p_emp_dq_le250 <- mean(dq$cal_fat <= 250)
p_theo_mc_300_450 <- pnorm(450, mc_mean, mc_sd) - pnorm(300, mc_mean, mc_sd)
p_emp_mc_300_450 <- mean(mc$cal_fat >= 300 & mc$cal_fat <= 450)
out7 <- data.frame(
Restaurant = c("Dairy Queen","McDonald’s"),
Event = c("cal_fat<=250","300<=cal_fat<=450"),
Theoretical_Normal = round(c(p_theo_dq_le250, p_theo_mc_300_450), 3),
Empirical_Sample = round(c(p_emp_dq_le250, p_emp_mc_300_450), 3)
)
out7
## Restaurant Event Theoretical_Normal Empirical_Sample
## 1 Dairy Queen cal_fat<=250 0.473 0.595
## 2 McDonald’s 300<=cal_fat<=450 0.246 0.246
Answer E7. Results are fairly close in the center of each distribution; any noticeable gaps are in the tails, reinforcing that the normal model is an approximation for these menu data.
Normality: Both restaurants’ cal_fat
are roughly normal in the center with tail
deviations.
Simulated vs. real: Simulated normals align with the QQ
line; real data depart at extremes. Probabilities: The
theoretical normal and empirical
proportions agree better in mid-ranges than in
tails.