Setup

Load data (base R, no packages)

urls <- c(
"https://www.openintro.org/data/csv/fastfood.csv",  # primary (works)
                     "https://raw.githubusercontent.com/OpenIntroStat/openintro-r-package/master/inst/extdata/fastfood.csv",
"https://raw.githubusercontent.com/OpenIntroStat/openintro/master/data/fastfood.csv"
)

fastfood <- NULL
for (u in urls) {
  fastfood <- tryCatch(read.csv(u), error = function(e) NULL)
  if (!is.null(fastfood)) { ff_url <- u; break }
}

if (is.null(fastfood)) {
  message("Could not download from the known URLs. If you have the file locally, upload it (Files pane → Upload) and set:")
  message('fastfood <- read.csv("fastfood.csv")')
  stop("fastfood.csv not found online; please upload and rerun.")
}

if (!all(c("restaurant","cal_fat") %in% names(fastfood))) {
  stop("Loaded file does not have expected columns: restaurant, cal_fat.")
}

str(fastfood[1:5])

## 'data.frame':    515 obs. of  5 variables:
##  $ restaurant: chr  "Mcdonalds" "Mcdonalds" "Mcdonalds" "Mcdonalds" ...
##  $ item      : chr  "Artisan Grilled Chicken Sandwich" "Single Bacon Smokehouse Burger" "Double Bacon Smokehouse Burger" "Grilled Bacon Smokehouse Chicken Sandwich" ...
##  $ calories  : int  380 840 1130 750 920 540 300 510 430 770 ...
##  $ cal_fat   : int  60 410 600 280 410 250 100 210 190 400 ...
##  $ total_fat : int  7 45 67 31 45 28 12 24 21 45 ...

Context. We’ll study calories from fat (cal_fat) for Dairy Queen and McDonald’s. We’ll: (E1) pick variable & summarize, (E2–E5) assess normality (histogram with normal overlay; QQ plots), and (E6–E7) compute probabilities two ways—theoretical normal vs empirical.

E1. Choose restaurants & variable; compute n, mean, sd

dq <- subset(fastfood, restaurant == "Dairy Queen" & !is.na(cal_fat))
mc <- subset(fastfood, restaurant == "Mcdonalds"   & !is.na(cal_fat))

n_dq  <- nrow(dq); dq_mean <- mean(dq$cal_fat); dq_sd <- sd(dq$cal_fat)
n_mc  <- nrow(mc); mc_mean <- mean(mc$cal_fat); mc_sd <- sd(mc$cal_fat)
data.frame(Restaurant = c("Dairy Queen","McDonald’s"),
           n = c(n_dq, n_mc),
           mean_cal_fat = c(dq_mean, mc_mean),
           sd_cal_fat   = c(dq_sd,   mc_sd))

##    Restaurant  n mean_cal_fat sd_cal_fat
## 1 Dairy Queen 42     260.4762   156.4851
## 2  McDonald’s 57     285.6140   220.8993

Answer E1. Variable: cal_fat. Restaurants: Dairy Queen (n = 42, mean = 260.5, sd = 156.5) and McDonald’s (n = 57, mean = 285.6, sd = 220.9).

E2. Histogram with theoretical normal overlay (Dairy Queen)

hist(dq$cal_fat, breaks = 30, freq = FALSE,
     main = "Dairy Queen: cal_fat with Normal Overlay",
     xlab = "Calories from fat", ylab = "Density", col = "grey90", border = "white")
curve(dnorm(x, mean = dq_mean, sd = dq_sd), add = TRUE, lwd = 2)

Answer E2 (shape). The histogram is unimodal with mild right skew. The normal curve matches the center reasonably, but misses the tails—so the distribution is approximately normal, not perfect.

E3. Simulate from a matching Normal; QQ plot of the simulated sample

sim_norm_dq <- rnorm(n = n_dq, mean = dq_mean, sd = dq_sd)
qqnorm(sim_norm_dq, main = "QQ Plot — Simulated Normal (Dairy Queen)")
qqline(sim_norm_dq, col = "red", lwd = 2)

Answer E3. As expected for true normals, the simulated points track the QQ line closely, with only tiny tail deviations.

E4. QQ plot for the real Dairy Queen data & comparison

qqnorm(dq$cal_fat, main = "QQ Plot — Dairy Queen cal_fat (real data)")
qqline(dq$cal_fat, col = "red", lwd = 2)

Answer E4. The real cal_fat values show systematic tail departures from the line (more curvature in the extremes) → only approximately normal.

E5. Repeat the normality check for McDonald’s

hist(mc$cal_fat, breaks = 30, freq = FALSE,
     main = "McDonald’s: cal_fat with Normal Overlay",
     xlab = "Calories from fat", ylab = "Density", col = "grey90", border = "white")
curve(dnorm(x, mean = mc_mean, sd = mc_sd), add = TRUE, lwd = 2)

qqnorm(mc$cal_fat, main = "QQ Plot — McDonald’s cal_fat (real data)")
qqline(mc$cal_fat, col = "red", lwd = 2)

Answer E5. McDonald’s looks a bit closer to normal in the center, though the tails still deviate. Overall: roughly normal.

E6. Probability questions — theoretical Normal vs empirical proportions

We’ll answer four events (two per restaurant) with both methods: 1) \(P(200 \le \text{cal\_fat} \le 400)\) 2) \(P(\text{cal\_fat} \ge 500)\)

# Dairy Queen
p_theo_dq_200_400 <- pnorm(400, dq_mean, dq_sd) - pnorm(200, dq_mean, dq_sd)
p_emp_dq_200_400  <- mean(dq$cal_fat >= 200 & dq$cal_fat <= 400)
p_theo_dq_ge500   <- 1 - pnorm(500, dq_mean, dq_sd)
p_emp_dq_ge500    <- mean(dq$cal_fat >= 500)

# McDonald’s
p_theo_mc_200_400 <- pnorm(400, mc_mean, mc_sd) - pnorm(200, mc_mean, mc_sd)
p_emp_mc_200_400  <- mean(mc$cal_fat >= 200 & mc$cal_fat <= 400)
p_theo_mc_ge500   <- 1 - pnorm(500, mc_mean, mc_sd)
p_emp_mc_ge500    <- mean(mc$cal_fat >= 500)

out6 <- data.frame(
  Restaurant = c("Dairy Queen","Dairy Queen","McDonald’s","McDonald’s"),
  Event      = c("200<=cal_fat<=400","cal_fat>=500","200<=cal_fat<=400","cal_fat>=500"),
  Theoretical_Normal = round(c(p_theo_dq_200_400, p_theo_dq_ge500, p_theo_mc_200_400, p_theo_mc_ge500), 3),
  Empirical_Sample   = round(c(p_emp_dq_200_400,  p_emp_dq_ge500,  p_emp_mc_200_400,  p_emp_mc_ge500), 3)
)
out6

##    Restaurant             Event Theoretical_Normal Empirical_Sample
## 1 Dairy Queen 200<=cal_fat<=400              0.464            0.357
## 2 Dairy Queen      cal_fat>=500              0.063            0.071
## 3  McDonald’s 200<=cal_fat<=400              0.349            0.474
## 4  McDonald’s      cal_fat>=500              0.166            0.105

Answer E6 (compare). The middle interval (200–400) shows closer agreement between theoretical and empirical. The tail event (≥ 500) differs more—consistent with the QQ-plot tail deviations.

E7. Two more events (both methods) with interpretation

We add:
- Q7a (Dairy Queen): \(P(\text{cal\_fat} \le 250)\)
- Q7b (McDonald’s): \(P(300 \le \text{cal\_fat} \le 450)\)

p_theo_dq_le250 <- pnorm(250, dq_mean, dq_sd)
p_emp_dq_le250  <- mean(dq$cal_fat <= 250)

p_theo_mc_300_450 <- pnorm(450, mc_mean, mc_sd) - pnorm(300, mc_mean, mc_sd)
p_emp_mc_300_450  <- mean(mc$cal_fat >= 300 & mc$cal_fat <= 450)

out7 <- data.frame(
  Restaurant = c("Dairy Queen","McDonald’s"),
  Event = c("cal_fat<=250","300<=cal_fat<=450"),
  Theoretical_Normal = round(c(p_theo_dq_le250, p_theo_mc_300_450), 3),
  Empirical_Sample   = round(c(p_emp_dq_le250,  p_emp_mc_300_450), 3)
)
out7

##    Restaurant             Event Theoretical_Normal Empirical_Sample
## 1 Dairy Queen      cal_fat<=250              0.473            0.595
## 2  McDonald’s 300<=cal_fat<=450              0.246            0.246

Answer E7. Results are fairly close in the center of each distribution; any noticeable gaps are in the tails, reinforcing that the normal model is an approximation for these menu data.

Brief conclusion

Normality: Both restaurants’ cal_fat are roughly normal in the center with tail deviations.
Simulated vs. real: Simulated normals align with the QQ line; real data depart at extremes. Probabilities: The theoretical normal and empirical proportions agree better in mid-ranges than in tails.

DATA606 Lab 4 — The Normal Distribution

Sachi Kapoor

2025-10-04