The Normal Distribution

The data

In this lab we work with the fastfood data from the openintro package.

data("fastfood", package = "openintro")
glimpse(fastfood)

## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…

We will often focus on just Mcdonalds and Dairy Queen items.

mcdonalds <- fastfood %>% 
  filter(restaurant == "Mcdonalds")

dairy_queen <- fastfood %>% 
  filter(restaurant == "Dairy Queen")

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

mc_dq <- fastfood %>% 
  filter(restaurant %in% c("Mcdonalds", "Dairy Queen"))

ggplot(mc_dq, aes(x = cal_fat)) +
  geom_histogram(binwidth = 50, color = "white") +
  facet_wrap(~ restaurant, ncol = 1, scales = "free_y") +
  labs(
    x = "Calories from fat",
    y = "Count",
    title = "Calories from fat by restaurant"
  )

Answer (Exercise 1):
The distributions of calories from fat for Mcdonalds and Dairy Queen are both right–skewed with a single main mode. Dairy Queen tends to have higher calories from fat on average than Mcdonalds, with many menu items above 400–500 fat calories. Mcdonalds items have a somewhat lower center and a slightly smaller spread, while Dairy Queen has a wider range and more very high–fat items.

The normal distribution

Now we focus only on Dairy Queen items and examine whether the calories from fat are nearly normal.

dqmean <- mean(dairy_queen$cal_fat, na.rm = TRUE)
dqsd   <- sd(dairy_queen$cal_fat, na.rm = TRUE)

dqmean

## [1] 260.4762

dqsd

## [1] 156.4851

We now draw a density histogram of cal_fat for Dairy Queen and overlay a normal curve with the same mean and standard deviation.

ggplot(dairy_queen, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), binwidth = 50, color = "white") +
  stat_function(fun = dnorm,
                args = list(mean = dqmean, sd = dqsd),
                col = "tomato", linewidth = 1) +
  labs(
    x = "Calories from fat (Dairy Queen)",
    y = "Density",
    title = "Dairy Queen calories from fat with normal curve"
  )

Comment:
The histogram is unimodal and somewhat right–skewed. The normal curve roughly follows the middle of the distribution but does not match the heavier right tail; there are more very high–fat items than a perfect normal distribution would predict.

Exercise 2

Based on the normal probability plot for Dairy Queen’s calories from fat, does it appear that the data follow a nearly normal distribution?

We construct the Q–Q plot for Dairy Queen calories from fat.

ggplot(dairy_queen, aes(sample = cal_fat)) +
  geom_line(stat = "qq") +
  labs(
    x = "Theoretical normal quantiles",
    y = "Sample quantiles of cal_fat",
    title = "Normal Q-Q plot: Dairy Queen calories from fat"
  )

Answer (Exercise 2):
Most of the points lie roughly along a straight line in the middle, but the upper tail points bend above the line, showing that the largest values are larger than would be expected under a normal distribution. Overall the data are approximately normal in the center but with a noticeably heavy right tail, so they are only approximately normal, not perfectly.

Simulated normal data

To see what Q–Q plots look like when we know the data are normal, we simulate a new dataset with the same mean and standard deviation and the same number of observations as the Dairy Queen items.

sim_norm <- rnorm(
  n = nrow(dairy_queen),
  mean = dqmean,
  sd   = dqsd
)

We make a Q–Q plot for these simulated normal data.

ggplot(mapping = aes(sample = sim_norm)) +
  geom_line(stat = "qq") +
  labs(
    x = "Theoretical normal quantiles",
    y = "Sample quantiles of sim_norm",
    title = "Normal Q-Q plot: simulated normal data"
  )

Answer (Exercise 3, part 1):
For the simulated normal data, the points fall very close to a straight line, with only small random wiggles. This is what we expect when the data truly come from a normal distribution.

Comparing many normal Q–Q plots

We now compare the Q–Q plot of the real data to several simulated normal Q–Q plots.

First we define a helper function qqnormsim (in case it is not already available in the environment).

qqnormsim <- function(sample, data) {
  x <- data[[deparse(substitute(sample))]]
  x <- x[!is.na(x)]
  par(mfrow = c(3, 3))
  qqnorm(x, main = "Data: cal_fat (Dairy Queen)")
  qqline(x)
  for (i in 1:8) {
    sim <- rnorm(length(x), mean(x), sd(x))
    qqnorm(sim, main = paste("Normal sim", i))
    qqline(sim)
  }
  par(mfrow = c(1, 1))
}

Now we call the function:

qqnormsim(sample = cal_fat, data = dairy_queen)

Answer (Exercise 3, part 2):
The Q–Q plot for the real Dairy Queen data looks similar to many of the simulated normal plots, but its upper tail shows slightly larger deviations from the diagonal than most of the simulated plots. This suggests that assuming normality is not terrible, but there is some extra skew and heavy tail behavior.

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

Answer (Exercise 4):
Yes. When compared to the simulated normal Q–Q plots, the Dairy Queen calories from fat plot shows patterns that are broadly similar: roughly linear through the middle with moderate deviations at the extreme tails. This provides reasonable evidence that the distribution is approximately normal, though the heavier right tail means it is not perfectly normal.

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

We repeat the analysis for Mcdonalds.

mc_mean <- mean(mcdonalds$cal_fat, na.rm = TRUE)
mc_sd   <- sd(mcdonalds$cal_fat, na.rm = TRUE)

ggplot(mcdonalds, aes(x = cal_fat)) +
  geom_histogram(aes(y = ..density..), binwidth = 50, color = "white") +
  stat_function(fun = dnorm,
                args = list(mean = mc_mean, sd = mc_sd),
                col = "steelblue") +
  labs(
    x = "Calories from fat (Mcdonalds)",
    y = "Density",
    title = "Mcdonalds calories from fat with normal curve"
  )

ggplot(mcdonalds, aes(sample = cal_fat)) +
  geom_line(stat = "qq") +
  labs(
    x = "Theoretical normal quantiles",
    y = "Sample quantiles of cal_fat",
    title = "Normal Q-Q plot: Mcdonalds calories from fat"
  )

Answer (Exercise 5):
For Mcdonalds, the histogram is unimodal but somewhat right–skewed. The Q–Q plot shows points curving upward in the right tail and slightly downward in the left tail, indicating right skew. The distribution is less normal than the Dairy Queen calories from fat distribution. So Mcdonalds calories from fat do not appear to be very well modeled by a normal distribution.

Normal probabilities

We now use the normal model for Dairy Queen calories from fat to answer probability questions. Assuming cal_fat is approximately N(dqmean, dqsd^2), we find

What is the probability that a randomly chosen Dairy Queen item has more than 600 calories from fat?

Theoretical probability (normal model):

p_theoretical <- 1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
p_theoretical

## [1] 0.01501523

Empirical probability (using the data):

p_empirical <- dairy_queen %>% 
  summarise(percent = mean(cal_fat > 600))

p_empirical

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0476

Answer:
The normal model gives a probability of about 0.015 that a Dairy Queen item has more than 600 calories from fat. The empirical proportion from the data is approximately 0.0476. These values are reasonably close; small differences are expected because the data are only approximately normal and we have a finite sample.

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution. Which one had a closer agreement between the two methods?

Question 1

Q1: For Dairy Queen, what is the probability that cal_fat is less than 300?

# Theoretical (normal)
p1_theoretical <- pnorm(q = 300, mean = dqmean, sd = dqsd)

# Empirical
p1_empirical <- dairy_queen %>% 
  summarise(prop = mean(cal_fat < 300))

p1_theoretical

## [1] 0.5997007

p1_empirical

## # A tibble: 1 × 1
##    prop
##   <dbl>
## 1 0.667

Question 2

Q2: For Mcdonalds, what is the probability that cal_fat is between 250 and 450?

# Theoretical (normal)
p2_theoretical <- pnorm(450, mean = mc_mean, sd = mc_sd) - 
  pnorm(250, mean = mc_mean, sd = mc_sd)

# Empirical
p2_empirical <- mcdonalds %>% 
  summarise(prop = mean(cal_fat >= 250 & cal_fat <= 450))

p2_theoretical

## [1] 0.3356534

p2_empirical

## # A tibble: 1 × 1
##    prop
##   <dbl>
## 1 0.351

Answer (Exercise 6):
For both questions the theoretical and empirical probabilities are fairly similar, but the agreement is stronger for Dairy Queen. This makes sense because Dairy Queen calories from fat are closer to normal than Mcdonalds, so the normal approximation works better there.

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

We first create Q–Q plots of sodium by restaurant.

ggplot(fastfood, aes(sample = sodium)) +
  geom_line(stat = "qq") +
  facet_wrap(~ restaurant, scales = "free") +
  labs(
    x = "Theoretical normal quantiles",
    y = "Sample quantiles of sodium",
    title = "Normal Q-Q plots of sodium by restaurant"
  )

We can also look at histograms:

ggplot(fastfood, aes(x = sodium)) +
  geom_histogram(binwidth = 100, color = "white") +
  facet_wrap(~ restaurant, scales = "free_y") +
  labs(
    x = "Sodium (mg)",
    y = "Count",
    title = "Distribution of sodium by restaurant"
  )

Answer (Exercise 7):
Based on the Q–Q plots and histograms, the sodium distributions for Subway and Taco Bell appear closest to normal: their Q–Q plot points follow a fairly straight line with only mild deviations at the tails. Other restaurants show more curvature, clustering, or strong skewness.

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. Why do you think this might be the case?

Answer (Exercise 8):
Sodium values are typically reported in discrete increments (for example multiples of 10 or 20 mg). This rounding causes many menu items to share exactly the same sodium value, so when we sort the data there are repeated quantiles. In a Q–Q plot these repeated values produce horizontal segments, giving the plot a stepwise appearance instead of a smooth line.

Exercise 9

Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

We choose Subway and examine total_carb.

subway <- fastfood %>% 
  filter(restaurant == "Subway")

ggplot(subway, aes(sample = total_carb)) +
  geom_line(stat = "qq") +
  labs(
    x = "Theoretical normal quantiles",
    y = "Sample quantiles of total_carb",
    title = "Normal Q-Q plot: Subway total carbohydrates"
  )

ggplot(subway, aes(x = total_carb)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(
    x = "Total carbohydrates (g)",
    y = "Count",
    title = "Histogram: Subway total carbohydrates"
  )

Answer (Exercise 9):
The Q–Q plot for Subway total carbohydrates shows points that bend upward in the right tail, and the histogram has a long right tail. This indicates that the distribution of total carbohydrates is right–skewed: most items have moderate carb levels, with a smaller number of high–carb items pulling the distribution to the right.