The Normal Distribution

Getting Started

Load packages

We’re programming within the tidyverse and using data from the openintro library.

library(tidyverse)
library(openintro)

The data

This data set from the openintro library contains data on 515 menu items from some of the most popular fast food restaurants worldwide.

data("fastfood", package='openintro')

Exercises

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

They both have similar medians of 200 calories, they are both skewed to the right, however the McDonald’s data has more outliers that affect the mean.

# Save subsets of the data by restaurant
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

# Plot the McDonald's calories due to fat per menu item
ggplot(data = mcdonalds) +
    geom_histogram(color = "#e9ecef", alpha=0.6, position = 'identity',
                   mapping = aes(x = cal_fat), binwidth = 50) +
    ggtitle("McDonalds")

# Plot the Dairy Queen calories due to fat per menu item
ggplot(data = dairy_queen) +
    geom_histogram(color = "#e9ecef", alpha=0.6, position = 'identity',
                   mapping = aes(x = cal_fat), binwidth = 50) +
    ggtitle("Dairy Queen")

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Yes - It has one mode and seemingly an equal number of observations on either side of the mean (mean is close to median), however it’s choppy.

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)

ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

The points fall roughly on the line. This is similar to the probability plot for the real data.

# Plot the real data
ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

# Generate the simulated data
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
sim_norm_df <- data.frame(sim_norm)

# Plot the simulated data
ggplot(data = sim_norm_df, aes(sample = sim_norm)) + 
  geom_line(stat = "qq")

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

The plots suggest that the calories are nearly normal.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

The McDonald’s menu does not appear to come from a normal distribution because of the skewed-right tail.

qqnormsim(sample = cal_fat, data = mcdonalds)

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

What is the probability that a Dairy Queen menu item has less than 200 calories from fat?

Theoretically 35.0%, Empirically 42.9%, for a difference of 7.9%.

What is the probability that a Dairy Queen menu item has more than 400 calories from fat?

Theoretically 18.6%, Empirically 21.4%, for a difference of 2.8%. This one had the closer agreement between the two models.

# Probability for the theoretical normal distribution is 0.350
pnorm(q = 200, mean = dqmean, sd = dqsd)

## [1] 0.3495757

# Probability using the empirical distribution is 0.429
dairy_queen %>% 
  filter(cal_fat < 200) %>%
  summarise(percent = n() / nrow(dairy_queen))

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.429

# Probability for the theoretical normal distribution is 0.186
1 - pnorm(q = 400, mean = dqmean, sd = dqsd)

## [1] 0.1863007

# Probability using the empirical distribution is 0.214
dairy_queen %>% 
  filter(cal_fat > 400) %>%
  summarise(percent = n() / nrow(dairy_queen))

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.214

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

Burger King’s menu items’ sodium distributions are the most normal. Arby’s seems to be second closest to normal for sodium.

arbys <- fastfood %>%
  filter(restaurant == "Arbys")
burger_king <- fastfood %>%
  filter(restaurant == "Burger King")
chick_fila <- fastfood %>%
  filter(restaurant == "Chick Fil-A")
sonic <- fastfood %>%
  filter(restaurant == "Sonic")
subway <- fastfood %>%
  filter(restaurant == "Subway")
taco_bell <- fastfood %>%
  filter(restaurant == "Taco Bell")

ggplot(data = dairy_queen, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Dairy Queen")

ggplot(data = mcdonalds, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("McDonalds")

ggplot(data = arbys, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Arbys")

ggplot(data = burger_king, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Burger King")

ggplot(data = chick_fila, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Chick Fil-A")

ggplot(data = sonic, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Sonic")

ggplot(data = subway, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Subway")

ggplot(data = taco_bell, aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  ggtitle("Taco Bell")

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

Stepwise patterns are more prominent in discrete data as opposed to continuous data. This is probably because menu items are either supposed to be not salty (icecream), a little salty (burger) or very salty (fries).

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

While this isn’t Keto Bell, it looks like there are many low carb menu item options. The distribution looks skewed to the right.

ggplot(data = taco_bell) +
    geom_histogram(color = "#e9ecef", alpha=0.6, position = 'identity',
                   mapping = aes(x = total_carb), binwidth = 5) +
    ggtitle("Taco Bell")

tbmean <- mean(taco_bell$total_carb)
tbsd   <- sd(taco_bell$total_carb)

ggplot(data = taco_bell, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = tbmean, sd = tbsd), col = "tomato")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

DATA606wk4Lab

PK O’Flaherty

2022-02-27