The Normal Distribution

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
head(fastfood)
## # A tibble: 6 x 17
##   restaurant item  calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>    <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Arti~      380      60         7       2       0            95
## 2 Mcdonalds  Sing~      840     410        45      17       1.5         130
## 3 Mcdonalds  Doub~     1130     600        67      27       3           220
## 4 Mcdonalds  Gril~      750     280        31      10       0.5         155
## 5 Mcdonalds  Cris~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big ~      540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

glimpse(mcdonalds)
## Rows: 57
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mc...
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smoke...
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380,...
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 3...
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, ...
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0...
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, ...
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 12...
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 129...
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67,...
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5...
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3...
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33,...
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4,...
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6,...
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, ...
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "...
hist(mcdonalds$cal_fat)

hist(dairy_queen$cal_fat)

Mcdonalds histogram has a stronger right skew than the dairy queen histogram. The center of the McDonalds cal_fat data has a center that is more towards the left, while the dairy queen histogram falls ever so slightly closer towards the middle. McDonalds data has a greater spread, with the maximum being between 1200-1400, while the dairy queen maximum is between 600-700.

The Normal Distribution

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins=20) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "darkmagenta")

Excercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Not really, it looks like it has a slight skew to the right the majority if the data is along the 200 mark, while the center of the distribution seems to fall over to the right slightly.

Evaluating the normal distribution

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)

ggplot(data = dairy_queen, aes(sample = sim_norm)) + 
  geom_line(stat = "qq")

This plot seems to be more normally distributed than the original data, the points fall in a mostly diaognal line.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data?

The simulated plots look more normally distributed than the data plot. I think this shows that the cal_fat data for dairy queen in mostly normally distributed.

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

mmean <- mean(mcdonalds$cal_fat)
msd   <- sd(mcdonalds$cal_fat)
ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins=20) +
        stat_function(fun = dnorm, args = c(mean = mmean, sd = msd), col = "palegreen2")

sim_norm2 <- rnorm(n = nrow(mcdonalds), mean = mmean, sd = msd)
ggplot(data = mcdonalds, aes(sample = sim_norm2)) + 
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = mcdonalds)

The theoretical plots seems to be mostly normally distributed based on the simulations as they are mostly a consistent diagonal line on the plots.

Normal Probabilities

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
dairy_queen %>% 
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

What is the probability that an item from dairy queen has less than 13 grams of sugar?
glimpse(dairy_queen$sugar)
##  num [1:42] 9 13 13 13 8 8 9 4 4 3 ...
dqsmean <- mean(dairy_queen$sugar)
dqssd  <- sd(dairy_queen$sugar)
pnorm(q=13, mean= dqsmean, sd=dqssd)
## [1] 0.9068705
dairy_queen %>% 
  filter(sugar < 13) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.905

These two percents are really similar.

What is the probability that an item from mcdonalds has less than 10 grams of sugar?
msmean <- mean(mcdonalds$sugar)
mssd  <- sd(mcdonalds$sugar)
pnorm(q=11, mean= msmean, sd=mssd)
## [1] 0.4979022
mcdonalds %>% 
  filter(sugar < 11) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.579

Not as similar.

The dairy queen sugar data shows a closer agreement between the probabilities than the McDonalds data does.

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

msomean <- mean(mcdonalds$sodium)
msosd   <- sd(mcdonalds$sodium)
ggplot(data = mcdonalds, aes(x = sodium)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins=20) +
        stat_function(fun = dnorm, args = c(mean = msomean, sd = msosd), col = "salmon2")

sim_norm3 <- rnorm(n = nrow(mcdonalds), mean = msomean, sd = msosd)
ggplot(data = mcdonalds, aes(sample = sim_norm3)) + 
  geom_line(stat = "qq")

qqnormsim(sample = sodium, data = mcdonalds)

dqsomean <- mean(dairy_queen$sodium)
dqsosd   <- sd(dairy_queen$sodium)
ggplot(data = dairy_queen, aes(x = sodium)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins=20) +
        stat_function(fun = dnorm, args = c(mean = dqsomean, sd = dqsosd), col = "turquoise4")

sim_norm4 <- rnorm(n = nrow(dairy_queen), mean = dqsomean, sd = dqsosd)
ggplot(data = dairy_queen, aes(sample = sim_norm4)) + 
  geom_line(stat = "qq")

qqnormsim(sample = sodium, data = dairy_queen)

The mcdonalds date seems to be more normally distributed

Exercise 8

This pattern seems to happen, since the sodium data is on a finite scale and the numbers are whole, there are no decimal values that would lead to a more continuous looking line.

Exercise 9

dqcmean <- mean(dairy_queen$total_carb)
dqcsd   <- sd(dairy_queen$total_carb)
sim_norm5 <- rnorm(n = nrow(dairy_queen), mean = dqcmean, sd = dqcsd)
ggplot(data = dairy_queen, aes(sample = sim_norm5)) + 
  geom_line(stat = "qq")

ggplot(data = dairy_queen, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins=20) +
        stat_function(fun = dnorm, args = c(mean = dqcmean, sd = dqcsd), col = "darkorange2")

The data seems to be mostly normally distributed based on the plot, and when compared to the histogram this seems to make sense.