Lab 4: The Normal Distribution

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
head(fastfood)
## # A tibble: 6 x 17
##   restaurant item  calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>    <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Arti~      380      60         7       2       0            95
## 2 Mcdonalds  Sing~      840     410        45      17       1.5         130
## 3 Mcdonalds  Doub~     1130     600        67      27       3           220
## 4 Mcdonalds  Gril~      750     280        31      10       0.5         155
## 5 Mcdonalds  Cris~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big ~      540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>

Exercise 1: Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")

dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

ggplot(data = mcdonalds, aes(x = cal_fat )) +
  ggtitle("Calories from Fat on the McDonalds Menu") +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = dairy_queen, aes(x = cal_fat )) +
  ggtitle("Calories from Fat on the Dairy Queen Menu") +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answer: The center of the McDonalds histogram and of the Dairy Queen histogram are both at about x = 250. Both graphs also appear to have right skewed shapes, although the McDonalds distribution is skewed even more. The McDonalds distribution has a greater spread than Dairy Queen due to a few menu items with over 750 calories from fat.

Exercise 2: Based on this plot, does it appear that the data follow a nearly normal distribution?

dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)

ggplot(data = dairy_queen, aes(x = cal_fat )) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Answer: Based on the graph above, it does appear that the Dairy Queen data follow an almost normal distribution. There is an identifiable bell-shape to the curve.

Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  ggtitle("Dairy Queen Menu Probability Plot") +
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

ggplot(data = dairy_queen, aes(sample = sim_norm)) + 
  ggtitle("Dairy Queen Theoretical Probability Plot") +
  geom_line(stat = "qq")

Answer: No, not all of the points fall on the line. The theoretical probability plot is curved upwards like a “u” while the true menu data probability plot is curved downwards like an “n”. There may be some overlap between the lines at their ends.

Exercise 4: Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?

qqnormsim(sample = cal_fat, data = dairy_queen)

Answer: The normal probability plot for the calories from fat does look similar to the simulated data plots. The plots do provide evidence that the calories from fat data are nearly normal.

Exercise 5: Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

qqnormsim(sample = cal_fat, data = mcdonalds)

Answer: The calories from the McDonalds menu do not appear to come from a normal distribution. This is because the data give a pronounced curve in the graph and there is an outlier above 1000 calories not otherwise observed in the simulations.

Exercise 6: Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

Question 1: What is the probability that a randomly chosen Dairy Queen product has less than 300 calories?

pnorm(q = 300, mean = dqmean, sd = dqsd)
## [1] 0.5997007
dairy_queen %>%
  filter(cal_fat < 300) %>%
  summarize(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.667

Question 2: What is the probability that a randomly chosen McDonalds product contains more than 400 calories?

mdmean <- mean(mcdonalds$cal_fat)
mdsd <- sd(mcdonalds$cal_fat)

1 - pnorm(q = 400, mean = mdmean, sd = mdsd )
## [1] 0.3022921
mcdonalds %>%
  filter(cal_fat > 400) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.158

Answer: The Dairy Queen question yielded the closest agreement between the two methods.

Exercise 7: Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

unique(fastfood$restaurant)
## [1] "Mcdonalds"   "Chick Fil-A" "Sonic"       "Arbys"       "Burger King"
## [6] "Dairy Queen" "Subway"      "Taco Bell"
ggplot(data = mcdonalds, aes(sample = sodium)) + 
  ggtitle("McDonalds") +
  geom_line(stat = "qq")

chick_fil_a <- fastfood %>%
  filter(restaurant == "Chick Fil-A")
ggplot(data = chick_fil_a, aes(sample = sodium)) + 
  ggtitle("Chick Fil-A") +
  geom_line(stat = "qq")

sonic <- fastfood %>%
  filter(restaurant == "Sonic")
ggplot(data = sonic, aes(sample = sodium)) + 
  ggtitle("Sonic") +
  geom_line(stat = "qq")

arbys <- fastfood %>%
  filter(restaurant == "Arbys")
ggplot(data = arbys, aes(sample = sodium)) + 
  ggtitle("Arbys") +
  geom_line(stat = "qq")

burger_king <- fastfood %>%
  filter(restaurant == "Burger King")
ggplot(data = burger_king, aes(sample = sodium)) + 
  ggtitle("Burger King") +
  geom_line(stat = "qq")

ggplot(data = dairy_queen, aes(sample = sodium)) + 
  ggtitle("Dairy Queen") +
  geom_line(stat = "qq")

subway <- fastfood %>%
  filter(restaurant == "Subway")
ggplot(data = subway, aes(sample = sodium)) + 
  ggtitle("Subway") +
  geom_line(stat = "qq")

taco_bell <- fastfood %>%
  filter(restaurant == "Taco Bell")
ggplot(data = taco_bell, aes(sample = sodium)) + 
  ggtitle("Taco Bell") +
  geom_line(stat = "qq")

### Answer: The restaurant with the closest line graph to expected normal distribution is Burger King. Therefore, Burger King’s distribution is the closest to normal.

Exercise 8: Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

Answer: There may be a stepwise pattern due to the fact that each section on a menu will likely contain a similar amount of sodium.

Exercise 9: As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

ggplot(data = dairy_queen, aes(sample = total_carb)) + 
  geom_line(stat = "qq")

ggplot(data = dairy_queen, aes(x = total_carb)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

### Answer: The variable is right-skewed.