The Normal Distribution

Getting Started

Load packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
head(fastfood)
## # A tibble: 6 x 17
##   restaurant item  calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>    <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Arti~      380      60         7       2       0            95
## 2 Mcdonalds  Sing~      840     410        45      17       1.5         130
## 3 Mcdonalds  Doub~     1130     600        67      27       3           220
## 4 Mcdonalds  Gril~      750     280        31      10       0.5         155
## 5 Mcdonalds  Cris~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big ~      540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mc...
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smoke...
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380,...
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 3...
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, ...
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0...
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, ...
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 12...
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 129...
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67,...
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5...
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3...
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33,...
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4,...
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6,...
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, ...
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "...
fastfood <- fastfood

Focus on restaurant, calories, and calories from fat

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1:

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

Both the McDonalds and Dairy Queen distributions are unimodal, but skewed right (they both have a right-hand tail). However, McDonalds has a more prominent right skew (a much longer right-hand tail). The approximate medians (centers) for both restaurants are similar (both medians are around 200). McDonalds has a much wider range (~1250) compared to Dairy Queen (~750).

ggplot(mcdonalds, aes(x=cal_fat)) + 
         geom_histogram(binwidth = 100)

ggplot(dairy_queen, aes(x=cal_fat)) +
  geom_histogram(binwidth = 100)

The normal distribution

Plotting the normal distribution curve on top of the histogram:

dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
ggplot(dairy_queen, aes(x=cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y=..density..), binwidth = 100) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

Exercise 2:

Based on this plot, does it appear that the data follows a nearly normal distribution.

Since the lab is all about normal distributions, I am guessing the answer is supposed to be yes, but I think it’s a bit skewed right. I suppose it isn’t greatly skewed, so sure, it can be considered “nearly normal”.

Evaluating the normal distribution

Constructing a normal probability plot AKA a normal Q-Q plot for “quantile-quantile”

ggplot(dairy_queen, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

Note: instead of “x =” we use “sample =”

How close is close enough?

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3:

Make a normal probability plot of sim_norm. Do all the points fall on the line? How does it compare to the probability plot for the real data?

No, all the points are not perfectly along a straight line. It looks pretty similar to the actual data. So I guess it would be considered a normal distribution… this is probably why they say in the textbook that it is hard to eyeball a histogram and decide whether it is normal or not.

ggplot(data = NULL, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4:

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories from fat are near normal.

Yes, the simulated data does look comparable to the actual data. Therefore, these plots do provide evidence that the calories from fat for the Dairy Queen data set is near normal.

Exercise 5:

Using the same technique, determine whether or not the calories from fat from the McDonalds menu appear to come from a normal distribution.

The simulated data plots support my original statement that the calories from fat from the McDonald’s data set does not follow a normal distribution. In fact, according to the textbook, when the line curves upward at the top, this indicates the distribution is skewed right.

qqnormsim(sample = cal_fat, data = mcdonalds)

Normal probabilities

If the plot is normal, we can use tools such as qnorm and pnorm

For example, to find the probability that a randomly chosen item on the Dairy Queen menu has more than 600 calories from fat:

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
1 - pnorm(600, dqmean, dqsd)
## [1] 0.01501523
dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6:

Write out two probability questions you would like to answer. Calculate those probabilities using both the theoretical normal distribution and the empirical distribution.

Probability that a randomly chosen item is less than 200 calories:

pnorm(200, dqmean, dqsd)
## [1] 0.3495757
dairy_queen %>%
  filter(cal_fat < 200) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.429

Probability that a randomly chosen item is between 200 and 600 calories:

pnorm(600, dqmean, dqsd) - pnorm(200, dqmean, dqsd)
## [1] 0.6354091
dairy_queen %>%
  filter(cal_fat < 600 & cal_fat > 200) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1     0.5

Pr{Y < 200} is in closer agreement to the probability using the empirical distribution.

More practice

Exercise 7:

Consider other variable in the data set. Out of the restaurants, which ones’ distribution is closest to normal for sodium?

unique(fastfood$restaurant)
## [1] "Mcdonalds"   "Chick Fil-A" "Sonic"       "Arbys"       "Burger King"
## [6] "Dairy Queen" "Subway"      "Taco Bell"
chick_fil_a <- fastfood %>%
  filter(restaurant == "Chick Fil-A")
arbys <- fastfood %>%
  filter(restaurant == "Arbys")
sonic <- fastfood %>%
  filter(restaurant == "Sonic")
burger_king <- fastfood %>%
  filter(restaurant == "Burger King")
subway <- fastfood %>%
  filter(restaurant == "Subway")
taco_bell <- fastfood %>%
  filter(restaurant == "Taco Bell")
qqnormsim(sample = sodium, data = mcdonalds)

qqnormsim(sample = sodium, data = chick_fil_a)

qqnormsim(sample = sodium, data = sonic)

qqnormsim(sample = sodium, data = arbys)

qqnormsim(sample = sodium, data = burger_king)

qqnormsim(sample = sodium, data = dairy_queen)

qqnormsim(sample = sodium, data = subway)

qqnormsim(sample = sodium, data = taco_bell)

I suppose the closest to normal, in my opinion, is Burger King. The slope is fairly consistent between the actual data and the simulations. Additionally, it has the least upward curve at the top of the line compared to all the other restaurants.

Exercise 8:

Notice some of the normal probability plots for sodium have a step-wise pattern. Why do you think this might be the case?

Well, according to the textbook, this step-wise pattern is due to repeating the same values multiple times in a data set. This causes a “granularity” that appears “step-wise”.

Exercise 9:

Make a normal probability plot for the total carbohydrates from a resturant of your choice. Based on this probability plot, is this variable skewed left, symmetric, or skewed right? Check with histogram.

Let’s go with Chick Fil-A because I eat there most:

ggplot(chick_fil_a, aes(sample = total_carb)) +
  geom_line(stat = "qq")

Based on this normal probability plot, I would say it is heavily skewed right. Let’s look at it compared to simulation plots:

qqnormsim(sample = total_carb, data = chick_fil_a)

I don’t know, after looking at the simulations, I am starting to think it looks fairly normal…. maybe this is because the plots are small? It might be easier if I was able to include a line of best fit to every single graph. Maybe I am not good at determining normality based on either histograms or normal probability plots….

ggplot(chick_fil_a, aes(x = total_carb)) +
  geom_histogram(binwidth = 5)

Honestly, this histogram looks bimodal to me. Let’s overlay the normal distribution curve:

chickmean <- mean(chick_fil_a$total_carb)
chicksd <- sd(chick_fil_a$total_carb)
ggplot(chick_fil_a, aes(x=total_carb)) +
  geom_blank() +
  geom_histogram(aes(y=..density..), binwidth = 5) +
  stat_function(fun = dnorm, args = c(mean = chickmean, sd = chicksd), col = "tomato")

Yeah, definitely not normal. Fianlly answer, this one was a trick question because it was bimodal.

Updates

After Thursday’s class, we learned that our Q-Q plots should have a line of best fit. The code in the lab did not result in an output that displayed the line of best fit. I am trying out the new code sent out:

ggplot(dairy_queen, aes(sample = cal_fat))+
   stat_qq()+stat_qq_line()

Okay, now that I have the line I feel good about my first answer where I said that the distribution was not completely normal - it had a right skew. This lab made me second guess myself!

ggplot(data = NULL, aes(sample = sim_norm)) +
  stat_qq()+ 
  stat_qq_line()

ggplot(chick_fil_a, aes(sample = total_carb)) +
  stat_qq()+
  stat_qq_line()