Set working directory

setwd("~/DATA110")

The normal distribution

In this lab, you’ll investigate the probability distribution that is most central to statistics: the normal distribution. If you are confident that your data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

Getting Started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages as well as the openintro package.

Let’s load the packages.

library(tidyverse)
## -- Attaching packages -------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

The data

Either you can use glimpse like before, or head to do this.

head(fastfood)
## # A tibble: 6 x 17
##   restaurant item  calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>    <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Arti~      380      60         7       2       0            95
## 2 Mcdonalds  Sing~      840     410        45      17       1.5         130
## 3 Mcdonalds  Doub~     1130     600        67      27       3           220
## 4 Mcdonalds  Gril~      750     280        31      10       0.5         155
## 5 Mcdonalds  Cris~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big ~      540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mc...
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smoke...
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380,...
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 3...
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, ...
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0...
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, ...
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 12...
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 129...
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67,...
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5...
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3...
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33,...
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4,...
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6,...
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, ...
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "...

You’ll see that for every observation there are 17 measurements, many of which are nutritional facts.

You’ll be focusing on just three columns to get started: restaurant, calories, calories from fat.

Let’s first focus on just products from McDonalds and Dairy Queen.

mcdonalds <- fastfood %>% 
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>% 
  filter(restaurant == "Dairy Queen")
head(mcdonalds)
## # A tibble: 6 x 17
##   restaurant item  calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>    <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Arti~      380      60         7       2       0            95
## 2 Mcdonalds  Sing~      840     410        45      17       1.5         130
## 3 Mcdonalds  Doub~     1130     600        67      27       3           220
## 4 Mcdonalds  Gril~      750     280        31      10       0.5         155
## 5 Mcdonalds  Cris~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big ~      540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>

Exercise 1: Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

**Answer: Both appear skewed to the right. Their centers are about 200 and 250, respectively so fairly close. The spread of McDonalds is slightly wider as it extends beyond 1000 while Dairy Queen stops below 750. They do not appear bell-shaped to me.

dairy_queen %>%
  ggplot() +
  geom_histogram(aes(x = cal_fat), bins = 7) +
  ggtitle("Dairy Queen Calories from Fat") +
  xlab("Calories from Fat") +
  ylab("Frequency")

mcdonalds %>%
  ggplot() +
  geom_histogram(aes(x = cal_fat), bins = 7) +
  ggtitle("McDonalds Calories from Fat") +
  xlab("Calories from Fat") +
  ylab("Frequency")

The normal distribution

In your description of the distributions, did you use words like bell-shaped or normal? It’s tempting to say so when faced with a unimodal symmetric distribution.

To see how accurate that description is, you can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. You’ll be focusing on calories from fat from Dairy Queen products, so let’s store them as a separate object and then calculate some statistics that will be referenced later.

INSERT: Experiment with facet wrap to get side-by- side graphs

fastfood %>% 
  filter(restaurant  == "Mcdonalds" | restaurant == "Dairy Queen") %>% 
  ggplot() +
  geom_histogram(aes(x = cal_fat), bins = 7) +
  ggtitle("Calories from Fat by Restaurant") +
  xlab("Calories from Fat") +
  ylab("Frequency") +
  facet_wrap(. ~restaurant)

Return to main lab

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
dqmean
## [1] 260.4762
dqsd
## [1] 156.4851

Next, you make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve. The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar. Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve is a normal probability density function that also has area under the curve of 1. Frequency and density histograms both display the same exact shape; they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the density histogram created by the commands below.

ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins = 7) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato") +
  xlab("calories from fat") +
  ggtitle("Dairy Queen")

mcmean <- mean(mcdonalds$cal_fat)
mcsd   <- sd(mcdonalds$cal_fat)
mcmean
## [1] 285.614
mcsd
## [1] 220.8993
ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), bins = 7) +
        stat_function(fun = dnorm, args = c(mean = mcmean, sd = mcsd), col = "tomato") +
  xlab("calories from fat") +
  ggtitle("McDonalds")

Exercise 2: Based on the this plot, does it appear that the data follow a nearly normal distribution?

Answer: It’s hard to tell so we will explore with a QQ plot

Explore Dairy Queen Data with a QQ plot to simulate a normal plot; we see below that it’s close to a diagonal line but, again, not clearly showing normal.

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

what do probability plots look like for data that I know came from a normal distribution? We can answer this by simulating data from a normal distribution using rnorm.

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)

qqnorm(sim_norm)

The above still does not show all data point on a straight line; it looks to similar to the real data.

Even better than comparing the original plot to a single plot generated from a normal distribution is to compare it to many more plots using the following function. It shows the Q-Q plot corresponding to the original data in the top left corner, and the Q-Q plots of 8 different simulated normal data. It may be helpful to click the zoom button in the plot window.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4: Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the cal_fat data points are nearly normal?

Answer: Yes, they look very similar, so this is evidence that they may be nearly normal despite my assuming the histogram was off; however the Shapiro-Wild Test very strongly suggest this data is not normal (p = .007)

# perform shapiro test for normality
shapiro.test(dairy_queen$cal_fat)
## 
##  Shapiro-Wilk normality test
## 
## data:  dairy_queen$cal_fat
## W = 0.92299, p-value = 0.007568

Excercise 5: Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

checking the QQ plot again for McDonald’s using the real data:

ggplot(data = mcdonalds, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

Simulate the data from a normal distribution uing simulated data that is normalized using rnorm

sim_norm2 <- rnorm(n = nrow(mcdonalds), mean = mcmean, sd = mcsd)
qqnorm(sim_norm)

Now run multiple simulations of points from the data that is normalized to compare

qqnormsim(sample = cal_fat, data = mcdonalds)

*Answer: Again the McDonalds distribution does appear to be near normal after all; however the Shapiro-Wilk normality test very strongly suggest it is not a normal distribution.

# perform shapiro test for normality
shapiro.test(mcdonalds$cal_fat)
## 
##  Shapiro-Wilk normality test
## 
## data:  mcdonalds$cal_fat
## W = 0.77045, p-value = 4.749e-08

Normal Probabilities

Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed. Why should you care?

If we assume that the calories from fat from Dairy Queen’s menu are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm().

pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.9849848

**Because we want values >600 the answer is 1-pnorm = 0.0150152

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 600 then divide this number by the total sample size.

dairy_queen %>% 
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0476

Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.

Exercise 6: Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

Probability Chick Fil-A sugar is >10 and probability that calories are >400

chickfila <- fastfood %>% 
  filter(restaurant == "Chick Fil-A")
head(chickfila)
## # A tibble: 6 x 17
##   restaurant item  calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>    <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Chick Fil~ Char~      430     144        16     8           0          85
## 2 Chick Fil~ Char~      310      54         6     2           0          55
## 3 Chick Fil~ Chic~      270      99        11     2.5         0          45
## 4 Chick Fil~ 1 Pi~      120      54         6     3           0          25
## 5 Chick Fil~ 2 Pi~      230     108        12     3           0          55
## 6 Chick Fil~ 3 Pi~      350     153        17     3           0          70
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
#puts Chick Fil-A into its own dataframe names chickfila

Find sd and mean for each: sugar and calories

sugarmean <- mean(chickfila$sugar)
sugarsd <- sd(chickfila$sugar)
calmean <- mean(chickfila$calories)
calsd <- sd(chickfila$calories)

####pnorm for sugar and calories

1-pnorm(10, mean = sugarmean, sd = sugarsd)
## [1] 0.05521115
1-pnorm(400, mean = calmean, sd = calsd)
## [1] 0.4718786
#theoretical probabilities are 1 - the results printed below

empirical method for sugar and calories

chickfila %>% 
  filter(sugar > 10) %>%
  summarise(percent = n() / nrow(chickfila))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0370
chickfila %>% 
  filter(calories > 400) %>%
  summarise(percent = n() / nrow(chickfila))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.481

**They are as close as the others for McDonalds and Dairy Queen as far as appearing nearly normal

More Practice

Exercise 7: Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

**from the histograms it looks like Sonic or Burger King might be closest to normal.

fastfood %>% 
  group_by(restaurant) %>% 
  ggplot() +
  geom_histogram(aes(x = sodium), bins = 15) +
  ggtitle("Restaurant Sodium") +
  xlab("Sodium") +
  ylab("Frequency") +
  facet_wrap(. ~restaurant)

We can look at the qqplots

fastfood %>% 
group_by(restaurant) %>% 
ggplot(aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  facet_wrap(.~restaurant)

**Looks like Burger King or Taco Bell are closest to normal. Let’s look at Burger King, Taco Bell, Sonic and Arbys by creating a dataframe for each to analyze

taco_bell <- fastfood %>% 
  filter(restaurant == "Taco Bell")
burger_king <- fastfood %>% 
  filter(restaurant == "Burger King")
sonic <- fastfood %>% 
  filter(restaurant == "Sonic")
arbys <- fastfood %>% 
  filter(restaurant == "Arbys")

Perform ShapiroTest to check for normality for each and put in a variable

**Answer: Arbys and Burger King are normal under this test whereas the results strongly suggest Taco Bell and Sonic are not

shapiro.test(taco_bell$sodium)
## 
##  Shapiro-Wilk normality test
## 
## data:  taco_bell$sodium
## W = 0.95501, p-value = 0.000699
shapiro.test(burger_king$sodium)
## 
##  Shapiro-Wilk normality test
## 
## data:  burger_king$sodium
## W = 0.97291, p-value = 0.1331
shapiro.test(sonic$sodium)
## 
##  Shapiro-Wilk normality test
## 
## data:  sonic$sodium
## W = 0.82286, p-value = 1.784e-06
shapiro.test(arbys$sodium)
## 
##  Shapiro-Wilk normality test
## 
## data:  arbys$sodium
## W = 0.97073, p-value = 0.1985

Exercise 8: Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

**Answer: There are many repeated values

Exercise 9: As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed?

**Answer: appears skewed to the right

sonicplot <- sonic %>% 
  ggplot() +
  geom_line(aes(sample = total_carb), stat = "qq") +
  ggtitle("Sonic Normal Probability Plot - Carbohydrates")
sonicplot

Use a histogram to confirm your findings.

**Answer: confirms somewhat skewed to the right

sonic_hist <- sonic %>% 
  ggplot() +
  geom_histogram(aes(x = total_carb), binwidth = 10) +
  xlab("total carbohydrates") + 
  ylab("frequency") +
  ggtitle("Sonic Carbohydrates")
sonic_hist

Also can use density histogram

scarbsmean = mean(sonic$total_carb)
scarbssd = sd(sonic$total_carb)
ggplot(data = sonic, aes(x = total_carb)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..), bins = 10)  +
  stat_function(fun = dnorm, args = c(mean = scarbsmean, sd = scarbssd), col = "red") +
  xlab("Total Carbohydrates")

  ggtitle("Sonic Carbohydrates")
## $title
## [1] "Sonic Carbohydrates"
## 
## attr(,"class")
## [1] "labels"