Load Packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

Data

In this lab we’ll be looking into the normal distribution and how we can find if a data set is “normal”. For the data set, we’ll be using fast food data

data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 x 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G~      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba~      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba~     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B~      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba~      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
mcdonalds <- fastfood %>%
  filter(restaurant=="Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant=="Dairy Queen")

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

Both plots have a high bell shaped curve. The Mcdonalds plot has food items with more calories compared to dairy queen. A noticable difference is where the majority of values lie. In the dairy queen plot we noticed it’s more spread out: we can see large peaks around the 200, 440 and even 600 calorie mark. In the Mcdonalds plot we see a majority of values between the 0 and 500 calorie mark.

ggplot(data=mcdonalds, aes(x=cal_fat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=dairy_queen, aes(x=cal_fat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The normal distribution

A normal curve should have the same mean and standard deviation as the data. Here we are going to store the mean and standard deviation of the dairy_queen in seperate objects and later plot them.

dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
mcmean <- mean(mcdonalds$cal_fat)
mcsd <- sd(mcdonalds$cal_fat)

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

It’s difficult to tell by only looking at the plot because of the varying bin sizes. I would guess that it is because of the bell shape curve but I can’t definitively say

ggplot(data=dairy_queen, aes(x=cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Evaluating the normal distribution

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

On the sim_norm plot we see that the data values don’t always fall on the line, however it does follow more closely to it than the plot for the real data

qqnorm(sim_norm) 
qqline(sim_norm)

qqnorm(dairy_queen$cal_fat)
qqline(dairy_queen$cal_fat)

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

The normal prob plot from the fat plot looks similar to the other simulated plots. There are some small differences towards the middle and ending of the upper right tail. I think there is evidence that the calories are nearly normal

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

We can see that the McDonald’s plot is more curved compared to the simulated plots and does not appear to come from a normal distribution.

qqnormsim(sample = cal_fat, data = mcdonalds)

Normal probabilities

pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.9849848
dairy_queen %>% 
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

In this hypothetical scenario I would like to find a low calorie meal between 200 and 400 calories in McDonalds and Dairy Queen. In looking at both the theoretical and empirical distribution, I noticed that the theoretical model in the DQ data set gave a higher probability compared to the empirical model. The opposite occured in the McDonalds set.

pnorm(q = 400, mean = dqmean, sd = dqsd)- 
pnorm(q = 200, mean = dqmean, sd = dqsd)
## [1] 0.4641236
dairy_queen %>% 
  filter(cal_fat >= 200 & cal_fat <= 400) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.357
pnorm(q = 400, mean = mcmean, sd = mcsd)- 
pnorm(q = 200, mean = mcmean, sd = mcsd)
## [1] 0.3485409
mcdonalds %>% 
  filter(cal_fat >= 200 & cal_fat <= 400) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.474

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

Looking at the distributions side by side. It appears visually, that Arby’s datapoints are closest to normal for sodium

ggplot(data = fastfood)+
geom_histogram(aes(x=sodium))+
facet_wrap(~restaurant)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = fastfood, aes(sample=sodium))+
geom_line(stat="qq")+
facet_wrap(~restaurant)

Exericise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

I would say because there seems to be menu items that have similar sodium levels.

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

We can see there is some outlier data in the histogram, however I would say it closely resembles a symmetrical shape.

taco_bell <- fastfood %>%
  filter(restaurant=="Taco Bell")
ggplot(data = taco_bell, aes(sample = total_carb)) + 
  geom_line(stat = "qq")

ggplot(data = taco_bell, aes(x=total_carb))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.