Load packages and the data

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon~
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou~
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62~
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,~
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, ~
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4~
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5~
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, ~
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, ~
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31~
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2~
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1~
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13~
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,~
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15~
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, ~
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth~
fastfood <- (fastfood)

Filter Data

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

ggplot(data = mcdonalds, aes(x = cal_fat)) +
  geom_histogram(binwidth = 100)

ggplot(data = dairy_queen, aes(x = cal_fat))+
  geom_histogram(binwidth = 100)

The McDonalds data shows a rightward skew. The Dairy Queen data shows a fairly normal distribution. The McDonalds data also has much greater spread, with a mx of about 1250. The Dairy Queen data has a max of about 750. Both have majority of the data around 200.

The normal distribution

Calculate mean and standard deviation

dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)

Plot density histogram and normal distribution line

ggplot(data = dairy_queen, aes(x = cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..), binwidth = 100) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

The data is fairly normal, but not perfectly.

Evaluating the normal distribution

QQ Plot

ggplot(data = dairy_queen, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

Simulated normal

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)

Normal plot

ggplot(data = dairy_queen, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

The data is fairly similar to the actual data.

Compare the data

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories from fat are nearly normal?

The actual data and the simulated data are very similar. There is possibly evidence, but a statistical test is needed to determine the strength of the evidence.

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

Calculate mean and standard deviation

mdmean <- mean(mcdonalds$cal_fat)
mdsd <- sd(mcdonalds$cal_fat)

Plot density histogram and normal distribution line

ggplot(data = mcdonalds, aes(x = cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..), binwidth = 100) +
  stat_function(fun = dnorm, args = c(mean =mdmean, sd = mdsd), col = "tomato")

Evaluating the normal distribution

QQ Plot

ggplot(data = mcdonalds, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

Simulated normal

sim_norm_md <- rnorm(n = nrow(mcdonalds), mean = mdmean, sd = mdsd)

Normal plot

ggplot(data = mcdonalds, aes(sample = sim_norm_md)) +
  geom_line(stat = "qq")

Compare the data

qqnormsim(sample = cal_fat, data = mcdonalds)

The data and the simulated data are different. Therefore, the actual data likely does not follow the normal distribution.

Normal probabilities

Probability randomly selected item from Dariy Queen Menu is > 600

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523

% of Items > 600

dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

What is the probability that a randomly chosen item on the McDonald’s menu has a sugar content of 15 or greater?

Calculate mean and standard deviation

sgmean <- mean(mcdonalds$sugar)
sgsd <- sd(mcdonalds$sugar)

Probability randomly selected item from McDonald’s menu is > 15

1 - pnorm(q = 15, mean = sgmean, sd = sgsd)
## [1] 0.3842

Actual # of items > 15

mcdonalds %>%
  filter(sugar > 15) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.158

What is the probabilty that a ranomly chosen item from the Dairy Queen Menus has cholesterol over 100?

Calculate mean and standard deviation

cholmean <- mean(dairy_queen$cholesterol)
cholsd <- sd(dairy_queen$cholesterol)

Probability randomly selected item from Dairy Queen’s menu is > 100

1 - pnorm(q = 100, mean = cholmean, sd = cholsd)
## [1] 0.2577334

Actual # of items > 15

dairy_queen %>%
  filter(cholesterol > 100) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1   0.190

There was variation in both, but the probability an item from the Dairy Queen menu has greater than 100 cholesterol is closer to the actual percent of items that exist on the menu (0.26 to 0.19) compared to the probability that an item from the Mcdonald’s menu has greater than 15 sugar (0.38 to 0.16).

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")
chickfila <- fastfood %>%
  filter(restaurant == "Chick Fil-A")
sonic <- fastfood %>%
  filter(restaurant == "Sonic")
bk <- fastfood %>%
  filter(restaurant == "Burger King")
arbys <- fastfood %>%
  filter(restaurant == "Arbys")
subway <- fastfood %>%
  filter(restaurant == "Subway")
taco_bell <- fastfood %>%
  filter(restaurant == "Taco Bell")

Mcdonalds:

qqnormsim(sample = sodium, data = mcdonalds)

Dairy Queen:

qqnormsim(sample = sodium, data = dairy_queen)

Arbys:

qqnormsim(sample = sodium, data = arbys)

Burger King:

qqnormsim(sample = sodium, data = bk)

Chick Fil-A:

qqnormsim(sample = sodium, data = chickfila)

Sonic:

qqnormsim(sample = sodium, data = sonic)

Subway:

qqnormsim(sample = sodium, data = subway)

Taco Bell:

qqnormsim(sample = sodium, data = taco_bell)

The data from Burger King seems to be most normally distributed.

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

Different items on the menu might have different “levels” of sodium. For example, a burger might have more sodium than a sandwich. Because there are many different types of burgers and sandwiches on the various restaurant menus, there are various “levels” to be found on the distribution of sodium as well, with items that are similar and have the same amount of sodium grouped together.

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Total carbohydrates in Sonic menu items:

qqnormsim(sample = total_carb, data = sonic)

The data is skewed right, as there are outliers to in the upperleft hand corner of the plotted data.

Histogram:

ggplot(data = sonic, aes(x = total_carb)) +
  geom_histogram(binwidth = 10)

The histogram shows that the data is indeed skewed right.