knitr::opts_chunk$set(eval = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan G… 380 60 7 2 0 95
## 2 Mcdonalds Single Ba… 840 410 45 17 1.5 130
## 3 Mcdonalds Double Ba… 1130 600 67 27 3 220
## 4 Mcdonalds Grilled B… 750 280 31 10 0.5 155
## 5 Mcdonalds Crispy Ba… 920 410 45 12 0.5 120
## 6 Mcdonalds Big Mac 540 250 28 10 1 80
## # … with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## # sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## # salad <chr>
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…
Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
Each set is unimodal and right skewed (the tail runs to the right). The unique differences between the distributions - McDonald’s has a higher minimum, maximum and mean / center for fat calories, and the McDonald’s x axis increases in increments of 200 cals while the Dairy Queen one increases in increments of 100 cals.
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
summary(mcdonalds$cal_fat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.0 160.0 240.0 285.6 320.0 1270.0
hist(mcdonalds$cal_fat)
summary(dairy_queen$cal_fat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 160.0 220.0 260.5 310.0 670.0
hist(dairy_queen$cal_fat)
Based on the plot (below), does it appear that the data follow a nearly normal distribution?
Yes. It appears that the data follow a nearly normal distribution, the peaks are much higher and more spread our as they approach 0 (in density), and there’s a high density to the far right. With that said, there is a central peak ~220 cal_fat and the distribution does approach 0 (in density) near the min and max cal_fat values so the case can be made that it is a normal distribution.
dqmean = mean(dairy_queen$cal_fat)
dqsd = sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
No, all of the points do not fall on the line. The probability plots for the simulated v real data are similar but not the same. The real data has a lesser slope from -2 to -1 and a greater slope from 1 to 2. Other than that nuance, the plots are very similar.
#Construct a normal probability (Q-Q) plot
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")
#Simulate data from a normal distribution
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
#Plot the simulated data
ggplot(data = NULL, aes(sample = sim_norm)) +
geom_line(stat = "qq")
Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the cal_fat are nearly normal?
Per lab text: A data set that is nearly normal will result in a probability plot where the points closely follow a diagonal line. Any deviations from normality lead to deviations of these points from that line.
The Dairy Queen cal_fat data (plotted below) is nearly normal. It closely follows a diagonal line and is incredibly similar to the plots generated by qqnormsim (ie. sim 7).
qqnormsim(sample = cal_fat, data = dairy_queen)
Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
The McDonald’s cal_fat data (plotted below) is nearly normal. Although the slope is rather small near the beginning and rather larger later on, it does form a diagonal line and closely mimics a couple of the simulated plots up until these higher values (ie. sim 1 or 3).
qqnormsim(sample = cal_fat, data = mcdonalds)
Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
Question 1: What is the probability that a randomly chosen Arby’s product has more than 20g’s of protein?
#Arby's >20gs protein calculations:
arbys <- fastfood %>%
filter(restaurant == "Arbys")
a_mean <- mean(arbys$protein)
a_sd <- sd(arbys$protein)
1 - pnorm(q = 20, mean = a_mean, sd = a_sd)
## [1] 0.7725201
arbys %>%
filter(protein > 20) %>%
summarise(percent = n() / nrow(arbys))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.745
Question 2: What is the probability that a randomly chosen product from any of these fast food restaurants is less than 1000 calories?
#< 1000 cals calculations:
ff_mean <- mean(fastfood$calories)
ff_sd <- sd(fastfood$calories)
pnorm(q = 1000, mean = ff_mean, sd = ff_sd)
## [1] 0.9516294
fastfood %>%
filter(calories < 1000) %>%
summarise(percent = n() / nrow(fastfood))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.938
Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
Based on the plots below, Burger King and Chick Fil-A had the distributions closest to normal for sodium.
#Arbys sodium plot
arbys <- fastfood %>%
filter(restaurant == "Arbys")
qqnorm(arbys$sodium, main = "Arbys")
#Burger King sodium plot
bk <- fastfood %>%
filter(restaurant == "Burger King")
qqnorm(bk$sodium, main = "Burger King")
#Chick Fil-A sodium plot **
cfa <- fastfood %>%
filter(restaurant == "Chick Fil-A")
qqnorm(cfa$sodium, main = "Chick Fil-A")
#Dairy Queen sodium plot
dq <- fastfood %>%
filter(restaurant == "Dairy Queen")
qqnorm(dq$sodium, main = "Dairy Queen")
#McDonald's sodium plot
mcd <- fastfood %>%
filter(restaurant == "Mcdonalds")
qqnorm(mcd$sodium, main = "McDonald's")
#Sonic sodium plot
s <- fastfood %>%
filter(restaurant == "Sonic")
qqnorm(s$sodium, main = "Sonic")
#Subway sodium plot
sw <- fastfood %>%
filter(restaurant == "Subway")
qqnorm(sw$sodium, main = "Subway")
#Taco Bell sodium plot
tb <- fastfood %>%
filter(restaurant == "Taco Bell")
qqnorm(tb$sodium, main = "Taco Bell")
Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?
The stepwise pattern could due to these restaurants offer a variety of products (ie. french fries, soft drinks, sandwiches, salads) and the sodium content across these different food groups could varies considerably
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
Based on the normal probability plot (below), this variable (total carbohydrates) is right skewed and the histogram confirms this with data being concentrated on the left with a tail running to the right.
#Normal plot for total carbohydrates from Dairy Queen
qqnorm(dq$total_carb, main = "Dairy Queen Carbs")
qqline(dq$total_carb)
#Use a histogram to confirm.
hist(dq$total_carb)