library(tidyverse)
library(openintro)This week we will look at fast food data. The data set contains data on 515 menu items from some of the most popular fast food restaurants worldwide. Let’s examine the data:
head(fastfood)## # A tibble: 6 x 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Arti… 380 60 7 2 0 95
## 2 Mcdonalds Sing… 840 410 45 17 1.5 130
## 3 Mcdonalds Doub… 1130 600 67 27 3 220
## 4 Mcdonalds Gril… 750 280 31 10 0.5 155
## 5 Mcdonalds Cris… 920 410 45 12 0.5 120
## 6 Mcdonalds Big … 540 250 28 10 1 80
## # … with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## # sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## # salad <chr>
We’ll focus on three columns: restaurant, calories, and calories from fat.
We’ll start with products from McDonald’s and Dairy Queen.
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
mcdonalds %>%
ggplot() +
geom_histogram(mapping = aes(x = calories), binwidth = 100) +
labs(title = "Calorie distribution in Mcdonalds' menu items")mcdonalds %>%
ggplot() +
geom_histogram(mapping = aes(x = cal_fat), binwidth = 50) +
labs(title = "Calories-by-fat distribution in Mcdonalds' menu items")dairy_queen %>%
ggplot() +
geom_histogram(mapping = aes(x = calories), binwidth = 50) +
labs(title = "Calorie distribution in Dairy Queen menu items")dairy_queen %>%
ggplot() +
geom_histogram(mapping = aes(x = cal_fat), binwidth = 50) +
labs(title = "Calories-by-fat distribution in Dairy Queen menu items")The datasets are fairly similar. Both have a peak for calories-per-item of around 500, with a tail extending to the right (right-skewed), as well as a peak of calories-from-fat-per-item of around 200.
The data does appear like it could possibly be bell-shaped but I would not personally conclude that it is, just yet.
We’ll plot bell-curves on top of the histograms.
dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato") +
labs(title = ("Calories from Fat in Dairy Queen Food"))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data has a clear similarity to the overlaid bell curve. However, personally I am still not so sure.
Since it is still difficult to say, well try another approach with a normal Q-Q plot.
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")Let’s compare this to data that we know comes from a normal distribution.
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
ggplot() + geom_line(mapping = aes(sample = sim_norm), stat = "qq")qqnormsim(sample = cal_fat, data = dairy_queen)Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?
I would say that the data on calories-from-fat in Dairy Queen is similar enough to the above plots to be considered approximately normal.
mcdmean <- mean(mcdonalds$cal_fat)
mcdsd <- sd(mcdonalds$cal_fat)ggplot(data = mcdonalds, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = mcdonalds, aes(sample = cal_fat)) +
geom_line(stat = "qq")The McDonald’s data deviates significantly from the theoretical linear QQline. I would have to conclude that this data is not normal.
Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
Personally, I am generally the most concerned with the sugar and protein content of my food. While I try to eat fast food as little as possible, I probably have food from Subway and Taco Bell more than any other on the list, so I will look at these.
subway <- fastfood %>%
filter(restaurant == "Subway")
taco_bell <- fastfood %>%
filter(restaurant == "Taco Bell")subway_sugar_mean <- mean(subway$sugar)
subway_sugar_sd <- sd(subway$sugar)
subway_protein_mean <- mean(subway$protein)
subway_protein_sd <- sd(subway$protein)
tb_sugar_mean <- mean(taco_bell$sugar)
tb_sugar_sd <- sd(taco_bell$sugar)
tb_protein_mean <- mean(taco_bell$protein)
tb_protein_sd <- sd(taco_bell$protein)ggplot(data = subway, aes(x = sugar)) +
geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 5) +
stat_function(fun = dnorm, args = c(mean = subway_sugar_mean, sd = subway_sugar_sd), col = "tomato") +
labs(title = "Sugar in Subway Food")ggplot(data = subway, aes(x = protein)) +
geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 5) +
stat_function(fun = dnorm, args = c(mean = subway_protein_mean, sd = subway_protein_sd), col = "tomato") +
labs(title = "Protein in Subway Food")ggplot(data = taco_bell, aes(x = sugar)) +
geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 1) +
stat_function(fun = dnorm, args = c(mean = tb_sugar_mean, sd = tb_sugar_sd), col = "tomato") +
labs(title = "Sugar in Taco Bell Food")ggplot(data = taco_bell, aes(x = protein)) +
geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 3) +
stat_function(fun = dnorm, args = c(mean = tb_protein_mean, sd = tb_protein_sd), col = "tomato") +
labs(title = "Protein in Taco Bell Food")ggplot(data = subway, aes(sample = sugar)) +
geom_line(stat = "qq") +
labs(title = "Sugar in Subway food, actual vs. theoretical")ggplot(data = subway, aes(sample = protein)) +
geom_line(stat = "qq") +
labs(title = "Protein in Subway food, actual vs. theoretical")ggplot(data = taco_bell, aes(sample = sugar)) +
geom_line(stat = "qq") +
labs(title = "Sugar in Taco Bell food, actual vs. theoretical")ggplot(data = taco_bell, aes(sample = protein)) +
geom_line(stat = "qq") +
labs(title = "Protein in Taco Bell food, actual vs. theoretical")The protein content in offerings from Subway and Taco Bell appears fairly close to normal. The Sugar content in each is signficantly skewed to the right.
What is the probability that a Subway item has more than 5 units of sugar?
# theoretical
theo = 1 - pnorm(q = 5, mean = subway_sugar_mean, sd = subway_sugar_sd)
# empirical
emp = subway %>%
filter(sugar > 5) %>%
summarise(percent = n() / nrow(subway))
theo## [1] 0.8181227
emp## # A tibble: 1 x 1
## percent
## <dbl>
## 1 0.812
.818 for the theoretical normal distribution vs. .813 for the empirical data; the normal distribution actually provides a very good prediction
What is the probability that a Taco Bell item has fewer than 15 grams of protein?
# theoretical
theo = pnorm(q = 15, mean = tb_protein_mean, sd = tb_protein_sd)
# empirical
emp = taco_bell %>%
filter(protein < 15) %>%
summarise(percent = n() / nrow(taco_bell))
theo## [1] 0.3673821
emp## # A tibble: 1 x 1
## percent
## <dbl>
## 1 0.435
The theoretical model predicts a .367 probability while the empirical data shows a 0.435 probability.
Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
fastfood %>%
ggplot() +
geom_histogram(mapping = aes(x = sodium)) +
facet_wrap(vars(restaurant))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
arbys <- fastfood %>% filter(restaurant=="Arbys")
sonic <- fastfood %>% filter(restaurant=="Sonic")
chickfila <- fastfood %>% filter(restaurant=="Chick Fil-A")
burgerking <- fastfood %>% filter(restaurant=="Burger King")ggplot(data = subway, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Subway food, actual vs. theoretical")ggplot(data = dairy_queen, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Dairy Queen food, actual vs. theoretical")ggplot(data = arbys, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Arby's food, actual vs. theoretical")ggplot(data = taco_bell, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Taco Bell food, actual vs. theoretical")ggplot(data = mcdonalds, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in McDonald's food, actual vs. theoretical")ggplot(data = chickfila, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Chick-Fil-A food, actual vs. theoretical")ggplot(data = sonic, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Sonic food, actual vs. theoretical")ggplot(data = burgerking, aes(sample = sodium)) +
geom_line(stat = "qq") +
labs(title = "Sodium in Burger King food, actual vs. theoretical")I would say that the distribution of sodium in Arby’s food is closest to normal.
Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?
I think it comes down to two things: the content of ingredients like sodium, which are added to food to help with flavor, will be added in specific amounts rather than in a continuous fashion. Also, some menu items will have significantly more sodium than others, for example, french fries will have much more sodium than a burger, which will have much more sodium than a salad.
As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
ggplot(data = subway, aes(sample = total_carb)) +
geom_line(stat = "qq")The data looks right skewed
subway_carb_mean <- mean(subway$total_carb)
subway_carb_sd <- sd(subway$total_carb)
subway_carb_mean## [1] 54.71875
subway_carb_sd## [1] 33.31436
subway %>%
ggplot() +
geom_histogram(mapping = aes(x = total_carb), bins=15)With a mean of ~54.7 and most of the data occuring around or below the mean, the data does appear somewhat right skewed.
…