Loading data and packages

library(tidyverse)
library(openintro)

Data

data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Let’s first focus on just products from McDonalds and Dairy Queen.

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

summary(mcdonalds$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   160.0   240.0   285.6   320.0  1270.0
hist(mcdonalds$cal_fat)

summary(dairy_queen$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   160.0   220.0   260.5   310.0   670.0
hist(dairy_queen$cal_fat)

The normal distribution

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

I think it does appear to have a normal distribution but some width spaces and a little peak at the end.

Evaluating the normal distribution

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.).

ggplot(data = NULL, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?.

It does give an initial impression of looking very similar. If we look at the sims closely we will some variations but besides that it looks pretty similar.

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

Looking at the McDonalds plot we can see that it has a right skew.

ggplot(data = mcdonalds, aes(sample = calories)) + 
  geom_line(stat = "qq")

### Normal probabilities

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
dairy_queen %>% 
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?.

Question 1: What is the probability that a random item from any of the fast food restaurants has less than 500 calories?

Recommended amount fo calories for: Breakfast: 300-400 calories. Lunch or dinner: 400-500 calories

fastf_mean <- mean(fastfood$calories)
fastf_sd <- sd(fastfood$calories)

pnorm(q = 500, mean = fastf_mean, sd = fastf_sd)
## [1] 0.4564228
fastfood %>% 
  filter(calories < 500) %>%
  summarise(percent = n() / nrow(fastfood))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.513

Question 2: What is the probability that a random item from any og the restaurants has less than 13 grams of saturated fat?

The American Heart Association recommends aiming for a dietary pattern that achieves 5% to 6% of calories from saturated fat. For example, if you need about 2,000 calories a day, no more than 120 of them should come from saturated fat. That’s about 13 grams of saturated fat per day.

ffs_mean <- mean(fastfood$sat_fat)
ffs_sd <- sd(fastfood$sat_fat)

pnorm(q = 13, mean = ffs_mean, sd = ffs_sd)
## [1] 0.7748942
fastfood %>% 
  filter(sat_fat < 13) %>%
  summarise(percent = n() / nrow(fastfood))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.837

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?.

Arbys

Arbys <- fastfood %>%
  filter(restaurant == "Arbys")

qqnorm(Arbys$sodium, main = "Arbys")

Burger King

Burgerk <- fastfood %>%
  filter(restaurant == "Burger King")

qqnorm(Burgerk$sodium, main = "Burger King")

Chick Fil-A

ChickF <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

qqnorm(ChickF$sodium, main = "Chick Fil-A")

Dairy Queen

DairyQ <- fastfood %>%
  filter(restaurant == "Dairy Queen")

qqnorm(DairyQ$sodium, main = "Dairy Queen")

McDonalds

McD <- fastfood %>%
  filter(restaurant == "Mcdonalds")

qqnorm(McD$sodium, main = "McDonald's")

Sonic

Sonic <- fastfood %>%
  filter(restaurant == "Sonic")

qqnorm(Sonic$sodium, main = "Sonic")

Subway

Subway <- fastfood %>%
  filter(restaurant == "Subway")

qqnorm(Subway$sodium, main = "Subway")

Taco Bell

TacoB <- fastfood %>%
  filter(restaurant == "Taco Bell")

qqnorm(TacoB$sodium, main = "Taco Bell")

Burger King appears to a normal distribution for Sodium.

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case? I think it is beacuse of the multiple item in the menu that contains sodium.

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Subway.

I picked Subway because I thought that it would be the “healthy” option but I can see that the distribution is asymmetric, it has skewed zones in both sides. I think the menu is the main why the distribution looks different.

hist(Subway$total_carb)

qqnorm(Subway$total_carb, main="Subway")

