R Markdown

data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
##   restaur…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
##   <chr>     <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
## 1 Mcdonalds Arti…     380      60       7       2     0        95   1110      44
## 2 Mcdonalds Sing…     840     410      45      17     1.5     130   1580      62
## 3 Mcdonalds Doub…    1130     600      67      27     3       220   1920      63
## 4 Mcdonalds Gril…     750     280      31      10     0.5     155   1940      62
## 5 Mcdonalds Cris…     920     410      45      12     0.5     120   1980      81
## 6 Mcdonalds Big …     540     250      28      10     1        80    950      46
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## #   vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## #   variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
## #   ⁵​cholesterol, ⁶​total_carb
#print(n=150,fastfood)

Including Plots

You can also embed plots, for example:

## # A tibble: 57 × 17
##    restau…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
##    <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 Mcdonal… Arti…     380      60       7       2     0        95   1110      44
##  2 Mcdonal… Sing…     840     410      45      17     1.5     130   1580      62
##  3 Mcdonal… Doub…    1130     600      67      27     3       220   1920      63
##  4 Mcdonal… Gril…     750     280      31      10     0.5     155   1940      62
##  5 Mcdonal… Cris…     920     410      45      12     0.5     120   1980      81
##  6 Mcdonal… Big …     540     250      28      10     1        80    950      46
##  7 Mcdonal… Chee…     300     100      12       5     0.5      40    680      33
##  8 Mcdonal… Clas…     510     210      24       4     0        65   1040      49
##  9 Mcdonal… Doub…     430     190      21      11     1        85   1040      35
## 10 Mcdonal… Doub…     770     400      45      21     2.5     175   1290      42
## # … with 47 more rows, 7 more variables: fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and
## #   abbreviated variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
## #   ⁵​cholesterol, ⁶​total_carb
## # A tibble: 42 × 17
##    restau…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
##    <chr>    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 Dairy Q… 1/2 …    1000     660      74      26       2     170   1610      40
##  2 Dairy Q… 1/2 …     800     460      51      20       2     135   1280      44
##  3 Dairy Q… 1/4 …     630     330      37      13       1      95   1250      44
##  4 Dairy Q… 1/4 …     540     270      30      11       1      70   1020      44
##  5 Dairy Q… 1/4 …     570     310      35      11       1      75    820      39
##  6 Dairy Q… Orig…     400     160      18       9       1      65    930      34
##  7 Dairy Q… Orig…     630     310      34      18       2     125   1240      34
##  8 Dairy Q… 4 Pi…    1030     480      53       9       1      80   2780     105
##  9 Dairy Q… 6 Pi…    1260     590      66      11       1     120   3500     121
## 10 Dairy Q… Baco…     420     240      26      11       1      60   1140      26
## # … with 32 more rows, 7 more variables: fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and
## #   abbreviated variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
## #   ⁵​cholesterol, ⁶​total_carb

Exercise 1 : Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

hist(mcdonalds$cal_fat)

hist(dairy_queen$cal_fat)

summary(mcdonalds$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   160.0   240.0   285.6   320.0  1270.0
summary(dairy_queen$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   160.0   220.0   260.5   310.0   670.0

The distribution of McDonald's is right skewed. The data is spread out more towards the right side of the graph. The maximum amount of calories of fat is 1270. The distribution of Dairy Queen is more jumbled up. The data here is also right skewed and the maximum amount of calories of fat is 670. McDonald's x axis increasesby 200 calories while Dairy Queen's increases by 100 calories. They are both partial bell-shaped normals/unimodal symmetric distribution.

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)

ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2: Based on the this plot, does it appear that the data follow a nearly normal distribution? Yes, after seeing this version of the graph, and seeing the curve drawn in red, I do see a nearly normal distribution with almost no left or right bias.

Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

According to National Institute of Standards and Technology, "The normal probability plot (Chambers et al., 1983) is a graphical technique for assessing whether or not a data set is approximately normally distributed."

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

qqnorm(sim_norm)
qqline(sim_norm)

No, there are about 3 outliers so all the points don't appear on the line, but majority of the points almost fall on the line.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4: Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal? Yes, the probability plot does look similar to the simulated data shown in the graphs right above.

Exercise 5: Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

qqnormsim(sample = cal_fat, data = mcdonalds)

Yes, according to the graphs above the McDonald's menu does come from a normal distribution.

Exercise 6: Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

What is the probability that a randomly chosen McDonald's product has more than 800 calories from fat?”

McDmean <- mean(mcdonalds$cal_fat)
McDsd   <- sd(mcdonalds$cal_fat)

print(McDmean)
## [1] 285.614
print(McDsd)
## [1] 220.8993
1 - pnorm(q = 800, mean = McDmean, sd = McDsd)
## [1] 0.009940144
mcdonalds %>% 
  filter(cal_fat > 800) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0351
ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = McDmean, sd = McDsd), col = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What is the probability that a randomly chosen product from Chick Fil-A has less than 400 calories from fat?

chickfilA <- fastfood %>%
  filter(restaurant == "Chick Fil-A")
head(chickfilA)
## # A tibble: 6 × 17
##   restaur…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
##   <chr>     <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
## 1 Chick Fi… Char…     430     144      16     8         0      85   1120      37
## 2 Chick Fi… Char…     310      54       6     2         0      55    820      36
## 3 Chick Fi… Chic…     270      99      11     2.5       0      45    800      26
## 4 Chick Fi… 1 Pi…     120      54       6     3         0      25    320       6
## 5 Chick Fi… 2 Pi…     230     108      12     3         0      55    630      13
## 6 Chick Fi… 3 Pi…     350     153      17     3         0      70    940      22
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## #   vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## #   variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
## #   ⁵​cholesterol, ⁶​total_carb
chickMean <- mean(chickfilA$cal_fat)
chickSD  <- sd(chickfilA$cal_fat)

chickfilA %>% 
  filter(cal_fat < 400) %>%
  summarise(percent = n() / nrow(chickfilA))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.926
ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = chickMean, sd = chickSD), col = "purple")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 7: Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

unique(fastfood$restaurant)
## [1] "Mcdonalds"   "Chick Fil-A" "Sonic"       "Arbys"       "Burger King"
## [6] "Dairy Queen" "Subway"      "Taco Bell"
mac <- fastfood %>%
  filter(restaurant == "Mcdonalds")

qqnorm(mac$sodium, main = "Mcdonalds")
qqline(mac$sodium)

#------------------------
chick <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

qqnorm(chick$sodium, main = "Chick Fil-A")
qqline(chick$sodium)

#------------------------

sonic <- fastfood %>%
  filter(restaurant == "Sonic")

qqnorm(sonic$sodium, main = "Sonic")
qqline(sonic$sodium)

#------------------------

arbys <- fastfood %>%
  filter(restaurant == "Arbys")

qqnorm(arbys$sodium, main = "Arbys")
qqline(arbys$sodium)

#------------------------
burger <- fastfood %>%
  filter(restaurant == "Burger King")

qqnorm(burger$sodium, main = "Burger King")
qqline(burger$sodium)

#------------------------
dq <- fastfood %>%
  filter(restaurant == "Dairy Queen")

qqnorm(dq$sodium, main = "Dairy Queen")
qqline(dq$sodium)

#------------------------
taco <- fastfood %>%
  filter(restaurant == "Taco Bell")

qqnorm(taco$sodium, main = "Taco Bell")
qqline(taco$sodium)

#------------------------
subway <- fastfood %>%
  filter(restaurant == "Subway")

qqnorm(subway$sodium, main = "Subway")
qqline(subway$sodium)

I think Chick Fil-A's was the closest to normal because it had the least amount of outliers.

Exercise 8: Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

I would guess that there will need to be a huge increase in calories for sodium to also increase.

Exercise 9: As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

colnames(fastfood)
##  [1] "restaurant"  "item"        "calories"    "cal_fat"     "total_fat"  
##  [6] "sat_fat"     "trans_fat"   "cholesterol" "sodium"      "total_carb" 
## [11] "fiber"       "sugar"       "protein"     "vit_a"       "vit_c"      
## [16] "calcium"     "salad"
carbmean <- mean(mcdonalds$total_carb)
carbsd   <- sd(mcdonalds$total_carb)

ggplot(data = mcdonalds, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = carbmean, sd = carbsd), col = "green")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(mcdonalds$total_carb)

The graph is definitely right skewed.