The normal distribution

library(tidyverse)
library(openintro)

data("fastfood", package='openintro')
head(fastfood)

## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

summary(mcdonalds$cal_fat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   160.0   240.0   285.6   320.0  1270.0

hist(mcdonalds$cal_fat)

summary(dairy_queen$cal_fat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   160.0   220.0   260.5   310.0   670.0

hist(dairy_queen$cal_fat)

By looking at the McDonald’s historgram compared to the dairy queen histogram, I can tell that McDonald’s has more highly caloric in fat food items than dairy queen. Dairy queen has about two items that is over 700 calories while McDonald’s has at least 4 items that range from 700 to 1200 calories. The histogram of McDonald’s is more skewed to the left, meaning they have more items that range from 0 to 400 calories, while dairy queen is more evenly distributed and had a bell shaped look, although they have slightly more items ranging from 0 to the mean which is 400

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Based on this plot, this nearly does look like a normal distribution, but as I said in Exercise 1, the bell curve does seem to be slightly skewed to the left

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

Not all the points fall on the same line, I can say that this line is even more closer to following the line than the line with the real data, so I will say that the sim_norm data is closer to being normally distributed than the real data

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
ggplot(data = NULL, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

The normal probability plots for the calories from the fat look similar to the simulated data, you only see the differences really towards the right end of the plot. I believe the plots however, provide evidence that the calories are nearly normal.

qqnormsim(sample = cal_fat, data = dairy_queen)

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

qqnormsim(sample = cal_fat, data = mcdonalds)

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

#1 What is the probability that a randomly chosen item has fewer than 300 calories from fat in mcdonalds and dairy queen? Theoretical calculation mcdonalds

mcmean <- mean(mcdonalds$cal_fat)
mcsd <- sd(mcdonalds$cal_fat)
1 - pnorm(q = 300, mean = mcmean, sd = mcsd)

## [1] 0.4740374

Empirical Calculation using the data set mcdonalds

mcdonalds %>% 
  filter(cal_fat < 300) %>%
  summarise(percent = n() / nrow(mcdonalds))

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.632

There is a 47.4 percent chance based off of the theoretical evaluation and a 63.2 percent chance empirically that a randomly chosen item has fewer than 300 calories from fat. That is a 15.8% difference in data.

Dairy queen Theoretical

1 - pnorm(q = 300, mean = dqmean, sd = dqsd)

## [1] 0.4002993

Empirical

dairy_queen %>% 
  filter(cal_fat < 300) %>%
  summarise(percent = n() / nrow(dairy_queen))

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.667

There is a 40.0 percent chance based off of the theoretical evaluation and a 66.7 percent chance empirically that a randomly chosen item has fewer than 300 calories from fat. That is a 26.7% difference in data. Based off of this data, I can conclude that mcdonalds had the closer agreement by 10.9% ## More Practice

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium? Based on the plots below, burger king, taco bell, and subway had the distributions closest to normal for sodium mcdonalds

mcd <- fastfood %>%
  filter(restaurant == "Mcdonalds")

qqnormsim(sample = sodium, data = mcdonalds)

Dairy Queen

dq <- fastfood %>%
  filter(restaurant == "Dairy Queen")

qqnormsim(sample = sodium, data = dq)

Sonic

sc <- fastfood %>%
  filter(restaurant == "Sonic")

qqnormsim(sample = sodium, data = sc)

Subway

sw <- fastfood %>%
  filter(restaurant == "Subway")

qqnormsim(sample = sodium, data = sw)

Taco Bell

tb <- fastfood %>%
  filter(restaurant == "Taco Bell")

qqnormsim(sample = sodium, data = tb)

Chick Fil-A

cfa <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

qqnormsim(sample = sodium, data = cfa)

Burger King

bk <- fastfood %>%
  filter(restaurant == "Burger King")

qqnormsim(sample = sodium, data = bk)

Arbys

arb <- fastfood %>%
  filter(restaurant == "Arbys")

qqnormsim(sample = sodium, data = arb)

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

The reason for the stepwise pattern may be due to the varrying types of food across the menu, a few items may be high in sodium while another may low sodium

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Based on this normal probability plot, I find that this plot is symmetric

bkmean <- mean(bk$total_carb)
bksd   <- sd(bk$total_carb)
qqnormsim(sample = total_carb, data = bk)

historgram

ggplot(data = bk, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = bkmean, sd = bksd), col = "tomato")