Normal Distributions

head(fastfood)   # look at the fastfood dataset

Exercise 1 - plot fat calories

Make a plot calories from fat of the options from Dairy Queen and McDonald’s. How do their centers, shapes, and spreads compare?

Both the Dairy Queen and the McDonald’s fat calorie histograms show right skewed data. The spread of the McDonald’s data is wider, with several more higher fat calorie values than the DQ data. The centers of the the two graphs are similar falling between 250 and 300 calories, with the McDonald’s center being pulled upwards by the higher values in the right tail.

fastfood %>% 
  filter(restaurant == "Mcdonalds" |  restaurant == "Dairy Queen") %>%   # filter for MickyDs & DQ
ggplot() +
  geom_histogram(aes(x = cal_fat,                                        # create a facted histogram
                     fill = restaurant)) + 
  facet_grid(cols = vars(restaurant), scales = "free_y") +               # facets by restaurant
  scale_fill_manual(guide = NULL, values = c("blue", "gold2")) +       # suppress guide add pretty colors   
  theme_bw()                                                           # simple plot theme

Exercise 2 - Dairy Queen

Based on the this plot, does it appear that the data follow a nearly normal distribution?

The data does follow a nearly normal distribution. It’s unimodal, and and the curve of the majority of the data is similar to the normal curve line. It’s thrown off a bit by the high values in the right tail.

dairy_queen <- fastfood %>%                                    # create DQ dataset
  filter(restaurant == "Dairy Queen") 

dqmean <- mean(dairy_queen$cal_fat)                            # calc the mean and std. deviation
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +                 # create histogram                     
        geom_histogram(aes(y = ..density..),                   # use the density function built into ggplot
                       fill = "lightblue") +                   # set histogram color
        stat_function(fun = dnorm,                             # create a normal curve line
                      args = c(mean = dqmean, sd = dqsd),      # using the mean and sd from DQ data
                      col = "tomato", size = 1.5) +            # set line color and width
  labs(x = "fat calories",                                     # add label on x axis
       title = "Dairy Queen density plot") +                   # plot title
  theme_light()                                                # simple plot theme

Exercise 3

ggplot() + 
   geom_line(stat = "qq",                                       # create normal quantile plot
            data = dairy_queen, aes(sample = cal_fat),          # for DQ fat calories data
            color = "red", size = 1.5) +                        # set line color and width
    labs(title = "Dairy Queen fat calories normal quantile plot") +    # plot title
  theme_light()

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

No, not all of the points fall on the line, there’s some jiggling in the middle, but they’re all very close to the line. The ends of the line are strongly linear.

set.seed(19)                                                        # set seed for random simulation
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)  # simulate normal data using DQ mean & sd

ggplot() + 
  geom_line(stat = "qq", aes(sample = sim_norm),     # plot the simulated data in a normal quantile line
            color= "blue",                           # set line color & width
            size = 1.5) +
      labs(title = "Normal simulation",
           subtitle = "using Dairy Queen mean and sd") +            # plot title
  theme_light()

The normal quantile plots for the DQ data and the simulated normal data are close, the right skew and fat tails show up in the ends of the DQ qq line.

rc <- dairy_queen$restaurant                          # create a vector with DQ restaurant name 
rc[43:84] <-  "Normal"                                # Normal restaurant name
cal_fat <- append(dairy_queen$cal_fat, sim_norm)      # DQ cal_fat data
qqdata <- tibble("restaurant" = rc, cal_fat)          # simulated normal data
ggplot(data = qqdata) + 
   geom_line(stat = "qq",                             # create a normal quantile plot overlapping
            aes(sample = cal_fat,                     # the two lines - DQ & normal sim data
            color = restaurant),
            size = 1.5) +
      labs(title = "Dairy Queen qq plot v. Normal simulation") +  # plot title
    scale_color_manual(labels = c("DQ", "Normal"),                # create guide 
                       values = c("red", "blue")) +               # assign colors
  theme_light()

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories from fat are nearly normal?

The plots look similar, plots 7 and 8 have a good bit of similarity to the DQ data. The plots appear to show that the DQ fat calorie data is nearly normal.

qqnormsim(sample = cal_fat, data = dairy_queen) +   # create facet plot of DQ 
  theme_bw()                                        # and several normal simulations

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

The McDonald’s data shows the high values in the right tail, without these values the data looks nearly normal.

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")                           # create faceted plot of McD data
qqnormsim(sample = cal_fat, data = mcdonalds) +               # and several normal simulations
  labs(title = "McDonald's Fat Calorie distribution") +
  theme_bw()

What is the probability that a randomly chosen Dairy Queen product has more than 600 calories from fat?

Theoretical Normal distribution

Using the normal distribution, the probability of more than 600 calories from fat is 1.5%.

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)            # use complement to find greater than
## [1] 0.01501523

Empirical distribution

The DQ data shows that the probability of more that 600 calories from fat is 4.8%. The difference is 3.3%, and the empirical value is more than times the theoretical normal value. Showing the effect of the right skewed data.

dairy_queen %>%                                         # calculate percentage from DQ data
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen)) %>% 
  as.numeric()
## [1] 0.04761905

Exercise 6 Theoretical normal and Empirical probability

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

Question 1. What is the probability that a randomly selected Arby’s product will have less than 9 grams of sugar?

arbys <- fastfood %>%
  filter(restaurant == "Arbys")
rbmean <- mean(arbys$sugar)
rbsd   <- sd(arbys$sugar)

ptn <- pnorm(9, rbmean, rbsd) %>% round(3)
paste("Theoretical Normal ditribution percentage:", ptn)
## [1] "Theoretical Normal ditribution percentage: 0.597"
ptm <- arbys %>% 
  filter(sugar < 9) %>%
  summarise(percent = n() / nrow(arbys)) %>% 
  round(3)
paste("Empirical distribution percentage:", ptm)
## [1] "Empirical distribution percentage: 0.636"
paste("Difference = ", ptm - ptn)
## [1] "Difference =  0.039"

The difference in calculations is about 4%.

Question 2. What is the probability that a randomly selected Taco Bell product will have more than 5 grams of fiber?

tacobell <- fastfood %>%
  filter(restaurant == "Taco Bell")
tbmean <- mean(tacobell$fiber)
tbsd   <- sd(tacobell$fiber)

ptn <- 1 - pnorm(5, tbmean, tbsd) %>% round(3)
paste("Theoretical Normal ditribution percentage:", ptn)
## [1] "Theoretical Normal ditribution percentage: 0.592"
ptm <- tacobell %>% 
  filter(fiber > 5) %>%
  summarise(percent = n() / nrow(tacobell)) %>% 
  round(3)
paste("Empirical distribution percentage:", ptm)
## [1] "Empirical distribution percentage: 0.443"
paste("Difference = ", ptn - ptm)
## [1] "Difference =  0.149"

The difference in percentage calculations is almost 15%, fiber content in Taco Bell products is probably not normally distributed.

The Arby’s sugar distribution is closer to it’s theoretical normal distribution than the Taco Bell fiber distribution.

Exercise 7

Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

ggplot(data = fastfood) +
  geom_histogram(aes(x = sodium,
                     y = ..density..,
                     fill = restaurant)) + 
        labs(title = "Sodium content histograms") +
  facet_wrap(vars(restaurant)) +
  guides(fill = "none") +
  theme_bw()  

ggplot(data = fastfood) + 
   geom_line(stat = "qq", 
            aes(sample = sodium,
            color = restaurant),
            size = 1.5) +
      labs(title = "Sodium content qq plots") +
  facet_wrap(vars(restaurant)) +
  guides(color = "none") +
  theme_light()

Arby’s, Burger King, Taco Bell and Subway are close to normally distributed. Chick Fil-A, Dairy Queen, McDonald’s and Sonic are right skewed.

qqnormsim(sample = sodium, data = arbys) +
  labs(title = "Arby's Sodium distribution") +
  theme_bw()

fastfood %>% filter(restaurant == "Burger King") %>% 
qqnormsim(sample = sodium) +
  labs(title = "Burger King Sodium distribution") +
  theme_bw()

fastfood %>% filter(restaurant == "Chick Fil-A") %>% 
qqnormsim(sample = sodium) +
  labs(title = "Chick Fil-A Sodium distribution") +
  theme_bw()

qqnormsim(sample = sodium, data = dairy_queen) +
  labs(title = "DQ Sodium distribution") +
  theme_bw()

qqnormsim(sample = sodium, data = mcdonalds) +
  labs(title = "McDonald's Sodium distribution") +
  theme_bw()

fastfood %>% filter(restaurant == "Sonic") %>% 
qqnormsim(sample = sodium) +
  labs(title = "Sonic Sodium distribution") +
  theme_bw()

fastfood %>% filter(restaurant == "Subway") %>% 
qqnormsim(sample = sodium) +
  labs(title = "Subway Sodium distribution") +
  theme_bw()

qqnormsim(sample = sodium, data = tacobell) +
  labs(title = "Taco Bell Sodium distribution") +
  theme_bw()

Exercise 8 - Stepwise graphs

Some of the normal probability plots for sodium distributions seem to have a step-wise pattern. Why do you think this might be the case?

Looking at the histograms for these graphs, Chick Fil-A, DQ, McDonald’s, have gaps in sodium content distributions.

Exercise 9 - Carbohydrates plots

Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

cfa <- fastfood %>% filter(restaurant == "Chick Fil-A")
  
ggplot(data = cfa) + 
   geom_line(stat = "qq", 
            aes(sample = total_carb),
            color = "green4",
            size = 1.5) +
      labs(title = "Chick Fil-A total carbs qq plot") +
  theme_light()

The total carbs data isn’t normally distributed, and it doesn’t look symmetric, it may have fat tails.

ggplot(data = cfa) +
  geom_histogram(aes(x = total_carb,
                     y = ..density..
                     ), fill = "green4") + 
        labs(title = "Chick Fil-A total carbs histogram") +
  theme_bw()  

The histogram shows that the data is not uni-modal and is not normally distributed.