library(tidyverse)
library(openintro)
data("fastfood", package='openintro')

Exercise 1

McDonald’s plot is right skewed curve. McDonald’s products’ calories from fat centers the around 200-300 calories as six of their products have ~250 calories. There are few products that are outliers from the distribution; there are +800 calories from fat. (Note:use xlim(0,500) for estimates)

Dairy queen’s plot is a less right skewed curve in its calories from fat. Dairy queen has a higher center as calories from; its products center around 300 calories. This is not surprise as DQ have more dairy products compared to MCd’s. However,DQ’s 90th quantile stops around 400 and its outliers is under 700. (Note:use xlim(0,700) for estimates)

library(ggplot2)
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")
ggplot(mcdonalds, aes(x = cal_fat)) + geom_histogram(fill = "#C0392B") + theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(dairy_queen, aes(x = cal_fat)) + geom_histogram(fill = "#E59866") + theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2

Yes, it does appear to be nearly normal distribution. I am more inclined to stay with right skewed as the density of the products have a large spread.

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm,
                args = c(mean = dqmean, sd = dqsd),
                col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 3

The simulated DQ’s points fall closely to the line. There is a increased deviation from the mean, as points further away from the line. DQ’s real data has more deviations from its median but very similar.

sim_norm <- rnorm(n = nrow(dairy_queen),mean = dqmean,sd = dqsd)
ggplot(mapping = aes(sample = sim_norm)) + stat_qq(size = 2, color = "pink") + stat_qq_line(color = "red")

ggplot(data = dairy_queen, aes(sample = cal_fat)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

Exercise 4

The simulated normal probability plots are very similar to the real data plot. Its difference is in the shape of the sim Data as it appears to correct the deviations around the tail. Also, there appears to be more products of the same cal_fat in the sim plots compared to the real data.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 5

I don’t believe McDonald’s menu has a normal distribution. There is a more apparent difference in the McDonald’ real data compared to the simulated data. There is steeper deviations at the tail of the real data. The sim plot tried to balance out the tails with an increment in the samples. The sim plots appeared to have shifted the quarantines of the original data and a right shift for the mean of the theorized data.

qqnormsim(sample = cal_fat, data = mcdonalds)

Exercise 6

What is the probability that a randomly chosen Subway product is considered low sodium? What is the probability that a randomly chosen taco bell product has more sugar than the daily amount?

In the calculations of their probabilities under theoretical/empirical normal distribution, we saw a closer agrrement in both methods with the subway question.

subway<- fastfood %>% filter(restaurant == "Subway")
sbmean <- mean(subway$sodium)
sbsd   <- sd(subway$sodium)
pnorm(140,sbmean,sbsd)
## [1] 0.06380986
subway %>% filter(sodium<140) %>%summarise(percent = n() / nrow(subway))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0104
Mcmean <- mean(mcdonalds$sugar)
Mcsd   <- sd(mcdonalds$sugar)
pnorm(36,Mcmean,Mcsd)
## [1] 0.9691213
mcdonalds %>% filter(sugar>36) %>%summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 x 1
##   percent
##     <dbl>
## 1  0.0351

Exercise 7

Out of all the restaurants, Burger King had the closest distribution to the normal. Burger king has the most tightest line in the QQ plot with its few deviations close to the diagonal line.

taco_bell <- fastfood %>%filter(restaurant == "Taco Bell")
arbys <- fastfood %>%filter(restaurant == "Arbys")
burger_king <- fastfood %>%filter(restaurant == "Burger King")
sonic <- fastfood %>%filter(restaurant == "Sonic")
chick_fla<-fastfood %>%filter(restaurant=="Chick Fil-A")
ggplot(data = dairy_queen, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = mcdonalds, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = taco_bell, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = subway, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = arbys, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = burger_king, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = sonic, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

ggplot(data = chick_fla, aes(sample = sodium)) +stat_qq(size = 2, color ="red") + stat_qq_line(color = "white")

Exercise 8

I am assuming it is the amount of products that each restaurant carry with that sodium level. For example, there was 27 chickfla products in the chickfla table. In its normal probability plot, there appears an linear increase in the sodium of the food items. In a stepwise pattern plot, the food items may have large difference in their sodium levels.

Exercise 9

There is a a lot of deviations on the right tail on the taco bell plot. From the observation, the variable is right skewed.

ggplot(data = taco_bell, aes(sample = total_carb)) +stat_qq(size = 2, color ="#8E44AD") + stat_qq_line(color = "#FBFCFC")+labs(x="total_carbs",y="density")

Looking at the histogram, it appears that the majority of taco bell’s food items are below 80g. The peak of the histogram leans to the right instead of the center of the plot like a normal distributed curve would be placed.

ggplot(taco_bell, aes(x = total_carb)) + geom_histogram(fill = "#C0392B") + theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

