library(tidyverse)
library(openintro)

Exercise 1

Make a plot or plots to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

Shapes of both histograms look the same as each set is unimodel and right skewed. Mcdonald’s fat colories is higer in Min, Max, mean(center) amount than that of Dairy Queen’s products. And, each increment of 200 cal can be seen on x-axis of Mcdonald’s plot and 100 cal interval on that of Dairy Queen’s graph.

Mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
summary(Mcdonalds$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   160.0   240.0   285.6   320.0  1270.0
hist(Mcdonalds$cal_fat)

dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

summary(dairy_queen$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   160.0   220.0   260.5   310.0   670.0
hist(dairy_queen$cal_fat)

Exercise 2

Based on this plot, does it apper that the data follow a nearly normal distribution?

Answe: Yes. it is not exactly normal distribution because the peaks are higher as they approach to 0 on x-axis. The highest peak should be around 330 cal_fat at the mid of between min and max, it is said to be normal distribution.

dqmean = mean(dairy_queen$cal_fat)
dqsd = sd(dairy_queen$cal_fat)

ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

Answer: No. all the points do not fall on the line. The simulated plot has a larger slope between -2 and 1, but a lesser slope from 1 to 2. Apart from that, the normal distribution plot and simulated data plots are similar in the range between -1 to 1.

# Draw a normal probability plot (Q-Q Plot)
ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

#data from normal distribution is simulated
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
# contruct a simulated data from single normal distribution
ggplot(data = NULL, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

Exercise 4

Does the normal distribution plot for the calories from fat look similar to the plots created for simulated data? That is, do the plots provide evidence that the caloies are nearly normal.

Yes. Most all of the points on the normal probability (Q - Q ploat) fall on a diagonal line that stands out a data set is nearly normal according to the plots generated from previous exercise.

#simulate 8 different normal distribution data
qqnormsim(sample = cal_fat, data = dairy_queen) 

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution. A data set of calories from McDonald’s menu is nearly normal too. Because the line generated from thicker points look diagonal although the slope is a little lesser at the starting and end points on the plot.

qqnormsim(sample = cal_fat, data = Mcdonalds)

### Exercise 6 Write out two probability questions that you would like to answer about any of the restaurants in this data set. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one has a closer agreement between two methods?

Q1: What is the probability that Sonic restaurant has more than 50gs of cholesterol? Q2: What is the probability that Subway has less than that 35gs of total_fat?

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0476
#Sonic > 50gs cholesterol calculations: 
Sonic <- fastfood %>%
  filter(restaurant == "Sonic")
s_mean <- mean(Sonic$cholesterol)
s_sd <- sd(Sonic$cholesterol)

1 - pnorm(q = 50, mean = s_mean, sd = s_sd)
## [1] 0.7190123
Sonic %>% 
  filter(cholesterol > 50) %>%
  summarise(percent = n() / nrow(Sonic))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.660
#Subway > 35gs total_fat calculations: 
Subway <- fastfood %>%
  filter(restaurant == "Subway")
s_mean <- mean(Subway$total_fat)
s_sd <- sd(Subway$total_fat)

1 - pnorm(q = 35, mean = s_mean, sd = s_sd)
## [1] 0.1290602
Subway %>% 
  filter(total_fat > 35) %>%
  summarise(percent = n() / nrow(Subway))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.135

Exercise 7

Now let’s consider some of the other variables in the data set. Out of all the different restaurants, which ones’ distribution is the closet to normal for sodium?

Answer: Burger King & Chick Fil-A

#Arbys sodium plot
arbys <- fastfood %>%
  filter(restaurant == "Arbys")
qqnorm(arbys$sodium, main = "Arbys")

#Burger King sodium plot
bk <- fastfood %>%
  filter(restaurant == "Burger King")

qqnorm(bk$sodium, main = "Burger King")

#Chick Fil-A sodium plot **
cfa <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

qqnorm(cfa$sodium, main = "Chick Fil-A")

#Dairy Queen sodium plot
dq <- fastfood %>%
  filter(restaurant == "Dairy Queen")

qqnorm(dq$sodium, main = "Dairy Queen")

#McDonald's sodium plot
mcd <- fastfood %>%
  filter(restaurant == "Mcdonalds")

qqnorm(mcd$sodium, main = "McDonald's")

#Sonic sodium plot
s <- fastfood %>%
  filter(restaurant == "Sonic")

qqnorm(s$sodium, main = "Sonic")

#Subway sodium plot
sw <- fastfood %>%
  filter(restaurant == "Subway")

qqnorm(sw$sodium, main = "Subway")

#Taco Bell sodium plot
tb <- fastfood %>%
  filter(restaurant == "Taco Bell")

qqnorm(tb$sodium, main = "Taco Bell")

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. Why do you think this might be the case? A stepwise pattern of probability plots means data or vaiable is discrete ( as not categorical) that happens different levels of groups of food categories under the distribution.

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

Answer: This variable, “sodium” is right skewed and the histogram confirms the frequencies of observations are lower than the frequencies of observations to the left side.

#Normal plot for sodium from fastfood
qqnorm(dq$sodium, main = "fastfood")
qqline(dq$sodium)

hist(dq$sodium)

