library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

#Load "fastfood"dataset
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
# create two new data sets, "mcdonalds" and "dairy_queen,"
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
head(mcdonalds)
## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")
head(dairy_queen)
## # A tibble: 6 × 17
##   restaurant  item      calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>       <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Dairy Queen 1/2 lb. …     1000     660        74      26         2         170
## 2 Dairy Queen 1/2 lb. …      800     460        51      20         2         135
## 3 Dairy Queen 1/4 lb. …      630     330        37      13         1          95
## 4 Dairy Queen 1/4 lb. …      540     270        30      11         1          70
## 5 Dairy Queen 1/4 lb. …      570     310        35      11         1          75
## 6 Dairy Queen Original…      400     160        18       9         1          65
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
hist(mcdonalds$cal_fat)

hist(dairy_queen$cal_fat)

These two histograms show distribution of cal_fat for Mcdonalds and Dairy_queen. McDonald’s has a higher fat calories compared to Dairy Queen. Both distributions are right-skewed, this means that there are few items with very high calories from fat in the data set. McDonald’s histogram increases in bin width of 200 calories, whereas Dairy Queen’s histogram increases in bin width of 100 calories.

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The standard deviation measures the spread or variability of the data points around the mean (average fat calorie content). It quantifies how much individual values deviate from the mean value. The histogram closely resembles the curve and exhibits relatively few deviations, so, “calories from fat” data for Dairy Queen follows a nearly normal distribution.

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

# Create a Q-Q plot for the simulated data
qqplot_sim_norm <- ggplot() +
  geom_qq(aes(sample = sim_norm)) +
  geom_qq_line() +
  labs(title = "Normal Q-Q Plot for Simulated Data")

qqplot_sim_norm

This Q-Q plot is to assess the normality of the “calories from fat” data in Dairy Queen and generate a random sample of normally distributed data using the ‘rnorm’ function. the “sim_norm” variable contains a random sample of data that follows a normal distribution with a mean and standard deviation matching those of the “calories from fat” data for Dairy queen. Most of the simulated data points mostly follow the line, QQ plot is approximately normal distributed.

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

qqnormsim(sample = cal_fat, data = dairy_queen)

A multiple Q-Q plots are generated to compare the original data to several simulated normal data sets that helps us assess how well the original data conforms to a normal distribution in comparison to the simulated data.If the points in the Q-Q plot closely follow the reference line, the original data looks closely to the reference line as the simulated data Q-Q plots, Yes, the normal probability plot for cal-fat data looks similar to the plots created for the simulated data.

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

dqmean1<- mean(mcdonalds$cal_fat)
dqsd1   <- sd(mcdonalds$cal_fat)
ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean1, sd = dqsd1), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Create a Q-Q plot for McDonald's data
ggplot(data = mcdonalds, aes(sample = cal_fat)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Normal Q-Q Plot for Calories from Fat (McDonald's)")

qqnormsim(sample = cal_fat, data = mcdonalds)

The histogram somewhat resembles a normal distribution with appears to be uni-modal (one main peak). The cal_fat data from McDonald’s menus approximately follow a normal distribution, but, there are some deviations and non-normal characteristics (like right-skewed).

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

Question 1. Calculates the theoretical probability that a Dairy Queen item has fewer than 600 calories from fat using the cumulative distribution function (CDF) of the normal distribution

Question 2. What is the empirical probability that a randomly chosen menu item from McDonald’s has between 200 and 600 calories from fat?

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
dairy_queen %>% 
  filter(cal_fat < 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.952
mcdonalds %>% 
  filter(cal_fat >= 200 & cal_fat <= 600) %>%
  summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.561

Exercise 7

Now let’s consider some of the other variables in the data set. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

# Find mean and standard deviation for each restaurant
restaurant_summary <- fastfood %>%
  group_by(restaurant) %>%
  summarise(mean_sodium = mean(sodium), sd_sodium = sd(sodium))
restaurant_summary
## # A tibble: 8 × 3
##   restaurant  mean_sodium sd_sodium
##   <chr>             <dbl>     <dbl>
## 1 Arbys             1515.      664.
## 2 Burger King       1224.      500.
## 3 Chick Fil-A       1151.      727.
## 4 Dairy Queen       1182.      610.
## 5 Mcdonalds         1438.     1036.
## 6 Sonic             1351.      665.
## 7 Subway            1273.      744.
## 8 Taco Bell         1014.      474.
# Plot sodium distribution for each restaurant
ggplot(data = fastfood, aes(x = sodium)) +
  geom_histogram(aes(y = ..density..), bins = 20) +
  facet_wrap(~restaurant, scales = "free") +
  geom_density() +
  labs(title = "Sodium Distribution by Restaurant")

Arbys and Burger King restaurant’s distribution of sodium is the closest to normal.

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a step wise pattern. why do you think this might be the case?

Some of the Sodium Distribution plots by Restaurant have a step wise pattern because type of data is discrete or categorical rather than continuous normal distribution.

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

# create a new data set "Subway"
Subway <- fastfood %>%
  filter(restaurant == "Subway")
head(Subway)
## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Subway     "6\" B.L.…      320      80         9       4         0          20
## 2 Subway     "Footlong…      640     160        18       8         0          40
## 3 Subway     "6\" BBQ …      430     160        18       6         0          50
## 4 Subway     "Footlong…      860     320        36      12         0         100
## 5 Subway     "6\" Big …      580     310        31      11         0          85
## 6 Subway     "Footlong…     1160     620        62      22         0         170
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
# Plot Total Carbohydrates distribution of Subway
ggplot(data = Subway, aes(x = sodium)) +
  geom_histogram(aes(y = ..density..), bins = 20) +
  facet_wrap(~restaurant, scales = "free") +
  geom_density() +
  labs(title = "Total Carbohydrates Distribution of Subway ")

# Create a Q-Q plot
ggplot(data = Subway, aes(sample = total_carb)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Normal Q-Q Plot for Total Carbohydrates")

# Create a histogram
hist(Subway$total_carb)

According to the normal probability plot, we can clearly see that the “total carbohydrates” variable for Subway is right-skewed.

