Load library data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)

Get the data

data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
##   restaur…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
##   <chr>     <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
## 1 Mcdonalds Arti…     380      60       7       2     0        95   1110      44
## 2 Mcdonalds Sing…     840     410      45      17     1.5     130   1580      62
## 3 Mcdonalds Doub…    1130     600      67      27     3       220   1920      63
## 4 Mcdonalds Gril…     750     280      31      10     0.5     155   1940      62
## 5 Mcdonalds Cris…     920     410      45      12     0.5     120   1980      81
## 6 Mcdonalds Big …     540     250      28      10     1        80    950      46
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## #   vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## #   variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
## #   ⁵​cholesterol, ⁶​total_carb

Get the data of Mcdolads and Dairy Queen fast food restaurants

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercie 1: Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

**Ans1: To answer the exercise 1, two histograms using queens dairy and mcdonalds calories from fat data are being created:

hist(dairy_queen$cal_fat)

hist(mcdonalds$cal_fat)

mean(dairy_queen$cal_fat)
## [1] 260.4762
median(mcdonalds$cal_fat)
## [1] 240

Making comparison:

From the two plots above, it is seen that the dairy queens calories from fat distribution plot is almost symmetrical (slightly right skewed) whereas mcdonalds calories from fat distribution plot is right skewed. Hence, mean will be best described center for dairy queens data and median will be the best described center for the mcdonalds data. The center of the both of the data is nearly within same calorie range (220-250). It is also seen that the dairy queen’s spread is smaller than the mcdonalds spread.

Dairy queen and mcdonalds caloreies from fat data distribution plots

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
plot1<-ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
plot1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mcdmean <- mean(mcdonalds$cal_fat)
mcdsd   <- sd(mcdonalds$cal_fat)
plot2<-ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = mcdmean, sd = mcdsd), col = "tomato")
plot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2: Based on the plots above, does it appear that the data follow a nearly normal distribution?

** Ans 2: From the two plots above, it is seen that dairy queens plot (plot1) data follows nearly a symmetric normal distribution.On the other hand, mcdonalds data (plot2) follows a right skewed distribution.

Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
qplot(sample = sim_norm, stat = "qq")
## Warning: `stat` is deprecated

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")+geom_blank()+geom_line(aes(sample = sim_norm),stat = "qq",  col = "red")

** Ans 3: From the simulated data probability plots (in case of two plots, red line reflects the simulated data plot) above, it is seen that almost all the data points fall on the same line. Though some deviations have occurred at the tail, it can be said that the data is nearly normally distributed.Similar characteristics have also found in the real data set’s probability plot (black line).

Exercise 4: Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

qqnormsim(sample = cal_fat, data = dairy_queen)

** Ans 4: By comparing the real data plot with 8 simulated data plots, it can be said that the calories from fat data are following nearly normal distribution.

Exercise 5: Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

sim_norm <- rnorm(n = nrow(mcdonalds), mean = mcdmean, sd = mcdsd)
ggplot(data = mcdonalds, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")+geom_blank()+geom_line(aes(sample = sim_norm),stat = "qq",  col = "blue")

qqnormsim(sample = cal_fat, data = mcdonalds)

** Ans 5: From the plots above, it is seen that the mcdonalds calories from data points are not following the diagonal line i.e. its not nearly linear in trend and it does not match as closely with the simulated data probability plots as the dairy queens data did. Hence, the distribution is not normal for the mcdonalds calories from fat data set.

Exercise 6: Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

** Ans 6: By answering the following two questions, answer of the above exercise will be given.

** Question-1: What is the probability of selecting a food item under 800 calories from Subway? We don’t know if Subway has a normal distribution.

subway<- fastfood %>%
  filter(restaurant == "Subway")

swmean <- mean(subway$calories)
swsd <- sd(subway$calories)

pnorm(q=800, mean=swmean, sd=swsd)
## [1] 0.8536674

So, the probability of theoretical normal distribution using pnorm is 85.40%

subway %>% 
  filter(calories<800) %>%
  summarise(percent = n() / nrow(subway))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.833

The probability of empirical normal distribution is 83.33%, which is very close to the value of the calculated the Z score above. Hence, Subway may be has a normal distribution.

** Question 2: What is the probability of selecting a food item above 500 calories from Sonic? We don’t know if Sonic has a normal distribution.

arbys<- fastfood %>%
  filter(restaurant == "Arbys")
armean <- mean(arbys$calories)
arsd <- sd(arbys$calories)
1-pnorm(q=500, mean=armean, sd=arsd)
## [1] 0.5618231

So, the probability of theoretical normal distribution using pnorm is 56%

arbys%>% 
  filter(calories>500) %>%
  summarise(percent = n() / nrow(arbys))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1     0.6

The probability of empirical normal distribution is 60%, which is very close to the value of the calculated the Z score above. Hence, Arbys may be has a normal distribution.

Exercise 7: Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

ggplot(data = fastfood, aes(sample = sodium)) + geom_line(stat = "qq") +  facet_wrap(~restaurant)

ggplot(data = fastfood) + geom_histogram(aes(x = sodium)) +  facet_wrap(~restaurant)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

** Ans 7: From the all probability distribution plots above, it seems Arbys, Burger King, and Taco bell’s distributions are close to the normal distributions. If I were to pick one it would be Arbys restaurant, as it has the most close linear qq plot.

Exercise 8: Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

** Ans 8: Some of the normal probability plots for sodium seem to have a step wise pattern as because the sodium data set is a discrete variable.

Exercise 9: As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

** Ans 9: To answer this question let’s consider the Mcdonalds total carbohydrates data set’s probability distribution and density distribution plots below:

mcdmean <- mean(mcdonalds$total_carb)
mcdsd   <- sd(mcdonalds$total_carb)

sim_norm <- rnorm(n = nrow(mcdonalds), mean = mcdmean, sd = mcdsd)
ggplot(data = mcdonalds, aes(sample = total_carb)) + 
  geom_line(stat = "qq")+geom_blank()+geom_line(aes(sample = sim_norm),stat = "qq",  col = "blue")

ggplot(data = mcdonalds, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = mcdmean, sd = mcdsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the probability distribution curve above, it is seen that the total carb data of mcdonalds is not following linear trend diagonally. That means it has skewness. To understand it’s shape properly the probability curve over density histogram is helpful. From the, overlay probability curve it is seen that the mcdonalds total curb data set is not following normal distribution. It is right skewed.