library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
## restaur…¹ item calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Arti… 380 60 7 2 0 95 1110 44
## 2 Mcdonalds Sing… 840 410 45 17 1.5 130 1580 62
## 3 Mcdonalds Doub… 1130 600 67 27 3 220 1920 63
## 4 Mcdonalds Gril… 750 280 31 10 0.5 155 1940 62
## 5 Mcdonalds Cris… 920 410 45 12 0.5 120 1980 81
## 6 Mcdonalds Big … 540 250 28 10 1 80 950 46
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## # vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## # variable names ¹restaurant, ²calories, ³total_fat, ⁴trans_fat,
## # ⁵cholesterol, ⁶total_carb
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
**Ans1: To answer the exercise 1, two histograms using queens dairy and mcdonalds calories from fat data are being created:
hist(dairy_queen$cal_fat)
hist(mcdonalds$cal_fat)
mean(dairy_queen$cal_fat)
## [1] 260.4762
median(mcdonalds$cal_fat)
## [1] 240
Making comparison:
From the two plots above, it is seen that the dairy queens calories from fat distribution plot is almost symmetrical (slightly right skewed) whereas mcdonalds calories from fat distribution plot is right skewed. Hence, mean will be best described center for dairy queens data and median will be the best described center for the mcdonalds data. The center of the both of the data is nearly within same calorie range (220-250). It is also seen that the dairy queen’s spread is smaller than the mcdonalds spread.
dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
plot1<-ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
plot1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mcdmean <- mean(mcdonalds$cal_fat)
mcdsd <- sd(mcdonalds$cal_fat)
plot2<-ggplot(data = mcdonalds, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = mcdmean, sd = mcdsd), col = "tomato")
plot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
** Ans 2: From the two plots above, it is seen that dairy queens plot (plot1) data follows nearly a symmetric normal distribution.On the other hand, mcdonalds data (plot2) follows a right skewed distribution.
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
qplot(sample = sim_norm, stat = "qq")
## Warning: `stat` is deprecated
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")+geom_blank()+geom_line(aes(sample = sim_norm),stat = "qq", col = "red")
** Ans 3: From the simulated data probability plots (in case of two plots, red line reflects the simulated data plot) above, it is seen that almost all the data points fall on the same line. Though some deviations have occurred at the tail, it can be said that the data is nearly normally distributed.Similar characteristics have also found in the real data set’s probability plot (black line).
qqnormsim(sample = cal_fat, data = dairy_queen)
** Ans 4: By comparing the real data plot with 8 simulated data plots,
it can be said that the calories from fat data are following nearly
normal distribution.
** Ans 6: By answering the following two questions, answer of the above exercise will be given.
** Question-1: What is the probability of selecting a food item under 800 calories from Subway? We don’t know if Subway has a normal distribution.
subway<- fastfood %>%
filter(restaurant == "Subway")
swmean <- mean(subway$calories)
swsd <- sd(subway$calories)
pnorm(q=800, mean=swmean, sd=swsd)
## [1] 0.8536674
So, the probability of theoretical normal distribution using pnorm is 85.40%
subway %>%
filter(calories<800) %>%
summarise(percent = n() / nrow(subway))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.833
The probability of empirical normal distribution is 83.33%, which is very close to the value of the calculated the Z score above. Hence, Subway may be has a normal distribution.
** Question 2: What is the probability of selecting a food item above 500 calories from Sonic? We don’t know if Sonic has a normal distribution.
arbys<- fastfood %>%
filter(restaurant == "Arbys")
armean <- mean(arbys$calories)
arsd <- sd(arbys$calories)
1-pnorm(q=500, mean=armean, sd=arsd)
## [1] 0.5618231
So, the probability of theoretical normal distribution using pnorm is 56%
arbys%>%
filter(calories>500) %>%
summarise(percent = n() / nrow(arbys))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.6
The probability of empirical normal distribution is 60%, which is very close to the value of the calculated the Z score above. Hence, Arbys may be has a normal distribution.
ggplot(data = fastfood, aes(sample = sodium)) + geom_line(stat = "qq") + facet_wrap(~restaurant)
ggplot(data = fastfood) + geom_histogram(aes(x = sodium)) + facet_wrap(~restaurant)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
** Ans 7: From the all probability distribution plots above, it seems Arbys, Burger King, and Taco bell’s distributions are close to the normal distributions. If I were to pick one it would be Arbys restaurant, as it has the most close linear qq plot.
** Ans 8: Some of the normal probability plots for sodium seem to have a step wise pattern as because the sodium data set is a discrete variable.
** Ans 9: To answer this question let’s consider the Mcdonalds total carbohydrates data set’s probability distribution and density distribution plots below:
mcdmean <- mean(mcdonalds$total_carb)
mcdsd <- sd(mcdonalds$total_carb)
sim_norm <- rnorm(n = nrow(mcdonalds), mean = mcdmean, sd = mcdsd)
ggplot(data = mcdonalds, aes(sample = total_carb)) +
geom_line(stat = "qq")+geom_blank()+geom_line(aes(sample = sim_norm),stat = "qq", col = "blue")
ggplot(data = mcdonalds, aes(x = total_carb)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = mcdmean, sd = mcdsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the probability distribution curve above, it is seen that the total carb data of mcdonalds is not following linear trend diagonally. That means it has skewness. To understand it’s shape properly the probability curve over density histogram is helpful. From the, overlay probability curve it is seen that the mcdonalds total curb data set is not following normal distribution. It is right skewed.