The distribution of the McDonald’s data has a center around 250, it is right-skewed, and it has a wide spread. The distribution of the Dairy Queen data has a center somewhere between 220 and 260, is slightly more symmetrical than the McDonald’s distribution, and has less spread than the McDonald’s distribution. The center for the Dairy Queen distribution is not obvious from the graph, but the mean was about 260, and the median was 220.
data("fastfood", package='openintro')
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
ggplot(mcdonalds, aes(x = cal_fat)) +
geom_histogram(bins = 35)
ggplot(dairy_queen, aes(x = cal_fat)) +
geom_histogram(bins = 35)
print("McDonald's")## [1] "McDonald's"
## [1] 285.614
## [1] 240
## [1] 220.8993
## [1] 48796.49
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.0 160.0 240.0 285.6 320.0 1270.0
## [1] "Dairy Queen"
## [1] 260.4762
## [1] 220
## [1] 156.4851
## [1] 24487.57
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 160.0 220.0 260.5 310.0 670.0
The data does not appear to follow a normal distribution. The shape of the curve is similar to the shape of the bars, but the height of the curve is much lower than the height of the bars. I would not claim that this is nearly a normal distribution.
dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Not all of the points fall on the line as seen in the first Q-Q plot. Most of the middle values fall on the line, but the ones near both ends seem to stray.
This plot is more normally distributed than the plot for the real data. The second Q-Q plot below shows many values far from the line on the right side of the plot, indicating it is not normally distributed.
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
ggplot(mapping = aes(sample = sim_norm)) +
geom_line(stat = "qq") Yes, it looks pretty similar to the plots created for the simulated data. I would say that this does provide evidence that the calories are nearly normal because the original data has a graph very similar to all the other plot, and most of the data points in each of the graphs fall along the line. It is not exactly normal, but it is pretty close.
If you were to randomly order an item from the Chick Fil-A menu, what is the probability of choosing an item with a cholesterol greater than 100 mg?
Theoretical normal distribution probability: 0.3291412 Empirical distribution probability: 0.1486697 Difference (absolute value): 0.2180301
If you were to randomly order an item from Subway, what is the probability of choosing an item that contained total carbs less than 20 g?
Theoretical normal distribution probability: 0.1111111 Empirical distribution probability: 0.2291667 Difference (absolute value): 0.08049694
The second question had more similar results than the first question. This would typically indicate that the data about total carbs for Subway was closer to a normal distribution than the data about cholesterol for Chick Fil-A. However, the Q-Q plots show that the chick Fil-A data about cholesterol had an outlier that affected the normality of the data, and the Q-Q plot for Subway was not normal.
chickFilA <- fastfood |> filter(restaurant == "Chick Fil-A")
chmean <- mean(chickFilA$cholesterol)
chsd <- sd(chickFilA$cholesterol)
(chProb <- 1 - pnorm(q = 100, mean = chmean, sd = chsd))## [1] 0.3291412
(chActualProb <- chickFilA |>
filter(cholesterol > 100) |>
summarise(percent = n() / nrow(chickFilA)))## # A tibble: 1 Ă— 1
## percent
## <dbl>
## 1 0.111
subway <- fastfood |> filter(restaurant == "Subway")
smean <- mean(subway$total_carb)
ssd <- sd(subway$total_carb)
(subProb <- pnorm(q = 20, mean = smean, sd = ssd))## [1] 0.1486697
## # A tibble: 1 Ă— 1
## percent
## <dbl>
## 1 0.229
## percent
## 1 0.2180301
## percent
## 1 -0.08049694
Arby’s had the closest to normal distribution for sodium. I verified my assumption with the Shapiro-Wilk Test. The p-value for Arby’s was greater than 0.05, just like Burger King. Since it had a greater p-value than Arby’s I decided it was the most normal.
## [1] "Mcdonalds" "Chick Fil-A" "Sonic" "Arbys" "Burger King"
## [6] "Dairy Queen" "Subway" "Taco Bell"
sonic <- fastfood |> filter(restaurant == "Sonic")
arbys <-fastfood |> filter(restaurant == "Arbys")
bk <- fastfood |> filter(restaurant == "Burger King")
tacoBell <- fastfood |> filter(restaurant == "Taco Bell")
qqnorm(mcdonalds$sodium)
qqline(mcdonalds$sodium)##
## Shapiro-Wilk normality test
##
## data: mcdonalds$sodium
## W = 0.76922, p-value = 4.458e-08
##
## Shapiro-Wilk normality test
##
## data: chickFilA$sodium
## W = 0.86663, p-value = 0.002503
##
## Shapiro-Wilk normality test
##
## data: sonic$sodium
## W = 0.82286, p-value = 1.784e-06
##
## Shapiro-Wilk normality test
##
## data: arbys$sodium
## W = 0.97073, p-value = 0.1985
##
## Shapiro-Wilk normality test
##
## data: bk$sodium
## W = 0.97291, p-value = 0.1331
##
## Shapiro-Wilk normality test
##
## data: dairy_queen$sodium
## W = 0.84504, p-value = 4.715e-05
##
## Shapiro-Wilk normality test
##
## data: subway$sodium
## W = 0.92175, p-value = 2.515e-05
##
## Shapiro-Wilk normality test
##
## data: tacoBell$sodium
## W = 0.95501, p-value = 0.000699
This might be the case because of the different categories of food. Sandwiches probably have a range of sodium, fries probably have a different range of sodium, salads probably have a different range of sodium, and so on. Since different categories of food likely have ranges of sodium that don’ overlap much, the pattern appears step-wise.
My guess was right-skewed because there seemed to be a tail to the right and above the line in the normal probability plot. The graph confirms this because it is right-skewed with the tail to the right.