data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
## restaur…¹ item calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Arti… 380 60 7 2 0 95 1110 44
## 2 Mcdonalds Sing… 840 410 45 17 1.5 130 1580 62
## 3 Mcdonalds Doub… 1130 600 67 27 3 220 1920 63
## 4 Mcdonalds Gril… 750 280 31 10 0.5 155 1940 62
## 5 Mcdonalds Cris… 920 410 45 12 0.5 120 1980 81
## 6 Mcdonalds Big … 540 250 28 10 1 80 950 46
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## # vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## # variable names ¹restaurant, ²calories, ³total_fat, ⁴trans_fat,
## # ⁵cholesterol, ⁶total_carb
#print(n=150,fastfood)
You can also embed plots, for example:
## # A tibble: 57 × 17
## restau…¹ item calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonal… Arti… 380 60 7 2 0 95 1110 44
## 2 Mcdonal… Sing… 840 410 45 17 1.5 130 1580 62
## 3 Mcdonal… Doub… 1130 600 67 27 3 220 1920 63
## 4 Mcdonal… Gril… 750 280 31 10 0.5 155 1940 62
## 5 Mcdonal… Cris… 920 410 45 12 0.5 120 1980 81
## 6 Mcdonal… Big … 540 250 28 10 1 80 950 46
## 7 Mcdonal… Chee… 300 100 12 5 0.5 40 680 33
## 8 Mcdonal… Clas… 510 210 24 4 0 65 1040 49
## 9 Mcdonal… Doub… 430 190 21 11 1 85 1040 35
## 10 Mcdonal… Doub… 770 400 45 21 2.5 175 1290 42
## # … with 47 more rows, 7 more variables: fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and
## # abbreviated variable names ¹restaurant, ²calories, ³total_fat, ⁴trans_fat,
## # ⁵cholesterol, ⁶total_carb
## # A tibble: 42 × 17
## restau…¹ item calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dairy Q… 1/2 … 1000 660 74 26 2 170 1610 40
## 2 Dairy Q… 1/2 … 800 460 51 20 2 135 1280 44
## 3 Dairy Q… 1/4 … 630 330 37 13 1 95 1250 44
## 4 Dairy Q… 1/4 … 540 270 30 11 1 70 1020 44
## 5 Dairy Q… 1/4 … 570 310 35 11 1 75 820 39
## 6 Dairy Q… Orig… 400 160 18 9 1 65 930 34
## 7 Dairy Q… Orig… 630 310 34 18 2 125 1240 34
## 8 Dairy Q… 4 Pi… 1030 480 53 9 1 80 2780 105
## 9 Dairy Q… 6 Pi… 1260 590 66 11 1 120 3500 121
## 10 Dairy Q… Baco… 420 240 26 11 1 60 1140 26
## # … with 32 more rows, 7 more variables: fiber <dbl>, sugar <dbl>,
## # protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and
## # abbreviated variable names ¹restaurant, ²calories, ³total_fat, ⁴trans_fat,
## # ⁵cholesterol, ⁶total_carb
Exercise 1 : Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
hist(mcdonalds$cal_fat)
hist(dairy_queen$cal_fat)
summary(mcdonalds$cal_fat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.0 160.0 240.0 285.6 320.0 1270.0
summary(dairy_queen$cal_fat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 160.0 220.0 260.5 310.0 670.0
The distribution of McDonald's is right skewed. The data is spread out more towards the right side of the graph. The maximum amount of calories of fat is 1270. The distribution of Dairy Queen is more jumbled up. The data here is also right skewed and the maximum amount of calories of fat is 670. McDonald's x axis increasesby 200 calories while Dairy Queen's increases by 100 calories. They are both partial bell-shaped normals/unimodal symmetric distribution.
dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Exercise 2: Based on the this plot, does it appear that the data follow a nearly normal distribution? Yes, after seeing this version of the graph, and seeing the curve drawn in red, I do see a nearly normal distribution with almost no left or right bias.
Exercise 3: Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)
According to National Institute of Standards and Technology, "The normal probability plot (Chambers et al., 1983) is a graphical technique for assessing whether or not a data set is approximately normally distributed."
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
qqnorm(sim_norm)
qqline(sim_norm)
No, there are about 3 outliers so all the points don't appear on the line, but majority of the points almost fall on the line.
qqnormsim(sample = cal_fat, data = dairy_queen)
Exercise 4: Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal? Yes, the probability plot does look similar to the simulated data shown in the graphs right above.
Exercise 5: Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
qqnormsim(sample = cal_fat, data = mcdonalds)
Yes, according to the graphs above the McDonald's menu does come from a normal distribution.
Exercise 6: Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
What is the probability that a randomly chosen McDonald's product has more than 800 calories from fat?”
McDmean <- mean(mcdonalds$cal_fat)
McDsd <- sd(mcdonalds$cal_fat)
print(McDmean)
## [1] 285.614
print(McDsd)
## [1] 220.8993
1 - pnorm(q = 800, mean = McDmean, sd = McDsd)
## [1] 0.009940144
mcdonalds %>%
filter(cal_fat > 800) %>%
summarise(percent = n() / nrow(mcdonalds))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.0351
ggplot(data = mcdonalds, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = McDmean, sd = McDsd), col = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What is the probability that a randomly chosen product from Chick Fil-A has less than 400 calories from fat?
chickfilA <- fastfood %>%
filter(restaurant == "Chick Fil-A")
head(chickfilA)
## # A tibble: 6 × 17
## restaur…¹ item calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chick Fi… Char… 430 144 16 8 0 85 1120 37
## 2 Chick Fi… Char… 310 54 6 2 0 55 820 36
## 3 Chick Fi… Chic… 270 99 11 2.5 0 45 800 26
## 4 Chick Fi… 1 Pi… 120 54 6 3 0 25 320 6
## 5 Chick Fi… 2 Pi… 230 108 12 3 0 55 630 13
## 6 Chick Fi… 3 Pi… 350 153 17 3 0 70 940 22
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## # vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## # variable names ¹restaurant, ²calories, ³total_fat, ⁴trans_fat,
## # ⁵cholesterol, ⁶total_carb
chickMean <- mean(chickfilA$cal_fat)
chickSD <- sd(chickfilA$cal_fat)
chickfilA %>%
filter(cal_fat < 400) %>%
summarise(percent = n() / nrow(chickfilA))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.926
ggplot(data = mcdonalds, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = chickMean, sd = chickSD), col = "purple")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Exercise 7: Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
unique(fastfood$restaurant)
## [1] "Mcdonalds" "Chick Fil-A" "Sonic" "Arbys" "Burger King"
## [6] "Dairy Queen" "Subway" "Taco Bell"
mac <- fastfood %>%
filter(restaurant == "Mcdonalds")
qqnorm(mac$sodium, main = "Mcdonalds")
qqline(mac$sodium)
#------------------------
chick <- fastfood %>%
filter(restaurant == "Chick Fil-A")
qqnorm(chick$sodium, main = "Chick Fil-A")
qqline(chick$sodium)
#------------------------
sonic <- fastfood %>%
filter(restaurant == "Sonic")
qqnorm(sonic$sodium, main = "Sonic")
qqline(sonic$sodium)
#------------------------
arbys <- fastfood %>%
filter(restaurant == "Arbys")
qqnorm(arbys$sodium, main = "Arbys")
qqline(arbys$sodium)
#------------------------
burger <- fastfood %>%
filter(restaurant == "Burger King")
qqnorm(burger$sodium, main = "Burger King")
qqline(burger$sodium)
#------------------------
dq <- fastfood %>%
filter(restaurant == "Dairy Queen")
qqnorm(dq$sodium, main = "Dairy Queen")
qqline(dq$sodium)
#------------------------
taco <- fastfood %>%
filter(restaurant == "Taco Bell")
qqnorm(taco$sodium, main = "Taco Bell")
qqline(taco$sodium)
#------------------------
subway <- fastfood %>%
filter(restaurant == "Subway")
qqnorm(subway$sodium, main = "Subway")
qqline(subway$sodium)
I think Chick Fil-A's was the closest to normal because it had the least amount of outliers.
Exercise 8: Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?
I would guess that there will need to be a huge increase in calories for sodium to also increase.
Exercise 9: As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.
colnames(fastfood)
## [1] "restaurant" "item" "calories" "cal_fat" "total_fat"
## [6] "sat_fat" "trans_fat" "cholesterol" "sodium" "total_carb"
## [11] "fiber" "sugar" "protein" "vit_a" "vit_c"
## [16] "calcium" "salad"
carbmean <- mean(mcdonalds$total_carb)
carbsd <- sd(mcdonalds$total_carb)
ggplot(data = mcdonalds, aes(x = total_carb)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = carbmean, sd = carbsd), col = "green")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
hist(mcdonalds$total_carb)
The graph is definitely right skewed.