library(tidyverse)
library(openintro)
library(ggplot2)
library(dplyr)
data("fastfood")
data("dfdq")## Warning in data("dfdq"): data set 'dfdq' not found
Focus on just three columns to get started: restaurant, calories, calories from fat.
Focus on just products from McDonalds and Dairy Queen.
#extract calories and cal_fat columns for McDonald's
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
mcdonalds[,c(1,3,4)]## # A tibble: 57 x 3
## restaurant calories cal_fat
## <chr> <dbl> <dbl>
## 1 Mcdonalds 380 60
## 2 Mcdonalds 840 410
## 3 Mcdonalds 1130 600
## 4 Mcdonalds 750 280
## 5 Mcdonalds 920 410
## 6 Mcdonalds 540 250
## 7 Mcdonalds 300 100
## 8 Mcdonalds 510 210
## 9 Mcdonalds 430 190
## 10 Mcdonalds 770 400
## # ... with 47 more rows
rname <- mcdonalds[c(1)]
calories <- mcdonalds [c(3)]
cal_fat <- mcdonalds [c(4)]
dfm <- data.frame(cal_fat, calories) #extract calories and cal_fat columns for Dairy Queen
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
dairy_queen[,c(1,2,4)]## # A tibble: 42 x 3
## restaurant item cal_fat
## <chr> <chr> <dbl>
## 1 Dairy Queen 1/2 lb. FlameThrower® GrillBurger 660
## 2 Dairy Queen 1/2 lb. GrillBurger with Cheese 460
## 3 Dairy Queen 1/4 lb. Bacon Cheese GrillBurger 330
## 4 Dairy Queen 1/4 lb. GrillBurger with Cheese 270
## 5 Dairy Queen 1/4 lb. Mushroom Swiss GrillBurger 310
## 6 Dairy Queen Original Cheeseburger 160
## 7 Dairy Queen Original Double Cheeseburger 310
## 8 Dairy Queen 4 Piece Chicken Strip Basket w/ Country Gravy 480
## 9 Dairy Queen 6 Piece Chicken Strip Basket w/ Country Gravy 590
## 10 Dairy Queen Bacon Cheese Dog 240
## # ... with 32 more rows
rname2 <- dairy_queen[c(1)]
calories2 <- dairy_queen [c(3)]
cal_fat2 <- dairy_queen [c(4)]
dfdq <- data.frame(cal_fat2, calories2) Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
**Answer:
I used ggplot for McDonalds. When I attempted to do the same for Dairy Queen, the following error displayed: “Error in is.finite(x) : default method not implemented for type ‘list’”. There were several solutions online, however none of them worked. Apparently, from what I read, it is caused by one or more values in the Dairy Queen data. After trying many different kinds of plots, the only code that returned a result is as follows: “plot(dfdq).” I even attempted to convert the Dairy Queen data frame to a vector, however it still errored out as follows:
#convert Dairy Queen data frame to vector and plot
vecdcalfat <- pull(dfdq, cal_fat2)
vecdcalories <-pull(dfdq, calories2)
ggplot(data=NULL,aes(x=vecdcalfat,y=vecdcalories))+geom_point()
Error: Must extract column with a single valid subscript. x Subscript var has the wrong type tbl_df<cal_fat:double>. i It must be numeric or character. Run rlang::last_error() to see where the error occurred.
In visually comparing the two plots, the calories from fat to total calories from McDonald’s and Dairy Queen, they appear to have similar distributions of calories from fat, with most food items having up to around 800 calories. Without seeing a curve and by looking at the data points, it appears the data is a bit left-skewed on both. My expectation would be that the mean, mode, and the median from both plots are similar.
#Plot for McDonalds using dataframe
data1m <- data.frame(cal_fat, calories) # Create first data frame - McDonald's
ggplot(data=data1m,aes(x=cal_fat,y=calories))+geom_point() #plot data from McDonald's#Plot for Dairy Queen
dfdq <- data.frame(cal_fat2, calories2)
#ggplot(data=dfdq,aes(x=cal_fat2,y=calories2))+geom_point() #plot data from Dairy Queen
data(dfdq)## Warning in data(dfdq): data set 'dfdq' not found
print(dfdq)## cal_fat calories
## 1 660 1000
## 2 460 800
## 3 330 630
## 4 270 540
## 5 310 570
## 6 160 400
## 7 310 630
## 8 480 1030
## 9 590 1260
## 10 240 420
## 11 220 390
## 12 220 380
## 13 180 330
## 14 160 290
## 15 180 350
## 16 80 310
## 17 80 250
## 18 410 550
## 19 670 1050
## 20 440 760
## 21 140 260
## 22 200 470
## 23 160 400
## 24 310 640
## 25 240 540
## 26 130 350
## 27 430 780
## 28 310 580
## 29 430 910
## 30 180 350
## 31 180 500
## 32 220 640
## 33 280 520
## 34 120 280
## 35 270 600
## 36 190 350
## 37 170 380
## 38 20 150
## 39 140 360
## 40 130 280
## 41 0 20
## 42 240 550
plot(dfdq)dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
mcmean <- mean(data1m$cal_fat)
mcsd <- sd(data1m$cal_fat)ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Based on the this plot, does it appear that the data follow a nearly normal distribution?
**Answer: By looking at the curve, it appears to have nearly symmetrical sides indicating it is possibly a near normal distribution.
ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")#simulating data from a normal distribution using rnorm.
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
**Answer: See normal probability plot of sim_norm below. While not all the points fall on the line, many of them do, indicating the distribution could be normal.
ggplot(data = NULL, aes(sample = sim_norm)) +
geom_line(stat = "qq")qqnormsim(sample = cal_fat, data = dairy_queen)Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?
***Answer: Most points fall on the line; most calories/calories from fat are from items that have somewhere under 800 calories; it appears the variable calories from fat is approximately normal.
Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
***Answer:
The distribution under the bell curve does not appear to be symmetrical; ?????????????? I have questions about the outcome of the other graphs. It is not clear to me if the random variable we are measuring is approximately normal by looking at the other graphs compared to the bell graph.
ggplot(data = data1m, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = mcmean, sd = mcsd), col = "tomato")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data1m, aes(sample = cal_fat)) +
geom_line(stat = "qq")#simulating data from a normal distribution using rnorm.
sim_norm <- rnorm(n = nrow(data1m), mean = mcmean, sd = mcsd)ggplot(data = NULL, aes(sample = sim_norm)) +
geom_line(stat = "qq")qqnormsim(sample = cal_fat, data = data1m)#What is the probability that a randomly chosen Dairy Queen product has more than 600 calories from fat?” If we assume that the calories from fat from Dairy Queen’s menu are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm(). pnorm() gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that a Dairy Queen item has more than 600 calories from fat, we have to take one minus that probability.
1 - pnorm(q = 600, mean = dqmean, sd = dqsd)## [1] 0.01501523
Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 600 then divide this number by the total sample size.
dairy_queen %>%
filter(cal_fat > 600) %>%
summarise(percent = n() / nrow(dairy_queen))## # A tibble: 1 x 1
## percent
## <dbl>
## 1 0.0476
Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.
#repeating for McDonald's
1 - pnorm(q = 600, mean = mcmean, sd = mcsd)## [1] 0.07733771
data1m %>%
filter(cal_fat > 600) %>%
summarise(percent = n() / nrow(data1m))## percent
## 1 0.07017544
Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
**Answer:
Q1: What is the probability that a randomly chosen McDonald’s product has more than 600 calories from fat? Q2: What is the probabilty that a randomly chosen Dairy Queen product has more than 25 calories from total fat?
Q1 had a closer agreement between the two methods at .07, indicating the variable is normally distributed. Q2 results of .58 and .47 tells me that either something is wrong with my calculation or the variable I selected, total fat, is not normally distributed. If it’s not normally distributed, answers related to probability won’t be accurate using these methods.
#Q1 - McDonald's theoretical
1 - pnorm(q = 600, mean = mcmean, sd = mcsd)## [1] 0.07733771
#Q1 - McDonald's Emperical
data1m %>%
filter(cal_fat > 600) %>%
summarise(percent = n() / nrow(data1m))## percent
## 1 0.07017544
#Set Dairy Queen for total_fat; plot
ggplot(data = dairy_queen, aes(sample = total_fat)) +
geom_line(stat = "qq")#Q2 - Dairy Queen; total_fat; theoretical
dqmean <- mean(dairy_queen$total_fat)
dqsd <- sd(dairy_queen$total_fat)1 - pnorm(q = 25, mean = dqmean, sd = dqsd)## [1] 0.5871316
#Q2 - Dairy Queen Emperical
dairy_queen %>%
filter(total_fat > 25) %>%
summarise(percent = n() / nrow(dairy_queen))## # A tibble: 1 x 1
## percent
## <dbl>
## 1 0.476