library(tidyverse)
library(openintro)
This week you’ll be working with fast food data. This data set contains data on 515 menu items from some of the most popular fast food restaurants worldwide. Let’s take a quick peek at the first few rows of the data.
Either you can use glimpse like before, or
head to do this.
library(tidyverse)
library(openintro)
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
## restaur…¹ item calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Arti… 380 60 7 2 0 95 1110 44
## 2 Mcdonalds Sing… 840 410 45 17 1.5 130 1580 62
## 3 Mcdonalds Doub… 1130 600 67 27 3 220 1920 63
## 4 Mcdonalds Gril… 750 280 31 10 0.5 155 1940 62
## 5 Mcdonalds Cris… 920 410 45 12 0.5 120 1980 81
## 6 Mcdonalds Big … 540 250 28 10 1 80 950 46
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## # vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## # variable names ¹restaurant, ²calories, ³total_fat, ⁴trans_fat,
## # ⁵cholesterol, ⁶total_carb
## # ℹ Use `colnames()` to see all variable names
mcdonalds <- fastfood %>%
filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
filter(restaurant == "Dairy Queen")
#header 1. Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?
Looking at the Dairy Queen and Mcdonald distrubtion. They are both unimodal, but have a right hand tail, or skewed right. Dairy queen and McDonald has a median around 200, but the range is hugly different. Dairy Queen = 750 vs McDonald at 1,250.
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_histogram(binwidth = 25)
ggplot(data = mcdonalds, aes(x = cal_fat)) +
geom_histogram(binwidth = 25)
Looking through this plot, and seeing the data nearly as identical as normal probability distribution. ## Evaluating the normal distribution
sim_norm. Do all
of the points fall on the line? How does this plot compare to the
probability plot for the real data? (Since sim_norm is not
a data frame, it can be put directly into the sample
argument and the data argument can be dropped.)ggplot(data = dairy_queen, aes(sample = cal_fat)) +
geom_line(stat = "qq")
#{r} sim_norm <- norm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd) #sim_norm
#qqnorm(sim_norm)
#qqline(sim_norm)
qqnormsim(sample = cal_fat, data = dairy_queen)
Looking through sim_normal and the q-q plot. You can see that the q-q
plot is similar to the DQ set. The points are not forming a straight
line. It looks similar to the actual data. This should be a normal
distribution.
#4. Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?
qqnormsim(sample = cal_fat, data = dairy_queen)
Based on the simulated data this does look similar to the actual data. Therefore these plots do prpvoide evdiences that Mcdonald and Dairy Queen has similar stats.
The probability that a item is randomly less than 400 calories
#pnorm(400, dqmean, dqsd)
#dairy_queen %>%
# filter(cal_fat < 400) %>%
#summarise(percent = n() / nrow(dairy_queen))
In probability there is a 81% chance that the item is less than 400 calories, and in normal distribution the item is 78.5% of being less than 400 calories.
#pnorm(700, dqmean, dqsd) - pnorm(300, dqmean, dqsd)
dairy_queen %>%
filter(cal_fat < 700 & cal_fat > 300) %>%
summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
## percent
## <dbl>
## 1 0.333
for (ids in unique(fastfood$restaurant)){
test <- fastfood %>%
filter(restaurant == ids)
qqnorm(test[test$restaurant == ids, c('sodium') ]$sodium, main = ids)
qqline(test[test$restaurant == ids, c('sodium') ]$sodium)
}
This program code should be able to find the sodium distrubtion. Looking
through all the resturants Burger King might be the only one close to
the normal distribution. Arby is a close 2nd one.
Some of the plots in the sodium distribution has a stepwise pattern. This can be the reaosn that the data might be a bit skewed?
qqnorm(mcdonalds$total_carb, main = "Mcdonalds Carbs")
qqline(mcdonalds$total_carb)
dqmean <- mean(mcdonalds$total_carb)
dqsd <- sd(mcdonalds$total_carb)
ggplot(data = mcdonalds, aes(x = total_carb)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I chose Mcdonald’s carb because I like mcdonalds. I can see that some of
the data set is skewing to the right. Therefore there is a close
resemblance of data being normally distrubted.