Getting Started

Load packages

library(tidyverse)
library(openintro)

The data

This week you’ll be working with fast food data. This data set contains data on 515 menu items from some of the most popular fast food restaurants worldwide. Let’s take a quick peek at the first few rows of the data.

Either you can use glimpse like before, or head to do this.

library(tidyverse)
library(openintro)
data("fastfood", package='openintro')
head(fastfood)
## # A tibble: 6 × 17
##   restaur…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
##   <chr>     <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
## 1 Mcdonalds Arti…     380      60       7       2     0        95   1110      44
## 2 Mcdonalds Sing…     840     410      45      17     1.5     130   1580      62
## 3 Mcdonalds Doub…    1130     600      67      27     3       220   1920      63
## 4 Mcdonalds Gril…     750     280      31      10     0.5     155   1940      62
## 5 Mcdonalds Cris…     920     410      45      12     0.5     120   1980      81
## 6 Mcdonalds Big …     540     250      28      10     1        80    950      46
## # … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
## #   vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
## #   variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
## #   ⁵​cholesterol, ⁶​total_carb
## # ℹ Use `colnames()` to see all variable names
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

#header 1. Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

Looking at the Dairy Queen and Mcdonald distrubtion. They are both unimodal, but have a right hand tail, or skewed right. Dairy queen and McDonald has a median around 200, but the range is hugly different. Dairy Queen = 750 vs McDonald at 1,250.

ggplot(data = dairy_queen, aes(x = cal_fat)) +
  geom_histogram(binwidth = 25)

ggplot(data = mcdonalds, aes(x = cal_fat)) +
  geom_histogram(binwidth = 25)

2. Based on the this plot, does it appear that the data follow a nearly normal distribution?

Looking through this plot, and seeing the data nearly as identical as normal probability distribution. ## Evaluating the normal distribution

3. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

#{r} sim_norm <- norm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd) #sim_norm

#qqnorm(sim_norm)
#qqline(sim_norm)
qqnormsim(sample = cal_fat, data = dairy_queen)

Looking through sim_normal and the q-q plot. You can see that the q-q plot is similar to the DQ set. The points are not forming a straight line. It looks similar to the actual data. This should be a normal distribution.

#4. Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

qqnormsim(sample = cal_fat, data = dairy_queen)

Based on the simulated data this does look similar to the actual data. Therefore these plots do prpvoide evdiences that Mcdonald and Dairy Queen has similar stats.

5. Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

qqnormsim(sample = cal_fat, data = mcdonalds)

Comparing the stimulated data to the plot from the data set. The stimulated data set doesn’t follow the normal distribution. Becuase the data is skewing right, the data set should also be skewed to the right. ## Normal probabilities

Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed. Why should you care?

It turns out that statisticians know a lot about the normal distribution. Once you decide that a random variable is approximately normal, you can answer all sorts of questions about that variable related to probability. Take, for example, the question of, “What is the probability that a randomly chosen Dairy Queen product has more than 600 calories from fat?”

If we assume that the calories from fat from Dairy Queen’s menu are normally distributed (a very close approximation is also okay), we can find this probability by calculating a Z score and consulting a Z table (also called a normal probability table). In R, this is done in one step with the function pnorm().

#1 - pnorm(q = 600, mean = dqmean, sd = dqsd)

Note that the function pnorm() gives the area under the normal curve below a given value, q, with a given mean and standard deviation. Since we’re interested in the probability that a Dairy Queen item has more than 600 calories from fat, we have to take one minus that probability.

Assuming a normal distribution has allowed us to calculate a theoretical probability. If we want to calculate the probability empirically, we simply need to determine how many observations fall above 600 then divide this number by the total sample size.

dairy_queen %>% 
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0476

Although the probabilities are not exactly the same, they are reasonably close. The closer that your distribution is to being normal, the more accurate the theoretical probabilities will be.

6. Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

The probability that a item is randomly less than 400 calories

#pnorm(400, dqmean, dqsd)
#dairy_queen %>%
 # filter(cal_fat < 400) %>%
  #summarise(percent = n() / nrow(dairy_queen))

In probability there is a 81% chance that the item is less than 400 calories, and in normal distribution the item is 78.5% of being less than 400 calories.

#pnorm(700, dqmean, dqsd) - pnorm(300, dqmean, dqsd)
dairy_queen %>%
filter(cal_fat < 700 & cal_fat > 300) %>%
summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1   0.333

7.Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

for (ids in unique(fastfood$restaurant)){
  test <- fastfood %>% 
    filter(restaurant == ids)
    qqnorm(test[test$restaurant == ids, c('sodium') ]$sodium, main = ids)
    qqline(test[test$restaurant == ids, c('sodium') ]$sodium)
}

This program code should be able to find the sodium distrubtion. Looking through all the resturants Burger King might be the only one close to the normal distribution. Arby is a close 2nd one.

8. Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

Some of the plots in the sodium distribution has a stepwise pattern. This can be the reaosn that the data might be a bit skewed?

9. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

qqnorm(mcdonalds$total_carb, main = "Mcdonalds Carbs")
qqline(mcdonalds$total_carb)

dqmean <- mean(mcdonalds$total_carb)
dqsd   <- sd(mcdonalds$total_carb)
ggplot(data = mcdonalds, aes(x = total_carb)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I chose Mcdonald’s carb because I like mcdonalds. I can see that some of the data set is skewing to the right. Therefore there is a close resemblance of data being normally distrubted.