Data 606 Lab 4

Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

data("fastfood", package='openintro')
head(fastfood)

## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
## #   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

You’ll see that for every observation there are 17 measurements, many of which are nutritional facts.

You’ll be focusing on just three columns to get started: restaurant, calories, calories from fat.

Let’s first focus on just products from McDonalds and Dairy Queen.

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Mcdonalds graph

ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Dairy queen graph

ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

I don’t think the graph follows a full normal distribution as the curve does not necessarily have a symmetrical curve as there seems to have more data distributed to the right.

ggplot(data=dairy_queen, aes(x=cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

This plot seems to have somewhat of a normal distribution as it is somewhat of a straight line that most of the data falls on. I think that this compares somewhat better than the real data.

ggplot(data = dairy_queen, aes(sample = sim_norm)) + 
  geom_line(stat = "qq")

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

The simulated plots look way more consistently regularly distributed than the data plot. I think that this shows that the calories are nearly normal.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

It appears that by using the same technique that the mcdonlds menu appears to form somewhat of a normal distribution.

ggplot(data = mcdonalds, aes(x=cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sim_norm2 <- rnorm(n = nrow(mcdonalds), mean = dqmean, sd = dqsd)

ggplot(data = mcdonalds, aes(sample = sim_norm2)) + 
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = mcdonalds)

Question 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

I would like to look if I can find a meal in Dairy Queen and Mcdonald between 500 to 700 calories For DQ I see that the emperical gave a better probability while Mcdonalds did the inverse.

pnorm(q = 700, mean = dqmean, sd = dqsd)- 
pnorm(q = 500, mean = dqmean, sd = dqsd)

## [1] 0.0604411

dairy_queen %>% 
  filter(cal_fat >= 500 & cal_fat <= 700) %>%
  summarise(percent = n() / nrow(dairy_queen))

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0714

Mean for mcdonalds

mcmean <- mean(mcdonalds$cal_fat)
mcsd <- sd(mcdonalds$cal_fat)

pnorm(q = 700, mean = mcmean, sd = mcsd)- 
pnorm(q = 500, mean = mcmean, sd = mcsd)

## [1] 0.1355608

mcdonalds %>% 
  filter(cal_fat >= 500 & cal_fat <= 700) %>%
  summarise(percent = n() / nrow(mcdonalds))

## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0526

Exercise 7

Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

Visually looking at the dataset I see that the closest to a normal distribution is Arby’s for sodium.

ggplot(data = fastfood)+
geom_histogram(aes(x=sodium))+
facet_wrap(~restaurant)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnormsim(sample = sodium, data = fastfood)

ggplot(data = fastfood, aes(sample=sodium))+
geom_line(stat="qq")+
facet_wrap(~restaurant)

Exercise 8

Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

This may be due to the way the data was collected maybe they rounded to the nearest 10 by looking at the data within the column

Exercise 9

As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

The normal distribution seems to be completely flat and this probably suggest a large standard deviation of the carbohydrates

chick <- fastfood %>%
  filter(restaurant=="Chick Fil-A")

ggplot(data = chick, aes(sample = total_carb)) + 
  geom_line(stat = "qq")

chickmean <- mean(chick$cal_fat)
chicksd <- sd(chick$cal_fat)

ggplot(data = chick, aes(x=total_carb))+

        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = chickmean, sd = chicksd), col = "tomato")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.