title: “Lab 4: Distributions”
author: “Evan McLaughlin”
date: “9.27.2020
knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "show", message = FALSE)

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(openintro)
library(ggplot2)
library(dplyr)
library(trelliscopejs)
## Warning: package 'trelliscopejs' was built under R version 4.0.3
head(fastfood)

Exercise 1

Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

Both sets are right-skewed, and McDonald’s has a wider range and larger mean. The McDonald’s distribution increments by 200 while the DQ distribution increments by 100.

mcdonalds <- fastfood %>% filter(restaurant == 'Mcdonalds')

dairy_queen <- fastfood %>% filter(restaurant == 'Dairy Queen')

summary(mcdonalds$cal_fat)
hist(mcdonalds$cal_fat)

summary(dairy_queen$cal_fat)
hist(dairy_queen$cal_fat)

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Mostly, but not entirely, because the data appear to widen as density decreases, and there are some significant outliers in the especially high-fat area.

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)

ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..)) +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

Exercise 3

Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)

Not all the points fall on the line. The probability plots for the simulated and data are similar, though. The simulated data has a larger slope between .5 and 1.

library(dplyr)
ggplot(data = dairy_queen, aes(sample = cal_fat)) + 
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the female heights are nearly normal?

The DQ data is nearly normal and follows a fairly steady slope, similar to the simulated data.

qqnormsim(sample = cal_fat, data = dairy_queen)

###Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

The McDonald’s calories are mostly normal except for as the data approaches 1.5 when it jumps upwards compared to the simulated data.

qqnormsim(sample = cal_fat, data = mcdonalds)

Exercise 6

Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

In my probability equations, my McDonald’s query has the closest agreement between the two methods.

What is the probability that a Mcdonald’s item, chosen at random, has 3 or more grams of fiber?

m_mean <- mean(mcdonalds$fiber)
m_sd <- sd(mcdonalds$fiber)

1 - pnorm(q=3, mean = m_mean, sd = m_sd)

mcdonalds %>%
  filter(fiber > 3) %>%
  summarise(percent = n() / nrow(mcdonalds))

What is the probability that a Taco Bell item, chosen at random, has less than 800 calories?

tb <- fastfood %>%
  filter(restaurant == "Taco Bell")

tb_mean <- mean(tb$calories)
tb_sd <- sd(tb$calories)

1 - pnorm(q=800, mean = tb_mean, sd = tb_sd)

tb %>%
  filter(calories < 800) %>%
  summarise(percent = n() / nrow(tb))