R Lab 4 Distributions Tidyverse

Author

Rachel Saidi

Published

February 18, 2021

Loading packages

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.2
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.4      ✔ forcats 1.0.0 
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tibble' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2
Warning: package 'dplyr' was built under R version 4.2.2
Warning: package 'stringr' was built under R version 4.2.2
Warning: package 'forcats' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(openintro)
Warning: package 'openintro' was built under R version 4.2.2
Loading required package: airports
Warning: package 'airports' was built under R version 4.2.2
Loading required package: cherryblossom
Warning: package 'cherryblossom' was built under R version 4.2.2
Loading required package: usdata
Warning: package 'usdata' was built under R version 4.2.2

Checking out “fastfood” dataset with head() function

fastfood <- fastfood
head(fastfood)
# A tibble: 6 × 17
  restaur…¹ item  calor…² cal_fat total…³ sat_fat trans…⁴ chole…⁵ sodium total…⁶
  <chr>     <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
1 Mcdonalds Arti…     380      60       7       2     0        95   1110      44
2 Mcdonalds Sing…     840     410      45      17     1.5     130   1580      62
3 Mcdonalds Doub…    1130     600      67      27     3       220   1920      63
4 Mcdonalds Gril…     750     280      31      10     0.5     155   1940      62
5 Mcdonalds Cris…     920     410      45      12     0.5     120   1980      81
6 Mcdonalds Big …     540     250      28      10     1        80    950      46
# … with 7 more variables: fiber <dbl>, sugar <dbl>, protein <dbl>,
#   vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>, and abbreviated
#   variable names ¹​restaurant, ²​calories, ³​total_fat, ⁴​trans_fat,
#   ⁵​cholesterol, ⁶​total_carb

Filtering for McDonalds and Dairy Queen

mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")

dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1

Create histogram with overlayed density curve for calories from fat for Dairy Queen and McDonalds

dqmean <- mean(dairy_queen$cal_fat)
dqsd   <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), fill = "light green") +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd),color = "purple")+
  ggtitle("Histogram and Density Curve of Calories from Fat in Dairy Queen Products")+
  theme_bw()
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Creating McDonalds histogram with overlayed distribution curve for calories from fat

ggplot(data = mcdonalds, aes(x = cal_fat)) +
        geom_blank() +
        geom_histogram(aes(y = stat(density)), fill = "light green") +
        stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "purple")+
        theme_bw() + 
        ggtitle("Histogram of Calories from Fat from McDonalds Products")
Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It appears that the distribution of calories from fat for Dairy Queen products is relatively normal, but those from McDonalds are strongly skewed right

Exercise 2

Based on the this plot, does it appear that the data follow a nearly normal distribution?

Answer: The distribution of the calories from fat of Dairy Queen’s items is close to normal (bell shaped), whereas the distribution of the calories from fat of McDonalds’ items is not normal (this isn’t as apparent in the histograms but is made much more clear in by the density curve). The center of McDonalds’ curve is around 280 calories from fat while the center of Dairy Queen’s curve is around 250 calories from fat. McDonalds’ curve is heavily skewed to the right, with a hefty max value of over 1250 calories from fat. Dairy Queen’s curve is far less skewed, with a small skew to the right and a max value around 675 calories from fat.

Evaluating the Normal Distribution

Constructing normal probablility plot of Dairy Queen’s “cal_fat”

ggplot(data= dairy_queen, aes(sample= cal_fat)) + 
  geom_line(stat= "qq", color = "purple") + stat_qq_line()+
    ggtitle("Quantile Plot of Calories from Fat from Dairy Queen Products") +
  theme_bw()

Simulating data from a normal distribution

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

Exercise 3

Creating normal probability plot of “sim_norm”

ggplot(data= NULL, aes(sample= sim_norm)) +
  geom_line(stat= "qq", color = "purple") + stat_qq_line() +
  theme_bw()

Answer: The points do not directly fall on the x= y diagonal line, but this plot is more closely aligned to the diagonal line, so it’s closer to a normal distribution than the Dairy Queen data. Notably, the values here tend to be above the diagonal line, while most of the values in our Dairy Queen distribution fall below the diagonal line.

Create many Q-Q plot simulations against Dairy Queen data

qqnormsim(sample = cal_fat, data = dairy_queen) +
  theme_bw()

Exercise 4

Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data?

Answer: Yes, the Dairy Queen “cal_fat” normal probability plot is pretty closely aligned with all the simulated data probability plots, although it curves slightly below the y=x line while the simulations did not. The simulations are generally more closely aligned with the y=x line than our Dairy Queen data, although sim 2 is arguably around the same closeness to the Dairy Queen data. Sim 2 also has a clear “s” shape that other sims and our Dairy Queen data generally don’t have.

Exercise 5

Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

Create many Q-Q plot simulations against McDonalds data

qqnormsim(sample= cal_fat, data= mcdonalds) +
  theme_bw()

Answer: The McDonalds data does not appear to come from a normal distribution. It more resembles an exponential growth curve or cubic growth curve.

Calculate Z score

Answer “What is the probability that a randomly chosen Dairy Queen product has more than 600 calories from fat?” while assuming normal distribution (theoretical probability)

1 - pnorm(q= 600, mean= dqmean, sd= dqsd)
[1] 0.01501523

Calculate the same probability but empirically

dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent= n() / nrow(dairy_queen))
# A tibble: 1 × 1
  percent
    <dbl>
1  0.0476

Exercise 6

“What is the probability that a randomly chosen Chick Fil-A product has more than 600 calories from fat?”

Filter for Chick Fil-A

chick_fil_a <- fastfood %>%
  filter(restaurant == "Chick Fil-A")

Theoretical probability

1 - pnorm(q= 600, mean= mean(chick_fil_a$cal_fat), sd= sd(chick_fil_a$cal_fat))
[1] 4.46471e-06

Empirical probability

chick_fil_a %>%
  filter(cal_fat > 600) %>%
  summarise(percent= n() / nrow(chick_fil_a))
# A tibble: 1 × 1
  percent
    <dbl>
1       0

Answer: There is a 0% chance of randomly selecting a Chick Fil-A item above 600 calories from fat (healthy! well, sort of).

“What is the probability that a randomly chosen Taco Bell product has more than 600 calories from fat?”

Filter for Taco Bell

taco_bell <- fastfood %>%
  filter(restaurant == "Taco Bell")

Theoretical probability

1 - pnorm(q= 600, mean= mean(taco_bell$cal_fat), sd= sd(taco_bell$cal_fat))
[1] 5.932939e-07
## [1] 5.932939e-07

Empirical probability

taco_bell %>%
  filter(cal_fat > 600) %>%
  summarise(percent= n() / nrow(taco_bell))
# A tibble: 1 × 1
  percent
    <dbl>
1       0

Answer: There is also a 0% chance of randomly selecting a Taco Bell item above 600 calories from fat.

Both empirical probabilities very closely matched the theoretical probabilities, mainly because I chose a relatively high number of calories from fat (neither Taco Bell nor Chick Fil-A appear to have any items with over 600 calories from fat). I chose 600 so I could compare Chick Fil-A and Taco Bell with McDonalds and Dairy Queen.

Exercise 7

Filter for Sonic, Arbys, Burger King and Subway

sonic <- fastfood %>%
  filter(restaurant == "Sonic")

arbys <- fastfood %>%
  filter(restaurant == "Arbys")

burger_king <- fastfood %>%
  filter(restaurant == "Burger King")

subway <- fastfood %>%
  filter(restaurant == "Subway")

Create Q-Q plot for Sonic sodium data

ggplot(data= sonic, aes(sample= sodium)) + 
  geom_line(stat= "qq", color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Sonic Products")+
  theme_bw()

Create Q-Q plot for Arbys sodium data

ggplot(data= arbys, aes(sample= sodium)) + 
  geom_line(stat= "qq", color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Arbys Products")+
  theme_bw()

Create Q-Q plot for Burger King sodium data

ggplot(data= burger_king, aes(sample= sodium)) + 
  geom_line(stat= "qq",color = "purple")+ stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Burger King Products")+
  theme_bw()

Create Q-Q plot for Subway sodium data

ggplot(data= subway, aes(sample= sodium)) + 
  geom_line(stat= "qq",color = "purple")+ stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Subway Products")+
  theme_bw()

Creating Q-Q plot for McDonalds sodium data

ggplot(data= mcdonalds, aes(sample= sodium)) + 
  geom_line(stat= "qq",color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in McDonalds Products")+
  theme_bw()

Creating Q-Q plot for Dairy Queen sodium data

ggplot(data= dairy_queen, aes(sample= sodium)) + 
  geom_line(stat= "qq",color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Dairy Queen Products")+
  theme_bw()

Creating Q-Q plot for Taco Bell sodium data

ggplot(data= taco_bell, aes(sample= sodium)) + 
  geom_line(stat= "qq",color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Taco Bell Products")+
  theme_bw()

Creating Q-Q plot for Chick Fil-A sodium data

ggplot(data= chick_fil_a, aes(sample= sodium)) + 
  geom_line(stat= "qq",color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Sodium Content in Chic Fil A Products")+
  theme_bw()

Arbys and Burger King appear to have the closest to normal distributions for their sodium data.

Exercise 8

Explore histograms of Taco Bell and Burger King sodium data since they both showed stepwise patterns in their Q-Q plots

ggplot(data= taco_bell, aes(x= sodium)) +
        geom_blank() +
        geom_histogram(aes(y= ..density..), bins= 7, fill = "Light Green") +
        stat_function(fun= dnorm, args= c(mean= mean(taco_bell$sodium),
                      sd= sd(taco_bell$sodium)), col= "purple")+
  ggtitle("Histogram and Density Curve of Sodium Content in Taco Bell Products")+
  theme_bw()

ggplot(data= burger_king, aes(x= sodium)) +
        geom_blank() +
        geom_histogram(aes(y= ..density..), bins= 7, fill = "light green") +
        stat_function(fun= dnorm, args= c(mean= mean(burger_king$sodium),
                      sd= sd(burger_king$sodium)),col= "purple")+
  ggtitle("Histogram and Density Curve of Sodium Content in Burger King Products")+
  theme_bw()

The restaurants with the most normal distributions appear to be the ones with stepwise distributions. Perhaps there is a correlation?

Exercise 9

Create a normal probability plot for the total carbs of Taco Bell items

ggplot(data= taco_bell, aes(sample= total_carb)) + 
  geom_line(stat= "qq",color = "purple")+stat_qq_line() +
  ggtitle("QQPlot of Total Carbs Content in Taco Bell A Products")+
  theme_bw()

Create density histogram of the total carbs of Taco Bell items with normal distribution curve

ggplot(data= taco_bell, aes(x= total_carb)) +
        geom_blank() +
        geom_histogram(aes(y= ..density..),fill = "light green", bins= 7) +
        stat_function(fun= dnorm, args= c(mean= mean(taco_bell$total_carb),
                      sd= sd(taco_bell$total_carb)), col= "purple") +
        theme_bw()

        ggtitle("Density Curve of Total Carbs for Taco Bell")
$title
[1] "Density Curve of Total Carbs for Taco Bell"

attr(,"class")
[1] "labels"

Total carbs data of Taco Bell items is skewed to the right.