The Normal Distribution

The Data

library(tidyverse)
library(openintro)
data(fastfood)
glimpse(fastfood)
## Rows: 515
## Columns: 17
## $ restaurant  <chr> "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdonalds", "Mcdon…
## $ item        <chr> "Artisan Grilled Chicken Sandwich", "Single Bacon Smokehou…
## $ calories    <dbl> 380, 840, 1130, 750, 920, 540, 300, 510, 430, 770, 380, 62…
## $ cal_fat     <dbl> 60, 410, 600, 280, 410, 250, 100, 210, 190, 400, 170, 300,…
## $ total_fat   <dbl> 7, 45, 67, 31, 45, 28, 12, 24, 21, 45, 18, 34, 20, 34, 8, …
## $ sat_fat     <dbl> 2.0, 17.0, 27.0, 10.0, 12.0, 10.0, 5.0, 4.0, 11.0, 21.0, 4…
## $ trans_fat   <dbl> 0.0, 1.5, 3.0, 0.5, 0.5, 1.0, 0.5, 0.0, 1.0, 2.5, 0.0, 1.5…
## $ cholesterol <dbl> 95, 130, 220, 155, 120, 80, 40, 65, 85, 175, 40, 95, 125, …
## $ sodium      <dbl> 1110, 1580, 1920, 1940, 1980, 950, 680, 1040, 1040, 1290, …
## $ total_carb  <dbl> 44, 62, 63, 62, 81, 46, 33, 49, 35, 42, 38, 48, 48, 67, 31…
## $ fiber       <dbl> 3, 2, 3, 2, 4, 3, 2, 3, 2, 3, 2, 3, 3, 5, 2, 2, 3, 3, 5, 2…
## $ sugar       <dbl> 11, 18, 18, 18, 18, 9, 7, 6, 7, 10, 5, 11, 11, 11, 6, 3, 1…
## $ protein     <dbl> 37, 46, 70, 55, 46, 25, 15, 25, 25, 51, 15, 32, 42, 33, 13…
## $ vit_a       <dbl> 4, 6, 10, 6, 6, 10, 10, 0, 20, 20, 2, 10, 10, 10, 2, 4, 6,…
## $ vit_c       <dbl> 20, 20, 20, 25, 20, 2, 2, 4, 4, 6, 0, 10, 20, 15, 2, 6, 15…
## $ calcium     <dbl> 20, 20, 50, 20, 20, 15, 10, 2, 15, 20, 15, 35, 35, 35, 4, …
## $ salad       <chr> "Other", "Other", "Other", "Other", "Other", "Other", "Oth…
mcdonalds <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1 Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

Based on the 2 histograms, the McDonalds data has more of a right skew, with the bulk of the data being on the left. The Dairy Queen data is somewhat more centrally located, with only a minor right skew. That being said, the data for McDonalds covers 1400 calories, and the DQ data is only 700. So the bulk of the data is actually in a similar place for both data sets- between 100-500 calories.

hist(mcdonalds$cal_fat)

hist(dairy_queen$cal_fat)

The normal distribution

dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)
ggplot(data = dairy_queen, aes(x = cal_fat)) +
  geom_blank() +
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 2 Based on the this plot, does it appear that the data follow a nearly normal distribution?

Based on this plot, the data do not follow a nearly normal distribution. The density data well exceeds the curve, there are significant gaps, and the right tail does not fit within the curve.

Evaluating the normal distribution

ggplot(data = dairy_queen, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

library(ggplot2)
sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)
ggplot(data = dairy_queen, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

Exercise 3 Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?

The plots are similar, though the center portion of the simulated data is closer to the line than the real data. The simulated data is more of a “straight” line whereas the real data has several curves and departures.

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 4 Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories from fat are nearly normal?

The plot for the real data does look similar to the plots for the simulated data. None of the plots are an exact match for each other, but the real data is quite similar to sim 7 and sim 8 in particular, with the lower and upper tails. I would say the plots provide evidence that the calories from fat are nearly normal.

Exercise 5 Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

ggplot(data = mcdonalds, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = mcdonalds)

Based on the data, the McDonalds data does not follow a normal distribution. The right tail curve does not match any of the simulated data, and the plot itself does not have a similar shape to any of the other data curves.

Normal probabilities

1 - pnorm(q = 600, mean = dqmean, sd = dqsd)
## [1] 0.01501523
dairy_queen %>%
  filter(cal_fat > 600) %>%
  summarise(percent = n() / nrow(dairy_queen))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0476

Exercise 6 Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

Chick_fil_a <- fastfood %>%
  filter(restaurant == "Chick Fil-A")
arbys <- fastfood %>%
  filter(restaurant == "Arbys")
cfmean <- mean(Chick_fil_a$cal_fat)
cfsd <- sd(Chick_fil_a$cal_fat)

armean <- mean(arbys$cal_fat)
arsd <- sd(arbys$cal_fat)

a Probability that a Chick-fil-A item has more then 300 calories from fat

ggplot(data = Chick_fil_a, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = Chick_fil_a)

1 - pnorm(q = 300, mean = cfmean, sd = cfsd)
## [1] 0.06543432
Chick_fil_a %>%
  filter(cal_fat >300) %>%
  summarise(percent = n() / nrow(Chick_fil_a))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1  0.0741

b Probability that an Arbys item has fewer than 800 calories from fat

ggplot(data = arbys, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

qqnormsim(sample = cal_fat, data = arbys)

1 - pnorm(q = 800, mean = armean, sd = arsd)
## [1] 3.39207e-07
arbys %>%
  filter(cal_fat > 800) %>%
  summarise(percent = n() / nrow(arbys))
## # A tibble: 1 × 1
##   percent
##     <dbl>
## 1       0
LS0tCnRpdGxlOiAiTGFiIDQiCmF1dGhvcjogIktpbSBLZW5kYWwiCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCIKb3V0cHV0OiBvcGVuaW50cm86OmxhYl9yZXBvcnQKLS0tCgoKIyBUaGUgTm9ybWFsIERpc3RyaWJ1dGlvbgoKIyMgVGhlIERhdGEKCmBgYHtyIGxvYWQtcGFja2FnZXMsIG1lc3NhZ2U9RkFMU0V9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KG9wZW5pbnRybykKZGF0YShmYXN0Zm9vZCkKZ2xpbXBzZShmYXN0Zm9vZCkKYGBgCgpgYGB7cn0KbWNkb25hbGRzIDwtIGZhc3Rmb29kICU+JQogIGZpbHRlcihyZXN0YXVyYW50ID09ICJNY2RvbmFsZHMiKQpkYWlyeV9xdWVlbiA8LSBmYXN0Zm9vZCAlPiUKICBmaWx0ZXIocmVzdGF1cmFudCA9PSAiRGFpcnkgUXVlZW4iKQpgYGAKCiMjIyBFeGVyY2lzZSAxIE1ha2UgYSBwbG90IChvciBwbG90cykgdG8gdmlzdWFsaXplIHRoZSBkaXN0cmlidXRpb25zIG9mIHRoZSBhbW91bnQgb2YgY2Fsb3JpZXMgZnJvbSBmYXQgb2YgdGhlIG9wdGlvbnMgZnJvbSB0aGVzZSB0d28gcmVzdGF1cmFudHMuIEhvdyBkbyB0aGVpciBjZW50ZXJzLCBzaGFwZXMsIGFuZCBzcHJlYWRzIGNvbXBhcmU/CipCYXNlZCBvbiB0aGUgMiBoaXN0b2dyYW1zLCB0aGUgTWNEb25hbGRzIGRhdGEgaGFzIG1vcmUgb2YgYSByaWdodCBza2V3LCB3aXRoIHRoZSBidWxrIG9mIHRoZSBkYXRhIGJlaW5nIG9uIHRoZSBsZWZ0LiBUaGUgRGFpcnkgUXVlZW4gZGF0YSBpcyBzb21ld2hhdCBtb3JlIGNlbnRyYWxseSBsb2NhdGVkLCB3aXRoIG9ubHkgYSBtaW5vciByaWdodCBza2V3LiBUaGF0IGJlaW5nIHNhaWQsIHRoZSBkYXRhIGZvciBNY0RvbmFsZHMgY292ZXJzIDE0MDAgY2Fsb3JpZXMsIGFuZCB0aGUgRFEgZGF0YSBpcyBvbmx5IDcwMC4gU28gdGhlIGJ1bGsgb2YgdGhlIGRhdGEgaXMgYWN0dWFsbHkgaW4gYSBzaW1pbGFyIHBsYWNlIGZvciBib3RoIGRhdGEgc2V0cy0gYmV0d2VlbiAxMDAtNTAwIGNhbG9yaWVzLioKCgpgYGB7cn0KaGlzdChtY2RvbmFsZHMkY2FsX2ZhdCkKYGBgCgoKYGBge3J9Cmhpc3QoZGFpcnlfcXVlZW4kY2FsX2ZhdCkKYGBgCgoKIyMgVGhlIG5vcm1hbCBkaXN0cmlidXRpb24KCmBgYHtyIGNvZGUtY2h1bmstbGFiZWx9CmRxbWVhbiA8LSBtZWFuKGRhaXJ5X3F1ZWVuJGNhbF9mYXQpCmRxc2QgPC0gc2QoZGFpcnlfcXVlZW4kY2FsX2ZhdCkKYGBgCgoKYGBge3J9CmdncGxvdChkYXRhID0gZGFpcnlfcXVlZW4sIGFlcyh4ID0gY2FsX2ZhdCkpICsKICBnZW9tX2JsYW5rKCkgKwogIGdlb21faGlzdG9ncmFtKGFlcyh5ID0gLi5kZW5zaXR5Li4pKSArCiAgc3RhdF9mdW5jdGlvbihmdW4gPSBkbm9ybSwgYXJncyA9IGMobWVhbiA9IGRxbWVhbiwgc2QgPSBkcXNkKSwgY29sID0gInRvbWF0byIpCmBgYAoKIyMjIEV4ZXJjaXNlIDIgQmFzZWQgb24gdGhlIHRoaXMgcGxvdCwgZG9lcyBpdCBhcHBlYXIgdGhhdCB0aGUgZGF0YSBmb2xsb3cgYSBuZWFybHkgbm9ybWFsIGRpc3RyaWJ1dGlvbj8KKkJhc2VkIG9uIHRoaXMgcGxvdCwgdGhlIGRhdGEgZG8gbm90IGZvbGxvdyBhIG5lYXJseSBub3JtYWwgZGlzdHJpYnV0aW9uLiBUaGUgZGVuc2l0eSBkYXRhIHdlbGwgZXhjZWVkcyB0aGUgY3VydmUsIHRoZXJlIGFyZSBzaWduaWZpY2FudCBnYXBzLCBhbmQgdGhlIHJpZ2h0IHRhaWwgZG9lcyBub3QgZml0IHdpdGhpbiB0aGUgY3VydmUuKgoKCiMjIEV2YWx1YXRpbmcgdGhlIG5vcm1hbCBkaXN0cmlidXRpb24KCmBgYHtyfQpnZ3Bsb3QoZGF0YSA9IGRhaXJ5X3F1ZWVuLCBhZXMoc2FtcGxlID0gY2FsX2ZhdCkpICsKICBnZW9tX2xpbmUoc3RhdCA9ICJxcSIpCmBgYApgYGB7cn0KbGlicmFyeShnZ3Bsb3QyKQpgYGAKCmBgYHtyfQpzaW1fbm9ybSA8LSBybm9ybShuID0gbnJvdyhkYWlyeV9xdWVlbiksIG1lYW4gPSBkcW1lYW4sIHNkID0gZHFzZCkKYGBgCgpgYGB7cn0KZ2dwbG90KGRhdGEgPSBkYWlyeV9xdWVlbiwgYWVzKHNhbXBsZSA9IHNpbV9ub3JtKSkgKwogIGdlb21fbGluZShzdGF0ID0gInFxIikKYGBgCgoKCiMjIyBFeGVyY2lzZSAzIE1ha2UgYSBub3JtYWwgcHJvYmFiaWxpdHkgcGxvdCBvZiBzaW1fbm9ybS4gRG8gYWxsIG9mIHRoZSBwb2ludHMgZmFsbCBvbiB0aGUgbGluZT8gSG93IGRvZXMgdGhpcyBwbG90IGNvbXBhcmUgdG8gdGhlIHByb2JhYmlsaXR5IHBsb3QgZm9yIHRoZSByZWFsIGRhdGE/IAoqVGhlIHBsb3RzIGFyZSBzaW1pbGFyLCB0aG91Z2ggdGhlIGNlbnRlciBwb3J0aW9uIG9mIHRoZSBzaW11bGF0ZWQgZGF0YSBpcyBjbG9zZXIgdG8gdGhlIGxpbmUgdGhhbiB0aGUgcmVhbCBkYXRhLiBUaGUgc2ltdWxhdGVkIGRhdGEgaXMgbW9yZSBvZiBhICJzdHJhaWdodCIgbGluZSB3aGVyZWFzIHRoZSByZWFsIGRhdGEgaGFzIHNldmVyYWwgY3VydmVzIGFuZCBkZXBhcnR1cmVzLioKCgoKYGBge3J9CnFxbm9ybXNpbShzYW1wbGUgPSBjYWxfZmF0LCBkYXRhID0gZGFpcnlfcXVlZW4pCmBgYAoKIyMjIEV4ZXJjaXNlIDQgRG9lcyB0aGUgbm9ybWFsIHByb2JhYmlsaXR5IHBsb3QgZm9yIHRoZSBjYWxvcmllcyBmcm9tIGZhdCBsb29rIHNpbWlsYXIgdG8gdGhlIHBsb3RzIGNyZWF0ZWQgZm9yIHRoZSBzaW11bGF0ZWQgZGF0YT8gVGhhdCBpcywgZG8gdGhlIHBsb3RzIHByb3ZpZGUgZXZpZGVuY2UgdGhhdCB0aGUgY2Fsb3JpZXMgZnJvbSBmYXQgYXJlIG5lYXJseSBub3JtYWw/CipUaGUgcGxvdCBmb3IgdGhlIHJlYWwgZGF0YSBkb2VzIGxvb2sgc2ltaWxhciB0byB0aGUgcGxvdHMgZm9yIHRoZSBzaW11bGF0ZWQgZGF0YS4gTm9uZSBvZiB0aGUgcGxvdHMgYXJlIGFuIGV4YWN0IG1hdGNoIGZvciBlYWNoIG90aGVyLCBidXQgdGhlIHJlYWwgZGF0YSBpcyBxdWl0ZSBzaW1pbGFyIHRvIHNpbSA3IGFuZCBzaW0gOCBpbiBwYXJ0aWN1bGFyLCB3aXRoIHRoZSBsb3dlciBhbmQgdXBwZXIgdGFpbHMuIEkgd291bGQgc2F5IHRoZSBwbG90cyBwcm92aWRlIGV2aWRlbmNlIHRoYXQgdGhlIGNhbG9yaWVzIGZyb20gZmF0IGFyZSBuZWFybHkgbm9ybWFsLioKCiMjIyBFeGVyY2lzZSA1IFVzaW5nIHRoZSBzYW1lIHRlY2huaXF1ZSwgZGV0ZXJtaW5lIHdoZXRoZXIgb3Igbm90IHRoZSBjYWxvcmllcyBmcm9tIE1jRG9uYWxk4oCZcyBtZW51IGFwcGVhciB0byBjb21lIGZyb20gYSBub3JtYWwgZGlzdHJpYnV0aW9uLgoKYGBge3J9CmdncGxvdChkYXRhID0gbWNkb25hbGRzLCBhZXMoc2FtcGxlID0gY2FsX2ZhdCkpICsKICBnZW9tX2xpbmUoc3RhdCA9ICJxcSIpCmBgYAoKYGBge3J9CnFxbm9ybXNpbShzYW1wbGUgPSBjYWxfZmF0LCBkYXRhID0gbWNkb25hbGRzKQpgYGAKCipCYXNlZCBvbiB0aGUgZGF0YSwgdGhlIE1jRG9uYWxkcyBkYXRhIGRvZXMgbm90IGZvbGxvdyBhIG5vcm1hbCBkaXN0cmlidXRpb24uIFRoZSByaWdodCB0YWlsIGN1cnZlIGRvZXMgbm90IG1hdGNoIGFueSBvZiB0aGUgc2ltdWxhdGVkIGRhdGEsIGFuZCB0aGUgcGxvdCBpdHNlbGYgZG9lcyBub3QgaGF2ZSBhIHNpbWlsYXIgc2hhcGUgdG8gYW55IG9mIHRoZSBvdGhlciBkYXRhIGN1cnZlcy4qCgojIyBOb3JtYWwgcHJvYmFiaWxpdGllcwoKYGBge3J9CjEgLSBwbm9ybShxID0gNjAwLCBtZWFuID0gZHFtZWFuLCBzZCA9IGRxc2QpCgpgYGAKCgpgYGB7cn0KZGFpcnlfcXVlZW4gJT4lCiAgZmlsdGVyKGNhbF9mYXQgPiA2MDApICU+JQogIHN1bW1hcmlzZShwZXJjZW50ID0gbigpIC8gbnJvdyhkYWlyeV9xdWVlbikpCmBgYAoKIyMjIEV4ZXJjaXNlIDYgV3JpdGUgb3V0IHR3byBwcm9iYWJpbGl0eSBxdWVzdGlvbnMgdGhhdCB5b3Ugd291bGQgbGlrZSB0byBhbnN3ZXIgYWJvdXQgYW55IG9mIHRoZSByZXN0YXVyYW50cyBpbiB0aGlzIGRhdGFzZXQuIENhbGN1bGF0ZSB0aG9zZSBwcm9iYWJpbGl0aWVzIHVzaW5nIGJvdGggdGhlIHRoZW9yZXRpY2FsIG5vcm1hbCBkaXN0cmlidXRpb24gYXMgd2VsbCBhcyB0aGUgZW1waXJpY2FsIGRpc3RyaWJ1dGlvbiAoZm91ciBwcm9iYWJpbGl0aWVzIGluIGFsbCkuIFdoaWNoIG9uZSBoYWQgYSBjbG9zZXIgYWdyZWVtZW50IGJldHdlZW4gdGhlIHR3byBtZXRob2RzPwoKCmBgYHtyfQpDaGlja19maWxfYSA8LSBmYXN0Zm9vZCAlPiUKICBmaWx0ZXIocmVzdGF1cmFudCA9PSAiQ2hpY2sgRmlsLUEiKQphcmJ5cyA8LSBmYXN0Zm9vZCAlPiUKICBmaWx0ZXIocmVzdGF1cmFudCA9PSAiQXJieXMiKQpgYGAKCmBgYHtyfQpjZm1lYW4gPC0gbWVhbihDaGlja19maWxfYSRjYWxfZmF0KQpjZnNkIDwtIHNkKENoaWNrX2ZpbF9hJGNhbF9mYXQpCgphcm1lYW4gPC0gbWVhbihhcmJ5cyRjYWxfZmF0KQphcnNkIDwtIHNkKGFyYnlzJGNhbF9mYXQpCmBgYAoKIyMjIyBhIFByb2JhYmlsaXR5IHRoYXQgYSBDaGljay1maWwtQSBpdGVtIGhhcyBtb3JlIHRoZW4gMzAwIGNhbG9yaWVzIGZyb20gZmF0CmBgYHtyfQpnZ3Bsb3QoZGF0YSA9IENoaWNrX2ZpbF9hLCBhZXMoc2FtcGxlID0gY2FsX2ZhdCkpICsKICBnZW9tX2xpbmUoc3RhdCA9ICJxcSIpCnFxbm9ybXNpbShzYW1wbGUgPSBjYWxfZmF0LCBkYXRhID0gQ2hpY2tfZmlsX2EpCjEgLSBwbm9ybShxID0gMzAwLCBtZWFuID0gY2ZtZWFuLCBzZCA9IGNmc2QpCkNoaWNrX2ZpbF9hICU+JQogIGZpbHRlcihjYWxfZmF0ID4zMDApICU+JQogIHN1bW1hcmlzZShwZXJjZW50ID0gbigpIC8gbnJvdyhDaGlja19maWxfYSkpCmBgYAoKCiMjIyMgYiBQcm9iYWJpbGl0eSB0aGF0IGFuIEFyYnlzIGl0ZW0gaGFzIGZld2VyIHRoYW4gODAwIGNhbG9yaWVzIGZyb20gZmF0CgpgYGB7cn0KZ2dwbG90KGRhdGEgPSBhcmJ5cywgYWVzKHNhbXBsZSA9IGNhbF9mYXQpKSArCiAgZ2VvbV9saW5lKHN0YXQgPSAicXEiKQpxcW5vcm1zaW0oc2FtcGxlID0gY2FsX2ZhdCwgZGF0YSA9IGFyYnlzKQoxIC0gcG5vcm0ocSA9IDgwMCwgbWVhbiA9IGFybWVhbiwgc2QgPSBhcnNkKQphcmJ5cyAlPiUKICBmaWx0ZXIoY2FsX2ZhdCA+IDgwMCkgJT4lCiAgc3VtbWFyaXNlKHBlcmNlbnQgPSBuKCkgLyBucm93KGFyYnlzKSkKYGBgCgo=