Hazal Gunduz

The normal distribution

In this lab, you’ll investigate the probability distribution that is most central to statistics: the normal distribution. If you are confident that your data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data("fastfood", package = 'openintro')
head(fastfood)
## # A tibble: 6 × 17
##   restaurant item       calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>      <chr>         <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Mcdonalds  Artisan G…      380      60         7       2       0            95
## 2 Mcdonalds  Single Ba…      840     410        45      17       1.5         130
## 3 Mcdonalds  Double Ba…     1130     600        67      27       3           220
## 4 Mcdonalds  Grilled B…      750     280        31      10       0.5         155
## 5 Mcdonalds  Crispy Ba…      920     410        45      12       0.5         120
## 6 Mcdonalds  Big Mac         540     250        28      10       1            80
## # … with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
mcdonals <- fastfood %>%
  filter(restaurant == "Mcdonalds")
dairy_queen <- fastfood %>%
  filter(restaurant == "Dairy Queen")

Exercise 1. Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants. How do their centers, shapes, and spreads compare?

summary(mcdonals$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   160.0   240.0   285.6   320.0  1270.0
hist(mcdonals$cal_fat)

summary(dairy_queen$cal_fat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   160.0   220.0   260.5   310.0   670.0
hist(dairy_queen$cal_fat)

Exercise 2. Based on the this plot, does it appear that the data follow a nearly normal distribution?

dqmean <- mean(dairy_queen$cal_fat)
dqsd <- sd(dairy_queen$cal_fat)

ggplot(data = dairy_queen, aes(x = cal_fat)) + 
  geom_blank() + 
  geom_histogram(aes(y = ..density..)) +
  stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 3. Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a data frame, it can be put directly into the sample argument and the data argument can be dropped.)

ggplot(data = dairy_queen, aes(sample = cal_fat)) +
  geom_line(stat = "qq")

sim_norm <- rnorm(n = nrow(dairy_queen), mean = dqmean, sd = dqsd)

ggplot(data = NULL, aes(sample = sim_norm)) +
  geom_line(stat = "qq")

Exercise 4. Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories are nearly normal?

qqnormsim(sample = cal_fat, data = dairy_queen)

Exercise 5. Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.

qqnormsim(sample = cal_fat, data = mcdonals)

Exercise 6. Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?

  1. Probability for cholestrol greater than 20 at Burger King?
burgerking <- fastfood %>%
  dplyr::filter(restaurant == "Burger King")
head(burgerking)
## # A tibble: 6 × 17
##   restaurant  item      calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>       <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Burger King American…     1550    1134       126      47       8           805
## 2 Burger King Bacon & …     1000     585        65      24       3           200
## 3 Burger King Bacon Ch…      330     140        16       7       0            55
## 4 Burger King Bacon Ch…      290     120        14       6       0.5          40
## 5 Burger King Bacon Ki…     1040     630        48      28       2.5         220
## 6 Burger King Bacon Ki…      730     351        39       9       0            90
## # … with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
chols_mean <- mean(burgerking$cholesterol)
chols_sd <- sd(burgerking$cholesterol)
1 - pnorm(20, mean = chols_mean, sd = chols_sd)
## [1] 0.7743965
  1. Probability for sat_fat greater than 4 at Burger King?
burgerking <- fastfood %>% 
  dplyr::filter(restaurant == "Burger King")
head(burgerking)
## # A tibble: 6 × 17
##   restaurant  item      calories cal_fat total_fat sat_fat trans_fat cholesterol
##   <chr>       <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
## 1 Burger King American…     1550    1134       126      47       8           805
## 2 Burger King Bacon & …     1000     585        65      24       3           200
## 3 Burger King Bacon Ch…      330     140        16       7       0            55
## 4 Burger King Bacon Ch…      290     120        14       6       0.5          40
## 5 Burger King Bacon Ki…     1040     630        48      28       2.5         220
## 6 Burger King Bacon Ki…      730     351        39       9       0            90
## # … with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## #   sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## #   salad <chr>
sat_fat_mean <- mean(burgerking$sat_fat)
sat_fat_sd <- sd(burgerking$sat_fat)
2 - pnorm(4, mean = sat_fat_mean, sd = sat_fat_sd)
## [1] 1.792606

Exercise 7. Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?

fastfood %>% 
  group_by(restaurant) %>% 
  ggplot() +
  geom_histogram(aes(x = sodium), bins = 11) +
  ggtitle("Rest. Sodium") +
  xlab("Sodium") +
  ylab("Freq") +
  facet_wrap(. ~restaurant)

fastfood %>% 
group_by(restaurant) %>% 
ggplot(aes(sample = sodium)) + 
  geom_line(stat = "qq") +
  facet_wrap(.~restaurant)

Exercise 8. Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?

=> Certain foods offered maybe highly produced then others.

Exercise 9. As you can see, normal probability plots can be used both to assess normality and visualize skewness. Make a normal probability plot for the total carbohydrates from a restaurant of your choice. Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed? Use a histogram to confirm your findings.

burgerkingplot <- burgerking %>%
  ggplot() +
  geom_line(aes(sample = total_carb), stat = "qq") +
  ggtitle("burgerking - carbohydrates")
burgerkingplot

burgerking_hist <- burgerking%>% 
  ggplot() +
  geom_histogram(aes(x = total_carb), binwidth = 11) +
  xlab("total carbohydrates") + 
  ylab("frequency") +
  ggtitle("burgerking carbohydrates")
burgerking_hist

Rpubs => http://rpubs.com/gunduzhazal/814802