Why The Normal

Harold Nelson

2025-03-17

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Task 1

Create a histogram of the variable carat in the dataframe diamonds.

Solution

diamonds %>% 
  ggplot(aes(x = carat)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Task 2

Does this appear to have a normal distribution?

Solution

No, there is extreme right skewness.

Task 3

Compute the mean and standard deviation of the variable. Call them mean_carat and sd_carat. Display the values.

Solution

mean_carat = mean(diamonds$carat)
mean_carat
## [1] 0.7979397
sd_carat = sd(diamonds$carat)
sd_carat
## [1] 0.4740112

Task 4

Write a code snippet which computes an estimate of the mean based on a sample of size sample_size. Run it repeatedly with sample_size = 2.

Solution

sample_size = 2
xbar = mean(sample(diamonds$carat,sample_size))
xbar
## [1] 0.745

Task 5

Repeat this exercise with sample_size values of 5, 10, 100, 500, 1000. What do you observe as you increase the sample size?

Solution

The estimates of xbar get better as the sample size increases.

Task 6

Let’s look at the distribution of the values of xbar.

Begin by setting a vector of 1,000 estimates equal to zero using rep(). Then use a for loop to fill this vector with estimates of xbar based on random samples of size sample_size.

Compute the mean of the estimates as mean_estimate and the standard deviation of the estimates as sd_estimate. Display these values.

Solution

sample_size = 4
estimates = rep(0,1000)

for(i in 1:1000){
  estimates[i] = mean(sample(diamonds$carat,sample_size))

}

mean_estimates = mean(estimates)
mean_estimates
## [1] 0.790455
sd_estimates = sd(estimates)
sd_estimates
## [1] 0.2306748
data.frame(estimates) %>% 
  ggplot(aes(x = estimates)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Task 7

Run the code with sample_size values of 1, 4, and 16. what do you observe?

The Central Limit Theorem

These facts are examples of two relationships that we need to remember.

\[\mu_{\bar{x}}=\mu_{x}\] and \[\sigma_{\bar{x}} =\frac{\sigma_{x}}{\sqrt{n}}\]

More importantly, no matter what kind of distribution the original variable x had, the distribution of xbar approaches the normal distribution as a limit when the value of sample_size becomes large.