From the “Number of Years in College” data, we will demonstrate the Central Limit Theorem
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.1
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(ggplot2)
You can also embed plots, for example:
## Parsed with column specification:
## cols(
## Years = col_double()
## )
## # A tibble: 6 x 1
## Years
## <dbl>
## 1 6
## 2 6
## 3 8
## 4 8
## 5 8
## 6 6
#Graph the data set
hist(years$Years, breaks = 5, border="blue", col="green", xlab = "Years in College")
#As we can see, the data is skewed right
#Let’s calculate the mean and standard deviation of the population
pop_mean <- mean(years$Years)
pop_sd <- sd(years$Years)
paste0("The Mean of the population: ", pop_mean)
## [1] "The Mean of the population: 3.52666666666667"
paste0("The Standard Deviation of the population: ", pop_sd)
## [1] "The Standard Deviation of the population: 2.29588679966326"
#Select a sample of size 30 and calculate the mean and standard deviation
set.seed(357)
n <- 30
sample1 <- sample(years$Years, size = n, replace = TRUE)
hist(sample1, breaks = 5, border="blue", col="green", xlab = "Years in College")
#calculate the mean and standard deviation
sample1_mean <- mean(sample1)
sample1_std <- sd(sample1)
paste0("The Mean of Sample1:", sample1_mean)
## [1] "The Mean of Sample1:3.65"
paste0("The Standard Deviation of Sample1: ", sample1_std)
## [1] "The Standard Deviation of Sample1: 2.27864902503882"
#Take another sample
sample2 <- sample(years$Years, size = n, replace = TRUE)
hist(sample2, breaks = 5, border="blue", col="green", xlab = "Years in College")
#Calculate the Mean and Standard Deviation
paste0("The Mean of Sample2:", mean(sample2))
## [1] "The Mean of Sample2:3.55"
paste0("The Standard Deviation of Sample2: ", sd(sample2))
## [1] "The Standard Deviation of Sample2: 1.96674946250809"
#Let’s look at one more sample
sample3 <- sample(years$Years, size = n, replace = TRUE)
hist(sample3, breaks = 5, border="blue", col="green", xlab = "Years in College")
#Mean and Standard Deviation of Sample 3
paste0("The Mean of Sample2: ", mean(sample3))
## [1] "The Mean of Sample2: 3.7"
paste0("The Standard Deviation of Sample2: ", sd(sample3))
## [1] "The Standard Deviation of Sample2: 1.85973672589837"
#Notice that our sample means are not equal to the population mean In order to understand sampling, we need to look at many sample means and see how close they are to the true population mean. As in the chapters on descriptive statistics, we want to examine the behavior the the sample means.
We are going to select 10000 samples of size 30, graph the samples means and find the mean of the sample means.
B <- 10000
N <- 30
x_bar <- replicate(B, {
x <- sample(years$Years, size = N, replace = TRUE)
mean(x)
})
hist(x_bar, border="blue", col = 'green', xlab = "Average Number of Years in College \nfor a sample of size 30", ylab = "Frequency", main = "Sampling Distribution of Sample Means")
print("The Mean of the sample means is", mean(x_bar))
## [1] "The Mean of the sample means is"
sd(x_bar)
## [1] 0.4201272
#The sampling distribution of sample mean is a normal distribution! Let’s calculate the mean of the means
paste0("The Mean of the Population: ", round(mean(years$Years), digits = 4))
## [1] "The Mean of the Population: 3.5267"
paste0("The Mean of the sample means: ", round(mean(x_bar), digits = 4))
## [1] "The Mean of the sample means: 3.5299"
#The Central Limit theorem states that #for samples of size n > 30 the Mean of the Sample means will appoximate the True Population Mean. #and! The sampling distribution of sample mean will be a normal distribution!
#If we know the Sampling Distribution of Sample Means forms a Normal Distribution then the Empirical Rule holds.
#So, we know! 95% of the sample means will be two standard deviations from the true mean!
Recall, x_bar, our sample mean is a random variable. Just like the coin toss, there is a standard deviation or variability in the set of all sample means. Let’s take a look at the standard deviation of the population and the standard deviation of the Sampling Distribution of Sample Means.
paste0("The standard deviation of the Population: ", round(sd(years$Years), digits = 4))
## [1] "The standard deviation of the Population: 2.2959"
paste0("The standard deviation of the sampling distribution of sample means: ", round(sd(x_bar), digits = 4))
## [1] "The standard deviation of the sampling distribution of sample means: 0.4201"
#The standard deviation of the sampling distribution of sample means or the variability in the set of sample means is much smaller. Using the Empirical Rule
paste0("95% of all sample means are in error by plus or minus ", round(2*sd(x_bar), digits = 4), ' years from the true mean of ', round(mean(years$Years), digits = 4))
## [1] "95% of all sample means are in error by plus or minus 0.8403 years from the true mean of 3.5267"
#The Central Limit also states that the standard error of the Sampling Distribtution of Sample means is sigma/sqrt(n). Let’s see how well we simulated the standard error
paste0("The Standard Deviation of the Sampling Distribution of Sample Means: ", round(sd(x_bar), digits = 4))
## [1] "The Standard Deviation of the Sampling Distribution of Sample Means: 0.4201"
paste0("The Population Standard Deviation divided by the square root of n: ", round(sd(years$Years)/sqrt(N), digits = 4))
## [1] "The Population Standard Deviation divided by the square root of n: 0.4192"