CLT Discussion

  1. The central limit theorem states that if you take a large enough sample (when n is greater than or equal to 30), then the sample means will be normally distributed. The CLT works when a population is not normally distributed. The more observations you take, the more likely the sampling distribution will be centered around the true/population mean.
mybinom <- rbinom(n=10000, size=50, prob=.25) #creating binomial distribution
mean(mybinom)
## [1] 12.5025
sd(mybinom)
## [1] 3.095602
hist(x=mybinom)
library("psych")

describe(mybinom)
##    vars     n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 10000 12.5 3.1     12   12.45 2.97   2  26    24 0.18        0 0.03
v <- matrix(data = rep(x     = 0,
                       times = 10000
),
nrow = 10000,
ncol = 1) #empty matrix

v[1:20]
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
for (i in 1:10000){
  v[i] = mean(sample(x= mybinom, size= 100, replace=TRUE))
} #random sample of 100 observations
v[1:20]
##  [1] 12.61 12.37 12.10 12.92 12.50 12.08 12.71 13.14 12.43 12.25 12.29 12.91
## [13] 12.74 12.80 12.84 12.24 12.49 12.21 12.22 12.27
describe(v)
##    vars     n mean   sd median trimmed  mad   min   max range  skew kurtosis se
## X1    1 10000 12.5 0.31   12.5    12.5 0.31 11.28 13.74  2.46 -0.01    -0.02  0
hist(v)

v <- matrix(data = rep(x     = 0,
                       times = 50000
),
nrow = 10000,
ncol = 5) #expand columns of null matrix

n2 <- c(5, 10, 50, 100, 1000)

for (j in 1:5){
  for (i in 1:10000){
    v[i,j] <- mean(sample( x       = mybinom,
                           size    = n2[j],
                           replace = TRUE))}}
colnames(v) <- c("Sample size=5", 
                 "Sample size=10", 
                 "Sample size=50",
                 "Sample size=100", 
                 "Sample size=1000")
summary(v)
##  Sample size=5   Sample size=10  Sample size=50  Sample size=100
##  Min.   : 7.60   Min.   : 9.10   Min.   :10.68   Min.   :11.31  
##  1st Qu.:11.60   1st Qu.:11.80   1st Qu.:12.22   1st Qu.:12.29  
##  Median :12.40   Median :12.50   Median :12.50   Median :12.50  
##  Mean   :12.49   Mean   :12.51   Mean   :12.51   Mean   :12.50  
##  3rd Qu.:13.40   3rd Qu.:13.20   3rd Qu.:12.80   3rd Qu.:12.71  
##  Max.   :18.40   Max.   :16.10   Max.   :14.32   Max.   :13.65  
##  Sample size=1000
##  Min.   :12.16   
##  1st Qu.:12.44   
##  Median :12.50   
##  Mean   :12.50   
##  3rd Qu.:12.57   
##  Max.   :12.91
hist(x    = mybinom, 
     main = "Histogram of a Binomial Distribution, N=10,000",
     xlab = ""
) #histogram of population

for (k in 1:5){
  hist(x    = v[,k],
       main = "Histogram of Mean of Binomial Distribution",
       xlim = c(8, 16),
       xlab = paste0("Sample Size ", n2[k], " (Column ", k, " from matrix)")
  )
} #sampling mean distributions

colMeans(v) #check means
##    Sample size=5   Sample size=10   Sample size=50  Sample size=100 
##         12.48658         12.51304         12.50767         12.50431 
## Sample size=1000 
##         12.50352
apply(X = v, 
      MARGIN = 2,
      FUN = sd) * c(sqrt(5),
                    sqrt(10),
                    sqrt(50),
                    sqrt(100),
                    sqrt(1000)) #check SD
##    Sample size=5   Sample size=10   Sample size=50  Sample size=100 
##         3.067012         3.109175         3.115295         3.103632 
## Sample size=1000 
##         3.102940

Does central limit theorem hold as expected? Yes, the CLT does hold as expected. As the sampling sizes increased, the sampling mean converged to the population mean. As the observations increased, the sampling distribution mean came closer to the population mean (as shown in the graphs). However, it looks like the mean of the sample size of 100 was closer to the population mean than the sample size of 1000, just by looking at the column means, which I am a bit confused about. I wonder if I were to take a larger sample (2000 for example), if the mean would be closer to the population mean.