mybinom <- rbinom(n=10000, size=50, prob=.25) #creating binomial distribution
mean(mybinom)
## [1] 12.5025
sd(mybinom)
## [1] 3.095602
hist(x=mybinom)
library("psych")
describe(mybinom)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 12.5 3.1 12 12.45 2.97 2 26 24 0.18 0 0.03
v <- matrix(data = rep(x = 0,
times = 10000
),
nrow = 10000,
ncol = 1) #empty matrix
v[1:20]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
for (i in 1:10000){
v[i] = mean(sample(x= mybinom, size= 100, replace=TRUE))
} #random sample of 100 observations
v[1:20]
## [1] 12.61 12.37 12.10 12.92 12.50 12.08 12.71 13.14 12.43 12.25 12.29 12.91
## [13] 12.74 12.80 12.84 12.24 12.49 12.21 12.22 12.27
describe(v)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 12.5 0.31 12.5 12.5 0.31 11.28 13.74 2.46 -0.01 -0.02 0
hist(v)
v <- matrix(data = rep(x = 0,
times = 50000
),
nrow = 10000,
ncol = 5) #expand columns of null matrix
n2 <- c(5, 10, 50, 100, 1000)
for (j in 1:5){
for (i in 1:10000){
v[i,j] <- mean(sample( x = mybinom,
size = n2[j],
replace = TRUE))}}
colnames(v) <- c("Sample size=5",
"Sample size=10",
"Sample size=50",
"Sample size=100",
"Sample size=1000")
summary(v)
## Sample size=5 Sample size=10 Sample size=50 Sample size=100
## Min. : 7.60 Min. : 9.10 Min. :10.68 Min. :11.31
## 1st Qu.:11.60 1st Qu.:11.80 1st Qu.:12.22 1st Qu.:12.29
## Median :12.40 Median :12.50 Median :12.50 Median :12.50
## Mean :12.49 Mean :12.51 Mean :12.51 Mean :12.50
## 3rd Qu.:13.40 3rd Qu.:13.20 3rd Qu.:12.80 3rd Qu.:12.71
## Max. :18.40 Max. :16.10 Max. :14.32 Max. :13.65
## Sample size=1000
## Min. :12.16
## 1st Qu.:12.44
## Median :12.50
## Mean :12.50
## 3rd Qu.:12.57
## Max. :12.91
hist(x = mybinom,
main = "Histogram of a Binomial Distribution, N=10,000",
xlab = ""
) #histogram of population
for (k in 1:5){
hist(x = v[,k],
main = "Histogram of Mean of Binomial Distribution",
xlim = c(8, 16),
xlab = paste0("Sample Size ", n2[k], " (Column ", k, " from matrix)")
)
} #sampling mean distributions
colMeans(v) #check means
## Sample size=5 Sample size=10 Sample size=50 Sample size=100
## 12.48658 12.51304 12.50767 12.50431
## Sample size=1000
## 12.50352
apply(X = v,
MARGIN = 2,
FUN = sd) * c(sqrt(5),
sqrt(10),
sqrt(50),
sqrt(100),
sqrt(1000)) #check SD
## Sample size=5 Sample size=10 Sample size=50 Sample size=100
## 3.067012 3.109175 3.115295 3.103632
## Sample size=1000
## 3.102940
Does central limit theorem hold as expected? Yes, the CLT does hold as expected. As the sampling sizes increased, the sampling mean converged to the population mean. As the observations increased, the sampling distribution mean came closer to the population mean (as shown in the graphs). However, it looks like the mean of the sample size of 100 was closer to the population mean than the sample size of 1000, just by looking at the column means, which I am a bit confused about. I wonder if I were to take a larger sample (2000 for example), if the mean would be closer to the population mean.