library("ggplot2")
  1. Write a program to find the average of 1000 random digits 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. Have the program test to see if the average lies within three standard deviations of the expected value of 4.5. Modify the program so that it repeats this simulation 1000 times and keeps track of the number of times the test is passed. Does your outcome agree with the Central Limit Theorem?

R Function

simulate <- function( simulations, samples ){
  
basket <- c(0,1,2,3,4,5,6,7,8,9)
stdDev <- sd(basket) # standard deviation
expected <- mean(basket) # mean
avgs <- rep(0, simulations) # store averages 
passed <- 0 # track passed tests


for(i in 1:simulations){
  trial <- sample(basket, size = samples, replace = TRUE)
  avg <- mean(trial)
  avgs[i] <- avg
  if(avg >= ( expected - 3*stdDev ) & avg <= ( expected + 3*stdDev ) ){
    passed <- passed + 1
  }
}
 
print( paste0("The average was within 3 standard deviations of the mean ", passed, " of ", simulations, " times."))

results <- data.frame(n = avgs)

ggplot(results, aes(n)) +
  geom_histogram(fill="steelblue", stat = "bin", bins=200) +
  labs( title="Histogram of final simulation", x= "Mean", y = "Frequncy") +
  geom_vline(xintercept = c(expected), linetype="dashed", color = "red") + 
  scale_x_continuous(limits=c(3, 6), expand = c(0, 0)) +
  theme_minimal()
}

A Single Simulation

simulate(1, 1000)
## [1] "The average was within 3 standard deviations of the mean 1 of 1 times."

As evident above, the simulation was within three standard deviations of the mean. Infact, it was within 1 standard deviation of the mean. Because were are taking 1000 samples this is not really surprising. Let’s see what the passing rate will be when doing 1000 simulations instead of one.

1000 Simulations

simulate(1000, 1000)
## [1] "The average was within 3 standard deviations of the mean 1000 of 1000 times."

Having a 1000 simulations, the histogram is centered very close to the expected value of 4.5. This is an evident improvement over doing a single simulation. It should also be noted that 100% percent of the means fell within 3 standard deviations of the mean. This distribution now resembles a normal distribution (thanks to the Central Limit Theorem) and it is expected that 99.7% of the data will be within 3 standard deviations of the mean, so this is not surprising. However, having a standard deviation of approximately 3.027 and a maximum value of 9, it seems impossible to get a value greater than 3 standard deviations of the mean. The same goes for the left side of the normal curve because 3 standard deviations below the mean of 4.5 will result in a negative value, which is impossible with a minimum value of 0 in the data set