Please go to Dropbox folder (link in Course Introduction and Materials) and find the R Markdown file (.RmD extension) for the Central Limit Theorem (CLT) code we saw in class. Â
Then -Â Â
Begin with setting seed in R. Â The recommended way to specify seeds
is -Â Â set.seed(seed = 42) , where seed can take on any
single value that is interpreted as an integer (42 here, but you can put
your favorite number instead).
set.seed(seed = 7) #my favorite number!
Please Google and describe Law of Large Numbers in your own words.
The Law of Large Numbers (LLN) states that, the more you repeat an empirical experiment independently and average the results each time, the more likely that you will be close to the true expected value/finding. In other words, to find the true average of an experiment, the law of large number states that an experiment should be repeated with a large sample for a number of times to obtain results that are true of the sample population. In social sciences, we often begin with a pilot study with a small sample size, then re-run the experiment with a larger sample size, then continue to replicate the study with different subsets of the population to generalize on the findings. After enough iterations, there will be enough results to generalize on the entire population.
Please explain CLT in your own words. You can, and should read your textbook and/or online references to understand what is CLT, its uses, et ctera. Furthermore, if you find any useful resource, include it in your post so that the rest of the class can have a look at it to. EG - Aspect 3 in the section Overview of four aspects. (1:16) in the YouTube video substantiates my claim more formally that xxx..., Josh Starmer at StatsQuest... , JB Statistics Intro to CLT, significance/uses of CLT, Wikipedia, ...
The Central Limit Theorem (CLT) states that with a large enough sample size, the sampling distribution of the mean will always adhere to a normal distribution — regardless of the type of distribution (e.g., Poisson, binomial, exponential, etc.) — and the sampling error will decrease as the sample size increases. In other words, as sample size increases (typically over 30 in social sciences), the sample mean and standard deviation will become closer to the population mean and standard deviation; the distribution will take on a bell-curve.
(Image Source)
What are the similarities and differences between LLN and CLT? Write a few lines. Â
The Law of Large Numbers (LLN) and Central Limit Theorem (CLT) are quite similar in the sense that a large sample size allows us to make inferences about the general population. However, the theories are different in that the LLN focuses specifically on the large sample size / iterations of the experiment to allow for the generalization, whereas the CLT emphasizes the normality of the distribution with a large enough sample size.
For example, let’s imagine that I’m starting an ice cream factory and my business plan is to only sell five ice cream flavors. To decide on the flavors, I randomly send out surveys to residents living in my state. Many of the responses indicated orange sherbet to be everyone’s favorite — a fruit and flavor that is native and unique to my state. Arguably, the sample size is small and not representative of all 50 US states at this point. However, the more I sample residents from other US states and increase the sample size, the likely that the distribution of flavors will change, with more basic and common flavors like chocolate and vanilla at the center of the distribution, and less common flavors, like orange sherbet, being at the ends of the distribution tail. With LLN, the more number of responses from different US states will allow us to make generalizations about commonly favored flavors, while the CLT states that with enough responses, we’ll be able to see which flavors are preferred the most on average.
 Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution).
Please describe this distribution first in 5 lines.
Binomial distributions look at the total count of outcomes in a succession of trials; the trials permit only two possible mutally exclusively outcomes (i.e., binary). For example, surveying for the number of folks who choose chocolate ice cream as their favorite flavor would be a binomial distribution. In other words, it models the number of successes in a fixed number of independent Bernoulli trials. The parameters required to model binomial distributions are: 1) probability of success (p), and 2) x successes in a given number of trials (n).
 Then, apply the CLT on the sample mean of this chosen distribution. in R (adapt our class R code, or you can find an alternative code on the web too). Â
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 537496 28.8 1201002 64.2 NA 669400 35.8
## Vcells 995660 7.6 8388608 64.0 16384 1851617 14.2
cat("\f") # Clear console
set.seed(7)
binom.data <- rbinom(10000, size = 10000, prob = 0.5) #large sample size to apply CLT
library(psych)
describe(binom.data)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10000 5000.55 50.59 5000 5000.51 50.41 4794 5201 407 -0.01 0.02
## se
## X1 0.51
summary(binom.data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4794 4966 5000 5001 5035 5201
mu <- mean(binom.data) # check mean of actual distribution
mu
## [1] 5000.546
sigma <- sd(binom.data) # check sd of actual distribution
sigma
## [1] 50.59332
#visualize
hist(x = binom.data,
main = "Histogram of the Binomial Distribution with n = 10,000 and p = 0.5 ",
xlab = "")
z <- matrix(data = rep(x = 0,
times = 40000
),
nrow = 10000,
ncol = 4)
# i indexes rows and j indexes columns
# let n be the sample size
#sample mean of 2 obs, 6 obs, 30, 1000
n <- c(2, 6, 30, 1000)
for (j in 1:4){
for (i in 1:10000){
z[i,j] <- mean(
sample( x = binom.data,
size = n[j],
replace = TRUE
)
)
}
}
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. :4860 Min. :4925 Min. :4963 Min. :4995
## 1st Qu.:4976 1st Qu.:4986 1st Qu.:4994 1st Qu.:4999
## Median :5000 Median :5000 Median :5001 Median :5001
## Mean :5000 Mean :5000 Mean :5001 Mean :5001
## 3rd Qu.:5024 3rd Qu.:5014 3rd Qu.:5007 3rd Qu.:5002
## Max. :5148 Max. :5085 Max. :5034 Max. :5006
#mean gets closer to true value as sample size increases
hist(z, xlab = "", main = "Histogram of Sample Mean ")
#check CLT with graph
par(mfrow=c(3,2))
length(binom.data)
## [1] 10000
hist(x = binom.data,
main = "Histogram of Binom Distribution, N=10,000",
xlab = ""
)
# matrix column 1-4 which contain sample means of randomly chosen samples from binom dist
for (k in 1:4){
hist(x = z[,k],
main = "Histogram of Sample Mean of Binom Dist, N=10,000",
xlim = c(0,6000), # plot the domain of X axis from x=0 to x=30 only
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
#check sample mean to population mean
mu
## [1] 5000.546
#vs
apply(X = z,
MARGIN = 2, # function applied on columns
FUN = mean)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 5000.181 5000.241 5000.686 5000.556
#check sample sd to population sd
sigma
## [1] 50.59332
#vs
apply(X = z,
MARGIN = 2,
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(1000)
)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 50.03404 50.63752 50.55205 50.31923
#larger the sample size, smaller the variance error
 Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both. Â
Does the central limit theorem hold as expected? Please elaborate (at-least 3 points). Â
You can post a few pictures to substantiate your claim while answering the CLT part above. Make sure there are comments in your code to explain and walk the reader through your logic.  Â
percentile_25 <- quantile(x = binom.data, # numeric vector whose sample quantiles are wanted,
probs = c(.25) # numeric vector of probabilities with values in [0,1][0,1].
)
percentile_25
## 25%
## 4966
z2 <- matrix(data = rep(x = 0,
times = 50000
),
nrow = 10000,
ncol = 5)
n <- c(2, 6, 30, 120, 1000)
for (j in 1:5){
for (i in 1:10000){
z2[i,j] <- quantile( # CHANGE FROM MEAN TO QUANTILE
x = sample( x = binom.data,
size = n[j],
replace = TRUE
),
probs = c(.25)
)
}
}
#random sample size
colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=120", "Sample size=1000")
summary(z2)
## Sample size=2 Sample size=6 Sample size=30 Sample size=120 Sample size=1000
## Min. :4849 Min. :4876 Min. :4919 Min. :4944 Min. :4959
## 1st Qu.:4962 1st Qu.:4958 1st Qu.:4960 1st Qu.:4963 1st Qu.:4965
## Median :4987 Median :4974 Median :4968 Median :4967 Median :4966
## Mean :4987 Mean :4974 Mean :4968 Mean :4967 Mean :4966
## 3rd Qu.:5012 3rd Qu.:4991 3rd Qu.:4976 3rd Qu.:4971 3rd Qu.:4968
## Max. :5126 Max. :5076 Max. :5014 Max. :4995 Max. :4974
The 25th percentile value is 4966. As you can see with the five random sample sizes, the CLT holds at a sample size of at least 30. This is shown as the value at 25th percentile gets closer to the mean with a sample size of 30 or more. Whether it is the sample median, 25th or 80th percentile, the CLT holds true here, in which the sample value equals the population value with a large enough sample size.
## check graphically
par(mfrow=c(3,2))
hist(x = binom.data,
main = "Histogram of Binom Dist, N=10,000",
xlab = ""
)
for (k in 1:5){
hist(x = z2[,k], #z2 now
main = "Histogram of Sample 25 Percentile of Binom Dist",
xlim = c(0, 6000),
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}