ADEC7310 Discussion 4

0. Begin with setting seed in R. The recommended way to specify seeds is - set.seed(seed = 42) , where seed can take on any single value that is interpreted as an integer (42 here, but you can put your favorite number instead).

set.seed(seed = 42) # Because 42 is the ONLY number that matters!

1. Please Google and describe Law of Large Numbers in your own words.

Law of Large Numbers (LLN): LLN states that as more observations are collected, the sample mean converges to the true population mean.This appears to me to be rather intuitive. At the simplest example possible, we know that the probability of flipping a coin and getting a head is 50%. BUT, we can easily observe a scenario where we flip a coin 10 times and see 7 heads, which is a 70% probability. What LLN states is that the more sampling we do, the closer to a true mean we observe. In the case of the coin flips, if this time we flip it 100,000 times, we will see a convergence towards the true, 50%, that we would expect. LLN is simply a comment on the accuracy of sampling. Formulaic, we can say…

\(As\text{ } n \to \infty, then \text{ }the\text{ } sample\text{ } mean\text{ } X\to \mu\)

2. Please explain CLT in your own words.

Central Limit Theorem (CLT): CLT states that as the number of samples increase, the sampling distribution approaches normal. Personally, I find this fascinating and slightly less intuitive than LLN. One example is to consider rolling a single die 10 times. The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1/6. On a plot, it would be a boring rectangle. Now plot the same results with 2 dice for 1000 rolls, or 5 dice for 1000 or 50 dice for 1000 rolls. It becomes obvious that the plot converges upon the familiar bell-shaped normal distribution.

3. What are the similarities and differences between LLN and CLT? Write a few lines.

Compare and contrast LLN and CLT: Both theorems are fundamental concepts in statistics. They both rely on large sample sizes before either trend is apparent. With CLT, we can not only see where the center (mean) is but also the “spreadiness.” Also, LLN shows us what is happening to the mean whereas CLT is more about the shape of the distribution.

4. Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution). Please describe this distribution first in 5 lines.

Negative Binomial Distribution: If ound this explanation on youtube, here: https://www.youtube.com/watch?v=VlVDKA9pg4A . I found it helped me contrast binomial and negative bionomial in a way I can understand. In the Binomial Distribution, we determine how many successes we get in a fixed number of trials. The Negative Binomial answers the question: “How many trials until we get a fixed number of success?”

5A. Then, apply the CLT on the sample mean of this chosen distribution in R.

# Thank you, Arvinid Sharma for showing me these 
rm(list = ls()) # Clear environment
gc()            # Clear unused memory

##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  620411 33.2    1422270   76   702071 37.5
## Vcells 1155864  8.9    8388608   64  1927866 14.8

cat("\f")       # Clear the console

#dev.off()     # Clear par(mfrow=c(3,2)) type commands 

# The answer to the ultimate question of life, the universe, and everything!
set.seed(42)

# DISCLOSURE: I made use of this on rpubs 
# https://rpubs.com/JoelChG/StatInf-CourseProject
# to organize how i wanted to do this below but since I have had mixed results using ggplot2 I am sticking with base r for plotting purposes
# I have also referenced this site: 
#https://www.geeksforgeeks.org/r-language/how-to-use-the-replicate-function-in-r #to help me understand the replicate function .

# Parameters for our negative binomial distribution
r <- 3      # number of successes we're waiting for
p <- 0.4    # probability of success

# Parameters of our population (mean and standard deviation)
mu <- r * (1 - p) / p 
sigma <- sqrt(r * (1 - p) / p^2) 

cat("Mean:", mu, "\n")

## Mean: 4.5

cat("Standard Deviation:", sigma, "\n\n")

## Standard Deviation: 3.354102

# Generate a large population to visualize
population <- rnbinom(100000, size = r, prob = p)

# Create sampling distributions with different sample sizes
simulations <- 10000

# Going to use the replicate function for the first time.
# The syntax for replicate is replicate(n, expression) where n is the number of times to evaluate the expression. In our case, 10000.

# Sample size = 5
sim5 <- replicate(simulations, mean(rnbinom(5, r, p)))
# For me: mean(rnbinom(5, r, p))

# Sample size = 30
sim30 <- replicate(simulations, mean(rnbinom(30, r, p)))
mean(rnbinom(30, r, p))

## [1] 4.333333

# Sample size = 100
sim100 <- replicate(simulations,mean(rnbinom(100, r, p)))

#Plot the plots
par(mfrow = c(2, 3))

# Plot 1: Original population distribution
hist(population, breaks = 50, 
     main = "Original Negative Binomial\nPopulation",
     xlab = "Number of Failures",
     col = "lightblue",
     probability = TRUE,
     xlim = c(0, 30))
abline(v = mu, col = "red", lwd = 2, lty = 2)
text(mu + 4, 0.2, paste("μ =", round(mu, 2)), col = "red")

# Plot 2: n = 5
hist(sim5, breaks = 30,
     main = "Sampling Distribution\n(n = 5)",
     xlab = "Sample Mean",
     col = "lightgreen",
     probability = TRUE,
     xlim = c(0, 10))
abline(v = mu, col = "red", lwd = 2, lty = 2)
text(mu + 0.5, 0.31, paste("μ =", round(mu, 2)), col = "red")

# Overlay normal curve
curve(dnorm(x, mu, sigma/sqrt(5)), 
      add = TRUE, col = "blue", lwd = 2)

# Plot 3: n = 30
hist(sim30, breaks = 30,
     main = "Sampling Distribution\n(n = 30)",
     xlab = "Sample Mean",
     col = "violet",
     probability = TRUE,
     xlim = c(2, 7))
abline(v = mu, col = "red", lwd = 2, lty = 2)
text(mu + 0.5, 0.61, paste("μ =", round(mu, 2)), col = "red")

# Overlay normal curve
curve(dnorm(x, mu, sigma/sqrt(30)), 
      add = TRUE, col = "blue", lwd = 2)

# Empty plot just to make it prettier
frame()

# Plot 4: n = 100
hist(sim100, breaks = 30,
     main = "Sampling Distribution\n(n = 100)",
     xlab = "Sample Mean",
     col = "lightblue",
     probability = TRUE,
     xlim = c(3, 6))
abline(v = mu, col = "red", lwd = 2, lty = 2)
text(mu + 0.5, 0.8, paste("μ =", round(mu, 2)), col = "red")

# Overlay normal curve
curve(dnorm(x, mu, sigma/sqrt(100)), 
      add = TRUE, col = "blue", lwd = 2)

Does the central limit theorem hold as expected? Please elaborate (at-least 3 points).

CLT is demonstrated. What I see is that the original distribution was right-skewed (ie not normal). Then, even with n=5, we already are beginning to see the familiar bell-shape. As we increase n to be greater than the rule of thumb that n should be greater than 20, we see a very normal distribution appear. By the time we get to n=100, we have clearly convergence on normal. That said, with enough observations, we can use normal statistical methods.