0 Begin with setting seed in R.

# Setting Seed
set.seed(seed = 52) # Used favorite number of 52

1 Please Google and describe Law of Large Numbers in your own words.

The law of large numbers is a theorem that states if we repeat a certain experiment independently a large number of times and average the results, the result should be close to the true expected value. This theorem has a very central role in probability and statistics. We can also split up the law of Large numbers into two main versions. These are called weak and strong laws of large numbers.

Weak Law of Large Numbers: The weak law of large numbers, which is also known as Bernoulli’s theorem, states if you have a sample of independent or identically distributed variables, as the sample size grows, the sample mean will tend towards the population mean. This essentially means the larger the sample, the better the chance it represents the true population mean.

Strong Law of Large Numbers: The strong law of large numbers states that if you have a sample of independently and identically distributed variables, as the sample size grows the sample mean tends to the expected value of the random variable.

2 Please explain CLT in your own words.

The central limit theorem is a theorem that is used to allow us to study populations with different shaped distributions. This gives us the opportunity to apply the normal distribution to many different populations when the sample size is large enough. Typically we used a sample of >30 to apply the central limit theorem. This is beneficial since we can apply this when data is drawn from populations that have an unknown shape. We can do this since the sample means are normally distributed. As the sample size increases, we see the distribution narrows since the standard deviation of the mean decreases as the sample size increases.

3 What are the similarities and differences between LLN and CLT?

Similarities: A similarity between LLN and CLT is that both look at the sample mean and its relation to the sample size increasing. We know that LLN is when as the sample size increases, the closer we get to the true mean. Similarly as stated for CLT, we know as the sample size increases, the narrower our distribution gets. When the distribution gets narrower, this is caused by the standard deviation getting smaller. This tells us that more of our sample is closer to the mean. Another similarity is that we know both are fundamental concepts in statistics.

Differences: A difference between LLN and CLT is that the LLN focuses more closely on the sample mean itself while the CLT focuses on the distribution of the sample mean. This difference is highlighted by the fact we can apply the CLT to any sample over 30 since we already know the rules of what a normal distribution looks like. LLN will be dependent on the sample and the mean itself.

4 Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R.

?Chisquare

## starting httpd help server ... done

I decided to look into the Chi-Square distribution. This distribution is a continuous probability distribution that is used in hypothesis testing. The shape of this distribution is determined by the parameter K, representing degrees of freedom. The main parameters for this is the degrees of freedom which influences the shape, center, and spread of the distribution. Below, I have listed out 4 different distributions for this based off the different functions R gives us.

Example 1 using the dchisq (Density) Function

x_dchisq <- seq(0, 30, by = 0.1)   # Specify x-values for dchisq function
y_dchisq <- dchisq(x_dchisq, df = 5)  # Apply dchisq function
plot(y_dchisq) # Plot dchisq values

Example 2 using the pchisq (CDF) Function

x_pchisq <- seq(0, 30, by = 0.1)  # Specify x-values for pchisq function
y_pchisq <- pchisq(x_pchisq, df = 5)  # Apply pchisq function
plot(y_pchisq) # Plot pchisq values

Example 3 using qchisq (Quantile) Function

x_qchisq <- seq(0, 1, by = 0.01) # Specify x-values for qchisq function
y_qchisq <- qchisq(x_qchisq, df = 5) # Apply qchisq function
plot(y_qchisq) # Plot qchisq values

Example 4 using the rchisq (Random) Function

N <- 10000  # Specify sample size

y_rchisq <- rchisq(N, df = 5)   # Draw N chi squared distributed values

hist(y_rchisq,          # Plot of randomly drawn chisq density
     breaks = 100,
     main = "")

5A Then, apply the CLT on the sample mean of this chosen distribution in R. (adapt our class R code, or you can find an alternative code on the web too).

To start in applying CLT from our Chi-Square distribution, we want to find the mean so we can plot out our normal distribution. I saw the mean is equal to the degrees of freedom and that the standard deviation is square root of 2*df. Once we have these parameters, we can plug these into our dnorm function and plot it. As seen below, we have a standard distribution from our original chi-square inputs. After doing so, I got the following normal distribution.

# Define the sequence of x-values
x <- seq(0, 60, by = 0.1)

# Compute the corresponding y-values using the normal distribution
y_normal <- dnorm(x, mean = 30, sd = sqrt(60))  # mean and standard deviation

# Plot the normal distribution
plot(x, 
     y_normal, 
     type = "l", 
     xlab = "x", 
     ylab = "Density", 
     main = "Normal Distribution")

5b Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.

To apply the CLT to another sample statistic, lets say we are looking at NHL goal scorers and are just given a sample statistic that the 95th percentile of goal scorers scored 33 goals in an 82 game regular season. In this random sample, we sampled 50 random players out of 1078. This sample was for forwards and the sample mean was 21. We can then use CLT to create a normal distribution from our sample since our sample was greater than 30. Our first steps are writing out our parameters and then calculating the standard deviation based off our Z-score.From there we got that our SD was 7.3 goals. Once we had that, we could plot our entire normal distribution which I did below.

# Given Sample Statistics
Pct95 <- 33 # Sample 95th pct Goals in an NHL regular season
Sample.Players <- 50 # Sample of 50 players
sample.mean <- 21 # Sample Mean
Total.Players <- 1078 # Players in the NHL

# Setting Parameters for Normal Dist in CLT

# Zscore for 95th percentile in a standard normal distribution
z_score <- qnorm(0.95)

# Calculate the standard deviation using the desired mean
standard_deviation <- (Pct95 - sample.mean) / z_score
round(standard_deviation, digits = 2)

## [1] 7.3

# Plotting Normal Distribution
 
# Generate a range of values around the mean
x<-seq(from = sample.mean-3*standard_deviation,
        to = sample.mean + 3*standard_deviation,
        length.out=1000
        )

 # Calculate the probability density function
 pdf<-dnorm(x =x,
            mean=sample.mean,
            sd =standard_deviation
            )
 
# Plot the normal distribution
 plot(x =x,
 y = pdf,
 type = 'l',
 col = 'blue',
 lwd = 2,
 xlab= 'Goals',
 ylab= 'Density',
 main= 'Normal Distribution'
 )

DrewBaker_Discussion5

2024-04-19