Your Distribution and the Central Limit Theorem

Author

Karl Magwood, Drex Duffy, Jude Toy

Introduction

Can I break the Central Limit Theorem? - parameter space Why are you doing what you are doing? We are exploring the Central Limit Theorem and how it applies to the Chi-Squared Distribution. In the reality, it follows a probability mechanism and this makes what we are doing more interesting because we can not observe examples of this in the real world.

Chi-Squared Distribution

Chi-Squared Distribution In the context of the chi-squared distribution, there is no parameter commonly denoted as “lambda.” Instead, the chi-squared distribution is characterized by a parameter known as the degrees of freedom. The chi-squared distribution is a special case of the gamma distribution, where the shape parameter k (equivalent to degrees of freedom ff in the chi-squared distribution) is a positive integer.The constraint on the degrees of freedom ff is that it must be a positive integer (i.e. ff > 0) because it represents the number of independent standard normal random variables squared and summed to obtain the chi-squared distribution.

chisq <- rchisq(1000,1)

Background on your distribution

The chi-squared distribution has its roots in statistical theory and has found extensive use in various fields, particularly in inferential statistics. Here is a brief overview of the history and background of the chi-squared distribution: Karl Pearson and the Chi-Squared Test: The chi-squared distribution is closely associated with Karl Pearson, a pioneering English mathematician and statistician. In the early 20th century, Pearson developed the chi-squared test as a statistical test to assess the fit between observed and expected frequencies in categorical data. The test was initially introduced in Pearson’s work on the goodness-of-fit test, where he used a statistic that follows the chi-squared distribution. Development and Naming: The distribution itself was first introduced by Karl Pearson in the context of his chi-squared test around 1900. The term “chi-squared” comes from the Greek letter X^2 (chi-squared), which is used to represent the test statistic. Use in Various Fields: The chi-squared distribution is widely employed in hypothesis testing, particularly in situations where categorical data is involved. It is often used to test the independence of variables in contingency tables. In genetics, chi-squared tests are used to analyze the distribution of observed and expected ratios of different genotypes. Degrees of Freedom: The chi-squared distribution is characterized by its degrees of freedom(ff), which determine the shape of the distribution. The degrees of freedom are related to the number of categories or groups in the data. Regarding conducting a test, yes, you can certainly conduct a chi-squared test if you have categorical data. The chi-squared test helps determine whether there is a significant association between categorical variables. To conduct the test, you would follow these general steps: 1. Formulate the null hypothesis and alternative hypothesis. 2. Collect and organize your categorical data into a contingency table. 3. Calculate the expected frequencies under the assumption of independence. 4. Compute the chi-squared test statistic. 5. Determine the p-value associated with the test statistic. 6. Make a decision based on the p-value and your chosen significance level. It’s worth noting that certain assumptions should be met for the chi-squared test to be valid, such as the observations being independent and the expected frequencies in each cell being sufficiently large. Additionally, the test is generally suitable for large sample sizes. If your study involves experimental design or data collection, careful planning is essential to ensure the validity of your results.t

How it occurs in nature

WE NEVER FIND IT IN NATURE? The chi-distribution, also known as the chi-square distribution, is found in nature and has several applications across various fields such as physics, biology, and statistics. The most prominent example being in statistical analysis, specifically in hypothesis testing. In genetics, the chi-square test is commonly used to analyze the distribution of observed and expected genetic ratios in offspring, such as in Mendelian genetics experiments. Researchers use this distribution to determine whether observed data fits expected patterns or if there are significant deviations, which may indicate factors such as genetic linkage, gene expression, or genetic drift. In physics, the chi-distribution arises in the context of the distribution of squared random variables. For instance, when analyzing the distribution of the sum of squares of independent standard normal random variables, the resulting distribution follows a chi-square distribution. This distribution is integral in various statistical mechanics and quantum mechanics calculations. Additionally, in fields like finance and economics, the chi-square distribution is utilized in modeling and analyzing the variability of asset prices or returns, among other applications. In essence, the chi-square distribution is a fundamental statistical distribution that emerges in diverse natural phenomena and is widely applied in scientific research and analysis across disciplines.

How it is related to other distributions

Chi-squared relation to F distribution: F and chi-squared statistics are really the same thing in that, after a normalization, chi-squared is the limiting distribution of the F as the denominator degrees of freedom goes to infinity. The normalization is: chi-squared = (numerator degrees of freedom) * F. Chi-squared relation to normal distribution: A chi-squared distribution constructed by squaring a single standard normal distribution is said to have 1 degree of freedom. Thus, as the sample size for a hypothesis test increases, the distribution of the test statistic approaches a normal distribution. Chi-squared relation to skewed normal distribution: Chi Square distributions are positively skewed, with the degree of skew decreasing with increasing degrees of freedom. Chi-squared relation to exponential distribution: A chi-squared distribution with 2 degrees of freedom (k = 2) is an exponential distribution with a mean value of 2 (rate ff = 1/2.) Chi-squared relation to gamma distribution: The chi-square distribution with 2 degrees of freedom is the gamma distribution with shape parameter 1 and scale parameter 2, which we already know is the exponential distribution with scale parameter 2. If Z has the standard normal distribution then X=Z2 has the chi-square distribution with 1 degree of freedom. Breaking the CLT by changing the parameters. The central limit theorem (CLT) states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the original distribution of the variables. For the chi-square distribution, altering the degrees of freedom parameter can influence how quickly it converges to a normal distribution according to the CLT. The chi-square distribution is characterized by its degrees of freedom (df). Generally, as the degrees of freedom increase, the shape of the chi-square distribution becomes more bell-shaped and approaches a normal distribution. Therefore, to break the central limit theorem and hinder the convergence

if (!require("ggplot2")) install.packages("ggplot2")
if (!require("dplyr")) install.packages("dplyr")

library(ggplot2)
library(dplyr)

combined_seq <- seq(-4, 20, by = 0.1)

chi_pdf <- dchisq(combined_seq, df = 4)
norm_pdf <- dnorm(combined_seq, mean = 0, sd = 1) # Standard normal for visibility
gamma_pdf <- dgamma(combined_seq, shape = 4, rate = 1)
f_pdf <- df(combined_seq, df1 = 5, df2 = 10)
exp_pdf <- dexp(combined_seq, rate = 1)

data_all <- data.frame(
  seq = combined_seq,
  Chi_square = ifelse(combined_seq >= 0, chi_pdf, NA), # Chi-square is not defined for negative values
  Normal = norm_pdf,
  Gamma = gamma_pdf,
  F_Distribution = ifelse(combined_seq >= 0, f_pdf, NA), # F-distribution is not defined for negative values
  Exponential = ifelse(combined_seq >= 0, exp_pdf, NA) # Exponential is not defined for negative values
)

data_melted <- data_all %>%
  pivot_longer(cols = -seq, names_to = "Distribution", values_to = "Density")

ggplot(data_melted, aes(x = seq, y = Density, color = Distribution)) +
  geom_line() +
  scale_color_manual(values = c("Chi_square" = "blue", "Normal" = "red", "Gamma" = "green", "F_Distribution" = "purple", "Exponential" = "orange")) +
  ggtitle("Overlay of Various Distributions") +
  theme_minimal() +
  labs(y = "Density", x = "Value")

Warning: Removed 120 rows containing missing values (`geom_line()`).

What is the scope of your study

In a chi-squared test, the parameter space typically involves the degrees of freedom (ff) associated with the chi-squared distribution. The degrees of freedom are a crucial parameter that determines the shape and characteristics of the chi-squared distribution. For the chi-squared goodness-of-fit test, the degrees of freedom are determined by the number of categories or groups minus 1. If you have k categories, then the degrees of freedom (ff) will be k−1. The test assesses whether the observed distribution of categorical data fits a specified theoretical distribution. For the chi-squared test of independence, the degrees of freedom are determined by the number of rows (r)and columns (c) in the contingency table. The degrees of freedom(ff) are given by ff=(r−1)×(c−1). This test examines whether there is a significant association between two categorical variables.

N <- 1000
nrep <- 100

results <- NULL
for (theta in c(0.1, .5, 1, 1.5, 2.5)){
  for (k in c(4, 6.5, 7, 7.5, 8, 8.5, 9, 10, 15)){
    for (n in (seq(1, 50, 5))){#choose 50 because the CLT needs to be greater than 30
      for (rep in 1:nrep){
        
      #results <- c(results, c(theta, k, n))
      results <- c(results, c(n, N, theta, k))
      #print(results)
  }
  }
  } 
  }
#do for loops for how many variables. for gamma: theta, shape(k), n

Methodology

How am I changing the parameter space? 1. Understand the Chi-Square Distribution: Recognize that as the degrees of freedom increase, the chi-square distribution approaches normality. 2. Decrease Degrees of Freedom: Adjust the degrees of freedom parameter (df) to a lower value. The lower the degrees of freedom, the more the distribution will deviate from a normal distribution. 3. Assess the Impact: Generate random samples from the modified chi-square distribution with reduced degrees of freedom and observe how the distribution of the sum (or average) behaves. Compare it to a situation with higher degrees of freedom. 4. Statistical Testing:Use statistical tests or graphical methods to assess how well the distribution adheres to normality. Techniques like Q-Q plots or hypothesis testing can be employed. It’s important to note that while reducing degrees of freedom will influence the shape of the distribution, complete divergence from the central limit theorem might not be achievable, especially with a sufficiently large sample size, due to the robust nature of the CLT. The impact on normality will be more pronounced with smaller sample sizes and lower degrees of freedom. ## Results The results of a chi-squared test provide information about whether there is a significant association between categorical variables or whether observed categorical data fits a theoretical distribution. The test yields a test statistic and a p-value, and the interpretation depends on the context of the study and the null hypothesis. Here’s what the results generally tell us: 1.Test Statistic: - The chi-squared test produces a test statistic (( ff^2 )) that quantifies the difference between the observed and expected frequencies. A larger test statistic suggests a greater discrepancy between observed and expected values. 2.Degrees of Freedom: - Degrees of freedom are associated with the chi-squared distribution and depend on the specific test (goodness-of-fit or independence). The degrees of freedom impact the critical value for the test. 3.P-Value: - The p-value associated with the chi-squared test measures the probability of observing the test statistic (or a more extreme value) under the assumption that the null hypothesis is true. A small p-value (typically below a chosen significance level, such as 0.05) suggests evidence to reject the null hypothesis. Interpretation: • Goodness-of-Fit Test: – Null Hypothesis ((H_0)): The observed distribution fits the expected distribution. – Alternative Hypothesis ((H_1)): The observed distribution does not fit the expected distribution. – Interpretation: A small p-value suggests that there is evidence to reject the hypothesis that the observed distribution fits the expected distribution. • Test of Independence: – Null Hypothesis ((H_0)): There is no association between the two categorical variables. – Alternative Hypothesis ((H_1)): There is a significant association between the two categorical variables. – Interpretation: A small p-value indicates evidence to reject the hypothesis that the variables are independent, suggesting a significant association. Decision Rule: • If the p-value is less than the chosen significance level (e.g., 0.05), you may reject the null hypothesis. • If the p-value is greater than the significance level, you fail to reject the null hypothesis. In conclusion, the results of a chi-squared test help researchers make informed decisions about the relationships between categorical variables or the goodness-of-fit of observed data to expected distributions. The test is widely used in various fields for hypothesis testing and data analysis involving categorical data. ## Summary and Conclusions The Central Limit Theorem is a fundamental concept in statistics that states that the distribution of the sum or average of a large number of independent, identically distributed random variables approaches a normal distribution. The CLT is not universally applicable; it has specific conditions, one of which is that the original distribution should have finite mean and variance.

In our case of the chi-squared distribution, which is often associated with the sum of squared standard normal random variables, lowering the degrees of freedom would impact its shape. This can potentially lead to difficulties in satisfying the conditions of the CLT. The chi-squared distribution with k degrees of freedom is the sum of the squares of k independent standard normal random variables.

If you lower the degrees of freedom, you are altering the shape of the chi-squared distribution. As you decrease the degrees of freedom, the distribution becomes more skewed and less symmetric. Moreover, when the degrees of freedom are low, the chi-squared distribution may not have a well-defined mean or variance.

Breaking the CLT in the context of a chi-squared distribution with lowered degrees of freedom could occur due to the following reasons:

Finite Mean and Variance: The CLT assumes that the original distribution (in this case, the chi-squared distribution) has a finite mean and variance. Lowering the degrees of freedom may lead to a situation where the distribution’s mean and/or variance become infinite, violating the conditions for the CLT.
Impact on Normality: The CLT relies on the normality of the underlying distribution of the random variables being summed or averaged. Lowering the degrees of freedom in the chi-squared distribution may introduce more skewness and deviation from normality, making it harder for the CLT to hold.
Small Sample Size: Lowering the degrees of freedom often corresponds to reducing the number of terms in the sum. If the sample size is small, the sum’s distribution may not converge to a normal distribution, as the CLT typically requires a large sample size.

How to use this information

We live in an analytical, scientific world. More and more, we are using data to improve. The central limit theorem (CLT) is a crucial concept when it comes to making statistical inferences and performing hypothesis testing in many areas of science and analytics. However, when dealing with the chi-squared distribution and low degrees of freedom, the assumptions of the CLT may not hold, and the normal approximation may be unreliable. When assumptions of the CLT are infringed upon, the normal approximation is often inaccurate, leading to incorrect confidence intervals and hypothesis tests. Oftentimes, statistical tests and procedures depend on the normality assumption given by the CLT. If this premise is violated because of the inherent qualities of the chi-squared distribution with low degrees of freedom, the results of these tests may be incorrect, and certainly, less accurate. All in all, knowing the impact of low degrees of freedom in the chi-squared distribution is very important to making well-informed statistical decisions. Researchers, practitioners, and analysts need to be aware of the limitations imposed by non-normality when working with data that follows a chi-squared distribution with a low degree of freedom. In such cases, alternative methods or distribution-specific approaches could definitely be more suitable.

Appendix

https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=The%20central%20limit%20theorem%20says,the%20mean%20will%20be%20normal.

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/06%3A_Random_Samples/6.04%3A_The_Central_Limit_Theorem

https://www.statsdirect.co.uk/help/distributions/chi_square_distribution.htm#:~:text=A%20chi%2Dsquare%20variable%20with,the%20central%20limit%20theorem%20dictates.

Code

set.seed(123)

low_df <- 2

num_samples <- 1000

sample_size <- 31

chi_data <- matrix(rchisq(num_samples * sample_size, df = low_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 70
intended_sd <- 10

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

set.seed(123)

low_df <- 2

num_samples <- 1000

sample_size <- 31

chi_data <- matrix(rchisq(num_samples * sample_size, df = low_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 70
intended_sd <- 10

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

set.seed(123)

high_df <- 10

num_samples <- 1000

sample_size <- 31

chi_data <- matrix(rchisq(num_samples * sample_size, df = high_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 300
intended_sd <- 25

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

set.seed(123)

low_df <- 2

num_samples <- 1000

sample_size <- 50

chi_data <- matrix(rchisq(num_samples * sample_size, df = low_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 110
intended_sd <- 15

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

set.seed(123)

high_df <- 10

num_samples <- 1000

sample_size <- 50

chi_data <- matrix(rchisq(num_samples * sample_size, df = high_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 525
intended_sd <- 27

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

set.seed(123)

high_df <- 2

num_samples <- 1000

sample_size <- 90

chi_data <- matrix(rchisq(num_samples * sample_size, df = high_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 200
intended_sd <- 20

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

set.seed(123)

high_df <- 10

num_samples <- 1000

sample_size <- 90

chi_data <- matrix(rchisq(num_samples * sample_size, df = high_df), nrow = num_samples)

sums <- rowSums(chi_data)

hist_info = hist(sums, main = "Histogram of Sums with Theoretical Normal Distribution", xlab = "Sum", col = "lightblue", border = "black", freq = TRUE, breaks = 30)

intended_mean <- 900
intended_sd <- 40

total_obs = length(sums)

scale_factor <- total_obs * diff(hist_info$mids)[1]

curve(dnorm(x, mean = intended_mean, sd = intended_sd) * scale_factor, from = min(sums), to = max(sums), add = TRUE, col = "red", lwd = 2)

legend("topright", legend = c("Sum Distribution", "Normal Distribution"), col = c("lightblue", "red"), lty = 1, lwd = 2)

Final Report

100 repetitions
Using the Shapiro procedure to test the Null hypothesis that the 1000 draws of sample means from (n) observations from the Chi-Squared distribution approximate a Normal distribution vs the Alternative hypothesis the 1000 sample means do not approximate a Normal distribution.
The factors of the experiment are:
1. (n) with levels indicating the range of sample sizes explored in the experiment to assess the Central Limit Theorem’s applicability.
2. Degrees of Freedom (df) for the Chi-Squared distribution with levels chosen to explore how varying degrees of freedom affect the distribution’s shape and its convergence to normality.
3. (If there were more factors or parameters involved in your study that have specific levels, they would be listed here as parameter_1, parameter_2, etc., with their respective levels.)

This factorial experiment is designed to rigorously test the central limit theorem’s predictions about the normality of distribution means, particularly through the lens of the Chi-Squared distribution’s behavior under different sample sizes and degrees of freedom.