A Short Essay Describing Normal, t, chi-square, and F Distributions, Their Assumptions, and Their Connections

  • Develop a clear technical understanding of nonparametric cumulative distribution function (CDF) estimation and various kernel density estimators.

  • Translate mathematical formulas into R functions and apply them to solve related problems.

  • Create effective visualizations to demonstrate your understanding of key concepts in the following questions.


The Normal Distribution

The Normal distribution is a continuous, unimodal distribution that is characterized by its symmetric, bell-shaped curve. A Normal distribution is characterized by two values, its mean, \(\mu\) and its variance, \(\sigma^2\). For instance, a Normal distribution is written as \(N(\mu, \sigma^2)\). A Standard Normal is defined as a Normal distribution with \(\mu\) = 0 and \(\sigma^2\) = 1. This would be written as \(N(0, 1)\).

For a random sample of \(X_1, X_2, \ldots, X_n\), we would be interested in finding the sample mean, \(\bar{X}\), as an estimator of \(\mu\). In this case, the mean of the distribution of \(\bar{X}\) would still be \(\mu\). However, the standard deviation would be found by \(\sigma / \sqrt{n}\). So, this would be written as \(N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\). This value can be standardized by finding the Z-score. This Z-score represents how many standard deviations an observation is away from the mean. A positive Z-score means an observation is to the right of the mean, and a negative Z-score means that an observation is to the left of the mean. In this case, Z = \(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\). Once this Z-score is calculated then we have a standardized value with N(0, 1) as seen in the Standard Normal.

Below is a visualization of several Normal distribution curves with different means and variances to show how these values shift the appearance of a Normal curve. This visualization includes a Standard Normal curve with a mean of 0 and a variance of 1. Additionally, the visualization includes two other Normal distributions with a mean of 0, but with different variances. One of these distributions has a variance of 4, and it can be seen how this curve is much flatter and wider than the Standard Normal. The other of these two distributions has a variance of 0.25, and it can be seen how this distribution is much more narrow with a sharper and higher peak than the Standard Normal. This shows that when a Normal distribution has a variance greater than that of a Standard Normal, the curve becomes wider, but if it has a variance less than that of a Standard Normal, the curve becomes narrower. Finally, there is one more Normal distribution which has a mean of 2, and a variance of 1. It can be seen that this distribution has the same spread as a Standard Normal, due to having an equivalent variance, but is shifted two units to the right due to having a mean of 2 rather than 0. This shows that the mean of a Normal distribution affects how the curve is shifted from that of a Standard Normal. A distribution with a positive mean would be shifted to the right, while a distribution with a negative mean would be shifted to the left. Overall, this visualization shows how Normal distribution curves change based upon changes to their mean and variance.

x <- seq(-6, 6, length = 1000)

# Standard Normal: mean = 0, var = 1
y_standard <- dnorm(x, mean = 0, sd = 1)

# mean = 0, var = 4 
y_wide <- dnorm(x, mean = 0, sd = 2)

# mean = 0, var = 0.25 
y_narrow <- dnorm(x, mean = 0, sd = 0.5)

# mean = 2, var = 1
y_shifted <- dnorm(x, mean = 2, sd = 1)

plot(x, y_standard,
     type = "l",
     lwd = 3,
     col = "purple",
     ylim = c(0, max(y_narrow)),
     main = expression("Normal Distributions with Different Values of " * mu * " and " * sigma^2),
     xlab = "x",
     ylab = "Density")

lines(x, y_wide, col = "lightblue", lwd = 3, lty = 2)
lines(x, y_narrow, col = "green", lwd = 3, lty = 3)
lines(x, y_shifted, col = "pink", lwd = 3)

legend("topright",
       legend = c(
         expression(mu == 0 ~ "," ~ sigma^2 == 1),
         expression(mu == 0 ~ "," ~ sigma^2 == 4),
         expression(mu == 0 ~ "," ~ sigma^2 == 0.25),
         expression(mu == 2 ~ "," ~ sigma^2 == 1)
       ),
       col = c("purple", "lightblue", "green", "pink"),
       lty = c(1, 2, 3, 1),
       lwd = 3,
       bty = "n")

The Normal distribution is defined from \(-\infty\) to \(\infty\).

Assumptions of a Normal Distribution

In order to use a Normal distribution, the following assumptions must be met:

  • The observations are independent from one another.

  • The dependent variable must be continuous.

  • The sample errors are normally distributed.

  • The sample size is sufficiently large enough.

Going off of the last assumption, the exact number to be sufficiently large enough can vary, but often is given as n > 30. The importance of this is seen through one of the most fundamental theorems in statistics, the Central Limit Theorem (CLT). This theorem states that the distribution of the sampling mean approaches a normal distribution as the sample size becomes sufficiently large enough. This occurs regardless of the distribution of the population as long as the sample size is sufficiently large enough. Typically n = 30 is the value used in statistics as the marker of a sufficiently large population, however this can vary as a highly skewed population distribution would likely need a much larger sample size to achieve an approximately normal sampling distribution.

Below shows a visualization of how the CLT applies to a sampling distribution. In this visualization, a sample is done three times, first with n = 5, then n = 30, and then n = 100. This shows how as sample size increases and becomes sufficiently large, the sampling distribution begins to follow that of a Normal distribution.

set.seed(123)

n_values <- c(5, 30, 100)
par(mfrow = c(2, 3))  

for (n in n_values) {
  sample_means <- replicate(1000, mean(rexp(n)))
  
  hist(sample_means,
       probability = TRUE,
       breaks = 30,
       col = "lavender",
       border = "purple",
       main = paste("Sampling Distribution (n =", n, ")"),
       xlab = "Sample Mean")
  
  lines(density(sample_means), lwd = 2, col = "purple")
  curve(dnorm(x, mean(sample_means), sd(sample_means)),
        add = TRUE, col = "darkmagenta", lwd = 2, lty = 2)
}
par(mfrow = c(1, 1)) 

As we can see, as the sample size, n, increases, the sample distribution begins to become more like that of a Normal distribution regardless of the population distribution.

The t-Distribution

The t-Distribution is a continuous, unimodal distribution with a symmetric, bell-shaped curve. This type of curve appears similar to that of a Normal distribution, however a t-distribution curve has a flatter shape and thicker tails in comparison. A t-distribution is used over a Normal distribution in the case that the population standard deviation is unknown. Additionally, a t-distribution would also be the ideal choice if the sample size is small, typically n < 30. So, while a Normal distribution would have a known population standard deviation, a t-distribution would have an unknown population standard deviation.

From a random sample of \(X_1, X_2, \ldots, X_n\), let the sample mean \(\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i\). In this case, the population standard deviation is unknown, so we are interested in using a t-distribution. It turns out that t = \(\frac{\bar{X} - \mu}{s / \sqrt{n}}\). Where \(s\) is the sample standard deviation, and \(S^2\) the sample variance, where \(S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2\).

An important characteristic of a t-distribution is the degrees of freedom. The degrees of freedom, often represents as v, equals n-1 where n is the sample size. This is the key parameter of a t-distribution, as the degrees of freedom will be a fixed value when the sample size is known. The visualization below shows t-distribution curves for various degrees of freedom values. A Normal curve is included for comparison.

x <- seq(-4, 4, length = 1000)
y_df2  <- dt(x, df = 2)
y_df5  <- dt(x, df = 5)
y_df30 <- dt(x, df = 30)
y_df50 <- dt(x, df = 50)
y_norm <- dnorm(x)
y_max <- max(y_df2, y_df5, y_df30, y_df50, y_norm)

plot(x, y_df2,
     type = "l",
     lwd = 2,
     col = "purple",
     ylim = c(0, y_max),
     main = "t-Distributions with Different Degrees of Freedom",
     ylab = "Density",
     xlab = "x")

lines(x, y_df5,  lwd = 2, col = "lightblue")
lines(x, y_df30, lwd = 2, col = "green")
lines(x, y_df50, lwd = 2, col = "brown", lty = 2)
lines(x, y_norm, lwd = 2, col = "pink")

legend("topright",
       legend = c("df = 2", "df = 5", "df = 30", "df = 50", "Normal"),
       col = c("purple", "lightblue", "green", "brown", "pink"),
       lty = c(1, 1, 1, 2),
       lwd = 2,
       bty = "n")

The t-distribution is defined from \(-\infty\) to \(\infty\).

As seen in the visualization above, a t-distribution with smaller degrees of freedom has a flatter peak with wider tails. On the other hand, a t-distribution with larger degrees of freedom has a higher peak with more narrow tails. Also, the visualization shows that as the number of degrees of freedom increases further and further, the curve of the distribution becomes closer to that of a Normal distribution curve.

Assumptions of a t-Distribution

In order to use a t-distribution, the following assumptions must be met:

  • The observations are independent from one another.

  • The dependent variable must be continuous.

  • The data follows an approximately Normal distribution.

  • The population standard deviation is unknown.

The Chi-Square Distribution

Another commonly used distribution is the Chi-Square distribution. The Chi-Square distribution is a variation of the Gamma distribution that is also represented as the sum of squared standard Normal random variables. If \(Z_1, Z_2, \ldots, Z_k \stackrel{iid}{\sim} N(0,1)\) then \(\sum_{i=1}^{k} Z_i^2 \sim \chi^2_k\) where k is the degrees of freedom. The exact distribution of the scaled sample variance for a Normal distribution is as follows, \(\frac{(n-1)S^2}{\sigma^2} {\to} \chi_{n-1}^2\). This gives us the Chi-Square distribution.

The shape of a Chi-Square distribution depends on its degrees of freedom, just like how the shape of a t-distribution also depends on its degrees of freedom. Once again, degrees of freedom is defined as n-1, where n is the sample size. One major difference of the Chi-Square distribution from the Normal distribution and t-distribution is that the Chi-Square distribution is asymmetrically shaped, and does not follow a symmetric, bell-shaped curve as was seen of the previous two distributions.

The visualization below shows the Chi-Square distributions for various degrees of freedom values.

x <- seq(0, 30, length = 1000)

y_df2  <- dchisq(x, df = 2)
y_df5  <- dchisq(x, df = 5)
y_df15 <- dchisq(x, df = 15)

y_max <- max(y_df2, y_df5, y_df15)

plot(x, y_df2,
     type = "l",
     lwd = 2,
     col = "purple",
     ylim = c(0, y_max),
     main = "Chi-Square Distributions with Different Degrees of Freedom",
     xlab = "x",
     ylab = "Density")

lines(x, y_df5,  lwd = 2, col = "lightblue")
lines(x, y_df15, lwd = 2, col = "green")

legend("topright",
       legend = c("df = 2", "df = 5", "df = 15"),
       col = c("purple", "lightblue", "green"),
       lwd = 2,
       bty = "n")

The Chi-Square distribution is defined from 0 to \(\infty\).

In the visualization above, it can be seen that as the degrees of freedom increases, the distribution curve becomes flatter and wider, and shifts over to the right. The smaller the degrees of freedom, the higher the peak of the distribution is, and the quicker it flattens out. For these smaller degrees of freedom values, the distribution is very much skewed to the right and asymmetric. It can be seen that how as the degrees of freedom becomes larger and larger, the distribution becomes less significantly skewed, and very large values of degrees of freedom begin to become closer and closer to the shape of a Normal distribution.

Assumptions of a Chi-Square Distribution

In order to use a Chi-Square distribution, the following assumptions must be met:

  • The observations are independent from one another.

  • The sample size is sufficiently large enough.

  • The population follows a Normal distribution.

  • The Chi-Square statistics is formed from squared deviations.

The F Distribution

One other important distribution is the F distribution. The F distribution is the sampling distribution for the ratio of two independent sample variances. The F distribution is useful for comparing variances and is used in ANOVA (analysis of variance) and regression modeling.

For a F distribution, we have two independent random samples, \(\{X_1, X_2, \cdots, X_{n_1}\} \overset{i.i.d}{\sim} N(\mu_1, \sigma_1^2) \quad\text{ and } \quad \{Y_1, Y_2, \cdots, Y_{n_2}\} \overset{i.i.d}{\sim} N(\mu_2, \sigma_2^2)\). From these two samples, we have the sample variance for each of the two distributions respectively, \(S_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_i - \bar{X})^2 \quad\text{ and } \quad S_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (Y_i - \bar{Y})^2\). The F statistic, is found as follows, \(F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \overset{d}{\to} F_{n_1-1, n_2-1}\). Thus, the F statistic serves as a ratio of the sample variances for the two independent distributions. Once again, the F distribution depends on the degrees of freedom for each of the two independent samples. In this case, \(n_1-1\) and \(n_2-1\) are the degrees of freedom for sample one and sample two respectively, where n is the sample size for each independent, random sample. These two values of the degrees of freedom for the numerator and denominator are the two parameters of a F distribution.

The following visualization shows F distributions for various values of the degrees of freedom for each of the two independent samples. Each F distribution has two parameters, df1 and df2, which are these two degrees of freedom values. This visualization shows how the F distribution changes in appearence based upon these two degrees of freedom parameters.

x <- seq(0, 5, length = 1000)

y_2_10  <- df(x, df1 = 2,  df2 = 10)
y_5_10  <- df(x, df1 = 5,  df2 = 10)
y_10_10 <- df(x, df1 = 10, df2 = 10)

y_max <- max(y_2_10, y_5_10, y_10_10)

plot(x, y_2_10,
     type = "l",
     lwd = 2,
     col = "purple",
     ylim = c(0, y_max),
     main = "F Distributions with Different Degrees of Freedom",
     xlab = "x",
     ylab = "Density")

lines(x, y_5_10,  lwd = 2, col = "lightblue")
lines(x, y_10_10, lwd = 2, col = "green")

legend("topright",
       legend = c("df1 = 2, df2 = 10",
                  "df1 = 5, df2 = 10",
                  "df1 = 10, df2 = 10"),
       col = c("purple", "lightblue", "green"),
       lwd = 2,
       bty = "n")

The F distribution is defined from 0 to \(\infty\).

As seen above, the curves of a F distribution as skewed right and asymmetric. These curves do not follow a symmetric, bell-shaped curve that the Normal distribution was seen to follow. In fact, these curves look quite similar to what was seen with the Chi-Square distribution. Similarly to the Chi-Square distribution, for a F distribution, smaller values of degrees of freedom show steeper, and more skewed distribution while larger values of degrees of freedom show wider distributions with less skew in comparison. In fact, the F distribution can be defined based on two independent Chi-Square distributions. The numerator and denominator of a F distribution can be written in terms of two independent Chi-Square distributions.If the samples are independent and normally distributed, then \(\frac{(n_1 - 1)S_1^2}{\sigma_1^2} \sim \chi^2_{n_1 - 1},\qquad\frac{(n_2 - 1)S_2^2}{\sigma_2^2} \sim \chi^2_{n_2 - 1}\). Taking the ratio results in, \(\frac{S_1^2}{S_2^2} \sim F_{n_1 - 1,\; n_2 - 1}\). Overall, the F distribution is a great way to compare variances between these two independent distributions.

Assummptions of a F Distribution

In order to use a F distribution, the following assumptions must be met:

  • The observations are independent from one another.

  • Each of the two samples are Normally distributed.

  • The samples are drawn independently from one another.

  • The populations should have homogeneity of variances (equal variances).

Connections Between These Distributions

All four of these distributions, the Normal distribution, the t-distribution, the Chi-Square distribution, and the F distribution, are incredibly important to statistical analysis and random sampling.

These distributions connect to one another in several ways. For instance, the sum of squared Normal variables follows a chi-square distribution. Additionally, another example of this is that a F statistic is the ratio of two independent Chi-Square random variables. Another important occurrence of this is that if \(Z \sim N(0,1)\), then \(Z^2 \sim \chi^2\). So, while all four distributions have distinctions from one another, they also overlap in several ways and show clear connections with each other.

The table below shows a clear comparison of key features of the four distributions. These features include a brief description of the shape of each distribution, their paramaters, and the support of values for which the distribution can take on.

dist_table <- data.frame(
  Distribution = c("Normal", "t", "Chi-square", "F"),
  Shape = c("Symmetric", "Symmetric, thicker tails", "Right-skewed", "Right-skewed"),
  Support = c("$(-\\infty, \\infty)$", "$(-\\infty, \\infty)$", "$(0, \\infty)$", "$(0, \\infty)$"),
  Parameters = c("$\\mu, \\sigma^2$", "df(v)", "df", "df$_1$, df$_2$")
)

kable(dist_table, format = "html", escape = FALSE)
Distribution Shape Support Parameters
Normal Symmetric \((-\infty, \infty)\) \(\mu, \sigma^2\)
t Symmetric, thicker tails \((-\infty, \infty)\) df(v)
Chi-square Right-skewed \((0, \infty)\) df
F Right-skewed \((0, \infty)\) df\(_1\), df\(_2\)

Altogether, the Normal distribution, the t-distribution, the Chi-Square distribution, and the F distribution are all important statistical tools when it comes to observing sampling distributions and making assumptions regarding the overall population based upon these distributions. These four distributions have distinct differences from one another, based upon their appearance and the parameters used within each distribution. However, these distributions connect with one another in various ways which shows the importance of each of these distributions based upon how they can work together based upon transformations of random variables through statistical procedures.

