Question 9, Section 9.3

Problem Description:

How large must n be before \(S_n = X_1+X_2+...+X_n\) is approximately normal? This number is often surprisingly small. Let us explore this question with a computer simulation. Choose \(n\) numbers from \([0, 1]\) with probability density \(f(x)\), where \(n = 3, 6, 12, 20\), and \(f(x)\) is each of the densities in Exercise 7. Compute their sum \(S_n\), repeat this experiment 1000 times, and make up a bar graph of 20 bars of the results. How large must \(n\) be before you get a good fit?

Note that the distributions given in question 7 are as follows:

\(f(x) = 1\)
\(f(x) = 2x\)
\(f(x) = 3x^2\)

Solution

First, we import a few packages necessary to solve the problem:

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::cross()  masks pracma::cross()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The code below initializes all the parameters that are needed to run the simulation, and creates three matrices to hold the sample sums generated by the distributions of (a), (b), and (c).

sims <- 1000
ns <- c(3, 6, 12, 20)
sum_dist_a <- matrix(0, nrow=sims, ncol=length(ns))
sum_dist_b <- matrix(0, nrow=sims, ncol=length(ns))
sum_dist_c <- matrix(0, nrow=sims, ncol=length(ns))

The double for loop below actually calculates the sample sums by randomly generating \(n\) random values \(r\) between 0 and 1 and then determining when the CDF of \(f(x)\) is equal to that value. In other words, we want to find \(x\) such that \(P_X(x) = r\). As an example, the CDF of the distribution from part (b) is: \(\int_0^x f(x)\, dx = \int_0^x 2x \, dx = x^2\). Thus, for each value random value \(r\), we want to determine \(r = x^2 \Rightarrow x = \sqrt{r}\). The sample sum is then determined by summing each of these \(x\) values. The loop does this below 1,000 times for each value of \(n\) given in the problem description, for each of the three distributions:

for(j in 1:length(ns)){
  for(i in 1:sims){
    rs <- runif(ns[j])
    dist_a <- rs
    dist_b <- sqrt(rs)
    dist_c <- nthroot(rs, 3)
    Sn_a <- sum(dist_a)
    Sn_b <- sum(dist_b)
    Sn_c <- sum(dist_c)
    sum_dist_a[i,j] <- Sn_a
    sum_dist_b[i,j] <- Sn_b
    sum_dist_c[i,j] <- Sn_c
  }
}

The cell below creates a function change_to_df that transforms the matrices into dataframes that can be more easily plotted:

change_to_df <- function(mymat){
  df <- data.frame(mymat)
  colnames(df) <- ns
  print(col)
  df <- pivot_longer(df, cols=colnames(df), names_to='n', values_to='sum')
  df$n <- as.integer(df$n)
  df <- df %>%
    arrange(n)
  df <- df %>%
    mutate(n = as.character(n))
}

The cells below create histograms to show the distribution of sample sums for each \(f(x)\):

plt_data <- change_to_df(sum_dist_a)

## function (x, as.factor = FALSE) 
## {
##     if (as.factor) {
##         labs <- colnames(x, do.NULL = FALSE, prefix = "")
##         res <- factor(.Internal(col(dim(x))), labels = labs)
##         dim(res) <- dim(x)
##         res
##     }
##     else .Internal(col(dim(x)))
## }
## <bytecode: 0x7fb858948f00>
## <environment: namespace:base>

ggplot(data=plt_data, aes(x=sum, y=..density.., fill=n, color=n)) + 
  geom_histogram(alpha=0.3, position='identity', bins=80) +
  labs(title= 'f(x)=1', x='Sample Sum Value', y='Density')

plt_data <- change_to_df(sum_dist_b)

## function (x, as.factor = FALSE) 
## {
##     if (as.factor) {
##         labs <- colnames(x, do.NULL = FALSE, prefix = "")
##         res <- factor(.Internal(col(dim(x))), labels = labs)
##         dim(res) <- dim(x)
##         res
##     }
##     else .Internal(col(dim(x)))
## }
## <bytecode: 0x7fb858948f00>
## <environment: namespace:base>

ggplot(data=plt_data, aes(x=sum, y=..density.., fill=n, color=n)) + 
  geom_histogram(alpha=0.3, position='identity', bins=80) +
  labs(title= 'f(x)=x', x='Sample Sum Value', y='Density')

plt_data <- change_to_df(sum_dist_c)

## function (x, as.factor = FALSE) 
## {
##     if (as.factor) {
##         labs <- colnames(x, do.NULL = FALSE, prefix = "")
##         res <- factor(.Internal(col(dim(x))), labels = labs)
##         dim(res) <- dim(x)
##         res
##     }
##     else .Internal(col(dim(x)))
## }
## <bytecode: 0x7fb858948f00>
## <environment: namespace:base>

ggplot(data=plt_data, aes(x=sum, y=..density.., fill=n, color=n)) + 
  geom_histogram(alpha=0.3, position='identity', bins=100) +
  labs(title= 'f(x)=x^2', x='Sample Sum Value', y='Density')

We can see that, as \(n\) increases, the distribution of sample sums becomes increasingly normal, and that it really doesn’t take a very large \(n\) for normal looking behavior to begin.

Data 605 Discussion - Week 9

William Jasmine

2023-03-16

Question 9, Section 9.3

Problem Description:

Solution