1 Confidence Intervals
2 Point Estimators
- 2.1 Normal Distribution.
- 2.2 Gamma Distribution.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1 Confidence Intervals

The goal of this problem is to understand Confidence Intervals using simulations. Recall that saying that you have a 95% confidence interval means that if under repeated sampling, 95% of the confidence interval around the sample means will contain the population mean.

1.1 Setting up some basic functions.

Write a function called ‘margin_error_95’ that takes input a vector of sample values, and outputs the margin of error for a 95% confidence interval.

margin_error_95 <- function(sample_vals, sig){
    n <- length(sample_vals)
    mar_err <- 1.96*(sig/sqrt(n))
}

Write a function ‘CI_95’ that takes input a vector of sample values, and outputs the 95% confidence interval for this sample. You can use the ‘margin_error_95’ function.

CI_95 <- function(sample_vals, sig){
    error <- margin_error_95(sample_vals, sig)
    CI <- mean(sample_vals) + c(-1, 1)*error
}

Write a function ‘sample_size_95’ that take input ‘width’ and ‘std_dev’ the desired width of the 95% confidence interval and the population std dev, and outputs the sample size required to get that that width for a 95% confidence interval.

sample_size_95 <- function(width, std_dev){
    # width = 2 * error, hence
    error <- width / 2
    n <- (1.96 * std_dev / error)^2
    return(n)
}

Write a function ‘width_95’ that take input ‘n’ and ‘std_dev’ the sample size and the population std dev, and outputs the width of the 95% confidence interval corresponding to this sample size.

width_95 <- function(n, std_dev){
    error <- 1.96 * std_dev / sqrt(n)
    return(2 * error)
}

mean_inside_CI <- function(samp, pop_mu, pop_sig){
    upper <- mean(samp) + margin_error_95(samp, pop_sig)
    lower <- mean(samp) - margin_error_95(samp, pop_sig)
    return(between(pop_mu, lower, upper))
}

1.2 Simulations for Confidence Intervals

We will simulate the calculation of a 95% confidence interval for a large number of samples of a fixed size from a population distribution whose mean we know.

We will then count the number of intervals that contain the population mean.

Use the replicate function to run a simulation of the following experiment:

set seed equal to 2020

calculate 100000 samples of size 50 from the population distribution: exponential with rate parameter \(\lambda = 2\).

use the mean_inside_CI function to check if the mean lies inside the confidence interval, and store this result in the vector ‘CI_95_exp’.

caculate the mean of ‘CI_95_exp’ to get the expected number of confidence intervals that contain the population mean.

set.seed(2020)
B <- 10000
n <- 50

CI_95_exp <- replicate(B, 
    {samp <- rexp(n, 2)  # generate a sample
    # check is mean is inside 95% CI
    mean_inside_CI(samp, pop_mu = 1 / 2 , pop_sig = 1 / 2)
    }
)

mean( CI_95_exp ) #calculate mean CI_95_exp to get empirical expected values

## [1] 0.9498

Use the replicate function to run a simulation of the following experiment:

set seed equal to 2020

calculate 100000 samples of size 50 from the population distribution: Gamma with shape parameter \(\alpha = 2\) and scale parameter \(\beta = 0.4\)

use the mean_inside_CI function to check if the mean lies inside the confidence interval, and store this result in the vector ‘CI_95_gamma’.

caculate the mean of ‘CI_95_gamma’ to get the expected number of confidence intervals that contain the population mean.

set.seed(2020)
B <- 10000
n <- 50

# gamma distribution parameters
alpha = 2
beta = 0.4

CI_95_gamma <- replicate(B,
    {samp <- rgamma(n, shape = alpha, rate = beta)
    #need to calculate mean and sd for gamma distribution from its parameters
    mean_inside_CI(samp, pop_mu = alpha / beta, pop_sig = sqrt(alpha) / beta ) 
    }
)

mean(CI_95_gamma)

## [1] 0.9483

1.3 Visualizing confidence intervals.

In this section we will visualize confidence intervals. Recall that “calculating a 95% confidence interval” means: under repeated sampling 95% of the confidence intervals around the sample mean will contain the population mean.

The following function “plot_CI_95” plots 95%-confidence intervals about 100 sample means coming from samples of size 30 and a normal population distribution with mean 5 and standard deviation 1.2.

The function takes input a random seed value and outputs a bunch of confidence intervals.

The confidence intervals that do not contain the population mean are plotted in red.

The sample means are plotted as black dots in the middle of the confidence interval.

The population mean is identified by a blue vertical line (note that the red confidence intervals do not intersect this line).

Get familiar with the function “plot_CI_95” by evaluating it for a few different seed values.

plot_CI_95 <- function(seed){
    B <- 100
    n <- 30
    mu <- 5
    sig <- 1.2
    
    set.seed(seed)
    # extract upper bound of CI's
    
    x_1 <- replicate(B,
                     {samp <- rnorm(n, mean = mu, sd = sig )
                     max(CI_95(samp, sig))
                     }
    )
    
    #extract lower bound of CI's
    
    set.seed(seed)
    
    x_0 <- replicate(B,
                     {samp <- rnorm(n, mean = mu, sd = sig )
                     min(CI_95(samp, sig))
                     }
    )
    
    set.seed(seed)
    
    means <- replicate(B, mean(rnorm(n, mean = mu, sd = sig)))
    
    plot(means, 1:B, pch = 20,
         xlim = c(mu - sig, mu + sig),
         ylim = c(0,B+1),
         xlab = "sample means",
         ylab = "index of the CI",
         main = paste(B, "Confidence intervals")
    )
    
    for (i in 1:B){
        if(between(mu, x_0[i], x_1[i])){
            segments(x_0[i], i, x_1[i], i, lwd = 2) #plot CI's that contain the mean in black
        } else {
            segments(x_0[i], i, x_1[i], i, col = "red", lwd = 2) #plot CI's that don't contain the mean in red
        }
    }
    
    abline(v=mu, col = "blue") #plot a vertical line at the population mean
}

Now use the “Plot_CI_95” function to plot at six different values of the random seed. What do you observe?

seeds <- c(2020, 42, 911, 14, 101, 34) #choose six different seed values

par(mfrow = c(1,2))

for (i in seeds) {
    plot_CI_95(i)
}

2 Point Estimators

Suppose we are given a population distribution with mean \(\mu\) and variance \(\sigma^2\). Consider the following two estimators for the population variance \(\sigma^2\):

\[ \theta_1 = \frac{\sum_{i=1}^n(x_i - \overline{x})^2}{n-1}.\]

\[ \theta_2 = \frac{\sum_{i=1}^n(x_i - \overline{x})^2}{n}.\]

Write two functions “theta_1” and “theta_2” that take input a sample of data values and the population mean and outputs the values of the function \(\theta_1\) and \(\theta_2\) respectively.

theta_1 <- function(samp){
    return(sum((samp-mean(samp))^2)/(length(samp)-1))
}

theta_2 <- function(samp){
    return(sum((samp - mean(samp))^2) / (length(samp)))
}

2.1 Normal Distribution.

Suppose we are working with samples of size 10 coming from a normal distribution \(N(\mu = 6, \sigma = 1.6)\). Run a simulation to calculate the empirical expected values of the estimators \(\theta_1\) and \(\theta_2\).

B <- 100000
sig <- 1.6

theta_1_values <- replicate(B,
                            {samp <- rnorm(10, mean = 6, sd = 1.6)
                            theta_1(samp)
                            }
)

theta_2_values <- replicate(B,
                            {samp <- rnorm(10, mean = 6, sd = 1.6)
                            theta_2(samp)
                            }
)

Print the empirical expected values for the two estimators.

mean(theta_1_values)

## [1] 2.554226

mean(theta_2_values)

## [1] 2.308749

sig^2

## [1] 2.56

Use the empirical expected values of \(\theta_1\) and \(\theta_2\) to calculate the empirical bias of these estimators.

bias_theta_1 <- abs(mean(theta_1_values) - sig^2)

bias_theta_2 <- abs(mean(theta_2_values) - sig^2)

bias_theta_1

## [1] 0.00577384

bias_theta_2

## [1] 0.2512512

What conclusions can you draw about the unbiasedness of these estimators?

YOUR ANSWER \(\theta_1\) is unbiased, while \(\theta_2\) is biased.

Now, suppose we are working with samples of size 500 coming from a normal distribution \(N(\mu = 6, \sigma^2 = 1.6)\). Run a simulation to calculate the empirical expected values of the estimators \(\theta_1\) and \(\theta_2\).

theta_1_values <- replicate(B,
                            {samp <- rnorm(500, mean=6, sd = sig)
                            theta_1(samp)
                            }
)

theta_2_values <- replicate(B,
                            {samp <- rnorm(500, mean=6, sd = sig)
                            theta_2(samp)
                            }
)

mean(theta_1_values) #empirical expected value of theta_1

## [1] 2.560173

mean(theta_2_values)#empirical expected value of theta_2

## [1] 2.554193

sig^2 #theoretical value of pop variance

## [1] 2.56

Use the empirical expected values of \(\theta_1\) and \(\theta_2\) to calculate the empirical bias of these estimators.

bias_theta_1 <- abs(mean(theta_1_values) - sig^2)
bias_theta_2 <- abs(mean(theta_2_values) - sig^2)

bias_theta_1 #bias for theta_1

## [1] 0.0001725795

bias_theta_2 #bias for theta_2

## [1] 0.005807232

What conclusions can you draw about the unbiasedness of these estimators when sample size is large?

Large sizes reduce the bias of \(\theta_2\) estimator to the point of it being insignificant. Still, we can see that \(theta_1\) is lass biased than \(theta_2\).

2.2 Gamma Distribution.

Suppose we are working with samples of size 10 coming from a Gamma distribution with shape parameter \(\alpha = 2\) and scale parameter \(\beta = 0.5\). Run a simulation to calculate the empirical expected values of the estimators \(\theta_1\) and \(\theta_2\).

alpha = 2
beta = 0.5

theta_1_values <- replicate(B,
                            {samp <- rgamma(10, shape = alpha, rate = beta)
                            theta_1(samp)
                            }
)

theta_2_values <- replicate(B,
                            {samp <- rgamma(10, shape = alpha, rate = beta)
                            theta_2(samp)
                            }
)

Use the empirical expected values of \(\theta_1\) and \(\theta_2\) to calculate the empirical bias of these estimators.

bias_theta_1 <- abs(mean(theta_1_values) - alpha / beta^2)
bias_theta_2 <- abs(mean(theta_2_values) - alpha / beta^2)

bias_theta_1 #bias for theta_1

## [1] 0.007121135

bias_theta_2 #bias for theta_2

## [1] 0.8013671

What conclusions can you draw about the unbiasedness of these estimators?

YOUR ANSWER \(\theta_2\) is much more biased for Gamma distribution. \(\theta_1\) is still unbiased.

Now, suppose we are working with samples of size 500 coming from a Gamma distribution with shape parameter \(\alpha = 2\) and scale parameter \(\beta = 0.5\). Run a simulation to calulate the empirical expected values of the estimators \(\theta_1\) and \(\theta_2\).

alpha = 2
beta = 0.5

theta_1_values <- replicate(B,
                            {samp <- rgamma(500, shape = alpha, rate = beta)
                            theta_1(samp)
                            }
)

theta_2_values <- replicate(B,
                            {samp <- rgamma(500, shape = alpha, rate = beta)
                            theta_2(samp)
                            }
)

Use the empirical expected values of \(\theta_1\) and \(\theta_2\) to calculate the empirical bias of these estimators.

bias_theta_1 <- abs(mean(theta_1_values) - alpha / beta^2)
bias_theta_2 <- abs(mean(theta_2_values) - alpha / beta^2)

bias_theta_1 #bias for theta_1

## [1] 0.004755056

bias_theta_2 #bias for theta_2

## [1] 0.01556244

What conclusions can you draw about the unbiasedness of these estimators when sample size is large?

Conclusion are similar to the case of normal distribution. Larger size reduces the bias of both estimators, and the bias of \(\theta_1\) is much smaller than that of \(\theta_2\).

Confidence Intervals and Point Estimates

Hana Goffe/ UID 117542202

1 Confidence Intervals

1.1 Setting up some basic functions.

1.2 Simulations for Confidence Intervals

1.3 Visualizing confidence intervals.

2 Point Estimators

2.1 Normal Distribution.

2.2 Gamma Distribution.