Statisical Inference Course Project: Hypothesis Testing

Overview

This project will use randomly generated data from an exponential distribution to investigate whether sample means generated using R conforms with the results implied by the Central Limit Theorem. Specifially, using simulated data, the report will analyze the data using inferential statistics to show whether one can accept the null hypothesis that the distribution of the sample means from an exponential distribution is significantly different from the normal distribution.

Simulations used

The data used to generate 1,000 sample means from the exponential distributions will be generated by the R function rexp(n, lambda), which returns n observations from an expoential distribution where lambda is the rate parameter. The mean of exponential distribution is given 1/lambda and the standard deviation is also 1/lambda.

The lambda value will be set at 0.2 for all of the simulations, so the mean and standard deviation of a single observation will have the value 1/lambda, or 5. The theoretical mean of a random sample of size 40 from this exponential random variables would be 5, with a variance of 5^2/40, or 0.625, and a standard deviation of sqrt(0.625), which is about 0.791.

A total of 40,000 observations will be generated to create a vector of 1,000 sample means.

set.seed(151)
group.size <- 40
num.sim <- 1000

# Generates 1,000 rows of 40 exponential random variables
sample.matrix <- matrix(rexp(num.sim*group.size,.2),num.sim) 
sample.mean.vec <- (apply(sample.matrix,1,mean))

One of the tests used in this project will compare the sample means with observations from a normal distribution with the same mean and standard deviation, which will be generated using the R function rnorm(m,mean,standard deviation), where m is the number of observations from a normal distribution of mean 5 and standard deviation sqrt(0.625). In this case, m will be 1,000, which is the same number of sample means generated from the exponential distibution.

set.seed(181)
sample.norm.vec <- rnorm(1000,mean=5,sd=sqrt(0.625))

Comparing the sample and theoretical mean

The sample mean from this exponential distribution can be compared with the mean of a normal distribution with the same expected mean and standard deviation by testing the null hypothesis that the sample mean is not signifintly differnt from the mean from the normal distribution.

This is consisted with the Central Limit Theorem that states that as the sample size tends toward infinity, the distribution of the sample mean approaches the normal distribution.

Below is a histogram of 1,000 simulations of the sample mean of 40 exponentials with lambda value of 0.2, with a density curve overlaying the histogram. Superimposed on the histogram are the density plot, and four vertical lines. The red one represents expected mean of 5, and the blue line the sample mean.

The two heavier green bars represent the endpoints of the 95% confidence interval for the mean, which ranges from 1.96 standard deviations above and below the expected mean, for a range of [3.45,6.55].

myhist <- hist(sample.mean.vec, breaks=30,
               xlab="Exponential sample mean", ylab="Freqency",
                main="Histogram of sample means with density overlay")
multiplier <- myhist$counts / myhist$density
mydensity <- density(sample.mean.vec)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity) # This is a density histogram rather than one with counts
abline(v=5, col="red", lwd=2)
abline(v=3.45, col="green", lwd=3)
abline(v=6.55, col="green", lwd=3)
abline(v=mean(sample.mean.vec), col="blue", lwd=2)

paste("Based on the sample mean (blue vertical line) of ", round(mean(sample.mean.vec), digits=3),", the z-score for the sample mean would be z = (",
round(mean(sample.mean.vec), digits=3)," - 5)/", round(sqrt(0.625),digits=3), " = ",
round((mean(sample.mean.vec) - 5)/sqrt(0.625), digits=3),", which has the quantile value of ",round(pnorm(mean(sample.mean.vec),5,sqrt(0.625)),digits=3) ,", so one would not reject the hypothesis that the sample mean is equal to 5.", sep="")

[1] "Based on the sample mean (blue vertical line) of 4.965, the z-score for the sample mean would be z = (4.965 - 5)/0.791 = -0.044, which has the quantile value of 0.483, so one would not reject the hypothesis that the sample mean is equal to 5."

Comparing the theoretical and sample variance of the distribution

The F-test, which is the R function var.test(x,y) in R, takes as input two vectors of values from normal distributions (x and y) to test the null hypothesis that the variances of the two normal distributions are equal.

In this case, the “x” vector is is the 1,000 randomly generated sample means from an exponential distribution with a mean and standard deviation of sqrt(0.625), which is assumed to be normally distributed according to the Central Limit Theorem, and the “y” vector consists of 1,000 radomly genreated values from a normal distribution of with the same mean and standard deviation.

This test will provide as output a ratio of the variances of the two vectors, a 95% confidence interval for that ratio, and a p-value for that ratio. If the ratio falls within that confidence interval, than one will not be able to reject the null hypothesis that the variances are equal.

var.test.results <- var.test(sample.norm.vec,sample.mean.vec)
var.test.results


    F test to compare two variances

data:  sample.norm.vec and sample.mean.vec
F = 1.028, num df = 999, denom df = 999, p-value = 0.6629
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9080053 1.1637861
sample estimates:
ratio of variances 
          1.027971

paste("The results of the test gives an F-statistic (variance ratio) of ", round(var.test.results$statistic,digits=4), ", which is within the 95% confidence interval of [", round(var.test.results$conf.int[1],digits=3),", ",round(var.test.results$conf.int[2], digits=3),"].", sep="")

[1] "The results of the test gives an F-statistic (variance ratio) of 1.028, which is within the 95% confidence interval of [0.908, 1.164]."

paste("This information, along with the p-value of ", round(var.test.results$p.value,digits=4),", implies that the null hypothesis of equal variances is not rejected.", sep="" )

[1] "This information, along with the p-value of 0.6629, implies that the null hypothesis of equal variances is not rejected."

Comparison of the simulated distribution and the normal distirbution

The two tests on the sample mean and sample variance of the exponential distribution with mean equal to 5 both implied that one could not reject the null hypotheses that the sample mean and sample variance were consistent with that of a normal distribution of mean 5 and standard deviation sqrt(0.625)

Another way to show that the distribution of the sample mean is consisted with a normal distribution is to use a Quantile-Quantile plot (Q-Q plot) to visually inspect the similarity between the data set of 1,000 sample means from an epxponential distribution, and the data set of 1,000 randomly generated values from a normal distribution. The result of the Q-Q plot function [qqnorm()] is below.

qqnorm(sample.mean.vec,main = "Normal Q-Q plot of exponential sample means",
       xlab = "Theoretical quantiles from normal distribution", ylab = "Sample mean quantiles",)
qqline(sample.norm.vec)

The plotted points fall closely to the solid line that represents the normal distribution, so it appears that the sample means are approximately normally distributed.

Statisical Inference Course Project: Hypothesis Testing

Todd Curtis

February 11, 2015

Overview