Statistical Inference Course Project: First part

This is the project for the statistical inference class. In it, you will use simulation to explore inference and do some simple inferential data analysis. The project consists of two parts:

A simulation exercise.
Basic inferential data analysis.

Each pdf report should be no more than 3 pages with 3 pages of supporting appendix material if needed.

Instructions

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should.

Show where the distribution is centered at and compare it to the theoretical center of the distribution.
Show how variable it is and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Note that for point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Load the required packages

First, load the packages knitr and ggplot2

library(knitr) 
library(ggplot2)

Set the global options

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.

lambda = 0.2           # lambda for all simulations
n = 40                 # number of exponential random variables
nsims = 1:10000        # number of simulated averages
set.seed(901)
means <- data.frame(x=sapply(nsims,function(x) {mean(rexp(n,lambda))})) # necessary means
head(means,2)        # show the results

##          x
## 1 3.981570
## 2 4.148078

Answers to questions through analysis of the simulations

Analysys

Evaluate the stats needed to answer the questions.

mean(means$x)            # Center of the distribution of sample means

## [1] 5.001601

sd(means$x)              # Standard deviation of sample means

## [1] 0.7882242

(1/lambda)/(sqrt(40))    # Expected standard deviation

## [1] 0.7905694

var(means$x)             # Variance of our simulations

## [1] 0.6212973

((1/lambda)/sqrt(40))^2  # Expected variance

## [1] 0.625

hmeans <- rowMeans(means) # Center of the distribution of sample means

Questions and answers

1. Show where the distribution is centered at and compare it to the theoretical center of the distribution.

Answer: The Center of the distribution:5.0016014. Expected center: \(\lambda^{-1}\) = 5. The mean of the means of the exponential of 10000 simulations of 40 exponential(0.2)s is 5.0016014, which is very close to the expected mean of 1/0.2 = 5.0.

The distribution of sample means is centered at 5.0016014 and the theoretical center of the distribution is \(\lambda^{-1}\) = 5. The variance of sample means is 0.6212973 where the theoretical variance of the distribution is \(\sigma^2 / n = 1/(\lambda^2 n) = 1/(0.04 \times 40)\) = 0.625.

The averages of samples follow normal distribution for the central limit theorem. The first figure shows the density computed using the histogram and the normal density plotted with theoretical mean and variance values. The quantiles comparison figure suggests the normality.

2. Show how variable it is and compare it to the theoretical variance of the distribution.

Answer: The standard deviation of 0.7882242 is also close to the expected standard deviation of 0.7905694. Expected standard deviation using Central Limit Theorem. Likewise, the variance and expected variance are 0.6212973 and 0.625, respectively.

3. Show that the distribution is approximately normal.

Answer: The results of simulation are into a new vector called simulation, where the distribution is close to standard normal (theoretical). The plot of distribution (histogram and its estimated density) in black colour and the actual density of the standard normal in blue colour. If you can see, the red curve is close to the blue.

### System info My analysis has been originally created and run in RStudio Version 0.98.994 under OS X 10.9.5. Time and date about of the report generation

## [1] "Sat Nov 22 20:05:15 2014"

Appendix

histograms and plot codes

Histogram

    hmeans <- rowMeans(means)
    hist(hmeans, col = I('#00e6ff'), breaks=40, prob=TRUE,
    main="Distribution of averages of samples,
    from exponential distribution",
    xlab="")
    # density of the averages of samples
    lines(density(hmeans))
    # theoretical center of distribution
    abline(v=1/lambda, col="red")
    # theoretical density of the averages of samples
    xfit <- seq(min(hmeans), max(hmeans), length=100)
    yfit <- dnorm(xfit, mean=1/lambda, sd=(1/lambda/sqrt(n)))
    lines(xfit, yfit, pch=19, col="blue", lty=2)
    # add legend
    legend('topright', c("simulation", "theoretical"), lty=c(1,2), col=c("black", "blue"))

Code for qqplot

    qqnorm(hmeans, col=I('red')); qqline(hmeans, col=I('blue'));
    legend('topleft', c("qqnorm", "qqline"),col=c("red", "blue"))

Plot third figure

Code of my first aproach

    ggplot(data = means, aes(x = x)) +
    geom_histogram(aes(y=..density..), fill = I('#00e6ff'),
    binwidth = 0.20, color = I('blue')) +
    stat_function(fun = dnorm, arg = list(mean = 5, sd = sd(means$x)))