Instructions
This is the project for the statistical inference class. In it, you will use simulation to explore inference and do some simple inferential data analysis. The project consists of two parts:
A simulation exercise.
Basic inferential data analysis.
You will create a report to answer the questions. Given the nature of the series, ideally you’ll use knitr to create the reports and convert to a pdf. (I will post a very simple introduction to knitr). However, feel free to use whatever software that you would like to create your pdf.
Each pdf report should be no more than 3 pages with 3 pages of supporting appendix material if needed (code, figures, etcetera).
Problem
The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should
Show where the distribution is centered at and compare it to the theoretical center of the distribution.
Show how variable it is and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.
Global Options
Be sure the knitr library is loaded as well as the ggplot2 library for the graphing later on.
Set echo to true to be sure all code is shown in the report.
One other note, to make this reproducible, meaning I get the same results each time it is run, the random seed needs to be set to a known value so the rexp function always generates the same array. In the following code I set it to 1.
setwd("~/Data/Statistical Inference Coure Project")
require(knitr)
## Loading required package: knitr
## Loading required package: knitr
require(ggplot2)
## Loading required package: ggplot2
## Loading required package: ggplot2
opts_chunk$set(echo=TRUE)
set.seed(1)
Set the variables as defined in the problem. number of values (n) = 40 lambda = 0.2 number of iterations, as least 1000. numsim=2000 #### Initialize the variables
n<-40
lambda<-0.2
numsim<-2000
Now Generate the data using rexp
Using the rexp function, along with the matrix function develop a dataset with the mean and lambda specified above
dataset<-matrix(rexp(n*numsim,lambda),numsim)
Answer the following questions
Question 1 - Show where the distribution is centered at and compare it to the theoretical center of the distribution.
Question 2 - Show how variable it is and compare it to the theoretical variance of the distribution.
Question 3 - Show that the distribution is approximately normal.
The theoretical mean was given in the problem as 1 / lambda or 1 / 0.2. The actual mean of the generated data can be calculated by using the apply & mean functions to obtain a mean for each row and then taking the mean of those numbers. Let’s get the Standard Deviation and Variance as well since we’ll need them later.
TheoryMean<-1/lambda
RowMeans<-apply(dataset,1,mean)
ActualMean<-mean(RowMeans)
TheorySD<-((1/lambda) * (1/sqrt(n)))
ActualSD<-sd(RowMeans)
TheoryVar<-TheorySD^2
ActualVar<-var(RowMeans)
To answer the questions, it is easy to look at the following table which shows the actual values compared to the theoretical values.
| Mean |
5 |
5.109 |
| Standard Deviation |
0.791 |
0.779 |
| Varaince |
0.625 |
0.607 |
Visually, the results look like this:
dfRowMeans<-data.frame(RowMeans) # convert to data.frame for ggplot
mp<-ggplot(dfRowMeans,aes(x=RowMeans))
mp<-mp+geom_histogram(binwidth = lambda,fill="green",color="black",aes(y = ..density..))
mp<-mp + labs(title="Density of 40 Numbers from Exponential Distribution", x="Mean of 40 Selections", y="Density")
mp<-mp + geom_vline(xintercept=ActualMean,size=1.0, color="black") # add a line for the actual mean
mp<-mp + stat_function(fun=dnorm,args=list(mean=ActualMean, sd=ActualSD),color = "blue", size = 1.0)
mp<-mp + geom_vline(xintercept=TheoryMean,size=1.0,color="yellow",linetype = "longdash")
mp<-mp + stat_function(fun=dnorm,args=list(mean=TheoryMean, sd=TheorySD),color = "red", size = 1.0)
mp

The theoretical mean is shown in the graph as a dotted yellow line, while the actual mean is shown by the black line. The normal curve formed by the the theoretical mean and standard deviation is shown in red. The curve formed by the actual mean and standard deviation is shown in blue. Finally, the density of the actual data is shown by the green bars.
To answer question 3, you can observe the that the Central Limit Theory is working here to make the actual data follow a normal curve by observing the shape of the actual data on the graph shown in blue as compared to the normal curve shown in red.
Summary
To answer question 1 - I havee demonstrated the actual mean compared to the theroetical mean. To Answer question 2 - I have demonstrated the actual standard deviation and variance compared to their theoretical values.
To answer question 3 - I graphed all the values above and was able to show that both the calculated curve is very close to the theoretical normal curve therefore proving the Central Limit Theory.