Statistical Inference Course Project Part 1

Simulation The code in the snippet below performs the simulations to collect necessary data.

#load the plotting library
library(ggplot2)

#variables that control the simulation
no_sim <- 1000
lambda <- 0.2
n <- 40

set.seed(234)

#Create a matrix of 1000 rows with the columns corresponding to random simulation 40 times
sim_matrix <- matrix(rexp(no_sim * n, rate=lambda), no_sim, n)
sim_mean <- rowMeans(sim_matrix)

The simulation data is now plotted to explore it a bit

hist(sim_mean, col = "red")

**Sample Mean *Comparison** The actual mean for the sample data and theoretical mean are calculated below:

mean_data <- mean(sim_mean)
theory_mean <- 1/lambda

Actual center of the distribution based on the simulations is 5.0015729 while the theoretical mean for lambda = 0.2 is 5. This implies that the actual mean from sample data is very close to the theoretical mean of normal data.

Variance Comparison The actual variance for the sample data and theoretical variance are calculated below

actual_var <- var(sim_mean)
theory_var <- (1/lambda)^2/n

Actual variance for the sample data is 0.6631504 while the theoretical variance is 0.625. Both these values are again quite close to each other.

Approximately Normal Distribution

To prove the concept, we use the following three steps:

Create an approximate normal distribution and see how the sample data aligns with it. Compare the confidence interval along with the mean and variance with normal distribution. q-q plot for quantiles

Step 1

plotdata <- data.frame(sim_mean);
m <- ggplot(plotdata, aes(x =sim_mean))
m <- m + geom_histogram(aes(y=..density..), colour="black",
fill = "red")
m + geom_density(colour="blue", size=1);

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The figure above shows that the histogram can be adeqautely approximated with the normal distribution.

Step 2 In the sections above, we have proved that the variance and mean of sample data closely resemble those of normal distribution. Lets also try to match the confidence intervals which are calculated below:

actual_conf_interval <- round (mean(sim_mean) + c(-1,1)*1.96*sd(sim_mean)/sqrt(n),3)
theory_conf_interval <- theory_mean + c(-1,1)*1.96*sqrt(theory_var)/sqrt(n);

Step 3

qqnorm(sim_mean); qqline(sim_mean)