Simulation and Basic inferential data analysis project

Overview

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Simulations

With 10000 simulations we will create a data set for comparison to theory. There are 40 observations per simulation and the exponential distribution function will be set to “rexp(40, 0.2)”.

The mean of exponential distribution is 1/λ with standard deviation = 1/λ. We will generate and then calculate the average of each sample.

library(ggplot2)
library(knitr)
sim <- 10000
lambda <- 0.2
n <- 40

sim_df <- matrix(rexp(n=sim*n, rate=lambda),sim,n)
sample_mean <- rowMeans(sim_df)
head(sample_mean)

## [1] 5.254612 5.122077 5.163010 3.996602 4.376479 4.315604

Sample Mean VS Theoretical Mean

The theoretical mean of the average of samples = 1/λ .The following shows that the average from sample means and the theoretical mean are closer when simulation are greater( sim = 10000).

actual_mean <- mean(sample_mean) 
theoretical_mean <- 1/ lambda

df <-data.frame("Mean"=c(actual_mean,theoretical_mean), 
                     row.names = c("Mean from the samples ","Theoretical mean"))

df

##                            Mean
## Mean from the samples  4.996755
## Theoretical mean       5.000000

We can see that simulation with sim= 10000 is closer to theoretical value of 5, but with a plot, specifically histogram can help us to see that distribution.

Sample_mean_df <- as.data.frame(sample_mean)
ggplot(Sample_mean_df,aes(sample_mean)) + geom_histogram(position = "identity", alpha=0.5,col="white")+ geom_vline(xintercept = theoretical_mean, colour="red",show.legend=TRUE) + geom_vline(xintercept = actual_mean, colour="blue", show.legend=TRUE) +ggtitle ("Histogram of the sample means with sim = 10,000 ")+xlab("Sample mean")+ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can not not observe so well in the graph, but giving some limits to x-axis like a zoom we can see the difference:

ggplot(Sample_mean_df,aes(sample_mean)) + geom_histogram(position = "identity", alpha=0.5,col="white")+ geom_vline(xintercept = theoretical_mean, colour="red",show.legend=TRUE) + geom_vline(xintercept = actual_mean, colour="blue", show.legend=TRUE) + xlim(c(4.5, 5.5)) +ggtitle ("Histogram of the sample means ")+xlab("Sample mean")+ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5284 rows containing non-finite values (stat_bin).

Sample Variance VS Theoretical Variance

The theoretical variance = (1/λ)^2/n, and we will calculate the variance of sample means and compare it with the theoretical variance, resulting that they are very close in value.

actual_variance <- var(sample_mean) 
theoretical_variance <- (1/ lambda)^2 /n 
df2<-data.frame("Variance"=c(actual_variance, theoretical_variance), 
                     row.names = c("Variance from the sample ","Theoretical variance"))
df2

##                            Variance
## Variance from the sample  0.6101953
## Theoretical variance      0.6250000

Distribution

We will compare the sample mean distribution with the theoretical, according with the central limit theorem (CLT), the averages of samples follow normal distribution. So our simulation matches the normal distribution. Also we create a Normal Probability Plot of Residuals below to confirm the fact that the distribution of sample means matches the theoretical normal distribution:

ggplot(Sample_mean_df, aes(sample_mean))+
        geom_histogram(aes(y=..density..), alpha=.2, position="identity", col="white")+
        geom_density(colour="red", size=1, show.legend = TRUE)+
        stat_function(fun = dnorm, colour = "blue", args = list(mean = theoretical_mean, sd = sqrt(theoretical_variance)), show.legend = TRUE)+
        ggtitle ("Histogram of sample means with the fitting normal curve ")+
        xlab("Sample mean")+
        ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

 qqnorm(sample_mean, main ="Normal probability plot")
 qqline(sample_mean,col = "red", lwd= 2)

The distribution of averages is approximately normal according histogram and Q-Q Plot.

Project_P1