This project is part of the course Statistical Inference by Coursera and consists of two parts:
The simulation exercise investigates the exponential distribution in R and compares it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations.
The exercise investigates the distribution of 40 exponentials, doing 1000 simulations. The goals are:
First we need to set the parameters. The function set.seed is used to facilitate the replication of this project.
library(ggplot2)
set.seed(123)
lambda <- 0.2 #rate parameter lambda
n <- 40 #number of exponentials
n_sim <- 1000 #number of simulations conducted
Then we can proceed with the simulations. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter (set to 0.2). The 1000x40 random exponentials generated are organised in a matrix of 1000 rows (simulations) and 40 columns (size of each simulation). After that we can find the mean of each simulation.
# Generate random exponentials
sim <- rexp(n*n_sim,lambda)
# Organize the random exponentials in a matrix
data_sim <- matrix(sim, ncol = n)
# Mean of each row (simulation)
sim_mean <- apply(data_sim,1,mean)
Then we can compare sample mean and theoretical mean.
# Mean of each row (simulation)
sim_mean <- apply(data_sim,1,mean)
# Sample mean
sample_mean <- mean(sim_mean)
sample_mean
## [1] 5.011911
# Theoretical mean
theo_mean <- 1/lambda
theo_mean
## [1] 5
The following plot shows the distribution of the simulations. We can notice how theoretical mean and sample mean are almost at the same level (red and green vertical lines).
ggplot(as.data.frame(sim_mean),aes(sim_mean))+
geom_histogram(fill='blue',binwidth = 0.2)+
ggtitle('Distribution of the simulations')+
geom_vline(xintercept = sample_mean,col='red')+
geom_vline(xintercept = theo_mean,col='green')+
xlab('Mean')+ylab('Frequency')
Here we compare the sample variance with the theoretical variance of the distribution. From the results we can notice that the two values are close.
# Sample variance
sample_var <- var(sim_mean)
sample_var
## [1] 0.6088292
# Theoretical variance
theo_var <- ((1/lambda)^2)/n
theo_var
## [1] 0.625
The final step of this part is to show that the distribution is approximately normal. This can be done plotting the histogram of the simulations and comparing it with a normal distribution curve with mean=theo_mean and var=theo_var. From the chart we can conclude that the distribution is approximately normal. The normal density curve is plotted using the function stat_function from the ggplot library. The y axis is expressed in density and not frequency.
ggplot(as.data.frame(sim_mean),aes(sim_mean))+
geom_histogram(aes(y=..density..),fill='blue',binwidth = 0.2)+
stat_function(fun=dnorm,color='red',args=list(mean=theo_mean,sd=sqrt(theo_var)),lwd=1.3)+
ggtitle('Distribution of simulations and Normal Curve')+
xlab('Mean')+ylab('Density')
The analysis conducted is summarised in the following points.
The second part focuses on the analysis of the ToothGrowth data in the R datasets package.The steps followed are:
Load the ToothGrowth data and perform some basic exploratory data analyses
data("ToothGrowth")
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
A basic summary is provided using summary and str function.
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
We can observe that the unique values in the column supp are VC and OJ.
unique(ToothGrowth$supp)
## [1] VC OJ
## Levels: OJ VC
We can also notice that the unique values in the column dose are 0.5, 1 and 2.
unique(ToothGrowth$dose)
## [1] 0.5 1.0 2.0
First we can compare the tooth growth observing the following chart where the data is grouped by supp and dose. The chart is built using the boxplot function from ggplot.
ToothGrowth$dose <- as.factor(ToothGrowth$dose) #convert dose in factor for plotting
ggplot(data=ToothGrowth,aes(dose,len))+geom_boxplot(aes(fill=dose))+
facet_grid(~supp)+ggtitle('Tooth Growth')
From the chart we can highlight two points.
In the following chart we can observe all the single points composing the data set.
ggplot(data = ToothGrowth, aes(dose,len,color=supp))+geom_point()+
ggtitle('Tooth Growth')
We can also calculate the mean of the two groups of supplements.
# Mean for OJ
mean_oj <- mean(ToothGrowth$len[ToothGrowth$supp=='OJ'])
mean_oj
## [1] 20.66333
# Mean for VC
mean_vc <- mean(ToothGrowth$len[ToothGrowth$supp=='VC'])
mean_vc
## [1] 16.96333
Using confidence intervals and hypothesis tests it is possible to compare tooth growth by supp and dose.
First we compare tooth growth by supplement. Using a t.test we verify if the average length of teeth for the two supplement types are significantly different.
# t test OJ and VC
t.test(len~supp, data = ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
From this test we can see that, with a 0.05 significance level, the lengths for the groups OJ and VC are not significantly different. Into practice the type of supplement does not seem to influence the tooth growth.
Then we can compare the three dosage levels. We conduct a t test to see if the average length of the teeth for the dosage groups are significantly different.
#t test for dosage 0.5 and 1.0
tooth_05 <- ToothGrowth$len[ToothGrowth$dose==0.5]
tooth_10 <- ToothGrowth$len[ToothGrowth$dose==1]
tooth_20 <- ToothGrowth$len[ToothGrowth$dose==2]
t.test(tooth_05,tooth_10)
##
## Welch Two Sample t-test
##
## data: tooth_05 and tooth_10
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean of x mean of y
## 10.605 19.735
# t test for dosage 0.5 and 2.0
t.test(tooth_05,tooth_20)
##
## Welch Two Sample t-test
##
## data: tooth_05 and tooth_20
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.15617 -12.83383
## sample estimates:
## mean of x mean of y
## 10.605 26.100
# t test for dosage 1.0 and 2.0
t.test(tooth_10,tooth_20)
##
## Welch Two Sample t-test
##
## data: tooth_10 and tooth_20
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.996481 -3.733519
## sample estimates:
## mean of x mean of y
## 19.735 26.100
In all the three tests the null hypothesis (no difference in the average of the groups) is rejected. The p-value is always lower than 0.05 and the confidence interval dose not contain zero. According to these results, the dosage level can influence the tooth length. A higher dosage is correlated to a higher tooth growth.
According to the analysis conducted, the results can be summarised in the following points.