This project is part of the course Statistical Inference by Coursera and consists of two parts:

a simulation exercise
basic inferential data analysis

1. Simulation Exercise

Overview

The simulation exercise investigates the exponential distribution in R and compares it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations.

The exercise investigates the distribution of 40 exponentials, doing 1000 simulations. The goals are:

to show the sample mean and compare it to the theoretical mean of the distribution
to show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution
to show that the distribution is approximately normal.

1.1 Simulations

First we need to set the parameters. The function set.seed is used to facilitate the replication of this project.

library(ggplot2)
set.seed(123)
lambda <- 0.2   #rate parameter lambda
n <- 40       #number of exponentials
n_sim <- 1000   #number of simulations conducted

Then we can proceed with the simulations. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter (set to 0.2). The 1000x40 random exponentials generated are organised in a matrix of 1000 rows (simulations) and 40 columns (size of each simulation). After that we can find the mean of each simulation.

# Generate random exponentials
sim <- rexp(n*n_sim,lambda)       

# Organize the random exponentials in a matrix
data_sim <- matrix(sim, ncol = n)   

# Mean of each row (simulation)
sim_mean <- apply(data_sim,1,mean)

1.2 Sample Mean and Theoretical Mean

Then we can compare sample mean and theoretical mean.

# Mean of each row (simulation)
sim_mean <- apply(data_sim,1,mean)

# Sample mean
sample_mean <- mean(sim_mean)
sample_mean

## [1] 5.011911

# Theoretical mean
theo_mean <- 1/lambda
theo_mean

## [1] 5

The following plot shows the distribution of the simulations. We can notice how theoretical mean and sample mean are almost at the same level (red and green vertical lines).

ggplot(as.data.frame(sim_mean),aes(sim_mean))+
    geom_histogram(fill='blue',binwidth = 0.2)+
    ggtitle('Distribution of the simulations')+
    geom_vline(xintercept = sample_mean,col='red')+
    geom_vline(xintercept = theo_mean,col='green')+
    xlab('Mean')+ylab('Frequency')

1.3 Sample Variance and Theoretical Variance

Here we compare the sample variance with the theoretical variance of the distribution. From the results we can notice that the two values are close.

# Sample variance
sample_var <- var(sim_mean)
sample_var

## [1] 0.6088292

# Theoretical variance
theo_var <- ((1/lambda)^2)/n
theo_var

## [1] 0.625

1.4 Distribution approximately Normal

The final step of this part is to show that the distribution is approximately normal. This can be done plotting the histogram of the simulations and comparing it with a normal distribution curve with mean=theo_mean and var=theo_var. From the chart we can conclude that the distribution is approximately normal. The normal density curve is plotted using the function stat_function from the ggplot library. The y axis is expressed in density and not frequency.

ggplot(as.data.frame(sim_mean),aes(sim_mean))+
    geom_histogram(aes(y=..density..),fill='blue',binwidth = 0.2)+
    stat_function(fun=dnorm,color='red',args=list(mean=theo_mean,sd=sqrt(theo_var)),lwd=1.3)+
    ggtitle('Distribution of simulations and Normal Curve')+
    xlab('Mean')+ylab('Density')

1.5 Conclusions

The analysis conducted is summarised in the following points.

Sample mean and theoretical mean are close to each other.
Sample variance is close to the theoretical variance of the distribution.
The distribution of 1000 averages of 40 exponentials is close to the normal distributions (with mean and variance given as 1/lambda). This is due to the Central Limit Theorem (CLM) and the hight number of simulations (1000) conducted.

2. Basic Inferential Data Analysis

The second part focuses on the analysis of the ToothGrowth data in the R datasets package.The steps followed are:

Load the ToothGrowth data and perform some basic exploratory data analysis.
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.
State your conclusions and the assumptions needed for your conclusions.

2.1 Data loading and basic exploratory data analyses

Load the ToothGrowth data and perform some basic exploratory data analyses

data("ToothGrowth")
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

2.2 Basic summary of the data

A basic summary is provided using summary and str function.

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

We can observe that the unique values in the column supp are VC and OJ.

unique(ToothGrowth$supp)

## [1] VC OJ
## Levels: OJ VC

We can also notice that the unique values in the column dose are 0.5, 1 and 2.

unique(ToothGrowth$dose)

## [1] 0.5 1.0 2.0

2.3 Compare tooth growth by supp and dose

First we can compare the tooth growth observing the following chart where the data is grouped by supp and dose. The chart is built using the boxplot function from ggplot.

ToothGrowth$dose <- as.factor(ToothGrowth$dose)  #convert dose in factor for plotting
ggplot(data=ToothGrowth,aes(dose,len))+geom_boxplot(aes(fill=dose))+
    facet_grid(~supp)+ggtitle('Tooth Growth')

From the chart we can highlight two points.

The higher the dosage the higher the tooth length, this is valid for both types of supplement.
The difference between the types of supplement (OJ and VC) does not seem strong. At a lower dosage the tooth length is higher for OJ and in the group VC there seems to be a bigger difference among the different dosages.

In the following chart we can observe all the single points composing the data set.

ggplot(data = ToothGrowth, aes(dose,len,color=supp))+geom_point()+
    ggtitle('Tooth Growth')

We can also calculate the mean of the two groups of supplements.

# Mean for OJ
mean_oj <- mean(ToothGrowth$len[ToothGrowth$supp=='OJ'])
mean_oj

## [1] 20.66333

# Mean for VC
mean_vc <- mean(ToothGrowth$len[ToothGrowth$supp=='VC'])
mean_vc

## [1] 16.96333

Using confidence intervals and hypothesis tests it is possible to compare tooth growth by supp and dose.
First we compare tooth growth by supplement. Using a t.test we verify if the average length of teeth for the two supplement types are significantly different.

# t test OJ and VC
t.test(len~supp, data = ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

From this test we can see that, with a 0.05 significance level, the lengths for the groups OJ and VC are not significantly different. Into practice the type of supplement does not seem to influence the tooth growth.

Then we can compare the three dosage levels. We conduct a t test to see if the average length of the teeth for the dosage groups are significantly different.

#t test for dosage 0.5 and 1.0
tooth_05 <- ToothGrowth$len[ToothGrowth$dose==0.5]
tooth_10 <- ToothGrowth$len[ToothGrowth$dose==1]
tooth_20 <- ToothGrowth$len[ToothGrowth$dose==2]
t.test(tooth_05,tooth_10)

## 
##  Welch Two Sample t-test
## 
## data:  tooth_05 and tooth_10
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

# t test for dosage 0.5 and 2.0
t.test(tooth_05,tooth_20)

## 
##  Welch Two Sample t-test
## 
## data:  tooth_05 and tooth_20
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean of x mean of y 
##    10.605    26.100

# t test for dosage 1.0 and 2.0
t.test(tooth_10,tooth_20)

## 
##  Welch Two Sample t-test
## 
## data:  tooth_10 and tooth_20
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

In all the three tests the null hypothesis (no difference in the average of the groups) is rejected. The p-value is always lower than 0.05 and the confidence interval dose not contain zero. According to these results, the dosage level can influence the tooth length. A higher dosage is correlated to a higher tooth growth.

2.4 Conclusions

According to the analysis conducted, the results can be summarised in the following points.

Supplement type does not seem to have effects on the tooth growth.
Dosage level seems to be correlated with an increase in tooth growth.

Statistical Inference Project

Flavio Angeli