Statistical Inference: Assignment

Synopsis

In this project, the exponential distribution has been investigated in R and compared with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of the exponential distribution is 1/lambda and the standard deviation is also 1/lambda. lambda has be setted as 0.2 for all of the simulations. The distribution of averages of 40 exponentials has been investigated.

In a second part, the ToothGrowth data in the R datasets package has been analyzed.

Besides we need two specific packages: ggplot2 and datasets.

Part 1: Simulation Exercise Instructions

1. Notations and simulation

Some notations have been introduced as below: - lambda = 0.2 - expo = n = 40 - simulations = 1000 - mean = standard deviation = 1/lambda = 5 as it is precised in the synopsis

lambda <- 0.2
n <- 40
simulations <- 1000
sd <- 1/lambda
set.seed(20) #needed for reproducibility

Then simulations have been run:

rexp.simulations <- rexp(simulations*n, rate=lambda)

2. Show the sample mean and compare it to the theoretical mean of the distribution

Both means have been calculated as below. Besides, preplotting has been written for the final graph:

sample_mean <- mean(rexp.simulations)
theoretical_mean <- 1/lambda
new_data <- NULL
for (i in 1 : simulations) new_data = c(new_data, mean(rexp(n, lambda)))
preplotting <- data.frame(header=c("Sample mean", "Theoretical mean"), values=c(sample_mean, theoretical_mean))
graph <- ggplot(NULL, aes(x=new_data))

Finally, we got: - sample_mean: 4.963828 - theoretical_mean: 5

Both means are very close which is not surprising as the population is quite large.

3. Show the sample variance and compare it to the theoretical variance of the distribution

Both variances have been calculated as below:

sample_variance <- sd(new_data)^2 #where sd is the standard deviation
theoretical_variance <- ((1/lambda)^2)/n

Finally, we got: - sample_variance: 0.5955671 - theoretical_variance: 0.625

Both variances are not exactly the same even if they are quite similar.

4. Show that the distribution is approximately normal

Regarding the ‘motivating example’ from the assignement, we can create a plot to observe the distribution but also both means:

final_graph_1 <- graph + geom_histogram(aes(y=..density..), color="green", fill="white", binwidth=0.1) + geom_density(color="blue")
final_graph_1 <- final_graph_1 + geom_vline(data=preplotting, aes(xintercept=values, linetype=header, color=header), show.legend=TRUE)
final_graph_1 <- final_graph_1 + theme(legend.title=element_blank()) + labs(title = "Sample and theoretical means comparison", x ="Data", y = "Density")
final_graph_1

We can notice that the density function looks like a normal distribution. Indeed, we could get a normal distribution by increasing either n or simulations.

Getting the qq-plot could be an other option to underline this conclusion:

qqnorm(new_data)
qqline(new_data)

As it can be noticed, the points are aligned almost perfectly with the straight line which spotlights that the distribution is approximately normal.

Part 2: Basic Inferential Data Analysis Instructions

1. Load the ToothGrowth data and perform some basic exploratory data analyses

First we load the ToothGrowth data as below:

data(ToothGrowth)

We are interested in three parameters: - len: tooth length - supp: supplement type) - dose = dosage in mg per day

The dosage is 0.5, 1 or 2 so it is not understandable to get dosage_mean and dosage_median. Hence, we need to convert dose with as.factor as below:

ToothGrowth$dose <- as.factor(ToothGrowth$dose)

2. Provide a basic summary of the data

We can finally summarize the dataset:

summary(ToothGrowth)

##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

The population is not very large so we can get box plots to get visual comparisons:

final_graph_2 <- ggplot(data = ToothGrowth, aes(dose, len)) + geom_boxplot(mapping=aes(group=dose)) + facet_wrap(~ supp, nrow=1)
final_graph_2

3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and doses

We will work with Welch Two Sample ttest to understand the impact of the supplement type but also the type of dosage on the tooth length:

t.test(len~supp, data=ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

The p-Value is equal to 0.061 which is higher than 0.05, hence the null hypothesis remains. In other terms, we can not truly think that the type of supplement impacts tooth length.

I had some troubles with the dataset because dose is now a factor so I just reloaded it:

data(ToothGrowth)
t.test(ToothGrowth$len, ToothGrowth$dose)

## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len and ToothGrowth$dose
## t = 17.81, df = 59.798, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15.66453 19.62881
## sample estimates:
## mean of x mean of y 
## 18.813333  1.166667

The p-Value is lower than 2.2e-16 which is lower than 0.05, hence the null hypothesis can be rejected. In other terms, we can truly think that the type of dosage impacts tooth length.

4. State your conclusions and the assumptions needed for your conclusions

We can think at 95% that the dosage type has an impact on the tooth length.
There is no 95% certainty that the supplement type has an impact on the tooth length.