This document is presented as the final project for the course Statistical Inference. In the first part of the document, several sets of exponentially distributed data will be simulated, then see how the distribution of means will approximate according to the Central Limit Theorem. The second part consists of the statistical analysis for the Tooth Growth dataset included in the package datasets.
# Loading the required libraries from the start
library(ggplot2)
library(datasets)
library(knitr)
First, a set of n exponentials are going to be generated using the rexp function, said numbers will have a mean \(\lambda\). This is going to be repeated a fixed number of times.
# setting the seed for reproducibility
set.seed(12893)
# Defining constants
n <- 40
lambda <- 0.3
n_sims <- 1500
# Running the simulations
exp_distributions <- matrix(data=rexp(n*n_sims, lambda), nrow=n_sims)
exp_distributions_means <- data.frame(means=apply(exp_distributions, 1, mean))
After simulating 1500 distributions, the means are calculated an then plotted into a histogram.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of means
The expected mean (\(\mu\)) of an exponential distribution is defined by the formula \(\mu = 1/\lambda\). The expected standard deviation is calculated as: \(\sigma = \frac{1}{\lambda\sqrt{n}}\). As for the variance (\(\sigma^{2}\)) its just \(\sigma^{2}\). On the other hand, the simulated mean, variance and standard deviation are calculated using the base R functions.
# Simulated mean, variance and standard deviation
mean_sim <- mean(exp_distributions_means$means)
sd_sim <- sd(exp_distributions_means$means)
var_sim <- var(exp_distributions_means$means)
# Theoretical mean, variance and standar deviation
mean_teo <- 1/lambda
sd_teo <- 1/(lambda*sqrt(n))
var_teo <- sd_teo^2
| Simulated | Theoretical | |
|---|---|---|
| Mean | 3.3261627 | 3.3333333 |
| Standard Dev. | 0.5198271 | 0.5270463 |
| Variance | 0.2702202 | 0.2777778 |
According to the results, both the theoretical and simulated measures are pretty close. This could be seen better using a plot in which the histogram, a trend line and a normal standard line -using the calculated values- are plotted.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Comparing the data to a normal distribution
In the figure it is clear that the normal distribution approximates really well to the distribution trend line.
The first step is to load the data and present some summary statistics.
# Second part
data("ToothGrowth")
# Visualizing the data structure
head(ToothGrowth, 5)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
# Dims of data and types
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Summary of the data
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
The data has 60 observations or 3 variables. * len: Tooth length, is a numeric type. * supp: Supplement given to the animal (VC or OJ), is a character type. * dose: Dose of supplement in milligrams/day, is a numeric type.
Visualizing the tooth growth data
According to the plot three main observations are drawn, the first one is that the \(2.0mg\) dosage is the most effective when it comes to tooth growth. The second one is that in general the supplment OJ has better effectiveness than VC. The final observation is that in \(2.0mg\) dosage both supplements have pretty much the same effect.
Before proceding with the hypothesis testing. Some assumptions have to be made, these are:
The data set is then splitted into groups that will help test our hypothesis.
# Splittin data set by dosage
dose_05 <- ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_10 <- ToothGrowth$len[ToothGrowth$dose == 1.0]
dose_20 <- ToothGrowth$len[ToothGrowth$dose == 2.0]
# Splitting data by supplement
supp_oj <- ToothGrowth$len[ToothGrowth$supp == "OJ"]
supp_vc <- ToothGrowth$len[ToothGrowth$supp == "VC"]
# Splitting data by supplement and dosage = 2.0
supp_oj_20 <- ToothGrowth$len[ToothGrowth$supp == "OJ" & ToothGrowth$dose == 2.0]
supp_vc_20 <- ToothGrowth$len[ToothGrowth$supp == "VC" & ToothGrowth$dose == 2.0]
A one tail t-test is performed using the t.test function of R.
# Testing that the 2mg dosage is more effective than both others
# Testing that dose 1.0 is greater than dose 0.5
t.test(dose_10, dose_05, alternative = "greater", paired = FALSE,
var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: dose_10 and dose_05
## t = 6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 6.753323 Inf
## sample estimates:
## mean of x mean of y
## 19.735 10.605
The p-value in this test is much smaller than our defined \(\alpha = 0.05\), and thus we can reject the null hypothesis. That means there is an almost non-existent probability of finding a higher tooth growth value for the \(0.5mg\) dosage compared to the \(1.0mg\) dosage.
# Testing that dose 2.0 is greater than dose 1.0
t.test(dose_20, dose_10, alternative = "greater", paired = FALSE,
var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: dose_20 and dose_10
## t = 4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 4.17387 Inf
## sample estimates:
## mean of x mean of y
## 26.100 19.735
Again, p-value in this test is much smaller than our defined \(\alpha = 0.05\), and thus we can reject the null hypothesis. That means there is an almost non-existent probability of finding a higher tooth growth value for the \(1.0mg\) dosage compared to the \(2.0mg\) dosage.
# Testing that in general OJ has grater effects than vc
t.test(supp_oj, supp_vc, alternative = "greater", paired = FALSE,
var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: supp_oj and supp_vc
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
The p-value obtained indicates a 3% probability of finding a value in which the OJ supplement is less effective than the VC. Given that our tolerance is set to be 5%, the null hypothesis can be sucessfully rejected.
# Testing that oj 2.0 has similar effects than vc 2.0
t.test(supp_oj_20, supp_vc_20, alternative = "two.sided", paired = FALSE,
var.equal = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: supp_oj_20 and supp_vc_20
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean of x mean of y
## 26.06 26.14
A p-value of \(0.9639\) is obtained, this indicates that the null hypothesis cannot be rejected. There is not enough evidence to show that the supplements have differents effects at such dosage.