Inferential statistics course project

Synopsis

This document is presented as the final project for the course Statistical Inference. In the first part of the document, several sets of exponentially distributed data will be simulated, then see how the distribution of means will approximate according to the Central Limit Theorem. The second part consists of the statistical analysis for the Tooth Growth dataset included in the package datasets.

# Loading the required libraries from the start
library(ggplot2)
library(datasets)
library(knitr)

1. Central Limit Theorem

Generating the data

First, a set of n exponentials are going to be generated using the rexp function, said numbers will have a mean \(\lambda\). This is going to be repeated a fixed number of times.

# setting the seed for reproducibility
set.seed(12893)

# Defining constants
n <- 40
lambda <- 0.3
n_sims <- 1500

# Running the simulations
exp_distributions <- matrix(data=rexp(n*n_sims, lambda), nrow=n_sims)
exp_distributions_means <- data.frame(means=apply(exp_distributions, 1, mean))

After simulating 1500 distributions, the means are calculated an then plotted into a histogram.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram of means

Sample vs Theoretical values

The expected mean (\(\mu\)) of an exponential distribution is defined by the formula \(\mu = 1/\lambda\). The expected standard deviation is calculated as: \(\sigma = \frac{1}{\lambda\sqrt{n}}\). As for the variance (\(\sigma^{2}\)) its just \(\sigma^{2}\). On the other hand, the simulated mean, variance and standard deviation are calculated using the base R functions.

# Simulated mean, variance and standard deviation
mean_sim <- mean(exp_distributions_means$means)
sd_sim <- sd(exp_distributions_means$means)
var_sim <- var(exp_distributions_means$means)

# Theoretical mean, variance and standar deviation
mean_teo <- 1/lambda
sd_teo <- 1/(lambda*sqrt(n))
var_teo <- sd_teo^2

	Simulated	Theoretical
Mean	3.3261627	3.3333333
Standard Dev.	0.5198271	0.5270463
Variance	0.2702202	0.2777778

According to the results, both the theoretical and simulated measures are pretty close. This could be seen better using a plot in which the histogram, a trend line and a normal standard line -using the calculated values- are plotted.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Comparing the data to a normal distribution

In the figure it is clear that the normal distribution approximates really well to the distribution trend line.

2. Inferential Data Analysis

The first step is to load the data and present some summary statistics.

# Second part
data("ToothGrowth")

# Visualizing the data structure
head(ToothGrowth, 5)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5

# Dims of data and types
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

# Summary of the data
summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

The data has 60 observations or 3 variables. * len: Tooth length, is a numeric type. * supp: Supplement given to the animal (VC or OJ), is a character type. * dose: Dose of supplement in milligrams/day, is a numeric type.

Visualizing the tooth growth data

According to the plot three main observations are drawn, the first one is that the \(2.0mg\) dosage is the most effective when it comes to tooth growth. The second one is that in general the supplment OJ has better effectiveness than VC. The final observation is that in \(2.0mg\) dosage both supplements have pretty much the same effect.

Hypothesis testing

Before proceding with the hypothesis testing. Some assumptions have to be made, these are:

The variables must be independent and identically distributed (i.i.d.).
Variances of tooth growth are different when using different supplement and dosage.
Tooth growth follows a normal distribution.

The data set is then splitted into groups that will help test our hypothesis.

# Splittin data set by dosage
dose_05 <- ToothGrowth$len[ToothGrowth$dose == 0.5]
dose_10 <- ToothGrowth$len[ToothGrowth$dose == 1.0]
dose_20 <- ToothGrowth$len[ToothGrowth$dose == 2.0]

# Splitting data by supplement
supp_oj <- ToothGrowth$len[ToothGrowth$supp == "OJ"]
supp_vc <- ToothGrowth$len[ToothGrowth$supp == "VC"]

# Splitting data by supplement and dosage = 2.0
supp_oj_20 <- ToothGrowth$len[ToothGrowth$supp == "OJ" & ToothGrowth$dose == 2.0]
supp_vc_20 <- ToothGrowth$len[ToothGrowth$supp == "VC" & ToothGrowth$dose == 2.0]

The \(2.0mg\) dosage is the most effective.

A one tail t-test is performed using the t.test function of R.

# Testing that the 2mg dosage is more effective than both others
# Testing that dose 1.0 is greater than dose 0.5
t.test(dose_10, dose_05, alternative = "greater", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  dose_10 and dose_05
## t = 6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.753323      Inf
## sample estimates:
## mean of x mean of y 
##    19.735    10.605

The p-value in this test is much smaller than our defined \(\alpha = 0.05\), and thus we can reject the null hypothesis. That means there is an almost non-existent probability of finding a higher tooth growth value for the \(0.5mg\) dosage compared to the \(1.0mg\) dosage.

# Testing that dose 2.0 is greater than dose 1.0
t.test(dose_20, dose_10, alternative = "greater", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  dose_20 and dose_10
## t = 4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  4.17387     Inf
## sample estimates:
## mean of x mean of y 
##    26.100    19.735

Again, p-value in this test is much smaller than our defined \(\alpha = 0.05\), and thus we can reject the null hypothesis. That means there is an almost non-existent probability of finding a higher tooth growth value for the \(1.0mg\) dosage compared to the \(2.0mg\) dosage.

Overall the OJ supplement is most effective when compared to the VC.

# Testing that in general OJ has grater effects than vc
t.test(supp_oj, supp_vc, alternative = "greater", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  supp_oj and supp_vc
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

The p-value obtained indicates a 3% probability of finding a value in which the OJ supplement is less effective than the VC. Given that our tolerance is set to be 5%, the null hypothesis can be sucessfully rejected.

For the \(2.0mg\) dosage both supplements have similar effects.

# Testing that oj 2.0 has similar effects than vc 2.0
t.test(supp_oj_20, supp_vc_20, alternative = "two.sided", paired = FALSE,
       var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  supp_oj_20 and supp_vc_20
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean of x mean of y 
##     26.06     26.14

A p-value of \(0.9639\) is obtained, this indicates that the null hypothesis cannot be rejected. There is not enough evidence to show that the supplements have differents effects at such dosage.