Overview:

This project is expected to cover all the topics in the Statistical Inference coursera class. it consists of two parts, the first one is a simulation exercise designed to test the exponential distribution and compare it with the central limit theorem. The second part is a basic inferential data analysis on the Tooth Growth R dataset.

Part 1: Simulation Exercise

1. Show the sample mean and compare it to the theoretical mean of the distribution.

for this part the theoretical mean and the sample mean are calculated and evaluated

#set seed for reproducibility
set.seed(1)
# Variables
n <- 40
lambda <- 0.2
# Theoretical mean
Tmean <- 1/lambda
# Calculate data
simData <- matrix(rexp(n*1000, rate=lambda),1000)
# Simulate the means for the rows
rowMean <- rowMeans(simData)
# Calculate sample mean
Smean <- mean(rowMean)
# histogram of the sample means
hist(rowMean, xlab="Mean", ylab = "Frequence", main = "Mean of the exponential distribution")
abline(v=Tmean, col="red", lwd=3)

as it is expected the means are approximate

# Theoretical Mean
print(paste("Theoretical mean is",Tmean))

## [1] "Theoretical mean is 5"

# Sample Mean
print(paste("Sample mean is",Smean))

## [1] "Sample mean is 4.99002520077716"

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

Next, the theoretical variance and sample variance are calculated and compared.

# Theoretical variance
Tvariance <- (1/lambda)^2/(n)
print(paste("Theoretical variance is",Tvariance))

## [1] "Theoretical variance is 0.625"

# Sample Variance
Svariance <- var(rowMean)
print(paste("Sample variance is",Svariance))

## [1] "Sample variance is 0.617707174842697"

As it is expected the variance are approximate.

3. Show that the distribution is approximately normal.

Here we can see that the histogram is close to the normal distribution since the hist is closely related to the curve of the normal distribution with the theoretical and sample mean and the standard deviation.

hist(rowMean,prob=TRUE, xlab="Mean", ylab = "Frequence", main="Distribution Comparison")
curve(dnorm(x, mean=Smean, sd=sqrt(Svariance)), col="red", lwd=2, lty = "dotted", add=TRUE, yaxt="n")
curve(dnorm(x, mean=Tmean, sd=sqrt(Tvariance)), col="yellow", lwd=2, add=TRUE, yaxt="n")

Part 2: Basic Inferential Data Analysis Instructions

1. Load the ToothGrowth data and perform some basic exploratory data analyses

# Load dataset for the analysis
library(datasets)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
data("ToothGrowth")

2. Provide a basic summary of the data.

Here a basic summary of the data is presented as the tooth growth by supplement and dosage.

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

head(ToothGrowth,10)

##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5

qplot(x=supp,y=len,data=ToothGrowth, facets=~dose, main="tooth growth by supplement type and dosage",xlab="supplement type", ylab="tooth length") + geom_boxplot(aes(fill = supp))

3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

We split the dataset into the three factors in the doses column(0.5, 1 and 2), and calculate the t test for each with the supp column. To test whether the supp(OJ or VC) have a statistical significant differ

firstly for the doses = 0.5

dosis_0.5 <- filter(ToothGrowth, dose == 0.5)
t_test_dosis_0.5 <- t.test(len ~ supp, paired = FALSE, data = dosis_0.5)
t_test_dosis_0.5

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

for dose of 0.5 mg/mL the p-value is lower than 0.05 which means that the means are different in the OJ group and the VC group. and there is a significant difference in supplement type with the chosen doses.
Secondly for the doses = 1

dosis_1.0 <- filter(ToothGrowth, dose == 1.0) 
t_test_dosis_1.0 <- t.test(len ~ supp, paired = FALSE, data = dosis_1.0)
t_test_dosis_1.0

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

for dose of 1.0 mg/mL the p-value is lower than 0.05 which means that the means are different in the OJ group and the VC group. and there is a significant difference in supplement type with the chosen doses.
thirdly for the doses = 2

dosis_2.0 <- filter(ToothGrowth, dose == 2.0)
t_test_dosis_2.0 <- t.test(len ~ supp, paired = FALSE, data = dosis_2.0)
t_test_dosis_2.0

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

for dose of 2.0 mg/mL the p-value is greater than 0.05 which means that the means are similar as the OJ group is 26 the same as the VC group.

4. State your conclusions and the assumptions needed for your conclusions.

We can see that for doses lower than 2.0 mg/mL the supplement type does have a significant difference in mean tooth length.

Statistical Inference Project

Carlos M. Restrepo

7/1/2020