This project is expected to cover all the topics in the Statistical Inference coursera class. it consists of two parts, the first one is a simulation exercise designed to test the exponential distribution and compare it with the central limit theorem. The second part is a basic inferential data analysis on the Tooth Growth R dataset.
for this part the theoretical mean and the sample mean are calculated and evaluated
#set seed for reproducibility
set.seed(1)
# Variables
n <- 40
lambda <- 0.2
# Theoretical mean
Tmean <- 1/lambda
# Calculate data
simData <- matrix(rexp(n*1000, rate=lambda),1000)
# Simulate the means for the rows
rowMean <- rowMeans(simData)
# Calculate sample mean
Smean <- mean(rowMean)
# histogram of the sample means
hist(rowMean, xlab="Mean", ylab = "Frequence", main = "Mean of the exponential distribution")
abline(v=Tmean, col="red", lwd=3)
as it is expected the means are approximate
# Theoretical Mean
print(paste("Theoretical mean is",Tmean))
## [1] "Theoretical mean is 5"
# Sample Mean
print(paste("Sample mean is",Smean))
## [1] "Sample mean is 4.99002520077716"
Next, the theoretical variance and sample variance are calculated and compared.
# Theoretical variance
Tvariance <- (1/lambda)^2/(n)
print(paste("Theoretical variance is",Tvariance))
## [1] "Theoretical variance is 0.625"
# Sample Variance
Svariance <- var(rowMean)
print(paste("Sample variance is",Svariance))
## [1] "Sample variance is 0.617707174842697"
As it is expected the variance are approximate.
Here we can see that the histogram is close to the normal distribution since the hist is closely related to the curve of the normal distribution with the theoretical and sample mean and the standard deviation.
hist(rowMean,prob=TRUE, xlab="Mean", ylab = "Frequence", main="Distribution Comparison")
curve(dnorm(x, mean=Smean, sd=sqrt(Svariance)), col="red", lwd=2, lty = "dotted", add=TRUE, yaxt="n")
curve(dnorm(x, mean=Tmean, sd=sqrt(Tvariance)), col="yellow", lwd=2, add=TRUE, yaxt="n")
# Load dataset for the analysis
library(datasets)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
data("ToothGrowth")
Here a basic summary of the data is presented as the tooth growth by supplement and dosage.
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth,10)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
qplot(x=supp,y=len,data=ToothGrowth, facets=~dose, main="tooth growth by supplement type and dosage",xlab="supplement type", ylab="tooth length") + geom_boxplot(aes(fill = supp))
We split the dataset into the three factors in the doses column(0.5, 1 and 2), and calculate the t test for each with the supp column. To test whether the supp(OJ or VC) have a statistical significant differ
firstly for the doses = 0.5
dosis_0.5 <- filter(ToothGrowth, dose == 0.5)
t_test_dosis_0.5 <- t.test(len ~ supp, paired = FALSE, data = dosis_0.5)
t_test_dosis_0.5
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
for dose of 0.5 mg/mL the p-value is lower than 0.05 which means that the means are different in the OJ group and the VC group. and there is a significant difference in supplement type with the chosen doses.
Secondly for the doses = 1
dosis_1.0 <- filter(ToothGrowth, dose == 1.0)
t_test_dosis_1.0 <- t.test(len ~ supp, paired = FALSE, data = dosis_1.0)
t_test_dosis_1.0
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
for dose of 1.0 mg/mL the p-value is lower than 0.05 which means that the means are different in the OJ group and the VC group. and there is a significant difference in supplement type with the chosen doses.
thirdly for the doses = 2
dosis_2.0 <- filter(ToothGrowth, dose == 2.0)
t_test_dosis_2.0 <- t.test(len ~ supp, paired = FALSE, data = dosis_2.0)
t_test_dosis_2.0
##
## Welch Two Sample t-test
##
## data: len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
for dose of 2.0 mg/mL the p-value is greater than 0.05 which means that the means are similar as the OJ group is 26 the same as the VC group.
We can see that for doses lower than 2.0 mg/mL the supplement type does have a significant difference in mean tooth length.