Intro

This project is divided into two parts. The first part demonstrates a simulation method on the exponential distribution and shows its effectiveness when compared with the central limit theorem. The second part of this project utilizes basic analysis techniques to explore the ToothGrowth dataset, found in the base packages in RStudio, using the data() function.

Part 1: Simulation of Exponential Distribution

The aim of this first part of the project is to show the effectiveness of simulation methods. The exponential distribution will be investigated and then compared with the central limit theorem. The mean and standard deviation of the exponential distribution are both \(\frac{1}{\lambda}\). Let \(\lambda = 0.2\) for all simulations. An average of \(40\) exponentials will be investigated, and \(1000\) simulations will be conducted. These variables are defined below.

set.seed(24) #set seed for reproduction
n <- 40 # 0 samples
lambda <- 0.2  # theoretical value for lambda
num.sim <- 1000 # number of simulations
z <- 1.96 # z for 95% conf int.

1. Compare the Sample Mean and Thoeretical Mean

## compare sample mean to theoretical mean
data1 <- matrix(rexp(n*num.sim, rate = lambda), num.sim)
means <- rowMeans(data1)
m.means <- mean(means)
t.mean <- 1/lambda #true mean
#histogram of sample means
hist(means, breaks = 25,
     xlab = "Mean",
     ylab = "Frequency",
     main = "Histogram of Simulated Means",
     col = "lightsteelblue1")
abline(v=t.mean, col="red", lwd=3)
legend("topright", lty = 1, lwd = 5, col = "red", legend = "theoretical mean")

t.mean; m.means
## [1] 5
## [1] 5.015078

The simulated sample means are normally distributed and the center of this distribution is very close to the line representing the theoretical mean. Further, the simulated sample mean is only 0.015 units away from the theoretical mean. This implies that the simulation was successful.

2. Compare the Sample Variance and Thoeretical Variance

var1 <- apply(data1, 1, var)
hist(var1, breaks = 25, main = "Histogram of Variances", xlab = "Variance", ylab = "Frequency", col = "palegoldenrod")
abline(v = (1/lambda)^2, lty = 1, lwd = 5, col = "blue")
legend("topright", lty = 1, lwd = 5, col = "blue", legend = "theoretical variance")

mean(var1); (1/lambda)^2
## [1] 24.7696
## [1] 25

The simulated sample variances are almost normally distributed with a right skewness, and a center near the theoretical variance. Furthermore, the difference between the simulated and theoretical variences is only 0.23. This shows that the simulation was successful.

3. Show that the distribution is approximately normal

qqnorm(means)
qqline(means,col= "Red")

This plot shows that the distribution is approximately normal.

Part 2: Analyze the ToothGrowth Dataset

In this part of the project, some basic statistical analysis is run on the ToothGrowth dataset.

1. Exploratory Data Analysis

A good starting point for exploring data is to summarize the quantiles and retrieve the means and standard deviations of the various factors. The most obvious factor is the Supplement variable, which has two levels: VC and OJ.

data("ToothGrowth")
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
names(ToothGrowth)[2] <- "Supplement" #better description
mean(ToothGrowth[ToothGrowth$Supplement == "VC", ]$len)
## [1] 16.96333
mean(ToothGrowth[ToothGrowth$Supplement == "OJ", ]$len)
## [1] 20.66333
sd(ToothGrowth[ToothGrowth$Supplement == "VC", ]$len)
## [1] 8.266029
sd(ToothGrowth[ToothGrowth$Supplement == "OJ", ]$len)
## [1] 6.605561

To get a better understanding of this data, boxplots can be constructed, as demonstrated below.

require(ggplot2)
## Loading required package: ggplot2
qplot(x=Supplement,y=len,data=ToothGrowth, facets=~dose, main="Tooth Growth by Supplement Type and Dosage",xlab="Supplement Type", ylab="Tooth Length") + geom_boxplot(aes(fill = Supplement))

Based on these plots, it is clear that the difference in lengths decreases as dosage increases. However, to determine if these changes are statistically significant, it is advantageous to conduct some hypothesis tests. These tests are demonstrated in the next section.

2. Hypothesis Testng

The goal is to find if there is a statistically significant difference between the effects of VS and OJ on tooth length. First, the difference is tested without regard to dosage, then the difference is tested for each dosage amount.

ttest.all <- t.test(len~Supplement, data = ToothGrowth)
ttest.all
## 
##  Welch Two Sample t-test
## 
## data:  len by Supplement
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

This test shows that there is no difference between the two supplement types. However, further investigation is required.

require(dplyr)
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dose05 <- filter(ToothGrowth, dose == 0.5) 
dose10 <- filter(ToothGrowth, dose == 1.0) 
dose20 <- filter(ToothGrowth, dose == 2.0)
ttest05 <-  t.test(len~Supplement, data = dose05)
ttest05
## 
##  Welch Two Sample t-test
## 
## data:  len by Supplement
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

This test rejects the null hypothesis that the two means are the same. So, at a 0.5 dosage level, there is a significant difference between the supplements.

ttest10 <- t.test(len~Supplement, data = dose10)
ttest10
## 
##  Welch Two Sample t-test
## 
## data:  len by Supplement
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

This test rejects the null hypothesis that the two means are the same. So, at a 0.5 dosage level, there is a significant difference between the supplements.

ttest20 <- t.test(len~Supplement, data = dose20)
ttest20
## 
##  Welch Two Sample t-test
## 
## data:  len by Supplement
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

This test fails to reject the null hypothesis that the two means are the same. So, at a 0.5 dosage level, there is no significant difference between the supplements.

3. Conclusion

The t-test assumes random and independent sampling, normality of data distribution, adequacy of sample size, and equality of variance. From the tests, it seems that supplement type results in a significant difference in mean tooth length except when dosis is high (2.0 mg/mL). These results are consistent with the plot of boxplots shown above.