Statistical Testing in R

Part 1 The Exponential Distribution and Comparision to CLT

Synopsis

We simulate 1000 iterations of 40 exponentials by using replicate and then calculating the mean of the simulation. In comparison to the theoretical mean we derive with the constants lambda <- 0.2; noSim <- 1000; and n <- 40 (lambda is set to 0.2, The number of simulations we want is 1000, and the the number of exponents we want in each simulation is 40). Our experiment shows there is not much difference in the theoretical mean of the constants provided and the simulated mean. We also show that the variances are not much different and that the distribution of the simulation is approximately normal.

# Set seed for reproducibility of random number generation
# Set relevant variables: lambda, noSim (number of simulations), and n(number of exponentials)
#  per above instructions
set.seed(999); lambda <- 0.2; noSim <- 1000; n <- 40

# Simulation: using replicate() we iterate 1000 simulations, take the means of each of the vectors, 
# and then the mean of that resulting (mExp) vector.
sExp <- replicate(noSim, rexp(n,lambda))
mExp <- colMeans(sExp)
sMean <- mean(mExp)

#set theoretical mean
tMean <- 1/lambda

q1Answer <- data.frame('Theoretical Mean' = tMean,
                          'Simulated Mean' = sMean)

print(q1Answer)

##   Theoretical.Mean Simulated.Mean
## 1                5       5.029028

From the above, we see there is not a large difference between the theoretical mean and the simulated mean.

hist(mExp, breaks = 40, xlab = "Mean", main = "1000 simulations of 40 exponentials", col = "orange")
abline(v = sMean, col = "blue")
abline(v = tMean, col = "red")

Visually we can see this with the distribution of the simulation results that the theoretical mean (red) is very close to the simulated mean (blue).

# Simulated Standard Deviation and Variance
sSD <- sd(mExp)
sVariance <- sSD^2

# Theoretical Standard Deviation and Variance
tSD <- (1/lambda)/sqrt(n)
tVariance <- tSD^2

kable(data.frame("Simulated Variance" = sVariance,
                 "Theoretical Variance" = tVariance))

Simulated.Variance	Theoretical.Variance
0.605616	0.625

We see here that the simulated variance is 0.605616 versus the theoretical variance of 0.625.

# Plot Histogram
hist(mExp, breaks = 40, xlab = "Mean", main = "Comparison to a Normal Distribution", col = "green")
xfit <- seq(min(mExp), max(mExp), length = 100)
yfit <- dnorm(xfit, mean = tMean, sd = tSD)
lines(xfit, yfit*200, pch = 20, col = 6, lwd = 1.5)

Taking our same distribution histogram from Q# 2.) we fit a line representing the normal distribution and can visually see that our simulation distribution is approximately normal shaped.

Conclusions and assumptions

We conclude that the mean of the vector means for 1000 simulations is close to the theoretical mean we derive under the assumption of lambda equaling 0.2.

Part 2

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.

Hypothesis Testing of the ToothGrowth dataset

Synopsis

The ToothGrowth dataset contains 60 observations of 3 variables on The Effects of Vitamin C on Tooth Growth in Guinea Pigs. We first use boxplots to visually explore the dataset, and then we conduct Hypothesis tests on the null Hypothesis that the mean of len (length of odontoblasts) for two samples is the same, where the two samples are the Guinea Pigs givensupp == 'VC' (Vitamin C) and those given supp == 'OJ' (Orange Juice).

Our Exploratory charting below reveals that the means look different for the two groups; however, when grouping by dose (dosage of 0.5, 1.0, or 2.0), our graphs reveal a similarity in mean for the higher dose of 2.0. Our hypothesis testing confirms our results as we see p-values that are significantly less than .05 for our 95% confidence interval when looking at just the two groups (supp == 'VC' vs supp == 'OJ') and for the two groups for the lower doses of dose == 0.5 & dose == 1.0. We get a higher p-value for the test on the two groups where dose == 2.0. Thus we reject the null Hypothesis in all cases except when dose == 2.0 where we fail to reject the null hypothesis.

data("ToothGrowth")
TG <- ToothGrowth
summary(TG)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

str(TG)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

ggplot(TG,aes(x=supp,y=len,fill=supp))+
    geom_boxplot(alpha = 0.88)

We can see from the above, that the means for OJ vs VC as the supplement look very different. If we were going on look alone, we would reject our null Hypothesis and conclude the means as dissimilar.

ggplot(TG, aes(x = supp, y = len, fill = supp))+
    geom_boxplot(alpha = 0.76)+
    facet_grid(.~dose)

When we look at the means for OJ vs VC while considering dosage as a factor, we visually conclude the means as dissimilar for the 0.5 and 1.0 dosage levels; however, we can see there is more similarity in the groups’ means at the higher dosage level of 2.0.

Conduct a T-test on the dataset for the two sample sets grouped by supp.

test <- t.test(len ~ supp, data= TG, var.equal = FALSE, paired=FALSE ,conf.level = .95)

testprint <- kable(data.frame(row.names = "By Supplement",
                               "t.stat" = test$statistic,
                                  "df" = test$parameter,
                                  "p-value" = test$p.value,
                                  "lower" = test$conf.int[1],
                                  "upper" = test$conf.int[2],
                                  "OJ mean" = test$estimate[1],
                                  "VC mean" = test$estimate[2]), caption = test$method)

testprint

Welch Two Sample t-test
	t.stat	df	p.value	lower	upper	OJ.mean	VC.mean
By Supplement	1.915268	55.30943	0.0606345	-0.1710156	7.571016	20.66333	16.96333

In this case, we reject the null Hypothesis as our p-value: 0.0606345 is < 0.5.

test1 <- t.test(len ~ supp, data=TG[TG$dose ==0.5,], var.equal = FALSE, paired=FALSE ,conf.level = .95)
test2 <- t.test(len ~ supp, data=TG[TG$dose ==1.0,], var.equal = FALSE, paired=FALSE ,conf.level = .95)
test3 <- t.test(len ~ supp, data=TG[TG$dose ==2.0,], var.equal = FALSE, paired=FALSE ,conf.level = .95)

test1print <- kable(data.frame(row.names = "By Supplement, dose = 0.5",
                               "t.stat" = test1$statistic,
                                  "df" = test1$parameter,
                                  "p-value" = test1$p.value,
                                  "lower" = test1$conf.int[1],
                                  "upper" = test1$conf.int[2],
                                  "OJ mean" = test1$estimate[1],
                                  "VC mean" = test1$estimate[2]), caption = test$method)

test2print <- kable(data.frame(row.names = "By Supplement, dose = 1.0",
                               "t.stat" = test2$statistic,
                                  "df" = test2$parameter,
                                  "p-value" = test2$p.value,
                                  "lower" = test2$conf.int[1],
                                  "upper" = test2$conf.int[2],
                                  "OJ mean" = test2$estimate[1],
                                  "VC mean" = test2$estimate[2]), caption = test$method)

test3print <- kable(data.frame(row.names = "By Supplement, dose = 2.0",
                               "t.stat" = test3$statistic,
                                  "df" = test3$parameter,
                                  "p-value" = test3$p.value,
                                  "lower" = test3$conf.int[1],
                                  "upper" = test3$conf.int[2],
                                  "OJ mean" = test3$estimate[1],
                                  "VC mean" = test3$estimate[2]), caption = test$method)

Results

Welch Two Sample t-test
	t.stat	df	p.value	lower	upper	OJ.mean	VC.mean
By Supplement, dose = 0.5	3.169733	14.96875	0.0063586	1.719057	8.780943	13.23	7.98

In this case, we reject the null Hypothesis as our p-value: 0.0063586 is < 0.5.

Welch Two Sample t-test
	t.stat	df	p.value	lower	upper	OJ.mean	VC.mean
By Supplement, dose = 1.0	4.03277	15.35767	0.0010384	2.802148	9.057852	22.7	16.77

In this case, we reject the null Hypothesis as our p-value: 0.0010384 is < 0.5.

Welch Two Sample t-test
	t.stat	df	p.value	lower	upper	OJ.mean	VC.mean
By Supplement, dose = 2.0	-0.0461361	14.03982	0.9638516	-3.798071	3.63807	26.06	26.14

In this case, we FAIL to reject the null Hypothesis as our p-value: 0.9638516 is > 0.5.

Conclusions and assumptions

We conclude that our null Hypothesis is one we would reject when comparing the sample means of two groups OJ vs. VC and confirm our visual analysis above. Furthermore, we would also reject our null Hypothesis when segmenting the VC and OJ groups by dosage level administered WHEN dosage = 0.5 or dosage = 1.0. Only in the case of dosage = 2.0 do we fail to reject our null Hypothesis.

Our assumptions in these tests are that the groups (by supplement, and by supplement/dose) of individual subjects (guinea pigs) are independent and also have different variances (note we set var.equal = FALSE in the above ttests.)

Statistical Testing in R

Neil Kutty

September 2, 2016

Part 1

The Exponential Distribution and Comparision to CLT

Synopsis

Conclusions and assumptions

Part 2

Hypothesis Testing of the ToothGrowth dataset

Synopsis

Conclusions and assumptions