Statistical Inference Project

Project Part 1: Simulation Exercise

In this exercise we are going to compare an exponential distribution to the Central Limit Theorem. Our sample distribution was created by running an exponential distribution of 40 exponentials 1000 times. We were provided a rate parameter of Lamdbda = 0.2. Our comparisons will be both mathmatical (% change between the theoretical and sample mean and variance) and visual (a plot of the two distributions overlayed atop each other with thier respective means).

library(knitr)
library(ggplot2)
opts_chunk$set(echo=TRUE, cache=TRUE, message=FALSE, results='hold')

set.seed(2831)                             ##Set a seed to ensure reproducibility
lambda<-0.2                                ##Pre-determined rate parameter
n<-40                                      ##Number of samples
I<-1000                                    ##Number of simulations
mean_theory<-1/lambda                      ##Mean of the exponential distribution
sd_theory<-1/lambda                        ##Std Dev. of the exponential distribution
var_theory<-sd_theory^2/n                  ##Variance of the exponential distribution  

exp_sim_1000<-replicate(I,rexp(n,lambda))  ##Run & save 1K exp. distribution simulations
mean_exp_sim<-apply(exp_sim_1000,2,mean)   ##Save the mean of the 2nd column of sims.
mean_sample<-mean(mean_exp_sim)            ##Take the overall mean of the mean data frame
var_sample<-var(mean_exp_sim)              ##Take the overall variance of the d.f.

Comparison Framework

All comparison’s were made between our sample (1000 simulations of an expenential distrubtion of 40 samples)and our theoretical mean (simulated as 1/Lambda, where Lambda is the Rate Parameter).

Mean Comparison

Our sample mean was calculated at 5.0188872
Our theoretical mean was calculated at 5
An insignificant, 0.377744% difference exists

Variance Comparison

Our sample mean was calculated at 0.6421645
Our theoretical mean was calculated at 0.625
An insignificant, 2.7463199% difference exists

Conclusions of mean and variance comparisons

Provided both our sample size and number of simulations are large enough, the mean and variance of the sample begin to take on a normal gaussian distribution. This latter distribution is reprentative of the perfect theoretical mean and variance.

Show that the distribution is approximately normal

plotted_means<-data.frame(mean_exp_sim)
hp<-ggplot(plotted_means, aes(x=mean_exp_sim))
hp<-hp + geom_histogram(aes(y=..density..), colour="blue4", fill="dodgerblue2")
hp<-hp + geom_vline(aes(xintercept=mean_sample), colour="royalblue3", size=.75)
hp<-hp + geom_vline(aes(xintercept=mean_theory), colour="chartreuse1", size=.75, linetype=2)
hp<-hp + stat_function(fun=dnorm,args=list(mean=5,sd=0.625), colour="chartreuse2",size=.75, linetype=2)
hp<-hp+labs(title="Comparison of sampled to theoretical exponential distribution data", x="40 Exponential distribution samples - overall sample mean in blue & theoretical mean in green",y="distribution density")
hp

Project Part 2: Tooth Grow Data Set Exploration and Analysis

In this exercise we want to explore the ToothGrowth data set, which measures the effects of vitamin C on Tooth Growth in Guinea Pigs. A full link to the data set’s R documentation can be found by typing C:/Program Files/R/R-3.2.5/library/datasets/help/ToothGrowth. In our analysis of the data we will summarize the set, and look for correlations between the tooth growth of both dosage sample sets.

Summary of the ToothGrowth Data Set

Our data set has the following dimensions: 60, 3
Our 3 columns contain: len, supp, dose - representing the length of the tooth, the supplement type, and the dosage
Our classes of each column are as follows: Length, class numeric Supplement, class factor Dosage, class numeric
A sample of first row of the data set shows: 4.2, VC, 0.5

A full summary of the data is below:

   kable(summary(ToothGrowth))

len	supp	dose
Min. : 4.20	OJ:30	Min. :0.500
1st Qu.:13.07	VC:30	1st Qu.:0.500
Median :19.25	NA	Median :1.000
Mean :18.81	NA	Mean :1.167
3rd Qu.:25.27	NA	3rd Qu.:2.000
Max. :33.90	NA	Max. :2.000

Investigation of the relationship between supplement type & supplement type by dosage on tooth growth

Let’s do a little investigation of the relationships between both supplements on our guinea pigs and each supplement individually

tgp<-ggplot(ToothGrowth, aes(x=as.factor(dose),y=len,fill=supp))
tgp<-tgp + geom_bar(stat="identity")
tgp<-tgp + facet_grid(.~supp)
tgp<-tgp + labs(title="ToothGrowth Length by Dosage", x="Dosage (mg)", y="Tooth Length (mm)")
tgp

Plotting the data shows us precisely what we expected. We have two supplements, each with a range of tooth lengths. It shows us that while both supplements have nearly the same maximum, our VC supplement has a smaller minimum. We can clearly see a pattern, no matter the supplement, with increased tooth length with increased supplement dosage.

Now let’s take a look at the data and use a t.test to determine how similar our two population means are within a 95% confidence interval

We’re going to take a look at the p-values and 95% confidence intervals for two scenarios. In the first test we’ll check for the differences (agnostic of dosage) between groups assuming equal and unequal variances.

Table of t.test results comparing supplement to to tooth length

t.test_all_unequal<-t.test(len~supp,data=ToothGrowth,paired=FALSE,var.equal=FALSE)
t.test_all_equal<-t.test(len~supp,data=ToothGrowth,paired=FALSE,var.equal=TRUE)
results_all<-matrix(c(t.test_all_equal$conf.int,t.test_all_equal$p.value,t.test_all_unequal$conf.int,t.test_all_unequal$p.value),ncol=2,nrow=3)
rownames(results_all)<-c("Lower Bound 95% Confidence Interval","Upper Bound 95% Confidence Interval","P-Value")
colnames(results_all)<-c("var.equal=T","var.equal=F")
kable(results_all)

	var.equal=T	var.equal=F
Lower Bound 95% Confidence Interval	-0.1670064	-0.1710156
Upper Bound 95% Confidence Interval	7.5670064	7.5710156
P-Value	0.0603934	0.0606345

Conclusion: Regardless of an assumed equal or unequal variance, we end up with a 95% confidence interval that captures zero. In addition, our P-Value exceeds the 0.05 limit. As such, we cannot reject the null hypothesis that the different supplement types have varying effects on the tooth growth.

Let’s run the same t.test’s to determine measure how similar our two population means are with a 95%, but we will take into account the dosage. Question: although I cannot reject the null hypothesis that OJ versus VC supplments effect tooth grow more or less, maybe I CAN reject a fter looking at specific dosage amounts?

Table of t.test results comparing supplement to to tooth length by dosage. Assumed unequal variances.

t.test_0.5<-t.test(len~supp,data=ToothGrowth[ToothGrowth$dose == .5, ],paired=FALSE,var.equal=FALSE)
t.test_1.0<-t.test(len~supp,data=ToothGrowth[ToothGrowth$dose == 1, ],paired=FALSE,var.equal=FALSE)
t.test_2.0<-t.test(len~supp,data=ToothGrowth[ToothGrowth$dose == 2, ],paired=FALSE,var.equal=FALSE)

results_by_dose<-matrix(c(t.test_0.5$conf.int,t.test_0.5$p.value,t.test_1.0$conf.int,t.test_1.0$p.value,t.test_2.0$conf.int,t.test_2.0$p.value),ncol=3,nrow=3)
rownames(results_by_dose)<-c("Lower Bound 95% Confidence Interval","Upper Bound 95% Confidence Interval","P-Value")
colnames(results_by_dose)<-c("Dosage = 0.5","Dosage = 1.0", "Dosage 2.0")
kable(results_by_dose)

	Dosage = 0.5	Dosage = 1.0	Dosage 2.0
Lower Bound 95% Confidence Interval	1.7190573	2.8021482	-3.7980705
Upper Bound 95% Confidence Interval	8.7809427	9.0578518	3.6380705
P-Value	0.0063586	0.0010384	0.9638516

CONCLUSION:

While looking at the data (agnostic of dosage) our confidence interval bounding zero and a P-Value > 0.5 all point to a failure to reject the null hypothesis, which is that there is a distinction between the mean value of the two supplements on tooth growth. While there is clearly a correlation between tooth growth and supplements, just which supplement is most effective failed to make an appearance in these data. Even when we split the data into dosage amounts, the only relevant summary was a dosage of 1.0 - with a low P-Value and a confidence interval above zero. But provided all other data suggests a failure to reject the null, I will remain with that same general assumption.

ASSUMPTIONS:

1.) Independant population data 2.) Unequal variance (even though I checked just in case) 3.) Normal distribtions