In this exercise we are going to compare an exponential distribution to the Central Limit Theorem. Our sample distribution was created by running an exponential distribution of 40 exponentials 1000 times. We were provided a rate parameter of Lamdbda = 0.2. Our comparisons will be both mathmatical (% change between the theoretical and sample mean and variance) and visual (a plot of the two distributions overlayed atop each other with thier respective means).
library(knitr)
library(ggplot2)
opts_chunk$set(echo=TRUE, cache=TRUE, message=FALSE, results='hold')
set.seed(2831) ##Set a seed to ensure reproducibility
lambda<-0.2 ##Pre-determined rate parameter
n<-40 ##Number of samples
I<-1000 ##Number of simulations
mean_theory<-1/lambda ##Mean of the exponential distribution
sd_theory<-1/lambda ##Std Dev. of the exponential distribution
var_theory<-sd_theory^2/n ##Variance of the exponential distribution
exp_sim_1000<-replicate(I,rexp(n,lambda)) ##Run & save 1K exp. distribution simulations
mean_exp_sim<-apply(exp_sim_1000,2,mean) ##Save the mean of the 2nd column of sims.
mean_sample<-mean(mean_exp_sim) ##Take the overall mean of the mean data frame
var_sample<-var(mean_exp_sim) ##Take the overall variance of the d.f.
All comparison’s were made between our sample (1000 simulations of an expenential distrubtion of 40 samples)and our theoretical mean (simulated as 1/Lambda, where Lambda is the Rate Parameter).
Our sample mean was calculated at 5.0188872
Our theoretical mean was calculated at 5
An insignificant, 0.377744% difference exists
Our sample mean was calculated at 0.6421645
Our theoretical mean was calculated at 0.625
An insignificant, 2.7463199% difference exists
Provided both our sample size and number of simulations are large enough, the mean and variance of the sample begin to take on a normal gaussian distribution. This latter distribution is reprentative of the perfect theoretical mean and variance.
plotted_means<-data.frame(mean_exp_sim)
hp<-ggplot(plotted_means, aes(x=mean_exp_sim))
hp<-hp + geom_histogram(aes(y=..density..), colour="blue4", fill="dodgerblue2")
hp<-hp + geom_vline(aes(xintercept=mean_sample), colour="royalblue3", size=.75)
hp<-hp + geom_vline(aes(xintercept=mean_theory), colour="chartreuse1", size=.75, linetype=2)
hp<-hp + stat_function(fun=dnorm,args=list(mean=5,sd=0.625), colour="chartreuse2",size=.75, linetype=2)
hp<-hp+labs(title="Comparison of sampled to theoretical exponential distribution data", x="40 Exponential distribution samples - overall sample mean in blue & theoretical mean in green",y="distribution density")
hp
In this exercise we want to explore the ToothGrowth data set, which measures the effects of vitamin C on Tooth Growth in Guinea Pigs. A full link to the data set’s R documentation can be found by typing C:/Program Files/R/R-3.2.5/library/datasets/help/ToothGrowth. In our analysis of the data we will summarize the set, and look for correlations between the tooth growth of both dosage sample sets.
Our data set has the following dimensions: 60, 3
Our 3 columns contain: len, supp, dose - representing the length of the tooth, the supplement type, and the dosage
Our classes of each column are as follows: Length, class numeric Supplement, class factor Dosage, class numeric
A sample of first row of the data set shows: 4.2, VC, 0.5
A full summary of the data is below:
kable(summary(ToothGrowth))
| len | supp | dose | |
|---|---|---|---|
| Min. : 4.20 | OJ:30 | Min. :0.500 | |
| 1st Qu.:13.07 | VC:30 | 1st Qu.:0.500 | |
| Median :19.25 | NA | Median :1.000 | |
| Mean :18.81 | NA | Mean :1.167 | |
| 3rd Qu.:25.27 | NA | 3rd Qu.:2.000 | |
| Max. :33.90 | NA | Max. :2.000 |
tgp<-ggplot(ToothGrowth, aes(x=as.factor(dose),y=len,fill=supp))
tgp<-tgp + geom_bar(stat="identity")
tgp<-tgp + facet_grid(.~supp)
tgp<-tgp + labs(title="ToothGrowth Length by Dosage", x="Dosage (mg)", y="Tooth Length (mm)")
tgp
Plotting the data shows us precisely what we expected. We have two supplements, each with a range of tooth lengths. It shows us that while both supplements have nearly the same maximum, our VC supplement has a smaller minimum. We can clearly see a pattern, no matter the supplement, with increased tooth length with increased supplement dosage.
We’re going to take a look at the p-values and 95% confidence intervals for two scenarios. In the first test we’ll check for the differences (agnostic of dosage) between groups assuming equal and unequal variances.
t.test_all_unequal<-t.test(len~supp,data=ToothGrowth,paired=FALSE,var.equal=FALSE)
t.test_all_equal<-t.test(len~supp,data=ToothGrowth,paired=FALSE,var.equal=TRUE)
results_all<-matrix(c(t.test_all_equal$conf.int,t.test_all_equal$p.value,t.test_all_unequal$conf.int,t.test_all_unequal$p.value),ncol=2,nrow=3)
rownames(results_all)<-c("Lower Bound 95% Confidence Interval","Upper Bound 95% Confidence Interval","P-Value")
colnames(results_all)<-c("var.equal=T","var.equal=F")
kable(results_all)
| var.equal=T | var.equal=F | |
|---|---|---|
| Lower Bound 95% Confidence Interval | -0.1670064 | -0.1710156 |
| Upper Bound 95% Confidence Interval | 7.5670064 | 7.5710156 |
| P-Value | 0.0603934 | 0.0606345 |
Conclusion: Regardless of an assumed equal or unequal variance, we end up with a 95% confidence interval that captures zero. In addition, our P-Value exceeds the 0.05 limit. As such, we cannot reject the null hypothesis that the different supplement types have varying effects on the tooth growth.
t.test_0.5<-t.test(len~supp,data=ToothGrowth[ToothGrowth$dose == .5, ],paired=FALSE,var.equal=FALSE)
t.test_1.0<-t.test(len~supp,data=ToothGrowth[ToothGrowth$dose == 1, ],paired=FALSE,var.equal=FALSE)
t.test_2.0<-t.test(len~supp,data=ToothGrowth[ToothGrowth$dose == 2, ],paired=FALSE,var.equal=FALSE)
results_by_dose<-matrix(c(t.test_0.5$conf.int,t.test_0.5$p.value,t.test_1.0$conf.int,t.test_1.0$p.value,t.test_2.0$conf.int,t.test_2.0$p.value),ncol=3,nrow=3)
rownames(results_by_dose)<-c("Lower Bound 95% Confidence Interval","Upper Bound 95% Confidence Interval","P-Value")
colnames(results_by_dose)<-c("Dosage = 0.5","Dosage = 1.0", "Dosage 2.0")
kable(results_by_dose)
| Dosage = 0.5 | Dosage = 1.0 | Dosage 2.0 | |
|---|---|---|---|
| Lower Bound 95% Confidence Interval | 1.7190573 | 2.8021482 | -3.7980705 |
| Upper Bound 95% Confidence Interval | 8.7809427 | 9.0578518 | 3.6380705 |
| P-Value | 0.0063586 | 0.0010384 | 0.9638516 |
While looking at the data (agnostic of dosage) our confidence interval bounding zero and a P-Value > 0.5 all point to a failure to reject the null hypothesis, which is that there is a distinction between the mean value of the two supplements on tooth growth. While there is clearly a correlation between tooth growth and supplements, just which supplement is most effective failed to make an appearance in these data. Even when we split the data into dosage amounts, the only relevant summary was a dosage of 1.0 - with a low P-Value and a confidence interval above zero. But provided all other data suggests a failure to reject the null, I will remain with that same general assumption.
1.) Independant population data 2.) Unequal variance (even though I checked just in case) 3.) Normal distribtions