Part one of this project compares the exponential distribution and samples of averages from the exponential distribution. It also compares samples of averages from the exponential distribution to a normal distribution. Part two of this project provides statistical analysis and exploratory data analysis to a dataset of medication and tooth growth in rats.
Two distributions were created to show the relationship between the Exponential distribution and averages taken from that distribution. These distributions are also compared with theoretical means and variance of the two distributions. First, I created the sample distributions, then calculated the means and variances of the two distributions, compared with the theoretical means and variances. Displayed first are the mean, then theoretical mean and variance of the exponential distribution. Second are the mean, then theoretical mean and variance of the averages from exponential distribution.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
ex<-rexp(1000, .2)
m_ex<-mean(ex)
v_ex<-var(ex)
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(40, .2)))
df<-data.frame(mns)
m_ex<-mean(ex)
v_ex<-var(ex)
var_exp<-(1/.2)^2
mean_exp<-1/.2
m_ex
## [1] 4.920314
mean_exp
## [1] 5
v_ex
## [1] 24.45852
var_exp
## [1] 25
ms<-mean(df$mns)
vs<-var(df$mns)
var_theo<-(1/.2)^2/40
mean_theo<-1/.2
ms
## [1] 4.990145
mean_theo
## [1] 5
vs
## [1] 0.597559
var_theo
## [1] 0.625
Below are two graphs. The graph to the left is a simulation of an exponential distribution with a blue line at the sample mean and red line at the sample variance. The sample mean and sample variance are very close to the theoretical mean and variance of an exponential distribution as noted below as shown below the graphs. The graph to the right shows averages from the exponential distribution. THe mean, theoretical mean, variance and theoretical variance are displayed as ablines on the two graphs. On the exponential distribution, the sample mean is the blue verticle line and the expected mean is in yellow. The sample variance is represented as a red verticle line with the expected variance shown as the green line. The averages of samples from the exponential distribution have sample means in red and theoretical means in blue. The theoretical variance is plotted in green abline and the sample variance is represented as black lines on each side of the sample and theoretical means. As you can see, the theoretical value is very close to the sample values.
par(mfrow=c(1,2))
par(mar=c(4, 4, 2, 2))
hist(ex, col="magenta", breaks=50, main="Exponential Distribution", xlab="Random Numbers \n from Exponential Distribution", freq=FALSE)
abline(v=mean_exp, col="yellow", lwd=5)
abline(v=var_exp, col="green", lwd=3)
abline(v=m_ex, col="blue", lwd=3)
abline(v=v_ex, col="red", lwd=3)
hist(mns, freq=FALSE, col="orange", breaks=50, main="Averages from \n Exponential Distribution", xlab="Random Numbers from \n Averages of Exponential Dist.")
abline(v=ms, col="red", lwd=5)
abline(v=mean_theo, col="blue", lwd=3)
abline(v=ms-vs, col="black", lwd=3)
abline(v=ms+vs, col="black", lwd=3)
abline(v=ms+var_theo, col="green", lwd=3)
abline(v=ms-var_theo, col="green", lwd=3)
As you can see with the histogram of averages from the exponential distribution and the curve plotted over the distribution, the averages from the exponential distribution are very close to a normal distribution with mean 5 and variance .625. As you can see, taking averages from the exponential distribution creates a normal distribution. This may be because repeatedly taking data from the same distribution simulates taking data from any distribution due to the properties of the Central Limit Theorem. Random samples from any distribution may mimic the normal distribution.
par(mfrow=c(1,1))
par(mar=c(2,2,1,1))
ggplot(df, aes(x = mns)) +
geom_histogram(aes(y =..density..),
breaks = seq(0, 10, by=.1),
colour = "black",
fill = "red")+
labs(title="Normal Curve Overlaying Exponential Distributed Averages", x="Averages of Exponential Distribution")+theme_bw()+
stat_function(fun = dnorm, args = list(mean = mean(df$mns), sd = sd(df$mns)), col="blue", lwd=2)
As noted above, this dataset contains two different treatments on rats at particular doseage levels and measures the growth of the tooth for these rats. See the summary of the data below:
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
A t-test is performed comparing each of the dosage amounts to each of the others. First is a t-test comparing dose .5 with dose 1; second is dose 1 compared with dose 2, and third is dose 2 compared with dose .5. As you can see, the p-values are very low for each of the dosages compared with each other. This shows that there is a difference in the way dosage amounts affect tooth growth.
dose1<-ToothGrowth %>% filter(dose==.5 | dose==1)
dose1$dose<-as.factor(dose1$dose)
dose1$supp<-NULL
dose2<-ToothGrowth %>% filter(dose==1 | dose==2)
dose2$dose<-as.factor(dose1$dose)
dose2$supp<-NULL
dose3<-ToothGrowth %>% filter(dose==2 | dose==.5)
dose3$dose<-as.factor(dose1$dose)
dose3$supp<-NULL
t.test(len~dose, data=dose1)
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean in group 0.5 mean in group 1
## 10.605 19.735
t.test(len~dose, data=dose2)
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.996481 -3.733519
## sample estimates:
## mean in group 0.5 mean in group 1
## 19.735 26.100
t.test(len~dose, data=dose3)
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.15617 -12.83383
## sample estimates:
## mean in group 0.5 mean in group 1
## 10.605 26.100
don1<-ToothGrowth %>% filter(dose==.5)
don2<-ToothGrowth %>% filter(dose==1)
don3<-ToothGrowth %>% filter(dose==2)
Included are histograms with group means shown in an abline. This shows there are significant differences in the means for each dosage group.
par(mfrow=c(2,2))
par(mar=c(4,4, 2, 2))
par(oma=c(0,0,2,0))
hist(don1$len, col=rgb(0,0,1,.5), breaks=20, xlab="Dose .5", main="")
abline(v=mean(don1$len), col="red", breaks=20)
## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## "breaks" is not a graphical parameter
hist(don2$len, col=rgb(1,0,0,.5), breaks=20, xlab="Dose 1", main="")
abline(v=mean(don2$len), col="blue", lwd=2)
hist(don3$len, col= rgb(0,1,0,.5), breaks=20, xlab="Dose 2", main="")
abline(v=mean(don3$len), col="orange", lwd=2)
boxplot(len~dose, data=ToothGrowth, col=c("blue", "red", "green"))
title(main="Tooth Growth by Dosage", outer=TRUE, cex=2)
As you can see, there is some difference between the means. The p-value is .06, which is between .05 and .1. Depending on the person looking at this data, the p-value may be significant or not.
t.test(ToothGrowth$len~ToothGrowth$supp, data=ToothGrowth)
##
## Welch Two Sample t-test
##
## data: ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
As you can see there is some difference in the means, but the significance may or may not be big enough to show a difference in the treatments.
supp1<-ToothGrowth %>% filter(supp=="OJ")
supp1$dose<-NULL
supp2<-ToothGrowth %>% filter(supp=="VC")
supp2$dose<-NULL
par(mfrow=c(2, 2))
par(mar=c(4,3,2,2))
hist(supp1$len, col=rgb(0,0,1,.5), xlab="OJ", main="")
abline(v=mean(supp1$len), col="red")
hist(supp2$len, col=rgb(1, 0,0,.5), xlab="VC", main="")
abline(v=mean(supp2$len), col="blue")
boxplot(len~supp, data=ToothGrowth, col=c("blue", "red"))
title(main="Tooth Growth by Supplement Type", outer=TRUE)