Statistical Inference Project by Tim Norton

Part one of this project compares the exponential distribution and samples of averages from the exponential distribution. It also compares samples of averages from the exponential distribution to a normal distribution. Part two of this project provides statistical analysis and exploratory data analysis to a dataset of medication and tooth growth in rats.

Part One:

Two distributions were created to show the relationship between the Exponential distribution and averages taken from that distribution. These distributions are also compared with theoretical means and variance of the two distributions. First, I created the sample distributions, then calculated the means and variances of the two distributions, compared with the theoretical means and variances. Displayed first are the mean, then theoretical mean and variance of the exponential distribution. Second are the mean, then theoretical mean and variance of the averages from exponential distribution.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

ex<-rexp(1000, .2)
m_ex<-mean(ex)
v_ex<-var(ex)

mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(40, .2)))
df<-data.frame(mns)

m_ex<-mean(ex)
v_ex<-var(ex)
var_exp<-(1/.2)^2
mean_exp<-1/.2

m_ex

## [1] 4.920314

mean_exp

## [1] 5

v_ex

## [1] 24.45852

var_exp

## [1] 25

ms<-mean(df$mns)
vs<-var(df$mns)
var_theo<-(1/.2)^2/40
mean_theo<-1/.2

ms

## [1] 4.990145

mean_theo

## [1] 5

vs

## [1] 0.597559

var_theo

## [1] 0.625

Below are two graphs. The graph to the left is a simulation of an exponential distribution with a blue line at the sample mean and red line at the sample variance. The sample mean and sample variance are very close to the theoretical mean and variance of an exponential distribution as noted below as shown below the graphs. The graph to the right shows averages from the exponential distribution. THe mean, theoretical mean, variance and theoretical variance are displayed as ablines on the two graphs. On the exponential distribution, the sample mean is the blue verticle line and the expected mean is in yellow. The sample variance is represented as a red verticle line with the expected variance shown as the green line. The averages of samples from the exponential distribution have sample means in red and theoretical means in blue. The theoretical variance is plotted in green abline and the sample variance is represented as black lines on each side of the sample and theoretical means. As you can see, the theoretical value is very close to the sample values.

    par(mfrow=c(1,2))
par(mar=c(4, 4, 2, 2))
hist(ex, col="magenta", breaks=50, main="Exponential Distribution", xlab="Random Numbers \n from Exponential Distribution", freq=FALSE)

abline(v=mean_exp, col="yellow", lwd=5)
abline(v=var_exp, col="green", lwd=3)
abline(v=m_ex, col="blue", lwd=3)
abline(v=v_ex, col="red", lwd=3)

hist(mns, freq=FALSE, col="orange", breaks=50, main="Averages from \n Exponential Distribution", xlab="Random Numbers from \n Averages of Exponential Dist.")

abline(v=ms, col="red", lwd=5)
abline(v=mean_theo, col="blue", lwd=3)
abline(v=ms-vs, col="black", lwd=3)
abline(v=ms+vs, col="black", lwd=3)
abline(v=ms+var_theo, col="green", lwd=3)
abline(v=ms-var_theo, col="green", lwd=3)

Comparison Between the Averages from the Exponential Distribution and a Normal Distribution.

As you can see with the histogram of averages from the exponential distribution and the curve plotted over the distribution, the averages from the exponential distribution are very close to a normal distribution with mean 5 and variance .625. As you can see, taking averages from the exponential distribution creates a normal distribution. This may be because repeatedly taking data from the same distribution simulates taking data from any distribution due to the properties of the Central Limit Theorem. Random samples from any distribution may mimic the normal distribution.

par(mfrow=c(1,1))
par(mar=c(2,2,1,1))


ggplot(df, aes(x = mns)) + 
    geom_histogram(aes(y =..density..),
                   breaks = seq(0, 10, by=.1), 
                   colour = "black", 
                   fill = "red")+
                   labs(title="Normal Curve Overlaying Exponential Distributed Averages", x="Averages of Exponential Distribution")+theme_bw()+
    stat_function(fun = dnorm, args = list(mean = mean(df$mns), sd = sd(df$mns)), col="blue", lwd=2)

Part Two: Exploratory Data Analysis

As noted above, this dataset contains two different treatments on rats at particular doseage levels and measures the growth of the tooth for these rats. See the summary of the data below:

    summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Below is data comparing diffences in tooth growth based on dosage amounts and tooth growth.

A t-test is performed comparing each of the dosage amounts to each of the others. First is a t-test comparing dose .5 with dose 1; second is dose 1 compared with dose 2, and third is dose 2 compared with dose .5. As you can see, the p-values are very low for each of the dosages compared with each other. This shows that there is a difference in the way dosage amounts affect tooth growth.

dose1<-ToothGrowth %>% filter(dose==.5 | dose==1)
dose1$dose<-as.factor(dose1$dose)
dose1$supp<-NULL

dose2<-ToothGrowth %>% filter(dose==1 | dose==2)
dose2$dose<-as.factor(dose1$dose)
dose2$supp<-NULL

dose3<-ToothGrowth %>% filter(dose==2 | dose==.5)
dose3$dose<-as.factor(dose1$dose)
dose3$supp<-NULL

t.test(len~dose, data=dose1)

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

t.test(len~dose, data=dose2)

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 0.5   mean in group 1 
##            19.735            26.100

t.test(len~dose, data=dose3)

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            26.100

don1<-ToothGrowth %>% filter(dose==.5)
don2<-ToothGrowth %>% filter(dose==1)
don3<-ToothGrowth %>% filter(dose==2)

Below are graphs comparing the different dosages.

Included are histograms with group means shown in an abline. This shows there are significant differences in the means for each dosage group.

par(mfrow=c(2,2))
par(mar=c(4,4, 2, 2))
par(oma=c(0,0,2,0))
hist(don1$len, col=rgb(0,0,1,.5), breaks=20, xlab="Dose .5", main="")
abline(v=mean(don1$len), col="red", breaks=20)

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## "breaks" is not a graphical parameter

hist(don2$len, col=rgb(1,0,0,.5), breaks=20, xlab="Dose 1",  main="")
abline(v=mean(don2$len), col="blue", lwd=2)

hist(don3$len, col= rgb(0,1,0,.5), breaks=20, xlab="Dose 2",  main="")
abline(v=mean(don3$len), col="orange", lwd=2)

boxplot(len~dose, data=ToothGrowth, col=c("blue", "red", "green"))

title(main="Tooth Growth by Dosage", outer=TRUE, cex=2)

Below is a t-test comparing the effect of two treatments on tooth growth.

As you can see, there is some difference between the means. The p-value is .06, which is between .05 and .1. Depending on the person looking at this data, the p-value may be significant or not.

t.test(ToothGrowth$len~ToothGrowth$supp, data=ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Blow are graphs showing the relationship between the affects of treatments on tooth growth

As you can see there is some difference in the means, but the significance may or may not be big enough to show a difference in the treatments.

supp1<-ToothGrowth %>% filter(supp=="OJ")
supp1$dose<-NULL
supp2<-ToothGrowth %>% filter(supp=="VC")
supp2$dose<-NULL
par(mfrow=c(2, 2))
par(mar=c(4,3,2,2))
hist(supp1$len, col=rgb(0,0,1,.5), xlab="OJ", main="")
abline(v=mean(supp1$len), col="red")
hist(supp2$len, col=rgb(1, 0,0,.5), xlab="VC", main="")
abline(v=mean(supp2$len), col="blue")
boxplot(len~supp, data=ToothGrowth, col=c("blue", "red"))
title(main="Tooth Growth by Supplement Type", outer=TRUE)

Statistical Inference Project of Coursera

Tim Norton

June 23, 2018