Statistical Inference Course Project

This Final Course Project Consists of Two Parts:

A Simulation Exercise
Basic Inferential Data Analysis

Part 1: A Simulation Exercise

Overview

Part 1 of this project will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution will be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is \(\frac{1}{\lambda}\) and the standard deviation is also \(\frac{1}{\lambda}\). \(\lambda\) will be set \(\lambda = 0.2\) for all of the simulations. To investigate the distribution of averages of 40 exponentials an illustration via simulation and associated explanatory text will be provided.

1. Simulation

First, the seed will be set for reproducibilty, the parameters will be set for the sample size, Lambda and our simulation size. Secondly, a NULL variable called “Exponentials” will be created to be used within a for loop to perform simulation and create a large collection of averages of 40 exponentials. This is accomplished using rexp(n, Lambda) as indicated in the overview.

set.seed(12345)
n<-40
Lambda<-0.2
Simulations<-1000

Exponentials <- NULL
for(i in 1:Simulations) {
  Exponentials <- c(Exponentials, mean(rexp(n, Lambda)))
}

2. Sample Versus Theoretical Mean and Variance

Sample Versus Theoretical Mean: As explained in the above overview, the theoretical mean is calculated by \(\frac{1}{\lambda}\) and the Sample Mean is calculated by the mean of the variable “Exponentials” as created by the previous for loop simulation.
Sample Versus Theoretical Variance: As explained in the above overview also, the theoretical variance is calculated by \(\frac{1 \div \lambda}{\sqrt{n}}\) and the sample variance is calculated by the variance of the variable “Exponentials” as created by the previous for loop simulation.

The absolute value of the differences is shown in the table below.

TheoMean<-1/Lambda; SampleMean<-mean(Exponentials)
MeanDiff<-abs(SampleMean-TheoMean)
CollMeans<-c(TheoMean, SampleMean, MeanDiff)

TheoVar<-(1/Lambda)/sqrt(n); SampleVar<-var(Exponentials)
VarDiff<-abs(SampleVar-TheoVar)
CollVars<-c(TheoVar, SampleVar, VarDiff)

ExpoTable<-data.frame(CollMeans, CollVars)
names(ExpoTable)<-c("Mean","Variance")
row.names(ExpoTable)<-c("Theoretical","Sample","Absolute Difference")
ExpoTable

Obviously, there is almost no difference between the sample and theoretical mean and variance. The mean difference is only 0.028028 and the variance difference is only 0.1951325.

3. Plot Normal Distribution

The purppose of this final section of “Part 1: A Simulation Exercise” is to show that the distribution of a large collection of averages of 40 exponentials is approximately normal. The focus is on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

set.seed(12345)
Random <- runif(1000)

par(mfrow=c(1,2))
hist(Random, breaks = n, freq = F, 
     main = "Normal Distribution of \n Random Exponentials",
     xlab = "Means")
  legend("topright", pch = 19, col = 3, legend = "Density Line")
  lines(density(Random), col = 3)
hist(Exponentials,breaks = n,freq = F, 
     main = "Normal Distribution of \n 40 Exponential Means",
     xlab = "Means")
  lines(density(Exponentials), col = 3)
  legend("topright", pch = 19, col = 3, legend = "Density Line")

The difference between the two are quite different. The random exponentials are a uniform distribution and the 40 exponentials are normally distributed.

Part 2: Basic Inferential Data Analysis

1. Load Data & Perform Basic EDA

Our provided dataset calculates the “Effect of Vitamin C on Tooth Growth in Guinea Pigs.” The following key of variables of the dataset are provided by the Department of Mathematics of ETH Zurich website:^[1]

[,1] len = Tooth length
[,2] supp = Supplement type (VC or OJ)
[,3] dose = Dose in milligrams/day

library(datasets)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

2. Basic Data Summary

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

boxplot(len~supp*dose, data=ToothGrowth, 
  col=(c("orange","purple")),
  main="Tooth Growth", xlab="Suppliment and Dose",ylab="Tooth Length")

3. Compare Tooth Growth By Supp & Dose

After EDA, the next step is to use hypothesis testing to compare tooth growth by supp and dose. Our \(H_0\) will state that the dose in milligrams per day of suppliment has no effect on tooth growth \(H_0:\mu=0\). Our \(H_1\) states dosage will be an effect on tooth growth \(H_1: \mu \neq 0\).

tg<-ToothGrowth
d.5<-subset(tg,dose==0.5); d1<-subset(tg,dose==1);d2<-subset(tg,dose==2)
HT1<-t.test(d.5$len~d.5$supp);HT2<-t.test(d1$len~d1$supp)
HT3<-t.test(d2$len~d2$supp)

PVALUES<-c(HT1$p.value,HT2$p.value,HT3$p.value)
Reject_NullHypothesis<-PVALUES<0.5
HypothesisTable<-data.frame(PVALUES,Reject_NullHypothesis)
row.names(HypothesisTable)<-c("0.5 mg/ day","1 mg/ day","2 mg/ day")
HypothesisTable

t.test(tg$len~tg$supp)

## 
##  Welch Two Sample t-test
## 
## data:  tg$len by tg$supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

4. Conclusions

Per the above hypothesis table, we have 2 out of 3 null hypothesis rejected since the p-value is below 0.05. It can can be said OJ did have a significant effect on tooth length compared to VC. However, on the 2 mg per day dosage there was no difference and both means remained within the 95% region. Comparitively, OJ and VC were not absolute in effect across all the data; however, OJ did have a significant effect on tooth growth given 0.5 mg and 1 mg per day.

Works Cited

^[1] McNeil, D. R. (1977). Interactive Data Analysis. New York: Wiley.
Crampton, E. W. (1947). The growth of the odontoblast of the incisor teeth as a criterion of vitamin C intake of the guinea pig. The Journal of Nutrition, 33(5), 491–504. doi: 10.1093/jn/33.5.491.