Peer-graded Assignment: Statistical Inference Course Project

Goal: Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.

Xtrangeloop

October 25, 2019

Overview

The project is to investigate the exponential distribution in R and compare it with the Central Limit Theorem.
* The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter.
* The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.
* Set lambda = 0.2 for all of the simulations.
* Investigate the distribution of averages of 40 exponentials, with a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should:
1. Show the sample mean and compare it to the theoretical mean of the distribution.
2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
3. Show that the distribution is approximately normal.

Synopsis

Due to the central limit theorem (CLT), the distribution of averages of 40 exponentials is very close to a normal distribution.

Simulations

Prepare the workspace by loading the libraries. Define the simulation parameters, as described in the task. Simulate the sample data.

#Load the libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)


#Define simulation parameters
Lambda <- .2 
Exponential <- 40
Sims <- 1000
set.seed(12345)


#Simulate
SimData <- replicate(Sims, rexp(Exponential, Lambda))

Sample Mean versus Theoretical Mean

Calculate the means for the sample dataset, as well as the actual and theoretical means.

#Calculate simulation, actual and theoretical means
MeanSimData <- apply(SimData, 2, mean)
ActualMean <- mean(MeanSimData)
TheoryMean <- 1/Lambda
#MeanSimData <- data.frame(MeanSimData, TheoryMean)


print(c("The Actual mean is", ActualMean, "while the Theorteical meanis", TheoryMean), quote = FALSE)

## [1] The Actual mean is           4.97197196424051            
## [3] while the Theorteical meanis 5

As printed above, the actual mean from the sample data is very close to the theoretical mean of normal data.

Sample Mean Distribution vs the Actual and Theoretical Means

PlotSimData <- data.frame(MeanSimData)

hist(MeanSimData, breaks = 10, col = "lightsteelblue2",
     xlim=c(3, 8.5), main = "Simulation Mean Distribution", 
     xlab = "Sample Mean")

#Plot Actual and Theoretical Means
abline(v=ActualMean, col="red", lwd=2)
abline(v=TheoryMean, col="blue", lwd=2)
legend(6, 260, legend=c("Actual Mean", "Theoretical Mean"),
       col=c("red", "blue"), lty=1:1, lwd=2, cex=1)

The above graph depicts the mean of the sample distribution with both the actual and theoretical means identified. As stated previously,the actual and theoretical means are very close.

Sample Variance vs Theoretical Variance

Calculate the standard deviations and variance for the sample dataset, both as actual and theoretical.

#Calculate actual and theoretical standard deviation and variance
ActualSD <- sd(MeanSimData)
TheorySD <- ((1/Lambda) * (1/sqrt(Exponential)))
ActualVar <- var(MeanSimData)
TheoryVar <- TheorySD^2

#Print Results
print(c("The standard deviation of the distribution is", ActualSD,
        "with a theorteical standard deviation of", TheorySD,
        "While the actual variance is", ActualVar,
        "with a theorteical variance of", TheoryVar), quote = FALSE)

## [1] The standard deviation of the distribution is
## [2] 0.771645582353234                            
## [3] with a theorteical standard deviation of     
## [4] 0.790569415042095                            
## [5] While the actual variance is                 
## [6] 0.595436904765261                            
## [7] with a theorteical variance of               
## [8] 0.625

As printed above, the actual and theoretical of both the standard deviation and the variance are very close, indicating that the simulated data approximates the expected normal results.

Distribution

According to the Central Limit Theorem, the averages of samples follow normal distribution, which has been continuously demonstrated by the closeness in values of the actual vs theoretical in multiple variables (mean, standard deviation, variance), and which continues to be depicted in the following graph.

#Plot the simulation distribution
hist(MeanSimData, breaks = 50, col = "lightblue",
     main = "Simulation Distribution", 
     xlab = "Simulation Mean", freq = FALSE)
lines(density(MeanSimData), col="orchid", lwd=2)

# Normal distribution line creation
x <- seq(min(MeanSimData), max(MeanSimData), length=2*Exponential)
y <- dnorm(x, mean=1/Lambda, sd=sqrt(((1/Lambda)/sqrt(Exponential))^2))
lines(x, y, pch=22, col="purple4", lwd=2)
legend(6.3, .56, legend=c("Actual Variance", "Theoretical Variance"),
       col=c("orchid", "purple4"), lty=1:1, lwd=2, cex=1)

The graph overlays the bell curves of the actual and theoretical distributions on top of the simulated data. Thereby visually reinforcing that the sampled data approximates normal distribution.

Part 2: Guinea Pigs: Tooth Growth by Supplement and Doses

Goal: Analyze the ToothGrowth data in the R datasets package

Overview

The ToothGrowth dataset contains data regarding the effect of vitamin C on the tooth growth of Guinea pigs. Specifically, the data includes the tooth length in each of 10 guinea pigs at three dosage levels of Vitamin C (0.5, 1, and 2 mg) and two delivery methods (orange juice or ascorbic acid).

Basic Exploratory Data Analysis

This data frame has 60 observations and 3 variables: + len - Tooth length (numeric). + supp - Supplement type: VC-“Vitamin C”" or OJ-“orange juice” (factor). + dose - Dose in milligrams (numeric).

#Load the Libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(datasets)


#Load the Data & Explore
data(ToothGrowth)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

# Exploration of the effect of dosed supplements on tooth growth  
ggplot(ToothGrowth, aes(x = dose, y = len)) + 
  geom_point(aes(color=supp))

Summarize the Data

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

tapply(ToothGrowth$len, ToothGrowth$supp, FUN=summary)

## $OJ
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20   15.53   22.70   20.66   25.73   30.90 
## 
## $VC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20   11.20   16.50   16.96   23.10   33.90

The median best growth comes from the Oranje Juice, while the single highest numbers come from Ascorbic Acid.

Compare Tooth Growth by Supplement and Dose

The following plot shows data summary for each supplement:

qplot(data = ToothGrowth, supp, len, facets = ~dose, 
      main = "Supplement & Dose Effect on Guinea Pig's Tooth Growth",
      xlab = "Supplement", ylab = "Tooth Length") +
  geom_boxplot(aes(fill = supp)) +
  theme(legend.position = "none")

Hypothesis Testing, Assumptions, & Conclusions

Assumptions

The guinea pigs were randomly selected and assigned
The variables are independant and identically distributed
Tooth growth variance is different between supplement and dosage
Tooth growth approximates normal distribution

Dose Comparison

Null Hypothesis is that there will be no difference in tooth growth caused by the dosage. Alternatively, we hypothesize there will be more tooth growth as the dosage increases.

#Tooth Growth Exploration Based on Dose Comparison   
LowDose <- ToothGrowth$len[ToothGrowth$dose == .5]
MedDose <- ToothGrowth$len[ToothGrowth$dose == 1]
HigDose <- ToothGrowth$len[ToothGrowth$dose == 2]

t.test(LowDose, MedDose, alternative = "less", paired = FALSE, 
       var.equal = FALSE, conf.level = .95)

## 
##  Welch Two Sample t-test
## 
## data:  LowDose and MedDose
## t = -6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -6.753323
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

t.test(MedDose, HigDose, alternative = "less", paired = FALSE, 
       var.equal = FALSE, conf.level = .95)

## 
##  Welch Two Sample t-test
## 
## data:  MedDose and HigDose
## t = -4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -4.17387
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

Conclusion is that we must reject the null hypothesis due to the fact that the p-value is lower than the default error value in both instances. In both instances, the p-value is close to zero, and so it is highly likely that the dosage does have an effect on tooth growth, such that a higher dosage results in increased tooth growth.

Supplement Comparison

Null Hypothesis is that there will be no difference in tooth growth caused by the supplement. Alternatively, we hypothesize there will be more tooth growth when using orange juice as the supplement.

#Tooth Growth Exploration Based on Supplement  
OJ = ToothGrowth$len[ToothGrowth$supp == "OJ"]
VC = ToothGrowth$len[ToothGrowth$supp == "VC"]

t.test(OJ, VC, alternative = "greater", paired = FALSE, 
       var.equal = FALSE, conf.level = .95)

## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

Conclusion is that we must reject the null hypothesis due to the fact that the p-value is lower than the default error value as there is only a 3% chance to obtain extreme differences in tooth growth based solely on the supplement.

Supplement Comparison at Highest Dosage

Null Hypothesis is that there will be no difference in tooth growth caused by the supplement at the 2mg dosage. Alternatively, we hypothesize there will be a difference based on supplement at the highest dosage level.

#Tooth Growth Exploration Based on Supplement with Highest Dosage  
HiOJ = ToothGrowth$len[ToothGrowth$supp == "OJ" & ToothGrowth$dose == 2]
HiVC = ToothGrowth$len[ToothGrowth$supp == "VC" & ToothGrowth$dose == 2]

t.test(HiOJ, HiVC, alternative = "two.sided", paired = FALSE, 
       var.equal = FALSE, conf.level = .95)

## 
##  Welch Two Sample t-test
## 
## data:  HiOJ and HiVC
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean of x mean of y 
##     26.06     26.14

Conclusion is that we cannot reject the null hypothesis, as its p-value is higher than the error tolerance, and there is insufficient data to demonstrate a difference in tooth growth based on the supplement at the highest doage.