Synopsis

Course Project for the Coursera Statistial Inference Course. Project consists of two parts: Part 1 : Simulation Excercise Part 2 : Basic Inferential Data Analysis

Part 1

Part 1: Simulation Exercise Instructionsless In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should

Show the sample mean and compare it to the theoretical mean of the distribution. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. Show that the distribution is approximately normal. In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

As a motivating example, compare the distribution of 1000 random uniforms

Question 1

Show the sample mean and compare it to the theoretical mean of the distribution.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(data.table)
## Warning: package 'data.table' was built under R version 3.3.1
lambda <- .2    ####rate
n<-40           ####sample size
sims<- 1000     ####number of times to run the simulation
quantile <-1.96

####Simulation
simulation <- matrix(rexp(n*sims, rate = lambda),sims)

###Mean Matrix
simulationmatrixmean <- rowMeans(simulation)
head(simulationmatrixmean) 
## [1] 4.715395 6.533664 5.815842 3.899973 3.547839 6.725399
###Mean of the Mtrix
simulationmean <- mean(simulationmatrixmean)
simulationmean
## [1] 5.008377
####The mean of exponential distribution is given as 1/lambda
expmean <- 1/lambda
expmean
## [1] 5

The mean of the simultaion is very close to the theoretical mean

Question 2

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

###simulation variance
simvar <- var(simulationmatrixmean)
simvar  
## [1] 0.6678532
###variance formula is sd^2/n  this is for variance of the exponential
expvar <- ((1/lambda)^2)/n
expvar
## [1] 0.625

The variance of the simulation and the theoretical are also very cloes

Question 3

Show that the distribution is approximately normal.

need to calculate standard deviation for both theorectical and simulation in order to plot both distributions

#### SD of simulation
simulation_sd <- sd(simulationmatrixmean)

#### SD of exponential given as 1/lambda * sqrt(n)
expsd<- 1/(lambda * sqrt(n))

######plot data
plot <- ggplot(data.frame(simulationmatrixmean), aes(simulationmatrixmean))+ geom_histogram(aes(y=..density..), color = "blue", fill = "blue")+labs(title ="Distribution of averages from random samples of 40", y = "Density", x= "Simulation Mean")+geom_vline(aes(xintercept = simulationmean, color = "simulation mean"))+geom_vline(aes(xintercept = expmean, color = "thoretical mean"))+stat_function(fun = dnorm, args = list(mean = expmean, sd = expsd), color = "yellow", size = 1)+stat_function(fun = dnorm, args = list(mean = simulationmean, sd = simulation_sd), color = "green", size = 1)
plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The mean of the theoretical and the simulation are plotted and are almost on top of each other

To show that they are normally distributed will calculate confidence intervals for both and compare

##Confidence intervals simulation
simulationCI <- mean(simulationmatrixmean) + c(-1,1)*1.96*sd(simulationmatrixmean)/sqrt(n)
simulationCI
## [1] 4.755117 5.261637
#Cinfidence interval theorectical
expCI <- expmean + c(-1,1) * 1.96 * sqrt(expmean)/sqrt(n)
expCI
## [1] 4.307035 5.692965

Confidence intervals are very similar. This shows they are normally distributed

Part 2

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.

Load the ToothGrowth data and perform some basic exploratory data analyses Provide a basic summary of the data. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering) State your conclusions and the assumptions needed for your conclusions.

Question 1

Load the ToothGrowth data and perform some basic exploratory data analyses

data("ToothGrowth")
data<- ToothGrowth
str(data)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(data)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
head(data)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
tail(data)
##     len supp dose
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2

The data set has three columns and 60 rows 2 columns are numeric len and dose and one is a Factor “supp” with two levels OJ and VC. Mean of len of 18.81 with a median of 19.25 and dose column has a mean of 1.167 with Median of 1

###subset data into OJ and VC
dataOJ<- data[which(data$supp=="OJ"),]
dataVC <- data[which(data$supp=="VC"),]

###find the mean by dose of the two datasets

aggregate( len ~ dose, dataOJ, mean )
##   dose   len
## 1  0.5 13.23
## 2  1.0 22.70
## 3  2.0 26.06
aggregate( len ~ dose, dataVC, mean )
##   dose   len
## 1  0.5  7.98
## 2  1.0 16.77
## 3  2.0 26.14

Question 2

Provide a basic summary of the data

Looking at the increase in dose with tooth length shows an increase when dose is increased for both OJ and Vitamin C. Need to test this hypothesis by way of confidence intervals and hypothesis testing

Question 3

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

Perform T test on Orange Juice and Vitamin C

t.test(len~supp, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Looking at p value and confidence interval looks like not useful

Looking at t test from paired dosage from .05 and 1, .5 and 2, 1 and 2

###Test for dosage 1 and .5
t.test(len~dose, data = data[data$dose == 1 | data$dose == .5, ])
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735
###Test for dosage .5 and 2
t.test(len~dose, data = data[data$dose == 2 | data$dose == .5, ])
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100
###Test for dosage 1 and 2 
t.test(len~dose, data = data[data$dose == 1 | data$dose == 2, ])
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

Question 4

State your conclusions and the assumptions needed for your conclusions.

Conclusion

Looking at the P values from the increase of dosages all values were less than .05 and all the confidence intervals did not go through zero so with this data dosage is significant. Therefore increase in dosage correlates with an increase in tooth length