Peer Graded Assignment: Statistical Inference Course Project

Overview

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. Hereunder we are going to: * Show the sample mean and compare it to the theoretical mean of the distribution. * Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. * Show that the distribution is approximately normal.

Simulations:

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. Let set lambda = 0.2 for all of the simulations. Here we create the distribution of 1000 averages of 40 exponentials, which will see us do a thousand simulations. To provide reproducability for exponentials generation the seed function establishes start position so every time generation gives the same values.

Problem: Pt.1: Show the sample mean and compare it to the theoretical mean distribution.

Setting a seed for reproducibility:

set.seed(12345)

Setting a rate parameter lambda:

lambda = 0.2

Setting a size of a sample:

n = 40

Setting a number of simulations:

nosim=1000

Generating 1000 samples of 40 exponentials and calculating their mean values:

r <- replicate(nosim, rexp(n, lambda))
dim(r)

## [1]   40 1000

class(r)

## [1] "matrix"

We can see that r is a matrix of 40 rows and 1000 columns. Since each column contains a sample of 40 random exponentials we’ll apply mean() to columns to get 1000 sample means:

exp_means <- apply(r, 2, mean)

Show the sample mean and compare it to the theoretical mean of the distribution. Sample(empirical) mean:

e_mean <- mean(exp_means)
e_mean

## [1] 4.971972

Theoretical mean:

t_mean<- 1/lambda
t_mean

## [1] 5

We can see that the sample mean 5.02562 is good approximation of the theoretical mean t_mean = 5.

means <- cumsum(exp_means)/(1:nosim)
library(ggplot2)
g <- ggplot(data.frame(x = 1:nosim, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = t_mean, colour = 'red') + geom_line(size = 1)
g <- g + labs(x = "Number of simulations", y = "Cumulative mean")
g <- g + ggtitle('Sample Means of 1000 Samples of 40 Random Exponentials ')
g

As we can see from the graph, empirical mean is a consistent estimator of theoretical mean, because it converges to the value of theoretical mean. 2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution:

e_var <- var(exp_means)
e_var

## [1] 0.5954369

Theoretical variance of the distribution of sample means:

t_var <- (1/(lambda*sqrt(n)))^2
t_var

## [1] 0.625

As we can see, both empirical and theorietical variance of the distribution of sample means have value close to 0.6.

cumvar <- cumsum((exp_means - e_mean)^2)/(seq_along(exp_means - 1))
g <- ggplot(data.frame(x = 1:nosim, y = cumvar), aes(x = x, y = y))
g <- g + geom_hline(yintercept = t_var, colour = 'red') + geom_line(size = 1)
g <- g + labs(x = "Number of simulations", y = "Cumulative variance")
g <- g + ggtitle('Sample Variance of 1000 Samples of 40 Random Exponentials ')
g

As we can see from the graph, sample variance is a consistent estimator of the theoretical variance, because it converges to the value of the theoretical variance. 3. Show that the distribution is approximately normal:

g <- ggplot(data.frame(x = exp_means), aes(x = x))
g <- g + geom_histogram(aes(y = ..density..), colour = 'white', fill = 'light blue')
g <- g + stat_function(fun = dnorm, colour = 'red', args = list(mean = t_mean))
g <- g + geom_vline(xintercept = t_mean, colour = 'black', size = 1)
g <- g + ggtitle('Distribution of Averages of 1000 Samples of 40 Random Exponential Variables')
g <- g + xlab('Means')
g <- g + ylab('Density')
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see from the graph the distribution of averages of 1000 samples of 40 iid exponentials is approximately normal.

Problem: Pt.2

Sample Mean versus Theoretical Mean: After 10000 times of the exponential distribution simulation, we could collect every mean data of the simulations. These mean data were exhibited on the histogram plot and the peak count of these mean data was compared to the theoretical mean of the exponential distribution which could calculated by the lambda value. The result shows that the peak of the mean data and the theoretical mean value were quit close.

Compare and plot the sample mean vs theoretical mean

We were asked to analyze the Tooth Growth dataset in R. From the R help file: “the response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).” The variable “len” is the tooth length in mm, “supp” is the supplement type delivery method (OJ = orange juice, VC = ascorbic acid), and “dose” is the dose level of vitamin C (0.5, 1, 2 mg).

Exploratory Data Analysis:

There are a total of 60 observations in this dataset. A summary of the data can be seen below.

library(plyr)
summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

library(plyr)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

library(plyr)
dim(ToothGrowth)

## [1] 60  3

ddply(ToothGrowth,"supp",summarize, avg=mean(len))

##   supp      avg
## 1   OJ 20.66333
## 2   VC 16.96333

ddply(ToothGrowth,"supp",summarize, StDev=sd(len))

##   supp    StDev
## 1   OJ 6.605561
## 2   VC 8.266029

ddply(ToothGrowth,"dose",summarize, avg=mean(len))

##   dose    avg
## 1  0.5 10.605
## 2  1.0 19.735
## 3  2.0 26.100

ddply(ToothGrowth,"dose",summarize, StDev=sd(len))

##   dose    StDev
## 1  0.5 4.499763
## 2  1.0 4.415436
## 3  2.0 3.774150

ddply(ToothGrowth,c("supp","dose"),summarize, avg=mean(len))

##   supp dose   avg
## 1   OJ  0.5 13.23
## 2   OJ  1.0 22.70
## 3   OJ  2.0 26.06
## 4   VC  0.5  7.98
## 5   VC  1.0 16.77
## 6   VC  2.0 26.14

ddply(ToothGrowth,c("supp","dose"),summarize, StDev=sd(len))

##   supp dose    StDev
## 1   OJ  0.5 4.459709
## 2   OJ  1.0 3.910953
## 3   OJ  2.0 2.655058
## 4   VC  0.5 2.746634
## 5   VC  1.0 2.515309
## 6   VC  2.0 4.797731

ddply(ToothGrowth,c("supp","dose"),summarize, StDev=sd(len))

##   supp dose    StDev
## 1   OJ  0.5 4.459709
## 2   OJ  1.0 3.910953
## 3   OJ  2.0 2.655058
## 4   VC  0.5 2.746634
## 5   VC  1.0 2.515309
## 6   VC  2.0 4.797731

The first plot shows the tooth length by supplement type. From the boxplot, it appears that they are not very dissimilar.

boxplot(ToothGrowth$len~ToothGrowth$supp,col=c("grey","red"),
main="Tooth Length by Supplement",ylab="Tooth Length (mm)",
xlab="Supplement (type)")

The next plot shows the tooth length by dose, 0.5, 1, and 2 mg. There does appear to be a difference between these, especially between the 0.5 and 2 mg doses.

boxplot(ToothGrowth$len~ToothGrowth$dose,col=c("lightgreen",
"green","darkgreen"),
main="Tooth Length by Dose",ylab="Tooth Length (mm)",xlab="Dose (mg)")

The next plot shows the interaction between dose and supplement type. When dose is considered, it appears that both supplement types increase the tooth length.

boxplot(len ~ interaction(supp,dose), data=ToothGrowth,
col=c("red","lightgreen","grey","green"," blue","darkgreen"),
main="Tooth Length by Dose and Supplement",ylab="Tooth Length (mm)",
xlab="Dose (mg) & Supplement (type) Interaction")

Hypothesis Tests There are many hypothesis tests that could be performed with this dataset. The plots above already give an idea of which tests could prove to be siginificant. We can already surmise that for tooth length, there is probably a highly significant difference between doses, but perhaps not a significant difference between supplement types. Two sample t-tests were performed. The assumptions underlying this test are: The populations from which the samples were drawn are normally distributed. The standard deviations of the populations are equal.

The samples were randomly drawn.

# T test for Supplement type
VC=subset(ToothGrowth, supp=="VC")
OJ=subset(ToothGrowth,supp=="OJ")
t.test(VC$len,OJ$len)

## 
##  Welch Two Sample t-test
## 
## data:  VC$len and OJ$len
## t = -1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.5710156  0.1710156
## sample estimates:
## mean of x mean of y 
##  16.96333  20.66333

# T test for doses
dose0.5=subset(ToothGrowth,dose==0.5)
dose2=subset(ToothGrowth,dose==2.0)
t.test(dose0.5$len,dose2$len)

## 
##  Welch Two Sample t-test
## 
## data:  dose0.5$len and dose2$len
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean of x mean of y 
##    10.605    26.100

tdose=round(t.test(dose0.5$len,dose2$len)$p.value,14)
dose0.5=subset(ToothGrowth,dose==0.5)
dose1=subset(ToothGrowth,dose==1.0)
t.test(dose0.5$len,dose1$len)

## 
##  Welch Two Sample t-test
## 
## data:  dose0.5$len and dose1$len
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

t.test(dose1$len,dose2$len)

## 
##  Welch Two Sample t-test
## 
## data:  dose1$len and dose2$len
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

# T test for supplement and dose
VCdose.5=subset(ToothGrowth,dose==0.5 & supp=="VC")
OJdose.5=subset(ToothGrowth,dose==0.5 & supp=="OJ")
t.test(VCdose.5$len,OJdose.5$len)

## 
##  Welch Two Sample t-test
## 
## data:  VCdose.5$len and OJdose.5$len
## t = -3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.780943 -1.719057
## sample estimates:
## mean of x mean of y 
##      7.98     13.23

VCdose1=subset(ToothGrowth,dose==1.0 & supp=="VC")
OJdose1=subset(ToothGrowth,dose==1.0 & supp=="OJ")
t.test(VCdose1$len,OJdose1$len)

## 
##  Welch Two Sample t-test
## 
## data:  VCdose1$len and OJdose1$len
## t = -4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.057852 -2.802148
## sample estimates:
## mean of x mean of y 
##     16.77     22.70

VCdose2=subset(ToothGrowth,dose==2.0 & supp=="VC")
OJdose2=subset(ToothGrowth,dose==2.0 & supp=="OJ")
t.test(VCdose2$len,OJdose2$len)

## 
##  Welch Two Sample t-test
## 
## data:  VCdose2$len and OJdose2$len
## t = 0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.63807  3.79807
## sample estimates:
## mean of x mean of y 
##     26.14     26.06

all of the t-tests performed resulted in significant differences in the means of the tooth length at the

p=0.05 level.

The t-test for the supplement type resulted in a p-value of 0.0606, just barely higher than the 0.05 significant level. Here we have very slight evidence that the tooth length means of the supplement types are not dissimilar.
We have strong evidence that dose results in different mean lengths for teeth. Between the 0.5 and 2.0 mg dose, a p-value of less than 0.000000001 is produced. This is highly significant.
When dose and supplement are considered, there are two significant difference at the 0.5 and 1.0 mg dose levels, between supplement types. The p-values for these tests (dose of 0.5 mg, by supplement type and dose of 1.0 mg, by supplement type) were 0.0063586 and 0.0010384 respectively.

##### Conclussion
Overall, we can conclude that as the dose of vitamin C increases, so does the tooth length. The supplement
type only appears to play a significant role at the 0.5 and 1.0 mg dose levels.

#### Session Information:

```r
sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plyr_1.8.4    ggplot2_3.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       knitr_1.24       magrittr_1.5     tidyselect_0.2.5
##  [5] munsell_0.5.0    colorspace_1.4-1 R6_2.4.0         rlang_0.4.1     
##  [9] stringr_1.4.0    dplyr_0.8.3      tools_3.6.1      grid_3.6.1      
## [13] gtable_0.3.0     xfun_0.8         withr_2.1.2      htmltools_0.3.6 
## [17] yaml_2.2.0       lazyeval_0.2.2   digest_0.6.20    assertthat_0.2.1
## [21] tibble_2.1.3     crayon_1.3.4     purrr_0.3.2      glue_1.3.1      
## [25] evaluate_0.14    rmarkdown_1.16   labeling_0.3     stringi_1.4.3   
## [29] compiler_3.6.1   pillar_1.4.2     scales_1.0.0     pkgconfig_2.0.2