Part 1: Simulation Exercise

Note: At the time of this assignment I was using RStudio Version 0.99.902 - © 2009-2016 RStudio, Inc. Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1, under R version R x64 3.3.1.

I am required to investigate the exponential distribution in R and compare it with the Central Limit Theorem. I will use rexp(n, lambda); “lambda” is the rate parameter. Mean = 1/lambda = Standard Deviation. lambda has been give as = 0.2 for all of the simulations. It is suggested to investigate the distribution of averages of 40 exponentials using over 1000 simulations or more, so I will also attempt this.

I will Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials and ensure that I:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

For point 3, the instructions are to focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials..

PS: I watched video 7 through 12 and downloaded the content to read.

Execution of the above instructions

We will take the provided information as follows:
lambda = 0.2
exponentials = n = 40
simulations = sim = 1000
Formula to use = rexp(n, lambda)
mean = standard deviation = 1/lambda = 5 (theoretical mean)

## let's set the above parameters in R
lambda <- 0.2
n <- 40
sim <- 1000
sd <- 1/lambda
set.seed(888) # for reproducibility

Now let’s simulate using the formula provided and save it for later.

rexp.sim <- rexp(sim*n, rate=lambda)
# will give 40000 numbers

Show the sample mean and compare it to the theoretical mean of the distribution.

Now to get the mean of our simulations (sample mean).

simMean <- mean(rexp.sim)
simMean

## [1] 4.984112

As I meantioned above, the theoretical mean is simply 1/lambda, which is 1/0.2 = 5. To reflect this:

ThMean <- 1/lambda
ThMean

## [1] 5

The assignment requests that the report structure address a slew of requirements. One such is that we are also required to provide titles and figures to highlight the comparisons. In my above example though I need to create a data frame or matrix in order to have something to plot that would help with illustrating the requirements.

I will use the ‘motivating example’ provided in the question to create the simulation of the distribution, and then plot it using ggplot.

data <- NULL
for (i in 1 : sim) data = c(data, mean(rexp(n, lambda)))
# this will give us a simulation of 1000 averages of 40 random uniforms.

Above we see that when we compare the means (theoretical and sample, or ThMean and simMean) they are almost identical. Now let’s plot them and compare them visually

# Need to create the vertical lines that will show us the sample and theoretical means in our visual illustraton.  
data.prep <- data.frame(header=c("Sample Mean", "Theoretical Mean"), values=c(simMean, ThMean))
# now for preferred plotting tool
library(ggplot2)
g <- ggplot(NULL, aes(x=data))
plot1 <-  g+geom_histogram(aes(y=..density..), color="black", fill="white", binwidth=0.1)+geom_density(color="red")+geom_vline(data=data.prep, aes(xintercept=values, linetype=header, color=header), show.legend=TRUE)
plot1

Based on the plot, we see that with a large enough simulation the sample means will get closer and cloer to the theoretical means.

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

Based on the course and swirl exercises, we can find the standard deviation by just using the sd() function on the sample we’re interested in. In our case the sample mean.

# to calclate standard deviation of the sample 
sd(data)

## [1] 0.7915261

# then to get the variance of the sample
VarData <- sd(data)^2
VarData

## [1] 0.6265135

The theoretical variance of the distribution is calculated using theta^2/n. in this case theta = 1/lambda = 0.2, and n = 40. So:

VarTh <- (1/lambda)^2/n
VarTh

## [1] 0.625

We can conclude they are almost identical.

Show that the distribution is approximately normal.

I was not sure how I could do this so I scoured the internet and used the explanations shown here

So, basically on further checking R has a built-in function called QQ plot, and it does all the work for you. Where would we be without the folks who work hard on simplifying R.

qqnorm(data)
qqline(data, col=12)

Based on the forum which I read, this is not 100% proof that the distribution is Normal but it is as close as I am going to get to proving it. To quote Glen from the forum “No test will show you that your data has a normal distribution - it will only be able to show you when the data is sufficiently inconsistent with a normal that you would reject the null”. So since I am no statistics genius by a stretch, I will conclude this is about as Normal as the distribution can get until someone comes up with a better method to prove it.

Part 2: Basic Inferential Data Analysis Instructions

In this section we have been asked to analyze the ToothGrowth data in the R datasets package. SO first we need to load the package

library(datasets)
data(ToothGrowth)

## To get an idea of what this dataset is about type ?ToothGrowth.

Based on the summary it’s to check the response of cells responsible for tooth growth (odontoblasts) in guinea pigs by using two delivery methods and 3 different doses on 60 guinea pigs.

Perform some basic exploratory data anaylisis

I chose to keep it really basic and reviewed some of our previous lecture functions to get an idea.

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Len = Tooth Length
Supp = Supplement type (H6H8O6 - Ascorbic Acid, or Orange Juice)
Dose = dosage in milligrams per day

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Summary provides self-explanatory data.

One last thing I noted later as I was doing the assignment was that the dose wasn’t coming out clear with the plots. eg I wasn’t able to visually show the length per dosage. As you can see from the summary above the dosages shouldn’t even have median, mean etc. So we need to convert them as.factors.

ToothGrowth$dose <- as.factor(ToothGrowth$dose)
## now if we look at the summary again
summary(ToothGrowth)

##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

We see here now that we have got the three dosage types. One last exploration

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Provide a basic summary of the data.

I am going to assume that the summary in this case must be more about explaining the figures a bit more than the brief data exploratory analysis done above. Per our last couple of classes we learned bootstrapping, permutations and group comparisons where we were shown some box graphs. I am going to follow some of the methods provided in those classes.

library(ggplot2)
g <- ggplot(ToothGrowth, aes(x=dose, y=len))
Plot1 <- g+geom_boxplot(aes(fill=dose))
Plot1

The above doesn’t differentiate between the 2 supplement types so it does not give us any real visual comparisons. To remedy this:

SuppComparisonPlot <- Plot1+facet_wrap(~supp)
SuppComparisonPlot

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

The sample size is quite small and in our lectures and swirl we used the t.test() function that provided the Welch Two Sample test. I have chosen to use it here because I cannot think of any other way of doing the R code to get right down to the details of each dosage and compare it.

t.test(len~supp, data=ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Rule of thumb that I noted is if the P-value is less than the 5% significance level then we reject the null hypothesis. However if it is more, then we keep the null hypothesis.

In the above case the P-Value is 0.061. 0.061 > 0.05, so we will keep the null hypothesis. Or in laymen’s terms: there is not enough evidence to suggest that the type of supplementation impacts tooth growth.

Now let’s try this again but by comparing the dosage levels. But before we do that, as I again found out the hard way, we need to revert back to the ToothGrowth data where dose was a numeric again. So let’s reload it.

data(ToothGrowth)

Quick test

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Ok, we’re good to go. Again, using what we learned in class:

t.test(ToothGrowth$len, ToothGrowth$dose)

## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len and ToothGrowth$dose
## t = 17.81, df = 59.798, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15.66453 19.62881
## sample estimates:
## mean of x mean of y 
## 18.813333  1.166667

So this time around, the P-value is waaaaay smaller than the 0.05 significance threshold, which means we can reject the null hypothesis. Or in laymen’s terms: The Dosage amount of each supplement does indeed show an impact on the tooh growth rate.

Conclusions and Assumptions

Guinea pigs given OJ or Ascorbic acid did not yield enough evidence to suggest their teeth grew faster or slower, or not at all.

However,changing the Dosage of the OJ or Ascorbic Acid yielded enough evidence that suggested significant tooth growth rate

My R skills were not good enough to try and go into depth by each dosage.

Thank you. Fin. Have a nice weekend.

Statistical Inference assignment

Hani Mebar

29 September 2016