Statistical Inference Peer Graded Assignment

PART 1: Overview: In this project, you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

set.seed(1) #allows results to be reproduced
lambda <- 0.2 #set lambda as directed
nexp <- 40 #set the number of exponentials as directed
nsim <- 1000 #set the number of simulations as directed
mns <- NULL #create a null set

After the variables have been set, the simulations need to be run.

exp_sim <- function(n, lambda)
{
        mean(rexp(n, lambda))
}

sim <- data.frame(ncol=2, nrow=1000)
names(sim) <- c("Index", "Mean")

for (i in 1:nsim)
{
        sim [i,1] <- i
        sim [i,2] <- exp_sim(nexp,lambda)
}

Calculate the sample mean and theoretical mean

sample_mean <-mean(sim$Mean)
sample_mean

## [1] 4.990025

theor_mean <- 1/lambda
theor_mean

## [1] 5

Create a histogram and display the means that are being compared. The sample mean is 4.990025 and the theoretical mean is 5; which for the distribution of the average of 40 exponentials is almost identical to the theoretical center of of the distribution.

hist(sim$Mean,
        breaks = 50,
        main="Exponential Distribution - 
        1000 means of 40 sample exponentials",
        prob = TRUE,
        xlab="Spread")
                abline(v = theor_mean,
                        col=6,
                        lwd = 4)
                abline(v = sample_mean,
                        col = 4,
                        lwd = 2)
                        
        legend('topright', c("Theoretical Mean", "Sample Mean"),
                lty = c(1,1),
                col = c(col = 6, col = 4))

The next step in this analysis is to compare the variance that exists in the sample means of 1000 simulations to the variance of the population. First, let’s look at the variances:

sample_var <- var(sim$Mean)
theor_var <-((1/lambda)^2)/40
sample_var

## [1] 0.6111165

theor_var

## [1] 0.625

Then calculate the standard deviation

sampleSD <- sd(sim$Mean)
theorSD <- ((1/lambda)/sqrt(40))
sampleSD

## [1] 0.7817394

theorSD

## [1] 0.7905694

The sample SD of 0.782 and the theoretical SD of 0.791, another small variance as seen with the variances (0.611 and 0.625 respectively) The histogram plots the distribution of the sample data that was simulated and compares it to a bell curve (normal distribution). As the Central Limit Theorem predicted, the average of the samples follows a norma distribution.

hist(sim$Mean,
        breaks = 50,
        main = "Exponential Distribution",
        xlab = "Spread",
        border = "gray",
        las = 1,
        prob =TRUE)
        lines(density(sim$Mean))
        abline (v = 1/lambda, col = 2)
        xfit <- seq(min(sim$Mean), max(sim$Mean), length = 100)
        yfit <- dnorm(xfit, mean = 1/lambda, sd = (1/lambda/sqrt(40)))
        lines(xfit, yfit, pch = 22, col = 3, lty = 4, lwd = 2)
        legend('topright', c("Simulated Values", "Theoretical Values"),
        lty = c(1,4), lwd = 2, col = c(1,3))

The q-q plot below also demonstrates normality. The theoretical quantiles again match closely with the actual quantiles.

qqnorm(sim$Mean,
        main = "Normal Q-Q Plot")
        qqline(sim$Mean,
        col = "3", lwd = 2)

From these three charts and this analysis, it can be concluded that the distribution is approximately normal.

PART 2 – The second portion of this assignment was to review the Tooth Growth data that is included in the R Dataset. According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/ToothGrowth.html ,this data “The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).” First, the data is reviewed with simple queries.

data(ToothGrowth)
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

tail(ToothGrowth)

##     len supp dose
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

These show that there are 60 observations and 3 variables, The data elements include length, supplement and dosages. Length and dose are numeric variables and supp is a factor. Next, box plots will be used to compare length to dosage and length by supplement type.

par(mfrow=c(2,2))
boxplot(len~dose, data=ToothGrowth, col = "blue",
        main="Tooth Growth / Dose in mg per day",
        xlab = "Dose in mg / day", ylab = "Growth in length")

boxplot(len~supp, data=ToothGrowth, col = "green",
        main="Tooth Growth / Supplement Type",
        xlab = "Type of Supplement", ylab = "Growth in length")

These charts indicate that a higher dose of Vitamin C increases tooth length. They also show that the administering of the dose of in Orange Juice leads to increased tooth growth.

Lastly, hypothesis tests were run to compare the length of the teeth by supplement and dose. The rule to be followed when conducting a hypothesis test is The rule is to reject the null hypothesis when the p-value <= alpha. THe pvalues for the hypotheses tests for all of the tests are less than .001, therefore the null hypothesis will be rejected for each test. Based on the results of the ttests, we can conclude that OJ in a dosage level of 1 - 2 mg/day results in the highest level of tooth length in guinea pigs.

SUPPLEMENTAL INFORMATION:

t.test(ToothGrowth$len[ToothGrowth$supp=="OJ"], ToothGrowth$len[ToothGrowth$supp=="VC"], paired=FALSE, var.equal=FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  ToothGrowth$len[ToothGrowth$supp == "OJ"] and ToothGrowth$len[ToothGrowth$supp == "VC"]
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

t.test(ToothGrowth$len[ToothGrowth$dose==0.5], ToothGrowth$len[ToothGrowth$dose==1],paired=FALSE, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  ToothGrowth$len[ToothGrowth$dose == 0.5] and ToothGrowth$len[ToothGrowth$dose == 1]
## t = -6.4766, df = 38, p-value = 1.266e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983748  -6.276252
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

t.test(ToothGrowth$len[ToothGrowth$dose==1], ToothGrowth$len[ToothGrowth$dose==2],paired=FALSE, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  ToothGrowth$len[ToothGrowth$dose == 1] and ToothGrowth$len[ToothGrowth$dose == 2]
## t = -4.9005, df = 38, p-value = 1.811e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.994387 -3.735613
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

t.test(ToothGrowth$len[ToothGrowth$dose==0.5 & ToothGrowth$supp=="OJ"],
       ToothGrowth$len[ToothGrowth$dose==1 & ToothGrowth$supp=="OJ"],
       paired=FALSE,var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  ToothGrowth$len[ToothGrowth$dose == 0.5 & ToothGrowth$supp ==  and ToothGrowth$len[ToothGrowth$dose == 1 & ToothGrowth$supp == "OJ"]    "OJ"] and ToothGrowth$len[ToothGrowth$dose == 1 & ToothGrowth$supp == "OJ"]
## t = -5.0486, df = 18, p-value = 8.358e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.410814  -5.529186
## sample estimates:
## mean of x mean of y 
##     13.23     22.70

t.test(ToothGrowth$len[ToothGrowth$dose==0.5 & ToothGrowth$supp=="VC"],
       ToothGrowth$len[ToothGrowth$dose==1 & ToothGrowth$supp=="VC"],
       paired=FALSE,var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  ToothGrowth$len[ToothGrowth$dose == 0.5 & ToothGrowth$supp ==  and ToothGrowth$len[ToothGrowth$dose == 1 & ToothGrowth$supp == "VC"]    "VC"] and ToothGrowth$len[ToothGrowth$dose == 1 & ToothGrowth$supp == "VC"]
## t = -7.4634, df = 18, p-value = 6.492e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.264346  -6.315654
## sample estimates:
## mean of x mean of y 
##      7.98     16.77

Statistical Inference Peer Graded Assignment

Christine Arsenault

April 26, 2018