Statistical Inference Course Project

Overview

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. We will investigate the distribution of averages of 40 exponentials. Note that we will need to do a thousand simulations.

Part 1 – simulation Exercise

Show the sample mean, and compare to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show the distribution is approximately normal.

We set the seed for reproducibility, then define the lambda, number of exponentials, and the number of simulations to run.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.1

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'ggplot2' was built under R version 3.4.1

## Warning: package 'tidyr' was built under R version 3.4.1

## Warning: package 'readr' was built under R version 3.4.1

## Warning: package 'purrr' was built under R version 3.4.1

## Warning: package 'dplyr' was built under R version 3.4.1

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

set.seed(20)

lambda <- 0.2
n <- 40
sims <- 1000

We run the simulations/replications and then calculate the mean.

sim_exe <- replicate(sims, rexp(n, lambda))
mean_sim_exe <- apply(sim_exe, 2, mean)

hist(mean_sim_exe, xlab = "Average Exponentials", main = "Distribution of the Average Exponentials", breaks = 20)
abline(v = mean(mean_sim_exe), lwd = "5", col = "blue")

Sample mean vs Theoretical Mean

We can calculate the simulated mean and then the theoretical mean to compare the two. We see that both the theoretical and simulated means are very close at almost 5.

sim_mean <- mean(mean_sim_exe)
sim_mean

## [1] 4.963828

theoretical_mean <- 1/lambda
theoretical_mean

## [1] 5

Sample Variance vs Theoretical Variance

Again we can see that our simulated and theoretical variances are very close, at almost 0.6 for each.

sim_variance <- (sd(mean_sim_exe))^2
sim_variance

## [1] 0.5919712

theoretical_variance <- (1/lambda)^2 / n
theoretical_variance

## [1] 0.625

Distribution

Next we are looking at the distribution, which falls into a relatively normal distribution. We can see the min and max at about 3 and 8.

min(mean_sim_exe)

## [1] 2.911542

max(mean_sim_exe)

## [1] 8.392128

hist(mean_sim_exe, prob = TRUE, xlab = "Average Exponentials", main = "Distribution of the Average Exponentials", xlim = c(floor(min(mean_sim_exe)), ceiling(max(mean_sim_exe))))

hist(mean_sim_exe, prob = TRUE, xlab = "Average Exponentials", main = "Distribution of the Average Exponentials", breaks = 40)
lines(density(mean_sim_exe), col = "blue")
abline(v = mean(mean_sim_exe), lwd = "5", col = "blue")

qqnorm(mean_sim_exe)
qqline(mean_sim_exe, col = 4)

The bulk of the middle of the graph shows how our numbers align and follow a normal distribution. This is approximately normal with some variation out at the extremes of the values.

Part 2 – Basic inferential data analysis instructions

Load data and basic exploratory data analysis

We can load the data in, and then check out the first few rows, a summary of the dataset, and then plug it into the pipe/dplyr to get some general information about the treatment effects. We can use the ?ToothGrowth call to check out the background information. The pigs are receiving vitamin C either through orange juice or via supplement.

data("ToothGrowth")
tooth <- ToothGrowth

head(tooth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

summary(tooth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

ggplot(aes(x = dose, y = len), data = tooth) + 
  geom_boxplot(aes(fill = dose, group = dose)) + facet_grid(~ supp) + xlab("Dose") + ylab("Tooth Length") + ggtitle("Dose vs Tooth Length by Supplement Type")

Basic summary of the data

sum_dat <- tooth %>% group_by(supp, dose) %>% summarize(avg = mean(len))
sum_dat

## # A tibble: 6 x 3
## # Groups:   supp [?]
##     supp  dose   avg
##   <fctr> <dbl> <dbl>
## 1     OJ   0.5 13.23
## 2     OJ   1.0 22.70
## 3     OJ   2.0 26.06
## 4     VC   0.5  7.98
## 5     VC   1.0 16.77
## 6     VC   2.0 26.14

We can also see the averages by group (supp x dose) via this piped function. In general, I would say that higher does of either supplement increase the tooth length, however the OJ supplement appears to be more effective at the lower doses.

Confidence intervals/hypothesis tests to compare tooth growth by supp and dose.

We can now run a factorial ANOVA to determine if there are differences in the means between the treatment doses or the supplement type.

In order to satisfy our ANOVA we need normal distribution with equal variances. We can check the variances via the bartlett test, which shows non-significant results indicating no difference between variances.

bartlett.test(len ~ supp, data=tooth)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  len by supp
## Bartlett's K-squared = 1.4217, df = 1, p-value = 0.2331

bartlett.test(len ~ dose, data=tooth)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  len by dose
## Bartlett's K-squared = 0.66547, df = 2, p-value = 0.717

aov.out = aov(len ~ supp * dose, data=tooth)
options(show.signif.stars = F)
summary(aov.out)

##             Df Sum Sq Mean Sq F value   Pr(>F)
## supp         1  205.4   205.4  12.317 0.000894
## dose         1 2224.3  2224.3 133.415  < 2e-16
## supp:dose    1   88.9    88.9   5.333 0.024631
## Residuals   56  933.6    16.7

Our factorial ANOVA was significant (p < 0.05) for supplement, dose, and an interaction of supp by dose, indicating that there was a differential effect of the doses as well as the type of supplement. This confirms our visual test that there was a difference in the effectiveness of the different doses and the supplement type.

State conclusions and assumptions needed for conclusions

As mentioned before, we need normal distribution and equal variances. We determined that there was an effect of dose (higher doses were more effective) and supplement type (orange juice better at low doses than VitC alone).