Statistical Inference Course Project

Introduction

The purpose of this exercise in part 1 and part 2 is to show where the distribution is centered at and compare it to the theoretical center of the distribution. Moreover, to show how the variable the distribution is and compare it to the theoretical variance of the distribution. Furthermore, an exploratory data analysis of at least a single plot or table highlighting basic features of the data will be performed, indicating appropriate confidence intervals and tests, where the results of the tests and intervals are interpreted in the context of the problem.

Part 1: Simulation Exercise

1. Show the sample mean and compare it to the theoretical mean of the distribution.

Set lambda = 0.2 and distribution of averages of 40 with mean of exponential distribution and standatd deviation at 1/lambda. Observation of 1000 simulation.

library(ggplot2)

set.seed(12)
n <- 40
lambda <- 0.2

Mean simulation

simulation_data <- replicate(1000, rexp(n, .2))
mean_simulation <- apply(simulation_data, 2, mean)

Sample Mean

sample_mean <- mean(mean_simulation)
sample_mean

## [1] 5.010015

Theoretical Mean

theoretical_mean <- 1/0.2
theoretical_mean

## [1] 5

plot

hist(mean_simulation, xlab = "mean", main = "Exponential Function Simulations")
abline(v = sample_mean, col = "red")
abline(v = theoretical_mean, col = "yellow")

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

calculate expected standard deviation and varience of sample

expected_sd <- (1/.2)/sqrt(n)
expected_var <- expected_sd^2

calculate standard deviation and variance of sample

sd <- sd(mean_simulation)
var <- var(mean_simulation)

3. Show that the distribution is approximately normal.

graph simulation means distribution along with the normal distribution (blue curve)

smd <- seq(min(mean_simulation), max(mean_simulation), length=100)
smd_graph <- dnorm(smd, mean=theoretical_mean, sd=expected_sd)

hist(mean_simulation, 
  breaks = n, prob=T, 
  xlab = "means", 
  ylab = "count", 
  main = "Density of Means")

lines(smd, smd_graph, pch=3, col="blue", lty=5)

Part 2: Basic Inferential Data Analysis

1. Load the ToothGrowth data and conduct exploratory data analysis

library(datasets)
library(ggplot2)

list column names and headers

colnames(ToothGrowth)

## [1] "len"  "supp" "dose"

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

2. basic summary of the data

list summary

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

plot ToothGrowth

ToothGrowth$dose <- as.factor(ToothGrowth$dose)
ggplot(aes(x=dose, y=len), data = ToothGrowth) +
  geom_boxplot(aes(fill=dose)) +
  ggtitle("Tooth Length by dose Amount of Vitamin C") +
  xlab("Dose") +
  ylab("Tooth Length") +
  facet_grid(~supp) +
  theme(plot.title = element_text(lineheight = .9, face = "bold"))

Find ANOVA

anova <- aov(len ~ supp * dose, data = ToothGrowth)
summary(anova)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

find tukeyHSD to show 3 catagories with variables P-vale of >0.05

TukeyHSD(anova)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $supp
##       diff       lwr       upr     p adj
## VC-OJ -3.7 -5.579828 -1.820172 0.0002312
## 
## $dose
##         diff       lwr       upr   p adj
## 1-0.5  9.130  6.362488 11.897512 0.0e+00
## 2-0.5 15.495 12.727488 18.262512 0.0e+00
## 2-1    6.365  3.597488  9.132512 2.7e-06
## 
## $`supp:dose`
##                diff        lwr        upr     p adj
## VC:0.5-OJ:0.5 -5.25 -10.048124 -0.4518762 0.0242521
## OJ:1-OJ:0.5    9.47   4.671876 14.2681238 0.0000046
## VC:1-OJ:0.5    3.54  -1.258124  8.3381238 0.2640208
## OJ:2-OJ:0.5   12.83   8.031876 17.6281238 0.0000000
## VC:2-OJ:0.5   12.91   8.111876 17.7081238 0.0000000
## OJ:1-VC:0.5   14.72   9.921876 19.5181238 0.0000000
## VC:1-VC:0.5    8.79   3.991876 13.5881238 0.0000210
## OJ:2-VC:0.5   18.08  13.281876 22.8781238 0.0000000
## VC:2-VC:0.5   18.16  13.361876 22.9581238 0.0000000
## VC:1-OJ:1     -5.93 -10.728124 -1.1318762 0.0073930
## OJ:2-OJ:1      3.36  -1.438124  8.1581238 0.3187361
## VC:2-OJ:1      3.44  -1.358124  8.2381238 0.2936430
## OJ:2-VC:1      9.29   4.491876 14.0881238 0.0000069
## VC:2-VC:1      9.37   4.571876 14.1681238 0.0000058
## VC:2-OJ:2      0.08  -4.718124  4.8781238 1.0000000

plot normality assumption

plot(anova, 2)

There is a correlation between tooth growth as well as an increase in the C vitamin. There is slight difference between the dose methods with orange juice not being significant. The assignment for the catagories are random and normal for the distribution of the means. Residules 32 and 49 of OJ as well as residule 23 of VC are showing as a outliers.