PART 1

This is the project for the statistical inference class. In it, you will use simulation to explore inference and do some simple inferential data analysis. The project consists of two parts:

  1. Simulation exercises.
  2. Basic inferential data analysis.

You will create a report to answer the questions. Please use knitr to create the reports and convert to a pdf. (I will post a very simple introduction to knitr). Each pdf report should be no more than 2 pages with 3 pages of supporting appendix material if needed (code, figures, etcetera). The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should

  1. Show where the distribution is centered at and compare it to the theoretical center of the distribution.
setwd("C:/Raghu/Rscipts/data")
library(lattice)
library(gplots)
## Warning: package 'gplots' was built under R version 3.2.2
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
lambda <- 0.2
n <- 40
simulations <- 1:2000  # must be <=2000 - since Shapiro-wilk test is used
set.seed(33) #helps reproducing (can be any number)
means <- data.frame(x = sapply(simulations, function(x) {mean(rexp(n, lambda))}))

samplMean <- mean(means$x) 
theoreticalMean <-(1/lambda) 

The distribution is centered at 5.0005418 (sampling mean) which is similar to the theoretical mean which is 5

  1. Show how variable it is and compare it to the theoretical variance of the distribution.
samplVarience <- var(means$x) 
samplSD <- sd(means$x) 
theoreticalVarience <- ((1/lambda)/sqrt(n))^2 
theoreticalSD <- (1/lambda)/sqrt(n) 

Sampling variance and standard deviation are 0.6564438 and 0.8102122 respectively.

Theoretical variance and standard deviation are 0.625 and 0.7905694 respectively.

  1. Show that the distribution is approximately normal.

To check the normality Shapiro-Wilk normality test is used. After performing a test a plot can give some support to the test.

shapiro.test(means$x) # p-value
## 
##  Shapiro-Wilk normality test
## 
## data:  means$x
## W = 0.99289, p-value = 2.909e-08

p < 0.05 means the data is represented by a normal curve, which is the case

tmp <- hist(means$x, freq=TRUE, main="Data histogram and normal curve",
    col="lightgreen", axes = FALSE, xlab='Mean', yaxt='n', ylab='Frequency')
axis(2, pos = 2.5)
axis(1,at=seq(0,8,1), pos=0)

multiplier <- tmp$counts / tmp$density
mydensity <- density(means$x)
mydensity$y <- mydensity$y * multiplier[1]

myx <- seq(2.5, 8, length.out= 100)
mymean <- mean(means$x)
mysd <- sd(means$x)

normal <- dnorm(x = myx, mean = mymean, sd = mysd)
lines(myx, normal * multiplier[1], col = "blue", lwd = 3)

  1. Evaluate the coverage of the confidence interval for 1/lambda: \(X\pm1.96S_n\).
# 95% confidence interval (Z=1.96)
mean(means$x) + c(-1,1)*1.96*sd(means$x)/sqrt(nrow(means)) 
## [1] 4.965033 5.036051

Note both sampling and theoretical mean are contained within confidence interval.

PART 2

Now in the second portion of the class, we’re going to analyze the ToothGrowth data in the R datasets package.

  1. Load the ToothGrowth data and perform some basic exploratory data analyses
setwd("C:/Raghu/Rscipts/data")
library(lattice)
library(gplots)

data(ToothGrowth)
nrow(ToothGrowth)
## [1] 60
ncol(ToothGrowth)
## [1] 3
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
ToothGrowth$dose #considering 0.5, 1 and 2 as scales
##  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## [18] 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 0.5 0.5 0.5 0.5
## [35] 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0
## [52] 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
ToothGrowth$dose = factor(ToothGrowth$dose, levels=c(0.5,1.0,2.0),
                          labels=c("low","med","high"))
attach(ToothGrowth)
table(supp,dose)
##     dose
## supp low med high
##   OJ  10  10   10
##   VC  10  10   10
bwplot(len ~ dose | supp)

  1. Provide a basic summary of the data.
summary(ToothGrowth)
##       len        supp      dose   
##  Min.   : 4.20   OJ:30   low :20  
##  1st Qu.:13.07   VC:30   med :20  
##  Median :19.25           high:20  
##  Mean   :18.81                    
##  3rd Qu.:25.27                    
##  Max.   :33.90
aggregate(len,list(supp,dose), mean)
##   Group.1 Group.2     x
## 1      OJ     low 13.23
## 2      VC     low  7.98
## 3      OJ     med 22.70
## 4      VC     med 16.77
## 5      OJ    high 26.06
## 6      VC    high 26.14
aggregate(len,list(supp,dose), sd)
##   Group.1 Group.2        x
## 1      OJ     low 4.459709
## 2      VC     low 2.746634
## 3      OJ     med 3.910953
## 4      VC     med 2.515309
## 5      OJ    high 2.655058
## 6      VC    high 4.797731
  1. Use confidence intervals and hypothesis tests to compare tooth growth by supp and dose. (Use the techniques from class even if there’s other approaches worth considering)

As “dose” variable(factor) is used, Anova analysis will be useful.

anova <- aov(len ~ supp * dose, data=ToothGrowth)
TukeyHSD(anova)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $supp
##       diff       lwr       upr     p adj
## VC-OJ -3.7 -5.579828 -1.820172 0.0002312
## 
## $dose
##            diff       lwr       upr   p adj
## med-low   9.130  6.362488 11.897512 0.0e+00
## high-low 15.495 12.727488 18.262512 0.0e+00
## high-med  6.365  3.597488  9.132512 2.7e-06
## 
## $`supp:dose`
##                  diff        lwr        upr     p adj
## VC:low-OJ:low   -5.25 -10.048124 -0.4518762 0.0242521
## OJ:med-OJ:low    9.47   4.671876 14.2681238 0.0000046
## VC:med-OJ:low    3.54  -1.258124  8.3381238 0.2640208
## OJ:high-OJ:low  12.83   8.031876 17.6281238 0.0000000
## VC:high-OJ:low  12.91   8.111876 17.7081238 0.0000000
## OJ:med-VC:low   14.72   9.921876 19.5181238 0.0000000
## VC:med-VC:low    8.79   3.991876 13.5881238 0.0000210
## OJ:high-VC:low  18.08  13.281876 22.8781238 0.0000000
## VC:high-VC:low  18.16  13.361876 22.9581238 0.0000000
## VC:med-OJ:med   -5.93 -10.728124 -1.1318762 0.0073930
## OJ:high-OJ:med   3.36  -1.438124  8.1581238 0.3187361
## VC:high-OJ:med   3.44  -1.358124  8.2381238 0.2936430
## OJ:high-VC:med   9.29   4.491876 14.0881238 0.0000069
## VC:high-VC:med   9.37   4.571876 14.1681238 0.0000058
## VC:high-OJ:high  0.08  -4.718124  4.8781238 1.0000000
#Plotting
plotmeans(len~interaction(supp,dose,sep=" "), 
          connect=list(c(1,3,5), c(2,4,6)), col=c("red","darkgreen"), 
          xlab="Dose and Supp Combination", 
          main = "Interaction plot with 95% confidence intervals")

  1. State your conclusions and the assumptions needed for your conclusions.

Assumption: The sample components are independent and identically distributed.

Conclussion: Tukey contrast reveals there no statistical evidence for other ‘noisy’ variables which are relevant, therefore changes in dose and size (independent variables) won’t affect “len” (dependent variable)