Part 1 Simulation Exercise

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. We will investigate the distribution of averages of 40 exponentials and will do thousand simulations.

1.Show the sample mean and compare it to the theoretical mean of the distribution.

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
n<- 40
lambda <- 0.2
simdata<- matrix(rexp(1000*n,lambda),nrow = 1000, ncol = n)
mean_dist<- apply(simdata, 1, mean)
hist(mean_dist,breaks = 50, main = "The distribution of 1000 averages of 40 random exponentials", xlab = "Means", ylab = "Frequency" )
abline(v= 1/lambda, lty = 1, lwd = 3, col = "blue")
legend("topright", lty = 1, lwd = 3, col = "blue", legend = "Theoretical Mean")

sample_mean<- mean(mean_dist)
sample_mean

## [1] 5.033899

theoretical_mean<- 1/lambda
theoretical_mean

## [1] 5

We can see that simulated sample means are normally distributed and close to theoretical mean

2.Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

sample_var<- var(mean_dist)
sample_var

## [1] 0.6569091

theoretical_var<- (1/lambda)^2/n
theoretical_var

## [1] 0.625

The simulated exponential variance and the theoretical value is very close.

3.Show that the distribution is approximately normal.

x<- seq(min(mean_dist), max(mean_dist), length = 100)
y<- dnorm(x, mean = theoretical_mean, sd = 1/.2/sqrt(n))
hist(mean_dist,breaks = n, prob = T,xlab = "means", ylab = "count", main = "Density of Means")
lines(x,y,lty = 5, pch = 2, col = "red")

Part 2: Analyze the Tooth Growth data in the R datasets package

1. Load the Tooth Growth data

library(stats)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data("ToothGrowth")

2. Summary of data

dim(ToothGrowth)

## [1] 60  3

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

data set ToothGrowth has 3 variables and 60 observations: 1. length - length of tooth , its numeric 2. supp - its factor with two levels oj:orange juice vc:vitamin C 3. dose - dosage of supplement and its numeric

qplot(x = supp, y = len, data =  ToothGrowth, facets = ~dose, main = "Tooth growth by supplement type and dosage" , xlab = "Supplement" , ylab = " Tooth Length")+ geom_boxplot(aes(fill = supp))

3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

We are going to do two sample t-testing on the data.For this we first split the data into groups according to the levels of two supplements OJ and VC.

dose_0.5<-filter(ToothGrowth, dose== 0.5)
dose_1.0<-filter(ToothGrowth, dose== 1.0)
dose_2.0<-filter(ToothGrowth, dose== 2.0)

Now we will test whether OJ and VC with same dosage have statistical significant differences in mean length in tooth growth.

t.test(len~supp, paired = FALSE, data = dose_0.5)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

As the p-value is lower than .05 we reject null hypothesis in favor of alternate hypothesis. so there is statistical significant difference in mean growth of tooth at dosage of 0.5.

Now lets run the same test for dosage of 1.

t.test(len~supp, paired = FALSE, data = dose_1.0)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

In this case also the p-value is lower than .05 so we reject null hypothesis in favor of alternative hypothesis.

Now we run the same test for dosage level 2.

t.test(len~supp, paired = FALSE, data = dose_2.0)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

In this case the p-value is higher than o.5 so we fail to reject null hypothesis in favor of alernate hypothesis means there is no statistical significant difference in the mean growth of tooth length when the dosage level is 2.

4.Conclusions and the assumptions needed for your conclusions.

We assume that population is normally distributed as whole and we also assume population under different doseage of supplements is normally distributed . Conclusion is when dose is 0.5 or 1.0 the p-value is lower than 0.5 so null hypothesis can be rejected in favor of alternative hypothesis while in case of 2.0 dose the p-value is higher than 0.5 so we fail to reject null hypothesis in favor of alternate hypothesis.

Statistical Inference Course Project1

priya malhotra

February 25, 2019