Overview

Part 1: Simulation Exercise Instructions

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. We will investigate the distribution of averages of 40 exponentials.

1. Show the sample mean and compare it to the theoretical mean of the distribution.

set.seed(262018) 
mns = NULL
 n<-40
 lambda<-0.2
 for (i in 1 : 1000) mns = c(mns, mean(rexp(n,rate=lambda)))
 hist(mns,main="Sample Mean for Exponential Function",col="blue")
 abline(v=mean(mns),col="red",lwd=4)

 sample_mean<-mean(mns)
 paste("sample mean is ",sample_mean)
## [1] "sample mean is  5.05353749083723"
 t_mean<-1/lambda
 paste("theoritical mean is ",t_mean)
## [1] "theoritical mean is  5"

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

 sample_var<-var(mns)
 t_var<-(1/lambda)^2/n
 paste0("Sample variance is ",sample_var)
## [1] "Sample variance is 0.628141741217932"
 paste0("Theoritical variance is ",t_var)
## [1] "Theoritical variance is 0.625"

3 Show that the distribution is approximately normal.

m<-mean(mns)
std<-sqrt(var(mns))
hist(mns, xlab="x-variable",prob=TRUE,
     main="normal curve over histogram")
x <- seq(min(mns), max(mns), length=2*1000)
y <- dnorm(x, mean=m, sd=std)
# sample curve
lines(x, y,col="blue")
x <- seq(min(mns), max(mns), length=2*1000)
std<-(1/lambda)/sqrt(40)
y <- dnorm(x, mean=1/lambda, sd=std)
# Theoritical curve
lines(x, y,col="red")

Conclusion

Sample mean and variance is very close to theoritical mean and variance. We also see that sample destribution closely follow the theoritical destribution

Part 2: Basic Inferential Data Analysis Instructions

Overview

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package. -Load the ToothGrowth data and perform some basic exploratory data analyses -Provide a basic summary of the data. -Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering) -State your conclusions and the assumptions needed for your conclusions.

library(ggplot2)
data("ToothGrowth")
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
dim(ToothGrowth)
## [1] 60  3
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
p <- ggplot(ToothGrowth, aes(factor(dose), len, fill = factor(supp)))
p<-p+geom_bar(stat="identity")
p<-p+facet_grid(. ~ supp)
p<-p+xlab("dose")+ylab("Length")+ggtitle("Growth by supplier")
print(p)

Visual inspection of data and plot

First of all, it is clear that tooth length increases with increasing dose in case of either supplier. Also, for dose sie .5 and 1, OJ is more effective than VC at those level

Supplier vs. Growth

t.test(len~supp,data=ToothGrowth)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

It is clear that p-value is less than .05 for all cases, therefore, there is a strong co-relation between supplier and tooth growth since, p-value is .06>0.05, we can conclude that there is no correlation between tooth growth and supplier

Dose vs. Growth

toothSubset<-subset(ToothGrowth, ToothGrowth$dose %in% c(.5,1))
t.test(len~supp,data=toothSubset)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.0503, df = 36.553, p-value = 0.004239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.875234 9.304766
## sample estimates:
## mean in group OJ mean in group VC 
##           17.965           12.375
toothSubset<-subset(ToothGrowth, ToothGrowth$dose %in% c(.5,1))
t.test(len~dose,data=toothSubset)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735
toothSubset<-subset(ToothGrowth, ToothGrowth$dose %in% c(1,2))
t.test(len~dose,data=toothSubset)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100
toothSubset<-subset(ToothGrowth, ToothGrowth$dose %in% c(.5,2))
t.test(len~dose,data=toothSubset)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

It is clear that p-value is less than .05 for all cases, therefore, there is a strong co-relation between dose and tooth growth

Conclusion

While there is no strong correlation between supploer and tooth growth, there is a strong correlation between dose adn tooth growth.As evident by very low value of p-value which is less than <0.05