This is a class project for Statistical Inference Course in Data Science Certification Program through John Hopkins University. This project conducts thousands of simulations of data of exponential distribution in R and assess how their means and variance compare with the theoretical ones. The second half of the project analyzes the ToothGrowth data and performs some basic exploratory analyses.
This project investigates the exponential distribution in R and compares it with the Central Limit Theorem. The exponential distribution was simulated in R using ‘rexp (n, lambda)’. If you are not sure, exponential distribution has a mean and standard deviation of 1 over lambda (1/ʎ). Lambda was set to (0.2) for all the simulations. This project investigates the distribution of averages of 40 exponentials using a thousand of simulations. This project attempted to answer following questions: 1. Does the sample mean vary from the theoretical mean? 2. Calculate the sample variance and compare it with the theoretical variance of the distribution. 3. Check if the distribution is approximately normal. The answer to question 3 should focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
set.seed(-123)
n<-40
lambda<- 0.2
S<-1000
z<-1.96
mydata<-matrix(rexp(n*S,lambda),nrow=S)
str(mydata)
## num [1:1000, 1:40] 3.28 6.62 1.43 2.97 14.94 ...
Row_Mean<-rowMeans(mydata)
summary(Row_Mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.872 4.439 4.967 5.019 5.500 8.183
MeanOfMean<-mean(Row_Mean)
MeanOfMean
## [1] 5.018855
theoreticalMean<-1/lambda
theoreticalMean
## [1] 5
sampleVar<-var(Row_Mean)
sampleVar
## [1] 0.7073296
theoreticalVar<-(1/lambda)^2/(n)
theoreticalVar
## [1] 0.625
par(bg='grey')
hist(Row_Mean,
main="Histogram of Sample Data Distribution",
xlab="Mean",
xlim=c(2,8),
col="darkmagenta",
freq=FALSE
)
This second portion of the project analyzes the ToothGrowth data in the R dataset package and performs the following activities:
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
ToothGrowth$dose<-as.factor(ToothGrowth$dose)
head(ToothGrowth$dose)
## [1] 0.5 0.5 0.5 0.5 0.5 0.5
## Levels: 0.5 1 2
SupMean=split(ToothGrowth$len,ToothGrowth$supp)
sapply(SupMean,mean)
## OJ VC
## 20.66333 16.96333
e<-ggplot(aes(x=supp,y=len),data=ToothGrowth)
e+geom_violin(aes(fill=supp),trim=FALSE)+
geom_boxplot(width=0.4)+scale_fill_manual(values=c("#E7B800", "#FC4E07"))+ geom_jitter(width = 0.1)+xlab("Supplement Type")+
ylab("Tooth Length")
Mean_dose=split(ToothGrowth$len,ToothGrowth$dose)
sapply(Mean_dose,mean)
## 0.5 1 2
## 10.605 19.735 26.100
e+ggtitle("Box-Plot Showing Dose of Supplement Type and Tooth Length")+
geom_boxplot(aes(fill=dose))+scale_fill_manual(values=c("#E7B800",
"#FC4E07","#00AFBB"))+xlab("Dose")+ylab("Tooth Length")
ToothGrowth %>%
group_by(supp,dose) %>%
summarise(Q25th_len=quantile(len,0.25),
Q50th_len=quantile(len,0.5),
Q75th_len=quantile(len,0.75),
average_lenth=mean(len),
SD_len=sd(len))->newtable
## `summarise()` regrouping output by 'supp' (override with `.groups` argument)
newtable
## # A tibble: 6 x 7
## # Groups: supp [2]
## supp dose Q25th_len Q50th_len Q75th_len average_lenth SD_len
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 OJ 0.5 9.7 12.2 16.2 13.2 4.46
## 2 OJ 1 20.3 23.5 25.6 22.7 3.91
## 3 OJ 2 24.6 26.0 27.1 26.1 2.66
## 4 VC 0.5 5.95 7.15 10.9 7.98 2.75
## 5 VC 1 15.3 16.5 17.3 16.8 2.52
## 6 VC 2 23.4 26.0 28.8 26.1 4.80
test=list()
dose=c(0.5,1,2)
for(m in dose){
Moj=ToothGrowth$len[ToothGrowth$dose==m & ToothGrowth$supp=="OJ"]
Mvc=ToothGrowth$len[ToothGrowth$dose==m & ToothGrowth$supp=="VC"]
t<-t.test(Moj,Mvc)
id<-paste0("OJ","-","VC",",",m)
test<-rbind(test,list(id=id,p.value=t$p.value,CI.LOW=t$conf.int[1],
CI.HIGH=t$conf.int[2]))
}
test
## id p.value CI.LOW CI.HIGH
## [1,] "OJ-VC,0.5" 0.006358607 1.719057 8.780943
## [2,] "OJ-VC,1" 0.001038376 2.802148 9.057852
## [3,] "OJ-VC,2" 0.9638516 -3.79807 3.63807