Statistical inference: Peer Assessment

1.Simulation Exercise

Investigation of the exponential distribution in R and comparisation with the Central Limit Theorem. The exponential distribution will be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda, lambda is 0.2 for all of the simulations. The distribution of averages include 40 exponentials and the investigation include thousand simulations.

1.1 Creating of means. Plotting

lam <- 0.2
n <- 40
num <- 1:1000
set.seed(1)
means <- data.table(x = sapply(1:1000, function(x) 
{mean(rexp(40, 0.2))}))
ggplot(data=means, aes(means$x)) + 
  geom_histogram()

1.2 Compare sample mean and theoretical mean of the distribution.

Meansmpl <- mean(means$x)
Mean <- 1/lam
Meansmpl

## [1] 4.990025

Mean

## [1] 5

1.3 Compare sample variance(via variance) and heoretical variance of the distribution. 1.5 Is the distribution normal?

Varsmpl <- var(means$x)
Sd <- (1/lam)^2/n
Varsmpl

## [1] 0.6111165

Sd

## [1] 0.625

ggplot(data=means, aes(means$x)) + 
  geom_histogram(aes(y=..density..))+
  geom_density(color = "red", size = 1)+ #normal distribution
  labs(x="Mean")

2.Basic Inferential Data Analysis. Analyze of the ToothGrowth data in the R datasets package.

The first step is loading of the ToothGrowth data and performing some basic exploratory data analyses. The second step is: Provide a basic summary of the data. The next step is comparisation the tooth growth by supp and dose.

2.1Load the ToothGrowth data and perform some basic exploratory data analyses

tg <-  ToothGrowth
str(tg)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

unique(tg$dose)

## [1] 0.5 1.0 2.0

unique(tg$supp)

## [1] VC OJ
## Levels: OJ VC

unique(tg$len)

##  [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2  5.2  7.0 16.5 15.2 17.3 22.5 13.6 14.5
## [16] 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5 17.6  9.7  8.2
## [31]  9.4 19.7 20.0 25.2 25.8 21.2 27.3 22.4 24.5 24.8 30.9 29.4 23.0

2.2Provide a basic summary of the data. Visualisation

s <- summary(tg)
s

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

ggplot(tg,aes(x=factor(dose),y=len,fill=factor(dose))) + 
  geom_boxplot(notch=F) +
  facet_grid(.~supp) +
  scale_x_discrete("Dose") +   
  scale_y_continuous("Tooth growth") +  
  scale_fill_discrete(name="Dose (mg)") + 
  ggtitle("Comparisation of tooth growth by supp and dose") + 
  geom_jitter(width=0.1,alpha=0.2)

2.3Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

fac_cols <- sapply(tg, is.factor)                           # Identify all factor columns
tg[fac_cols] <- lapply(tg[fac_cols], as.character)

dos_0.5 <- tg %>%
 filter(dose==0.5)
dos_1 <- tg %>%
  filter(dose==1)
dos_2 <- tg %>%
  filter(dose==2)

t05 <- t.test(len ~ supp, 
              data = dos_0.5, 
              var.equal = FALSE)
t05

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

t1 <- t.test(len ~ supp, 
              data = dos_1, 
              var.equal = FALSE)
t1

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

t2 <- t.test(len ~ supp, 
              data = dos_2, 
              var.equal = FALSE)
t2

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

tstat_sum <- data.frame(
  "p-value" = c(t05$p.value, t1$p.value, t2$p.value),
  "con_interval_low" = c(t05$conf.int[1],t1$conf.int[1], t2$conf.int[1]),
  "con_interval_high" = c(t05$conf.int[2],t1$conf.int[2], t2$conf.int[2]),
  row.names = c("dose_05","dose_1","dose_2"))
tstat_sum

##             p.value con_interval_low con_interval_high
## dose_05 0.006358607         1.719057          8.780943
## dose_1  0.001038376         2.802148          9.057852
## dose_2  0.963851589        -3.798070          3.638070

2.4Conclusions and assumptions

The sum of the statistic shows the following: 1) The null hypothesis: there is a difference in tooth growth according to the method of administration. 2) We observe p-values: At the two doses of 0.5 and 1, the p-value is below the threshold value of 0.05. So we are not rejecting the null hypothesis, claiming that the method of administration plays a role. 3) We observe p-values above the threshold of 0.05 and the confidence levels. We reject the null hypothesis with 95% certainty. 4) At a dosage of 2 milligrams / day, the p-value is higher than the threshold value of 5%. The method of administration does not matter in this case.