Simulation

First, we conduct 1,000 simulations of the average of 40 exponential distribution, and collect the mean and variance of each simulation.

# Setting seed to ensure reproducibility
set.seed(123)

# Conducting 1000 simulations of 40 exponential distributions
lambda = 0.2
nosim <- 1000; n <- 40
exp_mean <- rep(NA, nosim); exp_var <- rep(NA, nosim)
for (i in 1:nosim){
        exp_mean[i] <- mean(rexp(n, rate = lambda))
        exp_var[i] <- var(rexp(n, rate = lambda))/n
}

The sample mean from the simulations is 5.0069281, while the theoretical mean (generated by taking 1/rate) of the distribution is 5.

The sample variance from the simulation is 0.6111245, while the theoretical variance of the distribution is 25. In this case, as the sample variance measures the variability of 40 exponential distributions from the population instead of only 1 distribution, it is likely to be lower than the theoretical variance by a factor of 40. This implies that as the sample size approaches infinity, the sample mean, 5.0069281, approaches the theoretical mean of 5.

First, we plot the distribution of the average of 40 exponential distributions from our previous simulations. Next, we plot the average of 40 exponential distributions along side the normal distribution to compare the 2 distributions. In addition, we also create confidence intervals for both the mean and the variance of the normalized distribution.

# Plotting the average of 40 exponential distributions
hist(exp_mean, xlab = "Mean", ylab = "Frequency Count", col = "blue",
     main = "Average of 40 Exponential Distributions (1000 simulations)")

To check whether our distribution follows a Normal Distribution, we check whether the 1 and 2 standard deviations from the mean captures 68% and 95% of the distribution.

# 1 standard deviation
mean_1sd <- mean(exp_mean) + 1 * sqrt(mean(exp_var))
actual_1sd <- quantile(exp_mean, c(0.84))
print(mean_1sd); print(actual_1sd)

## [1] 5.788673

##      84% 
## 5.756918

# 2 standard deviation
mean_2sd <- mean(exp_mean) + 2 * sqrt(mean(exp_var))
actual_2sd <- quantile(exp_mean, c(0.975))
print(mean_2sd); print(actual_2sd)

## [1] 6.570417

##    97.5% 
## 6.668337

As the difference between our distribution and a normal distribution appears to be close to 0, we can conclude that our distribution does indeed follow a normal distribution.

Lastly, we can also conduct a formal test (Shapiro-Wilk test) to verify whether our distribution adheres to a normal distribution.

p_value <- shapiro.test(exp_mean)[2]
stat_sig <- if (p_value<0.05){
        "does not follow a normal distribution"} else {
                "follows a normal distribution"}

Under the Shapiro-Wilk test of normality, our distribution does not follow a normal distribution.

Part 2

We begin by loading the ToothGrowth dataset, and generating a basic summary of the data.

library(datasets); data("ToothGrowth")
summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

We plot the data to get a better understanding of how the dataframe looks:

facet_plt <- xyplot(len ~ supp | factor(dose), data = ToothGrowth, 
                    xlab = "Supplement", ylab = "Tooth Growth",
                    main = "Impact of Supplement on Tooth Growth by Dosage")
facet_plt

From the plot, it appears that OJ tends to do better than VC when the dosage is low, but the effect is muted when the dosage is high.

We conduct our hypothesis tests (with the use of confidence intervals) to compare tooth growth by the different supplements, while controlling for dosage.

pVal <- rep(NA, length(unique(ToothGrowth$dose)))
stat_sig <- rep(NA, length(unique(ToothGrowth$dose)))
x <- 1

# Impact of Tooth Growth by Supplement, controlled for Dosage
for (i in unique(ToothGrowth$dose)){
        df <- ToothGrowth[ToothGrowth$dose == i,]
        test_result <- t.test(len ~ supp, data = df, paired = F, var_equal = F)
        pVal[x] <- test_result$p.value
        conf <- test_result$conf
        stat_sig[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
                        "statistically significant"
                        } else {
                        "not statistically significant"
                        }
        x <- x + 1
}

Under a dosage of 0.5, the difference tooth growth across the two supplements is statistically significant.

Under a dosage of 1, the difference tooth growth across the two supplements is statistically significant.

Under a dosage of 2, the difference tooth growth across the two supplements is not statistically significant.

We now plot the data in a different perspective, to identify the impact of dosage on tooth growth, controlled for supplement

facet_plt2 <- xyplot(len ~ as.factor(dose) | supp, data = ToothGrowth, 
                    xlab = "Dosage (mg/day)", ylab = "Tooth Growth",
                    main = "Impact of Dosage on Tooth Growth by Supplement")
facet_plt2

We proceed to conduct a hypothesis tests to compare tooth growth by dosage, while controlling for the types of supplements used.

pVal2 <- rep(NA, length(unique(ToothGrowth$dose)) * length(unique(ToothGrowth$supp)))
stat_sig2 <- rep(NA, length(unique(ToothGrowth$dose)) * length(unique(ToothGrowth$supp)))
x <- 1

# Impact of Tooth Growth by Dosage, controlled for Supplement
for (i in unique(ToothGrowth$supp)){
        df <- filter(ToothGrowth, dose %in% c("0.5", "1") & supp == i)
        test_result <- t.test(len ~ dose, data = df, paired = F, var_equal = F)
        pVal2[x] <- test_result$p.value
        conf <- test_result$conf
        stat_sig2[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
                        "statistically significant"
                        } else {
                        "not statistically significant"
                        }
        x <- x + 1
}

for (i in unique(ToothGrowth$supp)){
        df <- filter(ToothGrowth, dose %in% c("1", "2") & supp == i)
        test_result <- t.test(len ~ dose, data = df, paired = F, var_equal = F)
        pVal2[x] <- test_result$p.value
        conf <- test_result$conf
        stat_sig2[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
                        "statistically significant"
                        } else {
                        "not statistically significant"
                        }
        x <- x + 1
}

for (i in unique(ToothGrowth$supp)){
        df <- filter(ToothGrowth, dose %in% c("0.5", "2") & supp == i)
        test_result <- t.test(len ~ dose, data = df, paired = F, var_equal = F)
        pVal2[x] <- test_result$p.value
        conf <- test_result$conf
        stat_sig2[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
                        "statistically significant"
                        } else {
                        "not statistically significant"
                        }
        x <- x + 1
}

For VC:

Increasing the dosage from 0.5 to 1 is statistically significant, under a significance of 5%.

Increasing the dosage from 1 to 2 is statistically significant, under a significance of 5%.

Increasing the dosage from 0.5 to 2 is statistically significant, under a significance of 5%.

For OJ:

Increasing the dosage from 0.5 to 1 is statistically significant, under a significance of 5%.

Increasing the dosage from 1 to 2 is statistically significant, under a significance of 5%.

Increasing the dosage from 0.5 to 2 is statistically significant, under a significance of 5%.

Conclusion

All in all, higher dosage of vitamin C, regardless of the delivery methods, result in faster tooth growth. While orange juice tend to result in faster tooth growth at lower levels of Vitamin C intake as compared to ascorbic acid, the effects fade off as dosage levels increases.

Statistical Inference Course Project

Ng Bo Lin

09/04/2017

Part 1

Introduction

Simulation

Part 2

Conclusion