knitr::opts_chunk$set(echo = TRUE)

Part 1: Simulation Exercises

Illustrate via simulation and associated explanatory text, the properties of the distribution of the mean of 40 exponentials

I’m familiar with R, but Knitr is quite impressive. I particularly like the way a lot of the formatting follows familiar formatting conventions, such as Atlassian’s Confluence

Question 1: Show the sample mean and compare it to the theoretical mean distribution

setwd("/home/johnm/Desktop")

lambda <- 0.2
nrow <- 1000
ncol <- 40
simulatedData <- matrix(rexp(1000*40, lambda), nrow, ncol)
Means <- apply(simulatedData, 1, mean)
hist(Means, breaks = 50, main = "1,000 averages of 40 random exponentials", xlab = "Means", ylab = "Frequency")
abline(v = 1/lambda, lty = 1, lwd = 5)
legend("topright", lty = 1, lwd = 5, legend = "Theoretical Mean")

SampleMean <- NULL
for(index in 1:nrow)
{
  SampleMean <- c(SampleMean, mean(rexp(ncol, lambda)))
}
mean(SampleMean)

## [1] 5.016767

Conclusion

The figure above, and the calculation both show that this is close to the theoretical value for the mean of 5

Question 2: Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution

distVar <- apply(simulatedData, 1, var)
hist(distVar, breaks = 50, main = "This distribution of 1,000 variance of 40 random exponentials", xlab = "Value of variances", ylab = "Frequency of variance", col = "lightblue")
abline(v = (1/lambda)^2, lty = 1, lwd = 2, col = "red")
legend("topright", lty = 1, lwd = 5, col = "blue", legend = "Theoretical variance")

The sample variances have a distribution that approaches a normal distributoin, and is nearly centered on the theoretical variance

Question 3: Show that the distribution approximates a normal distribution

The plot below shows the distribution of exponentials with lambda equal to 0.2

par(mfrow = c(3, 1))
hist(simulatedData, breaks = 50, main = "Distribution of exponentials with lambda equals to 0.2", xlab = "Exponentials", col = "orange")

The plot below shows the distribution of 1,000 averages of 40 random exponentials

hist(Means, breaks = 50, main = "The distribution of 1,000 averages of 40 random exponentials", xlab = "Value of means", ylab = "Frequency of means", col = "pink")

The plot below shows the normal distribution, with the mean and standard deviation equal to that in the second plot above.

Comparing the second figure with the third figure, it shows that the distribution of the means tends toward the normal distribution with the same mean and standard deviation.

simNorm <- rnorm(1000, mean = mean(Means), sd = sd(Means))
hist(simNorm, breaks = 50, main = "A normal distribution with theoretical mean and sd of the exponentials", xlab = "Normal variables", col = "lightgreen")

Part 2: Analyze the ToothGrowth data in the R datasets package

Task 1 - Load the ToothGrowth data and perform some basic exploratory data analyses

library(stats)
data("ToothGrowth")
library(ggplot2)


suppBox <- ggplot(data = ToothGrowth, aes(x = supp, y = len))
suppBox +geom_boxplot(aes(fill = supp)) + labs(title = "Boxplots of Tooth Length By Supplement", x = "Supplement", y = "Tooth Lenth")

suppDoseBox <- ggplot(data = ToothGrowth, aes(x = supp, y = len))
suppDoseBox + geom_boxplot(aes(fill = supp)) + 
        facet_wrap(~ dose) +
        labs(title = "Tooth Length by Supplement and Dose", 
             x = "Supplement", y = "Tooth Length")

The box plots shows a strong, positive correlation between the dose and tooth growth; as the dose increases, the tooth growth also increases.

Task 2: Provide a basic summary of the data

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

sd(ToothGrowth$len)

## [1] 7.649315

summary(ToothGrowth[ToothGrowth$supp == "OJ", ])

##       len        supp         dose      
##  Min.   : 8.20   OJ:30   Min.   :0.500  
##  1st Qu.:15.53   VC: 0   1st Qu.:0.500  
##  Median :22.70           Median :1.000  
##  Mean   :20.66           Mean   :1.167  
##  3rd Qu.:25.73           3rd Qu.:2.000  
##  Max.   :30.90           Max.   :2.000

sd(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)

## [1] 6.605561

summary(ToothGrowth[ToothGrowth$supp == "VC",])

##       len        supp         dose      
##  Min.   : 4.20   OJ: 0   Min.   :0.500  
##  1st Qu.:11.20   VC:30   1st Qu.:0.500  
##  Median :16.50           Median :1.000  
##  Mean   :16.96           Mean   :1.167  
##  3rd Qu.:23.10           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

sd(ToothGrowth[ToothGrowth$supp == "VC",]$len)

## [1] 8.266029

Results:

Dosage mean for 0.5 is 10.6
Dosage mean for 1.0 is 19.74
Dosage mean for 2.0 is 26.1
Standard deviation for overall is 7.65
The standard deviation for OJ supp is 6.61
The standard deviation for VC supp is 8.27.

Task 3: Use confidence intervals and/or hypothesis tests to compare tooth growth by ‘supp’ and ‘dose’.

t.test(x = ToothGrowth$len, data = ToothGrowth, paired = FALSE, conf.level = 0.95)$conf

## [1] 16.83731 20.78936
## attr(,"conf.level")
## [1] 0.95

The lower edge of the 95% confidence interval is 16.84 and the upper edge is 20.79 (truncated at three decimal places, and rounded to two decimal places)

Looking at the data for the means of ‘supp’ and ‘dosage’ shows the following:

summary(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)[4]

##     Mean 
## 20.66333

summary(ToothGrowth[ToothGrowth$supp == "VC", ]$len)[4]

##     Mean 
## 16.96333

summary(ToothGrowth[ToothGrowth$dose == 0.5, ]$len)[4]

##   Mean 
## 10.605

summary(ToothGrowth[ToothGrowth$dose == 1.0, ]$len)[4]

##   Mean 
## 19.735

summary(ToothGrowth[ToothGrowth$dose == 2.0, ]$len)[4]

## Mean 
## 26.1

On this basis, the null hypothesis can be rejected. As mentioned earlier, there is a strong correlation between the dosage and the growth

Point 4 State your conclusions and the assumptions needed for your conclusions

Conclusion:

The dosage mean for 0.5 is 10.6
The dosage mean for 1.0 19.74

The null hypothesis can safely be rejected

However, the effect appears to relate to the dose only. The null hypthesis that there is no difference between OJ and VC cannot be rejected

Coursera Statistical Inference

John Murphy

2023-06-02