Part 1: Simulation Exercise

Overview: The purpose of this data analysis is to investigate the exponential distribution

and compare it to the Central Limit Theorem. For this analysis, lambda will be set to 0.2 for all of the simulations.

Objective: Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.

Question 1: Show the sample mean and compare it to the theoretical mean of the distribution.

library(knitr)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE)
lambda <- 0.2
simData <- matrix(rexp(1000*40, lambda), nrow = 1000, ncol = 40)
distMean <- apply(simData, 1, mean)
hist(distMean, breaks = 50, main = "Distribution of 1000 averages of 40 random exponentials", xlab = "Value of the means", ylab = "Frequency of the means", col = "green")
abline(v = 1/lambda, lty = 2, lwd = 8, col = "black")
legend("topright", lty = 1, lwd = 6, col = "black", legend = "mean")

This shows a normal mean distribution.

Question 2: Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

distVar <- apply(simData, 1, var)
hist(distVar, breaks = 50, main = "Distribution of 1000 variance of 40 random exponentials", xlab = "Value of variances", ylab = "Frequency of variance", col = "orange")
abline(v = (1/lambda)^2, lty = 2, lwd = 8, col = "blue")
legend("topright", lty = 1, lwd = 6, col = "blue", legend = "variance")

The sample variances are almost normal with the center near the variance

Question 3: Show that the distribution is approximately normal.

par(mfrow = c(3, 1))
hist(simData, breaks = 50, main = "Distribution of exponentials with lambda equals to 0.2", xlab = "Exponentials", col = "light pink")
hist(distMean, breaks = 50, main = "Distribution of 1000 averages of 40 random exponentials", xlab = "Value of the means", ylab = "Frequency of means", col = "light green")
simNorm <- rnorm(1000, mean = mean(distMean), sd = sd(distMean))

Shows the distributions have some variance

Conclusion: Some of the center distributions are skew but they are inside confidence intervals

Part 2: Basic Inferential Data Analysis Instructions

Overview: For this data analysis we will analyze the ToothGrowth data set by

comparing the guinea tooth growth by supplement and dose. First, you should do exploratory data analysis on the data set. Then do the comparison with confidence intervals in order to make conclusions about the tooth growth.

Objective 1: Load the ToothGrowth data and perform some basic exploratory data analyses.

library(datasets)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stats)
data(ToothGrowth)
library(ggplot2)
t = ToothGrowth
levels(t$supp) <- c("Orange Juice", "Ascorbic Acid")
ggplot(t, aes(x=factor(dose), y=len)) + 
  facet_grid(.~supp) +
  geom_boxplot(aes(fill = supp), show_guide = FALSE) +
  labs(title="Guinea pigs tooth length by the dosage for each type of the supplement", 
    x="Dose (mg/day)",
    y="Tooth Length")
## Warning: `show_guide` has been deprecated. Please use `show.legend` instead.

The plots show the increased dosage increases the tooth growth.

Objective 2: Provide basic summary of the data.

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
summary(ToothGrowth[ToothGrowth$supp == "OJ", ])
##       len        supp         dose      
##  Min.   : 8.20   OJ:30   Min.   :0.500  
##  1st Qu.:15.53   VC: 0   1st Qu.:0.500  
##  Median :22.70           Median :1.000  
##  Mean   :20.66           Mean   :1.167  
##  3rd Qu.:25.73           3rd Qu.:2.000  
##  Max.   :30.90           Max.   :2.000

Objective 3: Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

t.test(x = ToothGrowth$len, data = ToothGrowth, paired = FALSE, conf.level = 0.95)$conf.
## [1] 16.83731 20.78936
## attr(,"conf.level")
## [1] 0.95

Then to calculate the mean under both supplements

summary(ToothGrowth[ToothGrowth$supp == "OJ", ]$len)[4]
##     Mean 
## 20.66333
summary(ToothGrowth[ToothGrowth$supp == "VC", ]$len)[4]
##     Mean 
## 16.96333

Both of them are inside the confidence intervals. OJ at 20.66 and VC at 16.96

Objective 4: State your conclusions and the assumptions needed for your conclusions.

Conclusion: There is a difference is tooth growth between the OJ and VC, while the tooth growth is between 0.5 and 1.0. It seems both of the populations are normally distributed.