This project is a part of Coursera’s Statistical Inference course. Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population (Source: Wikipedia).
The project consists of two parts:
The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
For more information on Central Limit Theorem:
In this part, we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
Design parameters for the simulation:
set.seed(100)
lambda <- 0.2
simulations <- 1000
n <- 40
sd <- 1/lambda
We need to illustrate the following through simulation and associated explanatory text:
The theoretical mean is:
theoretical_mean <- 1/lambda
theoretical_mean
## [1] 5
The simulated mean is:
sim <- data.frame(x = sapply(1:simulations, function(x) mean(rexp(n, lambda))))
head(sim)
## x
## 1 4.137412
## 2 6.051703
## 3 4.415869
## 4 4.404714
## 5 3.210413
## 6 5.475307
simulation_mean <- mean(sim$x)
simulation_mean
## [1] 4.999702
The theoretical variance is:
theoretical_var <- (1/lambda)^2 / n
theoretical_var
## [1] 0.625
The simulated variance is:
simulation_var <- var(sim$x)
simulation_var
## [1] 0.6432442
Thus, we see that our simulation results converges to the theoretical values. Now, to visualize the results, we make use of ggplot library.
library(ggplot2)
g <- ggplot(data = sim, aes(x = x)) +
geom_histogram( aes(y = ..density..),
binwidth = 0.15,
color = I("black"),
fill = I("green")) +
stat_function(fun = dnorm,
args = list(mean = theoretical_mean, sd = sqrt(theoretical_var)),
size = 1.2) +
geom_vline(xintercept = simulation_mean,
size=0.7,
color="blue",
linetype = 'longdash') +
labs(title = "Sampling Distribution of the Sample Means",
x = "Sample Means",
y = "Frequency")
g
Inference: We can see that our sampling distribution approximates the normal distribution reasonably well for a sample size of 40. It is important to note that a greater sample size would result in an even closer approximation of the normal distribution, whereas a smaller sample size would result the opposite.
Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package. We begin by loading the dataset.
library(datasets)
data("ToothGrowth")
ToothGrowth <- as.data.frame(ToothGrowth)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
Now, we plot Supplementary Type vs Tooth Length.
library(ggplot2)
g1 <- ggplot(aes(x = supp, y = len), data = ToothGrowth) +
geom_boxplot(aes(fill = supp)) +
ggtitle("Comparison between Length and Supplement") +
ylab("Length") + xlab("Supplement")
g1
From the figure, it appears that a longer length is detected for subjects tested with supplement OJ(Orange Juice) compared to VC(Vit C). Because of this, it might be worth it to do a hypothesis test for these. Let us check if the same can be said about a possible doses.
g2 <- ggplot(aes(dose, len), data = ToothGrowth) +
geom_boxplot(aes(fill=dose)) +
ggtitle("Boxplot of tooth lengths per dose") +
xlab("Dose in mg per day") + ylab("Tooth length in mm")
g2
Again, there exists a clear correlation, based on the boxplots. This motivates us to do a hypothesis test for the different doses as well.
To test the hyphothesis, a series of two-sided unpaired t-tests will be use to obtain the confidence intervals and p-values. Significance level to be tested will be at 0.05. The p-values will be adjusted using Bonferroni correction method (for conservative) and the comparative results show in the table below:
library(pander)
ttests <- lapply(c(.5, 1, 2), function(x) {
t.test(len ~ supp, data=subset(ToothGrowth, dose == x), paired=FALSE, var.equal=FALSE)})
ttests
## [[1]]
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
##
##
## [[2]]
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
##
##
## [[3]]
##
## Welch Two Sample t-test
##
## data: len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
pvals <- c(ttests[[1]]$p.value, ttests[[2]]$p.value, ttests[[3]]$p.value)
stats <- c(ttests[[1]]$statistic, ttests[[2]]$statistic, ttests[[3]]$statistic)
adjusted_p <- p.adjust(pvals, method = "bonferroni")
lower <- sapply(c(ttests[[1]]$conf.int[1], ttests[[2]]$conf.int[1], ttests[[3]]$conf.int[1]), round, 3)
upper <- sapply(c(ttests[[1]]$conf.int[2], ttests[[2]]$conf.int[2], ttests[[3]]$conf.int[2]), round, 3)
df <- data.frame(dose=c(0.5, 1, 2), t=stats, p=pvals, adj=adjusted_p,
ci=paste0("[",paste(lower, upper, sep=", "), "]"))
colnames(df) <- c("Dose", "t", "p-value", "adj. p-value", "conf. int.")
pander(df, round=3, split.tables=120,
caption="**_Two-sided t-test comparison of Supplement by Dose_**")
| Dose | t | p-value | adj. p-value | conf. int. |
|---|---|---|---|---|
| 0.5 | 3.17 | 0.006 | 0.019 | [1.719, 8.781] |
| 1 | 4.033 | 0.001 | 0.003 | [2.802, 9.058] |
| 2 | -0.046 | 0.964 | 1 | [-3.798, 3.638] |