Statistical Inference - Course Project

library(ggplot2)
library(gridExtra)

knitr::opts_chunk$set(echo = TRUE)

Overview

This is the project report for the Statistical Inference course project. The project has 2 parts -

A simulation exercise
Basic inferential data analysis

The first part of the project requires simulation of a large number of samples of an exponential distribution in order to compare the statistics of the samples versus their theoretical values. We will also try to explore whether the distribution is approximately normal.

The second part of the project involves analysis of the ToothGrowth data. We will first conduct preliminary exploratory analysis of data and then go on to compare the data for tooth growth by supp and dose. We will use confidence intervals and hypothesis testing to complete this part of the project.

PART 1 - Simulation Exercise

The exponential distribution can be simulated in R using the rexp function. We will create 1000 different samples with 40 exponentials in each sample. We will then compute the mean and standard deviation of each row and plot a histogram of the row-wise means. We will then compare these statistics with their expected theoretical values for the exponential distribution.

For the purpose of this project, we will use an exponential distribution with lambda of 0.2.

samples <- 1000
n <- 40
lambda <- 0.2
binwidth <- 0.5

exp_data <- matrix(data = rexp(samples * n, lambda), nrow = samples, ncol = n)
exp_stats <- data.frame(cbind(apply(exp_data, 1, mean), apply(exp_data, 1, sd)))
names(exp_stats) <- c('Mean', 'SD')

theoretical_mean <- round(1 / lambda, digits = 3)
simulated_mean <- round(mean(exp_stats$Mean), digits = 3)

theoretical_sd <- round(1 / lambda, digits = 3)
simulated_sd <- round(mean(exp_stats$SD), digits = 3)

I - Comparison of Theoretical vs. Simulated Means

Theoretical Mean = 1 / lambda = 5
Simulated Mean = 5

II - Comparison of Theoretical vs. Simulated Standard Deviation

Theoretical Variance = 1 / lambda = 5
Simulated Variance = 4.876

III - A Normal Distribution

As can be seen from the above comparisons between theoretical and simulated statistics, the 1000 samples containting 40 exponentials each have a mean and variance close to the expected population mean and variance. This satisfies the properties of independent and identically distributed, or iid, variables.

Below are two plots that visually show how the sample mean and variance of the exponential distribution are approximately normally distributed.

g_mean <- ggplot(exp_stats, aes(exp_stats$Mean)) + 
    geom_histogram(aes(y = ..density..), binwidth = binwidth, 
                   col = 'white', fill = 'grey', alpha = 0.7) + 
    labs(x = 'Sample Mean', y = 'Density') +
    stat_function(fun = dnorm, args = list(mean = mean(exp_stats$Mean), 
                                           sd = sd(exp_stats$Mean)), 
                  col = 'red', lwd = 2, alpha = 0.7) +
    theme_minimal()


g_var <- ggplot(exp_stats, aes(exp_stats$SD)) + 
    geom_histogram(aes(y = ..density..), binwidth = binwidth, 
                   col = 'white', fill = 'grey', alpha = 0.7) +
    labs(x = 'Sample Variance', y = 'Density') +
    stat_function(fun = dnorm, args = list(mean = mean(exp_stats$SD), 
                                           sd = sd(exp_stats$SD)), 
                  col = 'blue', lwd = 2, alpha = 0.7) +
    theme_minimal()

grid.arrange(g_mean, g_var, ncol = 2)

PART 2 - Basic Inferential Data Analysis

The first step of the analysis requires us to load the ToothGrowth dataset available in R and perform a basic exploration of the data. This will involve looking at and understanding the structure and contents of the data set.

library(datasets)
library(dplyr)

data("ToothGrowth")

# Convert the dataframe to a tibble
ToothGrowth <- tbl_df(ToothGrowth)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)

# Print the structure of the data frame
str(ToothGrowth)

## Classes 'tbl_df', 'tbl' and 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

It can be seen that the data consists of three variables, len, supp and dose. While supp and dose are factor variables, len is a numeric variable denoting the magnitude of tooth growth for different values of supp and dose. We will now try to deduce if there exists a statistically significant affect of supplements or dosage on tooth growth, ie. the len variable.

I - Tooth Growth by Supplement (supp)

We will first compute the mean and standard deviation of the len variable for the two groups of supplements, namely OJ and VC. We will also plot the tooth growth data for the two different supplements to visually inspect their affects.

ToothGrowth %>% 
    group_by(ToothGrowth$supp) %>% 
    summarize(mean = round(mean(len), 3), sd = round(sd(len), 3))

g_scatter <- ggplot(ToothGrowth, aes(supp, len)) + 
    geom_point(shape = 16, size = 3, aes(col = supp), alpha = 0.7) + 
    theme_minimal() + 
    theme(legend.position = "bottom", legend.box.background = element_rect(), 
          plot.title = element_text(hjust = 0.5)) +
    labs(title = "Tooth Growth by Supplement (supp)", 
         x = "", y = "Tooth Growth", col = "Supplement")

g_hist <- ggplot(ToothGrowth, aes(len, col = supp, fill = supp)) + 
    geom_density(lwd = 1, alpha = 0.1) + 
    theme_minimal() +
    theme(legend.position = "bottom", legend.box.background = element_rect(), 
          plot.title = element_text(hjust = 0.5)) +
    labs(title = "Tooth Growth Densities", 
         x = "Tooth Growth", y = "Density", col = "Supplement", fill = "Supplement")

grid.arrange(g_scatter, g_hist, ncol = 2)

For the purpose of our analysis, we will start with the following null hypothesis.

\(H_0: \mu_{OJ} = \mu_{VC}\)

\(H_a: \mu_{OJ} \neq \mu_{VC}\)

We will conduct a two-sided hypothesis test on the difference in means of the two groups of len values setting a 95% confidence interval.

len_oj <- ToothGrowth %>% filter(supp == "OJ")
len_vc <- ToothGrowth %>% filter(supp == "VC")

tsupp <- t.test(len_oj$len, len_vc$len, paired = FALSE, 
                alternative = 'two.sided', var.equal = FALSE,
                conf.level = 0.95)

Results

Confidence Interval: According to the t-test, the 95% confidence interval lies between -0.17 and 7.57.
P-value: The calculated p-value for our hypothesis is 6.06%.

Based on this information, we fail to reject our null hypothesis as the confidence interval includes zero and the p-value is greater than our threshold of 5%. Thus, we can’t conclude that there is a statistically significant difference in tooth growth for the two supplement groups.

II - Tooth Growth by Dosage (dose)

We will first compute the mean and standard deviation of the len variable for the three groups of dosage values, namely 0.5, 1.0 and 2.0. We will also plot the tooth growth data for the three different dosages to visually inspect their affects.

ToothGrowth %>% 
    group_by(ToothGrowth$dose) %>% 
    summarize(mean = round(mean(len), 3), sd = round(sd(len), 3))

g_scatter <- ggplot(ToothGrowth, aes(dose, len)) + 
    geom_point(shape = 16, size = 3, aes(col = dose), alpha = 0.7) + 
    theme_minimal() + 
    theme(legend.position = "bottom", legend.box.background = element_rect(), 
          plot.title = element_text(hjust = 0.5)) +
    labs(title = "Tooth Growth by Dosage (dose)", 
         x = "", y = "Tooth Growth", col = "Dosage")

g_hist <- ggplot(ToothGrowth, aes(len, col = dose, fill = dose)) + 
    geom_density(lwd = 1, alpha = 0.1) + 
    theme_minimal() +
    theme(legend.position = "bottom", legend.box.background = element_rect(), 
          plot.title = element_text(hjust = 0.5)) +
    labs(title = "Tooth Growth Densities", 
         x = "Tooth Growth", y = "Density", col = "Dosage", fill = "Dosage")

grid.arrange(g_scatter, g_hist, ncol = 2)

For the purpose of our analysis, we will compare the tooth growth for the smallest and largest values of dose, ie. 0.5 and 2.0. We will use the following null hypothesis.

\(H_0: \mu_{0.5} = \mu_{2.0}\)

\(H_a: \mu_{0.5} \neq \mu_{2.0}\)

We will conduct a two-sided hypothesis test on the difference in means of the two groups of len values setting a 95% confidence interval.

len_0.5 <- ToothGrowth %>% filter(dose == 0.5)
len_2.0 <- ToothGrowth %>% filter(dose == 2.0)

tdose <- t.test(len_0.5$len, len_2.0$len, paired = FALSE, 
                alternative = 'two.sided', var.equal = FALSE,
                conf.level = 0.95)

Results

Confidence Interval: According to the t-test, the 95% confidence interval lies between -18.16 and -12.83.
P-value: The calculated p-value for our hypothesis is ~0%.

Based on this information, we will reject our null hypothesis as both sides of the confidence interval are well below zero and the p-value is very close to zero. Thus, we can conclude that there is a statistically significant difference in tooth growth for different values of dose.

III - Conclusions

In this study, we used the ToothGrowth dataset to analyse the affect of supplements and dosage on tooth growth. We started with the null hypothesis that the distribution of tooth growth (len) doesn’t change for the different groups of supplements and dosages. We then used a t-test with 95% confidence intervals to determine whether this was really the case or whether we would be able to reject our null hypothesis based on the sample data.

After our analysis, we have not been able to establish a significant relation between tooth growth and supplements (OJ and VC). But we have found that there is a significant increase in tooth growth if the dosage is increased from 0.5 to 2.0. This would indicate a positive correlation between these two variables.