OVERVIEW

The information given below are all for the completion of the Statistical Inference Course Project from Coursera’s Data Science by Johns Hopkins University. This project aims to investigate the exponential distribution while comparing with the Central Limit Theorem. Additionally, illustrations are shown via simulation and explanatory texts by the means of the following:

# Load libraries

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(datasets)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(dplyr)
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

PART 2: BASIC INFERENCTIAL DATA ANALYSIS INSTRUCTIONS

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.

  • Load the ToothGrowth data and perform some basic exploratory data analyses.
  • Provide a basic summary of the data.
  • Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering) *State your conclusions and the assumptions needed for your conclusions.

OVERVIEW

The data set includes data analysing the Tooth Growth of Guinea Pigs that have been administered two types of supplements at different doses.

LOAD THE DATA

data("ToothGrowth")
tg <- ToothGrowth

BASIC EXPLORATORY DATA ANALYSIS

str(tg)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
unique(tg$dose)
## [1] 0.5 1.0 2.0

There are only 3 unique doses. Because of this, we are going to convert it into a factor type.

tg$dose <- factor(tg$dose)
str(tg)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Now let’s look at summary statistics

summary(tg)
##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

Box Plot based on type and dose

ggplot(tg,
       aes(x = dose,
           y = len,
           fill = supp)) +
  geom_boxplot() +
  labs(title = "Guinea Pig tooth length",
       subtitle = "Based on Dose and Supplement Type",
       y = "Tooth Lenght",
       x = "Dose (mg/day)",
       fill = "Supplement Type") +
  theme_test()

From the graph we can see that a dose of 0.5mg/day has the biggest difference in tooth growth (when we compare the different supplements). As the dose increases, the tooth lenfgth increases as well.

If we do not look at the 2mg/day dose, we can observe in the plot that the OJ supplement has the most impact in tooth growth at both the 0.5 and 1 mg/day doses.

STATISTICAL DIFFERENCES

Now we are going to look at these differences in more detail.

stats <- tg %>%
  group_by(supp, 
           Dose = dose) %>%
  summarise(mean = mean(len),
            .groups = "drop") %>%
  spread(supp,
         mean) %>%
  mutate(Diff = abs(VC - OJ))

kable(stats,
      align = "lccrr") %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = FALSE) %>%
  row_spec(row = 0,
           bold = TRUE)
Dose OJ VC Diff
0.5 13.23 7.98 5.25
1 22.70 16.77 5.93
2 26.06 26.14 0.08

As we had observed in the boxplot, the 2mg/day dose has very little difference in tooth growth independently on the supplement. This makes it harder to determine which supplement is more effective at this dose.

T-Test HYPOTHESIS TESTING

  • Null-Hypothesis = there is no difference between using OJ (orange juice) and VC (Ascorbic Acid / Vitamin C).
  • Alternative-Hypothesis = there is a difference between using OJ and VC.
  • Alpha-rate = 0.05 as standard
# Filter the data based on dose

dose_half <- filter(tg, 
                    dose == 0.5)
dose_one <- filter(tg,
                   dose == 1)
dose_two <- filter(tg,
                   dose == 2)

# Calculating the t-tests

t.test(len ~ supp,
       dose_half)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98
t.test(len ~ supp,
       dose_one)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77
t.test(len ~ supp,
       dose_two)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14
# Organise in a table

Dose <- c(0.5, 1, 2)
P_value <- c(0.0064, 0.0010, 0.9639)
Conf_int <- c("1.72, 8.78", "2.80, 9.06", "-3.80, 3.64")
Decision <- c("Reject null", "Reject null", "Do not reject null")
TG_ttest<- data.frame(Dose, Conf_int, P_value, Decision)

kable(TG_ttest,
      align = "lccrr") %>%
  kable_styling(bootstrap_options = "bordered",
                full_width = FALSE) %>%
  row_spec(row = 0,
           bold = TRUE)
Dose Conf_int P_value Decision
0.5 1.72, 8.78 0.0064 Reject null
1.0 2.80, 9.06 0.0010 Reject null
2.0 -3.80, 3.64 0.9639 Do not reject null

As expected, the p-values for doses 0.5 and 1.0 will be very small because of the big differences in mean between them.

Thus, for dose 0.5 and 1.0, since p-values are smaller than 0.5, we reject the null hypotheses. This means that the type of supplement has a statistical significant effect, making Orange Juice more effective. But for dose 2.0 mg/day, we cannot reject the null as the p-value is greater than 0.5.

CONCLUSION

The central assumption for the results is that the sample is representative of the population, and that no other variable is affecting tooth length.

For the t.test, two assumptions are made:

  • The data isn’t paired, meaning they’re independent
  • The variance are different.

With that, in reviewing the t.test, supplement type OJ is more effective than VC for doses less or equal than 1.0mg/day. But for dose at 2.0 mg/day, there is no difference between the supplement types.