Overview

This project analyses the Tooth Growth database from R datasets package. It does basic exploratory analysis of the data, postulates hypothesis for comparing the tooth growth by supp and dose and draws conclusion using confidence interval and p value using a t distribution.

The overal conclusion of the analysis is that Supp OJ produces greater tooth growth than supp VC for dose level 0.5 and 1.

Preprocessing

Let us first load the dataset and take a look

        library(datasets)
        data(ToothGrowth)
        head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Let us look if there are any NAs and then at the structure of the data

        ##Are there any NAs?
        if(length(complete.cases(ToothGrowth)) == nrow(ToothGrowth)){
                print("No NA")} else {
                print("NA present")
        }
## [1] "No NA"

Summary of Data

        str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

There are only two unique values in supp and three values in dose. This sugests a table structure

        ## Get a view of count of Supp vs Dose
        table(ToothGrowth$supp, ToothGrowth$dose)
##     
##      0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

This indicates that there are 6 groups of data with 10 observations each. Let us plot the length vs Supp and dose

Exploratory plots

        library(ggplot2)
        qplot(supp, len, data = ToothGrowth, facets = .~dose)

This looks interesting. To get a better view, let us boxplot the data.

        len_suppdose  <- split(ToothGrowth[,1], ToothGrowth[,c('supp','dose')])
        boxplot(len_suppdose)

Hypothesis

The box plot suggests the following hypothesis

  1. OJ produces longer length than VC, at lower doses of 0.5 and 1
  2. There are no difference in tooth growth betweeen OJ and VC at a dose of 2

To test these hypothesis, first check whether the data is normally distributed enough to use either normal distribution or t distribution

        par(mfrow = c(3,2))

        for(i in 1:6){
                qqnorm(len_suppdose[[i]], main = names(len_suppdose)[i])
                qqline(len_suppdose[[i]], col = "blue")
        }

The data does look approximately normal.

The number of observation is only 10 in each group. So let us use the t-distribution and avoid the normal distribution

So let us t.test between various groups

Assumptions

  • The data does not say that it is from a paired test so let us assume paired = false
  • This is about the two different supp namely OJ and VC. Hence let us assume variance is false
  • The null hypothesis is that there is no difference in the mean of length for OJ and VC
  • We accept the NULL hypothesis if p >0.05 or lower confidence level is < 0
  • The confidence limit used is 95%

For each of the dose namely 0.5, 1 and 2, we run a t.test with the null hypothesis

        analysis <- data.frame(Supp1=character(0),
                                  Supp2=character(0),
                                  Dose=character(0),
                                  LCL=character(0),
                                  UCL=character(0),
                                  PValue=character(0),
                                  Hypothesis=character(0),
                               stringsAsFactors=FALSE)
                
        i <- 1L

        for(j in c(0.5, 1, 2)){

                ## subset the length for OJ and VC for each of the dose
                len_OJ <- ToothGrowth[as.character(ToothGrowth$supp) == "OJ" & 
                        ToothGrowth$dose  == j, "len"]
                len_VC <- ToothGrowth[as.character(ToothGrowth$supp) == "VC" & 
                        ToothGrowth$dose  == j, "len"]

                ## run the t test
                result <- t.test(len_OJ, 
                                 len_VC, 
                                 paired = FALSE, 
                                var.equal = FALSE )
                        
                ## Accept or reject the NULL hypothesis
                
                hyp_test <- if(result$conf.int[1] < 0 | result$p.value >0.05)
                                {"Accept Null" 
                          } else {"Reject Null"}
        
                ## add the row to the dataframe
                analysis[i,] <- as.vector(c("OJ", 
                                            "VC",
                                            j, 
                                            round(result$conf.int[1],2), 
                                            round(result$conf.int[2],2),
                                            round(result$p.value, 4),
                                            hyp_test ))
        
                i = i +1
        }

Now we can print the results of the analysis

Results

        library(xtable)
        print(xtable(analysis, 
                     caption = "T Test on Supp for varying dosage"), 
              type = "html")
T Test on Supp for varying dosage
Supp1 Supp2 Dose LCL UCL PValue Hypothesis
1 OJ VC 0.5 1.72 8.78 0.0064 Reject Null
2 OJ VC 1 2.8 9.06 0.001 Reject Null
3 OJ VC 2 -3.8 3.64 0.9639 Accept Null

Conclusions

As seen in the above table, and under the assumptions stated above, OJ produces greater length for Dose 0.5 and 1 and does not produce greater length for Dose 2