Part Two - Basic Inferential Data Analysis

1. The Data

We start by loading the ToothGrowth data frame. The dataframe represents tha data collected in the study of the effect of vitamin C on tooth growth in Guinea pigs.

# Loading the dataset
library(datasets)
data(ToothGrowth)

We look what variables the data frame contains.

# Dataset description
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

As we can see the data frame contains a number of 60 observations on 3 variables. These are

len - numeric, represents the tooth length;
supp - factor, representing the supplement type administred (VC or OJ);
dose - numeric, representing the administred dose (in milligrams).

Since the administred dose have only three values we will convert them from numerical type to factor.

2. Exploratory Data Analysis.

We look at a summary of the dataset.

# Summary statistics for the varaibles
summary(ToothGrowth)

##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

Since supp and dose variables are factor varaibles, we are splitting the data between these two factors.

# Spliting data between different dose levels and delivery methods
table(ToothGrowth$dose, ToothGrowth$supp)

##      
##       OJ VC
##   0.5 10 10
##   1   10 10
##   2   10 10

We represent the splitted data using histograms.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.0.3

ggplot(data=ToothGrowth, aes(x=as.factor(dose), y=len, fill=supp)) +
    geom_bar(stat="identity",)+ scale_fill_manual(values=c("#0072B2", "#D55E00")) +
    facet_grid(. ~ supp, scales = "free") +
    xlab("Administred dose (in mg)") +
    ylab("Tooth length") +
    guides(fill=guide_legend(title="Supplement type"))

plot of chunk p2_splitted_histogram

It looks like there is a clear positive correlation between the tooth length (len) and the dose levels of vitamin C (dose), for both suplement types.

3. Hypothesis testing

We will use confidence intervals to compare tooth growth by supplement type (supp) and administred dose (dose).The 95% confidence intervals for two variables and the intercept are as follows:

fit <- lm(len ~ dose + supp, data=ToothGrowth)
confint(fit)

##                 2.5 %    97.5 %
## (Intercept) 10.475238 14.434762
## dose1        6.705297 11.554703
## dose2       13.070297 17.919703
## suppVC      -5.679762 -1.720238

This means that if we collect a different set of data and estimate parameters of the linear model many times, 95% of the time, the coefficient estimations will be in these ranges.

For each coefficient (i.e. intercept, dose and suppVC), the null hypothesis we consider is that the coefficients are zero, meaning that no tooth length variation is explained by that variable.

summary(fit)

## 
## Call:
## lm(formula = len ~ dose + supp, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.085 -2.751 -0.800  2.446  9.650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.4550     0.9883  12.603  < 2e-16 ***
## dose1         9.1300     1.2104   7.543 4.38e-10 ***
## dose2        15.4950     1.2104  12.802  < 2e-16 ***
## suppVC       -3.7000     0.9883  -3.744 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.828 on 56 degrees of freedom
## Multiple R-squared:  0.7623, Adjusted R-squared:  0.7496 
## F-statistic: 59.88 on 3 and 56 DF,  p-value: < 2.2e-16

All p-values are less than 0.05, therefore the null hypothesis can be rejected and the result suggests that each variable explains a significant portion of variability in tooth length.

The effect of the dose can also be further identified using regression analysis. The question that can also be addressed is whether the supplement type (i.e. orange juice OJ or vitamin C VC) has any effect on the tooth length.

Conclusions

The model explains 75% of the variance in the data.

The intercept is 12.455, meaning that with no supplement of vitamin C, the average tooth length is 12.455 units.

The coefficient of dose is 9.13 and represents the increase of the tooth length with 9.13 units for an increase of the delievered dose by 1 mg, all other variables remaining equal (no change in the supplement type in our case).

The last coefficient is for the supplement type. Since the supplement type is a categorical variable, dummy variables are used. The computed coefficient is for suppVC and the value is 15.495 representing the decrease of the tooth length with 15.495 units for delivering a given dose as vitamin C, without changing the dose.