Peer-graded Assignment: Statistical Inference Course Project Part 2

Part 2

We have been asked to analyze the ToothGrowth data in the R datasets package.

Load the ToothGrowth data and perform some basic exploratory data analyses
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
State your conclusions and the assumptions needed for your conclusions.

Description of Tooth Growth Dataset:

The dataset records the response of odontoblasts length (cells responsible for tooth growth) tested on 60 Guinea Pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by two delivery methods, (orange juice (OJ) or ascorbic acid (a form of vitamin C and coded as VC).

Part 2.1 (Exploratory Data Analysis)

First we load our Tooth Growth Dataset:

tooth <- ToothGrowth
str(tooth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

#checking that we don't have any NA's

tooth[!complete.cases(tooth),]

## [1] len  supp dose
## <0 rows> (or 0-length row.names)

We have a data frame with 60 observations, one numeric variable (len) and two categorical variables (supp and dose) that describes the supplement and dose respectively. We proceed to executing a summary statistics for the dataset:

summary(tooth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

IQR(tooth$len)

## [1] 12.2

The length of the odontoblasts shows that median and mean are slightly equal, so this gives an idea of our data distribution that should be normally distributed. Actually, 50% of our data lies between 13.07 and 25.27.

tapply(tooth$len, tooth$supp, mean)

##       OJ       VC 
## 20.66333 16.96333

Based on each delivery method, the results show that on average the Orange Juice method has been developed better than the Vitamin C method with an average of 20.66.

Dose variable may be treated as categorical since we have three levels. we proceed to convert the dose variabe to categorical

tooth$dose <- as.factor(tooth$dose)

We then proceed to average the length of odontoblasts again but now we are interested on how the portions of doses has been distributed:

tooth.mean.group <- tooth %>%
        group_by(supp, dose ) %>%
        summarise(mean = mean(len)) %>%
        arrange(mean, dose)
print(tooth.mean.group)

## # A tibble: 6 x 3
## # Groups:   supp [2]
##   supp  dose   mean
##   <fct> <fct> <dbl>
## 1 VC    0.5    7.98
## 2 OJ    0.5   13.2 
## 3 VC    1     16.8 
## 4 OJ    1     22.7 
## 5 OJ    2     26.1 
## 6 VC    2     26.1

Results show that tooth length has been developed slightly equal for both delivery methods including a representative dose of 2 mg/day. Small doses of 0.5 mg show that Orange Juice has nearly 60% better performance compared to the ascorbic acid.

We are interested in investigating how variable our data is each group across groups:

tooth.sd.group <- tooth %>%
        group_by(supp, dose ) %>%
        summarise(sd = sd(len)) %>%
        arrange(dose)
print(tooth.sd.group)

## # A tibble: 6 x 3
## # Groups:   supp [2]
##   supp  dose     sd
##   <fct> <fct> <dbl>
## 1 OJ    0.5    4.46
## 2 VC    0.5    2.75
## 3 OJ    1      3.91
## 4 VC    1      2.52
## 5 OJ    2      2.66
## 6 VC    2      4.80

boxplot(len~ supp * dose, data = tooth, col =c("gold", "green"))

As we can see in the graph, at higher levels of dose, effects are the same for OJ and VC. This difference is more pronounced with doses of 1 and 0.5 mg / day. As we have seen, we have two types of categorical variables (supp and dose) and we are interested in seeing the interactions of each group across the tooth length:

interaction.plot(tooth$dose, tooth$supp, tooth$len, type="b", 
                 
                 col=c("red","blue"), pch=c(16, 18),
                 
                 main = "Interaction between Dose and Supplement Type")

So, both groups appear to have a relationship between them and across the mean of tooth length. We can clearly see that as the number of doses increase there appears to be an increment in means for each supplement type. Though VC stays higher than OJ until given portions of two doses, in which means are nearly equally distributed for both methods.

Part 2.2: Inferential Analysis - Driving an Anova Test

For the inferential analysis part, we have selected to conduct an ANOVA test for trying to find an answer if the differences in means of tooth length between each group stated (DOSE and SUPP) are due to true differences about the population’s means or just do to sampling variability. Recalling our formula for calculation the F Statistic:

F statistics = Variation among sample means / Variation within groups

So, the result of this statistic result or our p-value will determine if there is a difference among those groups is greater or lower than the difference within each group (through each sample). We state our hypothesis statement:

H0 (NULL Hypothesis) = Differences are due to chance (there’s nothing going on)

HA (Alternative Hypothesis) = There is a significant relationship

In order to apply the ANOVA Test, we must consider:

Dependent variable is normally distributed
Standard deviations across groups are slightly equal

We can recall from our Exploratory Data Analysis that means of tooth length vary across the groups and standard deviations were relatively constant across the groups.

For testing our normality assumption, we are going to plot our distribution and assessing if normality is presen (if data falls between 95% confidence envelope, normally assumption is ok):

qqPlot(lm(len ~ dose*supp, data=tooth),        
       simulate=TRUE, main="Q-Q Plot", labels=FALSE)

Note that we are including both interactions of each factor variable. So the linear model test indicates that our data stays within the 95% confidence intervals envelope and nearly follow a normal distribution.

Next step is constructing the ANOVA Test:

anova <- aov(len ~ supp*dose, data = tooth) 
summary(anova)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on our results, all factors including the interaction point are significant. We can verify this comparing that p-value < 5% for each case. We have little evidence against the NULL Hypothesis, so we conclude in accepting the alternative HA that there is in fact a significant relationship among each Supplement and Dose methods used for testing the tooth growth on the Guinea Pigs. We have no supported evidence that these differences are due to chance or random sampling (nothing going on).

Peer-graded Assignment: Statistical Inference Course Project Part 2

Jaime Paz

May 7th, 2018

Part 2

Part 2.1 (Exploratory Data Analysis)

Part 2.2: Inferential Analysis - Driving an Anova Test