Coursera Inference assgn2

The purpose of this exercise is to explore the Tooth growth data in UsingR, provide a basic summary of the data, and use confidence intervals or hypothesis testing to compare tooth growth by the variables supp and dose. The conclusions will be provided as we go and also will be summarized at the end.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# First, simplify the name of the table
tg<-ToothGrowth
# Show structure ot the table
str(tg)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

60 observations, 3 variables which are length, supp and dose. There are two levels for the variable (factor) supp:

unique(tg$supp)

## [1] VC OJ
## Levels: OJ VC

There are three levels for the variable dose:

unique(tg$dose)

## [1] 0.5 1.0 2.0

Four charts are plotted in the appendix, that compare tooth length by each level of the grouping variables. The charts appear to show a difference in length between the different groups. This is also suggested when summarizing the data by group. Summary of the data, mean and standard deviation, by groups in supp:

tg_supp<-summarize(group_by(tg,supp),m_supp=mean(len),sd_supp=sd(len))
tg_supp

Summary of the data, mean and standard deviation, by groups in dose:

tg_dose<-summarize(group_by(tg,dose),m_supp=mean(len),sd_supp=sd(len))
tg_dose

Therefore it will be interesting to do hypothesis testing to understand if there are statistically significant differences. The histograms in the appendix appear to indicate that the different distributions look acceptably normal, which is a requirement to be able to apply t tests. We will assume unequal variances, as we can not make at this point gurantee the opposite.

Test 1 supp=OJ vs supp=VC H_0=equal tooth length means, H_a=length mean in group supp OJ > length mean in group supp VC

t.test(tg$len~tg$supp,alternative="greater")

## 
##  Welch Two Sample t-test
## 
## data:  tg$len by tg$supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

At a confidence level of 0.05, since the p-value is smaller than 0.05, we reject the hypothesis that the means of both groups are equal, and accept the alternative hypothesis that the length of OJ group is statistically significantly higher than the group VC.

Considering now the other factor, dose. Since there are three levels and therefore three groups, and we have to use the techniques learned in the course, we will stick to the tow sample t test and run three tests to cover all comparisons.

Test 2 dose=1 vs dose=0.5 H_0=equal length means, H_a=length mean in group dose 1 > length mean in group dose 0.5

t.test(filter(tg,dose==1)$len,filter(tg,dose==0.5)$len,alternative="greater")

## 
##  Welch Two Sample t-test
## 
## data:  filter(tg, dose == 1)$len and filter(tg, dose == 0.5)$len
## t = 6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  6.753323      Inf
## sample estimates:
## mean of x mean of y 
##    19.735    10.605

The p-value is really small, we reject the null hypothesis and accept the alternative hypothesis.

Test 3 dose=2 vs dose=1 H_0=equal length means, H_a=length mean in group dose 2 > length mean in group dose 1

t.test(filter(tg,dose==2)$len,filter(tg,dose==1)$len,alternative="greater")

## 
##  Welch Two Sample t-test
## 
## data:  filter(tg, dose == 2)$len and filter(tg, dose == 1)$len
## t = 4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  4.17387     Inf
## sample estimates:
## mean of x mean of y 
##    26.100    19.735

The p-value is again really small. We reject the null hypothesis and accept the alternative hypothesis.

The third pair-wise comparison need not be conducted because it is obvious that the null hypothesis of equal means between group in dose=2 adn group in dose=0.5 will be rejected, and the alternative hypothesis taht the mean length of the group in dose=2 will be greater.

CONCLUSION By conducting a number of two sample t tests we have confirmed that at a 0.05 significance level the differnt levels in both variables, supp and dose, cause a statistically significantly different tooth length in the groups under study.

Appendix These are the charts used to explore the data, in order to direct the analysis. Chart 1, overall description of the data

g1<-ggplot(data.frame(tg),aes(x=len))+geom_histogram(binwidth=5,fill="cyan",alpha=0.2,col="pink3",size=1.15)
g1

Chart 2, comparison of groups by variable supp

g2<-ggplot(data.frame(tg),aes(x=len))+geom_histogram(binwidth=5,fill="cyan",alpha=0.2,col="pink3",size=1.15)+ facet_grid(.~supp)
g2

Chart 3, comparison of groups by variable dose

g3<-ggplot(data.frame(tg),aes(x=len))+geom_histogram(binwidth=5,fill="cyan",alpha=0.2,col="pink3",size=1.15)+ facet_grid(.~dose)
g3

Chart 4, comparison of groups by both variables, supp and dose

g4<-ggplot(data.frame(tg),aes(x=len))+geom_histogram(binwidth=5,fill="cyan",alpha=0.2,col="pink3",size=1.15)+ facet_grid(supp~dose)
g4

Coursera Inference assgn2

TheCmos

February 16, 2019