The second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.
Load the ToothGrowth data and perform some basic exploratory data analyses
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
State your conclusions and the assumptions needed for your conclusions.
Sample Project Report Structure
Of course, there are multiple ways one could structure a report to address the requirements above. However, the more clearly you pose and answer each question, the easier it will be for reviewers to clearly identify and evaluate your work.
A sample set of headings that could be used to guide the creation of your report might be:
sessionInfo()
## R version 3.2.5 (2016-04-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5 formatR_1.3 tools_3.2.5 htmltools_0.3.5
## [5] yaml_2.1.13 Rcpp_0.12.5 stringi_1.1.1 rmarkdown_0.9.6
## [9] knitr_1.12.3 stringr_1.0.0 digest_0.6.9 evaluate_0.9
head( ToothGrowth, 10 ) # see first 10 records
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
## 7 11.2 VC 0.5
## 8 11.2 VC 0.5
## 9 5.2 VC 0.5
## 10 7.0 VC 0.5
tail( ToothGrowth, 10 ) # see last 10 records
## len supp dose
## 51 25.5 OJ 2
## 52 26.4 OJ 2
## 53 22.4 OJ 2
## 54 24.5 OJ 2
## 55 24.8 OJ 2
## 56 30.9 OJ 2
## 57 26.4 OJ 2
## 58 27.3 OJ 2
## 59 29.4 OJ 2
## 60 23.0 OJ 2
unique(ToothGrowth$supp) # [1] VC OJ
## [1] VC OJ
## Levels: OJ VC
unique(ToothGrowth$dose) # [1] 0.5 1.0 2.0
## [1] 0.5 1.0 2.0
d1 <- ToothGrowth
d1oj <- d1[ d1$supp=="OJ", c(1,3) ] # isolate OV and reduce data, not relevant with 30 records
d1vc <- d1[ d1$supp=="VC", c(1,3) ] # isolate VC and reduce data
d1oj # Only half
## len dose
## 31 15.2 0.5
## 32 21.5 0.5
## 33 17.6 0.5
## 34 9.7 0.5
## 35 14.5 0.5
## 36 10.0 0.5
## 37 8.2 0.5
## 38 9.4 0.5
## 39 16.5 0.5
## 40 9.7 0.5
## 41 19.7 1.0
## 42 23.3 1.0
## 43 23.6 1.0
## 44 26.4 1.0
## 45 20.0 1.0
## 46 25.2 1.0
## 47 25.8 1.0
## 48 21.2 1.0
## 49 14.5 1.0
## 50 27.3 1.0
## 51 25.5 2.0
## 52 26.4 2.0
## 53 22.4 2.0
## 54 24.5 2.0
## 55 24.8 2.0
## 56 30.9 2.0
## 57 26.4 2.0
## 58 27.3 2.0
## 59 29.4 2.0
## 60 23.0 2.0
d1vc # Only half
## len dose
## 1 4.2 0.5
## 2 11.5 0.5
## 3 7.3 0.5
## 4 5.8 0.5
## 5 6.4 0.5
## 6 10.0 0.5
## 7 11.2 0.5
## 8 11.2 0.5
## 9 5.2 0.5
## 10 7.0 0.5
## 11 16.5 1.0
## 12 16.5 1.0
## 13 15.2 1.0
## 14 17.3 1.0
## 15 22.5 1.0
## 16 17.3 1.0
## 17 13.6 1.0
## 18 14.5 1.0
## 19 18.8 1.0
## 20 15.5 1.0
## 21 23.6 2.0
## 22 18.5 2.0
## 23 33.9 2.0
## 24 25.5 2.0
## 25 26.4 2.0
## 26 32.5 2.0
## 27 26.7 2.0
## 28 21.5 2.0
## 29 23.3 2.0
## 30 29.5 2.0
summary(d1oj) # Summary give me some information about the raw data
## len dose
## Min. : 8.20 Min. :0.500
## 1st Qu.:15.53 1st Qu.:0.500
## Median :22.70 Median :1.000
## Mean :20.66 Mean :1.167
## 3rd Qu.:25.73 3rd Qu.:2.000
## Max. :30.90 Max. :2.000
summary(d1vc) # Summary give me some information about the raw data
## len dose
## Min. : 4.20 Min. :0.500
## 1st Qu.:11.20 1st Qu.:0.500
## Median :16.50 Median :1.000
## Mean :16.96 Mean :1.167
## 3rd Qu.:23.10 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
# Summary
summary( lm(d1$len ~ d1$dose) )
##
## Call:
## lm(formula = d1$len ~ d1$dose)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4496 -2.7406 -0.7452 2.8344 10.1139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4225 1.2601 5.89 2.06e-07 ***
## d1$dose 9.7636 0.9525 10.25 1.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.601 on 58 degrees of freedom
## Multiple R-squared: 0.6443, Adjusted R-squared: 0.6382
## F-statistic: 105.1 on 1 and 58 DF, p-value: 1.233e-14
The information provided is one object, 60 records, 3 fields on it. The first is the “length” obtained, second column is the “kind of tratement” (OJ or VC) and the third is the dose used. The first 30’s records are results from OJ and the others 30’s from VC tratement.
I can’t assure if the patiente is the same in both tratements (line 1 with 31, 2-32…. etc)
Relation between 0.5, 1 & 2 ( 2/4, 2/2, 2/1 ) using OJ
summary( d1oj [ d1oj$dose==0.5, 1 ] ) # more info about 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.20 9.70 12.25 13.23 16.18 21.50
summary( d1oj [ d1oj$dose==1, 1 ] ) # more info about 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 20.30 23.45 22.70 25.65 27.30
summary( d1oj [ d1oj$dose==2, 1 ] ) # more info about 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.40 24.58 25.95 26.06 27.08 30.90
length( d1oj [ d1oj$dose==0.5, 1 ] ) # num of records, no make sense in this sample, because it's a very short list
## [1] 10
length( d1oj [ d1oj$dose==1, 1 ] ) #
## [1] 10
length( d1oj [ d1oj$dose==2, 1 ] ) #
## [1] 10
Relation between 0.5, 1 & 2 ( 2/4, 2/2, 2/1 ) using VC
summary( d1vc [ d1vc$dose==0.5, 1 ] ) # more info about 0.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.20 5.95 7.15 7.98 10.90 11.50
summary( d1vc [ d1vc$dose==1, 1 ] ) # more info about 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.60 15.27 16.50 16.77 17.30 22.50
summary( d1vc [ d1vc$dose==2, 1 ] ) # more info about 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.50 23.38 25.95 26.14 28.80 33.90
length( d1vc [ d1vc$dose==0.5, 1 ] ) # num of records, no make sense in this sample, because it's a very short list
## [1] 10
length( d1vc [ d1vc$dose==1, 1 ] ) #
## [1] 10
length( d1vc [ d1vc$dose==2, 1 ] ) #
## [1] 10
As I see, the data is distributed 50/50 between OJ & VC. The data about them are again distributed 1/3, 1/3 & 1/3 between doses.
INFO: 10 records by each Dose by each Tratments Multiply it by 2 tratement multiply it by 3 doses. TOTAL 60 records ( it’s obvius in this data with only a few records )
par(mfrow = c(2,3) )
par(mfrow = c(2,3) )
boxplot( d1oj [ d1oj$dose==0.5, 1 ] , ylim=c(0,40), main="OJ - 0.5")
boxplot( d1oj [ d1oj$dose==1, 1 ] , ylim=c(0,40), main="OJ - 1")
boxplot( d1oj [ d1oj$dose==2, 1 ] , ylim=c(0,40), main="OJ - 2")
boxplot( d1vc [ d1vc$dose==0.5, 1 ] , ylim=c(0,40), main="VC - 0.5")
boxplot( d1vc [ d1vc$dose==1, 1 ] , ylim=c(0,40), main="VC - 1")
boxplot( d1vc [ d1vc$dose==2, 1 ] , ylim=c(0,40), main="VC - 2")
Multiply data by 4, 2 or 1 to adjust the level of dose between them.
par(mfrow = c(2,3) )
boxplot( d1oj [ d1oj$dose==0.5, 1 ] * 4 , ylim=c(0,90), main="OJ - 0.5")
boxplot( d1oj [ d1oj$dose==1, 1 ] * 2, ylim=c(0,90), main="OJ - 1")
boxplot( d1oj [ d1oj$dose==2, 1 ] * 1 , ylim=c(0,90), main="OJ - 2")
boxplot( (d1vc [ d1vc$dose==0.5, 1 ] * 4) , ylim=c(0,55), main="VC - 0.5")
boxplot( d1vc [ d1vc$dose==1, 1 ] * 2 , ylim=c(0,55), main="VC - 1")
boxplot( d1vc [ d1vc$dose==2, 1 ] * 1 , ylim=c(0,55), main="VC - 2")
par(mfrow = c(1,1) )
summary( d1oj [ d1oj$dose==0.5, 1 ] )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.20 9.70 12.25 13.23 16.18 21.50
summary( d1vc [ d1vc$dose==0.5, 1 ] )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.20 5.95 7.15 7.98 10.90 11.50
summary( d1oj [ d1oj$dose==1, 1 ] )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.50 20.30 23.45 22.70 25.65 27.30
summary( d1vc [ d1vc$dose==1, 1 ] )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.60 15.27 16.50 16.77 17.30 22.50
summary( d1oj [ d1oj$dose==2, 1 ] )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.40 24.58 25.95 26.06 27.08 30.90
summary( d1vc [ d1vc$dose==2, 1 ] )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.50 23.38 25.95 26.14 28.80 33.90
mean( d1oj [ d1oj$dose==0.5, 1 ] ) / mean( d1vc [ d1vc$dose==0.5, 1 ] )
## [1] 1.657895
mean( d1oj [ d1oj$dose==1, 1 ] ) / mean( d1vc [ d1vc$dose==1, 1 ] )
## [1] 1.353608
mean( d1oj [ d1oj$dose==2, 1 ] ) / mean( d1vc [ d1vc$dose==2, 1 ] )
## [1] 0.9969396
I multiply the results by 4 in the first case (0.5), x2 in the second (1) to compare them with the 3th sample (2). The relation with the data is 4,2,1.
ABOUT OJ Seeing graphics the initial inferences are: * between 0.5 and 1, the increase seams similar, and with a 0.5 dose the desviation is higher, less exact (less power) * comparing 2th and 3th, seams that dose 2 is higher and the tooth did not grow up, or in other words the “less dose” produce a higher increase of tooth. About costs and collateral effects, the 2nd option seams reasonable (less cost in drug and bigger increase) and the 1th option seams a bit irregular, but less dose again, so less costs.
ABOUT VC * The differences in the 3 options are similar, again 2 did not produce the same increase ratio comparing it with 0,5 or 1. * Comparing 0.5 and 1 seams that 0.5 are more disperse and the median is lower.
par(mfrow = c(1,3) )
# barplot(mean( d1oj [ d1oj$dose==0.5, 1 ] ) , mean( d1vc [ d1vc$dose==0.5, 1 ] ) )
# barplot(mean( d1oj [ d1oj$dose==1, 1 ] ) , mean( d1vc [ d1vc$dose==1, 1 ] ) )
# barplot(mean( d1oj [ d1oj$dose==2, 1 ] ) , mean( d1vc [ d1vc$dose==2, 1 ] ) )
# Slice by THERAPY
oj <- ToothGrowth[ToothGrowth$supp == "OJ",]
vc <- ToothGrowth[ToothGrowth$supp == "VC",]
# Slice by size of DOSE
oj05 <- oj[oj$dose==0.5,]
oj10 <- oj[oj$dose==1.0,]
oj20 <- oj[oj$dose==2.0,]
vc05 <- vc[vc$dose==0.5,]
vc10 <- vc[vc$dose==1.0,]
vc20 <- vc[vc$dose==2.0,]
# testing 0,5 1 & 2
t.test(len ~ supp, data=rbind(vc05,oj05), var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
t.test(len ~ supp, data=rbind(vc10,oj10), var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
t.test(len ~ supp, data=rbind(vc20,oj20), var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
# Testing 0.5 vs 1
t.test(len ~ dose, data=rbind(vc05,vc10), var.equal=TRUE)
##
## Two Sample t-test
##
## data: len by dose
## t = -7.4634, df = 18, p-value = 6.492e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.264346 -6.315654
## sample estimates:
## mean in group 0.5 mean in group 1
## 7.98 16.77
t.test(len ~ dose, data=rbind(oj05,oj10), var.equal=TRUE)
##
## Two Sample t-test
##
## data: len by dose
## t = -5.0486, df = 18, p-value = 8.358e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -13.410814 -5.529186
## sample estimates:
## mean in group 0.5 mean in group 1
## 13.23 22.70
# Testing 1 vs 2
t.test(len ~ dose, data=rbind(vc10,vc20), var.equal=TRUE)
##
## Two Sample t-test
##
## data: len by dose
## t = -5.4698, df = 18, p-value = 3.398e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.96896 -5.77104
## sample estimates:
## mean in group 1 mean in group 2
## 16.77 26.14
t.test(len ~ dose, data=rbind(oj10,oj20), var.equal=TRUE)
##
## Two Sample t-test
##
## data: len by dose
## t = -2.2478, df = 18, p-value = 0.03736
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.5005017 -0.2194983
## sample estimates:
## mean in group 1 mean in group 2
## 22.70 26.06
# Testing 0.5 vs 2
t.test(len ~ dose, data=rbind(vc05,vc20), var.equal=TRUE)
##
## Two Sample t-test
##
## data: len by dose
## t = -10.388, df = 18, p-value = 4.957e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -21.83284 -14.48716
## sample estimates:
## mean in group 0.5 mean in group 2
## 7.98 26.14
t.test(len ~ dose, data=rbind(oj05,oj20), var.equal=TRUE)
##
## Two Sample t-test
##
## data: len by dose
## t = -7.817, df = 18, p-value = 3.402e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -16.278223 -9.381777
## sample estimates:
## mean in group 0.5 mean in group 2
## 13.23 26.06
Simple graphic with all in the same.
moj <- aggregate( oj$len ~ oj$dose, , mean)
colnames(moj) <- c("dose", "len")
mvc <- aggregate( vc$len ~ vc$dose, , mean)
colnames(mvc) <- c("dose", "len")
# Trick to obtain values x
xl <- c( min( min(moj$dose), min(mvc$dose) ), max( max(moj$dose), max(mvc$dose) ) )
# Trick to obtain values y
yl <- c( min( min(moj$len), min(mvc$len) ), max( max(moj$len), max(mvc$len) ) )
plot( moj , xlim = xl, ylim = yl, type = "b", col = "green", pch = 1, lwd = 3, lty = 1,
xlab = "DOSE (0.5 1 2)", ylab = "LEN", main = "GRAPH - all in one")
points(mvc$dose, mvc$len, col = "red", pch = 5)
lines( mvc$dose, mvc$len, col = "red", lwd = 3, lty = 3)