Final Exam - Course 6 - Statistical Inference - Part 2

The second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.

  1. Load the ToothGrowth data and perform some basic exploratory data analyses

  2. Provide a basic summary of the data.

  3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

  4. State your conclusions and the assumptions needed for your conclusions.

Parts of the complete report.

Sample Project Report Structure

Of course, there are multiple ways one could structure a report to address the requirements above. However, the more clearly you pose and answer each question, the easier it will be for reviewers to clearly identify and evaluate your work.

A sample set of headings that could be used to guide the creation of your report might be:

  1. Title (give an appropriate title) and Author Name
  2. Overview: In a few (2-3) sentences explain what is going to be reported on.
  3. Simulations: Include English explanations of the simulations you ran, with the accompanying R code. Your explanations should make clear what the R code accomplishes.
  4. Sample Mean versus Theoretical Mean: Include figures with titles. In the figures, highlight the means you are comparing. Include text that explains the figures and what is shown on them, and provides appropriate numbers.
  5. Sample Variance versus Theoretical Variance: Include figures (output from R) with titles. Highlight the variances you are comparing. Include text that explains your understanding of the differences of the variances.
  6. Distribution: Via figures and text, explain how one can tell the distribution is approximately normal.

Identifying data (Input)

sessionInfo()
## R version 3.2.5 (2016-04-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.3     tools_3.2.5     htmltools_0.3.5
##  [5] yaml_2.1.13     Rcpp_0.12.5     stringi_1.1.1   rmarkdown_0.9.6
##  [9] knitr_1.12.3    stringr_1.0.0   digest_0.6.9    evaluate_0.9
head( ToothGrowth, 10 ) # see first 10 records
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
tail( ToothGrowth, 10 ) # see last 10 records
##     len supp dose
## 51 25.5   OJ    2
## 52 26.4   OJ    2
## 53 22.4   OJ    2
## 54 24.5   OJ    2
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2
unique(ToothGrowth$supp) # [1] VC OJ
## [1] VC OJ
## Levels: OJ VC
unique(ToothGrowth$dose)  # [1] 0.5 1.0 2.0
## [1] 0.5 1.0 2.0
d1 <- ToothGrowth

d1oj <- d1[ d1$supp=="OJ", c(1,3) ] # isolate OV and reduce data, not relevant with 30 records
d1vc <- d1[ d1$supp=="VC", c(1,3) ] # isolate VC and reduce data

d1oj # Only half
##     len dose
## 31 15.2  0.5
## 32 21.5  0.5
## 33 17.6  0.5
## 34  9.7  0.5
## 35 14.5  0.5
## 36 10.0  0.5
## 37  8.2  0.5
## 38  9.4  0.5
## 39 16.5  0.5
## 40  9.7  0.5
## 41 19.7  1.0
## 42 23.3  1.0
## 43 23.6  1.0
## 44 26.4  1.0
## 45 20.0  1.0
## 46 25.2  1.0
## 47 25.8  1.0
## 48 21.2  1.0
## 49 14.5  1.0
## 50 27.3  1.0
## 51 25.5  2.0
## 52 26.4  2.0
## 53 22.4  2.0
## 54 24.5  2.0
## 55 24.8  2.0
## 56 30.9  2.0
## 57 26.4  2.0
## 58 27.3  2.0
## 59 29.4  2.0
## 60 23.0  2.0
d1vc # Only half
##     len dose
## 1   4.2  0.5
## 2  11.5  0.5
## 3   7.3  0.5
## 4   5.8  0.5
## 5   6.4  0.5
## 6  10.0  0.5
## 7  11.2  0.5
## 8  11.2  0.5
## 9   5.2  0.5
## 10  7.0  0.5
## 11 16.5  1.0
## 12 16.5  1.0
## 13 15.2  1.0
## 14 17.3  1.0
## 15 22.5  1.0
## 16 17.3  1.0
## 17 13.6  1.0
## 18 14.5  1.0
## 19 18.8  1.0
## 20 15.5  1.0
## 21 23.6  2.0
## 22 18.5  2.0
## 23 33.9  2.0
## 24 25.5  2.0
## 25 26.4  2.0
## 26 32.5  2.0
## 27 26.7  2.0
## 28 21.5  2.0
## 29 23.3  2.0
## 30 29.5  2.0
summary(d1oj) # Summary give me some information about the raw data
##       len             dose      
##  Min.   : 8.20   Min.   :0.500  
##  1st Qu.:15.53   1st Qu.:0.500  
##  Median :22.70   Median :1.000  
##  Mean   :20.66   Mean   :1.167  
##  3rd Qu.:25.73   3rd Qu.:2.000  
##  Max.   :30.90   Max.   :2.000
summary(d1vc) # Summary give me some information about the raw data
##       len             dose      
##  Min.   : 4.20   Min.   :0.500  
##  1st Qu.:11.20   1st Qu.:0.500  
##  Median :16.50   Median :1.000  
##  Mean   :16.96   Mean   :1.167  
##  3rd Qu.:23.10   3rd Qu.:2.000  
##  Max.   :33.90   Max.   :2.000
# Summary
summary( lm(d1$len ~ d1$dose) )
## 
## Call:
## lm(formula = d1$len ~ d1$dose)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4496 -2.7406 -0.7452  2.8344 10.1139 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.4225     1.2601    5.89 2.06e-07 ***
## d1$dose       9.7636     0.9525   10.25 1.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.601 on 58 degrees of freedom
## Multiple R-squared:  0.6443, Adjusted R-squared:  0.6382 
## F-statistic: 105.1 on 1 and 58 DF,  p-value: 1.233e-14

Overview

The information provided is one object, 60 records, 3 fields on it. The first is the “length” obtained, second column is the “kind of tratement” (OJ or VC) and the third is the dose used. The first 30’s records are results from OJ and the others 30’s from VC tratement.

I can’t assure if the patiente is the same in both tratements (line 1 with 31, 2-32…. etc)

Relation between 0.5, 1 & 2 ( 2/4, 2/2, 2/1 ) using OJ

summary( d1oj [ d1oj$dose==0.5, 1 ] )  # more info about 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20    9.70   12.25   13.23   16.18   21.50
summary( d1oj [ d1oj$dose==1, 1 ] )    # more info about 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   20.30   23.45   22.70   25.65   27.30
summary( d1oj [ d1oj$dose==2, 1 ] )    # more info about 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.40   24.58   25.95   26.06   27.08   30.90
length( d1oj [ d1oj$dose==0.5, 1 ] )  # num of records, no make sense in this sample, because it's a very short list
## [1] 10
length( d1oj [ d1oj$dose==1, 1 ] )    # 
## [1] 10
length( d1oj [ d1oj$dose==2, 1 ] )    # 
## [1] 10

Relation between 0.5, 1 & 2 ( 2/4, 2/2, 2/1 ) using VC

summary( d1vc [ d1vc$dose==0.5, 1 ] )  # more info about 0.5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20    5.95    7.15    7.98   10.90   11.50
summary( d1vc [ d1vc$dose==1, 1 ] )    # more info about 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.60   15.27   16.50   16.77   17.30   22.50
summary( d1vc [ d1vc$dose==2, 1 ] )    # more info about 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.38   25.95   26.14   28.80   33.90
length( d1vc [ d1vc$dose==0.5, 1 ] )  # num of records, no make sense in this sample, because it's a very short list
## [1] 10
length( d1vc [ d1vc$dose==1, 1 ] )    # 
## [1] 10
length( d1vc [ d1vc$dose==2, 1 ] )    # 
## [1] 10

As I see, the data is distributed 50/50 between OJ & VC. The data about them are again distributed 1/3, 1/3 & 1/3 between doses.

INFO: 10 records by each Dose by each Tratments Multiply it by 2 tratement multiply it by 3 doses. TOTAL 60 records ( it’s obvius in this data with only a few records )

More info about the data

Comparative between OJ (3 praphics) vs VC ( 3 graph.) without adjust the data.

par(mfrow = c(2,3) )

par(mfrow = c(2,3) )

boxplot( d1oj [ d1oj$dose==0.5, 1 ]  , ylim=c(0,40), main="OJ - 0.5")
boxplot( d1oj [ d1oj$dose==1, 1 ]  , ylim=c(0,40), main="OJ - 1")
boxplot( d1oj [ d1oj$dose==2, 1 ]  , ylim=c(0,40), main="OJ - 2")

boxplot( d1vc [ d1vc$dose==0.5, 1 ]  , ylim=c(0,40), main="VC - 0.5")
boxplot( d1vc [ d1vc$dose==1, 1 ]  , ylim=c(0,40), main="VC - 1")
boxplot( d1vc [ d1vc$dose==2, 1 ]  , ylim=c(0,40), main="VC - 2")

Comparative between OJ (3 praphics) vs VC ( 3 graph.) with “ADJUST” the numbers in relation with the dose level.

Multiply data by 4, 2 or 1 to adjust the level of dose between them.

par(mfrow = c(2,3) )

boxplot( d1oj [ d1oj$dose==0.5, 1 ] * 4  , ylim=c(0,90), main="OJ - 0.5")
boxplot( d1oj [ d1oj$dose==1, 1 ]  * 2, ylim=c(0,90), main="OJ - 1")
boxplot( d1oj [ d1oj$dose==2, 1 ] * 1 , ylim=c(0,90), main="OJ - 2")

boxplot( (d1vc [ d1vc$dose==0.5, 1 ] * 4) , ylim=c(0,55), main="VC - 0.5")
boxplot( d1vc [ d1vc$dose==1, 1 ] * 2 , ylim=c(0,55), main="VC - 1")
boxplot( d1vc [ d1vc$dose==2, 1 ] * 1 , ylim=c(0,55), main="VC - 2")

par(mfrow = c(1,1) )

Comparatives between size of DOSE.

summary( d1oj [ d1oj$dose==0.5, 1 ] )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20    9.70   12.25   13.23   16.18   21.50
summary( d1vc [ d1vc$dose==0.5, 1 ] )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20    5.95    7.15    7.98   10.90   11.50
summary( d1oj [ d1oj$dose==1, 1 ] )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.50   20.30   23.45   22.70   25.65   27.30
summary( d1vc [ d1vc$dose==1, 1 ] )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.60   15.27   16.50   16.77   17.30   22.50
summary( d1oj [ d1oj$dose==2, 1 ] )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.40   24.58   25.95   26.06   27.08   30.90
summary( d1vc [ d1vc$dose==2, 1 ] )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.38   25.95   26.14   28.80   33.90

Comparative between same SIZE

mean( d1oj [ d1oj$dose==0.5, 1 ] ) /  mean( d1vc [ d1vc$dose==0.5, 1 ] )
## [1] 1.657895
mean( d1oj [ d1oj$dose==1, 1 ] ) /  mean( d1vc [ d1vc$dose==1, 1 ] )
## [1] 1.353608
mean( d1oj [ d1oj$dose==2, 1 ] ) /  mean( d1vc [ d1vc$dose==2, 1 ] )
## [1] 0.9969396

Comparatives, seeing information in graphics.

I multiply the results by 4 in the first case (0.5), x2 in the second (1) to compare them with the 3th sample (2). The relation with the data is 4,2,1.

ABOUT OJ Seeing graphics the initial inferences are: * between 0.5 and 1, the increase seams similar, and with a 0.5 dose the desviation is higher, less exact (less power) * comparing 2th and 3th, seams that dose 2 is higher and the tooth did not grow up, or in other words the “less dose” produce a higher increase of tooth. About costs and collateral effects, the 2nd option seams reasonable (less cost in drug and bigger increase) and the 1th option seams a bit irregular, but less dose again, so less costs.

ABOUT VC * The differences in the 3 options are similar, again 2 did not produce the same increase ratio comparing it with 0,5 or 1. * Comparing 0.5 and 1 seams that 0.5 are more disperse and the median is lower.

COMPARATIVE Between SIZE of DOSE.

  • The Mean of OJ for 0.5 SIZE of DOSE was 65% high than same DOSE for VC. ( 13.23/7.98 )
  • The Mean of OJ for 1 SIZE of DOSE was 35% high than same DOSE for VC. ( 22.70/16.77 )
  • The Mean of OJ for 2 SIZE of DOSE was 1% lower than same DOSE for VC. ( 26.06/26.14 )

Graphics

par(mfrow = c(1,3) )
# barplot(mean( d1oj [ d1oj$dose==0.5, 1 ] ) ,  mean( d1vc [ d1vc$dose==0.5, 1 ] ) )
# barplot(mean( d1oj [ d1oj$dose==1, 1 ] ) ,  mean( d1vc [ d1vc$dose==1, 1 ] ) )
# barplot(mean( d1oj [ d1oj$dose==2, 1 ] ) ,  mean( d1vc [ d1vc$dose==2, 1 ] ) )

Verifying it, T-TEST

# Slice by THERAPY
oj <- ToothGrowth[ToothGrowth$supp == "OJ",]
vc <- ToothGrowth[ToothGrowth$supp == "VC",]

# Slice by size of DOSE
oj05 <- oj[oj$dose==0.5,]
oj10 <- oj[oj$dose==1.0,]
oj20 <- oj[oj$dose==2.0,]

vc05 <- vc[vc$dose==0.5,]
vc10 <- vc[vc$dose==1.0,]
vc20 <- vc[vc$dose==2.0,]

# testing 0,5 1 & 2
t.test(len ~ supp, data=rbind(vc05,oj05), var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98
t.test(len ~ supp, data=rbind(vc10,oj10), var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77
t.test(len ~ supp, data=rbind(vc20,oj20), var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14
# Testing 0.5 vs 1
t.test(len ~ dose, data=rbind(vc05,vc10), var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -7.4634, df = 18, p-value = 6.492e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.264346  -6.315654
## sample estimates:
## mean in group 0.5   mean in group 1 
##              7.98             16.77
t.test(len ~ dose, data=rbind(oj05,oj10), var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -5.0486, df = 18, p-value = 8.358e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.410814  -5.529186
## sample estimates:
## mean in group 0.5   mean in group 1 
##             13.23             22.70
# Testing 1 vs 2
t.test(len ~ dose, data=rbind(vc10,vc20), var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -5.4698, df = 18, p-value = 3.398e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -12.96896  -5.77104
## sample estimates:
## mean in group 1 mean in group 2 
##           16.77           26.14
t.test(len ~ dose, data=rbind(oj10,oj20), var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -2.2478, df = 18, p-value = 0.03736
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.5005017 -0.2194983
## sample estimates:
## mean in group 1 mean in group 2 
##           22.70           26.06
# Testing 0.5 vs 2
t.test(len ~ dose, data=rbind(vc05,vc20), var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -10.388, df = 18, p-value = 4.957e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -21.83284 -14.48716
## sample estimates:
## mean in group 0.5   mean in group 2 
##              7.98             26.14
t.test(len ~ dose, data=rbind(oj05,oj20), var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -7.817, df = 18, p-value = 3.402e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -16.278223  -9.381777
## sample estimates:
## mean in group 0.5   mean in group 2 
##             13.23             26.06

Graphics - All in one.

Simple graphic with all in the same.

moj <- aggregate( oj$len ~ oj$dose, , mean)
colnames(moj) <- c("dose", "len")

mvc <- aggregate( vc$len ~ vc$dose, , mean)
colnames(mvc) <- c("dose", "len")

# Trick to obtain values x
xl <- c( min( min(moj$dose), min(mvc$dose) ), max( max(moj$dose), max(mvc$dose) ) )

# Trick to obtain values y
yl <- c( min( min(moj$len),  min(mvc$len)  ), max( max(moj$len),  max(mvc$len)  ) )

plot( moj , xlim = xl, ylim = yl, type = "b", col = "green", pch = 1, lwd = 3, lty = 1, 
    xlab = "DOSE (0.5 1 2)", ylab = "LEN", main = "GRAPH - all in one")

points(mvc$dose, mvc$len, col = "red", pch = 5)
lines( mvc$dose, mvc$len, col = "red", lwd = 3, lty = 3)

EOF