Day 5 Example

Comparing the mean of multiple groups using the linear model (ANOVA)

In this example, we will compare the means of more than two groups using the Analysis Of VAriance model, or ANOVA. This will be done in just the same way as the two-group comparison, by using the linear model framework.

library(mosaic)
## Loading required package: car
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'mosaic'
## 
## The following objects are masked from 'package:dplyr':
## 
##     do, tally
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test, sd,
##     t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, print, prod, range, sample, sum
library(RCurl)
## Loading required package: bitops
library(knitr)
url<-"https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv"
prbdata<-getURL(url)
prbdata<-read.csv(textConnection(prbdata), header=T, dec=",")

We can use a modeling approach to do the same type of statistical test as shown above The two-sample test can be thought of as a type of regression model We use the lm() function to specify a linear model for the difference between the two groups

recode a variable

prbdata$Africa<-ifelse(prbdata$Continent=="Africa",yes= "Africa",no= "Not Africa")

First, the two - group test again

test1<-lm(e0Total~Africa, data=prbdata)
summary(test1)
## 
## Call:
## lm(formula = e0Total ~ Africa, data = prbdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14.60  -4.27   0.40   4.48  19.40 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         59.60       0.87    68.5   <2e-16 ***
## AfricaNot Africa    14.89       1.02    14.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.45 on 204 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.513,  Adjusted R-squared:  0.511 
## F-statistic:  215 on 1 and 204 DF,  p-value: <2e-16

Which shows us the Not African countries, on average have a life expectancy 14.9 years higher than African countries.

Likewise, if we have more than two groups, we can set up a similar test But this time, we call the test the Analysis of Variance, but, it’s still just the good ol’ linear model!

test2<-lm(e0Total~Continent, data=prbdata)

But remember, in this test we have two classes of tests, the Global test that all at least two of the means are not equal, and the post-hoc tests for which specific means are different.

First, we do the global test

anova(test2)
## Analysis of Variance Table
## 
## Response: e0Total
##            Df Sum Sq Mean Sq F value Pr(>F)    
## Continent   5   9864    1973    52.2 <2e-16 ***
## Residuals 200   7559      38                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which shows that, yes, at least two of the means are different (small p-value)

Next, we do the post-hoc tests to see WHICH means are different.

TukeyHSD(test2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = x)
## 
## $Continent
##                               diff     lwr    upr  p adj
## Asia-Africa                 13.008   9.569 16.447 0.0000
## Europe-Africa               18.400  14.799 22.001 0.0000
## North America-Africa        15.400  11.243 19.557 0.0000
## Oceania-Africa              11.459   6.550 16.368 0.0000
## South America-Africa        14.092   8.637 19.548 0.0000
## Europe-Asia                  5.392   1.730  9.055 0.0005
## North America-Asia           2.392  -1.818  6.603 0.5765
## Oceania-Asia                -1.549  -6.503  3.405 0.9462
## South America-Asia           1.084  -4.412  6.581 0.9930
## North America-Europe        -3.000  -7.344  1.344 0.3533
## Oceania-Europe              -6.941 -12.009 -1.873 0.0015
## South America-Europe        -4.308  -9.907  1.292 0.2360
## Oceania-North America       -3.941  -9.418  1.536 0.3070
## South America-North America -1.308  -7.280  4.664 0.9887
## South America-Oceania        2.633  -3.884  9.151 0.8540

This shows us the tests between all pairs of continents it also shows us that the differences are mostly between Africa and non-African countries, with a couple of exceptions

Again, we can quantiles to help us make a grouping variable using the cut() function. Here I do this to GDP per capita

prbdata$GDPgroup<-cut(prbdata$GNIPPPperCapitaUSDollars2012, 
                      breaks=quantile(prbdata$GNIPPPperCapitaUSDollars2012,
                                      probs=c(0,.25, .5, .75, 1), na.rm=T))

And, I can do the ANOVA on this group variable:

test3<-lm(e0Total~GDPgroup, data=prbdata)
anova(test3)
## Analysis of Variance Table
## 
## Response: e0Total
##            Df Sum Sq Mean Sq F value Pr(>F)    
## GDPgroup    3   9581    3194     104 <2e-16 ***
## Residuals 174   5334      31                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(test3)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = x)
## 
## $GDPgroup
##                                           diff     lwr    upr  p adj
## (2.90e+03,9.39e+03]-(370,2.90e+03]      10.914  7.8686 13.959 0.0000
## (9.39e+03,2.18e+04]-(370,2.90e+03]      13.750 10.6879 16.812 0.0000
## (2.18e+04,8.47e+04]-(370,2.90e+03]      20.358 17.3130 23.403 0.0000
## (9.39e+03,2.18e+04]-(2.90e+03,9.39e+03]  2.836 -0.2087  5.881 0.0778
## (2.18e+04,8.47e+04]-(2.90e+03,9.39e+03]  9.444  6.4165 12.472 0.0000
## (2.18e+04,8.47e+04]-(9.39e+03,2.18e+04]  6.608  3.5630  9.653 0.0000