In this example, we will compare the means of more than two groups using the Analysis Of VAriance model, or ANOVA. This will be done in just the same way as the two-group comparison, by using the linear model framework.
library(mosaic)
## Loading required package: car
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'mosaic'
##
## The following objects are masked from 'package:dplyr':
##
## do, tally
##
## The following object is masked from 'package:car':
##
## logit
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cov, D, fivenum, IQR, median, prop.test, sd,
## t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, print, prod, range, sample, sum
library(RCurl)
## Loading required package: bitops
library(knitr)
url<-"https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv"
prbdata<-getURL(url)
prbdata<-read.csv(textConnection(prbdata), header=T, dec=",")
We can use a modeling approach to do the same type of statistical test as shown above The two-sample test can be thought of as a type of regression model We use the lm() function to specify a linear model for the difference between the two groups
recode a variable
prbdata$Africa<-ifelse(prbdata$Continent=="Africa",yes= "Africa",no= "Not Africa")
First, the two - group test again
test1<-lm(e0Total~Africa, data=prbdata)
summary(test1)
##
## Call:
## lm(formula = e0Total ~ Africa, data = prbdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.60 -4.27 0.40 4.48 19.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.60 0.87 68.5 <2e-16 ***
## AfricaNot Africa 14.89 1.02 14.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.45 on 204 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.513, Adjusted R-squared: 0.511
## F-statistic: 215 on 1 and 204 DF, p-value: <2e-16
Which shows us the Not African countries, on average have a life expectancy 14.9 years higher than African countries.
Likewise, if we have more than two groups, we can set up a similar test But this time, we call the test the Analysis of Variance, but, it’s still just the good ol’ linear model!
test2<-lm(e0Total~Continent, data=prbdata)
But remember, in this test we have two classes of tests, the Global test that all at least two of the means are not equal, and the post-hoc tests for which specific means are different.
First, we do the global test
anova(test2)
## Analysis of Variance Table
##
## Response: e0Total
## Df Sum Sq Mean Sq F value Pr(>F)
## Continent 5 9864 1973 52.2 <2e-16 ***
## Residuals 200 7559 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which shows that, yes, at least two of the means are different (small p-value)
Next, we do the post-hoc tests to see WHICH means are different.
TukeyHSD(test2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = x)
##
## $Continent
## diff lwr upr p adj
## Asia-Africa 13.008 9.569 16.447 0.0000
## Europe-Africa 18.400 14.799 22.001 0.0000
## North America-Africa 15.400 11.243 19.557 0.0000
## Oceania-Africa 11.459 6.550 16.368 0.0000
## South America-Africa 14.092 8.637 19.548 0.0000
## Europe-Asia 5.392 1.730 9.055 0.0005
## North America-Asia 2.392 -1.818 6.603 0.5765
## Oceania-Asia -1.549 -6.503 3.405 0.9462
## South America-Asia 1.084 -4.412 6.581 0.9930
## North America-Europe -3.000 -7.344 1.344 0.3533
## Oceania-Europe -6.941 -12.009 -1.873 0.0015
## South America-Europe -4.308 -9.907 1.292 0.2360
## Oceania-North America -3.941 -9.418 1.536 0.3070
## South America-North America -1.308 -7.280 4.664 0.9887
## South America-Oceania 2.633 -3.884 9.151 0.8540
This shows us the tests between all pairs of continents it also shows us that the differences are mostly between Africa and non-African countries, with a couple of exceptions
Again, we can quantiles to help us make a grouping variable using the cut() function. Here I do this to GDP per capita
prbdata$GDPgroup<-cut(prbdata$GNIPPPperCapitaUSDollars2012,
breaks=quantile(prbdata$GNIPPPperCapitaUSDollars2012,
probs=c(0,.25, .5, .75, 1), na.rm=T))
And, I can do the ANOVA on this group variable:
test3<-lm(e0Total~GDPgroup, data=prbdata)
anova(test3)
## Analysis of Variance Table
##
## Response: e0Total
## Df Sum Sq Mean Sq F value Pr(>F)
## GDPgroup 3 9581 3194 104 <2e-16 ***
## Residuals 174 5334 31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(test3)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = x)
##
## $GDPgroup
## diff lwr upr p adj
## (2.90e+03,9.39e+03]-(370,2.90e+03] 10.914 7.8686 13.959 0.0000
## (9.39e+03,2.18e+04]-(370,2.90e+03] 13.750 10.6879 16.812 0.0000
## (2.18e+04,8.47e+04]-(370,2.90e+03] 20.358 17.3130 23.403 0.0000
## (9.39e+03,2.18e+04]-(2.90e+03,9.39e+03] 2.836 -0.2087 5.881 0.0778
## (2.18e+04,8.47e+04]-(2.90e+03,9.39e+03] 9.444 6.4165 12.472 0.0000
## (2.18e+04,8.47e+04]-(9.39e+03,2.18e+04] 6.608 3.5630 9.653 0.0000