Day 4 Example

Comparing the mean of two groups using the linear model

We will no begin testing whether the differences we have seen are “real” or not We will first use a t-test to test for differences between means First we load our mosaic library and our data

library(mosaic)
## Loading required package: car
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'mosaic'
## 
## The following objects are masked from 'package:dplyr':
## 
##     do, tally
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test, sd,
##     t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, print, prod, range, sample, sum
library(RCurl)
## Loading required package: bitops
library(knitr)
url<-"https://raw.githubusercontent.com/coreysparks/data/master/PRB2013_new.csv"
prbdata<-getURL(url)
prbdata<-read.csv(textConnection(prbdata), header=T, dec=",")

Next, we recode a variable using the ifelse() function

prbdata$Africa<-ifelse(prbdata$Continent=="Africa",yes= "Africa",no= "Not Africa")

Now we can use our new variable to do some descriptive analysis

mean(e0Total~Africa, data=prbdata, na.rm=T)
##     Africa Not Africa 
##      59.60      74.49
sd(e0Total~Africa, data=prbdata, na.rm=T)
##     Africa Not Africa 
##      8.608      5.467
bwplot(e0Total~Africa, prbdata)

plot of chunk unnamed-chunk-4

Here is our test that average life expectancy is the same in Africa vs. Non-African countries To do this we construct a linear model for the difference in the means for the two groups this would be like:

\(latex e0Total_i = a + b*Africa + e_i \)

Where a is the mean e0Total in Africa, and b describes how the mean of the Non-African countries relates to the mean of the African countries. e contains all the information on e0Total that the difference between groups doesn’t explain, and is called the residual.

test1<-lm(e0Total~Africa, data=prbdata)
kable(summary(test1)$coef, digits=3)

| | Estimate| Std. Error| t value| Pr(>|t|)| |:—————-|——–:|———-:|——-:|——————:| |(Intercept) | 59.600| 0.870| 68.536| 0.000| |AfricaNot Africa | 14.890| 1.016| 14.660| 0.000| is most certainly is not the same, because we see the Probability that the Not Africa parameter is 0 is very, very small, close to 0, at least to three decimal places.

The mean e0Total for Africa is 59.6, the intercept the mean for the non-African countries is 59.6+14.89 = 74.49, which is exactly what we saw in:

mean(e0Total~Africa, data=prbdata, na.rm=T)
##     Africa Not Africa 
##      59.60      74.49

but what about the assumptions of our model? Are the residuals normal? We can do a graphical check using a Q-Q plot These plots compare the observed data’s quantiles to those expected from a normal distribution

qqnorm(rstudent(test1), main="Q-Q Plot for Model Residuals")
qqline(rstudent(test1), col="red")

plot of chunk unnamed-chunk-6

This looks pretty good!