The following code illustrates how I approached the models required by Lab 2. There are many ways to complete this lab, the code below illustrates one possible approach.
# Load Data
gard <- read.table("/Users/seth/Dropbox/GEOG 5023/GEOG 5023 - Spring 2013/Labs/Lab 2/jh_gardasil.dat",
header = T)
Examine completion by location:
# completion rates by location
locs <- table(gard$Completed, gard$Location)
locs
##
## 1 2 3 4
## 0 523 85 70 266
## 1 275 80 19 95
These numbers are useful but it would be better to have the completion rate:
locs[2, ]/colSums(locs)
## 1 2 3 4
## 0.34461 0.48485 0.21348 0.26316
The completion rates code above sums the columns of the locs table this gives us the total number of people treated in each location. The total number of people completing the treatment is given in the second row of the locs table, locs[2,] gives us the second row of the table (via the bracket/slice notation discussed in lab). We then calculate \( \frac{total\ completed}{total\ treated} \). Repeat this process for race:
race <- table(gard$Completed, gard$Race)
race[2, ]/colSums(race)
## 0 1 2 3
## 0.38251 0.23702 0.32692 0.36022
These tables are difficult to read because they use numeric codes to represent each location and race. We can convert these columns to factors, data structure for categorical variables. We've discussed factors before, though not in detail.
Factors have levels, if a variable is a factor each observation has to belong to one level. Levels can have text descriptions called labels . R is smart about these text descriptions and while they can be a pain to create they greatly simplify analysis with categorical variables. Below I will convert all of the categorical variables in the Gardasil table to factors, encoding their levels and labels. The order of the levels and the labels matter, labels are assigned to levels in the order they appear.
gard$MedAssist <- factor(gard$MedAssist, levels = c(0, 1), labels = c("NO_ASST",
"ASST"))
gard$Location <- factor(gard$Location, levels = c(1, 2, 3, 4), labels = c("Odenton",
"White Marsh", "Johns Hopkins", "Bayview"))
gard$LocationType <- factor(gard$LocationType, levels = c(0, 1), labels = c("Suburban",
"Urban"))
gard$Completed <- factor(gard$Completed, levels = c(0, 1), labels = c("No",
"Yes"))
gard$Race <- factor(gard$Race, levels = c(0, 1, 2, 3), labels = c("white", "black",
"hispanic", "other"))
Repeat the code above for completion rates by race. You'll notice that this makes output easier to read:
race <- table(gard$Completed, gard$Race)
race[2, ]/colSums(race)
## white black hispanic other
## 0.38251 0.23702 0.32692 0.36022
Once nice feature of the table object is that is has some convenient plotting and summary behavior:
plot(race)
You can do a \( \chi^2 \) test on the table using summary:
summary(race)
## Number of cases in table: 1413
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 27.1, df = 3, p-value = 0.0000056
The test indicates that there are significant difference in completion rates across the race categories. It doesn't tell us anything about the magnitude of the difference. For this purpose we'll calculate odds ratios using a logistic model.
We can fit a model examining how the age group variable relates to completion rates. We did not turn this variable into a factor.
lmAG <- glm(gard$Completed ~ gard$AgeGroup, family = binomial)
exp(cbind(OR = coef(lmAG), confint(lmAG)))
## OR 2.5 % 97.5 %
## (Intercept) 0.54405 0.46536 0.63454
## gard$AgeGroup 0.83275 0.66695 1.03929
To interpret this output we'd have to see which age category is assigned a zero. In any event it seems that the age category does not have an effect on the odds of completing the Gardasil regiment (the odds ratio straddles 1).
Fitting a model for race:
# Model for Race
lmRace <- glm(gard$Completed ~ gard$Race, family = binomial)
exp(cbind(OR = coef(lmRace), confint(lmRace)))
## OR 2.5 % 97.5 %
## (Intercept) 0.61947 0.53318 0.71847
## gard$Raceblack 0.50148 0.38370 0.65200
## gard$Racehispanic 0.78408 0.42157 1.40587
## gard$Raceother 0.90888 0.64818 1.26634
Here the odds ratios are easier to read because they are labeled with the appropriate factor labels. Notice that the effect for race==black seems to be negative but the other groups straddle 1. This tells us that the odds of african americans completing the regiment are lower than the reference group. In this case the reference group was white. The summary table provides further insight into the nature of racial patterns in completion rates:
summary(lmRace)
##
## Call:
## glm(formula = gard$Completed ~ gard$Race, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.982 -0.982 -0.736 1.386 1.697
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4789 0.0761 -6.30 0.0000000003 ***
## gard$Raceblack -0.6902 0.1352 -5.11 0.0000003277 ***
## gard$Racehispanic -0.2432 0.3053 -0.80 0.43
## gard$Raceother -0.0955 0.1706 -0.56 0.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1796 on 1412 degrees of freedom
## Residual deviance: 1768 on 1409 degrees of freedom
## AIC: 1776
##
## Number of Fisher Scoring iterations: 4
Now fit a model for location.
lmLoc <- glm(gard$Completed ~ gard$Location, family = binomial)
exp(cbind(OR = coef(lmLoc), confint(lmLoc)))
## OR 2.5 % 97.5 %
## (Intercept) 0.52581 0.45386 0.60783
## gard$LocationWhite Marsh 1.78995 1.27525 2.51147
## gard$LocationJohns Hopkins 0.51621 0.29687 0.85766
## gard$LocationBayview 0.67922 0.51382 0.89282
This output is confusing. Which category is the reference level?
The reference level was Odenton because it was the first in the list of locations. It would be better to choose a reference level that facilitated meaningful comparisons. Let's change the reference level to White Marsh because it had the best completion rates. All of our odds ratios should be less than 1 as a result…
gard$Location <- relevel(gard$Location, ref = "White Marsh")
The relevel function works on factors, it allows us to directly assign a reference level.
Now let's rerun the model:
lmLoc <- glm(gard$Completed ~ gard$Location, family = binomial)
exp(cbind(OR = coef(lmLoc), confint(lmLoc)))
## OR 2.5 % 97.5 %
## (Intercept) 0.94118 0.69281 1.27737
## gard$LocationOdenton 0.55868 0.39817 0.78416
## gard$LocationJohns Hopkins 0.28839 0.15633 0.51310
## gard$LocationBayview 0.37946 0.25781 0.55706
Notice the change in the odds ratios.
Location is an odd variable because it is coded two ways, one variable holds the facility name and the other records if it is urban or suburban.
lmLocT <- glm(gard$Completed ~ gard$LocationType, family = binomial)
exp(cbind(OR = coef(lmLocT), confint(lmLocT)))
## OR 2.5 % 97.5 %
## (Intercept) 0.58388 0.51182 0.66508
## gard$LocationTypeUrban 0.58109 0.45156 0.74406
Seems patients in urban locations are less likely to complete the treatment than suburban patients.
We have two ways of describing locations. We can compare them:
summary(lmLocT)
##
## Call:
## glm(formula = gard$Completed ~ gard$LocationType, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.959 -0.959 -0.764 1.413 1.657
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5381 0.0668 -8.06 7.9e-16 ***
## gard$LocationTypeUrban -0.5429 0.1273 -4.26 2.0e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1796.0 on 1412 degrees of freedom
## Residual deviance: 1777.1 on 1411 degrees of freedom
## AIC: 1781
##
## Number of Fisher Scoring iterations: 4
summary(lmLoc)
##
## Call:
## glm(formula = gard$Completed ~ gard$Location, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.152 -0.919 -0.781 1.460 1.757
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0606 0.1558 -0.39 0.69714
## gard$LocationOdenton -0.5822 0.1727 -3.37 0.00075 ***
## gard$LocationJohns Hopkins -1.2434 0.3020 -4.12 0.0000382 ***
## gard$LocationBayview -0.9690 0.1963 -4.94 0.0000008 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1796.0 on 1412 degrees of freedom
## Residual deviance: 1764.9 on 1409 degrees of freedom
## AIC: 1773
##
## Number of Fisher Scoring iterations: 4
Based on the AIC it seems facility name give a better model fit.
We could also compare a full model that included both location variables to a single variable model.
lmTwoLoc <- glm(gard$Completed ~ gard$LocationType + gard$Location, family = binomial)
### drop in deviance
anova(lmTwoLoc, lmLoc, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: gard$Completed ~ gard$LocationType + gard$Location
## Model 2: gard$Completed ~ gard$Location
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1409 1765
## 2 1409 1765 0 4.55e-13
There is significant difference between models.
Now we have to make some decisions about which variables to include in our final model. We have seen that both location and race have an effect on completion rates. For my final model I have decided to examine race and location in combination. Do patients of different races at the same location have different odds of completion?
As we've discussed in lab coding categorical variables as factors makes dummy variables easy to manage. Here we make a dummy variable with 16 levels that describes race by treatment location.
gard$RaceLoc <- gard$Race:gard$Location
summary(gard$RaceLoc)
## white:White Marsh white:Odenton white:Johns Hopkins
## 132 402 26
## white:Bayview black:White Marsh black:Odenton
## 172 25 220
## black:Johns Hopkins black:Bayview hispanic:White Marsh
## 57 141 6
## hispanic:Odenton hispanic:Johns Hopkins hispanic:Bayview
## 15 3 28
## other:White Marsh other:Odenton other:Johns Hopkins
## 2 161 3
## other:Bayview
## 20
We have seen that white patients and patients in White Marsh have the highest completion rates. Lets make the reference level white patients in White Marsh, it seems likely that these patients will have the highest completion rates. We expect all of our odds ratios to be less than 1.
gard$RaceLoc <- relevel(gard$RaceLoc, ref = "white:White Marsh")
Now lets fit the mega model with 16 coefficients.
lmRaceLoc <- glm(Completed ~ RaceLoc, family = binomial, gard)
exp(cbind(OR = coef(lmRaceLoc), confint(lmRaceLoc)))
## OR 2.5 % 97.5 %
## (Intercept) 1.03077 7.3238e-01 1.45159
## RaceLocwhite:Odenton 0.56528 3.7963e-01 0.84057
## RaceLocwhite:Johns Hopkins 0.23099 7.3633e-02 0.60569
## RaceLocwhite:Bayview 0.51972 3.2600e-01 0.82456
## RaceLocblack:White Marsh 0.54571 2.1708e-01 1.29926
## RaceLocblack:Odenton 0.38930 2.4758e-01 0.60878
## RaceLocblack:Johns Hopkins 0.23199 1.0611e-01 0.47244
## RaceLocblack:Bayview 0.17936 9.9837e-02 0.31243
## RaceLochispanic:White Marsh 0.48507 6.5629e-02 2.57313
## RaceLochispanic:Odenton 0.64677 2.0653e-01 1.89529
## RaceLochispanic:Johns Hopkins 1.94030 1.8163e-01 42.34872
## RaceLochispanic:Bayview 0.32338 1.2047e-01 0.77919
## RaceLocother:White Marsh 755973.15395 1.1806e-20 NA
## RaceLocother:Odenton 0.54630 3.4075e-01 0.87158
## RaceLocother:Johns Hopkins 0.48507 2.2225e-02 5.18172
## RaceLocother:Bayview 0.41578 1.4003e-01 1.10411
summary(lmRaceLoc)
##
## Call:
## glm(formula = Completed ~ RaceLoc, family = binomial, data = gard)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.482 -0.945 -0.821 1.414 1.927
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0303 0.1741 0.17 0.86181
## RaceLocwhite:Odenton -0.5704 0.2025 -2.82 0.00485 **
## RaceLocwhite:Johns Hopkins -1.4654 0.5272 -2.78 0.00544 **
## RaceLocwhite:Bayview -0.6545 0.2364 -2.77 0.00564 **
## RaceLocblack:White Marsh -0.6057 0.4516 -1.34 0.17984
## RaceLocblack:Odenton -0.9434 0.2292 -4.12 0.0000386588 ***
## RaceLocblack:Johns Hopkins -1.4611 0.3781 -3.86 0.00011 ***
## RaceLocblack:Bayview -1.7184 0.2901 -5.92 0.0000000032 ***
## RaceLochispanic:White Marsh -0.7235 0.8834 -0.82 0.41279
## RaceLochispanic:Odenton -0.4358 0.5551 -0.79 0.43240
## RaceLochispanic:Johns Hopkins 0.6628 1.2371 0.54 0.59208
## RaceLochispanic:Bayview -1.1289 0.4699 -2.40 0.01628 *
## RaceLocother:White Marsh 13.5358 378.5929 0.04 0.97148
## RaceLocother:Odenton -0.6046 0.2393 -2.53 0.01152 *
## RaceLocother:Johns Hopkins -0.7235 1.2371 -0.58 0.55867
## RaceLocother:Bayview -0.8776 0.5181 -1.69 0.09027 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1796.0 on 1412 degrees of freedom
## Residual deviance: 1735.9 on 1397 degrees of freedom
## AIC: 1768
##
## Number of Fisher Scoring iterations: 12
It seems that at the White March facility all races had equal odds of completion.