Seth's Lab 2

The following code illustrates how I approached the models required by Lab 2. There are many ways to complete this lab, the code below illustrates one possible approach.

# Load Data
gard <- read.table("/Users/seth/Dropbox/GEOG 5023/GEOG 5023 - Spring 2013/Labs/Lab 2/jh_gardasil.dat", 
    header = T)

Examine completion by location:

# completion rates by location
locs <- table(gard$Completed, gard$Location)
locs
##    
##       1   2   3   4
##   0 523  85  70 266
##   1 275  80  19  95

These numbers are useful but it would be better to have the completion rate:

locs[2, ]/colSums(locs)
##       1       2       3       4 
## 0.34461 0.48485 0.21348 0.26316

The completion rates code above sums the columns of the locs table this gives us the total number of people treated in each location. The total number of people completing the treatment is given in the second row of the locs table, locs[2,] gives us the second row of the table (via the bracket/slice notation discussed in lab). We then calculate \( \frac{total\ completed}{total\ treated} \). Repeat this process for race:

race <- table(gard$Completed, gard$Race)
race[2, ]/colSums(race)
##       0       1       2       3 
## 0.38251 0.23702 0.32692 0.36022

These tables are difficult to read because they use numeric codes to represent each location and race. We can convert these columns to factors, data structure for categorical variables. We've discussed factors before, though not in detail.

Factors have levels, if a variable is a factor each observation has to belong to one level. Levels can have text descriptions called labels . R is smart about these text descriptions and while they can be a pain to create they greatly simplify analysis with categorical variables. Below I will convert all of the categorical variables in the Gardasil table to factors, encoding their levels and labels. The order of the levels and the labels matter, labels are assigned to levels in the order they appear.

gard$MedAssist <- factor(gard$MedAssist, levels = c(0, 1), labels = c("NO_ASST", 
    "ASST"))

gard$Location <- factor(gard$Location, levels = c(1, 2, 3, 4), labels = c("Odenton", 
    "White Marsh", "Johns Hopkins", "Bayview"))

gard$LocationType <- factor(gard$LocationType, levels = c(0, 1), labels = c("Suburban", 
    "Urban"))

gard$Completed <- factor(gard$Completed, levels = c(0, 1), labels = c("No", 
    "Yes"))

gard$Race <- factor(gard$Race, levels = c(0, 1, 2, 3), labels = c("white", "black", 
    "hispanic", "other"))

Repeat the code above for completion rates by race. You'll notice that this makes output easier to read:

race <- table(gard$Completed, gard$Race)
race[2, ]/colSums(race)
##    white    black hispanic    other 
##  0.38251  0.23702  0.32692  0.36022

Once nice feature of the table object is that is has some convenient plotting and summary behavior:

plot(race)

plot of chunk unnamed-chunk-8

You can do a \( \chi^2 \) test on the table using summary:

summary(race)
## Number of cases in table: 1413 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 27.1, df = 3, p-value = 0.0000056

The test indicates that there are significant difference in completion rates across the race categories. It doesn't tell us anything about the magnitude of the difference. For this purpose we'll calculate odds ratios using a logistic model.

Fitting Models

We can fit a model examining how the age group variable relates to completion rates. We did not turn this variable into a factor.

lmAG <- glm(gard$Completed ~ gard$AgeGroup, family = binomial)
exp(cbind(OR = coef(lmAG), confint(lmAG)))
##                    OR   2.5 %  97.5 %
## (Intercept)   0.54405 0.46536 0.63454
## gard$AgeGroup 0.83275 0.66695 1.03929

To interpret this output we'd have to see which age category is assigned a zero. In any event it seems that the age category does not have an effect on the odds of completing the Gardasil regiment (the odds ratio straddles 1).

Fitting a model for race:

# Model for Race
lmRace <- glm(gard$Completed ~ gard$Race, family = binomial)
exp(cbind(OR = coef(lmRace), confint(lmRace)))
##                        OR   2.5 %  97.5 %
## (Intercept)       0.61947 0.53318 0.71847
## gard$Raceblack    0.50148 0.38370 0.65200
## gard$Racehispanic 0.78408 0.42157 1.40587
## gard$Raceother    0.90888 0.64818 1.26634

Here the odds ratios are easier to read because they are labeled with the appropriate factor labels. Notice that the effect for race==black seems to be negative but the other groups straddle 1. This tells us that the odds of african americans completing the regiment are lower than the reference group. In this case the reference group was white. The summary table provides further insight into the nature of racial patterns in completion rates:

summary(lmRace)
## 
## Call:
## glm(formula = gard$Completed ~ gard$Race, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.982  -0.982  -0.736   1.386   1.697  
## 
## Coefficients:
##                   Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)        -0.4789     0.0761   -6.30 0.0000000003 ***
## gard$Raceblack     -0.6902     0.1352   -5.11 0.0000003277 ***
## gard$Racehispanic  -0.2432     0.3053   -0.80         0.43    
## gard$Raceother     -0.0955     0.1706   -0.56         0.58    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1796  on 1412  degrees of freedom
## Residual deviance: 1768  on 1409  degrees of freedom
## AIC: 1776
## 
## Number of Fisher Scoring iterations: 4

Now fit a model for location.

lmLoc <- glm(gard$Completed ~ gard$Location, family = binomial)
exp(cbind(OR = coef(lmLoc), confint(lmLoc)))
##                                 OR   2.5 %  97.5 %
## (Intercept)                0.52581 0.45386 0.60783
## gard$LocationWhite Marsh   1.78995 1.27525 2.51147
## gard$LocationJohns Hopkins 0.51621 0.29687 0.85766
## gard$LocationBayview       0.67922 0.51382 0.89282

This output is confusing. Which category is the reference level?

The reference level was Odenton because it was the first in the list of locations. It would be better to choose a reference level that facilitated meaningful comparisons. Let's change the reference level to White Marsh because it had the best completion rates. All of our odds ratios should be less than 1 as a result…

gard$Location <- relevel(gard$Location, ref = "White Marsh")

The relevel function works on factors, it allows us to directly assign a reference level.

Now let's rerun the model:

lmLoc <- glm(gard$Completed ~ gard$Location, family = binomial)
exp(cbind(OR = coef(lmLoc), confint(lmLoc)))
##                                 OR   2.5 %  97.5 %
## (Intercept)                0.94118 0.69281 1.27737
## gard$LocationOdenton       0.55868 0.39817 0.78416
## gard$LocationJohns Hopkins 0.28839 0.15633 0.51310
## gard$LocationBayview       0.37946 0.25781 0.55706

Notice the change in the odds ratios.

Location is an odd variable because it is coded two ways, one variable holds the facility name and the other records if it is urban or suburban.

lmLocT <- glm(gard$Completed ~ gard$LocationType, family = binomial)
exp(cbind(OR = coef(lmLocT), confint(lmLocT)))
##                             OR   2.5 %  97.5 %
## (Intercept)            0.58388 0.51182 0.66508
## gard$LocationTypeUrban 0.58109 0.45156 0.74406

Seems patients in urban locations are less likely to complete the treatment than suburban patients.

We have two ways of describing locations. We can compare them:

summary(lmLocT)
## 
## Call:
## glm(formula = gard$Completed ~ gard$LocationType, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.959  -0.959  -0.764   1.413   1.657  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -0.5381     0.0668   -8.06  7.9e-16 ***
## gard$LocationTypeUrban  -0.5429     0.1273   -4.26  2.0e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1796.0  on 1412  degrees of freedom
## Residual deviance: 1777.1  on 1411  degrees of freedom
## AIC: 1781
## 
## Number of Fisher Scoring iterations: 4
summary(lmLoc)
## 
## Call:
## glm(formula = gard$Completed ~ gard$Location, family = binomial)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.152  -0.919  -0.781   1.460   1.757  
## 
## Coefficients:
##                            Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)                 -0.0606     0.1558   -0.39   0.69714    
## gard$LocationOdenton        -0.5822     0.1727   -3.37   0.00075 ***
## gard$LocationJohns Hopkins  -1.2434     0.3020   -4.12 0.0000382 ***
## gard$LocationBayview        -0.9690     0.1963   -4.94 0.0000008 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1796.0  on 1412  degrees of freedom
## Residual deviance: 1764.9  on 1409  degrees of freedom
## AIC: 1773
## 
## Number of Fisher Scoring iterations: 4

Based on the AIC it seems facility name give a better model fit.

We could also compare a full model that included both location variables to a single variable model.

lmTwoLoc <- glm(gard$Completed ~ gard$LocationType + gard$Location, family = binomial)
### drop in deviance
anova(lmTwoLoc, lmLoc, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: gard$Completed ~ gard$LocationType + gard$Location
## Model 2: gard$Completed ~ gard$Location
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      1409       1765                     
## 2      1409       1765  0 4.55e-13

There is significant difference between models.

Now we have to make some decisions about which variables to include in our final model. We have seen that both location and race have an effect on completion rates. For my final model I have decided to examine race and location in combination. Do patients of different races at the same location have different odds of completion?

As we've discussed in lab coding categorical variables as factors makes dummy variables easy to manage. Here we make a dummy variable with 16 levels that describes race by treatment location.

gard$RaceLoc <- gard$Race:gard$Location
summary(gard$RaceLoc)
##      white:White Marsh          white:Odenton    white:Johns Hopkins 
##                    132                    402                     26 
##          white:Bayview      black:White Marsh          black:Odenton 
##                    172                     25                    220 
##    black:Johns Hopkins          black:Bayview   hispanic:White Marsh 
##                     57                    141                      6 
##       hispanic:Odenton hispanic:Johns Hopkins       hispanic:Bayview 
##                     15                      3                     28 
##      other:White Marsh          other:Odenton    other:Johns Hopkins 
##                      2                    161                      3 
##          other:Bayview 
##                     20

We have seen that white patients and patients in White Marsh have the highest completion rates. Lets make the reference level white patients in White Marsh, it seems likely that these patients will have the highest completion rates. We expect all of our odds ratios to be less than 1.

gard$RaceLoc <- relevel(gard$RaceLoc, ref = "white:White Marsh")

Now lets fit the mega model with 16 coefficients.

lmRaceLoc <- glm(Completed ~ RaceLoc, family = binomial, gard)
exp(cbind(OR = coef(lmRaceLoc), confint(lmRaceLoc)))
##                                         OR      2.5 %   97.5 %
## (Intercept)                        1.03077 7.3238e-01  1.45159
## RaceLocwhite:Odenton               0.56528 3.7963e-01  0.84057
## RaceLocwhite:Johns Hopkins         0.23099 7.3633e-02  0.60569
## RaceLocwhite:Bayview               0.51972 3.2600e-01  0.82456
## RaceLocblack:White Marsh           0.54571 2.1708e-01  1.29926
## RaceLocblack:Odenton               0.38930 2.4758e-01  0.60878
## RaceLocblack:Johns Hopkins         0.23199 1.0611e-01  0.47244
## RaceLocblack:Bayview               0.17936 9.9837e-02  0.31243
## RaceLochispanic:White Marsh        0.48507 6.5629e-02  2.57313
## RaceLochispanic:Odenton            0.64677 2.0653e-01  1.89529
## RaceLochispanic:Johns Hopkins      1.94030 1.8163e-01 42.34872
## RaceLochispanic:Bayview            0.32338 1.2047e-01  0.77919
## RaceLocother:White Marsh      755973.15395 1.1806e-20       NA
## RaceLocother:Odenton               0.54630 3.4075e-01  0.87158
## RaceLocother:Johns Hopkins         0.48507 2.2225e-02  5.18172
## RaceLocother:Bayview               0.41578 1.4003e-01  1.10411
summary(lmRaceLoc)
## 
## Call:
## glm(formula = Completed ~ RaceLoc, family = binomial, data = gard)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.482  -0.945  -0.821   1.414   1.927  
## 
## Coefficients:
##                               Estimate Std. Error z value     Pr(>|z|)    
## (Intercept)                     0.0303     0.1741    0.17      0.86181    
## RaceLocwhite:Odenton           -0.5704     0.2025   -2.82      0.00485 ** 
## RaceLocwhite:Johns Hopkins     -1.4654     0.5272   -2.78      0.00544 ** 
## RaceLocwhite:Bayview           -0.6545     0.2364   -2.77      0.00564 ** 
## RaceLocblack:White Marsh       -0.6057     0.4516   -1.34      0.17984    
## RaceLocblack:Odenton           -0.9434     0.2292   -4.12 0.0000386588 ***
## RaceLocblack:Johns Hopkins     -1.4611     0.3781   -3.86      0.00011 ***
## RaceLocblack:Bayview           -1.7184     0.2901   -5.92 0.0000000032 ***
## RaceLochispanic:White Marsh    -0.7235     0.8834   -0.82      0.41279    
## RaceLochispanic:Odenton        -0.4358     0.5551   -0.79      0.43240    
## RaceLochispanic:Johns Hopkins   0.6628     1.2371    0.54      0.59208    
## RaceLochispanic:Bayview        -1.1289     0.4699   -2.40      0.01628 *  
## RaceLocother:White Marsh       13.5358   378.5929    0.04      0.97148    
## RaceLocother:Odenton           -0.6046     0.2393   -2.53      0.01152 *  
## RaceLocother:Johns Hopkins     -0.7235     1.2371   -0.58      0.55867    
## RaceLocother:Bayview           -0.8776     0.5181   -1.69      0.09027 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1796.0  on 1412  degrees of freedom
## Residual deviance: 1735.9  on 1397  degrees of freedom
## AIC: 1768
## 
## Number of Fisher Scoring iterations: 12

It seems that at the White March facility all races had equal odds of completion.