Motivating example: contraceptive use data

Load the data from http://data.princeton.edu/wws509/datasets/#cuse

##     age    education wantsMore    notUsing          using      
##  <25  :4   high:8    no :8     Min.   :  8.00   Min.   : 4.00  
##  25-29:4   low :8    yes:8     1st Qu.: 31.00   1st Qu.: 9.50  
##  30-39:4                       Median : 56.50   Median :29.00  
##  40-49:4                       Mean   : 68.75   Mean   :31.69  
##                                3rd Qu.: 85.75   3rd Qu.:49.00  
##                                Max.   :212.00   Max.   :80.00

Problem 1

What is the mean fraction of women using birth control for each age group? Each education level? For women who do or don’t want more children?
- Hint: look at the “data wrangling” cheat sheet functions mutate, group_by, and summarize.

library(dplyr)
cuse2 <- mutate(cuse, fracusing = using / (using + notUsing))
cuse2 %>% group_by(age) %>% summarize(mean(fracusing))

## # A tibble: 4 x 2
##      age `mean(fracusing)`
##   <fctr>             <dbl>
## 1    <25         0.1877614
## 2  25-29         0.2714671
## 3  30-39         0.3879687
## 4  40-49         0.4694775

Here the %>% is called a “pipe”, and it sends the output of the previous function to the input of the next function. This could also have been done in one step:

mutate(cuse, fracusing = using / (using + notUsing)) %>%
  group_by(age) %>% 
  summarize(mean(fracusing))

## # A tibble: 4 x 2
##      age `mean(fracusing)`
##   <fctr>             <dbl>
## 1    <25         0.1877614
## 2  25-29         0.2714671
## 3  30-39         0.3879687
## 4  40-49         0.4694775

Here the result is not stored anywhere; if you wanted to store it in a variable called myanswer, you could have started the above command with myanswer <- or myanswer =.

Also note that the line breaks are optional: you could put this all on one line, but breaking it up makes it more readable.

Problem 2

Based on fit1, write on paper the model for expected probability of using birth control? Don’t forget the (inverse) logit function.

fit1 <- glm(cbind(using, notUsing) ~ age + education + wantsMore, 
           data=cuse, family=binomial("logit"))

I’m going to store the coefficients of fit1 in a new variable b just to save typing when writing out the formula, and round to two decimal places:

b <- round(coef(fit1), 2)

Here I’m going to take advantage of R Markdown’s support for LaTeX formulae to write out the fitted regression model. I also use the ` r 2+2 ` syntax (the result is 4) for putting results of R code inline, instead of writing out numbers of the coefficients. I never wrote the number “four”, but you’ll have to look at the .Rmd to see how I did that! Doing this in reports allows you to update them as input data changes, without having to manually re-enter or copy numbers.

$P = \textit{logit}^{-1} \left( -0.81 + 0.39 \times \textit{age25-29} + 0.91 \times \textit{age30-39} + 1.19 \times \textit{age40-49} + -0.32 \times \textit{educationlow} + -0.83 \times \textit{wantsMoreyes} \right)$

$\textit{logit}^{-1}(x) = \frac{1}{1+e^{-x}}$

Problem 3

Based on fit1, what is the expected probability of an individual 25-29 years old, with high education, who wants more children, using birth control? Calculate it manually, and using predict(fit1)

This can be done using the predict() function, using a new data.frame giving the values that you want to predict for:

invLogit <- function(x) 1/(1+exp(-x))
newdata=data.frame(age="25-29", education="high", wantsMore="yes")
invLogit( predict(fit1, newdata=newdata) )

##         1 
## 0.2223899

By default, predict() predicts on the original data, and taking a look, I noticed that the 7th value is the one we want to predict on:

invLogit( predict(fit1)[7] )

##         7 
## 0.2223899

If we were really doing this manually, on paper, we would just calculate the linear predictor, then use the inverse logit function. Note that education=“high” is the reference group, so we don’t use it here:

invLogit( b["(Intercept)"] + b["age25-29"] + b["wantsMoreyes"] )

## (Intercept) 
##   0.2227001

I don’t really care, but if the remnant “(Intercept)” name there were annoying me, I would assign this to a variable then get rid of that name:

res <- invLogit( b["(Intercept)"] + b["age25-29"] + b["wantsMoreyes"] )
names(res) <- NULL
res

## [1] 0.2227001

Problem 4

Based on fit1: Relative to women under 25 who want to have children, what is the predicted odds ratio for a woman 40-49 years old who does not want to have children will be taking birth control?

exp( coef(fit1)["age40-49"] - coef(fit1)["wantsMoreyes"])

## age40-49 
##  7.55488

Problem 5

Using a likelihood ratio test, is there evidence that a model with interactions improves on fit1 (no interactions)?

fit1.int <- glm(cbind(using, notUsing) ~ age * education * wantsMore, 
           data=cuse, family=binomial("logit"))
anova(fit1, fit1.int, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: cbind(using, notUsing) ~ age + education + wantsMore
## Model 2: cbind(using, notUsing) ~ age * education * wantsMore
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        10     29.917                          
## 2         0      0.000 10   29.917 0.0008838 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Problem 6

Which, if any, variables have the strongest interactions? I use the “stargazer” package here which is extremely flexible for showing regression results from all kinds of models, capable of comparing multiple models or multiple dependent variables in the same table, and providing formatting customized to numerous journal styles.

We see here that the strongest interaction coeffient is for age30-39:wantsMoreyes. It is statistically significant and negative (-1.28). It appears that being 30-39 years old and wanting more children has a multiplicative effect against using birth control, moreso than expected based on the age or wanting more children alone.




	Outcome: Using Birth Control
	No Interaction	With Interactions
	(1)	(2)

age25-29	0.39^** (0.04, 0.73)	0.73^* (-0.08, 1.54)
age30-39	0.91^*** (0.59, 1.23)	1.75^*** (0.99, 2.50)
age40-49	1.19^*** (0.77, 1.61)	2.56^*** (1.61, 3.51)
educationlow	-0.32^*** (-0.57, -0.08)	0.69 (-0.65, 2.04)
wantsMoreyes	-0.83^*** (-1.06, -0.60)	0.20 (-0.54, 0.95)
age25-29:educationlow		-0.46 (-2.07, 1.15)
age30-39:educationlow		-0.79 (-2.21, 0.63)
age40-49:educationlow		-1.60^** (-3.15, -0.05)
age25-29:wantsMoreyes		-0.38 (-1.30, 0.54)
age30-39:wantsMoreyes		-1.28^*** (-2.16, -0.40)
age40-49:wantsMoreyes		-1.15 (-2.55, 0.25)
educationlow:wantsMoreyes		-1.47^* (-3.08, 0.15)
age25-29:educationlow:wantsMoreyes		0.83 (-1.13, 2.79)
age30-39:educationlow:wantsMoreyes		1.29 (-0.47, 3.04)
age40-49:educationlow:wantsMoreyes		0.61 (-1.61, 2.83)
Constant	-0.81^*** (-1.12, -0.50)	-1.61^*** (-2.29, -0.93)

Observations	16	16
Log Likelihood	-50.71	-35.75
Akaike Inf. Crit.	113.43	103.51

Note:	p<0.1; p<0.05; p<0.01

Problem 7

Create a model matrix for a fit on age only, with contrasts between every pair of age groups. Between which age groups is the contrast significant?

semi-manually

First, I’ll do it manually, using the multcomp package and following an example from ?glht. I’m not going to bother anymore creating pretty tables with stargazer though…

fit = glm(cbind(using, notUsing) ~ age, data=cuse, family=binomial("logit"))
coef(fit)

## (Intercept)    age25-29    age30-39    age40-49 
##  -1.5071591   0.4606758   1.0482932   1.4246380

K <- rbind("25-29 - <25" = c(-1, 1, 0, 0),
           "30-39 - <25" = c(-1, 0, 1, 0),
           "40-49 - <25" = c(-1, 0, 0, 1),
           "30-39 - 25-29" = c(0, -1, 1, 0),
           "40-49 - 25-29" = c(0, -1, 0, 1),
           "40-49 - 30-39" = c(0, 0, -1, 1))
K

##               [,1] [,2] [,3] [,4]
## 25-29 - <25     -1    1    0    0
## 30-39 - <25     -1    0    1    0
## 40-49 - <25     -1    0    0    1
## 30-39 - 25-29    0   -1    1    0
## 40-49 - 25-29    0   -1    0    1
## 40-49 - 30-39    0    0   -1    1

fit.all.cont = multcomp::glht(fit, linfct=K)
summary(fit.all.cont)

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: glm(formula = cbind(using, notUsing) ~ age, family = binomial("logit"), 
##     data = cuse)
## 
## Linear Hypotheses:
##                    Estimate Std. Error z value Pr(>|z|)    
## 25-29 - <25 == 0     1.9678     0.2841   6.926   <0.001 ***
## 30-39 - <25 == 0     2.5555     0.2734   9.347   <0.001 ***
## 40-49 - <25 == 0     2.9318     0.2975   9.854   <0.001 ***
## 30-39 - 25-29 == 0   0.5876     0.1406   4.181   <0.001 ***
## 40-49 - 25-29 == 0   0.9640     0.1831   5.265   <0.001 ***
## 40-49 - 30-39 == 0   0.3763     0.1660   2.268   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Using Tukey contrasts

There is an easier way if you realize that these are the “Tukey” contrasts and use the contrMat function from the multcomp package:

K2 = multcomp::contrMat(1:4, type="Tukey")
K2

## 
##   Multiple Comparisons of Means: Tukey Contrasts
## 
##        1  2  3 4
## 2 - 1 -1  1  0 0
## 3 - 1 -1  0  1 0
## 4 - 1 -1  0  0 1
## 3 - 2  0 -1  1 0
## 4 - 2  0 -1  0 1
## 4 - 3  0  0 -1 1

fit.all.cont2 = multcomp::glht(fit, linfct=K2)
summary(fit.all.cont2)

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: glm(formula = cbind(using, notUsing) ~ age, family = binomial("logit"), 
##     data = cuse)
## 
## Linear Hypotheses:
##            Estimate Std. Error z value Pr(>|z|)    
## 2 - 1 == 0   1.9678     0.2841   6.926   <0.001 ***
## 3 - 1 == 0   2.5555     0.2734   9.347   <0.001 ***
## 4 - 1 == 0   2.9318     0.2975   9.854   <0.001 ***
## 3 - 2 == 0   0.5876     0.1406   4.181   <0.001 ***
## 4 - 2 == 0   0.9640     0.1831   5.265   <0.001 ***
## 4 - 3 == 0   0.3763     0.1660   2.268   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

We didn’t get the same informative coefficient names this time, but we could have just by renaming the rows in K2, for example:

rownames(K2) = rownames(K)
K2

## 
##   Multiple Comparisons of Means: Tukey Contrasts
## 
##                1  2  3 4
## 25-29 - <25   -1  1  0 0
## 30-39 - <25   -1  0  1 0
## 40-49 - <25   -1  0  0 1
## 30-39 - 25-29  0 -1  1 0
## 40-49 - 25-29  0 -1  0 1
## 40-49 - 30-39  0  0 -1 1

summary( multcomp::glht(fit, linfct=K2) )

## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: glm(formula = cbind(using, notUsing) ~ age, family = binomial("logit"), 
##     data = cuse)
## 
## Linear Hypotheses:
##                    Estimate Std. Error z value Pr(>|z|)    
## 25-29 - <25 == 0     1.9678     0.2841   6.926   <0.001 ***
## 30-39 - <25 == 0     2.5555     0.2734   9.347   <0.001 ***
## 40-49 - <25 == 0     2.9318     0.2975   9.854   <0.001 ***
## 30-39 - 25-29 == 0   0.5876     0.1406   4.181   <0.001 ***
## 40-49 - 25-29 == 0   0.9640     0.1831   5.265   <0.001 ***
## 40-49 - 30-39 == 0   0.3763     0.1660   2.268   0.0983 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

Other contrast schemes

There are plenty of canned contrast schemes provided by multcomp::contrMat. Note use of the example() function to run all examples from the contrmat() function. This is a great example of how there are many ways to analyze one particular experimental design, but that you need to know design matrices to utilize many of them.

In these examples, the first line n <- c(10,20,30,40) just signifies four levels, n = 1:4 would do exactly the same thing.

library(multcomp)
example("contrMat")

## 
## cntrMt>  n <- c(10,20,30,40)
## 
## cntrMt>  names(n) <- paste("group", 1:4, sep="")
## 
## cntrMt>  contrMat(n) # Dunnett is default
## 
##   Multiple Comparisons of Means: Dunnett Contrasts
## 
##                 group1 group2 group3 group4
## group2 - group1     -1      1      0      0
## group3 - group1     -1      0      1      0
## group4 - group1     -1      0      0      1
## 
## cntrMt>  contrMat(n, base = 2)   # use second level as baseline
## 
##   Multiple Comparisons of Means: Dunnett Contrasts
## 
##                 group1 group2 group3 group4
## group1 - group2      1     -1      0      0
## group3 - group2      0     -1      1      0
## group4 - group2      0     -1      0      1
## 
## cntrMt>  contrMat(n, type = "Tukey")
## 
##   Multiple Comparisons of Means: Tukey Contrasts
## 
##                 group1 group2 group3 group4
## group2 - group1     -1      1      0      0
## group3 - group1     -1      0      1      0
## group4 - group1     -1      0      0      1
## group3 - group2      0     -1      1      0
## group4 - group2      0     -1      0      1
## group4 - group3      0      0     -1      1
## 
## cntrMt>  contrMat(n, type = "Sequen")
## 
##   Multiple Comparisons of Means: Sequen Contrasts
## 
##                 group1 group2 group3 group4
## group2 - group1     -1      1      0      0
## group3 - group2      0     -1      1      0
## group4 - group3      0      0     -1      1
## 
## cntrMt>  contrMat(n, type = "AVE")
## 
##   Multiple Comparisons of Means: AVE Contrasts
## 
##      group1  group2  group3  group4
## C 1  1.0000 -0.2222 -0.3333 -0.4444
## C 2 -0.1250  1.0000 -0.3750 -0.5000
## C 3 -0.1429 -0.2857  1.0000 -0.5714
## C 4 -0.1667 -0.3333 -0.5000  1.0000
## 
## cntrMt>  contrMat(n, type = "Changepoint")
## 
##   Multiple Comparisons of Means: Changepoint Contrasts
## 
##      group1  group2  group3 group4
## C 1 -1.0000  0.2222  0.3333 0.4444
## C 2 -0.3333 -0.6667  0.4286 0.5714
## C 3 -0.1667 -0.3333 -0.5000 1.0000
## 
## cntrMt>  contrMat(n, type = "Williams")
## 
##   Multiple Comparisons of Means: Williams Contrasts
## 
##     group1 group2 group3 group4
## C 1     -1 0.0000 0.0000 1.0000
## C 2     -1 0.0000 0.4286 0.5714
## C 3     -1 0.2222 0.3333 0.4444
## 
## cntrMt>  contrMat(n, type = "Marcus")
## 
##   Multiple Comparisons of Means: Marcus Contrasts
## 
##      group1  group2  group3 group4
## C 1 -1.0000  0.2222  0.3333 0.4444
## C 2 -1.0000  0.0000  0.4286 0.5714
## C 3 -0.3333 -0.6667  0.4286 0.5714
## C 4 -1.0000  0.0000  0.0000 1.0000
## C 5 -0.3333 -0.6667  0.0000 1.0000
## C 6 -0.1667 -0.3333 -0.5000 1.0000
## 
## cntrMt>  contrMat(n, type = "McDermott")
## 
##   Multiple Comparisons of Means: McDermott Contrasts
## 
##      group1  group2 group3 group4
## C 1 -1.0000  1.0000    0.0      0
## C 2 -0.3333 -0.6667    1.0      0
## C 3 -0.1667 -0.3333   -0.5      1
## 
## cntrMt>  ### Umbrella-protected Williams contrasts, i.e. a sequence of 
## cntrMt>  ### Williams-type contrasts with groups of higher order 
## cntrMt>  ### stepwise omitted
## cntrMt>  contrMat(n, type = "UmbrellaWilliams")
## 
##   Multiple Comparisons of Means: UmbrellaWilliams Contrasts
## 
##     group1 group2 group3 group4
## C 1     -1 0.0000 0.0000 1.0000
## C 2     -1 0.0000 0.4286 0.5714
## C 3     -1 0.2222 0.3333 0.4444
## C 4     -1 0.0000 1.0000 0.0000
## C 5     -1 0.4000 0.6000 0.0000
## C 6     -1 1.0000 0.0000 0.0000
## 
## cntrMt>  ### comparison of each group with grand mean of all groups
## cntrMt>  contrMat(n, type = "GrandMean")
## 
##   Multiple Comparisons of Means: GrandMean Contrasts
## 
##     group1 group2 group3 group4
## C 1    0.9   -0.2   -0.3   -0.4
## C 2   -0.1    0.8   -0.3   -0.4
## C 3   -0.1   -0.2    0.7   -0.4
## C 4   -0.1   -0.2   -0.3    0.6

BIOS621 sessions 2-3 lab exercises

Levi Waldron