R Markdown

In this vignette, I’m going to talk about problems caused by overlooking the default settings for how the glm model handles missing values. While this vignette uses the glm model to illustrate a point, the learnings can be applied in general.

Specifically, I came across an error message contrasts can be applied only to factors with 2 or more levels that I couldn’t get my head around for a while. The goal of this vignette is to save you some time in investigating the cause of the error message, tl;dr check the missing data that is used to fit the model.

This vignette uses the titanic data.

In this vignette you will learn about:

Loading the packages, getting and setting the data

library(tidyverse)
library(magrittr)
library(purrr)

titanicData <- read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv")
glimpse(titanicData)
## Observations: 1,309
## Variables: 14
## $ pclass    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ survived  <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1...
## $ name      <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. H...
## $ sex       <chr> "female", "male", "female", "male", "female", "male"...
## $ age       <dbl> 29.00, 0.92, 2.00, 30.00, 25.00, 48.00, 63.00, 39.00...
## $ sibsp     <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0...
## $ parch     <dbl> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...
## $ ticket    <chr> "24160", "113781", "113781", "113781", "113781", "19...
## $ fare      <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26...
## $ cabin     <chr> "B5", "C22 C26", "C22 C26", "C22 C26", "C22 C26", "E...
## $ embarked  <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "C", "C...
## $ boat      <chr> "2", "11", NA, NA, NA, "3", "10", NA, "D", NA, NA, "...
## $ body      <dbl> NA, NA, NA, 135, NA, NA, NA, NA, NA, 22, 124, NA, NA...
## $ home.dest <chr> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "...

Fitting the glm model

I want to fit a logistic regression model using three independent variables body, cabin, and boat and one dependent variable survived.

glm(survived ~ body + cabin + boat, family = binomial(logit), data = titanicData)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

Notice the error message contrasts can be applied only to factors with 2 or more levels. However, at a glimpse (pun intended) above, it was clear that there are more than two levels for each of the independent variables in question. Just to be sure, let’s check how many unique values are there in body, cabin, and boat.

titanicData %>% 
  select(body, cabin, boat) %>% 
  map(compose(length, unique))
## $body
## [1] 122
## 
## $cabin
## [1] 187
## 
## $boat
## [1] 28

1

Clearly, there are more than two levels for each dependent variable. Fitting a glm model using each of the variable individually also doesn’t produce any error messages.

glm(survived ~ body, family = binomial(logit), data = titanicData)
glm(survived ~ cabin, family = binomial(logit), data = titanicData)
glm(survived ~ boat, family = binomial(logit), data = titanicData)

Note that the results have been hidden because they are very long and not pertinent. The point is that there are no error messages.

Check the missing values

To be honest, one of the first things that should have been done is handling the missing data. From the first glimpse output, it’s clear that there are some missing data in the form of NAs. However, none of the code above explicitly tell glm how to handle missing data. In such cases, glm looks at the R global default setting for na.action, which we can get by getOption("na.action").

getOption("na.action")
## [1] "na.omit"

na.omit means that the observation that contains an NA in any of the columns will be omitted. Let’s see the data.frame after all the NA values have been omitted, because this is exactly what the glm will use to fit the model.

titanicData %>% 
  select(body, cabin, boat) %>%
  na.omit
## # A tibble: 0 x 3
## # ... with 3 variables: body <dbl>, cabin <chr>, boat <chr>

Well well well… it turns out that there is no observation that doesn’t contain NA in any of body, cabin or boat, hence the error.

Visualising The Missing Data

One particular package that is super useful in visualising missing data is Amelia and the missmap function.

library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(titanicData)

It’s easy to see that the body, cabin and boat variables have a high percentage of missing data. This should give you an indication that there might be little overlap in the non-missing data for this combination of these variables.

Conclusion

Sometimes is not so clear why the error messages say the things they say. Asking some simple questions about your data and function, such as “Are there missing data?” and “How does this function handle missing data?” could save you a lot of time in “debugging” your code. Hopefully, this example has inspired you to be more aware of the types of issues that could happen in your data analysis and that it has added another item to your checklist.


  1. For a brief explanation of the map and compose functions see purrring in r