Logistic regression with mislabeled outcomes

Imagine we had a simple logistic regression model with \(\Pr(Y=1)=g(-1 + x)\), where \(g(x)=\frac{1}{1+e^(-x)})\) (the logistic function). On top of this we add a probability of mislabeling: e.g, 0%, 1%, 3%, 10%, or 30% of the class labels are flipped. For each of those cases, we randomly flip the generated labels. (See here for more on this model).

We simulate 25 cases of each of these mislabel rates, in each replication simulating 1000 observations.

library(tidyverse)

true_terms <- data_frame(term = c("(Intercept)", "x"),
                         truth = c(-1, 1))

setup_data <- crossing(mislabel_rate = c(0, .01, .03, .1, .3),
         replication = 1:25,
         observation = 1:1000) %>%
  mutate(x = rnorm(n()),
         probability = plogis(-1 + x),
         mislabel = rbinom(n(), 1, mislabel_rate),
         true_class = rbinom(n(), 1, probability),
         class = ifelse(mislabel == 1, 1 - true_class, true_class))

setup_data

## # A tibble: 125,000 × 8
##    mislabel_rate replication observation           x probability mislabel
##            <dbl>       <int>       <int>       <dbl>       <dbl>    <int>
## 1              0           1           1  0.13038949   0.2953354        0
## 2              0           1           2 -0.22722207   0.2266680        0
## 3              0           1           3  0.07596937   0.2841373        0
## 4              0           1           4 -0.45949866   0.1885440        0
## 5              0           1           5 -0.64105034   0.1623222        0
## 6              0           1           6  0.41897078   0.3586958        0
## 7              0           1           7 -0.60672722   0.1670435        0
## 8              0           1           8  0.75357843   0.4387045        0
## 9              0           1           9  0.97551036   0.4938779        0
## 10             0           1          10 -0.12575119   0.2449461        0
## # ... with 124,990 more rows, and 2 more variables: true_class <int>,
## #   class <dbl>

If we perform logistic regression, the resulting estimates are biased (as described in the paper above). How can I reduce this problem?

I’ve tried both glm and the glmRob function from the robust package, but it barely decreased the bias at all:

set.seed(2016)
library(robust)
library(broom)

# glmRob doesn't have a broom method yet, but it can just be tidied with tidy.lm
tidy.glmRob <- broom:::tidy.lm

fit_model <- function(data, type, rate) {
  if (type == "glm") {
    glm(class ~ x, data, family = "binomial")
  } else {
    glmRob(class ~ x, data, family = "binomial", method = "misclass", mc.gamma = rate)
  }
}

glm_models <- setup_data %>%
  nest(-mislabel_rate, -replication) %>%
  crossing(type = c("glm", "glmRob")) %>%
  mutate(model = pmap(list(data, type, mislabel_rate), fit_model))

glm_models %>%
  unnest(map(model, tidy)) %>%
  inner_join(true_terms, by = "term") %>%
  ggplot(aes(factor(mislabel_rate), estimate - truth, fill = type)) +
  geom_hline(yintercept = 0, lty = 2) +
  geom_boxplot() +
  facet_wrap(~ term) +
  xlab("Mislabel rate: % of class labels that were flipped") +
  ylab("Bias of resulting estimate") +
  ggtitle("Bias in logistic regression estimate caused by mislabel rate",
          subtitle = "On 25 trials, relative to true intercept of -1 and x coefficient of 1") +
  theme_bw()

Am I using glmRob incorrectly? Is there another approach?

Logistic regression with mislabeled outcomes

David Robinson

December 12, 2016