Competing risks

Competing risks simulation

The idea here is to show how even if there is no trend in the effectivness of teachers, that if there is a relationship between effectiveness and leaving the profession then fixed effects models will be biased.

The basic case: no change over time, and no attrition

This is the basic situation: 1000 teachers, who join the profession at the same time. They start with a random initial effectiveness that changes with a random slope. There is no attrition from the sample over 5 years.

data <- data.frame(
  slope = runif(1000, min = -0.1, max = 0.1),
  year1 = runif(1000, min = -1, max = 1))
data$year2 <- data$year1 + data$slope
data$year3 <- data$year2 + data$slope
data$year4 <- data$year3 + data$slope
data$year5 <- data$year4 + data$slope
head(data)
         slope      year1      year2      year3      year4      year5
1 -0.001248886 -0.6505664 -0.6518153 -0.6530642 -0.6543131 -0.6555619
2  0.074564929  0.3313937  0.4059587  0.4805236  0.5550885  0.6296535
3 -0.063092592 -0.1494449 -0.2125375 -0.2756301 -0.3387227 -0.4018153
4 -0.044058875 -0.1464149 -0.1904738 -0.2345327 -0.2785915 -0.3226504
5  0.079923874  0.4710250  0.5509489  0.6308727  0.7107966  0.7907205
6 -0.068772798  0.7759033  0.7071305  0.6383577  0.5695849  0.5008121

Potting these lines we can see how they look (first 20 teachers).

plot(data$year1[1], 1, ylim = c(-1.5,1.5), xlim = c(1,5))
for (i in 1:20){
  lines(x = 1:5, y = data[i, 2:6])
}

So now we can look at the association between years of experience using a fixed effects model.

y <- do.call(c, data[, 2:6]) 
t <- rep(1:5, each = 1000)
x <- rep(1:1000, times = 5)

summary(lm(y ~ t + x))

Call:
lm(formula = y ~ t + x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.33236 -0.53618 -0.00945  0.53604  1.39595 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.223e-04  2.492e-02   0.021    0.983
t           -1.615e-03  6.041e-03  -0.267    0.789
x           -3.424e-05  2.959e-05  -1.157    0.247

Residual standard error: 0.6041 on 4997 degrees of freedom
Multiple R-squared:  0.0002821, Adjusted R-squared:  -0.0001181 
F-statistic: 0.7049 on 2 and 4997 DF,  p-value: 0.4942

Ok, so far so good. Now we can try adding some attrition into the data, but first at random. Once someone leaves this cohort, they leave for good.

data_r <- data 
leave_prob <- c(.1, .9)
data$leave1 <- sample(c(NA,1), 1000, replace = TRUE, prob = leave_prob)
data$leave2 <- sample(c(NA,1), 1000, replace = TRUE, prob = leave_prob)
data$leave3 <- sample(c(NA,1), 1000, replace = TRUE, prob = leave_prob)
data$leave4 <- sample(c(NA,1), 1000, replace = TRUE, prob = leave_prob)
data$leave5 <- sample(c(NA,1), 1000, replace = TRUE, prob = leave_prob)

data$year1 <- data$year1 * data$leave1
data$year2 <- data$year2 * data$leave1 * data$leave2
data$year3 <- data$year3 * data$leave1 * data$leave2 * data$leave3
data$year4 <- data$year4 * data$leave1 * data$leave2 * data$leave3 * data$leave4
data$year5 <- data$year5 * data$leave1 * data$leave2 * data$leave3 * data$leave4 * data$leave5

Potting these lines we can see, again how they look (first 20 teachers).

plot(data$year1[1], 1, ylim = c(-1.5,1.5), xlim = c(1,5)) 
for (i in 1:20){
  lines(x = 1:5, y = data[i, 2:6]) 
  }

So now we can look at the association between years of experience using a fixed effects model.

y <- do.call(c, data[, 2:6]) 
summary(lm(y ~ t + x))

Call:
lm(formula = y ~ t + x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.31077 -0.52038 -0.02866  0.52850  1.42266 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  4.836e-02  2.763e-02   1.750  0.08014 . 
t           -1.085e-02  6.965e-03  -1.558  0.11932   
x           -1.044e-04  3.379e-05  -3.091  0.00201 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5928 on 3677 degrees of freedom
  (1320 observations deleted due to missingness)
Multiple R-squared:  0.003203,  Adjusted R-squared:  0.002661 
F-statistic: 5.908 on 2 and 3677 DF,  p-value: 0.002743

Again, this is fine, and the estimates are unbiased. Now we can introduce the dependency on effectiveness. For the sake of simplicity, we will say that the probability of leaving is 20% in year for those with below average effectiveness and 5% for those with above average effectiveness.

data <- data_r
dep_sample <- function(X){
  unlist(lapply(X, FUN = function(x){
    sample(c(NA,1), 1, prob = `if`(x > 0, c(.05, .95), c(.2, .8)))
  }))
}

data$leave1 <- dep_sample(data$year1)
data$leave2 <- dep_sample(data$year2)
data$leave3 <- dep_sample(data$year3)
data$leave4 <- dep_sample(data$year4)
data$leave5 <- dep_sample(data$year5)

data$year1 <- data$year1 * data$leave1
data$year2 <- data$year2 * data$leave1 * data$leave2
data$year3 <- data$year3 * data$leave1 * data$leave2 * data$leave3
data$year4 <- data$year4 * data$leave1 * data$leave2 * data$leave3 * data$leave4
data$year5 <- data$year5 * data$leave1 * data$leave2 * data$leave3 * data$leave4 * data$leave5

y <- do.call(c, data[, 2:6]) 
summary(lm(y ~ t + x))

Call:
lm(formula = y ~ t + x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43552 -0.51357  0.07212  0.50271  1.21560 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.019e-03  2.798e-02   0.036    0.971    
t            3.329e-02  7.083e-03   4.700  2.7e-06 ***
x           -2.109e-05  3.464e-05  -0.609    0.543    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.592 on 3543 degrees of freedom
  (1454 observations deleted due to missingness)
Multiple R-squared:  0.006302,  Adjusted R-squared:  0.005741 
F-statistic: 11.23 on 2 and 3543 DF,  p-value: 1.369e-05

Now the coefficient of t is significant and positive, driven by the missing data only.