MC-SIMEX-Korrektur von Messfehlern in nominalen Variablen

Funktion

Die Funktion mcsimextable() stellt die Häufigkeitsverteilung einer diskreten Variable dar, deren Fehlklassifikation durch Misclassification Simulation Extrapolation (MC-SIMEX) (Küchenhoff, Mwalili, & Lesaffre, 2006) korrigiert wurde. Der Algorithmus folgt der Beschreibung im Anhang von Hopkins und King (2011).

Die Funktion erfordert als Input die auszuzählende, mit Fehlklassifikation beobachtete Variable obs_data. Zusätzlich muss entweder die Fehlklassifikationsmatrix Pi oder ein Reliabilitätswert rel angegeben werden. n_MC_steps(Zahl der Fehlklassifikationsschritte) und n_MC_sets (Zahl der in einem Schritt simulierten Datensätze) definieren den MC-SIMEX-Prozess. Die voreingestellten Werte beruhen auf umfangreichen Simulationsstudien. proportion gibt an, ob das Ergebnis in absoluten Häufigkeiten (entspricht table(x)) oder relativen Häufigkeiten (entspricht prop.table(table(x))) ausgegeben werden soll.

Die Funktionen make_Pi_coef(), MC_one_step() und MC_steps() werden intern von mcsimextable() aufgerufen.

# One misclass step
mcsimextable = function(obs_data, Pi = NULL, rel = NULL, n_MC_sets = 100, n_MC_steps = 3, proportion = T) {
  lev = levels(factor(obs_data))
  k = length(lev)
  if (is.null(Pi) & is.null(rel)) {print("Please provide either misclassification matrix or reliability estimate.")}
  if (is.null(Pi) & !is.null(rel)){
    Pi = make_Pi_coef(rel, lev)
  }
  step1 = replicate(n_MC_sets, MC_steps(obs_data = obs_data, Pi = Pi, n_MC_steps = n_MC_steps, lev = lev, k = k), simplify = F)
  step2 = do.call("rbind", lapply(step1, function(i) t(sapply(i, table))))
  step3 = cbind(step2, MC_step = as.numeric(rownames(step2)))
  step4 = do.call("rbind", apply(step3[,1:k], MARGIN = 2, function(i) aggregate(i, by = list(step3[,k+1]), mean)))
  names(step4) = c("MC_step", "Y")
  step4$lev = rep(lev, each = n_MC_steps+1)
  step4$Y = log(step4$Y + 1)
  extrapol = lm(Y ~ poly(MC_step, 2)*lev, step4)
  output_table = predict(extrapol, expand.grid(lev = lev, MC_step = -1))
  names(output_table) = lev
  output_table = exp(output_table) - 1
  if(proportion) output_table = prop.table(output_table)
  if(!proportion) output_table = round(output_table, 0)
  output_table
}

make_Pi_coef = function(rel, lev) {
  k = length(lev)
  mc_matrix = matrix(NA, k, k, dimnames = list(lev, lev))
  diag(mc_matrix) = sqrt(rel)
  error = (1 - sqrt(rel)) / (k - 1)
  mc_matrix[is.na(mc_matrix)] = error
  mc_matrix
}

MC_one_step = function(.data, Pi, lev, k) {
  .data = factor(.data, levels = lev)
  MS = table(.data)
  unlist(sapply(1:k, function(i) sample(lev, size = MS[i], replace = T, prob = Pi[, i])))
}

MC_steps = function(obs_data, Pi, n_MC_steps, lev, k) {
  MC_SMPL = list(obs_data)
  for (i in 1:n_MC_steps) MC_SMPL[[i+1]] = MC_one_step(.data = MC_SMPL[[i]], Pi = Pi, lev = lev, k = k)
  names(MC_SMPL) = 0:n_MC_steps
  MC_SMPL
}

Anwendung

Die Variable “Thema” mit den Ausprägungen “Politik” und “Nicht Politik” wurde in 100 Artikeln mit einer Reliabilität von .8 kodiert.

thema = c(rep("Politik", 20), rep("Nicht Politik", 80))
prop.table(table(thema))

## thema
## Nicht Politik       Politik 
##           0.8           0.2

Eine Korrektur um die Fehlklassifikation ergibt:

mcsimextable(thema, rel = .8)

## Nicht Politik       Politik 
##      0.861694      0.138306

MC-SIMEX-Korrektur von Messfehlern in nominalen Variablen

Marko Bachl & Michael Scharkow

September 2015

Funktion

Anwendung