1. Das Projekt

In diesem Projekt wird versucht, die Vorhersagegenauigkeit eines Random-Forest-Modells durch das Tuning der Hyperparameter mtry, ntree und maxnodes zu verbessern.

Das Random-Forest-Verfahren ist ein Ensemble-Lernverfahren, das sowohl für Klassifikations- als auch Regressionsaufgaben verwendet wird. Der Algorithmus gehört zu den Methoden des überwachten Lernens. Um Vorhersagen oder Entscheidungen zu treffen, nutzt der Random Forest die Ergebnisse einer Vielzahl verschiedener Entscheidungsbäume, die während des Trainings aus zufällig ausgewählten Teilmengen des Datensatzes generiert werden. Die einzelnen Bäume bestehen aus mehreren Verzweigungen, die durch Regeln entstehen, welche die Daten basierend auf ihren Eigenschaften einer Klasse zuordnen. Jeder Baum trifft eine eigene Entscheidung. Die endgültige Entscheidung des Random Forests ergibt sich aus der Aggregation dieser individuellen Entscheidungen: Bei Regressionsaufgaben durch Mittelwertbildung und bei Klassifikationsaufgaben durch Mehrheitsentscheidung.

Obwohl der Random Forest möglicherweise nicht der optimale Algorithmus für das untersuchte Datenset ist, wurde er aufgrund seiner zahlreichen Vorteile gegenüber vielen anderen Algorithmen für Regression und Klassifikation ausgewählt.

Vorteile des Random-Forest-Algorithmus:

Kurze Trainingszeiten: Die Berechnung ist effizient, auch bei größeren Datensätzen
Höhere Genauigkeit: Durch die Aggregation vieler Entscheidungsbäume wird die Varianz reduziert, was zu robusteren Ergebnissen führt
Einfachheit: Der Algorithmus ist vergleichsweise leicht zu verstehen und zu interpretieren
Skalierbarkeit: Der Random Forest kann gut mit einer großen Anzahl von Merkmalen umgehen und identifiziert irrelevante Merkmale automatisch

Ladung der Pakete:

library(readxl)
library(writexl)
library(fmsb)
library(sjPlot)
library(ggplot2)
library(caret)
library(lattice)
library(partykit)
library(knitr)
library(tidyr)
library(e1071)
library(dplyr)
library(ggmosaic)
library(kableExtra)
library(tibble)

2. Daten

Die untersuchten Daten enthalten Informationen zu 2.804 Personen, die nach 1933 aus deutschsprachigen Ländern geflohen sind. Der Datensatz bildet die Grundlage der digitalen Datenbank Emigration und Exil von Wissenschaftlern und Ingenieuren 1930–1950 der Universität Stuttgart. Im Rahmen meiner Masterarbeit habe ich diese Daten nach dem Herausfiltern spezifischer Berufsgruppen standardisiert.

Ladung der Daten:

a <- read_excel("Exil.xlsx")
data <- data.frame(a)
data <- subset(data, Gestorben.bis.1945 != "gestorben")
names(data)

##  [1] "PersonID"           "Nachname"           "Vorname"           
##  [4] "Geschlecht"         "Geburtsjahr"        "Geburtsort"        
##  [7] "Sterbejahr"         "Ausreiseland"       "Religion_NS"       
## [10] "Beruf"              "Disziplin"          "Berufsgruppe"      
## [13] "Ausreisealter"      "Altersgruppe"       "Gestorben.bis.1945"
## [16] "Zieljahr1"          "Zielland1"          "Zieljahr2"         
## [19] "Zielland2"          "Zieljahr3"          "Zielland3"         
## [22] "Zieljahr4"          "Zielland4"          "Zieljahr5"         
## [25] "Zielland5"          "Zieljahr6"          "Zielland6"         
## [28] "Zieljahr7"          "Zielland7"          "Zieljahr8"         
## [31] "Zielland8"          "Zieljahr9"          "Zielland9"         
## [34] "Zieljahr10"         "Zielland10"         "Zieljahr11"        
## [37] "Zielland11"         "Zieljahr12"         "Zielland12"        
## [40] "Zieljahr13"         "Zielland13"         "Zieljahr14"        
## [43] "Zielland14"         "Anzahl.Zielorte"    "Remigration"

dim(data)

## [1] 2804   45

names(data)[names(data) == "Zieljahr1"] <- "Ausreisejahr"

Das komplette Dataset enthält 45 Variablen. Für dieses Projekt werden 7 Variablen ausgesucht.

df <- data[, c("Remigration", "Geschlecht", "Anzahl.Zielorte", "Ausreisealter", "Ausreisejahr","Berufsgruppe", "Religion_NS")]

Ausgesuchten Variablen:

Remigration (binäre Variable, gibt an, ob eine Person nach 1945 in die deutschsprachigen Länder zurückgekehrt ist oder nicht)
Geschlecht (binäre Variable)
Anzahl Zielorte (Anzahl der Zielorte, die bis 1945 von Emigranten angesteuert wurden)
Ausreisealter (Alter der Person zum Zeitpunkt der ersten Ausreise)
Ausreisejahr (Jahr der Ausreise aus dem Heimatsort)
Berufsgruppe
- Geisteswissenschaften
- Naturwissenschaften
- Medizin
- Kunst (Sprache): Sprachbezogene Künstler wie Schauspieler, Autoren usw.
- Kunst (ohne Sprache): nicht sprachbezogene Künstler wie Musiker, Maler usw.
Religion_NS (binäre Variable, gibt an, ob die Personen einen jüdischen Hintergrund hatten)

Im Rahmen des Projekts wird versucht vorherzusagen, welche Personen nach 1945 in die deutssprachigen Länder zurückgekehrt sind und welche nicht (Klassifikationsaufgabe). Damit gilt die binäre Variable Remigration als Zielvariable.

Die Untersuchung erfolgt mithilfe des Pakets caret.

3. Datenaufbereitung

Im Rahmen der Datenaufbereitung werden die NAs entfernt und die charakter-Variablen faktorisiert.

Entfernung der NAs:

sapply(df, function(x) sum(is.na(x)))

##     Remigration      Geschlecht Anzahl.Zielorte   Ausreisealter    Ausreisejahr 
##               0               0               0               5               0 
##    Berufsgruppe     Religion_NS 
##               0               0

df <- df[complete.cases(df),]

Nach der Entfernung der wenigen NAs besteht das Dataset aus 2799 Fälle.

dim(df)

## [1] 2799    7

Klassen der Variablen im Überblick:

sapply(df, FUN = class)

##     Remigration      Geschlecht Anzahl.Zielorte   Ausreisealter    Ausreisejahr 
##     "character"     "character"       "numeric"       "numeric"       "numeric" 
##    Berufsgruppe     Religion_NS 
##     "character"     "character"

Faktorisierung der Zielvariable:

df$Remigration[df$Remigration == "Rückkehr"] <- 1
df$Remigration[df$Remigration == "kein Rückkehr"] <- 0
df$Remigration <- as.factor(df$Remigration)

Faktorisierung der charakter-Variablen:

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)
sapply(df, FUN = class)

##     Remigration      Geschlecht Anzahl.Zielorte   Ausreisealter    Ausreisejahr 
##        "factor"        "factor"       "numeric"       "numeric"       "numeric" 
##    Berufsgruppe     Religion_NS 
##        "factor"        "factor"

4. Visualisierung der Zielvariable

ggplot(df, aes(x = Remigration, fill = Remigration)) +
  geom_bar(width = 0.4, alpha = 0.8) +
  scale_fill_manual(values = c("skyblue", "darkred"), labels=c('kein Rückkehr', 'Rückkehr'))+
  labs(y = "Häufigkeit") +
  ggtitle("Remigration") +
  geom_text(stat = "count", aes(label = after_stat(count)), 
  vjust = -0.5, size = 3) +
  geom_hline(yintercept = 0, size = 1, colour="#333333") +
  theme_minimal()

prop.table(table(as.numeric(df$Remigration))) * 100

## 
##        1        2 
## 85.28046 14.71954

Wie erkennbar ist, liegt die Anzahl der Personen, die nach 1945 in deutschsprachige Länder remigriert sind, bei 412. Das entspricht etwa 15 % des gesamten Datensatzes (nach Abzug der fehlenden Werte). Die Anzahl der nicht remigrierten Personen ist deutlich höher und beträgt 2.387, was 85 % des Datensatzes ausmacht.

5. Split

Für die Modellierung werden die Daten zunächst in ein Trainings- und ein Test-Set aufgeteilt. Wie bei einer relativ kleinen Fallzahl üblich, umfasst das Trainings-Set 70 % und das Test-Set 30 % der Fälle.

Aufteilung der Daten in Training- und Test-Sets:

set.seed(404)
train.index <- createDataPartition(df$Remigration,
                                   p=0.7,
                                   list=FALSE)
head(train.index)

##      Resample1
## [1,]         1
## [2,]         3
## [3,]         6
## [4,]         8
## [5,]         9
## [6,]        11

training <- df[train.index,]
testing<- df[-train.index,]
nrow(training); nrow(testing)

## [1] 1960

## [1] 839

Nach der Aufteilung der Daten befinden sich im Training-Set 1960 und im Testing-Set 839 Fälle.

6. Wichtigsten Variablen

Training:

#set.seed(404)
#model.rf <- train(Remigration ~ .,
                       #data = training,
                       #method = 'rf')
#saveRDS(model.rf, "model_rf.RDS")
model.rf <- readRDS("model_rf.RDS")

Wichtigsten Variablen:

coefs.rf <- varImp(model.rf)
df.imp <-data.frame(coefs.rf$importance)
df.imp$Vars<-row.names(df.imp)
df.imp <- df.imp[order(df.imp$Overall, decreasing = T),]
df.imp$Vars <- factor(df.imp$Vars, levels = df.imp$Vars)
ggplot(df.imp, aes(y = Vars, x = Overall)) +
  geom_bar(stat = "identity", width = 0.03, fill = "steelblue") + 
  geom_point(color = "red") +
  labs(y = "") +
  ggtitle("Variable Importance") +
  theme_minimal() +
  theme(legend.position = "none")

Die wichtigsten Variablen sind Anzahl.Zielorte, Ausreisealter, Berufsgruppe und Ausreisejahr.

7. Ergebnisse des Basismodells

Nach dem Training des ersten Modells wird das Testen durchgeführt und die Ergebnisse der Modellierung ausgewertet. Die Auswertung erfolgt primär anhand der Parameter Accuracy, Kappa, Sensitivity und Specificity. Die weiteren Parameter PPV, NPV und F1 werden nicht detailliert analysiert und nur als Zusatzinformation ausgegeben.

Accuracy Gibt den Anteil aller Klassifizierungen an, die korrekt waren, unabhängig davon, ob sie positiv oder negativ sind
Kappa Vergleicht die beobachtete Accuracy mit der erwarteten Accuracy (durch Zufall). Es zeigt, wie gut ein Modell im Vergleich zu einem zufälligen Modell abschneidet
Sensitivity (auch als Recall bekannt) Gibt an, wie gut der Algorithmus die positiven Fälle aus der gesamten Menge der tatsächlich vorhandenen positiven Fälle erkennen kann
Specificity Gibt an, wie gut der Algorithmus die negativen Fälle aus der gesamten Menge der tatsächlich vorhandenen negativen Fälle erkennen kann
PPV (Positive Predictive Value, auch als Precision bekannt) Gibt an, wie viele der als positiv vorhergesagten Fälle tatsächlich positiv sind
NPV (Negative Predictive Value) Gibt an, wie viele der als negativ vorhergesagten Fälle tatsächlich negativ sind
F1 Das harmonische Mittel von PPV und Sensitivity bzw. von Precision und Recall

Testing:

predictions <- predict(model.rf, newdata = testing)
random_matrix <- confusionMatrix(predictions, testing$Remigration, positive = '1')
conf_matrix <- as.data.frame.matrix(random_matrix$table)
conf_matrix_with_labels <- conf_matrix %>%
  rownames_to_column(var = "Prediction")
conf_matrix_with_labels %>%
  kable(format = "html", caption = "Konfusionsmatrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE, 
                position = "left") %>%
  add_header_above(c(" " = 1, "Reference" = ncol(conf_matrix))) %>%
  column_spec(1:ncol(conf_matrix), width = "12em")

Konfusionsmatrix
	Reference
Prediction	0	1
0	709	105
1	7	18

Confusionsmatrix:

conf_df <- as.data.frame(as.table(random_matrix$table))
colnames(conf_df) <- c("Prediction", "Reference", "Freq")
conf_df$Group <-c("richtig", "falsch", "falsch", "richtig")
ggplot(data = conf_df) +
  geom_mosaic(aes(weight = Freq, x = product(Reference), fill = Group)) +
  labs(title = "Konfusionsmatrix", x = "Reference", y = "", fill = "Prediction") +
  theme_minimal() +
  scale_fill_manual(values = c("grey20", "skyblue")) + 
  theme(axis.text.y = element_blank(),  # Entfernen der y-Achsen-Beschriftungen
        axis.ticks.y = element_blank())

accuracy <- random_matrix$overall["Accuracy"]
kappa <- random_matrix$overall["Kappa"]
specificity <- random_matrix$byClass["Specificity"]
sensitivity <- random_matrix$byClass["Sensitivity"]
ppv <- random_matrix$byClass["Precision"]
npv <- random_matrix$byClass["Neg Pred Value"]
f1 <- random_matrix$byClass["F1"]

metrics_random <- data.frame(
  Kennzahl = c("Accuracy", "Kappa", "Specificity", "Sensitivity", "PPV", "NPV", "F1"),
  Wert = c(accuracy, kappa, specificity, sensitivity, ppv, npv, f1)
)

rownames(metrics_random) <- NULL
kable(metrics_random, caption = "Model Performance Metrics")

Model Performance Metrics
Kennzahl	Wert
Accuracy	0.8665077
Kappa	0.2038095
Specificity	0.9902235
Sensitivity	0.1463415
PPV	0.7200000
NPV	0.8710074
F1	0.2432432

Wie zu sehen ist, sind die Accuracy- und Specificity-Werte der ersten Modellierung relativ hoch und liegen bei jeweils 0.8665077 und 0.9902235. Dagegen sind die Kappa- und Sensitivity-Werte niedrig (jeweils 0.2038095 und 0.1463415). Das bedeutet, dass die Übereinstimmung des Modells mit den tatsächlichen Werten (Kappa) relativ gering ist und das Modell die Rückkehr relativ schlecht erkennen kann (Sensitivity). In Anbetracht der Verteilung der Variablenausprägungen der Zielvariablen (14.71954% für Rückkehr und 85.28046% für kein Rückkehr) liegt hier wahrscheinlich ein Class Imbalance-Problem vor. Bei unausbalancierten Daten neigen Modelle dazu, Beobachtungen der kleineren Klasse als Rauschen zu klassifizieren und ordnen neue Beobachtungen eher der größeren Klasse zu.

Im Folgenden wird versucht, die Ergebnisse der Modellierung durch das Tuning der Hyperparameter des Modells zu verbessern. Weitere Optimierungsmethoden wie etwa die Anpassung der Datenstruktur, Sampling oder die Validierung der Modelle werden in diesem Projekt nicht durchgeführt.

8. Hyperparameter Tuning

Die Optimierung des Modells erfolgt mithilfe der folgenden Hyperparameter:

mtry: Anzahl der Variablen, die bei jeder Aufteilung des Baums zufällig ausgewählt werden. Dieser Parameter beeinflusst sowohl die Diversität der einzelnen Bäume im Wald als auch die Genauigkeit des Modells
ntree: Anzahl der Bäume im Wald. Eine höhere Anzahl von Bäumen kann die Genauigkeit des Modells verbessern
maxnodes: Maximale Anzahl von Endknoten in jedem Entscheidungsbaum. Dieser Parameter begrenzt die Tiefe und Komplexität jedes Baums und reduziert dadurch das Risiko der Überanpassung (Overfitting)

8.1 Mtry

Im Standard-Modell werden in caret zunächst drei Modelle mit unterschiedlichen mtry-Werten erstellt. Der mtry-Wert mit der höchsten Accuracy wird ausgewählt. In diesem Fall liegt dieser Wert bei 2, mit einer maximalen Accuracy von 0,8608734. Angesichts des vorliegenden Class Imbalance Problems ist es jedoch vorteilhafter, das Modell im Hinblick auf Kappa zu optimieren. Kappa gibt einen besseren Hinweis auf die tatsächliche Leistungsfähigkeit eines Modells, indem es die Wahrscheinlichkeit berücksichtigt, dass der Erfolg auch durch Zufall zustande kommen könnte, und ist daher eine nützlichere Metrik bei unausgewogenen Daten.

print(model.rf)

## Random Forest 
## 
## 1960 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1960, 1960, 1960, 1960, 1960, 1960, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8608734  0.1915022
##   5     0.8387456  0.2648040
##   9     0.8291739  0.2530634
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

So wird im nächsten Schritt die mtry-Werte von 1 bis 6 (Anzahl der Perdicatoren in Dataset) systematisch eingesetzt.

set.seed(404)
#tune.mtry <- expand.grid(.mtry = c(1:6))
#model_mtry <- train(Remigration ~ .,
                 #data = training,
                 #tuneGrid = tune.mtry,
                 #metric = 'Kappa',
                 #method = 'rf')
#saveRDS(model_mtry, file = "model_mtry.RDS")
model_mtry <- readRDS("model_mtry.RDS")
print(model_mtry)

## Random Forest 
## 
## 1960 samples
##    6 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1960, 1960, 1960, 1960, 1960, 1960, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.8555061  0.0000000
##   2     0.8608722  0.1929628
##   3     0.8584052  0.2845575
##   4     0.8468263  0.2780580
##   5     0.8390747  0.2653309
##   6     0.8351180  0.2593445
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

results <- model_mtry$results
df_long <- pivot_longer(results, 
                        cols = c(Accuracy, Kappa),
                        names_to = "Wert",       
                        values_to = "Zahl") 
ggplot(df_long, aes(x = mtry, y = Zahl, color = Wert, group = Wert)) +
  geom_line(linewidth = 0.5) +  
  ylim(0, 1) +
  geom_point(size = 1.5) +
  scale_color_manual(values = c("darkred", "steelblue")) +
  labs(title = "Accuracy & Kappa",
       x = "Anzahl mtry",
       y = "Accuracy & Kappa") +
  scale_x_continuous(breaks = df_long$mtry) +
  geom_hline(yintercept = 0, size = 1, colour="#333333") +
  theme_minimal()

Die Abbildung zeigt, dass der optimale mtry-Wert, bei dem Kappa sein Maximum erreicht, bei 3 liegt. Die Accuracy nimmt mit steigendem mtry-Wert ab, jedoch nur in geringem Ausmaß.

predictions <- predict(model_mtry, newdata = testing)
conf.tune <- confusionMatrix(predictions, testing$Remigration, positive = '1')
conf_matrix <- as.data.frame.matrix(conf.tune$table)
conf_matrix_with_labels <- conf_matrix %>%
  rownames_to_column(var = "Prediction")
conf_matrix_with_labels %>%
  kable(format = "html", caption = "Konfusionsmatrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE, 
                position = "left") %>%
  add_header_above(c(" " = 1, "Reference" = ncol(conf_matrix))) %>%
  column_spec(1:ncol(conf_matrix), width = "12em")

Konfusionsmatrix
	Reference
Prediction	0	1
0	699	93
1	17	30

conf_df <- as.data.frame(as.table(conf.tune$table))
colnames(conf_df) <- c("Prediction", "Reference", "Freq")
conf_df$Group <-c("richtig", "falsch", "falsch", "richtig")
ggplot(data = conf_df) +
  geom_mosaic(aes(weight = Freq, x = product(Reference), fill = Group)) +
  labs(title = "Konfusionsmatrix", x = "Reference", y = "", fill = "Prediction") +
  theme_minimal() +
  scale_fill_manual(values = c("grey20", "skyblue")) + 
  theme(axis.text.y = element_blank(),  # Entfernen der y-Achsen-Beschriftungen
        axis.ticks.y = element_blank())

accuracy <- conf.tune$overall["Accuracy"]
kappa <- conf.tune$overall["Kappa"]
specificity <- conf.tune$byClass["Specificity"]
sensitivity <- conf.tune$byClass["Sensitivity"]
ppv <- conf.tune$byClass["Precision"]
npv <- conf.tune$byClass["Neg Pred Value"]
f1 <- conf.tune$byClass["F1"]

metrics_tune <- data.frame(
  Kennzahl = c("Accuracy", "Kappa", "Specificity", "Sensitivity", "PPV", "NPV", "F1"),
  Wert = c(accuracy, kappa, specificity, sensitivity, ppv, npv, f1)
)
rownames(metrics_tune) <- NULL
kable(metrics_tune, caption = "Metrics")

Metrics
Kennzahl	Wert
Accuracy	0.8688915
Kappa	0.2958617
Specificity	0.9762570
Sensitivity	0.2439024
PPV	0.6382979
NPV	0.8825758
F1	0.3529412

Wie man sieht, ist bei mtry = 3 das Kappa-Wert von 0.2038095 auf 0.2990918 und Sensitivity von 0.1463415 auf 0.2439024 gestiegen. Accuracy sinkt wie erwartet kaum und Specificity-Verlust liegt bei 0,02.

8.2 Ntry

Im nächsten Schritt wird die optimale Kombination aus mtry und ntree ermittelt. Standardmäßig ist die Anzahl der Bäume (ntree) auf 500 festgelegt.

#trees <- c(1000, 1500, 2000, 2500)
#results_ntree <- data.frame()
#for (tree in trees) {
  #set.seed(404)
  #temp.rf <- train(Remigration ~ .,
                   #data = training,
                   #method = "rf",
                   #metric = "Kappa",
                   #tuneGrid = expand.grid(.mtry = c(1:6)),
                   #ntree = tree)
    #temp.df <- data.frame(temp.rf$results[, c('mtry', 'Accuracy', 'Kappa')])
    #temp.df$trees <- tree
    
    #results_ntree <- rbind(results_ntree, temp.df)
#}
#saveRDS(results_ntree, file = "results_ntree.RDS")
results_ntry <- readRDS("results_ntree.RDS")

head(results_ntry, 15)

##    mtry  Accuracy     Kappa trees
## 1     1 0.8555061 0.0000000  1000
## 2     2 0.8608188 0.1919840  1000
## 3     3 0.8582923 0.2827687  1000
## 4     4 0.8463394 0.2774977  1000
## 5     5 0.8394562 0.2660146  1000
## 6     6 0.8350776 0.2596631  1000
## 7     1 0.8555061 0.0000000  1500
## 8     2 0.8606501 0.1889000  1500
## 9     3 0.8586666 0.2846469  1500
## 10    4 0.8464055 0.2781351  1500
## 11    5 0.8392145 0.2644075  1500
## 12    6 0.8346222 0.2561795  1500
## 13    1 0.8555061 0.0000000  2000
## 14    2 0.8609275 0.1920469  2000
## 15    3 0.8583296 0.2830154  2000

results_ntry[which.max(results_ntry$Kappa), ]

##   mtry  Accuracy     Kappa trees
## 9    3 0.8586666 0.2846469  1500

Wie man sieht, liegt der maximale Kappa-Wert bei 3 mtrys und 1500 ntrees.

results_long <- pivot_longer(results_ntry, 
                        cols = c(Accuracy, Kappa),
                        names_to = "Wert",       
                        values_to = "Zahl")

ggplot(results_long, aes(x = mtry, y = Zahl, color = Wert, group = Wert)) +
  geom_line(linewidth = 0.5) +  
  ylim(0, 1) +
  geom_point(size = 1.5) +
  scale_color_manual(values = c("darkred", "steelblue")) +
  labs(title = "Accuracy & Kappa",
       x = "Anzahl mtry",
       y = "Accuracy & Kappa") +
  scale_x_continuous(breaks = results_long$mtry) +
  geom_hline(yintercept = 0, size = 0.7, colour="#333333") +
  facet_wrap(.~trees) +
  theme(legend.position = "bottom") +
  theme_minimal()

Die Abbildung zeigt, dass bei steigendem Anzahl der ntrees weder Kappa noch Accuracy tatsächlich beeinflusst werden. Deswegen wird dieser Parameter in weiteren Untersuchungen nicht mehr berücksichtigt.

#set.seed(404)
#model_ntry_mtry <- train(Remigration ~ ., 
                  #data=training, 
                  #method='rf',
                  #tuneGrid = expand.grid(.mtry = 3),
                  #ntree = 1500) 
#saveRDS(model_ntry_mtry, file = "model_ntry_mtry.RDS")
model_ntry_mtry <- readRDS("model_ntry_mtry.RDS")

predictions <- predict(model_ntry_mtry, newdata = testing)
matrix_ntry_mtry <- confusionMatrix(predictions, testing$Remigration, positive = '1')
conf_matrix <- as.data.frame.matrix(matrix_ntry_mtry$table)
conf_matrix_with_labels <- conf_matrix %>%
  rownames_to_column(var = "Prediction")
conf_matrix_with_labels %>%
  kable(format = "html", caption = "Konfusionsmatrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE, 
                position = "left") %>%
  add_header_above(c(" " = 1, "Reference" = ncol(conf_matrix))) %>%
  column_spec(1:ncol(conf_matrix), width = "12em")

Konfusionsmatrix
	Reference
Prediction	0	1
0	701	92
1	15	31

conf_df <- as.data.frame(as.table(matrix_ntry_mtry$table))
colnames(conf_df) <- c("Prediction", "Reference", "Freq")
conf_df$Group <-c("richtig", "falsch", "falsch", "richtig")
ggplot(data = conf_df) +
  geom_mosaic(aes(weight = Freq, x = product(Reference), fill = Group)) +
  labs(title = "Konfusionsmatrix", x = "Reference", y = "", fill = "Prediction") +
  theme_minimal() +
  scale_fill_manual(values = c("grey20", "skyblue")) + 
  theme(axis.text.y = element_blank(),  # Entfernen der y-Achsen-Beschriftungen
        axis.ticks.y = element_blank())

accuracy <- matrix_ntry_mtry$overall["Accuracy"]
kappa <- matrix_ntry_mtry$overall["Kappa"]
specificity <- matrix_ntry_mtry$byClass["Specificity"]
sensitivity <- matrix_ntry_mtry$byClass["Sensitivity"]
ppv <- matrix_ntry_mtry$byClass["Precision"]
npv <- matrix_ntry_mtry$byClass["Neg Pred Value"]
f1 <- matrix_ntry_mtry$byClass["F1"]

metrics_tune <- data.frame(
  Kennzahl = c("Accuracy", "Kappa", "Specificity", "Sensitivity", "PPV", "NPV", "F1"),
  Wert = c(accuracy, kappa, specificity, sensitivity, ppv, npv, f1)
)
rownames(metrics_tune) <- NULL
kable(metrics_tune, caption = "Metrics")

Metrics
Kennzahl	Wert
Accuracy	0.8724672
Kappa	0.3119525
Specificity	0.9790503
Sensitivity	0.2520325
PPV	0.6739130
NPV	0.8839849
F1	0.3668639

Wie man sieht, haben Kappa- und Sensitivity-Werte bei mtry = 3 und ntree = 1500 nur leicht angestiegen im Vergleich zu Kappa- und Sensitivity-Werten bei mtry = 3 und ntree = 500 (bei default).

8.3 Maxnodes

Im nächsten Schritt wird die optimale Kombination von mtry und maxnodes ermittelt. Für maxnodes gibt es keine Standardvorgabe, was bedeutet, dass keine feste Grenze für die maximale Anzahl an Knoten in einem Entscheidungsbaum existiert. Der Random Forest-Algorithmus lässt die Bäume daher in der Regel so lange wachsen, bis ein „optimaler“ Zustand erreicht ist. Dies führt häufig dazu, dass die Bäume tief und komplex werden.

#max.nodes <- c(5:15)
#results_node <- data.frame()
#for (node in max.nodes) {
  #set.seed(404)
  #temp.rf <- train(Remigration ~ .,
                   #data = training,
                   #method = "rf",
                   #metric = "Kappa",
                   #tuneGrid = expand.grid(.mtry = c(1:6)),
                   #maxnodes = node,
                   #ntree = 1500)
  #temp.df <- data.frame(temp.rf$results[, c('mtry', 'Accuracy', 'Kappa')]) 
  #temp.df$max.node <- node 
  #results_node <- rbind(results_node, temp.df)
#}
#saveRDS(results_node, file = "results_node.rds")
results_node <- readRDS("results_node.rds")
head(results_node, 15)

##    mtry  Accuracy       Kappa max.node
## 1     1 0.8555061 0.000000000        5
## 2     2 0.8555061 0.000000000        5
## 3     3 0.8553369 0.002250422        5
## 4     4 0.8577732 0.090395933        5
## 5     5 0.8591778 0.153884377        5
## 6     6 0.8603817 0.180705110        5
## 7     1 0.8555061 0.000000000        6
## 8     2 0.8555061 0.000000000        6
## 9     3 0.8559863 0.035735656        6
## 10    4 0.8586804 0.128162246        6
## 11    5 0.8603353 0.178793010        6
## 12    6 0.8616648 0.213415401        6
## 13    1 0.8555061 0.000000000        7
## 14    2 0.8555061 0.000000000        7
## 15    3 0.8572299 0.075275655        7

results_node[which.max(results_node$Kappa), ]

##    mtry  Accuracy     Kappa max.node
## 66    6 0.8661054 0.2924901       15

Der maximale Kappa-Wert wird bei mtry = 6 und maxnodes = 15 erreicht.

node_long <- pivot_longer(results_node, 
                        cols = c(Accuracy, Kappa),
                        names_to = "Wert",       
                        values_to = "Zahl")

ggplot(node_long, aes(x = mtry, y = Zahl, color = Wert, group = Wert)) +
  geom_line(linewidth = 0.5) +  
  ylim(0, 1) +
  geom_hline(yintercept = 0.25, linetype = "dashed", color = "red") + 
  geom_point(size = 1) +
  scale_color_manual(values = c("darkred", "steelblue")) +
  labs(title = "Accuracy & Kappa",
       x = "Anzahl mtry",
       y = "Accuracy & Kappa") +
  geom_hline(yintercept = 0, size = 0.5, colour="#333333") +
  facet_wrap(.~max.node) +
  theme(legend.position = "bottom") +
  theme_minimal()

Aus der Abbildung kann man erkennen, dass Kappa-Werte mit dem Anstieg der Anzahl der maxnodes bis 12 auch sichtbar steigen. Bei maxnodes >= 11, ist der Anstieg von Kappa nur gering.

#set.seed(404)
#model_final <- train(Remigration ~ ., 
                  #data=training, 
                  #method='rf',
                  #tuneGrid = expand.grid(.mtry = 6),
                  #max.nodes = 15,
                  #ntree = 1500)
#saveRDS(model_final, file = "model_final.RDS")
model_final <- readRDS("model_final.RDS")

predictions <- predict(model_final, newdata = testing)
matrix_final <- confusionMatrix(predictions, testing$Remigration, positive = '1')
conf_matrix <- as.data.frame.matrix(matrix_final$table)
conf_matrix_with_labels <- conf_matrix %>%
  rownames_to_column(var = "Prediction")
conf_matrix_with_labels %>%
  kable(format = "html", caption = "Konfusionsmatrix") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE, 
                position = "left") %>%
  add_header_above(c(" " = 1, "Reference" = ncol(conf_matrix))) %>%
  column_spec(1:ncol(conf_matrix), width = "12em")

Konfusionsmatrix
	Reference
Prediction	0	1
0	673	77
1	43	46

conf_df <- as.data.frame(as.table(matrix_final$table))
colnames(conf_df) <- c("Prediction", "Reference", "Freq")
conf_df$Group <-c("richtig", "falsch", "falsch", "richtig")
ggplot(data = conf_df) +
  geom_mosaic(aes(weight = Freq, x = product(Reference), fill = Group)) +
  labs(title = "Konfusionsmatrix", x = "Reference", y = "", fill = "Prediction") +
  theme_minimal() +
  scale_fill_manual(values = c("grey20", "skyblue")) + 
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

accuracy <- matrix_final$overall["Accuracy"]
kappa <- matrix_final$overall["Kappa"]
specificity <- matrix_final$byClass["Specificity"]
sensitivity <- matrix_final$byClass["Sensitivity"]
ppv <- matrix_final$byClass["Precision"]
npv <- matrix_final$byClass["Neg Pred Value"]
f1 <- matrix_final$byClass["F1"]
metrics_final <- data.frame(
  Kennzahl = c("Accuracy", "Kappa", "Specificity", "Sensitivity", "PPV", "NPV", "F1"),
  Wert = c(accuracy, kappa, specificity, sensitivity, ppv, npv, f1)
)

rownames(metrics_final) <- NULL
kable(metrics_final, caption = "Model Performance Metrics")

Model Performance Metrics
Kennzahl	Wert
Accuracy	0.8569726
Kappa	0.3545078
Specificity	0.9399441
Sensitivity	0.3739837
PPV	0.5168539
NPV	0.8973333
F1	0.4339623

Mit mtry = 6 und maxnodes = 15 ist Kappa von 0.3119525 auf 0.3545078 und Sensitivity von 0.2520325 auf 0.3739837 gestiegen.

9. Visualisierung der Tuning-Ergebnisse

model_list <- list("origin" = model.rf, "mtry" = model_mtry, "mtry.ntry" = model_ntry_mtry, "mtry.nodes" = model_final)
results_final <- data.frame()
for (model_name in names(model_list)) {
  predictions <- predict(model_list[[model_name]], newdata = testing)
  conf_matrix <- confusionMatrix(predictions, testing$Remigration, positive = '1')
  kappa <- conf_matrix$overall["Kappa"]
  accuracy <- conf_matrix$overall["Accuracy"]
  sensitivity <- conf_matrix$byClass["Sensitivity"]
  specificity <- conf_matrix$byClass["Specificity"]
results_final<- rbind(results_final, data.frame(Model = model_name, Accuracy = accuracy, Kappa = kappa, Sensitivity = sensitivity,
             Specificity = specificity))
}

set.seed(404)
ggplot(results_final, aes(x = Accuracy, y = Kappa)) + 
  geom_point(aes(color = Model), size = 2.8) +
  scale_color_viridis_d() +
  xlim(0.855, 0.875) +
  ylim(0.2, 0.4) +
  geom_text(aes(label = Model), vjust = -1.2, size = 2.5) +
  ggtitle("Kappa & Accuracy") +
  theme_minimal()

ggplot(results_final, aes(x = Sensitivity, y = Specificity)) + 
  geom_point(aes(color = Model), size = 2.8) +
  xlim(0.14, 0.40) +
  ylim(0.93, 1) +
  scale_color_viridis_d() +
  geom_text(aes(label = Model), vjust = -1.2, size = 2.5) +
  ggtitle("Sensitivity & Specificity") +
  theme_minimal()

rownames(results_final) <- NULL
kable(results_final, caption = "Model Performance Metrics")

Model Performance Metrics
Model	Accuracy	Kappa	Sensitivity	Specificity
origin	0.8665077	0.2038095	0.1463415	0.9902235
mtry	0.8688915	0.2958617	0.2439024	0.9762570
mtry.ntry	0.8724672	0.3119525	0.2520325	0.9790503
mtry.nodes	0.8569726	0.3545078	0.3739837	0.9399441

Das Tuning des Modells führte zu einem Anstieg des Kappa-Wertes von 20,38% auf 35,45% sowie der Sensitivity von 14,63 % auf 37,40 %. Das bedeutet, dass die positive Klasse (Rückkehr) nun deutlich besser erkannt wird. Gleichzeitig sank die Accuracy um 0,01 und die Specificity um 0,05. Dennoch ist der neue Sensitivity-Wert von 37,40 % weiterhin unzureichend und sollte durch andere Methoden weiter verbessert werden.

Mögliche Verbesserungsmethoden sind:

Umstrukturierung der Daten: Einfügung neuer Variablen, Entfernung der unwichtigen Variablen, Bildung der Altersgruppen usw.
Auswahl eines optimalen Modells
Sampling der Daten
Auswahl der Validierungsmethode

Random Forest

Tuning von Hyperparameter

Maja Lichman