This project aims to combine three different models: logistic regression, decision trees, and KNN. The aim is to find reasonable predictors, if a human-being outlives the average person in the US in 2005.
library(class)
library(rpart)
library(caret)
## Lade nötiges Paket: ggplot2
## Lade nötiges Paket: lattice
library(gains)
# import the dataset
death_df <- read.csv("2005_data.csv", nrows = 50000)
#check for missing numbers
colSums(is.na(death_df))
str(death_df)
View(death_df)
Afterwards good possible predictors have to be extracted from the enormous number of columns in the dataset. Additionally, a column “Above_age” has to be created which holds the value 1, if the the detail_age column exceeds the average age 77.3, and 0 if it is lower than this value. Since the character names (M,F,D,M etc.) in various columns (sex, marital status etc.) led to problems, I followed the approach of converting these values into numbers.
# pre-processing
death_df <- na.omit(death_df[ ,c("sex","resident_status","manner_of_death","education_1989_revision","marital_status","hispanic_origin","injury_at_work","month_of_death","detail_age")])
death_df$Above_age <- ifelse(death_df$detail_age >77.3,1,0)
death_df <- death_df[,-c(9)]
death_df$sex <- as.factor(death_df$sex)
death_df$marital_status <- as.factor(death_df$marital_status)
death_df$injury_at_work <- as.factor(death_df$injury_at_work)
death_df$sex <- ifelse(death_df$sex == "M", 0, 1)
replacements <- c("M" = 1, "S" = 2, "U" = 3, "W" = 4, "D" = 5)
death_df$marital_status <- ifelse(death_df$marital_status %in% names(replacements), replacements[death_df$marital_status], death_df$marital_status)
replacements <- c("U" = 1, "N" = 2, "Y" = 3)
death_df$injury_at_work <- ifelse(death_df$injury_at_work %in% names(replacements), replacements[death_df$injury_at_work], death_df$injury_at_work)
In order to avoid over-fitting and improving the model, it is recommended to split the data into a training and validation dataset (here 70/30 split).
#partition the data into training and validation sets
set.seed(1)
train_index <- sample(c(1:dim(death_df)[1]), dim(death_df)[1]*0.7)
valid_index <- setdiff(c(1:dim(death_df)[1]), train_index)
train_df <- death_df[train_index, ]
valid_df <- death_df[valid_index, ]
As a first model, a logistic regression is used. First, all predictors are considered. After further review, the significant predictors will be extracted, leading to an improved model performance.
If the probability exceeds 0.5 of out-living the average American in 2005, the value in the confusion matrix will be set to 1, otherwise 0.
# logistic regression model
reg <- glm(Above_age ~., data = train_df, family = "binomial")
#confusion/classification matrix
confusionMatrix(factor(ifelse(predict(reg, valid_df, type = "response")>0.5, 1, 0)),
factor(valid_df$Above_age), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5878 1940
## 1 1052 3025
##
## Accuracy : 0.7485
## 95% CI : (0.7406, 0.7562)
## No Information Rate : 0.5826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4694
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6093
## Specificity : 0.8482
## Pos Pred Value : 0.7420
## Neg Pred Value : 0.7519
## Prevalence : 0.4174
## Detection Rate : 0.2543
## Detection Prevalence : 0.3427
## Balanced Accuracy : 0.7287
##
## 'Positive' Class : 1
##
summary(reg)
##
## Call:
## glm(formula = Above_age ~ ., family = "binomial", data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7092527 0.2141932 -21.986 <2e-16 ***
## sex 0.0897316 0.0300173 2.989 0.0028 **
## resident_status -0.3189256 0.0313588 -10.170 <2e-16 ***
## manner_of_death 0.3751031 0.0208650 17.978 <2e-16 ***
## education_1989_revision -0.0010849 0.0010673 -1.016 0.3094
## marital_status 0.6212379 0.0099470 62.455 <2e-16 ***
## hispanic_origin -0.0016918 0.0007459 -2.268 0.0233 *
## injury_at_work 0.2533921 0.1290304 1.964 0.0496 *
## month_of_death -0.0038359 0.0039582 -0.969 0.3325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37821 on 27752 degrees of freedom
## Residual deviance: 30495 on 27744 degrees of freedom
## AIC: 30513
##
## Number of Fisher Scoring iterations: 5
# create the second logistic regression model without the insignificant parameters
reg2 <- glm(Above_age ~manner_of_death+sex+resident_status+marital_status+injury_at_work, data = train_df, family = "binomial")
#confusion/classification matrix
confusionMatrix(factor(ifelse(predict(reg2, valid_df, type = "response")>0.5, 1, 0)),
factor(valid_df$Above_age), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5878 1939
## 1 1052 3026
##
## Accuracy : 0.7485
## 95% CI : (0.7407, 0.7563)
## No Information Rate : 0.5826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4696
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6095
## Specificity : 0.8482
## Pos Pred Value : 0.7420
## Neg Pred Value : 0.7520
## Prevalence : 0.4174
## Detection Rate : 0.2544
## Detection Prevalence : 0.3428
## Balanced Accuracy : 0.7288
##
## 'Positive' Class : 1
##
summary(reg2)
##
## Call:
## glm(formula = Above_age ~ manner_of_death + sex + resident_status +
## marital_status + injury_at_work, family = "binomial", data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.931469 0.197349 -24.989 < 2e-16 ***
## manner_of_death 0.376176 0.020848 18.043 < 2e-16 ***
## sex 0.091438 0.030005 3.047 0.00231 **
## resident_status -0.318271 0.031333 -10.158 < 2e-16 ***
## marital_status 0.621060 0.009944 62.457 < 2e-16 ***
## injury_at_work 0.255564 0.128905 1.983 0.04741 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37821 on 27752 degrees of freedom
## Residual deviance: 30503 on 27747 degrees of freedom
## AIC: 30515
##
## Number of Fisher Scoring iterations: 5
Since a K-nearest neighbor model also fits this use case, this approach will be considered as well. The neighbor number is set to k = 3 after trying different odd numbers (which did not have a significant boost in performance). Again, as mentioned above, if the probability exceeds 0.5 of out-living the average American in 2005, the value in the confusion matrix will be set to 1, otherwise 0.
# knn model
summary(train_df)
## sex resident_status manner_of_death education_1989_revision
## Min. :0.0000 Min. :1.000 Min. :1.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:7.000 1st Qu.: 9.00
## Median :0.0000 Median :1.000 Median :7.000 Median :12.00
## Mean :0.4922 Mean :1.213 Mean :6.493 Mean :12.78
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:7.000 3rd Qu.:13.00
## Max. :1.0000 Max. :4.000 Max. :7.000 Max. :99.00
## marital_status hispanic_origin injury_at_work month_of_death
## Min. :1.000 Min. :100.0 Min. :1.000 Min. : 1.000
## 1st Qu.:2.000 1st Qu.:100.0 1st Qu.:2.000 1st Qu.: 3.000
## Median :2.000 Median :100.0 Median :2.000 Median : 6.000
## Mean :3.061 Mean :101.5 Mean :1.941 Mean : 6.407
## 3rd Qu.:5.000 3rd Qu.:100.0 3rd Qu.:2.000 3rd Qu.:10.000
## Max. :5.000 Max. :998.0 Max. :3.000 Max. :12.000
## Above_age
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4234
## 3rd Qu.:1.0000
## Max. :1.0000
summary(valid_df)
## sex resident_status manner_of_death education_1989_revision
## Min. :0.0000 Min. :1.000 Min. :1.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:7.000 1st Qu.: 9.00
## Median :0.0000 Median :1.000 Median :7.000 Median :12.00
## Mean :0.4925 Mean :1.219 Mean :6.497 Mean :12.85
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:7.000 3rd Qu.:13.00
## Max. :1.0000 Max. :4.000 Max. :7.000 Max. :99.00
## marital_status hispanic_origin injury_at_work month_of_death
## Min. :1.000 Min. :100.0 Min. :1.000 Min. : 1.000
## 1st Qu.:2.000 1st Qu.:100.0 1st Qu.:2.000 1st Qu.: 3.000
## Median :2.000 Median :100.0 Median :2.000 Median : 6.000
## Mean :3.031 Mean :101.3 Mean :1.938 Mean : 6.387
## 3rd Qu.:5.000 3rd Qu.:100.0 3rd Qu.:2.000 3rd Qu.: 9.000
## Max. :5.000 Max. :998.0 Max. :3.000 Max. :12.000
## Above_age
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4174
## 3rd Qu.:1.0000
## Max. :1.0000
kn <- class::knn(train = train_df[, -9],
test = valid_df[,-9],
cl = train_df[,9],
k = 3, prob=TRUE)
#confusion/classification matrix
confusionMatrix(kn, factor(valid_df[,9]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5562 1795
## 1 1368 3170
##
## Accuracy : 0.7341
## 95% CI : (0.7261, 0.742)
## No Information Rate : 0.5826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4465
##
## Mcnemar's Test P-Value : 3.602e-14
##
## Sensitivity : 0.6385
## Specificity : 0.8026
## Pos Pred Value : 0.6985
## Neg Pred Value : 0.7560
## Prevalence : 0.4174
## Detection Rate : 0.2665
## Detection Prevalence : 0.3815
## Balanced Accuracy : 0.7205
##
## 'Positive' Class : 1
##
As a final model, the decision tree approach is considered. Since the
predict-function did not provide an output split into two columns, a
separate dataframe named combined_data had to be created,
in which the prediction value can be found in the 10th column.
# 3. Tree model
tr <- rpart(Above_age ~., data = death_df)
predictions <- predict(tr, valid_df)
# Combine original data and predictions
combined_data <- cbind(valid_df, predictions)
#confusion/classification matrix
confusionMatrix(factor(ifelse(combined_data[,10]>0.5, 1, 0)),
factor(valid_df$Above_age), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5848 1905
## 1 1082 3060
##
## Accuracy : 0.7489
## 95% CI : (0.741, 0.7567)
## No Information Rate : 0.5826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4713
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6163
## Specificity : 0.8439
## Pos Pred Value : 0.7388
## Neg Pred Value : 0.7543
## Prevalence : 0.4174
## Detection Rate : 0.2573
## Detection Prevalence : 0.3482
## Balanced Accuracy : 0.7301
##
## 'Positive' Class : 1
##
Since the combination of the three models may lead to a better performance than just using each model on its own, this is what to be done in the following.
# combining the models
res<- data.frame(ActualClass = valid_df$Above_age,
LRProb = predict(reg2, valid_df, type = "response"),
LRPred = ifelse(predict(reg2, valid_df, type = "response")>0.5, 1, 0),
KNNProb = 1-attr(kn, "prob"),
KNNPred = kn,
TREEProb = combined_data[,10],
TREEPred = ifelse(combined_data[,10]>0.5, 1, 0))
options(digits = 1, scipen = 2)
head(res, 10)
## ActualClass LRProb LRPred KNNProb KNNPred TREEProb TREEPred
## 4 0 0.4 0 0.13 0 0.3 0
## 10 1 0.6 1 0.24 1 0.7 1
## 11 1 0.7 1 0.22 1 0.7 1
## 14 0 0.3 0 0.40 0 0.3 0
## 16 0 0.3 0 0.24 0 0.3 0
## 19 0 0.2 0 0.09 0 0.3 0
## 21 1 0.3 0 0.31 0 0.3 0
## 26 0 0.3 0 0.31 0 0.3 0
## 28 0 0.7 1 0.27 1 0.7 1
## 34 0 0.3 0 0.25 1 0.3 0
res$majority <- rowMeans(data.frame(res$LRPred, as.numeric(res$KNNPred),
res$TREEPred))>0.5
res$avg <- rowMeans(data.frame(res$LRProb, res$KNNProb, res$TREEProb))
head(res)
## ActualClass LRProb LRPred KNNProb KNNPred TREEProb TREEPred majority avg
## 4 0 0.4 0 0.13 0 0.3 0 FALSE 0.3
## 10 1 0.6 1 0.24 1 0.7 1 TRUE 0.5
## 11 1 0.7 1 0.22 1 0.7 1 TRUE 0.6
## 14 0 0.3 0 0.40 0 0.3 0 FALSE 0.3
## 16 0 0.3 0 0.24 0 0.3 0 FALSE 0.3
## 19 0 0.2 0 0.09 0 0.3 0 FALSE 0.2
options(digits = 7, scipen = 2)
# Evaluation of the model
confusionMatrix(factor(res$majority * 1), factor(valid_df[,9]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5482 1664
## 1 1448 3301
##
## Accuracy : 0.7384
## 95% CI : (0.7304, 0.7463)
## No Information Rate : 0.5826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4587
##
## Mcnemar's Test P-Value : 0.0001162
##
## Sensitivity : 0.6649
## Specificity : 0.7911
## Pos Pred Value : 0.6951
## Neg Pred Value : 0.7671
## Prevalence : 0.4174
## Detection Rate : 0.2775
## Detection Prevalence : 0.3992
## Balanced Accuracy : 0.7280
##
## 'Positive' Class : 1
##
confusionMatrix(factor((res$avg > 0.5)* 1), factor(valid_df[,9]), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5941 2090
## 1 989 2875
##
## Accuracy : 0.7412
## 95% CI : (0.7332, 0.749)
## No Information Rate : 0.5826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4505
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5791
## Specificity : 0.8573
## Pos Pred Value : 0.7440
## Neg Pred Value : 0.7398
## Prevalence : 0.4174
## Detection Rate : 0.2417
## Detection Prevalence : 0.3248
## Balanced Accuracy : 0.7182
##
## 'Positive' Class : 1
##
As a final comment, the combination model did not lead to a better performance. The results stagnated at ~74%, which were approximately the individual score models, and sensitivity values lower than specificity values (mostly around 60% sensitivity, and 85% specificity).