Outline

This project aims to combine three different models: logistic regression, decision trees, and KNN. The aim is to find reasonable predictors, if a human-being outlives the average person in the US in 2005.

Importing the necessary packages and exploring the dataset

library(class)
library(rpart)
library(caret)
## Lade nötiges Paket: ggplot2
## Lade nötiges Paket: lattice
library(gains)

# import the dataset
death_df <- read.csv("2005_data.csv", nrows = 50000)

#check for missing numbers
colSums(is.na(death_df))
str(death_df)
View(death_df)

Pre-processing the data

Afterwards good possible predictors have to be extracted from the enormous number of columns in the dataset. Additionally, a column “Above_age” has to be created which holds the value 1, if the the detail_age column exceeds the average age 77.3, and 0 if it is lower than this value. Since the character names (M,F,D,M etc.) in various columns (sex, marital status etc.) led to problems, I followed the approach of converting these values into numbers.

# pre-processing
death_df <- na.omit(death_df[ ,c("sex","resident_status","manner_of_death","education_1989_revision","marital_status","hispanic_origin","injury_at_work","month_of_death","detail_age")])

death_df$Above_age <- ifelse(death_df$detail_age >77.3,1,0)
death_df <- death_df[,-c(9)]

death_df$sex <- as.factor(death_df$sex)
death_df$marital_status <- as.factor(death_df$marital_status)
death_df$injury_at_work <- as.factor(death_df$injury_at_work)

death_df$sex <- ifelse(death_df$sex == "M", 0, 1)
replacements <- c("M" = 1, "S" = 2, "U" = 3, "W" = 4, "D" = 5)
death_df$marital_status <- ifelse(death_df$marital_status %in% names(replacements), replacements[death_df$marital_status], death_df$marital_status)
replacements <- c("U" = 1, "N" = 2, "Y" = 3)
death_df$injury_at_work <- ifelse(death_df$injury_at_work %in% names(replacements), replacements[death_df$injury_at_work], death_df$injury_at_work)

Data Partitioning

In order to avoid over-fitting and improving the model, it is recommended to split the data into a training and validation dataset (here 70/30 split).

#partition the data into training and validation sets
set.seed(1)  
train_index <- sample(c(1:dim(death_df)[1]), dim(death_df)[1]*0.7)  
valid_index <- setdiff(c(1:dim(death_df)[1]), train_index)  
train_df <- death_df[train_index, ]
valid_df <- death_df[valid_index, ]

Logistic regression model

As a first model, a logistic regression is used. First, all predictors are considered. After further review, the significant predictors will be extracted, leading to an improved model performance.

If the probability exceeds 0.5 of out-living the average American in 2005, the value in the confusion matrix will be set to 1, otherwise 0.

# logistic regression model
reg <- glm(Above_age ~., data = train_df, family = "binomial")
#confusion/classification matrix
confusionMatrix(factor(ifelse(predict(reg, valid_df, type = "response")>0.5, 1, 0)), 
                factor(valid_df$Above_age), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5878 1940
##          1 1052 3025
##                                           
##                Accuracy : 0.7485          
##                  95% CI : (0.7406, 0.7562)
##     No Information Rate : 0.5826          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4694          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6093          
##             Specificity : 0.8482          
##          Pos Pred Value : 0.7420          
##          Neg Pred Value : 0.7519          
##              Prevalence : 0.4174          
##          Detection Rate : 0.2543          
##    Detection Prevalence : 0.3427          
##       Balanced Accuracy : 0.7287          
##                                           
##        'Positive' Class : 1               
## 
summary(reg)
## 
## Call:
## glm(formula = Above_age ~ ., family = "binomial", data = train_df)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.7092527  0.2141932 -21.986   <2e-16 ***
## sex                      0.0897316  0.0300173   2.989   0.0028 ** 
## resident_status         -0.3189256  0.0313588 -10.170   <2e-16 ***
## manner_of_death          0.3751031  0.0208650  17.978   <2e-16 ***
## education_1989_revision -0.0010849  0.0010673  -1.016   0.3094    
## marital_status           0.6212379  0.0099470  62.455   <2e-16 ***
## hispanic_origin         -0.0016918  0.0007459  -2.268   0.0233 *  
## injury_at_work           0.2533921  0.1290304   1.964   0.0496 *  
## month_of_death          -0.0038359  0.0039582  -0.969   0.3325    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37821  on 27752  degrees of freedom
## Residual deviance: 30495  on 27744  degrees of freedom
## AIC: 30513
## 
## Number of Fisher Scoring iterations: 5
# create the second logistic regression model without the insignificant parameters
reg2 <- glm(Above_age ~manner_of_death+sex+resident_status+marital_status+injury_at_work, data = train_df, family = "binomial")
#confusion/classification matrix
confusionMatrix(factor(ifelse(predict(reg2, valid_df, type = "response")>0.5, 1, 0)), 
                factor(valid_df$Above_age), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5878 1939
##          1 1052 3026
##                                           
##                Accuracy : 0.7485          
##                  95% CI : (0.7407, 0.7563)
##     No Information Rate : 0.5826          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4696          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6095          
##             Specificity : 0.8482          
##          Pos Pred Value : 0.7420          
##          Neg Pred Value : 0.7520          
##              Prevalence : 0.4174          
##          Detection Rate : 0.2544          
##    Detection Prevalence : 0.3428          
##       Balanced Accuracy : 0.7288          
##                                           
##        'Positive' Class : 1               
## 
summary(reg2)
## 
## Call:
## glm(formula = Above_age ~ manner_of_death + sex + resident_status + 
##     marital_status + injury_at_work, family = "binomial", data = train_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.931469   0.197349 -24.989  < 2e-16 ***
## manner_of_death  0.376176   0.020848  18.043  < 2e-16 ***
## sex              0.091438   0.030005   3.047  0.00231 ** 
## resident_status -0.318271   0.031333 -10.158  < 2e-16 ***
## marital_status   0.621060   0.009944  62.457  < 2e-16 ***
## injury_at_work   0.255564   0.128905   1.983  0.04741 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37821  on 27752  degrees of freedom
## Residual deviance: 30503  on 27747  degrees of freedom
## AIC: 30515
## 
## Number of Fisher Scoring iterations: 5

KNN Model

Since a K-nearest neighbor model also fits this use case, this approach will be considered as well. The neighbor number is set to k = 3 after trying different odd numbers (which did not have a significant boost in performance). Again, as mentioned above, if the probability exceeds 0.5 of out-living the average American in 2005, the value in the confusion matrix will be set to 1, otherwise 0.

# knn model
summary(train_df)
##       sex         resident_status manner_of_death education_1989_revision
##  Min.   :0.0000   Min.   :1.000   Min.   :1.000   Min.   : 0.00          
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:7.000   1st Qu.: 9.00          
##  Median :0.0000   Median :1.000   Median :7.000   Median :12.00          
##  Mean   :0.4922   Mean   :1.213   Mean   :6.493   Mean   :12.78          
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:7.000   3rd Qu.:13.00          
##  Max.   :1.0000   Max.   :4.000   Max.   :7.000   Max.   :99.00          
##  marital_status  hispanic_origin injury_at_work  month_of_death  
##  Min.   :1.000   Min.   :100.0   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:2.000   1st Qu.:100.0   1st Qu.:2.000   1st Qu.: 3.000  
##  Median :2.000   Median :100.0   Median :2.000   Median : 6.000  
##  Mean   :3.061   Mean   :101.5   Mean   :1.941   Mean   : 6.407  
##  3rd Qu.:5.000   3rd Qu.:100.0   3rd Qu.:2.000   3rd Qu.:10.000  
##  Max.   :5.000   Max.   :998.0   Max.   :3.000   Max.   :12.000  
##    Above_age     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4234  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
summary(valid_df)
##       sex         resident_status manner_of_death education_1989_revision
##  Min.   :0.0000   Min.   :1.000   Min.   :1.000   Min.   : 0.00          
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:7.000   1st Qu.: 9.00          
##  Median :0.0000   Median :1.000   Median :7.000   Median :12.00          
##  Mean   :0.4925   Mean   :1.219   Mean   :6.497   Mean   :12.85          
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:7.000   3rd Qu.:13.00          
##  Max.   :1.0000   Max.   :4.000   Max.   :7.000   Max.   :99.00          
##  marital_status  hispanic_origin injury_at_work  month_of_death  
##  Min.   :1.000   Min.   :100.0   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:2.000   1st Qu.:100.0   1st Qu.:2.000   1st Qu.: 3.000  
##  Median :2.000   Median :100.0   Median :2.000   Median : 6.000  
##  Mean   :3.031   Mean   :101.3   Mean   :1.938   Mean   : 6.387  
##  3rd Qu.:5.000   3rd Qu.:100.0   3rd Qu.:2.000   3rd Qu.: 9.000  
##  Max.   :5.000   Max.   :998.0   Max.   :3.000   Max.   :12.000  
##    Above_age     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4174  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
kn <- class::knn(train = train_df[, -9], 
                 test = valid_df[,-9], 
                 cl = train_df[,9], 
                 k = 3, prob=TRUE)

#confusion/classification matrix

confusionMatrix(kn, factor(valid_df[,9]), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5562 1795
##          1 1368 3170
##                                          
##                Accuracy : 0.7341         
##                  95% CI : (0.7261, 0.742)
##     No Information Rate : 0.5826         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4465         
##                                          
##  Mcnemar's Test P-Value : 3.602e-14      
##                                          
##             Sensitivity : 0.6385         
##             Specificity : 0.8026         
##          Pos Pred Value : 0.6985         
##          Neg Pred Value : 0.7560         
##              Prevalence : 0.4174         
##          Detection Rate : 0.2665         
##    Detection Prevalence : 0.3815         
##       Balanced Accuracy : 0.7205         
##                                          
##        'Positive' Class : 1              
## 

Decision tree model

As a final model, the decision tree approach is considered. Since the predict-function did not provide an output split into two columns, a separate dataframe named combined_data had to be created, in which the prediction value can be found in the 10th column.

# 3. Tree model

tr <- rpart(Above_age ~., data = death_df)

predictions <- predict(tr, valid_df)

# Combine original data and predictions
combined_data <- cbind(valid_df, predictions)



#confusion/classification matrix
confusionMatrix(factor(ifelse(combined_data[,10]>0.5, 1, 0)), 
                factor(valid_df$Above_age), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5848 1905
##          1 1082 3060
##                                          
##                Accuracy : 0.7489         
##                  95% CI : (0.741, 0.7567)
##     No Information Rate : 0.5826         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4713         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.6163         
##             Specificity : 0.8439         
##          Pos Pred Value : 0.7388         
##          Neg Pred Value : 0.7543         
##              Prevalence : 0.4174         
##          Detection Rate : 0.2573         
##    Detection Prevalence : 0.3482         
##       Balanced Accuracy : 0.7301         
##                                          
##        'Positive' Class : 1              
## 

Combination of models

Since the combination of the three models may lead to a better performance than just using each model on its own, this is what to be done in the following.

# combining the models
res<- data.frame(ActualClass = valid_df$Above_age, 
                 LRProb = predict(reg2, valid_df, type = "response"), 
                 LRPred = ifelse(predict(reg2, valid_df, type = "response")>0.5, 1, 0), 
                 KNNProb = 1-attr(kn, "prob"), 
                 KNNPred = kn, 
                 TREEProb = combined_data[,10], 
                 TREEPred = ifelse(combined_data[,10]>0.5, 1, 0))

options(digits = 1, scipen = 2)
head(res, 10)
##    ActualClass LRProb LRPred KNNProb KNNPred TREEProb TREEPred
## 4            0    0.4      0    0.13       0      0.3        0
## 10           1    0.6      1    0.24       1      0.7        1
## 11           1    0.7      1    0.22       1      0.7        1
## 14           0    0.3      0    0.40       0      0.3        0
## 16           0    0.3      0    0.24       0      0.3        0
## 19           0    0.2      0    0.09       0      0.3        0
## 21           1    0.3      0    0.31       0      0.3        0
## 26           0    0.3      0    0.31       0      0.3        0
## 28           0    0.7      1    0.27       1      0.7        1
## 34           0    0.3      0    0.25       1      0.3        0
res$majority <- rowMeans(data.frame(res$LRPred, as.numeric(res$KNNPred), 
                                    res$TREEPred))>0.5
res$avg <- rowMeans(data.frame(res$LRProb, res$KNNProb, res$TREEProb))

head(res)
##    ActualClass LRProb LRPred KNNProb KNNPred TREEProb TREEPred majority avg
## 4            0    0.4      0    0.13       0      0.3        0    FALSE 0.3
## 10           1    0.6      1    0.24       1      0.7        1     TRUE 0.5
## 11           1    0.7      1    0.22       1      0.7        1     TRUE 0.6
## 14           0    0.3      0    0.40       0      0.3        0    FALSE 0.3
## 16           0    0.3      0    0.24       0      0.3        0    FALSE 0.3
## 19           0    0.2      0    0.09       0      0.3        0    FALSE 0.2
options(digits = 7, scipen = 2)

Evaluation of the aggregated model

# Evaluation of the model
confusionMatrix(factor(res$majority * 1), factor(valid_df[,9]), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5482 1664
##          1 1448 3301
##                                           
##                Accuracy : 0.7384          
##                  95% CI : (0.7304, 0.7463)
##     No Information Rate : 0.5826          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4587          
##                                           
##  Mcnemar's Test P-Value : 0.0001162       
##                                           
##             Sensitivity : 0.6649          
##             Specificity : 0.7911          
##          Pos Pred Value : 0.6951          
##          Neg Pred Value : 0.7671          
##              Prevalence : 0.4174          
##          Detection Rate : 0.2775          
##    Detection Prevalence : 0.3992          
##       Balanced Accuracy : 0.7280          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(factor((res$avg > 0.5)* 1), factor(valid_df[,9]), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5941 2090
##          1  989 2875
##                                          
##                Accuracy : 0.7412         
##                  95% CI : (0.7332, 0.749)
##     No Information Rate : 0.5826         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4505         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.5791         
##             Specificity : 0.8573         
##          Pos Pred Value : 0.7440         
##          Neg Pred Value : 0.7398         
##              Prevalence : 0.4174         
##          Detection Rate : 0.2417         
##    Detection Prevalence : 0.3248         
##       Balanced Accuracy : 0.7182         
##                                          
##        'Positive' Class : 1              
## 

As a final comment, the combination model did not lead to a better performance. The results stagnated at ~74%, which were approximately the individual score models, and sensitivity values lower than specificity values (mostly around 60% sensitivity, and 85% specificity).