Binary Classification Model of Suicide Case Study

BIST/STAT 5605. Applied Statistics II, Spring 2016

Course Project 2:

Binary Model of Suicide Case Study

Section 1: Introduction

According to the dataset provided in Kaggle: (https://www.kaggle.com/cdc/mortality). An analysis of death cases was came up. In this project, we focus on the suicide cases whose age was between 30 to 40. Since these people are the main strength of their family, and middle age crisis might cause them stressed or depressed. Analysis of suicide case could help us figure out what kind of people are more likely to commit a suicide, then we could provide help to them. It also helps police when they found a body and want to determine if the person was suicide.

Through a primary data clean process, we got 6 predictors and 1 response - suicide or not. We build a binary model based on predictors and response and did variable selection, verified assumptions, model selection, test goodness of fit, and made predictions.

Specifically, we did the Wald Test, Hosmer-Lemeshow Test, Analyse the Deviance residuals, outliers, Cook’s distance. Through Lilelihood Ratio test, we chose the best model and make prediction base on it.

Key words: GLM, prediction, death, suicide

Section 2: Description of Data

In the original dataset, listed data for each dead person are introduced.

o Id (integer primary key) - Main identifier, used for joining with DeathRecordId in EntityAxisConditions and RecordAxisConditions tables.

o ResidentStatus (integer) - (e.g. 1 = Residents, 2 = Intrastate resident, etc)

o Education1989Revision (integer) - Years of education using the 1989 revision format (e.g. 8 = 8 years of elementary education)

o Education2003Revision (integer) - Years of education using the 2003 revision code (e.g. 8 = Doctorate or professional degree)

o EducationReportingFlag (integer) - (0 = 1989 revision was used on death certificate, 1 = 2003 revision was used)

o MonthOfDeath (integer) - Month of death (e.g. 1 = January, 12 = December)

o Sex (text) - (M = Male, F = Female)

o AgeType (integer) - Units for the Age column (e.g. 1 = Years, 2 = Months)

o Age (integer) - Age at death (in AgeType units)

o AgeSubstitutionFlag (integer) - (1 = Calculated age is substituted for reported age)

o AgeRecode52 (integer) - Age recoded into 52 bins (e.g. 1 = Under 1 hour)

o AgeRecode27 (integer) - Age recoded into 27 bins (e.g. 1 = Under 1 month)

o AgeRecode12 (integer) - Age recoded into 12 bins (e.g. 1 = Under 1 year)

o InfantAgeRecode22 (integer) - In the event of an infant, Age recoded into 22 bins (e.g. 1 = Under 1 hour)

o PlaceOfDeathAndDecedentsStatus (integer) - (e.g. 6 = Nursing home/long term care)

o MaritalStatus (text) - (e.g. M = married, D = divorced, W = widowed)

o DayOfWeekOfDeath (text) - (e.g. 1 = Sunday, 7 = Saturday)

o CurrentDataYear (text) - Year on death record. Always 2014 for this dataset.

o InjuryAtWork (text) - Was the person injured at work? (Y = yes, N = no, U = unknown)

o MannerOfDeath (integer) - (e.g. 1 = Accident, 2 = Suicides)

o MethodOfDisposition (text) - (e.g. B = burial, C = cremation)

o Autopsy (text) - Was an autopsy performed? (Y = Yes, N = No, U = Unknown)

o ActivityCode (integer) - (e.g. 0 = While engaged in sports activity, 1 = While engaged in leisure activity)

o PlaceOfInjury (integer) - (e.g. 0 = Home, 1 = Residential institution)

o Icd10Code (text) - ICD-10 code for the underlying cause of death (e.g. I251 = Atherosclerotic heart disease)

o CauseRecode358 (integer) - Cause of death recoded into 358 bins

o CauseRecode113 (integer) - Cause of death recoded into 113 bins

o InfantCauseRecode130 (integer) - Infant cause of death recoded into 130 bins

o CauseRecode39 (integer) - Cause of death recoded into 39 bins

o NumberOfEntityAxisConditions (integer) - Number of entries for this death record in the EntityAxisConditions table

o NumberOfRecordAxisConditions (integer) - Number of entries for this death record in the RecordAxisConditions table

o Race (integer) - Reported race (e.g. 1 = White, 2 = Black)

o BridgedRaceFlag (integer) - (e.g. 1 = Race is bridged)

o RaceImputationFlag (integer) - (e.g. 1 = Unknown race is imputed)

o RaceRecode3 (integer) - Race recoded into 3 bins (e.g. 2 = Races other than White or Black)

o RaceRecode5 (integer) - Race recoded into 5 bins (e.g. 4 = Asian or Pacific Islander)

o HispanicOrigin (integer) - (e.g. 220 = Central and South American)

HispanicOriginRaceRecode (integer) - HispanicOrigin / Race recoded (e.g. 1 = Mexican)

Some of them are repeated, and in our study, we use the Age, Sex, Education, Marital Status, Race, Resident Status as our predictors.

What I have to mention is that all of the variables are qualitative predictors except age. So we created dummy variables for them.

Section 3. Methods (or Models)

Model: Binary Regression Model
\[Y_i |p_i~\sim Bern(p_i)\],
\[logit(p_i )=η=x_i' β\]
Format:
$Y$: n1 matrix (vector)
$x_i'$: 1p matrix
$β$: p*1 matrix
$p_i$: a number between 0 to 1
$p=k+1$ where $k$ is the number of variables

Meaning:
$Y$: a response variable (0 or 1)
$x_i'$’: predictors
$β$: partial regression coefficient
$p_i$: probability that a person was suicide
Assumptions:
1. dependent variable to be binary: Y=0 or 1;
2. observations are independent to each other

Section 4. Analysis of Data.

4.1 Model validation

4.1.1 Test good of fit

a. Wald test

Test hypothesis: \[H_0:β_j=0, H_1:β_j≠0\].
Test statistics: \[z^*=b_j/(se(b_j)) \sim N(0,1)\]
Decision rule:
If $|z^*|≤z_{(1-α/2)}=z_{0.975}$, conclude $H_0$.
If $|z^*|>z_{(1-α/2)}=z_{0.975}$, conclude $H_1$.
The full model is:

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

mod2 <- glm(Suicide~ Age + Sex2 + 
              Education2+Education3+Education4+Education5+
              Education6+Education7+Education8+Education9+
              Marital2+Marital3+Marital4+Marital5+
              Race2+Race3+Race4+Race5+Race6+Race7+Race8+
              Race10+Race11+Race12+Race13+Race14+Race15
            +Resident2+Resident3+Resident4,
            
            
            family=binomial, data = Train_data)

summary(mod2)

## 
## Call:
## glm(formula = Suicide ~ Age + Sex2 + Education2 + Education3 + 
##     Education4 + Education5 + Education6 + Education7 + Education8 + 
##     Education9 + Marital2 + Marital3 + Marital4 + Marital5 + 
##     Race2 + Race3 + Race4 + Race5 + Race6 + Race7 + Race8 + Race10 + 
##     Race11 + Race12 + Race13 + Race14 + Race15 + Resident2 + 
##     Resident3 + Resident4, family = binomial, data = Train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0061  -0.5628  -0.4337  -0.3059   2.9000  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.682e-02  3.512e-01   0.133 0.893954    
## Age         -4.914e-02  6.224e-03  -7.895 2.90e-15 ***
## Sex2        -8.215e-01  3.783e-02 -21.714  < 2e-16 ***
## Education2  -1.268e-02  6.550e-02  -0.194 0.846541    
## Education3   1.693e-01  5.161e-02   3.281 0.001033 ** 
## Education4   3.260e-01  5.951e-02   5.478 4.30e-08 ***
## Education5   3.974e-01  7.319e-02   5.429 5.67e-08 ***
## Education6   6.386e-01  6.691e-02   9.544  < 2e-16 ***
## Education7   6.480e-01  1.036e-01   6.255 3.98e-10 ***
## Education8   6.155e-01  1.733e-01   3.553 0.000381 ***
## Education9  -1.147e-01  1.545e-01  -0.742 0.457855    
## Marital2     2.906e-01  3.669e-02   7.920 2.38e-15 ***
## Marital3     7.494e-01  1.466e-01   5.113 3.17e-07 ***
## Marital4     4.812e-01  4.441e-02  10.834  < 2e-16 ***
## Marital5     5.637e-01  1.708e-01   3.301 0.000963 ***
## Race2       -2.328e-01  2.727e-01  -0.854 0.393240    
## Race3       -1.362e+00  2.775e-01  -4.908 9.19e-07 ***
## Race4       -5.045e-01  2.984e-01  -1.691 0.090872 .  
## Race5       -7.768e-02  3.582e-01  -0.217 0.828293    
## Race6       -3.412e-01  5.578e-01  -0.612 0.540714    
## Race7       -1.066e+01  9.237e+01  -0.115 0.908131    
## Race8       -6.431e-01  3.748e-01  -1.716 0.086176 .  
## Race10      -6.428e-01  3.637e-01  -1.768 0.077118 .  
## Race11       2.773e-01  3.958e-01   0.701 0.483503    
## Race12      -5.992e-01  6.701e-01  -0.894 0.371209    
## Race13      -2.358e-04  3.971e-01  -0.001 0.999526    
## Race14      -1.094e+00  1.078e+00  -1.015 0.309876    
## Race15      -6.769e-01  3.401e-01  -1.990 0.046547 *  
## Resident2   -6.241e-01  4.712e-02 -13.243  < 2e-16 ***
## Resident3   -4.569e-01  6.920e-02  -6.603 4.02e-11 ***
## Resident4   -9.871e-01  2.921e-01  -3.380 0.000726 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29169  on 40708  degrees of freedom
## Residual deviance: 27514  on 40678  degrees of freedom
## AIC: 27576
## 
## Number of Fisher Scoring iterations: 11

From the result, we should drop predictors which doesn’t have “*" behind them, since their absolute z values is smaller than z_{0.975}=1.959964.

Then the reduced model is:

myvars <- names(Train_data) %in% c("Suicide", "Age","Sex2", 
                                   "Education3", 
                                   "Education4", "Education5",
                                   "Education6", "Education7",
                                   "Education8", "Marital2",
                                   "Marital3", "Marital4", "Marital5",
                                   "Race3", 
                                   "Race15", 
                                   "Resident2", "Resident3", "Resident4") 
Train_data_2 <- Train_data[myvars]
Val_data_2 <- Val_data[myvars]


mod_reduced <- glm(Suicide~.,
                   
                   
                   family=binomial, data = Train_data_2)

summary(mod_reduced)

## 
## Call:
## glm(formula = Suicide ~ ., family = binomial, data = Train_data_2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9448  -0.5645  -0.4339  -0.3062   2.9046  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.200424   0.217934  -0.920 0.357753    
## Age         -0.049241   0.006221  -7.916 2.46e-15 ***
## Sex2        -0.822448   0.037810 -21.752  < 2e-16 ***
## Education3   0.180478   0.041083   4.393 1.12e-05 ***
## Education4   0.337212   0.050642   6.659 2.76e-11 ***
## Education5   0.408539   0.066218   6.170 6.85e-10 ***
## Education6   0.650542   0.058983  11.029  < 2e-16 ***
## Education7   0.656794   0.098553   6.664 2.66e-11 ***
## Education8   0.623564   0.169468   3.680 0.000234 ***
## Marital2     0.290608   0.036649   7.929 2.20e-15 ***
## Marital3     0.747218   0.146504   5.100 3.39e-07 ***
## Marital4     0.483232   0.044365  10.892  < 2e-16 ***
## Marital5     0.522884   0.161383   3.240 0.001195 ** 
## Race3       -1.122026   0.056847 -19.738  < 2e-16 ***
## Race15      -0.437056   0.204643  -2.136 0.032704 *  
## Resident2   -0.625696   0.047092 -13.287  < 2e-16 ***
## Resident3   -0.455110   0.069080  -6.588 4.45e-11 ***
## Resident4   -0.998165   0.290724  -3.433 0.000596 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29169  on 40708  degrees of freedom
## Residual deviance: 27535  on 40691  degrees of freedom
## AIC: 27571
## 
## Number of Fisher Scoring iterations: 5

From the result, all these 17 predictors are significant.

b. Hosmer-Lemeshow Test

# 3. Hosmer and Lemeshow goodness of fit (GOF) test
library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library("ResourceSelection")

## ResourceSelection 0.3-2   2017-02-28

h1 <- hoslem.test(Train_data$Suicide, fitted(mod2))
h1

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Train_data$Suicide, fitted(mod2)
## X-squared = 11.419, df = 8, p-value = 0.179

h2 <- hoslem.test(Train_data_2$Suicide, fitted(mod_reduced))
h2

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  Train_data_2$Suicide, fitted(mod_reduced)
## X-squared = 8.2704, df = 8, p-value = 0.4075

# p-value = 0.4075 > 0.05, indicate a reasonable fit to the data

We tested the full model and got the first result “h1”. Then tested the reduced model and got the second result “h2”. \[p-value_1 = 0.179 < 0.4075 = p-value_2\] These two $p-value$ are larger than 0.05, so we conclude $H_0$, this two model are satisfactory fit. But the reduced model is much better.

c. Likelihood Ratio Test

lrtest(mod2, mod_reduced)

## Likelihood ratio test
## 
## Model 1: Suicide ~ Age + Sex2 + Education2 + Education3 + Education4 + 
##     Education5 + Education6 + Education7 + Education8 + Education9 + 
##     Marital2 + Marital3 + Marital4 + Marital5 + Race2 + Race3 + 
##     Race4 + Race5 + Race6 + Race7 + Race8 + Race10 + Race11 + 
##     Race12 + Race13 + Race14 + Race15 + Resident2 + Resident3 + 
##     Resident4
## Model 2: Suicide ~ Age + Sex2 + Education3 + Education4 + Education5 + 
##     Education6 + Education7 + Education8 + Marital2 + Marital3 + 
##     Marital4 + Marital5 + Race3 + Race15 + Resident2 + Resident3 + 
##     Resident4
##   #Df LogLik  Df  Chisq Pr(>Chisq)  
## 1  31 -13757                        
## 2  18 -13767 -13 20.423    0.08516 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

qchisq(0.95,13)

## [1] 22.36203

Decision rule: If \[G^2≤ χ^2_{(1-α,q)}=χ^2_{0.95,3}\], conclude $H_0$. If \[G^2 > χ^2_{(1-α,q)}=χ^2_{0.95,3}\], conclude $H_1$. Since $G^2=20.423<χ^2_{0.95,13}=22.36203$, we conclude $H_0$, so this 13 predictors can be dropped from the regression model. The reduced model is much better.

4.1.2 Logic regression diagnostics

We fitted the training dataset to the reduced model, and did the following analysis.
##### a. Deviance residuals

plot(mod_reduced$linear.predictors, residuals(mod_reduced, type="deviance"),
     ylab='deviance residuals',
     xlab='linear predictor',
     main='Scattor plot of deviance residuals vs. linear predictors')

In binary regression, the plots of deviance residuals are strongly patterned, parti-cularly the plot against the linear predictor, where the residuals can take on only two values, depending on whether the response is equal to 0 or 1.

b. outliers

hii <- influence(mod_reduced)$hat # diagonal elements of hat matrix
plot(hii,cex = .6,main = "Index plot of the diagonal elements of the estimated hat matrix")
abline(h=2*18/40709)

If $h_ii>2p/n=(2(17+1))/40709=0.00088433$, the i^th case is considered to be an outlier. From the following index plot we can know that points higher than the straight line are outliers. Under this model, since the data size, there seem like a little bit more outliers. But as we can see the overlapped points under the straight have accumulated to a black rectangle, there are majority of points actually. So there are not too many outliers.

c. Cook’s distance

CooksD<-  cooks.distance(mod_reduced)
plot(CooksD,type="o",cex = .5,main = "Index plot of Cook's distance")

From the above three plots, we can see that there are 13 cases most influential.

4.2 Prediction on new observation

a. a. Receiver Operating Characteristic (ROC) Curve

library("ROCR")
pred <- prediction(fitted(mod_reduced),Train_data_2$Suicide)
perf <- performance(pred, "tpr", "fpr")
plot(perf); abline(0,1)

performance(pred,"auc") # Slot "y.values":[[1]][1] 0.7296682

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.6746106
## 
## 
## Slot "alpha.values":
## list()

The area under the ROC curve is useful to measure the model’s predictive power. A value of 0.5 means that the predictions were no better than random guessing. For the reduce model, the ROC area is 0.6746106, it seems good.

b. The best cutoff

orecasting a binary outcome for given levels $X_h$ of the X variables is simple in the sense that the outcome 1 will be predicted if the estimated value $π ̂_h$ is large, and the outcome 0 will be predicted if $π ̂_h $ is small. The difficulty in making predictions of a binary outcome is in determining the cutoff point, below which the outcome 0 is predicted and above which the outcome I is predicted.
Under the reduced model, the best cutoff is 0.009094114.

perf@alpha.values[[1]][which.max(perf@x.values[[1]]+perf@y.values[[1]])]

## [1] 0.009094114

The highest sensitivity plus specificity is achieved in this case when you predict the positive outcome when the predicted probability exceeds 0.009094114 and predict the negative outcome when the predicted probability does not exceed 0.009094.

Section 5. Concluding Remarks

In this analysis, we build a binary model to predict if a dead person was suicide. It can help police to handle cases and could help psychologist to analysis what kind of people are more likely to commit a suicide.

From the ROC Curve, AUC = 0.67, so the model is reasonable, so we could use this model to make predictions.

And to enhance the prediction ability of our model, we could gather more information of observations such as occupation, family background, hobbies and so on. Then the results could be more accurate.