According to the dataset provided in Kaggle: (https://www.kaggle.com/cdc/mortality). An analysis of death cases was came up. In this project, we focus on the suicide cases whose age was between 30 to 40. Since these people are the main strength of their family, and middle age crisis might cause them stressed or depressed. Analysis of suicide case could help us figure out what kind of people are more likely to commit a suicide, then we could provide help to them. It also helps police when they found a body and want to determine if the person was suicide.
Through a primary data clean process, we got 6 predictors and 1 response - suicide or not. We build a binary model based on predictors and response and did variable selection, verified assumptions, model selection, test goodness of fit, and made predictions.
Specifically, we did the Wald Test, Hosmer-Lemeshow Test, Analyse the Deviance residuals, outliers, Cook’s distance. Through Lilelihood Ratio test, we chose the best model and make prediction base on it.
Key words: GLM, prediction, death, suicide
In the original dataset, listed data for each dead person are introduced.
o Id (integer primary key) - Main identifier, used for joining with DeathRecordId in EntityAxisConditions and RecordAxisConditions tables.
o ResidentStatus (integer) - (e.g. 1 = Residents, 2 = Intrastate resident, etc)
o Education1989Revision (integer) - Years of education using the 1989 revision format (e.g. 8 = 8 years of elementary education)
o Education2003Revision (integer) - Years of education using the 2003 revision code (e.g. 8 = Doctorate or professional degree)
o EducationReportingFlag (integer) - (0 = 1989 revision was used on death certificate, 1 = 2003 revision was used)
o MonthOfDeath (integer) - Month of death (e.g. 1 = January, 12 = December)
o Sex (text) - (M = Male, F = Female)
o AgeType (integer) - Units for the Age column (e.g. 1 = Years, 2 = Months)
o Age (integer) - Age at death (in AgeType units)
o AgeSubstitutionFlag (integer) - (1 = Calculated age is substituted for reported age)
o AgeRecode52 (integer) - Age recoded into 52 bins (e.g. 1 = Under 1 hour)
o AgeRecode27 (integer) - Age recoded into 27 bins (e.g. 1 = Under 1 month)
o AgeRecode12 (integer) - Age recoded into 12 bins (e.g. 1 = Under 1 year)
o InfantAgeRecode22 (integer) - In the event of an infant, Age recoded into 22 bins (e.g. 1 = Under 1 hour)
o PlaceOfDeathAndDecedentsStatus (integer) - (e.g. 6 = Nursing home/long term care)
o MaritalStatus (text) - (e.g. M = married, D = divorced, W = widowed)
o DayOfWeekOfDeath (text) - (e.g. 1 = Sunday, 7 = Saturday)
o CurrentDataYear (text) - Year on death record. Always 2014 for this dataset.
o InjuryAtWork (text) - Was the person injured at work? (Y = yes, N = no, U = unknown)
o MannerOfDeath (integer) - (e.g. 1 = Accident, 2 = Suicides)
o MethodOfDisposition (text) - (e.g. B = burial, C = cremation)
o Autopsy (text) - Was an autopsy performed? (Y = Yes, N = No, U = Unknown)
o ActivityCode (integer) - (e.g. 0 = While engaged in sports activity, 1 = While engaged in leisure activity)
o PlaceOfInjury (integer) - (e.g. 0 = Home, 1 = Residential institution)
o Icd10Code (text) - ICD-10 code for the underlying cause of death (e.g. I251 = Atherosclerotic heart disease)
o CauseRecode358 (integer) - Cause of death recoded into 358 bins
o CauseRecode113 (integer) - Cause of death recoded into 113 bins
o InfantCauseRecode130 (integer) - Infant cause of death recoded into 130 bins
o CauseRecode39 (integer) - Cause of death recoded into 39 bins
o NumberOfEntityAxisConditions (integer) - Number of entries for this death record in the EntityAxisConditions table
o NumberOfRecordAxisConditions (integer) - Number of entries for this death record in the RecordAxisConditions table
o Race (integer) - Reported race (e.g. 1 = White, 2 = Black)
o BridgedRaceFlag (integer) - (e.g. 1 = Race is bridged)
o RaceImputationFlag (integer) - (e.g. 1 = Unknown race is imputed)
o RaceRecode3 (integer) - Race recoded into 3 bins (e.g. 2 = Races other than White or Black)
o RaceRecode5 (integer) - Race recoded into 5 bins (e.g. 4 = Asian or Pacific Islander)
o HispanicOrigin (integer) - (e.g. 220 = Central and South American)
HispanicOriginRaceRecode (integer) - HispanicOrigin / Race recoded (e.g. 1 = Mexican)
Some of them are repeated, and in our study, we use the Age, Sex, Education, Marital Status, Race, Resident Status as our predictors.
What I have to mention is that all of the variables are qualitative predictors except age. So we created dummy variables for them.
Model: Binary Regression Model
\[Y_i |p_i~\sim Bern(p_i)\],
\[logit(p_i )=η=x_i' β\]
Format:
\(Y\): n1 matrix (vector)
\(x_i'\): 1p matrix
\(β\): p*1 matrix
\(p_i\): a number between 0 to 1
\(p=k+1\) where \(k\) is the number of variables
Meaning:
\(Y\): a response variable (0 or 1)
\(x_i'\)’: predictors
\(β\): partial regression coefficient
\(p_i\): probability that a person was suicide
Assumptions:
1. dependent variable to be binary: Y=0 or 1;
2. observations are independent to each other
Test hypothesis: \[H_0:β_j=0, H_1:β_j≠0\].
Test statistics: \[z^*=b_j/(se(b_j)) \sim N(0,1)\]
Decision rule:
If \(|z^*|≤z_{(1-α/2)}=z_{0.975}\), conclude \(H_0\).
If \(|z^*|>z_{(1-α/2)}=z_{0.975}\), conclude \(H_1\).
The full model is:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
mod2 <- glm(Suicide~ Age + Sex2 +
Education2+Education3+Education4+Education5+
Education6+Education7+Education8+Education9+
Marital2+Marital3+Marital4+Marital5+
Race2+Race3+Race4+Race5+Race6+Race7+Race8+
Race10+Race11+Race12+Race13+Race14+Race15
+Resident2+Resident3+Resident4,
family=binomial, data = Train_data)
summary(mod2)
##
## Call:
## glm(formula = Suicide ~ Age + Sex2 + Education2 + Education3 +
## Education4 + Education5 + Education6 + Education7 + Education8 +
## Education9 + Marital2 + Marital3 + Marital4 + Marital5 +
## Race2 + Race3 + Race4 + Race5 + Race6 + Race7 + Race8 + Race10 +
## Race11 + Race12 + Race13 + Race14 + Race15 + Resident2 +
## Resident3 + Resident4, family = binomial, data = Train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0061 -0.5628 -0.4337 -0.3059 2.9000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.682e-02 3.512e-01 0.133 0.893954
## Age -4.914e-02 6.224e-03 -7.895 2.90e-15 ***
## Sex2 -8.215e-01 3.783e-02 -21.714 < 2e-16 ***
## Education2 -1.268e-02 6.550e-02 -0.194 0.846541
## Education3 1.693e-01 5.161e-02 3.281 0.001033 **
## Education4 3.260e-01 5.951e-02 5.478 4.30e-08 ***
## Education5 3.974e-01 7.319e-02 5.429 5.67e-08 ***
## Education6 6.386e-01 6.691e-02 9.544 < 2e-16 ***
## Education7 6.480e-01 1.036e-01 6.255 3.98e-10 ***
## Education8 6.155e-01 1.733e-01 3.553 0.000381 ***
## Education9 -1.147e-01 1.545e-01 -0.742 0.457855
## Marital2 2.906e-01 3.669e-02 7.920 2.38e-15 ***
## Marital3 7.494e-01 1.466e-01 5.113 3.17e-07 ***
## Marital4 4.812e-01 4.441e-02 10.834 < 2e-16 ***
## Marital5 5.637e-01 1.708e-01 3.301 0.000963 ***
## Race2 -2.328e-01 2.727e-01 -0.854 0.393240
## Race3 -1.362e+00 2.775e-01 -4.908 9.19e-07 ***
## Race4 -5.045e-01 2.984e-01 -1.691 0.090872 .
## Race5 -7.768e-02 3.582e-01 -0.217 0.828293
## Race6 -3.412e-01 5.578e-01 -0.612 0.540714
## Race7 -1.066e+01 9.237e+01 -0.115 0.908131
## Race8 -6.431e-01 3.748e-01 -1.716 0.086176 .
## Race10 -6.428e-01 3.637e-01 -1.768 0.077118 .
## Race11 2.773e-01 3.958e-01 0.701 0.483503
## Race12 -5.992e-01 6.701e-01 -0.894 0.371209
## Race13 -2.358e-04 3.971e-01 -0.001 0.999526
## Race14 -1.094e+00 1.078e+00 -1.015 0.309876
## Race15 -6.769e-01 3.401e-01 -1.990 0.046547 *
## Resident2 -6.241e-01 4.712e-02 -13.243 < 2e-16 ***
## Resident3 -4.569e-01 6.920e-02 -6.603 4.02e-11 ***
## Resident4 -9.871e-01 2.921e-01 -3.380 0.000726 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 29169 on 40708 degrees of freedom
## Residual deviance: 27514 on 40678 degrees of freedom
## AIC: 27576
##
## Number of Fisher Scoring iterations: 11
From the result, we should drop predictors which doesn’t have “*" behind them, since their absolute z values is smaller than z_{0.975}=1.959964.
Then the reduced model is:
myvars <- names(Train_data) %in% c("Suicide", "Age","Sex2",
"Education3",
"Education4", "Education5",
"Education6", "Education7",
"Education8", "Marital2",
"Marital3", "Marital4", "Marital5",
"Race3",
"Race15",
"Resident2", "Resident3", "Resident4")
Train_data_2 <- Train_data[myvars]
Val_data_2 <- Val_data[myvars]
mod_reduced <- glm(Suicide~.,
family=binomial, data = Train_data_2)
summary(mod_reduced)
##
## Call:
## glm(formula = Suicide ~ ., family = binomial, data = Train_data_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9448 -0.5645 -0.4339 -0.3062 2.9046
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.200424 0.217934 -0.920 0.357753
## Age -0.049241 0.006221 -7.916 2.46e-15 ***
## Sex2 -0.822448 0.037810 -21.752 < 2e-16 ***
## Education3 0.180478 0.041083 4.393 1.12e-05 ***
## Education4 0.337212 0.050642 6.659 2.76e-11 ***
## Education5 0.408539 0.066218 6.170 6.85e-10 ***
## Education6 0.650542 0.058983 11.029 < 2e-16 ***
## Education7 0.656794 0.098553 6.664 2.66e-11 ***
## Education8 0.623564 0.169468 3.680 0.000234 ***
## Marital2 0.290608 0.036649 7.929 2.20e-15 ***
## Marital3 0.747218 0.146504 5.100 3.39e-07 ***
## Marital4 0.483232 0.044365 10.892 < 2e-16 ***
## Marital5 0.522884 0.161383 3.240 0.001195 **
## Race3 -1.122026 0.056847 -19.738 < 2e-16 ***
## Race15 -0.437056 0.204643 -2.136 0.032704 *
## Resident2 -0.625696 0.047092 -13.287 < 2e-16 ***
## Resident3 -0.455110 0.069080 -6.588 4.45e-11 ***
## Resident4 -0.998165 0.290724 -3.433 0.000596 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 29169 on 40708 degrees of freedom
## Residual deviance: 27535 on 40691 degrees of freedom
## AIC: 27571
##
## Number of Fisher Scoring iterations: 5
From the result, all these 17 predictors are significant.
# 3. Hosmer and Lemeshow goodness of fit (GOF) test
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library("ResourceSelection")
## ResourceSelection 0.3-2 2017-02-28
h1 <- hoslem.test(Train_data$Suicide, fitted(mod2))
h1
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Train_data$Suicide, fitted(mod2)
## X-squared = 11.419, df = 8, p-value = 0.179
h2 <- hoslem.test(Train_data_2$Suicide, fitted(mod_reduced))
h2
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: Train_data_2$Suicide, fitted(mod_reduced)
## X-squared = 8.2704, df = 8, p-value = 0.4075
# p-value = 0.4075 > 0.05, indicate a reasonable fit to the data
We tested the full model and got the first result “h1”. Then tested the reduced model and got the second result “h2”. \[p-value_1 = 0.179 < 0.4075 = p-value_2\] These two \(p-value\) are larger than 0.05, so we conclude \(H_0\), this two model are satisfactory fit. But the reduced model is much better.
lrtest(mod2, mod_reduced)
## Likelihood ratio test
##
## Model 1: Suicide ~ Age + Sex2 + Education2 + Education3 + Education4 +
## Education5 + Education6 + Education7 + Education8 + Education9 +
## Marital2 + Marital3 + Marital4 + Marital5 + Race2 + Race3 +
## Race4 + Race5 + Race6 + Race7 + Race8 + Race10 + Race11 +
## Race12 + Race13 + Race14 + Race15 + Resident2 + Resident3 +
## Resident4
## Model 2: Suicide ~ Age + Sex2 + Education3 + Education4 + Education5 +
## Education6 + Education7 + Education8 + Marital2 + Marital3 +
## Marital4 + Marital5 + Race3 + Race15 + Resident2 + Resident3 +
## Resident4
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 31 -13757
## 2 18 -13767 -13 20.423 0.08516 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qchisq(0.95,13)
## [1] 22.36203
Decision rule: If \[G^2≤ χ^2_{(1-α,q)}=χ^2_{0.95,3}\], conclude \(H_0\). If \[G^2 > χ^2_{(1-α,q)}=χ^2_{0.95,3}\], conclude \(H_1\). Since \(G^2=20.423<χ^2_{0.95,13}=22.36203\), we conclude \(H_0\), so this 13 predictors can be dropped from the regression model. The reduced model is much better.
We fitted the training dataset to the reduced model, and did the following analysis.
##### a. Deviance residuals
plot(mod_reduced$linear.predictors, residuals(mod_reduced, type="deviance"),
ylab='deviance residuals',
xlab='linear predictor',
main='Scattor plot of deviance residuals vs. linear predictors')
In binary regression, the plots of deviance residuals are strongly patterned, parti-cularly the plot against the linear predictor, where the residuals can take on only two values, depending on whether the response is equal to 0 or 1.
hii <- influence(mod_reduced)$hat # diagonal elements of hat matrix
plot(hii,cex = .6,main = "Index plot of the diagonal elements of the estimated hat matrix")
abline(h=2*18/40709)
If \(h_ii>2p/n=(2(17+1))/40709=0.00088433\), the i^th case is considered to be an outlier. From the following index plot we can know that points higher than the straight line are outliers. Under this model, since the data size, there seem like a little bit more outliers. But as we can see the overlapped points under the straight have accumulated to a black rectangle, there are majority of points actually. So there are not too many outliers.
CooksD<- cooks.distance(mod_reduced)
plot(CooksD,type="o",cex = .5,main = "Index plot of Cook's distance")
From the above three plots, we can see that there are 13 cases most influential.
library("ROCR")
pred <- prediction(fitted(mod_reduced),Train_data_2$Suicide)
perf <- performance(pred, "tpr", "fpr")
plot(perf); abline(0,1)
performance(pred,"auc") # Slot "y.values":[[1]][1] 0.7296682
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.6746106
##
##
## Slot "alpha.values":
## list()
The area under the ROC curve is useful to measure the model’s predictive power. A value of 0.5 means that the predictions were no better than random guessing. For the reduce model, the ROC area is 0.6746106, it seems good.
orecasting a binary outcome for given levels \(X_h\) of the X variables is simple in the sense that the outcome 1 will be predicted if the estimated value \(π ̂_h\) is large, and the outcome 0 will be predicted if $π ̂_h $ is small. The difficulty in making predictions of a binary outcome is in determining the cutoff point, below which the outcome 0 is predicted and above which the outcome I is predicted.
Under the reduced model, the best cutoff is 0.009094114.
perf@alpha.values[[1]][which.max(perf@x.values[[1]]+perf@y.values[[1]])]
## [1] 0.009094114
The highest sensitivity plus specificity is achieved in this case when you predict the positive outcome when the predicted probability exceeds 0.009094114 and predict the negative outcome when the predicted probability does not exceed 0.009094.
In this analysis, we build a binary model to predict if a dead person was suicide. It can help police to handle cases and could help psychologist to analysis what kind of people are more likely to commit a suicide.
From the ROC Curve, AUC = 0.67, so the model is reasonable, so we could use this model to make predictions.
And to enhance the prediction ability of our model, we could gather more information of observations such as occupation, family background, hobbies and so on. Then the results could be more accurate.