Logistic Regression Model for Stroke Prediction

This model uses 2015 National Health Interview Survey (NHIS) data to study the association between different health behavior variables (both categorical and numerical) and stroke incidence (binary response variable).

The NHIS data is available for public use through the U.S. Census Bureau, and it is useful for epidemiological study of the entire U.S. Adult population and selected subgroups (youth data is also available in a separate data collection file). See here for 2015 data. The biological question under study here was: which factors are significant predictors of stroke in the U.S. adult population?

Data Input

Data was processed in Excel prior to loading to ensure data manipulation in R would run smoothly. Selected variables for study were: sex, marital status, age, age, region of residence, diabetes, hours of sleep, and BMI (calculated from height and weight). Observations that did not have complete responses for hours of sleep, height, weight, and stroke were removed since the model would count them as missing data anyways. Here’s a headshot of the “cropped”, input data.

data <- read.csv("cropsamadult2015.csv",header=T)      #load data
head(data)
##   STROKE    SEX       MARITAL AGE REGION DIABETES HOURSLEEP HEIGHT WEIGHT
## 1     No   Male Never Married  25  South       No         7     71    155
## 2     No Female Never Married  49   West       No         6     60    118
## 3     No   Male Never Married  61   West      Yes         6     66    190
## 4     No   Male      Divorced  53  South       No         4     66    130
## 5     No Female       Married  31   West       No         7     63    125
## 6     No Female       Married  58  South       No         8     65    175

R Processing

BMI information for each individual (observation) was calculated using the formula: BMI = (weight * 703)/height^2, and set to three digits. Here’s an updated headshot of the data:

data$BMI <- ((data$WEIGHT * 703) / (data$HEIGHT * data$HEIGHT))
options(digits = 3)                                    #calc BMI to 3
head(data)
##   STROKE    SEX       MARITAL AGE REGION DIABETES HOURSLEEP HEIGHT WEIGHT
## 1     No   Male Never Married  25  South       No         7     71    155
## 2     No Female Never Married  49   West       No         6     60    118
## 3     No   Male Never Married  61   West      Yes         6     66    190
## 4     No   Male      Divorced  53  South       No         4     66    130
## 5     No Female       Married  31   West       No         7     63    125
## 6     No Female       Married  58  South       No         8     65    175
##    BMI
## 1 21.6
## 2 23.0
## 3 30.7
## 4 21.0
## 5 22.1
## 6 29.1

Binomial Logistic Modeling

Unique variable length was checked with the sapply() and length(unique()) functions to ensure correct input data.

sapply(data, function(x) length(unique(x)))       #check unique lengths
##    STROKE       SEX   MARITAL       AGE    REGION  DIABETES HOURSLEEP 
##         2         2         7        68         4         5        21 
##    HEIGHT    WEIGHT       BMI 
##        18       200      2372

Binomial logistic regression models (3) were produced for stroke prediction since it is a categorical response variable, and the model with the lowest AIC level was selected as the best one.

#Binomial logistic regression model #1 (all predictor variables)
ylogit1 <- glm(STROKE ~ SEX + MARITAL + AGE + REGION + DIABETES + HOURSLEEP + BMI, data = data, family = "binomial")
summary(ylogit1)
## 
## Call:
## glm(formula = STROKE ~ SEX + MARITAL + AGE + REGION + DIABETES + 
##     HOURSLEEP + BMI, family = "binomial", data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.254  -0.280  -0.173  -0.112   3.400  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -6.18280    0.36364  -17.00  < 2e-16 ***
## SEXMale                      0.12804    0.06901    1.86  0.06354 .  
## MARITALLiving with Partner  -0.28531    0.20284   -1.41  0.15956    
## MARITALMarried              -0.62365    0.09215   -6.77  1.3e-11 ***
## MARITALNever Married        -0.29881    0.12855   -2.32  0.02010 *  
## MARITALSeparated             0.18251    0.18357    0.99  0.32011    
## MARITALUnknown              -0.08880    0.73740   -0.12  0.90415    
## MARITALWidowed              -0.12784    0.10277   -1.24  0.21350    
## AGE                          0.05236    0.00268   19.53  < 2e-16 ***
## REGIONNortheast             -0.37186    0.11392   -3.26  0.00110 ** 
## REGIONSouth                  0.00400    0.08810    0.05  0.96383    
## REGIONWest                  -0.10509    0.09590   -1.10  0.27314    
## DIABETESDon't know           1.32856    0.90703    1.46  0.14299    
## DIABETESNo                  -0.65693    0.18733   -3.51  0.00045 ***
## DIABETESRefused             -8.30005  196.96778   -0.04  0.96639    
## DIABETESYes                  0.30435    0.19164    1.59  0.11226    
## HOURSLEEP                    0.05333    0.02047    2.61  0.00916 ** 
## BMI                          0.00578    0.00625    0.92  0.35561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8722.5  on 29519  degrees of freedom
## Residual deviance: 7514.3  on 29502  degrees of freedom
## AIC: 7550
## 
## Number of Fisher Scoring iterations: 10
#Binomial logistic regression model #2 (selected, 3 most statistically significant predictor variables)
ylogit2 <- glm(STROKE ~ MARITAL + AGE + DIABETES, data=data, family="binomial")
summary(ylogit2)
## 
## Call:
## glm(formula = STROKE ~ MARITAL + AGE + DIABETES, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.221  -0.282  -0.176  -0.112   3.374  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -5.72020    0.25997  -22.00  < 2e-16 ***
## MARITALLiving with Partner  -0.27173    0.20232   -1.34  0.17926    
## MARITALMarried              -0.59420    0.09130   -6.51  7.6e-11 ***
## MARITALNever Married        -0.28746    0.12783   -2.25  0.02453 *  
## MARITALSeparated             0.17345    0.18326    0.95  0.34392    
## MARITALUnknown              -0.10464    0.73696   -0.14  0.88709    
## MARITALWidowed              -0.14056    0.10161   -1.38  0.16656    
## AGE                          0.05288    0.00263   20.13  < 2e-16 ***
## DIABETESDon't know           1.46693    0.89910    1.63  0.10277    
## DIABETESNo                  -0.64485    0.18618   -3.46  0.00053 ***
## DIABETESRefused             -8.21287  196.96777   -0.04  0.96674    
## DIABETESYes                  0.34895    0.19103    1.83  0.06775 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8722.5  on 29519  degrees of freedom
## Residual deviance: 7541.5  on 29508  degrees of freedom
## AIC: 7565
## 
## Number of Fisher Scoring iterations: 10
#Binomial logistic regression model #3 (hours of sleep predictor variable)
ylogit3 <- glm(STROKE ~ HOURSLEEP, data=data, family="binomial")
summary(ylogit3)
## 
## Call:
## glm(formula = STROKE ~ HOURSLEEP, family = "binomial", data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.738  -0.277  -0.258  -0.239   2.930  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.4282     0.1548  -28.61   <2e-16 ***
## HOURSLEEP     0.1484     0.0205    7.26    4e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8722.5  on 29519  degrees of freedom
## Residual deviance: 8674.0  on 29518  degrees of freedom
## AIC: 8678
## 
## Number of Fisher Scoring iterations: 6

Discussion and Conclusions

Model #1 had the lowest AIC level (7550 vs. 7565 vs. 8678), so it is the best model of the three produced.

Predictor variable results (positive and negative)

Sex: (+) The parameter estimates suggest that being male increases the log likelihood of stroke by 0.12804 (P-value: 0.06354).

Marital Status: (-) The parameter estimates suggest that being married decreases the log likelihood of stroke by 0.62365 (P-value: 1.3e-11).

Age: (+) The parameter estimates suggest that an increase in age by one year increases the log likelihood of stroke by 0.05236 (P-value: < 2e-16).

Region: (-) The parameter estimates suggest that living in the northeast decreases the log likelihood of stroke by -0.37186 (P-value: 0.00110).

Diabetes: (-) The parameter estimates suggest that not having diabetes decreases the log likelhood of stroke by -0.65693 (P-value: 0.00045).

Hours of Sleep: (+) The parameter estimates suggest that one hour increase in sleep increases the log likelihood of stroke by 0.05333 (P-value: 0.00916); further analysis of sleep intervals needed, i.e. < 6 hours, 7-8 hours, > 9 hours.

BMI: (+) The parameter estimates suggest that an one-unit increase in BMI increases the log likelihood of stroke by 0.00578 (P-value: 0.35561).

The biological context/relevance of these findings centers on the fact that the best model includes more than just the statistically significant predictor variables; all of an individual’s health behaviors factor into the probability of stroke, so one cannot disregard other factors, but we can determine which are most influential.

ROC Plot

The ROC plot and respective area-under-curve (AUC) analysis determine how well a model can “separate the group being tested into those with and without the disease in question” (http://gim.unmc.edu/dxtests/roc3.htm). So, this model is fairly good/accurate in separating non-stroke from stroke individuals based on the factors (sex, marital status, age, region of residence, diabetes diagnosis, average sleep duration, and BMI) because the AUC was calculated to be 0.803, which is close to 1, which suggests that this is a fairly good model.

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
#generating stroke probabilities using the first, 7-factor model
p <- predict(ylogit1, newdata=subset(data,select=c(2,3,4,5,6,7,10)), type="response")
pr <- prediction(p, data$STROKE)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plotting ROC 
plot(prf)

#calculating AUC
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.803

Generalized Linear Model Plot of Sleep (hours) vs. Stroke Incidence

Here’s plotting of the logistic model between hours of sleep and stroke incidence. Since there is a great difference in the number of stroke vs. non-stroke incidences, the logisitic model may not accurately display the association between hours of sleep and stroke incidence (that the 3rd logistic model equation weakly suggests there is).

#converting sleep and stroke data columns to numeric values
hsleepdata <- subset(data, data$HOURSLEEP < 16 & data$HOURSLEEP > 4)
hsleep <- hsleepdata$HOURSLEEP
hsleep2 <- as.numeric(hsleep)

stroke <- hsleepdata$STROKE
stroke2 <- as.factor(stroke)
levels(stroke2) <- c("0", "1")
stroke3 <- as.numeric(as.character(stroke2))
#creating new data frame w/ only numeric sleep and stroke columns
dat = as.data.frame(cbind(hsleep2,stroke3))


#generating new plot window + plotting glm
quartz(title="hoursleep vs. stroke")
plot(hsleep2,stroke3,xlab="Sleep (hours)",ylab="Stroke Incidence")
g=glm(stroke3~hsleep2,family=binomial,dat)
#plotting incidence curve
curve(predict(g,data.frame(hsleep2=x),type="resp"),add=TRUE)
points(hsleep2,fitted(g),pch=20)

Positive and Negative Results Continued

So I wasn’t able to make a great logisitic plot of hours of sleep vs. stroke incidence, but the model analyses (ROC and AUC) were still able highlight the best model of the three.

Future Steps!

  1. Study these same 7 predictor variables in relation to other circulatory diseases, including hypertension, coronary heart disease, and myocardial infarction.

  2. Produce a meta-analysis walkthrough for 2015+ data when released (data prior to 2015 is not available as a raw .csv file, thus unusable on my end).