This model uses 2015 National Health Interview Survey (NHIS) data to study the association between different health behavior variables (both categorical and numerical) and stroke incidence (binary response variable).
The NHIS data is available for public use through the U.S. Census Bureau, and it is useful for epidemiological study of the entire U.S. Adult population and selected subgroups (youth data is also available in a separate data collection file). See here for 2015 data. The biological question under study here was: which factors are significant predictors of stroke in the U.S. adult population?
Data was processed in Excel prior to loading to ensure data manipulation in R would run smoothly. Selected variables for study were: sex, marital status, age, age, region of residence, diabetes, hours of sleep, and BMI (calculated from height and weight). Observations that did not have complete responses for hours of sleep, height, weight, and stroke were removed since the model would count them as missing data anyways. Here’s a headshot of the “cropped”, input data.
data <- read.csv("cropsamadult2015.csv",header=T) #load data
head(data)
## STROKE SEX MARITAL AGE REGION DIABETES HOURSLEEP HEIGHT WEIGHT
## 1 No Male Never Married 25 South No 7 71 155
## 2 No Female Never Married 49 West No 6 60 118
## 3 No Male Never Married 61 West Yes 6 66 190
## 4 No Male Divorced 53 South No 4 66 130
## 5 No Female Married 31 West No 7 63 125
## 6 No Female Married 58 South No 8 65 175
BMI information for each individual (observation) was calculated using the formula: BMI = (weight * 703)/height^2, and set to three digits. Here’s an updated headshot of the data:
data$BMI <- ((data$WEIGHT * 703) / (data$HEIGHT * data$HEIGHT))
options(digits = 3) #calc BMI to 3
head(data)
## STROKE SEX MARITAL AGE REGION DIABETES HOURSLEEP HEIGHT WEIGHT
## 1 No Male Never Married 25 South No 7 71 155
## 2 No Female Never Married 49 West No 6 60 118
## 3 No Male Never Married 61 West Yes 6 66 190
## 4 No Male Divorced 53 South No 4 66 130
## 5 No Female Married 31 West No 7 63 125
## 6 No Female Married 58 South No 8 65 175
## BMI
## 1 21.6
## 2 23.0
## 3 30.7
## 4 21.0
## 5 22.1
## 6 29.1
Unique variable length was checked with the sapply() and length(unique()) functions to ensure correct input data.
sapply(data, function(x) length(unique(x))) #check unique lengths
## STROKE SEX MARITAL AGE REGION DIABETES HOURSLEEP
## 2 2 7 68 4 5 21
## HEIGHT WEIGHT BMI
## 18 200 2372
Binomial logistic regression models (3) were produced for stroke prediction since it is a categorical response variable, and the model with the lowest AIC level was selected as the best one.
#Binomial logistic regression model #1 (all predictor variables)
ylogit1 <- glm(STROKE ~ SEX + MARITAL + AGE + REGION + DIABETES + HOURSLEEP + BMI, data = data, family = "binomial")
summary(ylogit1)
##
## Call:
## glm(formula = STROKE ~ SEX + MARITAL + AGE + REGION + DIABETES +
## HOURSLEEP + BMI, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.254 -0.280 -0.173 -0.112 3.400
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.18280 0.36364 -17.00 < 2e-16 ***
## SEXMale 0.12804 0.06901 1.86 0.06354 .
## MARITALLiving with Partner -0.28531 0.20284 -1.41 0.15956
## MARITALMarried -0.62365 0.09215 -6.77 1.3e-11 ***
## MARITALNever Married -0.29881 0.12855 -2.32 0.02010 *
## MARITALSeparated 0.18251 0.18357 0.99 0.32011
## MARITALUnknown -0.08880 0.73740 -0.12 0.90415
## MARITALWidowed -0.12784 0.10277 -1.24 0.21350
## AGE 0.05236 0.00268 19.53 < 2e-16 ***
## REGIONNortheast -0.37186 0.11392 -3.26 0.00110 **
## REGIONSouth 0.00400 0.08810 0.05 0.96383
## REGIONWest -0.10509 0.09590 -1.10 0.27314
## DIABETESDon't know 1.32856 0.90703 1.46 0.14299
## DIABETESNo -0.65693 0.18733 -3.51 0.00045 ***
## DIABETESRefused -8.30005 196.96778 -0.04 0.96639
## DIABETESYes 0.30435 0.19164 1.59 0.11226
## HOURSLEEP 0.05333 0.02047 2.61 0.00916 **
## BMI 0.00578 0.00625 0.92 0.35561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8722.5 on 29519 degrees of freedom
## Residual deviance: 7514.3 on 29502 degrees of freedom
## AIC: 7550
##
## Number of Fisher Scoring iterations: 10
#Binomial logistic regression model #2 (selected, 3 most statistically significant predictor variables)
ylogit2 <- glm(STROKE ~ MARITAL + AGE + DIABETES, data=data, family="binomial")
summary(ylogit2)
##
## Call:
## glm(formula = STROKE ~ MARITAL + AGE + DIABETES, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.221 -0.282 -0.176 -0.112 3.374
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.72020 0.25997 -22.00 < 2e-16 ***
## MARITALLiving with Partner -0.27173 0.20232 -1.34 0.17926
## MARITALMarried -0.59420 0.09130 -6.51 7.6e-11 ***
## MARITALNever Married -0.28746 0.12783 -2.25 0.02453 *
## MARITALSeparated 0.17345 0.18326 0.95 0.34392
## MARITALUnknown -0.10464 0.73696 -0.14 0.88709
## MARITALWidowed -0.14056 0.10161 -1.38 0.16656
## AGE 0.05288 0.00263 20.13 < 2e-16 ***
## DIABETESDon't know 1.46693 0.89910 1.63 0.10277
## DIABETESNo -0.64485 0.18618 -3.46 0.00053 ***
## DIABETESRefused -8.21287 196.96777 -0.04 0.96674
## DIABETESYes 0.34895 0.19103 1.83 0.06775 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8722.5 on 29519 degrees of freedom
## Residual deviance: 7541.5 on 29508 degrees of freedom
## AIC: 7565
##
## Number of Fisher Scoring iterations: 10
#Binomial logistic regression model #3 (hours of sleep predictor variable)
ylogit3 <- glm(STROKE ~ HOURSLEEP, data=data, family="binomial")
summary(ylogit3)
##
## Call:
## glm(formula = STROKE ~ HOURSLEEP, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.738 -0.277 -0.258 -0.239 2.930
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4282 0.1548 -28.61 <2e-16 ***
## HOURSLEEP 0.1484 0.0205 7.26 4e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8722.5 on 29519 degrees of freedom
## Residual deviance: 8674.0 on 29518 degrees of freedom
## AIC: 8678
##
## Number of Fisher Scoring iterations: 6
Model #1 had the lowest AIC level (7550 vs. 7565 vs. 8678), so it is the best model of the three produced.
Sex: (+) The parameter estimates suggest that being male increases the log likelihood of stroke by 0.12804 (P-value: 0.06354).
Marital Status: (-) The parameter estimates suggest that being married decreases the log likelihood of stroke by 0.62365 (P-value: 1.3e-11).
Age: (+) The parameter estimates suggest that an increase in age by one year increases the log likelihood of stroke by 0.05236 (P-value: < 2e-16).
Region: (-) The parameter estimates suggest that living in the northeast decreases the log likelihood of stroke by -0.37186 (P-value: 0.00110).
Diabetes: (-) The parameter estimates suggest that not having diabetes decreases the log likelhood of stroke by -0.65693 (P-value: 0.00045).
Hours of Sleep: (+) The parameter estimates suggest that one hour increase in sleep increases the log likelihood of stroke by 0.05333 (P-value: 0.00916); further analysis of sleep intervals needed, i.e. < 6 hours, 7-8 hours, > 9 hours.
BMI: (+) The parameter estimates suggest that an one-unit increase in BMI increases the log likelihood of stroke by 0.00578 (P-value: 0.35561).
The biological context/relevance of these findings centers on the fact that the best model includes more than just the statistically significant predictor variables; all of an individual’s health behaviors factor into the probability of stroke, so one cannot disregard other factors, but we can determine which are most influential.
The ROC plot and respective area-under-curve (AUC) analysis determine how well a model can “separate the group being tested into those with and without the disease in question” (http://gim.unmc.edu/dxtests/roc3.htm). So, this model is fairly good/accurate in separating non-stroke from stroke individuals based on the factors (sex, marital status, age, region of residence, diabetes diagnosis, average sleep duration, and BMI) because the AUC was calculated to be 0.803, which is close to 1, which suggests that this is a fairly good model.
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
#generating stroke probabilities using the first, 7-factor model
p <- predict(ylogit1, newdata=subset(data,select=c(2,3,4,5,6,7,10)), type="response")
pr <- prediction(p, data$STROKE)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plotting ROC
plot(prf)
#calculating AUC
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.803
Here’s plotting of the logistic model between hours of sleep and stroke incidence. Since there is a great difference in the number of stroke vs. non-stroke incidences, the logisitic model may not accurately display the association between hours of sleep and stroke incidence (that the 3rd logistic model equation weakly suggests there is).
#converting sleep and stroke data columns to numeric values
hsleepdata <- subset(data, data$HOURSLEEP < 16 & data$HOURSLEEP > 4)
hsleep <- hsleepdata$HOURSLEEP
hsleep2 <- as.numeric(hsleep)
stroke <- hsleepdata$STROKE
stroke2 <- as.factor(stroke)
levels(stroke2) <- c("0", "1")
stroke3 <- as.numeric(as.character(stroke2))
#creating new data frame w/ only numeric sleep and stroke columns
dat = as.data.frame(cbind(hsleep2,stroke3))
#generating new plot window + plotting glm
quartz(title="hoursleep vs. stroke")
plot(hsleep2,stroke3,xlab="Sleep (hours)",ylab="Stroke Incidence")
g=glm(stroke3~hsleep2,family=binomial,dat)
#plotting incidence curve
curve(predict(g,data.frame(hsleep2=x),type="resp"),add=TRUE)
points(hsleep2,fitted(g),pch=20)
So I wasn’t able to make a great logisitic plot of hours of sleep vs. stroke incidence, but the model analyses (ROC and AUC) were still able highlight the best model of the three.
Study these same 7 predictor variables in relation to other circulatory diseases, including hypertension, coronary heart disease, and myocardial infarction.
Produce a meta-analysis walkthrough for 2015+ data when released (data prior to 2015 is not available as a raw .csv file, thus unusable on my end).