Odds ratios are provided at the bottom of this output.
Based loosely on the conceptual framework provided by Kim & Park (2019), we chose variables that fell into the four categories listed above.
We started with the simplest model which only included demographic variables. We added the medical history variables and analyzed any improvements of the new model over the old. The order was based on the Kim & Park (2019) paper as well, but also based on what variables we believed to be the most important (i.e., we included the “Cost and Healthcare Service Utilization” variables last because we thought they would be the best predictors). Overall, there were four logistic regressions.
We compared models using likelihood ratio tests and the Akaike Information Criterion (AIC), which I explain more below.
## psychiatric_history new_meds
## 3 3
## health_econ_data.NumbPreviousConcuss numb_symptoms
## 2 2
## health_econ_data.Duration
## 1
This model only contains the demographic variables.
##
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age, family = "binomial",
## data = logreg_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7389 -0.6903 -0.6548 -0.5940 1.9717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.77917 0.56450 -3.152 0.00162 **
## SexM -0.19118 0.23588 -0.810 0.41767
## Age 0.03371 0.03776 0.893 0.37195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 460.82 on 460 degrees of freedom
## Residual deviance: 459.25 on 458 degrees of freedom
## AIC: 465.25
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.1687783 0.05346006 0.4923083
## SexM 0.8259876 0.51791227 1.3085292
## Age 1.0342866 0.96213073 1.1161131
This model contains the demographic variables and medical history related variables. A likelihood ratio test and the AIC were used to compare whether one model was better than the other.
The likelihood ratio test shows that these models are not significantly different, suggesting that the medical history variables do not improve the model’s ability to distinguish high vs. low cost users (p = 0.18). AICs are also very similar.
##
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History +
## Numb_Prev_Concuss + Therapy_History, family = "binomial",
## data = logreg_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8991 -0.7062 -0.6498 -0.5115 2.1552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.13599 0.61245 -3.488 0.000487 ***
## SexM -0.21174 0.24201 -0.875 0.381626
## Age 0.02069 0.03880 0.533 0.593820
## Psychiatric_HistoryYes 0.15101 0.25699 0.588 0.556804
## Numb_Prev_Concuss 0.07449 0.08382 0.889 0.374192
## Therapy_HistoryYes 0.53192 0.30907 1.721 0.085244 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 460.82 on 460 degrees of freedom
## Residual deviance: 454.40 on 455 degrees of freedom
## AIC: 466.4
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.1181272 0.03395513 0.3772894
## SexM 0.8091758 0.50115460 1.2971087
## Age 1.0209064 0.94757198 1.1037554
## Psychiatric_HistoryYes 1.1630073 0.69564716 1.9106492
## Numb_Prev_Concuss 1.0773343 0.90953643 1.2664780
## Therapy_HistoryYes 1.7021949 0.95149916 3.2206351
## Analysis of Deviance Table
##
## Model 1: Healthcare_Cost ~ Sex + Age
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 458 459.25
## 2 455 454.40 3 4.8439 0.1836
This model contains the demographic, medical history, and symptom severity variables. A likelihood ratio test and the AIC were used to compare whether one model was better than the other.
The likelihood ratio test shows that these models are significantly different, suggesting that symptom duration and number of symptoms improve the model’s ability to distinguish high vs. low cost users (p = 2.86e-05). The AIC is slightly smaller for model 3, indicating better fit. However, the odds ratios for symptom duration and number of symptoms are modest, despite being signficant.
##
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History +
## Numb_Prev_Concuss + Therapy_History + Number_of_Symptoms +
## Symptom_Duration, family = "binomial", data = logreg_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8793 -0.7060 -0.5600 -0.3962 2.2542
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1479756 0.6786993 -4.638 3.51e-06 ***
## SexM -0.0904625 0.2493558 -0.363 0.71677
## Age 0.0104058 0.0399116 0.261 0.79431
## Psychiatric_HistoryYes -0.0830698 0.2684863 -0.309 0.75702
## Numb_Prev_Concuss 0.1046689 0.0860674 1.216 0.22394
## Therapy_HistoryYes 0.1699004 0.3281210 0.518 0.60460
## Number_of_Symptoms 0.3542698 0.1132735 3.128 0.00176 **
## Symptom_Duration 0.0008289 0.0002638 3.142 0.00168 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 460.82 on 460 degrees of freedom
## Residual deviance: 433.41 on 453 degrees of freedom
## AIC: 449.41
##
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.04293896 0.01078365 0.1553263
## SexM 0.91350857 0.55838403 1.4877022
## Age 1.01046012 0.93574715 1.0947846
## Psychiatric_HistoryYes 0.92028691 0.53714881 1.5434790
## Numb_Prev_Concuss 1.11034292 0.93369501 1.3112126
## Therapy_HistoryYes 1.18518685 0.63487011 2.3165313
## Number_of_Symptoms 1.42513959 1.14819210 1.7924992
## Symptom_Duration 1.00082927 1.00032444 1.0013658
## Analysis of Deviance Table
##
## Model 1: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History + Number_of_Symptoms + Symptom_Duration
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 455 454.40
## 2 453 433.41 2 20.99 2.767e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model contains the demographic, medical history, symptom severity, and service utilization variables. Service utilization variables were whether new medications were prescribed from the physician and the number of allied health services used. A likelihood ratio test and the AIC were used to compare whether one model was better than the other.
The likelihood ratio test shows that these models are significantly different, suggesting that utilizing certain aspects of healthcare services improve the model (p = 2.2e-16). The AIC is much smaller for model 4 (449.48 vs. 299.16), indicating better fit. The odds ratios for both service utilization variables can be considered large.
The best way to predict high cost users is to track early on how they’re utilizing healthcare services (e.g., if they’re being prescribed new medication and being referred to different services). Symptom severity may also play a role, but service utilization is much better at predicting this. Future research will benefit from embedding this in predictive modelling framework, which would (ideally) have a larger sample size during validation.
##
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History +
## Numb_Prev_Concuss + Therapy_History + Number_of_Symptoms +
## Symptom_Duration + Services_Used + New_Meds, family = "binomial",
## data = logreg_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1886 -0.4838 -0.1478 -0.0525 3.2592
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.5049694 1.0922724 -6.871 6.38e-12 ***
## SexM 0.0475814 0.3125898 0.152 0.879016
## Age 0.0423248 0.0514294 0.823 0.410526
## Psychiatric_HistoryYes -0.0327524 0.3447867 -0.095 0.924320
## Numb_Prev_Concuss 0.1188643 0.1127340 1.054 0.291709
## Therapy_HistoryYes -0.3254394 0.3974621 -0.819 0.412904
## Number_of_Symptoms 0.1935829 0.1395477 1.387 0.165376
## Symptom_Duration 0.0006807 0.0003919 1.737 0.082363 .
## Services_Used 1.5541811 0.1863528 8.340 < 2e-16 ***
## New_MedsYes 1.1299791 0.3318910 3.405 0.000662 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 460.82 on 460 degrees of freedom
## Residual deviance: 279.19 on 451 degrees of freedom
## AIC: 299.19
##
## Number of Fisher Scoring iterations: 6
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.0005503427 5.681446e-05 0.004164988
## SexM 1.0487315231 5.669448e-01 1.938152273
## Age 1.0432332345 9.442848e-01 1.156024735
## Psychiatric_HistoryYes 0.9677781513 4.879261e-01 1.894508012
## Numb_Prev_Concuss 1.1262171155 9.009980e-01 1.402537705
## Therapy_HistoryYes 0.7222099306 3.326737e-01 1.594672160
## Number_of_Symptoms 1.2135899338 9.263303e-01 1.604263103
## Symptom_Duration 1.0006809680 9.998826e-01 1.001414247
## Services_Used 4.7312105817 3.368303e+00 7.010781964
## New_MedsYes 3.0955916767 1.634255e+00 6.027143615
## Analysis of Deviance Table
##
## Model 1: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History + Number_of_Symptoms + Symptom_Duration
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History + Number_of_Symptoms + Symptom_Duration +
## Services_Used + New_Meds
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 453 433.41
## 2 451 279.19 2 154.22 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Likelihood ratio test
##
## Model 1: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History + Number_of_Symptoms + Symptom_Duration
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss +
## Therapy_History + Number_of_Symptoms + Symptom_Duration +
## Services_Used + New_Meds
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -216.71
## 2 10 -139.59 2 154.23 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library("parameters")
## Registered S3 methods overwritten by 'parameters':
## method from
## as.double.parameters_kurtosis datawizard
## as.double.parameters_skewness datawizard
## as.double.parameters_smoothness datawizard
## as.numeric.parameters_kurtosis datawizard
## as.numeric.parameters_skewness datawizard
## as.numeric.parameters_smoothness datawizard
## print.parameters_distribution datawizard
## print.parameters_kurtosis datawizard
## print.parameters_skewness datawizard
## summary.parameters_kurtosis datawizard
## summary.parameters_skewness datawizard
model_parameters(logit4)
## Parameter | Log-Odds | SE | 95% CI | z | p
## ---------------------------------------------------------------------------------
## (Intercept) | -7.50 | 1.09 | [-9.78, -5.48] | -6.87 | < .001
## Sex [M] | 0.05 | 0.31 | [-0.57, 0.66] | 0.15 | 0.879
## Age | 0.04 | 0.05 | [-0.06, 0.14] | 0.82 | 0.411
## Psychiatric_History [Yes] | -0.03 | 0.34 | [-0.72, 0.64] | -0.09 | 0.924
## Numb_Prev_Concuss | 0.12 | 0.11 | [-0.10, 0.34] | 1.05 | 0.292
## Therapy_History [Yes] | -0.33 | 0.40 | [-1.10, 0.47] | -0.82 | 0.413
## Number_of_Symptoms | 0.19 | 0.14 | [-0.08, 0.47] | 1.39 | 0.165
## Symptom_Duration | 6.81e-04 | 3.92e-04 | [ 0.00, 0.00] | 1.74 | 0.082
## Services_Used | 1.55 | 0.19 | [ 1.21, 1.95] | 8.34 | < .001
## New_Meds [Yes] | 1.13 | 0.33 | [ 0.49, 1.80] | 3.40 | < .001
Using repeated cross-validation, a final logistic regression model classified participants as high or low cost. The classification performance is provided through the ROC, sensitivity and specificity.
## Loading required package: caret
##
## Attaching package: 'caret'
## The following object is masked from 'package:parameters':
##
## compare_models
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
## parameter ROC Sens Spec ROCSD SensSD SpecSD
## 1 none 0.8897708 0.9322372 0.4831111 0.05187104 0.04320644 0.1478534
The ROC and sensitivity are considered good/excellent. The specificity is mediocre. The coefficients for all predictors and their odds ratios are the same.
##
## Attaching package: 'MLeval'
## The following object is masked _by_ '.GlobalEnv':
##
## fit
## ***MLeval: Machine Learning Model Evaluation***
## Input: caret train function object
## Averaging probs.
## Group 1 type: repeatedcv
## Observations: 461
## Number of groups: 1
## Observations per group: 461
## Positive: high
## Negative: low
## Group: Group 1
## Positive: 92
## Negative: 369
## ***Performance Metrics***
## Group 1 Optimal Informedness = 0.6340874278308
## Group 1 AUC-ROC = 0.89
## $`Group 1`
## Score CI
## SENS 0.478 0.38-0.58
## SPEC 0.938 0.91-0.96
## MCC 0.472 <NA>
## Informedness 0.416 <NA>
## PREC 0.657 0.54-0.76
## NPV 0.878 0.84-0.91
## FPR 0.062 <NA>
## F1 0.553 <NA>
## TP 44.000 <NA>
## FP 23.000 <NA>
## TN 346.000 <NA>
## FN 48.000 <NA>
## AUC-ROC 0.890 0.84-0.94
## AUC-PR 0.620 <NA>
## AUC-PRG 0.230 <NA>
## Loading required package: rms
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
##
## Attaching package: 'rms'
## The following object is masked from 'package:lmtest':
##
## lrtest
## The following objects are masked from 'package:car':
##
## Predict, vif
## Logistic Regression Model
##
## rms::lrm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History +
## Numb_Prev_Concuss + Therapy_History + Number_of_Symptoms +
## Symptom_Duration + Services_Used + New_Meds, data = logreg_df)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 461 LR chi2 181.63 R2 0.515 C 0.901
## low 369 d.f. 9 g 2.837 Dxy 0.801
## high 92 Pr(> chi2) <0.0001 gr 17.056 gamma 0.802
## max |deriv| 1e-05 gp 0.257 tau-a 0.257
## Brier 0.098
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -7.5050 1.0923 -6.87 <0.0001
## Sex=M 0.0476 0.3126 0.15 0.8790
## Age 0.0423 0.0514 0.82 0.4105
## Psychiatric_History=Yes -0.0328 0.3448 -0.09 0.9243
## Numb_Prev_Concuss 0.1189 0.1127 1.05 0.2917
## Therapy_History=Yes -0.3254 0.3975 -0.82 0.4129
## Number_of_Symptoms 0.1936 0.1396 1.39 0.1654
## Symptom_Duration 0.0007 0.0004 1.74 0.0824
## Services_Used 1.5542 0.1864 8.34 <0.0001
## New_Meds=Yes 1.1300 0.3319 3.40 0.0007
##
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 5.681446e-05 0.004164988
## SexM 5.669448e-01 1.938152273
## Age 9.442848e-01 1.156024735
## Psychiatric_HistoryYes 4.879261e-01 1.894508012
## Numb_Prev_Concuss 9.009980e-01 1.402537705
## Therapy_HistoryYes 3.326737e-01 1.594672160
## Number_of_Symptoms 9.263303e-01 1.604263103
## Symptom_Duration 9.998826e-01 1.001414247
## Services_Used 3.368303e+00 7.010781964
## New_MedsYes 1.634255e+00 6.027143615