Odds ratios are provided at the bottom of this output.

Logistic Regression Model

Predictors:

  • Demographic Vars:
  1. Age
  2. Sex
  • Medical History:
  1. Psychiatric History (y/n)
  2. Number of previous concussions
  3. History of other therapy?
  • PPCS Symptom Severity
  1. Number of Symptoms
  2. Duration
  • Cost and Healthcare Service Utilization
  1. Number of allied health services used
  2. New medication prescribed?

Strategy:

Based loosely on the conceptual framework provided by Kim & Park (2019), we chose variables that fell into the four categories listed above.

We started with the simplest model which only included demographic variables. We added the medical history variables and analyzed any improvements of the new model over the old. The order was based on the Kim & Park (2019) paper as well, but also based on what variables we believed to be the most important (i.e., we included the “Cost and Healthcare Service Utilization” variables last because we thought they would be the best predictors). Overall, there were four logistic regressions.

We compared models using likelihood ratio tests and the Akaike Information Criterion (AIC), which I explain more below.

Patterns of missing data

##                  psychiatric_history                             new_meds 
##                                    3                                    3 
## health_econ_data.NumbPreviousConcuss                        numb_symptoms 
##                                    2                                    2 
##            health_econ_data.Duration 
##                                    1

First Logit Model:

This model only contains the demographic variables.

## 
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age, family = "binomial", 
##     data = logreg_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7389  -0.6903  -0.6548  -0.5940   1.9717  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.77917    0.56450  -3.152  0.00162 **
## SexM        -0.19118    0.23588  -0.810  0.41767   
## Age          0.03371    0.03776   0.893  0.37195   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 460.82  on 460  degrees of freedom
## Residual deviance: 459.25  on 458  degrees of freedom
## AIC: 465.25
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1687783 0.05346006 0.4923083
## SexM        0.8259876 0.51791227 1.3085292
## Age         1.0342866 0.96213073 1.1161131

Second Logit Model:

This model contains the demographic variables and medical history related variables. A likelihood ratio test and the AIC were used to compare whether one model was better than the other.

The likelihood ratio test shows that these models are not significantly different, suggesting that the medical history variables do not improve the model’s ability to distinguish high vs. low cost users (p = 0.18). AICs are also very similar.

## 
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History + 
##     Numb_Prev_Concuss + Therapy_History, family = "binomial", 
##     data = logreg_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8991  -0.7062  -0.6498  -0.5115   2.1552  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.13599    0.61245  -3.488 0.000487 ***
## SexM                   -0.21174    0.24201  -0.875 0.381626    
## Age                     0.02069    0.03880   0.533 0.593820    
## Psychiatric_HistoryYes  0.15101    0.25699   0.588 0.556804    
## Numb_Prev_Concuss       0.07449    0.08382   0.889 0.374192    
## Therapy_HistoryYes      0.53192    0.30907   1.721 0.085244 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 460.82  on 460  degrees of freedom
## Residual deviance: 454.40  on 455  degrees of freedom
## AIC: 466.4
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                               OR      2.5 %    97.5 %
## (Intercept)            0.1181272 0.03395513 0.3772894
## SexM                   0.8091758 0.50115460 1.2971087
## Age                    1.0209064 0.94757198 1.1037554
## Psychiatric_HistoryYes 1.1630073 0.69564716 1.9106492
## Numb_Prev_Concuss      1.0773343 0.90953643 1.2664780
## Therapy_HistoryYes     1.7021949 0.95149916 3.2206351
## Analysis of Deviance Table
## 
## Model 1: Healthcare_Cost ~ Sex + Age
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       458     459.25                     
## 2       455     454.40  3   4.8439   0.1836

Third Logit Model:

This model contains the demographic, medical history, and symptom severity variables. A likelihood ratio test and the AIC were used to compare whether one model was better than the other.

The likelihood ratio test shows that these models are significantly different, suggesting that symptom duration and number of symptoms improve the model’s ability to distinguish high vs. low cost users (p = 2.86e-05). The AIC is slightly smaller for model 3, indicating better fit. However, the odds ratios for symptom duration and number of symptoms are modest, despite being signficant.

## 
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History + 
##     Numb_Prev_Concuss + Therapy_History + Number_of_Symptoms + 
##     Symptom_Duration, family = "binomial", data = logreg_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8793  -0.7060  -0.5600  -0.3962   2.2542  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.1479756  0.6786993  -4.638 3.51e-06 ***
## SexM                   -0.0904625  0.2493558  -0.363  0.71677    
## Age                     0.0104058  0.0399116   0.261  0.79431    
## Psychiatric_HistoryYes -0.0830698  0.2684863  -0.309  0.75702    
## Numb_Prev_Concuss       0.1046689  0.0860674   1.216  0.22394    
## Therapy_HistoryYes      0.1699004  0.3281210   0.518  0.60460    
## Number_of_Symptoms      0.3542698  0.1132735   3.128  0.00176 ** 
## Symptom_Duration        0.0008289  0.0002638   3.142  0.00168 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 460.82  on 460  degrees of freedom
## Residual deviance: 433.41  on 453  degrees of freedom
## AIC: 449.41
## 
## Number of Fisher Scoring iterations: 4
## Waiting for profiling to be done...
##                                OR      2.5 %    97.5 %
## (Intercept)            0.04293896 0.01078365 0.1553263
## SexM                   0.91350857 0.55838403 1.4877022
## Age                    1.01046012 0.93574715 1.0947846
## Psychiatric_HistoryYes 0.92028691 0.53714881 1.5434790
## Numb_Prev_Concuss      1.11034292 0.93369501 1.3112126
## Therapy_HistoryYes     1.18518685 0.63487011 2.3165313
## Number_of_Symptoms     1.42513959 1.14819210 1.7924992
## Symptom_Duration       1.00082927 1.00032444 1.0013658
## Analysis of Deviance Table
## 
## Model 1: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History + Number_of_Symptoms + Symptom_Duration
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       455     454.40                          
## 2       453     433.41  2    20.99 2.767e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fourth Logit Model:

This model contains the demographic, medical history, symptom severity, and service utilization variables. Service utilization variables were whether new medications were prescribed from the physician and the number of allied health services used. A likelihood ratio test and the AIC were used to compare whether one model was better than the other.

The likelihood ratio test shows that these models are significantly different, suggesting that utilizing certain aspects of healthcare services improve the model (p = 2.2e-16). The AIC is much smaller for model 4 (449.48 vs. 299.16), indicating better fit. The odds ratios for both service utilization variables can be considered large.

The best way to predict high cost users is to track early on how they’re utilizing healthcare services (e.g., if they’re being prescribed new medication and being referred to different services). Symptom severity may also play a role, but service utilization is much better at predicting this. Future research will benefit from embedding this in predictive modelling framework, which would (ideally) have a larger sample size during validation.

## 
## Call:
## glm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History + 
##     Numb_Prev_Concuss + Therapy_History + Number_of_Symptoms + 
##     Symptom_Duration + Services_Used + New_Meds, family = "binomial", 
##     data = logreg_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1886  -0.4838  -0.1478  -0.0525   3.2592  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -7.5049694  1.0922724  -6.871 6.38e-12 ***
## SexM                    0.0475814  0.3125898   0.152 0.879016    
## Age                     0.0423248  0.0514294   0.823 0.410526    
## Psychiatric_HistoryYes -0.0327524  0.3447867  -0.095 0.924320    
## Numb_Prev_Concuss       0.1188643  0.1127340   1.054 0.291709    
## Therapy_HistoryYes     -0.3254394  0.3974621  -0.819 0.412904    
## Number_of_Symptoms      0.1935829  0.1395477   1.387 0.165376    
## Symptom_Duration        0.0006807  0.0003919   1.737 0.082363 .  
## Services_Used           1.5541811  0.1863528   8.340  < 2e-16 ***
## New_MedsYes             1.1299791  0.3318910   3.405 0.000662 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 460.82  on 460  degrees of freedom
## Residual deviance: 279.19  on 451  degrees of freedom
## AIC: 299.19
## 
## Number of Fisher Scoring iterations: 6
## Waiting for profiling to be done...
##                                  OR        2.5 %      97.5 %
## (Intercept)            0.0005503427 5.681446e-05 0.004164988
## SexM                   1.0487315231 5.669448e-01 1.938152273
## Age                    1.0432332345 9.442848e-01 1.156024735
## Psychiatric_HistoryYes 0.9677781513 4.879261e-01 1.894508012
## Numb_Prev_Concuss      1.1262171155 9.009980e-01 1.402537705
## Therapy_HistoryYes     0.7222099306 3.326737e-01 1.594672160
## Number_of_Symptoms     1.2135899338 9.263303e-01 1.604263103
## Symptom_Duration       1.0006809680 9.998826e-01 1.001414247
## Services_Used          4.7312105817 3.368303e+00 7.010781964
## New_MedsYes            3.0955916767 1.634255e+00 6.027143615
## Analysis of Deviance Table
## 
## Model 1: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History + Number_of_Symptoms + Symptom_Duration
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History + Number_of_Symptoms + Symptom_Duration + 
##     Services_Used + New_Meds
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       453     433.41                          
## 2       451     279.19  2   154.22 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Likelihood ratio test
## 
## Model 1: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History + Number_of_Symptoms + Symptom_Duration
## Model 2: Healthcare_Cost ~ Sex + Age + Psychiatric_History + Numb_Prev_Concuss + 
##     Therapy_History + Number_of_Symptoms + Symptom_Duration + 
##     Services_Used + New_Meds
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -216.71                         
## 2  10 -139.59  2 154.23  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library("parameters")
## Registered S3 methods overwritten by 'parameters':
##   method                           from      
##   as.double.parameters_kurtosis    datawizard
##   as.double.parameters_skewness    datawizard
##   as.double.parameters_smoothness  datawizard
##   as.numeric.parameters_kurtosis   datawizard
##   as.numeric.parameters_skewness   datawizard
##   as.numeric.parameters_smoothness datawizard
##   print.parameters_distribution    datawizard
##   print.parameters_kurtosis        datawizard
##   print.parameters_skewness        datawizard
##   summary.parameters_kurtosis      datawizard
##   summary.parameters_skewness      datawizard
model_parameters(logit4)
## Parameter                 | Log-Odds |       SE |         95% CI |     z |      p
## ---------------------------------------------------------------------------------
## (Intercept)               |    -7.50 |     1.09 | [-9.78, -5.48] | -6.87 | < .001
## Sex [M]                   |     0.05 |     0.31 | [-0.57,  0.66] |  0.15 | 0.879 
## Age                       |     0.04 |     0.05 | [-0.06,  0.14] |  0.82 | 0.411 
## Psychiatric_History [Yes] |    -0.03 |     0.34 | [-0.72,  0.64] | -0.09 | 0.924 
## Numb_Prev_Concuss         |     0.12 |     0.11 | [-0.10,  0.34] |  1.05 | 0.292 
## Therapy_History [Yes]     |    -0.33 |     0.40 | [-1.10,  0.47] | -0.82 | 0.413 
## Number_of_Symptoms        |     0.19 |     0.14 | [-0.08,  0.47] |  1.39 | 0.165 
## Symptom_Duration          | 6.81e-04 | 3.92e-04 | [ 0.00,  0.00] |  1.74 | 0.082 
## Services_Used             |     1.55 |     0.19 | [ 1.21,  1.95] |  8.34 | < .001
## New_Meds [Yes]            |     1.13 |     0.33 | [ 0.49,  1.80] |  3.40 | < .001

Machine Learning Approach

Using repeated cross-validation, a final logistic regression model classified participants as high or low cost. The classification performance is provided through the ROC, sensitivity and specificity.

## Loading required package: caret
## 
## Attaching package: 'caret'
## The following object is masked from 'package:parameters':
## 
##     compare_models
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
##   parameter       ROC      Sens      Spec      ROCSD     SensSD    SpecSD
## 1      none 0.8897708 0.9322372 0.4831111 0.05187104 0.04320644 0.1478534

The ROC and sensitivity are considered good/excellent. The specificity is mediocre. The coefficients for all predictors and their odds ratios are the same.

Visualizing the ROC Curve and other performance metrics

## 
## Attaching package: 'MLeval'
## The following object is masked _by_ '.GlobalEnv':
## 
##     fit
## ***MLeval: Machine Learning Model Evaluation***
## Input: caret train function object
## Averaging probs.
## Group 1 type: repeatedcv
## Observations: 461
## Number of groups: 1
## Observations per group: 461
## Positive: high
## Negative: low
## Group: Group 1
## Positive: 92
## Negative: 369
## ***Performance Metrics***

## Group 1 Optimal Informedness = 0.6340874278308
## Group 1 AUC-ROC = 0.89

## $`Group 1`
##                Score        CI
## SENS           0.478 0.38-0.58
## SPEC           0.938 0.91-0.96
## MCC            0.472      <NA>
## Informedness   0.416      <NA>
## PREC           0.657 0.54-0.76
## NPV            0.878 0.84-0.91
## FPR            0.062      <NA>
## F1             0.553      <NA>
## TP            44.000      <NA>
## FP            23.000      <NA>
## TN           346.000      <NA>
## FN            48.000      <NA>
## AUC-ROC        0.890 0.84-0.94
## AUC-PR         0.620      <NA>
## AUC-PRG        0.230      <NA>

Visualizing the odds ratio and their confidence intervals in the fourth logistic regression.

## Loading required package: rms
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## Attaching package: 'rms'
## The following object is masked from 'package:lmtest':
## 
##     lrtest
## The following objects are masked from 'package:car':
## 
##     Predict, vif
## Logistic Regression Model
##  
##  rms::lrm(formula = Healthcare_Cost ~ Sex + Age + Psychiatric_History + 
##      Numb_Prev_Concuss + Therapy_History + Number_of_Symptoms + 
##      Symptom_Duration + Services_Used + New_Meds, data = logreg_df)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs           461    LR chi2     181.63    R2       0.515    C       0.901    
##   low          369    d.f.             9    g        2.837    Dxy     0.801    
##   high          92    Pr(> chi2) <0.0001    gr      17.056    gamma   0.802    
##  max |deriv| 1e-05                          gp       0.257    tau-a   0.257    
##                                             Brier    0.098                     
##  
##                          Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept               -7.5050 1.0923 -6.87  <0.0001 
##  Sex=M                    0.0476 0.3126  0.15  0.8790  
##  Age                      0.0423 0.0514  0.82  0.4105  
##  Psychiatric_History=Yes -0.0328 0.3448 -0.09  0.9243  
##  Numb_Prev_Concuss        0.1189 0.1127  1.05  0.2917  
##  Therapy_History=Yes     -0.3254 0.3975 -0.82  0.4129  
##  Number_of_Symptoms       0.1936 0.1396  1.39  0.1654  
##  Symptom_Duration         0.0007 0.0004  1.74  0.0824  
##  Services_Used            1.5542 0.1864  8.34  <0.0001 
##  New_Meds=Yes             1.1300 0.3319  3.40  0.0007  
## 
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Waiting for profiling to be done...
##                               2.5 %      97.5 %
## (Intercept)            5.681446e-05 0.004164988
## SexM                   5.669448e-01 1.938152273
## Age                    9.442848e-01 1.156024735
## Psychiatric_HistoryYes 4.879261e-01 1.894508012
## Numb_Prev_Concuss      9.009980e-01 1.402537705
## Therapy_HistoryYes     3.326737e-01 1.594672160
## Number_of_Symptoms     9.263303e-01 1.604263103
## Symptom_Duration       9.998826e-01 1.001414247
## Services_Used          3.368303e+00 7.010781964
## New_MedsYes            1.634255e+00 6.027143615