Biostatistics for Health Researchers II: Formative Assessment II

Author
Affiliation

Bongani Ncube(3002164)

University Of the Witwatersrand (School of Public Health)

Published

May 16, 2025

Keywords

Logistic Regression, Statistical modeling, Sensitivity analysis, logit, Risk Ratios

About the dataset
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
52 1 0 125 212 0 1 168 0 1.0 2 2 3 0
53 1 0 140 203 1 0 155 1 3.1 0 0 3 0
70 1 0 145 174 0 1 125 1 2.6 0 0 3 0
61 1 0 148 203 0 1 161 0 0.0 2 1 3 0
62 0 0 138 294 1 1 106 0 1.9 1 3 2 0
58 0 0 100 248 0 0 122 0 1.0 1 0 2 1

single variable discriptive statistics

(a) Use appropriate descriptive statistics to summarise the variables, and include at least two types of graphical displays to support your summary

quietly import delimited heart

label define sex 1 "male" 0 "female", replace
label value sex sex
label define fbs 1 "true" 0 "false", replace
label value fbs fbs
label define target 0 "No disease" 1 "Disease", replace
label value target target
label define thal 1 "normal" 2 "fixed defect" 3 "reversible defect", replace
label value thal thal
label define exang 1 "yes" 0 "no", replace
label value exang exang

quietly save heart , replace
quietly import delimited heart

dtable, continuous(age trestbps chol thalach oldpeak) factor(exang sex target thal fbs restecg cp ca slope)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
Table1
-------------------------
              Summary    
-------------------------
N                   1,025
age        54.434 (9.072)
trestbps 131.612 (17.517)
chol     246.000 (51.593)
thalach  149.114 (23.006)
oldpeak     1.072 (1.175)
exang                    
  0           680 (66.3%)
  1           345 (33.7%)
sex                      
  0           312 (30.4%)
  1           713 (69.6%)
target                   
  0           499 (48.7%)
  1           526 (51.3%)
thal                     
  0              7 (0.7%)
  1             64 (6.2%)
  2           544 (53.1%)
  3           410 (40.0%)
fbs                      
  0           872 (85.1%)
  1           153 (14.9%)
restecg                  
  0           497 (48.5%)
  1           513 (50.0%)
  2             15 (1.5%)
cp                       
  0           497 (48.5%)
  1           167 (16.3%)
  2           284 (27.7%)
  3             77 (7.5%)
ca                       
  0           578 (56.4%)
  1           226 (22.0%)
  2           134 (13.1%)
  3             69 (6.7%)
  4             18 (1.8%)
slope                    
  0             74 (7.2%)
  1           482 (47.0%)
  2           469 (45.8%)
-------------------------
(collection DTable exported to file Statin_table.docx)
Comments
  • The table above shows frequencies and proportions for categorical data.
  • Continuous data was summarised by means and standard deviations
use heart 

quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)

quietly histogram trestbps ,title("Histogram of trestbps") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)

quietly histogram chol ,title("Histogram of chol") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)

quietly histogram thalach ,title("Histogram of thalach") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace)

quietly histogram oldpeak ,title("Histogram of oldpeak") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace)

graph pie, over(sex)  name(graph6 , replace) title("distribution of sex")

graph pie, over(exang) name(graph7 , replace) title("distribution of exang")

graph hbar (count), over(cp) blabel(bar) name(graph8 , replace) title("distribution of cp")

graph hbar (count), over(ca) blabel(bar) name(graph9 , replace) title("distribution of ca")

graph hbar (count), over(fbs) blabel(bar) name(graph10 , replace) title("distribution of fbs")

graph hbar (count), over(thal) blabel(bar) name(graph11 , replace) title("distribution of thal")

graph hbar (count), over(restecg) blabel(bar) name(graph12 , replace) title("distribution of restecg")

graph hbar (count), over(slope) blabel(bar) name(graph13 , replace) title("distribution of slope")

graph pie, over(target) name(graph14 , replace) title("distribution of target")

graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 graph14  ,title("")  

quietly graph export all_graphs.svg, replace

Stata Graph - Graph 0 .02 .04 .06 .08 Density 30 40 50 60 70 80 age Source : Data source unknown Histogram of Age 0 .01 .02 .03 .04 Density 100 120 140 160 180 200 trestbps Source : Data source unknown Histogram of trestbps 0 .002 .004 .006 .008 Density 100 200 300 400 500 600 chol Source : Data source unknown Histogram of chol 0 .005 .01 .015 .02 Density 50 100 150 200 thalach Source : Data source unknown Histogram of thalach 0 .5 1 1.5 2 Density 0 2 4 6 oldpeak Source : Data source unknown Histogram of oldpeak female male distribution of sex no yes distribution of exang 77 284 167 497 0 100 200 300 400 500 frequency 3 2 1 0 distribution of cp 18 69 134 226 578 0 200 400 600 frequency 4 3 2 1 0 distribution of ca 153 872 0 200 400 600 800 frequency true false distribution of fbs 410 544 64 7 0 200 400 600 frequency reversible defect fixed defect normal 0 distribution of thal 15 513 497 0 100 200 300 400 500 frequency 2 1 0 distribution of restecg 469 482 74 0 100 200 300 400 500 frequency 2 1 0 distribution of slope No disease Disease distribution of target

Summary statistics by Outcome Variable

quietly import delimited heart

dtable,by(target, tests testnotes) continuous(age trestbps chol thalach oldpeak) factor(exang sex thal fbs restecg cp ca slope)  title(Table1) titlestyles( font(, bold) ) export("target", as(docx) replace)
note: using test regress across levels of target for age, trestbps, chol,
      thalach, and oldpeak.
note: using test pearson across levels of target for exang, sex, thal, fbs,
      restecg, cp, ca, and slope.

Table1
------------------------------------------------------------------
                                   target                         
                 0                1              Total       Test 
------------------------------------------------------------------
N             499 (48.7%)      526 (51.3%)   1,025 (100.0%)       
age        56.569 (7.908)   52.409 (9.632)   54.434 (9.072) <0.001
trestbps 134.106 (18.577) 129.245 (16.112) 131.612 (17.517) <0.001
chol     251.293 (49.559) 240.979 (53.010) 246.000 (51.593)  0.001
thalach  139.130 (22.565) 158.586 (19.097) 149.114 (23.006) <0.001
oldpeak     1.600 (1.291)    0.570 (0.771)    1.072 (1.175) <0.001
exang                                                             
  0           225 (45.1%)      455 (86.5%)      680 (66.3%) <0.001
  1           274 (54.9%)       71 (13.5%)      345 (33.7%)       
sex                                                               
  0            86 (17.2%)      226 (43.0%)      312 (30.4%) <0.001
  1           413 (82.8%)      300 (57.0%)      713 (69.6%)       
thal                                                              
  0              4 (0.8%)         3 (0.6%)         7 (0.7%) <0.001
  1             43 (8.6%)        21 (4.0%)        64 (6.2%)       
  2           132 (26.5%)      412 (78.3%)      544 (53.1%)       
  3           320 (64.1%)       90 (17.1%)      410 (40.0%)       
fbs                                                               
  0           417 (83.6%)      455 (86.5%)      872 (85.1%)  0.188
  1            82 (16.4%)       71 (13.5%)      153 (14.9%)       
restecg                                                           
  0           283 (56.7%)      214 (40.7%)      497 (48.5%) <0.001
  1           204 (40.9%)      309 (58.7%)      513 (50.0%)       
  2             12 (2.4%)         3 (0.6%)        15 (1.5%)       
cp                                                                
  0           375 (75.2%)      122 (23.2%)      497 (48.5%) <0.001
  1             33 (6.6%)      134 (25.5%)      167 (16.3%)       
  2            65 (13.0%)      219 (41.6%)      284 (27.7%)       
  3             26 (5.2%)        51 (9.7%)        77 (7.5%)       
ca                                                                
  0           163 (32.7%)      415 (78.9%)      578 (56.4%) <0.001
  1           160 (32.1%)       66 (12.5%)      226 (22.0%)       
  2           113 (22.6%)        21 (4.0%)      134 (13.1%)       
  3            60 (12.0%)         9 (1.7%)        69 (6.7%)       
  4              3 (0.6%)        15 (2.9%)        18 (1.8%)       
slope                                                             
  0             46 (9.2%)        28 (5.3%)        74 (7.2%) <0.001
  1           324 (64.9%)      158 (30.0%)      482 (47.0%)       
  2           129 (25.9%)      340 (64.6%)      469 (45.8%)       
------------------------------------------------------------------
(collection DTable exported to file target.docx)
Comments
  • Preliminary results from Chi-square tests of association and student T.tests indicate that at \(5\%\) level of significance, most variables are are associated with being diagnosed or not being diagnosed with the disease.
use heart 

graph box age , over(target) title("distribution of Age by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph1 , replace)

graph box trestbps , over(target) title("distribution of trestbps by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph2 , replace)

graph box chol ,over(target) title("distribution of chol by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph3 , replace)

graph box thalach ,over(target) title("distribution of thalach by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph4 , replace)

graph box oldpeak ,over(target) title("distribution of oldpeak by target") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph5 , replace)

graph hbar ,over(target) over(sex) stack asyvars name(graph6 , replace) title("distribution of sex")

graph hbar ,over(target) over(exang) stack asyvars name(graph7 , replace) title("distribution of exang")

graph hbar ,over(target) over(cp) stack asyvars name(graph8 , replace) title("distribution of cp")

graph hbar ,over(target) over(ca)  stack asyvars name(graph9 , replace) title("distribution of ca")

graph hbar,over(target) over(fbs) stack asyvars name(graph10 , replace) title("distribution of fbs")

graph hbar ,over(target) over(thal) stack asyvars name(graph11 , replace) title("distribution of thal")

graph hbar ,over(target) over(restecg) stack asyvars name(graph12 , replace) title("distribution of restecg")

graph hbar ,over(target) over(slope)  stack asyvars name(graph13 , replace) title("distribution of slope")



graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13  ,title("") 

quietly graph export all_target.svg, replace

Stata Graph - Graph 30 40 50 60 70 80 Density No disease Disease Source : Data source unknown distribution of Age by target 100 120 140 160 180 200 Density No disease Disease Source : Data source unknown distribution of trestbps by target 100 200 300 400 500 600 Density No disease Disease Source : Data source unknown distribution of chol by target 50 100 150 200 Density No disease Disease Source : Data source unknown distribution of thalach by target 0 2 4 6 Density No disease Disease Source : Data source unknown distribution of oldpeak by target 0 20 40 60 80 percent male female distribution of sex No disease Disease 0 20 40 60 80 percent yes no distribution of exang No disease Disease 0 10 20 30 40 50 percent 3 2 1 0 distribution of cp No disease Disease 0 20 40 60 percent 4 3 2 1 0 distribution of ca No disease Disease 0 20 40 60 80 percent true false distribution of fbs No disease Disease 0 10 20 30 40 50 percent reversible defect fixed defect normal 0 distribution of thal No disease Disease 0 10 20 30 40 50 percent 2 1 0 distribution of restecg No disease Disease 0 10 20 30 40 50 percent 2 1 0 distribution of slope No disease Disease

2. Simple logistic regression [20 marks]

a) Fit a simple logistic regression model using age as the only risk factor of heart disease status

Tip

A logistic regression model is a generalized linear model and is made of a linear predictor:

\[\underbrace{g(\mu_i)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}\]

quietly import delimited heart

logit target age
Iteration 0:  Log likelihood = -710.12021  
Iteration 1:  Log likelihood = -682.58876  
Iteration 2:  Log likelihood = -682.53284  
Iteration 3:  Log likelihood = -682.53284  

Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(1)    =  55.17
                                                        Prob > chi2   = 0.0000
Log likelihood = -682.53284                             Pseudo R2     = 0.0388

------------------------------------------------------------------------------
      target | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0530036   .0073802    -7.18   0.000    -.0674686   -.0385386
       _cons |   2.943003   .4084775     7.20   0.000     2.142402    3.743605
------------------------------------------------------------------------------
Note
  • the equation below shows a simple logistic regression model fit on age as the predictor.

\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})= 2.943003-.0530036Age \end{aligned}\]

(b) Report the estimated coefficient for age, and interpret its sign and meaning in the context of heart disease risk

Comments
  • the model for the output above can be written as follows

\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 2.943003-.0530036Age \end{aligned}\]

  • The estimated coefficient for age is -0.0530036 and it is negative
  • Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by 0.0530036
  • Thus the chances of developing heart disease decrease with age

(c) Determine whether age is significantly associated with the probability of developing heart disease and justify your conclusion with appropriate statistical reasoning

  • Age is significantly associated with probability of developing heart disease at 5% level of significance since the corresponding p-value is 0.000 and is less than 0.05
  • the coefficient’s confidence is given by \([-0.0674686 ,-0.0385386]\) and does not contain zero hence the variable is significant.

(d) Based on the estimated coefficient, calculate the odds ratio associated with a one-year increase in age, and interpret its meaning in this context

Calculating odds

\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})&= 2.943003-.0530036Age\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003-.0530036Age}\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003}.e^{-.0530036Age} \end{aligned}\]

  • the odds associated with age is therefore \(e^{-.0530036}=0.9483766\) meaning that for a 1 year increase in age the odds of being diagnosed with heart disease decrease by \((1-0.9483766)*100\% \approx 5.16\%\)

Code below shows how to get the estimated odds ratios

quietly import delimited heart

logistic target age
Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(1)    =  55.17
                                                        Prob > chi2   = 0.0000
Log likelihood = -682.53284                             Pseudo R2     = 0.0388

------------------------------------------------------------------------------
      target | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .9483766   .0069993    -7.18   0.000     .9347571    .9621946
       _cons |   18.97274   7.749939     7.20   0.000      8.51988    42.25001
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Tip
  • also note that we can find the probability of being diagnosed at a given age by the following probability formula.

\[ \begin{aligned} \hat{p}(x) &= \frac{exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}{1 + exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}\\ &= \frac{exp^{2.943003-.0530036Age}}{1 + exp^{2.943003-.0530036Age}} \end{aligned}\]

(e) Evaluate the fit of the model using an appropriate measure and report what the results indicate about how well the model fits the data

quietly import delimited heart
quietly logit target
est store model1
quietly logit target age
est store model2
lrtest model1 model2 
Likelihood-ratio test
Assumption: model1 nested within model2

 LR chi2(1) =  55.17
Prob > chi2 = 0.0000
Comments
  • The likelihood ratio test above compares the null model with a model with age alone.
  • the \(p-value<0.001\) indicates that the model with age is better fit compared to a model with no covariates at all.
  • thus our current model fits the data better than a model with no predictors.

3. Model Selection and Evaluation [40 marks]

(a) in addition to age, select one categorical variable of your choice and develop a multiple logistic regression model (Model 2) that includes both the selected categorical variable and age. Report the coefficients and interpret the main effects

quietly import delimited heart

logit target age i.sex
Iteration 0:  Log likelihood = -710.12021  
Iteration 1:  Log likelihood =  -630.0552  
Iteration 2:  Log likelihood = -629.90565  
Iteration 3:  Log likelihood = -629.90562  

Logistic regression                                     Number of obs =  1,025
                                                        LR chi2(2)    = 160.43
                                                        Prob > chi2   = 0.0000
Log likelihood = -629.90562                             Pseudo R2     = 0.1130

------------------------------------------------------------------------------
      target | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0674349   .0080345    -8.39   0.000    -.0831823   -.0516875
       1.sex |  -1.533812   .1585261    -9.68   0.000    -1.844518   -1.223107
       _cons |   4.819817   .4868406     9.90   0.000     3.865627    5.774007
------------------------------------------------------------------------------
Comments
  • the model for the output above can be written as follows

\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 4.819817-.0674349Age-1.533812Male \end{aligned}\]

  • The estimated coefficient for age is -.0674349 and it is negative
  • Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by 0.0674349 while adjusting for gender of the patient.
  • whilst Adjusting for Age, males have 1.533812 less log odds of being diagnosed with heart disease as compared to females.
  • the two variables are significant at 5% level of significance since they have p-values less than 5%

(b) Using the full set of potential risk factors, perform the best subset variable selection to identify the optimal set of risk factors based on the reported criteria. Based on each selection criterion, could you identify the best model? These are called Model 3, Model 4, and Model 5

quietly import delimited heart

gvselect <term> age trestbps chol thalach oldpeak exang sex thal fbs restecg cp ca slope, nmodels(1): logit target <term>
Optimal models: 

   # Preds        LL       AIC       BIC
         1 -597.6464  1199.293  1209.158
         2 -510.5703  1027.141  1041.938
         3 -462.3088  932.6176  952.3474
         4 -433.2307  876.4613  901.1235
         5 -411.6748  835.3496  864.9443
         6 -392.3187  798.6374  833.1645
         7 -380.2178  776.4355  815.8951
         8 -372.8109  763.6218  808.0138
         9 -366.9707  753.9414  803.2659
        10 -362.3495   746.699  800.9559
        11 -359.7491  743.4983  802.6877
        12 -359.5109  745.0218  809.1436
        13  -359.448   746.896  815.9503

predictors for each model:

1 : oldpeak
2 : cp oldpeak
3 : cp ca oldpeak
4 : cp ca sex oldpeak
5 : cp ca sex oldpeak exang
6 : cp ca sex thal oldpeak thalach
7 : cp ca sex thal oldpeak exang thalach
8 : cp ca sex thal oldpeak exang thalach trestbps
9 : cp ca sex thal oldpeak exang thalach trestbps chol
10 : cp ca sex thal oldpeak exang thalach trestbps slope chol
11 : cp ca sex thal oldpeak exang thalach trestbps slope chol restecg
12 : cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age
13 : cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
Model selection based on AIC, BIC and Loglikelihood
  • Three models were selected based on AIC ,BIC and loglikelihood

Bayesian Information Criterion (BIC)

The model with the least BIC of 800.9559 has 10 variables

  1. cp ca sex thal oldpeak exang thalach trestbps slope chol -> Model3

Akaike Information Criterion (AIC)

The model with the least Akaike information criterion of 743.4983 has 11 variables and given below

  1. cp ca sex thal oldpeak exang thalach trestbps slope chol restecg -> Model4

Log likelihood

the model with the highest loglikelihood of -359.448 has 13 variables

  1. cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs -> Model5

(c) Refit a logistic regression model using each of the selected subsets of predictors from part (b),resulting in three models: Model 3, Model 4, and Model 5. Report the fitted models and briefly interpret any differences in the retained variables

quietly import delimited heart
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol
est store model3

quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg
est store model4

quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
est store model5

esttab model3 model4 model5 ,se bic aic 
                      (1)             (2)             (3)   
                   target          target          target   
------------------------------------------------------------
target                                                      
cp                  0.832***        0.846***        0.855***
                 (0.0976)        (0.0984)         (0.100)   

ca                 -0.765***       -0.766***       -0.754***
                  (0.101)         (0.102)         (0.103)   

sex                -1.883***       -1.834***       -1.847***
                  (0.254)         (0.254)         (0.257)   

thal               -0.849***       -0.877***       -0.886***
                  (0.150)         (0.152)         (0.156)   

oldpeak            -0.545***       -0.564***       -0.571***
                  (0.113)         (0.115)         (0.116)   

exang              -0.965***       -0.990***       -0.991***
                  (0.222)         (0.223)         (0.224)   

thalach            0.0257***       0.0251***       0.0236***
                (0.00522)       (0.00521)       (0.00568)   

trestbps          -0.0202***      -0.0193***      -0.0182** 
                (0.00540)       (0.00544)       (0.00562)   

slope               0.569**         0.541**         0.534** 
                  (0.185)         (0.188)         (0.189)   

chol             -0.00688***     -0.00594**      -0.00567** 
                (0.00199)       (0.00203)       (0.00206)   

restecg                             0.428*          0.413*  
                                  (0.188)         (0.189)   

age                                              -0.00820   
                                                 (0.0126)   

fbs                                                -0.101   
                                                  (0.285)   

_cons               3.588**         3.181**         3.690** 
                  (1.139)         (1.159)         (1.401)   
------------------------------------------------------------
N                    1025            1025            1025   
AIC                 746.7           743.5           746.9   
BIC                 801.0           802.7           816.0   
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
Comments
  • The models were fit and placed side by side
  • The models are essentially different in the context of number of variables,One has 10 , other 11 and the other 13 variables
  • both models differ in their AIC and BIC.
  • In terms of the coefficients, the variables common to both models are all significant in their respective models.
  • the coefficients are not too different in magnitude

(d) For each of the five models (Model 1 from 2a, Model 2 from 3a, and Models 3–5 from part 3c) generate the predicted probabilities of heart disease for each observation in the datase

quietly import delimited heart


quietly logistic target age
est store model1
predict model1, pr

list target model1 in 1/5

quietly logistic target age i.sex
est store model2
predict model2, pr
list target model2 in 1/5

quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol
est store model3
predict model3, pr
list target model3 in 1/5

quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg
est store model4
predict model4, pr
list target model4 in 1/5

quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
est store model5
predict model5, pr
list target model5 in 1/5

// Compute ROC curve

quietly roctab target model1, graph summary title("model with age") name(roc1)

quietly roctab target model2, graph summary title("model with age and sex") name(roc2)

quietly roctab target model3, graph summary title("model with 10 variables") name(roc3)

quietly roctab target model4, graph summary title("model with 11 variables") name(roc4)

quietly roctab target model5, graph summary title("model with 13 variables") name(roc5)

graph combine roc1 roc2 roc3 roc4 roc5 , title("Combined roc curves")

quietly graph export all_combined.svg, replace


roccomp target model1 model2 model3 model4 model5, graph summary
quietly graph export all_models.svg, replace
     | target     model1 |
     |-------------------|
  1. |      0   .5465688 |
  2. |      0   .5334034 |
  3. |      0   .3170747 |
  4. |      0    .427951 |
  5. |      0   .4150276 |
     +-------------------+

     +-------------------+
     | target     model2 |
     |-------------------|
  1. |      0     .44507 |
  2. |      0   .4284824 |
  3. |      0    .192408 |
  4. |      0   .3041687 |
  5. |      0    .654494 |
     +-------------------+

     +-------------------+
     | target     model3 |
     |-------------------|
  1. |      0   .1899559 |
  2. |      0   .0231344 |
  3. |      0   .0156339 |
  4. |      0   .3267112 |
  5. |      0   .0491517 |
     +-------------------+

     +-------------------+
     | target     model4 |
     |-------------------|
  1. |      0    .208391 |
  2. |      0   .0170882 |
  3. |      0   .0176785 |
  4. |      0   .3610301 |
  5. |      0   .0611438 |
     +-------------------+

     +-------------------+
     | target     model5 |
     |-------------------|
  1. |      0   .2055684 |
  2. |      0     .01548 |
  3. |      0   .0158628 |
  4. |      0   .3465444 |
  5. |      0   .0588899 |
     +-------------------+

                              ROC                     Asymptotic normal
                   Obs       area     Std. err.      [95% conf. interval]
-------------------------------------------------------------------------
model1           1,025     0.6387       0.0174        0.60453     0.67289
model2           1,025     0.7156       0.0159        0.68448     0.74666
model3           1,025     0.9233       0.0081        0.90736     0.93931
model4           1,025     0.9246       0.0080        0.90887     0.94038
model5           1,025     0.9250       0.0080        0.90931     0.94073
-------------------------------------------------------------------------
H0: area(model1) = area(model2) = area(model3) = area(model4) = area(model5)
    chi2(4) =   342.17       Prob>chi2 =   0.0000

Stata Graph - Graph 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - specificity Area under ROC curve = 0.6387 model with age 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - specificity Area under ROC curve = 0.7156 model with age and sex 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - specificity Area under ROC curve = 0.9233 model with 10 variables 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - specificity Area under ROC curve = 0.9246 model with 11 variables 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1 - specificity Area under ROC curve = 0.9250 model with 13 variables Combined roc curves Stata Graph - Graph 0.00 0.25 0.50 0.75 1.00 Sensitivity 0.00 0.25 0.50 0.75 1.00 1-specificity model1 ROC area: 0.6387 model2 ROC area: 0.7156 model3 ROC area: 0.9233 model4 ROC area: 0.9246 model5 ROC area: 0.925 Reference

Comments
  • After we fit our logistic regression models, we used them to calculate the probability that a given observation has a positive outcome i.e diagnosed with the disease, based on the values of the predictor variables.

  • The concordance statistics which is equal to to the AUC (area under curve) where calculated and the following criterion was used to choose the best model

    1. A value below 0.5 indicates a poor model.
    2. A value of 0.5 indicates that the model is no better out classifying outcomes than random chance.
    3. The closer the value is to 1, the better the model is at correctly classifying outcomes.
    4. A value of 1 means that the model is perfect at classifying outcomes.
  • To determine if an observation should be classified as positive (diagnosed with disease), a different cut-points are chosen such that observations with a fitted probability above the cut-point are classified as positive and any observations with a fitted probability below the cut-point are classified as negative.

  • The plotted ROC curves show us the values of sensitivity vs. 1-specificity as the value of the cut-off point moves from 0 to 1.

  • Models with with high sensitivity and high specificity end up having a ROC curve that hugs the top left corner of the plot (Model 3, model 4 and model 5) in our case.

  • models with low sensitivity and low specificity have a curve that is close to the 45-degree diagonal line just like model 1 and model 2.

  • The AUC (area under curve) values thus give us an idea of how well the models are able to distinguish between patients diagnosed with the disease and those not diagnosed. The AUC can range from 0 to 1. The higher the AUC, the better the model is at correctly classifying outcomes

  • In short the above results suggest that logistic regression model 3 , model 4 and model 5 where good at picking out patients diagnosed with the disease, judging by their area under the ROC curve of approximately \(\approx 92.33\%,92.46\% ~and~ 92.5\%\) respectively.

  • We also note that the table above also includes a test of significance , where the result shows that the difference in areas under the curves is statistically significant and is not a result of random chance.

  • However , overally the model with area under the curve of model 92.5% performs fairly better than all other models since it has a larger AUC value.

Goodness of fit Hosmer–Lemeshow

quietly import delimited heart

quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
est store model5


estat classification

estat gof , group(10)
Logistic model for target

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |       479            98  |        577
     -     |        47           401  |        448
-----------+--------------------------+-----------
   Total   |       526           499  |       1025

Classified + if predicted Pr(D) >= .5
True D defined as target != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   91.06%
Specificity                     Pr( -|~D)   80.36%
Positive predictive value       Pr( D| +)   83.02%
Negative predictive value       Pr(~D| -)   89.51%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)   19.64%
False - rate for true D         Pr( -| D)    8.94%
False + rate for classified +   Pr(~D| +)   16.98%
False - rate for classified -   Pr( D| -)   10.49%
--------------------------------------------------
Correctly classified                        85.85%
--------------------------------------------------

note: obs collapsed on 10 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: target

 Number of observations =  1,025
       Number of groups =     10
Hosmer–Lemeshow chi2(8) =  27.01
            Prob > chi2 = 0.0007
Comment
  • the accuracy of the model is pretty fair
  • however the goodness of fit test hosmer-Lemeshow shows that there is lack of fit in the model \(Prob > chi2 = 0.0007\) ,With a sample size this large, you should probably ignore the result. The problem is simply that with N = 1025, a “significant” lack of fit can be a small discrepancy that is of no practical importance.
  • in line with that goodness of fit test a hypothesis that we already know is false and they don’t give us a metric that we can use to assess whether the model is good enough for our purpose. Instead we need a descriptive analysis of how good the fit is, and than your subjective assessment of whether that is good enough or not.

Section B: 30 marks

Descriptive statistics for single variables

quietly import delimited health_insurance 
egen product2 = group(product), label
egen position_level2 = group(position_level), label
egen gender2 = group(gender), label
drop product
drop position_level
drop gender
rename product2 product
rename gender2 gender
rename position_level2 position_level
quietly save health_insurance , replace
dtable, continuous(age absent household) factor(gender product position_level)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
Table1
-------------------------------------
                          Summary    
-------------------------------------
N                               1,453
age                   40.919 (13.530)
absent                 14.467 (8.114)
household               3.257 (2.231)
group(gender)                        
  Female                  889 (61.2%)
  Male                    559 (38.5%)
  Non-binary                 5 (0.3%)
group(product)                       
  A                       495 (34.1%)
  B                       459 (31.6%)
  C                       499 (34.3%)
group(position_level)                
  1                       184 (12.7%)
  2                       404 (27.8%)
  3                       436 (30.0%)
  4                       231 (15.9%)
  5                       198 (13.6%)
-------------------------------------
(collection DTable exported to file Statin_table.docx)

Descriptive plots for single variables

use health_insurance

quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)

quietly histogram absent ,title("Histogram of absentism") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)

quietly histogram household ,title("Histogram of number of households") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)

graph pie, over(position_level)  name(graph4 , replace) title("distribution of Position level")

graph hbar (count), over(gender) blabel(bar) name(graph5 , replace) title("distribution of Gender")


graph hbar (count), over(product) blabel(bar) name(graph6 , replace) title("distribution of Products")


graph combine graph1 graph2 graph3 graph4 graph5 graph6  ,title("")  

quietly graph export all_multi.svg, replace

Stata Graph - Graph 0 .02 .04 .06 Density 20 30 40 50 60 70 age Source : Data source unknown Histogram of Age 0 .01 .02 .03 .04 .05 Density 0 10 20 30 absent Source : Data source unknown Histogram of absentism 0 .2 .4 .6 .8 Density 0 2 4 6 8 household Source : Data source unknown Histogram of number of households 1 2 3 4 5 distribution of Position level 5 559 889 0 200 400 600 800 1,000 frequency Non-binary Male Female distribution of Gender 499 459 495 0 100 200 300 400 500 frequency C B A distribution of Products

Descriptive statistics by outcome variable

use health_insurance
dtable,by(product, tests testnotes) continuous(age absent household) factor(gender  position_level)  title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
note: using test regress across levels of product for age, absent, and
      household.
note: using test pearson across levels of product for gender and
      position_level.

Table1
-------------------------------------------------------------------------------------------
                                                  group(product)                           
                             A              B               C             Total       Test 
-------------------------------------------------------------------------------------------
N                        495 (34.1%)     459 (31.6%)     499 (34.3%)  1,453 (100.0%)       
age                   28.913 (5.158) 44.865 (13.756) 49.198 (10.346) 40.919 (13.530) <0.001
absent                14.451 (8.030)  14.394 (8.092)  14.549 (8.232)  14.467 (8.114)  0.956
household              3.642 (2.339)   1.462 (1.201)   4.527 (1.740)   3.257 (2.231) <0.001
group(gender)                                                                              
  Female                 244 (49.3%)     404 (88.0%)     241 (48.3%)     889 (61.2%) <0.001
  Male                   249 (50.3%)      53 (11.5%)     257 (51.5%)     559 (38.5%)       
  Non-binary                2 (0.4%)        2 (0.4%)        1 (0.2%)        5 (0.3%)       
group(position_level)                                                                      
  1                       66 (13.3%)      59 (12.9%)      59 (11.8%)     184 (12.7%)  0.848
  2                      136 (27.5%)     133 (29.0%)     135 (27.1%)     404 (27.8%)       
  3                      144 (29.1%)     146 (31.8%)     146 (29.3%)     436 (30.0%)       
  4                       81 (16.4%)      65 (14.2%)      85 (17.0%)     231 (15.9%)       
  5                       68 (13.7%)      56 (12.2%)      74 (14.8%)     198 (13.6%)       
-------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)

Plots by Outcome variable

use health_insurance

graph box age ,over(product) title("boxplot of Age by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)

graph box absent ,over(product) title("boxplot of absentism by product") note( "Source : Data source unknown") ytitle("Fraction")  ytitle("Density") name(graph2 , replace)

graph box household ,over(product) title("boxplot of households by product ") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)

graph hbar,over(product) over(position_level)  stack asyvars name(graph4 , replace) title("distribution of Position level")

graph hbar ,over(product) over(gender)  stack asyvars name(graph5 , replace) title("distribution of Gender")

graph combine graph1 graph2 graph3 graph4 graph5  ,title("")  
quietly graph export all_multii.svg, replace

Stata Graph - Graph 20 30 40 50 60 70 Density A B C Source : Data source unknown boxplot of Age by product 0 10 20 30 Density A B C Source : Data source unknown boxplot of absentism by product 0 2 4 6 8 Density A B C Source : Data source unknown boxplot of households by product  0 10 20 30 percent 5 4 3 2 1 distribution of Position level A B C 0 20 40 60 percent Non-binary Male Female distribution of Gender A B C

(b) Assess the nature of the product variable. Determine whether it is nominal, ordinal, or binary, and justify your classification

Note
use health_insurance
tab product
group(produ |
        ct) |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |        495       34.07       34.07
          B |        459       31.59       65.66
          C |        499       34.34      100.00
------------+-----------------------------------
      Total |      1,453      100.00
  • The product variable has 3 levels which represents 3 product types (\(A,B ~and~ C\))
  • Thus the variable is of nominal type because it represents distinct product types without an inherent order (i.e., A is not objectively less than B and B is not less than C).”

5. Model selection and evaluation [20 marks]

(a) The most suitable model for this analysis is Multinomial logistic regression

Interpretations

Log Odds and Odds Ratios

  • The logit functions (log odds) when our outcome is a Product variable with (A,B and C) categories and the independent variable is age is shown below
  1. the log odds for comparison between B and A is

\[g_1(x) = ln\frac{P(product=B|age)} {P(product=A|age)}=\alpha_1 + (\beta_{11}age)\]

  1. and, the log odds for comparison between C and A is

\[g_2(x) = ln\frac{P(Product=C|age)} {P(Product=A|age)}=\alpha_2 + (\beta_{21}age)\] + Here the base category is product A hence all inferences are done with reference to it

Simple model

use health_insurance
mlogit product age , baseoutcome(1)
mlogit product age , baseoutcome(1) rrr
Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -1230.0067  
Iteration 2:  Log likelihood = -1175.2485  
Iteration 3:  Log likelihood = -1171.0613  
Iteration 4:  Log likelihood = -1171.0379  
Iteration 5:  Log likelihood = -1171.0379  

Multinomial logistic regression                         Number of obs =  1,453
                                                        LR chi2(2)    = 848.47
                                                        Prob > chi2   = 0.0000
Log likelihood = -1171.0379                             Pseudo R2     = 0.2659

------------------------------------------------------------------------------
     product | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   .1969927   .0122123    16.13   0.000      .173057    .2209283
       _cons |  -6.974368   .4166712   -16.74   0.000    -7.791029   -6.157708
-------------+----------------------------------------------------------------
C            |
         age |   .2274424   .0125945    18.06   0.000     .2027576    .2521272
       _cons |  -8.323104   .4420157   -18.83   0.000    -9.189439   -7.456769
------------------------------------------------------------------------------


Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -1230.0067  
Iteration 2:  Log likelihood = -1175.2485  
Iteration 3:  Log likelihood = -1171.0613  
Iteration 4:  Log likelihood = -1171.0379  
Iteration 5:  Log likelihood = -1171.0379  

Multinomial logistic regression                         Number of obs =  1,453
                                                        LR chi2(2)    = 848.47
                                                        Prob > chi2   = 0.0000
Log likelihood = -1171.0379                             Pseudo R2     = 0.2659

------------------------------------------------------------------------------
     product |        RRR   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   1.217735   .0148713    16.13   0.000     1.188934    1.247234
       _cons |   .0009356   .0003898   -16.74   0.000     .0004134    .0021171
-------------+----------------------------------------------------------------
C            |
         age |   1.255385    .015811    18.06   0.000     1.224776     1.28676
       _cons |   .0002428   .0001073   -18.83   0.000     .0001021    .0005775
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.

thus for the simple model from the stata output we have two logit models:

\[ \begin{aligned} g_1(x) &= ln\frac{P(Product=B|age)} {P(Product=A|age)}=-6.974368 + (0.1969927age)\\ RRR_1&=\frac{P(Product=B|age)} {P(Product=A|age)}=e^{-6.974368 + (0.1969927age)} \end{aligned}\]

Interpretation of coefficients

  • for one unit change in the variable age, the multinomial log-odds for choosing Product B relative to Product A will be increased by 0.1969927, age is also statistically significant at 5% level of significance.

Interpretation of exponenciated coefficients

  • for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased by \(e^{0.1969927}=1.21773\), age is also statistically significant at 5% level of significance.

\[ \begin{aligned} g_2(x) &= ln\frac{P(Product=C|age)} {P(Product=A|age)}=-8.323104 + (0.2274424age)\\ RRR_2&= \frac{P(Product=C|age)} {P(Product=A|age)}=e^{-8.323104 + (0.2274424age)} \end{aligned} \]

  • for one unit change in the variable age, the multinomial log-odds for choosing Product C relative to Product A will be increased by 0.2274424, age is also statistically significant at 5% level of significance.

Interpretation of exponenciated coefficients

  • for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased by a factor of \(e^{0.2274424}=1.255385\), age is also statistically significant at 5% level of significance.

Multivariable model

use health_insurance
ssc install fitstat
mlogit product age absent household i.gender i.position_level  , baseoutcome(1)
mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
checking fitstat consistency and verifying not already installed...
all files already exist and are up to date.


Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -774.56045  
Iteration 2:  Log likelihood = -745.09229  
Iteration 3:  Log likelihood = -743.38342  
Iteration 4:  Log likelihood = -743.38057  
Iteration 5:  Log likelihood = -743.38057  

Multinomial logistic regression                        Number of obs =   1,453
                                                       LR chi2(18)   = 1703.78
                                                       Prob > chi2   =  0.0000
Log likelihood = -743.38057                            Pseudo R2     =  0.5340

------------------------------------------------------------------------------
     product | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   .2449861   .0154906    15.82   0.000     .2146251    .2753472
      absent |   .0137522   .0131766     1.04   0.297    -.0120734    .0395778
   household |  -.9662861   .0698785   -13.83   0.000    -1.103245   -.8293269
             |
      gender |
       Male  |  -2.369476   .2334001   -10.15   0.000    -2.826932    -1.91202
 Non-binary  |    .240953   1.214742     0.20   0.843    -2.139897    2.621803
             |
position_l~l |
          2  |  -.2033913   .3119391    -0.65   0.514    -.8147807    .4079982
          3  |  -.6830315   .3107755    -2.20   0.028     -1.29214   -.0739228
          4  |  -1.510145   .4024609    -3.75   0.000    -2.298954   -.7213359
          5  |  -1.407559   .4024606    -3.50   0.000    -2.196368   -.6187512
             |
       _cons |  -5.207851   .5493343    -9.48   0.000    -6.284527   -4.131176
-------------+----------------------------------------------------------------
C            |
         age |   .2706643   .0157149    17.22   0.000     .2398636    .3014651
      absent |   .0046137   .0127058     0.36   0.717    -.0202892    .0295166
   household |   .2077009   .0497547     4.17   0.000     .1101834    .3052184
             |
      gender |
       Male  |   .1085684   .1964055     0.55   0.580    -.2763794    .4935162
 Non-binary  |  -1.293302   2.021219    -0.64   0.522    -5.254818    2.668215
             |
position_l~l |
          2  |  -.1427916   .3048914    -0.47   0.640    -.7403678    .4547845
          3  |  -.3147303    .306651    -1.03   0.305    -.9157553    .2862946
          4  |  -.8247833   .3575154    -2.31   0.021    -1.525501    -.124066
          5  |   -.730075   .3794344    -1.92   0.054    -1.473753    .0136028
             |
       _cons |  -10.54573   .6452422   -16.34   0.000    -11.81038   -9.281076
------------------------------------------------------------------------------


Iteration 0:  Log likelihood = -1595.2728  
Iteration 1:  Log likelihood = -774.56045  
Iteration 2:  Log likelihood = -745.09229  
Iteration 3:  Log likelihood = -743.38342  
Iteration 4:  Log likelihood = -743.38057  
Iteration 5:  Log likelihood = -743.38057  

Multinomial logistic regression                        Number of obs =   1,453
                                                       LR chi2(18)   = 1703.78
                                                       Prob > chi2   =  0.0000
Log likelihood = -743.38057                            Pseudo R2     =  0.5340

------------------------------------------------------------------------------
     product |        RRR   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A            |  (base outcome)
-------------+----------------------------------------------------------------
B            |
         age |   1.277604   .0197908    15.82   0.000     1.239397    1.316988
      absent |   1.013847    .013359     1.04   0.297     .9879992    1.040371
   household |   .3804935   .0265883   -13.83   0.000     .3317925    .4363429
             |
      gender |
       Male  |   .0935297   .0218298   -10.15   0.000     .0591942    .1477815
 Non-binary  |   1.272461   1.545712     0.20   0.843     .1176669    13.76052
             |
position_l~l |
          2  |   .8159589   .2545295    -0.65   0.514     .4427364    1.503804
          3  |   .5050835   .1569676    -2.20   0.028     .2746823    .9287434
          4  |    .220878   .0888948    -3.75   0.000     .1003638    .4861024
          5  |   .2447399   .0984981    -3.50   0.000     .1112064    .5386167
             |
       _cons |   .0054734   .0030067    -9.48   0.000     .0018649     .016064
-------------+----------------------------------------------------------------
C            |
         age |   1.310835   .0205997    17.22   0.000     1.271076    1.351838
      absent |   1.004624   .0127646     0.36   0.717     .9799153    1.029957
   household |   1.230845   .0612404     4.17   0.000     1.116483    1.356921
             |
      gender |
       Male  |   1.114681   .2189296     0.55   0.580     .7585251    1.638066
 Non-binary  |   .2743634   .5545485    -0.64   0.522     .0052223    14.41421
             |
position_l~l |
          2  |   .8669347   .2643209    -0.47   0.640     .4769385    1.575834
          3  |   .7299857   .2238508    -1.03   0.305     .4002142    1.331485
          4  |     .43833   .1567097    -2.31   0.021     .2175122    .8833216
          5  |   .4818729   .1828392    -1.92   0.054     .2290642    1.013696
             |
       _cons |   .0000263    .000017   -16.34   0.000     7.43e-06    .0000932
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.

Interpretation

Product B relative to Product A

\[ \begin{aligned} ln\frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\ \frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&= e^{-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\} \end{aligned}\]

  • for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased significantly by a factor of \(1.277604\) while adjusting for other covariates .Age is statistically significant \((p<0.05)\)
  • for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product B relative to product A increase by a factor of \(1.013847\) while adjusting for other variables, however absentism is not statistically significant \((p>0.05)\)
  • If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product B relative to product A decrease by a factor of \(.3804935\) while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
  • The relative risk ratio for choosing Product B relative to product A decrease by (1-.0935297):91% for males as compared to females while adjusting for all other covariates.Gender is a statistically significant predictor for choice of insurance product.\((p<0.05)\)
  • The relative Risk ratio for choosing product B relative to product A increase as position level increase from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
  1. decrease in RRR for Product 2 vs product 1 ->\(0.1840411\)
  2. decrease in RRR for Product 3 vs product 1 ->\(0.4949165^{***}\)
  3. decrease in RRR for Product 4 vs product 1 ->\(0.779122^{***}\)
  4. decrease in RRR for Product 5 vs product 1 ->\(0.7552601^{***}\)

here we not that the starred comparisons are the only significant ones at 5% level of significance.

Product C relative to Product A

\[ \begin{aligned} ln\frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5\\ \quad this~equates~to~RRR~as~follows~\\ \frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=e^{-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5} \end{aligned}\]

  • for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased significantly by a factor of 1.310835 while adjusting for other covariates .Age is statistically significant
  • for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product C relative to product A increase by a factor of 1.004624 while adjusting for other variables, however absentism is not statistically significant
  • If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product C relative to product A Increase by a factor of 1.230845 while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
  • The relative risk ratio for choosing Product C relative to product A Increase by (1.11468-1):11.5% for males as compared to females while adjusting for all other covariates. Gender is However not a statistically significant predictor for choice of insurance product here.
  • The relative Risk ratio for choosing product C relative to product A decrease as position level decrease from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
  1. decrease in RRR for Product 2 vs product 1 ->\(0.1330653\)
  2. decrease in RRR for Product 3 vs product 1 ->\(0.2700143\)
  3. decrease in RRR for Product 4 vs product 1 ->\(0.56167^{***}\)
  4. decrease in RRR for Product 5 vs product 1 ->\(0.5181271\)

here we not that the starred comparisons are the only significant ones at 5% level of significance.

(c) Evaluate and compare the overall fit of your models using appropriate statistical tests

Null model vs Age model

use health_insurance
quietly mlogit product, baseoutcome(1) rrr

est store model1

quietly mlogit product age  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
estout model1 model2 , stats(bic aic)  
  
lrtest model1 model2 
                   model1       model2
                        b            b
--------------------------------------
A                                     
age                                  0
_cons                   0            0
--------------------------------------
B                                     
age                           .1969927
_cons           -.0755076    -6.974368
--------------------------------------
C                                     
age                           .2274424
_cons            .0080483    -8.323104
--------------------------------------
bic              3205.108     2371.201
aic              3194.546     2350.076
--------------------------------------


Likelihood-ratio test
Assumption: model1 nested within model2

 LR chi2(2) = 848.47
Prob > chi2 = 0.0000
Note
  • For the model with age , AIC and BIC reduced significantly indicating that the model with age performs better than a null model.
  • Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))

Full model vs null model

use health_insurance
quietly mlogit product , baseoutcome(1) rrr

est store model1

quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
estout model1 model2 , stats(bic aic)  
  
lrtest model1 model2 
                   model1       model2
                        b            b
--------------------------------------
A                                     
age                                  0
absent                               0
household                            0
1.gender                             0
2.gender                             0
3.gender                             0
1.position~l                         0
2.position~l                         0
3.position~l                         0
4.position~l                         0
5.position~l                         0
_cons                   0            0
--------------------------------------
B                                     
age                           .2449861
absent                        .0137522
household                    -.9662861
1.gender                             0
2.gender                     -2.369476
3.gender                       .240953
1.position~l                         0
2.position~l                 -.2033913
3.position~l                 -.6830315
4.position~l                 -1.510145
5.position~l                 -1.407559
_cons           -.0755076    -5.207851
--------------------------------------
C                                     
age                           .2706643
absent                        .0046137
household                     .2077009
1.gender                             0
2.gender                      .1085684
3.gender                     -1.293302
1.position~l                         0
2.position~l                 -.1427916
3.position~l                 -.3147303
4.position~l                 -.8247833
5.position~l                  -.730075
_cons            .0080483    -10.54573
--------------------------------------
bic              3205.108     1632.389
aic              3194.546     1526.761
--------------------------------------


Likelihood-ratio test
Assumption: model1 nested within model2

LR chi2(18) = 1703.78
Prob > chi2 =  0.0000
Note
  • For the model with all variables , AIC and BIC reduced significantly indicating that the full model performs better than a null model.
  • Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))

(d) Which of the models would you consider the best, and justify your selection

use health_insurance
quietly mlogit product age, baseoutcome(1) rrr

est store model1

quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
est store model2

**Likelihood ratio test**
  
lrtest model1 model2 
Likelihood-ratio test
Assumption: model1 nested within model2

LR chi2(16) = 855.31
Prob > chi2 = 0.0000
Note
  • The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model. This table tells us that addition of more variables to age had significant main effects on Product choice,

\[\chi^2_{(16)} = 855.31, p = .000\]

use health_insurance

*Measures for model with age alone*
quietly mlogit product age, baseoutcome(1) rrr
fitstat , save

*Model with additional variables
quietly mlogit product age absent household i.gender i.position_level  , baseoutcome(1) rrr
fitstat, dif
Measures of Fit for mlogit of product

Log-Lik Intercept Only:    -1595.273     Log-Lik Full Model:        -1171.038
D(1447):                    2342.076     LR(2):                       848.470
                                         Prob > LR:                     0.000
McFadden's R2:                 0.266     McFadden's Adj R2:             0.262
Maximum Likelihood R2:         0.442     Cragg & Uhler's R2:            0.498
Count R2:                      0.509     Adj Count R2:                  0.255
AIC:                           1.620     AIC*n:                      2354.076
BIC:                       -8194.089     BIC':                       -833.907

(Indices saved in matrix fs_0)



Measures of Fit for mlogit of product

Log-Lik Intercept Only:    -1595.273     Log-Lik Full Model:         -743.381
D(1417):                    1486.761     LR(18):                     1703.784
                                         Prob > LR:                     0.000
McFadden's R2:                 0.534     McFadden's Adj R2:             0.511
Maximum Likelihood R2:         0.690     Cragg & Uhler's R2:            0.777
Count R2:                      0.566     Adj Count R2:                  0.341
AIC:                           1.073     AIC*n:                      1558.761
BIC:                       -8830.962     BIC':                      -1572.719
Comments
  • The table below and the one above show results for a full model and the one with a single predictor age respectively.
  • We see that on fitting the full model the following significant changes which show model improvement were noted:
  1. log likelihood increased from \(-1171.038~ to ~-743.381\)
  2. Akaike information criterion decreased from \(2354.076 ~to~ 1558.761\)
  3. Bayesian Information criterion decreased from \(-833.907 ~to~ -1572.719\)

all these results suggest that adding more variables to the model improved model fit. Hence the model with all variables performs better and the best than the model with AGE alone.