age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
52 | 1 | 0 | 125 | 212 | 0 | 1 | 168 | 0 | 1.0 | 2 | 2 | 3 | 0 |
53 | 1 | 0 | 140 | 203 | 1 | 0 | 155 | 1 | 3.1 | 0 | 0 | 3 | 0 |
70 | 1 | 0 | 145 | 174 | 0 | 1 | 125 | 1 | 2.6 | 0 | 0 | 3 | 0 |
61 | 1 | 0 | 148 | 203 | 0 | 1 | 161 | 0 | 0.0 | 2 | 1 | 3 | 0 |
62 | 0 | 0 | 138 | 294 | 1 | 1 | 106 | 0 | 1.9 | 1 | 3 | 2 | 0 |
58 | 0 | 0 | 100 | 248 | 0 | 0 | 122 | 0 | 1.0 | 1 | 0 | 2 | 1 |
single variable discriptive statistics
(a) Use appropriate descriptive statistics to summarise the variables, and include at least two types of graphical displays to support your summary
quietly import delimited heart
label define sex 1 "male" 0 "female", replace
label value sex sex
label define fbs 1 "true" 0 "false", replace
label value fbs fbs
label define target 0 "No disease" 1 "Disease", replace
label value target target
label define thal 1 "normal" 2 "fixed defect" 3 "reversible defect", replace
label value thal thal
label define exang 1 "yes" 0 "no", replace
label value exang exang
quietly save heart , replace
quietly import delimited heart
factor(exang sex target thal fbs restecg cp ca slope) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable, continuous(age trestbps chol thalach oldpeak)
Table1
-------------------------
Summary
-------------------------
N 1,025
age 54.434 (9.072)
trestbps 131.612 (17.517)
chol 246.000 (51.593)
thalach 149.114 (23.006)
oldpeak 1.072 (1.175)
exang
0 680 (66.3%)
1 345 (33.7%)
sex
0 312 (30.4%)
1 713 (69.6%)
target
0 499 (48.7%)
1 526 (51.3%)
thal
0 7 (0.7%)
1 64 (6.2%)
2 544 (53.1%)
3 410 (40.0%)
fbs
0 872 (85.1%)
1 153 (14.9%)
restecg
0 497 (48.5%)
1 513 (50.0%)
2 15 (1.5%)
cp
0 497 (48.5%)
1 167 (16.3%)
2 284 (27.7%)
3 77 (7.5%)
ca
0 578 (56.4%)
1 226 (22.0%)
2 134 (13.1%)
3 69 (6.7%)
4 18 (1.8%)
slope
0 74 (7.2%)
1 482 (47.0%)
2 469 (45.8%)
-------------------------
(collection DTable exported to file Statin_table.docx)
- The table above shows frequencies and proportions for categorical data.
- Continuous data was summarised by means and standard deviations
use heart
quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)
quietly histogram trestbps ,title("Histogram of trestbps") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)
quietly histogram chol ,title("Histogram of chol") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)
quietly histogram thalach ,title("Histogram of thalach") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph4 , replace)
quietly histogram oldpeak ,title("Histogram of oldpeak") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph5 , replace)
graph pie, over(sex) name(graph6 , replace) title("distribution of sex")
graph pie, over(exang) name(graph7 , replace) title("distribution of exang")
graph hbar (count), over(cp) blabel(bar) name(graph8 , replace) title("distribution of cp")
graph hbar (count), over(ca) blabel(bar) name(graph9 , replace) title("distribution of ca")
graph hbar (count), over(fbs) blabel(bar) name(graph10 , replace) title("distribution of fbs")
graph hbar (count), over(thal) blabel(bar) name(graph11 , replace) title("distribution of thal")
graph hbar (count), over(restecg) blabel(bar) name(graph12 , replace) title("distribution of restecg")
graph hbar (count), over(slope) blabel(bar) name(graph13 , replace) title("distribution of slope")
graph pie, over(target) name(graph14 , replace) title("distribution of target")
graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 graph14 ,title("")
quietly graph export all_graphs.svg, replace
Summary statistics by Outcome Variable
quietly import delimited heart
by(target, tests testnotes) continuous(age trestbps chol thalach oldpeak) factor(exang sex thal fbs restecg cp ca slope) title(Table1) titlestyles( font(, bold) ) export("target", as(docx) replace) dtable,
note: using test regress across levels of target for age, trestbps, chol,
thalach, and oldpeak.
note: using test pearson across levels of target for exang, sex, thal, fbs,
restecg, cp, ca, and slope.
Table1
------------------------------------------------------------------
target
0 1 Total Test
------------------------------------------------------------------
N 499 (48.7%) 526 (51.3%) 1,025 (100.0%)
age 56.569 (7.908) 52.409 (9.632) 54.434 (9.072) <0.001
trestbps 134.106 (18.577) 129.245 (16.112) 131.612 (17.517) <0.001
chol 251.293 (49.559) 240.979 (53.010) 246.000 (51.593) 0.001
thalach 139.130 (22.565) 158.586 (19.097) 149.114 (23.006) <0.001
oldpeak 1.600 (1.291) 0.570 (0.771) 1.072 (1.175) <0.001
exang
0 225 (45.1%) 455 (86.5%) 680 (66.3%) <0.001
1 274 (54.9%) 71 (13.5%) 345 (33.7%)
sex
0 86 (17.2%) 226 (43.0%) 312 (30.4%) <0.001
1 413 (82.8%) 300 (57.0%) 713 (69.6%)
thal
0 4 (0.8%) 3 (0.6%) 7 (0.7%) <0.001
1 43 (8.6%) 21 (4.0%) 64 (6.2%)
2 132 (26.5%) 412 (78.3%) 544 (53.1%)
3 320 (64.1%) 90 (17.1%) 410 (40.0%)
fbs
0 417 (83.6%) 455 (86.5%) 872 (85.1%) 0.188
1 82 (16.4%) 71 (13.5%) 153 (14.9%)
restecg
0 283 (56.7%) 214 (40.7%) 497 (48.5%) <0.001
1 204 (40.9%) 309 (58.7%) 513 (50.0%)
2 12 (2.4%) 3 (0.6%) 15 (1.5%)
cp
0 375 (75.2%) 122 (23.2%) 497 (48.5%) <0.001
1 33 (6.6%) 134 (25.5%) 167 (16.3%)
2 65 (13.0%) 219 (41.6%) 284 (27.7%)
3 26 (5.2%) 51 (9.7%) 77 (7.5%)
ca
0 163 (32.7%) 415 (78.9%) 578 (56.4%) <0.001
1 160 (32.1%) 66 (12.5%) 226 (22.0%)
2 113 (22.6%) 21 (4.0%) 134 (13.1%)
3 60 (12.0%) 9 (1.7%) 69 (6.7%)
4 3 (0.6%) 15 (2.9%) 18 (1.8%)
slope
0 46 (9.2%) 28 (5.3%) 74 (7.2%) <0.001
1 324 (64.9%) 158 (30.0%) 482 (47.0%)
2 129 (25.9%) 340 (64.6%) 469 (45.8%)
------------------------------------------------------------------
(collection DTable exported to file target.docx)
- Preliminary results from Chi-square tests of association and student T.tests indicate that at \(5\%\) level of significance, most variables are are associated with being diagnosed or not being diagnosed with the disease.
use heart
graph box age , over(target) title("distribution of Age by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)
graph box trestbps , over(target) title("distribution of trestbps by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph2 , replace)
graph box chol ,over(target) title("distribution of chol by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)
graph box thalach ,over(target) title("distribution of thalach by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph4 , replace)
graph box oldpeak ,over(target) title("distribution of oldpeak by target") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph5 , replace)
graph hbar ,over(target) over(sex) stack asyvars name(graph6 , replace) title("distribution of sex")
graph hbar ,over(target) over(exang) stack asyvars name(graph7 , replace) title("distribution of exang")
graph hbar ,over(target) over(cp) stack asyvars name(graph8 , replace) title("distribution of cp")
graph hbar ,over(target) over(ca) stack asyvars name(graph9 , replace) title("distribution of ca")
graph hbar,over(target) over(fbs) stack asyvars name(graph10 , replace) title("distribution of fbs")
graph hbar ,over(target) over(thal) stack asyvars name(graph11 , replace) title("distribution of thal")
graph hbar ,over(target) over(restecg) stack asyvars name(graph12 , replace) title("distribution of restecg")
graph hbar ,over(target) over(slope) stack asyvars name(graph13 , replace) title("distribution of slope")
graph combine graph1 graph2 graph3 graph4 graph5 graph6 graph7 graph8 graph9 graph10 graph11 graph12 graph13 ,title("")
quietly graph export all_target.svg, replace
2. Simple logistic regression [20 marks]
a) Fit a simple logistic regression model using age as the only risk factor of heart disease status
A logistic regression model is a generalized linear model and is made of a linear predictor:
\[\underbrace{g(\mu_i)}_{Link~~function} = \underbrace{\beta_0 + \beta_1X_1~+~...~+~\beta_pX_p}_{Linear~component}\]
quietly import delimited heart
logit target age
Iteration 0: Log likelihood = -710.12021
Iteration 1: Log likelihood = -682.58876
Iteration 2: Log likelihood = -682.53284
Iteration 3: Log likelihood = -682.53284
Logistic regression Number of obs = 1,025
LR chi2(1) = 55.17
Prob > chi2 = 0.0000
Log likelihood = -682.53284 Pseudo R2 = 0.0388
------------------------------------------------------------------------------
target | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.0530036 .0073802 -7.18 0.000 -.0674686 -.0385386
_cons | 2.943003 .4084775 7.20 0.000 2.142402 3.743605
------------------------------------------------------------------------------
- the equation below shows a simple logistic regression model fit on age as the predictor.
\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})= 2.943003-.0530036Age \end{aligned}\]
(b) Report the estimated coefficient for age, and interpret its sign and meaning in the context of heart disease risk
- the model for the output above can be written as follows
\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 2.943003-.0530036Age \end{aligned}\]
- The estimated coefficient for age is
-0.0530036
and it is negative - Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by
0.0530036
- Thus the chances of developing heart disease decrease with age
(c) Determine whether age is significantly associated with the probability of developing heart disease and justify your conclusion with appropriate statistical reasoning
- Age is significantly associated with probability of developing heart disease at 5% level of significance since the corresponding p-value is 0.000 and is less than
0.05
- the coefficient’s confidence is given by \([-0.0674686 ,-0.0385386]\) and does not contain zero hence the variable is significant.
(d) Based on the estimated coefficient, calculate the odds ratio associated with a one-year increase in age, and interpret its meaning in this context
\[ \begin{aligned} \log(\frac{\mu_i}{1-\mu_i})&= 2.943003-.0530036Age\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003-.0530036Age}\\ \frac{\mu_i}{1-\mu_i}&=e^{2.943003}.e^{-.0530036Age} \end{aligned}\]
- the odds associated with age is therefore \(e^{-.0530036}=0.9483766\) meaning that for a 1 year increase in age the odds of being diagnosed with heart disease decrease by \((1-0.9483766)*100\% \approx 5.16\%\)
Code below shows how to get the estimated odds ratios
quietly import delimited heart
logistic target age
Logistic regression Number of obs = 1,025
LR chi2(1) = 55.17
Prob > chi2 = 0.0000
Log likelihood = -682.53284 Pseudo R2 = 0.0388
------------------------------------------------------------------------------
target | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | .9483766 .0069993 -7.18 0.000 .9347571 .9621946
_cons | 18.97274 7.749939 7.20 0.000 8.51988 42.25001
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
- also note that we can find the probability of being diagnosed at a given age by the following probability formula.
\[ \begin{aligned} \hat{p}(x) &= \frac{exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}{1 + exp^{\hat{\beta}_{0} + \hat{\beta}_{1}{age}}}\\ &= \frac{exp^{2.943003-.0530036Age}}{1 + exp^{2.943003-.0530036Age}} \end{aligned}\]
(e) Evaluate the fit of the model using an appropriate measure and report what the results indicate about how well the model fits the data
quietly import delimited heart
quietly logit target
est store model1
quietly logit target age
est store model2
lrtest model1 model2
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(1) = 55.17
Prob > chi2 = 0.0000
- The likelihood ratio test above compares the null model with a model with age alone.
- the \(p-value<0.001\) indicates that the model with age is better fit compared to a model with no covariates at all.
- thus our current model fits the data better than a model with no predictors.
3. Model Selection and Evaluation [40 marks]
(a) in addition to age, select one categorical variable of your choice and develop a multiple logistic regression model (Model 2) that includes both the selected categorical variable and age. Report the coefficients and interpret the main effects
quietly import delimited heart
logit target age i.sex
Iteration 0: Log likelihood = -710.12021
Iteration 1: Log likelihood = -630.0552
Iteration 2: Log likelihood = -629.90565
Iteration 3: Log likelihood = -629.90562
Logistic regression Number of obs = 1,025
LR chi2(2) = 160.43
Prob > chi2 = 0.0000
Log likelihood = -629.90562 Pseudo R2 = 0.1130
------------------------------------------------------------------------------
target | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.0674349 .0080345 -8.39 0.000 -.0831823 -.0516875
1.sex | -1.533812 .1585261 -9.68 0.000 -1.844518 -1.223107
_cons | 4.819817 .4868406 9.90 0.000 3.865627 5.774007
------------------------------------------------------------------------------
- the model for the output above can be written as follows
\[ \begin{aligned} \text{logit}(\mu_i) &= \log(\frac{\mu_i}{1-\mu_i})\\ &= 4.819817-.0674349Age-1.533812Male \end{aligned}\]
- The estimated coefficient for age is
-.0674349
and it is negative - Here we see that for a 1 year increase in age , the log odds of being diagnosed with heart disease decrease significantly by
0.0674349
while adjusting for gender of the patient. - whilst Adjusting for Age, males have
1.533812
less log odds of being diagnosed with heart disease as compared to females. - the two variables are significant at 5% level of significance since they have p-values less than 5%
(b) Using the full set of potential risk factors, perform the best subset variable selection to identify the optimal set of risk factors based on the reported criteria. Based on each selection criterion, could you identify the best model? These are called Model 3, Model 4, and Model 5
quietly import delimited heart
ca slope, nmodels(1): logit target <term> gvselect <term> age trestbps chol thalach oldpeak exang sex thal fbs restecg cp
Optimal models:
# Preds LL AIC BIC
1 -597.6464 1199.293 1209.158
2 -510.5703 1027.141 1041.938
3 -462.3088 932.6176 952.3474
4 -433.2307 876.4613 901.1235
5 -411.6748 835.3496 864.9443
6 -392.3187 798.6374 833.1645
7 -380.2178 776.4355 815.8951
8 -372.8109 763.6218 808.0138
9 -366.9707 753.9414 803.2659
10 -362.3495 746.699 800.9559
11 -359.7491 743.4983 802.6877
12 -359.5109 745.0218 809.1436
13 -359.448 746.896 815.9503
predictors for each model:
1 : oldpeak
2 : cp oldpeak
3 : cp ca oldpeak
4 : cp ca sex oldpeak
5 : cp ca sex oldpeak exang
6 : cp ca sex thal oldpeak thalach
7 : cp ca sex thal oldpeak exang thalach
8 : cp ca sex thal oldpeak exang thalach trestbps
9 : cp ca sex thal oldpeak exang thalach trestbps chol
10 : cp ca sex thal oldpeak exang thalach trestbps slope chol
11 : cp ca sex thal oldpeak exang thalach trestbps slope chol restecg
12 : cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age
13 : cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
- Three models were selected based on AIC ,BIC and loglikelihood
Bayesian Information Criterion (BIC)
The model with the least BIC of 800.9559 has 10 variables
cp ca sex thal oldpeak exang thalach trestbps slope chol -> Model3
Akaike Information Criterion (AIC)
The model with the least Akaike information criterion of 743.4983 has 11 variables and given below
cp ca sex thal oldpeak exang thalach trestbps slope chol restecg -> Model4
Log likelihood
the model with the highest loglikelihood of -359.448 has 13 variables
cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs -> Model5
(c) Refit a logistic regression model using each of the selected subsets of predictors from part (b),resulting in three models: Model 3, Model 4, and Model 5. Report the fitted models and briefly interpret any differences in the retained variables
quietly import delimited heart
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol
est store model3
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg
est store model4
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
est store model5
esttab model3 model4 model5 ,se bic aic
(1) (2) (3)
target target target
------------------------------------------------------------
target
cp 0.832*** 0.846*** 0.855***
(0.0976) (0.0984) (0.100)
ca -0.765*** -0.766*** -0.754***
(0.101) (0.102) (0.103)
sex -1.883*** -1.834*** -1.847***
(0.254) (0.254) (0.257)
thal -0.849*** -0.877*** -0.886***
(0.150) (0.152) (0.156)
oldpeak -0.545*** -0.564*** -0.571***
(0.113) (0.115) (0.116)
exang -0.965*** -0.990*** -0.991***
(0.222) (0.223) (0.224)
thalach 0.0257*** 0.0251*** 0.0236***
(0.00522) (0.00521) (0.00568)
trestbps -0.0202*** -0.0193*** -0.0182**
(0.00540) (0.00544) (0.00562)
slope 0.569** 0.541** 0.534**
(0.185) (0.188) (0.189)
chol -0.00688*** -0.00594** -0.00567**
(0.00199) (0.00203) (0.00206)
restecg 0.428* 0.413*
(0.188) (0.189)
age -0.00820
(0.0126)
fbs -0.101
(0.285)
_cons 3.588** 3.181** 3.690**
(1.139) (1.159) (1.401)
------------------------------------------------------------
N 1025 1025 1025
AIC 746.7 743.5 746.9
BIC 801.0 802.7 816.0
------------------------------------------------------------
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001
- The models were fit and placed side by side
- The models are essentially different in the context of number of variables,One has 10 , other 11 and the other 13 variables
- both models differ in their AIC and BIC.
- In terms of the coefficients, the variables common to both models are all significant in their respective models.
- the coefficients are not too different in magnitude
(d) For each of the five models (Model 1 from 2a, Model 2 from 3a, and Models 3–5 from part 3c) generate the predicted probabilities of heart disease for each observation in the datase
quietly import delimited heart
quietly logistic target age
est store model1
predict model1, pr
list target model1 in 1/5
quietly logistic target age i.sex
est store model2
predict model2, pr
list target model2 in 1/5
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol
est store model3
predict model3, pr
list target model3 in 1/5
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg
est store model4
predict model4, pr
list target model4 in 1/5
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
est store model5
predict model5, pr
list target model5 in 1/5
// Compute ROC curve
quietly roctab target model1, graph summary title("model with age") name(roc1)
quietly roctab target model2, graph summary title("model with age and sex") name(roc2)
quietly roctab target model3, graph summary title("model with 10 variables") name(roc3)
quietly roctab target model4, graph summary title("model with 11 variables") name(roc4)
quietly roctab target model5, graph summary title("model with 13 variables") name(roc5)
graph combine roc1 roc2 roc3 roc4 roc5 , title("Combined roc curves")
quietly graph export all_combined.svg, replace
roccomp target model1 model2 model3 model4 model5, graph summary
quietly graph export all_models.svg, replace
| target model1 |
|-------------------|
1. | 0 .5465688 |
2. | 0 .5334034 |
3. | 0 .3170747 |
4. | 0 .427951 |
5. | 0 .4150276 |
+-------------------+
+-------------------+
| target model2 |
|-------------------|
1. | 0 .44507 |
2. | 0 .4284824 |
3. | 0 .192408 |
4. | 0 .3041687 |
5. | 0 .654494 |
+-------------------+
+-------------------+
| target model3 |
|-------------------|
1. | 0 .1899559 |
2. | 0 .0231344 |
3. | 0 .0156339 |
4. | 0 .3267112 |
5. | 0 .0491517 |
+-------------------+
+-------------------+
| target model4 |
|-------------------|
1. | 0 .208391 |
2. | 0 .0170882 |
3. | 0 .0176785 |
4. | 0 .3610301 |
5. | 0 .0611438 |
+-------------------+
+-------------------+
| target model5 |
|-------------------|
1. | 0 .2055684 |
2. | 0 .01548 |
3. | 0 .0158628 |
4. | 0 .3465444 |
5. | 0 .0588899 |
+-------------------+
ROC Asymptotic normal
Obs area Std. err. [95% conf. interval]
-------------------------------------------------------------------------
model1 1,025 0.6387 0.0174 0.60453 0.67289
model2 1,025 0.7156 0.0159 0.68448 0.74666
model3 1,025 0.9233 0.0081 0.90736 0.93931
model4 1,025 0.9246 0.0080 0.90887 0.94038
model5 1,025 0.9250 0.0080 0.90931 0.94073
-------------------------------------------------------------------------
H0: area(model1) = area(model2) = area(model3) = area(model4) = area(model5)
chi2(4) = 342.17 Prob>chi2 = 0.0000
After we fit our logistic regression models, we used them to calculate the probability that a given observation has a positive outcome i.e diagnosed with the disease, based on the values of the predictor variables.
The concordance statistics which is equal to to the AUC (area under curve) where calculated and the following criterion was used to choose the best model
- A value below 0.5 indicates a poor model.
- A value of 0.5 indicates that the model is no better out classifying outcomes than random chance.
- The closer the value is to 1, the better the model is at correctly classifying outcomes.
- A value of 1 means that the model is perfect at classifying outcomes.
To determine if an observation should be classified as positive (diagnosed with disease), a different cut-points are chosen such that observations with a fitted probability above the cut-point are classified as positive and any observations with a fitted probability below the cut-point are classified as negative.
The plotted ROC curves show us the values of sensitivity vs. 1-specificity as the value of the cut-off point moves from 0 to 1.
Models with with high sensitivity and high specificity end up having a ROC curve that hugs the top left corner of the plot
(Model 3, model 4 and model 5)
in our case.models with low sensitivity and low specificity have a curve that is close to the 45-degree diagonal line just like
model 1 and model 2
.The AUC (area under curve) values thus give us an idea of how well the models are able to distinguish between patients diagnosed with the disease and those not diagnosed. The AUC can range from 0 to 1. The higher the AUC, the better the model is at correctly classifying outcomes
In short the above results suggest that logistic regression
model 3 , model 4 and model 5
where good at picking out patients diagnosed with the disease, judging by their area under the ROC curve of approximately \(\approx 92.33\%,92.46\% ~and~ 92.5\%\) respectively.We also note that the table above also includes a test of significance , where the result shows that the difference in areas under the curves is statistically significant and is not a result of random chance.
However , overally the model with area under the curve of model 92.5% performs fairly better than all other models since it has a larger AUC value.
Goodness of fit Hosmer–Lemeshow
quietly import delimited heart
quietly logistic target cp ca sex thal oldpeak exang thalach trestbps slope chol restecg age fbs
est store model5
estat classification
estat gof , group(10)
Logistic model for target
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 479 98 | 577
- | 47 401 | 448
-----------+--------------------------+-----------
Total | 526 499 | 1025
Classified + if predicted Pr(D) >= .5
True D defined as target != 0
--------------------------------------------------
Sensitivity Pr( +| D) 91.06%
Specificity Pr( -|~D) 80.36%
Positive predictive value Pr( D| +) 83.02%
Negative predictive value Pr(~D| -) 89.51%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 19.64%
False - rate for true D Pr( -| D) 8.94%
False + rate for classified + Pr(~D| +) 16.98%
False - rate for classified - Pr( D| -) 10.49%
--------------------------------------------------
Correctly classified 85.85%
--------------------------------------------------
note: obs collapsed on 10 quantiles of estimated probabilities.
Goodness-of-fit test after logistic model
Variable: target
Number of observations = 1,025
Number of groups = 10
Hosmer–Lemeshow chi2(8) = 27.01
Prob > chi2 = 0.0007
- the accuracy of the model is pretty fair
- however the
goodness of fit
testhosmer-Lemeshow
shows that there is lack of fit in the model \(Prob > chi2 = 0.0007\) ,With a sample size this large, you should probably ignore the result. The problem is simply that with N = 1025, a “significant” lack of fit can be a small discrepancy that is of no practical importance. - in line with that
goodness of fit
test a hypothesis that we already know is false and they don’t give us a metric that we can use to assess whether the model is good enough for our purpose. Instead we need a descriptive analysis of how good the fit is, and than your subjective assessment of whether that is good enough or not.
Section B: 30 marks
Descriptive statistics for single variables
quietly import delimited health_insurance
egen product2 = group(product), label
egen position_level2 = group(position_level), label
egen gender2 = group(gender), label
drop product
drop position_level
drop gender
rename product2 product
rename gender2 gender
rename position_level2 position_level
quietly save health_insurance , replace
factor(gender product position_level) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable, continuous(age absent household)
Table1
-------------------------------------
Summary
-------------------------------------
N 1,453
age 40.919 (13.530)
absent 14.467 (8.114)
household 3.257 (2.231)
group(gender)
Female 889 (61.2%)
Male 559 (38.5%)
Non-binary 5 (0.3%)
group(product)
A 495 (34.1%)
B 459 (31.6%)
C 499 (34.3%)
group(position_level)
1 184 (12.7%)
2 404 (27.8%)
3 436 (30.0%)
4 231 (15.9%)
5 198 (13.6%)
-------------------------------------
(collection DTable exported to file Statin_table.docx)
Descriptive plots for single variables
use health_insurance
quietly histogram age ,title("Histogram of Age") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph1 , replace)
quietly histogram absent ,title("Histogram of absentism") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph2 , replace)
quietly histogram household ,title("Histogram of number of households") note( "Source : Data source unknown") ytitle("Fraction") normal ytitle("Density") name(graph3 , replace)
graph pie, over(position_level) name(graph4 , replace) title("distribution of Position level")
graph hbar (count), over(gender) blabel(bar) name(graph5 , replace) title("distribution of Gender")
graph hbar (count), over(product) blabel(bar) name(graph6 , replace) title("distribution of Products")
graph combine graph1 graph2 graph3 graph4 graph5 graph6 ,title("")
quietly graph export all_multi.svg, replace
Descriptive statistics by outcome variable
use health_insurance
by(product, tests testnotes) continuous(age absent household) factor(gender position_level) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable,
note: using test regress across levels of product for age, absent, and
household.
note: using test pearson across levels of product for gender and
position_level.
Table1
-------------------------------------------------------------------------------------------
group(product)
A B C Total Test
-------------------------------------------------------------------------------------------
N 495 (34.1%) 459 (31.6%) 499 (34.3%) 1,453 (100.0%)
age 28.913 (5.158) 44.865 (13.756) 49.198 (10.346) 40.919 (13.530) <0.001
absent 14.451 (8.030) 14.394 (8.092) 14.549 (8.232) 14.467 (8.114) 0.956
household 3.642 (2.339) 1.462 (1.201) 4.527 (1.740) 3.257 (2.231) <0.001
group(gender)
Female 244 (49.3%) 404 (88.0%) 241 (48.3%) 889 (61.2%) <0.001
Male 249 (50.3%) 53 (11.5%) 257 (51.5%) 559 (38.5%)
Non-binary 2 (0.4%) 2 (0.4%) 1 (0.2%) 5 (0.3%)
group(position_level)
1 66 (13.3%) 59 (12.9%) 59 (11.8%) 184 (12.7%) 0.848
2 136 (27.5%) 133 (29.0%) 135 (27.1%) 404 (27.8%)
3 144 (29.1%) 146 (31.8%) 146 (29.3%) 436 (30.0%)
4 81 (16.4%) 65 (14.2%) 85 (17.0%) 231 (15.9%)
5 68 (13.7%) 56 (12.2%) 74 (14.8%) 198 (13.6%)
-------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)
Plots by Outcome variable
use health_insurance
graph box age ,over(product) title("boxplot of Age by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph1 , replace)
graph box absent ,over(product) title("boxplot of absentism by product") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph2 , replace)
graph box household ,over(product) title("boxplot of households by product ") note( "Source : Data source unknown") ytitle("Fraction") ytitle("Density") name(graph3 , replace)
graph hbar,over(product) over(position_level) stack asyvars name(graph4 , replace) title("distribution of Position level")
graph hbar ,over(product) over(gender) stack asyvars name(graph5 , replace) title("distribution of Gender")
graph combine graph1 graph2 graph3 graph4 graph5 ,title("")
quietly graph export all_multii.svg, replace
(b) Assess the nature of the product variable. Determine whether it is nominal, ordinal, or binary, and justify your classification
use health_insurance
tab product
group(produ |
ct) | Freq. Percent Cum.
------------+-----------------------------------
A | 495 34.07 34.07
B | 459 31.59 65.66
C | 499 34.34 100.00
------------+-----------------------------------
Total | 1,453 100.00
- The product variable has 3 levels which represents 3 product types (\(A,B ~and~ C\))
- Thus the variable is of nominal type because it represents distinct product types without an inherent order (i.e.,
A
is not objectively less thanB
andB
is not less thanC
).”
5. Model selection and evaluation [20 marks]
(a) The most suitable model for this analysis is Multinomial logistic regression
Log Odds and Odds Ratios
- The logit functions (log odds) when our outcome is a Product variable with (A,B and C) categories and the independent variable is age is shown below
- the log odds for comparison between B and A is
\[g_1(x) = ln\frac{P(product=B|age)} {P(product=A|age)}=\alpha_1 + (\beta_{11}age)\]
- and, the log odds for comparison between C and A is
\[g_2(x) = ln\frac{P(Product=C|age)} {P(Product=A|age)}=\alpha_2 + (\beta_{21}age)\] + Here the base category is product A
hence all inferences are done with reference to it
Simple model
use health_insurance
mlogit product age , baseoutcome(1)
mlogit product age , baseoutcome(1) rrr
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -1230.0067
Iteration 2: Log likelihood = -1175.2485
Iteration 3: Log likelihood = -1171.0613
Iteration 4: Log likelihood = -1171.0379
Iteration 5: Log likelihood = -1171.0379
Multinomial logistic regression Number of obs = 1,453
LR chi2(2) = 848.47
Prob > chi2 = 0.0000
Log likelihood = -1171.0379 Pseudo R2 = 0.2659
------------------------------------------------------------------------------
product | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | .1969927 .0122123 16.13 0.000 .173057 .2209283
_cons | -6.974368 .4166712 -16.74 0.000 -7.791029 -6.157708
-------------+----------------------------------------------------------------
C |
age | .2274424 .0125945 18.06 0.000 .2027576 .2521272
_cons | -8.323104 .4420157 -18.83 0.000 -9.189439 -7.456769
------------------------------------------------------------------------------
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -1230.0067
Iteration 2: Log likelihood = -1175.2485
Iteration 3: Log likelihood = -1171.0613
Iteration 4: Log likelihood = -1171.0379
Iteration 5: Log likelihood = -1171.0379
Multinomial logistic regression Number of obs = 1,453
LR chi2(2) = 848.47
Prob > chi2 = 0.0000
Log likelihood = -1171.0379 Pseudo R2 = 0.2659
------------------------------------------------------------------------------
product | RRR Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | 1.217735 .0148713 16.13 0.000 1.188934 1.247234
_cons | .0009356 .0003898 -16.74 0.000 .0004134 .0021171
-------------+----------------------------------------------------------------
C |
age | 1.255385 .015811 18.06 0.000 1.224776 1.28676
_cons | .0002428 .0001073 -18.83 0.000 .0001021 .0005775
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.
thus for the simple model from the stata output we have two logit models:
\[ \begin{aligned} g_1(x) &= ln\frac{P(Product=B|age)} {P(Product=A|age)}=-6.974368 + (0.1969927age)\\ RRR_1&=\frac{P(Product=B|age)} {P(Product=A|age)}=e^{-6.974368 + (0.1969927age)} \end{aligned}\]
Interpretation of coefficients
- for one unit change in the variable age, the multinomial log-odds for choosing Product B relative to Product A will be increased by 0.1969927, age is also statistically significant at 5% level of significance.
Interpretation of exponenciated coefficients
- for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased by \(e^{0.1969927}=1.21773\), age is also statistically significant at 5% level of significance.
\[ \begin{aligned} g_2(x) &= ln\frac{P(Product=C|age)} {P(Product=A|age)}=-8.323104 + (0.2274424age)\\ RRR_2&= \frac{P(Product=C|age)} {P(Product=A|age)}=e^{-8.323104 + (0.2274424age)} \end{aligned} \]
- for one unit change in the variable age, the multinomial log-odds for choosing Product C relative to Product A will be increased by 0.2274424, age is also statistically significant at 5% level of significance.
Interpretation of exponenciated coefficients
- for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased by a factor of \(e^{0.2274424}=1.255385\), age is also statistically significant at 5% level of significance.
Multivariable model
use health_insurance
ssc install fitstat
mlogit product age absent household i.gender i.position_level , baseoutcome(1)
mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
checking fitstat consistency and verifying not already installed...
all files already exist and are up to date.
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -774.56045
Iteration 2: Log likelihood = -745.09229
Iteration 3: Log likelihood = -743.38342
Iteration 4: Log likelihood = -743.38057
Iteration 5: Log likelihood = -743.38057
Multinomial logistic regression Number of obs = 1,453
LR chi2(18) = 1703.78
Prob > chi2 = 0.0000
Log likelihood = -743.38057 Pseudo R2 = 0.5340
------------------------------------------------------------------------------
product | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | .2449861 .0154906 15.82 0.000 .2146251 .2753472
absent | .0137522 .0131766 1.04 0.297 -.0120734 .0395778
household | -.9662861 .0698785 -13.83 0.000 -1.103245 -.8293269
|
gender |
Male | -2.369476 .2334001 -10.15 0.000 -2.826932 -1.91202
Non-binary | .240953 1.214742 0.20 0.843 -2.139897 2.621803
|
position_l~l |
2 | -.2033913 .3119391 -0.65 0.514 -.8147807 .4079982
3 | -.6830315 .3107755 -2.20 0.028 -1.29214 -.0739228
4 | -1.510145 .4024609 -3.75 0.000 -2.298954 -.7213359
5 | -1.407559 .4024606 -3.50 0.000 -2.196368 -.6187512
|
_cons | -5.207851 .5493343 -9.48 0.000 -6.284527 -4.131176
-------------+----------------------------------------------------------------
C |
age | .2706643 .0157149 17.22 0.000 .2398636 .3014651
absent | .0046137 .0127058 0.36 0.717 -.0202892 .0295166
household | .2077009 .0497547 4.17 0.000 .1101834 .3052184
|
gender |
Male | .1085684 .1964055 0.55 0.580 -.2763794 .4935162
Non-binary | -1.293302 2.021219 -0.64 0.522 -5.254818 2.668215
|
position_l~l |
2 | -.1427916 .3048914 -0.47 0.640 -.7403678 .4547845
3 | -.3147303 .306651 -1.03 0.305 -.9157553 .2862946
4 | -.8247833 .3575154 -2.31 0.021 -1.525501 -.124066
5 | -.730075 .3794344 -1.92 0.054 -1.473753 .0136028
|
_cons | -10.54573 .6452422 -16.34 0.000 -11.81038 -9.281076
------------------------------------------------------------------------------
Iteration 0: Log likelihood = -1595.2728
Iteration 1: Log likelihood = -774.56045
Iteration 2: Log likelihood = -745.09229
Iteration 3: Log likelihood = -743.38342
Iteration 4: Log likelihood = -743.38057
Iteration 5: Log likelihood = -743.38057
Multinomial logistic regression Number of obs = 1,453
LR chi2(18) = 1703.78
Prob > chi2 = 0.0000
Log likelihood = -743.38057 Pseudo R2 = 0.5340
------------------------------------------------------------------------------
product | RRR Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
A | (base outcome)
-------------+----------------------------------------------------------------
B |
age | 1.277604 .0197908 15.82 0.000 1.239397 1.316988
absent | 1.013847 .013359 1.04 0.297 .9879992 1.040371
household | .3804935 .0265883 -13.83 0.000 .3317925 .4363429
|
gender |
Male | .0935297 .0218298 -10.15 0.000 .0591942 .1477815
Non-binary | 1.272461 1.545712 0.20 0.843 .1176669 13.76052
|
position_l~l |
2 | .8159589 .2545295 -0.65 0.514 .4427364 1.503804
3 | .5050835 .1569676 -2.20 0.028 .2746823 .9287434
4 | .220878 .0888948 -3.75 0.000 .1003638 .4861024
5 | .2447399 .0984981 -3.50 0.000 .1112064 .5386167
|
_cons | .0054734 .0030067 -9.48 0.000 .0018649 .016064
-------------+----------------------------------------------------------------
C |
age | 1.310835 .0205997 17.22 0.000 1.271076 1.351838
absent | 1.004624 .0127646 0.36 0.717 .9799153 1.029957
household | 1.230845 .0612404 4.17 0.000 1.116483 1.356921
|
gender |
Male | 1.114681 .2189296 0.55 0.580 .7585251 1.638066
Non-binary | .2743634 .5545485 -0.64 0.522 .0052223 14.41421
|
position_l~l |
2 | .8669347 .2643209 -0.47 0.640 .4769385 1.575834
3 | .7299857 .2238508 -1.03 0.305 .4002142 1.331485
4 | .43833 .1567097 -2.31 0.021 .2175122 .8833216
5 | .4818729 .1828392 -1.92 0.054 .2290642 1.013696
|
_cons | .0000263 .000017 -16.34 0.000 7.43e-06 .0000932
------------------------------------------------------------------------------
Note: _cons estimates baseline relative risk for each outcome.
Interpretation
Product B relative to Product A
\[ \begin{aligned} ln\frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\ \frac{P(Product=B|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&= e^{-5.207851 + (0.2449861age)+(0.0137522Absent)-.9662861household-2.36947Male-0.240953Nonbinary -0.2033913Position_2-0.6830315Position_3-1.510145Position_4-1.407559Position_5\\} \end{aligned}\]
- for one unit change in the variable age, the relative risk ratio for choosing Product B relative to product A will be increased significantly by a factor of \(1.277604\) while adjusting for other covariates .Age is statistically significant \((p<0.05)\)
- for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product B relative to product A increase by a factor of \(1.013847\) while adjusting for other variables, however absentism is not statistically significant \((p>0.05)\)
- If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product B relative to product A decrease by a factor of \(.3804935\) while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
- The relative risk ratio for choosing Product B relative to product A decrease by (1-.0935297):91% for males as compared to females while adjusting for all other covariates.Gender is a statistically significant predictor for choice of insurance product.\((p<0.05)\)
- The relative Risk ratio for choosing product B relative to product A increase as position level increase from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
- decrease in RRR for Product 2 vs product 1 ->\(0.1840411\)
- decrease in RRR for Product 3 vs product 1 ->\(0.4949165^{***}\)
- decrease in RRR for Product 4 vs product 1 ->\(0.779122^{***}\)
- decrease in RRR for Product 5 vs product 1 ->\(0.7552601^{***}\)
here we not that the starred comparisons are the only significant ones at 5% level of significance.
Product C relative to Product A
\[ \begin{aligned} ln\frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5\\ \quad this~equates~to~RRR~as~follows~\\ \frac{P(Product=C|X_1,X_2,...,X_n)} {P(Product=A|X_1,X_2,...,X_n)}&=e^{-10.54573+(0.2706643age)+(0.0046137 Absent)+0.2077009household+0.1085684Male-1.293302Nonbinary -0.1427916Position_2-0.3147303Position_3-0.8247833Position_4-0.730075Position_5} \end{aligned}\]
- for one unit change in the variable age, the relative risk ratio for choosing Product C relative to product A will be increased significantly by a factor of 1.310835 while adjusting for other covariates .Age is statistically significant
- for a 1 day increase in absentism from previous year, the relative risk ratio for choosing product C relative to product A increase by a factor of 1.004624 while adjusting for other variables, however absentism is not statistically significant
- If one person is added to the number of individuals in the employee’s household , the Relative Risk ratio of choosing Product C relative to product A Increase by a factor of 1.230845 while adjusting for other covariates in the model. Number of individuals in the household is also statistically significant at 5% level of significance.
- The relative risk ratio for choosing Product C relative to product A Increase by (1.11468-1):11.5% for males as compared to females while adjusting for all other covariates. Gender is However not a statistically significant predictor for choice of insurance product here.
- The relative Risk ratio for choosing product C relative to product A decrease as position level decrease from position level 2 , 3, 4 and 5 relative to position 1 while holding all covariates constant. In essence this means that :
- decrease in RRR for Product 2 vs product 1 ->\(0.1330653\)
- decrease in RRR for Product 3 vs product 1 ->\(0.2700143\)
- decrease in RRR for Product 4 vs product 1 ->\(0.56167^{***}\)
- decrease in RRR for Product 5 vs product 1 ->\(0.5181271\)
here we not that the starred comparisons are the only significant ones at 5% level of significance.
(c) Evaluate and compare the overall fit of your models using appropriate statistical tests
Null model vs Age model
use health_insurance
quietly mlogit product, baseoutcome(1) rrr
est store model1
quietly mlogit product age , baseoutcome(1) rrr
est store model2
ratio test**
**Likelihood
stats(bic aic)
estout model1 model2 ,
lrtest model1 model2
model1 model2
b b
--------------------------------------
A
age 0
_cons 0 0
--------------------------------------
B
age .1969927
_cons -.0755076 -6.974368
--------------------------------------
C
age .2274424
_cons .0080483 -8.323104
--------------------------------------
bic 3205.108 2371.201
aic 3194.546 2350.076
--------------------------------------
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(2) = 848.47
Prob > chi2 = 0.0000
- For the model with age , AIC and BIC reduced significantly indicating that the model with age performs better than a null model.
- Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))
Full model vs null model
use health_insurance
quietly mlogit product , baseoutcome(1) rrr
est store model1
quietly mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
est store model2
ratio test**
**Likelihood
stats(bic aic)
estout model1 model2 ,
lrtest model1 model2
model1 model2
b b
--------------------------------------
A
age 0
absent 0
household 0
1.gender 0
2.gender 0
3.gender 0
1.position~l 0
2.position~l 0
3.position~l 0
4.position~l 0
5.position~l 0
_cons 0 0
--------------------------------------
B
age .2449861
absent .0137522
household -.9662861
1.gender 0
2.gender -2.369476
3.gender .240953
1.position~l 0
2.position~l -.2033913
3.position~l -.6830315
4.position~l -1.510145
5.position~l -1.407559
_cons -.0755076 -5.207851
--------------------------------------
C
age .2706643
absent .0046137
household .2077009
1.gender 0
2.gender .1085684
3.gender -1.293302
1.position~l 0
2.position~l -.1427916
3.position~l -.3147303
4.position~l -.8247833
5.position~l -.730075
_cons .0080483 -10.54573
--------------------------------------
bic 3205.108 1632.389
aic 3194.546 1526.761
--------------------------------------
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(18) = 1703.78
Prob > chi2 = 0.0000
- For the model with all variables , AIC and BIC reduced significantly indicating that the full model performs better than a null model.
- Through the likelihood ratio test we see that the model with age performs better than the null model (model with no covariates) at 5% level of significance (\(p<0.001\))
(d) Which of the models would you consider the best, and justify your selection
use health_insurance
quietly mlogit product age, baseoutcome(1) rrr
est store model1
quietly mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
est store model2
ratio test**
**Likelihood
lrtest model1 model2
Likelihood-ratio test
Assumption: model1 nested within model2
LR chi2(16) = 855.31
Prob > chi2 = 0.0000
- The results of the likelihood ratio tests can be used to ascertain the significance of predictors to the model. This table tells us that addition of more variables to age had significant main effects on Product choice,
\[\chi^2_{(16)} = 855.31, p = .000\]
use health_insurance
for model with age alone*
*Measures quietly mlogit product age, baseoutcome(1) rrr
save
fitstat ,
*Model with additional variablesquietly mlogit product age absent household i.gender i.position_level , baseoutcome(1) rrr
fitstat, dif
Measures of Fit for mlogit of product
Log-Lik Intercept Only: -1595.273 Log-Lik Full Model: -1171.038
D(1447): 2342.076 LR(2): 848.470
Prob > LR: 0.000
McFadden's R2: 0.266 McFadden's Adj R2: 0.262
Maximum Likelihood R2: 0.442 Cragg & Uhler's R2: 0.498
Count R2: 0.509 Adj Count R2: 0.255
AIC: 1.620 AIC*n: 2354.076
BIC: -8194.089 BIC': -833.907
(Indices saved in matrix fs_0)
Measures of Fit for mlogit of product
Log-Lik Intercept Only: -1595.273 Log-Lik Full Model: -743.381
D(1417): 1486.761 LR(18): 1703.784
Prob > LR: 0.000
McFadden's R2: 0.534 McFadden's Adj R2: 0.511
Maximum Likelihood R2: 0.690 Cragg & Uhler's R2: 0.777
Count R2: 0.566 Adj Count R2: 0.341
AIC: 1.073 AIC*n: 1558.761
BIC: -8830.962 BIC': -1572.719
- The table below and the one above show results for a full model and the one with a single predictor age respectively.
- We see that on fitting the full model the following significant changes which show model improvement were noted:
- log likelihood increased from \(-1171.038~ to ~-743.381\)
- Akaike information criterion decreased from \(2354.076 ~to~ 1558.761\)
- Bayesian Information criterion decreased from \(-833.907 ~to~ -1572.719\)
all these results suggest that adding more variables to the model improved model fit. Hence the model with all variables performs better and the best than the model with AGE alone.