15 March 2019

Smoking Likelihood - Model Coefficients

The logistic model employed is Smoke ~ Pulse * BMI + . * Pulse + . * BMI where BMI is the body mass index and pulse is modelled, in these slides, as the ‘best’ fit using the Increment in Active Pulse Rate - the difference between the active and resting pulse rates of subjects. For the fitted model we have a table of the fitted coefficients and a scatter plot of subject distribution:

Coefficient Estimate 95% CI SE Pr(|Z|)
(Intercept) -10.241 (-23.428,0.895) 6.100 0.093
Pulse 0.160 (-0.081,0.469) 0.135 0.235
BMI 0.646 (-0.042,1.429) 0.368 0.079
Exercise -0.667 (-4.700,3.419) 2.026 0.742
GenderMale 1.278 (-8.103,10.890) 4.761 0.788
Pulse:BMI -0.011 (-0.031,0.001) 0.008 0.137
Pulse:Exercise 0.004 (-0.048,0.055) 0.026 0.874
Pulse:GenderMale 0.047 (-0.042,0.146) 0.047 0.316
BMI:Exercise -0.018 (-0.245,0.195) 0.110 0.868
BMI:GenderMale -0.139 (-0.705,0.421) 0.282 0.623

Smoking Likelihood - Deviance Goodness of Fit

For the fitted model we display the scatter plot of the residual deviance with marked outliers and a table detailing the goodness of fit p-value.

Metric Value
Residual Deviance Mean -0.078
Residual Deviance Domain [-3.730,29.529]
Residual Deviance IQR (-1.142,-1.036)
Degrees of Freedom 158
Deviance GOF p-value 0.998

Smoking Likelihood - Classification Outcomes

For the fitted model we display the confusion matrices and accuracy metrics for the in- and out-of-sample classification for the optimal Classifiation Threshold input parameter of 0.112. For the optimal threshold we also display the corresponding threshold optimaisation and ROC charts.

τ=0.1120626   Observed
Prediction   Non-smoker Smoker Non-smoker Smoker
Smoker   54 15 19 6
Non-smoker   101 5 32 0
Metric In-sample Out-of-sample
Accuracy 33.71% 33.33%
95% Confidence Interval (26.76%,41.24%) (21.40%,47.06%)
No Information Rate 88.57% 89.47%
Type II Error Rate 78.26% 76.00%

Smoking Likelihood - Conclusions

  • The coefficients are generally not significant with the exception of the ‘Increment in Active Pulse Rate’ predictor which is only significant at a 10% threshold. In most models, even the intercept is generally non-significant.

  • The only predictor of any real significance (10%) is body mass index in determining if a subject may be a smoker!

  • The deviance goodness of fit is generally ‘good’ - but only relative to the saturated model. However, there is clear heteroscedasticity in the residual distribution so it is unlikely that it can be assumed to be distributed as a \(\chi^2_{158}\) chi-squared distribution.

  • The heteroscedasticity would imply that the model is missing an important predictor in classifying a smoking outcome.

  • For all classification thresholds the accuracy is, to be blunt laughable and all fits exhibit high Type II Error rates. Even the No Information Rate of the null model is considerably better than the fitted model’s accuracy rates and always lies outside of the upper bound of the accuracy 95% confidence interval.



Therefore, whilst ‘many’ appear happy to use smoking as a predictor of some health metric, from this data it’s pretty clear that using those health metrics to predict whether a subject is a smoker or not is an exercise of, to be polite, dubious value and quality.