The purpose of this analysis is to inform a prediction algorithm to identify patients in the KDH-TB study at highest risk of M. tuberculosis bacteraemia, using patient factors typically available on day of recruitment to the study. The analysis will not attempt to describe mechanistic relationships or inferences between patient variables and presence of bacteraemia. The specific objective is to produce an algorithm with >90% positive predictive value assuming a prevalence of MTB bacteraemia similar to that seen during recruitment to date (~27%).
The most established approach to predictive classification models in clinical science is to define a score derived from regression modelling (e.g. Frammingham CHD risk score, APACHE score, FRAX WHO fracture risk assessment tool, CURB65 score etc.). Potential limitations of this approach include the following:
Based on these possible limitations it has been hypothesised that the complexity of clinical decision making may be better captured by contemporary ‘machine learning’ techniques used in fields allied to clinical science, such as the ’omics and other data sciences. Therefore, a secondary objective of this analysis is to test a range of prediction models and variable selection strategies to a clinical dataset, and explore accuracy / complexity trade-offs.
A related third objective is to check the external validity of MTB bacteraemia predictors previously reported in other cohorts. Building a predictive model using variables previously found to be significant should (theoretically) limit over-fitting and type-1 error of the model. The analysis therefore starts with a review of the previous relevant literature.
Pubmed was searched using all feild terms [“mycobacteraemia”], [“mycobacteri* AND bacteraemia”], [“tuberculosis” AND “blood culture”] to identify cohort studies with:
Ten cohorts were identified, 8 in subSaharan Africa, several from the same centres (Malawi is overrepresented), majority recruited inpatients. Five gave multivariate adjusted estimates of associations (table). No studies validated their models ‘out of sample’, although one study did use cross-validation within sample (Jacob et al. 2013).
The independent variable associations in these studies are summarised below where yellow,1 = an unadjusted association, and red,2 = an association remaining significant in a multivariate model. A limitation of this summary is that none of the studies unambiguously reports a full description of which variables were examined but found non-significant (i.e. some variables will not have been tested in all the cohorts).
The study by Jacob et al. is of particular interest as it is the best reported, and developed a risk score using their model (meaning, in addition to the variable set, the coefficients associated with the variables can be tested in our cohort).
Based partly on this literature, a structural model which attempts to group predictor variables into sets according to underlying phenomena was proposed (below).
The first 348 patients in KDHTB database were screened for inclusion. Inclusion criteria were:
Exclusion criteria were:
CONSORT flow chart
The following independent variables were identified as potentially available from day of recruitment data.
## [1] "Sex" "Age"
## [3] "PreviousTB" "Cough"
## [5] "LossOfAppetite" "LossOfWeight"
## [7] "DrenchingNightSweats" "Nausea.and.vomiting"
## [9] "Diarrhea" "Headache"
## [11] "BMI" "BPSystolic"
## [13] "ECOG.Score" "GCS"
## [15] "BPDiastolic" "TempDegreesC"
## [17] "RR" "HR"
## [19] "Lactate" "pH"
## [21] "SHCO3" "BE"
## [23] "Glucose" "Albumin"
## [25] "AST" "Sodium"
## [27] "Potassium" "Haemoglobin"
## [29] "MCV" "WhiteCellCount"
## [31] "Platelets" "CD4Count"
## [33] "ART" "Any.signs.of.Tuberculosis..CXR"
## [35] "Pulmonary.infiltrate" "Pleural.effusion"
## [37] "Mediastinal.lymphnodes" "miliary"
## [39] "cavitary" "nodular"
## [41] "CXR.sites" "Tuberculoma"
## [43] "Pericardial" "Peripheral_LN"
## [45] "Abdominal_LN" "Hepatic"
## [47] "Splenic" "TBM"
## [49] "Creatinine" "Splenic.microabscesses"
This ‘full’ variable set was intended to define a relatively unbiased feature selection starting point. However, some conservative variable manipulation and reduction aimed at increasing stablity of the data structure was performed as follows.
Low information variables Variables which showed ‘near zero’ variation across the cohort, (defined by having very few unique values relative to the number of samples plus a large ratio of the frequency of the most common value to the frequency of the second most common value) were identified using the nearZeroVar function of the Caret package in R. These variables (listed below) were removed from the data set as unlikely to provide significant information, and could negatively affect model performance.
## [1] "Tuberculoma"
Variables with high proportion of missing observations
Proportion of missing observations were examined in the following graphics. The histogram shows variables ordered by proportion missing. the second plot shows variables in columns and patients in rows (ordered by recruitment date); binary variables are coloured black for value = 1, white for value = 0, and red for missing. Numerical variables are plotted on grayscale with higher relative values a darker shade, and red for missing values. The pattern of missing data is non-random:
Distributions of numerical variables Normality of numerical variables were assessed by QQplots and histograms, the later are shown below.
CD4Count, total White cell count, creatinine, AST, glucose, lactate and age were log transformed to better approximate normality. A single outlier with bloop pressure 220/160 was removed from the blood pressure variables. ECOG score was converted to categorical variables because of non-normality (and also not interval measure).
Temperature observations presented a particular problem; with only 7% of the cohort having a fever defined as temperature >38.0oC, which seemed implausible and may relate to measurement error. However, recorded temperatures did correlate with other markers of inflammatory response (e.g. CRP, data not shown) so a new categorical variable ‘fever’ was defined as temperature >37.4oC was created (present in ~13% of cohort).
Pairwise correlations of numerical variables
Because collinearity may negatively affect the performance of some models, numerical variables were examined for extreme correlation and clustering. Pearson’s correlation coefficients for the numerical variables were calculated and are presented below as a heat map matrix with a hierachical clustering dendrogram showing distance between the variables. A second matrix (blue) where the correlation coefficients are converted to absolute values is also shown (such that strong negative correlation counts in clustering).
SHCO3 and Base Excess form a tight cluster (also related more distantly to pH, and negatively correlated with creatinine); a linear model showed that 90% of variation in BE was explained by SHCO3, haemoglobin and pH (results not shown) so BE variable was deleted. Diastolic and systolic BP are highly correlated, as was systolic BP and derived Pulse Pressure (not shown). Therefore a new variable was created representing pulse pressure variation unexplained by variation in systolic BP (residuals from a linear model predicting Pulse Pressure from systolic BP), and diastolic BP and Pulse Pressure variables renoved from the data set.
After the above manipulation 20 numeric predictors and 27 categorical predictors were left in the ‘full’ variable set.
A random split of the data set into a training set (75% of observations, n=207) and test set (25% of observations) balanced by bacteraema strata was performed.
Univariate associations with MTB bacteraemia for each predictor in the full variable set are shown below.
| “no” | “yes” | “p” | “test” | |
|---|---|---|---|---|
| “n” | 138 | 69 | ||
| “Sex = M (%)” | 68 (49.6) | 37 (54.4) | 0.555 | exact |
| “PreviousTB = Y (%)” | 61 (49.2) | 21 (33.3) | 0.044 | exact |
| “Cough = Y (%)” | 91 (72.2) | 43 (66.2) | 0.407 | exact |
| “LossOfAppetite = Y (%)” | 71 (56.3) | 36 (56.2) | 1.000 | exact |
| “LossOfWeight = Y (%)” | 112 (87.5) | 58 (87.9) | 1.000 | exact |
| “DrenchingNightSweats = Y (%)” | 73 (58.9) | 30 (46.9) | 0.125 | exact |
| “Nausea.and.vomiting = Y (%)” | 29 (22.7) | 16 (24.2) | 0.858 | exact |
| “Diarrhea = Y (%)” | 35 (27.6) | 21 (31.3) | 0.619 | exact |
| “Headache = Y (%)” | 28 (22.6) | 10 (15.6) | 0.338 | exact |
| “BPSystolic (mean (sd))” | 111.22 (16.82) | 112.32 (15.29) | 0.662 | |
| “ECOG.Score (%)” | 0.448 | exact | ||
| " 0" | 5 (3.8) | 0 (0.0) | ||
| " 1" | 7 (5.3) | 3 (4.5) | ||
| " 2" | 68 (51.1) | 34 (50.7) | ||
| " 3" | 26 (19.5) | 11 (16.4) | ||
| " 4" | 27 (20.3) | 19 (28.4) | ||
| “GCS = lowGCS (%)” | 10 (8.0) | 5 (8.2) | 1.000 | exact |
| “HR (mean (sd))” | 99.57 (19.80) | 109.22 (16.31) | 0.001 | |
| “pH (mean (sd))” | 7.38 (0.07) | 7.40 (0.07) | 0.051 | |
| “SHCO3 (mean (sd))” | 20.92 (4.12) | 19.37 (3.82) | 0.011 | |
| “Albumin (mean (sd))” | 26.08 (6.63) | 22.51 (5.42) | <0.001 | |
| “Sodium (mean (sd))” | 129.65 (4.81) | 127.29 (5.06) | 0.001 | |
| “Potassium (mean (sd))” | 4.08 (0.83) | 4.13 (0.94) | 0.675 | |
| “Haemoglobin (mean (sd))” | 9.48 (2.42) | 8.54 (2.19) | 0.007 | |
| “MCV (mean (sd))” | 84.71 (9.18) | 82.38 (8.28) | 0.081 | |
| “Platelets (mean (sd))” | 296.55 (129.17) | 221.87 (113.73) | <0.001 | |
| “ART (%)” | 0.565 | exact | ||
| " ART" | 54 (39.4) | 22 (31.9) | ||
| " Defaulted" | 27 (19.7) | 16 (23.2) | ||
| " naive" | 56 (40.9) | 31 (44.9) | ||
| “Any.signs.of.Tuberculosis..CXR = Y (%)” | 123 (96.9) | 60 (95.2) | 0.687 | exact |
| “Pulmonary.infiltrate = Y (%)” | 121 (88.3) | 58 (86.6) | 0.821 | exact |
| “Pleural.effusion = Y (%)” | 47 (34.3) | 13 (20.0) | 0.048 | exact |
| “Mediastinal.lymphnodes = Y (%)” | 57 (41.9) | 40 (60.6) | 0.016 | exact |
| “miliary = TRUE (%)” | 10 (7.2) | 15 (21.7) | 0.005 | exact |
| “cavitary = TRUE (%)” | 17 (12.3) | 8 (11.6) | 1.000 | exact |
| “nodular = TRUE (%)” | 58 (42.0) | 27 (39.1) | 0.765 | exact |
| “CXR.sites (mean (sd))” | 3.72 (1.92) | 4.03 (2.16) | 0.292 | |
| “Pericardial = Yes (%)” | 12 (8.7) | 8 (11.6) | 0.618 | exact |
| “Peripheral_LN = Yes (%)” | 2 (1.4) | 5 (7.2) | 0.042 | exact |
| “Abdominal_LN = Yes (%)” | 22 (15.9) | 24 (34.8) | 0.004 | exact |
| “Hepatic = Yes (%)” | 6 (4.3) | 6 (8.7) | 0.220 | exact |
| “Splenic = Yes (%)” | 24 (17.4) | 23 (33.3) | 0.013 | exact |
| “TBM = Yes (%)” | 15 (10.9) | 0 (0.0) | 0.003 | exact |
| “Splenic.microabscesses (%)” | 0.002 | exact | ||
| " N" | 22 (15.9) | 7 (10.1) | ||
| " noUSSdone" | 90 (65.2) | 33 (47.8) | ||
| " Y" | 26 (18.8) | 29 (42.0) | ||
| “logCD4 (mean (sd))” | 4.45 (0.95) | 3.27 (1.15) | <0.001 | |
| “logWCC (mean (sd))” | 1.98 (0.63) | 1.90 (0.61) | 0.389 | |
| “logCreatinine (mean (sd))” | 4.43 (0.64) | 4.76 (0.78) | 0.001 | |
| “logAST (mean (sd))” | 3.92 (0.81) | 4.40 (0.70) | <0.001 | |
| “logGlucose (mean (sd))” | 1.66 (0.27) | 1.73 (0.33) | 0.078 | |
| “logLactate (mean (sd))” | 0.58 (0.46) | 0.72 (0.53) | 0.052 | |
| “logRR (mean (sd))” | 3.11 (0.26) | 3.16 (0.24) | 0.189 | |
| “logAge (mean (sd))” | 3.60 (0.28) | 3.60 (0.23) | 0.931 | |
| “fever = Yes (%)” | 11 (8.0) | 17 (24.6) | 0.002 | |
| “pulsePr.res (mean (sd))” | -0.00 (9.17) | 0.35 (9.25) | 0.805 | |
| “bacteraemia = yes (%)” | 0 (0.0) | 69 (100.0) | <0.001 |
An exhaustive analysis of different combinations of predictors and their co-relation to bacteraemia is impractical. However, groupings of variables as defined by the proposed structural model in section 2.2 above were investigated as follows. For each a priori variable set {immunosupression variables; advanced TB variables; septic response variables; direct evidence of dissemination variables} pairwise relationships and their association with bacteraemia were visualised in a plot matrix. Secondly, principle components of each set were extracted using Principle Components Analysis (numerical variables) and Multiple Correspondence Analysis (categorical variables), or Factor Analysis for Mixed data. The relationship of derived components to bacteraemia was assessed visually, and using ROC curve and regression analysis. Because these methods are unable to handle missing observations, multivariate imputation of missing data by chained equations (MICE), which selects missing values from a probability distribution conditional on the levels of other variables, was used.
** 6.2.1 Advanced TB variable set**
Albumin was the strongest predictor of MTB bacteraemia amongst the variables chosen a priori to represent ‘advanced TB’. Haemoglobin and albumin were covaried (pearson’s r = 0.35) and the combination of the two appeared to marginally increase discrimination of bacteraemic and non-bacteraemic cases. Chest radiological extent of disease (‘CXR.sites’, number of anatomical sites with pathology on CXR) had higher mode in bacteraemic patients, but showed no clear relationship with other variables in this set. Loss of weight (LOW) and ECOG score (high=3-4, low = 1-2) showed surprisingly little relation to bacteraemia or any of the variables in this set.
_Relationship matrix, advanced TB variables, coloured by bacteraemia status blue-yes, red-no_
The fist dimension of a PCA combining albumin and haemoglobin explained ~7% variation in bacteraemia status, and had area under ROC curve similar or marginally better to either variable on their own for predicting bacteraemia (PC1 AUC = 0.659; Albumin AUC = 0.652; Haemoglobin AUC = 0.605). By contrast the first PC of the full variable set added little with an AUC of 0.674.
** 6.2.2 Septic variable set**
Predictors in the ‘sepsis set’ were relatively collinear: respiratory rate and heart rate correlated (Pearson’s r = 0.45); pyrexial patients had higher heart rates and plasma lactate. White cell count did not correlate well with the other variables, did not classify bacteraemic cases well, and did not seem to interact with the other variables relationship to bacteraemia.
_Relationship matrix, Sepsis variables, coloured by bacteraemia status blue-yes, red-no_
Cumulatively about half of the variability in the sepsis set variables could be captured on the first two principle components. Variation in bacteraemia status projected best on the 1st and 4th components of the PCA; these two dimensions were dominated by heart rate, respiratory rate, lactate, SHCO3 and fever. Removal of white cell count from the PCA did not adversly affect the predictive power of the PCs for bacteraemia, and all the predictive utility of this variable set could be captured in just two principle components.
##
## Call:
## roc.formula(formula = bacteraemia ~ pred1, data = septic)
##
## Data: pred1 in 138 controls (bacteraemia No.bacteraemia) < 69 cases (bacteraemia Bacteraemia).
## Area under the curve: 0.7396
##
## Call:
## roc.formula(formula = bacteraemia ~ pred2, data = septic)
##
## Data: pred2 in 138 controls (bacteraemia No.bacteraemia) < 69 cases (bacteraemia Bacteraemia).
## Area under the curve: 0.728
….
** 6.2.3 Immunosupression variable set**
Within the immunosupression variables, CD4 count was the best predictor for bacteraemia. It was uncorrelated with other variables in the set including ART status. Combination of CD4 with other variables in the set, pairwise or through principle component construction, did not appear to add discriminative power (data not shown).
_Relationship matrix, Immunosupression variables, coloured by bacteraemia status blue-yes, red-no_
** 6.2.4 Dissemination variable set**
Starting with the lymphoid tissue variables, as expected ‘Splenic’ which is a composite clinical diagnosis, and the USS finding ‘Splenic Microabscesses’ were strongly associated. Of more interest is the finding that ‘Splenic’ in the abscence of ‘splenic microabscesses’ was associatied with zero cases of MTB bacteraemia. Patients who had an USS and did not have splenic microabscesses were unlikely to have bacteraemia (NPV = 0.76). So as a predictor for MTB bacteraemia, splenic microabscesses has more specificity (less noise) than ‘splenic TB’.
Second, lymphoid variables, with the possible exception of mediastinal LNs, appeared additive such that involvement of more lymphoid pathology sites increased probability of MTB bacteraemia.
_Relationship matrix, disseminated TB variables, coloured by bacteraemia status blue-yes, red-no_
All 3 variables chosen to represent bone marrow pathology (haemoglobin, platelets and MCV) tended to be lower in bacteraemic patients. Among anaemic patients with bacteraemia, microcytosis was more common than macrocytosis. The combination of anaemia and thrombocytopenia seemed to be specific but not sensitive for bacteraemia, and there was no noticable correlation between haemoglobin and platlets.
Visualising some of the lymphoid and haematological variables together, along with ‘miliary disease on CXR’ variable showed the following. Miliary disease is uncommon but a reasonable positive predictor of bacteraemia (PPV = 0.6). This miliary PPV remained in patients without known abdominal LNs, and vice-versa, suggesting their predictive power was orthogonal. Miliary disease was associated with lower haemoglobin and platelets but this effect did not appear additive with the effect of bacteraemia on Hb and platelets.
About 27% variation in ‘dissemination’ variables was captured by the first dimension of the FAMD. The main contributors to this dimension were haemoglobin, splenic TB diagnosis, splenic microabscesses, abdominal LN, and hepatic TB diagnosis; it had ROC AUC of 0.69 for bacteraemia prediction. Bacteraemia status also loaded on the third FAMD dimension, dominated by platelets, peripheral LN, and miliary CXR, with ROC AUC 0.65. The combination of these two dimensions had increased predictive power for bacteraemia (ROC AUC = 0.74).
Conclusions from exploratory analysis
A specified objective of this analysis was to test the performance of different variable selection strategies in this data set. Therefore the following variable selections were defined:
Five model types were used:
All models were built on the training data set. Training of the models using 5 repeats of 10x k-fold cross validation was carried out using the caret package of R, such that ROC AUC of the predicted probabilities was maximised across a by selecting optimal model tuning parameters. Out-of-sample ROC AUC for each model was also estimated from the cross validation results, and then formally tested in the test set. Because some models require complete data to run, missing data was imputed using k-nearest neighbours method.
Ten additional models were created by running each model type with PCA pre-processing of both the full variable set and the univariate association variable set.
In addition to the pre-model variable selection procedures described previously, model based variable selection methods were assessed.
First, a LASSO (least absolute shrinkage and selection operator) regresison model was applied to the full variable data set using the glmnet package; this defined a variable sub-set with good multivariate predictive performance (BPSystolic, HR, SHCO3, Albumin, Sodium, MCV, Platelets, logCD4, pulsePr.res, Splenic.microabscesses, Haemoglobin, bacteraemia). This variable set was applied to each model type (= five additional models).
Second, model-type specific variable selection procedures were used. These model specific variable selection procedures generated a further 4 models.
All models were successfully fit, except logistic regression on the full variable set which did not converge (number of predictors too large for number of observed individuals).
The ROC AUC for each model is shown below. Grey points are the values derived from cross validation in the training set (estimating out of sample performance). The coloured points are the values in the test set with 95% CI.
The best model overall in tets set was a random forest on the full data set. However, most model types performed similarly except for the simple decison trees.
Variable selection was particularly important for logit models. Model specific variable selection was generally better than pre-model variable selection, but the univariate association and LASSO variable selections performed reasonably well in logistic regression, random forest and SVM. The supervised dimension reduction prior to modelling performed moderately well in logistic regression and SVM despite relative parsimony. Unsupervised PCA pre-processing added value in a decision tree model.
The following plot shows predicted probabilities against actual bacteraemia class for every model. In each panel, true positives are in the top right corner, true negatives are in the bottom left corner. The points are coloured according to predicted probabilities from one model (the logistic regression on the univariate association variables) with higher predicted probabilities a lighter shade of blue. It shows the models mostly made false predictions for the same individuals (although there are some exceptions). i.e. Some individuals consistently had high predicted probability of MTB bacteraemia, but did not have a bacteraemia. Thsi could represent noise in the outcome variable and therefore a possible ceiling to predictive model accuracy.
The full variable set random forest model includes >40 variables. Repeatedly rerunning this model with progressively less variables based on the variable importance measure shows that the optimal number of variables may be between 10 and 20.
Using 10 variables only, the ROC AUC for the random forest model on cross validation was .
Because the univariate association variable set and the LASSO variable set performed well in logistic regression, RF and SVM models, predictions from these models can all be used without additional work of variable input when it comes to using the model in practice. The mean predicted probabilities from all 6 models (‘ensemble model’) had higher ROC AUC than any individual model, and could add stability to the predictions.
##
## Call:
## roc.formula(formula = bacteraemia ~ p.Logit.lasso, data = ensTEST)
##
## Data: p.Logit.lasso in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8316
##
## Call:
## roc.formula(formula = bacteraemia ~ p.RF.lasso, data = ensTEST)
##
## Data: p.RF.lasso in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8187
##
## Call:
## roc.formula(formula = bacteraemia ~ p.svm.lasso, data = ensTEST)
##
## Data: p.svm.lasso in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8378
##
## Call:
## roc.formula(formula = bacteraemia ~ p.Logit.univar, data = ensTEST)
##
## Data: p.Logit.univar in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8605
##
## Call:
## roc.formula(formula = bacteraemia ~ p.RF.univar, data = ensTEST)
##
## Data: p.RF.univar in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8605
##
## Call:
## roc.formula(formula = bacteraemia ~ p.svm.univar, data = ensTEST)
##
## Data: p.svm.univar in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8471
##
## Call:
## roc.formula(formula = bacteraemia ~ meanPred, data = ensTEST)
##
## Data: meanPred in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8626
In the test set, about 25% of patients have ensemble model predicted probability of bacteraemia greater than 0.5. For this cut-off the PPV would be ~80% in the test set.