1. Introduction

The purpose of this analysis is to inform a prediction algorithm to identify patients in the KDH-TB study at highest risk of M. tuberculosis bacteraemia, using patient factors typically available on day of recruitment to the study. The analysis will not attempt to describe mechanistic relationships or inferences between patient variables and presence of bacteraemia. The specific objective is to produce an algorithm with >90% positive predictive value assuming a prevalence of MTB bacteraemia similar to that seen during recruitment to date (~27%).

The most established approach to predictive classification models in clinical science is to define a score derived from regression modelling (e.g. Frammingham CHD risk score, APACHE score, FRAX WHO fracture risk assessment tool, CURB65 score etc.). Potential limitations of this approach include the following:

Based on these possible limitations it has been hypothesised that the complexity of clinical decision making may be better captured by contemporary ‘machine learning’ techniques used in fields allied to clinical science, such as the ’omics and other data sciences. Therefore, a secondary objective of this analysis is to test a range of prediction models and variable selection strategies to a clinical dataset, and explore accuracy / complexity trade-offs.

A related third objective is to check the external validity of MTB bacteraemia predictors previously reported in other cohorts. Building a predictive model using variables previously found to be significant should (theoretically) limit over-fitting and type-1 error of the model. The analysis therefore starts with a review of the previous relevant literature.


2. Associations with MTB bacteraemia in previous cohort studies

2.1 Pubmed literature

Pubmed was searched using all feild terms [“mycobacteraemia”], [“mycobacteri* AND bacteraemia”], [“tuberculosis” AND “blood culture”] to identify cohort studies with:

  • prospective routine mycobacterial blood culture;
  • recorded HIV prevalence;
  • published in last 15 years (~HAART era);
  • reporting ‘risk factors’ for MTB bacteraemia positivity.

Ten cohorts were identified, 8 in subSaharan Africa, several from the same centres (Malawi is overrepresented), majority recruited inpatients. Five gave multivariate adjusted estimates of associations (table). No studies validated their models ‘out of sample’, although one study did use cross-validation within sample (Jacob et al. 2013).

The independent variable associations in these studies are summarised below where yellow,1 = an unadjusted association, and red,2 = an association remaining significant in a multivariate model. A limitation of this summary is that none of the studies unambiguously reports a full description of which variables were examined but found non-significant (i.e. some variables will not have been tested in all the cohorts).

The study by Jacob et al. is of particular interest as it is the best reported, and developed a risk score using their model (meaning, in addition to the variable set, the coefficients associated with the variables can be tested in our cohort).

2.2 An a priori proposed structural model

Based partly on this literature, a structural model which attempts to group predictor variables into sets according to underlying phenomena was proposed (below).


3. Inclusion and Exclusion, sample size

The first 348 patients in KDHTB database were screened for inclusion. Inclusion criteria were:

Exclusion criteria were:

CONSORT flow chart

4. Variable manipulation, full variable set

The following independent variables were identified as potentially available from day of recruitment data.

##  [1] "Sex"                            "Age"                           
##  [3] "PreviousTB"                     "Cough"                         
##  [5] "LossOfAppetite"                 "LossOfWeight"                  
##  [7] "DrenchingNightSweats"           "Nausea.and.vomiting"           
##  [9] "Diarrhea"                       "Headache"                      
## [11] "BMI"                            "BPSystolic"                    
## [13] "ECOG.Score"                     "GCS"                           
## [15] "BPDiastolic"                    "TempDegreesC"                  
## [17] "RR"                             "HR"                            
## [19] "Lactate"                        "pH"                            
## [21] "SHCO3"                          "BE"                            
## [23] "Glucose"                        "Albumin"                       
## [25] "AST"                            "Sodium"                        
## [27] "Potassium"                      "Haemoglobin"                   
## [29] "MCV"                            "WhiteCellCount"                
## [31] "Platelets"                      "CD4Count"                      
## [33] "ART"                            "Any.signs.of.Tuberculosis..CXR"
## [35] "Pulmonary.infiltrate"           "Pleural.effusion"              
## [37] "Mediastinal.lymphnodes"         "miliary"                       
## [39] "cavitary"                       "nodular"                       
## [41] "CXR.sites"                      "Tuberculoma"                   
## [43] "Pericardial"                    "Peripheral_LN"                 
## [45] "Abdominal_LN"                   "Hepatic"                       
## [47] "Splenic"                        "TBM"                           
## [49] "Creatinine"                     "Splenic.microabscesses"

This ‘full’ variable set was intended to define a relatively unbiased feature selection starting point. However, some conservative variable manipulation and reduction aimed at increasing stablity of the data structure was performed as follows.

Low information variables Variables which showed ‘near zero’ variation across the cohort, (defined by having very few unique values relative to the number of samples plus a large ratio of the frequency of the most common value to the frequency of the second most common value) were identified using the nearZeroVar function of the Caret package in R. These variables (listed below) were removed from the data set as unlikely to provide significant information, and could negatively affect model performance.

## [1] "Tuberculoma"

Variables with high proportion of missing observations

Proportion of missing observations were examined in the following graphics. The histogram shows variables ordered by proportion missing. the second plot shows variables in columns and patients in rows (ordered by recruitment date); binary variables are coloured black for value = 1, white for value = 0, and red for missing. Numerical variables are plotted on grayscale with higher relative values a darker shade, and red for missing values. The pattern of missing data is non-random:

Distributions of numerical variables Normality of numerical variables were assessed by QQplots and histograms, the later are shown below.

CD4Count, total White cell count, creatinine, AST, glucose, lactate and age were log transformed to better approximate normality. A single outlier with bloop pressure 220/160 was removed from the blood pressure variables. ECOG score was converted to categorical variables because of non-normality (and also not interval measure).

Temperature observations presented a particular problem; with only 7% of the cohort having a fever defined as temperature >38.0oC, which seemed implausible and may relate to measurement error. However, recorded temperatures did correlate with other markers of inflammatory response (e.g. CRP, data not shown) so a new categorical variable ‘fever’ was defined as temperature >37.4oC was created (present in ~13% of cohort).

Pairwise correlations of numerical variables

Because collinearity may negatively affect the performance of some models, numerical variables were examined for extreme correlation and clustering. Pearson’s correlation coefficients for the numerical variables were calculated and are presented below as a heat map matrix with a hierachical clustering dendrogram showing distance between the variables. A second matrix (blue) where the correlation coefficients are converted to absolute values is also shown (such that strong negative correlation counts in clustering).

SHCO3 and Base Excess form a tight cluster (also related more distantly to pH, and negatively correlated with creatinine); a linear model showed that 90% of variation in BE was explained by SHCO3, haemoglobin and pH (results not shown) so BE variable was deleted. Diastolic and systolic BP are highly correlated, as was systolic BP and derived Pulse Pressure (not shown). Therefore a new variable was created representing pulse pressure variation unexplained by variation in systolic BP (residuals from a linear model predicting Pulse Pressure from systolic BP), and diastolic BP and Pulse Pressure variables renoved from the data set.

After the above manipulation 20 numeric predictors and 27 categorical predictors were left in the ‘full’ variable set.


5. Data splitting for training and test sets

A random split of the data set into a training set (75% of observations, n=207) and test set (25% of observations) balanced by bacteraema strata was performed.


6. Exploratory analysis using training set

6.1 Univariate associations

Univariate associations with MTB bacteraemia for each predictor in the full variable set are shown below.

“no” “yes” “p” “test”
“n” 138 69
“Sex = M (%)” 68 (49.6) 37 (54.4) 0.555 exact
“PreviousTB = Y (%)” 61 (49.2) 21 (33.3) 0.044 exact
“Cough = Y (%)” 91 (72.2) 43 (66.2) 0.407 exact
“LossOfAppetite = Y (%)” 71 (56.3) 36 (56.2) 1.000 exact
“LossOfWeight = Y (%)” 112 (87.5) 58 (87.9) 1.000 exact
“DrenchingNightSweats = Y (%)” 73 (58.9) 30 (46.9) 0.125 exact
“Nausea.and.vomiting = Y (%)” 29 (22.7) 16 (24.2) 0.858 exact
“Diarrhea = Y (%)” 35 (27.6) 21 (31.3) 0.619 exact
“Headache = Y (%)” 28 (22.6) 10 (15.6) 0.338 exact
“BPSystolic (mean (sd))” 111.22 (16.82) 112.32 (15.29) 0.662
“ECOG.Score (%)” 0.448 exact
" 0" 5 (3.8) 0 (0.0)
" 1" 7 (5.3) 3 (4.5)
" 2" 68 (51.1) 34 (50.7)
" 3" 26 (19.5) 11 (16.4)
" 4" 27 (20.3) 19 (28.4)
“GCS = lowGCS (%)” 10 (8.0) 5 (8.2) 1.000 exact
“HR (mean (sd))” 99.57 (19.80) 109.22 (16.31) 0.001
“pH (mean (sd))” 7.38 (0.07) 7.40 (0.07) 0.051
“SHCO3 (mean (sd))” 20.92 (4.12) 19.37 (3.82) 0.011
“Albumin (mean (sd))” 26.08 (6.63) 22.51 (5.42) <0.001
“Sodium (mean (sd))” 129.65 (4.81) 127.29 (5.06) 0.001
“Potassium (mean (sd))” 4.08 (0.83) 4.13 (0.94) 0.675
“Haemoglobin (mean (sd))” 9.48 (2.42) 8.54 (2.19) 0.007
“MCV (mean (sd))” 84.71 (9.18) 82.38 (8.28) 0.081
“Platelets (mean (sd))” 296.55 (129.17) 221.87 (113.73) <0.001
“ART (%)” 0.565 exact
" ART" 54 (39.4) 22 (31.9)
" Defaulted" 27 (19.7) 16 (23.2)
" naive" 56 (40.9) 31 (44.9)
“Any.signs.of.Tuberculosis..CXR = Y (%)” 123 (96.9) 60 (95.2) 0.687 exact
“Pulmonary.infiltrate = Y (%)” 121 (88.3) 58 (86.6) 0.821 exact
“Pleural.effusion = Y (%)” 47 (34.3) 13 (20.0) 0.048 exact
“Mediastinal.lymphnodes = Y (%)” 57 (41.9) 40 (60.6) 0.016 exact
“miliary = TRUE (%)” 10 (7.2) 15 (21.7) 0.005 exact
“cavitary = TRUE (%)” 17 (12.3) 8 (11.6) 1.000 exact
“nodular = TRUE (%)” 58 (42.0) 27 (39.1) 0.765 exact
“CXR.sites (mean (sd))” 3.72 (1.92) 4.03 (2.16) 0.292
“Pericardial = Yes (%)” 12 (8.7) 8 (11.6) 0.618 exact
“Peripheral_LN = Yes (%)” 2 (1.4) 5 (7.2) 0.042 exact
“Abdominal_LN = Yes (%)” 22 (15.9) 24 (34.8) 0.004 exact
“Hepatic = Yes (%)” 6 (4.3) 6 (8.7) 0.220 exact
“Splenic = Yes (%)” 24 (17.4) 23 (33.3) 0.013 exact
“TBM = Yes (%)” 15 (10.9) 0 (0.0) 0.003 exact
“Splenic.microabscesses (%)” 0.002 exact
" N" 22 (15.9) 7 (10.1)
" noUSSdone" 90 (65.2) 33 (47.8)
" Y" 26 (18.8) 29 (42.0)
“logCD4 (mean (sd))” 4.45 (0.95) 3.27 (1.15) <0.001
“logWCC (mean (sd))” 1.98 (0.63) 1.90 (0.61) 0.389
“logCreatinine (mean (sd))” 4.43 (0.64) 4.76 (0.78) 0.001
“logAST (mean (sd))” 3.92 (0.81) 4.40 (0.70) <0.001
“logGlucose (mean (sd))” 1.66 (0.27) 1.73 (0.33) 0.078
“logLactate (mean (sd))” 0.58 (0.46) 0.72 (0.53) 0.052
“logRR (mean (sd))” 3.11 (0.26) 3.16 (0.24) 0.189
“logAge (mean (sd))” 3.60 (0.28) 3.60 (0.23) 0.931
“fever = Yes (%)” 11 (8.0) 17 (24.6) 0.002
“pulsePr.res (mean (sd))” -0.00 (9.17) 0.35 (9.25) 0.805
“bacteraemia = yes (%)” 0 (0.0) 69 (100.0) <0.001

6.2 Multivariate exploration

An exhaustive analysis of different combinations of predictors and their co-relation to bacteraemia is impractical. However, groupings of variables as defined by the proposed structural model in section 2.2 above were investigated as follows. For each a priori variable set {immunosupression variables; advanced TB variables; septic response variables; direct evidence of dissemination variables} pairwise relationships and their association with bacteraemia were visualised in a plot matrix. Secondly, principle components of each set were extracted using Principle Components Analysis (numerical variables) and Multiple Correspondence Analysis (categorical variables), or Factor Analysis for Mixed data. The relationship of derived components to bacteraemia was assessed visually, and using ROC curve and regression analysis. Because these methods are unable to handle missing observations, multivariate imputation of missing data by chained equations (MICE), which selects missing values from a probability distribution conditional on the levels of other variables, was used.

** 6.2.1 Advanced TB variable set**

Albumin was the strongest predictor of MTB bacteraemia amongst the variables chosen a priori to represent ‘advanced TB’. Haemoglobin and albumin were covaried (pearson’s r = 0.35) and the combination of the two appeared to marginally increase discrimination of bacteraemic and non-bacteraemic cases. Chest radiological extent of disease (‘CXR.sites’, number of anatomical sites with pathology on CXR) had higher mode in bacteraemic patients, but showed no clear relationship with other variables in this set. Loss of weight (LOW) and ECOG score (high=3-4, low = 1-2) showed surprisingly little relation to bacteraemia or any of the variables in this set.

 _Relationship matrix, advanced TB variables, coloured by bacteraemia status blue-yes, red-no_

The fist dimension of a PCA combining albumin and haemoglobin explained ~7% variation in bacteraemia status, and had area under ROC curve similar or marginally better to either variable on their own for predicting bacteraemia (PC1 AUC = 0.659; Albumin AUC = 0.652; Haemoglobin AUC = 0.605). By contrast the first PC of the full variable set added little with an AUC of 0.674.

** 6.2.2 Septic variable set**

Predictors in the ‘sepsis set’ were relatively collinear: respiratory rate and heart rate correlated (Pearson’s r = 0.45); pyrexial patients had higher heart rates and plasma lactate. White cell count did not correlate well with the other variables, did not classify bacteraemic cases well, and did not seem to interact with the other variables relationship to bacteraemia.

 _Relationship matrix, Sepsis variables, coloured by bacteraemia status blue-yes, red-no_
 

Cumulatively about half of the variability in the sepsis set variables could be captured on the first two principle components. Variation in bacteraemia status projected best on the 1st and 4th components of the PCA; these two dimensions were dominated by heart rate, respiratory rate, lactate, SHCO3 and fever. Removal of white cell count from the PCA did not adversly affect the predictive power of the PCs for bacteraemia, and all the predictive utility of this variable set could be captured in just two principle components.

## 
## Call:
## roc.formula(formula = bacteraemia ~ pred1, data = septic)
## 
## Data: pred1 in 138 controls (bacteraemia No.bacteraemia) < 69 cases (bacteraemia Bacteraemia).
## Area under the curve: 0.7396
## 
## Call:
## roc.formula(formula = bacteraemia ~ pred2, data = septic)
## 
## Data: pred2 in 138 controls (bacteraemia No.bacteraemia) < 69 cases (bacteraemia Bacteraemia).
## Area under the curve: 0.728

….

** 6.2.3 Immunosupression variable set**

Within the immunosupression variables, CD4 count was the best predictor for bacteraemia. It was uncorrelated with other variables in the set including ART status. Combination of CD4 with other variables in the set, pairwise or through principle component construction, did not appear to add discriminative power (data not shown).

 _Relationship matrix, Immunosupression variables, coloured by bacteraemia status blue-yes, red-no_

** 6.2.4 Dissemination variable set**

Starting with the lymphoid tissue variables, as expected ‘Splenic’ which is a composite clinical diagnosis, and the USS finding ‘Splenic Microabscesses’ were strongly associated. Of more interest is the finding that ‘Splenic’ in the abscence of ‘splenic microabscesses’ was associatied with zero cases of MTB bacteraemia. Patients who had an USS and did not have splenic microabscesses were unlikely to have bacteraemia (NPV = 0.76). So as a predictor for MTB bacteraemia, splenic microabscesses has more specificity (less noise) than ‘splenic TB’.

Second, lymphoid variables, with the possible exception of mediastinal LNs, appeared additive such that involvement of more lymphoid pathology sites increased probability of MTB bacteraemia.

 _Relationship matrix, disseminated TB variables, coloured by bacteraemia status blue-yes, red-no_

All 3 variables chosen to represent bone marrow pathology (haemoglobin, platelets and MCV) tended to be lower in bacteraemic patients. Among anaemic patients with bacteraemia, microcytosis was more common than macrocytosis. The combination of anaemia and thrombocytopenia seemed to be specific but not sensitive for bacteraemia, and there was no noticable correlation between haemoglobin and platlets.

Visualising some of the lymphoid and haematological variables together, along with ‘miliary disease on CXR’ variable showed the following. Miliary disease is uncommon but a reasonable positive predictor of bacteraemia (PPV = 0.6). This miliary PPV remained in patients without known abdominal LNs, and vice-versa, suggesting their predictive power was orthogonal. Miliary disease was associated with lower haemoglobin and platelets but this effect did not appear additive with the effect of bacteraemia on Hb and platelets.

About 27% variation in ‘dissemination’ variables was captured by the first dimension of the FAMD. The main contributors to this dimension were haemoglobin, splenic TB diagnosis, splenic microabscesses, abdominal LN, and hepatic TB diagnosis; it had ROC AUC of 0.69 for bacteraemia prediction. Bacteraemia status also loaded on the third FAMD dimension, dominated by platelets, peripheral LN, and miliary CXR, with ROC AUC 0.65. The combination of these two dimensions had increased predictive power for bacteraemia (ROC AUC = 0.74).

Conclusions from exploratory analysis

  • There are multiple strong univariate associations with bacteraemia in the training dataset. Most have been previously reported (suggesting external validity), while others, such as platlets and splenic pathology, are novel but plausible. These ‘highly relevant’ variables may be less useful for prediction because of redundancy and suboptimal sensitivity / specificity trade-offs.
  • For example, abscence of splenic microabscess had high NPV, but presence of microabscesses had only moderate PPV, and USS is only available in a minority of patients; peripheral lymphnodes have good correlation with bacteraemia, but are seen in too small a proportion of the cohort to be a very useful sign.
  • Combinations of variables can add discriminatory power. Several ad hoc principle component analyses, supervised by prior beliefs about how the variables relate, suggested dimensions of variability with better or as good prediction for bacteraemia than the raw variables. This may suggest that rational and/or data driven dimension reduction could reduce noise with relatively little information loss.
  • For example, lymphoid variables are largely incomplete, typically because not all patients get an USS. Second, involvement of particular lymph node sites (e.g. peripheral) is presumably to an extent stochastic. Third, all patients have multiple potential aetiologies of lymphatic disease (e.g. HIV). These three sources of noise could potentially be reduced by a combination of k variables into n<k components or factors, allowing more robust prediction performance from models tuned to genuine variation rather than noise. This dimension reduction could be supervised (described in next section) or completely data driven (described later in model building).
  • Finally, no MTB bacteraemia was seen in any of the patients diagnosed with TB meningitis in the training data set. TBM patients tended to cluster quite separately to other patients in the cohort (data not shown). Therefore, TBM cases were excluded from further analysis.

7. Feature selection before model building

A specified objective of this analysis was to test the performance of different variable selection strategies in this data set. Therefore the following variable selections were defined:

  1. Full set: All the variables thought to be available on day 0, as described previously in section 4.
  2. Prior Literature set: All the variables from the full set which have previously been found to be significantly associated with MTB bacteraemia from the literature review in section 2 above {fever, logCD4, Haemoglobin, Sex, lymphadenopathy (not well defined in previous literature, here defined as peripheral OR mediastinal OR abdominal), Cough, current HAART, Diarrhea, Heart Rate, log Respiratory Rate, Sodium, LossOfWeight}.
  3. Jacob et al. set: Variables from full set which were used in the model published by Jacob et al. {fever, log CD4, Haemoglobin, Sex, current HAART, Heart Rate, Sodium}.
  4. Univariate associations set: All variables from full set which were found to have significant associations with bacteraemia at the alpha=0.05 level in section 6.1 above. {PreviousTB, Heart Rate, SHCO3, Albumin, Sodium, Haemoglobin, Platelets, Pleural Effusion, Mediastinal LN, miliary disease on CXR, Peripheral LN, Abdominal LN, Splenic TB diagnosis, Splenic microabscesses, log CD4, log Creatinine, log AST, fever}.
  5. Supervised dimension reduction set: Based on the analysis presented in section 6.2 several new variables were created intended to capture the most useful components of variation in the data. In short, variables from the proposed structural model were decomposed into ostensively more fundamental dimensions of variation defined by the PCA analyses in section 6.2. These new variables were:

8. Model building

Five model types were used:

  1. Logistic regression. Generalised linear models based on the binomial probability distribution were constructed using the glm() function in R. Method summary: the conditional mean of the log odds ratio of the bacteraemia outcome (logit) is modeled as a linear function of the predictor variables weighted by parameters derived from maximum likelihood estimation.
  2. Decision tree. Decision tree models were constructed using the recursive partitioning (rpart) package in R. Method summary: the rpart algorithm searches for a predictor variable which splits the individuals of the data set into two groups such that the homogenity (purity) of the outcome in the two groups is maximised. This is serially repeated in the ‘daughter’ groups until purity is not increased by further splits, or a miaximum number of branches has been reached. To avoid over-fitting a ‘compexity parameter’, \(cp\), is specified which penalises larger trees; cp can be optimised (‘tuned’) using cross-validation.
  3. Random forest. Random forest models were produced using the randomForest package in R. Method summary: For a data set with \(N\) individuals (rows) and \(V\) predictor variables, a large number, \(K\), of decison tree models are constructed as follows. To create tree \(k\), a random sample with replacement of \(n < N\) individuals is made, with about 1/3 individuals held in reserve. At each node of tree \(k\), \(v<V\) predictors are randomly chosen for consideration as potential splitting variables. Predictions for new cases can be made by applying all \(K\) decision trees to the new case and counting the predicted outcomes (‘votes’) across the trees. In addition, the accuracy of each tree is tracked by running the reserved cases down each decison tree to give an estimate of ‘out-of-bag’ error rate. The main parameter affecting random forest performance is the size of \(v\), which can be ‘tuned’ using cross-validation.
  4. Boosted tree. Boosted tree models were made using the gbm package in R. Method summary: Similarly to random forest models, the boosted tree algorithm builds multiple decision trees and combines their predictions. However, unlike in the random forest method, each new tree is constructed using information from the previously constructed trees and is weighted to return more accurate predictions for cases misclassified in the previous trees. Bootstrapping is not used, but a modified version of the data set is used for each new tree. This is equivalent to running a new model to ‘explain’ the residuals from the previous model. The amount of ‘explanation’ the new model is peritted to provide is an adjustable parameter (\(lamda\), the ‘shrinkage’ or ‘learning rate’). Other parameters include number of branches per tree, total number of trees, and maximum number of observations in each node of the tree; all can be optimised through cross validation.
  5. Support vector machine. SVM models were implimented in R using the svmLinear2 function of the e1071 package. Method summary: For \(V\) predictor variables, individuals are located in \(V-1\) dimensional space, the SVM algorithm finds the optimal separating hyperplane between the two classes by maximizing the margin between the classes’ closest points using quadratic algebra. A cost parameter for misclassifying an individual, and a parameter determining the ‘smoothing’ in the hyperplane can be tuned using cross validation to reduce over-fitting.

All models were built on the training data set. Training of the models using 5 repeats of 10x k-fold cross validation was carried out using the caret package of R, such that ROC AUC of the predicted probabilities was maximised across a by selecting optimal model tuning parameters. Out-of-sample ROC AUC for each model was also estimated from the cross validation results, and then formally tested in the test set. Because some models require complete data to run, missing data was imputed using k-nearest neighbours method.


9. Models with PCA pre-processing of variables

Ten additional models were created by running each model type with PCA pre-processing of both the full variable set and the univariate association variable set.


10. Model based variable selection

In addition to the pre-model variable selection procedures described previously, model based variable selection methods were assessed.

First, a LASSO (least absolute shrinkage and selection operator) regresison model was applied to the full variable data set using the glmnet package; this defined a variable sub-set with good multivariate predictive performance (BPSystolic, HR, SHCO3, Albumin, Sodium, MCV, Platelets, logCD4, pulsePr.res, Splenic.microabscesses, Haemoglobin, bacteraemia). This variable set was applied to each model type (= five additional models).

Second, model-type specific variable selection procedures were used. These model specific variable selection procedures generated a further 4 models.

  1. Stepwise logistic regression. Forward stepwise regression with variable inclusion based on Akaike information criterion carried out with the glmStepAIC function of the MASS package in R, using th efull and the univariate association variable sets as starting points.
  2. Random forest variable importance. The RF method generates a measure of importance for variable \(v\) by comparing the out-of-bag accuracy of all trees with and without \(v\) included. Using this index the random forest model was re-run using only variables above an arbitrary variable importance level.
  3. Boosted tree variable importance. A variable importance measure can be obtained from the boosted-tree method by tabulating the reduction in classification error attributed to each variable at each tree split and summing over each boosting iteration. Using this index the boosted tree model was re-run using only variables above an arbitrary variable importance level.
  4. SVM variable importance. No built-in importance score is available for SVM classification models; instead the area under the ROC curve for each variable can be used. Using this index the SVM model was re-run using only variables above an arbitrary variable importance level.

11. Model performance in cross-validated training sets and in test set.

11.1 Total number of models tested

  • 5 variable sets defined in section 7 assessed in each of the 5 model types (=25); PLUS
  • 10 models with pre-processing of variables with PCA (=35); PLUS
  • Variable set defined by LASSO in five model types (=40); PLUS
  • Models re-run with variables specifically important in that model type (TOTAL =45).

All models were successfully fit, except logistic regression on the full variable set which did not converge (number of predictors too large for number of observed individuals).

11.2 ROC AUC performance

The ROC AUC for each model is shown below. Grey points are the values derived from cross validation in the training set (estimating out of sample performance). The coloured points are the values in the test set with 95% CI.

The best model overall in tets set was a random forest on the full data set. However, most model types performed similarly except for the simple decison trees.

Variable selection was particularly important for logit models. Model specific variable selection was generally better than pre-model variable selection, but the univariate association and LASSO variable selections performed reasonably well in logistic regression, random forest and SVM. The supervised dimension reduction prior to modelling performed moderately well in logistic regression and SVM despite relative parsimony. Unsupervised PCA pre-processing added value in a decision tree model.

11.3 Comparing predicted probabilities between models

The following plot shows predicted probabilities against actual bacteraemia class for every model. In each panel, true positives are in the top right corner, true negatives are in the bottom left corner. The points are coloured according to predicted probabilities from one model (the logistic regression on the univariate association variables) with higher predicted probabilities a lighter shade of blue. It shows the models mostly made false predictions for the same individuals (although there are some exceptions). i.e. Some individuals consistently had high predicted probability of MTB bacteraemia, but did not have a bacteraemia. Thsi could represent noise in the outcome variable and therefore a possible ceiling to predictive model accuracy.

11.4 More parsimonious versions of best models

The full variable set random forest model includes >40 variables. Repeatedly rerunning this model with progressively less variables based on the variable importance measure shows that the optimal number of variables may be between 10 and 20.

Using 10 variables only, the ROC AUC for the random forest model on cross validation was .

11.5 Ensemble model

Because the univariate association variable set and the LASSO variable set performed well in logistic regression, RF and SVM models, predictions from these models can all be used without additional work of variable input when it comes to using the model in practice. The mean predicted probabilities from all 6 models (‘ensemble model’) had higher ROC AUC than any individual model, and could add stability to the predictions.

## 
## Call:
## roc.formula(formula = bacteraemia ~ p.Logit.lasso, data = ensTEST)
## 
## Data: p.Logit.lasso in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8316
## 
## Call:
## roc.formula(formula = bacteraemia ~ p.RF.lasso, data = ensTEST)
## 
## Data: p.RF.lasso in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8187
## 
## Call:
## roc.formula(formula = bacteraemia ~ p.svm.lasso, data = ensTEST)
## 
## Data: p.svm.lasso in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8378
## 
## Call:
## roc.formula(formula = bacteraemia ~ p.Logit.univar, data = ensTEST)
## 
## Data: p.Logit.univar in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8605
## 
## Call:
## roc.formula(formula = bacteraemia ~ p.RF.univar, data = ensTEST)
## 
## Data: p.RF.univar in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8605
## 
## Call:
## roc.formula(formula = bacteraemia ~ p.svm.univar, data = ensTEST)
## 
## Data: p.svm.univar in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8471
## 
## Call:
## roc.formula(formula = bacteraemia ~ meanPred, data = ensTEST)
## 
## Data: meanPred in 44 controls (bacteraemia no) < 22 cases (bacteraemia yes).
## Area under the curve: 0.8626

11.6 Predicted probability cut-off for optimal PPV

In the test set, about 25% of patients have ensemble model predicted probability of bacteraemia greater than 0.5. For this cut-off the PPV would be ~80% in the test set.