Applying H2O AutoML to detect multifactorial drivers of STI and HPV co-infection in high-risk populations

Background

In South Africa, women of reproductive age are disproportionately affected by a significant burden of sexually transmitted infections (STIs), prominently including human papillomavirus (HPV). Notwithstanding, a substantial knowledge gap exists regarding the epidemiology of sexually transmitted pathogens among women residing in rural areas of the Eastern Cape Province, South Africa, where data on prevalence rates are scarce and warrant further investigation.

Aim

This study seeks to determine the optimal machine learning algorithm within the H2O AutoML framework for predicting and analyzing the prevalence of sexually transmitted infections (STIs) in South Africa, with a specific focus on the Eastern Cape Province. The findings of this research will augment the existing body of knowledge on STIs in the region, providing valuable information that can inform the development of targeted interventions and preventive strategies to reduce the burden of HPV infection on women’s health in rural communities.

Objective

The primary objective of this study is to investigate the risk factors associated with high-risk (HR) HPV infection among women residing in the Eastern Cape Province of South Africa.

Materials and Methods

A total of 205 cervical specimens were obtained from women aged 30 years or older at a rural community-based clinic, and subsequently subjected to comprehensive testing for a range of sexually transmitted infections (STIs) using a multiplex PCR STD direct flow chip assay on a manual Hybrispot platform (Master Diagnostica, Granada, Spain). The assay panel encompassed Chlamydia trachomatis (serovars A-K & L1-L3), Haemophilus ducreyi, Herpes Simplex Virus (Types 1 & 2), Neisseria gonorrhoeae, Treponema pallidum, Trichomonas vaginalis, as well as pathobionts including Mycoplasma genitalium, Mycoplasma hominis, and Ureaplasma spp. Additionally, high-risk human papillomavirus (HR-HPV) detection was performed using the Hybrid Capture-2 assay.

Statistical analysis

H2O AutoML

H2O AutoML, a component of the H2O system, is a machine learning algorithm that operates autonomously, distinguished by its user-friendly interface and implementation, making it an attractive choice for enterprise environments while delivering high-performance models. This algorithm supports a range of problem types on tabular datasets, including binary and multi-class classification, as well as regression. A notable advantage of H2O AutoML is its capacity for rapid scoring, enabling multiple models to generate predictions in a minimal timeframe. Furthermore, H2O AutoML offers APIs in multiple languages, enhancing its adaptability and usability across diverse domains. As an open-source, distributed, and scalable fully automatic supervised learning model integrated into the H2O library, it has gained widespread popularity in both academic and industrial settings.

To evaluate the performance of learning models in this study, the proposed system employs a combination of classifiers in conjunction with the H2O AutoML technique. The learning models are trained using H2O version 3.10.3.1, with all learning algorithms implemented through the H2O AutoML module, ensuring a comprehensive assessment of model performance.

Explainable Machine Learning

Conventional machine learning models often necessitate post hoc explanations to inform decision-making processes. These explanations facilitate understanding of the underlying logic driving predictions. Explainable machine learning (ML) techniques, such as SHAP, enhance transparency by providing measures of feature importance for interpreting predictions. In this study, SHAP, an explainable model, was employed to interpret machine learning-based predictions, offering a unified measure of feature importance and thereby improving transparency and understanding of the decision-making process.

SHAP was utilized to elucidate the importance of features in the study. While SHAP feature importance surpasses traditional alternatives, it may not provide substantial additional value in isolation. Beeswarm plots, however, offer a more comprehensive and informative visualization of SHAP values, revealing not only the relative importance of features but also their actual relationships with the predicted outcome. The SHAP summary graphic representation provides insights into the contribution of each feature to each instance of data. By aggregating feature contributions and the bias term, the model’s raw prediction (prior to applying the inverse link function) can be obtained. This graphic representation facilitates understanding of the significance of each feature in influencing the model’s predictions, thereby enhancing model interpretability.

We employed the open-source H2O.ai autoML package for the R programming language to facilitate model development. To ensure reproducibility, we excluded deep learning methods from this study. The autoML was trained on a randomly selected 80% of the dataset, utilizing 10-fold cross-validation. We instructed the autoML to generate 20 machine learning models, ranking them by their performance based on the area under the precision-recall curve (AUCPR) on the remaining 20% of the dataset. Additionally, we generated variable importance for the top-performing models and selected the 10 variables that exerted the greatest influence on model effectiveness. Variable importance was determined by calculating the relative influence of each variable, considering both its selection as a splitting criterion during tree building and the resulting improvement in squared error across all trees.

The relationships between independent variables and HR-HPV infection were examined using univariable and multivariable logistic regression models. Multivariate analysis was performed using the statistically significant variables (p-value < 0.05) from the univariate logistic regression models to identify the independent variables independently associated with HR-HPV infection. Crude and adjusted odds ratio estimates, along with their 95% confidence intervals, were reported. A p-value < 0.05 was set as the threshold for statistical significance in all analyses.

Results

Dataset

Table 1. Leaderboard of top 20 autoML models for HR HPV infection ranked by evaluation metrics using validation dataset

## # A tibble: 20 × 7
##    model_id                   auc logloss aucpr mean_per_class_error  rmse   mse
##    <chr>                    <dbl>   <dbl> <dbl>                <dbl> <dbl> <dbl>
##  1 GBM_2_AutoML_3_20240506… 0.859   0.420 0.758                0.217 0.376 0.141
##  2 GBM_4_AutoML_3_20240506… 0.851   0.432 0.735                0.200 0.384 0.147
##  3 GBM_3_AutoML_3_20240506… 0.849   0.434 0.754                0.221 0.383 0.147
##  4 XGBoost_grid_1_AutoML_3… 0.841   0.487 0.721                0.238 0.401 0.161
##  5 GLM_1_AutoML_3_20240506… 0.830   0.461 0.728                0.244 0.388 0.151
##  6 XGBoost_3_AutoML_3_2024… 0.830   0.469 0.707                0.247 0.396 0.157
##  7 XGBoost_grid_1_AutoML_3… 0.820   0.503 0.711                0.243 0.405 0.164
##  8 GBM_5_AutoML_3_20240506… 0.810   0.498 0.677                0.258 0.412 0.170
##  9 GBM_grid_1_AutoML_3_202… 0.809   0.485 0.701                0.263 0.401 0.161
## 10 DRF_1_AutoML_3_20240506… 0.800   0.490 0.691                0.271 0.402 0.162
## 11 GBM_grid_1_AutoML_3_202… 0.795   0.503 0.682                0.261 0.407 0.166
## 12 XGBoost_grid_1_AutoML_3… 0.789   0.532 0.647                0.250 0.419 0.175
## 13 XGBoost_2_AutoML_3_2024… 0.760   0.531 0.628                0.328 0.419 0.175
## 14 XGBoost_grid_1_AutoML_3… 0.755   0.544 0.604                0.279 0.426 0.182
## 15 GBM_grid_1_AutoML_3_202… 0.745   0.536 0.654                0.307 0.423 0.179
## 16 XRT_1_AutoML_3_20240506… 0.733   0.564 0.503                0.306 0.438 0.192
## 17 XGBoost_1_AutoML_3_2024… 0.723   0.561 0.560                0.319 0.435 0.189
## 18 GBM_grid_1_AutoML_3_202… 0.704   0.623 0.489                0.338 0.462 0.213
## 19 XGBoost_grid_1_AutoML_3… 0.694   0.569 0.469                0.317 0.440 0.193
## 20 XGBoost_grid_1_AutoML_3… 0.551   0.623 0.383                0.446 0.465 0.216

Table 2. Comparison of evaluation metrics and calibration among the different models built in h2o AutoML.

Model	Error Rate	Accuracy	Sensitivity	Specificity
GBM_2_AutoML	0.1807	0.8192	0.8796	0.7852
XGBoost_grid_1_AutoML	0.1667	0.8333	0.8584	0.8153
GLM_1_AutoML	0.2048	0.7952	0.8284	0.7801
DRF_1_AutoML	0.1864	0.8136	0.7860	0.8226

Figure 1: Distribution of other STIs among study participants

This bar chart illustrates the frequency and percentage of various combinations of other STIs reported by participants. The x-axis represents the percentage of participants reporting each category of STIs, while the y-axis lists the STI combinations. The total number of observations is 202 out of a possible 205, indicating the subset of participants who provided complete data on other STIs.

Key Findings:

The key findings from Figure 1 reveal the prevalence of different STI combinations among the study participants. The most common category was ‘UU’, reported by 63 participants (30.73%), followed by ‘Negative’ (no other STIs), reported by 44 participants (21.46%). The combination ‘MH,UU’ was reported by 38 participants (18.54%). Other STI combinations were significantly less frequent. This distribution highlights the variability in STI co-infections among participants, with a notable proportion reporting either a single type of infection (‘UU’) or no STIs. These findings emphasize the need for targeted screening and interventions tailored to specific STI combinations to effectively manage and prevent the spread of these infections in the population.

Figure 2: Prevalence of co-infections in HPV-positive individuals

This bar chart illustrates the 20 most common co-infections detected in individuals who tested positive for HPV. Each bar represents the percentage of individuals with a specific combination of STIs, in addition to HPV. The x-axis displays the percentage of individuals, while the y-axis lists the co-infection combinations. The data is based on a sample of 194 individuals out of a possible 205.

Key Findings:

The analysis of co-infections in HPV-positive individuals reveals a diverse spectrum of bacterial and viral STIs. The most frequent co-infection is UU, affecting 40 individuals (19.51%). This is followed by combinations of MH and UU, and UU with HPV, each affecting 29 (14.15%) and 23 (11.22%) individuals, respectively. Notably, multiple infections involving MH, UU, TV, and HSV-2 are also present, each contributing to less than 6% of the cases individually. These findings highlight the complexity of STI co-infections in individuals with HPV, emphasizing the need for comprehensive diagnostic and treatment strategies that can address multiple pathogens simultaneously.

Figure 3: Distribution and proportions of HIV infection relative to other STIs

This figure examines the relationship between HIV infection status and other STIs among a cohort of 205 subjects. The upper panel presents a bar chart showing the number of subjects categorized by their HIV status (negative or positive) across various STI status variables. The lower panel displays a stacked bar chart illustrating the proportion of subjects (in percentages) with HIV negative and positive statuses across the same variables, with blue indicating HIV negative and orange indicating HIV positive.

Key Findings:

The analysis reveals a diverse distribution of HIV infection across different STI variables. The UU category has the highest number of subjects, both HIV negative and positive, followed by the Negative category. The MH_UU and OTHERS categories also show significant counts. Notably, the proportion of HIV positive cases is high in categories such as UU_HSV2 (75%) and MH_UU_HSV2 (66.7%). In contrast, the UU_TV (25%) and UU (31.8%) categories show a lower proportion of HIV positive subjects, suggesting a weaker correlation between HIV infection and these conditions. These findings highlight the complex interplay between HIV and other STIs, emphasizing the need for targeted interventions in populations with specific STI co-infections.

Figure 4: Distribution and proportions of other STIs with HPV stratified by HIV infection

The upper panel displays a bar chart representing the count of individuals categorized by HIV infection status (positive in blue, negative in orange) across different STI status variables (MH, UU, OTHERS, Negative, UU.HPV, HPV, MH.UU.TV, MH.UU.HPV, UU.TV, MH.HPV, MH.UU, LU.TV). The lower panel shows a stacked bar chart detailing the proportions (%) of HIV infection status for each health variable. The total number of observations is 205.

Key Findings:

The bar chart in Figure 4 reveals the distribution of HIV infection status among 205 individuals across different STI status variables. Notably, the “UU” category has a higher number of HIV-negative individuals (31) compared to HIV-positive (9), while the “Negative” category shows a similar trend. In contrast, the “MH_UU” category has a balanced distribution between HIV-negative (15) and HIV-positive (14) statuses.

The proportions panel highlights significant variability in the percentage distribution of HIV infection status among the STI status variables. For example, “UU_TV” has a higher proportion of HIV-negative individuals (80%) compared to HIV-positive (20%). The “OTHERS” category shows a higher proportion of HIV-positive individuals (58.1%) relative to HIV-negative (41.9%), followed by individuals with HPV living with HIV (54.4%) and those with HPV but without HIV (45.5%). Notably, individuals without other STIs detected were less likely to be living with HIV (24.2%), while those without other STIs detected were likely to be HIV-negative (75.8%). This variability suggests differing levels of HIV infection risk associated with specific STIs, which could be indicative of underlying epidemiological patterns or the effectiveness of targeted health interventions. Further statistical analysis is required to determine the significance of these observations and their potential implications for public health strategies.

Variable Importance Heatmap

The variable importance heatmap provides a comprehensive view of the variable significance across multiple models, allowing for comparisons and evaluations. However, the return of variable importance varies among models, with some requiring one-hot encoded versions of categorical data (e.g., Deep Learning, XGBoost). To facilitate comparisons across model types, a consolidated variable importance is computed by summarizing the importance across all one-hot encoded features. This approach enables the analysis and interpretation of variable importance across diverse models for further model development and refinement.

Figure 5: Heatmap of variable importance across multiple predictive models for a study on STIs

This visualization displays the results of feature importance analysis for various machine learning models. The y-axis lists the input variables considered in the models, which include demographic factors (such as age), behavioral factors (like frequency of vaginal discharge), clinical factors (including HIV infection status and history of STIs), and other relevant variables. The x-axis shows the different model identifiers, which represent various configurations of XGBoost and Gradient Boosting Machine (GBM) models. The color intensity of each cell in the plot, ranging from light yellow to dark blue, indicates the relative importance of each variable in the corresponding model. Darker blue colors signify higher importance, while lighter yellow colors indicate lower importance. This visualization provides a clear representation of which variables are most influential in each model, allowing for a better understanding of the relationships between the input variables and the predicted outcomes.

Figure 6: Variable importance in Gradient Boosting Model (GBM) for predicting HR-HPV infection

This bar graph displays the relative importance of various predictors in the GBM model for predicting HR-HPV infection. The x-axis shows the normalized importance score, ranging from 0 to 1, while the y-axis lists the predictors. The predictors include single and multiple STIs with HPV, Mycoplasma hominis, age, frequency of vaginal discharge, Ureaplasma species, number of lifetime sexual partners, HIV infection, condom use, Trichomonas vaginalis, and contraception use.

Key findings:

The analysis reveals that single and multiple STI infections with HPV have the highest importance scores, indicating a strong correlation with HR-HPV infection. Mycoplasma hominis and age also demonstrate significant importance, highlighting their relevance in the model. Other factors, such as frequency of vaginal discharge, Ureaplasma species, and number of lifetime partners, exhibit moderate importance. In contrast, factors like HIV infection, condom use, Trichomonas vaginalis, and contraception use have lower importance scores, suggesting a lesser impact on the model’s predictive ability for HR-HPV infection. This variable importance plot identifies the key factors that should be prioritized in clinical assessments and potential interventions for HR-HPV infection.

Figure 7: Variable importance in predicting HR HPV infection using XGBoost

This bar chart displays the relative importance of various predictors in an XGBoost model designed to identify risk factors associated with HR-HPV infection. The predictors include single and multiple STI infections with HPV, negativity for Mycoplasma hominis, age, negativity for Trichomonas vaginalis, having three or more lifetime partners, infrequent vaginal discharge, non-use of contraception, negativity for Ureaplasma species, and non-use of condoms. The importance scores are normalized and presented on a scale from 0 to 1.

Key Findings:

The analysis reveals that the top predictors of HR-HPV infection risk in the model are the presence of single and multiple STI infections with HPV, particularly multiple infections, and the absence of Mycoplasma hominis. Age is also a significant factor, suggesting that different age groups may have varying risks of HR HPV infection. Other notable predictors include the absence of Trichomonas vaginalis and having multiple lifetime partners. While less influential, factors such as infrequent vaginal discharge, non-use of contraception, absence of Ureaplasma species, and non-use of condoms still play a role in the model. These findings underscore the complexity of factors contributing to HR HPV infection and highlight the importance of considering a broad range of biological and behavioural factors in HR HPV infection prevention strategies.

Figure 8: Variable importance in predicting HR HPV infection using GLM

This bar chart illustrates the relative importance of various predictors in a Generalized Linear Model (GLM) analysing factors associated with HR HPV infection. The predictors include the presence of single and multiple STI infections with HPV, Ureaplasma species status, Mycoplasma hominis status, Chlamydia trachomatis status, and pregnancy status.

Key Findings:

The analysis reveals that the presence of single and multiple STI infections with HPV is the most influential predictor of HR-HPV infection, particularly when multiple infections are present. This factor demonstrates the highest variable importance, significantly surpassing all other factors. The negative status for single and multiple STI infections with HPV also shows substantial importance. The status of Ureaplasma species, Mycoplasma hominis, Chlamydia trachomatis, and pregnancy also contribute to the model, but there are no significant differences between the statuses of these pathogens, indicating that their presence or absence is not influential in the model.

Notably, pregnancy status (both ‘Yes’ and ‘No’) shows the least importance among the factors analyzed, suggesting that it has a minimal influence on the model outcomes related to HR HPV infection in the studied population. This finding implies that other factors may play more critical roles in the context of HR HPV infection.

From the original study conducted by Taku and colleagues

The association between sexually transmitted pathogens and HR-HPV infection was determined by fitting univariable and multivariable logistic regression models. Multivariate analysis was done using the statistically significant variables (p-value < 0.05) of the univariate logistic regression models to identify the sexually transmitted pathogens that are independently associated with HR-HPV infection. Crude odds ratio and adjusted odds ratio estimates and their 95% confidence intervals were reported. Statistical significance for all analysis was set at p-value < 0.05.

The multivariable model identified two factors that are independently associated with HR-HPV infection: HSV-2 (adjusted odds ratio: 4.17, 95% confidence interval: 1.184-14.690, P = 0.026) and HIV infection (adjusted odds ratio: 2.11, 95% confidence interval: 1.145-3.873, P = 0.017). In contrast, the H2o AutoML GLM model found that the most influential factors associated with HR-HPV infection are single and multiple STI infections with HPV, Ureaplasma species, Mycoplasma hominis, Chlamydia trachomatis, and pregnancy.

Figure 9: Variable importance of factors associated with HR HPV infection as determined by DRF

The bar chart illustrates the relative importance of various factors in predicting HR-HPV infections. Factors are listed on the y-axis, and the importance scores are normalized and presented on a scale from 0 to 1 and plotted on the x-axis, ranging from 0 to 1. The most influential factors include single_and_multiple_sti_infections_with_hpv, ureaplasma_species, mycoplasma_hominis, chlamydia trachomatis and pregnancy.

Key Findings:

Single and Multiple STI Infections with HPV shows the highest variable importance, indicating a strong association with HPV infections. The presence of single or multiple STIs alongside HPV significantly influences the prediction model’s accuracy. Age is the second most influential factor, suggesting that the likelihood of HR HPV infections varies significantly with age. This aligns with epidemiological studies showing different HPV infection rates across age groups. Mycoplasma Hominis is often found in the urogenital tract, also plays a significant role in the model, highlighting potential interactions between bacterial infections and HR HPV status. Other factors such as “frequency of vaginal discharge,” “vaginal sexual intercourse,” and “lifetime partners” also contribute to the model, albeit to a lesser extent. These factors are indicative of sexual behavior and hygiene practices that may affect HPV transmission. This analysis underscores the multi factorial nature of HR HPV infections, emphasizing the importance of considering a broad range of biological and behavioural factors in disease prediction and management strategies.

SHAP Summary

The SHAP summary plot visualises the contribution of each feature for each data instance, offering insights into feature importance. The sum of feature contributions, along with the bias term, equals the model’s raw prediction prior to the application of the inverse link function. This plot type is valuable for understanding feature effects, significance, and model reliance on feature interactions.

Figure 10: SHAP summary plot for GBM model

This SHAP summary plot illustrates the importance of various clinical and demographic features in predicting HR HPV infection using a Gradient Boosting Machine (GBM) model. Each dot in the plot represents a SHAP value for a feature in an individual prediction, with the color indicating the normalized value of the feature. The features are listed on the y-axis, and their corresponding SHAP contributions are plotted on the x-axis. Positive SHAP values indicate a higher impact on the model’s prediction, while negative values suggest a lower impact.

Key Findings:

The SHAP summary plot provides valuable insights into the contributions of each feature to the predictive model’s outcomes. The features ‘single and multiple STI infections with HPV’, ‘Mycoplasma hominis’, Ureaplasma species, age, and frequency of vaginal discharge exhibit both positive and negative SHAP values, indicating that their impact on the model’s prediction varies depending on the context or levels of these features. The complex relationships between these features and the predicted outcome are highlighted by the distribution of their SHAP values across both positive and negative contributions. In contrast, features such as ‘pregnancy’, ‘Chlamydia trachomatis’, ‘Neisseria gonorrhoeae’, and Mycoplasma genitalium do not contribute significantly to the model’s predictions, as indicated by their lack of significant SHAP values.

Figure 11: SHAP summary plot for XgBoost model

This SHAP summary plot illustrates the impact of various features on the predictive outcome of the XGBoost model. The features are listed on the y-axis, ranging from ‘single and multiple STI infections with HPV’ to ‘pregnancy’, while the x-axis quantifies their SHAP contributions. Positive SHAP values indicate a higher impact on the model’s prediction in the positive class direction, whereas negative values indicate a higher impact in the negative class direction. Each point in the plot represents a SHAP value for a feature in an individual prediction, with the color intensity reflecting the normalized value of the feature’s impact.

Key Findings:

The XGBoost model highlights the significant importance of ‘single and multiple STI infections with HPV’ in predicting HR HPV infection, as evidenced by its substantial positive SHAP contribution. Additionally, ‘Mycoplasma hominis’ and ‘Ureaplasma species’ demonstrate notable positive contributions, underscoring their crucial roles in the model’s predictions. The SHAP values for features like ‘HIV infection’, ‘condom use’, and ‘Trichomonas vaginalis’ exhibit variability, suggesting that these factors may influence the model outcome differently across different observations. In contrast, features such as ‘pregnancy’, ‘Chlamydia trachomatis’, ‘Neisseria gonorrhoeae’, and Mycoplasma genitalium do not contribute significantly to the model’s predictions, as indicated by their lack of significant SHAP values.

Figure 12: SHAP summary plot for DRF model

This SHAP summary plot illustrates the impact of various features on the predictive output of the DRF model. Each point in the plot represents a SHAP value for a feature in a single prediction, with the color intensity indicating the normalized value of the feature, ranging from 0 (light pink) to 1 (dark blue). The features are ordered by the sum of SHAP value magnitudes across all samples, reflecting their overall importance to the model. Positive SHAP values suggest a feature contributes to increasing the model’s output, while negative values suggest a decrease.

Key Findings:

The feature ‘single and multiple STI infections with HPV’ exhibits the highest variability in SHAP contributions, indicating a significant influence on the model’s predictions, with both strong positive and negative impacts. Other features, such as ‘Mycoplasma hominis’, ‘age’, ‘frequency of vaginal discharge’, ‘HIV infection’, and ‘Ureaplasma species’, demonstrate substantial positive contributions, highlighting their strong association with HR HPV infection predicted by the model. Features related to sexual behavior and contraceptive use, including ‘condom use’, ‘sexual partners in the past 1 month’, and ‘contraception use’, display moderate influence with positive contributions. The plot underscores the complexity of interactions among various health conditions and behaviors, emphasizing the importance of considering a broad range of factors in predictive modeling of health outcomes.

Partial Dependence Plots

The partial dependence plot (PDP) provides a visual representation of the marginal effect of a specific variable on the response variable, quantified as the change in the mean response. This plot assumes independence between the feature for which the PDP is computed and the remaining features.

Figure 13: Partial dependence plot for single_and_multiple_infection_with_hpv

This partial dependence plot compares the performance of various machine learning models, including Distributed Random Forest (DRF_1), Gradient Boosting Machine (GBM_2), Generalized Linear Model (GLM_1), XGBoost, and Extremely Randomized Trees (XRT_1), on the feature ‘single_and_multiple_sti_infections_with_hpv’. The plot displays the mean response (HR HPV infection) of each model across three categories of the feature: multiple infection, negative, and single infection. Each point represents the mean response of the model for the respective category, with shaded regions indicating areas of higher data density.

Key Findings:

The models generally exhibit higher mean responses for the ‘multiple infection’ category compared to the ‘single infection’ and ‘negative’ categories, suggesting that multiple STI infections with HPV have a stronger influence on the model’s predictions. The XGBoost model shows the highest mean response across all categories, indicating a potentially higher sensitivity to the ‘single_and_multiple_sti_infections_with_hpv’ feature. In contrast, the mean responses for the ‘negative’ category are notably lower across all models, implying a lesser impact of the absence of STI infections on the predictions. The variability in mean responses between models for the same category highlights differences in how each model processes the feature’s influence on the outcome variable (HR HPV infection).

These findings underscore the importance of considering the type of machine learning model used when evaluating the impact of specific features, such as STI infections with HPV, on predictive outcomes in medical datasets.

Figure 14: Partial dependence plot for mycoplama_hominis

This partial dependence plot compares the mean responses of different machine learning models to the presence (positive) and absence (negative) of Mycoplasma hominis. Each point represents the mean response (HR HPV infection) for a specific model, including Distributed Random Forest (DRF_1), Gradient Boosting Machine (GBM_2), Generalized Linear Model (GLM_1), eXtreme Gradient Boosting (XGBoost), and Extremely Randomized Trees (XRT_1). The shaded regions indicate areas of higher data density.

Key Findings:

The plot reveals the influence of the binary feature “Mycoplasma_hominis” on the predictive models’ responses. Notably, the presence of Mycoplasma hominis generally results in a lower mean response across all models compared to its absence, suggesting a potential negative impact of this feature on the outcome variable (HR HPV infection). The DRF_1 and XGBoost models exhibit the highest mean responses when Mycoplasma hominis is absent, indicating a stronger dependence on this feature compared to other models. This variability in responses highlights the importance of selecting the appropriate machine learning model based on feature interactions and their impact on predictive performance in medical data analysis.

Figure 15: Partial dependence plot for age

This plot compares the mean responses of different machine learning models as a function of age, with age plotted on the x-axis and the mean response (HR HPV infection) plotted on the y-axis. The models compared include Distributed Random Forest (DRF_1), Gradient Boosting Machine (GBM_2), Generalized Linear Model (GLM_1), XGBoost, and Extremely Randomized Trees (XRT_1). The histogram at the bottom of the plot shows the distribution of the sample population across different age groups.

Key Findings:

The partial dependence plot reveals distinct patterns in model responses to age across various machine learning models. The XRT_1 model exhibits a relatively stable response across age groups, with a slight increase in mean response for individuals around 60 years old. In contrast, the DRF_1 model shows a sharp peak in mean response for individuals in their early 40s, followed by a gradual decline and subsequent stabilization after the age of 50. The GLM_1 model demonstrates a gradual decrease in mean response as age increases, suggesting a possible linear relationship between age and the predicted outcome (HR HPV infection) in this model.

The XGBoost model displays a notable dip in mean response for individuals around the age of 50, followed by a sharp decrease for those in their late 50s and early 60s, indicating a non-linear interaction between age and other features within this model. The GBM_1 model shows a significant increase in mean response starting from individuals in their late 40s, followed by a gradual decrease from 40 to 55 years, and then stabilizes thereafter.

These patterns suggest that age has a variable impact on the predictions made by different machine learning models, which could be due to the differing nature of algorithms and their interaction with other features in the dataset. The histogram at the bottom of the plot indicates that the majority of the sample population is concentrated around the ages of 40 and 53, which may influence the model responses observed in this age range. Further investigation into feature interactions and model parameters is necessary to fully understand the implications of these findings.

Figure 16: Partial dependence plot for frequency_of_vaginal_discharge

This plot illustrates the mean response (HR HPV infection) of various predictive models to different categories of the frequency of vaginal discharge. The x-axis categorizes the frequency of vaginal discharge into four groups: ‘Current/last week’, ‘More than 6 months ago’, ‘More than a week and less than 6 months’, and ‘NA’ (data not available). The y-axis represents the mean response (HR HPV infection) from the models. Each model is represented by a unique color: DRF_1 (green), GBM_2 (orange), GLM_1 (purple), XGBoost (red), and XRT_1 (light green). The shaded regions indicate areas of higher data density or focus.

Key Findings:

The models generally show higher mean responses for the category ‘Current/last week’ compared to other categories, suggesting a stronger association or impact of recent vaginal discharge on the model outcomes. There is a noticeable variation in the mean responses among different models for the same category of vaginal discharge frequency, indicating differing sensitivities or specificities of the models to this feature.

The shaded area around the ‘More than 6 months ago’ category indicates a higher variability or uncertainty in the model responses for this group, which might be due to fewer data points or inherent variability in the data. These findings highlight the importance of the frequency of vaginal discharge as a predictive feature in these models and suggest potential differences in model performance that could guide further refinement and application in clinical settings.

Figure 17: Partial dependence plot for ureaplasma_species

This figure illustrates the partial dependence of the predicted response on the feature ‘ureaplasmas_species’ across different machine learning models. The x-axis categorizes the ‘ureaplasmas_species’ into ‘Negative’ and ‘Positive’, while the y-axis quantifies the mean response (HR HPV infection). Each point represents the partial dependence for a specific model: DRF_1 (green), GBM_2 (orange), GLM_1 (purple), XGBoost (red), and XRT_1 (light green). The shaded regions indicate areas of higher data density or focus.

Key Findings:

Each model exhibits a unique response to changes in the ‘ureaplasmas_species’ feature, indicating differences in how each model processes this feature. The ‘Positive’ category shows generally lower mean responses across all models, with values clustering around 0.35. The ‘Negative’ category demonstrates higher variability among the models, with mean responses ranging approximately from 0.35 to 0.60. The feature ‘ureaplasmas_species’ appears to have a significant impact on the model outputs, particularly in distinguishing between the ‘Negative’ and ‘Positive’ categories. This suggests that ‘ureaplasmas_species’ is a critical feature in influencing the model’s predictions, potentially reflecting its biological or clinical relevance.

Figure 18: Partial dependence plot for lifetime_partners

This figure illustrates the partial dependence of the mean response (HR HPV infection) on the feature “lifetime_partners” across different machine learning models. The x-axis categorizes the “lifetime_partners” into three groups, represented by arrows, while the y-axis quantifies the mean response. Each model is represented by a unique color: DRF_1 (green), GBM_2 (orange), GLM_1 (purple), XGBoost (red), and XRT_1 (light green). The shaded regions indicate areas of higher data density or focus.

The observations highlight the diverse ways in which the feature “lifetime_partners” impacts the model outputs, which could be crucial for understanding model behavior and guiding feature engineering and model selection in predictive tasks.

Figure 19: Partial dependence plot for HIV Infection

This plot illustrates the partial dependence of the predicted response on the binary feature ‘hiv_infection’ (Negative, Positive) for five different machine learning models: Distributed Random Forest (DRF_1), Gradient Boosting Machine (GBM_2), Generalized Linear Model (GLM_1), XGBoost, and Extremely Randomized Trees (XRT_1). The y-axis represents the mean response of the models, indicating how changes in the ‘hiv_infection’ status influence the model predictions. The shaded regions highlight the range of the feature values.

The partial dependence plot displayed reveals distinct variations in the mean response (HR HPV infection) across different machine learning models when conditioned on the ‘hiv_infection’ feature. The models exhibit varying degrees of sensitivity to changes in HIV infection status. For instance, the DRF_1 model shows a higher mean response for ‘Positive’ HIV status compared to ‘Negative’, suggesting a strong dependence on this feature. The XRT_1 model shows the highest mean response for the ‘Negative’ HIV status, followed by the GBM_1 model. There is notable inter-model variability in how each model values the ‘hiv_infection’ feature. This variability underscores the importance of model selection and calibration in predictive analytics, particularly in healthcare settings where the accurate interpretation of such features is crucial. The insights from this plot could be instrumental in refining clinical decision support systems, ensuring that the predictive models used are appropriately sensitive to critical features such as HIV infection status. This could potentially lead to more personalized and effective patient care strategies. This analysis highlights the importance of understanding model behavior in relation to key features and provides a foundation for further investigation into model-specific responses to clinically relevant features.

Key findings from literature per covariate

Single and multiple STIs with HPV

Research consistently shows a relationship between HR-HPV infection and both single and multiple STIs. Vriend 2012 found that HR-HPV infection was associated with high-risk sexual behavior, and that multiple HR-HPV infections were more strongly linked to this behavior than single HR-HPV infections. Kim 2021 further supported this, showing that multiple HR-HPV infections were associated with high-grade SIL and persistent HPV infection. Pista 2011 and Chen 2022 also found that multiple HR-HPV infections were more prevalent in women with more severe cervical lesions, suggesting a potential link between these infections and disease progression.

Mycoplasma hominis

Research has shown a potential relationship between Mycoplasma hominis and high-risk human papillomavirus (hrHPV) infection. Adebamowo 2017 found a significant association between persistent M. hominis and persistent hrHPV, particularly in HIV-positive women. Noma 2021 also reported an increased risk of low-grade squamous intraepithelial cervical lesions in women with Ureaplasma parvum and hrHPV co-infections. However, Atallah 2019 and Taylor-Robinson 2020 caution that further studies are needed to confirm these associations and to consider other factors such as bacterial vaginosis.

Research consistently shows a relationship between HR-HPV infection and age, with prevalence generally increasing with age (Kang 2014, Han-feng 2011). However, the association between HR-HPV infection and cervical abnormalities decreases with age (Javed 2023). Furthermore, older age is associated with a lower clearance rate of HR-HPV infection, which is a key risk factor for cervical cancer (Li 2019). These findings suggest that age plays a significant role in the prevalence, persistence, and clearance of HR-HPV infection.

Frequency of vaginal discharge

Research has shown a correlation between HR-HPV infection and the frequency of vaginal discharge. Zhang (2021) found a significant relationship between HPV infection and vaginitis, with a higher correlation in patients with bacterial vaginitis. This is further supported by Ghaniabadi (2020), who found a high incidence of bacterial vaginosis in women with HPV infection. Bacterial vaginosis, a common cause of abnormal vaginal discharge, was also found to be significantly associated with pregnancy (Habib, 2016). However, it’s important to note that vaginal discharge is not necessarily indicative of an STD, as highlighted by Gul (2011).

Ureaplasma species

Research suggests a potential relationship between HR-HPV infection and Ureaplasma species. Xian-qin (2011) found that Ureaplasma spp. may be involved in HPV infection and cytological abnormalities of the cervix. This is supported by Ekiel (2013), who reported a higher prevalence of urogenital mycoplasmas, including Ureaplasma, in HPV-positive hemodialyzed women. However, further studies are needed to establish a definitive link between these two factors.

Number of lifetime sexual partners

Research consistently shows a strong relationship between HR-HPV infection and the number of lifetime sexual partners. Karlsson (1995) and Vaccarella (2006) both found that a higher number of sexual partners significantly increased the risk of HPV infection. This was further supported by Nielsen (2008), who found that a higher number of sexual partners was the main risk factor for HR-HPV infection, particularly in younger women. Liu (2018) also found that men with multiple lifetime sexual partners were more likely to have higher levels of HR-HPV load, which is associated with increased persistence of the infection. These findings collectively suggest that a higher number of lifetime sexual partners is a significant risk factor for HR-HPV infection.

HIV infection

Research has consistently shown a significant relationship between HR-HPV infection and HIV acquisition. Studies have found that HR-HPV infection is associated with an increased risk of HIV acquisition among female sex workers (Auvert, 2011), and that HIV infection is a risk factor for HPV acquisition (Kosz, 2020). This relationship is particularly pronounced in HIV/AIDS patients, with a higher prevalence of HR-HPV infection compared to non-HIV cases (Hao-lan, 2010). Furthermore, the association of HR-HPV with HIV incidence suggests that HR-HPV may facilitate HIV acquisition (Auvert, 2010). These findings underscore the importance of addressing both HR-HPV and HIV infections in prevention and treatment strategies.

Condom use

Research consistently shows a significant relationship between consistent condom use and a lower prevalence of high-risk human papillomavirus (HR-HPV) infection. Liang (2006) suggests that men who use condoms more frequently may be less likely to have sexual contact with high-risk groups, leading to a lower prevalence of HPV infection. Nielson (2010) and Winer (2006) both found that consistent condom use was strongly associated with a lower prevalence of HPV infection in men and women, respectively. These findings highlight the importance of condom use in reducing the risk of HR-HPV infection.

Trichomonas vaginalis

Research suggests a potential relationship between HR-HPV infection and Trichomonas vaginalis. While Bruni (2019) found no significant association between TV infection and HIV status, TCD4+ cell count, or viral load, other studies have highlighted the potential impact of TV on cervical epithelial abnormalities (Honigberg, 1984) and its association with pelvic inflammatory disease, particularly in HIV-infected women (Moodley, 2002). Furthermore, Abdullah (2022) identified a significant increase in serum MCP-1 levels in women infected with TV, suggesting a potential link between TV pathogenicity and immune response. These findings collectively suggest a complex interplay between TV infection and various health outcomes, including the potential for an indirect relationship with HR-HPV infection.

Contraception use

Research has shown a complex relationship between HR-HPV infection and contraception use. Kayikcioglu (2020) found that HR-HPV infection was associated with contraceptive method, suggesting a potential link. Ghanem (2011) further supported this, reporting an association between hormonal contraceptive use and HPV-16 detection. However, Kops (2019) found no significant difference in the prevalence of high-risk HPV types between hormonal contraceptive users and non-users. Vatopoulou (2019) also found that HPV vaccination did not affect sexual behavior or choice of contraceptive method. These findings suggest a need for further research to clarify the relationship between HR-HPV infection and contraception use.

Vaginal sexual intercourse

Research consistently shows a relationship between HR-HPV infection and sexual behavior, particularly vaginal intercourse. Studies in Japan (Tokita, 2020), Spain (Rodríguez-Cerdeira, 2012), Lithuania (Bumbulienė, 2012), and Denmark (Kjaer, 2001) all found that a higher number of sexual partners, early age at first intercourse, and unprotected vaginal intercourse were associated with an increased risk of HR-HPV infection. These findings suggest that vaginal sexual intercourse is a significant factor in the transmission of HR-HPV.

Herpes simplex virus type 2

Research has shown a complex relationship between HR-HPV infection and herpes simplex virus type 2 (HSV-2). While HSV-2 has been associated with increased HIV viral load and potential acceleration of HIV disease progression (Tan, 2013), its specific role in HIV transmission is still unclear (Mbopi-Kéou, 2000). In the context of HR-HPV infection, a higher prevalence of active HSV-2 infection has been observed, suggesting a potential link between the two (Bahena-Román, 2020). However, the effectiveness of HSV-2 suppressive therapy in reducing HIV incidence has been questioned, indicating the need for further research on the relationship between these two viruses (Glynn, 2009).

Chlamydia trachomatis

Research has shown a potential relationship between HR-HPV infection and Chlamydia trachomatis. Bianchi (2016) found a higher risk of C. trachomatis infection in HR-HPV infected women, while Lima (2021) reported a higher prevalence of HR-HPV in women living with HIV, a population at increased risk for C. trachomatis. This is further supported by Joyee (2005), who found a significant association between C. trachomatis and HIV infections. However, Oriel (1978) noted that the presence of C. trachomatis was not always indicative of specific symptoms or clinical signs. These studies suggest a potential link between HR-HPV and C. trachomatis, particularly in high-risk populations.

Concluding Remarks

This study has presented a comprehensive analysis of the predictive performance of various machine learning models in identifying high-risk human papillomavirus (HR HPV) infection. The results of this study demonstrate that the choice of machine learning algorithm can significantly impact the accuracy and reliability of HR HPV infection predictions. The partial dependence plots revealed distinct patterns of feature importance across different models, highlighting the complex interactions between clinical and demographic features and their impact on HR HPV infection risk. The findings suggest that features such as single and multiple STI infections with HPV, Mycoplasma hominis, age, frequency of vaginal discharge, and ureaplasma species are critical predictors of HR HPV infection risk.

Furthermore, the study highlights the importance of model selection and calibration in predictive analytics, particularly in healthcare settings where accurate interpretation of clinical features is crucial. The notable inter-model variability in how each model values specific features underscores the need for careful consideration of model behavior and feature interactions in the development of clinical decision support systems. The insights from this study have important implications for the development of personalized and effective patient care strategies for HR HPV infection prevention and management. By leveraging the strengths of different machine learning models and incorporating domain knowledge of clinical and demographic features, healthcare providers can develop more accurate and reliable predictive models that inform targeted interventions and improve patient outcomes.

Future research directions may include further investigation into the model-specific responses to clinically relevant features, as well as the development of ensemble models that combine the strengths of different machine learning algorithms. Additionally, the integration of these predictive models with electronic health records and other data sources may facilitate the development of more comprehensive and personalized clinical decision support systems. Overall, this study contributes to the growing body of research on the application of machine learning in healthcare and highlights the potential of predictive analytics to improve patient care and outcomes in the context of HR HPV infection prevention and management.

This analysis provides valuable insights into the complex relationships between various clinical and demographic factors and the risk of HR HPV infection. By understanding these relationships and the strengths and limitations of different machine learning models, healthcare professionals can make more informed decisions and develop more effective strategies for prevention and early intervention.

About

KP Analytics Insights (Pty) Ltd is a reputable data-driven research and consulting company based in South Africa. Our primary objective is to facilitate evidence-based decision-making in the field of medical science by offering valuable insights and innovative solutions. We cater to a diverse range of clients, including postgraduate students, independent researchers, medical scientists, as well as organizations and institutions seeking comprehensive statistical analysis, interpretation, reporting, and actionable insights.

At KP Analytics Insights, our portfolio encompasses a wide array of projects focused on sexually transmitted infections (STIs), human papillomavirus (HPV) research, HIV-1 drug resistance, HBV drug resistance, and the enhancement of laboratory workflows through the implementation of statistical quality control methods. Moreover, we actively contribute to the advancement of medical research through rigorous peer reviews, publication of scholarly articles, and the development of interactive dashboards.

Visit our website KP Analytics Insights (Pty) Ltd to gain a comprehensive understanding of how our expertise can assist you in harnessing the power of data-driven insights. Let us be your partner in transforming data into knowledge.