Introduction

The covid-19 pandemic has challenged the health systems all over the world. In Mexico, as of April 2022, the health authorities reported 5,671,144 confirmed patients, with 323,318 deaths (overall mortality rate of 5.7%). A total of 672,987 patients needed hospitalization and 302,285 died, meaning that the in-hospital mortality rate in Mexico is 44.9% (1). In Mexico, there was wide variability in the in-hospital mortality rates of COVID-19 between hospitals and institutions. The Mexican Institute of Social Security (IMSS, after the initials in Spanish), carries the biggest burden of public health care and the highest mortality of patients hospitalized with COVID-19 (2). Prediction models for COVID-19 can help guide evidence-based clinical decision making; but because the mortality of hospitalized patients with this disease varies so widely by country and context, it is imperative to develop prognostic prediction models tailored to the reality of the location.

Some previous studies have evaluated laboratory, radiological and clinical data to develop prognostic prediction models with limited success, and some advanced diagnostic tests may not be available in low and middle income (LMIC) countries (3). Those prognostic models or scores that have been developed in Mexico to-date have used a epidemiologic dataset of all patients (outpatients and inpatients) with COVID-19 , have focused on predicting ICU admission , or for hospitalized patients relied on integration of AI with chest tomography , which is not always available for all patients is public sector hospitals. Ours is the first study to specifically examine prognostic prediction models for inpatient mortality amongst hospitalized patients in the state of Puebla Mexico in public sector hospitals. Following the recommendations of the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement (4), we aimed to develop and validate a prognostic prediction model using sociodemographic, clinical and laboratory data that can be easily obtained and are part of the usual clinical management in Mexico. This work also points out the importance of some sociodemographic variables on the outcome of COVID-19 hospitalized patients.

Once the relevant data inputs are identified for the model, it is also essential to develop a rigorous statistical approach to test the prognostic prediction model. In machine learning, decisions are taken through models built with data (5). For binary outcomes, logistic regression has been extensively used, it can be regarded as the simplest prediction model when the outcome is categorical. More recently, more complicated mathematical models have emerged. Random forests is a mathematical algorithm considered as an extension of classification and regression trees (CART). CART algorithm is popular because is simple to interpret and run, but is unstable, meaning that small changes in the data may lead to important changes in the results. This problem can be prevented by creating many trees from the same data set, and this is what the random forests algorithm does (6). The goal is to improve prediction performance with the construction of a number of individual trees randomly different through bootstrapping (7).

Methods

Subjects

This is a retrospective cohort study conducted at the Hospital de Especialidades de Puebla, IMSS (HEP), and Hospital General Dr. Eduardo Vazquez Navarro (HGEVN), Mexico. The first site is a 315 bed multi-specialty hospital and a referral center for workers from the states of Puebla, Tlaxcala and Oaxaca. In March 2020 it was converted to a hybrid COVID-19 hospital. The second location is a ### bed general hospital for the general population, it mostly receives people below the porverty line, it was also converted to a COVID-19 hospital. Data for developing the model were collected from the charts of patients admitted in the HEP between April 3^rd and October 17^th 2020 with a clinical diagnosis of COVID-19 and a positive SARS-Cov-2 RT-PCR nasopharyngeal swab. This set will be described as the model-building data set. Patients transferred to other facilities were not included. For validation of the model, the data of an independent set of 100 patients from the IMSS hospital was prospectively collected during the month of November 2020 (nonrandom data set). In addition, the model was tested with the data of patients from the HGEVN, which belongs to the Secretaria de Salud del Estado de Puebla (external data set).

The data were registered at the moment of admission and the vital status recorded at the moment of discharge, as either alive or dead. The data gathered included sociodemographic variables, baseline comorbidities, medications, vital signs on admission and initial laboratory tests (table 1S of supplementary material (SM)). In addition, the Systematic COronary Risk Evaluation (SCORE) index was calculated, which gives the risk of developing fatal cardiovascular disease at 10 years. This score is endorsed by the European Society of Cardiology for high-risk patients and was used because of the high prevalence of cardiovascular risk factors in the Mexican population. The score considers age from 40 years, gender, systolic blood pressure, smoking, and total cholesterol level (8,9). Socioeconomic level was classified in high, medium-high, medium-low and low. For this, an ad-hoc scale was used obtaining a profile of the household which considered the educational level, current job position and the company where he or she works.

Ethics

This study was approved by both institutional review boards of the HEP (No. R-2020-785-126) and HGEVN (###) as a minimal-risk research and informed consent was waived. To keep confidentiality, data were deidentified.

Statistical analysis and model building strategy

The statistical analysis was performed with R version 4.1.1 (10) and RStudio version 2021.09.0 (11). The logistic regression and random forests models were built using the caret package with the glm and rf method respectively, for exploratory data analysis and descriptive statistics the tidyverse and ggplot2 packages were used.

Sample size. The criteria proposed by Riley for developing a multivariable model was used (12). Using the pmsampsize package, a sample size of 512 patients for the model building data set was considered adequate. The parameters were decided as follows: a Nagelkerke R² of 0.4 was deemed appropriate, as direct measures were used; the overall outcome proportion (mortality rate) was considered of 0.4, and 25 candidate predictor parameters for potential inclusion in the model were estimated.

Preprocessing. Once the database was curated, a bivariate analysis was performed using either χ² , U of Mann-Witney or Student t tests for categorical , ordinal or numerical variables respectively. Features with a p < 0.01 were considered as candidates for predictors. Variables close to zero variation were excluded. Single imputation with the median was used to handle missing values.

Resampling. When building a prediction model, because the data are random, the parameters and results obtained are also random variables. If we repeat the experiment the results are going to be different. According to the law of big numbers, after a lot of repetitions, results will tend to converge. Thus, one of the most important concepts in machine learning is the use of a resampling technique to stabilize the results. Bootstrapping was used as the resampling method.

Internal validation. The data set was partitioned into a training and test sets. Training set comprised 85% of the data. During model building, accuracy was used as the performance metric.

External validation. Two separate data sets were used for model validation. In the nonrandom data set, patients were taken from the same hospital (HEP) but from a later period. This data set was formed with patients admitted in November 2020. The performance of the models were evaluated with overall accuracy, sensitivity, specificity and the harmonic average of sensitivity and specificity (F₁ score). The external data set was taken from a hospital with a different population coverage (HGEVN). According to the TRIPOD statement, this design fulfill the requirements for both type 2b (validation set split by time) and type 3 (evaluation of the prediction model on separate data) analysis.

Results

After curating the databases, a total of 641 patients were included in the model-building data set, 100 in the nonrandom data set and 107 in the external data base. Table 1S of SM shows a list of the variables included in the model-building database, together with the codification and the units employed. The baseline features of the patients used to build the models are shown in table @ref(tab:tb-blinefeat). It is worth noting that patients hospitalized because of COVID-19 tend to be males above 50 years old and overweight.

The analysis of the categorical and numerical variables grouped by the outcome are shown in table 2S and 3S respectively. A p level less than 0.01 was used to decide the predictors. Table @ref(tab:tb-varfinalmodel) summarizes the variables selected for building the prediction models. In both algorithms the resampling technique employed was simple bootstrap with 25 repetitions.

The logistic regression algorithm selected was trained with no tuning parameters. The features of the final model is presented in table @ref(tab:tb-varglm).

The random forests algorithm was trained varying the number of randomly selected predictors and the number of trees. The values with the best accuracy were employed in the final model (figures 1S and 2S of the supplementary material). The impact of the predictors in the final model is shown in @ref(tab:tb-VarImpRF).

The performance of the logistic regression and the random forests models are presented in table @ref(tab:tb-perfboth).

Baseline features of the study patients
Variable	Value	Missing cases
Somatic variables:	Median, (IQR), [range]
Age	55, (44 - 64) [19 - 96]	0
Weight	77, (68 - 86) [39 - 136]	3
Height	1.6, (1.6 - 1.7) [1.4 - 2]	3
BMI	28.8, (25.6 - 32) [17.3 - 49.8]	3
Sociodemographic variables:	No. (%)
Gender		0
Female	241 (37.6)
Male	400 (62.4)
Occupation		1
Outdoor work	40 (6.2)
Health care worker	97 (15.2)
Unemployed	112 (17.5)
Office job	113 (17.7)
Work at home	129 (20.2)
Work in a public area	149 (23.3)
Schooling		34
Illiteracy	40 (6.6)
Primary school	137 (22.6)
Junior high school	120 (19.8)
High school	150 (24.7)
College	147 (24.2)
Postgraduate studies	13 (2.1)
Socioeconomic level		0
Low, Medium-low	449 (70)
Medium-high, High	192 (30)
Initial symptom		3
Cough	122 (19.1)
Fever	166 (26)
Headache	35 (5.5)
Anosmia	2 (0.3)
Malaise	143 (22.4)
Muscular weakness	3 (0.5)
Diarrhea	23 (3.6)
Rash	2 (0.3)
Dyspnea	80 (12.5)
Chest pain	14 (2.2)
Incapability to move	2 (0.3)
Sore throat	46 (7.2)

Variables used to create de prediction models
Variable	Missing cases	pa
Occupation	1	< 0.000001
Age	0	< 0.000001
Respiratory rate	4	< 0.000001
Oxygen saturation	4	< 0.000001
Blood urea nitrogen	9	< 0.000001
Serum creatinine	9	< 0.000001
Blood glucose	9	< 0.000001
White blood cells	9	< 0.000001
Lymphocytes	9	< 0.000001
Neutrophils	9	< 0.000001
Potassium	23	< 0.000001
pH	53	< 0.000001
Arterial oxygen partial pressure	53	< 0.000001
Arterial carbon dioxide partial pressure	53	< 0.000001
Serum lactate dehydrogenase	135	< 0.000001
Schooling	34	< 0.000001
SCORE	7	< 0.000001
Socioeconomic level	0	0.00008
Diabetes mellitus	0	0.00203
D-dimer	29	0.00400
Gender	0	0.00554
Past smoker	0	0.00612
Arterial pressure, diastolic	4	0.00700
Chloride	23	0.00900
ap value calculated with Student T test for numeric variables and χ2 for categorical variables.

Features of the logistic regression model
Predictor	Overall importance	Estimate	p value
Arterial oxygen partial pressure	100.00	-0.020	0.001
Oxygen saturation	95.76	-0.047	0.001
Serum lactate dehydrogenase	88.27	0.002	0.002
Arterial carbon dioxide partial pressure	83.63	0.045	0.004
Blood urea nitrogen	67.06	0.029	0.019
Arterial pressure, diastolic	59.82	-0.024	0.036
Respiratory rate	58.65	0.054	0.039
Lymphocytes	58.31	-0.001	0.040
SCORE	57.52	0.168	0.043
Socioeconomic level = Medium-high, High	41.14	-0.582	0.141
Age	39.67	0.018	0.154
pH	35.56	-2.035	0.198
Occupation = Work at home	35.21	-0.677	0.202
Past smoker = Yes	32.26	0.543	0.240
Schooling	27.58	-0.135	0.308
Chloride	26.92	-0.025	0.319
Neutrophils	25.17	0.000	0.348
Serum creatinine	23.73	-0.112	0.374
Gender = Male	23.44	-0.298	0.379
White blood cells	19.76	0.000	0.449
Diabetes mellitus = Yes	17.52	0.239	0.496
Occupation = Outdoor work	17.02	0.418	0.506
D-dimer	15.37	0.000	0.543
Blood glucose	14.69	-0.001	0.558
Potassium	13.54	0.115	0.584
Occupation = Work in public area	6.49	0.146	0.756
Socioeconomic level = Medium-high, High	0.85	0.063	0.903
Occupation = Office job	0.00	0.045	0.925
Intercept		18.641	0.144

Impact of the predictors on the random forests model
Variable	Overalla
Serum lactate dehydrogenase	100.0
D-dimer	94.3
Oxygen saturation	90.2
Arterial oxygen partial pressure	84.9
Neutrophils	68.3
Blood urea nitrogen	65.9
Arterial carbon dioxide partial pressure	65.6
White blood cells	62.3
pH	56.6
Age	49.9
Serum creatinine	49.7
Lymphocytes	48.2
Respiratory rate	41.1
Arterial pressure, diastolic	40.3
Potassium	36.3
Blood glucose	35.5
Schooling	35.5
Chloride	30.2
SCORE	25.4
Sex = Male	3.5
Socioeconomic level = Medium-high, High	3.1
Occupation = Work in public area	2.9
Occupation = Work at home	2.5
Occupation = Unemployed	1.0
Occupation = Office job	0.6
Diabetes mellitus = Yes	0.5
Occupation = Outdoor work	0.2
Past smoker = Yes	0.0
aNumbers represent the relative overall impact of the variables on the prediction model.

Performance of the logistic regression and random forests models
Metric	Internal data set	Nonrandom data set	External data set
Logistic Regression Model
Accuracya	0.83 (0.74 ± 0.9)	0.82 (0.73 ± 0.89)	0.68 (0.59 ± 0.77)
Sensitivity	0.667	0.676	0.697
Specificity	0.908	0.905	0.659
Positive predictive value	0.769	0.806	0.767
Negative predictive value	0.855	0.826	0.574
Random Forests Model
Accuracya	0.85 (0.77 ± 0.92)	0.82 (0.73 ± 0.89)	0.62 (0.52 ± 0.71)
Sensitivity	0.633	0.595	0.667
Specificity	0.954	0.952	0.537
Positive predictive value	0.864	0.88	0.698
Negative predictive value	0.849	0.8	0.5
aValues between parentheses represent the 95% confidence interval.

Description of the variables in the working database.
Code	Description	Units
sexo	Gender	0 = Female, 1 = Male
ocupacion	Occupation	1 = Health care worker, 2 = Office job, 3 = Outdoor work, 4 = Work in public area,5 = Work at home, 6 = Unemployed
escolaridad	Schooling	1 = Analphabet, 2 = Primary school, 3 = Junior high school, 4 = High school, 5 = College, 6 = Postgraduate studies
nivsoc	Socioeconomic level	0 = Low, Medium-low, 1 = Medium-high, High
app_0	Hypertension	1 = Yes, 0 = No
app_1	Alergic rinhitis	1 = Yes, 0 = No
app_2	Asthma	1 = Yes, 0 = No
app_3	Conjunctivitis	1 = Yes, 0 = No
app_4	Currently smoker	1 = Yes, 0 = No
app_5	Past smoker	1 = Yes, 0 = No
app_6	Use of medication	1 = Yes, 0 = No
app_7	Use of dietary supplements	1 = Yes, 0 = No
app_8	Cardiovascular disease	1 = Yes, 0 = No
app_9	Diabetes mellitus	1 = Yes, 0 = No
app_10	Insuline resistance	1 = Yes, 0 = No
app_11	COPD	1 = Yes, 0 = No
app_12	Renal disease	1 = Yes, 0 = No
app_13	Cancer	1 = Yes, 0 = No
app_14	Lung disease other than COPD	1 = Yes, 0 = No
app_15	AIDS	1 = Yes, 0 = No
app_16	Autoimmune disease	1 = Yes, 0 = No
app_17	Cerebrovascular disease	1 = Yes, 0 = No
app_18	Overweight	1 = Yes, 0 = No
app_19	Obesity	1 = Yes, 0 = No
edad	Age	Years old
peso	Weight	kg
talla	Height	m
imc	Body mass index	kg/m²
temp	Temperature	Centigrades
fc	Heart rate	Beats per minute
fr	Respiratory rate	Cicles per minute
tas	Arterial pressure, systolic	mm Hg
tad	Arterial pressure, diastolic	mm Hg
score	SCORE	Score points (0-16)
ing_disnea	Short of breath at the moment of hospitalization	1 = Yes, 0 = No
sintoma1	Prevalent symptom	1 = Cough,2 = Fever, 3 = Headache, 4 = Anosmia, 5 = Malaise, 6 = Dizziness, 7 = Muscular weakness, 8 = Diarrhea, 9 = Ageusia, 10 = Rash, 11 = Dyspnea, 12 = Chest pain, 13 = Incapability to move, 14 = Sore throat
sato2sin	Oxygen saturation	Percentage of Saturated hemoglobin
urea	Blood urea	mg/dL
bun	Blood urea nitrogen	mg/dL
creat	Serum creatinine	mg/dL
colesterol	Serum cholesterol	mg/dL
gluc	Blood glucose	mg/dL
hb	Hemoglobin concentration	mg/dL
leucos	White blood cells	number of cells/µL
plaq	Platelets	number of cells/µL
linfos	Lymphocytes	number of cells/µL
monos	Monocytes	number of cells/µL
eos	Eosinophils	number of cells/µL
basof	Basophils	number of cells/µL
neutros	Neutrophils	number of cells/µL
k	Potassium	mEq/L
na	Sodium	mEq/L
cl	Chloride	mEq/L
ca	Calcium	mEq/L
ph	pH	Units of pH
pao2	Arterial oxygen partial pressure	mm Hg
paco2	Arterial carbon dioxide partial pressure	mm Hg
hco3	Arterial bicarbonate	mEq/L
dhl	Serum lactate dehydrogenase	IU/L
alat	Serum Alanine aminotransferase	IU/L
aat	Serum Aspartate aminotransferase	IU/L
dimd	D-dimer	µg/mL
diasretraso	Hospitalization delay	Days
motivoegre	Vital status at the moment of discharge	0 = Alive, 1 = Dead

Categorical variables by outcome
Variable	Value	Total	Deada	Alivea	pb
Gender
	Female	241	60 (24.9)	181 (75.1)	0.006
	Male	400	143 (35.8)	257 (64.2)
Occupation
	Health care worker	97	16 (16.5)	81 (83.5)	< 0.001
	Office job	113	22 (19.5)	91 (80.5)
	Outdoor work	40	20 (50)	20 (50)
	Work in public area	149	49 (32.9)	100 (67.1)
	Work at home	129	40 (31)	89 (69)
	Unemployed	112	56 (50)	56 (50)
Socioeconomic level
	Low or medium-low	449	164 (36.5)	285 (63.5)	< 0.001
	Medium-high or high	192	39 (20.3)	153 (79.7)
Hypertension
	No	418	121 (28.9)	297 (71.1)	0.052
	Yes	223	82 (36.8)	141 (63.2)
Alergic rinhitis
	No	635	202 (31.8)	433 (68.2)	0.724
	Yes	6	1 (16.7)	5 (83.3)
Insuline resistance
	No	634	198 (31.2)	436 (68.8)	0.062
	Yes	7	5 (71.4)	2 (28.6)
COPD
	No	636	200 (31.4)	436 (68.6)	0.376
	Yes	5	3 (60)	2 (40)
Renal disease
	No	602	185 (30.7)	417 (69.3)	0.067
	Yes	39	18 (46.2)	21 (53.8)
Cancer
	No	622	194 (31.2)	428 (68.8)	0.214
	Yes	19	9 (47.4)	10 (52.6)
Lung disease other than COPD
	No	636	202 (31.8)	434 (68.2)	0.936
	Yes	5	1 (20)	4 (80)
Autoimmune disease
	No	619	192 (31)	427 (69)	0.099
	Yes	22	11 (50)	11 (50)
Cerebrovascular disease
	No	629	199 (31.6)	430 (68.4)	1
	Yes	12	4 (33.3)	8 (66.7)
Overweight
	No	393	116 (29.5)	277 (70.5)	0.165
	Yes	248	87 (35.1)	161 (64.9)
Obesity
	No	385	123 (31.9)	262 (68.1)	0.921
	Yes	256	80 (31.2)	176 (68.8)
Asthma
	No	634	201 (31.7)	433 (68.3)	1
	Yes	7	2 (28.6)	5 (71.4)
Currently smoker
	No	573	177 (30.9)	396 (69.1)	0.274
	Yes	68	26 (38.2)	42 (61.8)
Past smoker
	No	596	180 (30.2)	416 (69.8)	0.006
	Yes	45	23 (51.1)	22 (48.9)
Use of medication
	No	449	135 (30.1)	314 (69.9)	0.215
	Yes	192	68 (35.4)	124 (64.6)
Use of dietary supplements
	No	633	200 (31.6)	433 (68.4)	1
	Yes	8	3 (37.5)	5 (62.5)
Cardiovascular disease
	No	622	195 (31.4)	427 (68.6)	0.458
	Yes	19	8 (42.1)	11 (57.9)
Diabetes mellitus
	No	509	146 (28.7)	363 (71.3)	0.002
	Yes	132	57 (43.2)	75 (56.8)
Short of breath
	No	49	8 (16.3)	41 (83.7)	0.025
	Yes	592	195 (32.9)	397 (67.1)
aValues between parentheses represent the percentage for the corresponding category.
bp value calculated with χ2.

Exploratory data analysis of numerical variables
Variable	Valid cases	Status	Mean ± SD	pa
D-dimer	425	Alive	841.7 ± 2271.01	0.004
	187	Dead	2297.73 ± 6738.31	0.004
Arterial pressure, diastolic	437	Alive	76.17 ± 10.81	0.007
	200	Dead	73.34 ± 12.81	0.007
Chloride	424	Alive	103.74 ± 5.46	0.009
	194	Dead	102.47 ± 5.67	0.009
Serum Aspartate aminotransferase	408	Alive	48.48 ± 37.92	0.01
	185	Dead	73.33 ± 127.89	0.01
Sodium	424	Alive	136.67 ± 4.21	0.02
	194	Dead	135.62 ± 5.54	0.02
Heart rate	437	Alive	97.69 ± 17.57	0.032
	200	Dead	101.21 ± 19.89	0.032
Blood urea	436	Alive	29 ± 24.87	0.032
	196	Dead	66.57 ± 243.04	0.032
Monocytes	435	Alive	558.24 ± 520.18	0.058
	196	Dead	646.1 ± 544.48	0.058
Basophils	435	Alive	47.15 ± 96.2	0.068
	196	Dead	65.49 ± 124.59	0.068
Days of delay	438	Alive	8.13 ± 5.01	0.131
	203	Dead	7.51 ± 4.76	0.131
Hemoglobin concentration	436	Alive	14.4 ± 1.89	0.151
	196	Dead	14.14 ± 2.23	0.151
Height	438	Alive	1.64 ± 0.1	0.312
	200	Dead	1.63 ± 0.09	0.312
Arterial pressure, systolic	437	Alive	123.8 ± 19.36	0.316
	199	Dead	125.86 ± 25.8	0.316
Body mass index	438	Alive	29.13 ± 5.13	0.323
	200	Dead	29.57 ± 5.22	0.323
Arterial bicarbonate	395	Alive	18.01 ± 3.84	0.422
	193	Dead	17.68 ± 5.05	0.422
Serum Alanine aminotransferase	406	Alive	52.04 ± 40.41	0.523
	179	Dead	56.06 ± 79.66	0.523
Calcium	33	Alive	7.25 ± 2.41	0.572
	12	Dead	7.51 ± 0.52	0.572
Serum cholesterol	436	Alive	141.87 ± 52.71	0.593
	201	Dead	139.68 ± 45.73	0.593
Weight	438	Alive	78 ± 15.48	0.662
	200	Dead	78.6 ± 16.26	0.662
Platelets	436	Alive	273077.75 ± 111349.37	0.685
	196	Dead	268900.51 ± 123348.71	0.685
Temperature	437	Alive	37.06 ± 0.82	0.774
	200	Dead	37.09 ± 0.93	0.774
Blood urea nitrogen	436	Alive	13.72 ± 12.03	< 0.001
	196	Dead	23 ± 19.35	< 0.001
Serum creatinine	436	Alive	1.01 ± 1.1	< 0.001
	196	Dead	1.57 ± 1.87	< 0.001
Serum lactate dehydrogenase	355	Alive	371.35 ± 200.29	< 0.001
	151	Dead	565.43 ± 250.71	< 0.001
Age	438	Alive	51.55 ± 13.87	< 0.001
	203	Dead	60.31 ± 13.17	< 0.001
Schooling	411	Alive	3.64 ± 1.25	< 0.001
	196	Dead	3.02 ± 1.34	< 0.001
Respiratory rate	437	Alive	24.46 ± 4.47	< 0.001
	200	Dead	26.48 ± 5.42	< 0.001
Blood glucose	436	Alive	142.09 ± 82.07	< 0.001
	196	Dead	173.9 ± 101.57	< 0.001
Potassium	424	Alive	4.01 ± 0.55	< 0.001
	194	Dead	4.34 ± 0.84	< 0.001
White blood cells	436	Alive	9088.81 ± 4586.03	< 0.001
	196	Dead	12463.13 ± 6413.31	< 0.001
Lymphocytes	436	Alive	1021.38 ± 510.79	< 0.001
	196	Dead	836.01 ± 463.22	< 0.001
Neutrophils	436	Alive	7439.26 ± 4360.39	< 0.001
	196	Dead	10593.56 ± 5988.25	< 0.001
Arterial carbon dioxide partial pressure	395	Alive	27.28 ± 7.33	< 0.001
	193	Dead	32.24 ± 15.14	< 0.001
Arterial oxygen partial pressure	395	Alive	74.37 ± 29.15	< 0.001
	193	Dead	58.2 ± 22.73	< 0.001
pH	395	Alive	7.43 ± 0.07	< 0.001
	193	Dead	7.36 ± 0.15	< 0.001
Oxygen saturation	437	Alive	89.64 ± 6.4	< 0.001
	200	Dead	83.01 ± 12.11	< 0.001
Severity score	435	Alive	1.24 ± 1.75	< 0.001
	199	Dead	2.56 ± 2.9	< 0.001
ap value calculated with Student t test, except for schooling, number of comorbidities and severity score, where the U Mann-Whitney test was used.

Random forests model. Effect of the number of randomly selected predictors on accuracy.

effect of the number of trees on the stabilization of the out of bag error.

Secretaria de Salud. Información referente a casos COVID-19 en México [Internet]. 2022. Available from: https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico

Mariano Sanchez Talanquer. La letalidad hospitalaria por covid-19 en México: desigualdades institucionales [Internet]. 2020. Available from: https://datos.nexos.com.mx/?p=1625

Adam L. Booth, Elizabeth Abels, Peter McCaffrey. Development of a prognostic model for mortality in COVID-19 infection using machine learning. Modern Pathology. 2020 Oct 16;1–10. doi:10.1038/s41379-020-00700-x

Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, Karel G. M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. Journal of Clinical Epidemiology. 2015 Feb 1;68(2):112–21. doi:10.1016/j.jclinepi.2014.11.010

Rafael A. Irizarry. Introduction to Data Science: Data Analysis and Prediction Algorithms with R [Internet]. First Edition. London: Chapman; Hall/CRC; 2020. Available from: https://www.routledge.com/Introduction-to-Data-Science-Data-Analysis-and-Prediction-Algorithms-with/Irizarry/p/book/9780367357986

Andreas Ziegler, Inke R. König. Mining data with random forests: current options for real-world applications. WIREs Data Mining and Knowledge Discovery. 2014;4(1):55–63. doi:10.1002/widm.1114

Leo Breiman. Random Forests. Machine Learning. 2001 Oct 1;45(1):5–32. doi:10.1023/A:1010933404324

R. M. Conroy, K. Pyörälä, A. P. Fitzgerald, S. Sans, A. Menotti, G. De Backer, D. De Bacquer, P. Ducimetière, P. Jousilahti, U. Keil, I. Njølstad, R. G. Oganov, T. Thomsen, H. Tunstall-Pedoe, A. Tverdal, H. Wedel, P. Whincup, L. Wilhelmsen, I. M. Graham, on behalf of the SCORE project group. Estimation of ten-year risk of fatal cardiovascular disease in europe: The SCORE project. European Heart Journal. 2003 Jun 1;24(11):987–1003. doi:10.1016/S0195-668X(03)00114-3

Mathijs O. Versteylen, Ivo A. Joosen, Leslee J. Shaw, Jagat Narula, Leonard Hofstra. Comparison of Framingham, PROCAM, SCORE, and Diamond Forrester to predict coronary atherosclerosis and cardiovascular events. Journal of Nuclear Cardiology. 2011 Jul 19;18(5):904. doi:10.1007/s12350-011-9425-5

10.

R Core Team. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2020. Available from: https://www.R-project.org/

11.

RStudio Team. RStudio: Integrated development environment for r [Internet]. Boston, MA: RStudio, PBC; 2021. Available from: http://www.rstudio.com/

12.