Introduction

The covid-19 pandemic has challenged the health systems all over the world. In Mexico, as of April 2022, the health authorities reported 5,671,144 confirmed patients, with 323,318 deaths (overall mortality rate of 5.7%). A total of 672,987 patients needed hospitalization and 302,285 died, meaning that the in-hospital mortality rate in Mexico is 44.9% (1). In Mexico, there was wide variability in the in-hospital mortality rates of COVID-19 between hospitals and institutions. The Mexican Institute of Social Security (IMSS, after the initials in Spanish), carries the biggest burden of public health care and the highest mortality of patients hospitalized with COVID-19 (2). Prediction models for COVID-19 can help guide evidence-based clinical decision making; but because the mortality of hospitalized patients with this disease varies so widely by country and context, it is imperative to develop prognostic prediction models tailored to the reality of the location.

Some previous studies have evaluated laboratory, radiological and clinical data to develop prognostic prediction models with limited success, and some advanced diagnostic tests may not be available in low and middle income (LMIC) countries (3). Those prognostic models or scores that have been developed in Mexico to-date have used a epidemiologic dataset of all patients (outpatients and inpatients) with COVID-19 , have focused on predicting ICU admission , or for hospitalized patients relied on integration of AI with chest tomography , which is not always available for all patients is public sector hospitals. Ours is the first study to specifically examine prognostic prediction models for inpatient mortality amongst hospitalized patients in the state of Puebla Mexico in public sector hospitals. Following the recommendations of the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement (4), we aimed to develop and validate a prognostic prediction model using sociodemographic, clinical and laboratory data that can be easily obtained and are part of the usual clinical management in Mexico. This work also points out the importance of some sociodemographic variables on the outcome of COVID-19 hospitalized patients.

Once the relevant data inputs are identified for the model, it is also essential to develop a rigorous statistical approach to test the prognostic prediction model. In machine learning, decisions are taken through models built with data (5). For binary outcomes, logistic regression has been extensively used, it can be regarded as the simplest prediction model when the outcome is categorical. More recently, more complicated mathematical models have emerged. Random forests is a mathematical algorithm considered as an extension of classification and regression trees (CART). CART algorithm is popular because is simple to interpret and run, but is unstable, meaning that small changes in the data may lead to important changes in the results. This problem can be prevented by creating many trees from the same data set, and this is what the random forests algorithm does (6). The goal is to improve prediction performance with the construction of a number of individual trees randomly different through bootstrapping (7).

Methods

Subjects

This is a retrospective cohort study conducted at the Hospital de Especialidades de Puebla, IMSS (HEP), and Hospital General Dr. Eduardo Vazquez Navarro (HGEVN), Mexico. The first site is a 315 bed multi-specialty hospital and a referral center for workers from the states of Puebla, Tlaxcala and Oaxaca. In March 2020 it was converted to a hybrid COVID-19 hospital. The second location is a ### bed general hospital for the general population, it mostly receives people below the porverty line, it was also converted to a COVID-19 hospital. Data for developing the model were collected from the charts of patients admitted in the HEP between April 3rd and October 17th 2020 with a clinical diagnosis of COVID-19 and a positive SARS-Cov-2 RT-PCR nasopharyngeal swab. This set will be described as the model-building data set. Patients transferred to other facilities were not included. For validation of the model, the data of an independent set of 100 patients from the IMSS hospital was prospectively collected during the month of November 2020 (nonrandom data set). In addition, the model was tested with the data of patients from the HGEVN, which belongs to the Secretaria de Salud del Estado de Puebla (external data set).

The data were registered at the moment of admission and the vital status recorded at the moment of discharge, as either alive or dead. The data gathered included sociodemographic variables, baseline comorbidities, medications, vital signs on admission and initial laboratory tests (table 1S of supplementary material (SM)). In addition, the Systematic COronary Risk Evaluation (SCORE) index was calculated, which gives the risk of developing fatal cardiovascular disease at 10 years. This score is endorsed by the European Society of Cardiology for high-risk patients and was used because of the high prevalence of cardiovascular risk factors in the Mexican population. The score considers age from 40 years, gender, systolic blood pressure, smoking, and total cholesterol level (8,9). Socioeconomic level was classified in high, medium-high, medium-low and low. For this, an ad-hoc scale was used obtaining a profile of the household which considered the educational level, current job position and the company where he or she works.

Ethics

This study was approved by both institutional review boards of the HEP (No. R-2020-785-126) and HGEVN (###) as a minimal-risk research and informed consent was waived. To keep confidentiality, data were deidentified.

Statistical analysis and model building strategy

The statistical analysis was performed with R version 4.1.1 (10) and RStudio version 2021.09.0 (11). The logistic regression and random forests models were built using the caret package with the glm and rf method respectively, for exploratory data analysis and descriptive statistics the tidyverse and ggplot2 packages were used.

Sample size. The criteria proposed by Riley for developing a multivariable model was used (12). Using the pmsampsize package, a sample size of 512 patients for the model building data set was considered adequate. The parameters were decided as follows: a Nagelkerke R2 of 0.4 was deemed appropriate, as direct measures were used; the overall outcome proportion (mortality rate) was considered of 0.4, and 25 candidate predictor parameters for potential inclusion in the model were estimated.

Preprocessing. Once the database was curated, a bivariate analysis was performed using either χ2 , U of Mann-Witney or Student t tests for categorical , ordinal or numerical variables respectively. Features with a p < 0.01 were considered as candidates for predictors. Variables close to zero variation were excluded. Single imputation with the median was used to handle missing values.

Resampling. When building a prediction model, because the data are random, the parameters and results obtained are also random variables. If we repeat the experiment the results are going to be different. According to the law of big numbers, after a lot of repetitions, results will tend to converge. Thus, one of the most important concepts in machine learning is the use of a resampling technique to stabilize the results. Bootstrapping was used as the resampling method.

Internal validation. The data set was partitioned into a training and test sets. Training set comprised 85% of the data. During model building, accuracy was used as the performance metric.

External validation. Two separate data sets were used for model validation. In the nonrandom data set, patients were taken from the same hospital (HEP) but from a later period. This data set was formed with patients admitted in November 2020. The performance of the models were evaluated with overall accuracy, sensitivity, specificity and the harmonic average of sensitivity and specificity (F1 score). The external data set was taken from a hospital with a different population coverage (HGEVN). According to the TRIPOD statement, this design fulfill the requirements for both type 2b (validation set split by time) and type 3 (evaluation of the prediction model on separate data) analysis.

Results

After curating the databases, a total of 641 patients were included in the model-building data set, 100 in the nonrandom data set and 107 in the external data base. Table 1S of SM shows a list of the variables included in the model-building database, together with the codification and the units employed. The baseline features of the patients used to build the models are shown in table @ref(tab:tb-blinefeat). It is worth noting that patients hospitalized because of COVID-19 tend to be males above 50 years old and overweight.

The analysis of the categorical and numerical variables grouped by the outcome are shown in table 2S and 3S respectively. A p level less than 0.01 was used to decide the predictors. Table @ref(tab:tb-varfinalmodel) summarizes the variables selected for building the prediction models. In both algorithms the resampling technique employed was simple bootstrap with 25 repetitions.

The logistic regression algorithm selected was trained with no tuning parameters. The features of the final model is presented in table @ref(tab:tb-varglm).

The random forests algorithm was trained varying the number of randomly selected predictors and the number of trees. The values with the best accuracy were employed in the final model (figures 1S and 2S of the supplementary material). The impact of the predictors in the final model is shown in @ref(tab:tb-VarImpRF).

The performance of the logistic regression and the random forests models are presented in table @ref(tab:tb-perfboth).

Baseline features of the study patients

Variable

Value

Missing cases

Somatic variables:

Median, (IQR), [range]

Age

55, (44 - 64) [19 - 96]

0

Weight

77, (68 - 86) [39 - 136]

3

Height

1.6, (1.6 - 1.7) [1.4 - 2]

3

BMI

28.8, (25.6 - 32) [17.3 - 49.8]

3

Sociodemographic variables:

No. (%)

Gender

0

Female

241 (37.6)

Male

400 (62.4)

Occupation

1

Outdoor work

40 (6.2)

Health care worker

97 (15.2)

Unemployed

112 (17.5)

Office job

113 (17.7)

Work at home

129 (20.2)

Work in a public area

149 (23.3)

Schooling

34

Illiteracy

40 (6.6)

Primary school

137 (22.6)

Junior high school

120 (19.8)

High school

150 (24.7)

College

147 (24.2)

Postgraduate studies

13 (2.1)

Socioeconomic level

0

Low, Medium-low

449 (70)

Medium-high, High

192 (30)

Initial symptom

3

Cough

122 (19.1)

Fever

166 (26)

Headache

35 (5.5)

Anosmia

2 (0.3)

Malaise

143 (22.4)

Muscular weakness

3 (0.5)

Diarrhea

23 (3.6)

Rash

2 (0.3)

Dyspnea

80 (12.5)

Chest pain

14 (2.2)

Incapability to move

2 (0.3)

Sore throat

46 (7.2)

Variables used to create de prediction models

Variable

Missing cases

pa

Occupation

1

< 0.000001

Age

0

< 0.000001

Respiratory rate

4

< 0.000001

Oxygen saturation

4

< 0.000001

Blood urea nitrogen

9

< 0.000001

Serum creatinine

9

< 0.000001

Blood glucose

9

< 0.000001

White blood cells

9

< 0.000001

Lymphocytes

9

< 0.000001

Neutrophils

9

< 0.000001

Potassium

23

< 0.000001

pH

53

< 0.000001

Arterial oxygen partial pressure

53

< 0.000001

Arterial carbon dioxide partial pressure

53

< 0.000001

Serum lactate dehydrogenase

135

< 0.000001

Schooling

34

< 0.000001

SCORE

7

< 0.000001

Socioeconomic level

0

0.00008

Diabetes mellitus

0

0.00203

D-dimer

29

0.00400

Gender

0

0.00554

Past smoker

0

0.00612

Arterial pressure, diastolic

4

0.00700

Chloride

23

0.00900

ap value calculated with Student T test for numeric variables and χ2 for categorical variables.

Features of the logistic regression model

Predictor

Overall importance

Estimate

p value

Arterial oxygen partial pressure

100.00

-0.020

0.001

Oxygen saturation

95.76

-0.047

0.001

Serum lactate dehydrogenase

88.27

0.002

0.002

Arterial carbon dioxide partial pressure

83.63

0.045

0.004

Blood urea nitrogen

67.06

0.029

0.019

Arterial pressure, diastolic

59.82

-0.024

0.036

Respiratory rate

58.65

0.054

0.039

Lymphocytes

58.31

-0.001

0.040

SCORE

57.52

0.168

0.043

Socioeconomic level = Medium-high, High

41.14

-0.582

0.141

Age

39.67

0.018

0.154

pH

35.56

-2.035

0.198

Occupation = Work at home

35.21

-0.677

0.202

Past smoker = Yes

32.26

0.543

0.240

Schooling

27.58

-0.135

0.308

Chloride

26.92

-0.025

0.319

Neutrophils

25.17

0.000

0.348

Serum creatinine

23.73

-0.112

0.374

Gender = Male

23.44

-0.298

0.379

White blood cells

19.76

0.000

0.449

Diabetes mellitus = Yes

17.52

0.239

0.496

Occupation = Outdoor work

17.02

0.418

0.506

D-dimer

15.37

0.000

0.543

Blood glucose

14.69

-0.001

0.558

Potassium

13.54

0.115

0.584

Occupation = Work in public area

6.49

0.146

0.756

Socioeconomic level = Medium-high, High

0.85

0.063

0.903

Occupation = Office job

0.00

0.045

0.925

Intercept

18.641

0.144

Impact of the predictors on the random forests model

Variable

Overalla

Serum lactate dehydrogenase

100.0

D-dimer

94.3

Oxygen saturation

90.2

Arterial oxygen partial pressure

84.9

Neutrophils

68.3

Blood urea nitrogen

65.9

Arterial carbon dioxide partial pressure

65.6

White blood cells

62.3

pH

56.6

Age

49.9

Serum creatinine

49.7

Lymphocytes

48.2

Respiratory rate

41.1

Arterial pressure, diastolic

40.3

Potassium

36.3

Blood glucose

35.5

Schooling

35.5

Chloride

30.2

SCORE

25.4

Sex = Male

3.5

Socioeconomic level = Medium-high, High

3.1

Occupation = Work in public area

2.9

Occupation = Work at home

2.5

Occupation = Unemployed

1.0

Occupation = Office job

0.6

Diabetes mellitus = Yes

0.5

Occupation = Outdoor work

0.2

Past smoker = Yes

0.0

aNumbers represent the relative overall impact of the variables on the prediction model.

Performance of the logistic regression and random forests models

Metric

Internal data set

Nonrandom data set

External data set

Logistic Regression Model

Accuracya

0.83 (0.74 ± 0.9)

0.82 (0.73 ± 0.89)

0.68 (0.59 ± 0.77)

Sensitivity

0.667

0.676

0.697

Specificity

0.908

0.905

0.659

Positive predictive value

0.769

0.806

0.767

Negative predictive value

0.855

0.826

0.574

Random Forests Model

Accuracya

0.85 (0.77 ± 0.92)

0.82 (0.73 ± 0.89)

0.62 (0.52 ± 0.71)

Sensitivity

0.633

0.595

0.667

Specificity

0.954

0.952

0.537

Positive predictive value

0.864

0.88

0.698

Negative predictive value

0.849

0.8

0.5

aValues between parentheses represent the 95% confidence interval.

Description of the variables in the working database.

Code

Description

Units

sexo

Gender

0 = Female, 1 = Male

ocupacion

Occupation

1 = Health care worker, 2 = Office job, 3 = Outdoor work, 4 = Work in public area,5 = Work at home, 6 = Unemployed

escolaridad

Schooling

1 = Analphabet, 2 = Primary school, 3 = Junior high school, 4 = High school, 5 = College, 6 = Postgraduate studies

nivsoc

Socioeconomic level

0 = Low, Medium-low, 1 = Medium-high, High

app_0

Hypertension

1 = Yes, 0 = No

app_1

Alergic rinhitis

1 = Yes, 0 = No

app_2

Asthma

1 = Yes, 0 = No

app_3

Conjunctivitis

1 = Yes, 0 = No

app_4

Currently smoker

1 = Yes, 0 = No

app_5

Past smoker

1 = Yes, 0 = No

app_6

Use of medication

1 = Yes, 0 = No

app_7

Use of dietary supplements

1 = Yes, 0 = No

app_8

Cardiovascular disease

1 = Yes, 0 = No

app_9

Diabetes mellitus

1 = Yes, 0 = No

app_10

Insuline resistance

1 = Yes, 0 = No

app_11

COPD

1 = Yes, 0 = No

app_12

Renal disease

1 = Yes, 0 = No

app_13

Cancer

1 = Yes, 0 = No

app_14

Lung disease other than COPD

1 = Yes, 0 = No

app_15

AIDS

1 = Yes, 0 = No

app_16

Autoimmune disease

1 = Yes, 0 = No

app_17

Cerebrovascular disease

1 = Yes, 0 = No

app_18

Overweight

1 = Yes, 0 = No

app_19

Obesity

1 = Yes, 0 = No

edad

Age

Years old

peso

Weight

kg

talla

Height

m

imc

Body mass index

kg/m²

temp

Temperature

Centigrades

fc

Heart rate

Beats per minute

fr

Respiratory rate

Cicles per minute

tas

Arterial pressure, systolic

mm Hg

tad

Arterial pressure, diastolic

mm Hg

score

SCORE

Score points (0-16)

ing_disnea

Short of breath at the moment of hospitalization

1 = Yes, 0 = No

sintoma1

Prevalent symptom

1 = Cough,2 = Fever, 3 = Headache, 4 = Anosmia, 5 = Malaise, 6 = Dizziness, 7 = Muscular weakness, 8 = Diarrhea, 9 = Ageusia, 10 = Rash, 11 = Dyspnea, 12 = Chest pain, 13 = Incapability to move, 14 = Sore throat

sato2sin

Oxygen saturation

Percentage of Saturated hemoglobin

urea

Blood urea

mg/dL

bun

Blood urea nitrogen

mg/dL

creat

Serum creatinine

mg/dL

colesterol

Serum cholesterol

mg/dL

gluc

Blood glucose

mg/dL

hb

Hemoglobin concentration

mg/dL

leucos

White blood cells

number of cells/µL

plaq

Platelets

number of cells/µL

linfos

Lymphocytes

number of cells/µL

monos

Monocytes

number of cells/µL

eos

Eosinophils

number of cells/µL

basof

Basophils

number of cells/µL

neutros

Neutrophils

number of cells/µL

k

Potassium

mEq/L

na

Sodium

mEq/L

cl

Chloride

mEq/L

ca

Calcium

mEq/L

ph

pH

Units of pH

pao2

Arterial oxygen partial pressure

mm Hg

paco2

Arterial carbon dioxide partial pressure

mm Hg

hco3

Arterial bicarbonate

mEq/L

dhl

Serum lactate dehydrogenase

IU/L

alat

Serum Alanine aminotransferase

IU/L

aat

Serum Aspartate aminotransferase

IU/L

dimd

D-dimer

µg/mL

diasretraso

Hospitalization delay

Days

motivoegre

Vital status at the moment of discharge

0 = Alive, 1 = Dead

Categorical variables by outcome

Variable

Value

Total

Deada

Alivea

pb

Gender

Female

241

60 (24.9)

181 (75.1)

0.006

Male

400

143 (35.8)

257 (64.2)

Occupation

Health care worker

97

16 (16.5)

81 (83.5)

< 0.001

Office job

113

22 (19.5)

91 (80.5)

Outdoor work

40

20 (50)

20 (50)

Work in public area

149

49 (32.9)

100 (67.1)

Work at home

129

40 (31)

89 (69)

Unemployed

112

56 (50)

56 (50)

Socioeconomic level

Low or medium-low

449

164 (36.5)

285 (63.5)

< 0.001

Medium-high or high

192

39 (20.3)

153 (79.7)

Hypertension

No

418

121 (28.9)

297 (71.1)

0.052

Yes

223

82 (36.8)

141 (63.2)

Alergic rinhitis

No

635

202 (31.8)

433 (68.2)

0.724

Yes

6

1 (16.7)

5 (83.3)

Insuline resistance

No

634

198 (31.2)

436 (68.8)

0.062

Yes

7

5 (71.4)

2 (28.6)

COPD

No

636

200 (31.4)

436 (68.6)

0.376

Yes

5

3 (60)

2 (40)

Renal disease

No

602

185 (30.7)

417 (69.3)

0.067

Yes

39

18 (46.2)

21 (53.8)

Cancer

No

622

194 (31.2)

428 (68.8)

0.214

Yes

19

9 (47.4)

10 (52.6)

Lung disease other than COPD

No

636

202 (31.8)

434 (68.2)

0.936

Yes

5

1 (20)

4 (80)

Autoimmune disease

No

619

192 (31)

427 (69)

0.099

Yes

22

11 (50)

11 (50)

Cerebrovascular disease

No

629

199 (31.6)

430 (68.4)

1

Yes

12

4 (33.3)

8 (66.7)

Overweight

No

393

116 (29.5)

277 (70.5)

0.165

Yes

248

87 (35.1)

161 (64.9)

Obesity

No

385

123 (31.9)

262 (68.1)

0.921

Yes

256

80 (31.2)

176 (68.8)

Asthma

No

634

201 (31.7)

433 (68.3)

1

Yes

7

2 (28.6)

5 (71.4)

Currently smoker

No

573

177 (30.9)

396 (69.1)

0.274

Yes

68

26 (38.2)

42 (61.8)

Past smoker

No

596

180 (30.2)

416 (69.8)

0.006

Yes

45

23 (51.1)

22 (48.9)

Use of medication

No

449

135 (30.1)

314 (69.9)

0.215

Yes

192

68 (35.4)

124 (64.6)

Use of dietary supplements

No

633

200 (31.6)

433 (68.4)

1

Yes

8

3 (37.5)

5 (62.5)

Cardiovascular disease

No

622

195 (31.4)

427 (68.6)

0.458

Yes

19

8 (42.1)

11 (57.9)

Diabetes mellitus

No

509

146 (28.7)

363 (71.3)

0.002

Yes

132

57 (43.2)

75 (56.8)

Short of breath

No

49

8 (16.3)

41 (83.7)

0.025

Yes

592

195 (32.9)

397 (67.1)

aValues between parentheses represent the percentage for the corresponding category.

bp value calculated with χ2.

Exploratory data analysis of numerical variables

Variable

Valid cases

Status

Mean ± SD

pa

D-dimer

425

Alive

841.7 ± 2271.01

0.004

187

Dead

2297.73 ± 6738.31

0.004

Arterial pressure, diastolic

437

Alive

76.17 ± 10.81

0.007

200

Dead

73.34 ± 12.81

0.007

Chloride

424

Alive

103.74 ± 5.46

0.009

194

Dead

102.47 ± 5.67

0.009

Serum Aspartate aminotransferase

408

Alive

48.48 ± 37.92

0.01

185

Dead

73.33 ± 127.89

0.01

Sodium

424

Alive

136.67 ± 4.21

0.02

194

Dead

135.62 ± 5.54

0.02

Heart rate

437

Alive

97.69 ± 17.57

0.032

200

Dead

101.21 ± 19.89

0.032

Blood urea

436

Alive

29 ± 24.87

0.032

196

Dead

66.57 ± 243.04

0.032

Monocytes

435

Alive

558.24 ± 520.18

0.058

196

Dead

646.1 ± 544.48

0.058

Basophils

435

Alive

47.15 ± 96.2

0.068

196

Dead

65.49 ± 124.59

0.068

Days of delay

438

Alive

8.13 ± 5.01

0.131

203

Dead

7.51 ± 4.76

0.131

Hemoglobin concentration

436

Alive

14.4 ± 1.89

0.151

196

Dead

14.14 ± 2.23

0.151

Height

438

Alive

1.64 ± 0.1

0.312

200

Dead

1.63 ± 0.09

0.312

Arterial pressure, systolic

437

Alive

123.8 ± 19.36

0.316

199

Dead

125.86 ± 25.8

0.316

Body mass index

438

Alive

29.13 ± 5.13

0.323

200

Dead

29.57 ± 5.22

0.323

Arterial bicarbonate

395

Alive

18.01 ± 3.84

0.422

193

Dead

17.68 ± 5.05

0.422

Serum Alanine aminotransferase

406

Alive

52.04 ± 40.41

0.523

179

Dead

56.06 ± 79.66

0.523

Calcium

33

Alive

7.25 ± 2.41

0.572

12

Dead

7.51 ± 0.52

0.572

Serum cholesterol

436

Alive

141.87 ± 52.71

0.593

201

Dead

139.68 ± 45.73

0.593

Weight

438

Alive

78 ± 15.48

0.662

200

Dead

78.6 ± 16.26

0.662

Platelets

436

Alive

273077.75 ± 111349.37

0.685

196

Dead

268900.51 ± 123348.71

0.685

Temperature

437

Alive

37.06 ± 0.82

0.774

200

Dead

37.09 ± 0.93

0.774

Blood urea nitrogen

436

Alive

13.72 ± 12.03

< 0.001

196

Dead

23 ± 19.35

< 0.001

Serum creatinine

436

Alive

1.01 ± 1.1

< 0.001

196

Dead

1.57 ± 1.87

< 0.001

Serum lactate dehydrogenase

355

Alive

371.35 ± 200.29

< 0.001

151

Dead

565.43 ± 250.71

< 0.001

Age

438

Alive

51.55 ± 13.87

< 0.001

203

Dead

60.31 ± 13.17

< 0.001

Schooling

411

Alive

3.64 ± 1.25

< 0.001

196

Dead

3.02 ± 1.34

< 0.001

Respiratory rate

437

Alive

24.46 ± 4.47

< 0.001

200

Dead

26.48 ± 5.42

< 0.001

Blood glucose

436

Alive

142.09 ± 82.07

< 0.001

196

Dead

173.9 ± 101.57

< 0.001

Potassium

424

Alive

4.01 ± 0.55

< 0.001

194

Dead

4.34 ± 0.84

< 0.001

White blood cells

436

Alive

9088.81 ± 4586.03

< 0.001

196

Dead

12463.13 ± 6413.31

< 0.001

Lymphocytes

436

Alive

1021.38 ± 510.79

< 0.001

196

Dead

836.01 ± 463.22

< 0.001

Neutrophils

436

Alive

7439.26 ± 4360.39

< 0.001

196

Dead

10593.56 ± 5988.25

< 0.001

Arterial carbon dioxide partial pressure

395

Alive

27.28 ± 7.33

< 0.001

193

Dead

32.24 ± 15.14

< 0.001

Arterial oxygen partial pressure

395

Alive

74.37 ± 29.15

< 0.001

193

Dead

58.2 ± 22.73

< 0.001

pH

395

Alive

7.43 ± 0.07

< 0.001

193

Dead

7.36 ± 0.15

< 0.001

Oxygen saturation

437

Alive

89.64 ± 6.4

< 0.001

200

Dead

83.01 ± 12.11

< 0.001

Severity score

435

Alive

1.24 ± 1.75

< 0.001

199

Dead

2.56 ± 2.9

< 0.001

ap value calculated with Student t test, except for schooling, number of comorbidities and severity score, where the U Mann-Whitney test was used.

Random forests model. Effect of the number of randomly selected predictors on accuracy.

Random forests model. Effect of the number of randomly selected predictors on accuracy.

effect of the number of trees on the stabilization of the out of bag error.

effect of the number of trees on the stabilization of the out of bag error.

1.
Secretaria de Salud. Información referente a casos COVID-19 en México [Internet]. 2022. Available from: https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico
2.
Mariano Sanchez Talanquer. La letalidad hospitalaria por covid-19 en México: desigualdades institucionales [Internet]. 2020. Available from: https://datos.nexos.com.mx/?p=1625
3.
Adam L. Booth, Elizabeth Abels, Peter McCaffrey. Development of a prognostic model for mortality in COVID-19 infection using machine learning. Modern Pathology. 2020 Oct 16;1–10. doi:10.1038/s41379-020-00700-x
4.
Gary S. Collins, Johannes B. Reitsma, Douglas G. Altman, Karel G. M. Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. Journal of Clinical Epidemiology. 2015 Feb 1;68(2):112–21. doi:10.1016/j.jclinepi.2014.11.010
5.
Rafael A. Irizarry. Introduction to Data Science: Data Analysis and Prediction Algorithms with R [Internet]. First Edition. London: Chapman; Hall/CRC; 2020. Available from: https://www.routledge.com/Introduction-to-Data-Science-Data-Analysis-and-Prediction-Algorithms-with/Irizarry/p/book/9780367357986
6.
Andreas Ziegler, Inke R. König. Mining data with random forests: current options for real-world applications. WIREs Data Mining and Knowledge Discovery. 2014;4(1):55–63. doi:10.1002/widm.1114
7.
Leo Breiman. Random Forests. Machine Learning. 2001 Oct 1;45(1):5–32. doi:10.1023/A:1010933404324
8.
R. M. Conroy, K. Pyörälä, A. P. Fitzgerald, S. Sans, A. Menotti, G. De Backer, D. De Bacquer, P. Ducimetière, P. Jousilahti, U. Keil, I. Njølstad, R. G. Oganov, T. Thomsen, H. Tunstall-Pedoe, A. Tverdal, H. Wedel, P. Whincup, L. Wilhelmsen, I. M. Graham, on behalf of the SCORE project group. Estimation of ten-year risk of fatal cardiovascular disease in europe: The SCORE project. European Heart Journal. 2003 Jun 1;24(11):987–1003. doi:10.1016/S0195-668X(03)00114-3
9.
Mathijs O. Versteylen, Ivo A. Joosen, Leslee J. Shaw, Jagat Narula, Leonard Hofstra. Comparison of Framingham, PROCAM, SCORE, and Diamond Forrester to predict coronary atherosclerosis and cardiovascular events. Journal of Nuclear Cardiology. 2011 Jul 19;18(5):904. doi:10.1007/s12350-011-9425-5
10.
R Core Team. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2020. Available from: https://www.R-project.org/
11.
RStudio Team. RStudio: Integrated development environment for r [Internet]. Boston, MA: RStudio, PBC; 2021. Available from: http://www.rstudio.com/
12.
Richard D. Riley, Joie Ensor, Kym I. E. Snell, Frank E. Harrell, Glen P. Martin, Johannes B. Reitsma, Karel G. M. Moons, Gary Collins, Maarten van Smeden. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020 Mar 18;368:m441. doi:10.1136/bmj.m441