Read database

IMMAm <- read_excel("IMMAm.xlsx") 
summary(IMMAm)
##   Clas_ALMI              BMI             ARG             AFTG      
##  Length:123         Min.   :19.78   Min.   :22.95   Min.   :22.70  
##  Class :character   1st Qu.:25.37   1st Qu.:28.00   1st Qu.:27.45  
##  Mode  :character   Median :28.63   Median :30.50   Median :29.55  
##                     Mean   :28.99   Mean   :30.84   Mean   :30.04  
##                     3rd Qu.:32.19   3rd Qu.:33.62   3rd Qu.:32.25  
##                     Max.   :44.09   Max.   :44.00   Max.   :43.00  
##       ACG              FG              CG             CCG       
##  Min.   :18.63   Min.   :20.10   Min.   :28.50   Min.   :21.73  
##  1st Qu.:22.07   1st Qu.:22.62   1st Qu.:32.15   1st Qu.:26.35  
##  Median :23.89   Median :23.95   Median :34.50   Median :28.17  
##  Mean   :24.07   Mean   :24.08   Mean   :34.59   Mean   :28.06  
##  3rd Qu.:25.95   3rd Qu.:25.57   3rd Qu.:36.45   3rd Qu.:29.85  
##  Max.   :33.63   Max.   :29.65   Max.   :48.60   Max.   :35.56
IMMAh <- read_excel("IMMAh.xlsx")
summary(IMMAh)
##   Clas_ALMI              BMI             ARG             AFTG      
##  Length:60          Min.   :19.48   Min.   :24.00   Min.   :26.00  
##  Class :character   1st Qu.:26.39   1st Qu.:28.54   1st Qu.:29.11  
##  Mode  :character   Median :28.38   Median :30.55   Median :31.50  
##                     Mean   :28.58   Mean   :30.94   Mean   :31.62  
##                     3rd Qu.:30.24   3rd Qu.:32.96   3rd Qu.:33.30  
##                     Max.   :40.19   Max.   :37.95   Max.   :38.20  
##       ACG              FG              CG             CCG       
##  Min.   :22.30   Min.   :23.50   Min.   :29.75   Min.   :27.91  
##  1st Qu.:25.36   1st Qu.:25.50   1st Qu.:34.26   1st Qu.:31.67  
##  Median :26.91   Median :26.75   Median :36.45   Median :33.51  
##  Mean   :26.94   Mean   :27.03   Mean   :36.27   Mean   :33.53  
##  3rd Qu.:28.33   3rd Qu.:28.36   3rd Qu.:38.24   3rd Qu.:35.09  
##  Max.   :32.32   Max.   :32.00   Max.   :43.25   Max.   :39.12

Encode the dependent variable

IMMAm$Clas_ALMI <- ifelse(IMMAm$Clas_ALMI == "NORMAL", 1, ifelse(IMMAm$Clas_ALMI == "BAJO", 2, NA))
IMMAm$Clas_ALMI <- as.factor(IMMAm$Clas_ALMI)

IMMAh$Clas_ALMI <- ifelse(IMMAh$Clas_ALMI == "NORMAL", 1, ifelse(IMMAh$Clas_ALMI == "BAJO", 2, NA))
IMMAh$Clas_ALMI <- as.factor(IMMAh$Clas_ALMI)

IMMAm <- IMMAm[!is.na(IMMAm$Clas_ALMI), ]
IMMAh <- IMMAh[!is.na(IMMAh$Clas_ALMI), ]

Partition identical (females)

set.seed(123)

trainsetALMIm <- IMMAm %>% group_by(Clas_ALMI) %>% sample_frac(0.7)
testsetALMIm  <- IMMAm %>% anti_join(trainsetALMIm, by = colnames(IMMAm))

Partition identical (males)

trainsetALMIh <- IMMAh %>% group_by(Clas_ALMI) %>% sample_frac(0.7)
testsetALMIh  <- IMMAh %>% anti_join(trainsetALMIh, by = colnames(IMMAh))

Diagnostic and prognostic models are extensively utilized in the healthcare field for disease detection, severity classification, risk assessment of future conditions, and risk stratification (Cook, 2008). These models have been applied in the development of diagnostic tools for various conditions, including cardiovascular diseases, overweight and obesity, atrial fibrillation, endometrial lesions, osteoporosis, and sarcopenia (Albores-Mendez et al., 2022; Butler et al., 2018; Chung et al., 2021; Cook, 2008; Hendriksen et al., 2013; Kemmler et al., 2018; Ou Yang et al., 2021; Wolf & DeLand, 2011; Zhang et al., 2021). A key advantage of incorporating risk prediction models into clinical practice is the provision of individualized risk estimates, which can enhance decision-making and improve healthcare outcomes (Hendriksen et al., 2013). Moreover, the integration of Artificial Intelligence (AI) techniques, particularly Machine Learning (ML), has emerged as a promising approach for developing predictive models in healthcare settings (Greener et al., 2022).

Appendicular Lean Mass (ALM) is a key diagnostic criterion for sarcopenia, yet one of the most difficult variables to assess in clinical practice. Therefore, estimating it using anthropometric variables is a priority and also a practical alternative. In this study, the aim was to predict the binary classification of ALMI as “normal” or “low.” The following machine learning (ML) models were applied: Decision Trees (DT), Logistic Regression Models (LR), Random Forests (RF), Artificial Neural Networks (ANN), and LASSO Regression (LASSO) were applied. The models were evaluated using AUCs and specificity values. 

1 Random Forest model

A Random Forest is a machine learning algorithm used for both classification and regression tasks. It belongs to a family of models known as ensemble methods, which combine the predictions of many individual models to improve overall performance. Works by data sampling (bootstrapping), tree building and aggregation (majority vote). In this case, they were used to predict low Appendicular Lean Mass Index (ALMI) separated by sex.

Females

rf_script_m <- randomForest(Clas_ALMI ~ ., data = trainsetALMIm, ntree = 500)
pred_script_m <- predict(rf_script_m, newdata = testsetALMIm)
conf_mtx_script_m <- confusionMatrix(pred_script_m, testsetALMIm$Clas_ALMI, positive = "2")
print(conf_mtx_script_m)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 27  3
##          2  2  4
##                                          
##                Accuracy : 0.8611         
##                  95% CI : (0.705, 0.9533)
##     No Information Rate : 0.8056         
##     P-Value [Acc > NIR] : 0.273          
##                                          
##                   Kappa : 0.5312         
##                                          
##  Mcnemar's Test P-Value : 1.000          
##                                          
##             Sensitivity : 0.5714         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.6667         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.1944         
##          Detection Rate : 0.1111         
##    Detection Prevalence : 0.1667         
##       Balanced Accuracy : 0.7512         
##                                          
##        'Positive' Class : 2              
## 
var_importanceMrf <- importance(rf_script_m)
var_importance_ALMImRF <- as.data.frame(var_importanceMrf)
var_importance_ALMImRF$Variable <- rownames(var_importance_ALMImRF)
ggplot(var_importance_ALMImRF, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "#2F4731") +
  coord_flip() +
  labs(title = "Variable Importance in Random Forest",
       x = "Variables",
       y = "Mean Decrease in Gini") +
  theme_minimal()

prob_script_m <- predict(rf_script_m, newdata = testsetALMIm, type = "prob")[, "2"]
roc_script_m <- roc(testsetALMIm$Clas_ALMI, prob_script_m, levels = c("1", "2"))
## Setting direction: controls < cases
auc_script_m <- auc(roc_script_m)
plot(roc_script_m, main = "ROC Curve - RF Females")

Males

rf_script_h <- randomForest(Clas_ALMI ~ ., data = trainsetALMIh, ntree = 500)
pred_script_h <- predict(rf_script_h, newdata = testsetALMIh)
conf_mtx_script_h <- confusionMatrix(pred_script_h, testsetALMIh$Clas_ALMI, positive = "2")
print(conf_mtx_script_h)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 12  1
##          2  1  3
##                                           
##                Accuracy : 0.8824          
##                  95% CI : (0.6356, 0.9854)
##     No Information Rate : 0.7647          
##     P-Value [Acc > NIR] : 0.1998          
##                                           
##                   Kappa : 0.6731          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.9231          
##              Prevalence : 0.2353          
##          Detection Rate : 0.1765          
##    Detection Prevalence : 0.2353          
##       Balanced Accuracy : 0.8365          
##                                           
##        'Positive' Class : 2               
## 
var_importanceHrf <- importance(rf_script_h)
var_importance_ALMIhRF <- as.data.frame(var_importanceHrf)
var_importance_ALMIhRF$Variable <- rownames(var_importance_ALMImRF)
ggplot(var_importance_ALMIhRF, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "#2F4731") +
  coord_flip() +
  labs(title = "Variable Importance in Random Forest",
       x = "Variables",
       y = "Mean Decrease in Gini") +
  theme_minimal()

prob_script_h <- predict(rf_script_h, newdata = testsetALMIh, type = "prob")[, "2"]
roc_script_h <- roc(testsetALMIh$Clas_ALMI, prob_script_h, levels = c("1", "2"))
## Setting direction: controls < cases
auc_script_h <- auc(roc_script_h)
plot(roc_script_h, main = "ROC Curve- RF Males")

When the modeling was carried out, the variables with the greatest predictivity were obtained, being in the case of the female sex the BMI and CG, for the male sex the CG and ARG. The results of case classification during the test phase. The BAs were shown to have an AUC of 0.82 for female and 0.79 male.

2 Logistic Regression

Logistic regression is a supervised machine learning algorithm used for binary classification — meaning it predicts whether something belongs to one of two categories (e.g., yes/no, 0/1, positive/negative). Works with input variables and weighted sum, probability and threshold to classify into two categories (low or normal ALMI).

Females

glm_script_m <- glm(Clas_ALMI ~ ., data = trainsetALMIm, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_script_m)
## 
## Call:
## glm(formula = Clas_ALMI ~ ., family = "binomial", data = trainsetALMIm)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)   7376.80  706862.64   0.010    0.992
## BMI           -185.10   15623.58  -0.012    0.991
## ARG            506.38   40247.17   0.013    0.990
## AFTG          -331.97   28252.76  -0.012    0.991
## ACG           -222.99   19574.73  -0.011    0.991
## FG             -48.82    6982.93  -0.007    0.994
## CG             -48.11   10491.04  -0.005    0.996
## CCG            -14.11    3338.89  -0.004    0.997
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.8708e+01  on 86  degrees of freedom
## Residual deviance: 4.2174e-07  on 79  degrees of freedom
## AIC: 16
## 
## Number of Fisher Scoring iterations: 25
prob_glm_script_m <- predict(glm_script_m, newdata = testsetALMIm, type = "response")
pred_glm_script_m <- ifelse(prob_glm_script_m >= 0.5, 2, 1)
pred_glm_script_m <- as.factor(pred_glm_script_m)
conf_glm_script_m <- confusionMatrix(pred_glm_script_m, testsetALMIm$Clas_ALMI, positive = "2")
print(conf_glm_script_m)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 27  3
##          2  2  4
##                                          
##                Accuracy : 0.8611         
##                  95% CI : (0.705, 0.9533)
##     No Information Rate : 0.8056         
##     P-Value [Acc > NIR] : 0.273          
##                                          
##                   Kappa : 0.5312         
##                                          
##  Mcnemar's Test P-Value : 1.000          
##                                          
##             Sensitivity : 0.5714         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.6667         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.1944         
##          Detection Rate : 0.1111         
##    Detection Prevalence : 0.1667         
##       Balanced Accuracy : 0.7512         
##                                          
##        'Positive' Class : 2              
## 
roc_glm_script_m <- roc(testsetALMIm$Clas_ALMI, prob_glm_script_m, levels = c("1", "2"))
## Setting direction: controls < cases
auc_glm_script_m <- auc(roc_glm_script_m)
plot(roc_glm_script_m, main = "ROC Curve - GLM Females")

Males

glm_script_h <- glm(Clas_ALMI ~ ., data = trainsetALMIh, family = "binomial")
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_script_h)
## 
## Call:
## glm(formula = Clas_ALMI ~ ., family = "binomial", data = trainsetALMIh)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)   1802.22  973842.13   0.002    0.999
## BMI             -3.13   14513.51   0.000    1.000
## ARG            -56.27   99777.92  -0.001    1.000
## AFTG            24.36   44860.02   0.001    1.000
## ACG             23.28   45842.11   0.001    1.000
## FG             -21.73   33221.02  -0.001    0.999
## CG              19.88   26609.56   0.001    0.999
## CCG            -47.95   29310.10  -0.002    0.999
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.0901e+01  on 41  degrees of freedom
## Residual deviance: 6.3139e-09  on 34  degrees of freedom
## AIC: 16
## 
## Number of Fisher Scoring iterations: 25
prob_glm_script_h <- predict(glm_script_h, newdata = testsetALMIh, type = "response")
pred_glm_script_h <- ifelse(prob_glm_script_h >= 0.5, 2, 1)
pred_glm_script_h <- as.factor(pred_glm_script_h)
conf_glm_script_h <- confusionMatrix(pred_glm_script_h, testsetALMIh$Clas_ALMI, positive = "2")
print(conf_glm_script_h)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 10  1
##          2  3  3
##                                          
##                Accuracy : 0.7647         
##                  95% CI : (0.501, 0.9319)
##     No Information Rate : 0.7647         
##     P-Value [Acc > NIR] : 0.6300         
##                                          
##                   Kappa : 0.4426         
##                                          
##  Mcnemar's Test P-Value : 0.6171         
##                                          
##             Sensitivity : 0.7500         
##             Specificity : 0.7692         
##          Pos Pred Value : 0.5000         
##          Neg Pred Value : 0.9091         
##              Prevalence : 0.2353         
##          Detection Rate : 0.1765         
##    Detection Prevalence : 0.3529         
##       Balanced Accuracy : 0.7596         
##                                          
##        'Positive' Class : 2              
## 
roc_glm_script_h <- roc(testsetALMIh$Clas_ALMI, prob_glm_script_h, levels = c("1", "2"))
## Setting direction: controls < cases
auc_glm_script_h <- auc(roc_glm_script_h)
plot(roc_glm_script_h, main = "ROC Curve - GLM Males")

The predictive values of the variables included in the model were obtained. When classifying the cases on the test process the model achieved a classification AUC value of 0.76 for females and 0.77 for males, which means they are able to identify 76% and 77% of the cases with low ALMI in the test sample. These results indicate a similar value for females and males.

3 Decision Tree

A supervised machine learning algorithm used for both classification and regression tasks. It works like a flowchart where data is split into branches based on decision rules, eventually leading to an outcome or prediction. Star with a first question (split) as a root node of the most predictive variable, the internal nodes are other questions than divides the data and the final nodes as the final predictors (low or normal).

Females

dt_script_m <- rpart(Clas_ALMI ~ ., data = trainsetALMIm, method = "class")
rpart.plot(dt_script_m,
           main = "Decision Tree from Random Forest",
           digits = 3,
           box.col = "#367588",        # Fondo verde oscuro
           border.col = "white",       # Bordes blancos
           split.box.col = "#ADE8F4",  # Color de las cajas de decisión
           split.border.col = "white",
           col = "white",              # Texto blanco
           shadow.col = "gray")      

pred_dt_script_m <- predict(dt_script_m, newdata = testsetALMIm, type = "class")
conf_dt_script_m <- confusionMatrix(pred_dt_script_m, testsetALMIm$Clas_ALMI, positive = "2")
print(conf_dt_script_m)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 29  3
##          2  0  4
##                                           
##                Accuracy : 0.9167          
##                  95% CI : (0.7753, 0.9825)
##     No Information Rate : 0.8056          
##     P-Value [Acc > NIR] : 0.06112         
##                                           
##                   Kappa : 0.6824          
##                                           
##  Mcnemar's Test P-Value : 0.24821         
##                                           
##             Sensitivity : 0.5714          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9062          
##              Prevalence : 0.1944          
##          Detection Rate : 0.1111          
##    Detection Prevalence : 0.1111          
##       Balanced Accuracy : 0.7857          
##                                           
##        'Positive' Class : 2               
## 
prob_dt_script_m <- predict(dt_script_m, newdata = testsetALMIm)[, "2"]
roc_dt_script_m <- roc(testsetALMIm$Clas_ALMI, prob_dt_script_m, levels = c("1", "2"))
## Setting direction: controls < cases
auc_dt_script_m <- auc(roc_dt_script_m)
plot(roc_dt_script_m, main = "ROC Curve - DT Females")

Males

dt_script_h <- rpart(Clas_ALMI ~ ., data = trainsetALMIh, method = "class")
rpart.plot(dt_script_h,
           main = "Decision Tree from Random Forest",
           digits = 3,
           box.col = "#367588",        # Fondo verde oscuro
           border.col = "white",       # Bordes blancos
           split.box.col = "#ADE8F4",  # Color de las cajas de decisión
           split.border.col = "white",
           col = "white",              # Texto blanco
           shadow.col = "gray")  

pred_dt_script_h <- predict(dt_script_h, newdata = testsetALMIh, type = "class")
conf_dt_script_h <- confusionMatrix(pred_dt_script_h, testsetALMIh$Clas_ALMI, positive = "2")
print(conf_dt_script_h)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 11  1
##          2  2  3
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.7647         
##     P-Value [Acc > NIR] : 0.4069         
##                                          
##                   Kappa : 0.5487         
##                                          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.7500         
##             Specificity : 0.8462         
##          Pos Pred Value : 0.6000         
##          Neg Pred Value : 0.9167         
##              Prevalence : 0.2353         
##          Detection Rate : 0.1765         
##    Detection Prevalence : 0.2941         
##       Balanced Accuracy : 0.7981         
##                                          
##        'Positive' Class : 2              
## 
prob_dt_script_h <- predict(dt_script_h, newdata = testsetALMIh)[, "2"]
roc_dt_script_h <- roc(testsetALMIh$Clas_ALMI, prob_dt_script_h, levels = c("1", "2"))
## Setting direction: controls < cases
auc_dt_script_h <- auc(roc_dt_script_h)
plot(roc_dt_script_h, main = "ROC Curve - DT Males")

BMI and CCG were identified as the most relevant predictors for women, and ARG for men, showing cut-off points to detect low ALMI. Case classification tests were conducted to validate the models, yielding AUC of 0.84 for women and 0.80 for men, a little lower in males.

4 Artificial Neural Network (ANN)

An Artificial Neural Network (ANN) is a type of machine learning model inspired by the structure and functioning of the human brain. It’s particularly powerful for detecting complex, non-linear relationships in data and is widely used in tasks like image recognition, natural language processing, and medical diagnosis.

Females

ann_script_m <- nnet(Clas_ALMI ~ ., data = trainsetALMIm, size = 5, maxit = 500, decay = 5e-4, trace = FALSE)
prob_ann_script_m <- predict(ann_script_m, newdata = testsetALMIm, type = "raw")
pred_ann_script_m <- ifelse(prob_ann_script_m > 0.5, 2, 1)
pred_ann_script_m <- as.factor(pred_ann_script_m)
conf_ann_script_m <- confusionMatrix(pred_ann_script_m, testsetALMIm$Clas_ALMI, positive = "2")
print(conf_ann_script_m)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 27  3
##          2  2  4
##                                          
##                Accuracy : 0.8611         
##                  95% CI : (0.705, 0.9533)
##     No Information Rate : 0.8056         
##     P-Value [Acc > NIR] : 0.273          
##                                          
##                   Kappa : 0.5312         
##                                          
##  Mcnemar's Test P-Value : 1.000          
##                                          
##             Sensitivity : 0.5714         
##             Specificity : 0.9310         
##          Pos Pred Value : 0.6667         
##          Neg Pred Value : 0.9000         
##              Prevalence : 0.1944         
##          Detection Rate : 0.1111         
##    Detection Prevalence : 0.1667         
##       Balanced Accuracy : 0.7512         
##                                          
##        'Positive' Class : 2              
## 
roc_ann_script_m <- roc(testsetALMIm$Clas_ALMI, as.vector(prob_ann_script_m), levels = c("1", "2"))
## Setting direction: controls < cases
auc_ann_script_m <- auc(roc_ann_script_m)
plot(roc_ann_script_m, main = "ROC Curve - ANN Females")

plotnet(ann_script_m, rep = "best", alpha = 1, circle_col = "gray", pos_col = "black", neg_col = "#71142A", wts.only = FALSE)

Males

ann_script_h <- nnet(Clas_ALMI ~ ., data = trainsetALMIh, size = 5, maxit = 500, decay = 5e-4, trace = FALSE)
prob_ann_script_h <- predict(ann_script_h, newdata = testsetALMIh, type = "raw")
pred_ann_script_h <- ifelse(prob_ann_script_h > 0.5, 2, 1)
pred_ann_script_h <- as.factor(pred_ann_script_h)
conf_ann_script_h <- confusionMatrix(pred_ann_script_h, testsetALMIh$Clas_ALMI, positive = "2")
print(conf_ann_script_h)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 11  1
##          2  2  3
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.7647         
##     P-Value [Acc > NIR] : 0.4069         
##                                          
##                   Kappa : 0.5487         
##                                          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.7500         
##             Specificity : 0.8462         
##          Pos Pred Value : 0.6000         
##          Neg Pred Value : 0.9167         
##              Prevalence : 0.2353         
##          Detection Rate : 0.1765         
##    Detection Prevalence : 0.2941         
##       Balanced Accuracy : 0.7981         
##                                          
##        'Positive' Class : 2              
## 
roc_ann_script_h <- roc(testsetALMIh$Clas_ALMI, as.vector(prob_ann_script_h), levels = c("1", "2"))
## Setting direction: controls < cases
auc_ann_script_h <- auc(roc_ann_script_h)
plot(roc_ann_script_h, main = "ROC Curve - ANN Males")

plotnet(ann_script_h, main = "ROC Curve - ANN Males", alpha = 1, circle_col = "gray", pos_col = "black", neg_col = "#71142A", wts.only = FALSE)

Ons this ANNs, each consisting of a single hidden layer with five neurons (H1 to H5) and one output neuron (O1) responsible for predicting the binary outcome: low vs. normal ALMI. Bias nodes B1 and B2 are included to adjust activation thresholds and enhance the learning process. The connecting lines represent the synaptic weights, where thicker lines indicate stronger connections. Line color denotes the direction of the weight: black for positive and burgundy for negative weights. The distribution of weights from input to hidden neurons appears more balanced in network b. In the female model, AFTG emerges as the most influential variable, followed by ARG and BMI, indicating a greater predictive contribution from thigh-related anthropometric measures. In the male model, CCG appears to be the most important predictor, followed by CG and ARG. Regarding performance, the models achieved an AUC of 0.82 for females and 0.92 for males on the test dataset.

5 LASSO

Least Absolute Shrinkage and Selection Operator (LASSO) regression is a technique that combines coefficient shrinkage and variable selection to enhance model performance and interpretability in regression models.

Females

x_script_m <- model.matrix(Clas_ALMI ~ ., data = trainsetALMIm)[,-1]
y_script_m <- as.numeric(as.character(trainsetALMIm$Clas_ALMI))
lasso_script_m <- cv.glmnet(x_script_m, y_script_m, alpha = 1, family = "binomial")
x_test_script_m <- model.matrix(Clas_ALMI ~ ., data = testsetALMIm)[,-1]
prob_lasso_script_m <- predict(lasso_script_m, s = "lambda.min", newx = x_test_script_m, type = "response")
pred_lasso_script_m <- ifelse(prob_lasso_script_m > 0.5, 2, 1)
pred_lasso_script_m <- as.factor(pred_lasso_script_m)
conf_lasso_script_m <- confusionMatrix(pred_lasso_script_m, testsetALMIm$Clas_ALMI, positive = "2")
print(conf_lasso_script_m)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 28  3
##          2  1  4
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.7394, 0.9689)
##     No Information Rate : 0.8056          
##     P-Value [Acc > NIR] : 0.1444          
##                                           
##                   Kappa : 0.6022          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.5714          
##             Specificity : 0.9655          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9032          
##              Prevalence : 0.1944          
##          Detection Rate : 0.1111          
##    Detection Prevalence : 0.1389          
##       Balanced Accuracy : 0.7685          
##                                           
##        'Positive' Class : 2               
## 
roc_lasso_script_m <- roc(testsetALMIm$Clas_ALMI, as.vector(prob_lasso_script_m), levels = c("1", "2"))
## Setting direction: controls < cases
auc_lasso_script_m <- auc(roc_lasso_script_m)
plot(roc_lasso_script_m, main = "ROC Curve - LASSO Females")

Males

x_script_h <- model.matrix(Clas_ALMI ~ ., data = trainsetALMIh)[,-1]
y_script_h <- as.numeric(as.character(trainsetALMIh$Clas_ALMI))
lasso_script_h <- cv.glmnet(x_script_h, y_script_h, alpha = 1, family = "binomial")
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
## Warning in lognet(xd, is.sparse, ix, jx, y, weights, offset, alpha, nobs, : one
## multinomial or binomial class has fewer than 8 observations; dangerous ground
x_test_script_h <- model.matrix(Clas_ALMI ~ ., data = testsetALMIh)[,-1]
prob_lasso_script_h <- predict(lasso_script_h, s = "lambda.min", newx = x_test_script_h, type = "response")
pred_lasso_script_h <- ifelse(prob_lasso_script_h > 0.5, 2, 1)
pred_lasso_script_h <- as.factor(pred_lasso_script_h)
conf_lasso_script_h <- confusionMatrix(pred_lasso_script_h, testsetALMIh$Clas_ALMI, positive = "2")
print(conf_lasso_script_h)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 11  1
##          2  2  3
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.7647         
##     P-Value [Acc > NIR] : 0.4069         
##                                          
##                   Kappa : 0.5487         
##                                          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.7500         
##             Specificity : 0.8462         
##          Pos Pred Value : 0.6000         
##          Neg Pred Value : 0.9167         
##              Prevalence : 0.2353         
##          Detection Rate : 0.1765         
##    Detection Prevalence : 0.2941         
##       Balanced Accuracy : 0.7981         
##                                          
##        'Positive' Class : 2              
## 
roc_lasso_script_h <- roc(testsetALMIh$Clas_ALMI, as.vector(prob_lasso_script_h), levels = c("1", "2"))
## Setting direction: controls < cases
auc_lasso_script_h <- auc(roc_lasso_script_h)
plot(roc_lasso_script_h, main = "ROC Curve - LASSO Males")

Summarized the performance metrics of the model, showing an AUC of 0.84 for females and 0.87 for males. 

6 Graphical comparison of ROC curves by sex

The performance metrics of all the evaluated models: For females, both the DT and LASSO models demonstrated the best predictive performance, each achieving an AUC of 0.84. In the case of males, the ANN model outperformed others with an AUC of 0.92. These results are visually represented through the ROC curves. Given that the primary goal in predicting sarcopenia is to accurately identifying low ALMI cases, specificity is a particularly important metric. All models for females achieved high specificity scores above 0.93, with the DT model exhibiting perfect specificity (1.00), meaning it correctly identified all true negative cases (i.e., correctly identifying all the cases with low ALMI). For male participants, specificity values were slightly lower, but still clinically relevant. LR had the lowest specificity at 0.77, whereas the RF model performed best, reaching a specificity of 0.92 in detecting negative cases.

Females

plot(roc_glm_script_m, col="#65318e", lwd=2, main="ROC Curves – All models (Females)")
plot(roc_dt_script_m, col="#367588", add=TRUE, lwd=2)
plot(roc_script_m, col="#0B6623", add=TRUE, lwd=2) # Random Forest
plot(roc_ann_script_m, col="#660C07", add=TRUE, lwd=2)
plot(roc_lasso_script_m, col="#e3735e", add=TRUE, lwd=2)
legend("bottomright",
       legend=c("Logistic Regression", "Decision Tree", "Random Forest", "ANN", "LASSO"),
       col=c("#65318e", "#367588", "#0B6623", "#660C07", "#e3735e"), lwd=2, cex=0.8)

Males

plot(roc_glm_script_h, col="#65318e", lwd=2, main="ROC Curves – All models (Males)")
plot(roc_dt_script_h, col="#367588", add=TRUE, lwd=2)
plot(roc_script_h, col="#0B6623", add=TRUE, lwd=2) # Random Forest
plot(roc_ann_script_h, col="#660C07", add=TRUE, lwd=2)
plot(roc_lasso_script_h, col="#e3735e", add=TRUE, lwd=2)
legend("bottomright",
       legend=c("Logistic Regression", "Decision Tree", "Random Forest", "ANN", "LASSO"),
       col=c("#65318e", "#367588", "#0B6623", "#660C07", "#e3735e"), lwd=2, cex=0.8)

7 Summary metrics table

Females

tab_m <- data.frame(
  Modelo = c("Logistic Regression", "Decision Tree", "Random Forest", "ANN", "LASSO"),
  Accuracy    = c(conf_glm_script_m$overall["Accuracy"], conf_dt_script_m$overall["Accuracy"], conf_mtx_script_m$overall["Accuracy"], conf_ann_script_m$overall["Accuracy"], conf_lasso_script_m$overall["Accuracy"]),
  Sensitivity = c(conf_glm_script_m$byClass["Sensitivity"], conf_dt_script_m$byClass["Sensitivity"], conf_mtx_script_m$byClass["Sensitivity"], conf_ann_script_m$byClass["Sensitivity"], conf_lasso_script_m$byClass["Sensitivity"]),
  Specificity = c(conf_glm_script_m$byClass["Specificity"], conf_dt_script_m$byClass["Specificity"], conf_mtx_script_m$byClass["Specificity"], conf_ann_script_m$byClass["Specificity"], conf_lasso_script_m$byClass["Specificity"]),
  Precision   = c(conf_glm_script_m$byClass["Precision"], conf_dt_script_m$byClass["Precision"], conf_mtx_script_m$byClass["Precision"], conf_ann_script_m$byClass["Precision"], conf_lasso_script_m$byClass["Precision"]),
  AUC         = c(auc_glm_script_m, auc_dt_script_m, auc_script_m, auc_ann_script_m, auc_lasso_script_m)
)
knitr::kable(tab_m, digits=3, caption="Table 1. Performance of ML models for the prediction of low ALMI in females")
Table 1. Performance of ML models for the prediction of low ALMI in females
Modelo Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.861 0.571 0.931 0.667 0.761
Decision Tree 0.917 0.571 1.000 1.000 0.835
Random Forest 0.861 0.571 0.931 0.667 0.815
ANN 0.861 0.571 0.931 0.667 0.818
LASSO 0.889 0.571 0.966 0.800 0.842

Males

tab_h <- data.frame(
  Modelo = c("Logistic Regression", "Decision Tree", "Random Forest", "ANN", "LASSO"),
  Accuracy    = c(conf_glm_script_h$overall["Accuracy"], conf_dt_script_h$overall["Accuracy"], conf_mtx_script_h$overall["Accuracy"], conf_ann_script_h$overall["Accuracy"], conf_lasso_script_h$overall["Accuracy"]),
  Sensitivity = c(conf_glm_script_h$byClass["Sensitivity"], conf_dt_script_h$byClass["Sensitivity"], conf_mtx_script_h$byClass["Sensitivity"], conf_ann_script_h$byClass["Sensitivity"], conf_lasso_script_h$byClass["Sensitivity"]),
  Specificity = c(conf_glm_script_h$byClass["Specificity"], conf_dt_script_h$byClass["Specificity"], conf_mtx_script_h$byClass["Specificity"], conf_ann_script_h$byClass["Specificity"], conf_lasso_script_h$byClass["Specificity"]),
  Precision   = c(conf_glm_script_h$byClass["Precision"], conf_dt_script_h$byClass["Precision"], conf_mtx_script_h$byClass["Precision"], conf_ann_script_h$byClass["Precision"], conf_lasso_script_h$byClass["Precision"]),
  AUC         = c(auc_glm_script_h, auc_dt_script_h, auc_script_h, auc_ann_script_h, auc_lasso_script_h)
)
knitr::kable(tab_h, digits=3, caption="Table 2. Performance of ML models for the prediction of low ALMI in males")
Table 2. Performance of ML models for the prediction of low ALMI in males
Modelo Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.765 0.75 0.769 0.50 0.769
Decision Tree 0.824 0.75 0.846 0.60 0.798
Random Forest 0.882 0.75 0.923 0.75 0.788
ANN 0.824 0.75 0.846 0.60 0.923
LASSO 0.824 0.75 0.846 0.60 0.865

References:

Cook, N.R. Statistical evaluation of prognostic versus diagnostic models: Beyond the ROC curve. Clin. Chem. 2008, 54(1), 17–23.

Greener, J.G.; Kandathil, S.M.; Moffat, L.; Jones, D.T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022, 23(1), 40–55.

Albores-Mendez, E.M.; Aguilera Hernández, A.D.; Melo-González, A.; Vargas-Hernández, M.A.; Gutierrez de la Cruz, N.; Vazquez-Guzman, M.A.; Castro-Marín, M.; Romero-Morelos, P.; Winkler, R. A diagnostic model for overweight and obesity from untargeted urine metabolomics of soldiers. PeerJ 2022, 10, e13754.

Butler, É.M.; Derraik, J.G.B.; Taylor, R.W.; Cutfield, W.S. Prediction models for early childhood obesity: applicability and existing issues. Horm. Res. Paediatr. 2018, 90(6), 358–367.

Chung, H.; Jo, Y.; Ryu, D.; Jeong, C.; Choe, S.; Lee, J. Artificial‐intelligence‐driven discovery of prognostic biomarker for sarcopenia. J. Cachexia Sarcopenia Muscle 2021, 12(6), 2220–2230.

Hendriksen, J.M.T.; Geersing, G.J.; Moons, K.G.M.; de Groot, J.A.H. Diagnostic and prognostic prediction models. J. Thromb. Haemost. 2013, 11, 129–141.

Kemmler, W.; von Stengel, S.; Kohl, M. Developing sarcopenia criteria and cutoffs for an older caucasian cohort – a strictly biometrical approach; a strictly biometrical approach. Clin Interv Aging 2018, 13, 1365–1373.

Ou Yang, W.-Y.; Lai, C.-C.; Tsou, M.-T.; Hwang, L.-C. Development of machine learning models for prediction of osteoporosis from clinical health examination data. Int. J. Environ. Res. Public Health 2021, 18(14), 7635.

Wolf, M.B.; DeLand, E.C. A comprehensive, computer-model-based approach for diagnosis and treatment of complex acid–base disorders in critically-ill patients. J. Clin. Monit. Comput. 2011, 25(6), 353–364.  

Zhang, Y.; Wang, Z.; Zhang, J.; Wang, C.; Wang, Y.; Chen, H.; Shan, L.; Huo, J.; Gu, J.; Ma, X. Deep learning model for classifying endometrial lesions. J. Transl. Med. 2021, 19(1), 10.