library(readr)

Vamos a trabajar con el dataset propuesto en la actividad que es resultado de la investigación que intenta estimar el índice de obsesidad de población en Colombia, Perú y Mexico, en función de sus hábitos de vida y con la comida.

Para observar las distintas variables del dataset (Obesity.csv) procedemos a aplicar las funciones “dim” y “head”:

data <- read_csv("Obesity.csv")
## Rows: 2111 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(data)
#data<-read.csv("Obesity.csv")
head(data)
## # A tibble: 6 × 17
##   Gender   Age Height Weight family_history_with_overw…¹ FAVC   FCVC   NCP CAEC 
##   <chr>  <dbl>  <dbl>  <dbl> <chr>                       <chr> <dbl> <dbl> <chr>
## 1 Female    21   1.62   64   yes                         no        2     3 Some…
## 2 Female    21   1.52   56   yes                         no        3     3 Some…
## 3 Male      23   1.8    77   yes                         no        2     3 Some…
## 4 Male      27   1.8    87   no                          no        3     3 Some…
## 5 Male      22   1.78   89.8 no                          no        2     1 Some…
## 6 Male      29   1.62   53   no                          yes       2     3 Some…
## # ℹ abbreviated name: ¹​family_history_with_overweight
## # ℹ 8 more variables: SMOKE <chr>, CH2O <dbl>, SCC <chr>, FAF <dbl>, TUE <dbl>,
## #   CALC <chr>, MTRANS <chr>, NObeyesdad <chr>
dim(data)
## [1] 2111   17

Preparación de los datos y análisis exploratorio

Observamos así que contamos con un dataset de hasta 17 variables y 2111 objetos, para los que queremos analizar su naturaleza aplicando la función “str” y “summary”, que nos dirán tanto el tipo de variables registradas, así como los principales datos estadísitos de aquellas variables numéricas:

str(data)
## spc_tbl_ [2,111 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender                        : chr [1:2111] "Female" "Female" "Male" "Male" ...
##  $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr [1:2111] "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr [1:2111] "no" "no" "no" "no" ...
##  $ FCVC                          : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr [1:2111] "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr [1:2111] "no" "yes" "no" "no" ...
##  $ CH2O                          : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr [1:2111] "no" "yes" "no" "no" ...
##  $ FAF                           : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr [1:2111] "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr [1:2111] "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr [1:2111] "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   Age = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   family_history_with_overweight = col_character(),
##   ..   FAVC = col_character(),
##   ..   FCVC = col_double(),
##   ..   NCP = col_double(),
##   ..   CAEC = col_character(),
##   ..   SMOKE = col_character(),
##   ..   CH2O = col_double(),
##   ..   SCC = col_character(),
##   ..   FAF = col_double(),
##   ..   TUE = col_double(),
##   ..   CALC = col_character(),
##   ..   MTRANS = col_character(),
##   ..   NObeyesdad = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(data)
##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad       
##  Length:2111        Length:2111       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Tras esta visualización del conjunto de datos, y con el fin de iniciar el análisis exploratorio de nuestro dataset, a continuación definiremos cada una de las siglas que conforman cada variable:

-Gender -Age -Height -Weight -Family history with overweight -FAVC (Frequent consumption of high-caloric food) -FCVC (Frequency of consumption of vegetables) -NCP (Number of main meals) -CAEC (Consumption of food between meals) -SMOKE (Smoking habit) -CH2O (Daily water intake) -SCC (Calories consumption monitoring) -FAF (Physical activity frequency) -TUE (Time using technology devices) -CALC (Consumption of alcohol) -MTRANS (Mode of transportation) -NObeyesdad (Classification of weight status or obesity)

Ya entendidas las variables, y habiendo visto que no existen falta de registros, vamos a transformar las variables categóricas en factores para facilitar el análisis de nuestro conjunto de datos:

data$Gender=as.factor(data$Gender)
data$family_history_with_overweight=as.factor(data$family_history_with_overweight)
data$FAVC=as.factor(data$FAVC)
data$CAEC=as.factor(data$CAEC)
data$SMOKE=as.factor(data$SMOKE)
data$SCC=as.factor(data$SCC)
data$CALC=as.factor(data$CALC)
data$MTRANS=as.factor(data$MTRANS)
data$NObeyesdad=as.factor(data$NObeyesdad)
str(data)
## spc_tbl_ [2,111 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ FAVC                          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
##  $ FCVC                          : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SMOKE                         : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ CH2O                          : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ FAF                           : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
##  $ MTRANS                        : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   Age = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   family_history_with_overweight = col_character(),
##   ..   FAVC = col_character(),
##   ..   FCVC = col_double(),
##   ..   NCP = col_double(),
##   ..   CAEC = col_character(),
##   ..   SMOKE = col_character(),
##   ..   CH2O = col_double(),
##   ..   SCC = col_character(),
##   ..   FAF = col_double(),
##   ..   TUE = col_double(),
##   ..   CALC = col_character(),
##   ..   MTRANS = col_character(),
##   ..   NObeyesdad = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Vemos que en nuestro dataset tenemos peso y altura, lo que podría dar pie a generar una nueva variable que reduzca la dimensionalidad del mismo, incorporando el IMC en sustitución de estas dos variables, gracias a que es una nueva variable que relaciona peso y altura con un único registro:

data$IMC=data$Weight / (data$Height^2)
print(data)
## # A tibble: 2,111 × 18
##    Gender   Age Height Weight family_history_with_over…¹ FAVC   FCVC   NCP CAEC 
##    <fct>  <dbl>  <dbl>  <dbl> <fct>                      <fct> <dbl> <dbl> <fct>
##  1 Female    21   1.62   64   yes                        no        2     3 Some…
##  2 Female    21   1.52   56   yes                        no        3     3 Some…
##  3 Male      23   1.8    77   yes                        no        2     3 Some…
##  4 Male      27   1.8    87   no                         no        3     3 Some…
##  5 Male      22   1.78   89.8 no                         no        2     1 Some…
##  6 Male      29   1.62   53   no                         yes       2     3 Some…
##  7 Female    23   1.5    55   yes                        yes       3     3 Some…
##  8 Male      22   1.64   53   no                         no        2     3 Some…
##  9 Male      24   1.78   64   yes                        yes       3     3 Some…
## 10 Male      22   1.72   68   yes                        yes       2     3 Some…
## # ℹ 2,101 more rows
## # ℹ abbreviated name: ¹​family_history_with_overweight
## # ℹ 9 more variables: SMOKE <fct>, CH2O <dbl>, SCC <fct>, FAF <dbl>, TUE <dbl>,
## #   CALC <fct>, MTRANS <fct>, NObeyesdad <fct>, IMC <dbl>
str(data)
## spc_tbl_ [2,111 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ FAVC                          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
##  $ FCVC                          : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SMOKE                         : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ CH2O                          : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ FAF                           : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
##  $ MTRANS                        : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
##  $ IMC                           : num [1:2111] 24.4 24.2 23.8 26.9 28.3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Gender = col_character(),
##   ..   Age = col_double(),
##   ..   Height = col_double(),
##   ..   Weight = col_double(),
##   ..   family_history_with_overweight = col_character(),
##   ..   FAVC = col_character(),
##   ..   FCVC = col_double(),
##   ..   NCP = col_double(),
##   ..   CAEC = col_character(),
##   ..   SMOKE = col_character(),
##   ..   CH2O = col_double(),
##   ..   SCC = col_character(),
##   ..   FAF = col_double(),
##   ..   TUE = col_double(),
##   ..   CALC = col_character(),
##   ..   MTRANS = col_character(),
##   ..   NObeyesdad = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Procedemos a seleccionar las variables que queremos estudiar en el análisis (todas menos Height y Weight) y verificamos la estructura del dataframe con la función “str”:

data2=data[, !names(data) %in% c("Height", "Weight")]
str(data2)
## tibble [2,111 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ FAVC                          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
##  $ FCVC                          : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SMOKE                         : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ CH2O                          : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ FAF                           : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
##  $ MTRANS                        : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
##  $ IMC                           : num [1:2111] 24.4 24.2 23.8 26.9 28.3 ...

En este punto, y previo a trabajar los análisis predictivos de la variable Smoke y de SCC, creemos importante normalizar (Min-Max) los valores de las variables numéricas entre un rango específico.

numeric_vars=data2[, sapply(data2, is.numeric)]
normalize=function(x) {return((x - min(x)) / (max(x) - min(x)))}
normalized_data=as.data.frame(lapply(numeric_vars, normalize))
data2[, names(normalized_data)]=normalized_data
data_without_categorical=data2[, !names(data2) %in% c("SMOKE", "SCC")]
str(data2)
## tibble [2,111 × 16] (S3: tbl_df/tbl/data.frame)
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Age                           : num [1:2111] 0.149 0.149 0.191 0.277 0.17 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ FAVC                          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
##  $ FCVC                          : num [1:2111] 0.5 1 0.5 1 0.5 0.5 1 0.5 1 0.5 ...
##  $ NCP                           : num [1:2111] 0.667 0.667 0.667 0.667 0 ...
##  $ CAEC                          : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SMOKE                         : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ CH2O                          : num [1:2111] 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ SCC                           : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ FAF                           : num [1:2111] 0 1 0.667 0.667 0 ...
##  $ TUE                           : num [1:2111] 0.5 0 0.5 0 0 0 0 0 0.5 0.5 ...
##  $ CALC                          : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
##  $ MTRANS                        : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
##  $ IMC                           : num [1:2111] 0.301 0.297 0.285 0.366 0.406 ...

Ya contamos con un dataset limpio y con valores normalizados que nos permita iniciar el análisis predictivo.

Análisis predictivo “SMOKE”

Continuamos estableciendo una semilla “set.seed” que divida los datos para estudiar la clasificación de registros binarios de la variable “Smoke”. Procedemos por tanto a crear un conjunto de entrenamiento (70%) y otro conjunto de prueba (30%):

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(123)
index=createDataPartition(data2$SMOKE, p = 0.7, list = FALSE)
train_data=data2[index, ]
test_data=data2[-index, ]

Aplicamos a continuación un algoritmo de regresión logística que nos permita entrenar el modelo para posteriormente evaluar su efectividad:

model=glm(SMOKE ~ ., data = train_data, family = "binomial")
print(model)
## 
## Call:  glm(formula = SMOKE ~ ., family = "binomial", data = train_data)
## 
## Coefficients:
##                       (Intercept)                         GenderMale  
##                          -25.5040                            -0.6828  
##                               Age  family_history_with_overweightyes  
##                            6.7731                             0.3354  
##                           FAVCyes                               FCVC  
##                           -0.3269                             1.2808  
##                               NCP                     CAECFrequently  
##                            0.3053                            -0.9819  
##                            CAECno                      CAECSometimes  
##                          -17.7126                            -1.7150  
##                              CH2O                             SCCyes  
##                           -0.9705                             0.9114  
##                               FAF                                TUE  
##                            0.5006                             1.4385  
##                    CALCFrequently                             CALCno  
##                           19.0598                            16.4910  
##                     CALCSometimes                         MTRANSBike  
##                           18.3930                           -16.3921  
##                   MTRANSMotorbike        MTRANSPublic_Transportation  
##                          -16.1889                             1.4048  
##                     MTRANSWalking            NObeyesdadNormal_Weight  
##                            1.9983                             1.7432  
##          NObeyesdadObesity_Type_I          NObeyesdadObesity_Type_II  
##                            1.1811                             2.3534  
##        NObeyesdadObesity_Type_III       NObeyesdadOverweight_Level_I  
##                          -16.2783                             0.1574  
##     NObeyesdadOverweight_Level_II                                IMC  
##                            1.2739                             0.3318  
## 
## Degrees of Freedom: 1477 Total (i.e. Null);  1450 Residual
## Null Deviance:       300.9 
## Residual Deviance: 229.2     AIC: 285.2

Aplicamos ahora sobre el conjunto de validación la función “predict” para comparar los resultados con el conjunto de entrenamiento para la variable Smoke:

predictions=predict(model, newdata = test_data, type = "response")
predicted_classes=ifelse(predictions > 0.15, 1, 0)
#conf_matrix <- table(test_data$SMOKE, test_data$SMOKE)
conf_matrix <- table(test_data$SMOKE, predicted_classes)

rownames(conf_matrix) <- c('no', 'yes')
colnames(conf_matrix) <- c('no', 'yes')
print(conf_matrix)
##      predicted_classes
##        no yes
##   no  611   9
##   yes   9   4

La matriz de confusión muestra cómo las predicciones del modelo Predijo la clase “no” (negativa) correctamente 611 veces. Clasificó erróneamente la clase “no” como “yes” en 9 ocasiones. Logró predecir correctamente la clase “yes” (positiva) 4 veces. Además, 9 predicciones para “yes” fueron incorrectas.

Para evaluar el modelo, determinaremos las siguientes métricas: accuracy, sensitivity, precisio y f-score:

accuracy=sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity=conf_matrix["yes", "yes"] / sum(conf_matrix["yes", ])
precision=conf_matrix["yes", "yes"] / sum(conf_matrix[, "yes"])
F_score=(2 * precision * sensitivity) / (precision + sensitivity)
print(conf_matrix)
##      predicted_classes
##        no yes
##   no  611   9
##   yes   9   4
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.971563981042654"
print(paste("Sensitivity:", sensitivity))
## [1] "Sensitivity: 0.307692307692308"
print(paste("Precision:", precision))
## [1] "Precision: 0.307692307692308"
print(paste("F-score:", F_score))
## [1] "F-score: 0.307692307692308"

Estas métricas sugieren que el modelo tiene una alta precisión general (accuracy), pero un rendimiento relativamente bajo en términos de sensibilidad y precisión para la clase positiva. Es decir, el modelo tiende a clasificar bien las muestras negativas pero no es tan efectivo identificando las positivas (fumadores).

Análisis predictivo “SCC”

A continuación vamos a determinar y entrenar un nuevo modelo predictivo de clasificación binaria, en esta ocasión, tomando en consideración la variable SCC, que nos indica el control de la ingesta de calorías, para lo que generaremos de nuevo un conjunto de entrenamiento y otro de validación que permitan evaluar el modelo.

library('caTools')
split2=sample.split(data2$SCC, SplitRatio = 0.7)
train_data2=subset(data2, split = TRUE)
## Warning: In subset.data.frame(data2, split = TRUE) :
##  extra argument 'split' will be disregarded
test_data2=subset(data2, split = FALSE)
## Warning: In subset.data.frame(data2, split = FALSE) :
##  extra argument 'split' will be disregarded

Ya preparado el dataset para estudiar el modelo predictivo de clasificación binaria, aplicamos un algoritmo que nos permita categorizar adecuadamente la variable SCC a través de árboles de decisión:

library(rpart)
tree_model=rpart(SCC ~ ., data = train_data2, method = "class")
print(tree_model)
## n= 2111 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 2111 96 no (0.95452392 0.04547608)  
##     2) IMC>=0.3347113 1471 19 no (0.98708362 0.01291638) *
##     3) IMC< 0.3347113 640 77 no (0.87968750 0.12031250)  
##       6) NObeyesdad=Insufficient_Weight,Normal_Weight 559 52 no (0.90697674 0.09302326)  
##        12) FCVC< 0.9467742 339 11 no (0.96755162 0.03244838) *
##        13) FCVC>=0.9467742 220 41 no (0.81363636 0.18636364)  
##          26) CH2O< 0.9162967 182 24 no (0.86813187 0.13186813)  
##            52) Age>=0.1318942 94  4 no (0.95744681 0.04255319) *
##            53) Age< 0.1318942 88 20 no (0.77272727 0.22727273)  
##             106) FAVC=yes 68  9 no (0.86764706 0.13235294) *
##             107) FAVC=no 20  9 yes (0.45000000 0.55000000)  
##               214) CAEC=Always,Sometimes 11  3 no (0.72727273 0.27272727) *
##               215) CAEC=Frequently 9  1 yes (0.11111111 0.88888889) *
##          27) CH2O>=0.9162967 38 17 no (0.55263158 0.44736842)  
##            54) IMC>=0.1236712 29 10 no (0.65517241 0.34482759)  
##             108) CAEC=Frequently,Sometimes 21  5 no (0.76190476 0.23809524) *
##             109) CAEC=Always,no 8  3 yes (0.37500000 0.62500000) *
##            55) IMC< 0.1236712 9  2 yes (0.22222222 0.77777778) *
##       7) NObeyesdad=Overweight_Level_I 81 25 no (0.69135802 0.30864198)  
##        14) Age>=0.0850289 64  9 no (0.85937500 0.14062500)  
##          28) FAVC=yes 57  4 no (0.92982456 0.07017544) *
##          29) FAVC=no 7  2 yes (0.28571429 0.71428571) *
##        15) Age< 0.0850289 17  1 yes (0.05882353 0.94117647) *

Seguimos aplicante lo función “predict” tras determinar el conjunto de entrenamiento y procedemos a crear una matriz de confusión que compare las predicciones con valores reales:

tree_predictions=predict(tree_model, test_data2, type = "class")
tree_predicted=ifelse(predictions > 0.17, 1, 0)
#tree_conf_matrix <- table(test_data2$SCC, tree_predicted)
tree_conf_matrix <- table(test_data2$SCC, tree_predictions)

rownames(tree_conf_matrix) <- c('no', 'yes')
colnames(tree_conf_matrix) <- c('no', 'yes')
print(tree_conf_matrix)
##      tree_predictions
##         no  yes
##   no  2006    9
##   yes   55   41

Parece que el modelo está teniendo dificultades para clasificar correctamente la clase “yes”, ya que tiene un número notable de falsos negativos y muy pocos verdaderos positivos.Igualmente, queremos evaluar las métricas resultantes del modelo.

calculate_accuracy <- function(tree_conf_matrix) {
  sum(diag(tree_conf_matrix)) / sum(tree_conf_matrix)
}

calculate_precision_positive <- function(tree_conf_matrix) {
  tree_conf_matrix[2, 2] / sum(tree_conf_matrix[, 2])
}

calculate_recall <- function(tree_conf_matrix) {
  tree_conf_matrix[2, 2] / sum(tree_conf_matrix[2, ])
}

calculate_f1_score <- function(tree_conf_matrix) {
  precision <- calculate_precision_positive(tree_conf_matrix)
  recall <- calculate_recall(tree_conf_matrix)
  2 * ((precision * recall) / (precision + recall))
}

accuracy2 <- calculate_accuracy(tree_conf_matrix)
precision_positive2 <- calculate_precision_positive(tree_conf_matrix)
recall2 <- calculate_recall(tree_conf_matrix)
f1_score2 <- calculate_f1_score(tree_conf_matrix)
print(accuracy2)
## [1] 0.9696826
print(precision_positive2)
## [1] 0.82
print(recall2)
## [1] 0.4270833
print(f1_score2)
## [1] 0.5616438

Estas métricas sugieren que el modelo tiene una alta precisión general (accuracy), pero un rendimiento relativamente bajo en términos de sensibilidad y precisión para la clase positiva. Es decir, el modelo tiende a clasificar bien las muestras negativas pero no es tan efectivo identificando las positivas.

Aplicaremos de nuevo otro algoritmo, en esta ocasión SVM:

library(e1071)
svm_model <- svm(SCC ~ ., data = train_data2, kernel = "radial")
print(svm_model)
## 
## Call:
## svm(formula = SCC ~ ., data = train_data2, kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  254

Continuaremos con las predicciones para este modelo Super Vector Machines:

svm_predictions=predict(svm_model,kernel = c("linear", "radial", "polynomial"), ranges = list(cost = c(0.1, 1, 10), gamma = c(0.1, 1, 10), degree = c(2, 3, 4)), newdata = test_data2)
svm_conf_matrix=table(test_data2$SCC, svm_predictions)
calculate_accuracy <- function(svm_conf_matrix) {
  sum(diag(svm_conf_matrix)) / sum(svm_conf_matrix)
}

calculate_precision_positive <- function(svm_conf_matrix) {
  svm_conf_matrix[2, 2] / sum(svm_conf_matrix[, 2])
}

calculate_recall <- function(svm_conf_matrix) {
  svm_conf_matrix[2, 2] / sum(svm_conf_matrix[2, ])
}

calculate_f1_score <- function(svm_conf_matrix) {
  precision <- calculate_precision_positive(svm_conf_matrix)
  recall <- calculate_recall(svm_conf_matrix)
  2 * ((precision * recall) / (precision + recall))
}

accuracy3<- calculate_accuracy(svm_conf_matrix)
precision_positive3 <- calculate_precision_positive(svm_conf_matrix)
recall3 <- calculate_recall(svm_conf_matrix)
f1_score3 <- calculate_f1_score(svm_conf_matrix)
print(accuracy3)
## [1] 0.9545239
print(precision_positive3)
## [1] NaN
print(recall3)
## [1] 0
print(f1_score3)
## [1] NaN

Observamos tras este análisis que el modelo SVM nos está dando un accuracy algo superior, pero persisten los problemas en métricas como precision y f1-score.