library(readr)
Vamos a trabajar con el dataset propuesto en la actividad que es resultado de la investigación que intenta estimar el índice de obsesidad de población en Colombia, Perú y Mexico, en función de sus hábitos de vida y con la comida.
Para observar las distintas variables del dataset (Obesity.csv) procedemos a aplicar las funciones “dim” y “head”:
data <- read_csv("Obesity.csv")
## Rows: 2111 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, family_history_with_overweight, FAVC, CAEC, SMOKE, SCC, CAL...
## dbl (8): Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(data)
#data<-read.csv("Obesity.csv")
head(data)
## # A tibble: 6 × 17
## Gender Age Height Weight family_history_with_overw…¹ FAVC FCVC NCP CAEC
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 Female 21 1.62 64 yes no 2 3 Some…
## 2 Female 21 1.52 56 yes no 3 3 Some…
## 3 Male 23 1.8 77 yes no 2 3 Some…
## 4 Male 27 1.8 87 no no 3 3 Some…
## 5 Male 22 1.78 89.8 no no 2 1 Some…
## 6 Male 29 1.62 53 no yes 2 3 Some…
## # ℹ abbreviated name: ¹family_history_with_overweight
## # ℹ 8 more variables: SMOKE <chr>, CH2O <dbl>, SCC <chr>, FAF <dbl>, TUE <dbl>,
## # CALC <chr>, MTRANS <chr>, NObeyesdad <chr>
dim(data)
## [1] 2111 17
Observamos así que contamos con un dataset de hasta 17 variables y 2111 objetos, para los que queremos analizar su naturaleza aplicando la función “str” y “summary”, que nos dirán tanto el tipo de variables registradas, así como los principales datos estadísitos de aquellas variables numéricas:
str(data)
## spc_tbl_ [2,111 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Gender : chr [1:2111] "Female" "Female" "Male" "Male" ...
## $ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr [1:2111] "yes" "yes" "yes" "no" ...
## $ FAVC : chr [1:2111] "no" "no" "no" "no" ...
## $ FCVC : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr [1:2111] "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr [1:2111] "no" "yes" "no" "no" ...
## $ CH2O : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr [1:2111] "no" "yes" "no" "no" ...
## $ FAF : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr [1:2111] "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr [1:2111] "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr [1:2111] "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
## - attr(*, "spec")=
## .. cols(
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. family_history_with_overweight = col_character(),
## .. FAVC = col_character(),
## .. FCVC = col_double(),
## .. NCP = col_double(),
## .. CAEC = col_character(),
## .. SMOKE = col_character(),
## .. CH2O = col_double(),
## .. SCC = col_character(),
## .. FAF = col_double(),
## .. TUE = col_double(),
## .. CALC = col_character(),
## .. MTRANS = col_character(),
## .. NObeyesdad = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(data)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.386
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000
## SCC FAF TUE CALC
## Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
## Mode :character Median :1.0000 Median :0.6253 Mode :character
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## MTRANS NObeyesdad
## Length:2111 Length:2111
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Tras esta visualización del conjunto de datos, y con el fin de iniciar el análisis exploratorio de nuestro dataset, a continuación definiremos cada una de las siglas que conforman cada variable:
-Gender -Age -Height -Weight -Family history with overweight -FAVC (Frequent consumption of high-caloric food) -FCVC (Frequency of consumption of vegetables) -NCP (Number of main meals) -CAEC (Consumption of food between meals) -SMOKE (Smoking habit) -CH2O (Daily water intake) -SCC (Calories consumption monitoring) -FAF (Physical activity frequency) -TUE (Time using technology devices) -CALC (Consumption of alcohol) -MTRANS (Mode of transportation) -NObeyesdad (Classification of weight status or obesity)
Ya entendidas las variables, y habiendo visto que no existen falta de registros, vamos a transformar las variables categóricas en factores para facilitar el análisis de nuestro conjunto de datos:
data$Gender=as.factor(data$Gender)
data$family_history_with_overweight=as.factor(data$family_history_with_overweight)
data$FAVC=as.factor(data$FAVC)
data$CAEC=as.factor(data$CAEC)
data$SMOKE=as.factor(data$SMOKE)
data$SCC=as.factor(data$SCC)
data$CALC=as.factor(data$CALC)
data$MTRANS=as.factor(data$MTRANS)
data$NObeyesdad=as.factor(data$NObeyesdad)
str(data)
## spc_tbl_ [2,111 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ FAVC : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ FCVC : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ CH2O : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ FAF : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ MTRANS : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ NObeyesdad : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. family_history_with_overweight = col_character(),
## .. FAVC = col_character(),
## .. FCVC = col_double(),
## .. NCP = col_double(),
## .. CAEC = col_character(),
## .. SMOKE = col_character(),
## .. CH2O = col_double(),
## .. SCC = col_character(),
## .. FAF = col_double(),
## .. TUE = col_double(),
## .. CALC = col_character(),
## .. MTRANS = col_character(),
## .. NObeyesdad = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Vemos que en nuestro dataset tenemos peso y altura, lo que podría dar pie a generar una nueva variable que reduzca la dimensionalidad del mismo, incorporando el IMC en sustitución de estas dos variables, gracias a que es una nueva variable que relaciona peso y altura con un único registro:
data$IMC=data$Weight / (data$Height^2)
print(data)
## # A tibble: 2,111 × 18
## Gender Age Height Weight family_history_with_over…¹ FAVC FCVC NCP CAEC
## <fct> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <fct>
## 1 Female 21 1.62 64 yes no 2 3 Some…
## 2 Female 21 1.52 56 yes no 3 3 Some…
## 3 Male 23 1.8 77 yes no 2 3 Some…
## 4 Male 27 1.8 87 no no 3 3 Some…
## 5 Male 22 1.78 89.8 no no 2 1 Some…
## 6 Male 29 1.62 53 no yes 2 3 Some…
## 7 Female 23 1.5 55 yes yes 3 3 Some…
## 8 Male 22 1.64 53 no no 2 3 Some…
## 9 Male 24 1.78 64 yes yes 3 3 Some…
## 10 Male 22 1.72 68 yes yes 2 3 Some…
## # ℹ 2,101 more rows
## # ℹ abbreviated name: ¹family_history_with_overweight
## # ℹ 9 more variables: SMOKE <fct>, CH2O <dbl>, SCC <fct>, FAF <dbl>, TUE <dbl>,
## # CALC <fct>, MTRANS <fct>, NObeyesdad <fct>, IMC <dbl>
str(data)
## spc_tbl_ [2,111 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ FAVC : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ FCVC : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ CH2O : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ FAF : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ MTRANS : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ NObeyesdad : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
## $ IMC : num [1:2111] 24.4 24.2 23.8 26.9 28.3 ...
## - attr(*, "spec")=
## .. cols(
## .. Gender = col_character(),
## .. Age = col_double(),
## .. Height = col_double(),
## .. Weight = col_double(),
## .. family_history_with_overweight = col_character(),
## .. FAVC = col_character(),
## .. FCVC = col_double(),
## .. NCP = col_double(),
## .. CAEC = col_character(),
## .. SMOKE = col_character(),
## .. CH2O = col_double(),
## .. SCC = col_character(),
## .. FAF = col_double(),
## .. TUE = col_double(),
## .. CALC = col_character(),
## .. MTRANS = col_character(),
## .. NObeyesdad = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Procedemos a seleccionar las variables que queremos estudiar en el análisis (todas menos Height y Weight) y verificamos la estructura del dataframe con la función “str”:
data2=data[, !names(data) %in% c("Height", "Weight")]
str(data2)
## tibble [2,111 × 16] (S3: tbl_df/tbl/data.frame)
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
## $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ FAVC : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ FCVC : num [1:2111] 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ CH2O : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ FAF : num [1:2111] 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ MTRANS : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ NObeyesdad : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
## $ IMC : num [1:2111] 24.4 24.2 23.8 26.9 28.3 ...
En este punto, y previo a trabajar los análisis predictivos de la variable Smoke y de SCC, creemos importante normalizar (Min-Max) los valores de las variables numéricas entre un rango específico.
numeric_vars=data2[, sapply(data2, is.numeric)]
normalize=function(x) {return((x - min(x)) / (max(x) - min(x)))}
normalized_data=as.data.frame(lapply(numeric_vars, normalize))
data2[, names(normalized_data)]=normalized_data
data_without_categorical=data2[, !names(data2) %in% c("SMOKE", "SCC")]
str(data2)
## tibble [2,111 × 16] (S3: tbl_df/tbl/data.frame)
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num [1:2111] 0.149 0.149 0.191 0.277 0.17 ...
## $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ FAVC : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ FCVC : num [1:2111] 0.5 1 0.5 1 0.5 0.5 1 0.5 1 0.5 ...
## $ NCP : num [1:2111] 0.667 0.667 0.667 0.667 0 ...
## $ CAEC : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ CH2O : num [1:2111] 0.5 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ SCC : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ FAF : num [1:2111] 0 1 0.667 0.667 0 ...
## $ TUE : num [1:2111] 0.5 0 0.5 0 0 0 0 0 0.5 0.5 ...
## $ CALC : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ MTRANS : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ NObeyesdad : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
## $ IMC : num [1:2111] 0.301 0.297 0.285 0.366 0.406 ...
Ya contamos con un dataset limpio y con valores normalizados que nos permita iniciar el análisis predictivo.
Continuamos estableciendo una semilla “set.seed” que divida los datos para estudiar la clasificación de registros binarios de la variable “Smoke”. Procedemos por tanto a crear un conjunto de entrenamiento (70%) y otro conjunto de prueba (30%):
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(123)
index=createDataPartition(data2$SMOKE, p = 0.7, list = FALSE)
train_data=data2[index, ]
test_data=data2[-index, ]
Aplicamos a continuación un algoritmo de regresión logística que nos permita entrenar el modelo para posteriormente evaluar su efectividad:
model=glm(SMOKE ~ ., data = train_data, family = "binomial")
print(model)
##
## Call: glm(formula = SMOKE ~ ., family = "binomial", data = train_data)
##
## Coefficients:
## (Intercept) GenderMale
## -25.5040 -0.6828
## Age family_history_with_overweightyes
## 6.7731 0.3354
## FAVCyes FCVC
## -0.3269 1.2808
## NCP CAECFrequently
## 0.3053 -0.9819
## CAECno CAECSometimes
## -17.7126 -1.7150
## CH2O SCCyes
## -0.9705 0.9114
## FAF TUE
## 0.5006 1.4385
## CALCFrequently CALCno
## 19.0598 16.4910
## CALCSometimes MTRANSBike
## 18.3930 -16.3921
## MTRANSMotorbike MTRANSPublic_Transportation
## -16.1889 1.4048
## MTRANSWalking NObeyesdadNormal_Weight
## 1.9983 1.7432
## NObeyesdadObesity_Type_I NObeyesdadObesity_Type_II
## 1.1811 2.3534
## NObeyesdadObesity_Type_III NObeyesdadOverweight_Level_I
## -16.2783 0.1574
## NObeyesdadOverweight_Level_II IMC
## 1.2739 0.3318
##
## Degrees of Freedom: 1477 Total (i.e. Null); 1450 Residual
## Null Deviance: 300.9
## Residual Deviance: 229.2 AIC: 285.2
Aplicamos ahora sobre el conjunto de validación la función “predict” para comparar los resultados con el conjunto de entrenamiento para la variable Smoke:
predictions=predict(model, newdata = test_data, type = "response")
predicted_classes=ifelse(predictions > 0.15, 1, 0)
#conf_matrix <- table(test_data$SMOKE, test_data$SMOKE)
conf_matrix <- table(test_data$SMOKE, predicted_classes)
rownames(conf_matrix) <- c('no', 'yes')
colnames(conf_matrix) <- c('no', 'yes')
print(conf_matrix)
## predicted_classes
## no yes
## no 611 9
## yes 9 4
La matriz de confusión muestra cómo las predicciones del modelo Predijo la clase “no” (negativa) correctamente 611 veces. Clasificó erróneamente la clase “no” como “yes” en 9 ocasiones. Logró predecir correctamente la clase “yes” (positiva) 4 veces. Además, 9 predicciones para “yes” fueron incorrectas.
Para evaluar el modelo, determinaremos las siguientes métricas: accuracy, sensitivity, precisio y f-score:
accuracy=sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity=conf_matrix["yes", "yes"] / sum(conf_matrix["yes", ])
precision=conf_matrix["yes", "yes"] / sum(conf_matrix[, "yes"])
F_score=(2 * precision * sensitivity) / (precision + sensitivity)
print(conf_matrix)
## predicted_classes
## no yes
## no 611 9
## yes 9 4
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.971563981042654"
print(paste("Sensitivity:", sensitivity))
## [1] "Sensitivity: 0.307692307692308"
print(paste("Precision:", precision))
## [1] "Precision: 0.307692307692308"
print(paste("F-score:", F_score))
## [1] "F-score: 0.307692307692308"
Estas métricas sugieren que el modelo tiene una alta precisión general (accuracy), pero un rendimiento relativamente bajo en términos de sensibilidad y precisión para la clase positiva. Es decir, el modelo tiende a clasificar bien las muestras negativas pero no es tan efectivo identificando las positivas (fumadores).
A continuación vamos a determinar y entrenar un nuevo modelo predictivo de clasificación binaria, en esta ocasión, tomando en consideración la variable SCC, que nos indica el control de la ingesta de calorías, para lo que generaremos de nuevo un conjunto de entrenamiento y otro de validación que permitan evaluar el modelo.
library('caTools')
split2=sample.split(data2$SCC, SplitRatio = 0.7)
train_data2=subset(data2, split = TRUE)
## Warning: In subset.data.frame(data2, split = TRUE) :
## extra argument 'split' will be disregarded
test_data2=subset(data2, split = FALSE)
## Warning: In subset.data.frame(data2, split = FALSE) :
## extra argument 'split' will be disregarded
Ya preparado el dataset para estudiar el modelo predictivo de clasificación binaria, aplicamos un algoritmo que nos permita categorizar adecuadamente la variable SCC a través de árboles de decisión:
library(rpart)
tree_model=rpart(SCC ~ ., data = train_data2, method = "class")
print(tree_model)
## n= 2111
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2111 96 no (0.95452392 0.04547608)
## 2) IMC>=0.3347113 1471 19 no (0.98708362 0.01291638) *
## 3) IMC< 0.3347113 640 77 no (0.87968750 0.12031250)
## 6) NObeyesdad=Insufficient_Weight,Normal_Weight 559 52 no (0.90697674 0.09302326)
## 12) FCVC< 0.9467742 339 11 no (0.96755162 0.03244838) *
## 13) FCVC>=0.9467742 220 41 no (0.81363636 0.18636364)
## 26) CH2O< 0.9162967 182 24 no (0.86813187 0.13186813)
## 52) Age>=0.1318942 94 4 no (0.95744681 0.04255319) *
## 53) Age< 0.1318942 88 20 no (0.77272727 0.22727273)
## 106) FAVC=yes 68 9 no (0.86764706 0.13235294) *
## 107) FAVC=no 20 9 yes (0.45000000 0.55000000)
## 214) CAEC=Always,Sometimes 11 3 no (0.72727273 0.27272727) *
## 215) CAEC=Frequently 9 1 yes (0.11111111 0.88888889) *
## 27) CH2O>=0.9162967 38 17 no (0.55263158 0.44736842)
## 54) IMC>=0.1236712 29 10 no (0.65517241 0.34482759)
## 108) CAEC=Frequently,Sometimes 21 5 no (0.76190476 0.23809524) *
## 109) CAEC=Always,no 8 3 yes (0.37500000 0.62500000) *
## 55) IMC< 0.1236712 9 2 yes (0.22222222 0.77777778) *
## 7) NObeyesdad=Overweight_Level_I 81 25 no (0.69135802 0.30864198)
## 14) Age>=0.0850289 64 9 no (0.85937500 0.14062500)
## 28) FAVC=yes 57 4 no (0.92982456 0.07017544) *
## 29) FAVC=no 7 2 yes (0.28571429 0.71428571) *
## 15) Age< 0.0850289 17 1 yes (0.05882353 0.94117647) *
Seguimos aplicante lo función “predict” tras determinar el conjunto de entrenamiento y procedemos a crear una matriz de confusión que compare las predicciones con valores reales:
tree_predictions=predict(tree_model, test_data2, type = "class")
tree_predicted=ifelse(predictions > 0.17, 1, 0)
#tree_conf_matrix <- table(test_data2$SCC, tree_predicted)
tree_conf_matrix <- table(test_data2$SCC, tree_predictions)
rownames(tree_conf_matrix) <- c('no', 'yes')
colnames(tree_conf_matrix) <- c('no', 'yes')
print(tree_conf_matrix)
## tree_predictions
## no yes
## no 2006 9
## yes 55 41
Parece que el modelo está teniendo dificultades para clasificar correctamente la clase “yes”, ya que tiene un número notable de falsos negativos y muy pocos verdaderos positivos.Igualmente, queremos evaluar las métricas resultantes del modelo.
calculate_accuracy <- function(tree_conf_matrix) {
sum(diag(tree_conf_matrix)) / sum(tree_conf_matrix)
}
calculate_precision_positive <- function(tree_conf_matrix) {
tree_conf_matrix[2, 2] / sum(tree_conf_matrix[, 2])
}
calculate_recall <- function(tree_conf_matrix) {
tree_conf_matrix[2, 2] / sum(tree_conf_matrix[2, ])
}
calculate_f1_score <- function(tree_conf_matrix) {
precision <- calculate_precision_positive(tree_conf_matrix)
recall <- calculate_recall(tree_conf_matrix)
2 * ((precision * recall) / (precision + recall))
}
accuracy2 <- calculate_accuracy(tree_conf_matrix)
precision_positive2 <- calculate_precision_positive(tree_conf_matrix)
recall2 <- calculate_recall(tree_conf_matrix)
f1_score2 <- calculate_f1_score(tree_conf_matrix)
print(accuracy2)
## [1] 0.9696826
print(precision_positive2)
## [1] 0.82
print(recall2)
## [1] 0.4270833
print(f1_score2)
## [1] 0.5616438
Estas métricas sugieren que el modelo tiene una alta precisión general (accuracy), pero un rendimiento relativamente bajo en términos de sensibilidad y precisión para la clase positiva. Es decir, el modelo tiende a clasificar bien las muestras negativas pero no es tan efectivo identificando las positivas.
Aplicaremos de nuevo otro algoritmo, en esta ocasión SVM:
library(e1071)
svm_model <- svm(SCC ~ ., data = train_data2, kernel = "radial")
print(svm_model)
##
## Call:
## svm(formula = SCC ~ ., data = train_data2, kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 254
Continuaremos con las predicciones para este modelo Super Vector Machines:
svm_predictions=predict(svm_model,kernel = c("linear", "radial", "polynomial"), ranges = list(cost = c(0.1, 1, 10), gamma = c(0.1, 1, 10), degree = c(2, 3, 4)), newdata = test_data2)
svm_conf_matrix=table(test_data2$SCC, svm_predictions)
calculate_accuracy <- function(svm_conf_matrix) {
sum(diag(svm_conf_matrix)) / sum(svm_conf_matrix)
}
calculate_precision_positive <- function(svm_conf_matrix) {
svm_conf_matrix[2, 2] / sum(svm_conf_matrix[, 2])
}
calculate_recall <- function(svm_conf_matrix) {
svm_conf_matrix[2, 2] / sum(svm_conf_matrix[2, ])
}
calculate_f1_score <- function(svm_conf_matrix) {
precision <- calculate_precision_positive(svm_conf_matrix)
recall <- calculate_recall(svm_conf_matrix)
2 * ((precision * recall) / (precision + recall))
}
accuracy3<- calculate_accuracy(svm_conf_matrix)
precision_positive3 <- calculate_precision_positive(svm_conf_matrix)
recall3 <- calculate_recall(svm_conf_matrix)
f1_score3 <- calculate_f1_score(svm_conf_matrix)
print(accuracy3)
## [1] 0.9545239
print(precision_positive3)
## [1] NaN
print(recall3)
## [1] 0
print(f1_score3)
## [1] NaN
Observamos tras este análisis que el modelo SVM nos está dando un accuracy algo superior, pero persisten los problemas en métricas como precision y f1-score.