Objetivo

Descripción

Proceso

1. Cargar librerías

library(ggplot2)
library(dplyr)
library(knitr)
library(caret)
library(readr)
library(knitr)
library(e1071)

2. Cargar datos

datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")

kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
Los primeros 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
Los últimos 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
48833 48833 32 Private Dropout 6 Married Amer-Indian-Eskimo Male 40 <=50K 0.2054795 0.3333333 0.3979592 0
48834 48834 43 Private Community 11 Married White Male 45 <=50K 0.3561644 0.6666667 0.4489796 0
48835 48835 32 Private Master 14 Not_married Asian-Pac-Islander Male 11 <=50K 0.2054795 0.8666667 0.1020408 0
48836 48836 53 Private Master 14 Married White Male 40 >50K 0.4931507 0.8666667 0.3979592 1
48837 48837 22 Private Community 10 Not_married White Male 40 <=50K 0.0684932 0.6000000 0.3979592 0
48838 48838 27 Private Community 12 Married White Female 38 <=50K 0.1369863 0.7333333 0.3775510 0
48839 48839 40 Private HighGrad 9 Married White Male 40 >50K 0.3150685 0.5333333 0.3979592 1
48840 48840 58 Private HighGrad 9 Widow White Female 40 <=50K 0.5616438 0.5333333 0.3979592 0
48841 48841 22 Private HighGrad 9 Not_married White Male 20 <=50K 0.0684932 0.5333333 0.1938776 0
48842 48842 52 Self-emp-inc HighGrad 9 Married White Female 40 >50K 0.4794521 0.5333333 0.3979592 1

3. Identificar variables

  • Convertir a factor la variable dependiente income10
datos$income10 <- factor(datos$income10)

str(datos)
## 'data.frame':    48842 obs. of  14 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                  : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass            : chr  "Private" "Private" "Local-gov" "Private" ...
##  $ education            : chr  "Dropout" "HighGrad" "Community" "Community" ...
##  $ educational.num      : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status       : chr  "Not_married" "Married" "Married" "Married" ...
##  $ race                 : chr  "Black" "White" "White" "Black" ...
##  $ gender               : chr  "Male" "Male" "Male" "Male" ...
##  $ hours.per.week       : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ income               : chr  "<=50K" "<=50K" ">50K" ">50K" ...
##  $ age.scale            : num  0.1096 0.2877 0.1507 0.3699 0.0137 ...
##  $ educational.num.scale: num  0.4 0.533 0.733 0.6 0.6 ...
##  $ hours.per.week.scale : num  0.398 0.5 0.398 0.398 0.296 ...
##  $ income10             : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
  • age la edad de la persona
  • workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
  • education indica el nivel educativo de la persona
  • educational es el valor numérico de education
  • marital.status es su estado civil
  • race es el tipo de raza de persona
  • gender es el género de la persona
  • hours.per.week son las horas que trbaja por semana
  • income son los ingresos
  • age_scale la edad escalada
  • hours.per.week escalada
  • income10 con valores de 0 gana menos de 50 mil y 1 mas de 50 mil

4. Crear datos de entrenamiento y validación

  • 70 % datos de entrenamiento
  • 30 % datos de validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)

# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]

# Datos validación
datos.validacion <- datos[-entrena, ]

kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
Datos de entrenamiento (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
7 7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
11 11 65 Private HighGrad 9 Married White Male 40 >50K 0.6575342 0.5333333 0.3979592 1
kable(head(datos.validacion, 10), caption = "Datos de validación  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
Datos de validación (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1

5. Crear modelo de Máquinas de Soporte Vectorial SVM

  • Con el paquete e1071 se genera el modelo de SVM
  • Se utiliza la función svm
  • Kernel lineal
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)

6. Analizar y/o describir el modelo

summary(modelo1)
## 
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear", 
##     cost = 0.05, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.05 
## 
## Number of Support Vectors:  246
## 
##  ( 143 103 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
  • Modelo SVM de Kernel lineal
  • Costo del 0.05
  • 2 clases eb este caso etiquetas 0 y 1
  • Vectores de soporte 246 143 clase 0 y 103 clase 1

7. Realizar predicciones con el conjunto de datos de validación

prediccion <- predict(modelo1, datos.validacion)
  • Agregando columna al final de datos de validación para comparar
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)

kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
Las predicciones, primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 prediccion
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1 1
59 59 42 Self-emp-inc HighGrad 9 Married White Male 50 >50K 0.3424658 0.5333333 0.5000000 1 1
61 61 30 Private Bachelors 13 Not_married White Female 50 <=50K 0.1780822 0.8000000 0.5000000 0 0
64 64 33 Private HighGrad 9 Not_married White Female 40 <=50K 0.2191781 0.5333333 0.3979592 0 0
68 68 19 Private Community 10 Not_married White Male 20 <=50K 0.0273973 0.6000000 0.1938776 0 0
74 74 21 Private Community 10 Separated White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
77 77 41 Private HighGrad 9 Married White Male 50 <=50K 0.3287671 0.5333333 0.5000000 0 0
79 79 50 Private HighGrad 9 Married White Male 40 <=50K 0.4520548 0.5333333 0.3979592 0 0
81 81 45 Self-emp-inc Community 10 Married White Male 50 <=50K 0.3835616 0.6000000 0.5000000 0 0
91 91 59 Private Bachelors 13 Not_married White Female 25 <=50K 0.5753425 0.8000000 0.2448980 0 0
95 95 34 Private Master 14 Not_married Amer-Indian-Eskimo Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
96 96 20 Private HighGrad 9 Not_married White Male 40 <=50K 0.0410959 0.5333333 0.3979592 0 0
97 97 25 Private Bachelors 13 Not_married White Female 40 <=50K 0.1095890 0.8000000 0.3979592 0 0
102 102 33 Private Community 10 Not_married Black Female 35 <=50K 0.2191781 0.6000000 0.3469388 0 0
111 111 18 Private HighGrad 9 Not_married White Female 48 <=50K 0.0136986 0.5333333 0.4795918 0 0
112 112 28 Private Community 10 Married White Male 40 <=50K 0.1506849 0.6000000 0.3979592 0 0
120 120 43 Private Bachelors 13 Separated White Female 40 >50K 0.3561644 0.8000000 0.3979592 1 1
123 123 19 Private Community 10 Not_married White Male 30 <=50K 0.0273973 0.6000000 0.2959184 0 0
129 129 27 Self-emp-not-inc HighGrad 9 Married White Male 60 >50K 0.1369863 0.5333333 0.6020408 1 1
131 131 41 Private Community 10 Married White Male 40 <=50K 0.3287671 0.6000000 0.3979592 0 0
135 135 57 Private HighGrad 9 Married Black Male 48 <=50K 0.5479452 0.5333333 0.4795918 0 0
141 141 46 Private Master 14 Married White Male 40 <=50K 0.3972603 0.8666667 0.3979592 0 0
144 144 43 Self-emp-inc HighGrad 9 Married White Male 45 >50K 0.3561644 0.5333333 0.4489796 1 1
145 145 34 Private Master 14 Not_married White Female 30 <=50K 0.2328767 0.8666667 0.2959184 0 0
147 147 44 Private Community 11 Widow White Female 30 <=50K 0.3698630 0.6666667 0.2959184 0 0
153 153 50 Private Dropout 4 Married White Male 20 <=50K 0.4520548 0.2000000 0.1938776 0 0
160 160 38 Self-emp-inc Bachelors 13 Separated White Male 40 <=50K 0.2876712 0.8000000 0.3979592 0 0
161 161 55 Private Dropout 7 Married White Male 30 <=50K 0.5205479 0.4000000 0.2959184 0 0
163 163 22 Private HighGrad 9 Married White Male 45 <=50K 0.0684932 0.5333333 0.4489796 0 0
165 165 46 State-gov Master 14 Married White Male 45 >50K 0.3972603 0.8666667 0.4489796 1 1
167 167 58 Self-emp-not-inc PhD 16 Married White Male 16 >50K 0.5616438 1.0000000 0.1530612 1 1
168 168 42 Private HighGrad 9 Married White Male 45 <=50K 0.3424658 0.5333333 0.4489796 0 0
171 171 54 Private HighGrad 9 Married White Male 40 >50K 0.5068493 0.5333333 0.3979592 1 1
172 172 34 Private Master 14 Not_married White Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
173 173 26 Private Bachelors 13 Not_married White Female 40 <=50K 0.1232877 0.8000000 0.3979592 0 0
175 175 48 Local-gov Master 14 Separated Black Female 40 <=50K 0.4246575 0.8666667 0.3979592 0 0
176 176 36 Private Community 10 Married White Male 45 >50K 0.2602740 0.6000000 0.4489796 1 1
185 185 44 Private HighGrad 9 Separated Black Female 40 <=50K 0.3698630 0.5333333 0.3979592 0 0
189 189 34 State-gov Bachelors 13 Not_married Black Male 40 <=50K 0.2328767 0.8000000 0.3979592 0 0
193 193 47 Private Community 10 Separated Black Female 37 <=50K 0.4109589 0.6000000 0.3673469 0 0
196 196 31 Private Community 10 Separated White Female 40 <=50K 0.1917808 0.6000000 0.3979592 0 0

8. Evaluar el modelo de predicción con matriz de confusión

matriz.confusion <- table(predicho  = prediccion, real = datos.validacion$income10)

matriz.confusion
##         real
## predicho     0     1
##        0 11146     0
##        1     0  3506
  • Accurancy = 100% dentro del algoritmo SVM de kernel lineal

9. Interpretar el caso

El conjunto de datos que se uso en este caso fue retomado del caso 11 el cual trata sobre los ingresos de personas de USA en el cual indica si es que estas reciben un ingreso de más de 50,000 dólares o menos o igual a 50,000 dólares, dependiendo de diferentes variables, en el caso 11 se uso un algoritmo de regresión logística y para este caso 13 se esta usando un algoritmo de máquinas de soporte vectorial con kernel lineal igualmente la variable dependiente que se usa es income que se representa con ´>50k´ y ´<=50k´ después convirtiéndola a un valor binario 1 y 0 respectivamente, y las variables independientes son:

• age la edad de la persona • workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta • education indica el nivel educativo de la persona • educational es el valor numérico de education • marital.status es su estado civil • race es el tipo de raza de persona • gender es el género de la persona • hours.per.week son las horas que trbaja por semana • income son los ingresos • age_scale la edad escalada • hours.per.week escalada

Dentro del modelo de maquinas de soporte vectorial como ya se menciono se uso un kernel lineal, con un costo del 0.05, fue apoyado por 246 vectores y por solo 2 clases como lo vimos en este caso que es 0 y 1. En cuanto a la matriz de confusión se obtiene que el numero de verdaderos positivos es de 11146 y que el caso de verdaderos negativos es de 3506 y en los demás valores de la matriz es de 0 lo que quiere decir que este modelo tiene un Accurancy de 100% en sus predicciones es decir es mucho más acertado en sus predicciones que el modelo de regresión logística.

10. Comparar con el algoritmo de regresión logística del caso 11

Modelo de regresión lógistica
formula = income10 ~ age.scale + workclass + education + marital.status + race + gender + hours.per.week.scale
modelo2 <- glm(formula, data = datos.entrenamiento, family =  'binomial')
summary(modelo2)
## 
## Call:
## glm(formula = formula, family = "binomial", data = datos.entrenamiento)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7337  -0.5768  -0.2588  -0.0654   3.3492  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.419e+00  2.228e-01 -10.858  < 2e-16 ***
## age.scale                  2.224e+00  1.053e-01  21.121  < 2e-16 ***
## workclassFederal-gov       1.421e+00  1.237e-01  11.485  < 2e-16 ***
## workclassLocal-gov         6.942e-01  1.100e-01   6.312 2.76e-10 ***
## workclassNever-worked     -8.124e+00  1.042e+02  -0.078   0.9379    
## workclassPrivate           8.124e-01  9.598e-02   8.464  < 2e-16 ***
## workclassSelf-emp-inc      1.218e+00  1.186e-01  10.270  < 2e-16 ***
## workclassSelf-emp-not-inc  1.878e-01  1.071e-01   1.753   0.0797 .  
## workclassState-gov         5.339e-01  1.223e-01   4.367 1.26e-05 ***
## workclassWithout-pay      -3.965e-01  8.276e-01  -0.479   0.6318    
## educationCommunity        -9.930e-01  4.428e-02 -22.426  < 2e-16 ***
## educationDropout          -2.782e+00  7.802e-02 -35.657  < 2e-16 ***
## educationHighGrad         -1.611e+00  4.523e-02 -35.610  < 2e-16 ***
## educationMaster            6.250e-01  6.110e-02  10.230  < 2e-16 ***
## educationPhD               1.077e+00  1.379e-01   7.814 5.55e-15 ***
## marital.statusNot_married -2.491e+00  5.355e-02 -46.511  < 2e-16 ***
## marital.statusSeparated   -2.102e+00  5.650e-02 -37.214  < 2e-16 ***
## marital.statusWidow       -2.163e+00  1.287e-01 -16.809  < 2e-16 ***
## raceAsian-Pac-Islander    -2.461e-02  2.074e-01  -0.119   0.9055    
## raceBlack                  4.784e-04  1.968e-01   0.002   0.9981    
## raceOther                 -9.881e-02  2.817e-01  -0.351   0.7258    
## raceWhite                  2.155e-01  1.876e-01   1.148   0.2509    
## genderMale                 9.432e-02  4.455e-02   2.117   0.0342 *  
## hours.per.week.scale       3.136e+00  1.398e-01  22.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37626  on 34189  degrees of freedom
## Residual deviance: 25011  on 34166  degrees of freedom
## AIC: 25059
## 
## Number of Fisher Scoring iterations: 11
Realizar predicciones
predicciones <- predict(modelo2, datos.validacion, se.fit = TRUE)

predicciones.ajustadas <- predicciones$fit

kable(head(predicciones.ajustadas, 10))
x
6 -5.1235447
15 -0.4589452
17 -4.1994476
36 -1.0094459
41 2.3586616
49 -4.8803438
51 -4.1419552
54 -3.3575676
56 -3.5047903
58 0.6710119
Convertir las predicciones en probabilidades
predicciones_prob <- exp(predicciones$fit) / (1 + exp(predicciones$fit))

kable(head(predicciones_prob))
x
6 0.0059196
15 0.3872361
17 0.0147821
36 0.2670883
41 0.9136202
49 0.0075372
Agregar una columna de la predicción al final del conjunto de datos de validación
  • Agregar una columna de los datos de validación con los valores probabilísticos de las predicciones con cbind() y con un uevo conjunto de datos llamado las .predicciones
  • Agregar columna con valor 1 cuando la predicción es mayor que 0.5 y 0 cuando la predicción es menor o igual a 0.5
  • Verificar las columnas income10 e income10.prediccion
las.predicciones <- cbind(datos.validacion, predicciones_prob)

las.predicciones <- las.predicciones %>%
  mutate(income10.prediccion =  if_else(predicciones_prob > 0.5, 1, 0))
  
kable(head(las.predicciones))
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 prediccion predicciones_prob income10.prediccion
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0 0.0059196 0
15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1 0.3872361 0
17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0 0.0147821 0
36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0 0.2670883 0
41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1 0.9136202 1
49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0 0.0075372 0
Matriz de confusión de predicciones con regresión lógistica
matriz_confusion_predicciones <- table(las.predicciones$income10, las.predicciones$income10.prediccion, dnn = c("income10", "predicciones"))
matriz_confusion_predicciones
##         predicciones
## income10     0     1
##        0 10357   789
##        1  1762  1744
  • Accurancy = 82.58% dentro del algoritmo de regresión logistica
Matriz de confusión de predicciones con SVM de kernel lineal
matriz.confusion <- table(predicho  = prediccion, real = datos.validacion$income10)

matriz.confusion
##         real
## predicho     0     1
##        0 11146     0
##        1     0  3506
  • Accurancy = 100% dentro del algoritmo SVM de kernel lineal

Como podemos ver en cuanto a las predicciones el algoritmo que tiene mayor porcentaje de Accurancy es el SVM de Kernel lineal concluyendo de esta manera que el algoritmo que se adecua más a este caso (conjunto de datos de personas de USA respecto a sus ingresos) es el algoritmo SVM de kernel lineal