Objetivo

Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y comparar con el caso 11 de regresión loígistica el valor de exactitud.

Descripción

Construir un modelo de de maquinas de soporte vectorial SVM aplicado a datos de personas y sus ingresos en USA La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.

Proceso

  1. Cargar librerías
  2. Cargar datos
  3. Identificar variables
  4. Crear datos de entrenamiento y validación
  5. Crear modelo de Máquinas de Soporte Vectorial SVM *Deteminar anticipadamente e mejor costo del modelo de SVM Pendiente
  6. Analizar y/o describir el modelo
  7. Realizar predicciones con el conjunto de datos de validación
  8. Evaluar el modelo de predicción con matriz de confusión
  9. Interpretar el caso
  10. Comparar con el algoritmo de regresión logística del caso 11.

1. Cargar librerias

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(readr)
library(knitr)
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.3

2. Cargar datos

datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")

# datos <- read.csv("../datos/adultos_clean.csv")

# kable(head(datos))
# kable(tail(datos))
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
Los primeros 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
Los últimos 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
48833 48833 32 Private Dropout 6 Married Amer-Indian-Eskimo Male 40 <=50K 0.2054795 0.3333333 0.3979592 0
48834 48834 43 Private Community 11 Married White Male 45 <=50K 0.3561644 0.6666667 0.4489796 0
48835 48835 32 Private Master 14 Not_married Asian-Pac-Islander Male 11 <=50K 0.2054795 0.8666667 0.1020408 0
48836 48836 53 Private Master 14 Married White Male 40 >50K 0.4931507 0.8666667 0.3979592 1
48837 48837 22 Private Community 10 Not_married White Male 40 <=50K 0.0684932 0.6000000 0.3979592 0
48838 48838 27 Private Community 12 Married White Female 38 <=50K 0.1369863 0.7333333 0.3775510 0
48839 48839 40 Private HighGrad 9 Married White Male 40 >50K 0.3150685 0.5333333 0.3979592 1
48840 48840 58 Private HighGrad 9 Widow White Female 40 <=50K 0.5616438 0.5333333 0.3979592 0
48841 48841 22 Private HighGrad 9 Not_married White Male 20 <=50K 0.0684932 0.5333333 0.1938776 0
48842 48842 52 Self-emp-inc HighGrad 9 Married White Female 40 >50K 0.4794521 0.5333333 0.3979592 1

3. Las variables

  • Convertir a factor la variable dependiente income 10
datos$income10 <- factor(datos$income10)

str(datos)
## 'data.frame':    48842 obs. of  14 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                  : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass            : chr  "Private" "Private" "Local-gov" "Private" ...
##  $ education            : chr  "Dropout" "HighGrad" "Community" "Community" ...
##  $ educational.num      : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status       : chr  "Not_married" "Married" "Married" "Married" ...
##  $ race                 : chr  "Black" "White" "White" "Black" ...
##  $ gender               : chr  "Male" "Male" "Male" "Male" ...
##  $ hours.per.week       : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ income               : chr  "<=50K" "<=50K" ">50K" ">50K" ...
##  $ age.scale            : num  0.1096 0.2877 0.1507 0.3699 0.0137 ...
##  $ educational.num.scale: num  0.4 0.533 0.733 0.6 0.6 ...
##  $ hours.per.week.scale : num  0.398 0.5 0.398 0.398 0.296 ...
##  $ income10             : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
  • age la edad de la persona
  • workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
  • education indica el nivel educativo de la persona
  • educational es el valor numérico de education
  • marital.status es su estado civil
  • race es el tipo de raza de persona
  • gender es el género de la persona
  • hours.per.week son las horas que trbaja por semana income son los ingresos
  • age_scale la edad escalada
  • hours.per.week escalada
  • income10 con valores de 0 gana menos de 50 mil y 1 mas de 50 mil

4. Crear datos de entrenamiento y validación

  • 70 % datos de entrenamiento
  • 30 % datos de validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)

# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]

# Datos validación
datos.validacion <- datos[-entrena, ]

kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
Datos de entrenamiento (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
7 7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
11 11 65 Private HighGrad 9 Married White Male 40 >50K 0.6575342 0.5333333 0.3979592 1
kable(head(datos.validacion, 10), caption = "Datos de validación  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
Datos de validación (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1

5. Crear modelo de Máquinas de Soporte Vectorial SVM

  • Con el paquete e1071 se genera el modelo de SVM
  • Se utiliza la función svm
  • Kernel lineal
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)

6. Analizar y/o describir el modelo

summary(modelo1)
## 
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear", 
##     cost = 0.05, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.05 
## 
## Number of Support Vectors:  246
## 
##  ( 143 103 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

7. Realizar predicciones con el conjunto de datos de validación

prediccion <- predict(modelo1, datos.validacion)
  • Agregando columna al final de datos de validación para comparar
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)

kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
Las predicciones, primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 prediccion
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1 1
59 59 42 Self-emp-inc HighGrad 9 Married White Male 50 >50K 0.3424658 0.5333333 0.5000000 1 1
61 61 30 Private Bachelors 13 Not_married White Female 50 <=50K 0.1780822 0.8000000 0.5000000 0 0
64 64 33 Private HighGrad 9 Not_married White Female 40 <=50K 0.2191781 0.5333333 0.3979592 0 0
68 68 19 Private Community 10 Not_married White Male 20 <=50K 0.0273973 0.6000000 0.1938776 0 0
74 74 21 Private Community 10 Separated White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
77 77 41 Private HighGrad 9 Married White Male 50 <=50K 0.3287671 0.5333333 0.5000000 0 0
79 79 50 Private HighGrad 9 Married White Male 40 <=50K 0.4520548 0.5333333 0.3979592 0 0
81 81 45 Self-emp-inc Community 10 Married White Male 50 <=50K 0.3835616 0.6000000 0.5000000 0 0
91 91 59 Private Bachelors 13 Not_married White Female 25 <=50K 0.5753425 0.8000000 0.2448980 0 0
95 95 34 Private Master 14 Not_married Amer-Indian-Eskimo Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
96 96 20 Private HighGrad 9 Not_married White Male 40 <=50K 0.0410959 0.5333333 0.3979592 0 0
97 97 25 Private Bachelors 13 Not_married White Female 40 <=50K 0.1095890 0.8000000 0.3979592 0 0
102 102 33 Private Community 10 Not_married Black Female 35 <=50K 0.2191781 0.6000000 0.3469388 0 0
111 111 18 Private HighGrad 9 Not_married White Female 48 <=50K 0.0136986 0.5333333 0.4795918 0 0
112 112 28 Private Community 10 Married White Male 40 <=50K 0.1506849 0.6000000 0.3979592 0 0
120 120 43 Private Bachelors 13 Separated White Female 40 >50K 0.3561644 0.8000000 0.3979592 1 1
123 123 19 Private Community 10 Not_married White Male 30 <=50K 0.0273973 0.6000000 0.2959184 0 0
129 129 27 Self-emp-not-inc HighGrad 9 Married White Male 60 >50K 0.1369863 0.5333333 0.6020408 1 1
131 131 41 Private Community 10 Married White Male 40 <=50K 0.3287671 0.6000000 0.3979592 0 0
135 135 57 Private HighGrad 9 Married Black Male 48 <=50K 0.5479452 0.5333333 0.4795918 0 0
141 141 46 Private Master 14 Married White Male 40 <=50K 0.3972603 0.8666667 0.3979592 0 0
144 144 43 Self-emp-inc HighGrad 9 Married White Male 45 >50K 0.3561644 0.5333333 0.4489796 1 1
145 145 34 Private Master 14 Not_married White Female 30 <=50K 0.2328767 0.8666667 0.2959184 0 0
147 147 44 Private Community 11 Widow White Female 30 <=50K 0.3698630 0.6666667 0.2959184 0 0
153 153 50 Private Dropout 4 Married White Male 20 <=50K 0.4520548 0.2000000 0.1938776 0 0
160 160 38 Self-emp-inc Bachelors 13 Separated White Male 40 <=50K 0.2876712 0.8000000 0.3979592 0 0
161 161 55 Private Dropout 7 Married White Male 30 <=50K 0.5205479 0.4000000 0.2959184 0 0
163 163 22 Private HighGrad 9 Married White Male 45 <=50K 0.0684932 0.5333333 0.4489796 0 0
165 165 46 State-gov Master 14 Married White Male 45 >50K 0.3972603 0.8666667 0.4489796 1 1
167 167 58 Self-emp-not-inc PhD 16 Married White Male 16 >50K 0.5616438 1.0000000 0.1530612 1 1
168 168 42 Private HighGrad 9 Married White Male 45 <=50K 0.3424658 0.5333333 0.4489796 0 0
171 171 54 Private HighGrad 9 Married White Male 40 >50K 0.5068493 0.5333333 0.3979592 1 1
172 172 34 Private Master 14 Not_married White Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
173 173 26 Private Bachelors 13 Not_married White Female 40 <=50K 0.1232877 0.8000000 0.3979592 0 0
175 175 48 Local-gov Master 14 Separated Black Female 40 <=50K 0.4246575 0.8666667 0.3979592 0 0
176 176 36 Private Community 10 Married White Male 45 >50K 0.2602740 0.6000000 0.4489796 1 1
185 185 44 Private HighGrad 9 Separated Black Female 40 <=50K 0.3698630 0.5333333 0.3979592 0 0
189 189 34 State-gov Bachelors 13 Not_married Black Male 40 <=50K 0.2328767 0.8000000 0.3979592 0 0
193 193 47 Private Community 10 Separated Black Female 37 <=50K 0.4109589 0.6000000 0.3673469 0 0
196 196 31 Private Community 10 Separated White Female 40 <=50K 0.1917808 0.6000000 0.3979592 0 0

8. Evaluar el modelo de predicción con matriz de confusión

Generando la matriz de confusión

mat.confusion <- table(predicho  = prediccion, real = datos.validacion$income10)

mat.confusion
##         real
## predicho     0     1
##        0 11146     0
##        1     0  3506

9. Interpretar caso

El objetivo es Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y realozar una comparacion con el modelo del caso 11 de regresión loígistica el valor de exactitud. Como variable dependiente tenemos los ingresos identificados con 0 y 1 ( 0 es los que ganan por debajo o igual de 50 mil y 1 los que ganan por encima de 50 mil). Las variables independientes son age, workclass, education, marital.status, race, gender, hours.per.week, etc. La matriz de confusión se obtiene que el numero de verdaderos es de 11146 y que el caso de negativos es de 3506. El numero de vectores en los que se apoyo fueron 246 y 2 clases (0,1).

10. Comparar con el algoritmo de regresión logística del caso 11.

formula = income10 ~ age.scale + workclass + education + marital.status + race + gender + hours.per.week.scale

modelo2 <- glm(formula, data = datos.entrenamiento, family =  'binomial')
summary(modelo2)
## 
## Call:
## glm(formula = formula, family = "binomial", data = datos.entrenamiento)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7337  -0.5768  -0.2588  -0.0654   3.3492  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.419e+00  2.228e-01 -10.858  < 2e-16 ***
## age.scale                  2.224e+00  1.053e-01  21.121  < 2e-16 ***
## workclassFederal-gov       1.421e+00  1.237e-01  11.485  < 2e-16 ***
## workclassLocal-gov         6.942e-01  1.100e-01   6.312 2.76e-10 ***
## workclassNever-worked     -8.124e+00  1.042e+02  -0.078   0.9379    
## workclassPrivate           8.124e-01  9.598e-02   8.464  < 2e-16 ***
## workclassSelf-emp-inc      1.218e+00  1.186e-01  10.270  < 2e-16 ***
## workclassSelf-emp-not-inc  1.878e-01  1.071e-01   1.753   0.0797 .  
## workclassState-gov         5.339e-01  1.223e-01   4.367 1.26e-05 ***
## workclassWithout-pay      -3.965e-01  8.276e-01  -0.479   0.6318    
## educationCommunity        -9.930e-01  4.428e-02 -22.426  < 2e-16 ***
## educationDropout          -2.782e+00  7.802e-02 -35.657  < 2e-16 ***
## educationHighGrad         -1.611e+00  4.523e-02 -35.610  < 2e-16 ***
## educationMaster            6.250e-01  6.110e-02  10.230  < 2e-16 ***
## educationPhD               1.077e+00  1.379e-01   7.814 5.55e-15 ***
## marital.statusNot_married -2.491e+00  5.355e-02 -46.511  < 2e-16 ***
## marital.statusSeparated   -2.102e+00  5.650e-02 -37.214  < 2e-16 ***
## marital.statusWidow       -2.163e+00  1.287e-01 -16.809  < 2e-16 ***
## raceAsian-Pac-Islander    -2.461e-02  2.074e-01  -0.119   0.9055    
## raceBlack                  4.784e-04  1.968e-01   0.002   0.9981    
## raceOther                 -9.881e-02  2.817e-01  -0.351   0.7258    
## raceWhite                  2.155e-01  1.876e-01   1.148   0.2509    
## genderMale                 9.432e-02  4.455e-02   2.117   0.0342 *  
## hours.per.week.scale       3.136e+00  1.398e-01  22.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37626  on 34189  degrees of freedom
## Residual deviance: 25011  on 34166  degrees of freedom
## AIC: 25059
## 
## Number of Fisher Scoring iterations: 11