Objetivo

Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y comparar con el caso 11 de regresión loígistica el valor de exactitud.

Descripción

Construir un modelo de de maquinas de soporte vectorial SVM aplicado a datos de personas y sus ingresos en USA

La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.

Fundamento teórico

Pendiente que es el algoritmo de MSV ?

Proceso

  1. Cargar librerías
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(readr)
library(knitr)
library(e1071)
  1. Cargar datos
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
Los primeros 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
  1. Las variables Convertir a factor la variable dependiente income10
datos$income10 <- factor(datos$income10)

str(datos)
## 'data.frame':    48842 obs. of  14 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                  : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass            : chr  "Private" "Private" "Local-gov" "Private" ...
##  $ education            : chr  "Dropout" "HighGrad" "Community" "Community" ...
##  $ educational.num      : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status       : chr  "Not_married" "Married" "Married" "Married" ...
##  $ race                 : chr  "Black" "White" "White" "Black" ...
##  $ gender               : chr  "Male" "Male" "Male" "Male" ...
##  $ hours.per.week       : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ income               : chr  "<=50K" "<=50K" ">50K" ">50K" ...
##  $ age.scale            : num  0.1096 0.2877 0.1507 0.3699 0.0137 ...
##  $ educational.num.scale: num  0.4 0.533 0.733 0.6 0.6 ...
##  $ hours.per.week.scale : num  0.398 0.5 0.398 0.398 0.296 ...
##  $ income10             : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
  1. Crear datos de entrenamiento y validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)

# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]

# Datos validación
datos.validacion <- datos[-entrena, ]

kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
Datos de entrenamiento (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
7 7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
11 11 65 Private HighGrad 9 Married White Male 40 >50K 0.6575342 0.5333333 0.3979592 1
kable(head(datos.validacion, 10), caption = "Datos de validación  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
Datos de validación (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1
  1. Crear modelo de Máquinas de Soporte Vectorial SVM
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)
  1. Analizar y/o describir el modelo
summary(modelo1)
## 
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear", 
##     cost = 0.05, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.05 
## 
## Number of Support Vectors:  246
## 
##  ( 143 103 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
  1. Realizar predicciones con el conjunto de datos de validación
prediccion <- predict(modelo1, datos.validacion)
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)

kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
Las predicciones, primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 prediccion
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1 1
59 59 42 Self-emp-inc HighGrad 9 Married White Male 50 >50K 0.3424658 0.5333333 0.5000000 1 1
61 61 30 Private Bachelors 13 Not_married White Female 50 <=50K 0.1780822 0.8000000 0.5000000 0 0
64 64 33 Private HighGrad 9 Not_married White Female 40 <=50K 0.2191781 0.5333333 0.3979592 0 0
68 68 19 Private Community 10 Not_married White Male 20 <=50K 0.0273973 0.6000000 0.1938776 0 0
74 74 21 Private Community 10 Separated White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
77 77 41 Private HighGrad 9 Married White Male 50 <=50K 0.3287671 0.5333333 0.5000000 0 0
79 79 50 Private HighGrad 9 Married White Male 40 <=50K 0.4520548 0.5333333 0.3979592 0 0
81 81 45 Self-emp-inc Community 10 Married White Male 50 <=50K 0.3835616 0.6000000 0.5000000 0 0
91 91 59 Private Bachelors 13 Not_married White Female 25 <=50K 0.5753425 0.8000000 0.2448980 0 0
95 95 34 Private Master 14 Not_married Amer-Indian-Eskimo Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
96 96 20 Private HighGrad 9 Not_married White Male 40 <=50K 0.0410959 0.5333333 0.3979592 0 0
97 97 25 Private Bachelors 13 Not_married White Female 40 <=50K 0.1095890 0.8000000 0.3979592 0 0
102 102 33 Private Community 10 Not_married Black Female 35 <=50K 0.2191781 0.6000000 0.3469388 0 0
111 111 18 Private HighGrad 9 Not_married White Female 48 <=50K 0.0136986 0.5333333 0.4795918 0 0
112 112 28 Private Community 10 Married White Male 40 <=50K 0.1506849 0.6000000 0.3979592 0 0
120 120 43 Private Bachelors 13 Separated White Female 40 >50K 0.3561644 0.8000000 0.3979592 1 1
123 123 19 Private Community 10 Not_married White Male 30 <=50K 0.0273973 0.6000000 0.2959184 0 0
129 129 27 Self-emp-not-inc HighGrad 9 Married White Male 60 >50K 0.1369863 0.5333333 0.6020408 1 1
131 131 41 Private Community 10 Married White Male 40 <=50K 0.3287671 0.6000000 0.3979592 0 0
135 135 57 Private HighGrad 9 Married Black Male 48 <=50K 0.5479452 0.5333333 0.4795918 0 0
141 141 46 Private Master 14 Married White Male 40 <=50K 0.3972603 0.8666667 0.3979592 0 0
144 144 43 Self-emp-inc HighGrad 9 Married White Male 45 >50K 0.3561644 0.5333333 0.4489796 1 1
145 145 34 Private Master 14 Not_married White Female 30 <=50K 0.2328767 0.8666667 0.2959184 0 0
147 147 44 Private Community 11 Widow White Female 30 <=50K 0.3698630 0.6666667 0.2959184 0 0
153 153 50 Private Dropout 4 Married White Male 20 <=50K 0.4520548 0.2000000 0.1938776 0 0
160 160 38 Self-emp-inc Bachelors 13 Separated White Male 40 <=50K 0.2876712 0.8000000 0.3979592 0 0
161 161 55 Private Dropout 7 Married White Male 30 <=50K 0.5205479 0.4000000 0.2959184 0 0
163 163 22 Private HighGrad 9 Married White Male 45 <=50K 0.0684932 0.5333333 0.4489796 0 0
165 165 46 State-gov Master 14 Married White Male 45 >50K 0.3972603 0.8666667 0.4489796 1 1
167 167 58 Self-emp-not-inc PhD 16 Married White Male 16 >50K 0.5616438 1.0000000 0.1530612 1 1
168 168 42 Private HighGrad 9 Married White Male 45 <=50K 0.3424658 0.5333333 0.4489796 0 0
171 171 54 Private HighGrad 9 Married White Male 40 >50K 0.5068493 0.5333333 0.3979592 1 1
172 172 34 Private Master 14 Not_married White Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
173 173 26 Private Bachelors 13 Not_married White Female 40 <=50K 0.1232877 0.8000000 0.3979592 0 0
175 175 48 Local-gov Master 14 Separated Black Female 40 <=50K 0.4246575 0.8666667 0.3979592 0 0
176 176 36 Private Community 10 Married White Male 45 >50K 0.2602740 0.6000000 0.4489796 1 1
185 185 44 Private HighGrad 9 Separated Black Female 40 <=50K 0.3698630 0.5333333 0.3979592 0 0
189 189 34 State-gov Bachelors 13 Not_married Black Male 40 <=50K 0.2328767 0.8000000 0.3979592 0 0
193 193 47 Private Community 10 Separated Black Female 37 <=50K 0.4109589 0.6000000 0.3673469 0 0
196 196 31 Private Community 10 Separated White Female 40 <=50K 0.1917808 0.6000000 0.3979592 0 0
  1. Evaluar el modelo de predicción con matriz de confusión
mat.confusion <- table(predicho  = prediccion, real = datos.validacion$income10)

mat.confusion
##         real
## predicho     0     1
##        0 11146     0
##        1     0  3506