Objetivo

Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y comparar con el caso 11 de regresión loígistica el valor de exactitud.

Descripción

Construir un modelo de de maquinas de soporte vectorial SVM aplicado a datos de personas y sus ingresos en USA

La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.

Cargar librerias

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(readr)
library(knitr)
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.3

Cargar datos

getwd()
## [1] "C:/Users/pc/Documents/RStudio"
datos <- read.csv(file = "adultos_clean.csv", encoding = "UTF-8")
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
Los primeros 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
Los últimos 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
48833 48833 32 Private Dropout 6 Married Amer-Indian-Eskimo Male 40 <=50K 0.2054795 0.3333333 0.3979592 0
48834 48834 43 Private Community 11 Married White Male 45 <=50K 0.3561644 0.6666667 0.4489796 0
48835 48835 32 Private Master 14 Not_married Asian-Pac-Islander Male 11 <=50K 0.2054795 0.8666667 0.1020408 0
48836 48836 53 Private Master 14 Married White Male 40 >50K 0.4931507 0.8666667 0.3979592 1
48837 48837 22 Private Community 10 Not_married White Male 40 <=50K 0.0684932 0.6000000 0.3979592 0
48838 48838 27 Private Community 12 Married White Female 38 <=50K 0.1369863 0.7333333 0.3775510 0
48839 48839 40 Private HighGrad 9 Married White Male 40 >50K 0.3150685 0.5333333 0.3979592 1
48840 48840 58 Private HighGrad 9 Widow White Female 40 <=50K 0.5616438 0.5333333 0.3979592 0
48841 48841 22 Private HighGrad 9 Not_married White Male 20 <=50K 0.0684932 0.5333333 0.1938776 0
48842 48842 52 Self-emp-inc HighGrad 9 Married White Female 40 >50K 0.4794521 0.5333333 0.3979592 1
datos$income10 <- factor(datos$income10)

str(datos)
## 'data.frame':    48842 obs. of  14 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                  : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass            : chr  "Private" "Private" "Local-gov" "Private" ...
##  $ education            : chr  "Dropout" "HighGrad" "Community" "Community" ...
##  $ educational.num      : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status       : chr  "Not_married" "Married" "Married" "Married" ...
##  $ race                 : chr  "Black" "White" "White" "Black" ...
##  $ gender               : chr  "Male" "Male" "Male" "Male" ...
##  $ hours.per.week       : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ income               : chr  "<=50K" "<=50K" ">50K" ">50K" ...
##  $ age.scale            : num  0.1096 0.2877 0.1507 0.3699 0.0137 ...
##  $ educational.num.scale: num  0.4 0.533 0.733 0.6 0.6 ...
##  $ hours.per.week.scale : num  0.398 0.5 0.398 0.398 0.296 ...
##  $ income10             : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...

Crear datos entrenamiento y validación

70 % datos de entrenamiento
30 % datos de validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)

datos.entrenamiento <- datos[entrena, ] 
datos.validacion <- datos[-entrena, ]

kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
Datos de entrenamiento (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
7 7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
11 11 65 Private HighGrad 9 Married White Male 40 >50K 0.6575342 0.5333333 0.3979592 1
kable(head(datos.validacion, 10), caption = "Datos de validación  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
Datos de validación (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1

Creacion del modelo SVM

modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)

Analizar y/o describir el modelo

summary(modelo1)
## 
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear", 
##     cost = 0.05, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.05 
## 
## Number of Support Vectors:  246
## 
##  ( 143 103 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Prediccion del modelo

prediccion <- predict(modelo1, datos.validacion)
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)
kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
Las predicciones, primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 prediccion
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1 1
59 59 42 Self-emp-inc HighGrad 9 Married White Male 50 >50K 0.3424658 0.5333333 0.5000000 1 1
61 61 30 Private Bachelors 13 Not_married White Female 50 <=50K 0.1780822 0.8000000 0.5000000 0 0
64 64 33 Private HighGrad 9 Not_married White Female 40 <=50K 0.2191781 0.5333333 0.3979592 0 0
68 68 19 Private Community 10 Not_married White Male 20 <=50K 0.0273973 0.6000000 0.1938776 0 0
74 74 21 Private Community 10 Separated White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
77 77 41 Private HighGrad 9 Married White Male 50 <=50K 0.3287671 0.5333333 0.5000000 0 0
79 79 50 Private HighGrad 9 Married White Male 40 <=50K 0.4520548 0.5333333 0.3979592 0 0
81 81 45 Self-emp-inc Community 10 Married White Male 50 <=50K 0.3835616 0.6000000 0.5000000 0 0
91 91 59 Private Bachelors 13 Not_married White Female 25 <=50K 0.5753425 0.8000000 0.2448980 0 0
95 95 34 Private Master 14 Not_married Amer-Indian-Eskimo Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
96 96 20 Private HighGrad 9 Not_married White Male 40 <=50K 0.0410959 0.5333333 0.3979592 0 0
97 97 25 Private Bachelors 13 Not_married White Female 40 <=50K 0.1095890 0.8000000 0.3979592 0 0
102 102 33 Private Community 10 Not_married Black Female 35 <=50K 0.2191781 0.6000000 0.3469388 0 0
111 111 18 Private HighGrad 9 Not_married White Female 48 <=50K 0.0136986 0.5333333 0.4795918 0 0
112 112 28 Private Community 10 Married White Male 40 <=50K 0.1506849 0.6000000 0.3979592 0 0
120 120 43 Private Bachelors 13 Separated White Female 40 >50K 0.3561644 0.8000000 0.3979592 1 1
123 123 19 Private Community 10 Not_married White Male 30 <=50K 0.0273973 0.6000000 0.2959184 0 0
129 129 27 Self-emp-not-inc HighGrad 9 Married White Male 60 >50K 0.1369863 0.5333333 0.6020408 1 1
131 131 41 Private Community 10 Married White Male 40 <=50K 0.3287671 0.6000000 0.3979592 0 0
135 135 57 Private HighGrad 9 Married Black Male 48 <=50K 0.5479452 0.5333333 0.4795918 0 0
141 141 46 Private Master 14 Married White Male 40 <=50K 0.3972603 0.8666667 0.3979592 0 0
144 144 43 Self-emp-inc HighGrad 9 Married White Male 45 >50K 0.3561644 0.5333333 0.4489796 1 1
145 145 34 Private Master 14 Not_married White Female 30 <=50K 0.2328767 0.8666667 0.2959184 0 0
147 147 44 Private Community 11 Widow White Female 30 <=50K 0.3698630 0.6666667 0.2959184 0 0
153 153 50 Private Dropout 4 Married White Male 20 <=50K 0.4520548 0.2000000 0.1938776 0 0
160 160 38 Self-emp-inc Bachelors 13 Separated White Male 40 <=50K 0.2876712 0.8000000 0.3979592 0 0
161 161 55 Private Dropout 7 Married White Male 30 <=50K 0.5205479 0.4000000 0.2959184 0 0
163 163 22 Private HighGrad 9 Married White Male 45 <=50K 0.0684932 0.5333333 0.4489796 0 0
165 165 46 State-gov Master 14 Married White Male 45 >50K 0.3972603 0.8666667 0.4489796 1 1
167 167 58 Self-emp-not-inc PhD 16 Married White Male 16 >50K 0.5616438 1.0000000 0.1530612 1 1
168 168 42 Private HighGrad 9 Married White Male 45 <=50K 0.3424658 0.5333333 0.4489796 0 0
171 171 54 Private HighGrad 9 Married White Male 40 >50K 0.5068493 0.5333333 0.3979592 1 1
172 172 34 Private Master 14 Not_married White Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
173 173 26 Private Bachelors 13 Not_married White Female 40 <=50K 0.1232877 0.8000000 0.3979592 0 0
175 175 48 Local-gov Master 14 Separated Black Female 40 <=50K 0.4246575 0.8666667 0.3979592 0 0
176 176 36 Private Community 10 Married White Male 45 >50K 0.2602740 0.6000000 0.4489796 1 1
185 185 44 Private HighGrad 9 Separated Black Female 40 <=50K 0.3698630 0.5333333 0.3979592 0 0
189 189 34 State-gov Bachelors 13 Not_married Black Male 40 <=50K 0.2328767 0.8000000 0.3979592 0 0
193 193 47 Private Community 10 Separated Black Female 37 <=50K 0.4109589 0.6000000 0.3673469 0 0
196 196 31 Private Community 10 Separated White Female 40 <=50K 0.1917808 0.6000000 0.3979592 0 0
mat.confusion <- table(predicho  = prediccion, real = datos.validacion$income10)

mat.confusion
##         real
## predicho     0     1
##        0 11146     0
##        1     0  3506

Interpretacion

¿Que es SVM?

Maquina de vectores de soporte es un algoritmo de machine learning que puede emplear para clasificacion binaria o regresion.

Lineal K(x1,x2)=xT1x2 = Aprendizaje de dos clases

En este caos utilizamos el metodo lineal de Kernel

En donde por medio de la prediccion de los datos de entrena(x1) y valida(x2) se crea el modelo por medio de la funcion

modelo1 <- svm(..)

El metodo de kernel lleva los datos a un espacio vectorial donde aplica metodos lineales, en nuestro caso se hizo un vector de 246 con un rango aprox de ( 143 103 ).

Por uso de mapero los datos originales capturan caracteristicas reelevantes siguiendo una formula.

Especificamente no puedo explicar el metodo de kernel ya que es un algoritmo basado en distintas formulas pero por medio de R se puede facilitar el uso de el.