Objetivo

Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y comparar con el caso 11 de regresión loígistica el valor de exactitud.

Descripción

Construir un modelo de de maquinas de soporte vectorial SVM aplicado a datos de personas y sus ingresos en USA

La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.

Fundamento teórico

Pendiente que es el algoritmo de MSV ?

Proceso

1. Cargar librerías

  • Posiblemente se utilicen algunas de ellas
library(ggplot2)
library(dplyr)
library(knitr)
library(caret)
library(readr)
library(knitr)
library(e1071)

2. Cargar datos

# datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")

datos <- read.csv("../datos/adultos_clean.csv")

# kable(head(datos))
# kable(tail(datos))
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
Los primeros 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
Los últimos 10 registros de datos
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
48833 48833 32 Private Dropout 6 Married Amer-Indian-Eskimo Male 40 <=50K 0.2054795 0.3333333 0.3979592 0
48834 48834 43 Private Community 11 Married White Male 45 <=50K 0.3561644 0.6666667 0.4489796 0
48835 48835 32 Private Master 14 Not_married Asian-Pac-Islander Male 11 <=50K 0.2054795 0.8666667 0.1020408 0
48836 48836 53 Private Master 14 Married White Male 40 >50K 0.4931507 0.8666667 0.3979592 1
48837 48837 22 Private Community 10 Not_married White Male 40 <=50K 0.0684932 0.6000000 0.3979592 0
48838 48838 27 Private Community 12 Married White Female 38 <=50K 0.1369863 0.7333333 0.3775510 0
48839 48839 40 Private HighGrad 9 Married White Male 40 >50K 0.3150685 0.5333333 0.3979592 1
48840 48840 58 Private HighGrad 9 Widow White Female 40 <=50K 0.5616438 0.5333333 0.3979592 0
48841 48841 22 Private HighGrad 9 Not_married White Male 20 <=50K 0.0684932 0.5333333 0.1938776 0
48842 48842 52 Self-emp-inc HighGrad 9 Married White Female 40 >50K 0.4794521 0.5333333 0.3979592 1

3. Las variables

  • Convertir a factor la variable dependiente income10
datos$income10 <- factor(datos$income10)

str(datos)
## 'data.frame':    48842 obs. of  14 variables:
##  $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age                  : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass            : chr  "Private" "Private" "Local-gov" "Private" ...
##  $ education            : chr  "Dropout" "HighGrad" "Community" "Community" ...
##  $ educational.num      : int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status       : chr  "Not_married" "Married" "Married" "Married" ...
##  $ race                 : chr  "Black" "White" "White" "Black" ...
##  $ gender               : chr  "Male" "Male" "Male" "Male" ...
##  $ hours.per.week       : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ income               : chr  "<=50K" "<=50K" ">50K" ">50K" ...
##  $ age.scale            : num  0.1096 0.2877 0.1507 0.3699 0.0137 ...
##  $ educational.num.scale: num  0.4 0.533 0.733 0.6 0.6 ...
##  $ hours.per.week.scale : num  0.398 0.5 0.398 0.398 0.296 ...
##  $ income10             : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
  • age la edad de la persona
  • workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
  • education indica el nivel educativo de la persona
  • educational es el valor numérico de education
  • marital.status es su estado civil
  • race es el tipo de raza de persona
  • gender es el género de la persona
  • hours.per.week son las horas que trbaja por semana
  • income son los ingresos
  • age_scale la edad escalada
  • hours.per.week escalada
  • income10 con valores de 0 gana menos de 50 mil y 1 mas de 50 mil

4. Crear datos de entrenamiento y validación

  • 70 % datos de entrenamiento
  • 30 % datos de validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)

# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]

# Datos validación
datos.validacion <- datos[-entrena, ]

kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
Datos de entrenamiento (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
7 7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
11 11 65 Private HighGrad 9 Married White Male 40 >50K 0.6575342 0.5333333 0.3979592 1
kable(head(datos.validacion, 10), caption = "Datos de validación  (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
Datos de validación (primeros diez)
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1
5. Crear modelo de Máquinas de Soporte Vectorial SVM
  • Con el paquete e1071 se genera el modelo de SVM
  • Se utiliza la función svm
  • Kernel lineal
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)

6. Analizar y/o describir el modelo

summary(modelo1)
## 
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear", 
##     cost = 0.05, scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.05 
## 
## Number of Support Vectors:  246
## 
##  ( 143 103 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

7. Realizar predicciones con el conjunto de datos de validación

prediccion <- predict(modelo1, datos.validacion)
  • Agregando columna al final de datos de validación para comparar
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)

kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
Las predicciones, primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 prediccion
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1 1
59 59 42 Self-emp-inc HighGrad 9 Married White Male 50 >50K 0.3424658 0.5333333 0.5000000 1 1
61 61 30 Private Bachelors 13 Not_married White Female 50 <=50K 0.1780822 0.8000000 0.5000000 0 0
64 64 33 Private HighGrad 9 Not_married White Female 40 <=50K 0.2191781 0.5333333 0.3979592 0 0
68 68 19 Private Community 10 Not_married White Male 20 <=50K 0.0273973 0.6000000 0.1938776 0 0
74 74 21 Private Community 10 Separated White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
77 77 41 Private HighGrad 9 Married White Male 50 <=50K 0.3287671 0.5333333 0.5000000 0 0
79 79 50 Private HighGrad 9 Married White Male 40 <=50K 0.4520548 0.5333333 0.3979592 0 0
81 81 45 Self-emp-inc Community 10 Married White Male 50 <=50K 0.3835616 0.6000000 0.5000000 0 0
91 91 59 Private Bachelors 13 Not_married White Female 25 <=50K 0.5753425 0.8000000 0.2448980 0 0
95 95 34 Private Master 14 Not_married Amer-Indian-Eskimo Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
96 96 20 Private HighGrad 9 Not_married White Male 40 <=50K 0.0410959 0.5333333 0.3979592 0 0
97 97 25 Private Bachelors 13 Not_married White Female 40 <=50K 0.1095890 0.8000000 0.3979592 0 0
102 102 33 Private Community 10 Not_married Black Female 35 <=50K 0.2191781 0.6000000 0.3469388 0 0
111 111 18 Private HighGrad 9 Not_married White Female 48 <=50K 0.0136986 0.5333333 0.4795918 0 0
112 112 28 Private Community 10 Married White Male 40 <=50K 0.1506849 0.6000000 0.3979592 0 0
120 120 43 Private Bachelors 13 Separated White Female 40 >50K 0.3561644 0.8000000 0.3979592 1 1
123 123 19 Private Community 10 Not_married White Male 30 <=50K 0.0273973 0.6000000 0.2959184 0 0
129 129 27 Self-emp-not-inc HighGrad 9 Married White Male 60 >50K 0.1369863 0.5333333 0.6020408 1 1
131 131 41 Private Community 10 Married White Male 40 <=50K 0.3287671 0.6000000 0.3979592 0 0
135 135 57 Private HighGrad 9 Married Black Male 48 <=50K 0.5479452 0.5333333 0.4795918 0 0
141 141 46 Private Master 14 Married White Male 40 <=50K 0.3972603 0.8666667 0.3979592 0 0
144 144 43 Self-emp-inc HighGrad 9 Married White Male 45 >50K 0.3561644 0.5333333 0.4489796 1 1
145 145 34 Private Master 14 Not_married White Female 30 <=50K 0.2328767 0.8666667 0.2959184 0 0
147 147 44 Private Community 11 Widow White Female 30 <=50K 0.3698630 0.6666667 0.2959184 0 0
153 153 50 Private Dropout 4 Married White Male 20 <=50K 0.4520548 0.2000000 0.1938776 0 0
160 160 38 Self-emp-inc Bachelors 13 Separated White Male 40 <=50K 0.2876712 0.8000000 0.3979592 0 0
161 161 55 Private Dropout 7 Married White Male 30 <=50K 0.5205479 0.4000000 0.2959184 0 0
163 163 22 Private HighGrad 9 Married White Male 45 <=50K 0.0684932 0.5333333 0.4489796 0 0
165 165 46 State-gov Master 14 Married White Male 45 >50K 0.3972603 0.8666667 0.4489796 1 1
167 167 58 Self-emp-not-inc PhD 16 Married White Male 16 >50K 0.5616438 1.0000000 0.1530612 1 1
168 168 42 Private HighGrad 9 Married White Male 45 <=50K 0.3424658 0.5333333 0.4489796 0 0
171 171 54 Private HighGrad 9 Married White Male 40 >50K 0.5068493 0.5333333 0.3979592 1 1
172 172 34 Private Master 14 Not_married White Male 40 <=50K 0.2328767 0.8666667 0.3979592 0 0
173 173 26 Private Bachelors 13 Not_married White Female 40 <=50K 0.1232877 0.8000000 0.3979592 0 0
175 175 48 Local-gov Master 14 Separated Black Female 40 <=50K 0.4246575 0.8666667 0.3979592 0 0
176 176 36 Private Community 10 Married White Male 45 >50K 0.2602740 0.6000000 0.4489796 1 1
185 185 44 Private HighGrad 9 Separated Black Female 40 <=50K 0.3698630 0.5333333 0.3979592 0 0
189 189 34 State-gov Bachelors 13 Not_married Black Male 40 <=50K 0.2328767 0.8000000 0.3979592 0 0
193 193 47 Private Community 10 Separated Black Female 37 <=50K 0.4109589 0.6000000 0.3673469 0 0
196 196 31 Private Community 10 Separated White Female 40 <=50K 0.1917808 0.6000000 0.3979592 0 0

8. Evaluar el modelo de predicción con matriz de confusión

  • Generando la matriz de confusión
mat.confusion <- table(predicho  = prediccion, real = datos.validacion$income10)

mat.confusion
##         real
## predicho     0     1
##        0 11146     0
##        1     0  3506

El 100% de accuracy = exactitud ???? con el kernel lineal ???