Objetivo:

Aplicar e interpretar el algoritmo de máquinas de soporte vectorial (SVM) con el conjunto de datos de personas e ingresos en EUA y comparar los resultados obtenidos con el caso 11 de regresión logística, tomando el valor de exactitud.

Descripción:

Construir un modelo de máquinas de soporte vectorial (SVM) aplicado al conjunto de datos de personas y sus ingresos en EUA. Tomando como variable dependiente los ingresos, identificados por 0 y 1, siendo 0 los que ganan por debajo o igual a 50 mil y 1 los que ganan por encima de 50 mil.

Fundamento teórico:

Proceso:

1. Cargar librerías

library(ggplot2)
library(dplyr)
library(knitr)
library(caret)
library(readr)
library(e1071)

2. Cargar datos

datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")

kable(head(datos,10), caption = "Primeros 10 registros")
Primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
kable(tail(datos,10), caption = "Últimos 10 registros")
Últimos 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
48833 48833 32 Private Dropout 6 Married Amer-Indian-Eskimo Male 40 <=50K 0.2054795 0.3333333 0.3979592 0
48834 48834 43 Private Community 11 Married White Male 45 <=50K 0.3561644 0.6666667 0.4489796 0
48835 48835 32 Private Master 14 Not_married Asian-Pac-Islander Male 11 <=50K 0.2054795 0.8666667 0.1020408 0
48836 48836 53 Private Master 14 Married White Male 40 >50K 0.4931507 0.8666667 0.3979592 1
48837 48837 22 Private Community 10 Not_married White Male 40 <=50K 0.0684932 0.6000000 0.3979592 0
48838 48838 27 Private Community 12 Married White Female 38 <=50K 0.1369863 0.7333333 0.3775510 0
48839 48839 40 Private HighGrad 9 Married White Male 40 >50K 0.3150685 0.5333333 0.3979592 1
48840 48840 58 Private HighGrad 9 Widow White Female 40 <=50K 0.5616438 0.5333333 0.3979592 0
48841 48841 22 Private HighGrad 9 Not_married White Male 20 <=50K 0.0684932 0.5333333 0.1938776 0
48842 48842 52 Self-emp-inc HighGrad 9 Married White Female 40 >50K 0.4794521 0.5333333 0.3979592 1

3. Identificar variables

Convertimos la variable income10 a factor

datos$income10 <- factor(datos$income10)

4. Seleccionar datos de entrenamiento y validación

Conjuntos de entrenamiento y validación (70-30)

set.seed(2020)

particion <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)

conjunto_entrenamiento <- datos[particion,]
conjunto_validacion <- datos[-particion,]

kable(head(conjunto_entrenamiento,10), caption = "Primeros 10 registros de entrenamiento", row.names = 1:nrow(conjunto_entrenamiento))
Primeros 10 registros de entrenamiento
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
1 1 25 Private Dropout 7 Not_married Black Male 40 <=50K 0.1095890 0.4000000 0.3979592 0
2 2 38 Private HighGrad 9 Married White Male 50 <=50K 0.2876712 0.5333333 0.5000000 0
3 3 28 Local-gov Community 12 Married White Male 40 >50K 0.1506849 0.7333333 0.3979592 1
4 4 44 Private Community 10 Married Black Male 40 >50K 0.3698630 0.6000000 0.3979592 1
5 5 18 ? Community 10 Not_married White Female 30 <=50K 0.0136986 0.6000000 0.2959184 0
7 7 29 ? HighGrad 9 Not_married Black Male 40 <=50K 0.1643836 0.5333333 0.3979592 0
8 8 63 Self-emp-not-inc Master 15 Married White Male 32 >50K 0.6301370 0.9333333 0.3163265 1
9 9 24 Private Community 10 Not_married White Female 40 <=50K 0.0958904 0.6000000 0.3979592 0
10 10 55 Private Dropout 4 Married White Male 10 <=50K 0.5205479 0.2000000 0.0918367 0
11 11 65 Private HighGrad 9 Married White Male 40 >50K 0.6575342 0.5333333 0.3979592 1
kable(head(conjunto_validacion,10), caption = "Primeros 10 registros de validación", row.names = 1:nrow(conjunto_validacion))
Primeros 10 registros de validación
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1

5. Crear modelo de SVM

Kernel líneal

modelo_1 <- svm(income10 ~ ., data = conjunto_entrenamiento, kernel = 'linear', scale = TRUE, cost = .05)

6. Analizar/describir el modelo

summary(modelo_1)

Call:
svm(formula = income10 ~ ., data = conjunto_entrenamiento, kernel = "linear", 
    cost = 0.05, scale = TRUE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  0.05 

Number of Support Vectors:  246

 ( 143 103 )


Number of Classes:  2 

Levels: 
 0 1

7. Realizar predicciones con datos de validación

predicciones <- predict(modelo_1, conjunto_validacion)

Nueva columna

predicciones_df <- data.frame(conjunto_validacion, predicciones)

kable(head(predicciones_df,10), caption = "Predicciones primeros 10 registros")
Predicciones primeros 10 registros
X age workclass education educational.num marital.status race gender hours.per.week income age.scale educational.num.scale hours.per.week.scale income10 predicciones
6 6 34 Private Dropout 6 Not_married White Male 30 <=50K 0.2328767 0.3333333 0.2959184 0 0
15 15 48 Private HighGrad 9 Married White Male 48 >50K 0.4246575 0.5333333 0.4795918 1 1
17 17 20 State-gov Community 10 Not_married White Male 25 <=50K 0.0410959 0.6000000 0.2448980 0 0
36 36 65 ? HighGrad 9 Married White Male 40 <=50K 0.6575342 0.5333333 0.3979592 0 0
41 41 65 Private Master 14 Married White Male 50 >50K 0.6575342 0.8666667 0.5000000 1 1
49 49 52 Private Dropout 7 Separated Black Female 18 <=50K 0.4794521 0.4000000 0.1734694 0 0
51 51 18 Private Community 10 Not_married White Male 20 <=50K 0.0136986 0.6000000 0.1938776 0 0
54 54 22 Private HighGrad 9 Not_married White Male 60 >50K 0.0684932 0.5333333 0.6020408 1 1
56 56 21 Private Community 10 Not_married White Female 40 <=50K 0.0547945 0.6000000 0.3979592 0 0
58 58 34 Local-gov Bachelors 13 Married White Male 50 >50K 0.2328767 0.8000000 0.5000000 1 1

8. Evaluar el modelo con matríz de confusión

matriz_conf <- table(predicho = predicciones, real = conjunto_validacion$income10)

matriz_conf
        real
predicho     0     1
       0 11146     0
       1     0  3506

9. Interpretar el caso

Hasta el momento no encuentro alguna referencia que me ayude a entender porque da el 100% de accuracy, pero sí revisamos el dataframe con las predicciones realizadas con el modelo, podemos ver que el modelo predice exactamente el valor de income10.En la matríz de predicción observamos que 11,146 personas obtienen 50 mil o menos de ingresos, mientras que 3,506 obtienen un ingreso mayor a 50 mil dolares.

10. Comparar con caso 11.

La diferencia con el caso 11 es que con máquina de soporte vectorial (kernel líneal) todas las predicciones realizadas son certeras, mientras que con regresión logística se obtuvo que 24,135 personas ganan 50 mil o menos de dicha cantidad y 4,107 ganan más de 50 mil, siendo este valor mucho más alto que el obtenido aquí.