Objetivo:
Aplicar e interpretar el algoritmo de máquinas de soporte vectorial (SVM) con el conjunto de datos de personas e ingresos en EUA y comparar los resultados obtenidos con el caso 11 de regresión logística, tomando el valor de exactitud.
Descripción:
Construir un modelo de máquinas de soporte vectorial (SVM) aplicado al conjunto de datos de personas y sus ingresos en EUA. Tomando como variable dependiente los ingresos, identificados por 0 y 1, siendo 0 los que ganan por debajo o igual a 50 mil y 1 los que ganan por encima de 50 mil.
Fundamento teórico:
Proceso:
- Cargar librerías
- Cargar datos
- Identificar variables
- Seleccionar datos de entrenamiento y validación
- Crear modelo de SVM
- Analizar/describir el modelo
- Realizar predicciones con datos de validación
- Evaluar el modelo con matríz de confusión
- Interpretar el caso
- Comparar con caso 11.
1. Cargar librerías
library(ggplot2)
library(dplyr)
library(knitr)
library(caret)
library(readr)
library(e1071)
2. Cargar datos
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")
kable(head(datos,10), caption = "Primeros 10 registros")
Primeros 10 registros
| 1 |
25 |
Private |
Dropout |
7 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1095890 |
0.4000000 |
0.3979592 |
0 |
| 2 |
38 |
Private |
HighGrad |
9 |
Married |
White |
Male |
50 |
<=50K |
0.2876712 |
0.5333333 |
0.5000000 |
0 |
| 3 |
28 |
Local-gov |
Community |
12 |
Married |
White |
Male |
40 |
>50K |
0.1506849 |
0.7333333 |
0.3979592 |
1 |
| 4 |
44 |
Private |
Community |
10 |
Married |
Black |
Male |
40 |
>50K |
0.3698630 |
0.6000000 |
0.3979592 |
1 |
| 5 |
18 |
? |
Community |
10 |
Not_married |
White |
Female |
30 |
<=50K |
0.0136986 |
0.6000000 |
0.2959184 |
0 |
| 6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
| 7 |
29 |
? |
HighGrad |
9 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1643836 |
0.5333333 |
0.3979592 |
0 |
| 8 |
63 |
Self-emp-not-inc |
Master |
15 |
Married |
White |
Male |
32 |
>50K |
0.6301370 |
0.9333333 |
0.3163265 |
1 |
| 9 |
24 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0958904 |
0.6000000 |
0.3979592 |
0 |
| 10 |
55 |
Private |
Dropout |
4 |
Married |
White |
Male |
10 |
<=50K |
0.5205479 |
0.2000000 |
0.0918367 |
0 |
kable(tail(datos,10), caption = "Últimos 10 registros")
Últimos 10 registros
| 48833 |
48833 |
32 |
Private |
Dropout |
6 |
Married |
Amer-Indian-Eskimo |
Male |
40 |
<=50K |
0.2054795 |
0.3333333 |
0.3979592 |
0 |
| 48834 |
48834 |
43 |
Private |
Community |
11 |
Married |
White |
Male |
45 |
<=50K |
0.3561644 |
0.6666667 |
0.4489796 |
0 |
| 48835 |
48835 |
32 |
Private |
Master |
14 |
Not_married |
Asian-Pac-Islander |
Male |
11 |
<=50K |
0.2054795 |
0.8666667 |
0.1020408 |
0 |
| 48836 |
48836 |
53 |
Private |
Master |
14 |
Married |
White |
Male |
40 |
>50K |
0.4931507 |
0.8666667 |
0.3979592 |
1 |
| 48837 |
48837 |
22 |
Private |
Community |
10 |
Not_married |
White |
Male |
40 |
<=50K |
0.0684932 |
0.6000000 |
0.3979592 |
0 |
| 48838 |
48838 |
27 |
Private |
Community |
12 |
Married |
White |
Female |
38 |
<=50K |
0.1369863 |
0.7333333 |
0.3775510 |
0 |
| 48839 |
48839 |
40 |
Private |
HighGrad |
9 |
Married |
White |
Male |
40 |
>50K |
0.3150685 |
0.5333333 |
0.3979592 |
1 |
| 48840 |
48840 |
58 |
Private |
HighGrad |
9 |
Widow |
White |
Female |
40 |
<=50K |
0.5616438 |
0.5333333 |
0.3979592 |
0 |
| 48841 |
48841 |
22 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
20 |
<=50K |
0.0684932 |
0.5333333 |
0.1938776 |
0 |
| 48842 |
48842 |
52 |
Self-emp-inc |
HighGrad |
9 |
Married |
White |
Female |
40 |
>50K |
0.4794521 |
0.5333333 |
0.3979592 |
1 |
3. Identificar variables
Convertimos la variable income10 a factor
datos$income10 <- factor(datos$income10)
4. Seleccionar datos de entrenamiento y validación
Conjuntos de entrenamiento y validación (70-30)
set.seed(2020)
particion <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)
conjunto_entrenamiento <- datos[particion,]
conjunto_validacion <- datos[-particion,]
kable(head(conjunto_entrenamiento,10), caption = "Primeros 10 registros de entrenamiento", row.names = 1:nrow(conjunto_entrenamiento))
Primeros 10 registros de entrenamiento
| 1 |
1 |
25 |
Private |
Dropout |
7 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1095890 |
0.4000000 |
0.3979592 |
0 |
| 2 |
2 |
38 |
Private |
HighGrad |
9 |
Married |
White |
Male |
50 |
<=50K |
0.2876712 |
0.5333333 |
0.5000000 |
0 |
| 3 |
3 |
28 |
Local-gov |
Community |
12 |
Married |
White |
Male |
40 |
>50K |
0.1506849 |
0.7333333 |
0.3979592 |
1 |
| 4 |
4 |
44 |
Private |
Community |
10 |
Married |
Black |
Male |
40 |
>50K |
0.3698630 |
0.6000000 |
0.3979592 |
1 |
| 5 |
5 |
18 |
? |
Community |
10 |
Not_married |
White |
Female |
30 |
<=50K |
0.0136986 |
0.6000000 |
0.2959184 |
0 |
| 7 |
7 |
29 |
? |
HighGrad |
9 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1643836 |
0.5333333 |
0.3979592 |
0 |
| 8 |
8 |
63 |
Self-emp-not-inc |
Master |
15 |
Married |
White |
Male |
32 |
>50K |
0.6301370 |
0.9333333 |
0.3163265 |
1 |
| 9 |
9 |
24 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0958904 |
0.6000000 |
0.3979592 |
0 |
| 10 |
10 |
55 |
Private |
Dropout |
4 |
Married |
White |
Male |
10 |
<=50K |
0.5205479 |
0.2000000 |
0.0918367 |
0 |
| 11 |
11 |
65 |
Private |
HighGrad |
9 |
Married |
White |
Male |
40 |
>50K |
0.6575342 |
0.5333333 |
0.3979592 |
1 |
kable(head(conjunto_validacion,10), caption = "Primeros 10 registros de validación", row.names = 1:nrow(conjunto_validacion))
Primeros 10 registros de validación
| 6 |
6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
| 15 |
15 |
48 |
Private |
HighGrad |
9 |
Married |
White |
Male |
48 |
>50K |
0.4246575 |
0.5333333 |
0.4795918 |
1 |
| 17 |
17 |
20 |
State-gov |
Community |
10 |
Not_married |
White |
Male |
25 |
<=50K |
0.0410959 |
0.6000000 |
0.2448980 |
0 |
| 36 |
36 |
65 |
? |
HighGrad |
9 |
Married |
White |
Male |
40 |
<=50K |
0.6575342 |
0.5333333 |
0.3979592 |
0 |
| 41 |
41 |
65 |
Private |
Master |
14 |
Married |
White |
Male |
50 |
>50K |
0.6575342 |
0.8666667 |
0.5000000 |
1 |
| 49 |
49 |
52 |
Private |
Dropout |
7 |
Separated |
Black |
Female |
18 |
<=50K |
0.4794521 |
0.4000000 |
0.1734694 |
0 |
| 51 |
51 |
18 |
Private |
Community |
10 |
Not_married |
White |
Male |
20 |
<=50K |
0.0136986 |
0.6000000 |
0.1938776 |
0 |
| 54 |
54 |
22 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
60 |
>50K |
0.0684932 |
0.5333333 |
0.6020408 |
1 |
| 56 |
56 |
21 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0547945 |
0.6000000 |
0.3979592 |
0 |
| 58 |
58 |
34 |
Local-gov |
Bachelors |
13 |
Married |
White |
Male |
50 |
>50K |
0.2328767 |
0.8000000 |
0.5000000 |
1 |
5. Crear modelo de SVM
Kernel líneal
modelo_1 <- svm(income10 ~ ., data = conjunto_entrenamiento, kernel = 'linear', scale = TRUE, cost = .05)
6. Analizar/describir el modelo
summary(modelo_1)
Call:
svm(formula = income10 ~ ., data = conjunto_entrenamiento, kernel = "linear",
cost = 0.05, scale = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: linear
cost: 0.05
Number of Support Vectors: 246
( 143 103 )
Number of Classes: 2
Levels:
0 1
7. Realizar predicciones con datos de validación
predicciones <- predict(modelo_1, conjunto_validacion)
Nueva columna
predicciones_df <- data.frame(conjunto_validacion, predicciones)
kable(head(predicciones_df,10), caption = "Predicciones primeros 10 registros")
Predicciones primeros 10 registros
| 6 |
6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
0 |
| 15 |
15 |
48 |
Private |
HighGrad |
9 |
Married |
White |
Male |
48 |
>50K |
0.4246575 |
0.5333333 |
0.4795918 |
1 |
1 |
| 17 |
17 |
20 |
State-gov |
Community |
10 |
Not_married |
White |
Male |
25 |
<=50K |
0.0410959 |
0.6000000 |
0.2448980 |
0 |
0 |
| 36 |
36 |
65 |
? |
HighGrad |
9 |
Married |
White |
Male |
40 |
<=50K |
0.6575342 |
0.5333333 |
0.3979592 |
0 |
0 |
| 41 |
41 |
65 |
Private |
Master |
14 |
Married |
White |
Male |
50 |
>50K |
0.6575342 |
0.8666667 |
0.5000000 |
1 |
1 |
| 49 |
49 |
52 |
Private |
Dropout |
7 |
Separated |
Black |
Female |
18 |
<=50K |
0.4794521 |
0.4000000 |
0.1734694 |
0 |
0 |
| 51 |
51 |
18 |
Private |
Community |
10 |
Not_married |
White |
Male |
20 |
<=50K |
0.0136986 |
0.6000000 |
0.1938776 |
0 |
0 |
| 54 |
54 |
22 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
60 |
>50K |
0.0684932 |
0.5333333 |
0.6020408 |
1 |
1 |
| 56 |
56 |
21 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0547945 |
0.6000000 |
0.3979592 |
0 |
0 |
| 58 |
58 |
34 |
Local-gov |
Bachelors |
13 |
Married |
White |
Male |
50 |
>50K |
0.2328767 |
0.8000000 |
0.5000000 |
1 |
1 |
8. Evaluar el modelo con matríz de confusión
matriz_conf <- table(predicho = predicciones, real = conjunto_validacion$income10)
matriz_conf
real
predicho 0 1
0 11146 0
1 0 3506
9. Interpretar el caso
Hasta el momento no encuentro alguna referencia que me ayude a entender porque da el 100% de accuracy, pero sí revisamos el dataframe con las predicciones realizadas con el modelo, podemos ver que el modelo predice exactamente el valor de income10.En la matríz de predicción observamos que 11,146 personas obtienen 50 mil o menos de ingresos, mientras que 3,506 obtienen un ingreso mayor a 50 mil dolares.
10. Comparar con caso 11.
La diferencia con el caso 11 es que con máquina de soporte vectorial (kernel líneal) todas las predicciones realizadas son certeras, mientras que con regresión logística se obtuvo que 24,135 personas ganan 50 mil o menos de dicha cantidad y 4,107 ganan más de 50 mil, siendo este valor mucho más alto que el obtenido aquí.