Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y comparar con el caso 11 de regresión loígistica el valor de exactitud.
Construir un modelo de de maquinas de soporte vectorial SVM aplicado a datos de personas y sus ingresos en USA
La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.
Pendiente que es el algoritmo de MSV ?
library(ggplot2)
library(dplyr)
library(knitr)
library(caret)
library(readr)
library(knitr)
library(e1071)
# datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")
datos <- read.csv("../datos/adultos_clean.csv")
# kable(head(datos))
# kable(tail(datos))
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
| 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48833 | 48833 | 32 | Private | Dropout | 6 | Married | Amer-Indian-Eskimo | Male | 40 | <=50K | 0.2054795 | 0.3333333 | 0.3979592 | 0 |
| 48834 | 48834 | 43 | Private | Community | 11 | Married | White | Male | 45 | <=50K | 0.3561644 | 0.6666667 | 0.4489796 | 0 |
| 48835 | 48835 | 32 | Private | Master | 14 | Not_married | Asian-Pac-Islander | Male | 11 | <=50K | 0.2054795 | 0.8666667 | 0.1020408 | 0 |
| 48836 | 48836 | 53 | Private | Master | 14 | Married | White | Male | 40 | >50K | 0.4931507 | 0.8666667 | 0.3979592 | 1 |
| 48837 | 48837 | 22 | Private | Community | 10 | Not_married | White | Male | 40 | <=50K | 0.0684932 | 0.6000000 | 0.3979592 | 0 |
| 48838 | 48838 | 27 | Private | Community | 12 | Married | White | Female | 38 | <=50K | 0.1369863 | 0.7333333 | 0.3775510 | 0 |
| 48839 | 48839 | 40 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.3150685 | 0.5333333 | 0.3979592 | 1 |
| 48840 | 48840 | 58 | Private | HighGrad | 9 | Widow | White | Female | 40 | <=50K | 0.5616438 | 0.5333333 | 0.3979592 | 0 |
| 48841 | 48841 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 20 | <=50K | 0.0684932 | 0.5333333 | 0.1938776 | 0 |
| 48842 | 48842 | 52 | Self-emp-inc | HighGrad | 9 | Married | White | Female | 40 | >50K | 0.4794521 | 0.5333333 | 0.3979592 | 1 |
datos$income10 <- factor(datos$income10)
str(datos)
## 'data.frame': 48842 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : chr "Private" "Private" "Local-gov" "Private" ...
## $ education : chr "Dropout" "HighGrad" "Community" "Community" ...
## $ educational.num : int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : chr "Not_married" "Married" "Married" "Married" ...
## $ race : chr "Black" "White" "White" "Black" ...
## $ gender : chr "Male" "Male" "Male" "Male" ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ income : chr "<=50K" "<=50K" ">50K" ">50K" ...
## $ age.scale : num 0.1096 0.2877 0.1507 0.3699 0.0137 ...
## $ educational.num.scale: num 0.4 0.533 0.733 0.6 0.6 ...
## $ hours.per.week.scale : num 0.398 0.5 0.398 0.398 0.296 ...
## $ income10 : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 4 | 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 5 | 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 7 | 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
| 11 | 11 | 65 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.6575342 | 0.5333333 | 0.3979592 | 1 |
kable(head(datos.validacion, 10), caption = "Datos de validación (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
| 15 | 15 | 48 | Private | HighGrad | 9 | Married | White | Male | 48 | >50K | 0.4246575 | 0.5333333 | 0.4795918 | 1 |
| 17 | 17 | 20 | State-gov | Community | 10 | Not_married | White | Male | 25 | <=50K | 0.0410959 | 0.6000000 | 0.2448980 | 0 |
| 36 | 36 | 65 | ? | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.6575342 | 0.5333333 | 0.3979592 | 0 |
| 41 | 41 | 65 | Private | Master | 14 | Married | White | Male | 50 | >50K | 0.6575342 | 0.8666667 | 0.5000000 | 1 |
| 49 | 49 | 52 | Private | Dropout | 7 | Separated | Black | Female | 18 | <=50K | 0.4794521 | 0.4000000 | 0.1734694 | 0 |
| 51 | 51 | 18 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0136986 | 0.6000000 | 0.1938776 | 0 |
| 54 | 54 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 60 | >50K | 0.0684932 | 0.5333333 | 0.6020408 | 1 |
| 56 | 56 | 21 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 |
| 58 | 58 | 34 | Local-gov | Bachelors | 13 | Married | White | Male | 50 | >50K | 0.2328767 | 0.8000000 | 0.5000000 | 1 |
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)
summary(modelo1)
##
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear",
## cost = 0.05, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
##
## Number of Support Vectors: 246
##
## ( 143 103 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
prediccion <- predict(modelo1, datos.validacion)
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)
kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | prediccion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 | 0 |
| 15 | 15 | 48 | Private | HighGrad | 9 | Married | White | Male | 48 | >50K | 0.4246575 | 0.5333333 | 0.4795918 | 1 | 1 |
| 17 | 17 | 20 | State-gov | Community | 10 | Not_married | White | Male | 25 | <=50K | 0.0410959 | 0.6000000 | 0.2448980 | 0 | 0 |
| 36 | 36 | 65 | ? | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.6575342 | 0.5333333 | 0.3979592 | 0 | 0 |
| 41 | 41 | 65 | Private | Master | 14 | Married | White | Male | 50 | >50K | 0.6575342 | 0.8666667 | 0.5000000 | 1 | 1 |
| 49 | 49 | 52 | Private | Dropout | 7 | Separated | Black | Female | 18 | <=50K | 0.4794521 | 0.4000000 | 0.1734694 | 0 | 0 |
| 51 | 51 | 18 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0136986 | 0.6000000 | 0.1938776 | 0 | 0 |
| 54 | 54 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 60 | >50K | 0.0684932 | 0.5333333 | 0.6020408 | 1 | 1 |
| 56 | 56 | 21 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 | 0 |
| 58 | 58 | 34 | Local-gov | Bachelors | 13 | Married | White | Male | 50 | >50K | 0.2328767 | 0.8000000 | 0.5000000 | 1 | 1 |
| 59 | 59 | 42 | Self-emp-inc | HighGrad | 9 | Married | White | Male | 50 | >50K | 0.3424658 | 0.5333333 | 0.5000000 | 1 | 1 |
| 61 | 61 | 30 | Private | Bachelors | 13 | Not_married | White | Female | 50 | <=50K | 0.1780822 | 0.8000000 | 0.5000000 | 0 | 0 |
| 64 | 64 | 33 | Private | HighGrad | 9 | Not_married | White | Female | 40 | <=50K | 0.2191781 | 0.5333333 | 0.3979592 | 0 | 0 |
| 68 | 68 | 19 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0273973 | 0.6000000 | 0.1938776 | 0 | 0 |
| 74 | 74 | 21 | Private | Community | 10 | Separated | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 | 0 |
| 77 | 77 | 41 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.3287671 | 0.5333333 | 0.5000000 | 0 | 0 |
| 79 | 79 | 50 | Private | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.4520548 | 0.5333333 | 0.3979592 | 0 | 0 |
| 81 | 81 | 45 | Self-emp-inc | Community | 10 | Married | White | Male | 50 | <=50K | 0.3835616 | 0.6000000 | 0.5000000 | 0 | 0 |
| 91 | 91 | 59 | Private | Bachelors | 13 | Not_married | White | Female | 25 | <=50K | 0.5753425 | 0.8000000 | 0.2448980 | 0 | 0 |
| 95 | 95 | 34 | Private | Master | 14 | Not_married | Amer-Indian-Eskimo | Male | 40 | <=50K | 0.2328767 | 0.8666667 | 0.3979592 | 0 | 0 |
| 96 | 96 | 20 | Private | HighGrad | 9 | Not_married | White | Male | 40 | <=50K | 0.0410959 | 0.5333333 | 0.3979592 | 0 | 0 |
| 97 | 97 | 25 | Private | Bachelors | 13 | Not_married | White | Female | 40 | <=50K | 0.1095890 | 0.8000000 | 0.3979592 | 0 | 0 |
| 102 | 102 | 33 | Private | Community | 10 | Not_married | Black | Female | 35 | <=50K | 0.2191781 | 0.6000000 | 0.3469388 | 0 | 0 |
| 111 | 111 | 18 | Private | HighGrad | 9 | Not_married | White | Female | 48 | <=50K | 0.0136986 | 0.5333333 | 0.4795918 | 0 | 0 |
| 112 | 112 | 28 | Private | Community | 10 | Married | White | Male | 40 | <=50K | 0.1506849 | 0.6000000 | 0.3979592 | 0 | 0 |
| 120 | 120 | 43 | Private | Bachelors | 13 | Separated | White | Female | 40 | >50K | 0.3561644 | 0.8000000 | 0.3979592 | 1 | 1 |
| 123 | 123 | 19 | Private | Community | 10 | Not_married | White | Male | 30 | <=50K | 0.0273973 | 0.6000000 | 0.2959184 | 0 | 0 |
| 129 | 129 | 27 | Self-emp-not-inc | HighGrad | 9 | Married | White | Male | 60 | >50K | 0.1369863 | 0.5333333 | 0.6020408 | 1 | 1 |
| 131 | 131 | 41 | Private | Community | 10 | Married | White | Male | 40 | <=50K | 0.3287671 | 0.6000000 | 0.3979592 | 0 | 0 |
| 135 | 135 | 57 | Private | HighGrad | 9 | Married | Black | Male | 48 | <=50K | 0.5479452 | 0.5333333 | 0.4795918 | 0 | 0 |
| 141 | 141 | 46 | Private | Master | 14 | Married | White | Male | 40 | <=50K | 0.3972603 | 0.8666667 | 0.3979592 | 0 | 0 |
| 144 | 144 | 43 | Self-emp-inc | HighGrad | 9 | Married | White | Male | 45 | >50K | 0.3561644 | 0.5333333 | 0.4489796 | 1 | 1 |
| 145 | 145 | 34 | Private | Master | 14 | Not_married | White | Female | 30 | <=50K | 0.2328767 | 0.8666667 | 0.2959184 | 0 | 0 |
| 147 | 147 | 44 | Private | Community | 11 | Widow | White | Female | 30 | <=50K | 0.3698630 | 0.6666667 | 0.2959184 | 0 | 0 |
| 153 | 153 | 50 | Private | Dropout | 4 | Married | White | Male | 20 | <=50K | 0.4520548 | 0.2000000 | 0.1938776 | 0 | 0 |
| 160 | 160 | 38 | Self-emp-inc | Bachelors | 13 | Separated | White | Male | 40 | <=50K | 0.2876712 | 0.8000000 | 0.3979592 | 0 | 0 |
| 161 | 161 | 55 | Private | Dropout | 7 | Married | White | Male | 30 | <=50K | 0.5205479 | 0.4000000 | 0.2959184 | 0 | 0 |
| 163 | 163 | 22 | Private | HighGrad | 9 | Married | White | Male | 45 | <=50K | 0.0684932 | 0.5333333 | 0.4489796 | 0 | 0 |
| 165 | 165 | 46 | State-gov | Master | 14 | Married | White | Male | 45 | >50K | 0.3972603 | 0.8666667 | 0.4489796 | 1 | 1 |
| 167 | 167 | 58 | Self-emp-not-inc | PhD | 16 | Married | White | Male | 16 | >50K | 0.5616438 | 1.0000000 | 0.1530612 | 1 | 1 |
| 168 | 168 | 42 | Private | HighGrad | 9 | Married | White | Male | 45 | <=50K | 0.3424658 | 0.5333333 | 0.4489796 | 0 | 0 |
| 171 | 171 | 54 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.5068493 | 0.5333333 | 0.3979592 | 1 | 1 |
| 172 | 172 | 34 | Private | Master | 14 | Not_married | White | Male | 40 | <=50K | 0.2328767 | 0.8666667 | 0.3979592 | 0 | 0 |
| 173 | 173 | 26 | Private | Bachelors | 13 | Not_married | White | Female | 40 | <=50K | 0.1232877 | 0.8000000 | 0.3979592 | 0 | 0 |
| 175 | 175 | 48 | Local-gov | Master | 14 | Separated | Black | Female | 40 | <=50K | 0.4246575 | 0.8666667 | 0.3979592 | 0 | 0 |
| 176 | 176 | 36 | Private | Community | 10 | Married | White | Male | 45 | >50K | 0.2602740 | 0.6000000 | 0.4489796 | 1 | 1 |
| 185 | 185 | 44 | Private | HighGrad | 9 | Separated | Black | Female | 40 | <=50K | 0.3698630 | 0.5333333 | 0.3979592 | 0 | 0 |
| 189 | 189 | 34 | State-gov | Bachelors | 13 | Not_married | Black | Male | 40 | <=50K | 0.2328767 | 0.8000000 | 0.3979592 | 0 | 0 |
| 193 | 193 | 47 | Private | Community | 10 | Separated | Black | Female | 37 | <=50K | 0.4109589 | 0.6000000 | 0.3673469 | 0 | 0 |
| 196 | 196 | 31 | Private | Community | 10 | Separated | White | Female | 40 | <=50K | 0.1917808 | 0.6000000 | 0.3979592 | 0 | 0 |
mat.confusion <- table(predicho = prediccion, real = datos.validacion$income10)
mat.confusion
## real
## predicho 0 1
## 0 11146 0
## 1 0 3506
El 100% de accuracy = exactitud ???? con el kernel lineal ???