Aplicar e interpretar el algoritmo de maquinas de soporte vectorial SVM con los datos de personas e ingresos de USA y comparar con el caso 11 de regresión loígistica el valor de exactitud.
Construir un modelo de de maquinas de soporte vectorial SVM aplicado a datos de personas y sus ingresos en USA
La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(readr)
library(knitr)
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.3
getwd()
## [1] "C:/Users/pc/Documents/RStudio"
datos <- read.csv(file = "adultos_clean.csv", encoding = "UTF-8")
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
| 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48833 | 48833 | 32 | Private | Dropout | 6 | Married | Amer-Indian-Eskimo | Male | 40 | <=50K | 0.2054795 | 0.3333333 | 0.3979592 | 0 |
| 48834 | 48834 | 43 | Private | Community | 11 | Married | White | Male | 45 | <=50K | 0.3561644 | 0.6666667 | 0.4489796 | 0 |
| 48835 | 48835 | 32 | Private | Master | 14 | Not_married | Asian-Pac-Islander | Male | 11 | <=50K | 0.2054795 | 0.8666667 | 0.1020408 | 0 |
| 48836 | 48836 | 53 | Private | Master | 14 | Married | White | Male | 40 | >50K | 0.4931507 | 0.8666667 | 0.3979592 | 1 |
| 48837 | 48837 | 22 | Private | Community | 10 | Not_married | White | Male | 40 | <=50K | 0.0684932 | 0.6000000 | 0.3979592 | 0 |
| 48838 | 48838 | 27 | Private | Community | 12 | Married | White | Female | 38 | <=50K | 0.1369863 | 0.7333333 | 0.3775510 | 0 |
| 48839 | 48839 | 40 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.3150685 | 0.5333333 | 0.3979592 | 1 |
| 48840 | 48840 | 58 | Private | HighGrad | 9 | Widow | White | Female | 40 | <=50K | 0.5616438 | 0.5333333 | 0.3979592 | 0 |
| 48841 | 48841 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 20 | <=50K | 0.0684932 | 0.5333333 | 0.1938776 | 0 |
| 48842 | 48842 | 52 | Self-emp-inc | HighGrad | 9 | Married | White | Female | 40 | >50K | 0.4794521 | 0.5333333 | 0.3979592 | 1 |
datos$income10 <- factor(datos$income10)
str(datos)
## 'data.frame': 48842 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : chr "Private" "Private" "Local-gov" "Private" ...
## $ education : chr "Dropout" "HighGrad" "Community" "Community" ...
## $ educational.num : int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : chr "Not_married" "Married" "Married" "Married" ...
## $ race : chr "Black" "White" "White" "Black" ...
## $ gender : chr "Male" "Male" "Male" "Male" ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ income : chr "<=50K" "<=50K" ">50K" ">50K" ...
## $ age.scale : num 0.1096 0.2877 0.1507 0.3699 0.0137 ...
## $ educational.num.scale: num 0.4 0.533 0.733 0.6 0.6 ...
## $ hours.per.week.scale : num 0.398 0.5 0.398 0.398 0.296 ...
## $ income10 : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
70 % datos de entrenamiento
30 % datos de validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)
datos.entrenamiento <- datos[entrena, ]
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 4 | 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 5 | 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 7 | 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
| 11 | 11 | 65 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.6575342 | 0.5333333 | 0.3979592 | 1 |
kable(head(datos.validacion, 10), caption = "Datos de validación (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
| 15 | 15 | 48 | Private | HighGrad | 9 | Married | White | Male | 48 | >50K | 0.4246575 | 0.5333333 | 0.4795918 | 1 |
| 17 | 17 | 20 | State-gov | Community | 10 | Not_married | White | Male | 25 | <=50K | 0.0410959 | 0.6000000 | 0.2448980 | 0 |
| 36 | 36 | 65 | ? | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.6575342 | 0.5333333 | 0.3979592 | 0 |
| 41 | 41 | 65 | Private | Master | 14 | Married | White | Male | 50 | >50K | 0.6575342 | 0.8666667 | 0.5000000 | 1 |
| 49 | 49 | 52 | Private | Dropout | 7 | Separated | Black | Female | 18 | <=50K | 0.4794521 | 0.4000000 | 0.1734694 | 0 |
| 51 | 51 | 18 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0136986 | 0.6000000 | 0.1938776 | 0 |
| 54 | 54 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 60 | >50K | 0.0684932 | 0.5333333 | 0.6020408 | 1 |
| 56 | 56 | 21 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 |
| 58 | 58 | 34 | Local-gov | Bachelors | 13 | Married | White | Male | 50 | >50K | 0.2328767 | 0.8000000 | 0.5000000 | 1 |
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)
summary(modelo1)
##
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear",
## cost = 0.05, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
##
## Number of Support Vectors: 246
##
## ( 143 103 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
prediccion <- predict(modelo1, datos.validacion)
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)
kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | prediccion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 | 0 |
| 15 | 15 | 48 | Private | HighGrad | 9 | Married | White | Male | 48 | >50K | 0.4246575 | 0.5333333 | 0.4795918 | 1 | 1 |
| 17 | 17 | 20 | State-gov | Community | 10 | Not_married | White | Male | 25 | <=50K | 0.0410959 | 0.6000000 | 0.2448980 | 0 | 0 |
| 36 | 36 | 65 | ? | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.6575342 | 0.5333333 | 0.3979592 | 0 | 0 |
| 41 | 41 | 65 | Private | Master | 14 | Married | White | Male | 50 | >50K | 0.6575342 | 0.8666667 | 0.5000000 | 1 | 1 |
| 49 | 49 | 52 | Private | Dropout | 7 | Separated | Black | Female | 18 | <=50K | 0.4794521 | 0.4000000 | 0.1734694 | 0 | 0 |
| 51 | 51 | 18 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0136986 | 0.6000000 | 0.1938776 | 0 | 0 |
| 54 | 54 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 60 | >50K | 0.0684932 | 0.5333333 | 0.6020408 | 1 | 1 |
| 56 | 56 | 21 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 | 0 |
| 58 | 58 | 34 | Local-gov | Bachelors | 13 | Married | White | Male | 50 | >50K | 0.2328767 | 0.8000000 | 0.5000000 | 1 | 1 |
| 59 | 59 | 42 | Self-emp-inc | HighGrad | 9 | Married | White | Male | 50 | >50K | 0.3424658 | 0.5333333 | 0.5000000 | 1 | 1 |
| 61 | 61 | 30 | Private | Bachelors | 13 | Not_married | White | Female | 50 | <=50K | 0.1780822 | 0.8000000 | 0.5000000 | 0 | 0 |
| 64 | 64 | 33 | Private | HighGrad | 9 | Not_married | White | Female | 40 | <=50K | 0.2191781 | 0.5333333 | 0.3979592 | 0 | 0 |
| 68 | 68 | 19 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0273973 | 0.6000000 | 0.1938776 | 0 | 0 |
| 74 | 74 | 21 | Private | Community | 10 | Separated | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 | 0 |
| 77 | 77 | 41 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.3287671 | 0.5333333 | 0.5000000 | 0 | 0 |
| 79 | 79 | 50 | Private | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.4520548 | 0.5333333 | 0.3979592 | 0 | 0 |
| 81 | 81 | 45 | Self-emp-inc | Community | 10 | Married | White | Male | 50 | <=50K | 0.3835616 | 0.6000000 | 0.5000000 | 0 | 0 |
| 91 | 91 | 59 | Private | Bachelors | 13 | Not_married | White | Female | 25 | <=50K | 0.5753425 | 0.8000000 | 0.2448980 | 0 | 0 |
| 95 | 95 | 34 | Private | Master | 14 | Not_married | Amer-Indian-Eskimo | Male | 40 | <=50K | 0.2328767 | 0.8666667 | 0.3979592 | 0 | 0 |
| 96 | 96 | 20 | Private | HighGrad | 9 | Not_married | White | Male | 40 | <=50K | 0.0410959 | 0.5333333 | 0.3979592 | 0 | 0 |
| 97 | 97 | 25 | Private | Bachelors | 13 | Not_married | White | Female | 40 | <=50K | 0.1095890 | 0.8000000 | 0.3979592 | 0 | 0 |
| 102 | 102 | 33 | Private | Community | 10 | Not_married | Black | Female | 35 | <=50K | 0.2191781 | 0.6000000 | 0.3469388 | 0 | 0 |
| 111 | 111 | 18 | Private | HighGrad | 9 | Not_married | White | Female | 48 | <=50K | 0.0136986 | 0.5333333 | 0.4795918 | 0 | 0 |
| 112 | 112 | 28 | Private | Community | 10 | Married | White | Male | 40 | <=50K | 0.1506849 | 0.6000000 | 0.3979592 | 0 | 0 |
| 120 | 120 | 43 | Private | Bachelors | 13 | Separated | White | Female | 40 | >50K | 0.3561644 | 0.8000000 | 0.3979592 | 1 | 1 |
| 123 | 123 | 19 | Private | Community | 10 | Not_married | White | Male | 30 | <=50K | 0.0273973 | 0.6000000 | 0.2959184 | 0 | 0 |
| 129 | 129 | 27 | Self-emp-not-inc | HighGrad | 9 | Married | White | Male | 60 | >50K | 0.1369863 | 0.5333333 | 0.6020408 | 1 | 1 |
| 131 | 131 | 41 | Private | Community | 10 | Married | White | Male | 40 | <=50K | 0.3287671 | 0.6000000 | 0.3979592 | 0 | 0 |
| 135 | 135 | 57 | Private | HighGrad | 9 | Married | Black | Male | 48 | <=50K | 0.5479452 | 0.5333333 | 0.4795918 | 0 | 0 |
| 141 | 141 | 46 | Private | Master | 14 | Married | White | Male | 40 | <=50K | 0.3972603 | 0.8666667 | 0.3979592 | 0 | 0 |
| 144 | 144 | 43 | Self-emp-inc | HighGrad | 9 | Married | White | Male | 45 | >50K | 0.3561644 | 0.5333333 | 0.4489796 | 1 | 1 |
| 145 | 145 | 34 | Private | Master | 14 | Not_married | White | Female | 30 | <=50K | 0.2328767 | 0.8666667 | 0.2959184 | 0 | 0 |
| 147 | 147 | 44 | Private | Community | 11 | Widow | White | Female | 30 | <=50K | 0.3698630 | 0.6666667 | 0.2959184 | 0 | 0 |
| 153 | 153 | 50 | Private | Dropout | 4 | Married | White | Male | 20 | <=50K | 0.4520548 | 0.2000000 | 0.1938776 | 0 | 0 |
| 160 | 160 | 38 | Self-emp-inc | Bachelors | 13 | Separated | White | Male | 40 | <=50K | 0.2876712 | 0.8000000 | 0.3979592 | 0 | 0 |
| 161 | 161 | 55 | Private | Dropout | 7 | Married | White | Male | 30 | <=50K | 0.5205479 | 0.4000000 | 0.2959184 | 0 | 0 |
| 163 | 163 | 22 | Private | HighGrad | 9 | Married | White | Male | 45 | <=50K | 0.0684932 | 0.5333333 | 0.4489796 | 0 | 0 |
| 165 | 165 | 46 | State-gov | Master | 14 | Married | White | Male | 45 | >50K | 0.3972603 | 0.8666667 | 0.4489796 | 1 | 1 |
| 167 | 167 | 58 | Self-emp-not-inc | PhD | 16 | Married | White | Male | 16 | >50K | 0.5616438 | 1.0000000 | 0.1530612 | 1 | 1 |
| 168 | 168 | 42 | Private | HighGrad | 9 | Married | White | Male | 45 | <=50K | 0.3424658 | 0.5333333 | 0.4489796 | 0 | 0 |
| 171 | 171 | 54 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.5068493 | 0.5333333 | 0.3979592 | 1 | 1 |
| 172 | 172 | 34 | Private | Master | 14 | Not_married | White | Male | 40 | <=50K | 0.2328767 | 0.8666667 | 0.3979592 | 0 | 0 |
| 173 | 173 | 26 | Private | Bachelors | 13 | Not_married | White | Female | 40 | <=50K | 0.1232877 | 0.8000000 | 0.3979592 | 0 | 0 |
| 175 | 175 | 48 | Local-gov | Master | 14 | Separated | Black | Female | 40 | <=50K | 0.4246575 | 0.8666667 | 0.3979592 | 0 | 0 |
| 176 | 176 | 36 | Private | Community | 10 | Married | White | Male | 45 | >50K | 0.2602740 | 0.6000000 | 0.4489796 | 1 | 1 |
| 185 | 185 | 44 | Private | HighGrad | 9 | Separated | Black | Female | 40 | <=50K | 0.3698630 | 0.5333333 | 0.3979592 | 0 | 0 |
| 189 | 189 | 34 | State-gov | Bachelors | 13 | Not_married | Black | Male | 40 | <=50K | 0.2328767 | 0.8000000 | 0.3979592 | 0 | 0 |
| 193 | 193 | 47 | Private | Community | 10 | Separated | Black | Female | 37 | <=50K | 0.4109589 | 0.6000000 | 0.3673469 | 0 | 0 |
| 196 | 196 | 31 | Private | Community | 10 | Separated | White | Female | 40 | <=50K | 0.1917808 | 0.6000000 | 0.3979592 | 0 | 0 |
mat.confusion <- table(predicho = prediccion, real = datos.validacion$income10)
mat.confusion
## real
## predicho 0 1
## 0 11146 0
## 1 0 3506
¿Que es SVM?
Maquina de vectores de soporte es un algoritmo de machine learning que puede emplear para clasificacion binaria o regresion.
Lineal K(x1,x2)=xT1x2 = Aprendizaje de dos clases
En este caos utilizamos el metodo lineal de Kernel
En donde por medio de la prediccion de los datos de entrena(x1) y valida(x2) se crea el modelo por medio de la funcion
modelo1 <- svm(..)
El metodo de kernel lleva los datos a un espacio vectorial donde aplica metodos lineales, en nuestro caso se hizo un vector de 246 con un rango aprox de ( 143 103 ).
Por uso de mapero los datos originales capturan caracteristicas reelevantes siguiendo una formula.
Especificamente no puedo explicar el metodo de kernel ya que es un algoritmo basado en distintas formulas pero por medio de R se puede facilitar el uso de el.