Proceso
- Cargar librerías
- Cargar datos
- Identificar variables
- Crear datos de entrenamiento y validación
- Crear modelo de Máquinas de Soporte Vectorial SVM
- Deteminar anticipadamente e mejor costo del modelo de SVM Pendiente
- Analizar y/o describir el modelo
- Realizar predicciones con el conjunto de datos de validación
- Evaluar el modelo de predicción con matriz de confusión
- Interpretar el caso
- Comparar con el algoritmo de regresión logística del caso 11
1. Cargar librerías
library(ggplot2)
library(dplyr)
library(knitr)
library(caret)
library(readr)
library(knitr)
library(e1071)
2. Cargar datos
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")
kable(head(datos, 10), caption = "Los primeros 10 registros de datos")
Los primeros 10 registros de datos
| 1 |
25 |
Private |
Dropout |
7 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1095890 |
0.4000000 |
0.3979592 |
0 |
| 2 |
38 |
Private |
HighGrad |
9 |
Married |
White |
Male |
50 |
<=50K |
0.2876712 |
0.5333333 |
0.5000000 |
0 |
| 3 |
28 |
Local-gov |
Community |
12 |
Married |
White |
Male |
40 |
>50K |
0.1506849 |
0.7333333 |
0.3979592 |
1 |
| 4 |
44 |
Private |
Community |
10 |
Married |
Black |
Male |
40 |
>50K |
0.3698630 |
0.6000000 |
0.3979592 |
1 |
| 5 |
18 |
? |
Community |
10 |
Not_married |
White |
Female |
30 |
<=50K |
0.0136986 |
0.6000000 |
0.2959184 |
0 |
| 6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
| 7 |
29 |
? |
HighGrad |
9 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1643836 |
0.5333333 |
0.3979592 |
0 |
| 8 |
63 |
Self-emp-not-inc |
Master |
15 |
Married |
White |
Male |
32 |
>50K |
0.6301370 |
0.9333333 |
0.3163265 |
1 |
| 9 |
24 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0958904 |
0.6000000 |
0.3979592 |
0 |
| 10 |
55 |
Private |
Dropout |
4 |
Married |
White |
Male |
10 |
<=50K |
0.5205479 |
0.2000000 |
0.0918367 |
0 |
kable(tail(datos, 10), caption = "Los últimos 10 registros de datos")
Los últimos 10 registros de datos
| 48833 |
48833 |
32 |
Private |
Dropout |
6 |
Married |
Amer-Indian-Eskimo |
Male |
40 |
<=50K |
0.2054795 |
0.3333333 |
0.3979592 |
0 |
| 48834 |
48834 |
43 |
Private |
Community |
11 |
Married |
White |
Male |
45 |
<=50K |
0.3561644 |
0.6666667 |
0.4489796 |
0 |
| 48835 |
48835 |
32 |
Private |
Master |
14 |
Not_married |
Asian-Pac-Islander |
Male |
11 |
<=50K |
0.2054795 |
0.8666667 |
0.1020408 |
0 |
| 48836 |
48836 |
53 |
Private |
Master |
14 |
Married |
White |
Male |
40 |
>50K |
0.4931507 |
0.8666667 |
0.3979592 |
1 |
| 48837 |
48837 |
22 |
Private |
Community |
10 |
Not_married |
White |
Male |
40 |
<=50K |
0.0684932 |
0.6000000 |
0.3979592 |
0 |
| 48838 |
48838 |
27 |
Private |
Community |
12 |
Married |
White |
Female |
38 |
<=50K |
0.1369863 |
0.7333333 |
0.3775510 |
0 |
| 48839 |
48839 |
40 |
Private |
HighGrad |
9 |
Married |
White |
Male |
40 |
>50K |
0.3150685 |
0.5333333 |
0.3979592 |
1 |
| 48840 |
48840 |
58 |
Private |
HighGrad |
9 |
Widow |
White |
Female |
40 |
<=50K |
0.5616438 |
0.5333333 |
0.3979592 |
0 |
| 48841 |
48841 |
22 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
20 |
<=50K |
0.0684932 |
0.5333333 |
0.1938776 |
0 |
| 48842 |
48842 |
52 |
Self-emp-inc |
HighGrad |
9 |
Married |
White |
Female |
40 |
>50K |
0.4794521 |
0.5333333 |
0.3979592 |
1 |
3. Identificar variables
- Convertir a factor la variable dependiente income10
datos$income10 <- factor(datos$income10)
str(datos)
## 'data.frame': 48842 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : chr "Private" "Private" "Local-gov" "Private" ...
## $ education : chr "Dropout" "HighGrad" "Community" "Community" ...
## $ educational.num : int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : chr "Not_married" "Married" "Married" "Married" ...
## $ race : chr "Black" "White" "White" "Black" ...
## $ gender : chr "Male" "Male" "Male" "Male" ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ income : chr "<=50K" "<=50K" ">50K" ">50K" ...
## $ age.scale : num 0.1096 0.2877 0.1507 0.3699 0.0137 ...
## $ educational.num.scale: num 0.4 0.533 0.733 0.6 0.6 ...
## $ hours.per.week.scale : num 0.398 0.5 0.398 0.398 0.296 ...
## $ income10 : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
- age la edad de la persona
- workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
- education indica el nivel educativo de la persona
- educational es el valor numérico de education
- marital.status es su estado civil
- race es el tipo de raza de persona
- gender es el género de la persona
- hours.per.week son las horas que trbaja por semana
- income son los ingresos
- age_scale la edad escalada
- hours.per.week escalada
- income10 con valores de 0 gana menos de 50 mil y 1 mas de 50 mil
4. Crear datos de entrenamiento y validación
- 70 % datos de entrenamiento
- 30 % datos de validación
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
Datos de entrenamiento (primeros diez)
| 1 |
1 |
25 |
Private |
Dropout |
7 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1095890 |
0.4000000 |
0.3979592 |
0 |
| 2 |
2 |
38 |
Private |
HighGrad |
9 |
Married |
White |
Male |
50 |
<=50K |
0.2876712 |
0.5333333 |
0.5000000 |
0 |
| 3 |
3 |
28 |
Local-gov |
Community |
12 |
Married |
White |
Male |
40 |
>50K |
0.1506849 |
0.7333333 |
0.3979592 |
1 |
| 4 |
4 |
44 |
Private |
Community |
10 |
Married |
Black |
Male |
40 |
>50K |
0.3698630 |
0.6000000 |
0.3979592 |
1 |
| 5 |
5 |
18 |
? |
Community |
10 |
Not_married |
White |
Female |
30 |
<=50K |
0.0136986 |
0.6000000 |
0.2959184 |
0 |
| 7 |
7 |
29 |
? |
HighGrad |
9 |
Not_married |
Black |
Male |
40 |
<=50K |
0.1643836 |
0.5333333 |
0.3979592 |
0 |
| 8 |
8 |
63 |
Self-emp-not-inc |
Master |
15 |
Married |
White |
Male |
32 |
>50K |
0.6301370 |
0.9333333 |
0.3163265 |
1 |
| 9 |
9 |
24 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0958904 |
0.6000000 |
0.3979592 |
0 |
| 10 |
10 |
55 |
Private |
Dropout |
4 |
Married |
White |
Male |
10 |
<=50K |
0.5205479 |
0.2000000 |
0.0918367 |
0 |
| 11 |
11 |
65 |
Private |
HighGrad |
9 |
Married |
White |
Male |
40 |
>50K |
0.6575342 |
0.5333333 |
0.3979592 |
1 |
kable(head(datos.validacion, 10), caption = "Datos de validación (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
Datos de validación (primeros diez)
| 6 |
6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
| 15 |
15 |
48 |
Private |
HighGrad |
9 |
Married |
White |
Male |
48 |
>50K |
0.4246575 |
0.5333333 |
0.4795918 |
1 |
| 17 |
17 |
20 |
State-gov |
Community |
10 |
Not_married |
White |
Male |
25 |
<=50K |
0.0410959 |
0.6000000 |
0.2448980 |
0 |
| 36 |
36 |
65 |
? |
HighGrad |
9 |
Married |
White |
Male |
40 |
<=50K |
0.6575342 |
0.5333333 |
0.3979592 |
0 |
| 41 |
41 |
65 |
Private |
Master |
14 |
Married |
White |
Male |
50 |
>50K |
0.6575342 |
0.8666667 |
0.5000000 |
1 |
| 49 |
49 |
52 |
Private |
Dropout |
7 |
Separated |
Black |
Female |
18 |
<=50K |
0.4794521 |
0.4000000 |
0.1734694 |
0 |
| 51 |
51 |
18 |
Private |
Community |
10 |
Not_married |
White |
Male |
20 |
<=50K |
0.0136986 |
0.6000000 |
0.1938776 |
0 |
| 54 |
54 |
22 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
60 |
>50K |
0.0684932 |
0.5333333 |
0.6020408 |
1 |
| 56 |
56 |
21 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0547945 |
0.6000000 |
0.3979592 |
0 |
| 58 |
58 |
34 |
Local-gov |
Bachelors |
13 |
Married |
White |
Male |
50 |
>50K |
0.2328767 |
0.8000000 |
0.5000000 |
1 |
5. Crear modelo de Máquinas de Soporte Vectorial SVM
- Con el paquete e1071 se genera el modelo de SVM
- Se utiliza la función svm
- Kernel lineal
modelo1 <- svm(income10 ~ ., data = datos.entrenamiento, kernel = "linear", scale = TRUE, cost = .05)
6. Analizar y/o describir el modelo
summary(modelo1)
##
## Call:
## svm(formula = income10 ~ ., data = datos.entrenamiento, kernel = "linear",
## cost = 0.05, scale = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
##
## Number of Support Vectors: 246
##
## ( 143 103 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
- Modelo SVM de Kernel lineal
- Costo del 0.05
- 2 clases eb este caso etiquetas 0 y 1
- Vectores de soporte 246 143 clase 0 y 103 clase 1
7. Realizar predicciones con el conjunto de datos de validación
prediccion <- predict(modelo1, datos.validacion)
- Agregando columna al final de datos de validación para comparar
datos.validacion <- cbind(datos.validacion, prediccion = prediccion)
kable(head(datos.validacion, 50), caption = "Las predicciones, primeros 10 registros")
Las predicciones, primeros 10 registros
| 6 |
6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
0 |
| 15 |
15 |
48 |
Private |
HighGrad |
9 |
Married |
White |
Male |
48 |
>50K |
0.4246575 |
0.5333333 |
0.4795918 |
1 |
1 |
| 17 |
17 |
20 |
State-gov |
Community |
10 |
Not_married |
White |
Male |
25 |
<=50K |
0.0410959 |
0.6000000 |
0.2448980 |
0 |
0 |
| 36 |
36 |
65 |
? |
HighGrad |
9 |
Married |
White |
Male |
40 |
<=50K |
0.6575342 |
0.5333333 |
0.3979592 |
0 |
0 |
| 41 |
41 |
65 |
Private |
Master |
14 |
Married |
White |
Male |
50 |
>50K |
0.6575342 |
0.8666667 |
0.5000000 |
1 |
1 |
| 49 |
49 |
52 |
Private |
Dropout |
7 |
Separated |
Black |
Female |
18 |
<=50K |
0.4794521 |
0.4000000 |
0.1734694 |
0 |
0 |
| 51 |
51 |
18 |
Private |
Community |
10 |
Not_married |
White |
Male |
20 |
<=50K |
0.0136986 |
0.6000000 |
0.1938776 |
0 |
0 |
| 54 |
54 |
22 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
60 |
>50K |
0.0684932 |
0.5333333 |
0.6020408 |
1 |
1 |
| 56 |
56 |
21 |
Private |
Community |
10 |
Not_married |
White |
Female |
40 |
<=50K |
0.0547945 |
0.6000000 |
0.3979592 |
0 |
0 |
| 58 |
58 |
34 |
Local-gov |
Bachelors |
13 |
Married |
White |
Male |
50 |
>50K |
0.2328767 |
0.8000000 |
0.5000000 |
1 |
1 |
| 59 |
59 |
42 |
Self-emp-inc |
HighGrad |
9 |
Married |
White |
Male |
50 |
>50K |
0.3424658 |
0.5333333 |
0.5000000 |
1 |
1 |
| 61 |
61 |
30 |
Private |
Bachelors |
13 |
Not_married |
White |
Female |
50 |
<=50K |
0.1780822 |
0.8000000 |
0.5000000 |
0 |
0 |
| 64 |
64 |
33 |
Private |
HighGrad |
9 |
Not_married |
White |
Female |
40 |
<=50K |
0.2191781 |
0.5333333 |
0.3979592 |
0 |
0 |
| 68 |
68 |
19 |
Private |
Community |
10 |
Not_married |
White |
Male |
20 |
<=50K |
0.0273973 |
0.6000000 |
0.1938776 |
0 |
0 |
| 74 |
74 |
21 |
Private |
Community |
10 |
Separated |
White |
Female |
40 |
<=50K |
0.0547945 |
0.6000000 |
0.3979592 |
0 |
0 |
| 77 |
77 |
41 |
Private |
HighGrad |
9 |
Married |
White |
Male |
50 |
<=50K |
0.3287671 |
0.5333333 |
0.5000000 |
0 |
0 |
| 79 |
79 |
50 |
Private |
HighGrad |
9 |
Married |
White |
Male |
40 |
<=50K |
0.4520548 |
0.5333333 |
0.3979592 |
0 |
0 |
| 81 |
81 |
45 |
Self-emp-inc |
Community |
10 |
Married |
White |
Male |
50 |
<=50K |
0.3835616 |
0.6000000 |
0.5000000 |
0 |
0 |
| 91 |
91 |
59 |
Private |
Bachelors |
13 |
Not_married |
White |
Female |
25 |
<=50K |
0.5753425 |
0.8000000 |
0.2448980 |
0 |
0 |
| 95 |
95 |
34 |
Private |
Master |
14 |
Not_married |
Amer-Indian-Eskimo |
Male |
40 |
<=50K |
0.2328767 |
0.8666667 |
0.3979592 |
0 |
0 |
| 96 |
96 |
20 |
Private |
HighGrad |
9 |
Not_married |
White |
Male |
40 |
<=50K |
0.0410959 |
0.5333333 |
0.3979592 |
0 |
0 |
| 97 |
97 |
25 |
Private |
Bachelors |
13 |
Not_married |
White |
Female |
40 |
<=50K |
0.1095890 |
0.8000000 |
0.3979592 |
0 |
0 |
| 102 |
102 |
33 |
Private |
Community |
10 |
Not_married |
Black |
Female |
35 |
<=50K |
0.2191781 |
0.6000000 |
0.3469388 |
0 |
0 |
| 111 |
111 |
18 |
Private |
HighGrad |
9 |
Not_married |
White |
Female |
48 |
<=50K |
0.0136986 |
0.5333333 |
0.4795918 |
0 |
0 |
| 112 |
112 |
28 |
Private |
Community |
10 |
Married |
White |
Male |
40 |
<=50K |
0.1506849 |
0.6000000 |
0.3979592 |
0 |
0 |
| 120 |
120 |
43 |
Private |
Bachelors |
13 |
Separated |
White |
Female |
40 |
>50K |
0.3561644 |
0.8000000 |
0.3979592 |
1 |
1 |
| 123 |
123 |
19 |
Private |
Community |
10 |
Not_married |
White |
Male |
30 |
<=50K |
0.0273973 |
0.6000000 |
0.2959184 |
0 |
0 |
| 129 |
129 |
27 |
Self-emp-not-inc |
HighGrad |
9 |
Married |
White |
Male |
60 |
>50K |
0.1369863 |
0.5333333 |
0.6020408 |
1 |
1 |
| 131 |
131 |
41 |
Private |
Community |
10 |
Married |
White |
Male |
40 |
<=50K |
0.3287671 |
0.6000000 |
0.3979592 |
0 |
0 |
| 135 |
135 |
57 |
Private |
HighGrad |
9 |
Married |
Black |
Male |
48 |
<=50K |
0.5479452 |
0.5333333 |
0.4795918 |
0 |
0 |
| 141 |
141 |
46 |
Private |
Master |
14 |
Married |
White |
Male |
40 |
<=50K |
0.3972603 |
0.8666667 |
0.3979592 |
0 |
0 |
| 144 |
144 |
43 |
Self-emp-inc |
HighGrad |
9 |
Married |
White |
Male |
45 |
>50K |
0.3561644 |
0.5333333 |
0.4489796 |
1 |
1 |
| 145 |
145 |
34 |
Private |
Master |
14 |
Not_married |
White |
Female |
30 |
<=50K |
0.2328767 |
0.8666667 |
0.2959184 |
0 |
0 |
| 147 |
147 |
44 |
Private |
Community |
11 |
Widow |
White |
Female |
30 |
<=50K |
0.3698630 |
0.6666667 |
0.2959184 |
0 |
0 |
| 153 |
153 |
50 |
Private |
Dropout |
4 |
Married |
White |
Male |
20 |
<=50K |
0.4520548 |
0.2000000 |
0.1938776 |
0 |
0 |
| 160 |
160 |
38 |
Self-emp-inc |
Bachelors |
13 |
Separated |
White |
Male |
40 |
<=50K |
0.2876712 |
0.8000000 |
0.3979592 |
0 |
0 |
| 161 |
161 |
55 |
Private |
Dropout |
7 |
Married |
White |
Male |
30 |
<=50K |
0.5205479 |
0.4000000 |
0.2959184 |
0 |
0 |
| 163 |
163 |
22 |
Private |
HighGrad |
9 |
Married |
White |
Male |
45 |
<=50K |
0.0684932 |
0.5333333 |
0.4489796 |
0 |
0 |
| 165 |
165 |
46 |
State-gov |
Master |
14 |
Married |
White |
Male |
45 |
>50K |
0.3972603 |
0.8666667 |
0.4489796 |
1 |
1 |
| 167 |
167 |
58 |
Self-emp-not-inc |
PhD |
16 |
Married |
White |
Male |
16 |
>50K |
0.5616438 |
1.0000000 |
0.1530612 |
1 |
1 |
| 168 |
168 |
42 |
Private |
HighGrad |
9 |
Married |
White |
Male |
45 |
<=50K |
0.3424658 |
0.5333333 |
0.4489796 |
0 |
0 |
| 171 |
171 |
54 |
Private |
HighGrad |
9 |
Married |
White |
Male |
40 |
>50K |
0.5068493 |
0.5333333 |
0.3979592 |
1 |
1 |
| 172 |
172 |
34 |
Private |
Master |
14 |
Not_married |
White |
Male |
40 |
<=50K |
0.2328767 |
0.8666667 |
0.3979592 |
0 |
0 |
| 173 |
173 |
26 |
Private |
Bachelors |
13 |
Not_married |
White |
Female |
40 |
<=50K |
0.1232877 |
0.8000000 |
0.3979592 |
0 |
0 |
| 175 |
175 |
48 |
Local-gov |
Master |
14 |
Separated |
Black |
Female |
40 |
<=50K |
0.4246575 |
0.8666667 |
0.3979592 |
0 |
0 |
| 176 |
176 |
36 |
Private |
Community |
10 |
Married |
White |
Male |
45 |
>50K |
0.2602740 |
0.6000000 |
0.4489796 |
1 |
1 |
| 185 |
185 |
44 |
Private |
HighGrad |
9 |
Separated |
Black |
Female |
40 |
<=50K |
0.3698630 |
0.5333333 |
0.3979592 |
0 |
0 |
| 189 |
189 |
34 |
State-gov |
Bachelors |
13 |
Not_married |
Black |
Male |
40 |
<=50K |
0.2328767 |
0.8000000 |
0.3979592 |
0 |
0 |
| 193 |
193 |
47 |
Private |
Community |
10 |
Separated |
Black |
Female |
37 |
<=50K |
0.4109589 |
0.6000000 |
0.3673469 |
0 |
0 |
| 196 |
196 |
31 |
Private |
Community |
10 |
Separated |
White |
Female |
40 |
<=50K |
0.1917808 |
0.6000000 |
0.3979592 |
0 |
0 |
8. Evaluar el modelo de predicción con matriz de confusión
matriz.confusion <- table(predicho = prediccion, real = datos.validacion$income10)
matriz.confusion
## real
## predicho 0 1
## 0 11146 0
## 1 0 3506
- Accurancy = 100% dentro del algoritmo SVM de kernel lineal
9. Interpretar el caso
El conjunto de datos que se uso en este caso fue retomado del caso 11 el cual trata sobre los ingresos de personas de USA en el cual indica si es que estas reciben un ingreso de más de 50,000 dólares o menos o igual a 50,000 dólares, dependiendo de diferentes variables, en el caso 11 se uso un algoritmo de regresión logística y para este caso 13 se esta usando un algoritmo de máquinas de soporte vectorial con kernel lineal igualmente la variable dependiente que se usa es income que se representa con ´>50k´ y ´<=50k´ después convirtiéndola a un valor binario 1 y 0 respectivamente, y las variables independientes son:
• age la edad de la persona • workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta • education indica el nivel educativo de la persona • educational es el valor numérico de education • marital.status es su estado civil • race es el tipo de raza de persona • gender es el género de la persona • hours.per.week son las horas que trbaja por semana • income son los ingresos • age_scale la edad escalada • hours.per.week escalada
Dentro del modelo de maquinas de soporte vectorial como ya se menciono se uso un kernel lineal, con un costo del 0.05, fue apoyado por 246 vectores y por solo 2 clases como lo vimos en este caso que es 0 y 1. En cuanto a la matriz de confusión se obtiene que el numero de verdaderos positivos es de 11146 y que el caso de verdaderos negativos es de 3506 y en los demás valores de la matriz es de 0 lo que quiere decir que este modelo tiene un Accurancy de 100% en sus predicciones es decir es mucho más acertado en sus predicciones que el modelo de regresión logística.
10. Comparar con el algoritmo de regresión logística del caso 11
Modelo de regresión lógistica
formula = income10 ~ age.scale + workclass + education + marital.status + race + gender + hours.per.week.scale
modelo2 <- glm(formula, data = datos.entrenamiento, family = 'binomial')
summary(modelo2)
##
## Call:
## glm(formula = formula, family = "binomial", data = datos.entrenamiento)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7337 -0.5768 -0.2588 -0.0654 3.3492
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.419e+00 2.228e-01 -10.858 < 2e-16 ***
## age.scale 2.224e+00 1.053e-01 21.121 < 2e-16 ***
## workclassFederal-gov 1.421e+00 1.237e-01 11.485 < 2e-16 ***
## workclassLocal-gov 6.942e-01 1.100e-01 6.312 2.76e-10 ***
## workclassNever-worked -8.124e+00 1.042e+02 -0.078 0.9379
## workclassPrivate 8.124e-01 9.598e-02 8.464 < 2e-16 ***
## workclassSelf-emp-inc 1.218e+00 1.186e-01 10.270 < 2e-16 ***
## workclassSelf-emp-not-inc 1.878e-01 1.071e-01 1.753 0.0797 .
## workclassState-gov 5.339e-01 1.223e-01 4.367 1.26e-05 ***
## workclassWithout-pay -3.965e-01 8.276e-01 -0.479 0.6318
## educationCommunity -9.930e-01 4.428e-02 -22.426 < 2e-16 ***
## educationDropout -2.782e+00 7.802e-02 -35.657 < 2e-16 ***
## educationHighGrad -1.611e+00 4.523e-02 -35.610 < 2e-16 ***
## educationMaster 6.250e-01 6.110e-02 10.230 < 2e-16 ***
## educationPhD 1.077e+00 1.379e-01 7.814 5.55e-15 ***
## marital.statusNot_married -2.491e+00 5.355e-02 -46.511 < 2e-16 ***
## marital.statusSeparated -2.102e+00 5.650e-02 -37.214 < 2e-16 ***
## marital.statusWidow -2.163e+00 1.287e-01 -16.809 < 2e-16 ***
## raceAsian-Pac-Islander -2.461e-02 2.074e-01 -0.119 0.9055
## raceBlack 4.784e-04 1.968e-01 0.002 0.9981
## raceOther -9.881e-02 2.817e-01 -0.351 0.7258
## raceWhite 2.155e-01 1.876e-01 1.148 0.2509
## genderMale 9.432e-02 4.455e-02 2.117 0.0342 *
## hours.per.week.scale 3.136e+00 1.398e-01 22.430 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37626 on 34189 degrees of freedom
## Residual deviance: 25011 on 34166 degrees of freedom
## AIC: 25059
##
## Number of Fisher Scoring iterations: 11
Realizar predicciones
predicciones <- predict(modelo2, datos.validacion, se.fit = TRUE)
predicciones.ajustadas <- predicciones$fit
kable(head(predicciones.ajustadas, 10))
| 6 |
-5.1235447 |
| 15 |
-0.4589452 |
| 17 |
-4.1994476 |
| 36 |
-1.0094459 |
| 41 |
2.3586616 |
| 49 |
-4.8803438 |
| 51 |
-4.1419552 |
| 54 |
-3.3575676 |
| 56 |
-3.5047903 |
| 58 |
0.6710119 |
Convertir las predicciones en probabilidades
predicciones_prob <- exp(predicciones$fit) / (1 + exp(predicciones$fit))
kable(head(predicciones_prob))
| 6 |
0.0059196 |
| 15 |
0.3872361 |
| 17 |
0.0147821 |
| 36 |
0.2670883 |
| 41 |
0.9136202 |
| 49 |
0.0075372 |
Agregar una columna de la predicción al final del conjunto de datos de validación
- Agregar una columna de los datos de validación con los valores probabilísticos de las predicciones con cbind() y con un uevo conjunto de datos llamado las .predicciones
- Agregar columna con valor 1 cuando la predicción es mayor que 0.5 y 0 cuando la predicción es menor o igual a 0.5
- Verificar las columnas income10 e income10.prediccion
las.predicciones <- cbind(datos.validacion, predicciones_prob)
las.predicciones <- las.predicciones %>%
mutate(income10.prediccion = if_else(predicciones_prob > 0.5, 1, 0))
kable(head(las.predicciones))
| 6 |
34 |
Private |
Dropout |
6 |
Not_married |
White |
Male |
30 |
<=50K |
0.2328767 |
0.3333333 |
0.2959184 |
0 |
0 |
0.0059196 |
0 |
| 15 |
48 |
Private |
HighGrad |
9 |
Married |
White |
Male |
48 |
>50K |
0.4246575 |
0.5333333 |
0.4795918 |
1 |
1 |
0.3872361 |
0 |
| 17 |
20 |
State-gov |
Community |
10 |
Not_married |
White |
Male |
25 |
<=50K |
0.0410959 |
0.6000000 |
0.2448980 |
0 |
0 |
0.0147821 |
0 |
| 36 |
65 |
? |
HighGrad |
9 |
Married |
White |
Male |
40 |
<=50K |
0.6575342 |
0.5333333 |
0.3979592 |
0 |
0 |
0.2670883 |
0 |
| 41 |
65 |
Private |
Master |
14 |
Married |
White |
Male |
50 |
>50K |
0.6575342 |
0.8666667 |
0.5000000 |
1 |
1 |
0.9136202 |
1 |
| 49 |
52 |
Private |
Dropout |
7 |
Separated |
Black |
Female |
18 |
<=50K |
0.4794521 |
0.4000000 |
0.1734694 |
0 |
0 |
0.0075372 |
0 |
Matriz de confusión de predicciones con regresión lógistica
matriz_confusion_predicciones <- table(las.predicciones$income10, las.predicciones$income10.prediccion, dnn = c("income10", "predicciones"))
matriz_confusion_predicciones
## predicciones
## income10 0 1
## 0 10357 789
## 1 1762 1744
- Accurancy = 82.58% dentro del algoritmo de regresión logistica
Matriz de confusión de predicciones con SVM de kernel lineal
matriz.confusion <- table(predicho = prediccion, real = datos.validacion$income10)
matriz.confusion
## real
## predicho 0 1
## 0 11146 0
## 1 0 3506
- Accurancy = 100% dentro del algoritmo SVM de kernel lineal
Como podemos ver en cuanto a las predicciones el algoritmo que tiene mayor porcentaje de Accurancy es el SVM de Kernel lineal concluyendo de esta manera que el algoritmo que se adecua más a este caso (conjunto de datos de personas de USA respecto a sus ingresos) es el algoritmo SVM de kernel lineal