Realizar e interpretar regresión logística con datos de personas e ingresos de USA.
Construir un modelo de regresión logística aplicado a datos de personas y sus ingresos en USA
La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil dólares respectivamente.
Se presenta el desarrollo bajo el siguiente proceso:
Cargar librerías
Cargar datos
Identificar variables
Crear datos de entrenamiento y validación
Crear modelo de regresión logística
Analizar y/o describir el modelo
Evaluar el modelo con matriz de confusión
Realizar predicciones con el conjunto de datos de validación
Interpretar el caso
library(ggplot2) # Gráfics
library(dplyr) # Filtar datos
library(knitr) # Amigabilidad las tablas datos
library(caret) # Partir los datos
library(readr) # Impotar CSV
library(DT) # install.packages('DT') Pendiente
Cargar los datos desde la dirección: "https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")
#datatable(datos, caption = "Los datos", options = list(geLength = 10))
kable(head(datos, 10), caption = "Primeros diez registros")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
| 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
kable(tail(datos, 10), caption = "Últimos diez registros")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48833 | 48833 | 32 | Private | Dropout | 6 | Married | Amer-Indian-Eskimo | Male | 40 | <=50K | 0.2054795 | 0.3333333 | 0.3979592 | 0 |
| 48834 | 48834 | 43 | Private | Community | 11 | Married | White | Male | 45 | <=50K | 0.3561644 | 0.6666667 | 0.4489796 | 0 |
| 48835 | 48835 | 32 | Private | Master | 14 | Not_married | Asian-Pac-Islander | Male | 11 | <=50K | 0.2054795 | 0.8666667 | 0.1020408 | 0 |
| 48836 | 48836 | 53 | Private | Master | 14 | Married | White | Male | 40 | >50K | 0.4931507 | 0.8666667 | 0.3979592 | 1 |
| 48837 | 48837 | 22 | Private | Community | 10 | Not_married | White | Male | 40 | <=50K | 0.0684932 | 0.6000000 | 0.3979592 | 0 |
| 48838 | 48838 | 27 | Private | Community | 12 | Married | White | Female | 38 | <=50K | 0.1369863 | 0.7333333 | 0.3775510 | 0 |
| 48839 | 48839 | 40 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.3150685 | 0.5333333 | 0.3979592 | 1 |
| 48840 | 48840 | 58 | Private | HighGrad | 9 | Widow | White | Female | 40 | <=50K | 0.5616438 | 0.5333333 | 0.3979592 | 0 |
| 48841 | 48841 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 20 | <=50K | 0.0684932 | 0.5333333 | 0.1938776 | 0 |
| 48842 | 48842 | 52 | Self-emp-inc | HighGrad | 9 | Married | White | Female | 40 | >50K | 0.4794521 | 0.5333333 | 0.3979592 | 1 |
Se describen las variables:
age la edad de la persona
workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
education indica el nivel educativo de la persona
educational es el valor numérico de education
marital.status es su estado civil
race es el tipo de raza de persona
gender es el género de la persona
hours.per.week son las horas que trbaja por semana
income son los ingresos
age_scale la edad escalada
hours.per.week escalada
income10 con valores de 0 gana menos de 50 mil y 1 mas de 50 mil
Datos de entrenamiento y datos de validación
Se particiona el conjunto de datos original en un 70 30 es decir,
70% datos de entrenamieto y
30% datos de validación
La variable que se utiliza para partir los datos es income10 que trae consigo valores de 0 y 1 respectivamente.
set.seed(2021)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
# kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
# kable(head(datos.validacion, 10), caption = "Datos de validación (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
# datatable(datos.entrenamiento, caption = "Datos de entrenamiento", options = list(pageLength = 50))
kable(head(datos.entrenamiento, 50), caption = "Datos de entrenamiento (primeros cincuenta)")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 4 | 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 7 | 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
| 12 | 12 | 36 | Federal-gov | Bachelors | 13 | Married | White | Male | 40 | <=50K | 0.2602740 | 0.8000000 | 0.3979592 | 0 |
| 13 | 13 | 26 | Private | HighGrad | 9 | Not_married | White | Female | 39 | <=50K | 0.1232877 | 0.5333333 | 0.3877551 | 0 |
| 15 | 15 | 48 | Private | HighGrad | 9 | Married | White | Male | 48 | >50K | 0.4246575 | 0.5333333 | 0.4795918 | 1 |
| 16 | 16 | 43 | Private | Master | 14 | Married | White | Male | 50 | >50K | 0.3561644 | 0.8666667 | 0.5000000 | 1 |
| 19 | 19 | 37 | Private | HighGrad | 9 | Widow | White | Female | 20 | <=50K | 0.2739726 | 0.5333333 | 0.1938776 | 0 |
| 20 | 20 | 40 | Private | PhD | 16 | Married | Asian-Pac-Islander | Male | 45 | >50K | 0.3150685 | 1.0000000 | 0.4489796 | 1 |
| 21 | 21 | 34 | Private | Bachelors | 13 | Married | White | Male | 47 | >50K | 0.2328767 | 0.8000000 | 0.4693878 | 1 |
| 22 | 22 | 34 | Private | Community | 10 | Not_married | Black | Female | 35 | <=50K | 0.2328767 | 0.6000000 | 0.3469388 | 0 |
| 24 | 24 | 25 | Private | Bachelors | 13 | Not_married | White | Male | 43 | <=50K | 0.1095890 | 0.8000000 | 0.4285714 | 0 |
| 27 | 27 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 20 | <=50K | 0.0684932 | 0.5333333 | 0.1938776 | 0 |
| 28 | 28 | 23 | Private | HighGrad | 9 | Separated | Black | Male | 54 | <=50K | 0.0821918 | 0.5333333 | 0.5408163 | 0 |
| 29 | 29 | 54 | Private | HighGrad | 9 | Married | White | Male | 35 | <=50K | 0.5068493 | 0.5333333 | 0.3469388 | 0 |
| 30 | 30 | 32 | Self-emp-not-inc | Community | 10 | Not_married | White | Male | 60 | <=50K | 0.2054795 | 0.6000000 | 0.6020408 | 0 |
| 32 | 32 | 56 | Self-emp-not-inc | Dropout | 7 | Widow | White | Female | 50 | <=50K | 0.5342466 | 0.4000000 | 0.5000000 | 0 |
| 33 | 33 | 24 | Self-emp-not-inc | Bachelors | 13 | Not_married | White | Male | 50 | <=50K | 0.0958904 | 0.8000000 | 0.5000000 | 0 |
| 35 | 35 | 26 | Private | HighGrad | 9 | Separated | White | Female | 40 | <=50K | 0.1232877 | 0.5333333 | 0.3979592 | 0 |
| 37 | 37 | 36 | Local-gov | Bachelors | 13 | Married | White | Male | 40 | >50K | 0.2602740 | 0.8000000 | 0.3979592 | 1 |
| 38 | 38 | 22 | Private | Dropout | 3 | Not_married | White | Male | 50 | <=50K | 0.0684932 | 0.1333333 | 0.5000000 | 0 |
| 40 | 40 | 20 | Private | HighGrad | 9 | Not_married | White | Male | 40 | <=50K | 0.0410959 | 0.5333333 | 0.3979592 | 0 |
| 41 | 41 | 65 | Private | Master | 14 | Married | White | Male | 50 | >50K | 0.6575342 | 0.8666667 | 0.5000000 | 1 |
| 42 | 42 | 44 | Self-emp-inc | Community | 11 | Married | White | Male | 45 | >50K | 0.3698630 | 0.6666667 | 0.4489796 | 1 |
| 44 | 44 | 29 | Private | Dropout | 7 | Married | White | Male | 40 | <=50K | 0.1643836 | 0.4000000 | 0.3979592 | 0 |
| 46 | 46 | 28 | Private | Community | 11 | Married | White | Female | 36 | >50K | 0.1506849 | 0.6666667 | 0.3571429 | 1 |
| 47 | 47 | 39 | Private | Dropout | 4 | Married | White | Male | 40 | <=50K | 0.3013699 | 0.2000000 | 0.3979592 | 0 |
| 48 | 48 | 54 | Private | Community | 10 | Married | White | Male | 50 | <=50K | 0.5068493 | 0.6000000 | 0.5000000 | 0 |
| 49 | 49 | 52 | Private | Dropout | 7 | Separated | Black | Female | 18 | <=50K | 0.4794521 | 0.4000000 | 0.1734694 | 0 |
| 51 | 51 | 18 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0136986 | 0.6000000 | 0.1938776 | 0 |
| 52 | 52 | 39 | Private | HighGrad | 9 | Separated | Black | Male | 40 | <=50K | 0.3013699 | 0.5333333 | 0.3979592 | 0 |
| 53 | 53 | 21 | Private | Community | 10 | Not_married | White | Female | 24 | <=50K | 0.0547945 | 0.6000000 | 0.2346939 | 0 |
| 54 | 54 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 60 | >50K | 0.0684932 | 0.5333333 | 0.6020408 | 1 |
| 55 | 55 | 38 | Private | Dropout | 5 | Not_married | White | Male | 54 | <=50K | 0.2876712 | 0.2666667 | 0.5408163 | 0 |
| 56 | 56 | 21 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 |
| 57 | 57 | 63 | Private | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.6301370 | 0.5333333 | 0.3979592 | 0 |
| 58 | 58 | 34 | Local-gov | Bachelors | 13 | Married | White | Male | 50 | >50K | 0.2328767 | 0.8000000 | 0.5000000 | 1 |
| 59 | 59 | 42 | Self-emp-inc | HighGrad | 9 | Married | White | Male | 50 | >50K | 0.3424658 | 0.5333333 | 0.5000000 | 1 |
| 60 | 60 | 33 | Private | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.2191781 | 0.5333333 | 0.3979592 | 0 |
| 62 | 62 | 39 | Private | Community | 10 | Separated | White | Male | 40 | <=50K | 0.3013699 | 0.6000000 | 0.3979592 | 0 |
| 64 | 64 | 33 | Private | HighGrad | 9 | Not_married | White | Female | 40 | <=50K | 0.2191781 | 0.5333333 | 0.3979592 | 0 |
| 65 | 65 | 47 | Local-gov | HighGrad | 9 | Separated | White | Female | 40 | <=50K | 0.4109589 | 0.5333333 | 0.3979592 | 0 |
| 66 | 66 | 41 | Private | Bachelors | 13 | Not_married | White | Female | 40 | <=50K | 0.3287671 | 0.8000000 | 0.3979592 | 0 |
| 67 | 67 | 41 | Self-emp-inc | Community | 12 | Married | White | Male | 60 | >50K | 0.3287671 | 0.7333333 | 0.6020408 | 1 |
| 68 | 68 | 19 | Private | Community | 10 | Not_married | White | Male | 20 | <=50K | 0.0273973 | 0.6000000 | 0.1938776 | 0 |
| 69 | 69 | 46 | Private | HighGrad | 9 | Separated | White | Male | 40 | <=50K | 0.3972603 | 0.5333333 | 0.3979592 | 0 |
| 70 | 70 | 43 | Private | HighGrad | 9 | Married | White | Male | 48 | <=50K | 0.3561644 | 0.5333333 | 0.4795918 | 0 |
# datatable(datos.validacion, caption = "Datos de validación", options = list(pageLength = 50))
kable(head(datos.validacion, 50), caption = "Datos de validación (primeros cincuenta)")
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 5 | 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 6 | 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
| 11 | 11 | 65 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.6575342 | 0.5333333 | 0.3979592 | 1 |
| 14 | 14 | 58 | ? | HighGrad | 9 | Married | White | Male | 35 | <=50K | 0.5616438 | 0.5333333 | 0.3469388 | 0 |
| 17 | 17 | 20 | State-gov | Community | 10 | Not_married | White | Male | 25 | <=50K | 0.0410959 | 0.6000000 | 0.2448980 | 0 |
| 18 | 18 | 43 | Private | HighGrad | 9 | Married | White | Female | 30 | <=50K | 0.3561644 | 0.5333333 | 0.2959184 | 0 |
| 23 | 23 | 72 | ? | Dropout | 4 | Separated | White | Female | 6 | <=50K | 0.7534247 | 0.2000000 | 0.0510204 | 0 |
| 25 | 25 | 25 | Private | Bachelors | 13 | Married | White | Male | 40 | <=50K | 0.1095890 | 0.8000000 | 0.3979592 | 0 |
| 26 | 26 | 45 | Self-emp-not-inc | HighGrad | 9 | Married | White | Male | 90 | >50K | 0.3835616 | 0.5333333 | 0.9081633 | 1 |
| 31 | 31 | 46 | State-gov | Community | 10 | Married | Black | Male | 38 | >50K | 0.3972603 | 0.6000000 | 0.3775510 | 1 |
| 34 | 34 | 23 | Local-gov | Community | 10 | Married | White | Male | 40 | <=50K | 0.0821918 | 0.6000000 | 0.3979592 | 0 |
| 36 | 36 | 65 | ? | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.6575342 | 0.5333333 | 0.3979592 | 0 |
| 39 | 39 | 17 | Private | Dropout | 6 | Not_married | White | Male | 40 | <=50K | 0.0000000 | 0.3333333 | 0.3979592 | 0 |
| 43 | 43 | 36 | Private | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.2602740 | 0.5333333 | 0.3979592 | 0 |
| 45 | 45 | 20 | State-gov | Community | 10 | Not_married | White | Male | 32 | <=50K | 0.0410959 | 0.6000000 | 0.3163265 | 0 |
| 50 | 50 | 56 | Self-emp-inc | HighGrad | 9 | Widow | White | Female | 50 | <=50K | 0.5342466 | 0.5333333 | 0.5000000 | 0 |
| 61 | 61 | 30 | Private | Bachelors | 13 | Not_married | White | Female | 50 | <=50K | 0.1780822 | 0.8000000 | 0.5000000 | 0 |
| 63 | 63 | 26 | Private | Master | 14 | Not_married | White | Female | 40 | <=50K | 0.1232877 | 0.8666667 | 0.3979592 | 0 |
| 74 | 74 | 21 | Private | Community | 10 | Separated | White | Female | 40 | <=50K | 0.0547945 | 0.6000000 | 0.3979592 | 0 |
| 76 | 76 | 17 | ? | Dropout | 6 | Not_married | White | Male | 40 | <=50K | 0.0000000 | 0.3333333 | 0.3979592 | 0 |
| 78 | 78 | 69 | Self-emp-inc | HighGrad | 9 | Married | White | Male | 30 | <=50K | 0.7123288 | 0.5333333 | 0.2959184 | 0 |
| 85 | 85 | 31 | Self-emp-not-inc | Community | 10 | Married | White | Male | 50 | <=50K | 0.1917808 | 0.6000000 | 0.5000000 | 0 |
| 88 | 88 | 55 | Private | HighGrad | 9 | Married | White | Male | 56 | >50K | 0.5205479 | 0.5333333 | 0.5612245 | 1 |
| 89 | 89 | 24 | Federal-gov | Community | 10 | Not_married | White | Male | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 91 | 91 | 59 | Private | Bachelors | 13 | Not_married | White | Female | 25 | <=50K | 0.5753425 | 0.8000000 | 0.2448980 | 0 |
| 92 | 92 | 49 | Federal-gov | Dropout | 4 | Not_married | Black | Male | 20 | <=50K | 0.4383562 | 0.2000000 | 0.1938776 | 0 |
| 93 | 93 | 33 | Private | Master | 14 | Married | White | Female | 10 | >50K | 0.2191781 | 0.8666667 | 0.0918367 | 1 |
| 106 | 106 | 36 | Private | Dropout | 6 | Separated | White | Female | 40 | <=50K | 0.2602740 | 0.3333333 | 0.3979592 | 0 |
| 107 | 107 | 41 | Local-gov | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.3287671 | 0.5333333 | 0.3979592 | 0 |
| 108 | 108 | 28 | Private | HighGrad | 9 | Not_married | White | Male | 40 | <=50K | 0.1506849 | 0.5333333 | 0.3979592 | 0 |
| 109 | 109 | 19 | Private | Community | 10 | Not_married | Black | Male | 16 | <=50K | 0.0273973 | 0.6000000 | 0.1530612 | 0 |
| 112 | 112 | 28 | Private | Community | 10 | Married | White | Male | 40 | <=50K | 0.1506849 | 0.6000000 | 0.3979592 | 0 |
| 116 | 116 | 26 | Private | HighGrad | 9 | Not_married | White | Male | 40 | <=50K | 0.1232877 | 0.5333333 | 0.3979592 | 0 |
| 118 | 118 | 23 | Private | Dropout | 7 | Not_married | White | Female | 24 | <=50K | 0.0821918 | 0.4000000 | 0.2346939 | 0 |
| 121 | 121 | 31 | Local-gov | Bachelors | 13 | Separated | White | Female | 60 | <=50K | 0.1917808 | 0.8000000 | 0.6020408 | 0 |
| 123 | 123 | 19 | Private | Community | 10 | Not_married | White | Male | 30 | <=50K | 0.0273973 | 0.6000000 | 0.2959184 | 0 |
| 124 | 124 | 41 | Private | Community | 10 | Separated | White | Female | 40 | <=50K | 0.3287671 | 0.6000000 | 0.3979592 | 0 |
| 136 | 136 | 30 | Private | Community | 12 | Not_married | White | Female | 40 | <=50K | 0.1780822 | 0.7333333 | 0.3979592 | 0 |
| 147 | 147 | 44 | Private | Community | 11 | Widow | White | Female | 30 | <=50K | 0.3698630 | 0.6666667 | 0.2959184 | 0 |
| 150 | 150 | 19 | Private | HighGrad | 9 | Not_married | White | Male | 30 | <=50K | 0.0273973 | 0.5333333 | 0.2959184 | 0 |
| 151 | 151 | 28 | Private | Community | 10 | Not_married | Black | Male | 14 | <=50K | 0.1506849 | 0.6000000 | 0.1326531 | 0 |
| 152 | 152 | 27 | Private | Dropout | 7 | Married | Black | Female | 32 | <=50K | 0.1369863 | 0.4000000 | 0.3163265 | 0 |
| 153 | 153 | 50 | Private | Dropout | 4 | Married | White | Male | 20 | <=50K | 0.4520548 | 0.2000000 | 0.1938776 | 0 |
| 162 | 162 | 32 | Private | HighGrad | 9 | Married | White | Male | 40 | <=50K | 0.2054795 | 0.5333333 | 0.3979592 | 0 |
| 163 | 163 | 22 | Private | HighGrad | 9 | Married | White | Male | 45 | <=50K | 0.0684932 | 0.5333333 | 0.4489796 | 0 |
| 167 | 167 | 58 | Self-emp-not-inc | PhD | 16 | Married | White | Male | 16 | >50K | 0.5616438 | 1.0000000 | 0.1530612 | 1 |
| 171 | 171 | 54 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.5068493 | 0.5333333 | 0.3979592 | 1 |
| 173 | 173 | 26 | Private | Bachelors | 13 | Not_married | White | Female | 40 | <=50K | 0.1232877 | 0.8000000 | 0.3979592 | 0 |
Con los datos de entrenamiento se crea el modelo de Regresión Logística.
La regresión logística implica que existe una variable dependiente, dado un conjunto particular de valores de las variables independientes elegidas. Se estima la regresión de una persona con ingresos de una persona \(\leq 50\) o \(>50\) o lo que es lo mismo ingresos con valores de \(0\) y \(1\) en la variable income10.
Por medio de la función gml() se contruye un modelo de regresión logística
Variable dependiente o predictiva es ‘income10,’ ya que depende de todas las demás variables
Variables independientes o predictoras, todas las demás: “age,” “workclass,” “education,” “marital.status,” “race,” “gender,” “hours.per.week”
Se utiliza el conjunto de datos de entrenamiento
La finalidad de construir el modelo de regresión logística es entre otros cosas, para conocer los coeficientes y el nivel de significación de cada variable independiente o predictora así como las pruebas t y F
La ecuación
income10=age.scale+workclass+education+marital.status+race+gender+hours.per.week.scaleincome10=age.scale+workclass+education+marital.status+race+gender+hours.per.week.scale
Se asigna a una variable la fórmula y se utiliza para construir el modelo,
Significa que la variable ‘income10’ depende o es dependiente de todas las demás variables
La fórmula dada por la ecuacuación
formula = income10 ~ age.scale + workclass + education + marital.status + race + gender + hours.per.week.scale
modelo <- glm(formula, data = datos.entrenamiento, family = 'binomial')
summary(modelo)
##
## Call:
## glm(formula = formula, family = "binomial", data = datos.entrenamiento)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7493 -0.5658 -0.2560 -0.0645 3.3674
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.66731 0.22926 -11.635 < 2e-16 ***
## age.scale 2.18002 0.10543 20.677 < 2e-16 ***
## workclassFederal-gov 1.47838 0.12615 11.719 < 2e-16 ***
## workclassLocal-gov 0.79868 0.11252 7.098 1.27e-12 ***
## workclassNever-worked -6.82935 110.17471 -0.062 0.95057
## workclassPrivate 0.92899 0.09891 9.392 < 2e-16 ***
## workclassSelf-emp-inc 1.39342 0.12086 11.529 < 2e-16 ***
## workclassSelf-emp-not-inc 0.35580 0.10934 3.254 0.00114 **
## workclassState-gov 0.62340 0.12443 5.010 5.44e-07 ***
## workclassWithout-pay -0.13094 0.82861 -0.158 0.87444
## educationCommunity -0.96160 0.04441 -21.652 < 2e-16 ***
## educationDropout -2.74502 0.07749 -35.423 < 2e-16 ***
## educationHighGrad -1.56040 0.04527 -34.466 < 2e-16 ***
## educationMaster 0.69192 0.06167 11.219 < 2e-16 ***
## educationPhD 1.19683 0.13819 8.661 < 2e-16 ***
## marital.statusNot_married -2.56414 0.05397 -47.507 < 2e-16 ***
## marital.statusSeparated -2.14513 0.05613 -38.218 < 2e-16 ***
## marital.statusWidow -2.17315 0.12459 -17.443 < 2e-16 ***
## raceAsian-Pac-Islander 0.24192 0.21280 1.137 0.25561
## raceBlack 0.18126 0.20232 0.896 0.37031
## raceOther 0.14492 0.28953 0.501 0.61669
## raceWhite 0.45193 0.19310 2.340 0.01926 *
## genderMale 0.08478 0.04429 1.914 0.05557 .
## hours.per.week.scale 2.95571 0.13959 21.174 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37718 on 34189 degrees of freedom
## Residual deviance: 24923 on 34166 degrees of freedom
## AIC: 24971
##
## Number of Fisher Scoring iterations: 11
Con summary() se identifica que las mayoría de las variables son estadísticamente significativas, salvo las que no tienen ‘*.’
De los diversos coeficiente que genera el modelo, depende o se debiera utilizar si son de un valor o de otro valor conforme a la variable utilizada multiplicado éste por el valor de la variable.
Por ejemplo para workclass o clase de trabajo:
| workclass | coeficiente |
|---|---|
| Gobierno federal workclassFederal-gov | 1.47838 |
| Gobierno local workclassLocal-gov | 0.79868 |
| Nunca trabajó workclassNever-worked | -6.82935 |
| Trabajo privado workclassPrivate | 0.92899 |
| Trabaja por su cuenta, negocio propio workclassSelf-emp-inc | 1.39342 |
| Trabaja por su cuenta sin empresa (informal) workclassSelf-emp-not-inc | 0.35580 |
| Gobierno estatal workclassState-gov | 0.62340 |
| Trabajo voluntario workclassWithout-pay | -0.13094 |
Entonces una posible predicción sería de la siguiente manera:
\[ Y = \beta\_0 + \beta\_1\cdot(coeficiente) + \beta\_2 \cdot(coeficiente) \\ + \beta\_3 \cdot(coeficiente) + ... + \beta\_n\cdot(coeficiente) \]
Y …
\[ income10 = -2.66731 + workclass\cdot(coeficiente) + education \cdot(coeficiente) \\ + marital.status \cdot(coeficiente) + race \cdot(coeficiente) + \\ gender \cdot(coeficiente) + hours.per.week.scale \cdot(coeficiente) + \\ \text{las otras variables ...} \times(coeficientes) \]
Se observa que cuando alguien tienen negocio propio el valor del ingreso por consecuencia aumenta más que el tener otra actividad económica.
Las predicciones se hacen más adelante en este caso.
Para evaluar el rendimiento del modelo, se crea la matriz de confusión
Una matriz de confusión es una herramienta que permite la visualización del desempeño de un algoritmo que se emplea en aprendizaje supervisado.
Cada columna de la matriz representa el número de predicciones de cada clase, mientras que cada fila representa a las instancias en la clase real.
Uno de los beneficios de las matrices de confusión es que facilitan ver si el sistema está confundiendo las diferentes clases o resultados.
Matriz de Confusión
include_graphics("../Imagenes/matriz confusion.jpg")
Hay que encontrar a cuantos casos se le atinaron utilizando los datos de validación y con ello encontrar el porcentaje de aciertos.
Utilizando los datos de entrenamiento
Tres variables income10 original con valores \(0 \text{ y } 1 \text{ s }\) s
Se genera un data frame llamado comparar con tres columnas (income10, ajuste,income10ajustados) , con las columnas 1 y 3 se puede generar la matriz de confusión
Se observan los primeros diez y últimos diez registros
comparar <- data.frame(datos.entrenamiento$income10, as.vector(modelo$fitted.values) )
comparar <- comparar %>%
mutate(income10ajustados = if_else (modelo$fitted.values > 0.5, 1, 0))
colnames(comparar) <- c("income10", "ajuste", 'income10ajustados')
kable(head(comparar, 10), caption = "Comparar valores, primeros diez")
| income10 | ajuste | income10ajustados |
|---|---|---|
| 0 | 0.0046497 | 0 |
| 1 | 0.3890455 | 0 |
| 0 | 0.0067516 | 0 |
| 1 | 0.7730817 | 1 |
| 0 | 0.0314673 | 0 |
| 0 | 0.0730755 | 0 |
| 0 | 0.7486522 | 1 |
| 0 | 0.0180558 | 0 |
| 1 | 0.3968239 | 0 |
| 1 | 0.8512711 | 1 |
kable(tail(comparar, 10), caption = "Comparar valores, últimos diez")
| income10 | ajuste | income10ajustados | |
|---|---|---|---|
| 34181 | 1 | 0.2408071 | 0 |
| 34182 | 0 | 0.0678873 | 0 |
| 34183 | 0 | 0.3931412 | 0 |
| 34184 | 0 | 0.0603121 | 0 |
| 34185 | 0 | 0.3817556 | 0 |
| 34186 | 0 | 0.4850764 | 0 |
| 34187 | 1 | 0.8508947 | 1 |
| 34188 | 0 | 0.0322399 | 0 |
| 34189 | 0 | 0.0679071 | 0 |
| 34190 | 0 | 0.0099146 | 0 |
# datatable(comparar, caption = "Comparar valores", options = list(pageLength = 10))
La gráfica muetra los colores rojos que son los que están pro debajo o igual al 50 % de probablidad o sea 0 y en azul los que están por encima de 50% o sea 1.
Se observa algunos puntos que están en la zona roja que son azules y viceversa, significa aquellos valores que no coinciden con el valor real, es decir que era 0 y con el valor ajustado es 1 o viceversa.
ggplot(data = comparar, aes(x =row.names(comparar), y = ajuste, )) +
geom_point(aes(colour = factor(income10))) +
geom_hline(yintercept = 0.50)
Se genera la matriz de confusión con los datos que se generaron en el data.frame comparar.
matriz_confusion <- table(comparar$income10, comparar$income10ajustados, dnn = c("income10", "income10ajustados para predicciones"))
kable(matriz_confusion, caption = "Matriz de confusión")
| 0 | 1 | |
|---|---|---|
| 0 | 24068 | 1901 |
| 1 | 4004 | 4217 |
¿Qué significa la matriz de confusión?
La matriz de confusión indica a cuantos registro le atinó el modelo.
El modelo dice que 24068 ganan por debajo o igual a 50 Mil de los 34190 registros totales el conjunto de entrenamiento.
El modelo dice que 4217 ganan encima de 50 Mil de los 34190 registros totales el conjunto de entrenamiento.
¿Que porcentaje de aciertos hubo?, VP verdaderos positivos y FP verdaderos negativos de acuerdo a la imagen de la matriz de confusión.
Se evalúa el modelo de acuerdo a estas condiciones:
\[ precision = \frac{TP}{TP + FP} \]
Recall o recuperación \[ recall = \frac{TP}{TP + FN} \]
Accuracy o exactitud \[ accuracy = \frac{TP + FP}{n} \] Especificity o especificidad (tasa de verdaderos negativos) \[ especificity = \frac{TN}{TN + FP} \]