Realizar e interpretar regresión logísitica con datos de personas e ingresos de USA
Construir un modelo de regresión logísitca aplicado a datos de personas y sus ingresos en USA
La variable dependiente es los ingresos identificado por 0 y 1, los ganan por debajo o igual a 50 Mil y los que ganan por encima de 50 Mil.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(readr)
datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/FundamentosMachineLearning/master/datos/adultos_clean.csv", encoding = "UTF-8")
kable(head(datos))
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 3 | 28 | Local-gov | Community | 12 | Married | White | Male | 40 | >50K | 0.1506849 | 0.7333333 | 0.3979592 | 1 |
| 4 | 44 | Private | Community | 10 | Married | Black | Male | 40 | >50K | 0.3698630 | 0.6000000 | 0.3979592 | 1 |
| 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 6 | 34 | Private | Dropout | 6 | Not_married | White | Male | 30 | <=50K | 0.2328767 | 0.3333333 | 0.2959184 | 0 |
kable(tail(datos))
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 48837 | 48837 | 22 | Private | Community | 10 | Not_married | White | Male | 40 | <=50K | 0.0684932 | 0.6000000 | 0.3979592 | 0 |
| 48838 | 48838 | 27 | Private | Community | 12 | Married | White | Female | 38 | <=50K | 0.1369863 | 0.7333333 | 0.3775510 | 0 |
| 48839 | 48839 | 40 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.3150685 | 0.5333333 | 0.3979592 | 1 |
| 48840 | 48840 | 58 | Private | HighGrad | 9 | Widow | White | Female | 40 | <=50K | 0.5616438 | 0.5333333 | 0.3979592 | 0 |
| 48841 | 48841 | 22 | Private | HighGrad | 9 | Not_married | White | Male | 20 | <=50K | 0.0684932 | 0.5333333 | 0.1938776 | 0 |
| 48842 | 48842 | 52 | Self-emp-inc | HighGrad | 9 | Married | White | Female | 40 | >50K | 0.4794521 | 0.5333333 | 0.3979592 | 1 |
set.seed(2020)
entrena <- createDataPartition(y = datos$income10, p = 0.7, list = FALSE, times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
kable(head(datos.entrenamiento, 10), caption = "Datos de entrenamiento (primeros diez)", row.names = 1:nrow(datos.entrenamiento))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): la condición tiene
## longitud > 1 y sólo el primer elemento será usado
## Warning in if (row.names) {: la condición tiene longitud > 1 y sólo el primer
## elemento será usado
| X | age | workclass | education | educational.num | marital.status | race | gender | hours.per.week | income | age.scale | educational.num.scale | hours.per.week.scale | income10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 25 | Private | Dropout | 7 | Not_married | Black | Male | 40 | <=50K | 0.1095890 | 0.4000000 | 0.3979592 | 0 |
| 2 | 2 | 38 | Private | HighGrad | 9 | Married | White | Male | 50 | <=50K | 0.2876712 | 0.5333333 | 0.5000000 | 0 |
| 5 | 5 | 18 | ? | Community | 10 | Not_married | White | Female | 30 | <=50K | 0.0136986 | 0.6000000 | 0.2959184 | 0 |
| 7 | 7 | 29 | ? | HighGrad | 9 | Not_married | Black | Male | 40 | <=50K | 0.1643836 | 0.5333333 | 0.3979592 | 0 |
| 8 | 8 | 63 | Self-emp-not-inc | Master | 15 | Married | White | Male | 32 | >50K | 0.6301370 | 0.9333333 | 0.3163265 | 1 |
| 9 | 9 | 24 | Private | Community | 10 | Not_married | White | Female | 40 | <=50K | 0.0958904 | 0.6000000 | 0.3979592 | 0 |
| 10 | 10 | 55 | Private | Dropout | 4 | Married | White | Male | 10 | <=50K | 0.5205479 | 0.2000000 | 0.0918367 | 0 |
| 11 | 11 | 65 | Private | HighGrad | 9 | Married | White | Male | 40 | >50K | 0.6575342 | 0.5333333 | 0.3979592 | 1 |
| 13 | 13 | 26 | Private | HighGrad | 9 | Not_married | White | Female | 39 | <=50K | 0.1232877 | 0.5333333 | 0.3877551 | 0 |
| 14 | 14 | 58 | ? | HighGrad | 9 | Married | White | Male | 35 | <=50K | 0.5616438 | 0.5333333 | 0.3469388 | 0 |
Por medio de la función gml() se contruye un modelo de regresión logística
Variable dependiente o predictiva es ‘income10’, ya que depende de todas las demás variables
Variables independientes o predictoras, todas las demás: “age”, “workclass”, “education”, “educational.num”, “marital.status”, “race”, “gender”, “hours.per.week”
Se utiliza el conjunto de datos de entrenamiento
La finalidad de consruir el modelo de regresión logística es entre otros cosas, para conocer los coeficienes y el nivel de significación de cada variable independiente o predictora así como las pruebas t y F
La ecuación $ income10=age.scale+workclass+education+marital.status+race+gender+hours.per.week.scale$ que se asigna a una variable y se utiliza para construir el modelo, significa que la variable ingresos * ‘income10’ depende o es dependiente de todas las demás variables
La fórmula dada por la ecuacuación:
formula = income10 ~ age.scale + workclass + education + marital.status + race + gender + hours.per.week.scale
modelo <- glm(formula, data = datos.entrenamiento, family = 'binomial')
summary(modelo)
##
## Call:
## glm(formula = formula, family = "binomial", data = datos.entrenamiento)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7548 -0.5778 -0.2603 -0.0684 3.3249
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76730 0.23517 -11.767 < 2e-16 ***
## age.scale 2.04720 0.10474 19.546 < 2e-16 ***
## workclassFederal-gov 1.39346 0.12545 11.107 < 2e-16 ***
## workclassLocal-gov 0.76117 0.11093 6.862 6.79e-12 ***
## workclassNever-worked -7.98118 98.02680 -0.081 0.93511
## workclassPrivate 0.88906 0.09705 9.161 < 2e-16 ***
## workclassSelf-emp-inc 1.32433 0.11899 11.130 < 2e-16 ***
## workclassSelf-emp-not-inc 0.31448 0.10762 2.922 0.00348 **
## workclassState-gov 0.48978 0.12316 3.977 6.99e-05 ***
## workclassWithout-pay 0.15627 0.85256 0.183 0.85457
## educationCommunity -1.00153 0.04434 -22.589 < 2e-16 ***
## educationDropout -2.69532 0.07543 -35.731 < 2e-16 ***
## educationHighGrad -1.58616 0.04511 -35.164 < 2e-16 ***
## educationMaster 0.59823 0.06103 9.803 < 2e-16 ***
## educationPhD 1.22350 0.13962 8.763 < 2e-16 ***
## marital.statusNot_married -2.53858 0.05407 -46.948 < 2e-16 ***
## marital.statusSeparated -2.10873 0.05619 -37.531 < 2e-16 ***
## marital.statusWidow -2.10698 0.12348 -17.064 < 2e-16 ***
## raceAsian-Pac-Islander 0.32012 0.21854 1.465 0.14298
## raceBlack 0.34440 0.20941 1.645 0.10005
## raceOther 0.26768 0.28964 0.924 0.35540
## raceWhite 0.58693 0.20059 2.926 0.00343 **
## genderMale 0.10151 0.04450 2.281 0.02252 *
## hours.per.week.scale 3.05128 0.13834 22.056 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37584 on 34189 degrees of freedom
## Residual deviance: 25090 on 34166 degrees of freedom
## AIC: 25138
##
## Number of Fisher Scoring iterations: 11
La mayoría de las variables tiene in efecto estadísticamente significativo en relación a la variable dependiente, dado los ’***’.
prediccion <- predict(object = modelo, newdata = datos.validacion)
predicciones <- data.frame(datos.validacion, prediccion)
age.scale=0.3342466
education.num=14
income10=0
hours.per.week.scale=0.39795918
workclass="Local-gov"
education="PhD"
marital.status= "Married"
race="White"
gender="Male"
nuevos.datos <- data.frame(race=race,gender=gender,marital.status=marital.status,education=education,workclass=workclass,age.scale=age.scale,education.num=education.num,income10=income10,hours.per.week.scale=hours.per.week.scale)
prediccion <- predict(object = modelo,
newdata = nuevos.datos)
paste("El valor del precio predicho es: ", round(prediccion, 2))
## [1] "El valor del precio predicho es: 1.8"