Análisis de regresión logística para predicción de ingresos de personas

La regresión logística se usa para predecir una clase, es decir, una probabilidad. La regresión logística puede predecir un resultado binario. Como ejemplo podría ser predecir si un préstamo es denegado / aceptado en función de muchos atributos.La regresión logística es de la forma 0/1. y = 0 si se rechaza un préstamo, y = 1 si se acepta. (Guru99, 2020)

En muchas aplicaciones de la regresión la variable dependiente asume sólo dos valores discretos, por ejemplo, en un banco suele necesitarse una ecuación de regresión estimada para predecir si a una persona se le aprobará su solicitud de tarjeta de crédito. A esta variable dependiente pueden dársele los valores y = 1 si la solicitud de tarjeta de crédito es aprobada, y y = 0 si es rechazada. (Anderson, Sweeney, & Williams, 2008).

Con la regresión logística, dado un conjunto particular de valores de las variables independientes elegidas, se estima la probabilidad de que el banco apruebe la solicitud de tarjeta de crédito. Un modelo de regresión logística difiere del modelo de regresión lineal de dos maneras: En primer lugar, la regresión logística solo acepta entradas dicotómicas (binarias) como una variable dependiente (es decir, un vector de 0 y 1). En segundo lugar, el resultado se mide mediante la siguiente función de enlace probabilístico llamada sigmoide debido a su forma de S

Descripción

Se utiliza un conjunto de datos de personas para ilustrar la regresión logística. El conjunto de datos denominado “adultos.csv” se utiliza tanto para la tarea de clasificacióny predicción.

Objetivo

Predecir si el ingreso anual en dólares de un individuo excederá los $ 50,000.

Cargar librerias

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(fdth)
## 
## Attaching package: 'fdth'
## The following objects are masked from 'package:stats':
## 
##     sd, var
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(reshape)
## Warning: package 'reshape' was built under R version 4.0.3
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

Cargar datos

getwd()
## [1] "C:/Users/pc/Documents/RStudio"
datos <- read.csv(file = "adultos.csv", encoding = "UTF-8")

Explorando datos

kable(head(datos, 10))
x age workclass education educational.num marital.status race gender hours.per.week income
1 25 Private 11th 7 Never-married Black Male 40 <=50K
2 38 Private HS-grad 9 Married-civ-spouse White Male 50 <=50K
3 28 Local-gov Assoc-acdm 12 Married-civ-spouse White Male 40 >50K
4 44 Private Some-college 10 Married-civ-spouse Black Male 40 >50K
5 18 ? Some-college 10 Never-married White Female 30 <=50K
6 34 Private 10th 6 Never-married White Male 30 <=50K
7 29 ? HS-grad 9 Never-married Black Male 40 <=50K
8 63 Self-emp-not-inc Prof-school 15 Married-civ-spouse White Male 32 >50K
9 24 Private Some-college 10 Never-married White Female 40 <=50K
10 55 Private 7th-8th 4 Married-civ-spouse White Male 10 <=50K
kable(tail(datos,10))
x age workclass education educational.num marital.status race gender hours.per.week income
48833 48833 32 Private 10th 6 Married-civ-spouse Amer-Indian-Eskimo Male 40 <=50K
48834 48834 43 Private Assoc-voc 11 Married-civ-spouse White Male 45 <=50K
48835 48835 32 Private Masters 14 Never-married Asian-Pac-Islander Male 11 <=50K
48836 48836 53 Private Masters 14 Married-civ-spouse White Male 40 >50K
48837 48837 22 Private Some-college 10 Never-married White Male 40 <=50K
48838 48838 27 Private Assoc-acdm 12 Married-civ-spouse White Female 38 <=50K
48839 48839 40 Private HS-grad 9 Married-civ-spouse White Male 40 >50K
48840 48840 58 Private HS-grad 9 Widowed White Female 40 <=50K
48841 48841 22 Private HS-grad 9 Never-married White Male 20 <=50K
48842 48842 52 Self-emp-inc HS-grad 9 Married-civ-spouse White Female 40 >50K

La estructrua de los datos

Resumen de los datos:
  x Variable de consecutivo de los datos
  age la edad de la persona
  workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
  education indica el nivel educativo de la persona
  educational es el valor numérico de education
  marital es su estado civil
  race es el tipo de raza de persona
  gender es el género de la persona
  hours.per.week son las horas que trabaja por semana
  income son los ingresos
str(datos)
## 'data.frame':    48842 obs. of  10 variables:
##  $ x              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age            : int  25 38 28 44 18 34 29 63 24 55 ...
##  $ workclass      : chr  "Private" "Private" "Local-gov" "Private" ...
##  $ education      : chr  "11th" "HS-grad" "Assoc-acdm" "Some-college" ...
##  $ educational.num: int  7 9 12 10 10 6 9 15 10 4 ...
##  $ marital.status : chr  "Never-married" "Married-civ-spouse" "Married-civ-spouse" "Married-civ-spouse" ...
##  $ race           : chr  "Black" "White" "White" "Black" ...
##  $ gender         : chr  "Male" "Male" "Male" "Male" ...
##  $ hours.per.week : int  40 50 40 40 30 30 40 32 40 10 ...
##  $ income         : chr  "<=50K" "<=50K" ">50K" ">50K" ...
kable(summary(datos[-1]))
age workclass education educational.num marital.status race gender hours.per.week income
Min. :17.00 Length:48842 Length:48842 Min. : 1.00 Length:48842 Length:48842 Length:48842 Min. : 1.00 Length:48842
1st Qu.:28.00 Class :character Class :character 1st Qu.: 9.00 Class :character Class :character Class :character 1st Qu.:40.00 Class :character
Median :37.00 Mode :character Mode :character Median :10.00 Mode :character Mode :character Mode :character Median :40.00 Mode :character
Mean :38.64 NA NA Mean :10.08 NA NA NA Mean :40.42 NA
3rd Qu.:48.00 NA NA 3rd Qu.:12.00 NA NA NA 3rd Qu.:45.00 NA
Max. :90.00 NA NA Max. :16.00 NA NA NA Max. :99.00 NA

Proceso para analizar los datos

paso 1: Identificar variables numéricas

numericas <-select_if(datos, is.numeric)
kable(summary(numericas[-1])) 
age educational.num hours.per.week
Min. :17.00 Min. : 1.00 Min. : 1.00
1st Qu.:28.00 1st Qu.: 9.00 1st Qu.:40.00
Median :37.00 Median :10.00 Median :40.00
Mean :38.64 Mean :10.08 Mean :40.42
3rd Qu.:48.00 3rd Qu.:12.00 3rd Qu.:45.00
Max. :90.00 Max. :16.00 Max. :99.00

Análisis de la variable hours.per.week

ggplot(numericas, aes(x = hours.per.week)) +
    geom_density(alpha = .2, fill = "#FF6666")

distribucion <- fdt(numericas$hours.per.week,breaks="Sturges") 


kable(distribucion)
Class limits f rf rf(%) cf cf(%)
[0.99,6.8135) 410 0.0083944 0.8394415 410 0.8394415
[6.8135,12.637) 982 0.0201056 2.0105647 1392 2.8500061
[12.637,18.461) 1180 0.0241595 2.4159535 2572 5.2659596
[18.461,24.284) 2383 0.0487900 4.8789976 4955 10.1449572
[24.284,30.108) 2896 0.0592932 5.9293231 7851 16.0742803
[30.108,35.931) 2481 0.0507964 5.0796446 10332 21.1539249
[35.931,41.755) 24217 0.4958233 49.5823267 34549 70.7362516
[41.755,47.578) 3803 0.0778633 7.7863314 38352 78.5225830
[47.578,53.402) 5319 0.1089022 10.8902174 43671 89.4128005
[53.402,59.225) 1318 0.0269850 2.6984972 44989 92.1112977
[59.225,65.049) 2596 0.0531510 5.3150977 47585 97.4263953
[65.049,70.872) 483 0.0098890 0.9889030 48068 98.4152983
[70.872,76.696) 223 0.0045657 0.4565743 48291 98.8718726
[76.696,82.519) 237 0.0048524 0.4852381 48528 99.3571107
[82.519,88.343) 98 0.0020065 0.2006470 48626 99.5577577
[88.343,94.166) 52 0.0010647 0.1064657 48678 99.6642234
[94.166,99.99) 164 0.0033578 0.3357766 48842 100.0000000
x
start 0.990000
end 99.990000
h 5.823529
right 0.000000
barplot(height = distribucion$table$f, names.arg = distribucion$table$`Class limits`)

Análisis de la variable age

ggplot(numericas, aes(x = age)) +
    geom_density(alpha = .2, fill = "#FF6666")

distribucion <- fdt(numericas$age, breaks="Sturges") 


kable(distribucion)
Class limits f rf rf(%) cf cf(%)
[16.83,21.19) 4719 0.0966177 9.6617665 4719 9.661767
[21.19,25.54) 4908 0.1004873 10.0487286 9627 19.710495
[25.54,29.9) 4888 0.1000778 10.0077802 14515 29.718275
[29.9,34.26) 6494 0.1329593 13.2959338 21009 43.014209
[34.26,38.62) 5229 0.1070595 10.7059498 26238 53.720159
[38.62,42.97) 4793 0.0981328 9.8132755 31031 63.533434
[42.97,47.33) 5445 0.1114819 11.1481921 36476 74.681626
[47.33,51.69) 3435 0.0703288 7.0328815 39911 81.714508
[51.69,56.04) 3247 0.0664797 6.6479669 43158 88.362475
[56.04,60.4) 2078 0.0425454 4.2545350 45236 92.617010
[60.4,64.76) 1519 0.0311003 3.1100283 46755 95.727038
[64.76,69.11) 1086 0.0222350 2.2234962 47841 97.950534
[69.11,73.47) 479 0.0098071 0.9807133 48320 98.931248
[73.47,77.83) 272 0.0055690 0.5568978 48592 99.488145
[77.83,82.19) 154 0.0031530 0.3153024 48746 99.803448
[82.19,86.54) 30 0.0006142 0.0614225 48776 99.864870
[86.54,90.9) 66 0.0013513 0.1351296 48842 100.000000
x
start 16.830000
end 90.900000
h 4.357059
right 0.000000
barplot(height = distribucion$table$f, names.arg = distribucion$table$`Class limits`)

Estandarizar los valores numéricos

escalados <- datos[-1] %>%
    mutate(age.scale = rescale(age),educational.num.scale = rescale(educational.num), hours.per.week.scale = rescale(hours.per.week)  )

head(escalados, 10)
##    age        workclass    education educational.num     marital.status  race
## 1   25          Private         11th               7      Never-married Black
## 2   38          Private      HS-grad               9 Married-civ-spouse White
## 3   28        Local-gov   Assoc-acdm              12 Married-civ-spouse White
## 4   44          Private Some-college              10 Married-civ-spouse Black
## 5   18                ? Some-college              10      Never-married White
## 6   34          Private         10th               6      Never-married White
## 7   29                ?      HS-grad               9      Never-married Black
## 8   63 Self-emp-not-inc  Prof-school              15 Married-civ-spouse White
## 9   24          Private Some-college              10      Never-married White
## 10  55          Private      7th-8th               4 Married-civ-spouse White
##    gender hours.per.week income  age.scale educational.num.scale
## 1    Male             40  <=50K 0.10958904             0.4000000
## 2    Male             50  <=50K 0.28767123             0.5333333
## 3    Male             40   >50K 0.15068493             0.7333333
## 4    Male             40   >50K 0.36986301             0.6000000
## 5  Female             30  <=50K 0.01369863             0.6000000
## 6    Male             30  <=50K 0.23287671             0.3333333
## 7    Male             40  <=50K 0.16438356             0.5333333
## 8    Male             32   >50K 0.63013699             0.9333333
## 9  Female             40  <=50K 0.09589041             0.6000000
## 10   Male             10  <=50K 0.52054795             0.2000000
##    hours.per.week.scale
## 1            0.39795918
## 2            0.50000000
## 3            0.39795918
## 4            0.39795918
## 5            0.29591837
## 6            0.29591837
## 7            0.39795918
## 8            0.31632653
## 9            0.39795918
## 10           0.09183673
tail(escalados, 10)
##       age    workclass    education educational.num     marital.status
## 48833  32      Private         10th               6 Married-civ-spouse
## 48834  43      Private    Assoc-voc              11 Married-civ-spouse
## 48835  32      Private      Masters              14      Never-married
## 48836  53      Private      Masters              14 Married-civ-spouse
## 48837  22      Private Some-college              10      Never-married
## 48838  27      Private   Assoc-acdm              12 Married-civ-spouse
## 48839  40      Private      HS-grad               9 Married-civ-spouse
## 48840  58      Private      HS-grad               9            Widowed
## 48841  22      Private      HS-grad               9      Never-married
## 48842  52 Self-emp-inc      HS-grad               9 Married-civ-spouse
##                     race gender hours.per.week income  age.scale
## 48833 Amer-Indian-Eskimo   Male             40  <=50K 0.20547945
## 48834              White   Male             45  <=50K 0.35616438
## 48835 Asian-Pac-Islander   Male             11  <=50K 0.20547945
## 48836              White   Male             40   >50K 0.49315068
## 48837              White   Male             40  <=50K 0.06849315
## 48838              White Female             38  <=50K 0.13698630
## 48839              White   Male             40   >50K 0.31506849
## 48840              White Female             40  <=50K 0.56164384
## 48841              White   Male             20  <=50K 0.06849315
## 48842              White Female             40   >50K 0.47945205
##       educational.num.scale hours.per.week.scale
## 48833             0.3333333            0.3979592
## 48834             0.6666667            0.4489796
## 48835             0.8666667            0.1020408
## 48836             0.8666667            0.3979592
## 48837             0.6000000            0.3979592
## 48838             0.7333333            0.3775510
## 48839             0.5333333            0.3979592
## 48840             0.5333333            0.3979592
## 48841             0.5333333            0.1938776
## 48842             0.5333333            0.3979592

Identificar variables factor