Análisis de regresión logística para predicción de ingresos de personas
La regresión logística se usa para predecir una clase, es decir, una probabilidad. La regresión logística puede predecir un resultado binario. Como ejemplo podría ser predecir si un préstamo es denegado / aceptado en función de muchos atributos.La regresión logística es de la forma 0/1. y = 0 si se rechaza un préstamo, y = 1 si se acepta. (Guru99, 2020)
En muchas aplicaciones de la regresión la variable dependiente asume sólo dos valores discretos, por ejemplo, en un banco suele necesitarse una ecuación de regresión estimada para predecir si a una persona se le aprobará su solicitud de tarjeta de crédito. A esta variable dependiente pueden dársele los valores y = 1 si la solicitud de tarjeta de crédito es aprobada, y y = 0 si es rechazada. (Anderson, Sweeney, & Williams, 2008).
Con la regresión logística, dado un conjunto particular de valores de las variables independientes elegidas, se estima la probabilidad de que el banco apruebe la solicitud de tarjeta de crédito. Un modelo de regresión logística difiere del modelo de regresión lineal de dos maneras: En primer lugar, la regresión logística solo acepta entradas dicotómicas (binarias) como una variable dependiente (es decir, un vector de 0 y 1). En segundo lugar, el resultado se mide mediante la siguiente función de enlace probabilístico llamada sigmoide debido a su forma de S
Descripción
Se utiliza un conjunto de datos de personas para ilustrar la regresión logística. El conjunto de datos denominado “adultos.csv” se utiliza tanto para la tarea de clasificacióny predicción.
Objetivo
Predecir si el ingreso anual en dólares de un individuo excederá los $ 50,000.
Cargar librerias
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(ggplot2)
library(fdth)
##
## Attaching package: 'fdth'
## The following objects are masked from 'package:stats':
##
## sd, var
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(reshape)
## Warning: package 'reshape' was built under R version 4.0.3
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Cargar datos
getwd()
## [1] "C:/Users/pc/Documents/RStudio"
datos <- read.csv(file = "adultos.csv", encoding = "UTF-8")
Explorando datos
kable(head(datos, 10))
| 1 |
25 |
Private |
11th |
7 |
Never-married |
Black |
Male |
40 |
<=50K |
| 2 |
38 |
Private |
HS-grad |
9 |
Married-civ-spouse |
White |
Male |
50 |
<=50K |
| 3 |
28 |
Local-gov |
Assoc-acdm |
12 |
Married-civ-spouse |
White |
Male |
40 |
>50K |
| 4 |
44 |
Private |
Some-college |
10 |
Married-civ-spouse |
Black |
Male |
40 |
>50K |
| 5 |
18 |
? |
Some-college |
10 |
Never-married |
White |
Female |
30 |
<=50K |
| 6 |
34 |
Private |
10th |
6 |
Never-married |
White |
Male |
30 |
<=50K |
| 7 |
29 |
? |
HS-grad |
9 |
Never-married |
Black |
Male |
40 |
<=50K |
| 8 |
63 |
Self-emp-not-inc |
Prof-school |
15 |
Married-civ-spouse |
White |
Male |
32 |
>50K |
| 9 |
24 |
Private |
Some-college |
10 |
Never-married |
White |
Female |
40 |
<=50K |
| 10 |
55 |
Private |
7th-8th |
4 |
Married-civ-spouse |
White |
Male |
10 |
<=50K |
kable(tail(datos,10))
| 48833 |
48833 |
32 |
Private |
10th |
6 |
Married-civ-spouse |
Amer-Indian-Eskimo |
Male |
40 |
<=50K |
| 48834 |
48834 |
43 |
Private |
Assoc-voc |
11 |
Married-civ-spouse |
White |
Male |
45 |
<=50K |
| 48835 |
48835 |
32 |
Private |
Masters |
14 |
Never-married |
Asian-Pac-Islander |
Male |
11 |
<=50K |
| 48836 |
48836 |
53 |
Private |
Masters |
14 |
Married-civ-spouse |
White |
Male |
40 |
>50K |
| 48837 |
48837 |
22 |
Private |
Some-college |
10 |
Never-married |
White |
Male |
40 |
<=50K |
| 48838 |
48838 |
27 |
Private |
Assoc-acdm |
12 |
Married-civ-spouse |
White |
Female |
38 |
<=50K |
| 48839 |
48839 |
40 |
Private |
HS-grad |
9 |
Married-civ-spouse |
White |
Male |
40 |
>50K |
| 48840 |
48840 |
58 |
Private |
HS-grad |
9 |
Widowed |
White |
Female |
40 |
<=50K |
| 48841 |
48841 |
22 |
Private |
HS-grad |
9 |
Never-married |
White |
Male |
20 |
<=50K |
| 48842 |
48842 |
52 |
Self-emp-inc |
HS-grad |
9 |
Married-civ-spouse |
White |
Female |
40 |
>50K |
La estructrua de los datos
Resumen de los datos:
x Variable de consecutivo de los datos
age la edad de la persona
workclass es un tipo o clase de trabajo de la persona, privado, gobierno, por su cuenta,
education indica el nivel educativo de la persona
educational es el valor numérico de education
marital es su estado civil
race es el tipo de raza de persona
gender es el género de la persona
hours.per.week son las horas que trabaja por semana
income son los ingresos
str(datos)
## 'data.frame': 48842 obs. of 10 variables:
## $ x : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 25 38 28 44 18 34 29 63 24 55 ...
## $ workclass : chr "Private" "Private" "Local-gov" "Private" ...
## $ education : chr "11th" "HS-grad" "Assoc-acdm" "Some-college" ...
## $ educational.num: int 7 9 12 10 10 6 9 15 10 4 ...
## $ marital.status : chr "Never-married" "Married-civ-spouse" "Married-civ-spouse" "Married-civ-spouse" ...
## $ race : chr "Black" "White" "White" "Black" ...
## $ gender : chr "Male" "Male" "Male" "Male" ...
## $ hours.per.week : int 40 50 40 40 30 30 40 32 40 10 ...
## $ income : chr "<=50K" "<=50K" ">50K" ">50K" ...
kable(summary(datos[-1]))
|
Min. :17.00 |
Length:48842 |
Length:48842 |
Min. : 1.00 |
Length:48842 |
Length:48842 |
Length:48842 |
Min. : 1.00 |
Length:48842 |
|
1st Qu.:28.00 |
Class :character |
Class :character |
1st Qu.: 9.00 |
Class :character |
Class :character |
Class :character |
1st Qu.:40.00 |
Class :character |
|
Median :37.00 |
Mode :character |
Mode :character |
Median :10.00 |
Mode :character |
Mode :character |
Mode :character |
Median :40.00 |
Mode :character |
|
Mean :38.64 |
NA |
NA |
Mean :10.08 |
NA |
NA |
NA |
Mean :40.42 |
NA |
|
3rd Qu.:48.00 |
NA |
NA |
3rd Qu.:12.00 |
NA |
NA |
NA |
3rd Qu.:45.00 |
NA |
|
Max. :90.00 |
NA |
NA |
Max. :16.00 |
NA |
NA |
NA |
Max. :99.00 |
NA |
Proceso para analizar los datos
paso 1: Identificar variables numéricas
numericas <-select_if(datos, is.numeric)
kable(summary(numericas[-1]))
|
Min. :17.00 |
Min. : 1.00 |
Min. : 1.00 |
|
1st Qu.:28.00 |
1st Qu.: 9.00 |
1st Qu.:40.00 |
|
Median :37.00 |
Median :10.00 |
Median :40.00 |
|
Mean :38.64 |
Mean :10.08 |
Mean :40.42 |
|
3rd Qu.:48.00 |
3rd Qu.:12.00 |
3rd Qu.:45.00 |
|
Max. :90.00 |
Max. :16.00 |
Max. :99.00 |
Análisis de la variable hours.per.week
ggplot(numericas, aes(x = hours.per.week)) +
geom_density(alpha = .2, fill = "#FF6666")

distribucion <- fdt(numericas$hours.per.week,breaks="Sturges")
kable(distribucion)
| [0.99,6.8135) |
410 |
0.0083944 |
0.8394415 |
410 |
0.8394415 |
| [6.8135,12.637) |
982 |
0.0201056 |
2.0105647 |
1392 |
2.8500061 |
| [12.637,18.461) |
1180 |
0.0241595 |
2.4159535 |
2572 |
5.2659596 |
| [18.461,24.284) |
2383 |
0.0487900 |
4.8789976 |
4955 |
10.1449572 |
| [24.284,30.108) |
2896 |
0.0592932 |
5.9293231 |
7851 |
16.0742803 |
| [30.108,35.931) |
2481 |
0.0507964 |
5.0796446 |
10332 |
21.1539249 |
| [35.931,41.755) |
24217 |
0.4958233 |
49.5823267 |
34549 |
70.7362516 |
| [41.755,47.578) |
3803 |
0.0778633 |
7.7863314 |
38352 |
78.5225830 |
| [47.578,53.402) |
5319 |
0.1089022 |
10.8902174 |
43671 |
89.4128005 |
| [53.402,59.225) |
1318 |
0.0269850 |
2.6984972 |
44989 |
92.1112977 |
| [59.225,65.049) |
2596 |
0.0531510 |
5.3150977 |
47585 |
97.4263953 |
| [65.049,70.872) |
483 |
0.0098890 |
0.9889030 |
48068 |
98.4152983 |
| [70.872,76.696) |
223 |
0.0045657 |
0.4565743 |
48291 |
98.8718726 |
| [76.696,82.519) |
237 |
0.0048524 |
0.4852381 |
48528 |
99.3571107 |
| [82.519,88.343) |
98 |
0.0020065 |
0.2006470 |
48626 |
99.5577577 |
| [88.343,94.166) |
52 |
0.0010647 |
0.1064657 |
48678 |
99.6642234 |
| [94.166,99.99) |
164 |
0.0033578 |
0.3357766 |
48842 |
100.0000000 |
|
| start |
0.990000 |
| end |
99.990000 |
| h |
5.823529 |
| right |
0.000000 |
|
barplot(height = distribucion$table$f, names.arg = distribucion$table$`Class limits`)

Análisis de la variable age
ggplot(numericas, aes(x = age)) +
geom_density(alpha = .2, fill = "#FF6666")

distribucion <- fdt(numericas$age, breaks="Sturges")
kable(distribucion)
| [16.83,21.19) |
4719 |
0.0966177 |
9.6617665 |
4719 |
9.661767 |
| [21.19,25.54) |
4908 |
0.1004873 |
10.0487286 |
9627 |
19.710495 |
| [25.54,29.9) |
4888 |
0.1000778 |
10.0077802 |
14515 |
29.718275 |
| [29.9,34.26) |
6494 |
0.1329593 |
13.2959338 |
21009 |
43.014209 |
| [34.26,38.62) |
5229 |
0.1070595 |
10.7059498 |
26238 |
53.720159 |
| [38.62,42.97) |
4793 |
0.0981328 |
9.8132755 |
31031 |
63.533434 |
| [42.97,47.33) |
5445 |
0.1114819 |
11.1481921 |
36476 |
74.681626 |
| [47.33,51.69) |
3435 |
0.0703288 |
7.0328815 |
39911 |
81.714508 |
| [51.69,56.04) |
3247 |
0.0664797 |
6.6479669 |
43158 |
88.362475 |
| [56.04,60.4) |
2078 |
0.0425454 |
4.2545350 |
45236 |
92.617010 |
| [60.4,64.76) |
1519 |
0.0311003 |
3.1100283 |
46755 |
95.727038 |
| [64.76,69.11) |
1086 |
0.0222350 |
2.2234962 |
47841 |
97.950534 |
| [69.11,73.47) |
479 |
0.0098071 |
0.9807133 |
48320 |
98.931248 |
| [73.47,77.83) |
272 |
0.0055690 |
0.5568978 |
48592 |
99.488145 |
| [77.83,82.19) |
154 |
0.0031530 |
0.3153024 |
48746 |
99.803448 |
| [82.19,86.54) |
30 |
0.0006142 |
0.0614225 |
48776 |
99.864870 |
| [86.54,90.9) |
66 |
0.0013513 |
0.1351296 |
48842 |
100.000000 |
|
| start |
16.830000 |
| end |
90.900000 |
| h |
4.357059 |
| right |
0.000000 |
|
barplot(height = distribucion$table$f, names.arg = distribucion$table$`Class limits`)

Estandarizar los valores numéricos
escalados <- datos[-1] %>%
mutate(age.scale = rescale(age),educational.num.scale = rescale(educational.num), hours.per.week.scale = rescale(hours.per.week) )
head(escalados, 10)
## age workclass education educational.num marital.status race
## 1 25 Private 11th 7 Never-married Black
## 2 38 Private HS-grad 9 Married-civ-spouse White
## 3 28 Local-gov Assoc-acdm 12 Married-civ-spouse White
## 4 44 Private Some-college 10 Married-civ-spouse Black
## 5 18 ? Some-college 10 Never-married White
## 6 34 Private 10th 6 Never-married White
## 7 29 ? HS-grad 9 Never-married Black
## 8 63 Self-emp-not-inc Prof-school 15 Married-civ-spouse White
## 9 24 Private Some-college 10 Never-married White
## 10 55 Private 7th-8th 4 Married-civ-spouse White
## gender hours.per.week income age.scale educational.num.scale
## 1 Male 40 <=50K 0.10958904 0.4000000
## 2 Male 50 <=50K 0.28767123 0.5333333
## 3 Male 40 >50K 0.15068493 0.7333333
## 4 Male 40 >50K 0.36986301 0.6000000
## 5 Female 30 <=50K 0.01369863 0.6000000
## 6 Male 30 <=50K 0.23287671 0.3333333
## 7 Male 40 <=50K 0.16438356 0.5333333
## 8 Male 32 >50K 0.63013699 0.9333333
## 9 Female 40 <=50K 0.09589041 0.6000000
## 10 Male 10 <=50K 0.52054795 0.2000000
## hours.per.week.scale
## 1 0.39795918
## 2 0.50000000
## 3 0.39795918
## 4 0.39795918
## 5 0.29591837
## 6 0.29591837
## 7 0.39795918
## 8 0.31632653
## 9 0.39795918
## 10 0.09183673
tail(escalados, 10)
## age workclass education educational.num marital.status
## 48833 32 Private 10th 6 Married-civ-spouse
## 48834 43 Private Assoc-voc 11 Married-civ-spouse
## 48835 32 Private Masters 14 Never-married
## 48836 53 Private Masters 14 Married-civ-spouse
## 48837 22 Private Some-college 10 Never-married
## 48838 27 Private Assoc-acdm 12 Married-civ-spouse
## 48839 40 Private HS-grad 9 Married-civ-spouse
## 48840 58 Private HS-grad 9 Widowed
## 48841 22 Private HS-grad 9 Never-married
## 48842 52 Self-emp-inc HS-grad 9 Married-civ-spouse
## race gender hours.per.week income age.scale
## 48833 Amer-Indian-Eskimo Male 40 <=50K 0.20547945
## 48834 White Male 45 <=50K 0.35616438
## 48835 Asian-Pac-Islander Male 11 <=50K 0.20547945
## 48836 White Male 40 >50K 0.49315068
## 48837 White Male 40 <=50K 0.06849315
## 48838 White Female 38 <=50K 0.13698630
## 48839 White Male 40 >50K 0.31506849
## 48840 White Female 40 <=50K 0.56164384
## 48841 White Male 20 <=50K 0.06849315
## 48842 White Female 40 >50K 0.47945205
## educational.num.scale hours.per.week.scale
## 48833 0.3333333 0.3979592
## 48834 0.6666667 0.4489796
## 48835 0.8666667 0.1020408
## 48836 0.8666667 0.3979592
## 48837 0.6000000 0.3979592
## 48838 0.7333333 0.3775510
## 48839 0.5333333 0.3979592
## 48840 0.5333333 0.3979592
## 48841 0.5333333 0.1938776
## 48842 0.5333333 0.3979592
Identificar variables factor