1. Introducción

Ejemplo del uso del modelo de regresión logística para predecir los ingresos de los diferentes habitantes de América del Norte y América del Sur. En este proyecto vemos desde el tratamiento del dataset para la limpieza de datos como la visualización de diferentes gráficas para intentar detectar diferentes comportamientos. Al final, usaremos el modelo de regresión logística para predecir los ingresos y valoraremos tanto su exactitud como precisión.

2. Lectura y estructura del dataset.

library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caTools)
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
setwd("~/R/R-Course-HTML-Notes/R-for-Data-Science-and-Machine-Learning/Training Exercises/Machine Learning Projects/CSV files for ML Projects")
df<- read.csv("adult_sal.csv") 
head(df)
##   X age    type_employer fnlwgt education education_num            marital
## 1 1  39        State-gov  77516 Bachelors            13      Never-married
## 2 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3 3  38          Private 215646   HS-grad             9           Divorced
## 4 4  53          Private 234721      11th             7 Married-civ-spouse
## 5 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hr_per_week       country income
## 1          40 United-States  <=50K
## 2          13 United-States  <=50K
## 3          40 United-States  <=50K
## 4          40 United-States  <=50K
## 5          40          Cuba  <=50K
## 6          40 United-States  <=50K

Eliminamos la columna X pues está repetida.

df <- df %>% select(-X)
head(df)
##   age    type_employer fnlwgt education education_num            marital
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hr_per_week       country income
## 1          40 United-States  <=50K
## 2          13 United-States  <=50K
## 3          40 United-States  <=50K
## 4          40 United-States  <=50K
## 5          40          Cuba  <=50K
## 6          40 United-States  <=50K

Veamos la estructura del dataset.

str(df)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation   : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(df)
##       age                 type_employer       fnlwgt       
##  Min.   :17.00   Private         :22696   Min.   :  12285  
##  1st Qu.:28.00   Self-emp-not-inc: 2541   1st Qu.: 117827  
##  Median :37.00   Local-gov       : 2093   Median : 178356  
##  Mean   :38.58   ?               : 1836   Mean   : 189778  
##  3rd Qu.:48.00   State-gov       : 1298   3rd Qu.: 237051  
##  Max.   :90.00   Self-emp-inc    : 1116   Max.   :1484705  
##                  (Other)         :  981                    
##         education     education_num                    marital     
##  HS-grad     :10501   Min.   : 1.00   Divorced             : 4443  
##  Some-college: 7291   1st Qu.: 9.00   Married-AF-spouse    :   23  
##  Bachelors   : 5355   Median :10.00   Married-civ-spouse   :14976  
##  Masters     : 1723   Mean   :10.08   Married-spouse-absent:  418  
##  Assoc-voc   : 1382   3rd Qu.:12.00   Never-married        :10683  
##  11th        : 1175   Max.   :16.00   Separated            : 1025  
##  (Other)     : 5134                   Widowed              :  993  
##            occupation           relationship                   race      
##  Prof-specialty :4140   Husband       :13193   Amer-Indian-Eskimo:  311  
##  Craft-repair   :4099   Not-in-family : 8305   Asian-Pac-Islander: 1039  
##  Exec-managerial:4066   Other-relative:  981   Black             : 3124  
##  Adm-clerical   :3770   Own-child     : 5068   Other             :  271  
##  Sales          :3650   Unmarried     : 3446   White             :27816  
##  Other-service  :3295   Wife          : 1568                             
##  (Other)        :9541                                                    
##      sex         capital_gain    capital_loss     hr_per_week   
##  Female:10771   Min.   :    0   Min.   :   0.0   Min.   : 1.00  
##  Male  :21790   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00  
##                 Median :    0   Median :   0.0   Median :40.00  
##                 Mean   : 1078   Mean   :  87.3   Mean   :40.44  
##                 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00  
##                 Max.   :99999   Max.   :4356.0   Max.   :99.00  
##                                                                 
##           country        income     
##  United-States:29170   <=50K:24720  
##  Mexico       :  643   >50K : 7841  
##  ?            :  583                
##  Philippines  :  198                
##  Germany      :  137                
##  Canada       :  121                
##  (Other)      : 1709

Podemos observar que en este dataset no hay NA’s como tal sino que vienen representados por el caracter ?.

3. Limpieza del dataset

sort(table(df$type_employer))
## 
##     Never-worked      Without-pay      Federal-gov     Self-emp-inc 
##                7               14              960             1116 
##        State-gov                ?        Local-gov Self-emp-not-inc 
##             1298             1836             2093             2541 
##          Private 
##            22696

Juntaremos los grupos Never-worked y Without-pay en un grupo llamado Unemployed.

df$type_employer <-  revalue(df$type_employer,c("Never-worked" = "Unemployed"))
df$type_employer <-  revalue(df$type_employer,c("Without-pay" = "Unemployed"))

Juntamos también los grupos State-gov y Local-gov en uno nuevo llamado SL-gov.

df$type_employer <-  revalue(df$type_employer,c("Local-gov" = "SL-gov"))
df$type_employer <-  revalue(df$type_employer,c("State-gov" = "SL-gov"))

Veamos la tabla de frecuencias de la columna Marital

sort(table(df$marital))
## 
##     Married-AF-spouse Married-spouse-absent               Widowed 
##                    23                   418                   993 
##             Separated              Divorced         Never-married 
##                  1025                  4443                 10683 
##    Married-civ-spouse 
##                 14976

Vamos a reducir estos 7 grupos en 3: Married, Not-Married, Never-Married. En esta ocasión, vamos a crear una función que remplace el estado civil.

civil_status <- function(marital){
  #marital <- as.character(marital)
  if (marital %in% c("Separated","Widowed","Divorced")){
    return('Not-Married')
  }else if(marital=='Never-married'){
    return('Never-married') #Podemos poner return(marital) pero en ese caso nos devuelve la variable como 5, ya que entra en la función como factor. Colocamos return('never-married') o descomentamos la primera linea de la función.
  }else{
    return('Married')
  }
}
df$marital <- sapply(df$marital, civil_status)
table(df$marital)
## 
##       Married Never-married   Not-Married 
##         15417         10683          6461

Comprobamos la columna Country

sort(table(df$country))
## 
##         Holand-Netherlands                   Scotland 
##                          1                         12 
##                   Honduras                    Hungary 
##                         13                         13 
## Outlying-US(Guam-USVI-etc)                 Yugoslavia 
##                         14                         16 
##                       Laos                   Thailand 
##                         18                         18 
##                   Cambodia            Trinadad&Tobago 
##                         19                         19 
##                       Hong                    Ireland 
##                         20                         24 
##                    Ecuador                     France 
##                         28                         29 
##                     Greece                       Peru 
##                         29                         31 
##                  Nicaragua                   Portugal 
##                         34                         37 
##                       Iran                      Haiti 
##                         43                         44 
##                     Taiwan                   Columbia 
##                         51                         59 
##                     Poland                      Japan 
##                         60                         62 
##                  Guatemala                    Vietnam 
##                         64                         67 
##         Dominican-Republic                      Italy 
##                         70                         73 
##                      China                      South 
##                         75                         80 
##                    Jamaica                    England 
##                         81                         90 
##                       Cuba                      India 
##                         95                        100 
##                El-Salvador                Puerto-Rico 
##                        106                        114 
##                     Canada                    Germany 
##                        121                        137 
##                Philippines                          ? 
##                        198                        583 
##                     Mexico              United-States 
##                        643                      29170

Agruparemos los países por sus respectivos continentes, para ello, vamos a crear primero los vectores de los continentes.

Asia <- c('Laos', 'Thailand', 'Cambodia', 'Hong', 'Iran', 'Taiwan', 'Japan', 'Vietnam', 'China', 'India', 'Philippines')
Europe <- c('Holand-Netherlands', 'Scotland', 'Hungary', 'Yugoslavia','Ireland', 'France', 'Greece', 'Portugal','Poland', 'Italy','England','Germany')
North.America <- c('United-States', 'Canada', 'Puerto-Rico')
Latin.and.South.America <- c('Honduras', 'Trinadad&Tobago', 'Ecuador', 'Peru', 'Nicaragua', 'Haiti','Columbia', 'Guatemala', 'Dominican-Republic', 'Jamaica', 'Cuba', 'El-Salvador', 'Mexico','Outlying-US(Guam-USVI-etc)')
Other <- c('South')

Creamos una función para asignar a cada país su continente en el dataset.

continent.assign <- function(country){
  if(country %in% Asia){
    return('Asia')
  }else if(country %in% Europe){
      return('Europe')
  }else if(country %in% North.America){
      return('North.America')
  }
  else if(country %in% Latin.and.South.America){
    return('Latin.and.South.America')
  }
  else{
    return('Other')
  }
}
df$country <- sapply(df$country, continent.assign)
table(df$country)
## 
##                    Asia                  Europe Latin.and.South.America 
##                     671                     521                    1301 
##           North.America                   Other 
##                   29405                     663

Veamos de nuevo la estrucuta del dataset.

str(df)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ type_employer: Factor w/ 7 levels "?","Federal-gov",..: 3 7 5 5 5 5 5 7 5 5 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ education_num: int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital      : chr  "Never-married" "Married" "Not-Married" "Married" ...
##  $ occupation   : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capital_gain : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hr_per_week  : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ country      : chr  "North.America" "North.America" "North.America" "North.America" ...
##  $ income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

Ponemos tipo factor las columnas que hemos modificado previamente.

df$type_employer <- as.factor(df$type_employer)
df$marital <- as.factor(df$marital)
df$country <- as.factor(df$country)
names(df)[names(df)=='country'] <- "region"

4. Búsqueda y limpieza de características NA.

Usaremos el paquete Amelia y la función missmap() para ver si existen NA’s en el dataset. Como hemos comprobado anteriormente, no existen NA’s como tal sino están sustituídas por ?. Lo que haremos ahora es reemplazar todos esos valores por Na.

df[df=='?'] <- NA
missmap(df, main="Missingness Map", col=c("yellow","black"), legend=FALSE)

En esta ocasión vamos a eliminar las observaciones que tienen NA en la columna occupation.

df <- na.omit(df)
missmap(df, main="Missingness Map", col=c("yellow","black"), legend=FALSE)

5. Exploratory Data Analysis

Analicemos el dataset mediante su visualización gráfica.

ggplot(df, aes(age)) + geom_histogram(binwidth=1, col="black",aes(fill=factor(income))) + theme_bw() + scale_x_continuous(breaks=seq(0,80, by=20))

Observamos que hay más personas que ingresan más de 50k entre los 35-45 años de edad.

ggplot(df, aes(sex)) + geom_bar(aes(fill=factor(income)), position="dodge") + theme_bw()

En ambos sexos es más predominante la gente que cobra menos de 50K pero sin embargo, en el sexo femenino es más destacable esa diferencia.

ggplot(df, aes(hr_per_week)) + geom_histogram(bins = 30, aes(fill=factor(income)), col="black") + theme_bw() + scale_x_continuous(breaks=seq(0, 120, by=3))

La moda se encuentra en unas 40 horas semanales y es donde se encuentra el mayor número de gente que ingresa más de 50K.

ggplot(df, aes(region)) + geom_bar(aes(fill=income), col='black') + theme_bw() + theme(axis.text.x=element_text(angle=90, hjust=1))

En Norte América es donde más gente suele cobrar más de 50K.

ggplot(df, aes(education)) + geom_bar(aes(fill=factor(income)), col="black") + theme_bw() + theme(axis.text.x=element_text(angle=90, hjust=1))

ggplot(df, aes(occupation)) + geom_bar(aes(fill=income), col="black") + theme_bw() + theme(axis.text.x=element_text(angle=90))

ggplot(df, aes(type_employer)) + geom_bar(aes(fill=income), col="black", position="dodge") + theme_bw()

ggplot(df, aes(occupation, income)) + geom_jitter(aes(col=factor(region))) + theme_bw() + theme(axis.text.x=element_text(angle=90))

Podemos observar que, como habíamos comprobado anteriormente, la gente que ingresa más de 50K es de la región de Norte América, sin embargo, en algunas profesiones también destaca Asia sobre el resto de regiones.

ggplot(df, aes(education, income)) + geom_jitter(aes(col=factor(region))) + theme_bw() + theme(axis.text.x=element_text(angle=90))

Podemos observar que para ingresar >50K parece ser preferible tener un nivel de estudios superior.

6. Construcción del modelo

Pasamos ahora a construir el modelo de regresión logística para predecir si un grupo de observaciones ingresará más o menos de 50K en función de sus características dadas en las diferentes columnas.

head(df)
##   age    type_employer fnlwgt education education_num       marital
## 1  39           SL-gov  77516 Bachelors            13 Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13       Married
## 3  38          Private 215646   HS-grad             9   Not-Married
## 4  53          Private 234721      11th             7       Married
## 5  28          Private 338409 Bachelors            13       Married
## 6  37          Private 284582   Masters            14       Married
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hr_per_week                  region income
## 1          40           North.America  <=50K
## 2          13           North.America  <=50K
## 3          40           North.America  <=50K
## 4          40           North.America  <=50K
## 5          40 Latin.and.South.America  <=50K
## 6          40           North.America  <=50K
sample <- sample.split(df, SplitRatio=0.7)
train <- subset(df, sample==TRUE)
test <- subset(df, sample==FALSE)
model <- glm(income~., family=binomial(logit), data=train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1655  -0.5081  -0.1900   0.0000   3.6897  
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -5.591e+00  4.456e-01 -12.546  < 2e-16 ***
## age                            2.552e-02  2.042e-03  12.496  < 2e-16 ***
## type_employerSL-gov           -8.477e-01  1.277e-01  -6.638 3.18e-11 ***
## type_employerUnemployed       -1.321e+01  2.626e+02  -0.050  0.95989    
## type_employerPrivate          -6.043e-01  1.131e-01  -5.344 9.08e-08 ***
## type_employerSelf-emp-inc     -4.604e-01  1.479e-01  -3.113  0.00185 ** 
## type_employerSelf-emp-not-inc -1.090e+00  1.326e-01  -8.221  < 2e-16 ***
## fnlwgt                         7.829e-07  2.112e-07   3.707  0.00021 ***
## education11th                  1.538e-01  2.685e-01   0.573  0.56681    
## education12th                  4.954e-01  3.433e-01   1.443  0.14899    
## education1st-4th              -4.533e-01  6.370e-01  -0.712  0.47668    
## education5th-6th               2.498e-02  4.071e-01   0.061  0.95106    
## education7th-8th              -3.826e-01  2.926e-01  -1.307  0.19104    
## education9th                   3.915e-02  3.226e-01   0.121  0.90341    
## educationAssoc-acdm            1.300e+00  2.264e-01   5.744 9.27e-09 ***
## educationAssoc-voc             1.463e+00  2.179e-01   6.715 1.88e-11 ***
## educationBachelors             2.130e+00  2.038e-01  10.451  < 2e-16 ***
## educationDoctorate             3.111e+00  2.739e-01  11.355  < 2e-16 ***
## educationHS-grad               9.504e-01  1.990e-01   4.776 1.79e-06 ***
## educationMasters               2.426e+00  2.162e-01  11.218  < 2e-16 ***
## educationPreschool            -1.820e+01  1.096e+02  -0.166  0.86811    
## educationProf-school           3.102e+00  2.570e-01  12.073  < 2e-16 ***
## educationSome-college          1.305e+00  2.014e-01   6.477 9.38e-11 ***
## education_num                         NA         NA      NA       NA    
## maritalNever-married          -1.421e+00  2.005e-01  -7.086 1.38e-12 ***
## maritalNot-Married            -8.163e-01  1.986e-01  -4.111 3.95e-05 ***
## occupationArmed-Forces        -9.494e-01  1.679e+00  -0.565  0.57182    
## occupationCraft-repair         7.559e-02  9.764e-02   0.774  0.43884    
## occupationExec-managerial      8.186e-01  9.408e-02   8.702  < 2e-16 ***
## occupationFarming-fishing     -9.624e-01  1.706e-01  -5.640 1.70e-08 ***
## occupationHandlers-cleaners   -7.107e-01  1.729e-01  -4.110 3.96e-05 ***
## occupationMachine-op-inspct   -1.761e-01  1.235e-01  -1.426  0.15378    
## occupationOther-service       -8.200e-01  1.441e-01  -5.689 1.28e-08 ***
## occupationPriv-house-serv     -3.971e+00  1.834e+00  -2.165  0.03039 *  
## occupationProf-specialty       4.380e-01  9.920e-02   4.415 1.01e-05 ***
## occupationProtective-serv      5.939e-01  1.523e-01   3.900 9.61e-05 ***
## occupationSales                3.118e-01  1.004e-01   3.106  0.00190 ** 
## occupationTech-support         5.702e-01  1.367e-01   4.172 3.02e-05 ***
## occupationTransport-moving    -1.839e-01  1.217e-01  -1.511  0.13083    
## relationshipNot-in-family     -8.074e-01  1.958e-01  -4.123 3.74e-05 ***
## relationshipOther-relative    -1.179e+00  2.731e-01  -4.316 1.59e-05 ***
## relationshipOwn-child         -1.814e+00  2.467e-01  -7.355 1.92e-13 ***
## relationshipUnmarried         -9.826e-01  2.190e-01  -4.486 7.27e-06 ***
## relationshipWife               1.346e+00  1.276e-01  10.546  < 2e-16 ***
## raceAsian-Pac-Islander         5.621e-01  3.244e-01   1.733  0.08315 .  
## raceBlack                      3.776e-01  2.866e-01   1.317  0.18776    
## raceOther                     -9.116e-03  4.443e-01  -0.021  0.98363    
## raceWhite                      5.420e-01  2.727e-01   1.988  0.04685 *  
## sexMale                        8.289e-01  9.740e-02   8.510  < 2e-16 ***
## capital_gain                   3.278e-04  1.296e-05  25.289  < 2e-16 ***
## capital_loss                   5.978e-04  4.603e-05  12.987  < 2e-16 ***
## hr_per_week                    3.020e-02  2.046e-03  14.764  < 2e-16 ***
## regionEurope                   4.634e-01  2.563e-01   1.808  0.07056 .  
## regionLatin.and.South.America -3.516e-01  2.645e-01  -1.329  0.18375    
## regionNorth.America            3.023e-01  2.072e-01   1.459  0.14444    
## regionOther                   -7.595e-02  2.363e-01  -0.321  0.74795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23023  on 20477  degrees of freedom
## Residual deviance: 13227  on 20423  degrees of freedom
## AIC: 13337
## 
## Number of Fisher Scoring iterations: 13

Como podemos observar aparecen un número considerable de características que tienen importancia en el modelo. Muchas de ellas se podrían haber reducido si se hubiera reducido el número de factores en educación, occupation, etc.

Vamos a usar la función step() que nos permitirá comparar varios modelos de regresión logística mediante el método AIC( Akaike information criterion ). Lo que hace esta función es ir añadiendo características al modelo e ir comparándolos para ver cuál es el número de características que valen mejor para predecir la variable income.

new.step.model <- step(model)
## Start:  AIC=13337.11
## income ~ age + type_employer + fnlwgt + education + education_num + 
##     marital + occupation + relationship + race + sex + capital_gain + 
##     capital_loss + hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=13337.11
## income ~ age + type_employer + fnlwgt + education + marital + 
##     occupation + relationship + race + sex + capital_gain + capital_loss + 
##     hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                 Df Deviance   AIC
## <none>                13227 13337
## - race           4    13237 13339
## - fnlwgt         1    13241 13349
## - region         4    13250 13352
## - marital        2    13291 13397
## - sex            1    13302 13410
## - type_employer  5    13318 13418
## - age            1    13384 13492
## - capital_loss   1    13400 13508
## - hr_per_week    1    13451 13559
## - relationship   5    13463 13563
## - occupation    13    13616 13700
## - education     15    13966 14046
## - capital_gain   1    14469 14577

Veamos qué variables tomó el modelo

summary(new.step.model)
## 
## Call:
## glm(formula = income ~ age + type_employer + fnlwgt + education + 
##     marital + occupation + relationship + race + sex + capital_gain + 
##     capital_loss + hr_per_week + region, family = binomial(logit), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1655  -0.5081  -0.1900   0.0000   3.6897  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -5.591e+00  4.456e-01 -12.546  < 2e-16 ***
## age                            2.552e-02  2.042e-03  12.496  < 2e-16 ***
## type_employerSL-gov           -8.477e-01  1.277e-01  -6.638 3.18e-11 ***
## type_employerUnemployed       -1.321e+01  2.626e+02  -0.050  0.95989    
## type_employerPrivate          -6.043e-01  1.131e-01  -5.344 9.08e-08 ***
## type_employerSelf-emp-inc     -4.604e-01  1.479e-01  -3.113  0.00185 ** 
## type_employerSelf-emp-not-inc -1.090e+00  1.326e-01  -8.221  < 2e-16 ***
## fnlwgt                         7.829e-07  2.112e-07   3.707  0.00021 ***
## education11th                  1.538e-01  2.685e-01   0.573  0.56681    
## education12th                  4.954e-01  3.433e-01   1.443  0.14899    
## education1st-4th              -4.533e-01  6.370e-01  -0.712  0.47668    
## education5th-6th               2.498e-02  4.071e-01   0.061  0.95106    
## education7th-8th              -3.826e-01  2.926e-01  -1.307  0.19104    
## education9th                   3.915e-02  3.226e-01   0.121  0.90341    
## educationAssoc-acdm            1.300e+00  2.264e-01   5.744 9.27e-09 ***
## educationAssoc-voc             1.463e+00  2.179e-01   6.715 1.88e-11 ***
## educationBachelors             2.130e+00  2.038e-01  10.451  < 2e-16 ***
## educationDoctorate             3.111e+00  2.739e-01  11.355  < 2e-16 ***
## educationHS-grad               9.504e-01  1.990e-01   4.776 1.79e-06 ***
## educationMasters               2.426e+00  2.162e-01  11.218  < 2e-16 ***
## educationPreschool            -1.820e+01  1.096e+02  -0.166  0.86811    
## educationProf-school           3.102e+00  2.570e-01  12.073  < 2e-16 ***
## educationSome-college          1.305e+00  2.014e-01   6.477 9.38e-11 ***
## maritalNever-married          -1.421e+00  2.005e-01  -7.086 1.38e-12 ***
## maritalNot-Married            -8.163e-01  1.986e-01  -4.111 3.95e-05 ***
## occupationArmed-Forces        -9.494e-01  1.679e+00  -0.565  0.57182    
## occupationCraft-repair         7.559e-02  9.764e-02   0.774  0.43884    
## occupationExec-managerial      8.186e-01  9.408e-02   8.702  < 2e-16 ***
## occupationFarming-fishing     -9.624e-01  1.706e-01  -5.640 1.70e-08 ***
## occupationHandlers-cleaners   -7.107e-01  1.729e-01  -4.110 3.96e-05 ***
## occupationMachine-op-inspct   -1.761e-01  1.235e-01  -1.426  0.15378    
## occupationOther-service       -8.200e-01  1.441e-01  -5.689 1.28e-08 ***
## occupationPriv-house-serv     -3.971e+00  1.834e+00  -2.165  0.03039 *  
## occupationProf-specialty       4.380e-01  9.920e-02   4.415 1.01e-05 ***
## occupationProtective-serv      5.939e-01  1.523e-01   3.900 9.61e-05 ***
## occupationSales                3.118e-01  1.004e-01   3.106  0.00190 ** 
## occupationTech-support         5.702e-01  1.367e-01   4.172 3.02e-05 ***
## occupationTransport-moving    -1.839e-01  1.217e-01  -1.511  0.13083    
## relationshipNot-in-family     -8.074e-01  1.958e-01  -4.123 3.74e-05 ***
## relationshipOther-relative    -1.179e+00  2.731e-01  -4.316 1.59e-05 ***
## relationshipOwn-child         -1.814e+00  2.467e-01  -7.355 1.92e-13 ***
## relationshipUnmarried         -9.826e-01  2.190e-01  -4.486 7.27e-06 ***
## relationshipWife               1.346e+00  1.276e-01  10.546  < 2e-16 ***
## raceAsian-Pac-Islander         5.621e-01  3.244e-01   1.733  0.08315 .  
## raceBlack                      3.776e-01  2.866e-01   1.317  0.18776    
## raceOther                     -9.116e-03  4.443e-01  -0.021  0.98363    
## raceWhite                      5.420e-01  2.727e-01   1.988  0.04685 *  
## sexMale                        8.289e-01  9.740e-02   8.510  < 2e-16 ***
## capital_gain                   3.278e-04  1.296e-05  25.289  < 2e-16 ***
## capital_loss                   5.978e-04  4.603e-05  12.987  < 2e-16 ***
## hr_per_week                    3.020e-02  2.046e-03  14.764  < 2e-16 ***
## regionEurope                   4.634e-01  2.563e-01   1.808  0.07056 .  
## regionLatin.and.South.America -3.516e-01  2.645e-01  -1.329  0.18375    
## regionNorth.America            3.023e-01  2.072e-01   1.459  0.14444    
## regionOther                   -7.595e-02  2.363e-01  -0.321  0.74795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23023  on 20477  degrees of freedom
## Residual deviance: 13227  on 20423  degrees of freedom
## AIC: 13337
## 
## Number of Fisher Scoring iterations: 13

Ha tomado las mismas que usamos nosotros antes debido a que ha usado el criterio AIC. Hay otros criterios que no vamos a usar para este dataset.

Aplicaremos ahora el modelo sobre el dataset de testeo para ver la eficacia del modelo.

test$predicted.income <- predict(model, test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
table(test$income,test$predicted.income>0.5)
##        
##         FALSE TRUE
##   <=50K  7150  556
##   >50K    997 1537
acc <- (7174+1553)/(7174+1553+542+970)
paste('Exactitud: ', acc)
## [1] "Exactitud:  0.852329329036039"

Calculamos otra medida como el Ratio de Positivos Verdaderos o sensibilidad

1553/(1553+970)
## [1] 0.6155371

Calculamos también la precisión

1553/(1553+542)
## [1] 0.7412888

7. Conclusión.

No podemos decidir si el modelo es bueno o no porque habría que ver cuál es el objetivo principal. En algunas ocasiones, al margen de buscar una exactitud muy alta en el modelo, lo que nos interesa es que tenga una alta precisión o sensibilidad dependiendo del problema en el que nos encontremos.