Ejemplo del uso del modelo de regresión logística para predecir los ingresos de los diferentes habitantes de América del Norte y América del Sur. En este proyecto vemos desde el tratamiento del dataset para la limpieza de datos como la visualización de diferentes gráficas para intentar detectar diferentes comportamientos. Al final, usaremos el modelo de regresión logística para predecir los ingresos y valoraremos tanto su exactitud como precisión.
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caTools)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
setwd("~/R/R-Course-HTML-Notes/R-for-Data-Science-and-Machine-Learning/Training Exercises/Machine Learning Projects/CSV files for ML Projects")
df<- read.csv("adult_sal.csv")
head(df)
## X age type_employer fnlwgt education education_num marital
## 1 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 3 38 Private 215646 HS-grad 9 Divorced
## 4 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
Eliminamos la columna X pues está repetida.
df <- df %>% select(-X)
head(df)
## age type_employer fnlwgt education education_num marital
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
Veamos la estructura del dataset.
str(df)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
summary(df)
## age type_employer fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education_num marital
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship race
## Prof-specialty :4140 Husband :13193 Amer-Indian-Eskimo: 311
## Craft-repair :4099 Not-in-family : 8305 Asian-Pac-Islander: 1039
## Exec-managerial:4066 Other-relative: 981 Black : 3124
## Adm-clerical :3770 Own-child : 5068 Other : 271
## Sales :3650 Unmarried : 3446 White :27816
## Other-service :3295 Wife : 1568
## (Other) :9541
## sex capital_gain capital_loss hr_per_week
## Female:10771 Min. : 0 Min. : 0.0 Min. : 1.00
## Male :21790 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00
## Median : 0 Median : 0.0 Median :40.00
## Mean : 1078 Mean : 87.3 Mean :40.44
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00
## Max. :99999 Max. :4356.0 Max. :99.00
##
## country income
## United-States:29170 <=50K:24720
## Mexico : 643 >50K : 7841
## ? : 583
## Philippines : 198
## Germany : 137
## Canada : 121
## (Other) : 1709
Podemos observar que en este dataset no hay NA’s como tal sino que vienen representados por el caracter ?.
sort(table(df$type_employer))
##
## Never-worked Without-pay Federal-gov Self-emp-inc
## 7 14 960 1116
## State-gov ? Local-gov Self-emp-not-inc
## 1298 1836 2093 2541
## Private
## 22696
Juntaremos los grupos Never-worked y Without-pay en un grupo llamado Unemployed.
df$type_employer <- revalue(df$type_employer,c("Never-worked" = "Unemployed"))
df$type_employer <- revalue(df$type_employer,c("Without-pay" = "Unemployed"))
Juntamos también los grupos State-gov y Local-gov en uno nuevo llamado SL-gov.
df$type_employer <- revalue(df$type_employer,c("Local-gov" = "SL-gov"))
df$type_employer <- revalue(df$type_employer,c("State-gov" = "SL-gov"))
Veamos la tabla de frecuencias de la columna Marital
sort(table(df$marital))
##
## Married-AF-spouse Married-spouse-absent Widowed
## 23 418 993
## Separated Divorced Never-married
## 1025 4443 10683
## Married-civ-spouse
## 14976
Vamos a reducir estos 7 grupos en 3: Married, Not-Married, Never-Married. En esta ocasión, vamos a crear una función que remplace el estado civil.
civil_status <- function(marital){
#marital <- as.character(marital)
if (marital %in% c("Separated","Widowed","Divorced")){
return('Not-Married')
}else if(marital=='Never-married'){
return('Never-married') #Podemos poner return(marital) pero en ese caso nos devuelve la variable como 5, ya que entra en la función como factor. Colocamos return('never-married') o descomentamos la primera linea de la función.
}else{
return('Married')
}
}
df$marital <- sapply(df$marital, civil_status)
table(df$marital)
##
## Married Never-married Not-Married
## 15417 10683 6461
Comprobamos la columna Country
sort(table(df$country))
##
## Holand-Netherlands Scotland
## 1 12
## Honduras Hungary
## 13 13
## Outlying-US(Guam-USVI-etc) Yugoslavia
## 14 16
## Laos Thailand
## 18 18
## Cambodia Trinadad&Tobago
## 19 19
## Hong Ireland
## 20 24
## Ecuador France
## 28 29
## Greece Peru
## 29 31
## Nicaragua Portugal
## 34 37
## Iran Haiti
## 43 44
## Taiwan Columbia
## 51 59
## Poland Japan
## 60 62
## Guatemala Vietnam
## 64 67
## Dominican-Republic Italy
## 70 73
## China South
## 75 80
## Jamaica England
## 81 90
## Cuba India
## 95 100
## El-Salvador Puerto-Rico
## 106 114
## Canada Germany
## 121 137
## Philippines ?
## 198 583
## Mexico United-States
## 643 29170
Agruparemos los países por sus respectivos continentes, para ello, vamos a crear primero los vectores de los continentes.
Asia <- c('Laos', 'Thailand', 'Cambodia', 'Hong', 'Iran', 'Taiwan', 'Japan', 'Vietnam', 'China', 'India', 'Philippines')
Europe <- c('Holand-Netherlands', 'Scotland', 'Hungary', 'Yugoslavia','Ireland', 'France', 'Greece', 'Portugal','Poland', 'Italy','England','Germany')
North.America <- c('United-States', 'Canada', 'Puerto-Rico')
Latin.and.South.America <- c('Honduras', 'Trinadad&Tobago', 'Ecuador', 'Peru', 'Nicaragua', 'Haiti','Columbia', 'Guatemala', 'Dominican-Republic', 'Jamaica', 'Cuba', 'El-Salvador', 'Mexico','Outlying-US(Guam-USVI-etc)')
Other <- c('South')
Creamos una función para asignar a cada país su continente en el dataset.
continent.assign <- function(country){
if(country %in% Asia){
return('Asia')
}else if(country %in% Europe){
return('Europe')
}else if(country %in% North.America){
return('North.America')
}
else if(country %in% Latin.and.South.America){
return('Latin.and.South.America')
}
else{
return('Other')
}
}
df$country <- sapply(df$country, continent.assign)
table(df$country)
##
## Asia Europe Latin.and.South.America
## 671 521 1301
## North.America Other
## 29405 663
Veamos de nuevo la estrucuta del dataset.
str(df)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ type_employer: Factor w/ 7 levels "?","Federal-gov",..: 3 7 5 5 5 5 5 7 5 5 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ education_num: int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital : chr "Never-married" "Married" "Not-Married" "Married" ...
## $ occupation : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hr_per_week : int 40 13 40 40 40 40 16 45 50 40 ...
## $ country : chr "North.America" "North.America" "North.America" "North.America" ...
## $ income : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Ponemos tipo factor las columnas que hemos modificado previamente.
df$type_employer <- as.factor(df$type_employer)
df$marital <- as.factor(df$marital)
df$country <- as.factor(df$country)
names(df)[names(df)=='country'] <- "region"
Usaremos el paquete Amelia y la función missmap() para ver si existen NA’s en el dataset. Como hemos comprobado anteriormente, no existen NA’s como tal sino están sustituídas por ?. Lo que haremos ahora es reemplazar todos esos valores por Na.
df[df=='?'] <- NA
missmap(df, main="Missingness Map", col=c("yellow","black"), legend=FALSE)
En esta ocasión vamos a eliminar las observaciones que tienen NA en la columna occupation.
df <- na.omit(df)
missmap(df, main="Missingness Map", col=c("yellow","black"), legend=FALSE)
Analicemos el dataset mediante su visualización gráfica.
ggplot(df, aes(age)) + geom_histogram(binwidth=1, col="black",aes(fill=factor(income))) + theme_bw() + scale_x_continuous(breaks=seq(0,80, by=20))
Observamos que hay más personas que ingresan más de 50k entre los 35-45 años de edad.
ggplot(df, aes(sex)) + geom_bar(aes(fill=factor(income)), position="dodge") + theme_bw()
En ambos sexos es más predominante la gente que cobra menos de 50K pero sin embargo, en el sexo femenino es más destacable esa diferencia.
ggplot(df, aes(hr_per_week)) + geom_histogram(bins = 30, aes(fill=factor(income)), col="black") + theme_bw() + scale_x_continuous(breaks=seq(0, 120, by=3))
La moda se encuentra en unas 40 horas semanales y es donde se encuentra el mayor número de gente que ingresa más de 50K.
ggplot(df, aes(region)) + geom_bar(aes(fill=income), col='black') + theme_bw() + theme(axis.text.x=element_text(angle=90, hjust=1))
En Norte América es donde más gente suele cobrar más de 50K.
ggplot(df, aes(education)) + geom_bar(aes(fill=factor(income)), col="black") + theme_bw() + theme(axis.text.x=element_text(angle=90, hjust=1))
ggplot(df, aes(occupation)) + geom_bar(aes(fill=income), col="black") + theme_bw() + theme(axis.text.x=element_text(angle=90))
ggplot(df, aes(type_employer)) + geom_bar(aes(fill=income), col="black", position="dodge") + theme_bw()
ggplot(df, aes(occupation, income)) + geom_jitter(aes(col=factor(region))) + theme_bw() + theme(axis.text.x=element_text(angle=90))
Podemos observar que, como habíamos comprobado anteriormente, la gente que ingresa más de 50K es de la región de Norte América, sin embargo, en algunas profesiones también destaca Asia sobre el resto de regiones.
ggplot(df, aes(education, income)) + geom_jitter(aes(col=factor(region))) + theme_bw() + theme(axis.text.x=element_text(angle=90))
Podemos observar que para ingresar >50K parece ser preferible tener un nivel de estudios superior.
Pasamos ahora a construir el modelo de regresión logística para predecir si un grupo de observaciones ingresará más o menos de 50K en función de sus características dadas en las diferentes columnas.
head(df)
## age type_employer fnlwgt education education_num marital
## 1 39 SL-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married
## 3 38 Private 215646 HS-grad 9 Not-Married
## 4 53 Private 234721 11th 7 Married
## 5 28 Private 338409 Bachelors 13 Married
## 6 37 Private 284582 Masters 14 Married
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hr_per_week region income
## 1 40 North.America <=50K
## 2 13 North.America <=50K
## 3 40 North.America <=50K
## 4 40 North.America <=50K
## 5 40 Latin.and.South.America <=50K
## 6 40 North.America <=50K
sample <- sample.split(df, SplitRatio=0.7)
train <- subset(df, sample==TRUE)
test <- subset(df, sample==FALSE)
model <- glm(income~., family=binomial(logit), data=train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = income ~ ., family = binomial(logit), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1655 -0.5081 -0.1900 0.0000 3.6897
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.591e+00 4.456e-01 -12.546 < 2e-16 ***
## age 2.552e-02 2.042e-03 12.496 < 2e-16 ***
## type_employerSL-gov -8.477e-01 1.277e-01 -6.638 3.18e-11 ***
## type_employerUnemployed -1.321e+01 2.626e+02 -0.050 0.95989
## type_employerPrivate -6.043e-01 1.131e-01 -5.344 9.08e-08 ***
## type_employerSelf-emp-inc -4.604e-01 1.479e-01 -3.113 0.00185 **
## type_employerSelf-emp-not-inc -1.090e+00 1.326e-01 -8.221 < 2e-16 ***
## fnlwgt 7.829e-07 2.112e-07 3.707 0.00021 ***
## education11th 1.538e-01 2.685e-01 0.573 0.56681
## education12th 4.954e-01 3.433e-01 1.443 0.14899
## education1st-4th -4.533e-01 6.370e-01 -0.712 0.47668
## education5th-6th 2.498e-02 4.071e-01 0.061 0.95106
## education7th-8th -3.826e-01 2.926e-01 -1.307 0.19104
## education9th 3.915e-02 3.226e-01 0.121 0.90341
## educationAssoc-acdm 1.300e+00 2.264e-01 5.744 9.27e-09 ***
## educationAssoc-voc 1.463e+00 2.179e-01 6.715 1.88e-11 ***
## educationBachelors 2.130e+00 2.038e-01 10.451 < 2e-16 ***
## educationDoctorate 3.111e+00 2.739e-01 11.355 < 2e-16 ***
## educationHS-grad 9.504e-01 1.990e-01 4.776 1.79e-06 ***
## educationMasters 2.426e+00 2.162e-01 11.218 < 2e-16 ***
## educationPreschool -1.820e+01 1.096e+02 -0.166 0.86811
## educationProf-school 3.102e+00 2.570e-01 12.073 < 2e-16 ***
## educationSome-college 1.305e+00 2.014e-01 6.477 9.38e-11 ***
## education_num NA NA NA NA
## maritalNever-married -1.421e+00 2.005e-01 -7.086 1.38e-12 ***
## maritalNot-Married -8.163e-01 1.986e-01 -4.111 3.95e-05 ***
## occupationArmed-Forces -9.494e-01 1.679e+00 -0.565 0.57182
## occupationCraft-repair 7.559e-02 9.764e-02 0.774 0.43884
## occupationExec-managerial 8.186e-01 9.408e-02 8.702 < 2e-16 ***
## occupationFarming-fishing -9.624e-01 1.706e-01 -5.640 1.70e-08 ***
## occupationHandlers-cleaners -7.107e-01 1.729e-01 -4.110 3.96e-05 ***
## occupationMachine-op-inspct -1.761e-01 1.235e-01 -1.426 0.15378
## occupationOther-service -8.200e-01 1.441e-01 -5.689 1.28e-08 ***
## occupationPriv-house-serv -3.971e+00 1.834e+00 -2.165 0.03039 *
## occupationProf-specialty 4.380e-01 9.920e-02 4.415 1.01e-05 ***
## occupationProtective-serv 5.939e-01 1.523e-01 3.900 9.61e-05 ***
## occupationSales 3.118e-01 1.004e-01 3.106 0.00190 **
## occupationTech-support 5.702e-01 1.367e-01 4.172 3.02e-05 ***
## occupationTransport-moving -1.839e-01 1.217e-01 -1.511 0.13083
## relationshipNot-in-family -8.074e-01 1.958e-01 -4.123 3.74e-05 ***
## relationshipOther-relative -1.179e+00 2.731e-01 -4.316 1.59e-05 ***
## relationshipOwn-child -1.814e+00 2.467e-01 -7.355 1.92e-13 ***
## relationshipUnmarried -9.826e-01 2.190e-01 -4.486 7.27e-06 ***
## relationshipWife 1.346e+00 1.276e-01 10.546 < 2e-16 ***
## raceAsian-Pac-Islander 5.621e-01 3.244e-01 1.733 0.08315 .
## raceBlack 3.776e-01 2.866e-01 1.317 0.18776
## raceOther -9.116e-03 4.443e-01 -0.021 0.98363
## raceWhite 5.420e-01 2.727e-01 1.988 0.04685 *
## sexMale 8.289e-01 9.740e-02 8.510 < 2e-16 ***
## capital_gain 3.278e-04 1.296e-05 25.289 < 2e-16 ***
## capital_loss 5.978e-04 4.603e-05 12.987 < 2e-16 ***
## hr_per_week 3.020e-02 2.046e-03 14.764 < 2e-16 ***
## regionEurope 4.634e-01 2.563e-01 1.808 0.07056 .
## regionLatin.and.South.America -3.516e-01 2.645e-01 -1.329 0.18375
## regionNorth.America 3.023e-01 2.072e-01 1.459 0.14444
## regionOther -7.595e-02 2.363e-01 -0.321 0.74795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23023 on 20477 degrees of freedom
## Residual deviance: 13227 on 20423 degrees of freedom
## AIC: 13337
##
## Number of Fisher Scoring iterations: 13
Como podemos observar aparecen un número considerable de características que tienen importancia en el modelo. Muchas de ellas se podrían haber reducido si se hubiera reducido el número de factores en educación, occupation, etc.
Vamos a usar la función step() que nos permitirá comparar varios modelos de regresión logística mediante el método AIC( Akaike information criterion ). Lo que hace esta función es ir añadiendo características al modelo e ir comparándolos para ver cuál es el número de características que valen mejor para predecir la variable income.
new.step.model <- step(model)
## Start: AIC=13337.11
## income ~ age + type_employer + fnlwgt + education + education_num +
## marital + occupation + relationship + race + sex + capital_gain +
## capital_loss + hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=13337.11
## income ~ age + type_employer + fnlwgt + education + marital +
## occupation + relationship + race + sex + capital_gain + capital_loss +
## hr_per_week + region
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 13227 13337
## - race 4 13237 13339
## - fnlwgt 1 13241 13349
## - region 4 13250 13352
## - marital 2 13291 13397
## - sex 1 13302 13410
## - type_employer 5 13318 13418
## - age 1 13384 13492
## - capital_loss 1 13400 13508
## - hr_per_week 1 13451 13559
## - relationship 5 13463 13563
## - occupation 13 13616 13700
## - education 15 13966 14046
## - capital_gain 1 14469 14577
Veamos qué variables tomó el modelo
summary(new.step.model)
##
## Call:
## glm(formula = income ~ age + type_employer + fnlwgt + education +
## marital + occupation + relationship + race + sex + capital_gain +
## capital_loss + hr_per_week + region, family = binomial(logit),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.1655 -0.5081 -0.1900 0.0000 3.6897
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.591e+00 4.456e-01 -12.546 < 2e-16 ***
## age 2.552e-02 2.042e-03 12.496 < 2e-16 ***
## type_employerSL-gov -8.477e-01 1.277e-01 -6.638 3.18e-11 ***
## type_employerUnemployed -1.321e+01 2.626e+02 -0.050 0.95989
## type_employerPrivate -6.043e-01 1.131e-01 -5.344 9.08e-08 ***
## type_employerSelf-emp-inc -4.604e-01 1.479e-01 -3.113 0.00185 **
## type_employerSelf-emp-not-inc -1.090e+00 1.326e-01 -8.221 < 2e-16 ***
## fnlwgt 7.829e-07 2.112e-07 3.707 0.00021 ***
## education11th 1.538e-01 2.685e-01 0.573 0.56681
## education12th 4.954e-01 3.433e-01 1.443 0.14899
## education1st-4th -4.533e-01 6.370e-01 -0.712 0.47668
## education5th-6th 2.498e-02 4.071e-01 0.061 0.95106
## education7th-8th -3.826e-01 2.926e-01 -1.307 0.19104
## education9th 3.915e-02 3.226e-01 0.121 0.90341
## educationAssoc-acdm 1.300e+00 2.264e-01 5.744 9.27e-09 ***
## educationAssoc-voc 1.463e+00 2.179e-01 6.715 1.88e-11 ***
## educationBachelors 2.130e+00 2.038e-01 10.451 < 2e-16 ***
## educationDoctorate 3.111e+00 2.739e-01 11.355 < 2e-16 ***
## educationHS-grad 9.504e-01 1.990e-01 4.776 1.79e-06 ***
## educationMasters 2.426e+00 2.162e-01 11.218 < 2e-16 ***
## educationPreschool -1.820e+01 1.096e+02 -0.166 0.86811
## educationProf-school 3.102e+00 2.570e-01 12.073 < 2e-16 ***
## educationSome-college 1.305e+00 2.014e-01 6.477 9.38e-11 ***
## maritalNever-married -1.421e+00 2.005e-01 -7.086 1.38e-12 ***
## maritalNot-Married -8.163e-01 1.986e-01 -4.111 3.95e-05 ***
## occupationArmed-Forces -9.494e-01 1.679e+00 -0.565 0.57182
## occupationCraft-repair 7.559e-02 9.764e-02 0.774 0.43884
## occupationExec-managerial 8.186e-01 9.408e-02 8.702 < 2e-16 ***
## occupationFarming-fishing -9.624e-01 1.706e-01 -5.640 1.70e-08 ***
## occupationHandlers-cleaners -7.107e-01 1.729e-01 -4.110 3.96e-05 ***
## occupationMachine-op-inspct -1.761e-01 1.235e-01 -1.426 0.15378
## occupationOther-service -8.200e-01 1.441e-01 -5.689 1.28e-08 ***
## occupationPriv-house-serv -3.971e+00 1.834e+00 -2.165 0.03039 *
## occupationProf-specialty 4.380e-01 9.920e-02 4.415 1.01e-05 ***
## occupationProtective-serv 5.939e-01 1.523e-01 3.900 9.61e-05 ***
## occupationSales 3.118e-01 1.004e-01 3.106 0.00190 **
## occupationTech-support 5.702e-01 1.367e-01 4.172 3.02e-05 ***
## occupationTransport-moving -1.839e-01 1.217e-01 -1.511 0.13083
## relationshipNot-in-family -8.074e-01 1.958e-01 -4.123 3.74e-05 ***
## relationshipOther-relative -1.179e+00 2.731e-01 -4.316 1.59e-05 ***
## relationshipOwn-child -1.814e+00 2.467e-01 -7.355 1.92e-13 ***
## relationshipUnmarried -9.826e-01 2.190e-01 -4.486 7.27e-06 ***
## relationshipWife 1.346e+00 1.276e-01 10.546 < 2e-16 ***
## raceAsian-Pac-Islander 5.621e-01 3.244e-01 1.733 0.08315 .
## raceBlack 3.776e-01 2.866e-01 1.317 0.18776
## raceOther -9.116e-03 4.443e-01 -0.021 0.98363
## raceWhite 5.420e-01 2.727e-01 1.988 0.04685 *
## sexMale 8.289e-01 9.740e-02 8.510 < 2e-16 ***
## capital_gain 3.278e-04 1.296e-05 25.289 < 2e-16 ***
## capital_loss 5.978e-04 4.603e-05 12.987 < 2e-16 ***
## hr_per_week 3.020e-02 2.046e-03 14.764 < 2e-16 ***
## regionEurope 4.634e-01 2.563e-01 1.808 0.07056 .
## regionLatin.and.South.America -3.516e-01 2.645e-01 -1.329 0.18375
## regionNorth.America 3.023e-01 2.072e-01 1.459 0.14444
## regionOther -7.595e-02 2.363e-01 -0.321 0.74795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23023 on 20477 degrees of freedom
## Residual deviance: 13227 on 20423 degrees of freedom
## AIC: 13337
##
## Number of Fisher Scoring iterations: 13
Ha tomado las mismas que usamos nosotros antes debido a que ha usado el criterio AIC. Hay otros criterios que no vamos a usar para este dataset.
Aplicaremos ahora el modelo sobre el dataset de testeo para ver la eficacia del modelo.
test$predicted.income <- predict(model, test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
table(test$income,test$predicted.income>0.5)
##
## FALSE TRUE
## <=50K 7150 556
## >50K 997 1537
acc <- (7174+1553)/(7174+1553+542+970)
paste('Exactitud: ', acc)
## [1] "Exactitud: 0.852329329036039"
Calculamos otra medida como el Ratio de Positivos Verdaderos o sensibilidad
1553/(1553+970)
## [1] 0.6155371
Calculamos también la precisión
1553/(1553+542)
## [1] 0.7412888
No podemos decidir si el modelo es bueno o no porque habría que ver cuál es el objetivo principal. En algunas ocasiones, al margen de buscar una exactitud muy alta en el modelo, lo que nos interesa es que tenga una alta precisión o sensibilidad dependiendo del problema en el que nos encontremos.