Metodología CRISP-DM

1.- Análisis y compresión del negocio y/o problema.

Los datos provenienen del repositorio UCI machine learning repository | https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data; la bases de datos aborda un censo realizado en 1994 desde la pagína | http://www.census.gov/ftp/pub/DES/www/welcome.html la tarea a resolver es la siguiente:

  1. Usando algoritmos de predicción, determinar si una persona gana mas de 50k(USD) en un año.

instalar los paquetes para la manipulación de datos

2.- comprensión de los datos adult.data

download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "adult.data")

readLines("adult.data", n = 10)
##  [1] "39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"      
##  [2] "50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K"
##  [3] "38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K"             
##  [4] "53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K"            
##  [5] "28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K"                   
##  [6] "37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K"           
##  [7] "49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K"            
##  [8] "52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K"   
##  [9] "31, Private, 45781, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K"      
## [10] "42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 40, United-States, >50K"
adult <- read.table("adult.data", sep = ",", header = F)
head(adult)
tail(adult)
df <- adult
glimpse(df)
## Rows: 32,561
## Columns: 15
## $ V1  <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32, 40, 34, 2…
## $ V2  <chr> " State-gov", " Self-emp-not-inc", " Private", " Private", " Priv…
## $ V3  <int> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 209642, 457…
## $ V4  <chr> " Bachelors", " Bachelors", " HS-grad", " 11th", " Bachelors", " …
## $ V5  <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11, 4, 9, 9, …
## $ V6  <chr> " Never-married", " Married-civ-spouse", " Divorced", " Married-c…
## $ V7  <chr> " Adm-clerical", " Exec-managerial", " Handlers-cleaners", " Hand…
## $ V8  <chr> " Not-in-family", " Husband", " Not-in-family", " Husband", " Wif…
## $ V9  <chr> " White", " White", " White", " Black", " Black", " White", " Bla…
## $ V10 <chr> " Male", " Male", " Male", " Male", " Female", " Female", " Femal…
## $ V11 <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ V12 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ V13 <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50, 40, 45, 3…
## $ V14 <chr> " United-States", " United-States", " United-States", " United-St…
## $ V15 <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <=5…
skim(df)
Data summary
Name df
Number of rows 32561
Number of columns 15
_______________________
Column type frequency:
character 9
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
V2 0 1 2 17 0 9 0
V4 0 1 4 13 0 16 0
V6 0 1 8 22 0 7 0
V7 0 1 2 18 0 15 0
V8 0 1 5 15 0 6 0
V9 0 1 6 19 0 5 0
V10 0 1 5 7 0 2 0
V14 0 1 2 27 0 42 0
V15 0 1 5 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
V1 0 1 38.58 13.64 17 28 37 48 90 ▇▇▅▂▁
V3 0 1 189778.37 105549.98 12285 117827 178356 237051 1484705 ▇▁▁▁▁
V5 0 1 10.08 2.57 1 9 10 12 16 ▁▁▇▃▁
V11 0 1 1077.65 7385.29 0 0 0 0 99999 ▇▁▁▁▁
V12 0 1 87.30 402.96 0 0 0 0 4356 ▇▁▁▁▁
V13 0 1 40.44 12.35 1 40 40 45 99 ▁▇▃▁▁

Se desconoce la procedencia del atributo V3 según https://www.kaggle.com/uciml/adult-census-income/discussion/32698 V3 es el numero estimado de personas V11 Es el capital ganado V12 Es la perdida de capital V13 Hora/semana

Por razones de compresión, a la hora del análisis, se cambiaran los nombres de las columnas de las variables.

x = c("Edad", "tipo-trabajo", "fnlwgt", "educacion", "educacion-num", "estado-civil", 
      "ocupacion", "Relacion", "raza", "sexo", "capital-ganancia", "perdida-capital", 
      "horas-semana", "pais-origen", "ganancia-año")
names(df) = x

Visualización de los datos sobre la variable V3, V11, V12, V13

ggplot(df, x = 1) + geom_boxplot(aes(y = fnlwgt))

ggplot(df, x = 1) + geom_boxplot(aes(y = `capital-ganancia`))

ggplot(df, x = 1) + geom_boxplot(aes(y = `perdida-capital`))

ggplot(df, x = 1) + geom_boxplot(aes(y = `horas-semana`))

#buscamos los outliers en las mismas variables con la función boxplot.stats()
outednum <- boxplot.stats(df$`educación-num`)
outcapgan <- boxplot.stats(df$`capital-ganancia`)
outcaper <- boxplot.stats(df$`perdida-capital`)
outHsem <- boxplot.stats(df$`horas-semana`)

Los datos en fnlwt, capital-ganancia, perdida-capital, horas-semana; efectivamente presentan casos atípicos. por lo que deben ser corregidos (por simplicidad se eliminaran de nuestro data set ‘df’)

3.- Preparación y calidad de los datos

Filtraremos los outliers o datos anómalos, ademas de eliminar las variable

df <- df %>%
  filter(fnlwgt < 47700) %>%
  filter(fnlwgt > 20000)

df$fnlwgt <- NULL
df$`capital-ganancia` <- NULL
df$`perdida-capital` <- NULL

Se transformaron la variables fnlwgt de manera que pueda manejarse mejor los datos y que los datos atípicos no interfieran con el modelo de clasificación

3.1 Análisis exploratorio de los datos

##                Edad educacion-num horas-semana
## Edad           1.00         -0.03         0.08
## educacion-num -0.03          1.00         0.15
## horas-semana   0.08          0.15         1.00
## 
##  <=50K   >50K 
##   1610    480

Las variables enteros también poseen gran cantidad de valores atípicos. entonces se tratara se obviar las variables con datos atípicos, de manera que no pueda estropear el modelo; así mismo, las variables enteras no se encuentran correlacionadas lo que puede ser un buen indicio para el modelado; pues se reduce sesgo.

Los datos indican que 77% de las personas encuestadas ganan <=50K y un 23% gana > 50K

4.- Modelización variable tarjet ganacia-año TAREA PREDICTORA - CLASIFICACION: Que personas ganan mas de 50K o menos de 50K

attach(df)
indextraining <- createDataPartition(`ganancia-año`, p = 0.91, 
                                     list = FALSE, times = 1)
datatrain <- df[indextraining, ]
datatest <- df[-indextraining, ]

fitcontrol <- trainControl(method = "CV", number = 12)

# modelado con caret
DTfit <- train(`ganancia-año` ~., 
               data = datatrain, 
               method = 'C5.0', trControl = fitcontrol,
               verbose = F, type = "object")
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 7 for this object. Predictions generated using 7
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials

## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated using 9
## trials

## Warning: 'trials' should be <= 9 for this object. Predictions generated using 9
## trials

## Warning: 'trials' should be <= 9 for this object. Predictions generated using 9
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials

## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
DTfit
## C5.0 
## 
## 1903 samples
##   11 predictor
##    2 classes: ' <=50K', ' >50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (12 fold) 
## Summary of sample sizes: 1744, 1744, 1745, 1745, 1743, 1743, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.8145786  0.4238143
##   rules  FALSE   10      0.8255953  0.4348101
##   rules  FALSE   20      0.8261293  0.4443774
##   rules   TRUE    1      0.8150995  0.4143939
##   rules   TRUE   10      0.8224341  0.4200691
##   rules   TRUE   20      0.8197969  0.4321043
##   tree   FALSE    1      0.8140380  0.4203808
##   tree   FALSE   10      0.8213757  0.4247036
##   tree   FALSE   20      0.8187518  0.4352202
##   tree    TRUE    1      0.8145787  0.4195133
##   tree    TRUE   10      0.8166621  0.4218906
##   tree    TRUE   20      0.8156105  0.4406176
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = rules and
##  winnow = FALSE.
# modelado con rpart
DTfit2 <- rpart(datatrain$`ganancia-año` ~., data = datatrain)
DTfit2
## n= 1903 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 1903 437  <=50K (0.77036259 0.22963741)  
##     2) estado-civil= Divorced, Married-spouse-absent, Never-married, Separated, Widowed 1018  68  <=50K (0.93320236 0.06679764) *
##     3) estado-civil= Married-AF-spouse, Married-civ-spouse 885 369  <=50K (0.58305085 0.41694915)  
##       6) ocupacion= ?, Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Transport-moving 455 116  <=50K (0.74505495 0.25494505) *
##       7) ocupacion= Adm-clerical, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support 430 177  >50K (0.41162791 0.58837209)  
##        14) educacion= 10th, 11th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, HS-grad, Some-college 353 166  >50K (0.47025496 0.52974504)  
##          28) Edad< 31.5 56  16  <=50K (0.71428571 0.28571429) *
##          29) Edad>=31.5 297 126  >50K (0.42424242 0.57575758)  
##            58) educacion= 10th, 11th, 7th-8th, 9th, Assoc-acdm, HS-grad 97  42  <=50K (0.56701031 0.43298969) *
##            59) educacion= Assoc-voc, Bachelors, Some-college 200  71  >50K (0.35500000 0.64500000)  
##             118) tipo-trabajo= Self-emp-not-inc, State-gov 41  18  <=50K (0.56097561 0.43902439) *
##             119) tipo-trabajo= Federal-gov, Local-gov, Private, Self-emp-inc 159  48  >50K (0.30188679 0.69811321) *
##        15) educacion= Doctorate, Masters, Prof-school 77  11  >50K (0.14285714 0.85714286) *
rpart.plot(DTfit2)

Conclusiones: el modelo generado con el paquete “caret” predice con un 1% de mayor exactitud que el modelo generado con el paquete “rpart”.

Por otro lado, el árbol de decisión producido por “rpart.plot” posiciona a las personas con estados civiles: “divorciados”, “casado-conyuge ausente”, “nunca casado”, “separado”, y “viudo”; con una ganancia menor a 50K anuales. En contraste, las personas con estados civil: “casado-esposo civil”, “casado-esposo AF”; con una educación: “doctoral”, “preescolar”, “bachilleres”; con edades superiores a los “29” años; y ocupación: “soporte técnico”, “ventas, ejecutivo de dirección”, “especialidad profesional”, “administrativo-adm”, “agricultura-pesca”, “Priv-house- serv”, “Fuerzas armadas”.

5.- Evaluación del modelo con caret y rpart

pred <- predict(DTfit, datatest)
plot(pred)

table(pred, datatest$`ganancia-año`)
##         
## pred      <=50K  >50K
##    <=50K    134    29
##    >50K      10    14
pred2 <- predict(DTfit2, datatest, type = "class")
plot(pred2)

table(pred2, datatest$`ganancia-año`)
##         
## pred2     <=50K  >50K
##    <=50K    131    28
##    >50K      13    15
# Calculamos el porcentaje e aciertos (accuracy) = total de aciertos / total de datos
acpred = ((133+19) / (133+11+24+19))
acpred
## [1] 0.8128342
acpred2 = (132+24) / (132+19+12+24)
acpred2
## [1] 0.8342246

Conclusiones: En la evaluación del modelo se puede apreciar un valor de predicción del (ac): 81.3% versus un 18.7% de error con “caret” y un 83.4% versus un 16.6% de error con “rpart”. estos valores inducen un buen porcentaje sobre el modelo de arboles de decisión aplicados, a este conjunto de datos.