Metodología CRISP-DM
1.- Análisis y compresión del negocio y/o problema.
Los datos provenienen del repositorio UCI machine learning repository | https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data; la bases de datos aborda un censo realizado en 1994 desde la pagína | http://www.census.gov/ftp/pub/DES/www/welcome.html la tarea a resolver es la siguiente:
instalar los paquetes para la manipulación de datos
2.- comprensión de los datos adult.data
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", "adult.data")
readLines("adult.data", n = 10)
## [1] "39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
## [2] "50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K"
## [3] "38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K"
## [4] "53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K"
## [5] "28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K"
## [6] "37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K"
## [7] "49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K"
## [8] "52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K"
## [9] "31, Private, 45781, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K"
## [10] "42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 40, United-States, >50K"
adult <- read.table("adult.data", sep = ",", header = F)
head(adult)
tail(adult)
df <- adult
glimpse(df)
## Rows: 32,561
## Columns: 15
## $ V1 <int> 39, 50, 38, 53, 28, 37, 49, 52, 31, 42, 37, 30, 23, 32, 40, 34, 2…
## $ V2 <chr> " State-gov", " Self-emp-not-inc", " Private", " Private", " Priv…
## $ V3 <int> 77516, 83311, 215646, 234721, 338409, 284582, 160187, 209642, 457…
## $ V4 <chr> " Bachelors", " Bachelors", " HS-grad", " 11th", " Bachelors", " …
## $ V5 <int> 13, 13, 9, 7, 13, 14, 5, 9, 14, 13, 10, 13, 13, 12, 11, 4, 9, 9, …
## $ V6 <chr> " Never-married", " Married-civ-spouse", " Divorced", " Married-c…
## $ V7 <chr> " Adm-clerical", " Exec-managerial", " Handlers-cleaners", " Hand…
## $ V8 <chr> " Not-in-family", " Husband", " Not-in-family", " Husband", " Wif…
## $ V9 <chr> " White", " White", " White", " Black", " Black", " White", " Bla…
## $ V10 <chr> " Male", " Male", " Male", " Male", " Female", " Female", " Femal…
## $ V11 <int> 2174, 0, 0, 0, 0, 0, 0, 0, 14084, 5178, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ V12 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ V13 <int> 40, 13, 40, 40, 40, 40, 16, 45, 50, 40, 80, 40, 30, 50, 40, 45, 3…
## $ V14 <chr> " United-States", " United-States", " United-States", " United-St…
## $ V15 <chr> " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <=50K", " <=5…
skim(df)
| Name | df |
| Number of rows | 32561 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| V2 | 0 | 1 | 2 | 17 | 0 | 9 | 0 |
| V4 | 0 | 1 | 4 | 13 | 0 | 16 | 0 |
| V6 | 0 | 1 | 8 | 22 | 0 | 7 | 0 |
| V7 | 0 | 1 | 2 | 18 | 0 | 15 | 0 |
| V8 | 0 | 1 | 5 | 15 | 0 | 6 | 0 |
| V9 | 0 | 1 | 6 | 19 | 0 | 5 | 0 |
| V10 | 0 | 1 | 5 | 7 | 0 | 2 | 0 |
| V14 | 0 | 1 | 2 | 27 | 0 | 42 | 0 |
| V15 | 0 | 1 | 5 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| V1 | 0 | 1 | 38.58 | 13.64 | 17 | 28 | 37 | 48 | 90 | ▇▇▅▂▁ |
| V3 | 0 | 1 | 189778.37 | 105549.98 | 12285 | 117827 | 178356 | 237051 | 1484705 | ▇▁▁▁▁ |
| V5 | 0 | 1 | 10.08 | 2.57 | 1 | 9 | 10 | 12 | 16 | ▁▁▇▃▁ |
| V11 | 0 | 1 | 1077.65 | 7385.29 | 0 | 0 | 0 | 0 | 99999 | ▇▁▁▁▁ |
| V12 | 0 | 1 | 87.30 | 402.96 | 0 | 0 | 0 | 0 | 4356 | ▇▁▁▁▁ |
| V13 | 0 | 1 | 40.44 | 12.35 | 1 | 40 | 40 | 45 | 99 | ▁▇▃▁▁ |
Se desconoce la procedencia del atributo V3 según https://www.kaggle.com/uciml/adult-census-income/discussion/32698 V3 es el numero estimado de personas V11 Es el capital ganado V12 Es la perdida de capital V13 Hora/semana
Por razones de compresión, a la hora del análisis, se cambiaran los nombres de las columnas de las variables.
x = c("Edad", "tipo-trabajo", "fnlwgt", "educacion", "educacion-num", "estado-civil",
"ocupacion", "Relacion", "raza", "sexo", "capital-ganancia", "perdida-capital",
"horas-semana", "pais-origen", "ganancia-año")
names(df) = x
Visualización de los datos sobre la variable V3, V11, V12, V13
ggplot(df, x = 1) + geom_boxplot(aes(y = fnlwgt))
ggplot(df, x = 1) + geom_boxplot(aes(y = `capital-ganancia`))
ggplot(df, x = 1) + geom_boxplot(aes(y = `perdida-capital`))
ggplot(df, x = 1) + geom_boxplot(aes(y = `horas-semana`))
#buscamos los outliers en las mismas variables con la función boxplot.stats()
outednum <- boxplot.stats(df$`educación-num`)
outcapgan <- boxplot.stats(df$`capital-ganancia`)
outcaper <- boxplot.stats(df$`perdida-capital`)
outHsem <- boxplot.stats(df$`horas-semana`)
Los datos en fnlwt, capital-ganancia, perdida-capital, horas-semana; efectivamente presentan casos atípicos. por lo que deben ser corregidos (por simplicidad se eliminaran de nuestro data set ‘df’)
3.- Preparación y calidad de los datos
Filtraremos los outliers o datos anómalos, ademas de eliminar las variable
df <- df %>%
filter(fnlwgt < 47700) %>%
filter(fnlwgt > 20000)
df$fnlwgt <- NULL
df$`capital-ganancia` <- NULL
df$`perdida-capital` <- NULL
Se transformaron la variables fnlwgt de manera que pueda manejarse mejor los datos y que los datos atípicos no interfieran con el modelo de clasificación
3.1 Análisis exploratorio de los datos
## Edad educacion-num horas-semana
## Edad 1.00 -0.03 0.08
## educacion-num -0.03 1.00 0.15
## horas-semana 0.08 0.15 1.00
##
## <=50K >50K
## 1610 480
Las variables enteros también poseen gran cantidad de valores atípicos. entonces se tratara se obviar las variables con datos atípicos, de manera que no pueda estropear el modelo; así mismo, las variables enteras no se encuentran correlacionadas lo que puede ser un buen indicio para el modelado; pues se reduce sesgo.
Los datos indican que 77% de las personas encuestadas ganan <=50K y un 23% gana > 50K
4.- Modelización variable tarjet ganacia-año TAREA PREDICTORA - CLASIFICACION: Que personas ganan mas de 50K o menos de 50K
attach(df)
indextraining <- createDataPartition(`ganancia-año`, p = 0.91,
list = FALSE, times = 1)
datatrain <- df[indextraining, ]
datatest <- df[-indextraining, ]
fitcontrol <- trainControl(method = "CV", number = 12)
# modelado con caret
DTfit <- train(`ganancia-año` ~.,
data = datatrain,
method = 'C5.0', trControl = fitcontrol,
verbose = F, type = "object")
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 7 for this object. Predictions generated using 7
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 8 for this object. Predictions generated using 8
## trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated using 9
## trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated using 9
## trials
## Warning: 'trials' should be <= 9 for this object. Predictions generated using 9
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 4 for this object. Predictions generated using 4
## trials
## Warning: 'trials' should be <= 1 for this object. Predictions generated using 1
## trials
DTfit
## C5.0
##
## 1903 samples
## 11 predictor
## 2 classes: ' <=50K', ' >50K'
##
## No pre-processing
## Resampling: Cross-Validated (12 fold)
## Summary of sample sizes: 1744, 1744, 1745, 1745, 1743, 1743, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.8145786 0.4238143
## rules FALSE 10 0.8255953 0.4348101
## rules FALSE 20 0.8261293 0.4443774
## rules TRUE 1 0.8150995 0.4143939
## rules TRUE 10 0.8224341 0.4200691
## rules TRUE 20 0.8197969 0.4321043
## tree FALSE 1 0.8140380 0.4203808
## tree FALSE 10 0.8213757 0.4247036
## tree FALSE 20 0.8187518 0.4352202
## tree TRUE 1 0.8145787 0.4195133
## tree TRUE 10 0.8166621 0.4218906
## tree TRUE 20 0.8156105 0.4406176
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 20, model = rules and
## winnow = FALSE.
# modelado con rpart
DTfit2 <- rpart(datatrain$`ganancia-año` ~., data = datatrain)
DTfit2
## n= 1903
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1903 437 <=50K (0.77036259 0.22963741)
## 2) estado-civil= Divorced, Married-spouse-absent, Never-married, Separated, Widowed 1018 68 <=50K (0.93320236 0.06679764) *
## 3) estado-civil= Married-AF-spouse, Married-civ-spouse 885 369 <=50K (0.58305085 0.41694915)
## 6) ocupacion= ?, Craft-repair, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Transport-moving 455 116 <=50K (0.74505495 0.25494505) *
## 7) ocupacion= Adm-clerical, Exec-managerial, Prof-specialty, Protective-serv, Sales, Tech-support 430 177 >50K (0.41162791 0.58837209)
## 14) educacion= 10th, 11th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, HS-grad, Some-college 353 166 >50K (0.47025496 0.52974504)
## 28) Edad< 31.5 56 16 <=50K (0.71428571 0.28571429) *
## 29) Edad>=31.5 297 126 >50K (0.42424242 0.57575758)
## 58) educacion= 10th, 11th, 7th-8th, 9th, Assoc-acdm, HS-grad 97 42 <=50K (0.56701031 0.43298969) *
## 59) educacion= Assoc-voc, Bachelors, Some-college 200 71 >50K (0.35500000 0.64500000)
## 118) tipo-trabajo= Self-emp-not-inc, State-gov 41 18 <=50K (0.56097561 0.43902439) *
## 119) tipo-trabajo= Federal-gov, Local-gov, Private, Self-emp-inc 159 48 >50K (0.30188679 0.69811321) *
## 15) educacion= Doctorate, Masters, Prof-school 77 11 >50K (0.14285714 0.85714286) *
rpart.plot(DTfit2)
Conclusiones: el modelo generado con el paquete “caret” predice con un 1% de mayor exactitud que el modelo generado con el paquete “rpart”.
Por otro lado, el árbol de decisión producido por “rpart.plot” posiciona a las personas con estados civiles: “divorciados”, “casado-conyuge ausente”, “nunca casado”, “separado”, y “viudo”; con una ganancia menor a 50K anuales. En contraste, las personas con estados civil: “casado-esposo civil”, “casado-esposo AF”; con una educación: “doctoral”, “preescolar”, “bachilleres”; con edades superiores a los “29” años; y ocupación: “soporte técnico”, “ventas, ejecutivo de dirección”, “especialidad profesional”, “administrativo-adm”, “agricultura-pesca”, “Priv-house- serv”, “Fuerzas armadas”.
5.- Evaluación del modelo con caret y rpart
pred <- predict(DTfit, datatest)
plot(pred)
table(pred, datatest$`ganancia-año`)
##
## pred <=50K >50K
## <=50K 134 29
## >50K 10 14
pred2 <- predict(DTfit2, datatest, type = "class")
plot(pred2)
table(pred2, datatest$`ganancia-año`)
##
## pred2 <=50K >50K
## <=50K 131 28
## >50K 13 15
# Calculamos el porcentaje e aciertos (accuracy) = total de aciertos / total de datos
acpred = ((133+19) / (133+11+24+19))
acpred
## [1] 0.8128342
acpred2 = (132+24) / (132+19+12+24)
acpred2
## [1] 0.8342246
Conclusiones: En la evaluación del modelo se puede apreciar un valor de predicción del (ac): 81.3% versus un 18.7% de error con “caret” y un 83.4% versus un 16.6% de error con “rpart”. estos valores inducen un buen porcentaje sobre el modelo de arboles de decisión aplicados, a este conjunto de datos.