The company is a shopping credit card who offer cards to the people with discount and fees to buy in the shopping, in the previous campaign the acceptance rate is near to the 20% and the late payments rate is over the 50%, they want that a algorithm returns the top 100 potentials clients.
For they it´s more important that the client don´t finish in late payment and after that the prospect accept the card, when have the results of campaign will have a return score where punish more the late payment (-7500), the accepted card with no late payment will be +5000, and if not accept no change in the result.
To solve who are the best clients i have different DB with historical information, and need to create a algorithm to solve.
I load 5 DB that we have with information, the description of each one is:
financial_report: Bank and credit card info, loans, debts, numbers of cards, etc.
social_security: Numbers of child, wage, place where live, retired, pensioner, work in public or in private.
cellphone: Information about numbers of line, company of the phone line, late payment, movements.
clients: Internal information of clients that have shopping card, late payments, days of late, movements with the card.
previous results: Information for other campaigns that the company made, who accepts the card its the most important.
I merge the 5 DB into 1 big DB with all the information
library(reshape)
financial_report <- read.csv(file = "Base Info Financiera.csv", sep = ";", header = T, stringsAsFactors = F)
social_security <- read.csv(file = "Base info de ANSES.csv", sep = ";", header = T, stringsAsFactors = F)
cellphone <- read.csv(file = "Base Info de Telefonía Movil.csv", sep = ";", header = T, stringsAsFactors = F)
clients <- read.csv(file = "Base Info de Clientes.csv", sep = ";", header = T, stringsAsFactors = F)
previous_results <- read.csv(file = "Base Resultado Campaña.csv", sep = ";", header = T, stringsAsFactors = F)
#### JOINEO BASES
df_list <- list(financial_report, social_security, cellphone,clients,previous_results)
DataBase <- merge_recurse(df_list)
rm(financial_report, social_security, cellphone,clients,previous_results, df_list)
table (DataBase$HISTORIAL_ATRASO) #Historial_Atraso = Late payment historial
El ultimo mes En el ultimo año
60 47
En el último año En el último mes
284 33
En los últimos 5 años No registra mora en los últimos 5 años
2362 1214
#El ultimo mes = the last month
#En el ultimo año = the last year
#En el último año = the last year
#En el último mes = the last month
#En los últimos 5 años = the last 5 years
#No registra mora en los últimos 5 años = no late payments in 5 years
I summary the DB to see what have each row, if we have outliers or NA values, the type of each row, and after the summary in the next steps i will take out some values or parameters
summary (DataBase)
IDENTIFICADOR.CLIENTE TIENE_CAJAAHORRO_SISTFIN TIENE_TARJETA_SISTFIN TIENE_PRESTAMO_SISTFIN INGRESO_MENSUAL
Min. : 1 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. : 5000
1st Qu.:1001 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 25000
Median :2000 Median :1.0000 Median :0.0000 Median :0.000 Median : 36000
Mean :2000 Mean :0.7037 Mean :0.3663 Mean :0.201 Mean : 41717
3rd Qu.:3000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 46000
Max. :4000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :1000000
DEUDA_SISTEMAFINANCIERO INGRESO_FORMAL RELACION_CUOTA_INGRESO HISTORIAL_ATRASO TRABAJA.EN.SECTOR.PUBLICO
Min. : 0 Min. : 0 Min. :0.0000 Length:4000 Min. :0.0000
1st Qu.: 0 1st Qu.: 11700 1st Qu.:0.0000 Class :character 1st Qu.:0.0000
Median : 0 Median : 25600 Median :0.0000 Mode :character Median :0.0000
Mean : 4319 Mean : 30463 Mean :0.1546 Mean :0.0935
3rd Qu.: 5000 3rd Qu.: 36900 3rd Qu.:0.1600 3rd Qu.:0.0000
Max. :40000 Max. :1000000 Max. :7.2000 Max. :1.0000
NA's :58
JUBILADO PLANES.SOCIALES TIENE_MOVISTAR TIENE_PERSONAL TIENE_CLARO edad
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. : 25.00
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 37.00
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.000 Median :0.0000 Median : 50.00
Mean :0.1862 Mean :0.09875 Mean :0.4873 Mean :0.409 Mean :0.1037 Mean : 49.37
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.: 62.00
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :110.00
HIJOS_MENORES ESTADO_CIVIL PROVINCIA NACIONALIDAD ACEPTO.TARJETA
Min. : 0.000 Length:4000 Length:4000 Length:4000 Min. :0.000
1st Qu.: 0.000 Class :character Class :character Class :character 1st Qu.:0.000
Median : 1.000 Mode :character Mode :character Mode :character Median :0.000
Mean : 1.593 Mean :0.214
3rd Qu.: 2.000 3rd Qu.:0.000
Max. :25.000 Max. :1.000
TUVO_ATRASO_CON_LA_.TARJETA
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1123
3rd Qu.:0.0000
Max. :1.0000
The cells of DEUDA_SISTEMAFINANCIERO refer to know if the clients have debts with a public or private bank, in this case they have some NA, i will put in 0, thinking that if it´s no info, it´s no debts
DataBase$DEUDA_SISTEMAFINANCIERO <- ifelse(is.na(DataBase$DEUDA_SISTEMAFINANCIERO), 0, DataBase$DEUDA_SISTEMAFINANCIERO )
First i search if in the cells of Nacionalidad (nationality) have nulls or no
After that i search the uniques value of Nacionalidad (nationality), Provincia (State), Estado civil (civil status)
I create the table of Nationality to know the proportion of each country
I delete Nationality and State because don´t bring me important information
I made a table of Civil status to know if it´s important or no
I made the table of the acceptation of the card to see the balance of the class (21% accept the card)
anyNA(DataBase$NACIONALIDAD)
[1] FALSE
unique(DataBase$NACIONALIDAD)
[1] "Argentina" "Armenia" "" "Chile" "Albania"
unique(DataBase$PROVINCIA)
[1] "BUENOS AIRES" "CHUBUT" "CORDOBA"
[4] "SANTA CRUZ" "CIUDAD AUTÓNOMA DE BUENOS AIRES" "MENDOZA"
[7] "SAN JUAN" "SANTA FE" "RIO NEGRO"
[10] "LA PAMPA" "CORRIENTES" "ENTRE RIOS"
[13] "NEUQUEN" "TIERRA DEL FUEGO" "CATAMARCA"
[16] "SALTA" "TUCUMAN" ""
[19] "CHACO" "MISIONES" "LA RIOJA"
[22] "JUJUY" "SANTIAGO DEL ESTERO" "SAN LUIS"
[25] "FORMOSA"
unique(DataBase$ESTADO_CIVIL)
[1] "CASADO" "DIVORCIADO" "VIUDO" "SOLTERO" ""
table(DataBase$NACIONALIDAD)
Albania Argentina Armenia Chile
197 1 3773 28 1
DataBase$NACIONALIDAD <- NULL
DataBase$PROVINCIA <- NULL
table (DataBase$ESTADO_CIVIL)
CASADO DIVORCIADO SOLTERO VIUDO
30 1795 409 1723 43
prop.table(table(DataBase$ACEPTO.TARJETA))
0 1
0.786 0.214
The class of the late payment (Historial_atraso) will be 1 if the person don´t regist late payment in the last five years
Also if the civil status is in blank, i determined that is soltero (single)
We see in a table the proportion of each civil status
DataBase$HISTORIAL_ATRASO <- ifelse(DataBase$HISTORIAL_ATRASO=="En los últimos 5 años" |DataBase$HISTORIAL_ATRASO=="No registra mora en los últimos 5 años", 0, 1 )
DataBase$ESTADO_CIVIL <- ifelse(DataBase$ESTADO_CIVIL == "", "SOLTERO",DataBase$ESTADO_CIVIL )
table (DataBase$ESTADO_CIVIL)
CASADO DIVORCIADO SOLTERO VIUDO
1795 409 1753 43
I made boxplot graph of ingreso mensual (monthly income), ingreso formal (formal income), relacion cuota ingreso (ratio between income and monthly payment), edad (age) and hijos menores (minor children)
After see the outliers, i put a maximun for each category
I summary again the DB to see the new values
boxplot(DataBase$INGRESO_MENSUAL)
boxplot(DataBase$INGRESO_FORMAL)
boxplot(DataBase$RELACION_CUOTA_INGRESO)
boxplot(DataBase$edad)
boxplot(DataBase$HIJOS_MENORES)
DataBase$INGRESO_MENSUAL <- ifelse(DataBase$INGRESO_MENSUAL > 80000, 80000,DataBase$INGRESO_MENSUAL)
DataBase$INGRESO_FORMAL <- ifelse(DataBase$INGRESO_FORMAL > 80000, 80000,DataBase$INGRESO_FORMAL)
DataBase$RELACION_CUOTA_INGRESO <- ifelse(DataBase$RELACION_CUOTA_INGRESO> 0.7, 0.7, DataBase$RELACION_CUOTA_INGRESO)
DataBase$edad <- ifelse(DataBase$edad > 80, 80, DataBase$edad)
DataBase$HIJOS_MENORES <- ifelse(DataBase$HIJOS_MENORES > 10, 10, DataBase$HIJOS_MENORES )
summary(DataBase)
IDENTIFICADOR.CLIENTE TIENE_CAJAAHORRO_SISTFIN TIENE_TARJETA_SISTFIN TIENE_PRESTAMO_SISTFIN INGRESO_MENSUAL
Min. : 1 Min. :0.0000 Min. :0.0000 Min. :0.000 Min. : 5000
1st Qu.:1001 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:25000
Median :2000 Median :1.0000 Median :0.0000 Median :0.000 Median :36000
Mean :2000 Mean :0.7037 Mean :0.3663 Mean :0.201 Mean :36427
3rd Qu.:3000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:46000
Max. :4000 Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :80000
DEUDA_SISTEMAFINANCIERO INGRESO_FORMAL RELACION_CUOTA_INGRESO HISTORIAL_ATRASO TRABAJA.EN.SECTOR.PUBLICO
Min. : 0 Min. : 0 Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.: 0 1st Qu.:11700 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Median : 0 Median :25600 Median :0.0000 Median :0.000 Median :0.0000
Mean : 4256 Mean :25173 Mean :0.1239 Mean :0.106 Mean :0.0935
3rd Qu.: 5000 3rd Qu.:36900 3rd Qu.:0.1600 3rd Qu.:0.000 3rd Qu.:0.0000
Max. :40000 Max. :80000 Max. :0.7000 Max. :1.000 Max. :1.0000
JUBILADO PLANES.SOCIALES TIENE_MOVISTAR TIENE_PERSONAL TIENE_CLARO edad
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :25.0
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:37.0
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.000 Median :0.0000 Median :50.0
Mean :0.1862 Mean :0.09875 Mean :0.4873 Mean :0.409 Mean :0.1037 Mean :49.3
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:62.0
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :80.0
HIJOS_MENORES ESTADO_CIVIL ACEPTO.TARJETA TUVO_ATRASO_CON_LA_.TARJETA
Min. : 0.000 Length:4000 Min. :0.000 Min. :0.0000
1st Qu.: 0.000 Class :character 1st Qu.:0.000 1st Qu.:0.0000
Median : 1.000 Mode :character Median :0.000 Median :0.0000
Mean : 1.331 Mean :0.214 Mean :0.1123
3rd Qu.: 2.000 3rd Qu.:0.000 3rd Qu.:0.0000
Max. :10.000 Max. :1.000 Max. :1.0000
I separated the DB into two parts, the 80% of the results in training to the machine learning model, and the 20% in testing to predict and see how its worked the algorithm.
The new database are Entrenamiento (training) and Validacion (testing)
set.seed(3234)
muestra <- floor(nrow(DataBase)*0.8)
trIndex <- sample(nrow(DataBase), muestra, replace=F)
vaIndex <- seq_len(nrow(DataBase))[!(seq_len(nrow(DataBase)) %in% trIndex)]
Entrenamiento <- DataBase[trIndex,]
Validacion <- DataBase[vaIndex,]
A logistic regression to determinate if the person will accept or no the card its the first step, in this case i want to determinate the acceptation or no of the card with this data:
TIENE_CAJAAHORRO_SISTFIN (Have check account in bank)
INGRESO_FORMAL (Formal income)
edad (age)
JUBILADO (retired)
PLANES.SOCIALES (Have social government income)
DataBaseRegresionTarjeta <- DataBase
modelo <- glm(DataBaseRegresionTarjeta$ACEPTO.TARJETA ~ TIENE_CAJAAHORRO_SISTFIN + INGRESO_FORMAL + edad + JUBILADO + PLANES.SOCIALES, data=DataBaseRegresionTarjeta, family=binomial(link="logit"))
summary(modelo)
Call:
glm(formula = DataBaseRegresionTarjeta$ACEPTO.TARJETA ~ TIENE_CAJAAHORRO_SISTFIN +
INGRESO_FORMAL + edad + JUBILADO + PLANES.SOCIALES, family = binomial(link = "logit"),
data = DataBaseRegresionTarjeta)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5749 -0.5527 -0.3941 -0.3418 2.4200
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.103e+00 2.266e-01 -4.870 1.12e-06 ***
TIENE_CAJAAHORRO_SISTFIN -9.011e-01 9.451e-02 -9.534 < 2e-16 ***
INGRESO_FORMAL -7.924e-06 2.623e-06 -3.021 0.00252 **
edad -7.614e-03 4.483e-03 -1.698 0.08941 .
JUBILADO 2.295e+00 1.436e-01 15.983 < 2e-16 ***
PLANES.SOCIALES 2.543e+00 1.303e-01 19.519 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4153.7 on 3999 degrees of freedom
Residual deviance: 3114.3 on 3994 degrees of freedom
AIC: 3126.3
Number of Fisher Scoring iterations: 5
I predict into the testing model the col of accept the card or no.
prediccion <- predict(modelo,Validacion,type='response')
str(prediccion)
Named num [1:800] 0.6555 0.1532 0.0638 0.077 0.0532 ...
- attr(*, "names")= chr [1:800] "1" "2" "13" "14" ...
summary(prediccion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04299 0.07345 0.09684 0.21805 0.32159 0.96340
Validacion$Pred <- prediccion
I calculated the ROCR curve, the idea it´s that this results was over 0.75
library(ROCR)
pred_lrr <- prediction(prediccion, Validacion$ACEPTO.TARJETA)
auc <- performance(pred_lrr,"auc")
auc@y.values[[1]]
[1] 0.8186458
We create a new database where only have the people who accept the cards, because if we want to made the late payment with all people (accepted or no the card) we will have a fail result
DataBaseMora <- subset(DataBase, subset = DataBase$ACEPTO.TARJETA == 1)
I separated the DB of late payment into two parts, the 80% of the results in training to the machine learning model, and the 20% in testing to predict and see how its worked the algorithm.
The new database are Entrenamiento (training) and Validacion (testing)
set.seed(3234)
muestraMora <- floor(nrow(DataBaseMora)*0.8)
trIndexMora <- sample(nrow(DataBaseMora), muestraMora, replace=F)
vaIndexMora <- seq_len(nrow(DataBaseMora))[!(seq_len(nrow(DataBaseMora)) %in% trIndexMora)]
EntrenamientoMora <- DataBaseMora[trIndexMora,]
ValidacionMora <- DataBaseMora[vaIndexMora,]
In this case for the late payment i will use two methods, the tree and the logistic regression, first i created the tree with :
TIENE_CAJAAHORRO_SISTFIN (Have check account in bank)
TIENE_PRESTAMO_SISTFIN (Have loans in banks)
INGRESO_MENSUAL (monthly income)
DEUDA_SISTEMAFINANCIERO (debts in the financial system)
JUBILADO (retired)
PLANES.SOCIALES (Have social government income)
library(rpart)
library(rpart.plot)
fit <- rpart(EntrenamientoMora$TUVO_ATRASO_CON_LA_.TARJETA ~ TIENE_CAJAAHORRO_SISTFIN + TIENE_PRESTAMO_SISTFIN + INGRESO_MENSUAL + DEUDA_SISTEMAFINANCIERO + JUBILADO + PLANES.SOCIALES, data=EntrenamientoMora)
rpart.plot(fit, extra=0, type=2)
I predict into the late payment testing model the col of have late payment or no with a tree.
prediccionMora <- predict(fit,ValidacionMora,method='class')
ValidacionMora$Pred <- prediccionMora
A logistic regression to determinate if the person will be or no in the late payment its the next step, in this case i want to determinate with the same data of the tree
modeloMora <- glm(DataBaseMora$TUVO_ATRASO_CON_LA_.TARJETA ~ TIENE_CAJAAHORRO_SISTFIN + TIENE_PRESTAMO_SISTFIN + INGRESO_MENSUAL + DEUDA_SISTEMAFINANCIERO + JUBILADO + PLANES.SOCIALES, data=DataBaseMora, family=binomial(link="logit"))
summary(modeloMora)
Call:
glm(formula = DataBaseMora$TUVO_ATRASO_CON_LA_.TARJETA ~ TIENE_CAJAAHORRO_SISTFIN +
TIENE_PRESTAMO_SISTFIN + INGRESO_MENSUAL + DEUDA_SISTEMAFINANCIERO +
JUBILADO + PLANES.SOCIALES, family = binomial(link = "logit"),
data = DataBaseMora)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5904 -0.6308 0.1900 0.6335 2.5658
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.188e+00 2.810e-01 7.787 6.84e-15 ***
TIENE_CAJAAHORRO_SISTFIN -1.044e+00 2.314e-01 -4.512 6.43e-06 ***
TIENE_PRESTAMO_SISTFIN 1.317e+00 2.834e-01 4.645 3.40e-06 ***
INGRESO_MENSUAL -3.592e-05 6.630e-06 -5.417 6.06e-08 ***
DEUDA_SISTEMAFINANCIERO 5.778e-05 1.653e-05 3.495 0.000474 ***
JUBILADO -2.638e+00 1.892e-01 -13.941 < 2e-16 ***
PLANES.SOCIALES 1.125e+00 2.037e-01 5.526 3.28e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1184.61 on 855 degrees of freedom
Residual deviance: 748.74 on 849 degrees of freedom
AIC: 762.74
Number of Fisher Scoring iterations: 5
I predict into the late payment testing model the col of have late payment or no with a logistic.
prediccionMora2 <- predict(modeloMora,ValidacionMora,type='response')
ValidacionMora$Pred2 <- prediccionMora2
I calculated the ROCR curve, the idea it´s that this results was over 0.75
pred_lrr <- prediction(prediccionMora, ValidacionMora$TUVO_ATRASO_CON_LA_.TARJETA)
auc <- performance(pred_lrr,"auc")
auc@y.values[[1]]
[1] 0.8491076
I load the database with information of the new potentials clients that the company card want to call to do a marketing campaign
Base_Nuevos_clientes <- read.csv(file = "BASE TEST_sin_targets.csv", sep = ";", header = T, stringsAsFactors = F)
I load the database with information of the new potentials clients that the company card want to call to do a marketing campaign
summary(Base_Nuevos_clientes)
IDENTIFICADOR.CLIENTE TIENE_CAJAAHORRO_SISTFIN TIENE_TARJETA_SISTFIN TIENE_PRESTAMO_SISTFIN
Min. :4001 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:4250 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :4500 Median :1.0000 Median :0.0000 Median :0.0000
Mean :4500 Mean :0.7097 Mean :0.3734 Mean :0.1832
3rd Qu.:4750 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :4999 Max. :1.0000 Max. :1.0000 Max. :1.0000
TRABAJA.EN.SECTOR.PUBLICO JUBILADO PLANES.SOCIALES edad HIJOS_MENORES
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 25.00 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 37.00 1st Qu.:0.000
Median :0.0000 Median :0.0000 Median :0.0000 Median : 50.00 Median :1.000
Mean :0.1031 Mean :0.1792 Mean :0.0981 Mean : 49.21 Mean :1.134
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 62.00 3rd Qu.:2.000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :110.00 Max. :8.000
ESTADO_CIVIL PROVINCIA NACIONALIDAD TIENE_MOVISTAR TIENE_PERSONAL TIENE_CLARO
Length:999 Length:999 Length:999 Min. :0.0000 Min. :0.0000 Min. :0.0000
Class :character Class :character Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Mode :character Mode :character Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.3363 Mean :0.3283 Mean :0.3353
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
INGRESO_MENSUAL DEUDA_SISTEMAFINANCIERO INGRESO_FORMAL RELACION_CUOTA_INGRESO HISTORIAL_ATRASO
Min. : 5000 Min. : 0 Min. : 0 Min. :0.0000 Length:999
1st Qu.:25000 1st Qu.: 0 1st Qu.:12600 1st Qu.:0.0000 Class :character
Median :35000 Median : 0 Median :25600 Median :0.0000 Mode :character
Mean :36288 Mean : 4498 Mean :25243 Mean :0.1591
3rd Qu.:46000 3rd Qu.: 6000 3rd Qu.:36000 3rd Qu.:0.1900
Max. :75000 Max. :40000 Max. :75000 Max. :3.6000
I will make in only one step all the clean and transform to the DB similar to the original
Base_Nuevos_clientes$NACIONALIDAD <- NULL
Base_Nuevos_clientes$PROVINCIA <- NULL
Base_Nuevos_clientes$HISTORIAL_ATRASO <- ifelse(Base_Nuevos_clientes$HISTORIAL_ATRASO=="En los últimos 5 años" |Base_Nuevos_clientes$HISTORIAL_ATRASO=="No registra mora en los últimos 5 años", 0, 1 )
Base_Nuevos_clientes$ESTADO_CIVIL <- ifelse(Base_Nuevos_clientes$ESTADO_CIVIL == "", "SOLTERO",Base_Nuevos_clientes$ESTADO_CIVIL )
Base_Nuevos_clientes$INGRESO_MENSUAL <- ifelse(Base_Nuevos_clientes$INGRESO_MENSUAL > 80000, 80000,Base_Nuevos_clientes$INGRESO_MENSUAL)
Base_Nuevos_clientes$INGRESO_FORMAL <- ifelse(Base_Nuevos_clientes$INGRESO_FORMAL > 80000, 80000,Base_Nuevos_clientes$INGRESO_FORMAL)
Base_Nuevos_clientes$RELACION_CUOTA_INGRESO <- ifelse(Base_Nuevos_clientes$RELACION_CUOTA_INGRESO> 0.7, 0.7, Base_Nuevos_clientes$RELACION_CUOTA_INGRESO)
Base_Nuevos_clientes$edad <- ifelse(Base_Nuevos_clientes$edad > 80, 80, Base_Nuevos_clientes$edad)
Base_Nuevos_clientes$HIJOS_MENORES <- ifelse(Base_Nuevos_clientes$HIJOS_MENORES > 10, 10, Base_Nuevos_clientes$HIJOS_MENORES )
I will predict the late payment into the new clients with the tree and regression made before, with a assemble between row means.
prediccionMora <- predict(fit,Base_Nuevos_clientes,method='class')
prediccionMora2 <- predict(modeloMora,Base_Nuevos_clientes,type='response')
summary(prediccionMora)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07692 0.21875 0.67857 0.57140 0.67857 1.00000
Base_Nuevos_clientes$PredMora <- prediccionMora
Base_Nuevos_clientes$PredMora2 <- prediccionMora2
Base_Nuevos_clientes$ensamble <- rowMeans(Base_Nuevos_clientes[,c("PredMora", "PredMora2")])
I will predict the acceptation of the card into the new clients with the regression made before
prediccion <- predict(modelo,Base_Nuevos_clientes,type='response')
str(prediccion)
Named num [1:999] 0.0715 0.0626 0.0814 0.077 0.0581 ...
- attr(*, "names")= chr [1:999] "1" "2" "3" "4" ...
summary(prediccion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04465 0.07147 0.08976 0.20912 0.21266 0.96340
Base_Nuevos_clientes$PredAceptacion <- prediccion
Now the next step it´s create the table with the 100 best potential clients, thinking that it´s more important that the new client don´t finish in late payment and then the probability of accept the card.
Only sned the IDENTIFICADOR.CLIENTE (Client ID)
summary(Base_Nuevos_clientes$PredAceptacion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04465 0.07147 0.08976 0.20912 0.21266 0.96340
summary(Base_Nuevos_clientes$PredMora)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07692 0.21875 0.67857 0.57140 0.67857 1.00000
BaseMejoresClientes <- subset(Base_Nuevos_clientes, subset = Base_Nuevos_clientes$PredAceptacion > 0.20 & Base_Nuevos_clientes$ensamble < 0.175)
BaseMejoresClientes <- BaseMejoresClientes %>% select(IDENTIFICADOR.CLIENTE)
The final step is to save the table in a CSV with the 100 best client to send a the card company
write.table (BaseMejoresClientes,file="Clientes_potenciales.csv",row.names=FALSE,col.names=TRUE,quote=TRUE,sep=",")
After a time the company gave me back the results of the campaign over this 100 clients
The result have the 100 observation of the clients that was sended and says if accept the card of it was in late payment.
results <- read.csv(file = "resultado.csv", sep = ";", header = T, stringsAsFactors = F)
Let´s graph the performance, remember that acceptance of card is 20% and late payment is 50%
library(ggplot2)
DF_graph = data.frame(resultado = "variable", acceptance = sum(results$ACEPTO.TARJETA)/100 , latepayment = sum(results$TUVO_ATRASO_CON_LA_.TARJETA)/100)
DF_graph = melt(data= DF_graph, id="resultado")
DF_graph %>%
ggplot(aes(x = variable , y= value, fill = variable)) +
geom_col() +
labs(x = "variable", y = "% of", title = "Final results of campaign") +
scale_y_continuous(limits = c(0.00, 0.99)) +
theme_replace()
The algorithm performance was very good, the percentage of acceptation was improve from 20% to 43% and the late payment that was the main problem was reduced from 50% to 3%.
This work demonstrates the importance of machine learning to improve business