Tenemos datos sobre la relación entre clientes y cierta tienda que realiza campañas de marketing donde se toman en cuenta datos desde la participación o no de ellos en las 5 campañas promocionales anteriores, hasta información tan específica como la cantidad que gastan en determinados productos, así como aaspectos de su vida familiar como el número de niños o su estado civil.
Cargamos primero las librerías y los datos, posteriormente le quitamos los datos faltantes y quitamos la columna de ID ya que no aporta información relevante para el modelo.
library(tidyverse)
library(caret)
library(caTools)
library(rpart)
library(rpart.plot)
library(rattle)
library(kableExtra)
data <- read.csv("C:/Users/luisa/Downloads/marketing_campaign.csv", sep = ";", fileEncoding = "UTF-8-BOM")
data <- na.omit(data)
data <- data[,-1]
kbl(head(data,20)) %>%
kable_paper() %>%
scroll_box(width = "900px", height = "400px")| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1957 | Graduation | Single | 58138 | 0 | 0 | 2012-09-04 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 2 | 1954 | Graduation | Single | 46344 | 1 | 1 | 2014-03-08 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 1965 | Graduation | Together | 71613 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 1984 | Graduation | Together | 26646 | 1 | 0 | 2014-02-10 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 5 | 1981 | PhD | Married | 58293 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 6 | 1967 | Master | Together | 62513 | 0 | 1 | 2013-09-09 | 16 | 520 | 42 | 98 | 0 | 42 | 14 | 2 | 6 | 4 | 10 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 7 | 1971 | Graduation | Divorced | 55635 | 0 | 1 | 2012-11-13 | 34 | 235 | 65 | 164 | 50 | 49 | 27 | 4 | 7 | 3 | 7 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 8 | 1985 | PhD | Married | 33454 | 1 | 0 | 2013-05-08 | 32 | 76 | 10 | 56 | 3 | 1 | 23 | 2 | 4 | 0 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 9 | 1974 | PhD | Together | 30351 | 1 | 0 | 2013-06-06 | 19 | 14 | 0 | 24 | 3 | 3 | 2 | 1 | 3 | 0 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 10 | 1950 | PhD | Together | 5648 | 1 | 1 | 2014-03-13 | 68 | 28 | 0 | 6 | 1 | 1 | 13 | 1 | 1 | 0 | 0 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 12 | 1976 | Basic | Married | 7500 | 0 | 0 | 2012-11-13 | 59 | 6 | 16 | 11 | 11 | 1 | 16 | 1 | 2 | 0 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 13 | 1959 | Graduation | Divorced | 63033 | 0 | 0 | 2013-11-15 | 82 | 194 | 61 | 480 | 225 | 112 | 30 | 1 | 3 | 4 | 8 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 14 | 1952 | Master | Divorced | 59354 | 1 | 1 | 2013-11-15 | 53 | 233 | 2 | 53 | 3 | 5 | 14 | 3 | 6 | 1 | 5 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 15 | 1987 | Graduation | Married | 17323 | 0 | 0 | 2012-10-10 | 38 | 3 | 14 | 17 | 6 | 1 | 5 | 1 | 1 | 0 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 16 | 1946 | PhD | Single | 82800 | 0 | 0 | 2012-11-24 | 23 | 1006 | 22 | 115 | 59 | 68 | 45 | 1 | 7 | 6 | 12 | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 11 | 1 |
| 17 | 1980 | Graduation | Married | 41850 | 1 | 1 | 2012-12-24 | 51 | 53 | 5 | 19 | 2 | 13 | 4 | 3 | 3 | 0 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 18 | 1946 | Graduation | Together | 37760 | 0 | 0 | 2012-08-31 | 20 | 84 | 5 | 38 | 150 | 12 | 28 | 2 | 4 | 1 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 19 | 1949 | Master | Married | 76995 | 0 | 1 | 2013-03-28 | 91 | 1012 | 80 | 498 | 0 | 16 | 176 | 2 | 11 | 4 | 9 | 5 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 20 | 1985 | 2n Cycle | Single | 33812 | 1 | 0 | 2012-11-03 | 86 | 4 | 17 | 19 | 30 | 24 | 39 | 2 | 2 | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 21 | 1982 | Graduation | Married | 37040 | 0 | 0 | 2012-08-08 | 41 | 86 | 2 | 73 | 69 | 38 | 48 | 1 | 4 | 2 | 5 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
Vamos a pasar las variables de Response, Education y Marital_Status a factor y Dt_Customer a fecha para facilitar el trabajo del árbol con estas variables y también vamos dividir los datos para ajustarle a un 80% de ellos un árbol de clasificación y probarlo con el 20% restante.
data$Response <- factor(data$Response, levels = c(0,1))
data$Dt_Customer <- as.Date(data$Dt_Customer)
data$Marital_Status <- factor(data$Marital_Status,
levels = c("YOLO", "Absurd", "Alone",
"Single", "Together", "Married",
"Divorced", "Widow"))
data$Education <- factor(data$Education, levels = c("Basic", "Graduation",
"2n Cycle", "Master",
"PhD"))
#Observemos la cantidad de 0 y 1 en la variable Response
table(data$Response)##
## 0 1
## 1883 333
set.seed(123)
split <- sample.split(data$Response, SplitRatio = 0.8)
train <- subset(data, split == T)
test <- subset(data, split == F)
#Se mantiene la proporción de 0 y 1 en train y test, que es cercana al 15%
table(train$Response)##
## 0 1
## 1506 266
##
## 0 1
## 377 67
Pasaremos un modelo completo y veremos de ahí que variables son las que más aportan al modelo.
##
## Classification tree:
## rpart(formula = Response ~ ., data = data, method = "class")
##
## Variables actually used in tree construction:
## [1] AcceptedCmp1 AcceptedCmp3 AcceptedCmp5
## [4] Dt_Customer Marital_Status NumCatalogPurchases
## [7] Recency
##
## Root node error: 333/2216 = 0.15027
##
## n= 2216
##
## CP nsplit rel error xerror xstd
## 1 0.060060 0 1.00000 1.00000 0.050515
## 2 0.054054 1 0.93994 0.99099 0.050327
## 3 0.039039 2 0.88589 0.94895 0.049430
## 4 0.027027 3 0.84685 0.93393 0.049102
## 5 0.016517 5 0.79279 0.92192 0.048836
## 6 0.014014 7 0.75976 0.89790 0.048297
## 7 0.010000 10 0.71772 0.85285 0.047253
Una vez que tenemos las variables significativas, pasamos a ajustar el árbol de clasificación.
tree <- rpart(Response ~ AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp5 + Dt_Customer +
Marital_Status + NumCatalogPurchases + Recency,
data = train, method = "class")Ahora hacemos la predicción con este árbol con los datos de test y vemos las métricas de ajuste de este árbol
obs <- test$Response
pred <- predict(tree, newdata = test[-28], type = "class")
confusionMatrix(obs, pred, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 369 8
## 1 48 19
##
## Accuracy : 0.8739
## 95% CI : (0.8394, 0.9033)
## No Information Rate : 0.9392
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3477
##
## Mcnemar's Test P-Value : 1.872e-07
##
## Sensitivity : 0.70370
## Specificity : 0.88489
## Pos Pred Value : 0.28358
## Neg Pred Value : 0.97878
## Prevalence : 0.06081
## Detection Rate : 0.04279
## Detection Prevalence : 0.15090
## Balanced Accuracy : 0.79430
##
## 'Positive' Class : 1
##
Podemos ver que tiene buen accuracy y balanced accuracy, debido a la gran parte de ceros que hay en el dataset y que los predice muy bien.
Veamos como queda nuestro árbol:
Ahora si tratamos el modelo con probabilidades y vemos la curva ROC:
library(ROCR)
tree1 <- rpart(Response ~ AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp5 + Dt_Customer +
Marital_Status + NumCatalogPurchases + Recency,
data = train)
pred <- ROCR::prediction(predict(tree1, type = "prob")[, 2], train$Response)
plot(performance(pred, "tpr", "fpr"))
abline(0, 1, lty = 2)Usando la información de esta curva, veamos las métricas de ajuste
## [1] 0 0 1 0 0 0
## 2 8 9 18 23 27
## 0.09370904 0.09370904 0.09370904 0.09370904 0.09370904 0.09370904
test_pred <- ifelse(test_pred < 0.4, 0, 1)
obs <- factor(obs, levels = c(0,1))
test_pred <- factor(test_pred, levels = c(0,1))
confusionMatrix(obs, test_pred, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 369 8
## 1 48 19
##
## Accuracy : 0.8739
## 95% CI : (0.8394, 0.9033)
## No Information Rate : 0.9392
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3477
##
## Mcnemar's Test P-Value : 1.872e-07
##
## Sensitivity : 0.70370
## Specificity : 0.88489
## Pos Pred Value : 0.28358
## Neg Pred Value : 0.97878
## Prevalence : 0.06081
## Detection Rate : 0.04279
## Detection Prevalence : 0.15090
## Balanced Accuracy : 0.79430
##
## 'Positive' Class : 1
##
Ambos manera de interpretar el modelo tienen métricas similares, por lo que cualquiera de las maneras es adecuada para predecir usando este árbol de clasificación