Classification Tree

Datos

Tenemos datos sobre la relación entre clientes y cierta tienda que realiza campañas de marketing donde se toman en cuenta datos desde la participación o no de ellos en las 5 campañas promocionales anteriores, hasta información tan específica como la cantidad que gastan en determinados productos, así como aaspectos de su vida familiar como el número de niños o su estado civil.

Cargamos primero las librerías y los datos, posteriormente le quitamos los datos faltantes y quitamos la columna de ID ya que no aporta información relevante para el modelo.

library(tidyverse)
library(caret)
library(caTools)
library(rpart)
library(rpart.plot)
library(rattle)
library(kableExtra)
data <- read.csv("C:/Users/luisa/Downloads/marketing_campaign.csv", sep = ";", fileEncoding = "UTF-8-BOM")
data <- na.omit(data)
data <- data[,-1]
kbl(head(data,20)) %>%
  kable_paper() %>%
  scroll_box(width = "900px", height = "400px")

	Year_Birth	Education	Marital_Status	Income	Kidhome	Teenhome	Dt_Customer	Recency	MntWines	MntFruits	MntMeatProducts	MntFishProducts	MntSweetProducts	MntGoldProds	NumDealsPurchases	NumWebPurchases	NumCatalogPurchases	NumStorePurchases	NumWebVisitsMonth	AcceptedCmp3	AcceptedCmp5	AcceptedCmp1	Z_CostContact	Z_Revenue	Response
1	1957	Graduation	Single	58138	0	0	2012-09-04	58	635	88	546	172	88	88	3	8	10	4	7	0	0	0	3	11	1
2	1954	Graduation	Single	46344	1	1	2014-03-08	38	11	1	6	2	1	6	2	1	1	2	5	0	0	0	3	11	0
3	1965	Graduation	Together	71613	0	0	2013-08-21	26	426	49	127	111	21	42	1	8	2	10	4	0	0	0	3	11	0
4	1984	Graduation	Together	26646	1	0	2014-02-10	26	11	4	20	10	3	5	2	2	0	4	6	0	0	0	3	11	0
5	1981	PhD	Married	58293	1	0	2014-01-19	94	173	43	118	46	27	15	5	5	3	6	5	0	0	0	3	11	0
6	1967	Master	Together	62513	0	1	2013-09-09	16	520	42	98	0	42	14	2	6	4	10	6	0	0	0	3	11	0
7	1971	Graduation	Divorced	55635	0	1	2012-11-13	34	235	65	164	50	49	27	4	7	3	7	6	0	0	0	3	11	0
8	1985	PhD	Married	33454	1	0	2013-05-08	32	76	10	56	3	1	23	2	4	0	4	8	0	0	0	3	11	0
9	1974	PhD	Together	30351	1	0	2013-06-06	19	14	0	24	3	3	2	1	3	0	2	9	0	0	0	3	11	1
10	1950	PhD	Together	5648	1	1	2014-03-13	68	28	0	6	1	1	13	1	1	0	0	20	1	0	0	3	11	0
12	1976	Basic	Married	7500	0	0	2012-11-13	59	6	16	11	11	1	16	1	2	0	3	8	0	0	0	3	11	0
13	1959	Graduation	Divorced	63033	0	0	2013-11-15	82	194	61	480	225	112	30	1	3	4	8	2	0	0	0	3	11	0
14	1952	Master	Divorced	59354	1	1	2013-11-15	53	233	2	53	3	5	14	3	6	1	5	6	0	0	0	3	11	0
15	1987	Graduation	Married	17323	0	0	2012-10-10	38	3	14	17	6	1	5	1	1	0	3	8	0	0	0	3	11	0
16	1946	PhD	Single	82800	0	0	2012-11-24	23	1006	22	115	59	68	45	1	7	6	12	3	0	1	1	3	11	1
17	1980	Graduation	Married	41850	1	1	2012-12-24	51	53	5	19	2	13	4	3	3	0	3	8	0	0	0	3	11	0
18	1946	Graduation	Together	37760	0	0	2012-08-31	20	84	5	38	150	12	28	2	4	1	6	7	0	0	0	3	11	0
19	1949	Master	Married	76995	0	1	2013-03-28	91	1012	80	498	0	16	176	2	11	4	9	5	0	0	1	3	11	0
20	1985	2n Cycle	Single	33812	1	0	2012-11-03	86	4	17	19	30	24	39	2	2	1	3	6	0	0	0	3	11	0
21	1982	Graduation	Married	37040	0	0	2012-08-08	41	86	2	73	69	38	48	1	4	2	5	8	0	0	0	3	11	0

Árbol de Clasificación

Vamos a pasar las variables de Response, Education y Marital_Status a factor y Dt_Customer a fecha para facilitar el trabajo del árbol con estas variables y también vamos dividir los datos para ajustarle a un 80% de ellos un árbol de clasificación y probarlo con el 20% restante.

data$Response <- factor(data$Response, levels = c(0,1))
data$Dt_Customer <- as.Date(data$Dt_Customer)
data$Marital_Status <- factor(data$Marital_Status,
                               levels = c("YOLO", "Absurd", "Alone", 
                                          "Single", "Together", "Married",
                                          "Divorced", "Widow"))
data$Education <- factor(data$Education, levels = c("Basic", "Graduation",
                                                      "2n Cycle", "Master",
                                                      "PhD"))

#Observemos la cantidad de 0 y 1 en la variable Response
table(data$Response)

## 
##    0    1 
## 1883  333

set.seed(123)
split <- sample.split(data$Response, SplitRatio = 0.8)
train <- subset(data, split == T)
test <- subset(data, split == F)
#Se mantiene la proporción de 0 y 1 en train y test, que es cercana al 15%
table(train$Response)

## 
##    0    1 
## 1506  266

table(test$Response)

## 
##   0   1 
## 377  67

Pasaremos un modelo completo y veremos de ahí que variables son las que más aportan al modelo.

complete <- rpart(Response ~., data = data, method = "class" )
printcp(complete)

## 
## Classification tree:
## rpart(formula = Response ~ ., data = data, method = "class")
## 
## Variables actually used in tree construction:
## [1] AcceptedCmp1        AcceptedCmp3        AcceptedCmp5       
## [4] Dt_Customer         Marital_Status      NumCatalogPurchases
## [7] Recency            
## 
## Root node error: 333/2216 = 0.15027
## 
## n= 2216 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.060060      0   1.00000 1.00000 0.050515
## 2 0.054054      1   0.93994 0.99099 0.050327
## 3 0.039039      2   0.88589 0.94895 0.049430
## 4 0.027027      3   0.84685 0.93393 0.049102
## 5 0.016517      5   0.79279 0.92192 0.048836
## 6 0.014014      7   0.75976 0.89790 0.048297
## 7 0.010000     10   0.71772 0.85285 0.047253

Una vez que tenemos las variables significativas, pasamos a ajustar el árbol de clasificación.

tree <- rpart(Response ~ AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp5 + Dt_Customer +
              Marital_Status + NumCatalogPurchases + Recency,
              data = train, method = "class")

Ahora hacemos la predicción con este árbol con los datos de test y vemos las métricas de ajuste de este árbol

obs <- test$Response
pred <- predict(tree, newdata = test[-28], type = "class")
confusionMatrix(obs, pred, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 369   8
##          1  48  19
##                                           
##                Accuracy : 0.8739          
##                  95% CI : (0.8394, 0.9033)
##     No Information Rate : 0.9392          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3477          
##                                           
##  Mcnemar's Test P-Value : 1.872e-07       
##                                           
##             Sensitivity : 0.70370         
##             Specificity : 0.88489         
##          Pos Pred Value : 0.28358         
##          Neg Pred Value : 0.97878         
##              Prevalence : 0.06081         
##          Detection Rate : 0.04279         
##    Detection Prevalence : 0.15090         
##       Balanced Accuracy : 0.79430         
##                                           
##        'Positive' Class : 1               
##

Podemos ver que tiene buen accuracy y balanced accuracy, debido a la gran parte de ceros que hay en el dataset y que los predice muy bien.

Veamos como queda nuestro árbol:

fancyRpartPlot(tree)

Ahora si tratamos el modelo con probabilidades y vemos la curva ROC:

library(ROCR)

tree1 <- rpart(Response ~ AcceptedCmp1 + AcceptedCmp3 + AcceptedCmp5 + Dt_Customer +
              Marital_Status + NumCatalogPurchases + Recency,
              data = train)

pred <- ROCR::prediction(predict(tree1, type = "prob")[, 2], train$Response)
plot(performance(pred, "tpr", "fpr"))
abline(0, 1, lty = 2)

plot(performance(pred, "acc"))

Usando la información de esta curva, veamos las métricas de ajuste

obs <- as.numeric(test$Response) - 1 
head(obs)

## [1] 0 0 1 0 0 0

test_pred <- predict(tree1, newdata = test[-28], type = "prob")[,2]
head(test_pred)

##          2          8          9         18         23         27 
## 0.09370904 0.09370904 0.09370904 0.09370904 0.09370904 0.09370904

test_pred <- ifelse(test_pred < 0.4, 0, 1)
obs <- factor(obs, levels = c(0,1))
test_pred <- factor(test_pred, levels = c(0,1))
confusionMatrix(obs, test_pred, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 369   8
##          1  48  19
##                                           
##                Accuracy : 0.8739          
##                  95% CI : (0.8394, 0.9033)
##     No Information Rate : 0.9392          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3477          
##                                           
##  Mcnemar's Test P-Value : 1.872e-07       
##                                           
##             Sensitivity : 0.70370         
##             Specificity : 0.88489         
##          Pos Pred Value : 0.28358         
##          Neg Pred Value : 0.97878         
##              Prevalence : 0.06081         
##          Detection Rate : 0.04279         
##    Detection Prevalence : 0.15090         
##       Balanced Accuracy : 0.79430         
##                                           
##        'Positive' Class : 1               
##

Ambos manera de interpretar el modelo tienen métricas similares, por lo que cualquiera de las maneras es adecuada para predecir usando este árbol de clasificación

Classification Tree

Luis Alberto Abaunza Guevara

24/11/2020

Datos

Árbol de Clasificación