Inicio del proyecto fiabilidad
Ver video de explicación https://youtu.be/T6YHKmvO_yo carageremos los archivos de tarjeta de credito
card_base <- read.csv("CardBase.csv", stringsAsFactors=FALSE)
customer_base <- read.csv("CustomerBase.csv", stringsAsFactors=FALSE)
transaction_base <- read.csv("TransactionBase.csv", stringsAsFactors=FALSE)
revisemos card base
summary(card_base)
Card_Number Card_Family Credit_Limit Cust_ID
Length:500 Length:500 Min. : 2000 Length:500
Class :character Class :character 1st Qu.: 35000 Class :character
Mode :character Mode :character Median :139000 Mode :character
Mean :252614
3rd Qu.:412500
Max. :899000
revisemos la base de clientes
summary(customer_base)
Cust_ID Age Customer_Segment Customer_Vintage_Group
Length:5674 Min. :20.00 Length:5674 Length:5674
Class :character 1st Qu.:27.00 Class :character Class :character
Mode :character Median :35.00 Mode :character Mode :character
Mean :35.06
3rd Qu.:43.00
Max. :50.00
por ultimo veamos la tabla de transacciones
summary(transaction_base)
Transaction_ID Transaction_Date Credit_Card_ID Transaction_Value Transaction_Segment
Length:10000 Length:10000 Length:10000 Min. : 103 Length:10000
Class :character Class :character Class :character 1st Qu.:12318 Class :character
Mode :character Mode :character Mode :character Median :24721 Mode :character
Mean :24759
3rd Qu.:37036
Max. :49995
como podemos observar necesitamos relacionar los datasets y prepar la relacion de joins
transaction_base<- transaction_base%>%rename(Card_Number = Credit_Card_ID)
full_card <- inner_join(card_base,customer_base , by= c("Cust_ID") )
full_card <- inner_join(full_card,transaction_base , "Card_Number" )
ya hemos armado nuestra tabla, en este momento procedremos a ver la data y la estructura
head(full_card)
summary(full_card)
Card_Number Card_Family Credit_Limit Cust_ID Age Customer_Segment
Length:10000 Length:10000 Min. : 2000 Length:10000 Min. :20.00 Length:10000
Class :character Class :character 1st Qu.: 35000 Class :character 1st Qu.:28.00 Class :character
Mode :character Mode :character Median :140000 Mode :character Median :36.00 Mode :character
Mean :250840 Mean :35.57
3rd Qu.:406000 3rd Qu.:43.00
Max. :899000 Max. :50.00
Customer_Vintage_Group Transaction_ID Transaction_Date Transaction_Value Transaction_Segment
Length:10000 Length:10000 Length:10000 Min. : 103 Length:10000
Class :character Class :character Class :character 1st Qu.:12318 Class :character
Mode :character Mode :character Mode :character Median :24721 Mode :character
Mean :24759
3rd Qu.:37036
Max. :49995
Convertiremos nuestros campos caracters en factores dentro de la nueva tabla
full_card$Card_Family <- as.factor(full_card$Card_Family)
full_card$Customer_Segment <- as.factor(full_card$Customer_Segment)
full_card$Customer_Vintage_Group <- as.factor(full_card$Customer_Vintage_Group)
full_card$Transaction_Segment <- as.factor(full_card$Transaction_Segment)
full_card$fecha=as.Date(full_card$Transaction_Date,format="%d-%b-%y")
full_card$week_day=weekdays(full_card$fecha)
full_card$week_day <- as.factor(full_card$week_day)
Veamos nuestra estructura nueva
str(full_card)
'data.frame': 10000 obs. of 13 variables:
$ Card_Number : chr "8638-5407-3631-8196" "8638-5407-3631-8196" "8638-5407-3631-8196" "8638-5407-3631-8196" ...
$ Card_Family : Factor w/ 3 levels "Gold","Platinum",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Credit_Limit : int 530000 530000 530000 530000 530000 530000 530000 530000 530000 530000 ...
$ Cust_ID : chr "CC67088" "CC67088" "CC67088" "CC67088" ...
$ Age : int 27 27 27 27 27 27 27 27 27 27 ...
$ Customer_Segment : Factor w/ 3 levels "Diamond","Gold",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Customer_Vintage_Group: Factor w/ 3 levels "VG1","VG2","VG3": 1 1 1 1 1 1 1 1 1 1 ...
$ Transaction_ID : chr "CTID85973577" "CTID13215166" "CTID41016745" "CTID82512584" ...
$ Transaction_Date : chr "13-Feb-16" "30-Oct-16" "9-May-16" "31-Aug-16" ...
$ Transaction_Value : int 31592 5894 20925 36556 36715 30996 43853 45751 16565 19345 ...
$ Transaction_Segment : Factor w/ 15 levels "SEG11","SEG12",..: 11 2 1 1 8 5 13 7 11 11 ...
$ fecha : Date, format: "2016-02-13" "2016-10-30" "2016-05-09" "2016-08-31" ...
$ week_day : Factor w/ 7 levels "Friday","Monday",..: 3 4 2 7 1 4 4 1 4 6 ...
Procedremos a seleccionar los campos que deseamos para la clasificacion:
La idea de este proyecto es determianr Customer_Segment , procedemos a seleccionar unicament elos campos que nos van a server y a limpiar la data
cards<-full_card%>%dplyr::select( Customer_Segment ,Card_Family,Credit_Limit,Age,Transaction_Segment,Transaction_Value,week_day)
ncolmax<-7
set.seed(123)
cards<-cards[complete.cases(cards),]
cards<-cards[sample(nrow(cards), 2000), ]
se extrae al azar 1000 elementos de toda la colección
str(cards)
'data.frame': 2000 obs. of 7 variables:
$ Customer_Segment : Factor w/ 3 levels "Diamond","Gold",..: 1 1 3 1 2 1 1 3 2 1 ...
$ Card_Family : Factor w/ 3 levels "Gold","Platinum",..: 3 3 1 3 3 2 2 1 1 3 ...
$ Credit_Limit : int 147000 492000 32000 405000 361000 96000 63000 35000 2000 575000 ...
$ Age : int 44 35 34 48 35 34 35 48 39 44 ...
$ Transaction_Segment: Factor w/ 15 levels "SEG11","SEG12",..: 13 2 11 13 2 4 1 8 5 5 ...
$ Transaction_Value : int 13128 48453 16656 41754 42618 39382 8951 25688 15360 17744 ...
$ week_day : Factor w/ 7 levels "Friday","Monday",..: 7 6 5 3 4 4 7 6 6 2 ...
head(cards)
summary(cards)
Customer_Segment Card_Family Credit_Limit Age Transaction_Segment Transaction_Value week_day
Diamond :934 Gold :716 Min. : 2000 Min. :20.00 SEG19 : 157 Min. : 109 Friday :300
Gold :635 Platinum:476 1st Qu.: 34000 1st Qu.:28.00 SEG16 : 148 1st Qu.:12487 Monday :272
Platinum:431 Premium :808 Median :135000 Median :36.00 SEG13 : 146 Median :25274 Saturday :307
Mean :251012 Mean :35.81 SEG12 : 143 Mean :24822 Sunday :271
3rd Qu.:412500 3rd Qu.:44.00 SEG14 : 138 3rd Qu.:36758 Thursday :288
Max. :899000 Max. :50.00 SEG20 : 138 Max. :49983 Tuesday :270
(Other):1130 Wednesday:292
Definimos tres vectores uno para el indice del 70% de la muestra y 30% para test
ncolmax<-7
train_index <- createDataPartition(y = cards$Customer_Segment,p=0.70,list = F)
training <- cards[train_index,]
testing <- cards[-train_index,]
p.Card_Family <- ggplot(training, aes(Customer_Segment, Card_Family))
p.Card_Family <- p.Card_Family + geom_boxplot() + ggtitle("Card_Family")
p.Credit_Limit <- ggplot(training, aes(Customer_Segment, Credit_Limit))
p.Credit_Limit <- p.Credit_Limit + geom_boxplot() + ggtitle("Credit_Limit")
p.Age <- ggplot(training, aes(Customer_Segment, Age))
p.Age <- p.Age + geom_boxplot() + ggtitle("Age")
p.Customer_Segment <- ggplot(training, aes(Customer_Segment, Customer_Segment))
p.Customer_Segment <- p.Customer_Segment + geom_boxplot() + ggtitle("Customer_Segment")
p.Transaction_Value <- ggplot(training, aes(Customer_Segment, Transaction_Value))
p.Transaction_Value <- p.Transaction_Value + geom_boxplot() + ggtitle("Transaction_Value")
p.week_day <- ggplot(training, aes(Customer_Segment, week_day))
p.week_day <- p.week_day + geom_boxplot() + ggtitle("week_day")
gridExtra::grid.arrange(p.Card_Family, p.Credit_Limit, p.Age, p.Customer_Segment,p.Transaction_Value,p.week_day, ncol = 2, nrow = 4)
Dado que nuestra variable a determinar es una dimencion con mas de una categoria implementaremos un modelo de arboles para determinar el segmento
###Arbol de decicion simple
modFit <- caret::train(Customer_Segment ~ ., method = "rpart", data = training)
rattle::fancyRpartPlot(modFit$finalModel)
verificaremos nuestra prediccion y la matriz de confusión
predictions <- predict(modFit, newdata = testing[,c(2:ncolmax,1)])
confusionMatrix(predictions, testing$Customer_Segment)
Confusion Matrix and Statistics
Reference
Prediction Diamond Gold Platinum
Diamond 246 120 100
Gold 34 70 29
Platinum 0 0 0
Overall Statistics
Accuracy : 0.5275
95% CI : (0.4867, 0.5681)
No Information Rate : 0.4674
P-Value [Acc > NIR] : 0.001845
Kappa : 0.1652
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: Diamond Class: Gold Class: Platinum
Sensitivity 0.8786 0.3684 0.0000
Specificity 0.3103 0.8460 1.0000
Pos Pred Value 0.5279 0.5263 NaN
Neg Pred Value 0.7444 0.7425 0.7846
Prevalence 0.4674 0.3172 0.2154
Detection Rate 0.4107 0.1169 0.0000
Detection Prevalence 0.7780 0.2220 0.0000
Balanced Accuracy 0.5945 0.6072 0.5000
como podemos ver en la siguiente tabla, se detalla un Accuracy de casi el 41%, al verificar el Mean de nuestro prediction vemos que tambien igual
mean(predictions==testing$Customer_Segment)
[1] 0.5275459
Ahora bien verificaremos nuevamente el modelo haciendo uso de RPART para poder generar el scalfolding function Utilizando un metodo de anova calcularemos el arbol de forma manual
modFit_rpat <- rpart(formula = Customer_Segment ~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day
, data = training, method = "anova" )
rattle::fancyRpartPlot(modFit_rpat)
Verificaremos el resultado del arbol
modFit_rpat <- rpart(formula = Customer_Segment ~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day
, data = training )
Procedmos a calcular nuestras prediccions de tipo clase
predictions_rpat <- predict(modFit_rpat, newdata = testing,type="class")
table(predicho=predictions_rpat, real=testing$Customer_Segment)
real
predicho Diamond Gold Platinum
Diamond 246 126 99
Gold 21 57 23
Platinum 13 7 7
error_clas <- mean(predictions_rpat != testing$Customer_Segment)
paste("El error de clasificación es del:", 100 * error_clas, "%.",
sum(predictions_rpat == testing$Customer_Segment),
"clasificaciones correctas de un total de", length(predictions_rpat))
[1] "El error de clasificación es del: 48.2470784641068 %. 310 clasificaciones correctas de un total de 599"
como podemos ver bajo esta variación del arbol la exactitud decae a un 45% Procederemos a ejecutar un bosque de arboles para determinar el mejor modelo para esto utilizaremos la funcion randomForest
modFit.rf <- randomForest::randomForest(Customer_Segment ~ ., data = cards, )
modFit.rf
Call:
randomForest(formula = Customer_Segment ~ ., data = cards, )
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 40.1%
Confusion matrix:
Diamond Gold Platinum class.error
Diamond 770 119 45 0.1755889
Gold 304 300 31 0.5275591
Platinum 231 72 128 0.7030162
evaluaremos el resultado del bosque generado
predictions.rf <- predict(modFit.rf, newdata = testing[,c(1:ncolmax)])
confusionMatrix(predictions.rf, testing$Customer_Segment)
Confusion Matrix and Statistics
Reference
Prediction Diamond Gold Platinum
Diamond 280 0 0
Gold 0 190 0
Platinum 0 0 129
Overall Statistics
Accuracy : 1
95% CI : (0.9939, 1)
No Information Rate : 0.4674
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Diamond Class: Gold Class: Platinum
Sensitivity 1.0000 1.0000 1.0000
Specificity 1.0000 1.0000 1.0000
Pos Pred Value 1.0000 1.0000 1.0000
Neg Pred Value 1.0000 1.0000 1.0000
Prevalence 0.4674 0.3172 0.2154
Detection Rate 0.4674 0.3172 0.2154
Detection Prevalence 0.4674 0.3172 0.2154
Balanced Accuracy 1.0000 1.0000 1.0000
graficaremos el confjunto de errores en la iteracion de los arboles
plot(modFit.rf, main = "Razon del random de arboles")
Como podemos obervar en la tabla y en la grafica el mejor modelo reprecenta una exactitud del 83% la importancia de este valor corresponde a la tabla de variables y su aporte
varImpPlot(modFit.rf, pch = 20, main = "aporte de Variables para el bosque")
##Kfold para modelo de arbol Ahora procederemos a realizar el kfold sobre todos nuestros posibles modelos
Iniciamos determinado la funcion de calculo de metrica dado un conjunto de features, tomemos encuenta que el error ahora es calculado
get_metric_part <- function(features,dataset,label){
formulat <- paste0(features, collapse="+")
formulat <- paste0(label,"~",formulat,
collapse = "")
fit<- rpart(formula=formulat, data = dataset)
test_pred <- predict(fit, newdata = dataset,type="class")
se <-mean(test_pred!=dataset[,label])
print( paste0(formulat," ERR=",se))
return(se*se)
}
Definimos nuestra funcion de segmentar n-k capas
fit_nk_part <- function(k,dataset,label){
features <- colnames(dataset)
features <- setdiff(features , label)
features_subset<- combn(features,k,simplify = FALSE)
metric <- lapply(features_subset,
get_metric_part,
dataset=dataset,
label=label)
index<-
which.min(metric %>% unlist() )
print( ".... Mejor Modelo (Respuesta) ....")
return(features_subset[[index]])
}
Verificamos el posible resultado. el sistema debe de retornar la menor error el listado que se detalla corresponde a todos los posibles modelso que ha iterado
fit_nk_part(5,training,"Customer_Segment")
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value ERR=0.443254817987152"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+week_day ERR=0.443254817987152"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Value+week_day ERR=0.471092077087794"
[1] "Customer_Segment~Card_Family+Credit_Limit+Transaction_Segment+Transaction_Value+week_day ERR=0.483226266952177"
[1] "Customer_Segment~Card_Family+Age+Transaction_Segment+Transaction_Value+week_day ERR=0.494646680942184"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day ERR=0.443254817987152"
[1] ".... Mejor Modelo (Respuesta) ...."
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment" "Transaction_Value"
Como podemos ver ya tenemos la respuesta de la evaluacion de cada modelo
Quitaremos los logs de las funciones
get_metric_part <- function(features,dataset,label){
formulat <- paste0(features, collapse="+")
formulat <- paste0(label,"~",formulat,
collapse = "")
fit<- rpart(formula=formulat, data = dataset)
test_pred <- predict(fit, newdata = dataset,type="class")
se <-mean(test_pred!=dataset[,label])
return(se*se)
}
Definimos nuestra funcion de segmentar n-k capas
fit_nk_part <- function(k,dataset,label){
features <- colnames(dataset)
features <- setdiff(features , label)
features_subset<- combn(features,k,simplify = FALSE)
metric <- lapply(features_subset,
get_metric_part,
dataset=dataset,
label=label)
index<-
which.min(metric %>% unlist() )
return(features_subset[[index]])
}
calcularemos la lista de los mejores modelos a testear con un maximo de 6 posibles mejores modelos para predecir
n_features <- ncol(training)-1
list_models<-
lapply(1:n_features,
fit_nk_part,
dataset=training,
label = 'Customer_Segment')
list_models
[[1]]
[1] "Credit_Limit"
[[2]]
[1] "Credit_Limit" "Age"
[[3]]
[1] "Credit_Limit" "Age" "Transaction_Segment"
[[4]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment"
[[5]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment" "Transaction_Value"
[[6]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment" "Transaction_Value"
[6] "week_day"
acontinuacion definiremos la fuccion que escanea y calcula el train y test para cada modelo a testear
loocv <- function(index, dataset, label, features){
train <- dataset[-index,]
test <- dataset[index,]
formulat <- paste0(label, "~", paste0(features, collapse="+"), collapse = "")
fit<- rpart(formula=formulat, data = train )
test_pred <- predict(fit, test,type="class")
se <-mean(test_pred!=test[,label])
resprint<- paste(formulat, " Err:", se, " correctos:", sum(test_pred == test$Customer_Segment),
" de", length(test_pred))
print(resprint)
return(se*se)
}
Definimos la funcion fold de neustros modelos en los que hacemos test de cada segmento que nos pueda calcular carret
folds <- function(k, dataset, label, features){
se_folds_all <- lapply(createFolds(dataset[,label], k = k, list = TRUE, returnTrain = FALSE),
loocv,
label=label,
dataset=dataset,
features=features)
return(mean(se_folds_all %>% unlist()))
}
Aplicaremos un Lapply a la lista de modelos para nuestra funcion de fold en k segmentos de 4
test <- lapply(list_models,
folds,
k=6,
dataset=cards,
label="Customer_Segment"
)
[1] "Customer_Segment~Credit_Limit Err: 0.486486486486487 correctos: 171 de 333"
[1] "Customer_Segment~Credit_Limit Err: 0.477477477477477 correctos: 174 de 333"
[1] "Customer_Segment~Credit_Limit Err: 0.471471471471471 correctos: 176 de 333"
[1] "Customer_Segment~Credit_Limit Err: 0.523952095808383 correctos: 159 de 334"
[1] "Customer_Segment~Credit_Limit Err: 0.514970059880239 correctos: 162 de 334"
[1] "Customer_Segment~Credit_Limit Err: 0.510510510510511 correctos: 163 de 333"
[1] "Customer_Segment~Credit_Limit+Age Err: 0.525525525525526 correctos: 158 de 333"
[1] "Customer_Segment~Credit_Limit+Age Err: 0.483483483483483 correctos: 172 de 333"
[1] "Customer_Segment~Credit_Limit+Age Err: 0.513513513513513 correctos: 162 de 333"
[1] "Customer_Segment~Credit_Limit+Age Err: 0.473053892215569 correctos: 176 de 334"
[1] "Customer_Segment~Credit_Limit+Age Err: 0.491017964071856 correctos: 170 de 334"
[1] "Customer_Segment~Credit_Limit+Age Err: 0.516516516516517 correctos: 161 de 333"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment Err: 0.539156626506024 correctos: 153 de 332"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment Err: 0.493975903614458 correctos: 168 de 332"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment Err: 0.497005988023952 correctos: 168 de 334"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment Err: 0.523952095808383 correctos: 159 de 334"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment Err: 0.467065868263473 correctos: 178 de 334"
[1] "Customer_Segment~Credit_Limit+Age+Transaction_Segment Err: 0.502994011976048 correctos: 166 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment Err: 0.507507507507508 correctos: 164 de 333"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment Err: 0.48948948948949 correctos: 170 de 333"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment Err: 0.476047904191617 correctos: 175 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment Err: 0.502994011976048 correctos: 166 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment Err: 0.452095808383234 correctos: 183 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment Err: 0.521084337349398 correctos: 159 de 332"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value Err: 0.5 correctos: 167 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value Err: 0.495495495495495 correctos: 168 de 333"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value Err: 0.487951807228916 correctos: 170 de 332"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value Err: 0.505988023952096 correctos: 165 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value Err: 0.504504504504504 correctos: 165 de 333"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value Err: 0.508982035928144 correctos: 164 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day Err: 0.495495495495495 correctos: 168 de 333"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day Err: 0.451807228915663 correctos: 182 de 332"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day Err: 0.531531531531532 correctos: 156 de 333"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day Err: 0.538922155688623 correctos: 154 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day Err: 0.467065868263473 correctos: 178 de 334"
[1] "Customer_Segment~Card_Family+Credit_Limit+Age+Transaction_Segment+Transaction_Value+week_day Err: 0.494011976047904 correctos: 169 de 334"
print(paste("index:",which.min(test)))
[1] "index: 4"
list_models[which.min(test)]
[[1]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment"
Seleccionamos al que menor error tenga listado de indices con error:
test
[[1]]
[1] 0.24788
[[2]]
[1] 0.2508829
[[3]]
[1] 0.2545661
[[4]]
[1] 0.242118
[[5]]
[1] 0.2505374
[[6]]
[1] 0.2474678
list_models
[[1]]
[1] "Credit_Limit"
[[2]]
[1] "Credit_Limit" "Age"
[[3]]
[1] "Credit_Limit" "Age" "Transaction_Segment"
[[4]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment"
[[5]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment" "Transaction_Value"
[[6]]
[1] "Card_Family" "Credit_Limit" "Age" "Transaction_Segment" "Transaction_Value"
[6] "week_day"
modFit.rf <- randomForest::randomForest(Customer_Segment ~ Card_Family+Credit_Limit+Age+Transaction_Segment, data = cards )
modFit.rf
Call:
randomForest(formula = Customer_Segment ~ Card_Family + Credit_Limit + Age + Transaction_Segment, data = cards)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 24.35%
Confusion matrix:
Diamond Gold Platinum class.error
Diamond 795 102 37 0.1488223
Gold 157 451 27 0.2897638
Platinum 111 53 267 0.3805104
evaluaremos el resultado del bosque generado
predictions.rf <- predict(modFit.rf, newdata = testing[,c(1:ncolmax)])
confusionMatrix(predictions.rf, testing$Customer_Segment)
Confusion Matrix and Statistics
Reference
Prediction Diamond Gold Platinum
Diamond 279 0 2
Gold 1 190 0
Platinum 0 0 127
Overall Statistics
Accuracy : 0.995
95% CI : (0.9854, 0.999)
No Information Rate : 0.4674
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9921
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: Diamond Class: Gold Class: Platinum
Sensitivity 0.9964 1.0000 0.9845
Specificity 0.9937 0.9976 1.0000
Pos Pred Value 0.9929 0.9948 1.0000
Neg Pred Value 0.9969 1.0000 0.9958
Prevalence 0.4674 0.3172 0.2154
Detection Rate 0.4658 0.3172 0.2120
Detection Prevalence 0.4691 0.3189 0.2120
Balanced Accuracy 0.9951 0.9988 0.9922