In this project we will attempt to predict if customers are satisfied or dissatisfied for one of the biggest banks - Santander. The main goal is to identify unhappy customers so that the bank can take steps to improve their situation. This is very needed since dissatisfied customers rarely voice their complaints and thus can churn without the possiblity of retention.
We have hundreds of anonymised features available and one of our major challenges would be to decide which ones are the most important.
Let’s start out by loading our dependencies. For reading the files we will use the readr package. Is is extremely fast and adds the convenience of coercing the dataframes into data.tables. For data wrangling probably the best package available is dplyr, for machine learning caret and analyzing the resulting model pROC and ROCR.
library(readr)
library(dplyr)
library(caret)
library(pROC)
library(ROCR)
In this example we will be working just with the training dataset. Let’s read it in.
train.df.raw <- read_csv("data/train.csv")
## Warning: 694 parsing failures.
## row col expected actual
## 1040 saldo_var13_largo no trailing characters .53
## 1040 saldo_medio_var13_largo_ult1 no trailing characters .53
## 1147 saldo_var33 no trailing characters .66
## 1147 saldo_medio_var33_hace2 no trailing characters .95
## 1147 saldo_medio_var33_hace3 no trailing characters .16
## .... ............................ ...................... ......
## .See problems(...) for more details.
To reduce the computation time we can use a small fraction of the data. In our data cleaning pipeline we can also omit any NAs. Fortunately for us there are just a few of those in the data. We can also remove the ID column, since we will not be doing any merges at this point. Since we will later be doing a logistic regression, there is no need to standardize the data.
df <-
train.df.raw %>%
sample_frac(size = 0.1) %>%
dplyr::select(-ID) %>%
na.omit()
The next step is to convert the independent variables into their proper types. In this case this is quite easy since we just have numeric variables, and our dependent variable can be coded as a factor.
# datatypes change
df <- apply(df , 2, as.numeric) %>% as.data.frame()
df$TARGET <- as.factor(df$TARGET)
Before we can start with preparing our model, we have to split the data and implement cross-validation. We also set a seed (set.seed) for reproducibility purposes.
set.seed(42)
# train test split
trainIndex <- createDataPartition(df$TARGET, p = .8,
list = FALSE,
times = 1)
santanderTrain <- df[ trainIndex,]
santanderTest <- df[-trainIndex,]
# cross validation
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)
Then we can proceed with training the model. This is the most time consuming step so be patient. Our model of choice is a Logistic Regression. A classing algorithm that is very suitable for classification tasks such as this one, it is also extremely interpretable and fast.
# train model
model.caret <- train(TARGET ~ ., data=santanderTrain, method="glm", family="binomial", trControl = ctrl, tuneLength = 5)
After our model is trained we can inspect it.
# inspect model
# summary(model.caret)
varImp(model.caret) %>% plot()
We can see that there are several variables that are most useful for the classification. To speed up the computation later we can do some feature selection and just use those variables. A Random Forest algorithm was used and achieved similar results for variable importance. Now we can use our model to make some predictions!
# make predictions
predicted_values.caret <- predict(model.caret, newdata = santanderTest)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
Great, this seems to have worked. But just looking at the resulting vector will not help us be confident in our model. We will print out a confusion matrix to see its classification accuracy.
# model evaluation
confusionMatrix(data=predicted_values.caret, santanderTest$TARGET)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1442 63
## 1 10 0
##
## Accuracy : 0.9518
## 95% CI : (0.9398, 0.962)
## No Information Rate : 0.9584
## P-Value [Acc > NIR] : 0.9094
##
## Kappa : -0.0115
## Mcnemar's Test P-Value : 1.157e-09
##
## Sensitivity : 0.9931
## Specificity : 0.0000
## Pos Pred Value : 0.9581
## Neg Pred Value : 0.0000
## Prevalence : 0.9584
## Detection Rate : 0.9518
## Detection Prevalence : 0.9934
## Balanced Accuracy : 0.4966
##
## 'Positive' Class : 0
##
accuracy <- table(predicted_values.caret, santanderTest[,"TARGET"])
sum(diag(accuracy))/sum(accuracy)
## [1] 0.9518152
Another technique to assist us in assessing the model is the ROC curve.
# roc
f1 <- roc(TARGET ~ var38, data = santanderTrain)
plot(f1, col = 'red')
##
## Call:
## roc.formula(formula = TARGET ~ var38, data = santanderTrain)
##
## Data: var38 in 5808 controls (TARGET 0) > 253 cases (TARGET 1).
## Area under the curve: 0.5803
The accuracy is quite spectacular. In truth, when we use the model on the complete dataset, the accuracy is much lower (more around 79% classification rate), but this is quite good for this relatively simple model. We could try other models, such as Random Forest, SVM or gradient boosting.