Summary

This is an explanation of a model seen during the course on Artificial Neural Network available on Udemy " Machine Learning A-Z™: Hands-On Python & R In Data Science"

We use the information of a list of clients from a bank and attempt to define which client has a highier risk to leave the bank. The success rate of this model is about 85%.

Datasets and libraries

Please download the csv file from my dropbox https://db.tt/D1m9lxYlXi and ensure your file is in your working directory

library(caTools) # for feature scaling
library(h2o) # the deep leanring library
set.seed(123) # for reproducibility
dataset = read.csv('Churn_Modelling.csv')

The dataset is composed by ID informations of 10,000 clients of a bank and some information linked to their accounts. The last column is the one which interests us because this is what we try to predict.

head(dataset)
##   RowNumber CustomerId  Surname CreditScore Geography Gender Age Tenure
## 1         1   15634602 Hargrave         619    France Female  42      2
## 2         2   15647311     Hill         608     Spain Female  41      1
## 3         3   15619304     Onio         502    France Female  42      8
## 4         4   15701354     Boni         699    France Female  39      1
## 5         5   15737888 Mitchell         850     Spain Female  43      2
## 6         6   15574012      Chu         645     Spain   Male  44      8
##     Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## 1      0.00             1         1              1       101348.88      1
## 2  83807.86             1         0              1       112542.58      0
## 3 159660.80             3         1              0       113931.57      1
## 4      0.00             2         0              0        93826.63      0
## 5 125510.82             1         1              1        79084.10      0
## 6 113755.78             2         1              0       149756.71      1

To build the model we will remove any data that will not help us to predict our outcome

dataset = dataset[4:14]

We also need to encode some text data into factors

dataset$Geography = as.numeric(factor(dataset$Geography,
                                      levels = c('France', 'Spain', 'Germany'),
                                      labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
                                   levels = c('Female', 'Male'),
                                   labels = c(1, 2)))

Splitting dataset and feature scaling

We will now split the dataset into the train set and the test set. Then, We will also apply feature scaling to normalise the data. It is very important to normalise your data after having splitted your data otherwise you insert a bias.

split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
training_set[-11] = scale(training_set[-11])
test_set[-11] = scale(test_set[-11])

Creation of the deep learning model

Creation of the Artificial Neural Network with the rectifier as an activation function.

h2o.init(nthreads = -1) # -1 means all CPUs on the host
model = h2o.deeplearning(y = 'Exited', # define the outcome to predict
                         training_frame = as.h2o(training_set), # define training set
                         activation = 'Rectifier', # define the activation function
                         hidden = c(5,5),  # number of neurons and layers
                         epochs = 100, # number of times the dataset should be iterated
                         train_samples_per_iteration = -2) # Number of training samples (globally) per MapReduce iteration, -2 autotunes automatically

Predicting the Test set results

y_pred = h2o.predict(model, newdata = as.h2o(test_set[-11]))
y_pred = (y_pred > 0.5)
y_pred = as.vector(y_pred)

Prediction Analysis

The success rate of the model over the test set is 86.9%.

cm = table(test_set[, 11], y_pred)
successrate = (cm[1,1]+ cm[2,2])/(cm[1,2]+ cm[2,1]+cm[1,1]+ cm[2,2])
cm
##    y_pred
##        0    1
##   0 1515   78
##   1  190  217
successrate
## [1] 0.866