This is an explanation of a model seen during the course on Artificial Neural Network available on Udemy " Machine Learning A-Z™: Hands-On Python & R In Data Science"
We use the information of a list of clients from a bank and attempt to define which client has a highier risk to leave the bank. The success rate of this model is about 85%.
Please download the csv file from my dropbox https://db.tt/D1m9lxYlXi and ensure your file is in your working directory
library(caTools) # for feature scaling
library(h2o) # the deep leanring library
set.seed(123) # for reproducibility
dataset = read.csv('Churn_Modelling.csv')
The dataset is composed by ID informations of 10,000 clients of a bank and some information linked to their accounts. The last column is the one which interests us because this is what we try to predict.
head(dataset)
## RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure
## 1 1 15634602 Hargrave 619 France Female 42 2
## 2 2 15647311 Hill 608 Spain Female 41 1
## 3 3 15619304 Onio 502 France Female 42 8
## 4 4 15701354 Boni 699 France Female 39 1
## 5 5 15737888 Mitchell 850 Spain Female 43 2
## 6 6 15574012 Chu 645 Spain Male 44 8
## Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
## 1 0.00 1 1 1 101348.88 1
## 2 83807.86 1 0 1 112542.58 0
## 3 159660.80 3 1 0 113931.57 1
## 4 0.00 2 0 0 93826.63 0
## 5 125510.82 1 1 1 79084.10 0
## 6 113755.78 2 1 0 149756.71 1
To build the model we will remove any data that will not help us to predict our outcome
dataset = dataset[4:14]
We also need to encode some text data into factors
dataset$Geography = as.numeric(factor(dataset$Geography,
levels = c('France', 'Spain', 'Germany'),
labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
levels = c('Female', 'Male'),
labels = c(1, 2)))
We will now split the dataset into the train set and the test set. Then, We will also apply feature scaling to normalise the data. It is very important to normalise your data after having splitted your data otherwise you insert a bias.
split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
training_set[-11] = scale(training_set[-11])
test_set[-11] = scale(test_set[-11])
Creation of the Artificial Neural Network with the rectifier as an activation function.
h2o.init(nthreads = -1) # -1 means all CPUs on the host
model = h2o.deeplearning(y = 'Exited', # define the outcome to predict
training_frame = as.h2o(training_set), # define training set
activation = 'Rectifier', # define the activation function
hidden = c(5,5), # number of neurons and layers
epochs = 100, # number of times the dataset should be iterated
train_samples_per_iteration = -2) # Number of training samples (globally) per MapReduce iteration, -2 autotunes automatically
y_pred = h2o.predict(model, newdata = as.h2o(test_set[-11]))
y_pred = (y_pred > 0.5)
y_pred = as.vector(y_pred)
The success rate of the model over the test set is 86.9%.
cm = table(test_set[, 11], y_pred)
successrate = (cm[1,1]+ cm[2,2])/(cm[1,2]+ cm[2,1]+cm[1,1]+ cm[2,2])
cm
## y_pred
## 0 1
## 0 1515 78
## 1 190 217
successrate
## [1] 0.866