Data Dictionary The data has 64000 rows and 9 variables. The dependent variable has 2 factors

This data frame contains the following columns:

recency :

months since last purchase

history :

$ value of the historical purchases

used_discount :

indicates if the customer used a discount before

used_bogo :

indicates if the customer used a buy one get one before

zip_code :

If the customer lives in Rural,Urban or Suburban area

is_referral :

Was the customer acquired through referral or not. 0 signifies no referal acquisition , 1 signifies acquisition through referal.

channel :

How the customer reads the product , is it by phone , web or multi channel medium

offer :

the offers sent to the customers, Discount/But One Get One/No Offer

conversion :

Dependent variable. customer conversion(buy or not. 0 signifies no conversion and 1 signifies conversion.

##Importing the dataset

dataset = read.csv("data.csv")
str(dataset)
## 'data.frame':    64000 obs. of  9 variables:
##  $ recency      : int  10 6 7 9 2 6 9 9 9 10 ...
##  $ history      : num  142.4 329.1 180.7 675.8 45.3 ...
##  $ used_discount: int  1 1 0 1 1 0 1 0 1 0 ...
##  $ used_bogo    : int  0 1 1 0 0 1 0 1 1 1 ...
##  $ zip_code     : chr  "Surburban" "Rural" "Surburban" "Rural" ...
##  $ is_referral  : int  0 1 1 1 0 0 1 0 1 1 ...
##  $ channel      : chr  "Phone" "Web" "Web" "Web" ...
##  $ offer        : chr  "Buy One Get One" "No Offer" "Buy One Get One" "Discount" ...
##  $ conversion   : int  0 0 0 0 0 1 0 0 0 0 ...
names(dataset)
## [1] "recency"       "history"       "used_discount" "used_bogo"    
## [5] "zip_code"      "is_referral"   "channel"       "offer"        
## [9] "conversion"

So I can see most of the independent variables are character or integer. Zip_code,channel and offer column are categorical variables but are defined as character. So I need to level them into 1 and 0 first and than convert them into factors.

dataset$zip_code = ifelse(dataset$zip_code == "Surburban", "0", ifelse(dataset$zip_code == "Rural","1","2"))
dataset$channel = ifelse(dataset$channel == "Phone", "0", ifelse(dataset$channel == "Web","1","2"))
dataset$offer = ifelse(dataset$offer == "Buy One Get One", "0", ifelse(dataset$offer == "Discount","1","2"))

Now that all of our categorical independent variables are categorised properly, We can convert them into factors and proceed with the model building. But I want to check the corelation between all the variables first to get a quick glance of the data. Since the correlation can be calculated only for numeric class variables and all the variables that we have are either character or integer. I can convert them into numeric class data type so that a correlation can be calculated.

converting variables into numeric data type

dataset$zip_code = as.numeric(dataset$zip_code)
dataset$channel = as.numeric(dataset$channel)
dataset$offer = as.numeric(dataset$offer)
dataset$conversion = as.numeric(dataset$conversion)

dataset$recency = as.numeric(dataset$recency)
dataset$used_discount = as.numeric(dataset$used_discount)
dataset$used_bogo = as.numeric(dataset$used_bogo)
dataset$is_referral = as.numeric(dataset$is_referral)

Do we have any NA values in our dataset? It would be better to check for the same and deal with it properly.

sum(is.na(dataset))
## [1] 0

So the sum is zero, that means we do not have any NA values in the dataset, which is a good thing.

Now that all the variables are converted into numeric data type a correlation can be calculated and a corrplot can be plotted. Positive correlation means a proportional relation between the variables Negative correlation means an inverse relation between the variables Higher the correlation value , more the relation persists.

str(dataset)
## 'data.frame':    64000 obs. of  9 variables:
##  $ recency      : num  10 6 7 9 2 6 9 9 9 10 ...
##  $ history      : num  142.4 329.1 180.7 675.8 45.3 ...
##  $ used_discount: num  1 1 0 1 1 0 1 0 1 0 ...
##  $ used_bogo    : num  0 1 1 0 0 1 0 1 1 1 ...
##  $ zip_code     : num  0 1 0 1 2 0 0 2 1 2 ...
##  $ is_referral  : num  0 1 1 1 0 0 1 0 1 1 ...
##  $ channel      : num  0 1 1 1 1 0 0 0 0 1 ...
##  $ offer        : num  0 2 0 1 0 0 0 0 1 0 ...
##  $ conversion   : num  0 0 0 0 0 1 0 0 0 0 ...
cr=cor(dataset)
library(corrplot)
## corrplot 0.84 loaded
corrplot(cr,method= "number",title = "correlation between Variables",bg="brown")

Finding : *

1.There is not much of correlation seen between the independent variables. 2.That’s a good thing because, we do not have to worry about multicolinearity in the data. 3.Highest correlation is seen between Channel and history variables. Which implies that people who are using Web channel to access the website rather than the phone channel, are more likely to invest in our product.

Now let’s check a quick distribution of recency of all the customers in the dataset

hist(dataset$recency)

A good distribution is seen here.

dataset$conversion = factor(dataset$conversion)

Now let’s divide the dataset into training and test datasets. For that, I will use caTools library. Whole Machine learning model works on the concept that machine learns to predict by learning the training data . Than a model is build based on that training set data. Based on this model we predict the test set data.

So here we are dividing the data into training and test set data in proportion of 80% and 20% respectively.

#Splitting the dataset into the Training set and Test set
#install.packages('caTools')
library(caTools)
set.seed(12345)
split = sample.split(dataset$conversion, SplitRatio = 0.8)

training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Now that the data is split into training and test data, we can start with the model building process. Here , I have decided to use K- nearest neighbor model. I built decision tree, random forest, Support vector machine kernel model on this dataset, but I got the highest accuracy with KNN model. SO I have decided to go with only this model in this project and try to increase the accuracy by tuning the KNN model.

1.K-Nearest Neighbor is one of the Supervised machine Learning technique. 2.It performs a curvi-linear classification, unlike logistic classifiction which gives a linear classification, that is the reason why KNN gives more accurate prediction. 3.K-NN algorithm classifies a new data point based on the similarity. 4.K-NN algorithm can be used for Regression as well.

So how KNN works ? STEP 1: Choose the number K of neighbors, in most cases we choose k = 5

STEP 2: Take the K nearest neighbors of the new data point, according to the Euclidean distance

STEP 3: Among these K neighbors, count the number of data points in each category

STEP 4: Assign the new data point to the category where you counted the most neighbors And that’s how the new data point is filled into that category.

KNN in the training set just stores the training set data and if a new data enters the model, it classifies the new data into the category based on similarity which it has with the already available data.

So how does it find out the similarity ? It plots the new data onto a graph and than finds out the nearest neighbor to that data by calculating the distance by using euclidean distance methodology. Than based on the value of K that is given in the model, it counts that number of neighbors and than categories the new data based on the category from which the most neighbor are from.

Since K-nearest neighbor uses euclidean method to calculate the distance to find neighbors, the larger points in our data can dominate the smaller points . So scaling is compulsory while using K-nearest neighbors.

Scaling training and test dataset

We have to leave out the dependent level while scaling, because we need as the target variable and as intact.

#--------------------------------------------------------
# K-Nearest Neighbors (K-NN)

#Scaling the training and test set data
training_set[-9] = scale(training_set[-9])
test_set[-9] = scale(test_set[-9])

So I have decided to take the value of k as 5 ,for the model, which means while categorizing the model will look for 5 neighbors. The value of k must be odd, only than equal number of neighbors will not be a problem. I have set the seed also to make sure a proper model is build and not to get a biased result.

Fitting K-NN to the Training set and Predicting the Test set results

Here both the process of building a model and predicting the test set happens simultaneously.

library(class)
set.seed(123)
y_pred = knn(train = training_set[, -9],
             test = test_set[, -9],
             cl = training_set[, 9],
             k = 5,
             prob = TRUE)

Now that our model is built and the test set variables are used for prediction, we have stored the prediction in y_pred. Now to know the acuracy we need to buikd a confusion matrix. To know how much our model result actually deviates from the actual dependent variable of test data.

# Making the Confusion Matrix
cm = actual = table(test_set[, 9], predicted = y_pred)
(cm)
##    predicted
##         0     1
##   0 10563   358
##   1  1772   107
#an accuracy of 83.37%

IMPROVING THE MODEL SELECTION

Grid Search

So now the achieved accuracy is 83.37% , but there are a lot of parameters in k-nn model, which we took randomly, we can tweak them to find out the model which gives us the highest accuracy. So to improve our model,we will use Grid search . Grid search will find the best tuning parameter for our model to get the highest accuracy.

# Applying Grid Search to find the best parameters
# install.packages('caret')
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
classifier = train(form = conversion ~ ., data = training_set, method = 'knn')

The result from Grid search is stored in classifier, accuracy is used to find out the best tuning parameter.

(classifier)
## k-Nearest Neighbors 
## 
## 51200 samples
##     8 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 51200, 51200, 51200, 51200, 51200, 51200, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa     
##   5  0.8013953  0.04034556
##   7  0.8180592  0.03950692
##   9  0.8279086  0.03567666
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
classifier$bestTune
##   k
## 3 9

So the best accuracy which we can get from the knn model is at the value of k = 9. Let’s put the value of K as 9 in the model and see what is the increased accuracy that we get.

set.seed(123)
y_pred1 = knn(train = training_set[, -9],
             test = test_set[, -9],
             cl = training_set[, 9],
             k = 9,
             prob = TRUE)

The improved prediction is stored into y_pred1 Now let’s find out the accuracy of our model.

# Making the Confusion Matrix
cm1 = table(actual = test_set[, 9], predicted = y_pred1)
(cm1)
##       predicted
## actual     0     1
##      0 10774   147
##      1  1830    49

So the improved accuracy is ((10774+49)/12800)*100 = 84.55%

End Notes

1.The dataset is imbalanced, since most of the dependent variables in the dataset are the customer not buying the product. The ratio of customer not buying and buying the product is nearly 10:1

2.So most of the models don’t work on this dataset unless we balance the dataset by doing oversampling or undersampling

  1. By using trial and error of creating all the classification models, I have come to the conclusion that KNN gives the highest accuracy and non biased result in the given dataset.

4.Artificial neural network model might be used to build a model on the dataset to get even higher accuracy.