Books By Mail company is interested in offering a new title called The Art History of Florence to 1000, existing customers. Of these, 83 actually purchased the book, a response rate of 8.3 percent. Hence, the company sent a test mailing to them in this regard. The company also sent out an identical mailing to another 1000 customers to serve as holdout sample. The mail has been sent to customers based on two input variables namely months since last purchase and number of art books purchased.

Loading the dataset

train.data <- read.csv("D:/PGP BA-BI Course Materials/PREDICTIVE MODELING/GROUP ASSIGNMENT/PaulBooks1.csv")
head(train.data)

##     ID Months NoBought Purchase
## 1 1001     24        0        0
## 2 1002     16        0        0
## 3 1003     15        0        0
## 4 1004     22        0        0
## 5 1006      6        2        0
## 6 1007     11        0        0

Exploratory Data Analysis

Structure of the data

str(train.data)

## 'data.frame':    1000 obs. of  4 variables:
##  $ ID      : int  1001 1002 1003 1004 1006 1007 1008 1009 1010 1011 ...
##  $ Months  : int  24 16 15 22 6 11 11 26 14 15 ...
##  $ NoBought: int  0 0 0 0 2 0 1 2 1 0 ...
##  $ Purchase: int  0 0 0 0 0 0 0 0 0 0 ...

Summary of the data

summary(train.data)

##        ID           Months         NoBought        Purchase    
##  Min.   :1001   Min.   : 1.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1251   1st Qu.: 7.00   1st Qu.:0.000   1st Qu.:0.000  
##  Median :1500   Median :12.00   Median :0.000   Median :0.000  
##  Mean   :1500   Mean   :12.46   Mean   :0.389   Mean   :0.083  
##  3rd Qu.:1750   3rd Qu.:15.00   3rd Qu.:1.000   3rd Qu.:0.000  
##  Max.   :2000   Max.   :35.00   Max.   :5.000   Max.   :1.000

Visualization of independent variables

opar2 <- par(no.readonly = TRUE)
par(mfcol = c(1,2),bg ="gray63", col="white", col.axis = "white", col.lab = "white", col.main = "white")
boxplot(train.data$Months,main = "Boxplot of Months",xlab = "No.of months since last purchase",border ="white",horizontal = TRUE,col = "aquamarine3")
hist(train.data$Months,probability = TRUE, border = "white",main = "Histogram of Months",xlab = "No.of months since last purchase",col = "aquamarine3")
d <- density(train.data$Months)
lines(d,col="red",lwd = 2.0)

opar2 <- par(no.readonly = TRUE)
par(mfcol = c(1,2),bg ="gray63", col="white", col.axis = "white", col.lab = "white", col.main = "white")
boxplot(train.data$NoBought,main = "Boxplot of NoBought",xlab = "No.of arts books purhcased",border ="white",horizontal = TRUE,col = "aquamarine3")
hist(train.data$NoBought,probability = TRUE, border = "white",main = "Histogram of Nobought",xlab = "No.of arts books purhcased",col = "aquamarine3")
d <- density(train.data$NoBought)
lines(d,col="red",lwd = 2.0)

Visualization of target variable

train.data$Purchase <- as.factor(train.data$Purchase)

opar2 <- par(no.readonly = TRUE)
par(mfcol = c(1,2),bg ="gray63", col="white", col.axis = "white", col.lab = "white", col.main = "white")
barplot(table(train.data$Purchase),main = "Respones rate of book purchase",xlab = "Response Rate",ylab = "Counts",border = TRUE,
        col = c("aquamarine3","aquamarine4"),legend.text =c("No","Yes"))

Correlation matrix plot

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.2

## corrplot 0.84 loaded

cor.data <- cor(train.data[,c(2,3)])
corrplot(cor.data,method = "circle",tl.col = "black",bg = "grey14",
         addgrid.col = "gray50", tl.offset = 1,col = colorRampPalette(c("firebrick2","ivory2","aquamarine3"))(50))

Relationship between between dependent and independent variables

opar1 <- par(no.readonly = TRUE)
par(mfrow=c(1,2), bg ="gray63",col="white", col.axis = "white", col.lab = "white", col.main = "white" )
boxplot(train.data$Months~train.data$Purchase,main = "Boxplot pattern of Purchase vs Months", xlab = "Response Rate of Book Purchase", ylab = "No.of months since last purchase",
        col = "lightskyblue",border = par("fg"))
boxplot(train.data$NoBought~train.data$Purchase,main = "Boxplot pattern of Purchase vs NoBought", xlab = "Response Rate of Book Purchase", 
        ylab = "No.of art books purchased", col = "lightskyblue",border = par("fg"))

par(opar1)

Model development on training data

Linear Discriminant model using library MASS

library(MASS)
train.data$Purchase <- as.factor(train.data$Purchase)
Model <- lda(Purchase~Months+NoBought,data = train.data)
Model

## Call:
## lda(Purchase ~ Months + NoBought, data = train.data)
## 
## Prior probabilities of groups:
##     0     1 
## 0.917 0.083 
## 
## Group means:
##      Months  NoBought
## 0 12.731734 0.3336968
## 1  9.409639 1.0000000
## 
## Coefficients of linear discriminants:
##                  LD1
## Months   -0.05098078
## NoBought  1.41242601

From the linear discriminant model, we are able to see the means of each variable differ in both groups.It also gives the coefficients for both the variables -0.0509 and 1.412.

With these coefficients, discriminant scores for each record is calculated and their class predicted.

Fisher’s Discriminant Model

library(DiscriMiner)

## Warning: package 'DiscriMiner' was built under R version 3.4.3

X <- train.data[,c(2,3)]
train.data$Purchase <- as.factor(train.data$Purchase)
Y <- train.data[,4]
Fisher <- desDA(X,Y)
Fisher

## 
## Descriptive Discriminant Analysis
## ---------------------------------
## $power      discriminant power
## $values     table of eigenvalues
## $discrivar  discriminant variables
## $discor     correlations
## $scores     discriminant scores
## ---------------------------------
## 
## $power
##           cor_ratio   wilks_lamb  F_statistic  p_values  
## Months     0.0131350   0.9868650  13.2832222    0.0002816
## NoBought   0.0728742   0.9271258  78.4450638    0.0000000
## 
## 
## $values
##      value     proportion  accumulated
## DF1    0.0868  100.0000    100.0000   
## 
## 
## $discrivar
##                DF1
## constant   0.08558
## Months    -0.05098
## NoBought   1.41243
## 
## 
## $discor
##               DF1
## Months    -0.3909
## NoBought   0.9207
## 
## 
## $scores
##         z1
## 1  -1.1380
## 2  -0.7301
## 3  -0.6791
## 4  -1.0360
## 5   2.6046
## 6  -0.4752
## ...

Fisher’s model gives the coefficients of variables directly and also the output tells the importance of variables by the correlation value between the model and the varibales.

In the output,the p-values for both the variables are very less. Hence, we conclude that both of them are statistically significant and they will be able to separate the customers into buying and not buying of the book.

The correlation ratio values in $power tells that, the NoBought is 7 times more important than Months in statisically separating and classifying the customers into buying and not buying the book.

With the coefficients there is also a constant value which is nothing but the cut-off value used similar to the intercept in regression. Hence the equation is,

                Z = 0.80558 -0.05098(Months) + 1.4124(NoBought)

From this equation z score for each customer is calculated and their class is predicted. The correlation value shows that Months is 39% correlated with the model and NoBought is 92% correlated with the model. With this we infer that the no. of books purchased is far more important than no.of months since last purchase in predciting the customers.

Mahalanobis Discriminant Model

maha <- linDA(X,Y)
maha

## 
## Linear Discriminant Analysis
## -------------------------------------------
## $functions        discrimination functions
## $confusion        confusion matrix
## $scores           discriminant scores
## $classification   assigned class
## $error_rate       error rate
## -------------------------------------------
## 
## $functions
##           0       1     
## constant  -1.473  -4.295
## Months     0.200   0.143
## NoBought   0.699   2.267
## 
## 
## $confusion
##         predicted
## original    0    1
##        0  907   10
##        1   72   11
## 
## 
## $error_rate
## [1] 0.082
## 
## 
## $scores
##            0           1
## 1  3.3151450  -0.8650023
## 2  1.7189944  -2.0082531
## 3  1.5194755  -2.1511595
## 4  2.9161073  -1.1508150
## 5  1.1208676   1.0966496
## 6  0.7214002  -2.7227849
## ...
## 
## $classification
## [1] 0 0 0 0 0 0
## Levels: 0 1
## ...

Mahalanobis discriminant model gives the output by two separate equations for buying and not buying the book. From the eqaution , we can compute the score for each record for both buying and not buying of the book.

                     For buying : -1.473+0.200(Months)+0.699(NoBought)
                  For not buying : -4.295+0.143(Months)+2.267(NoBought)

Thus, two scores are calculated for each record from the above equations and the record is classified into a class with highest score.

Testing the model accuracy on test data

test.data <- read.csv("D:/PGP BA-BI Course Materials/PREDICTIVE MODELING/GROUP ASSIGNMENT/PaulBooks2.csv")
test.data$Purchase <- as.factor(test.data$Purchase)
pred <- classify(maha,test.data[,2:3])
table(test.data$Purchase,pred$pred_class)

##    
##       0   1
##   0 908  11
##   1  67  14

The model works well with test data also and the overall accuracy of the model here is also 92%. But, we face the problem of model predicting 1 as 1 very less. Since, it is an unbalanced class data, a proper cut-off value to be chosen based on business intuition.