Books By Mail company is interested in offering a new title called The Art History of Florence to 1000, existing customers. Of these, 83 actually purchased the book, a response rate of 8.3 percent. Hence, the company sent a test mailing to them in this regard. The company also sent out an identical mailing to another 1000 customers to serve as holdout sample. The mail has been sent to customers based on two input variables namely months since last purchase and number of art books purchased.
train.data <- read.csv("D:/PGP BA-BI Course Materials/PREDICTIVE MODELING/GROUP ASSIGNMENT/PaulBooks1.csv")
head(train.data)
## ID Months NoBought Purchase
## 1 1001 24 0 0
## 2 1002 16 0 0
## 3 1003 15 0 0
## 4 1004 22 0 0
## 5 1006 6 2 0
## 6 1007 11 0 0
str(train.data)
## 'data.frame': 1000 obs. of 4 variables:
## $ ID : int 1001 1002 1003 1004 1006 1007 1008 1009 1010 1011 ...
## $ Months : int 24 16 15 22 6 11 11 26 14 15 ...
## $ NoBought: int 0 0 0 0 2 0 1 2 1 0 ...
## $ Purchase: int 0 0 0 0 0 0 0 0 0 0 ...
summary(train.data)
## ID Months NoBought Purchase
## Min. :1001 Min. : 1.00 Min. :0.000 Min. :0.000
## 1st Qu.:1251 1st Qu.: 7.00 1st Qu.:0.000 1st Qu.:0.000
## Median :1500 Median :12.00 Median :0.000 Median :0.000
## Mean :1500 Mean :12.46 Mean :0.389 Mean :0.083
## 3rd Qu.:1750 3rd Qu.:15.00 3rd Qu.:1.000 3rd Qu.:0.000
## Max. :2000 Max. :35.00 Max. :5.000 Max. :1.000
opar2 <- par(no.readonly = TRUE)
par(mfcol = c(1,2),bg ="gray63", col="white", col.axis = "white", col.lab = "white", col.main = "white")
boxplot(train.data$Months,main = "Boxplot of Months",xlab = "No.of months since last purchase",border ="white",horizontal = TRUE,col = "aquamarine3")
hist(train.data$Months,probability = TRUE, border = "white",main = "Histogram of Months",xlab = "No.of months since last purchase",col = "aquamarine3")
d <- density(train.data$Months)
lines(d,col="red",lwd = 2.0)
opar2 <- par(no.readonly = TRUE)
par(mfcol = c(1,2),bg ="gray63", col="white", col.axis = "white", col.lab = "white", col.main = "white")
boxplot(train.data$NoBought,main = "Boxplot of NoBought",xlab = "No.of arts books purhcased",border ="white",horizontal = TRUE,col = "aquamarine3")
hist(train.data$NoBought,probability = TRUE, border = "white",main = "Histogram of Nobought",xlab = "No.of arts books purhcased",col = "aquamarine3")
d <- density(train.data$NoBought)
lines(d,col="red",lwd = 2.0)
train.data$Purchase <- as.factor(train.data$Purchase)
opar2 <- par(no.readonly = TRUE)
par(mfcol = c(1,2),bg ="gray63", col="white", col.axis = "white", col.lab = "white", col.main = "white")
barplot(table(train.data$Purchase),main = "Respones rate of book purchase",xlab = "Response Rate",ylab = "Counts",border = TRUE,
col = c("aquamarine3","aquamarine4"),legend.text =c("No","Yes"))
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
cor.data <- cor(train.data[,c(2,3)])
corrplot(cor.data,method = "circle",tl.col = "black",bg = "grey14",
addgrid.col = "gray50", tl.offset = 1,col = colorRampPalette(c("firebrick2","ivory2","aquamarine3"))(50))
opar1 <- par(no.readonly = TRUE)
par(mfrow=c(1,2), bg ="gray63",col="white", col.axis = "white", col.lab = "white", col.main = "white" )
boxplot(train.data$Months~train.data$Purchase,main = "Boxplot pattern of Purchase vs Months", xlab = "Response Rate of Book Purchase", ylab = "No.of months since last purchase",
col = "lightskyblue",border = par("fg"))
boxplot(train.data$NoBought~train.data$Purchase,main = "Boxplot pattern of Purchase vs NoBought", xlab = "Response Rate of Book Purchase",
ylab = "No.of art books purchased", col = "lightskyblue",border = par("fg"))
par(opar1)
library(MASS)
train.data$Purchase <- as.factor(train.data$Purchase)
Model <- lda(Purchase~Months+NoBought,data = train.data)
Model
## Call:
## lda(Purchase ~ Months + NoBought, data = train.data)
##
## Prior probabilities of groups:
## 0 1
## 0.917 0.083
##
## Group means:
## Months NoBought
## 0 12.731734 0.3336968
## 1 9.409639 1.0000000
##
## Coefficients of linear discriminants:
## LD1
## Months -0.05098078
## NoBought 1.41242601
From the linear discriminant model, we are able to see the means of each variable differ in both groups.It also gives the coefficients for both the variables -0.0509 and 1.412.
With these coefficients, discriminant scores for each record is calculated and their class predicted.
library(DiscriMiner)
## Warning: package 'DiscriMiner' was built under R version 3.4.3
X <- train.data[,c(2,3)]
train.data$Purchase <- as.factor(train.data$Purchase)
Y <- train.data[,4]
Fisher <- desDA(X,Y)
Fisher
##
## Descriptive Discriminant Analysis
## ---------------------------------
## $power discriminant power
## $values table of eigenvalues
## $discrivar discriminant variables
## $discor correlations
## $scores discriminant scores
## ---------------------------------
##
## $power
## cor_ratio wilks_lamb F_statistic p_values
## Months 0.0131350 0.9868650 13.2832222 0.0002816
## NoBought 0.0728742 0.9271258 78.4450638 0.0000000
##
##
## $values
## value proportion accumulated
## DF1 0.0868 100.0000 100.0000
##
##
## $discrivar
## DF1
## constant 0.08558
## Months -0.05098
## NoBought 1.41243
##
##
## $discor
## DF1
## Months -0.3909
## NoBought 0.9207
##
##
## $scores
## z1
## 1 -1.1380
## 2 -0.7301
## 3 -0.6791
## 4 -1.0360
## 5 2.6046
## 6 -0.4752
## ...
Fisher’s model gives the coefficients of variables directly and also the output tells the importance of variables by the correlation value between the model and the varibales.
In the output,the p-values for both the variables are very less. Hence, we conclude that both of them are statistically significant and they will be able to separate the customers into buying and not buying of the book.
The correlation ratio values in $power tells that, the NoBought is 7 times more important than Months in statisically separating and classifying the customers into buying and not buying the book.
With the coefficients there is also a constant value which is nothing but the cut-off value used similar to the intercept in regression. Hence the equation is,
Z = 0.80558 -0.05098(Months) + 1.4124(NoBought)
From this equation z score for each customer is calculated and their class is predicted. The correlation value shows that Months is 39% correlated with the model and NoBought is 92% correlated with the model. With this we infer that the no. of books purchased is far more important than no.of months since last purchase in predciting the customers.
maha <- linDA(X,Y)
maha
##
## Linear Discriminant Analysis
## -------------------------------------------
## $functions discrimination functions
## $confusion confusion matrix
## $scores discriminant scores
## $classification assigned class
## $error_rate error rate
## -------------------------------------------
##
## $functions
## 0 1
## constant -1.473 -4.295
## Months 0.200 0.143
## NoBought 0.699 2.267
##
##
## $confusion
## predicted
## original 0 1
## 0 907 10
## 1 72 11
##
##
## $error_rate
## [1] 0.082
##
##
## $scores
## 0 1
## 1 3.3151450 -0.8650023
## 2 1.7189944 -2.0082531
## 3 1.5194755 -2.1511595
## 4 2.9161073 -1.1508150
## 5 1.1208676 1.0966496
## 6 0.7214002 -2.7227849
## ...
##
## $classification
## [1] 0 0 0 0 0 0
## Levels: 0 1
## ...
Mahalanobis discriminant model gives the output by two separate equations for buying and not buying the book. From the eqaution , we can compute the score for each record for both buying and not buying of the book.
For buying : -1.473+0.200(Months)+0.699(NoBought)
For not buying : -4.295+0.143(Months)+2.267(NoBought)
Thus, two scores are calculated for each record from the above equations and the record is classified into a class with highest score.
test.data <- read.csv("D:/PGP BA-BI Course Materials/PREDICTIVE MODELING/GROUP ASSIGNMENT/PaulBooks2.csv")
test.data$Purchase <- as.factor(test.data$Purchase)
pred <- classify(maha,test.data[,2:3])
table(test.data$Purchase,pred$pred_class)
##
## 0 1
## 0 908 11
## 1 67 14
The model works well with test data also and the overall accuracy of the model here is also 92%. But, we face the problem of model predicting 1 as 1 very less. Since, it is an unbalanced class data, a proper cut-off value to be chosen based on business intuition.