##Overview A superstore is planning for the year-end sale.They want to launch a new offer - gold membership, that gives a 20% discount on all purchases, for only $499 which is $999 on other days.It will be valid only for existing customers and the campaign through phone calls is currently being planned for them. The management feels that the best way to reduce the cost of the campaign is to make a predictive model which will classify customers who might purchase the offer.
##Objectives Work will aim to meet the following objectives; i. To predict the likelihood of the customers giving a positive response. ii. To identify the different factors that affect a customer’s response.
##Questions The work will seek to answer the following questions; i) What is the likelihood of the customers giving a positive response? ii) What are the factors that affect a customer’s response?
##Metrics for Success For this work the success will include coming up with a model that correctly predicts the likelihood of a customer to give a positive response. Success will also be achieved by correctly analyzing and establishing the factors that contribute towards a customer giving a positive response to the Superstore campaign.
We aim to achieve an accuracy level of 90%.
##The Experimental Design Below are the steps taken in this analysis a. Loading the required libraries b. Loading and previewing data c. Cleaning the data d. Exploratory Data Analysis(EDA) e. Creating a model to predict d. Drawing conclusions e. Giving Recommendations
##Data Relevance and Validation The data available is relevant for the intended analysis. It contains information that is significant to predicting the likelihood of a customer giving a positive response. It has data on customers income, education levels, marital status, children in the households, shopping trends, and the responses and complaints.
##Understanding the context The data set we are to work with contains the following columns: a. ID - Unique ID of each customer b. Year_Birth - Age of the customer c. Complain - 1 if the customer complained in the last 2 years d. Dt_Customer - date of customer’s enrollment with the company e. Education - customer’s level of education f. Marital - customer’s marital status g. Kidhome - number of small children in customer’s household h. Teenhome - number of teenagers in customer’s household i. Income - customer’s yearly household income j. MntFishProducts - the amount spent on fish products in the last 2 years k. MntMeatProducts - the amount spent on meat products in the last 2 years l. MntFruits - the amount spent on fruits products in the last 2 years m. MntSweetProducts - amount spent on sweet products in the last 2 years n. MntWines - the amount spent on wine products in the last 2 years o. MntGoldProds - the amount spent on gold products in the last 2 years p. NumDealsPurchases - number of purchases made with discount q. NumCatalogPurchases - number of purchases made using catalog (buying goods to be shipped through the mail) r. NumStorePurchases - number of purchases made directly in stores s. NumWebPurchases - number of purchases made through the company’s website t. NumWebVisitsMonth - number of visits to company’s website in the last month u. Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise v. Recency - number of days since the last purchase
The dataset has 2240 observations and 22 columns
This dataset can be accessed from this link:Superstore Marketing Campaign
##Reading and Understanding The Dataset
superstore_data <- read.csv("C:/Users/HP/Downloads/superstore_data.csv")#Reading the first five rows of the data set
head(superstore_data,5)
## Id Year_Birth Education Marital_Status Income Kidhome Teenhome
## 1 1826 1970 Graduation Divorced 84835 0 0
## 2 1 1961 Graduation Single 57091 0 0
## 3 10476 1958 Graduation Married 67267 0 1
## 4 1386 1967 Graduation Together 32474 1 1
## 5 5371 1989 Graduation Single 21474 1 0
## Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 1 6/16/2014 0 189 104 379 111
## 2 6/15/2014 0 464 5 64 7
## 3 5/13/2014 0 134 11 59 15
## 4 11/5/2014 0 10 0 1 0
## 5 8/4/2014 0 6 16 24 11
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1 189 218 1 4
## 2 0 37 1 7
## 3 2 30 1 3
## 4 0 0 1 1
## 5 0 34 2 3
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Response Complain
## 1 4 6 1 1 0
## 2 3 7 5 1 0
## 3 2 5 2 0 0
## 4 0 2 7 0 0
## 5 1 2 7 1 0
#Reading the last 5 rows of the data set
tail(superstore_data,5)
## Id Year_Birth Education Marital_Status Income Kidhome Teenhome
## 2236 10142 1976 PhD Divorced 66476 0 1
## 2237 5263 1977 2n Cycle Married 31056 1 0
## 2238 22 1976 Graduation Divorced 46310 1 0
## 2239 528 1978 Graduation Married 65819 0 0
## 2240 4070 1969 PhD Married 94871 0 2
## Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 2236 7/3/2013 99 372 18 126 47
## 2237 1/22/2013 99 5 10 13 3
## 2238 3/12/2012 99 185 2 88 15
## 2239 11/29/2012 99 267 38 701 149
## 2240 1/9/2012 99 169 24 553 188
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 2236 48 78 2 5
## 2237 8 16 1 1
## 2238 5 14 2 6
## 2239 165 63 1 5
## 2240 0 144 1 8
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Response Complain
## 2236 2 11 4 0 0
## 2237 0 3 8 0 0
## 2238 1 5 8 0 0
## 2239 4 10 3 0 0
## 2240 5 4 7 1 0
#Checking for the data types of the variables
str(superstore_data)
## 'data.frame': 2240 obs. of 22 variables:
## $ Id : int 1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
## $ Year_Birth : int 1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Divorced" "Single" "Married" "Together" ...
## $ Income : int 84835 57091 67267 32474 21474 71691 63564 44931 65324 65324 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : chr "6/16/2014" "6/15/2014" "5/13/2014" "11/5/2014" ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
The output shows that the dataset has 22 columns and 2240 rows.With 3 characters and the rest as integers.
#Checking the dimension of the data
dim(superstore_data)
## [1] 2240 22
This dataset has 2240 rows and 22 columns
#Checking for missing variables
colSums(is.na(superstore_data))
## Id Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 24 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## Response Complain
## 0 0
This shows that the Income column is the only one with missing values.
#Replacing the missing values with the mean of the column
superstore_data$Income[is.na(superstore_data$Income)]<-mean(superstore_data$Income, na.rm=TRUE)
#Checking to confirm the missing variables have been filled
colSums(is.na(superstore_data))
## Id Year_Birth Education Marital_Status
## 0 0 0 0
## Income Kidhome Teenhome Dt_Customer
## 0 0 0 0
## Recency MntWines MntFruits MntMeatProducts
## 0 0 0 0
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## 0 0 0 0
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 0 0 0 0
## Response Complain
## 0 0
Observation: This shows that all missing values have been replaced with the mean,hence no missing values.
# Checking for the unique values in the Marital Status column
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
unique_values<-unique(superstore_data$Marital_Status)
unique_values
## [1] "Divorced" "Single" "Married" "Together" "Widow" "YOLO" "Alone"
## [8] "Absurd"
#Changing the variables in the Marital status column that are not in line
library(stringr)
superstore_data$Marital_Status<-str_replace(superstore_data$Marital_Status,
"YOLO","Single")
superstore_data$Marital_Status<-str_replace(superstore_data$Marital_Status,
"Alone","Single")
superstore_data$Marital_Status<-str_replace(superstore_data$Marital_Status,
"Absurd","Single")
# Checking to see whether the variables in the Marital Status column have been changed
unique_marital<-unique(superstore_data$Marital_Status)
unique_marital
## [1] "Divorced" "Single" "Married" "Together" "Widow"
##Performing Exploratory Data Analysis
# A pie chart of the educational level of the customers
edu<-table(superstore_data$Education)
edu
##
## 2n Cycle Basic Graduation Master PhD
## 203 54 1127 370 486
pie(edu,
col = hcl.colors(length(edu), "BluYl"))
Observation: Most of the customers are in the graduation sub-category,
this is followed by those who hold PHDs and Masters. The least are those
with basic education.
library("ggplot2") #Library that supports plotting
#Ggplot comparing Income vs Education
ggplot(superstore_data, aes(Education,Income)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black")) + ggtitle("Income vs Education") + theme_bw()
Observation: The ggplot shows that those who have graduated have the
highest income with those with basic education have the least
income.
#A scatter plot of year of birth vs sweet products
ggplot(superstore_data, aes(x= Year_Birth, y = MntSweetProducts)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("MntSweetProducts") + ggtitle("Year of Birth vs MntSweetProducts")
Observation: The scatter shows that majority of the people born between 1950 and 1980 are the most purchases of sweet products.
# a table of complains
complains<-table(superstore_data$Complain)
complains
##
## 0 1
## 2219 21
#Barplot of complains
barplot(complains)
Observation: The barplot for complains showing that the data is
imbalanced
#Comparing the Year of birth against Complain
ggplot(superstore_data, aes(Year_Birth,Complain)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black")) + ggtitle("Year of Birth vs Complain") + theme_bw()
Observation: Most complaints were recorded from customers born between
1950 and 1980
#Barplot for response
responses<-table(superstore_data$Response)
barplot(responses)
Observation: This barplot shows the variance of Responses which shows
imbalanced data
#Comparing the Year of birth against Response
ggplot(superstore_data, aes(Year_Birth,Response)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black")) + ggtitle("Year of Birth vs Response") + theme_bw()
Observation: Most responses were recorded from customers born from 1940.
the highest number came from customers born around 1970.
#Scatter plot between Year of birth and Fish products
ggplot(superstore_data, aes(x= Year_Birth, y = MntFishProducts)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("MntFishProducts") + ggtitle("Year of Birth vs MntFishProducts")
# scatter plot showing year of birth versus amount of wine purchased.
ggplot(superstore_data, aes(x= Year_Birth, y = MntWines)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("MntWines") + ggtitle("Year of Birth vs MntWines")
#Boxplot between Income and Marital Status
ggplot(superstore_data, aes(Marital_Status,Income)) +geom_boxplot() +ggtitle("Box Plot") + theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "red")) + xlab("Marital_Status") + ylab("Income") + ggtitle("Income vs MaritaL Status")
Observation: Almost all categories have similar incomes, the widowed
have a higher average income while those who are together have some
extremely high cases of income levels.
#Converting date from character(chr) to datetime
library("data.table") #Package necessary for reading a column
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
date_variable <- as.Date(superstore_data$Dt_Customer, format = "%Y-%m-%d")
str(superstore_data)
## 'data.frame': 2240 obs. of 22 variables:
## $ Id : int 1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
## $ Year_Birth : int 1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Divorced" "Single" "Married" "Together" ...
## $ Income : num 84835 57091 67267 32474 21474 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : chr "6/16/2014" "6/15/2014" "5/13/2014" "11/5/2014" ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
# Dropping the Dt Customer and Id column
superstore_data1<-select(superstore_data, -Dt_Customer)
superstore_data2<-select(superstore_data1, -Id)
head(superstore_data2)
## Year_Birth Education Marital_Status Income Kidhome Teenhome Recency MntWines
## 1 1970 Graduation Divorced 84835 0 0 0 189
## 2 1961 Graduation Single 57091 0 0 0 464
## 3 1958 Graduation Married 67267 0 1 0 134
## 4 1967 Graduation Together 32474 1 1 0 10
## 5 1989 Graduation Single 21474 1 0 0 6
## 6 1958 PhD Single 71691 0 0 0 336
## MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
## 1 104 379 111 189 218
## 2 5 64 7 0 37
## 3 11 59 15 2 30
## 4 0 1 0 0 0
## 5 16 24 11 0 34
## 6 130 411 240 32 43
## NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
## 1 1 4 4 6
## 2 1 7 3 7
## 3 1 3 2 5
## 4 1 1 0 2
## 5 2 3 1 2
## 6 1 4 7 5
## NumWebVisitsMonth Response Complain
## 1 1 1 0
## 2 5 1 0
## 3 2 0 0
## 4 7 0 0
## 5 7 1 0
## 6 2 1 0
str(superstore_data2)
## 'data.frame': 2240 obs. of 20 variables:
## $ Year_Birth : int 1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Divorced" "Single" "Married" "Together" ...
## $ Income : num 84835 57091 67267 32474 21474 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
#Performing label encoding on Categorical columns
superstore_data2$Marital_Status <- as.numeric(factor(superstore_data2$Marital_Status))
superstore_data2$Education <- as.numeric(factor(superstore_data2$Education))
#Installing the correlation packages
#install.packages("corrplot")
library(corrplot)
## corrplot 0.92 loaded
#we need to first change the response column from factor to integer
superstore_data2$Response<-as.integer(superstore_data2$Response)
#checking for dt types
sapply(superstore_data2,class)
## Year_Birth Education Marital_Status Income
## "integer" "numeric" "numeric" "numeric"
## Kidhome Teenhome Recency MntWines
## "integer" "integer" "integer" "integer"
## MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## "integer" "integer" "integer" "integer"
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## "integer" "integer" "integer" "integer"
## NumStorePurchases NumWebVisitsMonth Response Complain
## "integer" "integer" "integer" "integer"
#Performing Correlation
superstore_data2.cor=cor(superstore_data2)
#Visualizing the correlation
corrplot(superstore_data2.cor)
Observation: There is a positive correlation between Mnt Wine
Productions There is a negative correlation between Number of web visits
and Income This correlation table shows that all the areas with red dots
indicates a negative correlation and where there are blue dots,indicates
a positive correlation.
#Summary of dataset
summary(superstore_data2)
## Year_Birth Education Marital_Status Income
## Min. :1893 Min. :1.000 Min. :1.000 Min. : 1730
## 1st Qu.:1959 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 35539
## Median :1970 Median :3.000 Median :3.000 Median : 51742
## Mean :1969 Mean :3.394 Mean :2.735 Mean : 52247
## 3rd Qu.:1977 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.: 68290
## Max. :1996 Max. :5.000 Max. :5.000 Max. :666666
## Kidhome Teenhome Recency MntWines
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:24.00 1st Qu.: 23.75
## Median :0.0000 Median :0.0000 Median :49.00 Median : 173.50
## Mean :0.4442 Mean :0.5062 Mean :49.11 Mean : 303.94
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:74.00 3rd Qu.: 504.25
## Max. :2.0000 Max. :2.0000 Max. :99.00 Max. :1493.00
## MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 16.0 1st Qu.: 3.00 1st Qu.: 1.00
## Median : 8.0 Median : 67.0 Median : 12.00 Median : 8.00
## Mean : 26.3 Mean : 166.9 Mean : 37.53 Mean : 27.06
## 3rd Qu.: 33.0 3rd Qu.: 232.0 3rd Qu.: 50.00 3rd Qu.: 33.00
## Max. :199.0 Max. :1725.0 Max. :259.00 Max. :263.00
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 24.00 Median : 2.000 Median : 4.000 Median : 2.000
## Mean : 44.02 Mean : 2.325 Mean : 4.085 Mean : 2.662
## 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000
## Max. :362.00 Max. :15.000 Max. :27.000 Max. :28.000
## NumStorePurchases NumWebVisitsMonth Response Complain
## Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. :0.000000
## 1st Qu.: 3.00 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.000000
## Median : 5.00 Median : 6.000 Median :0.0000 Median :0.000000
## Mean : 5.79 Mean : 5.317 Mean :0.1491 Mean :0.009375
## 3rd Qu.: 8.00 3rd Qu.: 7.000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :13.00 Max. :20.000 Max. :1.0000 Max. :1.000000
Observation: If we take a look at the Income column, the mean income of customers in this data set is 52247 with a max income of 666666 and min of 1730.
#normality test
#install required packages
if (!requireNamespace("remotes")) install.packages("remotes")
## Loading required namespace: remotes
remotes::install_github("asael697/nortsTest",dependencies = TRUE)
## Skipping install of 'nortsTest' from a github remote, the SHA1 (b9198769) has not changed since last install.
## Use `force = TRUE` to force installation
#checking the response frequency distribution column
#Plotting the distribution of the Response Variable
response<-as.integer(superstore_data2$Response)
hist(response, col = 'steelblue', main = 'Distribution of Response',
xlab = 'response')
Observation: The data does not appear to be normally distributed. To confirm this, we can conduct an A-D test to formally test whether or not the data is normally distributed.
# Checking for normality of data
# Anderson-Darling normality test
#install.packages("nortest")
library(nortest)
nortest::ad.test(response)
##
## Anderson-Darling normality test
##
## data: response
## A = 665.85, p-value < 2.2e-16
Observation: The p-value of the test is less than 0.05, thus we have sufficient evidence to reject the null hypothesis and conclude that response column does not follow a normal distribution.
##Modelling using Decision Tree Algorthim
set.seed(1234) #to make the results reproducible
random<-runif(2240) #to randomize for uniform distribution
superstore_random2<-superstore_data2[order(random),]
head(superstore_random2)
## Year_Birth Education Marital_Status Income Kidhome Teenhome Recency
## 783 1989 4 1 10979 0 0 34
## 473 1945 5 2 70356 0 0 20
## 746 1981 1 4 19414 1 0 32
## 996 1968 3 3 57107 0 1 44
## 1575 1976 3 2 17649 1 0 70
## 1801 1976 3 2 17148 1 0 80
## MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 783 8 4 10 2 2
## 473 1349 16 249 43 16
## 746 2 3 12 3 5
## 996 159 0 120 0 0
## 1575 15 1 23 0 5
## 1801 9 9 11 13 10
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 783 4 2 3 0
## 473 33 2 10 6
## 746 7 1 1 0
## 996 137 7 4 4
## 1575 1 3 3 0
## 1801 16 4 3 1
## NumStorePurchases NumWebVisitsMonth Response Complain
## 783 3 5 0 0
## 473 9 6 0 0
## 746 3 8 0 0
## 996 8 8 1 0
## 1575 3 8 0 0
## 1801 3 8 0 0
#Plotting the output of response for clear understanding
responses<-table(superstore_random2$Response)
head(responses)
##
## 0 1
## 1906 334
Observation: The out put shows that there are 1906 0 responses and 334 1 responses hence imbalanced output
#Decision Tree
#loading required library
library(rpart)
# set the response variable
y <- superstore_random2$Response
# set the predictor variables
x<-select(superstore_random2, -Response)
head(x)
## Year_Birth Education Marital_Status Income Kidhome Teenhome Recency
## 783 1989 4 1 10979 0 0 34
## 473 1945 5 2 70356 0 0 20
## 746 1981 1 4 19414 1 0 32
## 996 1968 3 3 57107 0 1 44
## 1575 1976 3 2 17649 1 0 70
## 1801 1976 3 2 17148 1 0 80
## MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 783 8 4 10 2 2
## 473 1349 16 249 43 16
## 746 2 3 12 3 5
## 996 159 0 120 0 0
## 1575 15 1 23 0 5
## 1801 9 9 11 13 10
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 783 4 2 3 0
## 473 33 2 10 6
## 746 7 1 1 0
## 996 137 7 4 4
## 1575 1 3 3 0
## 1801 16 4 3 1
## NumStorePurchases NumWebVisitsMonth Complain
## 783 3 5 0
## 473 9 6 0
## 746 3 8 0
## 996 8 8 0
## 1575 3 8 0
## 1801 3 8 0
# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")
# make predictions
predictions <- predict(tree_model, newdata = x, type = "class")
# view the tree output
print(tree_model)
## n= 2240
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2240 334 0 (0.85089286 0.14910714)
## 2) MntWines< 815.5 2009 238 0 (0.88153310 0.11846690)
## 4) Recency>=20.5 1577 127 0 (0.91946734 0.08053266)
## 8) Income< 85613 1539 109 0 (0.92917479 0.07082521) *
## 9) Income>=85613 38 18 0 (0.52631579 0.47368421)
## 18) MntWines< 563.5 24 7 0 (0.70833333 0.29166667) *
## 19) MntWines>=563.5 14 3 1 (0.21428571 0.78571429) *
## 5) Recency< 20.5 432 111 0 (0.74305556 0.25694444)
## 10) NumCatalogPurchases< 0.5 129 11 0 (0.91472868 0.08527132) *
## 11) NumCatalogPurchases>=0.5 303 100 0 (0.66996700 0.33003300)
## 22) NumStorePurchases>=2.5 272 77 0 (0.71691176 0.28308824)
## 44) MntMeatProducts< 498 239 57 0 (0.76150628 0.23849372)
## 88) NumWebVisitsMonth< 7.5 202 38 0 (0.81188119 0.18811881)
## 176) NumDealsPurchases< 6.5 194 32 0 (0.83505155 0.16494845) *
## 177) NumDealsPurchases>=6.5 8 2 1 (0.25000000 0.75000000) *
## 89) NumWebVisitsMonth>=7.5 37 18 1 (0.48648649 0.51351351)
## 178) Income>=48828 13 3 0 (0.76923077 0.23076923) *
## 179) Income< 48828 24 8 1 (0.33333333 0.66666667) *
## 45) MntMeatProducts>=498 33 13 1 (0.39393939 0.60606061) *
## 23) NumStorePurchases< 2.5 31 8 1 (0.25806452 0.74193548) *
## 3) MntWines>=815.5 231 96 0 (0.58441558 0.41558442)
## 6) MntMeatProducts< 720 200 74 0 (0.63000000 0.37000000)
## 12) NumStorePurchases>=9.5 88 20 0 (0.77272727 0.22727273) *
## 13) NumStorePurchases< 9.5 112 54 0 (0.51785714 0.48214286)
## 26) Recency>=66.5 38 10 0 (0.73684211 0.26315789) *
## 27) Recency< 66.5 74 30 1 (0.40540541 0.59459459)
## 54) MntSweetProducts>=95 12 3 0 (0.75000000 0.25000000) *
## 55) MntSweetProducts< 95 62 21 1 (0.33870968 0.66129032) *
## 7) MntMeatProducts>=720 31 9 1 (0.29032258 0.70967742) *
# Visualizing the tree
#loading required package
#install.packages("rpart.plot")
library(rpart.plot)
# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")
# visualize the tree
rpart.plot(tree_model)
##Evaluating our model performance
# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")
# make predictions
predictions <- predict(tree_model, newdata = x, type = "class")
# calculate accuracy
accuracy <- mean(predictions == y)
#Getting the accuracy
print(accuracy)
## [1] 0.884375
Observation: Accuracy stands at 88%
#confusionMatrix()
#loading the required library
#install.packages("lattice")
library(caret)#Used for plotting the correlation matrix
## Loading required package: lattice
# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")
# make predictions
predictions <- predict(tree_model, newdata = x, type = "class")
# convert predictions and y to factors with the same levels
predictions <- as.factor(predictions)
y <- as.factor(y)
levels(predictions) <- levels(y)
# create the confusion matrix
confusionMatrix(data = predictions, reference = y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1842 195
## 1 64 139
##
## Accuracy : 0.8844
## 95% CI : (0.8704, 0.8973)
## No Information Rate : 0.8509
## P-Value [Acc > NIR] : 2.522e-06
##
## Kappa : 0.4564
##
## Mcnemar's Test P-Value : 6.594e-16
##
## Sensitivity : 0.9664
## Specificity : 0.4162
## Pos Pred Value : 0.9043
## Neg Pred Value : 0.6847
## Prevalence : 0.8509
## Detection Rate : 0.8223
## Detection Prevalence : 0.9094
## Balanced Accuracy : 0.6913
##
## 'Positive' Class : 0
##
The performance stands at 88%
###Dealing with class imbalance by oversampling
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.5
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ data.table::first() masks dplyr::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ data.table::last() masks dplyr::last()
## ✖ purrr::lift() masks caret::lift()
## ✖ purrr::transpose() masks data.table::transpose()
Let’s first check Response variable levels frequency (after having been converted to a factor type).
#Converting the classes to factor
superstore_data2$Response<-as.factor(superstore_data2$Response)
prop.table(table(superstore_data2$Response))
##
## 0 1
## 0.8508929 0.1491071
As we see the minority class number “1” is only about 14% of the total cases. We also need to show the summary of the data to take an overall look at all the features to be aware of missing values or unusual outliers.
#checking the summary
summary(superstore_data2)
## Year_Birth Education Marital_Status Income
## Min. :1893 Min. :1.000 Min. :1.000 Min. : 1730
## 1st Qu.:1959 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 35539
## Median :1970 Median :3.000 Median :3.000 Median : 51742
## Mean :1969 Mean :3.394 Mean :2.735 Mean : 52247
## 3rd Qu.:1977 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.: 68290
## Max. :1996 Max. :5.000 Max. :5.000 Max. :666666
## Kidhome Teenhome Recency MntWines
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:24.00 1st Qu.: 23.75
## Median :0.0000 Median :0.0000 Median :49.00 Median : 173.50
## Mean :0.4442 Mean :0.5062 Mean :49.11 Mean : 303.94
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:74.00 3rd Qu.: 504.25
## Max. :2.0000 Max. :2.0000 Max. :99.00 Max. :1493.00
## MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 16.0 1st Qu.: 3.00 1st Qu.: 1.00
## Median : 8.0 Median : 67.0 Median : 12.00 Median : 8.00
## Mean : 26.3 Mean : 166.9 Mean : 37.53 Mean : 27.06
## 3rd Qu.: 33.0 3rd Qu.: 232.0 3rd Qu.: 50.00 3rd Qu.: 33.00
## Max. :199.0 Max. :1725.0 Max. :259.00 Max. :263.00
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 24.00 Median : 2.000 Median : 4.000 Median : 2.000
## Mean : 44.02 Mean : 2.325 Mean : 4.085 Mean : 2.662
## 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000
## Max. :362.00 Max. :15.000 Max. :27.000 Max. :28.000
## NumStorePurchases NumWebVisitsMonth Response Complain
## Min. : 0.00 Min. : 0.000 0:1906 Min. :0.000000
## 1st Qu.: 3.00 1st Qu.: 3.000 1: 334 1st Qu.:0.000000
## Median : 5.00 Median : 6.000 Median :0.000000
## Mean : 5.79 Mean : 5.317 Mean :0.009375
## 3rd Qu.: 8.00 3rd Qu.: 7.000 3rd Qu.:0.000000
## Max. :13.00 Max. :20.000 Max. :1.000000
looking at this summary, we do not have any critical issues like missing values for instance.
##Data partition
Before applying any subsampling method we split the data first between the training set and the testing set and we use only the former to be subsampled
library(caret)#provides a function called upSample to perform upsampling technique.
set.seed(1234)
index<-createDataPartition(superstore_data2$Response,p=0.8
,list=FALSE)
train<-superstore_data2[index,]
test<-superstore_data2[-index,]
set.seed(111)
trainup<-upSample(x=train[,-ncol(train)],
y=train$Response)
table(trainup$Response)
##
## 0 1
## 1525 1525
Observation: As we see the two classes now have the same size 1525
Now let’s use the Naive Bayes algorithm for prediction
# Installing Packages
#install.packages("e1071")
#install.packages("caTools")
#install.packages("class")
# Loading package
library(e1071)
library(caTools)
library(class)
# Splitting data into train
# and test data
split <- sample.split(superstore_data2, SplitRatio = 0.7)
train_cl <- subset(superstore_data2, split == "TRUE")
test_cl <- subset(superstore_data2, split == "FALSE")
# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
# Fitting Naive Bayes Model
# to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Response ~ ., data = train_cl)
classifier_cl
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.8571429 0.1428571
##
## Conditional probabilities:
## Year_Birth
## Y [,1] [,2]
## 0 1968.808 11.76192
## 1 1968.911 12.71809
##
## Education
## Y [,1] [,2]
## 0 3.363095 1.117388
## 1 3.660714 1.124960
##
## Marital_Status
## Y [,1] [,2]
## 0 2.732887 1.062845
## 1 2.705357 1.093301
##
## Income
## Y [,1] [,2]
## 0 50691.51 20485.67
## 1 61214.57 22848.57
##
## Kidhome
## Y [,1] [,2]
## 0 0.4575893 0.5454630
## 1 0.3303571 0.4808138
##
## Teenhome
## Y [,1] [,2]
## 0 0.5498512 0.5502724
## 1 0.2946429 0.4854547
##
## Recency
## Y [,1] [,2]
## 0 51.29539 28.58788
## 1 36.42411 27.96448
##
## MntWines
## Y [,1] [,2]
## 0 273.2388 304.7892
## 1 513.3080 435.0880
##
## MntFruits
## Y [,1] [,2]
## 0 23.76042 37.47206
## 1 38.36607 47.57764
##
## MntMeatProducts
## Y [,1] [,2]
## 0 147.1429 207.2389
## 1 297.0625 285.9604
##
## MntFishProducts
## Y [,1] [,2]
## 0 34.78051 53.14568
## 1 55.39286 62.30611
##
## MntSweetProducts
## Y [,1] [,2]
## 0 24.36310 39.09171
## 1 38.79018 44.43240
##
## MntGoldProds
## Y [,1] [,2]
## 0 42.10863 52.20101
## 1 63.85714 58.55010
##
## NumDealsPurchases
## Y [,1] [,2]
## 0 2.351935 1.938852
## 1 2.196429 1.976722
##
## NumWebPurchases
## Y [,1] [,2]
## 0 3.918899 2.749421
## 1 4.901786 2.500080
##
## NumCatalogPurchases
## Y [,1] [,2]
## 0 2.443452 2.822063
## 1 4.419643 3.276875
##
## NumStorePurchases
## Y [,1] [,2]
## 0 5.787202 3.298405
## 1 6.214286 3.211327
##
## NumWebVisitsMonth
## Y [,1] [,2]
## 0 5.341518 2.430607
## 1 5.147321 2.600568
##
## Complain
## Y [,1] [,2]
## 0 0.011160714 0.10509220
## 1 0.008928571 0.09427902
Observation: The Conditional probability for each feature or variable is created by model separately. The A-priori probabilities are also calculated which indicates the distribution of our data.
# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test_cl)
# Confusion Matrix
cm <- table(test_cl$Response, y_pred)
cm
## y_pred
## 0 1
## 0 454 108
## 1 60 50
# Model Evaluation
confusionMatrix(cm)
## Confusion Matrix and Statistics
##
## y_pred
## 0 1
## 0 454 108
## 1 60 50
##
## Accuracy : 0.75
## 95% CI : (0.7155, 0.7823)
## No Information Rate : 0.7649
## P-Value [Acc > NIR] : 0.8304408
##
## Kappa : 0.2232
##
## Mcnemar's Test P-Value : 0.0002877
##
## Sensitivity : 0.8833
## Specificity : 0.3165
## Pos Pred Value : 0.8078
## Neg Pred Value : 0.4545
## Prevalence : 0.7649
## Detection Rate : 0.6756
## Detection Prevalence : 0.8363
## Balanced Accuracy : 0.5999
##
## 'Positive' Class : 0
##
We can see that using Naive Bayes, the Accuracy is at 72%, having less accuracy than the Decision Tree Algorithm.
##Conclusion
##Recommendation