Predicting the chances of a customer giving a positive response

##Overview A superstore is planning for the year-end sale.They want to launch a new offer - gold membership, that gives a 20% discount on all purchases, for only $499 which is $999 on other days.It will be valid only for existing customers and the campaign through phone calls is currently being planned for them. The management feels that the best way to reduce the cost of the campaign is to make a predictive model which will classify customers who might purchase the offer.

##Objectives Work will aim to meet the following objectives; i. To predict the likelihood of the customers giving a positive response. ii. To identify the different factors that affect a customer’s response.

##Questions The work will seek to answer the following questions; i) What is the likelihood of the customers giving a positive response? ii) What are the factors that affect a customer’s response?

##Metrics for Success For this work the success will include coming up with a model that correctly predicts the likelihood of a customer to give a positive response. Success will also be achieved by correctly analyzing and establishing the factors that contribute towards a customer giving a positive response to the Superstore campaign.

We aim to achieve an accuracy level of 90%.

##The Experimental Design Below are the steps taken in this analysis a. Loading the required libraries b. Loading and previewing data c. Cleaning the data d. Exploratory Data Analysis(EDA) e. Creating a model to predict d. Drawing conclusions e. Giving Recommendations

##Data Relevance and Validation The data available is relevant for the intended analysis. It contains information that is significant to predicting the likelihood of a customer giving a positive response. It has data on customers income, education levels, marital status, children in the households, shopping trends, and the responses and complaints.

##Understanding the context The data set we are to work with contains the following columns: a. ID - Unique ID of each customer b. Year_Birth - Age of the customer c. Complain - 1 if the customer complained in the last 2 years d. Dt_Customer - date of customer’s enrollment with the company e. Education - customer’s level of education f. Marital - customer’s marital status g. Kidhome - number of small children in customer’s household h. Teenhome - number of teenagers in customer’s household i. Income - customer’s yearly household income j. MntFishProducts - the amount spent on fish products in the last 2 years k. MntMeatProducts - the amount spent on meat products in the last 2 years l. MntFruits - the amount spent on fruits products in the last 2 years m. MntSweetProducts - amount spent on sweet products in the last 2 years n. MntWines - the amount spent on wine products in the last 2 years o. MntGoldProds - the amount spent on gold products in the last 2 years p. NumDealsPurchases - number of purchases made with discount q. NumCatalogPurchases - number of purchases made using catalog (buying goods to be shipped through the mail) r. NumStorePurchases - number of purchases made directly in stores s. NumWebPurchases - number of purchases made through the company’s website t. NumWebVisitsMonth - number of visits to company’s website in the last month u. Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise v. Recency - number of days since the last purchase

The dataset has 2240 observations and 22 columns

This dataset can be accessed from this link:Superstore Marketing Campaign

##Reading and Understanding The Dataset

superstore_data <- read.csv("C:/Users/HP/Downloads/superstore_data.csv")#Reading the first five rows of the data set
head(superstore_data,5)

##      Id Year_Birth  Education Marital_Status Income Kidhome Teenhome
## 1  1826       1970 Graduation       Divorced  84835       0        0
## 2     1       1961 Graduation         Single  57091       0        0
## 3 10476       1958 Graduation        Married  67267       0        1
## 4  1386       1967 Graduation       Together  32474       1        1
## 5  5371       1989 Graduation         Single  21474       1        0
##   Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 1   6/16/2014       0      189       104             379             111
## 2   6/15/2014       0      464         5              64               7
## 3   5/13/2014       0      134        11              59              15
## 4   11/5/2014       0       10         0               1               0
## 5    8/4/2014       0        6        16              24              11
##   MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1              189          218                 1               4
## 2                0           37                 1               7
## 3                2           30                 1               3
## 4                0            0                 1               1
## 5                0           34                 2               3
##   NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Response Complain
## 1                   4                 6                 1        1        0
## 2                   3                 7                 5        1        0
## 3                   2                 5                 2        0        0
## 4                   0                 2                 7        0        0
## 5                   1                 2                 7        1        0

#Reading the last 5 rows of the data set
tail(superstore_data,5)

##         Id Year_Birth  Education Marital_Status Income Kidhome Teenhome
## 2236 10142       1976        PhD       Divorced  66476       0        1
## 2237  5263       1977   2n Cycle        Married  31056       1        0
## 2238    22       1976 Graduation       Divorced  46310       1        0
## 2239   528       1978 Graduation        Married  65819       0        0
## 2240  4070       1969        PhD        Married  94871       0        2
##      Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
## 2236    7/3/2013      99      372        18             126              47
## 2237   1/22/2013      99        5        10              13               3
## 2238   3/12/2012      99      185         2              88              15
## 2239  11/29/2012      99      267        38             701             149
## 2240    1/9/2012      99      169        24             553             188
##      MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 2236               48           78                 2               5
## 2237                8           16                 1               1
## 2238                5           14                 2               6
## 2239              165           63                 1               5
## 2240                0          144                 1               8
##      NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Response Complain
## 2236                   2                11                 4        0        0
## 2237                   0                 3                 8        0        0
## 2238                   1                 5                 8        0        0
## 2239                   4                10                 3        0        0
## 2240                   5                 4                 7        1        0

#Checking for the data types of the variables
str(superstore_data)

## 'data.frame':    2240 obs. of  22 variables:
##  $ Id                 : int  1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
##  $ Year_Birth         : int  1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Divorced" "Single" "Married" "Together" ...
##  $ Income             : int  84835 57091 67267 32474 21474 71691 63564 44931 65324 65324 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : chr  "6/16/2014" "6/15/2014" "5/13/2014" "11/5/2014" ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...

The output shows that the dataset has 22 columns and 2240 rows.With 3 characters and the rest as integers.

#Checking the dimension of the data
dim(superstore_data)

## [1] 2240   22

This dataset has 2240 rows and 22 columns

Cleaning the Dataset

#Checking for missing variables
colSums(is.na(superstore_data))

##                  Id          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                  24                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##            Response            Complain 
##                   0                   0

This shows that the Income column is the only one with missing values.

#Replacing the missing values with the mean of the column
superstore_data$Income[is.na(superstore_data$Income)]<-mean(superstore_data$Income, na.rm=TRUE)

#Checking to confirm the missing variables have been filled
colSums(is.na(superstore_data))

##                  Id          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                   0                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##            Response            Complain 
##                   0                   0

Observation: This shows that all missing values have been replaced with the mean,hence no missing values.

# Checking for the unique values in the Marital Status column
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

unique_values<-unique(superstore_data$Marital_Status)

unique_values

## [1] "Divorced" "Single"   "Married"  "Together" "Widow"    "YOLO"     "Alone"   
## [8] "Absurd"

#Changing the variables in the Marital status column that are not in line
library(stringr)
superstore_data$Marital_Status<-str_replace(superstore_data$Marital_Status, 
"YOLO","Single")
superstore_data$Marital_Status<-str_replace(superstore_data$Marital_Status,
"Alone","Single")
superstore_data$Marital_Status<-str_replace(superstore_data$Marital_Status,
"Absurd","Single")

# Checking to see whether the variables in the Marital Status column have been changed
unique_marital<-unique(superstore_data$Marital_Status)
unique_marital

## [1] "Divorced" "Single"   "Married"  "Together" "Widow"

##Performing Exploratory Data Analysis

# A pie chart of the educational level of the customers
edu<-table(superstore_data$Education)
edu

## 
##   2n Cycle      Basic Graduation     Master        PhD 
##        203         54       1127        370        486

pie(edu,
    col = hcl.colors(length(edu), "BluYl"))

Observation: Most of the customers are in the graduation sub-category, this is followed by those who hold PHDs and Masters. The least are those with basic education.

library("ggplot2") #Library that supports plotting
#Ggplot comparing Income vs Education
ggplot(superstore_data, aes(Education,Income)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black"))  + ggtitle("Income vs Education") + theme_bw()

Observation: The ggplot shows that those who have graduated have the highest income with those with basic education have the least income.

#A scatter plot of year of birth vs sweet products
ggplot(superstore_data, aes(x= Year_Birth, y = MntSweetProducts)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("MntSweetProducts") + ggtitle("Year of Birth vs MntSweetProducts")

Observation: The scatter shows that majority of the people born between 1950 and 1980 are the most purchases of sweet products.

# a table of complains
complains<-table(superstore_data$Complain)
complains

## 
##    0    1 
## 2219   21

#Barplot of complains
barplot(complains)

Observation: The barplot for complains showing that the data is imbalanced

#Comparing the Year of birth against Complain
ggplot(superstore_data, aes(Year_Birth,Complain)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black"))  + ggtitle("Year of Birth vs Complain") + theme_bw()

Observation: Most complaints were recorded from customers born between 1950 and 1980

#Barplot for response
responses<-table(superstore_data$Response)
barplot(responses)

Observation: This barplot shows the variance of Responses which shows imbalanced data

#Comparing the Year of birth against Response
ggplot(superstore_data, aes(Year_Birth,Response)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black"))  + ggtitle("Year of Birth vs Response") + theme_bw()

Observation: Most responses were recorded from customers born from 1940. the highest number came from customers born around 1970.

#Scatter plot between Year of birth and Fish products
ggplot(superstore_data, aes(x= Year_Birth, y = MntFishProducts)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("MntFishProducts") + ggtitle("Year of Birth vs MntFishProducts")

# scatter plot showing year of birth versus amount of wine purchased. 
ggplot(superstore_data, aes(x= Year_Birth, y = MntWines)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("MntWines") + ggtitle("Year of Birth vs MntWines")

#Boxplot between Income and  Marital Status
ggplot(superstore_data, aes(Marital_Status,Income)) +geom_boxplot() +ggtitle("Box Plot") + theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "red")) + xlab("Marital_Status") + ylab("Income") + ggtitle("Income vs MaritaL Status")

Observation: Almost all categories have similar incomes, the widowed have a higher average income while those who are together have some extremely high cases of income levels.

#Converting date from character(chr) to datetime
library("data.table") #Package necessary for reading a column

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

date_variable <- as.Date(superstore_data$Dt_Customer, format = "%Y-%m-%d")

str(superstore_data)

## 'data.frame':    2240 obs. of  22 variables:
##  $ Id                 : int  1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
##  $ Year_Birth         : int  1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Divorced" "Single" "Married" "Together" ...
##  $ Income             : num  84835 57091 67267 32474 21474 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : chr  "6/16/2014" "6/15/2014" "5/13/2014" "11/5/2014" ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...

# Dropping the Dt Customer and Id column
superstore_data1<-select(superstore_data, -Dt_Customer)
superstore_data2<-select(superstore_data1, -Id)

head(superstore_data2)

##   Year_Birth  Education Marital_Status Income Kidhome Teenhome Recency MntWines
## 1       1970 Graduation       Divorced  84835       0        0       0      189
## 2       1961 Graduation         Single  57091       0        0       0      464
## 3       1958 Graduation        Married  67267       0        1       0      134
## 4       1967 Graduation       Together  32474       1        1       0       10
## 5       1989 Graduation         Single  21474       1        0       0        6
## 6       1958        PhD         Single  71691       0        0       0      336
##   MntFruits MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
## 1       104             379             111              189          218
## 2         5              64               7                0           37
## 3        11              59              15                2           30
## 4         0               1               0                0            0
## 5        16              24              11                0           34
## 6       130             411             240               32           43
##   NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
## 1                 1               4                   4                 6
## 2                 1               7                   3                 7
## 3                 1               3                   2                 5
## 4                 1               1                   0                 2
## 5                 2               3                   1                 2
## 6                 1               4                   7                 5
##   NumWebVisitsMonth Response Complain
## 1                 1        1        0
## 2                 5        1        0
## 3                 2        0        0
## 4                 7        0        0
## 5                 7        1        0
## 6                 2        1        0

str(superstore_data2)

## 'data.frame':    2240 obs. of  20 variables:
##  $ Year_Birth         : int  1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Divorced" "Single" "Married" "Together" ...
##  $ Income             : num  84835 57091 67267 32474 21474 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...

#Performing label encoding on Categorical columns
superstore_data2$Marital_Status <- as.numeric(factor(superstore_data2$Marital_Status))
superstore_data2$Education <- as.numeric(factor(superstore_data2$Education))

#Installing the correlation packages
#install.packages("corrplot")
library(corrplot)

## corrplot 0.92 loaded

#we need to first change the response column from factor to integer
superstore_data2$Response<-as.integer(superstore_data2$Response)

#checking for dt types
sapply(superstore_data2,class)

##          Year_Birth           Education      Marital_Status              Income 
##           "integer"           "numeric"           "numeric"           "numeric" 
##             Kidhome            Teenhome             Recency            MntWines 
##           "integer"           "integer"           "integer"           "integer" 
##           MntFruits     MntMeatProducts     MntFishProducts    MntSweetProducts 
##           "integer"           "integer"           "integer"           "integer" 
##        MntGoldProds   NumDealsPurchases     NumWebPurchases NumCatalogPurchases 
##           "integer"           "integer"           "integer"           "integer" 
##   NumStorePurchases   NumWebVisitsMonth            Response            Complain 
##           "integer"           "integer"           "integer"           "integer"

#Performing Correlation
superstore_data2.cor=cor(superstore_data2)

#Visualizing the correlation
corrplot(superstore_data2.cor)

Observation: There is a positive correlation between Mnt Wine Productions There is a negative correlation between Number of web visits and Income This correlation table shows that all the areas with red dots indicates a negative correlation and where there are blue dots,indicates a positive correlation.

#Summary of dataset
summary(superstore_data2)

##    Year_Birth     Education     Marital_Status      Income      
##  Min.   :1893   Min.   :1.000   Min.   :1.000   Min.   :  1730  
##  1st Qu.:1959   1st Qu.:3.000   1st Qu.:2.000   1st Qu.: 35539  
##  Median :1970   Median :3.000   Median :3.000   Median : 51742  
##  Mean   :1969   Mean   :3.394   Mean   :2.735   Mean   : 52247  
##  3rd Qu.:1977   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.: 68290  
##  Max.   :1996   Max.   :5.000   Max.   :5.000   Max.   :666666  
##     Kidhome          Teenhome         Recency         MntWines      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.00   Min.   :   0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:24.00   1st Qu.:  23.75  
##  Median :0.0000   Median :0.0000   Median :49.00   Median : 173.50  
##  Mean   :0.4442   Mean   :0.5062   Mean   :49.11   Mean   : 303.94  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:74.00   3rd Qu.: 504.25  
##  Max.   :2.0000   Max.   :2.0000   Max.   :99.00   Max.   :1493.00  
##    MntFruits     MntMeatProducts  MntFishProducts  MntSweetProducts
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  1.0   1st Qu.:  16.0   1st Qu.:  3.00   1st Qu.:  1.00  
##  Median :  8.0   Median :  67.0   Median : 12.00   Median :  8.00  
##  Mean   : 26.3   Mean   : 166.9   Mean   : 37.53   Mean   : 27.06  
##  3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00   3rd Qu.: 33.00  
##  Max.   :199.0   Max.   :1725.0   Max.   :259.00   Max.   :263.00  
##   MntGoldProds    NumDealsPurchases NumWebPurchases  NumCatalogPurchases
##  Min.   :  0.00   Min.   : 0.000    Min.   : 0.000   Min.   : 0.000     
##  1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000   1st Qu.: 0.000     
##  Median : 24.00   Median : 2.000    Median : 4.000   Median : 2.000     
##  Mean   : 44.02   Mean   : 2.325    Mean   : 4.085   Mean   : 2.662     
##  3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000   3rd Qu.: 4.000     
##  Max.   :362.00   Max.   :15.000    Max.   :27.000   Max.   :28.000     
##  NumStorePurchases NumWebVisitsMonth    Response         Complain       
##  Min.   : 0.00     Min.   : 0.000    Min.   :0.0000   Min.   :0.000000  
##  1st Qu.: 3.00     1st Qu.: 3.000    1st Qu.:0.0000   1st Qu.:0.000000  
##  Median : 5.00     Median : 6.000    Median :0.0000   Median :0.000000  
##  Mean   : 5.79     Mean   : 5.317    Mean   :0.1491   Mean   :0.009375  
##  3rd Qu.: 8.00     3rd Qu.: 7.000    3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :13.00     Max.   :20.000    Max.   :1.0000   Max.   :1.000000

Observation: If we take a look at the Income column, the mean income of customers in this data set is 52247 with a max income of 666666 and min of 1730.

#normality test
#install required packages
if (!requireNamespace("remotes")) install.packages("remotes")

## Loading required namespace: remotes

remotes::install_github("asael697/nortsTest",dependencies = TRUE)

## Skipping install of 'nortsTest' from a github remote, the SHA1 (b9198769) has not changed since last install.
##   Use `force = TRUE` to force installation

#checking the response frequency distribution column
#Plotting the distribution of the Response Variable
response<-as.integer(superstore_data2$Response)
hist(response, col = 'steelblue', main = 'Distribution of Response',
     xlab = 'response')

Observation: The data does not appear to be normally distributed. To confirm this, we can conduct an A-D test to formally test whether or not the data is normally distributed.

# Checking for normality of data
 #  Anderson-Darling normality test
#install.packages("nortest")
library(nortest)
nortest::ad.test(response)

## 
##  Anderson-Darling normality test
## 
## data:  response
## A = 665.85, p-value < 2.2e-16

Observation: The p-value of the test is less than 0.05, thus we have sufficient evidence to reject the null hypothesis and conclude that response column does not follow a normal distribution.

##Modelling using Decision Tree Algorthim

set.seed(1234) #to make the results reproducible

random<-runif(2240) #to randomize for uniform distribution

superstore_random2<-superstore_data2[order(random),]

head(superstore_random2)

##      Year_Birth Education Marital_Status Income Kidhome Teenhome Recency
## 783        1989         4              1  10979       0        0      34
## 473        1945         5              2  70356       0        0      20
## 746        1981         1              4  19414       1        0      32
## 996        1968         3              3  57107       0        1      44
## 1575       1976         3              2  17649       1        0      70
## 1801       1976         3              2  17148       1        0      80
##      MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 783         8         4              10               2                2
## 473      1349        16             249              43               16
## 746         2         3              12               3                5
## 996       159         0             120               0                0
## 1575       15         1              23               0                5
## 1801        9         9              11              13               10
##      MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 783             4                 2               3                   0
## 473            33                 2              10                   6
## 746             7                 1               1                   0
## 996           137                 7               4                   4
## 1575            1                 3               3                   0
## 1801           16                 4               3                   1
##      NumStorePurchases NumWebVisitsMonth Response Complain
## 783                  3                 5        0        0
## 473                  9                 6        0        0
## 746                  3                 8        0        0
## 996                  8                 8        1        0
## 1575                 3                 8        0        0
## 1801                 3                 8        0        0

#Plotting the output of response for clear understanding
responses<-table(superstore_random2$Response)
head(responses)

## 
##    0    1 
## 1906  334

Observation: The out put shows that there are 1906 0 responses and 334 1 responses hence imbalanced output

#Decision Tree
#loading required library
library(rpart)

# set the response variable
y <- superstore_random2$Response

# set the predictor variables
x<-select(superstore_random2, -Response)
head(x)

##      Year_Birth Education Marital_Status Income Kidhome Teenhome Recency
## 783        1989         4              1  10979       0        0      34
## 473        1945         5              2  70356       0        0      20
## 746        1981         1              4  19414       1        0      32
## 996        1968         3              3  57107       0        1      44
## 1575       1976         3              2  17649       1        0      70
## 1801       1976         3              2  17148       1        0      80
##      MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 783         8         4              10               2                2
## 473      1349        16             249              43               16
## 746         2         3              12               3                5
## 996       159         0             120               0                0
## 1575       15         1              23               0                5
## 1801        9         9              11              13               10
##      MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 783             4                 2               3                   0
## 473            33                 2              10                   6
## 746             7                 1               1                   0
## 996           137                 7               4                   4
## 1575            1                 3               3                   0
## 1801           16                 4               3                   1
##      NumStorePurchases NumWebVisitsMonth Complain
## 783                  3                 5        0
## 473                  9                 6        0
## 746                  3                 8        0
## 996                  8                 8        0
## 1575                 3                 8        0
## 1801                 3                 8        0

# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")

# make predictions
predictions <- predict(tree_model, newdata = x, type = "class")

# view the tree output
print(tree_model)

## n= 2240 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 2240 334 0 (0.85089286 0.14910714)  
##     2) MntWines< 815.5 2009 238 0 (0.88153310 0.11846690)  
##       4) Recency>=20.5 1577 127 0 (0.91946734 0.08053266)  
##         8) Income< 85613 1539 109 0 (0.92917479 0.07082521) *
##         9) Income>=85613 38  18 0 (0.52631579 0.47368421)  
##          18) MntWines< 563.5 24   7 0 (0.70833333 0.29166667) *
##          19) MntWines>=563.5 14   3 1 (0.21428571 0.78571429) *
##       5) Recency< 20.5 432 111 0 (0.74305556 0.25694444)  
##        10) NumCatalogPurchases< 0.5 129  11 0 (0.91472868 0.08527132) *
##        11) NumCatalogPurchases>=0.5 303 100 0 (0.66996700 0.33003300)  
##          22) NumStorePurchases>=2.5 272  77 0 (0.71691176 0.28308824)  
##            44) MntMeatProducts< 498 239  57 0 (0.76150628 0.23849372)  
##              88) NumWebVisitsMonth< 7.5 202  38 0 (0.81188119 0.18811881)  
##               176) NumDealsPurchases< 6.5 194  32 0 (0.83505155 0.16494845) *
##               177) NumDealsPurchases>=6.5 8   2 1 (0.25000000 0.75000000) *
##              89) NumWebVisitsMonth>=7.5 37  18 1 (0.48648649 0.51351351)  
##               178) Income>=48828 13   3 0 (0.76923077 0.23076923) *
##               179) Income< 48828 24   8 1 (0.33333333 0.66666667) *
##            45) MntMeatProducts>=498 33  13 1 (0.39393939 0.60606061) *
##          23) NumStorePurchases< 2.5 31   8 1 (0.25806452 0.74193548) *
##     3) MntWines>=815.5 231  96 0 (0.58441558 0.41558442)  
##       6) MntMeatProducts< 720 200  74 0 (0.63000000 0.37000000)  
##        12) NumStorePurchases>=9.5 88  20 0 (0.77272727 0.22727273) *
##        13) NumStorePurchases< 9.5 112  54 0 (0.51785714 0.48214286)  
##          26) Recency>=66.5 38  10 0 (0.73684211 0.26315789) *
##          27) Recency< 66.5 74  30 1 (0.40540541 0.59459459)  
##            54) MntSweetProducts>=95 12   3 0 (0.75000000 0.25000000) *
##            55) MntSweetProducts< 95 62  21 1 (0.33870968 0.66129032) *
##       7) MntMeatProducts>=720 31   9 1 (0.29032258 0.70967742) *

# Visualizing the tree
#loading required package
#install.packages("rpart.plot")

library(rpart.plot)

# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")

# visualize the tree
rpart.plot(tree_model)

##Evaluating our model performance

# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")

# make predictions
predictions <- predict(tree_model, newdata = x, type = "class")

# calculate accuracy
accuracy <- mean(predictions == y)

#Getting the accuracy 
print(accuracy)

## [1] 0.884375

Observation: Accuracy stands at 88%

Using Confusion Matrix to counter check accuracy

#confusionMatrix()
#loading the required library
#install.packages("lattice")
library(caret)#Used for plotting the correlation matrix

## Loading required package: lattice

# fit the decision tree model
tree_model <- rpart(y ~ ., data = x, method = "class")

# make predictions
predictions <- predict(tree_model, newdata = x, type = "class")

# convert predictions and y to factors with the same levels
predictions <- as.factor(predictions)
y <- as.factor(y)
levels(predictions) <- levels(y)

# create the confusion matrix
confusionMatrix(data = predictions, reference = y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1842  195
##          1   64  139
##                                           
##                Accuracy : 0.8844          
##                  95% CI : (0.8704, 0.8973)
##     No Information Rate : 0.8509          
##     P-Value [Acc > NIR] : 2.522e-06       
##                                           
##                   Kappa : 0.4564          
##                                           
##  Mcnemar's Test P-Value : 6.594e-16       
##                                           
##             Sensitivity : 0.9664          
##             Specificity : 0.4162          
##          Pos Pred Value : 0.9043          
##          Neg Pred Value : 0.6847          
##              Prevalence : 0.8509          
##          Detection Rate : 0.8223          
##    Detection Prevalence : 0.9094          
##       Balanced Accuracy : 0.6913          
##                                           
##        'Positive' Class : 0               
##

The performance stands at 88%

Using Naive Bayes Classification

###Dealing with class imbalance by oversampling

#install.packages("tidyverse")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.5
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ data.table::between() masks dplyr::between()
## ✖ dplyr::filter()       masks stats::filter()
## ✖ data.table::first()   masks dplyr::first()
## ✖ dplyr::lag()          masks stats::lag()
## ✖ data.table::last()    masks dplyr::last()
## ✖ purrr::lift()         masks caret::lift()
## ✖ purrr::transpose()    masks data.table::transpose()

Let’s first check Response variable levels frequency (after having been converted to a factor type).

#Converting the classes to factor
superstore_data2$Response<-as.factor(superstore_data2$Response)
prop.table(table(superstore_data2$Response))

## 
##         0         1 
## 0.8508929 0.1491071

As we see the minority class number “1” is only about 14% of the total cases. We also need to show the summary of the data to take an overall look at all the features to be aware of missing values or unusual outliers.

#checking the summary
summary(superstore_data2)

##    Year_Birth     Education     Marital_Status      Income      
##  Min.   :1893   Min.   :1.000   Min.   :1.000   Min.   :  1730  
##  1st Qu.:1959   1st Qu.:3.000   1st Qu.:2.000   1st Qu.: 35539  
##  Median :1970   Median :3.000   Median :3.000   Median : 51742  
##  Mean   :1969   Mean   :3.394   Mean   :2.735   Mean   : 52247  
##  3rd Qu.:1977   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.: 68290  
##  Max.   :1996   Max.   :5.000   Max.   :5.000   Max.   :666666  
##     Kidhome          Teenhome         Recency         MntWines      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.00   Min.   :   0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:24.00   1st Qu.:  23.75  
##  Median :0.0000   Median :0.0000   Median :49.00   Median : 173.50  
##  Mean   :0.4442   Mean   :0.5062   Mean   :49.11   Mean   : 303.94  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:74.00   3rd Qu.: 504.25  
##  Max.   :2.0000   Max.   :2.0000   Max.   :99.00   Max.   :1493.00  
##    MntFruits     MntMeatProducts  MntFishProducts  MntSweetProducts
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  1.0   1st Qu.:  16.0   1st Qu.:  3.00   1st Qu.:  1.00  
##  Median :  8.0   Median :  67.0   Median : 12.00   Median :  8.00  
##  Mean   : 26.3   Mean   : 166.9   Mean   : 37.53   Mean   : 27.06  
##  3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00   3rd Qu.: 33.00  
##  Max.   :199.0   Max.   :1725.0   Max.   :259.00   Max.   :263.00  
##   MntGoldProds    NumDealsPurchases NumWebPurchases  NumCatalogPurchases
##  Min.   :  0.00   Min.   : 0.000    Min.   : 0.000   Min.   : 0.000     
##  1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000   1st Qu.: 0.000     
##  Median : 24.00   Median : 2.000    Median : 4.000   Median : 2.000     
##  Mean   : 44.02   Mean   : 2.325    Mean   : 4.085   Mean   : 2.662     
##  3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000   3rd Qu.: 4.000     
##  Max.   :362.00   Max.   :15.000    Max.   :27.000   Max.   :28.000     
##  NumStorePurchases NumWebVisitsMonth Response    Complain       
##  Min.   : 0.00     Min.   : 0.000    0:1906   Min.   :0.000000  
##  1st Qu.: 3.00     1st Qu.: 3.000    1: 334   1st Qu.:0.000000  
##  Median : 5.00     Median : 6.000             Median :0.000000  
##  Mean   : 5.79     Mean   : 5.317             Mean   :0.009375  
##  3rd Qu.: 8.00     3rd Qu.: 7.000             3rd Qu.:0.000000  
##  Max.   :13.00     Max.   :20.000             Max.   :1.000000

looking at this summary, we do not have any critical issues like missing values for instance.

##Data partition

Before applying any subsampling method we split the data first between the training set and the testing set and we use only the former to be subsampled

library(caret)#provides a function called upSample to perform upsampling technique.
set.seed(1234)
index<-createDataPartition(superstore_data2$Response,p=0.8
,list=FALSE)
train<-superstore_data2[index,]
test<-superstore_data2[-index,]

set.seed(111)
trainup<-upSample(x=train[,-ncol(train)],
                  y=train$Response)
table(trainup$Response)

## 
##    0    1 
## 1525 1525

Observation: As we see the two classes now have the same size 1525

Now let’s use the Naive Bayes algorithm for prediction

# Installing Packages
#install.packages("e1071")
#install.packages("caTools")
#install.packages("class")

# Loading package
library(e1071)
library(caTools)
library(class)

# Splitting data into train
# and test data
split <- sample.split(superstore_data2, SplitRatio = 0.7)
train_cl <- subset(superstore_data2, split == "TRUE")
test_cl <- subset(superstore_data2, split == "FALSE")

# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])

# Fitting Naive Bayes Model
# to training dataset
set.seed(120)  # Setting Seed
classifier_cl <- naiveBayes(Response ~ ., data = train_cl)
classifier_cl

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.8571429 0.1428571 
## 
## Conditional probabilities:
##    Year_Birth
## Y       [,1]     [,2]
##   0 1968.808 11.76192
##   1 1968.911 12.71809
## 
##    Education
## Y       [,1]     [,2]
##   0 3.363095 1.117388
##   1 3.660714 1.124960
## 
##    Marital_Status
## Y       [,1]     [,2]
##   0 2.732887 1.062845
##   1 2.705357 1.093301
## 
##    Income
## Y       [,1]     [,2]
##   0 50691.51 20485.67
##   1 61214.57 22848.57
## 
##    Kidhome
## Y        [,1]      [,2]
##   0 0.4575893 0.5454630
##   1 0.3303571 0.4808138
## 
##    Teenhome
## Y        [,1]      [,2]
##   0 0.5498512 0.5502724
##   1 0.2946429 0.4854547
## 
##    Recency
## Y       [,1]     [,2]
##   0 51.29539 28.58788
##   1 36.42411 27.96448
## 
##    MntWines
## Y       [,1]     [,2]
##   0 273.2388 304.7892
##   1 513.3080 435.0880
## 
##    MntFruits
## Y       [,1]     [,2]
##   0 23.76042 37.47206
##   1 38.36607 47.57764
## 
##    MntMeatProducts
## Y       [,1]     [,2]
##   0 147.1429 207.2389
##   1 297.0625 285.9604
## 
##    MntFishProducts
## Y       [,1]     [,2]
##   0 34.78051 53.14568
##   1 55.39286 62.30611
## 
##    MntSweetProducts
## Y       [,1]     [,2]
##   0 24.36310 39.09171
##   1 38.79018 44.43240
## 
##    MntGoldProds
## Y       [,1]     [,2]
##   0 42.10863 52.20101
##   1 63.85714 58.55010
## 
##    NumDealsPurchases
## Y       [,1]     [,2]
##   0 2.351935 1.938852
##   1 2.196429 1.976722
## 
##    NumWebPurchases
## Y       [,1]     [,2]
##   0 3.918899 2.749421
##   1 4.901786 2.500080
## 
##    NumCatalogPurchases
## Y       [,1]     [,2]
##   0 2.443452 2.822063
##   1 4.419643 3.276875
## 
##    NumStorePurchases
## Y       [,1]     [,2]
##   0 5.787202 3.298405
##   1 6.214286 3.211327
## 
##    NumWebVisitsMonth
## Y       [,1]     [,2]
##   0 5.341518 2.430607
##   1 5.147321 2.600568
## 
##    Complain
## Y          [,1]       [,2]
##   0 0.011160714 0.10509220
##   1 0.008928571 0.09427902

Observation: The Conditional probability for each feature or variable is created by model separately. The A-priori probabilities are also calculated which indicates the distribution of our data.

# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test_cl)

# Confusion Matrix
cm <- table(test_cl$Response, y_pred)
cm

##    y_pred
##       0   1
##   0 454 108
##   1  60  50

# Model Evaluation
confusionMatrix(cm)

## Confusion Matrix and Statistics
## 
##    y_pred
##       0   1
##   0 454 108
##   1  60  50
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.7155, 0.7823)
##     No Information Rate : 0.7649          
##     P-Value [Acc > NIR] : 0.8304408       
##                                           
##                   Kappa : 0.2232          
##                                           
##  Mcnemar's Test P-Value : 0.0002877       
##                                           
##             Sensitivity : 0.8833          
##             Specificity : 0.3165          
##          Pos Pred Value : 0.8078          
##          Neg Pred Value : 0.4545          
##              Prevalence : 0.7649          
##          Detection Rate : 0.6756          
##    Detection Prevalence : 0.8363          
##       Balanced Accuracy : 0.5999          
##                                           
##        'Positive' Class : 0               
##

We can see that using Naive Bayes, the Accuracy is at 72%, having less accuracy than the Decision Tree Algorithm.

##Conclusion

Decision tree has given out the best accuracy compared to Naive Bayes
Majority of the variables are positively correlated
The purchase of wines, meat, catalog purchases had a positive relationship with the response. Recency and the number of teens in the home had negative correlation with the response
The widowed have have high income on average compared to the other customers
Families with teens are less likely to respond to the campaign.

##Recommendation

The campaign should target customers who purchase wines, meat and catalogs as they are more likely to accept the offer.
The campaign team should not focus on families with teens.

Predicting the chances of a customer giving a positive response

Antonia Mulinge,Lisa Kagendo & Sammy Murimi

2023-02-01

Cleaning the Dataset

Using Confusion Matrix to counter check accuracy

Using Naive Bayes Classification