A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.
setwd("C:/Users/Dell/Desktop/Project/Capstone Project")
ABC_train=read.csv("Training Data Set.csv")
str(ABC_train)
## 'data.frame': 550068 obs. of 12 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
## $ Age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
## $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
## $ Marital_Status : int 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
## $ Product_Category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
## $ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
We see that the data type of “Marital Status” is integer. However, it is a categorical variable with value 1 showing “Married” and 0 showing “Unmarried”. Thus, we have to convert into Factor.
ABC_train$Marital_Status=factor(ABC_train$Marital_Status)
str(ABC_train)
## 'data.frame': 550068 obs. of 12 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
## $ Age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
## $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
## $ Marital_Status : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 2 ...
## $ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
## $ Product_Category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
## $ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
summary(ABC_train)
## User_ID Product_ID Gender Age
## Min. :1000001 P00265242: 1880 F:135809 0-17 : 15102
## 1st Qu.:1001516 P00025442: 1615 M:414259 18-25: 99660
## Median :1003077 P00110742: 1612 26-35:219587
## Mean :1003029 P00112142: 1562 36-45:110013
## 3rd Qu.:1004478 P00057642: 1470 46-50: 45701
## Max. :1006040 P00184942: 1440 51-55: 38501
## (Other) :540489 55+ : 21504
## Occupation City_Category Stay_In_Current_City_Years Marital_Status
## Min. : 0.000 A:147720 0 : 74398 0:324731
## 1st Qu.: 2.000 B:231173 1 :193821 1:225337
## Median : 7.000 C:171175 2 :101838
## Mean : 8.077 3 : 95285
## 3rd Qu.:14.000 4+: 84726
## Max. :20.000
##
## Product_Category_1 Product_Category_2 Product_Category_3 Purchase
## Min. : 1.000 Min. : 2.00 Min. : 3.0 Min. : 12
## 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 9.0 1st Qu.: 5823
## Median : 5.000 Median : 9.00 Median :14.0 Median : 8047
## Mean : 5.404 Mean : 9.84 Mean :12.7 Mean : 9264
## 3rd Qu.: 8.000 3rd Qu.:15.00 3rd Qu.:16.0 3rd Qu.:12054
## Max. :20.000 Max. :18.00 Max. :18.0 Max. :23961
## NA's :173638 NA's :383247
library(psych)
describe(ABC_train)[,c(2,3,4,5,8,9)]
## n mean sd median min
## User_ID 550068 1003028.84 1727.59 1003077 1000001
## Product_ID* 550068 1708.47 1012.20 1667 1
## Gender* 550068 1.75 0.43 2 1
## Age* 550068 3.50 1.35 3 1
## Occupation 550068 8.08 6.52 7 0
## City_Category* 550068 2.04 0.76 2 1
## Stay_In_Current_City_Years* 550068 2.86 1.29 3 1
## Marital_Status* 550068 1.41 0.49 1 1
## Product_Category_1 550068 5.40 3.94 5 1
## Product_Category_2 376430 9.84 5.09 9 2
## Product_Category_3 166821 12.67 4.13 14 3
## Purchase 550068 9263.97 5023.07 8047 12
## max
## User_ID 1006040
## Product_ID* 3631
## Gender* 2
## Age* 7
## Occupation 20
## City_Category* 3
## Stay_In_Current_City_Years* 5
## Marital_Status* 2
## Product_Category_1 20
## Product_Category_2 18
## Product_Category_3 18
## Purchase 23961
We can see that there are some missing values in the data set (like in Product_Category_2 and Product_Category_3). NA’s can be a hindrance in the statistical analysis, due to they need to be converted into 0’s.
ABC_train[is.na(ABC_train)]=0
t=prop.table(table(ABC_train$Marital_Status,ABC_train$Gender))
barplot(t,beside=T,col=c("red","blue"),main="Customers Distribution with respect to Gender and Marital Status",legend=rownames(t),xlab="Gender",ylab="Percent of Total",ylim=c(0,0.5))
barplot(prop.table(table(ABC_train$Age)),col="red",main="Customer Distribution with respect to Age Group",xlab="Age Group",ylab="Percent of Total",ylim=c(0,0.4))
barplot(prop.table(table(ABC_train$City_Category)),col="darkblue",main="Customer Distribution with respect to City Category",xlab="City Category",ylab="Percent of Total",ylim=c(0,0.4))
barplot(prop.table(table(ABC_train$Stay_In_Current_City_Years)),col="red",main="Customer Distribution with respect to Number of Years Stay in Current City",xlab="Number of Years Stay in Current City",ylab="Percent of Total",ylim=c(0,0.4))
barplot(prop.table(table(ABC_train$Occupation)),col="darkblue",main="Customer Distribution with respect to Occupation",xlab="Occupation",ylab="Percent of Total",ylim=c(0,0.2))
barplot(prop.table(table(ABC_train$Product_Category_1)),col="red",main="Customer Distribution with respect to Occupation",xlab="Occupation",ylab="Percent of Total")
p1=aggregate(Purchase~Gender,data=ABC_train,mean)
p1
## Gender Purchase
## 1 F 8734.566
## 2 M 9437.526
plot(aggregate(Purchase~Age,data=ABC_train,mean),main="Purchase Amount with respect to Age Group")
plot(aggregate(Purchase~Occupation,data=ABC_train,mean),spread=FALSE,smoother.args=list(lty=2),pch=19,main="Purchase Amount with respect to Occupation")
## Warning in plot.window(...): "spread" is not a graphical parameter
## Warning in plot.window(...): "smoother.args" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "spread" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "smoother.args" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in box(...): "spread" is not a graphical parameter
## Warning in box(...): "smoother.args" is not a graphical parameter
## Warning in title(...): "spread" is not a graphical parameter
## Warning in title(...): "smoother.args" is not a graphical parameter
plot(aggregate(Purchase~City_Category,data=ABC_train,mean),spread=FALSE,smoother.args=list(lty=2),pch=19,main="Purchase Amount with respect to City Category")
plot(aggregate(Purchase~Stay_In_Current_City_Years,data=ABC_train,mean),main="Purchase Amount with respect to Number of Years Stay in Current City")
p2=aggregate(Purchase~Marital_Status,data=ABC_train,mean)
p2
## Marital_Status Purchase
## 1 0 9265.908
## 2 1 9261.175
plot(aggregate(Purchase~Product_Category_1,data=ABC_train,mean),spread=FALSE,smoother.args=list(lty=2),pch=19,main="Purchase Amount with respect to Category of Products")
## Warning in plot.window(...): "spread" is not a graphical parameter
## Warning in plot.window(...): "smoother.args" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "spread" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "smoother.args" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in box(...): "spread" is not a graphical parameter
## Warning in box(...): "smoother.args" is not a graphical parameter
## Warning in title(...): "spread" is not a graphical parameter
## Warning in title(...): "smoother.args" is not a graphical parameter
Since the data set is enormously huge with each customer purchasing different products, the correlations between the “Customer’s Demographics” and “Purchase” cannot be easily established by taking every record in the data set into consideration.
Hence, we simplify the situation by averaging the purchase amounts of different products associated with a single customer and considering that average value in building correlations.
Users=unique(ABC_train[,c(1,3,4,5,6,7,8)])
p=aggregate(ABC_train$Purchase,by=list(User=ABC_train$User_ID),mean)
Users$Avg_Purchase=p[,2]
str(Users)
## 'data.frame': 5891 obs. of 8 variables:
## $ User_ID : int 1000001 1000002 1000003 1000004 1000005 1000006 1000007 1000008 1000009 1000010 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
## $ Age : Factor w/ 7 levels "0-17","18-25",..: 1 7 3 5 3 6 4 3 3 4 ...
## $ Occupation : int 10 16 15 7 20 9 1 12 17 1 ...
## $ City_Category : Factor w/ 3 levels "A","B","C": 1 3 1 2 1 1 2 3 3 2 ...
## $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 5 4 3 2 2 2 5 1 5 ...
## $ Marital_Status : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 2 1 2 ...
## $ Avg_Purchase : num 9546 10526 11781 14748 7745 ...
Users$Gender=as.integer(Users$Gender)
Users$Age=as.integer(Users$Age)
Users$City_Category=as.integer(Users$City_Category)
Users$Stay_In_Current_City_Years=as.integer(Users$Stay_In_Current_City_Years)
Users$Marital_Status=as.integer(Users$Marital_Status)
Users$Avg_Purchase=as.integer(Users$Avg_Purchase)
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(Users[,c("Gender","Age","Occupation","City_Category","Stay_In_Current_City_Years","Marital_Status")], spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix")
On studying the Scatter Plot Matrix closely, we spot the correlations which have been numerically expressed below
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
Users1=Users[,-c(1)]
cormatrix=rcorr(as.matrix(Users1))
cormatrix
## Gender Age Occupation City_Category
## Gender 1.00 -0.01 0.15 0.01
## Age -0.01 1.00 0.08 0.10
## Occupation 0.15 0.08 1.00 0.04
## City_Category 0.01 0.10 0.04 1.00
## Stay_In_Current_City_Years 0.01 -0.01 0.01 0.00
## Marital_Status -0.01 0.33 0.03 0.05
## Avg_Purchase 0.01 -0.02 -0.01 -0.01
## Stay_In_Current_City_Years Marital_Status
## Gender 0.01 -0.01
## Age -0.01 0.33
## Occupation 0.01 0.03
## City_Category 0.00 0.05
## Stay_In_Current_City_Years 1.00 -0.01
## Marital_Status -0.01 1.00
## Avg_Purchase 0.01 -0.02
## Avg_Purchase
## Gender 0.01
## Age -0.02
## Occupation -0.01
## City_Category -0.01
## Stay_In_Current_City_Years 0.01
## Marital_Status -0.02
## Avg_Purchase 1.00
##
## n= 5891
##
##
## P
## Gender Age Occupation City_Category
## Gender 0.4140 0.0000 0.4675
## Age 0.4140 0.0000 0.0000
## Occupation 0.0000 0.0000 0.0016
## City_Category 0.4675 0.0000 0.0016
## Stay_In_Current_City_Years 0.3382 0.5499 0.4926 0.9565
## Marital_Status 0.2570 0.0000 0.0140 0.0003
## Avg_Purchase 0.3185 0.2123 0.4266 0.3465
## Stay_In_Current_City_Years Marital_Status
## Gender 0.3382 0.2570
## Age 0.5499 0.0000
## Occupation 0.4926 0.0140
## City_Category 0.9565 0.0003
## Stay_In_Current_City_Years 0.4080
## Marital_Status 0.4080
## Avg_Purchase 0.6093 0.1621
## Avg_Purchase
## Gender 0.3185
## Age 0.2123
## Occupation 0.4266
## City_Category 0.3465
## Stay_In_Current_City_Years 0.6093
## Marital_Status 0.1621
## Avg_Purchase
. The upper matrix is the Correlation Matrix . The lower matrix gives the Pearson Rank Test
library(corrgram)
corrgram(Users1, order=TRUE, main="Corrgram",lower.panel=panel.pts, upper.panel=panel.pie, diag.panel=panel.minmax, text.panel=panel.txt)
After visualizing data, we need to formulate a hypothesis which needs to be tested with the help a regression model:
“Which customer demographics explain the average amount of purchase made by the customer?”
In order to be able to answer this question, we need to develop a null hypothesis which states that “There is no significant difference in the amount purchased by customers with respect to their demographics”.
t.test(Users$Gender,Users$Avg_Purchase)
##
## Welch Two Sample t-test
##
## data: Users$Gender and Users$Avg_Purchase
## t = -388.48, df = 5890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9614.908 -9518.358
## sample estimates:
## mean of x mean of y
## 1.717196 9568.350195
t.test(Users$Marital_Status,Users$Avg_Purchase)
##
## Welch Two Sample t-test
##
## data: Users$Marital_Status and Users$Avg_Purchase
## t = -388.5, df = 5890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9615.205 -9518.655
## sample estimates:
## mean of x mean of y
## 1.419963 9568.350195
t.test(Users$Age,Users$Avg_Purchase)
##
## Welch Two Sample t-test
##
## data: Users$Age and Users$Avg_Purchase
## t = -388.41, df = 5890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9613.005 -9516.455
## sample estimates:
## mean of x mean of y
## 3.620438 9568.350195
Since the value of p is less than 0.05, there is a significant difference in the “Purchase Amounts” with respect to Gender, Age and Marital Status. Thus, the null hypothesis of “Purchase Amount” being universally same is rejected.