A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.

The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

READING DATA

1. Reading Data from a CSV format file into R

setwd("C:/Users/Dell/Desktop/Project/Capstone Project")
ABC_train=read.csv("Training Data Set.csv")

2. Checking the data types of various variables involved

str(ABC_train)
## 'data.frame':    550068 obs. of  12 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
##  $ Gender                    : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
##  $ Age                       : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
##  $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
##  $ Marital_Status            : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : int  NA 6 NA 14 NA 2 8 15 16 NA ...
##  $ Product_Category_3        : int  NA 14 NA NA NA NA 17 NA NA NA ...
##  $ Purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...

We see that the data type of “Marital Status” is integer. However, it is a categorical variable with value 1 showing “Married” and 0 showing “Unmarried”. Thus, we have to convert into Factor.

3. Appropriate conversion of data types of variables

ABC_train$Marital_Status=factor(ABC_train$Marital_Status)

4. Checking the data types of various variables involved AGAIN

str(ABC_train)
## 'data.frame':    550068 obs. of  12 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 1832 1746 3321 3605 2632 ...
##  $ Gender                    : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
##  $ Age                       : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
##  $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
##  $ Marital_Status            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 2 ...
##  $ Product_Category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : int  NA 6 NA 14 NA 2 8 15 16 NA ...
##  $ Product_Category_3        : int  NA 14 NA NA NA NA 17 NA NA NA ...
##  $ Purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...

DATA SUMMARY

5. Summarizing the statistics- Mean, standard deviation, median, minimum, maximum of variables

summary(ABC_train)
##     User_ID            Product_ID     Gender        Age        
##  Min.   :1000001   P00265242:  1880   F:135809   0-17 : 15102  
##  1st Qu.:1001516   P00025442:  1615   M:414259   18-25: 99660  
##  Median :1003077   P00110742:  1612              26-35:219587  
##  Mean   :1003029   P00112142:  1562              36-45:110013  
##  3rd Qu.:1004478   P00057642:  1470              46-50: 45701  
##  Max.   :1006040   P00184942:  1440              51-55: 38501  
##                    (Other)  :540489              55+  : 21504  
##    Occupation     City_Category Stay_In_Current_City_Years Marital_Status
##  Min.   : 0.000   A:147720      0 : 74398                  0:324731      
##  1st Qu.: 2.000   B:231173      1 :193821                  1:225337      
##  Median : 7.000   C:171175      2 :101838                                
##  Mean   : 8.077                 3 : 95285                                
##  3rd Qu.:14.000                 4+: 84726                                
##  Max.   :20.000                                                          
##                                                                          
##  Product_Category_1 Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 1.000     Min.   : 2.00      Min.   : 3.0       Min.   :   12  
##  1st Qu.: 1.000     1st Qu.: 5.00      1st Qu.: 9.0       1st Qu.: 5823  
##  Median : 5.000     Median : 9.00      Median :14.0       Median : 8047  
##  Mean   : 5.404     Mean   : 9.84      Mean   :12.7       Mean   : 9264  
##  3rd Qu.: 8.000     3rd Qu.:15.00      3rd Qu.:16.0       3rd Qu.:12054  
##  Max.   :20.000     Max.   :18.00      Max.   :18.0       Max.   :23961  
##                     NA's   :173638     NA's   :383247
library(psych)
describe(ABC_train)[,c(2,3,4,5,8,9)]
##                                  n       mean      sd  median     min
## User_ID                     550068 1003028.84 1727.59 1003077 1000001
## Product_ID*                 550068    1708.47 1012.20    1667       1
## Gender*                     550068       1.75    0.43       2       1
## Age*                        550068       3.50    1.35       3       1
## Occupation                  550068       8.08    6.52       7       0
## City_Category*              550068       2.04    0.76       2       1
## Stay_In_Current_City_Years* 550068       2.86    1.29       3       1
## Marital_Status*             550068       1.41    0.49       1       1
## Product_Category_1          550068       5.40    3.94       5       1
## Product_Category_2          376430       9.84    5.09       9       2
## Product_Category_3          166821      12.67    4.13      14       3
## Purchase                    550068    9263.97 5023.07    8047      12
##                                 max
## User_ID                     1006040
## Product_ID*                    3631
## Gender*                           2
## Age*                              7
## Occupation                       20
## City_Category*                    3
## Stay_In_Current_City_Years*       5
## Marital_Status*                   2
## Product_Category_1               20
## Product_Category_2               18
## Product_Category_3               18
## Purchase                      23961

We can see that there are some missing values in the data set (like in Product_Category_2 and Product_Category_3). NA’s can be a hindrance in the statistical analysis, due to they need to be converted into 0’s.

5. NA’s value treatment

ABC_train[is.na(ABC_train)]=0

VISUALIZATION

Plotting Customer Demographics

6a. Plot of Customer Distribution with respect to Gender and Marital Status

t=prop.table(table(ABC_train$Marital_Status,ABC_train$Gender))
barplot(t,beside=T,col=c("red","blue"),main="Customers Distribution with respect to Gender and Marital Status",legend=rownames(t),xlab="Gender",ylab="Percent of Total",ylim=c(0,0.5))

6b. Plot of Customer Distribution with respect to Age Group

barplot(prop.table(table(ABC_train$Age)),col="red",main="Customer Distribution with respect to Age Group",xlab="Age Group",ylab="Percent of Total",ylim=c(0,0.4))

6c. Plot of Customer Distribution with respect to City Category

barplot(prop.table(table(ABC_train$City_Category)),col="darkblue",main="Customer Distribution with respect to City Category",xlab="City Category",ylab="Percent of Total",ylim=c(0,0.4))

6d. Plot of Customer Distribution with respect to Number of Years Stay in Current City

barplot(prop.table(table(ABC_train$Stay_In_Current_City_Years)),col="red",main="Customer Distribution with respect to Number of Years Stay in Current City",xlab="Number of Years Stay in Current City",ylab="Percent of Total",ylim=c(0,0.4))

6e. Plot of Customer Distribution with respect to Occupation

barplot(prop.table(table(ABC_train$Occupation)),col="darkblue",main="Customer Distribution with respect to Occupation",xlab="Occupation",ylab="Percent of Total",ylim=c(0,0.2))

6f. Plot of Customer Distribution with respect to Product Category

barplot(prop.table(table(ABC_train$Product_Category_1)),col="red",main="Customer Distribution with respect to Occupation",xlab="Occupation",ylab="Percent of Total")

Purchase Amount versus different Customer Demographics

7a. Purchase Amount with respect to Gender

p1=aggregate(Purchase~Gender,data=ABC_train,mean)
p1
##   Gender Purchase
## 1      F 8734.566
## 2      M 9437.526

7b. Purchase Amount with respect to Age Group

plot(aggregate(Purchase~Age,data=ABC_train,mean),main="Purchase Amount with respect to Age Group")

7c. Purchase Amount with respect to Occupation

plot(aggregate(Purchase~Occupation,data=ABC_train,mean),spread=FALSE,smoother.args=list(lty=2),pch=19,main="Purchase Amount with respect to Occupation")
## Warning in plot.window(...): "spread" is not a graphical parameter
## Warning in plot.window(...): "smoother.args" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "spread" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "smoother.args" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in box(...): "spread" is not a graphical parameter
## Warning in box(...): "smoother.args" is not a graphical parameter
## Warning in title(...): "spread" is not a graphical parameter
## Warning in title(...): "smoother.args" is not a graphical parameter

7d. Purchase Amount with respect to City Category

plot(aggregate(Purchase~City_Category,data=ABC_train,mean),spread=FALSE,smoother.args=list(lty=2),pch=19,main="Purchase Amount with respect to City Category")

7e. Purchase Amount with respect to Number of Years Stay in Current City

plot(aggregate(Purchase~Stay_In_Current_City_Years,data=ABC_train,mean),main="Purchase Amount with respect to Number of Years Stay in Current City")

7f. Purchase Amount with respect to Marital Status

p2=aggregate(Purchase~Marital_Status,data=ABC_train,mean)
p2
##   Marital_Status Purchase
## 1              0 9265.908
## 2              1 9261.175

7g. Purchase Amount with respect to Category of Products

plot(aggregate(Purchase~Product_Category_1,data=ABC_train,mean),spread=FALSE,smoother.args=list(lty=2),pch=19,main="Purchase Amount with respect to Category of Products")
## Warning in plot.window(...): "spread" is not a graphical parameter
## Warning in plot.window(...): "smoother.args" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "spread" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "smoother.args" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "spread" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...):
## "smoother.args" is not a graphical parameter
## Warning in box(...): "spread" is not a graphical parameter
## Warning in box(...): "smoother.args" is not a graphical parameter
## Warning in title(...): "spread" is not a graphical parameter
## Warning in title(...): "smoother.args" is not a graphical parameter

INFERENCE

BUILDING CORRELATIONS

Since the data set is enormously huge with each customer purchasing different products, the correlations between the “Customer’s Demographics” and “Purchase” cannot be easily established by taking every record in the data set into consideration.

Hence, we simplify the situation by averaging the purchase amounts of different products associated with a single customer and considering that average value in building correlations.

8. Making a new data set comprising of all the customers and their average purchase amounts

Users=unique(ABC_train[,c(1,3,4,5,6,7,8)])
p=aggregate(ABC_train$Purchase,by=list(User=ABC_train$User_ID),mean)
Users$Avg_Purchase=p[,2]

9. Checking and converting the data types of variables into “Interger Type”

str(Users)
## 'data.frame':    5891 obs. of  8 variables:
##  $ User_ID                   : int  1000001 1000002 1000003 1000004 1000005 1000006 1000007 1000008 1000009 1000010 ...
##  $ Gender                    : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
##  $ Age                       : Factor w/ 7 levels "0-17","18-25",..: 1 7 3 5 3 6 4 3 3 4 ...
##  $ Occupation                : int  10 16 15 7 20 9 1 12 17 1 ...
##  $ City_Category             : Factor w/ 3 levels "A","B","C": 1 3 1 2 1 1 2 3 3 2 ...
##  $ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 5 4 3 2 2 2 5 1 5 ...
##  $ Marital_Status            : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 2 1 2 ...
##  $ Avg_Purchase              : num  9546 10526 11781 14748 7745 ...
Users$Gender=as.integer(Users$Gender)
Users$Age=as.integer(Users$Age)
Users$City_Category=as.integer(Users$City_Category)
Users$Stay_In_Current_City_Years=as.integer(Users$Stay_In_Current_City_Years)
Users$Marital_Status=as.integer(Users$Marital_Status)
Users$Avg_Purchase=as.integer(Users$Avg_Purchase)

10. Contructing a Scatter Plot Matrix between the variables

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(Users[,c("Gender","Age","Occupation","City_Category","Stay_In_Current_City_Years","Marital_Status")], spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix")

On studying the Scatter Plot Matrix closely, we spot the correlations which have been numerically expressed below

12. Correlations expressed numerically

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
Users1=Users[,-c(1)]
cormatrix=rcorr(as.matrix(Users1))
cormatrix
##                            Gender   Age Occupation City_Category
## Gender                       1.00 -0.01       0.15          0.01
## Age                         -0.01  1.00       0.08          0.10
## Occupation                   0.15  0.08       1.00          0.04
## City_Category                0.01  0.10       0.04          1.00
## Stay_In_Current_City_Years   0.01 -0.01       0.01          0.00
## Marital_Status              -0.01  0.33       0.03          0.05
## Avg_Purchase                 0.01 -0.02      -0.01         -0.01
##                            Stay_In_Current_City_Years Marital_Status
## Gender                                           0.01          -0.01
## Age                                             -0.01           0.33
## Occupation                                       0.01           0.03
## City_Category                                    0.00           0.05
## Stay_In_Current_City_Years                       1.00          -0.01
## Marital_Status                                  -0.01           1.00
## Avg_Purchase                                     0.01          -0.02
##                            Avg_Purchase
## Gender                             0.01
## Age                               -0.02
## Occupation                        -0.01
## City_Category                     -0.01
## Stay_In_Current_City_Years         0.01
## Marital_Status                    -0.02
## Avg_Purchase                       1.00
## 
## n= 5891 
## 
## 
## P
##                            Gender Age    Occupation City_Category
## Gender                            0.4140 0.0000     0.4675       
## Age                        0.4140        0.0000     0.0000       
## Occupation                 0.0000 0.0000            0.0016       
## City_Category              0.4675 0.0000 0.0016                  
## Stay_In_Current_City_Years 0.3382 0.5499 0.4926     0.9565       
## Marital_Status             0.2570 0.0000 0.0140     0.0003       
## Avg_Purchase               0.3185 0.2123 0.4266     0.3465       
##                            Stay_In_Current_City_Years Marital_Status
## Gender                     0.3382                     0.2570        
## Age                        0.5499                     0.0000        
## Occupation                 0.4926                     0.0140        
## City_Category              0.9565                     0.0003        
## Stay_In_Current_City_Years                            0.4080        
## Marital_Status             0.4080                                   
## Avg_Purchase               0.6093                     0.1621        
##                            Avg_Purchase
## Gender                     0.3185      
## Age                        0.2123      
## Occupation                 0.4266      
## City_Category              0.3465      
## Stay_In_Current_City_Years 0.6093      
## Marital_Status             0.1621      
## Avg_Purchase

. The upper matrix is the Correlation Matrix . The lower matrix gives the Pearson Rank Test

Significant correlations include Gender~Occupation, Marital Status~Age. Other slight correlations are City Category~Age, Occupation~Age

13. Constructing a corrgram between diffferent variables

library(corrgram)
corrgram(Users1, order=TRUE, main="Corrgram",lower.panel=panel.pts, upper.panel=panel.pie, diag.panel=panel.minmax, text.panel=panel.txt)

HYPOTHESIS TESTING

After visualizing data, we need to formulate a hypothesis which needs to be tested with the help a regression model:

“Which customer demographics explain the average amount of purchase made by the customer?”

In order to be able to answer this question, we need to develop a null hypothesis which states that “There is no significant difference in the amount purchased by customers with respect to their demographics”.

14. t-test between Average Purchase Amount and Gender

t.test(Users$Gender,Users$Avg_Purchase)
## 
##  Welch Two Sample t-test
## 
## data:  Users$Gender and Users$Avg_Purchase
## t = -388.48, df = 5890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9614.908 -9518.358
## sample estimates:
##   mean of x   mean of y 
##    1.717196 9568.350195

15. t-test between Average Purchase Amount and Marital Status

t.test(Users$Marital_Status,Users$Avg_Purchase)
## 
##  Welch Two Sample t-test
## 
## data:  Users$Marital_Status and Users$Avg_Purchase
## t = -388.5, df = 5890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9615.205 -9518.655
## sample estimates:
##   mean of x   mean of y 
##    1.419963 9568.350195

16. t-test between Average Purchase Amount and Age

t.test(Users$Age,Users$Avg_Purchase)
## 
##  Welch Two Sample t-test
## 
## data:  Users$Age and Users$Avg_Purchase
## t = -388.41, df = 5890, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9613.005 -9516.455
## sample estimates:
##   mean of x   mean of y 
##    3.620438 9568.350195

Since the value of p is less than 0.05, there is a significant difference in the “Purchase Amounts” with respect to Gender, Age and Marital Status. Thus, the null hypothesis of “Purchase Amount” being universally same is rejected.