Retailing involves the process of selling consumer goods or services to customers through multiple channels of distribution to earn a profit. The term “retailer” is typically applied where a service provider fills the small orders of a large number of individuals, who are end-users, rather than large orders of a small number of wholesale, corporate or government clientele.
One such service provider is ABC Private Limited. It sells products belonging to different categories to different customers. Being an offline store, it has its own advantages of providing the availablity of checking the great variety of products to the customers in real-time and physically . However, with the e-commerce industry growing tremendously and other retail companies resorting to Analytics to devise better marketing strategies, the company is faced with a lot of competition.
The retail company is catering to the needs of a wide customer base. The demographics of the customers has been studied and recorded. Gender, Age Group, Occupation, Kind of City the customer dwells in, Marital Status, Number of Years Spent in the Current City are the pieces of information gathered from the customer when he or she makes a purchase. Each product put for sale is categorised into at least one category and at most three. The above information can be analysed to make a prediction on how much will the customer spend while shopping at ABC
Now, the company wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month. The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month. They want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products. Accordingly, we construct the following hypothesis:
Hypothesis: Which customer demographics explain the average amount of purchase made by the customer better?
For this study, we collected data from the website https://datahack.analyticsvidhya.com/contest/black-friday/. The variables in the data set has the following definitons:
User_ID - User ID
Product_ID - Product ID
Gender - Sex of User
Age - Age in bins
Occupation - Occupation (Masked)
City_Category - Category of the City (A,B,C)
Stay_In_Current_City_Years - Number of years stay in current city
Marital_Status - Marital Status
Product_Category_1 - Product Category (Masked)
Product_Category_2 - Product may belongs to other category also (Masked)
Product_Category_3 - Product may belongs to other category also (Masked)
Purchase - Purchase Amount (Target Variable)
In order to test the hypothesis, we proposed the following model:
\[Price= \alpha_0 + \alpha_1 GenderM + \alpha_2 Age18-25 + \alpha_3 Age26-35 + \alpha_4 Age36-45 + \alpha_5 Age46-50 + \alpha_6 Age51-55 + \alpha_7 Age55+ + \alpha_8 Occupation + \alpha_9 City_CategoryB + \alpha_10 City_CategoryC + \alpha_11 Stay_In_Current_City_Years1 + \alpha_12 Stay_In_Current_City_Years2 + \alpha_13 Stay_In_Current_City_Years3 + \alpha_14 Stay_In_Current_City_Years4 + \alpha_15 Marital_Status1 + \alpha_16 Product_Category_1 + \alpha_17 Product_Category_2 + \alpha_18 Product_Category_3 + \epsilon \]
setwd("C:/Users/Dell/Desktop/Project/Capstone Project")
ABC_train=read.csv("Training Data Set.csv")
ABC_train$Marital_Status=factor(ABC_train$Marital_Status)
m1=lm(Purchase~Gender+Age+Occupation+City_Category+Stay_In_Current_City_Years+Marital_Status+Product_Category_1+Product_Category_2+Product_Category_3,data=ABC_train)
summary(m1)
##
## Call:
## lm(formula = Purchase ~ Gender + Age + Occupation + City_Category +
## Stay_In_Current_City_Years + Marital_Status + Product_Category_1 +
## Product_Category_2 + Product_Category_3, data = ABC_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11109.9 -2840.6 -437.6 2779.0 19941.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11469.093 85.178 134.649 < 2e-16 ***
## GenderM 319.925 27.487 11.639 < 2e-16 ***
## Age18-25 414.023 71.812 5.765 8.16e-09 ***
## Age26-35 533.391 69.809 7.641 2.17e-14 ***
## Age36-45 668.706 71.846 9.308 < 2e-16 ***
## Age46-50 687.862 79.716 8.629 < 2e-16 ***
## Age51-55 1040.773 81.748 12.732 < 2e-16 ***
## Age55+ 900.212 91.234 9.867 < 2e-16 ***
## Occupation 7.643 1.774 4.307 1.65e-05 ***
## City_CategoryB 209.119 28.901 7.236 4.65e-13 ***
## City_CategoryC 837.074 30.425 27.512 < 2e-16 ***
## Stay_In_Current_City_Years1 115.460 36.506 3.163 0.00156 **
## Stay_In_Current_City_Years2 212.938 40.516 5.256 1.48e-07 ***
## Stay_In_Current_City_Years3 66.809 41.227 1.621 0.10512
## Stay_In_Current_City_Years4+ 148.511 42.530 3.492 0.00048 ***
## Marital_Status1 -39.823 24.673 -1.614 0.10652
## Product_Category_1 -827.614 5.113 -161.867 < 2e-16 ***
## Product_Category_2 25.321 3.377 7.498 6.53e-14 ***
## Product_Category_3 73.052 3.285 22.236 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4631 on 166802 degrees of freedom
## (383247 observations deleted due to missingness)
## Multiple R-squared: 0.1699, Adjusted R-squared: 0.1698
## F-statistic: 1897 on 18 and 166802 DF, p-value: < 2.2e-16
We established the effect of Purchase Amount on all the demographics of the customer with the simplest model. We estimated the model using linear least squares. The coefficients of the regression can be studied to estimate which demographic is of more importance than the other.
coef(m1)
## (Intercept) GenderM
## 11469.092934 319.925227
## Age18-25 Age26-35
## 414.022514 533.391444
## Age36-45 Age46-50
## 668.706288 687.862193
## Age51-55 Age55+
## 1040.773098 900.211598
## Occupation City_CategoryB
## 7.643229 209.118510
## City_CategoryC Stay_In_Current_City_Years1
## 837.073958 115.459813
## Stay_In_Current_City_Years2 Stay_In_Current_City_Years3
## 212.938441 66.808870
## Stay_In_Current_City_Years4+ Marital_Status1
## 148.510845 -39.823226
## Product_Category_1 Product_Category_2
## -827.613566 25.320631
## Product_Category_3
## 73.051732
We can see that Gender, Age Group and Product Category 1 seem to have more impact than City Category, Marital Status, Number of Years of Stay in the Current City and lastly Occupation. As we can observe that the value of Multiple R-squared is 0.17 indicating that fraction by which the variance of the errors is less than the variance of the dependent variable is 0.17. Though the value is less, yet the model is doing very well as it is able to tap the uncertainty in customer purchasing behaviour to quite some extent.
It is pretty much known that the customers can be very erratic as far as purchasing is concerned. With such consderations, this model enables to formulate some approximations as to which kind of customer will reap how much profit for the company. This will aid in making personalized offers to the customers and initating loyalty programs which offer credit points on making purchases.