A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.
User_ID | User ID
Product_ID | Product ID
Gender | Sex of User
Age | Age in bins
Occupation | Occupation (Masked)
City_Category | Category of the City (A,B,C)
Stay_In_Current_City_Years | Number of years stay in current city
Marital_Status | Marital Status
Product_Category_1 | Product Category (Masked)
Product_Category_2 | Product may belongs to other category also (Masked)
Product_Category_3 | Product may belongs to other category also (Masked)
Purchase | Purchase Amount (Target Variable)
A thorough EDA the dataset reveals some interesting correlations and factors influencing the final purchase amount. Correlatons between categorical variables Using Chi Square tests were used to determine features for the linear regression model to predict purchase amount. The product Id, along with the City_Category come out to be the major influencers of purchase amount. Other columns have been included in the regression model since they too exhibit some correlation with the purchase amount, as depicted by chi sq tests. Unmarried people, aged 46-50 in City category C seem to be major spenders.
The model output with an R-Squared value of 0.64 is able to predict the amount with decent accuracy. The next step would be to improve it’s accuracy by using Random Forest approach, along with a gradient boosting method.
SalesRaw$Marital_Status <- as.factor(SalesRaw$Marital_Status)
SalesRaw$Product_Category_1 <- as.factor(SalesRaw$Product_Category_1)
SalesRaw$Product_Category_2 <- as.factor(SalesRaw$Product_Category_2)
SalesRaw$Product_Category_3 <- as.factor(SalesRaw$Product_Category_3)
SalesRaw$Occupation <- as.factor(SalesRaw$Occupation)
SalesRaw$User_ID <- as.factor(SalesRaw$User_ID)
SalesRawSummary <- summary(SalesRaw)
SalesRawSummary
## User_ID Product_ID Gender Age
## 1001680: 1026 P00265242: 1880 F:135809 0-17 : 15102
## 1004277: 979 P00025442: 1615 M:414259 18-25: 99660
## 1001941: 898 P00110742: 1612 26-35:219587
## 1001181: 862 P00112142: 1562 36-45:110013
## 1000889: 823 P00057642: 1470 46-50: 45701
## 1003618: 767 P00184942: 1440 51-55: 38501
## (Other):544713 (Other) :540489 55+ : 21504
## Occupation City_Category Stay_In_Current_City_Years Marital_Status
## 4 : 72308 A:147720 0 : 74398 0:324731
## 0 : 69638 B:231173 1 :193821 1:225337
## 7 : 59133 C:171175 2 :101838
## 1 : 47426 3 : 95285
## 17 : 40043 4+: 84726
## 20 : 33562
## (Other):227958
## Product_Category_1 Product_Category_2 Product_Category_3 Purchase
## 5 :150933 8 : 64088 16 : 32636 Min. : 12
## 1 :140378 14 : 55108 15 : 28013 1st Qu.: 5823
## 8 :113925 2 : 49217 14 : 18428 Median : 8047
## 11 : 24287 16 : 43255 17 : 16702 Mean : 9264
## 2 : 23864 15 : 37855 5 : 16658 3rd Qu.:12054
## 6 : 20466 (Other):126907 (Other): 54384 Max. :23961
## (Other): 76215 NA's :173638 NA's :383247
GenderAge <- xtabs(~SalesRaw$Gender + SalesRaw$Age )
GenderAge
## SalesRaw$Age
## SalesRaw$Gender 0-17 18-25 26-35 36-45 46-50 51-55 55+
## F 5083 24628 50752 27170 13199 9894 5083
## M 10019 75032 168835 82843 32502 28607 16421
prop.table(GenderAge,2)*100
## SalesRaw$Age
## SalesRaw$Gender 0-17 18-25 26-35 36-45 46-50 51-55
## F 33.65779 24.71202 23.11248 24.69708 28.88121 25.69803
## M 66.34221 75.28798 76.88752 75.30292 71.11879 74.30197
## SalesRaw$Age
## SalesRaw$Gender 55+
## F 23.63746
## M 76.36254
chisqtestGenPurchase <- chisq.test(SalesRaw$Gender, SalesRaw$Purchase)
## Warning in chisq.test(SalesRaw$Gender, SalesRaw$Purchase): Chi-squared
## approximation may be incorrect
chisqtestGenPurchase
##
## Pearson's Chi-squared test
##
## data: SalesRaw$Gender and SalesRaw$Purchase
## X-squared = 23742, df = 18104, p-value < 2.2e-16
chisqtestAgePurchase <- chisq.test(SalesRaw$Age, SalesRaw$Purchase)
## Warning in chisq.test(SalesRaw$Age, SalesRaw$Purchase): Chi-squared
## approximation may be incorrect
chisqtestAgePurchase
##
## Pearson's Chi-squared test
##
## data: SalesRaw$Age and SalesRaw$Purchase
## X-squared = 118080, df = 108620, p-value < 2.2e-16
GenderOccupation <- xtabs(~SalesRaw$Gender + SalesRaw$Occupation )
GenderOccupation
## SalesRaw$Occupation
## SalesRaw$Gender 0 1 2 3 4 5 6 7 8
## F 18112 17984 8629 7919 17836 2220 8160 10028 361
## M 51526 29442 17959 9731 54472 9957 12195 49105 1185
## SalesRaw$Occupation
## SalesRaw$Gender 9 10 11 12 13 14 15 16 17
## F 5843 4003 1500 3469 1498 6763 2390 4107 3929
## M 448 8927 10086 27710 6230 20546 9775 21264 36114
## SalesRaw$Occupation
## SalesRaw$Gender 18 19 20
## F 230 2017 8811
## M 6392 6444 24751
round(prop.table(GenderOccupation,2)*100,2)
## SalesRaw$Occupation
## SalesRaw$Gender 0 1 2 3 4 5 6 7 8
## F 26.01 37.92 32.45 44.87 24.67 18.23 40.09 16.96 23.35
## M 73.99 62.08 67.55 55.13 75.33 81.77 59.91 83.04 76.65
## SalesRaw$Occupation
## SalesRaw$Gender 9 10 11 12 13 14 15 16 17
## F 92.88 30.96 12.95 11.13 19.38 24.76 19.65 16.19 9.81
## M 7.12 69.04 87.05 88.87 80.62 75.24 80.35 83.81 90.19
## SalesRaw$Occupation
## SalesRaw$Gender 18 19 20
## F 3.47 23.84 26.25
## M 96.53 76.16 73.75
chisqtestOccPurchase <- chisq.test(SalesRaw$Purchase, SalesRaw$Occupation)
## Warning in chisq.test(SalesRaw$Purchase, SalesRaw$Occupation): Chi-squared
## approximation may be incorrect
chisqtestOccPurchase
##
## Pearson's Chi-squared test
##
## data: SalesRaw$Purchase and SalesRaw$Occupation
## X-squared = 373980, df = 362080, p-value < 2.2e-16
AgeOccupation <- xtabs(~SalesRaw$Age + SalesRaw$Occupation )
AgeOccupation
## SalesRaw$Occupation
## SalesRaw$Age 0 1 2 3 4 5 6 7 8 9
## 0-17 2134 387 144 0 113 0 0 139 29 0
## 18-25 9095 3820 4364 1860 48241 1450 1144 2078 14 559
## 26-35 34204 19080 12617 8159 21829 6082 7216 24060 378 1489
## 36-45 13393 9501 5183 4126 1747 3066 4822 18762 98 3096
## 46-50 4488 7089 2124 1599 129 1187 2561 6664 549 528
## 51-55 4602 4410 1344 1094 249 377 3952 5355 317 398
## 55+ 1722 3139 812 812 0 15 660 2075 161 221
## SalesRaw$Occupation
## SalesRaw$Age 10 11 12 13 14 15 16 17 18 19
## 0-17 10951 18 237 15 93 0 0 35 0 807
## 18-25 1649 717 4585 0 4388 906 1816 3944 1085 2500
## 26-35 26 5009 15279 0 13446 6874 7070 17064 2243 3468
## 36-45 170 2732 6848 427 5590 2585 7572 10252 1527 1008
## 46-50 0 1584 2491 631 1445 854 3032 4662 1124 261
## 51-55 0 1383 1417 1785 1012 514 3918 2528 531 200
## 55+ 134 143 322 4870 1335 432 1963 1558 112 217
## SalesRaw$Occupation
## SalesRaw$Age 20
## 0-17 0
## 18-25 5445
## 26-35 13994
## 36-45 7508
## 46-50 2699
## 51-55 3115
## 55+ 801
round(prop.table(AgeOccupation,2)*100,2)
## SalesRaw$Occupation
## SalesRaw$Age 0 1 2 3 4 5 6 7 8 9
## 0-17 3.06 0.82 0.54 0.00 0.16 0.00 0.00 0.24 1.88 0.00
## 18-25 13.06 8.05 16.41 10.54 66.72 11.91 5.62 3.51 0.91 8.89
## 26-35 49.12 40.23 47.45 46.23 30.19 49.95 35.45 40.69 24.45 23.67
## 36-45 19.23 20.03 19.49 23.38 2.42 25.18 23.69 31.73 6.34 49.21
## 46-50 6.44 14.95 7.99 9.06 0.18 9.75 12.58 11.27 35.51 8.39
## 51-55 6.61 9.30 5.05 6.20 0.34 3.10 19.42 9.06 20.50 6.33
## 55+ 2.47 6.62 3.05 4.60 0.00 0.12 3.24 3.51 10.41 3.51
## SalesRaw$Occupation
## SalesRaw$Age 10 11 12 13 14 15 16 17 18 19
## 0-17 84.69 0.16 0.76 0.19 0.34 0.00 0.00 0.09 0.00 9.54
## 18-25 12.75 6.19 14.71 0.00 16.07 7.45 7.16 9.85 16.38 29.55
## 26-35 0.20 43.23 49.00 0.00 49.24 56.51 27.87 42.61 33.87 40.99
## 36-45 1.31 23.58 21.96 5.53 20.47 21.25 29.85 25.60 23.06 11.91
## 46-50 0.00 13.67 7.99 8.17 5.29 7.02 11.95 11.64 16.97 3.08
## 51-55 0.00 11.94 4.54 23.10 3.71 4.23 15.44 6.31 8.02 2.36
## 55+ 1.04 1.23 1.03 63.02 4.89 3.55 7.74 3.89 1.69 2.56
## SalesRaw$Occupation
## SalesRaw$Age 20
## 0-17 0.00
## 18-25 16.22
## 26-35 41.70
## 36-45 22.37
## 46-50 8.04
## 51-55 9.28
## 55+ 2.39
cityoccupation <- xtabs(~SalesRaw$City_Category + SalesRaw$Occupation)
cityoccupation
## SalesRaw$Occupation
## SalesRaw$City_Category 0 1 2 3 4 5 6 7
## A 18858 12809 9232 5645 24205 2383 3755 16216
## B 29850 19756 11366 6905 29854 6615 10953 22986
## C 20930 14861 5990 5100 18249 3179 5647 19931
## SalesRaw$Occupation
## SalesRaw$City_Category 8 9 10 11 12 13 14 15
## A 104 708 2226 2470 7080 393 7745 3035
## B 829 3224 4230 5595 13184 2392 11256 5262
## C 613 2359 6474 3521 10915 4943 8308 3868
## SalesRaw$Occupation
## SalesRaw$City_Category 16 17 18 19 20
## A 6095 8177 1605 2236 12743
## B 10879 16111 2126 3400 14400
## C 8397 15755 2891 2825 6419
round(prop.table(cityoccupation,2)*100,2)
## SalesRaw$Occupation
## SalesRaw$City_Category 0 1 2 3 4 5 6 7
## A 27.08 27.01 34.72 31.98 33.47 19.57 18.45 27.42
## B 42.86 41.66 42.75 39.12 41.29 54.32 53.81 38.87
## C 30.06 31.34 22.53 28.90 25.24 26.11 27.74 33.71
## SalesRaw$Occupation
## SalesRaw$City_Category 8 9 10 11 12 13 14 15
## A 6.73 11.25 17.22 21.32 22.71 5.09 28.36 24.95
## B 53.62 51.25 32.71 48.29 42.28 30.95 41.22 43.26
## C 39.65 37.50 50.07 30.39 35.01 63.96 30.42 31.80
## SalesRaw$Occupation
## SalesRaw$City_Category 16 17 18 19 20
## A 24.02 20.42 24.24 26.43 37.97
## B 42.88 40.23 32.11 40.18 42.91
## C 33.10 39.35 43.66 33.39 19.13
genderageSales <- SalesRaw %>% group_by(SalesRaw$Gender, SalesRaw$Age) %>% summarise(Avgsales = mean(Purchase))
## Warning: package 'bindrcpp' was built under R version 3.4.1
genderageSales
## # A tibble: 14 x 3
## # Groups: SalesRaw$Gender [?]
## `SalesRaw$Gender` `SalesRaw$Age` Avgsales
## <fctr> <fctr> <dbl>
## 1 F 0-17 8338.772
## 2 F 18-25 8343.180
## 3 F 26-35 8728.252
## 4 F 36-45 8959.844
## 5 F 46-50 8842.099
## 6 F 51-55 9042.450
## 7 F 55+ 9007.036
## 8 M 0-17 9235.174
## 9 M 18-25 9440.943
## 10 M 26-35 9410.338
## 11 M 36-45 9453.194
## 12 M 46-50 9357.472
## 13 M 51-55 9705.095
## 14 M 55+ 9438.196
par(mfrow = c(3,2))
CitySales <- plot(SalesRaw$City_Category, SalesRaw$Purchase, xlab = "city")
OccupationSales <- plot(SalesRaw$Occupation, SalesRaw$Purchase, xlab = "Occupation")
GenderSales <- plot(SalesRaw$Gender, SalesRaw$Purchase, xlab = "Gender")
AgeSales <- plot(SalesRaw$Age, SalesRaw$Purchase , xlab = "Age")
StaySales <- plot(SalesRaw$Stay_In_Current_City_Years, SalesRaw$Purchase , xlab = "Stay Duration Years")
MartialstatusSales <- plot(SalesRaw$Marital_Status, SalesRaw$Purchase , xlab = "Marital Status")
CitySales <- SalesRaw %>% group_by(SalesRaw$City_Category) %>% summarise(Avgsales = mean(Purchase))
CitySales
## # A tibble: 3 x 2
## `SalesRaw$City_Category` Avgsales
## <fctr> <dbl>
## 1 A 8911.939
## 2 B 9151.301
## 3 C 9719.921
nullcheck <- is.na(SalesRaw$Product_Category_1)
unique(nullcheck)
## [1] FALSE
model_v1 <- lm(Purchase ~ Gender + Age + Occupation + City_Category + Stay_In_Current_City_Years + Marital_Status + Product_Category_1 , data=SalesRaw)
summary(model_v1)
##
## Call:
## lm(formula = Purchase ~ Gender + Age + Occupation + City_Category +
## Stay_In_Current_City_Years + Marital_Status + Product_Category_1,
## data = SalesRaw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15546.3 -1594.6 397.5 1963.4 8379.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.334e+04 4.351e+01 306.546 < 2e-16 ***
## GenderM -5.037e+01 9.875e+00 -5.101 3.39e-07 ***
## Age18-25 -1.773e+02 4.084e+01 -4.342 1.41e-05 ***
## Age26-35 -7.597e+01 4.070e+01 -1.866 0.06197 .
## Age36-45 5.244e+01 4.128e+01 1.270 0.20403
## Age46-50 7.647e+01 4.312e+01 1.774 0.07614 .
## Age51-55 3.225e+02 4.361e+01 7.395 1.42e-13 ***
## Age55+ 2.175e+02 4.616e+01 4.712 2.46e-06 ***
## Occupation1 -2.832e+01 1.811e+01 -1.563 0.11798
## Occupation2 5.749e+01 2.180e+01 2.637 0.00836 **
## Occupation3 2.463e+02 2.553e+01 9.651 < 2e-16 ***
## Occupation4 1.590e+02 1.737e+01 9.152 < 2e-16 ***
## Occupation5 5.352e+01 2.970e+01 1.802 0.07158 .
## Occupation6 2.090e+02 2.428e+01 8.610 < 2e-16 ***
## Occupation7 1.195e+02 1.705e+01 7.013 2.34e-12 ***
## Occupation8 -3.202e+02 7.777e+01 -4.117 3.84e-05 ***
## Occupation9 1.090e+02 4.041e+01 2.697 0.00701 **
## Occupation10 -7.522e+01 4.377e+01 -1.718 0.08573 .
## Occupation11 1.274e+02 3.037e+01 4.194 2.74e-05 ***
## Occupation12 2.865e+02 2.064e+01 13.881 < 2e-16 ***
## Occupation13 9.252e+01 3.904e+01 2.370 0.01779 *
## Occupation14 2.033e+02 2.158e+01 9.424 < 2e-16 ***
## Occupation15 3.901e+02 2.967e+01 13.145 < 2e-16 ***
## Occupation16 1.116e+02 2.234e+01 4.996 5.84e-07 ***
## Occupation17 1.838e+02 1.909e+01 9.624 < 2e-16 ***
## Occupation18 -2.209e+01 3.894e+01 -0.567 0.57048
## Occupation19 -3.088e+02 3.493e+01 -8.840 < 2e-16 ***
## Occupation20 -8.167e+01 2.014e+01 -4.054 5.03e-05 ***
## City_CategoryB 1.348e+02 1.016e+01 13.262 < 2e-16 ***
## City_CategoryC 5.544e+02 1.101e+01 50.339 < 2e-16 ***
## Stay_In_Current_City_Years1 -6.840e-02 1.307e+01 -0.005 0.99583
## Stay_In_Current_City_Years2 3.787e+01 1.459e+01 2.596 0.00942 **
## Stay_In_Current_City_Years3 -6.566e+00 1.484e+01 -0.442 0.65820
## Stay_In_Current_City_Years4+ 2.198e+01 1.521e+01 1.445 0.14847
## Marital_Status1 -5.590e+01 8.835e+00 -6.327 2.50e-10 ***
## Product_Category_12 -2.346e+03 2.112e+01 -111.106 < 2e-16 ***
## Product_Category_13 -3.492e+03 2.273e+01 -153.619 < 2e-16 ***
## Product_Category_14 -1.125e+04 2.899e+01 -387.969 < 2e-16 ***
## Product_Category_15 -7.343e+03 1.124e+01 -653.594 < 2e-16 ***
## Product_Category_16 2.244e+03 2.257e+01 99.431 < 2e-16 ***
## Product_Category_17 2.816e+03 5.009e+01 56.221 < 2e-16 ***
## Product_Category_18 -6.102e+03 1.212e+01 -503.394 < 2e-16 ***
## Product_Category_19 1.922e+03 1.491e+02 12.893 < 2e-16 ***
## Product_Category_110 6.038e+03 4.289e+01 140.772 < 2e-16 ***
## Product_Category_111 -8.886e+03 2.097e+01 -423.800 < 2e-16 ***
## Product_Category_112 -1.229e+04 4.873e+01 -252.209 < 2e-16 ***
## Product_Category_113 -1.289e+04 4.127e+01 -312.260 < 2e-16 ***
## Product_Category_114 -4.483e+02 7.769e+01 -5.771 7.89e-09 ***
## Product_Category_115 1.177e+03 3.885e+01 30.294 < 2e-16 ***
## Product_Category_116 1.174e+03 3.146e+01 37.311 < 2e-16 ***
## Product_Category_117 -3.527e+03 1.257e+02 -28.066 < 2e-16 ***
## Product_Category_118 -1.067e+04 5.455e+01 -195.578 < 2e-16 ***
## Product_Category_119 -1.370e+04 7.573e+01 -180.873 < 2e-16 ***
## Product_Category_120 -1.335e+04 6.025e+01 -221.486 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3014 on 550014 degrees of freedom
## Multiple R-squared: 0.6401, Adjusted R-squared: 0.6401
## F-statistic: 1.846e+04 on 53 and 550014 DF, p-value: < 2.2e-16