PROBLEM STATEMENT

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.

The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.

Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

DESCRIPTION OF DATASET

User_ID | User ID
Product_ID | Product ID
Gender | Sex of User
Age | Age in bins
Occupation | Occupation (Masked)
City_Category | Category of the City (A,B,C)
Stay_In_Current_City_Years | Number of years stay in current city
Marital_Status | Marital Status
Product_Category_1 | Product Category (Masked)
Product_Category_2 | Product may belongs to other category also (Masked)
Product_Category_3 | Product may belongs to other category also (Masked)
Purchase | Purchase Amount (Target Variable)

SYNOPSIS

A thorough EDA the dataset reveals some interesting correlations and factors influencing the final purchase amount. Correlatons between categorical variables Using Chi Square tests were used to determine features for the linear regression model to predict purchase amount. The product Id, along with the City_Category come out to be the major influencers of purchase amount. Other columns have been included in the regression model since they too exhibit some correlation with the purchase amount, as depicted by chi sq tests. Unmarried people, aged 46-50 in City category C seem to be major spenders.

The model output with an R-Squared value of 0.64 is able to predict the amount with decent accuracy. The next step would be to improve it’s accuracy by using Random Forest approach, along with a gradient boosting method.

EDA and Model output

SalesRaw$Marital_Status <- as.factor(SalesRaw$Marital_Status)

SalesRaw$Product_Category_1 <- as.factor(SalesRaw$Product_Category_1)
SalesRaw$Product_Category_2 <- as.factor(SalesRaw$Product_Category_2)
SalesRaw$Product_Category_3 <- as.factor(SalesRaw$Product_Category_3)
SalesRaw$Occupation <- as.factor(SalesRaw$Occupation)
SalesRaw$User_ID <- as.factor(SalesRaw$User_ID)
SalesRawSummary <- summary(SalesRaw)
SalesRawSummary
##     User_ID           Product_ID     Gender        Age        
##  1001680:  1026   P00265242:  1880   F:135809   0-17 : 15102  
##  1004277:   979   P00025442:  1615   M:414259   18-25: 99660  
##  1001941:   898   P00110742:  1612              26-35:219587  
##  1001181:   862   P00112142:  1562              36-45:110013  
##  1000889:   823   P00057642:  1470              46-50: 45701  
##  1003618:   767   P00184942:  1440              51-55: 38501  
##  (Other):544713   (Other)  :540489              55+  : 21504  
##    Occupation     City_Category Stay_In_Current_City_Years Marital_Status
##  4      : 72308   A:147720      0 : 74398                  0:324731      
##  0      : 69638   B:231173      1 :193821                  1:225337      
##  7      : 59133   C:171175      2 :101838                                
##  1      : 47426                 3 : 95285                                
##  17     : 40043                 4+: 84726                                
##  20     : 33562                                                          
##  (Other):227958                                                          
##  Product_Category_1 Product_Category_2 Product_Category_3    Purchase    
##  5      :150933     8      : 64088     16     : 32636     Min.   :   12  
##  1      :140378     14     : 55108     15     : 28013     1st Qu.: 5823  
##  8      :113925     2      : 49217     14     : 18428     Median : 8047  
##  11     : 24287     16     : 43255     17     : 16702     Mean   : 9264  
##  2      : 23864     15     : 37855     5      : 16658     3rd Qu.:12054  
##  6      : 20466     (Other):126907     (Other): 54384     Max.   :23961  
##  (Other): 76215     NA's   :173638     NA's   :383247
GenderAge <- xtabs(~SalesRaw$Gender + SalesRaw$Age )
GenderAge
##                SalesRaw$Age
## SalesRaw$Gender   0-17  18-25  26-35  36-45  46-50  51-55    55+
##               F   5083  24628  50752  27170  13199   9894   5083
##               M  10019  75032 168835  82843  32502  28607  16421
prop.table(GenderAge,2)*100
##                SalesRaw$Age
## SalesRaw$Gender     0-17    18-25    26-35    36-45    46-50    51-55
##               F 33.65779 24.71202 23.11248 24.69708 28.88121 25.69803
##               M 66.34221 75.28798 76.88752 75.30292 71.11879 74.30197
##                SalesRaw$Age
## SalesRaw$Gender      55+
##               F 23.63746
##               M 76.36254
chisqtestGenPurchase <- chisq.test(SalesRaw$Gender, SalesRaw$Purchase)
## Warning in chisq.test(SalesRaw$Gender, SalesRaw$Purchase): Chi-squared
## approximation may be incorrect
chisqtestGenPurchase
## 
##  Pearson's Chi-squared test
## 
## data:  SalesRaw$Gender and SalesRaw$Purchase
## X-squared = 23742, df = 18104, p-value < 2.2e-16
chisqtestAgePurchase <- chisq.test(SalesRaw$Age, SalesRaw$Purchase)
## Warning in chisq.test(SalesRaw$Age, SalesRaw$Purchase): Chi-squared
## approximation may be incorrect
chisqtestAgePurchase
## 
##  Pearson's Chi-squared test
## 
## data:  SalesRaw$Age and SalesRaw$Purchase
## X-squared = 118080, df = 108620, p-value < 2.2e-16
GenderOccupation <- xtabs(~SalesRaw$Gender + SalesRaw$Occupation )
GenderOccupation
##                SalesRaw$Occupation
## SalesRaw$Gender     0     1     2     3     4     5     6     7     8
##               F 18112 17984  8629  7919 17836  2220  8160 10028   361
##               M 51526 29442 17959  9731 54472  9957 12195 49105  1185
##                SalesRaw$Occupation
## SalesRaw$Gender     9    10    11    12    13    14    15    16    17
##               F  5843  4003  1500  3469  1498  6763  2390  4107  3929
##               M   448  8927 10086 27710  6230 20546  9775 21264 36114
##                SalesRaw$Occupation
## SalesRaw$Gender    18    19    20
##               F   230  2017  8811
##               M  6392  6444 24751
round(prop.table(GenderOccupation,2)*100,2)
##                SalesRaw$Occupation
## SalesRaw$Gender     0     1     2     3     4     5     6     7     8
##               F 26.01 37.92 32.45 44.87 24.67 18.23 40.09 16.96 23.35
##               M 73.99 62.08 67.55 55.13 75.33 81.77 59.91 83.04 76.65
##                SalesRaw$Occupation
## SalesRaw$Gender     9    10    11    12    13    14    15    16    17
##               F 92.88 30.96 12.95 11.13 19.38 24.76 19.65 16.19  9.81
##               M  7.12 69.04 87.05 88.87 80.62 75.24 80.35 83.81 90.19
##                SalesRaw$Occupation
## SalesRaw$Gender    18    19    20
##               F  3.47 23.84 26.25
##               M 96.53 76.16 73.75
chisqtestOccPurchase <- chisq.test(SalesRaw$Purchase, SalesRaw$Occupation)
## Warning in chisq.test(SalesRaw$Purchase, SalesRaw$Occupation): Chi-squared
## approximation may be incorrect
chisqtestOccPurchase
## 
##  Pearson's Chi-squared test
## 
## data:  SalesRaw$Purchase and SalesRaw$Occupation
## X-squared = 373980, df = 362080, p-value < 2.2e-16
AgeOccupation <- xtabs(~SalesRaw$Age + SalesRaw$Occupation )
AgeOccupation
##             SalesRaw$Occupation
## SalesRaw$Age     0     1     2     3     4     5     6     7     8     9
##        0-17   2134   387   144     0   113     0     0   139    29     0
##        18-25  9095  3820  4364  1860 48241  1450  1144  2078    14   559
##        26-35 34204 19080 12617  8159 21829  6082  7216 24060   378  1489
##        36-45 13393  9501  5183  4126  1747  3066  4822 18762    98  3096
##        46-50  4488  7089  2124  1599   129  1187  2561  6664   549   528
##        51-55  4602  4410  1344  1094   249   377  3952  5355   317   398
##        55+    1722  3139   812   812     0    15   660  2075   161   221
##             SalesRaw$Occupation
## SalesRaw$Age    10    11    12    13    14    15    16    17    18    19
##        0-17  10951    18   237    15    93     0     0    35     0   807
##        18-25  1649   717  4585     0  4388   906  1816  3944  1085  2500
##        26-35    26  5009 15279     0 13446  6874  7070 17064  2243  3468
##        36-45   170  2732  6848   427  5590  2585  7572 10252  1527  1008
##        46-50     0  1584  2491   631  1445   854  3032  4662  1124   261
##        51-55     0  1383  1417  1785  1012   514  3918  2528   531   200
##        55+     134   143   322  4870  1335   432  1963  1558   112   217
##             SalesRaw$Occupation
## SalesRaw$Age    20
##        0-17      0
##        18-25  5445
##        26-35 13994
##        36-45  7508
##        46-50  2699
##        51-55  3115
##        55+     801
round(prop.table(AgeOccupation,2)*100,2)
##             SalesRaw$Occupation
## SalesRaw$Age     0     1     2     3     4     5     6     7     8     9
##        0-17   3.06  0.82  0.54  0.00  0.16  0.00  0.00  0.24  1.88  0.00
##        18-25 13.06  8.05 16.41 10.54 66.72 11.91  5.62  3.51  0.91  8.89
##        26-35 49.12 40.23 47.45 46.23 30.19 49.95 35.45 40.69 24.45 23.67
##        36-45 19.23 20.03 19.49 23.38  2.42 25.18 23.69 31.73  6.34 49.21
##        46-50  6.44 14.95  7.99  9.06  0.18  9.75 12.58 11.27 35.51  8.39
##        51-55  6.61  9.30  5.05  6.20  0.34  3.10 19.42  9.06 20.50  6.33
##        55+    2.47  6.62  3.05  4.60  0.00  0.12  3.24  3.51 10.41  3.51
##             SalesRaw$Occupation
## SalesRaw$Age    10    11    12    13    14    15    16    17    18    19
##        0-17  84.69  0.16  0.76  0.19  0.34  0.00  0.00  0.09  0.00  9.54
##        18-25 12.75  6.19 14.71  0.00 16.07  7.45  7.16  9.85 16.38 29.55
##        26-35  0.20 43.23 49.00  0.00 49.24 56.51 27.87 42.61 33.87 40.99
##        36-45  1.31 23.58 21.96  5.53 20.47 21.25 29.85 25.60 23.06 11.91
##        46-50  0.00 13.67  7.99  8.17  5.29  7.02 11.95 11.64 16.97  3.08
##        51-55  0.00 11.94  4.54 23.10  3.71  4.23 15.44  6.31  8.02  2.36
##        55+    1.04  1.23  1.03 63.02  4.89  3.55  7.74  3.89  1.69  2.56
##             SalesRaw$Occupation
## SalesRaw$Age    20
##        0-17   0.00
##        18-25 16.22
##        26-35 41.70
##        36-45 22.37
##        46-50  8.04
##        51-55  9.28
##        55+    2.39
cityoccupation <- xtabs(~SalesRaw$City_Category + SalesRaw$Occupation)
cityoccupation
##                       SalesRaw$Occupation
## SalesRaw$City_Category     0     1     2     3     4     5     6     7
##                      A 18858 12809  9232  5645 24205  2383  3755 16216
##                      B 29850 19756 11366  6905 29854  6615 10953 22986
##                      C 20930 14861  5990  5100 18249  3179  5647 19931
##                       SalesRaw$Occupation
## SalesRaw$City_Category     8     9    10    11    12    13    14    15
##                      A   104   708  2226  2470  7080   393  7745  3035
##                      B   829  3224  4230  5595 13184  2392 11256  5262
##                      C   613  2359  6474  3521 10915  4943  8308  3868
##                       SalesRaw$Occupation
## SalesRaw$City_Category    16    17    18    19    20
##                      A  6095  8177  1605  2236 12743
##                      B 10879 16111  2126  3400 14400
##                      C  8397 15755  2891  2825  6419
round(prop.table(cityoccupation,2)*100,2)
##                       SalesRaw$Occupation
## SalesRaw$City_Category     0     1     2     3     4     5     6     7
##                      A 27.08 27.01 34.72 31.98 33.47 19.57 18.45 27.42
##                      B 42.86 41.66 42.75 39.12 41.29 54.32 53.81 38.87
##                      C 30.06 31.34 22.53 28.90 25.24 26.11 27.74 33.71
##                       SalesRaw$Occupation
## SalesRaw$City_Category     8     9    10    11    12    13    14    15
##                      A  6.73 11.25 17.22 21.32 22.71  5.09 28.36 24.95
##                      B 53.62 51.25 32.71 48.29 42.28 30.95 41.22 43.26
##                      C 39.65 37.50 50.07 30.39 35.01 63.96 30.42 31.80
##                       SalesRaw$Occupation
## SalesRaw$City_Category    16    17    18    19    20
##                      A 24.02 20.42 24.24 26.43 37.97
##                      B 42.88 40.23 32.11 40.18 42.91
##                      C 33.10 39.35 43.66 33.39 19.13
genderageSales <- SalesRaw %>% group_by(SalesRaw$Gender, SalesRaw$Age) %>% summarise(Avgsales = mean(Purchase))
## Warning: package 'bindrcpp' was built under R version 3.4.1
genderageSales
## # A tibble: 14 x 3
## # Groups:   SalesRaw$Gender [?]
##    `SalesRaw$Gender` `SalesRaw$Age` Avgsales
##               <fctr>         <fctr>    <dbl>
##  1                 F           0-17 8338.772
##  2                 F          18-25 8343.180
##  3                 F          26-35 8728.252
##  4                 F          36-45 8959.844
##  5                 F          46-50 8842.099
##  6                 F          51-55 9042.450
##  7                 F            55+ 9007.036
##  8                 M           0-17 9235.174
##  9                 M          18-25 9440.943
## 10                 M          26-35 9410.338
## 11                 M          36-45 9453.194
## 12                 M          46-50 9357.472
## 13                 M          51-55 9705.095
## 14                 M            55+ 9438.196
par(mfrow = c(3,2))
CitySales <- plot(SalesRaw$City_Category, SalesRaw$Purchase, xlab = "city")
OccupationSales <- plot(SalesRaw$Occupation, SalesRaw$Purchase, xlab = "Occupation")
GenderSales <- plot(SalesRaw$Gender, SalesRaw$Purchase, xlab = "Gender")
AgeSales <- plot(SalesRaw$Age, SalesRaw$Purchase , xlab = "Age")
StaySales <- plot(SalesRaw$Stay_In_Current_City_Years, SalesRaw$Purchase , xlab = "Stay Duration Years")
MartialstatusSales <- plot(SalesRaw$Marital_Status, SalesRaw$Purchase , xlab = "Marital Status")

CitySales <- SalesRaw %>% group_by(SalesRaw$City_Category) %>% summarise(Avgsales = mean(Purchase))
CitySales
## # A tibble: 3 x 2
##   `SalesRaw$City_Category` Avgsales
##                     <fctr>    <dbl>
## 1                        A 8911.939
## 2                        B 9151.301
## 3                        C 9719.921
nullcheck <- is.na(SalesRaw$Product_Category_1)
unique(nullcheck)
## [1] FALSE
model_v1 <- lm(Purchase ~ Gender + Age + Occupation + City_Category + Stay_In_Current_City_Years + Marital_Status + Product_Category_1  , data=SalesRaw)
summary(model_v1)
## 
## Call:
## lm(formula = Purchase ~ Gender + Age + Occupation + City_Category + 
##     Stay_In_Current_City_Years + Marital_Status + Product_Category_1, 
##     data = SalesRaw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15546.3  -1594.6    397.5   1963.4   8379.4 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   1.334e+04  4.351e+01  306.546  < 2e-16 ***
## GenderM                      -5.037e+01  9.875e+00   -5.101 3.39e-07 ***
## Age18-25                     -1.773e+02  4.084e+01   -4.342 1.41e-05 ***
## Age26-35                     -7.597e+01  4.070e+01   -1.866  0.06197 .  
## Age36-45                      5.244e+01  4.128e+01    1.270  0.20403    
## Age46-50                      7.647e+01  4.312e+01    1.774  0.07614 .  
## Age51-55                      3.225e+02  4.361e+01    7.395 1.42e-13 ***
## Age55+                        2.175e+02  4.616e+01    4.712 2.46e-06 ***
## Occupation1                  -2.832e+01  1.811e+01   -1.563  0.11798    
## Occupation2                   5.749e+01  2.180e+01    2.637  0.00836 ** 
## Occupation3                   2.463e+02  2.553e+01    9.651  < 2e-16 ***
## Occupation4                   1.590e+02  1.737e+01    9.152  < 2e-16 ***
## Occupation5                   5.352e+01  2.970e+01    1.802  0.07158 .  
## Occupation6                   2.090e+02  2.428e+01    8.610  < 2e-16 ***
## Occupation7                   1.195e+02  1.705e+01    7.013 2.34e-12 ***
## Occupation8                  -3.202e+02  7.777e+01   -4.117 3.84e-05 ***
## Occupation9                   1.090e+02  4.041e+01    2.697  0.00701 ** 
## Occupation10                 -7.522e+01  4.377e+01   -1.718  0.08573 .  
## Occupation11                  1.274e+02  3.037e+01    4.194 2.74e-05 ***
## Occupation12                  2.865e+02  2.064e+01   13.881  < 2e-16 ***
## Occupation13                  9.252e+01  3.904e+01    2.370  0.01779 *  
## Occupation14                  2.033e+02  2.158e+01    9.424  < 2e-16 ***
## Occupation15                  3.901e+02  2.967e+01   13.145  < 2e-16 ***
## Occupation16                  1.116e+02  2.234e+01    4.996 5.84e-07 ***
## Occupation17                  1.838e+02  1.909e+01    9.624  < 2e-16 ***
## Occupation18                 -2.209e+01  3.894e+01   -0.567  0.57048    
## Occupation19                 -3.088e+02  3.493e+01   -8.840  < 2e-16 ***
## Occupation20                 -8.167e+01  2.014e+01   -4.054 5.03e-05 ***
## City_CategoryB                1.348e+02  1.016e+01   13.262  < 2e-16 ***
## City_CategoryC                5.544e+02  1.101e+01   50.339  < 2e-16 ***
## Stay_In_Current_City_Years1  -6.840e-02  1.307e+01   -0.005  0.99583    
## Stay_In_Current_City_Years2   3.787e+01  1.459e+01    2.596  0.00942 ** 
## Stay_In_Current_City_Years3  -6.566e+00  1.484e+01   -0.442  0.65820    
## Stay_In_Current_City_Years4+  2.198e+01  1.521e+01    1.445  0.14847    
## Marital_Status1              -5.590e+01  8.835e+00   -6.327 2.50e-10 ***
## Product_Category_12          -2.346e+03  2.112e+01 -111.106  < 2e-16 ***
## Product_Category_13          -3.492e+03  2.273e+01 -153.619  < 2e-16 ***
## Product_Category_14          -1.125e+04  2.899e+01 -387.969  < 2e-16 ***
## Product_Category_15          -7.343e+03  1.124e+01 -653.594  < 2e-16 ***
## Product_Category_16           2.244e+03  2.257e+01   99.431  < 2e-16 ***
## Product_Category_17           2.816e+03  5.009e+01   56.221  < 2e-16 ***
## Product_Category_18          -6.102e+03  1.212e+01 -503.394  < 2e-16 ***
## Product_Category_19           1.922e+03  1.491e+02   12.893  < 2e-16 ***
## Product_Category_110          6.038e+03  4.289e+01  140.772  < 2e-16 ***
## Product_Category_111         -8.886e+03  2.097e+01 -423.800  < 2e-16 ***
## Product_Category_112         -1.229e+04  4.873e+01 -252.209  < 2e-16 ***
## Product_Category_113         -1.289e+04  4.127e+01 -312.260  < 2e-16 ***
## Product_Category_114         -4.483e+02  7.769e+01   -5.771 7.89e-09 ***
## Product_Category_115          1.177e+03  3.885e+01   30.294  < 2e-16 ***
## Product_Category_116          1.174e+03  3.146e+01   37.311  < 2e-16 ***
## Product_Category_117         -3.527e+03  1.257e+02  -28.066  < 2e-16 ***
## Product_Category_118         -1.067e+04  5.455e+01 -195.578  < 2e-16 ***
## Product_Category_119         -1.370e+04  7.573e+01 -180.873  < 2e-16 ***
## Product_Category_120         -1.335e+04  6.025e+01 -221.486  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3014 on 550014 degrees of freedom
## Multiple R-squared:  0.6401, Adjusted R-squared:  0.6401 
## F-statistic: 1.846e+04 on 53 and 550014 DF,  p-value: < 2.2e-16