Retail sales prediction

Lakshmi Narayanan B

2021-10-24


Abstract

  • This study examines predicting retail outlet sales through statistical learning and machine learning models.

Executive summary

Aim

  • It is of interest to explore and understand the relationship between variables that affect the product sales. In particular, the objective is to build a model to predict the sales at a particular outlet.

Methodology

Data

  • Two cross-sectional multivariate data sets with eleven (11) predictor variables and one (1) outcome variable, were used for this analysis. Assume the currency used in these data sets were INR.

  • The training data set had 8523 sample observations with eleven (11) predictor variables, and one (1) outcome variable, Item Outlet Sales.

  • The test data set had 5681 sample observations with eleven (11) predictor variables.

Technique(s)

  • Supervised machine learning.

  • Two machine learning techniques, namely, multi-variate log-log linear regression and random forest models were used to analyse and predict the retail outlet sales.

  • The training data set was randomly split into two data sets with ratio of 60:40 for model selection and evaluation, respectively.

  • The outlet sales were predicted using the test data set.

Results

Multi-variate log-log linear regression

  • The log-log linear model predicted the outlet sales based on Item MRP, Item Fat Content and Outlet Identifier. The prediction error was 1100.67 INR (RMSEP).

Random Forest regression

  • The Random Forest model predicted the outlet sales based on Item MRP, Outlet Type and Outlet Identifier. The prediction error was 1149.71 INR (RMSEP).

Conclusion

  • From model interpretability and prediction performance standpoint, the multi-variate log-log linear regression model is somewhat better than the Random Forest classifier and regression tree model.

  • Additionally, both models could be further trained with additional variables such as season, day and time to assess the models performance.

Technical notes

Initial hypotheses

The hypotheses in this section were generated based on the objective of this paper, and prior knowledge about the subject-matter. The hypotheses were defined prior to analysing the data to help with better understanding of data analysis in the forthcoming sections.

Based on prior knowledge, it is known that a variety of factors affect the value and volume of product sales. Some of the key determinants of sales are five-Ps.

  • Place:
    • The location of retail outlet influence product sales. For example, urban vs. semi-urban vs. rural, indicates customer segments and demand for consumer needs, respectively.
    • The capacity of retail outlet indicates supply and selection of product groups, thus influencing product sales.
    • Thus, a retail outlet’s location and size in tandem drives competition.
  • Product:
    • The brand, type, and packaging of products will influence sales.
  • Price:
    • The price of a product will influence product sales. For example, luxury products price will be higher than utility-oriented products.
    • Thus, lower the product price, and higher the demand, will result in higher volume of sales.
  • Promotion:
    • New product promotions and campaign at retail outlet will attract more shoppers leading to higher sales.
  • Positioning:
    • The brand positioning of retail outlet determines attractiveness and customer loyalty, thus higher the attractiveness higher the customer loyalty leading to increased sales.
  • Season:
    • The season and time of week, month and year will influence the retail sales.

Therefore in summary, it is of initial assumption that season and five-Ps outlined above could affect product sales.

Exploratory data analysis

Descriptive statistics of the training and test data sets

Data Frame Summary

trainR
Dimensions: 8523 x 12
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 Item_Identifier [factor] 1. DRA12 2. DRA24 3. DRA59 4. DRB01 5. DRB13 6. DRB24 7. DRB25 8. DRB48 9. DRC01 10. DRC12 [ 1549 others ]
6(0.1%)
7(0.1%)
8(0.1%)
3(0.0%)
5(0.1%)
4(0.0%)
6(0.1%)
7(0.1%)
6(0.1%)
4(0.0%)
8467(99.3%)
0 (0%)
2 Item_Weight [numeric] Mean (sd) : 12.9 (4.6) min < med < max: 4.6 < 12.6 < 21.4 IQR (CV) : 8.1 (0.4) 415 distinct values 1463 (17.17%)
3 Item_Fat_Content [factor] 1. LF 2. low fat 3. Low Fat 4. reg 5. Regular
316(3.7%)
112(1.3%)
5089(59.7%)
117(1.4%)
2889(33.9%)
0 (0%)
4 Item_Visibility [numeric] Mean (sd) : 0.1 (0.1) min < med < max: 0 < 0.1 < 0.3 IQR (CV) : 0.1 (0.8) 7880 distinct values 0 (0%)
5 Item_Type [factor] 1. Baking Goods 2. Breads 3. Breakfast 4. Canned 5. Dairy 6. Frozen Foods 7. Fruits and Vegetables 8. Hard Drinks 9. Health and Hygiene 10. Household [ 6 others ]
648(7.6%)
251(2.9%)
110(1.3%)
649(7.6%)
682(8.0%)
856(10.0%)
1232(14.5%)
214(2.5%)
520(6.1%)
910(10.7%)
2451(28.8%)
0 (0%)
6 Item_MRP [numeric] Mean (sd) : 141 (62.3) min < med < max: 31.3 < 143 < 266.9 IQR (CV) : 91.8 (0.4) 5938 distinct values 0 (0%)
7 Outlet_Identifier [factor] 1. OUT010 2. OUT013 3. OUT017 4. OUT018 5. OUT019 6. OUT027 7. OUT035 8. OUT045 9. OUT046 10. OUT049
555(6.5%)
932(10.9%)
926(10.9%)
928(10.9%)
528(6.2%)
935(11.0%)
930(10.9%)
929(10.9%)
930(10.9%)
930(10.9%)
0 (0%)
8 Outlet_Establishment_Year [integer] Mean (sd) : 1997.8 (8.4) min < med < max: 1985 < 1999 < 2009 IQR (CV) : 17 (0)
1985:1463(17.2%)
1987:932(10.9%)
1997:930(10.9%)
1998:555(6.5%)
1999:930(10.9%)
2002:929(10.9%)
2004:930(10.9%)
2007:926(10.9%)
2009:928(10.9%)
0 (0%)
9 Outlet_Size [factor] 1. High 2. Medium 3. Small
932(15.2%)
2793(45.7%)
2388(39.1%)
2410 (28.28%)
10 Outlet_Location_Type [factor] 1. Tier 1 2. Tier 2 3. Tier 3
2388(28.0%)
2785(32.7%)
3350(39.3%)
0 (0%)
11 Outlet_Type [factor] 1. Grocery Store 2. Supermarket Type1 3. Supermarket Type2 4. Supermarket Type3
1083(12.7%)
5577(65.4%)
928(10.9%)
935(11.0%)
0 (0%)
12 Item_Outlet_Sales [numeric] Mean (sd) : 2181.3 (1706.5) min < med < max: 33.3 < 1794.3 < 13087 IQR (CV) : 2267 (0.8) 3493 distinct values 0 (0%)

The summary output above is for the training data set.

About the data set:

  • There are 8523 sample observations and 12 variables in the data set.

  • There are seven (7) categorical variables.

    • Three (3) variables pertaining to items, namely, Item ID, Item Fat Content and Item Type.
    • Four (4) variables pertaining to outlets, namely, Outlet ID, Outlet Size, Outlet Location Type and Outlet Type.
  • There are five (5) numeric variables.

    • Three (3) variables pertaining to items, namely, Item Weight, Item Visibility and Item MRP.
    • One variable (1) pertaining to outlets, Outlet Establishment Year.
    • One variable (1) pertaining to both items and outlets, Item Outlet Sales.
  • The outcome variable is Item Outlet Sales.

Missing data:

  • Categorical variable(s): One (1) variable has missing data, Outlet Size.
  • Numeric variable(s): One (1) variable has missing data, Item Weight.

Class distribution:

  • Item Fat content: 64.7% were low fat, and 35.3% were regular. The labels in this categorical variable have duplicates, LF, low fat, Low Fat, reg, and Regular.
  • Item Type: The distribution is fragmented across sixteen (16) item types. Thus, a better way of re-grouping the item types should be determined.
  • Outlet ID: There are ten (10) outlets. Outlet OUT027 represents 11% of the total outlets.
  • Outlet Size: 45.7% were Medium, 39.1% were Small, and 15.2% were Large.
  • Outlet Location Type: 39.3% were Tier-3, 32.7% were Tier-2, and 28% were Tier-1.
  • Outlet Type: 65.4% were Supermarket Type1, 12.7% were Grocery Store, 11% were Supermarket Type3 and 10.9% were Supermarket Type2.

Outcome variable:

  • Item Product Sales: The distribution appears to be right-skewed. The difference between mean sales (2181.3) and median sales (1794.3) is quite large. Therefore, it would be sensible to transform this variable.

Data Frame Summary

testR
Dimensions: 5681 x 11
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Missing
1 Item_Identifier [factor] 1. DRA12 2. DRA24 3. DRA59 4. DRB01 [ 1539 others ]
3(0.1%)
3(0.1%)
2(0.0%)
5(0.1%)
5668(99.8%)
0 (0%)
2 Item_Weight [numeric] Mean (sd) : 12.7 (4.7) min < med < max: 4.6 < 12.5 < 21.4 IQR (CV) : 8.1 (0.4) 410 distinct values 976 (17.18%)
3 Item_Fat_Content [factor] 1. LF 2. low fat 3. Low Fat 4. reg 5. Regular
206(3.6%)
66(1.2%)
3396(59.8%)
78(1.4%)
1935(34.1%)
0 (0%)
4 Item_Visibility [numeric] Mean (sd) : 0.1 (0.1) min < med < max: 0 < 0.1 < 0.3 IQR (CV) : 0.1 (0.8) 5277 distinct values 0 (0%)
5 Item_Type [factor] 1. Baking Goods 2. Breads 3. Breakfast 4. Canned [ 12 others ]
438(7.7%)
165(2.9%)
76(1.3%)
435(7.7%)
4567(80.4%)
0 (0%)
6 Item_MRP [numeric] Mean (sd) : 141 (61.8) min < med < max: 32 < 141.4 < 266.6 IQR (CV) : 91.6 (0.4) 4402 distinct values 0 (0%)
7 Outlet_Identifier [factor] 1. OUT010 2. OUT013 3. OUT017 4. OUT018 [ 6 others ]
370(6.5%)
621(10.9%)
617(10.9%)
618(10.9%)
3455(60.8%)
0 (0%)
8 Outlet_Establishment_Year [integer] Mean (sd) : 1997.8 (8.4) min < med < max: 1985 < 1999 < 2009 IQR (CV) : 17 (0) 9 distinct values 0 (0%)
9 Outlet_Size [factor] 1. High 2. Medium 3. Small
621(15.2%)
1862(45.7%)
1592(39.1%)
1606 (28.27%)
10 Outlet_Location_Type [factor] 1. Tier 1 2. Tier 2 3. Tier 3
1592(28.0%)
1856(32.7%)
2233(39.3%)
0 (0%)
11 Outlet_Type [factor] 1. Grocery Store 2. Supermarket Type1 3. Supermarket Type2 4. Supermarket Type3
722(12.7%)
3717(65.4%)
618(10.9%)
624(11.0%)
0 (0%)

The summary output above is for the test data set.

About the data set:

  • There are 5681 sample observations and 11 variables in the data set.

  • There are seven (7) categorical variables.

    • Three (3) variables pertaining to items, namely, Item ID, Item Fat Content and Item Type.
    • Four (4) variables pertaining to outlets, namely, Outlet ID, Outlet Size, Outlet Location Type and Outlet Type.
  • There are five (5) numeric variables.

    • Three (3) variables pertaining to items, namely, Item Weight, Item Visibility and Item MRP.
    • One variable (1) pertaining to outlets, Outlet Establishment Year.
    • One variable (1) pertaining to both items and outlets, Item Outlet Sales.

Missing data:

  • Categorical variable(s): One (1) variable has missing data, Outlet Size.
  • Numeric variable(s): One (1) variable has missing data, Item Weight.

Class distribution:

  • Item Fat content: 64.5% were low fat, and 35.5% were regular. The labels in this categorical variable have duplicates, LF, low fat, Low Fat, reg, and Regular.
  • Item Type: The distribution is fragmented across sixteen (16) item types.
  • Outlet ID: The distribution is fragmented across ten (10) outlets.
  • Outlet Size: 45.7% were medium, 39.1% were small, and 15.2% were large.
  • Outlet Location Type: 39.3% were Tier-3, 32.7% were Tier-2, and 28% were Tier-1.
  • Outlet Type: 65.4% were Supermarket Type1, 12.7% were Grocery Store, 11% were Supermarket Type3 and 10.9% were Supermarket Type2.

Outcome variable:

  • The outcome variable Item Outlet Sales must be predicted by the model(s).

Visualisation of the data

Univariate analysis
Univariate analysis: Descriptive statistics
Item Weight Item Visibility Item MRP Item-Outlet Sales
Min. : 4.555 Min. :0.00000 Min. : 31.29 Min. : 33.29
1st Qu.: 8.774 1st Qu.:0.02699 1st Qu.: 93.83 1st Qu.: 834.25
Median :12.600 Median :0.05393 Median :143.01 Median : 1794.33
Mean :12.858 Mean :0.06613 Mean :140.99 Mean : 2181.29
3rd Qu.:16.850 3rd Qu.:0.09459 3rd Qu.:185.64 3rd Qu.: 3101.30
Max. :21.350 Max. :0.32839 Max. :266.89 Max. :13086.97
NA’s :1463 NA NA NA

  • The histograms and box-plots above represent the distribution, shape and spread of the numeric variables Item Weight, Item Visibility, Item MRP and Item-Outlet Sales, from the training data set.

Distribution and skewness:

  • Item Weight: This variable shows a multi-modal distribution.

  • Item Visibility: This variable shows right-skewed distribution, suggesting few items have more visibility.

  • Item MRP: This variable shows a multi-modal distribution.

  • Item-Outlet Sales: This variable shows right-skewed distribution, suggesting a combination of few items-and-outlets lead to high sales value.

Therefore, the skewness of item visibility and item-outlet sales must be treated using appropriate transformation.

Outliers:

  • Item-Outlet Sales: There appears to be four (4) outliers.
Bivariate analysis

The box-plots above show the bivariate analysis between categorical variables and item-outlet sales from the training data set.

Outlet Identifier:

  • The median item-outlet sales in outlet 27 is higher than rest of the outlets across all item types.
  • Outlets 10 and 19 median sales appear to be lower than rest of the outlets.
  • Outlet 27 has a couple of outliers in non-consumables type.

Outlet Size:

  • The median item-outlet sales appear to be more or less same across all outlet sizes.
  • The small size has a couple of outliers in non-consumables item type.

Outlet Location Type:

  • The median item-outlet sales appear to be more or less same across all outlet location types.
  • The tier-3 outlet location type has a couple of outliers in non-consumables item type.

Outlet Type:

  • The median item-outlet sales in supermarket type-3 appear to be higher than other outlet types for all item types.
  • The median sales in grocery store outlet type is lower than all other outlet types.
  • The supermarket type-3 has a couple of outliers in non-consumables item type.

In summary, the median item-outlet sales appear to vary across different outlets, and type of outlet. In particular, the median item-outlet sales were higher in outlet 27 and supermarket type-3 than the others.

The box-plot above show the bivariate analysis between outlet established year/age and item-outlet sales from the training data set.

Outlet Age:

  • The median item-outlet sales is lower for outlet aging 15 years than others.
  • In the non-consumables item type, the median item-outlet sales for outlet aging between 4 and 14 years appear to be higher than other item types. Otherwise, the item-outlet sales appear more or less same in other item types.
  • There are a few outliers for outlet aging 6, 16, 26 and 28 years across item types.

  • Item MRP and item-outlet sales have a positive linear relationship. There is some non-constant scatter or variation in the data.

  • Item weight and sales have a non-linear relationship mostly. But, consumables show some positive linear relationship.

  • Item visibility and sales show some positive linear relationship. But, the plots show heavy non-constant variation in the data.

  • There are outliers in consumables and non-consumables item types across all three plots.

Correlation analysis

  • The heatmap above shows the correlation between numeric variables in the training data set.

  • The variables item MRP and item-outlet sales are positively correlated (0.62).

  • The variables item visibility and item-outlet sales are somewhat negatively correlated (-0.09).

  • In summary, there is no extensive multicollinearity in the data.

Data cleaning

Missing value treatment

## Training data set:
## Item_Identifier                 0
## Item_Weight                  1463
## Item_Fat_Content                0
## Item_Visibility                 0
## Item_Type                       0
## Item_MRP                        0
## Outlet_Identifier               0
## Outlet_Establishment_Year       0
## Outlet_Size                  2410
## Outlet_Location_Type            0
## Outlet_Type                     0
## Item_Outlet_Sales               0
## dtype: int64
## Test data set:
## Item_Identifier                 0
## Item_Weight                   976
## Item_Fat_Content                0
## Item_Visibility                 0
## Item_Type                       0
## Item_MRP                        0
## Outlet_Identifier               0
## Outlet_Establishment_Year       0
## Outlet_Size                  1606
## Outlet_Location_Type            0
## Outlet_Type                     0
## dtype: int64
  • The item weight and outlet size variables have missing values in training and test data sets.

  • The item weight is a continuous numeric variable, and outlet size is an ordinal categorical variable.

  • Therefore, it is sensible to impute the mean item weight by products/items. For outlet size, it is sensible to impute with the mode of outlet size.

## Mode for each outlet type: 
##  Outlet_Type Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
## Outlet_Size         Small             Small            Medium            Medium

Review the missing value treatment

## Training data set:
## Item_Identifier              0
## Item_Weight                  4
## Item_Fat_Content             0
## Item_Visibility              0
## Item_Type                    0
## Item_MRP                     0
## Outlet_Identifier            0
## Outlet_Establishment_Year    0
## Outlet_Size                  0
## Outlet_Location_Type         0
## Outlet_Type                  0
## Item_Outlet_Sales            0
## dtype: int64
## Test data set:
## Item_Identifier              0
## Item_Weight                  1
## Item_Fat_Content             0
## Item_Visibility              0
## Item_Type                    0
## Item_MRP                     0
## Outlet_Identifier            0
## Outlet_Establishment_Year    0
## Outlet_Size                  0
## Outlet_Location_Type         0
## Outlet_Type                  0
## dtype: int64

There appears to be four (4) missing values in item weight variable. Need further diagnosis to understand the issue.

## Training data set:
## 927     FDN52
## 1922    FDK57
## 4187    FDE52
## 5022    FDQ60
## Name: Item_Identifier, dtype: object
## Training data set:
## True     8383
## False     140
## Name: Item_Identifier, dtype: int64
## Test data set:
## True    5681
## Name: Item_Identifier, dtype: int64

There are four (4) products that have missing item weight, and these four (4) products do not have any other observations in the training data set. Therefore, we must exclude these four (4) products and its observations from both training and test data sets.

Next, there are 140 products that are in training data, but not present in test data set. No further action required, as test data set show no conflicts.

## Training data set:
## Item_Identifier              0
## Item_Weight                  0
## Item_Fat_Content             0
## Item_Visibility              0
## Item_Type                    0
## Item_MRP                     0
## Outlet_Identifier            0
## Outlet_Establishment_Year    0
## Outlet_Size                  0
## Outlet_Location_Type         0
## Outlet_Type                  0
## Item_Outlet_Sales            0
## dtype: int64
## Test data set:
## Item_Identifier              0
## Item_Weight                  0
## Item_Fat_Content             0
## Item_Visibility              0
## Item_Type                    0
## Item_MRP                     0
## Outlet_Identifier            0
## Outlet_Establishment_Year    0
## Outlet_Size                  0
## Outlet_Location_Type         0
## Outlet_Type                  0
## dtype: int64
## Test data set:
## True    5650
## Name: Item_Identifier, dtype: int64

The missing value treatment, and the treatment for product mismatch between training and test data sets look okay.

Feature engineering

Skewness treatment

  • As we saw earlier in the univariate analysis the item outlet sales and item MRP distribution was right-skewed, so it is sensible to log-transform the variable.

Derive new variables for years of operation (i.e., outlet age), item category for training and test data sets

Clean item fat-content variable

## Training data set:
## Food              6121
## Non-consumable    1599
## Consumable         799
## Name: Item_Category, dtype: int64
## Test data set:
## Food              4045
## Non-consumable    1087
## Consumable         518
## Name: Item_Category, dtype: int64
## Training data set:
## Low Fat       3917
## Regular       3003
## Non-edible    1599
## Name: Item_Fat_Content, dtype: int64
## Test data set:
## Low Fat       2573
## Regular       1990
## Non-edible    1087
## Name: Item_Fat_Content, dtype: int64

Inspect feature engineering effects

  • Following log-transformation, the histogram (log-transformed) above show that the Item Outlet Sales show reasonable Normal distribution.

  • The heatmap above shows the correlation between log-transformed variable, Item Outlet Sales, and existing variables log-transformed Item MRP (positively correlated), and Item Visibility (negatively correlated).

  • In summary, there are no major concerns (i.e., no extensive multicollinearity) with the feature engineering and transformation.

  • Finally, the training and test data sets are ready for model selection, evaluation and prediction.

Model selection, evaluation and prediction

  • The training data set was split into two data sets with 60:40 ratio. The former will be used for model selection and latter will be used for model evaluation purposes.

Model: Linear regression analysis

Model selection
##                                                                     Best Subsets Regression                                                                     
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## Model Index    Predictors
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
##      1         Outlet_Identifier                                                                                                                                 
##      2         Outlet_Identifier log_Item_MRP                                                                                                                    
##      3         Item_Fat_Content Outlet_Identifier log_Item_MRP                                                                                                   
##      4         Item_Weight Item_Fat_Content Outlet_Identifier log_Item_MRP                                                                                       
##      5         Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier log_Item_MRP                                                                       
##      6         Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier log_Item_MRP Item_Category                                                         
##      7         Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size log_Item_MRP Item_Category                                             
##      8         Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size Outlet_Location_Type log_Item_MRP Item_Category                        
##      9         Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size Outlet_Location_Type Outlet_Type log_Item_MRP Item_Category            
##     10         Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size Outlet_Location_Type Outlet_Type log_Item_MRP Item_Category Outlet_Age 
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## 
##                                                      Subsets Regression Summary                                                     
## ------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                              
## Model    R-Square    R-Square    R-Square      C(p)          AIC        SBIC       SBC          MSEP        FPE       HSP      APC  
## ------------------------------------------------------------------------------------------------------------------------------------
##   1        0.4635      0.4625      0.4614    5850.8746    11585.3163      NA    11657.2491    2875.1348    0.5635    1e-04    0.5370 
##   2        0.7503      0.7498      0.7492      -6.0403     7677.0608      NA     7755.5330    1338.2708    0.2624    1e-04    0.2500 
##   3        0.7505      0.7499      0.7492      -7.7255     7677.3656      NA     7768.9164    1337.5656    0.2623    1e-04    0.2499 
##   4        0.7505      0.7499      0.7491      -5.8305     7679.2603      NA     7777.3504    1337.8000    0.2624    1e-04    0.2500 
##   5        0.7505      0.7498       0.749      -3.9272     7681.1633      NA     7785.7928    1338.0367    0.2625    1e-04    0.2501 
##   6        0.7505      0.7498      0.7489      -2.0000     7685.0902      NA     7802.7985    1338.2797    0.2626    1e-04    0.2502 
##   7        0.7505      0.7498      0.7489      -2.0000     7689.0902      NA     7819.8772    1338.2797    0.2626    1e-04    0.2502 
##   8        0.7505      0.7498      0.7489      -2.0000     7693.0902      NA     7836.9559    1338.2797    0.2626    1e-04    0.2502 
##   9        0.7505      0.7498      0.7489      -2.0000     7699.0902      NA     7862.5739    1338.2797    0.2626    1e-04    0.2502 
##  10        0.7505      0.7498      0.7489      -2.0000     7701.0902      NA     7871.1132    1338.2797    0.2626    1e-04    0.2502 
## ------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Using the Number of Regressors, AIC (Akaike Information Criterion), MSEP (Estimated Error of Prediction) and R-Squared as a goodness-of-fit statistic, there are two regression models that appear to be sensible for further evaluation through test data set.

In particular, model # 3 and 4 appear to be sensible to evaluate the prediction of outlet sales using test data set.

Model evaluation
## 
## Call:
## lm(formula = log_Item_Outlet_Sales ~ Item_Weight + Item_Fat_Content + 
##     Outlet_Identifier + log_Item_MRP, data = trainSet1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04997 -0.27229  0.05395  0.36987  1.40557 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.4897603  0.0743027   6.591  4.8e-11 ***
## Item_Weight                -0.0004973  0.0015345  -0.324   0.7459    
## Item_Fat_ContentNon-edible  0.0146930  0.0196949   0.746   0.4557    
## Item_Fat_ContentRegular     0.0307994  0.0160165   1.923   0.0545 .  
## Outlet_IdentifierOUT013     1.9475487  0.0361812  53.828  < 2e-16 ***
## Outlet_IdentifierOUT017     2.0293153  0.0356984  56.846  < 2e-16 ***
## Outlet_IdentifierOUT018     1.8157205  0.0361653  50.206  < 2e-16 ***
## Outlet_IdentifierOUT019     0.0397723  0.0404119   0.984   0.3251    
## Outlet_IdentifierOUT027     2.5011464  0.0357753  69.913  < 2e-16 ***
## Outlet_IdentifierOUT035     2.0267209  0.0358661  56.508  < 2e-16 ***
## Outlet_IdentifierOUT045     1.9417941  0.0363195  53.464  < 2e-16 ***
## Outlet_IdentifierOUT046     2.0167335  0.0360832  55.891  < 2e-16 ***
## Outlet_IdentifierOUT049     2.0391918  0.0358143  56.938  < 2e-16 ***
## log_Item_MRP                1.0402138  0.0135991  76.492  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.512 on 5098 degrees of freedom
## Multiple R-squared:  0.7505, Adjusted R-squared:  0.7499 
## F-statistic:  1180 on 13 and 5098 DF,  p-value: < 2.2e-16
## [1] 1100.81

The log-log linear regression model with four regressors, namely, Item Weight, Item Fat Content, Outlet ID, and log-Item MRP, was fitted to the test data set. From the model summary output above, it appears that the log-Item MRP and Outlet Identifier were statistically significant, suggesting that the retail outlet sales were associated with item price and outlet. The Item Weight regressor was not statistically significant (\(P-value: 0.75\)), and Item Fat Content (Regular) was somewhat significant (\(P-value: .05\)). The root mean-squared error of prediction was 1100.81 INR.

Therefore, it is sensible to try fitting the regression model with three regressors and assess the significance and prediction performance.

## 
## Call:
## lm(formula = log_Item_Outlet_Sales ~ Item_Fat_Content + Outlet_Identifier + 
##     log_Item_MRP, data = trainSet1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04641 -0.27217  0.05384  0.36929  1.40562 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.48373    0.07193   6.725 1.94e-11 ***
## Item_Fat_ContentNon-edible  0.01437    0.01967   0.731   0.4650    
## Item_Fat_ContentRegular     0.03074    0.01601   1.920   0.0549 .  
## Outlet_IdentifierOUT013     1.94773    0.03617  53.844  < 2e-16 ***
## Outlet_IdentifierOUT017     2.02940    0.03569  56.855  < 2e-16 ***
## Outlet_IdentifierOUT018     1.81577    0.03616  50.213  < 2e-16 ***
## Outlet_IdentifierOUT019     0.03983    0.04041   0.986   0.3243    
## Outlet_IdentifierOUT027     2.50129    0.03577  69.928  < 2e-16 ***
## Outlet_IdentifierOUT035     2.02693    0.03586  56.528  < 2e-16 ***
## Outlet_IdentifierOUT045     1.94196    0.03631  53.479  < 2e-16 ***
## Outlet_IdentifierOUT046     2.01688    0.03608  55.905  < 2e-16 ***
## Outlet_IdentifierOUT049     2.03933    0.03581  56.951  < 2e-16 ***
## log_Item_MRP                1.04013    0.01360  76.505  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.512 on 5099 degrees of freedom
## Multiple R-squared:  0.7505, Adjusted R-squared:  0.7499 
## F-statistic:  1278 on 12 and 5099 DF,  p-value: < 2.2e-16
## [1] 1100.669

The log-log linear regression model with three regressors, namely, Item Fat Content, Outlet ID, and log-Item MRP, was fitted to the test data set. From the model summary output above, it appears that the log-Item MRP and Outlet Identifier were statistically significant, suggesting that the retail outlet sales are associated with item price and outlet. Particularly, the expected sales at Outlet # 27 appears to be higher than other outlets. The Item Fat Content (Regular) is somewhat significant (\(P-value: .05\)). The root mean-squared error of prediction for test data set was 1100.67 INR.

Model equation:

\(log(Item \, Outlet \, Sales) = \beta_0 + \beta_{X} X_i + \beta_{Z} Z_i + \beta_{3} * log(Item \, MRP) + \epsilon_i\)

where \(\epsilon_i \sim N(0, \sigma^2)\).

where \(X\) is Item Fat with two levels, and \(Z\) is Outlet Identifier with nine levels.

Model: Other regression techniques

Regression techniques such as Lasso and Ridge were not sensible to fit for this data set as there was no extensive multi-collinearity in the data.

Model: Random Forest model

Model selection
## 
## Call:
##  randomForest(formula = log_Item_Outlet_Sales ~ ., data = trainSet1,      mtry = 3, ntree = 150, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 150
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.2844208
##                     % Var explained: 72.86

The random forest tree-based model was fitted to the training data set. Models were fitted with three to eight variables. It appeared that models with four variables and above were over-fitting the training data set and that the prediction power of those models were not better than the model with three variables. The random forest model with three variables had an R-Squared value of 72.86%.

Random Forest Classifier(s)
%IncMSE IncNodePurity
Item_Weight 0.00 233.00
Item_Fat_Content 0.00 43.58
Item_Visibility 0.01 279.37
Outlet_Identifier 0.43 1161.11
Outlet_Size 0.08 72.70
Outlet_Location_Type 0.04 52.66
Outlet_Type 0.41 1107.55
log_Item_MRP 0.53 1603.76
Item_Category 0.00 36.87
Outlet_Age 0.06 116.33

From the output above, it appears that variables such as log-Item MRP, Outlet Identifier and Outlet Type are the key tree classifiers for the prediction of item outlet sales.

From the plot above it appears that bagging of around 100 trees reduces the prediction error to approximately ~25%.

Model evaluation
## [1] 1149.713

The root mean-squared error of prediction for test data set was 1149.71 INR.

References

  1. Big Mart Sales Prediction, Anonymous, 2016, Data set, https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/.

  2. summarytools, Dominic Comtois, 2020, R package, https://cran.r-project.org/web/packages/summarytools/index.html.

  3. GGPlot2, Hadley Wickham, 2016, R package, https://ggplot2.tidyverse.org.

  4. gridExtra, Baptiste Auguie, Anton Antonov, 2017, R package, https://cran.r-project.org/web/packages/gridExtra/index.html.

  5. knitr, Several, 2020, R package, https://cran.r-project.org/web/packages/knitr/index.html.

  6. dplyr, Several, 2020, R package, https://cran.r-project.org/web/packages/dplyr/index.html.

  7. ggpubr, Alboukadel Kassambara, 2020, R package, https://cran.csiro.au/web/packages/ggpubr/ggpubr.pdf.

  8. ggcorrplot, Alboukadel Kassambara, 2019, R package, https://cran.r-project.org/web/packages/ggcorrplot/index.html.

  9. tidyverse, Hadley Wickham, 2019, R package, https://cran.r-project.org/web/packages/tidyverse/index.html.

  10. randomForest, Several, 2018, R package, https://cran.r-project.org/web/packages/randomForest/index.html.

  11. pandas, Several, 2020, Python module, https://pandas.pydata.org/about/citing.html.

  12. matplotlib, J.D. Hunter, 2007, Python module, https://matplotlib.org.

  13. seaborn, Several, 2017, Python module, https://seaborn.pydata.org.

  14. prettydoc, Several, 2020, R package, https://cran.r-project.org/web/packages/prettydoc/vignettes/cayman.html.

Keywords

  • \(retail \, sales\), \(multivariate\), \(machine\,learning\), \(supervised\, learning\), \(prediction\), \(regression\), \(random \, forest\).