Two cross-sectional multivariate data sets with eleven (11) predictor variables and one (1) outcome variable, were used for this analysis. Assume the currency used in these data sets were INR.
The training data set had 8523 sample observations with eleven (11) predictor variables, and one (1) outcome variable, Item Outlet Sales.
The test data set had 5681 sample observations with eleven (11) predictor variables.
Supervised machine learning.
Two machine learning techniques, namely, multi-variate log-log linear regression and random forest models were used to analyse and predict the retail outlet sales.
The training data set was randomly split into two data sets with ratio of 60:40 for model selection and evaluation, respectively.
The outlet sales were predicted using the test data set.
From model interpretability and prediction performance standpoint, the multi-variate log-log linear regression model is somewhat better than the Random Forest classifier and regression tree model.
Additionally, both models could be further trained with additional variables such as season, day and time to assess the models performance.
The hypotheses in this section were generated based on the objective of this paper, and prior knowledge about the subject-matter. The hypotheses were defined prior to analysing the data to help with better understanding of data analysis in the forthcoming sections.
Based on prior knowledge, it is known that a variety of factors affect the value and volume of product sales. Some of the key determinants of sales are five-Ps.
Therefore in summary, it is of initial assumption that season and five-Ps outlined above could affect product sales.
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Item_Identifier [factor] | 1. DRA12 2. DRA24 3. DRA59 4. DRB01 5. DRB13 6. DRB24 7. DRB25 8. DRB48 9. DRC01 10. DRC12 [ 1549 others ] |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | Item_Weight [numeric] | Mean (sd) : 12.9 (4.6) min < med < max: 4.6 < 12.6 < 21.4 IQR (CV) : 8.1 (0.4) | 415 distinct values | 1463 (17.17%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | Item_Fat_Content [factor] | 1. LF 2. low fat 3. Low Fat 4. reg 5. Regular |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | Item_Visibility [numeric] | Mean (sd) : 0.1 (0.1) min < med < max: 0 < 0.1 < 0.3 IQR (CV) : 0.1 (0.8) | 7880 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | Item_Type [factor] | 1. Baking Goods 2. Breads 3. Breakfast 4. Canned 5. Dairy 6. Frozen Foods 7. Fruits and Vegetables 8. Hard Drinks 9. Health and Hygiene 10. Household [ 6 others ] |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | Item_MRP [numeric] | Mean (sd) : 141 (62.3) min < med < max: 31.3 < 143 < 266.9 IQR (CV) : 91.8 (0.4) | 5938 distinct values | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | Outlet_Identifier [factor] | 1. OUT010 2. OUT013 3. OUT017 4. OUT018 5. OUT019 6. OUT027 7. OUT035 8. OUT045 9. OUT046 10. OUT049 |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | Outlet_Establishment_Year [integer] | Mean (sd) : 1997.8 (8.4) min < med < max: 1985 < 1999 < 2009 IQR (CV) : 17 (0) |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | Outlet_Size [factor] | 1. High 2. Medium 3. Small |
|
2410 (28.28%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | Outlet_Location_Type [factor] | 1. Tier 1 2. Tier 2 3. Tier 3 |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | Outlet_Type [factor] | 1. Grocery Store 2. Supermarket Type1 3. Supermarket Type2 4. Supermarket Type3 |
|
0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | Item_Outlet_Sales [numeric] | Mean (sd) : 2181.3 (1706.5) min < med < max: 33.3 < 1794.3 < 13087 IQR (CV) : 2267 (0.8) | 3493 distinct values | 0 (0%) |
The summary output above is for the training data set.
About the data set:
There are 8523 sample observations and 12 variables in the data set.
There are seven (7) categorical variables.
There are five (5) numeric variables.
The outcome variable is Item Outlet Sales.
Missing data:
Class distribution:
Outcome variable:
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Item_Identifier [factor] | 1. DRA12 2. DRA24 3. DRA59 4. DRB01 [ 1539 others ] |
|
0 (0%) | |||||||||||||||||||||
| 2 | Item_Weight [numeric] | Mean (sd) : 12.7 (4.7) min < med < max: 4.6 < 12.5 < 21.4 IQR (CV) : 8.1 (0.4) | 410 distinct values | 976 (17.18%) | |||||||||||||||||||||
| 3 | Item_Fat_Content [factor] | 1. LF 2. low fat 3. Low Fat 4. reg 5. Regular |
|
0 (0%) | |||||||||||||||||||||
| 4 | Item_Visibility [numeric] | Mean (sd) : 0.1 (0.1) min < med < max: 0 < 0.1 < 0.3 IQR (CV) : 0.1 (0.8) | 5277 distinct values | 0 (0%) | |||||||||||||||||||||
| 5 | Item_Type [factor] | 1. Baking Goods 2. Breads 3. Breakfast 4. Canned [ 12 others ] |
|
0 (0%) | |||||||||||||||||||||
| 6 | Item_MRP [numeric] | Mean (sd) : 141 (61.8) min < med < max: 32 < 141.4 < 266.6 IQR (CV) : 91.6 (0.4) | 4402 distinct values | 0 (0%) | |||||||||||||||||||||
| 7 | Outlet_Identifier [factor] | 1. OUT010 2. OUT013 3. OUT017 4. OUT018 [ 6 others ] |
|
0 (0%) | |||||||||||||||||||||
| 8 | Outlet_Establishment_Year [integer] | Mean (sd) : 1997.8 (8.4) min < med < max: 1985 < 1999 < 2009 IQR (CV) : 17 (0) | 9 distinct values | 0 (0%) | |||||||||||||||||||||
| 9 | Outlet_Size [factor] | 1. High 2. Medium 3. Small |
|
1606 (28.27%) | |||||||||||||||||||||
| 10 | Outlet_Location_Type [factor] | 1. Tier 1 2. Tier 2 3. Tier 3 |
|
0 (0%) | |||||||||||||||||||||
| 11 | Outlet_Type [factor] | 1. Grocery Store 2. Supermarket Type1 3. Supermarket Type2 4. Supermarket Type3 |
|
0 (0%) |
The summary output above is for the test data set.
About the data set:
There are 5681 sample observations and 11 variables in the data set.
There are seven (7) categorical variables.
There are five (5) numeric variables.
Missing data:
Class distribution:
Outcome variable:
| Item Weight | Item Visibility | Item MRP | Item-Outlet Sales | |
|---|---|---|---|---|
| Min. : 4.555 | Min. :0.00000 | Min. : 31.29 | Min. : 33.29 | |
| 1st Qu.: 8.774 | 1st Qu.:0.02699 | 1st Qu.: 93.83 | 1st Qu.: 834.25 | |
| Median :12.600 | Median :0.05393 | Median :143.01 | Median : 1794.33 | |
| Mean :12.858 | Mean :0.06613 | Mean :140.99 | Mean : 2181.29 | |
| 3rd Qu.:16.850 | 3rd Qu.:0.09459 | 3rd Qu.:185.64 | 3rd Qu.: 3101.30 | |
| Max. :21.350 | Max. :0.32839 | Max. :266.89 | Max. :13086.97 | |
| NA’s :1463 | NA | NA | NA |
Distribution and skewness:
Item Weight: This variable shows a multi-modal distribution.
Item Visibility: This variable shows right-skewed distribution, suggesting few items have more visibility.
Item MRP: This variable shows a multi-modal distribution.
Item-Outlet Sales: This variable shows right-skewed distribution, suggesting a combination of few items-and-outlets lead to high sales value.
Therefore, the skewness of item visibility and item-outlet sales must be treated using appropriate transformation.
Outliers:
The box-plots above show the bivariate analysis between categorical variables and item-outlet sales from the training data set.
Outlet Identifier:
Outlet Size:
Outlet Location Type:
Outlet Type:
In summary, the median item-outlet sales appear to vary across different outlets, and type of outlet. In particular, the median item-outlet sales were higher in outlet 27 and supermarket type-3 than the others.
The box-plot above show the bivariate analysis between outlet established year/age and item-outlet sales from the training data set.
Outlet Age:
Item MRP and item-outlet sales have a positive linear relationship. There is some non-constant scatter or variation in the data.
Item weight and sales have a non-linear relationship mostly. But, consumables show some positive linear relationship.
Item visibility and sales show some positive linear relationship. But, the plots show heavy non-constant variation in the data.
There are outliers in consumables and non-consumables item types across all three plots.
The heatmap above shows the correlation between numeric variables in the training data set.
The variables item MRP and item-outlet sales are positively correlated (0.62).
The variables item visibility and item-outlet sales are somewhat negatively correlated (-0.09).
In summary, there is no extensive multicollinearity in the data.
## Training data set:
## Item_Identifier 0
## Item_Weight 1463
## Item_Fat_Content 0
## Item_Visibility 0
## Item_Type 0
## Item_MRP 0
## Outlet_Identifier 0
## Outlet_Establishment_Year 0
## Outlet_Size 2410
## Outlet_Location_Type 0
## Outlet_Type 0
## Item_Outlet_Sales 0
## dtype: int64
## Test data set:
## Item_Identifier 0
## Item_Weight 976
## Item_Fat_Content 0
## Item_Visibility 0
## Item_Type 0
## Item_MRP 0
## Outlet_Identifier 0
## Outlet_Establishment_Year 0
## Outlet_Size 1606
## Outlet_Location_Type 0
## Outlet_Type 0
## dtype: int64
The item weight and outlet size variables have missing values in training and test data sets.
The item weight is a continuous numeric variable, and outlet size is an ordinal categorical variable.
Therefore, it is sensible to impute the mean item weight by products/items. For outlet size, it is sensible to impute with the mode of outlet size.
## Mode for each outlet type:
## Outlet_Type Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
## Outlet_Size Small Small Medium Medium
Review the missing value treatment
## Training data set:
## Item_Identifier 0
## Item_Weight 4
## Item_Fat_Content 0
## Item_Visibility 0
## Item_Type 0
## Item_MRP 0
## Outlet_Identifier 0
## Outlet_Establishment_Year 0
## Outlet_Size 0
## Outlet_Location_Type 0
## Outlet_Type 0
## Item_Outlet_Sales 0
## dtype: int64
## Test data set:
## Item_Identifier 0
## Item_Weight 1
## Item_Fat_Content 0
## Item_Visibility 0
## Item_Type 0
## Item_MRP 0
## Outlet_Identifier 0
## Outlet_Establishment_Year 0
## Outlet_Size 0
## Outlet_Location_Type 0
## Outlet_Type 0
## dtype: int64
There appears to be four (4) missing values in item weight variable. Need further diagnosis to understand the issue.
## Training data set:
## 927 FDN52
## 1922 FDK57
## 4187 FDE52
## 5022 FDQ60
## Name: Item_Identifier, dtype: object
## Training data set:
## True 8383
## False 140
## Name: Item_Identifier, dtype: int64
## Test data set:
## True 5681
## Name: Item_Identifier, dtype: int64
There are four (4) products that have missing item weight, and these four (4) products do not have any other observations in the training data set. Therefore, we must exclude these four (4) products and its observations from both training and test data sets.
Next, there are 140 products that are in training data, but not present in test data set. No further action required, as test data set show no conflicts.
## Training data set:
## Item_Identifier 0
## Item_Weight 0
## Item_Fat_Content 0
## Item_Visibility 0
## Item_Type 0
## Item_MRP 0
## Outlet_Identifier 0
## Outlet_Establishment_Year 0
## Outlet_Size 0
## Outlet_Location_Type 0
## Outlet_Type 0
## Item_Outlet_Sales 0
## dtype: int64
## Test data set:
## Item_Identifier 0
## Item_Weight 0
## Item_Fat_Content 0
## Item_Visibility 0
## Item_Type 0
## Item_MRP 0
## Outlet_Identifier 0
## Outlet_Establishment_Year 0
## Outlet_Size 0
## Outlet_Location_Type 0
## Outlet_Type 0
## dtype: int64
## Test data set:
## True 5650
## Name: Item_Identifier, dtype: int64
The missing value treatment, and the treatment for product mismatch between training and test data sets look okay.
## Training data set:
## Food 6121
## Non-consumable 1599
## Consumable 799
## Name: Item_Category, dtype: int64
## Test data set:
## Food 4045
## Non-consumable 1087
## Consumable 518
## Name: Item_Category, dtype: int64
## Training data set:
## Low Fat 3917
## Regular 3003
## Non-edible 1599
## Name: Item_Fat_Content, dtype: int64
## Test data set:
## Low Fat 2573
## Regular 1990
## Non-edible 1087
## Name: Item_Fat_Content, dtype: int64
The heatmap above shows the correlation between log-transformed variable, Item Outlet Sales, and existing variables log-transformed Item MRP (positively correlated), and Item Visibility (negatively correlated).
In summary, there are no major concerns (i.e., no extensive multicollinearity) with the feature engineering and transformation.
Finally, the training and test data sets are ready for model selection, evaluation and prediction.
## Best Subsets Regression
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## Model Index Predictors
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 Outlet_Identifier
## 2 Outlet_Identifier log_Item_MRP
## 3 Item_Fat_Content Outlet_Identifier log_Item_MRP
## 4 Item_Weight Item_Fat_Content Outlet_Identifier log_Item_MRP
## 5 Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier log_Item_MRP
## 6 Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier log_Item_MRP Item_Category
## 7 Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size log_Item_MRP Item_Category
## 8 Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size Outlet_Location_Type log_Item_MRP Item_Category
## 9 Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size Outlet_Location_Type Outlet_Type log_Item_MRP Item_Category
## 10 Item_Weight Item_Fat_Content Item_Visibility Outlet_Identifier Outlet_Size Outlet_Location_Type Outlet_Type log_Item_MRP Item_Category Outlet_Age
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## ------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ------------------------------------------------------------------------------------------------------------------------------------
## 1 0.4635 0.4625 0.4614 5850.8746 11585.3163 NA 11657.2491 2875.1348 0.5635 1e-04 0.5370
## 2 0.7503 0.7498 0.7492 -6.0403 7677.0608 NA 7755.5330 1338.2708 0.2624 1e-04 0.2500
## 3 0.7505 0.7499 0.7492 -7.7255 7677.3656 NA 7768.9164 1337.5656 0.2623 1e-04 0.2499
## 4 0.7505 0.7499 0.7491 -5.8305 7679.2603 NA 7777.3504 1337.8000 0.2624 1e-04 0.2500
## 5 0.7505 0.7498 0.749 -3.9272 7681.1633 NA 7785.7928 1338.0367 0.2625 1e-04 0.2501
## 6 0.7505 0.7498 0.7489 -2.0000 7685.0902 NA 7802.7985 1338.2797 0.2626 1e-04 0.2502
## 7 0.7505 0.7498 0.7489 -2.0000 7689.0902 NA 7819.8772 1338.2797 0.2626 1e-04 0.2502
## 8 0.7505 0.7498 0.7489 -2.0000 7693.0902 NA 7836.9559 1338.2797 0.2626 1e-04 0.2502
## 9 0.7505 0.7498 0.7489 -2.0000 7699.0902 NA 7862.5739 1338.2797 0.2626 1e-04 0.2502
## 10 0.7505 0.7498 0.7489 -2.0000 7701.0902 NA 7871.1132 1338.2797 0.2626 1e-04 0.2502
## ------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Using the Number of Regressors, AIC (Akaike Information Criterion), MSEP (Estimated Error of Prediction) and R-Squared as a goodness-of-fit statistic, there are two regression models that appear to be sensible for further evaluation through test data set.
In particular, model # 3 and 4 appear to be sensible to evaluate the prediction of outlet sales using test data set.
##
## Call:
## lm(formula = log_Item_Outlet_Sales ~ Item_Weight + Item_Fat_Content +
## Outlet_Identifier + log_Item_MRP, data = trainSet1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04997 -0.27229 0.05395 0.36987 1.40557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4897603 0.0743027 6.591 4.8e-11 ***
## Item_Weight -0.0004973 0.0015345 -0.324 0.7459
## Item_Fat_ContentNon-edible 0.0146930 0.0196949 0.746 0.4557
## Item_Fat_ContentRegular 0.0307994 0.0160165 1.923 0.0545 .
## Outlet_IdentifierOUT013 1.9475487 0.0361812 53.828 < 2e-16 ***
## Outlet_IdentifierOUT017 2.0293153 0.0356984 56.846 < 2e-16 ***
## Outlet_IdentifierOUT018 1.8157205 0.0361653 50.206 < 2e-16 ***
## Outlet_IdentifierOUT019 0.0397723 0.0404119 0.984 0.3251
## Outlet_IdentifierOUT027 2.5011464 0.0357753 69.913 < 2e-16 ***
## Outlet_IdentifierOUT035 2.0267209 0.0358661 56.508 < 2e-16 ***
## Outlet_IdentifierOUT045 1.9417941 0.0363195 53.464 < 2e-16 ***
## Outlet_IdentifierOUT046 2.0167335 0.0360832 55.891 < 2e-16 ***
## Outlet_IdentifierOUT049 2.0391918 0.0358143 56.938 < 2e-16 ***
## log_Item_MRP 1.0402138 0.0135991 76.492 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.512 on 5098 degrees of freedom
## Multiple R-squared: 0.7505, Adjusted R-squared: 0.7499
## F-statistic: 1180 on 13 and 5098 DF, p-value: < 2.2e-16
## [1] 1100.81
The log-log linear regression model with four regressors, namely, Item Weight, Item Fat Content, Outlet ID, and log-Item MRP, was fitted to the test data set. From the model summary output above, it appears that the log-Item MRP and Outlet Identifier were statistically significant, suggesting that the retail outlet sales were associated with item price and outlet. The Item Weight regressor was not statistically significant (\(P-value: 0.75\)), and Item Fat Content (Regular) was somewhat significant (\(P-value: .05\)). The root mean-squared error of prediction was 1100.81 INR.
Therefore, it is sensible to try fitting the regression model with three regressors and assess the significance and prediction performance.
##
## Call:
## lm(formula = log_Item_Outlet_Sales ~ Item_Fat_Content + Outlet_Identifier +
## log_Item_MRP, data = trainSet1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04641 -0.27217 0.05384 0.36929 1.40562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.48373 0.07193 6.725 1.94e-11 ***
## Item_Fat_ContentNon-edible 0.01437 0.01967 0.731 0.4650
## Item_Fat_ContentRegular 0.03074 0.01601 1.920 0.0549 .
## Outlet_IdentifierOUT013 1.94773 0.03617 53.844 < 2e-16 ***
## Outlet_IdentifierOUT017 2.02940 0.03569 56.855 < 2e-16 ***
## Outlet_IdentifierOUT018 1.81577 0.03616 50.213 < 2e-16 ***
## Outlet_IdentifierOUT019 0.03983 0.04041 0.986 0.3243
## Outlet_IdentifierOUT027 2.50129 0.03577 69.928 < 2e-16 ***
## Outlet_IdentifierOUT035 2.02693 0.03586 56.528 < 2e-16 ***
## Outlet_IdentifierOUT045 1.94196 0.03631 53.479 < 2e-16 ***
## Outlet_IdentifierOUT046 2.01688 0.03608 55.905 < 2e-16 ***
## Outlet_IdentifierOUT049 2.03933 0.03581 56.951 < 2e-16 ***
## log_Item_MRP 1.04013 0.01360 76.505 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.512 on 5099 degrees of freedom
## Multiple R-squared: 0.7505, Adjusted R-squared: 0.7499
## F-statistic: 1278 on 12 and 5099 DF, p-value: < 2.2e-16
## [1] 1100.669
The log-log linear regression model with three regressors, namely, Item Fat Content, Outlet ID, and log-Item MRP, was fitted to the test data set. From the model summary output above, it appears that the log-Item MRP and Outlet Identifier were statistically significant, suggesting that the retail outlet sales are associated with item price and outlet. Particularly, the expected sales at Outlet # 27 appears to be higher than other outlets. The Item Fat Content (Regular) is somewhat significant (\(P-value: .05\)). The root mean-squared error of prediction for test data set was 1100.67 INR.
Model equation:
\(log(Item \, Outlet \, Sales) = \beta_0 + \beta_{X} X_i + \beta_{Z} Z_i + \beta_{3} * log(Item \, MRP) + \epsilon_i\)
where \(\epsilon_i \sim N(0, \sigma^2)\).
where \(X\) is Item Fat with two levels, and \(Z\) is Outlet Identifier with nine levels.
Regression techniques such as Lasso and Ridge were not sensible to fit for this data set as there was no extensive multi-collinearity in the data.
##
## Call:
## randomForest(formula = log_Item_Outlet_Sales ~ ., data = trainSet1, mtry = 3, ntree = 150, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 150
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.2844208
## % Var explained: 72.86
The random forest tree-based model was fitted to the training data set. Models were fitted with three to eight variables. It appeared that models with four variables and above were over-fitting the training data set and that the prediction power of those models were not better than the model with three variables. The random forest model with three variables had an R-Squared value of 72.86%.
| %IncMSE | IncNodePurity | |
|---|---|---|
| Item_Weight | 0.00 | 233.00 |
| Item_Fat_Content | 0.00 | 43.58 |
| Item_Visibility | 0.01 | 279.37 |
| Outlet_Identifier | 0.43 | 1161.11 |
| Outlet_Size | 0.08 | 72.70 |
| Outlet_Location_Type | 0.04 | 52.66 |
| Outlet_Type | 0.41 | 1107.55 |
| log_Item_MRP | 0.53 | 1603.76 |
| Item_Category | 0.00 | 36.87 |
| Outlet_Age | 0.06 | 116.33 |
From the output above, it appears that variables such as log-Item MRP, Outlet Identifier and Outlet Type are the key tree classifiers for the prediction of item outlet sales.
From the plot above it appears that bagging of around 100 trees reduces the prediction error to approximately ~25%.
## [1] 1149.713
The root mean-squared error of prediction for test data set was 1149.71 INR.
Big Mart Sales Prediction, Anonymous, 2016, Data set, https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/.
summarytools, Dominic Comtois, 2020, R package, https://cran.r-project.org/web/packages/summarytools/index.html.
GGPlot2, Hadley Wickham, 2016, R package, https://ggplot2.tidyverse.org.
gridExtra, Baptiste Auguie, Anton Antonov, 2017, R package, https://cran.r-project.org/web/packages/gridExtra/index.html.
knitr, Several, 2020, R package, https://cran.r-project.org/web/packages/knitr/index.html.
dplyr, Several, 2020, R package, https://cran.r-project.org/web/packages/dplyr/index.html.
ggpubr, Alboukadel Kassambara, 2020, R package, https://cran.csiro.au/web/packages/ggpubr/ggpubr.pdf.
ggcorrplot, Alboukadel Kassambara, 2019, R package, https://cran.r-project.org/web/packages/ggcorrplot/index.html.
tidyverse, Hadley Wickham, 2019, R package, https://cran.r-project.org/web/packages/tidyverse/index.html.
randomForest, Several, 2018, R package, https://cran.r-project.org/web/packages/randomForest/index.html.
pandas, Several, 2020, Python module, https://pandas.pydata.org/about/citing.html.
matplotlib, J.D. Hunter, 2007, Python module, https://matplotlib.org.
seaborn, Several, 2017, Python module, https://seaborn.pydata.org.
prettydoc, Several, 2020, R package, https://cran.r-project.org/web/packages/prettydoc/vignettes/cayman.html.