Content

1.Data Preparation
1.1 Data Source
1.2 Load the Dataset & Required Packages
1.3 Data Preparation
1.4 Basic Summary of the Dataset
2.Exploratory Data Analysis
2.1 Item Sales Distribution
2.2 Follow-up analysis
3.Insights & Conclusion

1.Data Preparation

1.1 Data Information:

  1. From this link: https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
  2. Download the “Train File” and “Test File”
  3. Save them to your working directory
Notes:

1.2 Load the Dataset & Required Packages

  • This project will use three R packages, i.e. tidyverse, gridExtra, hexbin.

1.3 Data Preparation

  • Missing values have been imputed
  • Mislabeled observations have been corrected
  • New variable has been created
  • For the variable, Item_Fat_Content, “LF” and “low fat” have been changed to “Low Fat”; “reg” has been changed to “Regular”
  • For the variable, Outlet_Size, missing values of “” have been replaced with NAs
  • For the variable, Item_Weight, missing values of NAs have been replaced by the mean
  • New variable, Log2_MRP, has been created for better comparison of retail price

1.4 Basic Summary of the Dataset

BigMart has collected 2013 sales data for 1559 products across 10 stores in different cities. In our training dataset, we have 8523 observations and 12 variables. Also, there is one additional variable which has been transformed from one of the original variables.

The structure of the data is as below:

## Classes 'tbl_df', 'tbl' and 'data.frame':    8523 obs. of  13 variables:
##  $ Item_Identifier          : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
##  $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
##  $ Item_Fat_Content         : Factor w/ 2 levels "Low Fat","Regular": 1 2 1 2 1 2 2 1 2 2 ...
##  $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
##  $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
##  $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
##  $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
##  $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
##  $ Outlet_Size              : Factor w/ 3 levels "High","Medium",..: 2 2 2 NA 1 2 1 2 NA NA ...
##  $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
##  $ Log2_MRP                 : num  7.96 5.59 7.15 7.51 5.75 ...
##  $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...

The numeric summary of the data is as below:

##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility  
##  FDG33  :  10    Min.   : 4.555   Low Fat:5517     Min.   :0.00000  
##  FDW13  :  10    1st Qu.: 9.310   Regular:3006     1st Qu.:0.02699  
##  DRE49  :   9    Median :12.793                    Median :0.05393  
##  DRN47  :   9    Mean   :12.847                    Mean   :0.06613  
##  FDD38  :   9    3rd Qu.:16.000                    3rd Qu.:0.09459  
##  FDF52  :   9    Max.   :21.350                    Max.   :0.32839  
##  (Other):8467                                                       
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1232   Min.   : 31.29   OUT027 : 935     
##  Snack Foods          :1200   1st Qu.: 93.83   OUT013 : 932     
##  Household            : 910   Median :143.01   OUT035 : 930     
##  Frozen Foods         : 856   Mean   :140.99   OUT046 : 930     
##  Dairy                : 682   3rd Qu.:185.64   OUT049 : 930     
##  Canned               : 649   Max.   :266.89   OUT045 : 929     
##  (Other)              :2994                    (Other):2937     
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985              High  : 932   Tier 1:2388         
##  1st Qu.:1987              Medium:2793   Tier 2:2785         
##  Median :1999              Small :2388   Tier 3:3350         
##  Mean   :1998              NA's  :2410                       
##  3rd Qu.:2004                                                
##  Max.   :2009                                                
##                                                              
##             Outlet_Type      Log2_MRP     Item_Outlet_Sales 
##  Grocery Store    :1083   Min.   :4.968   Min.   :   33.29  
##  Supermarket Type1:5577   1st Qu.:6.552   1st Qu.:  834.25  
##  Supermarket Type2: 928   Median :7.160   Median : 1794.33  
##  Supermarket Type3: 935   Mean   :6.966   Mean   : 2181.29  
##                           3rd Qu.:7.536   3rd Qu.: 3101.30  
##                           Max.   :8.060   Max.   :13086.97  
## 

2.Exploratory Data Analysis

Let’s begin our data analysis by asking a question regarding to the item sales.

2.1 What is the distribution of item sales look like?

  • The overall distribution of item sales is right skewed
  • More low item sales than high item sales
  • Outliers of item sales are found, which are beyond 100,00
  • Those outliers, though have little in count, contain higher item sales

The above points will be addressed one by one. So…

2.1.1 Why the overall distribution of item sales is right skewed?

Firstly, I will create histograms for each continuous variables.

  • From the above three plots, one of them is also right-skewed in distribution and is similar to the distribution of item sales, i.e. item visibility.
  • Does that imply the correlation between item visibility and item sales?
  • Let’s assume the higher the visibility, the higher the sales.

  • Fig.1f and 1g shows a quite counterintuitive result of their relationship.
  • Most of the items cluster at the bottom left, meaning less visible and less sales.
  • Items with visibility less than 0.1 have achieved sales up to 10000.
  • Yet, items with visibility greater than 0.2 only have sales less than 1000, though the items fall in this cluster are less in count.
  • Maybe there are some hidden facts behind the scene. We will go into depth in the later session.
  • Follow-up analysis 1 - Fig.1f: The cluster of item visibility

How about the relationship between item sales and retail price? Will that be the higher in retail price, the higer in item sales?

  • There’s a distribution pattern in Fig.1h.
  • The retail price divides into 4 clusters with a obvious boundary between each clusters.
  • The plot confirms our assumption about the relationship between these two variables.
  • Item sales increase as retail pice increases.
  • Fig.1i show that the variation increases as the retail price increases.
  • We may wonder such different variations were due to some reasons, such as different product types, different stores strategies, etc.
  • Thus, further investigation has to be conducted.
  • In short, item’s maximum retail price might be a strong predictive in the model predicting item sales.
  • Follow-up analysis 2 - Fig.1h: The cluster of item retail price

Up to here, we have not yet found out the reason of the right-skewed distribution of item sales. Let’s check the relationship between item weight and item sales.

  • Fig.1c and 1j shows a strange pattern in the plots.
  • There’s a darker vertical line at the middle of both plots.
  • Excluding this fact, the relationship between sales and weight is not obvious.
  • For that mentioned vertical line, we have to investigate further.
  • Follow-up 3 analysis - Fig.1c: Strange vertical line at the middle of the distribution of item weight

Back to our question, the right-skewed distribution of item sales is not explained by item weight, item visibilty nor item retail price. How about we break the distribution of item sales by facet of categorical variables?

  • By plotting item sales against different categorical variables, we still cannot find out the reason accounting for such right-skewed distribution.
  • We may wonder that either the dataset is insufficient to provide the answer or it is due to coincidence.
  • From Fig.1k, the shape of distribution is similar. The plot shows there is more low fat item than regular item.
  • From Fig.1m, it reveals something interesting.
  • Store 010 and store 019 has sales concentrating to the far left, meaning that most of the items in these two stores have low sales performance.
  • That may account for the right-skewed distribution of the overall item sales.
  • Yet, by excluding them from the plot, the shape of distribution does not change much.
  • It can be illustrated by the following plot, Fig.1r.

  • From Fig.1n, there’s nothing interesting here.
  • Since we have 10 stores in our dataset and they are correlated with their establishment year, the distribution of item sales against establishment year looks similar to that against stores ID of Fig.1m.
  • This also explains the shape of distribution in Fig.1o, 1p and 1q.
  • From Fig.1o, there is another counterintuitive fact here, i.e. larger outlet size does not imply higher item sales.
  • From Fig.1q, the shape of distribution of item sales in grocery items looks familiar to us. We may wonder outlet 010 and outlet 019 belongs to that category. Let’s find out:
## # A tibble: 2 x 3
## # Groups: Outlet_Type [?]
##   Outlet_Type   Outlet_Identifier count
##   <fct>         <fct>             <int>
## 1 Grocery Store OUT010              555
## 2 Grocery Store OUT019              528

2.1.2 What are the outliers indicate for?

  • From Fig.1s, The vertical line is the mean of the overall item sales
  • Outlet 027 performs the best, is above the average, and contains the toppest four item sales
  • Other outlets’ item sales are either around the mean or below it
  • Outlet 010 and outlet 019 (both are grocery stores) have the worst performance in item sales
  • Yet, Their number of items is half of other outlets
  • The above plot can be further supported by the following table:
## # A tibble: 10 x 5
##    Outlet_Identifier count     sum   avg   IQR
##    <fct>             <int>   <dbl> <dbl> <dbl>
##  1 OUT027              935 3453926  3694  2931
##  2 OUT035              930 2268123  2439  2071
##  3 OUT049              930 2183970  2348  2099
##  4 OUT017              926 2167465  2341  1901
##  5 OUT013              932 2142664  2299  2094
##  6 OUT046              930 2118395  2278  1958
##  7 OUT045              929 2036725  2192  1861
##  8 OUT018              928 1851823  1995  1721
##  9 OUT010              555  188340   339   301
## 10 OUT019              528  179694   340   307

Any common characteristics of the outliers of each outlets? We have filtered the outliers (i.e. item sales greater than 1.5 * inter-quartile range of item sales) from each outlets. Then compare its distribution with the overall one.

  • The shape of distribution changed from multiple peaks in Fig.1e to single peak in Fig.1t
  • Fig.1t shows left-skewed distribution, meaning most of the outliers of item sales might be attributed to higher retail price

  • By comparing Fig.1c and Fig. 1u,tTwo plots are almost identical

  • By comparing Fig.1d and Fig. 1v, two plots are almost identical

  • Fig.1w shows the distributions are roughly the same

## # A tibble: 16 x 5
##    Item_Type                 n  sales   avg   IQR
##    <fct>                 <int>  <dbl> <dbl> <dbl>
##  1 Fruits and Vegetables    59 363753  6165  2203
##  2 Snack Foods              45 277147  6159  2218
##  3 Household                35 218462  6242  2135
##  4 Frozen Foods             30 181329  6044  1412
##  5 Canned                   26 152902  5881  1469
##  6 Dairy                    22 150422  6837  2463
##  7 Meat                     22 135645  6166  1700
##  8 Baking Goods             17  97164  5716  1302
##  9 Health and Hygiene       15  89796  5986   918
## 10 Soft Drinks              10  70464  7046  2346
## 11 Breads                    7  44202  6315  1445
## 12 Hard Drinks               7  40375  5768  1518
## 13 Breakfast                 4  30427  7607   745
## 14 Starchy Foods             4  23740  5935  1634
## 15 Others                    3  15189  5063   483
## 16 Seafood                   1   5705  5705     0
  • From the overall distribution in Fig.1x, the most popular item types in each outlets are snack foods and fruits and vegetables. Both of these two types are similar in count.
  • From Fig.1y regarding the outlier data subset, fruits and vegetables are much more than snack foods at some outlets, such as outlet 013, 027, 045, 046
  • Among these four outlets, outlet 027 performs the best in item sales
  • Maybe selling more fruits and vegetables can boost up sales

2.2 Follow-up analysis

2.2.1 Follow-up 1 - Fig.1f: The cluster of item visibility

## # A tibble: 2 x 2
##   Outlet_Identifier Visibility_Greater_Than_0.2
##   <fct>                                   <int>
## 1 OUT010                                     62
## 2 OUT019                                     72

  • From the above plots, only outlet 010 and outlet 019 have item visibility greater than 0.2
  • As we figured out previously, they belongs to grocery store
  • Item visibility of these two outlets does not correlate with their item sales
  • We may wonder that the outlet’s physical structure is different between grocery store and supermarket. Such difference may account for the difference of item visibility in the outlets
  • Insufficient data to address the outlet’s physical structure
  • The distributions of item visibility in other outlets are similar, i.e. right-skewed

2.2.2 Follow-up 2 - Fig.1h: The cluster of item retail price

  • When we zoom in the range of item sales from 0 to 1000, though the plot still shows the tendency of cluster, it’s not as obvious as Fig.1h

2.2.3 Follow-up 3 - Fig.1c: Strange vertical line at the middle of the distribution of item weight

  • From Fig.2d, item weight varies among different items. The distribution of weight of most stores is evenly spread, except outlet 019 and 027
  • The item weight of outlet 019 and 027 forms a vertical line in the plot

  • From Fig.2e, both outlet 019 and outlet 027 sells the same types of items of other outlets, but their items’ weights are all the same.
  • We suspect that such fact is due to data entry error because items of different kinds cannot have the same weight.

3.Insights & Conclusion