Big Mart Sales Data Analysis, Visualization, Modelling

Looking at the structure of our data.

## Classes 'tbl_df', 'tbl' and 'data.frame':    14204 obs. of  13 variables:
##  $ Item_Identifier          : chr  "FDA15" "DRC01" "FDN15" "FDX07" ...
##  $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
##  $ Item_Fat_Content         : chr  "Low Fat" "Regular" "Low Fat" "Regular" ...
##  $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
##  $ Item_Type                : chr  "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
##  $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
##  $ Outlet_Identifier        : chr  "OUT049" "OUT018" "OUT049" "OUT010" ...
##  $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
##  $ Outlet_Size              : chr  "Medium" "Medium" "Medium" NA ...
##  $ Outlet_Location_Type     : chr  "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
##  $ Outlet_Type              : chr  "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
##  $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...
##  $ source                   : chr  "train" "train" "train" "train" ...

Preliminary notes

So we have 12 columns here, plus the ‘source’ column we added. Our first task is to think about which variables we can actually use in our analysis.
That is,
1. is the variable something that would plausibly have an impact on the sales of the product,
2. given the variable represents a metric we are interested in, is the data included physically sufficient for us to use (e.g. not 80% NA values, etc.)

Right off the bat, we can see there are a few variables that may not be interesting as they stand now, just by looking at the names and first few values:

Item_weight: Customers won’t generally inclined to buy a product because it is heavier, and the heaviness off a product shouldn’t be a factor when they decide on their purchase. If it is really a larger product, say an appliance, the weight will generally be true of all similar appliances, and wouldn’t be specific to that particular Item_ID
Item_MRP: The MRP (list price) will likely be correlated with sales automatically since sales are calculated as the product of price and quantity. We may want to adjust for this later
Outlet_Establishment_Year: While the ‘run down’ factor could be influential in the popularity of a location, the establishment year doesn’t give us this information. Neither does it tell us whether or not an older location has developed a stronger longterm customer base.

Now, looking at the remaining 9 variables we intend to analyze:

Item_Identifier: Product ID (used to identify each unique product)
Item_Fat_Content: Consumers generally attracted to items with lower fat
Item_Visibility: Higher visibility means customers are more likely to actually see the product during their trip to the store, and perhaps this increases the likelihood that a product will be purchased
Item_Type: Product category may be correlated with sales
Outlet_Identifier: Store ID (used to identify each location)
Outlet_Size: Linked with store capacity; customers will be more inclined to visit stores with a wide variety of products so as to do all their shopping in one place (convenience factor)
Outlet_Location_Type: Tier 1~3, More urban (tier 1) stores ought to have the benefit of higher population density and generally customers with higher incomes
Outlet_Type: Supermarket Type1/2/3 or Grocery Store. Again the capacity and variety available makes locations more or less attractive (convenience factor)
Item_Outlet_Sales: Our dependent variable; the outcome we are looking to explain

Check for null values
Now we’ll look for NAs and get to know what we’re physically working with in greater detail

## Print number of NA values in each column in data  
sapply(data, function(x) sum(length(which(is.na(x)))))

##           Item_Identifier               Item_Weight 
##                         0                      2439 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                      4016                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681 
##                    source 
##                         0

Looks like we have 3 variables with a number of NAs.
Note that the 5681 NAs in outlet sales correspond exactly to the number of rows in the ‘test’ dataframe, so we need not worry about these.
We’ll come back to deal with the ‘Item_Weight’ and ‘Outlet_Size’ nulls.

Inspecting numeric and categorical variables in further detail

## Let's have a look at the quantitative variables in data  
data %>%  
        select_if(is.numeric) %>%  
        summary()

##   Item_Weight     Item_Visibility      Item_MRP     
##  Min.   : 4.555   Min.   :0.00000   Min.   : 31.29  
##  1st Qu.: 8.710   1st Qu.:0.02704   1st Qu.: 94.01  
##  Median :12.600   Median :0.05402   Median :142.25  
##  Mean   :12.793   Mean   :0.06595   Mean   :141.00  
##  3rd Qu.:16.750   3rd Qu.:0.09404   3rd Qu.:185.86  
##  Max.   :21.350   Max.   :0.32839   Max.   :266.89  
##  NA's   :2439                                       
##  Outlet_Establishment_Year Item_Outlet_Sales 
##  Min.   :1985              Min.   :   33.29  
##  1st Qu.:1987              1st Qu.:  834.25  
##  Median :1999              Median : 1794.33  
##  Mean   :1998              Mean   : 2181.29  
##  3rd Qu.:2004              3rd Qu.: 3101.30  
##  Max.   :2009              Max.   :13086.97  
##                            NA's   :5681

Some points to note:

Item_Visibility has a minimum of 0, which doesn’t make much sense.
Outlet_Establishment_Year format won’t help us in our analysis. If anything we’ll change the values to something like ‘years since establishment’.

## To inspect our categorical variables, print the number of unique values per column  
sapply(data, function(x) length(unique(x)))

##           Item_Identifier               Item_Weight 
##                      1559                       416 
##          Item_Fat_Content           Item_Visibility 
##                         5                     13006 
##                 Item_Type                  Item_MRP 
##                        16                      8052 
##         Outlet_Identifier Outlet_Establishment_Year 
##                        10                         9 
##               Outlet_Size      Outlet_Location_Type 
##                         4                         3 
##               Outlet_Type         Item_Outlet_Sales 
##                         4                      3494 
##                    source 
##                         2

So looks like we have 1559 unique items, 10 unique store locations.

Next we’ll print the frequency of each unique category for each of our categorical variables (excluding Item/Outlet ID, source)

categorical_freqs <- data %>%  
        select_if(is.character) %>%  
        select(-one_of(c('Item_Identifier', 'Outlet_Identifier', 'source'))) %>%  
        apply(2, table)  
  
categorical_freqs

## $Item_Fat_Content
## 
##      LF low fat Low Fat     reg Regular 
##     522     178    8485     195    4824 
## 
## $Item_Type
## 
##          Baking Goods                Breads             Breakfast 
##                  1086                   416                   186 
##                Canned                 Dairy          Frozen Foods 
##                  1084                  1136                  1426 
## Fruits and Vegetables           Hard Drinks    Health and Hygiene 
##                  2013                   362                   858 
##             Household                  Meat                Others 
##                  1548                   736                   280 
##               Seafood           Snack Foods           Soft Drinks 
##                    89                  1989                   726 
##         Starchy Foods 
##                   269 
## 
## $Outlet_Size
## 
##   High Medium  Small 
##   1553   4655   3980 
## 
## $Outlet_Location_Type
## 
## Tier 1 Tier 2 Tier 3 
##   3980   4641   5583 
## 
## $Outlet_Type
## 
##     Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3 
##              1805              9294              1546              1559

Notes:

Looks like fat content has a few redundant labels - LF vs. low fat vs. Low Fat, reg vs. Regular
There are a lot of Item_Type categories with few values. It might be wise to combine some of them
We’ll investigate whether there are significant differences between the Type1, Type2 and Type3 supermarkets. These also could potentially be combined later on

Data tidying with dplyr

Now it’s time to start tidying up our dataset. Taking into account the above notes/observations, let’s start cleaning up.

First we’ll take care of the null values we discovered in ‘Item_Weight’ and ‘Outlet_Size’

Item_Weight Nulls

It appears that these NA values in Item_Weight are coming from the OUT027 and OUT019 locations

## Warning: Removed 2439 rows containing non-finite values (stat_boxplot).

So it looks like 2 outlets just haven’t reported any data on item weights. Luckily the distributions of Item_Weights don’t seem to vary between locations (to the point that it’s unrealistic, especially considering that these data include 4 categories of outlet types.. But whatever it works in favor of our convenience here)

#Take a quick look at the distribution of 'Item_Weight' to see if there's any reason to prefer mean over median or vice versa
summary(data$Item_Weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   4.555   8.710  12.600  12.790  16.750  21.350    2439

Doesn’t appear that the choice will make much of a difference. We’ll go ahead and impute by the median for ‘Item_Weight’.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.555   9.300  12.600  12.760  16.000  21.350

Outlet Size Nulls
Let’s quickly look at which outlets have NAs for Outlet_Size

## # A tibble: 1 x 1
##   Outlet_Type  
##   <chr>        
## 1 Grocery Store

## # A tibble: 1 x 1
##   Outlet_Type      
##   <chr>            
## 1 Supermarket Type1

## # A tibble: 1 x 1
##   Outlet_Type      
##   <chr>            
## 1 Supermarket Type1

So all we need to do here is replace the nulls in Outlet_Size with the mode Outlet_Size for a given outlet’s Outlet_Type.
We can impute by the mode grouped by ‘Outlet_Type’

## # A tibble: 4 x 2
##   Outlet_Type       mode  
##   <chr>             <chr> 
## 1 Grocery Store     <NA>  
## 2 Supermarket Type1 Small 
## 3 Supermarket Type2 Medium
## 4 Supermarket Type3 Medium

It stands to reason that grocery stores would be the the smallest type.

Once more print out the number of missing values for each column, confirming that Item_Weight and Outlet_Size are indeed null-free.

##           Item_Identifier               Item_Weight 
##                         0                         0 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                         0                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681 
##                    source 
##                         0

Next we’ll perform some feature engineering - merging and/or modifying variables where it seems appropriate.

Item Visibility
We come back to the issue of ‘Item_Visibility’, which had the mysterious minimum of zero.
The zero values are most likely just missing information, so we’ll impute Item_Visibility based on the given item’s (Item_Identifier’s) mean visibility.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003575 0.031150 0.057190 0.069710 0.096930 0.328400

Now we have absolute visibility stats for all products in all stores, but in order to determine how a particular product’s visibility at a given location impacts its sales at that location, we need to know how the product’s visibility at that particular location compares with the product’s visibility at other locations.

To understand this, we’ll need to create a new metric, namely, the ratio of the product’s visibility at a given store to its mean visibility across all stores.

#Calculate the new variable, 'Item_Visibility_MeanRatio'
data <- data %>%
        group_by(Item_Identifier) %>%
        mutate(Item_Mean_Visibility = mean(Item_Visibility)) %>%
        mutate(Item_Visibility_MeanRatio = Item_Visibility / Item_Mean_Visibility) %>%
        select(-Item_Mean_Visibility)

Item Type Recategorization

Here we rearranged item types into more singificant-looking groups based on considerations of mean sales and a measure of daily necessity.

## # A tibble: 3 x 1
## # Groups:   Item_Type_Aggregate [3]
##   Item_Type_Aggregate
##   <chr>              
## 1 High Sales         
## 2 Low Sales          
## 3 Medium Sales

Outlet Establishment Year

Next we come to a variable we can definitely improve on for our analysis: ‘Outlet_Establishment_Year’.
We’re going to create a new variable, ‘Outlet_Years’, to more intuitively represent location age.

#Subtract year established from 2013, as 2013 is the year in which this sales data was collected
data  <- data %>%
        mutate(Outlet_Years = 2013 - Outlet_Establishment_Year)
summary(data$Outlet_Years)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    9.00   14.00   15.17   26.00   28.00

Item Fat Content
Next we’ll deal with the mislabelling of the Item_Fat_Content categories

#Replace redundant category names with 'Low Fat' and 'Regular' for the sake of consistency
data$Item_Fat_Content <- recode(data$Item_Fat_Content,
                                LF = 'Low Fat', 
                                `low fat` = 'Low Fat',
                                reg = 'Regular')
  
#And we musn't forget to change the fat content label for non-consumable items
data <- data %>%
        mutate(Item_Fat_Content = replace(Item_Fat_Content, grepl('NC', Item_Identifier), 'Non-Consumable'))
#Check to make sure we have just the two categories remaining
data %>% group_by(Item_Fat_Content) %>% select(Item_Fat_Content) %>% apply(2, table)

##                Item_Fat_Content
## Low Fat                    6499
## Non-Consumable             2686
## Regular                    5019

Before constructing a model with our tidied and modified data, let’s do some visualizaiton of the relationships between individual attributes and overall sales

Sales on Item_MRP

#Simply Item_Outlet_Sales on Item_MRP
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales)) +
        geom_density(stat = 'identity') +
        geom_smooth(color = 'red', size = 1.5) +
        ggtitle('Sales on Item_MRP') +
        bigmart_theme

#And dividing sales by the price to adjust for the fact that sales is basically price * quantity
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales / Item_MRP)) +
        geom_density(stat = 'identity') +
        geom_smooth(color = 'red', size = 1.5) +
        ggtitle('Sales/Item_MRP on Item_MRP') +
        bigmart_theme

As anticipated, the correlation between price and sales comes from the way in which Item_Outlet_Sales is calculated to begin with

To see visualizations for a wider range of attributes against Item_Outlet_Sales, check out my Shiny app:
https://warriwes.shinyapps.io/BigmartSales_Viz/

The following are just a few highlights and my comments

Sales on Item_Visibility_MeanRatio

We can see here that the item visibility distribution is pulled way out by products in grocery stores. What this tells us is that grocery stores in fact have much less inventory, and therefore the ‘visibility’ awarded to each individual item is automatically greater than it would be in a supermarket

Sales on Item_Weight

From this distribution of item weights, we can see where we imputed the median for the OUT019 and OUT027 locations, which happened to be the two Type 3 supermarkets. Item_Weight doesn’t look like it will be a very helpful variable anyway, which we suspected from the beginning

Sales on Item_Type_Aggregate

Just using sales vs. our modified item types for a clear picture, we can see that all-in-all, outlet type has a major effect on sales. Interestingly, Type 1 supermarkets surpass Type 2 supermarkets overwhelmingly, but even more clear is the gap between grocery stores and all supermarket types.

Preparing data for model development and testing

Prepare to split back into train and test sets

#Drop columns we replaced
data <- data %>%
        select(-c(Item_Type, Outlet_Establishment_Year))
#Set item and outlet ID pairs to row names and coerce character columns into factors (by converting data to data.frame)
ID_cols <- paste(data$Item_Identifier, data$Outlet_Identifier)
data_df <- as.data.frame(unclass(data), row.names = ID_cols)
#Split back into train and test on 'source' column
train <- data_df %>%
        filter(source == 'train')
test <- data_df %>%
        filter(source == 'test')
#Drop unnecessary columns
train <- train %>%
        select(-source)
test <- test %>%
        select(-c(source, Item_Outlet_Sales))

Random Forest Model

Now it’s time to test out our model..

library(randomForest)

rf <- randomForest(Item_Outlet_Sales ~ ., ntree = 100, data = train_1)
plot(rf)

Just another way to visualize the results:

Making it clear how biased the model is toward Item_MRP

Let’s say we just model against quantities of items sold (Sales / MRP)…

library(randomForest)
rf <- randomForest(Item_Outlet_Sales / Item_MRP ~ ., ntree = 100, data = train_1)
plot(rf)

Removing Item_MRP gives us a clearer picture of how the other attributes influence the quantity of a given item that is sold

Clearly the MRP of a product will impact its total sales. This is true of any item in any location, but we can draw more helpful conclusions from the feature importance coefficients on the previous slide, with Item_MRP taken out of the picture, in addition to the univariate analysis we performed earlier.

Outlet_Type is a major factor in how much of an item is sold, primarily as a result of the high variation between sales at grocery stores vs. supermarkets. It would be interesting to take grocery stores out of the equation entirely in order to observe the resulting impact on Outlet_Type importance, as well as that of Item_Visibility, plus it would allow us to consider the variation between different supermarket types in more detail.
Item_Visibility, Item_Visibility_MeanRatio are influential in our model, in fact the ratio variable we engineered is second only to outlet type after adjusting for Item_MRP. This tells us that in fact the overall organization and display strategy taken at Big Mart stores is an incredibly important part of their stores’ operations. This ought to be taken into consideration when making decisions about store layout, as well as product promotions, etc.
Attributes such as Item_Fat_Content turn out to be relatively insignificant when it comes to overall sales. This is likely because consumer choices with respect to nutrition, etc., are not homogenous throughout the customer population. Also, ‘Low_Fat’ does not necessarily mean ‘nutritious’, and with this dataset we don’t have access to information such as whether these items were specifically advertised as low-fat/healthy, or whether that was just Big Mart’s catoregization, for example.
One other note I will mention is that personally I thought Outlet_Location_Type might play a larger role in determining sales figures. However, I suppose the location of a given outlet has more impact on the outlet’s total sales and turnover, as opposed to the sales of unique items. For instance it stands to reason that outlets located in urban areas with a larger customer base would stock a wider variety of products, and maybe have less room to boost individual items’ visibility numbers. Also our Outlet_Type variable may be capturing a portion of the effect of Outlet_Location_Type.

I think there is certainly room to go deeper with this dataset. I would consider the above a debriefing for the client to keep in mind, and look forward to picking apart the nitty gritty details.