Data and other contest resources can be found on the Analytics Vidhya website:
https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

SO the goal here is to:

try to understand the properties of products and stores which play a key role in increasing sales at BigMart.


Let’s begin by reading in the data

library(tidyverse)  
## Read in train and test data  
train <- read_csv('train.csv')  
test <- read_csv('test.csv')  
  
## We'll label our data based on whether it comes from the train or test set to be separated later, but for now we'll join the tibbles to perform some tidying operations  
train['source'] <- 'train'  
test['source'] <- 'test'  
data <- train %>%  
        full_join(test)  
## Check out the dimensions of train, test, and data just to make sure everything joined as planned  
dim(train)  
## [1] 8523   13
dim(test)  
## [1] 5681   12
dim(data)  
## [1] 14204    13
## Look at data in a bit mroe detail  
str(data)  
## Classes 'tbl_df', 'tbl' and 'data.frame':    14204 obs. of  13 variables:
##  $ Item_Identifier          : chr  "FDA15" "DRC01" "FDN15" "FDX07" ...
##  $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
##  $ Item_Fat_Content         : chr  "Low Fat" "Regular" "Low Fat" "Regular" ...
##  $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
##  $ Item_Type                : chr  "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
##  $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
##  $ Outlet_Identifier        : chr  "OUT049" "OUT018" "OUT049" "OUT010" ...
##  $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
##  $ Outlet_Size              : chr  "Medium" "Medium" "Medium" NA ...
##  $ Outlet_Location_Type     : chr  "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
##  $ Outlet_Type              : chr  "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
##  $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...
##  $ source                   : chr  "train" "train" "train" "train" ...
head(data)  
## # A tibble: 6 x 13
##   Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type  
##   <chr>                 <dbl> <chr>                      <dbl> <chr>      
## 1 FDA15                  9.30 Low Fat                   0.0160 Dairy      
## 2 DRC01                  5.92 Regular                   0.0193 Soft Drinks
## 3 FDN15                 17.5  Low Fat                   0.0168 Meat       
## 4 FDX07                 19.2  Regular                   0.     Fruits and~
## 5 NCD19                  8.93 Low Fat                   0.     Household  
## 6 FDP36                 10.4  Regular                   0.     Baking Goo~
## # ... with 8 more variables: Item_MRP <dbl>, Outlet_Identifier <chr>,
## #   Outlet_Establishment_Year <int>, Outlet_Size <chr>,
## #   Outlet_Location_Type <chr>, Outlet_Type <chr>,
## #   Item_Outlet_Sales <dbl>, source <chr>
names(data)  
##  [1] "Item_Identifier"           "Item_Weight"              
##  [3] "Item_Fat_Content"          "Item_Visibility"          
##  [5] "Item_Type"                 "Item_MRP"                 
##  [7] "Outlet_Identifier"         "Outlet_Establishment_Year"
##  [9] "Outlet_Size"               "Outlet_Location_Type"     
## [11] "Outlet_Type"               "Item_Outlet_Sales"        
## [13] "source"

Preliminary notes
So we have 12 columns here, plus the ‘source’ column we added. Our first task is to think about which variables we can actually use in our analysis.
That is,
1: is the variable something that would plausibly have an impact on the sales of the product,
2: given that the variable represents a metric we are interested in, is the data included physically sufficient for us to use (e.g. not 80% NA values, etc.)

Right off the bat, we can see there are a few variables that may not be interesting as they stand now, just by looking at the names and first few values:
–Item_weight: Customers won’t generally inclined to buy a product because it is heavier, and the heaviness off a product shouldn’t be a factor when they decide on their purchase. If it is really a larger product, say an appliance, the weight will generally be true of all similar appliances, and wouldn’t be specific to that particular Item_ID
–Item_MRP: The MRP (list price) will likely be correlated with sales automatically since sales are calculated as the product of price and quantity. We may want to adjust for this later
–Outlet_Establishment_Year: While the ‘run down’ factor could be influential in the popularity of a location, the establishment year doesn’t give us this information. Neither does it tell us whether or not an older location has developed a stronger longterm customer base.

Now, looking at the remaining 9 variables we intend to analyze:
–Item_Identifier: Product ID (used to identify each unique product)
–Item_Fat_Content: Consumers generally attracted to items with lower fat
–Item_Visibility: Higher visibility means customers are more likely to actually see the product during their trip to the store, and perhaps this increases the likelihood that a product will be purchased
–Item_Type: Product category may be correlated with sales
–Outlet_Identifier: Store ID (used to identify each location)
–Outlet_Size: Linked with store capacity; customers will be more inclined to visit stores with a wide variety of products so as to do all their shopping in one place (convenience factor)
–Outlet_Location_Type: Tier 1~3, More urban (tier 1) stores ought to have the benefit of higher population density and generally customers with higher incomes
–Outlet_Type: Supermarket Type1/2/3 or Grocery Store. Again the capacity and variety available makes locations more or less attractive (convenience factor)
–Item_Outlet_Sales: Our dependent variable; the outcome we are looking to explain

## Just as an example, a simple way to look at the different values a discrete variable can take:  
unique(data$Outlet_Type)  
## [1] "Supermarket Type1" "Supermarket Type2" "Grocery Store"    
## [4] "Supermarket Type3"

Check for null values
Now we’ll look for NAs and get to know what we’re physically working with in greater detail

## Print number of NA values in each column in data  
sapply(data, function(x) sum(length(which(is.na(x)))))  
##           Item_Identifier               Item_Weight 
##                         0                      2439 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                      4016                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681 
##                    source 
##                         0

Looks like we have 3 variables with a number of NAs.
Note that the 5681 NAs in outlet sales correspond exactly to the number of rows in the ‘test’ dataframe, so we need not worry about these.
We’ll come back to deal with the ‘Item_Weight’ and ‘Outlet_Size’ nulls.

Inspecting numeric and categorical variables in further detail

## Let's have a look at the quantitative variables in data  
data %>%  
        select_if(is.numeric) %>%  
        summary()  
##   Item_Weight     Item_Visibility      Item_MRP     
##  Min.   : 4.555   Min.   :0.00000   Min.   : 31.29  
##  1st Qu.: 8.710   1st Qu.:0.02704   1st Qu.: 94.01  
##  Median :12.600   Median :0.05402   Median :142.25  
##  Mean   :12.793   Mean   :0.06595   Mean   :141.00  
##  3rd Qu.:16.750   3rd Qu.:0.09404   3rd Qu.:185.86  
##  Max.   :21.350   Max.   :0.32839   Max.   :266.89  
##  NA's   :2439                                       
##  Outlet_Establishment_Year Item_Outlet_Sales 
##  Min.   :1985              Min.   :   33.29  
##  1st Qu.:1987              1st Qu.:  834.25  
##  Median :1999              Median : 1794.33  
##  Mean   :1998              Mean   : 2181.29  
##  3rd Qu.:2004              3rd Qu.: 3101.30  
##  Max.   :2009              Max.   :13086.97  
##                            NA's   :5681

Some points to note:
–Item_Visibility has a minimum of 0, which doesn’t make much sense.
–Outlet_Establishment_Year format won’t help us in our analysis. If anything we’ll change the values to something like ‘years since establishment’.

## To inspect our categorical variables, print the number of unique values per column  
sapply(data, function(x) length(unique(x)))  
##           Item_Identifier               Item_Weight 
##                      1559                       416 
##          Item_Fat_Content           Item_Visibility 
##                         5                     13006 
##                 Item_Type                  Item_MRP 
##                        16                      8052 
##         Outlet_Identifier Outlet_Establishment_Year 
##                        10                         9 
##               Outlet_Size      Outlet_Location_Type 
##                         4                         3 
##               Outlet_Type         Item_Outlet_Sales 
##                         4                      3494 
##                    source 
##                         2

So looks like we have 1559 unique items, 10 unique store locations.

Next we’ll print the frequency of each unique category for each of our categorical variables (excluding Item/Outlet ID, source)

categorical_freqs <- data %>%  
        select_if(is.character) %>%  
        select(-one_of(c('Item_Identifier', 'Outlet_Identifier', 'source'))) %>%  
        apply(2, table)  
  
categorical_freqs  
## $Item_Fat_Content
## 
##      LF low fat Low Fat     reg Regular 
##     522     178    8485     195    4824 
## 
## $Item_Type
## 
##          Baking Goods                Breads             Breakfast 
##                  1086                   416                   186 
##                Canned                 Dairy          Frozen Foods 
##                  1084                  1136                  1426 
## Fruits and Vegetables           Hard Drinks    Health and Hygiene 
##                  2013                   362                   858 
##             Household                  Meat                Others 
##                  1548                   736                   280 
##               Seafood           Snack Foods           Soft Drinks 
##                    89                  1989                   726 
##         Starchy Foods 
##                   269 
## 
## $Outlet_Size
## 
##   High Medium  Small 
##   1553   4655   3980 
## 
## $Outlet_Location_Type
## 
## Tier 1 Tier 2 Tier 3 
##   3980   4641   5583 
## 
## $Outlet_Type
## 
##     Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3 
##              1805              9294              1546              1559

Notes:
–Looks like fat content has a few redundant labels - LF vs. low fat vs. Low Fat, reg vs. Regular
–There are a lot of Item_Type categories with few values. It might be wise to combine some of them
–We’ll investigate whether there are significant differences between the Type1, Type2 and Type3 supermarkets. These also could potentially be combined later on


Data tidying with dplyr

Now it’s time to start tidying up our dataset. Taking into account the above notes/observations, let’s start cleaning up.

First we’ll take care of the null values we discovered in ‘Item_Weight’ and ‘Outlet_Size’

Item_Weight Nulls

#First check out where these nulls are coming from  
data %>%  
        filter(is.na(data$Item_Weight) == TRUE) %>%  
        do(head(., 20))  
## # A tibble: 20 x 13
##    Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type 
##    <chr>                 <dbl> <chr>                      <dbl> <chr>     
##  1 FDP10                    NA Low Fat                  0.127   Snack Foo~
##  2 DRI11                    NA Low Fat                  0.0342  Hard Drin~
##  3 FDW12                    NA Regular                  0.0354  Baking Go~
##  4 FDC37                    NA Low Fat                  0.0576  Baking Go~
##  5 FDC14                    NA Regular                  0.0722  Canned    
##  6 FDV20                    NA Regular                  0.0595  Fruits an~
##  7 FDX10                    NA Regular                  0.123   Snack Foo~
##  8 FDB34                    NA Low Fat                  0.0265  Snack Foo~
##  9 FDS02                    NA Regular                  0.255   Dairy     
## 10 FDI26                    NA Low Fat                  0.0611  Canned    
## 11 FDF09                    NA Low Fat                  0.0121  Fruits an~
## 12 FDY40                    NA Regular                  0.150   Frozen Fo~
## 13 FDY45                    NA Low Fat                  0.0260  Snack Foo~
## 14 FDN48                    NA Low Fat                  0.114   Baking Go~
## 15 NCL18                    NA Low Fat                  0.293   Household 
## 16 FDR12                    NA Regular                  0.0314  Baking Go~
## 17 FDQ49                    NA Regular                  0.0391  Breakfast 
## 18 FDU04                    NA Low Fat                  0.00971 Frozen Fo~
## 19 NCP18                    NA Low Fat                  0.0285  Household 
## 20 FDD10                    NA Regular                  0.0458  Snack Foo~
## # ... with 8 more variables: Item_MRP <dbl>, Outlet_Identifier <chr>,
## #   Outlet_Establishment_Year <int>, Outlet_Size <chr>,
## #   Outlet_Location_Type <chr>, Outlet_Type <chr>,
## #   Item_Outlet_Sales <dbl>, source <chr>

It appears that these NA values in Item_Weight are coming from the OUT027 and OUT019 locations

data %>%
        ggplot(aes(x=Outlet_Identifier, y=Item_Weight)) +
        geom_boxplot(fill = 'gray') +
        theme_classic() +
        bigmart_theme +
        ggtitle('Item Weights Reported by Outlet')
## Warning: Removed 2439 rows containing non-finite values (stat_boxplot).

So it looks like 2 outlets just haven’t reported any data on item weights. Luckily the distributions of Item_Weights don’t seem to vary between locations (to the point that it’s unrealistic, especially considering that these data include 4 categories of outlet types.. But whatever it works in favor of our convenience here)

#Take a quick look at the distribution of 'Item_Weight' to see if there's any reason to prefer mean over median or vice versa
summary(data$Item_Weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   4.555   8.710  12.600  12.790  16.750  21.350    2439

Doesn’t appear that the choice will make much of a difference. We’ll go ahead and impute by the median for ‘Item_Weight’.

#Impute Item_Weight by its median to take care of missing values
data <- data %>%
        replace_na(list(Item_Weight = median(data$Item_Weight, na.rm = TRUE)))
summary(data$Item_Weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.555   9.300  12.600  12.760  16.000  21.350

Outlet Size Nulls
Let’s quickly look at which outlets have NAs for Outlet_Size

data %>%
        ggplot(aes(x=Outlet_Identifier, y=Outlet_Size)) +
        geom_point(shape = 8, size = 5) +
        theme_classic() +
        bigmart_theme +
        ggtitle('Outlet Size by Outlet')

#Print Outlet_Types of the 3 outlets missing Outlet_Size:
data %>% filter(Outlet_Identifier == 'OUT010') %>% distinct(Outlet_Type)
## # A tibble: 1 x 1
##   Outlet_Type  
##   <chr>        
## 1 Grocery Store
data %>% filter(Outlet_Identifier == 'OUT013') %>% distinct(Outlet_Type)
## # A tibble: 1 x 1
##   Outlet_Type      
##   <chr>            
## 1 Supermarket Type1
data %>% filter(Outlet_Identifier == 'OUT045') %>% distinct(Outlet_Type)
## # A tibble: 1 x 1
##   Outlet_Type      
##   <chr>            
## 1 Supermarket Type1

So all we need to do here is replace the nulls in Outlet_Size with the mode Outlet_Size for a given outlet’s Outlet_Type.
We can impute by the mode grouped by ‘Outlet_Type’

calculate_mode <- function(x) {
  uniq <- unique(x)
  uniq[which.max(tabulate(match(x, uniq)))]
}

data %>%
    group_by(Outlet_Type) %>%
    summarise(mode = calculate_mode(Outlet_Size))
## # A tibble: 4 x 2
##   Outlet_Type       mode  
##   <chr>             <chr> 
## 1 Grocery Store     <NA>  
## 2 Supermarket Type1 Small 
## 3 Supermarket Type2 Medium
## 4 Supermarket Type3 Medium

It stands to reason that grocery stores would be the the smallest type.

#Replace NAs in Outlet_Size with the its mode based on Outlet_Type.
#Note: It is actually quicker for me (and R) in this case to just assign the value 'Small' to our 2 Supermarket Type1 and 1 Grocery Store outlets with NAs in Outlet_Size, but using conditionals within replace_na is just more fun ;p And it's an example of what one might do if there were more null cases
data <- data %>%
        replace_na(list(Outlet_Size = ifelse(data$Outlet_Type == 'Grocery Store' || data$Outlet_Type == 'Supermarket Type1', 'Small', ifelse(data$Outlet_Type == 'Supermarket Type2' || data$Outlet_Type == 'Supermarket Type3', 'Medium'))))

#Once more print out the number of missing values for each column, confirming that Item_Weight and Outlet_Size are indeed null-free
sapply(data, function(x) sum(length(which(is.na(x)))))
##           Item_Identifier               Item_Weight 
##                         0                         0 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                         0                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681 
##                    source 
##                         0

Next we’ll perform some feature engineering - merging and/or modifying variables where it seems appropriate.

Outlet Type Combinations?
First we’ll take a look at the ‘Outlet_Type’ variable, in particular, whether Supermarket Type2 and Supermarket Type3 ought to be combined into one category.

#Print out the mean sales grouped by Outlet_Type
data %>%
        group_by(Outlet_Type) %>%
        summarise(mean_sales = mean(Item_Outlet_Sales, na.rm = TRUE))
## # A tibble: 4 x 2
##   Outlet_Type       mean_sales
##   <chr>                  <dbl>
## 1 Grocery Store           340.
## 2 Supermarket Type1      2316.
## 3 Supermarket Type2      1995.
## 4 Supermarket Type3      3694.

The mean sales prove to be significantly different for Supermarket Type2 and Type3 (in fact there’s not a bad spread overall), so we won’t merge these categories. Merging two categories which could potentially be impacting our output variable could confound our analysis.

Item Visibility
Second, we’ll get back to the issue of ‘Item_Visibility’, which had the mysterious minimun of zero.
The zero values are most likely just missing information, so we’ll impute Item_Visibility based on the given item’s (Item_Identifier’s) mean visibility.

#Group by the item ID and replace zero values in Item_Visibility with the corresponding item's average visibility across all stores
data <- data %>%
        group_by(Item_Identifier) %>%
        mutate(Item_Visibility = replace(Item_Visibility, Item_Visibility == 0, mean(Item_Visibility)))
summary(data$Item_Visibility)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003575 0.031150 0.057190 0.069710 0.096930 0.328400

Now we have absolute visibility stats for all products in all stores, but in order to determine how a particular product’s visibility at a given location impacts its sales at that location, we need to know how the product’s visibility at that particular location compares with the product’s visibility at other locations.
To understand this, we’ll need to create a new metric, namely, the ratio of the product’s visibility at a given store to its mean visibility across all stores.

#Calculate the new variable, 'Item_Visibility_MeanRatio'
data <- data %>%
        group_by(Item_Identifier) %>%
        mutate(Item_Mean_Visibility = mean(Item_Visibility)) %>%
        mutate(Item_Visibility_MeanRatio = Item_Visibility / Item_Mean_Visibility) %>%
        select(-Item_Mean_Visibility)

Item Type Recategorization

Here let’s take a look at mean sales by Item_Type to see if there’s any worthwhile rearranging we can do

data %>%
        group_by(Item_Type) %>%
        summarise(mean_by_type = mean(Item_Outlet_Sales, na.rm=TRUE)) %>%
        arrange(desc(mean_by_type))
## # A tibble: 16 x 2
##    Item_Type             mean_by_type
##    <chr>                        <dbl>
##  1 Starchy Foods                2374.
##  2 Seafood                      2326.
##  3 Fruits and Vegetables        2289.
##  4 Snack Foods                  2277.
##  5 Household                    2259.
##  6 Dairy                        2233.
##  7 Canned                       2225.
##  8 Breads                       2204.
##  9 Meat                         2159.
## 10 Hard Drinks                  2139.
## 11 Frozen Foods                 2133.
## 12 Breakfast                    2112.
## 13 Health and Hygiene           2010.
## 14 Soft Drinks                  2007.
## 15 Baking Goods                 1953.
## 16 Others                       1926.

I’m curious to see what the ‘Others’ category looks like

data %>%
        filter(Item_Type == 'Others') %>%
        do(head(.))
## # A tibble: 180 x 14
## # Groups:   Item_Identifier [30]
##    Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type
##    <chr>                 <dbl> <chr>                      <dbl> <chr>    
##  1 NCI31                  20.0 Low Fat                   0.0817 Others   
##  2 NCI31                  20.0 Low Fat                   0.0815 Others   
##  3 NCI31                  20.0 low fat                   0.136  Others   
##  4 NCI31                  20.0 Low Fat                   0.0815 Others   
##  5 NCI31                  12.6 Low Fat                   0.0809 Others   
##  6 NCI31                  20.0 Low Fat                   0.0813 Others   
##  7 NCJ19                  18.6 LF                        0.118  Others   
##  8 NCJ19                  18.6 Low Fat                   0.118  Others   
##  9 NCJ19                  18.6 Low Fat                   0.198  Others   
## 10 NCJ19                  12.6 Low Fat                   0.118  Others   
## # ... with 170 more rows, and 9 more variables: Item_MRP <dbl>,
## #   Outlet_Identifier <chr>, Outlet_Establishment_Year <int>,
## #   Outlet_Size <chr>, Outlet_Location_Type <chr>, Outlet_Type <chr>,
## #   Item_Outlet_Sales <dbl>, source <chr>, Item_Visibility_MeanRatio <dbl>

Actually by going through all 180 entries, it is clear that all ‘Others’ items are some form of non-consumable (marked with NC in the ‘Item_Identifier column). I also just noticed that NCs also have an Item_Fat_Content entry, namely ’Low Fat’.. We’ll probably change this later.
Let’s find out what Item_Types are included in the non-consumables category

data %>%
        filter(grepl('NC', Item_Identifier)) %>%
        group_by(Item_Type) %>%
        distinct(Item_Type)
## # A tibble: 3 x 1
## # Groups:   Item_Type [3]
##   Item_Type         
##   <chr>             
## 1 Household         
## 2 Health and Hygiene
## 3 Others

Only these 3 Item_Types in unconsumables. It may make sense to combined Others with Health and Hygiene, but the Household type mean sales is considerably higher than the other two.

As such maybe there’s a better way to group Item_Types than by ID code. (In addition to NC=‘non-consumables’, there is FD=‘foods’ and DR=‘drinks’, but it doesn’t immediately seem to make sense to combine item types in to these three categories).

In fact, let’s combine item types by utility/necessity, which seems to me like it may be more correlated with sales than whether the item happens to be food or not. Consumers will be looking to purchase some of all three ID categories (food, beverage, and non-consumable) in order to sustain their livelihood. Is there not some form of decreasing marginal returns to consumer surplus when filling your shopping cart with food vs. beverage vs. household goods, etc.? Won’t you get more benefit from the first unit of laundry detergent than the 400th potato?

SO, that said let’s group the item types as best we can by their average sales and a judgement of necessity.

Looking at the mean_by_type column for the 16 item types above, let’s see if we can’t create groups based on the mean sales, and how likely the AVERAGE consumer may be to buy from a certain group.

What first catches my eye is the top selling item types, from ‘Starchy Foods’ down to ‘Breads’. Once you get into the ‘Meat’ and ‘Hard Drinks’ categories, you may be cutting out a larger portion of consumers who won’t be purchasing these items, or at least not as regularly as, say, ‘Fruits and Vegetables’.
From ‘Meats’ down to ‘Breakfast’ seems to make a nice 2nd category. After ‘Breakfast’, average sales drop by over 100, and we get categories that you’d maybe just purchase in smaller quantities when you run out of supplies(namely ‘Health and Hygiene’ and ‘Baking Goods’), or maybe on special occasions (for ‘Soft Drinks’).

Note: The top two categories, ‘Starchy Foods’ and ‘Seafood’ also stand out to me, but ‘Starchy Foods’ just seems to me a really vague way to categorize (like, aren’t ‘breads’ starchy as well?), and we don’t have enough information with this dataset to know exactly what’s in there (though I imagine it would be pastas, rice, etc.). That said I’m hesitant to put it in one category alone with ‘Seafood’. Seafood tends to be expensive in the U.S., so it makes sense that mean sales would be relatively high, but ‘Starchy Foods’ is probably leading sales due to the quantity that customers by. These are completely different products with high sales for different reasons, so I won’t create a new category on the basis of sale numbers just for these two.

OK, so let’s go ahead and label these 3 groups in a new column

#Add item types to variables representing our new categories
sales_high <- c('Starchy Foods', 'Seafood', 'Fruits and Vegetables', 'Snack Foods', 'Household', 'Dairy', 'Canned', 'Breads')
sales_med <- c('Meat', 'Hard Drinks', 'Frozen Foods', 'Breakfast')

#Create the categories 'High Sales', 'Medium Sales', and 'Low Sales' in new column, 'Item_Type_Aggregate'
data <- data %>%
        group_by() %>%
        mutate(Item_Type_Aggregate = ifelse(data$Item_Type %in% sales_high, 'High Sales', ifelse(data$Item_Type %in% sales_med, 'Medium Sales', 'Low Sales')))
data %>%
        group_by(Item_Type_Aggregate) %>%
        distinct(Item_Type_Aggregate)
## # A tibble: 3 x 1
## # Groups:   Item_Type_Aggregate [3]
##   Item_Type_Aggregate
##   <chr>              
## 1 High Sales         
## 2 Low Sales          
## 3 Medium Sales

Outlet Establishment Year
Next we come to a variable we can definitely improve on for our analysis: ‘Outlet_Establishment_Year’.
We’re going to create a new variable, ‘Outlet_Years’, to more intuitively represent location age.

#Subtract year established from 2013, as 2013 is the year in which this sales data was collected
data  <- data %>%
        mutate(Outlet_Years = 2013 - Outlet_Establishment_Year)
summary(data$Outlet_Years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    9.00   14.00   15.17   26.00   28.00

Item Fat Content
Next we’ll deal with the mislabelling of the Item_Fat_Content categories

#Just a reminder of what the existing category names are
data %>%
        group_by(Item_Fat_Content) %>%
        distinct(Item_Fat_Content)
## # A tibble: 5 x 1
## # Groups:   Item_Fat_Content [5]
##   Item_Fat_Content
##   <chr>           
## 1 Low Fat         
## 2 Regular         
## 3 low fat         
## 4 LF              
## 5 reg
#Replace redundant category names with 'Low Fat' and 'Regular' for the sake of consistency
data$Item_Fat_Content <- recode(data$Item_Fat_Content,
                                LF = 'Low Fat', 
                                `low fat` = 'Low Fat',
                                reg = 'Regular')
  
#And we musn't forget to change the fat content label for non-consumable items
data <- data %>%
        mutate(Item_Fat_Content = replace(Item_Fat_Content, grepl('NC', Item_Identifier), 'Non-Consumable'))
#Check to make sure we have just the two categories remaining
data %>% group_by(Item_Fat_Content) %>% select(Item_Fat_Content) %>% apply(2, table)
##                Item_Fat_Content
## Low Fat                    6499
## Non-Consumable             2686
## Regular                    5019

Before constructing a model with our tidied and modified data, let’s do some visualizaiton of the relationships between individual attributes and overall sales

Sales on Item_MRP

#Simply Item_Outlet_Sales on Item_MRP
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales)) +
        geom_density(stat = 'identity') +
        geom_smooth(color = 'red', size = 1.5) +
        ggtitle('Sales on Item_MRP') +
        bigmart_theme

#And dividing sales by the price to adjust for the fact that sales is basically price * quantity
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales / Item_MRP)) +
        geom_density(stat = 'identity') +
        geom_smooth(color = 'red', size = 1.5) +
        ggtitle('Sales/Item_MRP on Item_MRP') +
        bigmart_theme

As anticipated, the correlation between price and sales comes from the way in which Item_Outlet_Sales is calculated to begin with

To see visualizations for a wider range of attributes against Item_Outlet_Sales, check out my Shiny app:
https://warriwes.shinyapps.io/BigmartSales_Viz/

Below are just a few highlights and my comments:

Sales on Item_Visibility_MeanRatio

We can see here that the item visibility distribution is pulled way out by products in grocery stores. What this tells us is that grocery stores in fact have much less inventory, and therefore the ‘visibility’ awarded to each individual item is automatically greater than it would be in a supermarket

Sales on Item_Weight

From this distribution of item weights, we can see where we imputed the median for the OUT019 and OUT027 locations, which happened to be the two Type 3 supermarkets. Item_Weight doesn’t look like it will be a very helpful variable anyway, which we suspected from the beginning

Sales on Item_Type_Aggregate

Just using sales vs. our modified item types for a clear picture, we can see that all-in-all, outlet type has a major effect on sales. Interestingly, Type 1 supermarkets surpass Type 2 supermarkets overwhelmingly, but even more clear is the gap between grocery stores and all supermarket types.


Preparing data for model development and testing
Prepare to split back into train and test sets

#Drop columns we replaced
data <- data %>%
        select(-c(Item_Type, Outlet_Establishment_Year))
#Set item and outlet ID pairs to row names and coerce character columns into factors (by converting data to data.frame)
ID_cols <- paste(data$Item_Identifier, data$Outlet_Identifier)
data_df <- as.data.frame(unclass(data), row.names = ID_cols)
#Split back into train and test on 'source' column
train <- data_df %>%
        filter(source == 'train')
test <- data_df %>%
        filter(source == 'test')
#Drop unnecessary columns
train <- train %>%
        select(-source)
test <- test %>%
        select(-c(source, Item_Outlet_Sales))

Export prepared train/test sets to csv files

write_csv(train, 'train_new.csv')
write_csv(test, 'test_new.csv')
train_1 <- train %>%
        select(-c(Item_Identifier, Outlet_Identifier))

Random Forest Model

Now it’s time to test out our model..

library(randomForest)
rf <- randomForest(Item_Outlet_Sales ~ ., ntree = 100, data = train_1)
plot(rf)

print(rf)
## 
## Call:
##  randomForest(formula = Item_Outlet_Sales ~ ., data = train_1,      ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 1232897
##                     % Var explained: 57.66
varImpPlot(rf, sort = T, n.var = 10, 
           main = 'Variable Importance')

Just another way to visualize the results:

#Create variable importance data.frame
var_imp = data.frame(importance(rf, type=2))

#Add row names columns
var_imp$Variables = row.names(var_imp)  
var_imp <- var_imp[order(var_imp$IncNodePurity, decreasing = T),]

#Plot bar graph showing variable importance 
ggplot(var_imp, aes(x = reorder(Variables, IncNodePurity), y = IncNodePurity, fill = IncNodePurity)) + 
        geom_col() +
        theme(
                axis.text.x = element_text(angle = 45, hjust = 0.85),
                legend.position = 'none'
                ) +
        bold_theme +
        ggtitle('Variable Importance (Sales ~)') +
        coord_flip()

Making it clear how biased the model is toward Item_MRP


Let’s say we just model against quantities of items sold (Sales / MRP)…

library(randomForest)
rf <- randomForest(Item_Outlet_Sales / Item_MRP ~ ., ntree = 100, data = train_1)
plot(rf)

print(rf)
## 
## Call:
##  randomForest(formula = Item_Outlet_Sales/Item_MRP ~ ., data = train_1,      ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 50.92052
##                     % Var explained: 39.74
varImpPlot(rf, sort = T, n.var = 9, 
           main = 'Variable Importance (Sales / Item_MRP ~)')

#Create variable importance data.frame
var_imp = data.frame(importance(rf, type=2))

#Add row names columns
var_imp$Variables = row.names(var_imp)  
var_imp <- var_imp[order(var_imp$IncNodePurity, decreasing = T),]

#Plot bar graph showing variable importance 
ggplot(var_imp, aes(x = reorder(Variables, IncNodePurity), y = IncNodePurity, fill = IncNodePurity)) +
        geom_col() +
        theme(
                axis.text.x = element_text(angle = 45, hjust = 0.85),
                legend.position = 'none'
                ) +
        bold_theme +
        ggtitle('Variable Importance (Sales / Item_MRP ~)') +
        xlab('Variables') +
        coord_flip()

Removing Item_MRP gives us a clearer picture of how the other attributes influence the quantity of a given item that is sold