WarriWes
March 25, 2018
Data and other contest resources can be found on the Analytics Vidhya website:
https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
try to understand the properties of products and stores which play a key role in increasing sales at BigMart.
Let’s begin by reading in the data
Looking at the structure of our data.
## Classes 'tbl_df', 'tbl' and 'data.frame': 14204 obs. of 13 variables:
## $ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
## $ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
## $ Item_Fat_Content : chr "Low Fat" "Regular" "Low Fat" "Regular" ...
## $ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
## $ Item_Type : chr "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
## $ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
## $ Outlet_Identifier : chr "OUT049" "OUT018" "OUT049" "OUT010" ...
## $ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
## $ Outlet_Size : chr "Medium" "Medium" "Medium" NA ...
## $ Outlet_Location_Type : chr "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
## $ Outlet_Type : chr "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
## $ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
## $ source : chr "train" "train" "train" "train" ...
So we have 12 columns here, plus the ‘source’ column we added. Our first task is to think about which variables we can actually use in our analysis.
That is,
1. is the variable something that would plausibly have an impact on the sales of the product,
2. given the variable represents a metric we are interested in, is the data included physically sufficient for us to use (e.g. not 80% NA values, etc.)
Right off the bat, we can see there are a few variables that may not be interesting as they stand now, just by looking at the names and first few values:
Item_weight: Customers won’t generally inclined to buy a product because it is heavier, and the heaviness off a product shouldn’t be a factor when they decide on their purchase. If it is really a larger product, say an appliance, the weight will generally be true of all similar appliances, and wouldn’t be specific to that particular Item_ID
Item_MRP: The MRP (list price) will likely be correlated with sales automatically since sales are calculated as the product of price and quantity. We may want to adjust for this later
Outlet_Establishment_Year: While the ‘run down’ factor could be influential in the popularity of a location, the establishment year doesn’t give us this information. Neither does it tell us whether or not an older location has developed a stronger longterm customer base.
Now, looking at the remaining 9 variables we intend to analyze:
Item_Identifier: Product ID (used to identify each unique product)
Item_Fat_Content: Consumers generally attracted to items with lower fat
Item_Visibility: Higher visibility means customers are more likely to actually see the product during their trip to the store, and perhaps this increases the likelihood that a product will be purchased
Item_Type: Product category may be correlated with sales
Outlet_Identifier: Store ID (used to identify each location)
Outlet_Size: Linked with store capacity; customers will be more inclined to visit stores with a wide variety of products so as to do all their shopping in one place (convenience factor)
Outlet_Location_Type: Tier 1~3, More urban (tier 1) stores ought to have the benefit of higher population density and generally customers with higher incomes
Outlet_Type: Supermarket Type1/2/3 or Grocery Store. Again the capacity and variety available makes locations more or less attractive (convenience factor)
Item_Outlet_Sales: Our dependent variable; the outcome we are looking to explain
Check for null values
Now we’ll look for NAs and get to know what we’re physically working with in greater detail
## Print number of NA values in each column in data
sapply(data, function(x) sum(length(which(is.na(x))))) ## Item_Identifier Item_Weight
## 0 2439
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 4016 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
## source
## 0
Looks like we have 3 variables with a number of NAs.
Note that the 5681 NAs in outlet sales correspond exactly to the number of rows in the ‘test’ dataframe, so we need not worry about these.
We’ll come back to deal with the ‘Item_Weight’ and ‘Outlet_Size’ nulls.
## Let's have a look at the quantitative variables in data
data %>%
select_if(is.numeric) %>%
summary() ## Item_Weight Item_Visibility Item_MRP
## Min. : 4.555 Min. :0.00000 Min. : 31.29
## 1st Qu.: 8.710 1st Qu.:0.02704 1st Qu.: 94.01
## Median :12.600 Median :0.05402 Median :142.25
## Mean :12.793 Mean :0.06595 Mean :141.00
## 3rd Qu.:16.750 3rd Qu.:0.09404 3rd Qu.:185.86
## Max. :21.350 Max. :0.32839 Max. :266.89
## NA's :2439
## Outlet_Establishment_Year Item_Outlet_Sales
## Min. :1985 Min. : 33.29
## 1st Qu.:1987 1st Qu.: 834.25
## Median :1999 Median : 1794.33
## Mean :1998 Mean : 2181.29
## 3rd Qu.:2004 3rd Qu.: 3101.30
## Max. :2009 Max. :13086.97
## NA's :5681
## To inspect our categorical variables, print the number of unique values per column
sapply(data, function(x) length(unique(x))) ## Item_Identifier Item_Weight
## 1559 416
## Item_Fat_Content Item_Visibility
## 5 13006
## Item_Type Item_MRP
## 16 8052
## Outlet_Identifier Outlet_Establishment_Year
## 10 9
## Outlet_Size Outlet_Location_Type
## 4 3
## Outlet_Type Item_Outlet_Sales
## 4 3494
## source
## 2
So looks like we have 1559 unique items, 10 unique store locations.
Next we’ll print the frequency of each unique category for each of our categorical variables (excluding Item/Outlet ID, source)
categorical_freqs <- data %>%
select_if(is.character) %>%
select(-one_of(c('Item_Identifier', 'Outlet_Identifier', 'source'))) %>%
apply(2, table)
categorical_freqs ## $Item_Fat_Content
##
## LF low fat Low Fat reg Regular
## 522 178 8485 195 4824
##
## $Item_Type
##
## Baking Goods Breads Breakfast
## 1086 416 186
## Canned Dairy Frozen Foods
## 1084 1136 1426
## Fruits and Vegetables Hard Drinks Health and Hygiene
## 2013 362 858
## Household Meat Others
## 1548 736 280
## Seafood Snack Foods Soft Drinks
## 89 1989 726
## Starchy Foods
## 269
##
## $Outlet_Size
##
## High Medium Small
## 1553 4655 3980
##
## $Outlet_Location_Type
##
## Tier 1 Tier 2 Tier 3
## 3980 4641 5583
##
## $Outlet_Type
##
## Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
## 1805 9294 1546 1559
Notes:
Looks like fat content has a few redundant labels - LF vs. low fat vs. Low Fat, reg vs. Regular
There are a lot of Item_Type categories with few values. It might be wise to combine some of them
We’ll investigate whether there are significant differences between the Type1, Type2 and Type3 supermarkets. These also could potentially be combined later on
Now it’s time to start tidying up our dataset. Taking into account the above notes/observations, let’s start cleaning up.
First we’ll take care of the null values we discovered in ‘Item_Weight’ and ‘Outlet_Size’
It appears that these NA values in Item_Weight are coming from the OUT027 and OUT019 locations
## Warning: Removed 2439 rows containing non-finite values (stat_boxplot).
So it looks like 2 outlets just haven’t reported any data on item weights. Luckily the distributions of Item_Weights don’t seem to vary between locations (to the point that it’s unrealistic, especially considering that these data include 4 categories of outlet types.. But whatever it works in favor of our convenience here)
#Take a quick look at the distribution of 'Item_Weight' to see if there's any reason to prefer mean over median or vice versa
summary(data$Item_Weight)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.555 8.710 12.600 12.790 16.750 21.350 2439
Doesn’t appear that the choice will make much of a difference. We’ll go ahead and impute by the median for ‘Item_Weight’.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.555 9.300 12.600 12.760 16.000 21.350
Outlet Size Nulls
Let’s quickly look at which outlets have NAs for Outlet_Size
## # A tibble: 1 x 1
## Outlet_Type
## <chr>
## 1 Grocery Store
## # A tibble: 1 x 1
## Outlet_Type
## <chr>
## 1 Supermarket Type1
## # A tibble: 1 x 1
## Outlet_Type
## <chr>
## 1 Supermarket Type1
So all we need to do here is replace the nulls in Outlet_Size with the mode Outlet_Size for a given outlet’s Outlet_Type.
We can impute by the mode grouped by ‘Outlet_Type’
## # A tibble: 4 x 2
## Outlet_Type mode
## <chr> <chr>
## 1 Grocery Store <NA>
## 2 Supermarket Type1 Small
## 3 Supermarket Type2 Medium
## 4 Supermarket Type3 Medium
It stands to reason that grocery stores would be the the smallest type.
Once more print out the number of missing values for each column, confirming that Item_Weight and Outlet_Size are indeed null-free.
## Item_Identifier Item_Weight
## 0 0
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 0 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
## source
## 0
Item Visibility
We come back to the issue of ‘Item_Visibility’, which had the mysterious minimum of zero.
The zero values are most likely just missing information, so we’ll impute Item_Visibility based on the given item’s (Item_Identifier’s) mean visibility.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003575 0.031150 0.057190 0.069710 0.096930 0.328400
Now we have absolute visibility stats for all products in all stores, but in order to determine how a particular product’s visibility at a given location impacts its sales at that location, we need to know how the product’s visibility at that particular location compares with the product’s visibility at other locations.
To understand this, we’ll need to create a new metric, namely, the ratio of the product’s visibility at a given store to its mean visibility across all stores.
#Calculate the new variable, 'Item_Visibility_MeanRatio'
data <- data %>%
group_by(Item_Identifier) %>%
mutate(Item_Mean_Visibility = mean(Item_Visibility)) %>%
mutate(Item_Visibility_MeanRatio = Item_Visibility / Item_Mean_Visibility) %>%
select(-Item_Mean_Visibility)Item Type Recategorization
Here we rearranged item types into more singificant-looking groups based on considerations of mean sales and a measure of daily necessity.
## # A tibble: 3 x 1
## # Groups: Item_Type_Aggregate [3]
## Item_Type_Aggregate
## <chr>
## 1 High Sales
## 2 Low Sales
## 3 Medium Sales
Outlet Establishment Year
Next we come to a variable we can definitely improve on for our analysis: ‘Outlet_Establishment_Year’.
We’re going to create a new variable, ‘Outlet_Years’, to more intuitively represent location age.
#Subtract year established from 2013, as 2013 is the year in which this sales data was collected
data <- data %>%
mutate(Outlet_Years = 2013 - Outlet_Establishment_Year)
summary(data$Outlet_Years)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 9.00 14.00 15.17 26.00 28.00
Item Fat Content
Next we’ll deal with the mislabelling of the Item_Fat_Content categories
#Replace redundant category names with 'Low Fat' and 'Regular' for the sake of consistency
data$Item_Fat_Content <- recode(data$Item_Fat_Content,
LF = 'Low Fat',
`low fat` = 'Low Fat',
reg = 'Regular')
#And we musn't forget to change the fat content label for non-consumable items
data <- data %>%
mutate(Item_Fat_Content = replace(Item_Fat_Content, grepl('NC', Item_Identifier), 'Non-Consumable'))
#Check to make sure we have just the two categories remaining
data %>% group_by(Item_Fat_Content) %>% select(Item_Fat_Content) %>% apply(2, table)## Item_Fat_Content
## Low Fat 6499
## Non-Consumable 2686
## Regular 5019
Sales on Item_MRP
#Simply Item_Outlet_Sales on Item_MRP
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales)) +
geom_density(stat = 'identity') +
geom_smooth(color = 'red', size = 1.5) +
ggtitle('Sales on Item_MRP') +
bigmart_theme#And dividing sales by the price to adjust for the fact that sales is basically price * quantity
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales / Item_MRP)) +
geom_density(stat = 'identity') +
geom_smooth(color = 'red', size = 1.5) +
ggtitle('Sales/Item_MRP on Item_MRP') +
bigmart_themeAs anticipated, the correlation between price and sales comes from the way in which Item_Outlet_Sales is calculated to begin with
To see visualizations for a wider range of attributes against Item_Outlet_Sales, check out my Shiny app:
https://warriwes.shinyapps.io/BigmartSales_Viz/
Sales on Item_Visibility_MeanRatio
We can see here that the item visibility distribution is pulled way out by products in grocery stores. What this tells us is that grocery stores in fact have much less inventory, and therefore the ‘visibility’ awarded to each individual item is automatically greater than it would be in a supermarket
Sales on Item_Weight
From this distribution of item weights, we can see where we imputed the median for the OUT019 and OUT027 locations, which happened to be the two Type 3 supermarkets. Item_Weight doesn’t look like it will be a very helpful variable anyway, which we suspected from the beginning
Sales on Item_Type_Aggregate
Just using sales vs. our modified item types for a clear picture, we can see that all-in-all, outlet type has a major effect on sales. Interestingly, Type 1 supermarkets surpass Type 2 supermarkets overwhelmingly, but even more clear is the gap between grocery stores and all supermarket types.
Preparing data for model development and testing
Prepare to split back into train and test sets
#Drop columns we replaced
data <- data %>%
select(-c(Item_Type, Outlet_Establishment_Year))
#Set item and outlet ID pairs to row names and coerce character columns into factors (by converting data to data.frame)
ID_cols <- paste(data$Item_Identifier, data$Outlet_Identifier)
data_df <- as.data.frame(unclass(data), row.names = ID_cols)
#Split back into train and test on 'source' column
train <- data_df %>%
filter(source == 'train')
test <- data_df %>%
filter(source == 'test')
#Drop unnecessary columns
train <- train %>%
select(-source)
test <- test %>%
select(-c(source, Item_Outlet_Sales))Random Forest Model
Now it’s time to test out our model..
library(randomForest)rf <- randomForest(Item_Outlet_Sales ~ ., ntree = 100, data = train_1)
plot(rf)Making it clear how biased the model is toward Item_MRP
library(randomForest)
rf <- randomForest(Item_Outlet_Sales / Item_MRP ~ ., ntree = 100, data = train_1)
plot(rf)Removing Item_MRP gives us a clearer picture of how the other attributes influence the quantity of a given item that is sold
Clearly the MRP of a product will impact its total sales. This is true of any item in any location, but we can draw more helpful conclusions from the feature importance coefficients on the previous slide, with Item_MRP taken out of the picture, in addition to the univariate analysis we performed earlier.
Outlet_Type is a major factor in how much of an item is sold, primarily as a result of the high variation between sales at grocery stores vs. supermarkets. It would be interesting to take grocery stores out of the equation entirely in order to observe the resulting impact on Outlet_Type importance, as well as that of Item_Visibility, plus it would allow us to consider the variation between different supermarket types in more detail.
Item_Visibility, Item_Visibility_MeanRatio are influential in our model, in fact the ratio variable we engineered is second only to outlet type after adjusting for Item_MRP. This tells us that in fact the overall organization and display strategy taken at Big Mart stores is an incredibly important part of their stores’ operations. This ought to be taken into consideration when making decisions about store layout, as well as product promotions, etc.
Attributes such as Item_Fat_Content turn out to be relatively insignificant when it comes to overall sales. This is likely because consumer choices with respect to nutrition, etc., are not homogenous throughout the customer population. Also, ‘Low_Fat’ does not necessarily mean ‘nutritious’, and with this dataset we don’t have access to information such as whether these items were specifically advertised as low-fat/healthy, or whether that was just Big Mart’s catoregization, for example.
One other note I will mention is that personally I thought Outlet_Location_Type might play a larger role in determining sales figures. However, I suppose the location of a given outlet has more impact on the outlet’s total sales and turnover, as opposed to the sales of unique items. For instance it stands to reason that outlets located in urban areas with a larger customer base would stock a wider variety of products, and maybe have less room to boost individual items’ visibility numbers. Also our Outlet_Type variable may be capturing a portion of the effect of Outlet_Location_Type.
I think there is certainly room to go deeper with this dataset. I would consider the above a debriefing for the client to keep in mind, and look forward to picking apart the nitty gritty details.