Data and other contest resources can be found on the Analytics Vidhya website:
https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
SO the goal here is to:
try to understand the properties of products and stores which play a key role in increasing sales at BigMart.
Let’s begin by reading in the data
library(tidyverse)
## Read in train and test data
train <- read_csv('train.csv')
test <- read_csv('test.csv')
## We'll label our data based on whether it comes from the train or test set to be separated later, but for now we'll join the tibbles to perform some tidying operations
train['source'] <- 'train'
test['source'] <- 'test'
data <- train %>%
full_join(test)
## Check out the dimensions of train, test, and data just to make sure everything joined as planned
dim(train)
## [1] 8523 13
dim(test)
## [1] 5681 12
dim(data)
## [1] 14204 13
## Look at data in a bit mroe detail
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 14204 obs. of 13 variables:
## $ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
## $ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
## $ Item_Fat_Content : chr "Low Fat" "Regular" "Low Fat" "Regular" ...
## $ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
## $ Item_Type : chr "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
## $ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
## $ Outlet_Identifier : chr "OUT049" "OUT018" "OUT049" "OUT010" ...
## $ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
## $ Outlet_Size : chr "Medium" "Medium" "Medium" NA ...
## $ Outlet_Location_Type : chr "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
## $ Outlet_Type : chr "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
## $ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
## $ source : chr "train" "train" "train" "train" ...
head(data)
## # A tibble: 6 x 13
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type
## <chr> <dbl> <chr> <dbl> <chr>
## 1 FDA15 9.30 Low Fat 0.0160 Dairy
## 2 DRC01 5.92 Regular 0.0193 Soft Drinks
## 3 FDN15 17.5 Low Fat 0.0168 Meat
## 4 FDX07 19.2 Regular 0. Fruits and~
## 5 NCD19 8.93 Low Fat 0. Household
## 6 FDP36 10.4 Regular 0. Baking Goo~
## # ... with 8 more variables: Item_MRP <dbl>, Outlet_Identifier <chr>,
## # Outlet_Establishment_Year <int>, Outlet_Size <chr>,
## # Outlet_Location_Type <chr>, Outlet_Type <chr>,
## # Item_Outlet_Sales <dbl>, source <chr>
names(data)
## [1] "Item_Identifier" "Item_Weight"
## [3] "Item_Fat_Content" "Item_Visibility"
## [5] "Item_Type" "Item_MRP"
## [7] "Outlet_Identifier" "Outlet_Establishment_Year"
## [9] "Outlet_Size" "Outlet_Location_Type"
## [11] "Outlet_Type" "Item_Outlet_Sales"
## [13] "source"
Preliminary notes
So we have 12 columns here, plus the ‘source’ column we added. Our first task is to think about which variables we can actually use in our analysis.
That is,
1: is the variable something that would plausibly have an impact on the sales of the product,
2: given that the variable represents a metric we are interested in, is the data included physically sufficient for us to use (e.g. not 80% NA values, etc.)
Right off the bat, we can see there are a few variables that may not be interesting as they stand now, just by looking at the names and first few values:
–Item_weight: Customers won’t generally inclined to buy a product because it is heavier, and the heaviness off a product shouldn’t be a factor when they decide on their purchase. If it is really a larger product, say an appliance, the weight will generally be true of all similar appliances, and wouldn’t be specific to that particular Item_ID
–Item_MRP: The MRP (list price) will likely be correlated with sales automatically since sales are calculated as the product of price and quantity. We may want to adjust for this later
–Outlet_Establishment_Year: While the ‘run down’ factor could be influential in the popularity of a location, the establishment year doesn’t give us this information. Neither does it tell us whether or not an older location has developed a stronger longterm customer base.
Now, looking at the remaining 9 variables we intend to analyze:
–Item_Identifier: Product ID (used to identify each unique product)
–Item_Fat_Content: Consumers generally attracted to items with lower fat
–Item_Visibility: Higher visibility means customers are more likely to actually see the product during their trip to the store, and perhaps this increases the likelihood that a product will be purchased
–Item_Type: Product category may be correlated with sales
–Outlet_Identifier: Store ID (used to identify each location)
–Outlet_Size: Linked with store capacity; customers will be more inclined to visit stores with a wide variety of products so as to do all their shopping in one place (convenience factor)
–Outlet_Location_Type: Tier 1~3, More urban (tier 1) stores ought to have the benefit of higher population density and generally customers with higher incomes
–Outlet_Type: Supermarket Type1/2/3 or Grocery Store. Again the capacity and variety available makes locations more or less attractive (convenience factor)
–Item_Outlet_Sales: Our dependent variable; the outcome we are looking to explain
## Just as an example, a simple way to look at the different values a discrete variable can take:
unique(data$Outlet_Type)
## [1] "Supermarket Type1" "Supermarket Type2" "Grocery Store"
## [4] "Supermarket Type3"
Check for null values
Now we’ll look for NAs and get to know what we’re physically working with in greater detail
## Print number of NA values in each column in data
sapply(data, function(x) sum(length(which(is.na(x)))))
## Item_Identifier Item_Weight
## 0 2439
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 4016 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
## source
## 0
Looks like we have 3 variables with a number of NAs.
Note that the 5681 NAs in outlet sales correspond exactly to the number of rows in the ‘test’ dataframe, so we need not worry about these.
We’ll come back to deal with the ‘Item_Weight’ and ‘Outlet_Size’ nulls.
Inspecting numeric and categorical variables in further detail
## Let's have a look at the quantitative variables in data
data %>%
select_if(is.numeric) %>%
summary()
## Item_Weight Item_Visibility Item_MRP
## Min. : 4.555 Min. :0.00000 Min. : 31.29
## 1st Qu.: 8.710 1st Qu.:0.02704 1st Qu.: 94.01
## Median :12.600 Median :0.05402 Median :142.25
## Mean :12.793 Mean :0.06595 Mean :141.00
## 3rd Qu.:16.750 3rd Qu.:0.09404 3rd Qu.:185.86
## Max. :21.350 Max. :0.32839 Max. :266.89
## NA's :2439
## Outlet_Establishment_Year Item_Outlet_Sales
## Min. :1985 Min. : 33.29
## 1st Qu.:1987 1st Qu.: 834.25
## Median :1999 Median : 1794.33
## Mean :1998 Mean : 2181.29
## 3rd Qu.:2004 3rd Qu.: 3101.30
## Max. :2009 Max. :13086.97
## NA's :5681
Some points to note:
–Item_Visibility has a minimum of 0, which doesn’t make much sense.
–Outlet_Establishment_Year format won’t help us in our analysis. If anything we’ll change the values to something like ‘years since establishment’.
## To inspect our categorical variables, print the number of unique values per column
sapply(data, function(x) length(unique(x)))
## Item_Identifier Item_Weight
## 1559 416
## Item_Fat_Content Item_Visibility
## 5 13006
## Item_Type Item_MRP
## 16 8052
## Outlet_Identifier Outlet_Establishment_Year
## 10 9
## Outlet_Size Outlet_Location_Type
## 4 3
## Outlet_Type Item_Outlet_Sales
## 4 3494
## source
## 2
So looks like we have 1559 unique items, 10 unique store locations.
Next we’ll print the frequency of each unique category for each of our categorical variables (excluding Item/Outlet ID, source)
categorical_freqs <- data %>%
select_if(is.character) %>%
select(-one_of(c('Item_Identifier', 'Outlet_Identifier', 'source'))) %>%
apply(2, table)
categorical_freqs
## $Item_Fat_Content
##
## LF low fat Low Fat reg Regular
## 522 178 8485 195 4824
##
## $Item_Type
##
## Baking Goods Breads Breakfast
## 1086 416 186
## Canned Dairy Frozen Foods
## 1084 1136 1426
## Fruits and Vegetables Hard Drinks Health and Hygiene
## 2013 362 858
## Household Meat Others
## 1548 736 280
## Seafood Snack Foods Soft Drinks
## 89 1989 726
## Starchy Foods
## 269
##
## $Outlet_Size
##
## High Medium Small
## 1553 4655 3980
##
## $Outlet_Location_Type
##
## Tier 1 Tier 2 Tier 3
## 3980 4641 5583
##
## $Outlet_Type
##
## Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
## 1805 9294 1546 1559
Notes:
–Looks like fat content has a few redundant labels - LF vs. low fat vs. Low Fat, reg vs. Regular
–There are a lot of Item_Type categories with few values. It might be wise to combine some of them
–We’ll investigate whether there are significant differences between the Type1, Type2 and Type3 supermarkets. These also could potentially be combined later on
Data tidying with dplyr
Now it’s time to start tidying up our dataset. Taking into account the above notes/observations, let’s start cleaning up.
First we’ll take care of the null values we discovered in ‘Item_Weight’ and ‘Outlet_Size’
Item_Weight Nulls
#First check out where these nulls are coming from
data %>%
filter(is.na(data$Item_Weight) == TRUE) %>%
do(head(., 20))
## # A tibble: 20 x 13
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type
## <chr> <dbl> <chr> <dbl> <chr>
## 1 FDP10 NA Low Fat 0.127 Snack Foo~
## 2 DRI11 NA Low Fat 0.0342 Hard Drin~
## 3 FDW12 NA Regular 0.0354 Baking Go~
## 4 FDC37 NA Low Fat 0.0576 Baking Go~
## 5 FDC14 NA Regular 0.0722 Canned
## 6 FDV20 NA Regular 0.0595 Fruits an~
## 7 FDX10 NA Regular 0.123 Snack Foo~
## 8 FDB34 NA Low Fat 0.0265 Snack Foo~
## 9 FDS02 NA Regular 0.255 Dairy
## 10 FDI26 NA Low Fat 0.0611 Canned
## 11 FDF09 NA Low Fat 0.0121 Fruits an~
## 12 FDY40 NA Regular 0.150 Frozen Fo~
## 13 FDY45 NA Low Fat 0.0260 Snack Foo~
## 14 FDN48 NA Low Fat 0.114 Baking Go~
## 15 NCL18 NA Low Fat 0.293 Household
## 16 FDR12 NA Regular 0.0314 Baking Go~
## 17 FDQ49 NA Regular 0.0391 Breakfast
## 18 FDU04 NA Low Fat 0.00971 Frozen Fo~
## 19 NCP18 NA Low Fat 0.0285 Household
## 20 FDD10 NA Regular 0.0458 Snack Foo~
## # ... with 8 more variables: Item_MRP <dbl>, Outlet_Identifier <chr>,
## # Outlet_Establishment_Year <int>, Outlet_Size <chr>,
## # Outlet_Location_Type <chr>, Outlet_Type <chr>,
## # Item_Outlet_Sales <dbl>, source <chr>
It appears that these NA values in Item_Weight are coming from the OUT027 and OUT019 locations
data %>%
ggplot(aes(x=Outlet_Identifier, y=Item_Weight)) +
geom_boxplot(fill = 'gray') +
theme_classic() +
bigmart_theme +
ggtitle('Item Weights Reported by Outlet')
## Warning: Removed 2439 rows containing non-finite values (stat_boxplot).
So it looks like 2 outlets just haven’t reported any data on item weights. Luckily the distributions of Item_Weights don’t seem to vary between locations (to the point that it’s unrealistic, especially considering that these data include 4 categories of outlet types.. But whatever it works in favor of our convenience here)
#Take a quick look at the distribution of 'Item_Weight' to see if there's any reason to prefer mean over median or vice versa
summary(data$Item_Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.555 8.710 12.600 12.790 16.750 21.350 2439
Doesn’t appear that the choice will make much of a difference. We’ll go ahead and impute by the median for ‘Item_Weight’.
#Impute Item_Weight by its median to take care of missing values
data <- data %>%
replace_na(list(Item_Weight = median(data$Item_Weight, na.rm = TRUE)))
summary(data$Item_Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.555 9.300 12.600 12.760 16.000 21.350
Outlet Size Nulls
Let’s quickly look at which outlets have NAs for Outlet_Size
data %>%
ggplot(aes(x=Outlet_Identifier, y=Outlet_Size)) +
geom_point(shape = 8, size = 5) +
theme_classic() +
bigmart_theme +
ggtitle('Outlet Size by Outlet')
#Print Outlet_Types of the 3 outlets missing Outlet_Size:
data %>% filter(Outlet_Identifier == 'OUT010') %>% distinct(Outlet_Type)
## # A tibble: 1 x 1
## Outlet_Type
## <chr>
## 1 Grocery Store
data %>% filter(Outlet_Identifier == 'OUT013') %>% distinct(Outlet_Type)
## # A tibble: 1 x 1
## Outlet_Type
## <chr>
## 1 Supermarket Type1
data %>% filter(Outlet_Identifier == 'OUT045') %>% distinct(Outlet_Type)
## # A tibble: 1 x 1
## Outlet_Type
## <chr>
## 1 Supermarket Type1
So all we need to do here is replace the nulls in Outlet_Size with the mode Outlet_Size for a given outlet’s Outlet_Type.
We can impute by the mode grouped by ‘Outlet_Type’
calculate_mode <- function(x) {
uniq <- unique(x)
uniq[which.max(tabulate(match(x, uniq)))]
}
data %>%
group_by(Outlet_Type) %>%
summarise(mode = calculate_mode(Outlet_Size))
## # A tibble: 4 x 2
## Outlet_Type mode
## <chr> <chr>
## 1 Grocery Store <NA>
## 2 Supermarket Type1 Small
## 3 Supermarket Type2 Medium
## 4 Supermarket Type3 Medium
It stands to reason that grocery stores would be the the smallest type.
#Replace NAs in Outlet_Size with the its mode based on Outlet_Type.
#Note: It is actually quicker for me (and R) in this case to just assign the value 'Small' to our 2 Supermarket Type1 and 1 Grocery Store outlets with NAs in Outlet_Size, but using conditionals within replace_na is just more fun ;p And it's an example of what one might do if there were more null cases
data <- data %>%
replace_na(list(Outlet_Size = ifelse(data$Outlet_Type == 'Grocery Store' || data$Outlet_Type == 'Supermarket Type1', 'Small', ifelse(data$Outlet_Type == 'Supermarket Type2' || data$Outlet_Type == 'Supermarket Type3', 'Medium'))))
#Once more print out the number of missing values for each column, confirming that Item_Weight and Outlet_Size are indeed null-free
sapply(data, function(x) sum(length(which(is.na(x)))))
## Item_Identifier Item_Weight
## 0 0
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 0 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
## source
## 0
Next we’ll perform some feature engineering - merging and/or modifying variables where it seems appropriate.
Outlet Type Combinations?
First we’ll take a look at the ‘Outlet_Type’ variable, in particular, whether Supermarket Type2 and Supermarket Type3 ought to be combined into one category.
#Print out the mean sales grouped by Outlet_Type
data %>%
group_by(Outlet_Type) %>%
summarise(mean_sales = mean(Item_Outlet_Sales, na.rm = TRUE))
## # A tibble: 4 x 2
## Outlet_Type mean_sales
## <chr> <dbl>
## 1 Grocery Store 340.
## 2 Supermarket Type1 2316.
## 3 Supermarket Type2 1995.
## 4 Supermarket Type3 3694.
The mean sales prove to be significantly different for Supermarket Type2 and Type3 (in fact there’s not a bad spread overall), so we won’t merge these categories. Merging two categories which could potentially be impacting our output variable could confound our analysis.
Item Visibility
Second, we’ll get back to the issue of ‘Item_Visibility’, which had the mysterious minimun of zero.
The zero values are most likely just missing information, so we’ll impute Item_Visibility based on the given item’s (Item_Identifier’s) mean visibility.
#Group by the item ID and replace zero values in Item_Visibility with the corresponding item's average visibility across all stores
data <- data %>%
group_by(Item_Identifier) %>%
mutate(Item_Visibility = replace(Item_Visibility, Item_Visibility == 0, mean(Item_Visibility)))
summary(data$Item_Visibility)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003575 0.031150 0.057190 0.069710 0.096930 0.328400
Now we have absolute visibility stats for all products in all stores, but in order to determine how a particular product’s visibility at a given location impacts its sales at that location, we need to know how the product’s visibility at that particular location compares with the product’s visibility at other locations.
To understand this, we’ll need to create a new metric, namely, the ratio of the product’s visibility at a given store to its mean visibility across all stores.
#Calculate the new variable, 'Item_Visibility_MeanRatio'
data <- data %>%
group_by(Item_Identifier) %>%
mutate(Item_Mean_Visibility = mean(Item_Visibility)) %>%
mutate(Item_Visibility_MeanRatio = Item_Visibility / Item_Mean_Visibility) %>%
select(-Item_Mean_Visibility)
Item Type Recategorization
Here let’s take a look at mean sales by Item_Type to see if there’s any worthwhile rearranging we can do
data %>%
group_by(Item_Type) %>%
summarise(mean_by_type = mean(Item_Outlet_Sales, na.rm=TRUE)) %>%
arrange(desc(mean_by_type))
## # A tibble: 16 x 2
## Item_Type mean_by_type
## <chr> <dbl>
## 1 Starchy Foods 2374.
## 2 Seafood 2326.
## 3 Fruits and Vegetables 2289.
## 4 Snack Foods 2277.
## 5 Household 2259.
## 6 Dairy 2233.
## 7 Canned 2225.
## 8 Breads 2204.
## 9 Meat 2159.
## 10 Hard Drinks 2139.
## 11 Frozen Foods 2133.
## 12 Breakfast 2112.
## 13 Health and Hygiene 2010.
## 14 Soft Drinks 2007.
## 15 Baking Goods 1953.
## 16 Others 1926.
I’m curious to see what the ‘Others’ category looks like
data %>%
filter(Item_Type == 'Others') %>%
do(head(.))
## # A tibble: 180 x 14
## # Groups: Item_Identifier [30]
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type
## <chr> <dbl> <chr> <dbl> <chr>
## 1 NCI31 20.0 Low Fat 0.0817 Others
## 2 NCI31 20.0 Low Fat 0.0815 Others
## 3 NCI31 20.0 low fat 0.136 Others
## 4 NCI31 20.0 Low Fat 0.0815 Others
## 5 NCI31 12.6 Low Fat 0.0809 Others
## 6 NCI31 20.0 Low Fat 0.0813 Others
## 7 NCJ19 18.6 LF 0.118 Others
## 8 NCJ19 18.6 Low Fat 0.118 Others
## 9 NCJ19 18.6 Low Fat 0.198 Others
## 10 NCJ19 12.6 Low Fat 0.118 Others
## # ... with 170 more rows, and 9 more variables: Item_MRP <dbl>,
## # Outlet_Identifier <chr>, Outlet_Establishment_Year <int>,
## # Outlet_Size <chr>, Outlet_Location_Type <chr>, Outlet_Type <chr>,
## # Item_Outlet_Sales <dbl>, source <chr>, Item_Visibility_MeanRatio <dbl>
Actually by going through all 180 entries, it is clear that all ‘Others’ items are some form of non-consumable (marked with NC in the ‘Item_Identifier column). I also just noticed that NCs also have an Item_Fat_Content entry, namely ’Low Fat’.. We’ll probably change this later.
Let’s find out what Item_Types are included in the non-consumables category
data %>%
filter(grepl('NC', Item_Identifier)) %>%
group_by(Item_Type) %>%
distinct(Item_Type)
## # A tibble: 3 x 1
## # Groups: Item_Type [3]
## Item_Type
## <chr>
## 1 Household
## 2 Health and Hygiene
## 3 Others
Only these 3 Item_Types in unconsumables. It may make sense to combined Others with Health and Hygiene, but the Household type mean sales is considerably higher than the other two.
As such maybe there’s a better way to group Item_Types than by ID code. (In addition to NC=‘non-consumables’, there is FD=‘foods’ and DR=‘drinks’, but it doesn’t immediately seem to make sense to combine item types in to these three categories).
In fact, let’s combine item types by utility/necessity, which seems to me like it may be more correlated with sales than whether the item happens to be food or not. Consumers will be looking to purchase some of all three ID categories (food, beverage, and non-consumable) in order to sustain their livelihood. Is there not some form of decreasing marginal returns to consumer surplus when filling your shopping cart with food vs. beverage vs. household goods, etc.? Won’t you get more benefit from the first unit of laundry detergent than the 400th potato?
SO, that said let’s group the item types as best we can by their average sales and a judgement of necessity.
Looking at the mean_by_type column for the 16 item types above, let’s see if we can’t create groups based on the mean sales, and how likely the AVERAGE consumer may be to buy from a certain group.
What first catches my eye is the top selling item types, from ‘Starchy Foods’ down to ‘Breads’. Once you get into the ‘Meat’ and ‘Hard Drinks’ categories, you may be cutting out a larger portion of consumers who won’t be purchasing these items, or at least not as regularly as, say, ‘Fruits and Vegetables’.
From ‘Meats’ down to ‘Breakfast’ seems to make a nice 2nd category. After ‘Breakfast’, average sales drop by over 100, and we get categories that you’d maybe just purchase in smaller quantities when you run out of supplies(namely ‘Health and Hygiene’ and ‘Baking Goods’), or maybe on special occasions (for ‘Soft Drinks’).
Note: The top two categories, ‘Starchy Foods’ and ‘Seafood’ also stand out to me, but ‘Starchy Foods’ just seems to me a really vague way to categorize (like, aren’t ‘breads’ starchy as well?), and we don’t have enough information with this dataset to know exactly what’s in there (though I imagine it would be pastas, rice, etc.). That said I’m hesitant to put it in one category alone with ‘Seafood’. Seafood tends to be expensive in the U.S., so it makes sense that mean sales would be relatively high, but ‘Starchy Foods’ is probably leading sales due to the quantity that customers by. These are completely different products with high sales for different reasons, so I won’t create a new category on the basis of sale numbers just for these two.
OK, so let’s go ahead and label these 3 groups in a new column
#Add item types to variables representing our new categories
sales_high <- c('Starchy Foods', 'Seafood', 'Fruits and Vegetables', 'Snack Foods', 'Household', 'Dairy', 'Canned', 'Breads')
sales_med <- c('Meat', 'Hard Drinks', 'Frozen Foods', 'Breakfast')
#Create the categories 'High Sales', 'Medium Sales', and 'Low Sales' in new column, 'Item_Type_Aggregate'
data <- data %>%
group_by() %>%
mutate(Item_Type_Aggregate = ifelse(data$Item_Type %in% sales_high, 'High Sales', ifelse(data$Item_Type %in% sales_med, 'Medium Sales', 'Low Sales')))
data %>%
group_by(Item_Type_Aggregate) %>%
distinct(Item_Type_Aggregate)
## # A tibble: 3 x 1
## # Groups: Item_Type_Aggregate [3]
## Item_Type_Aggregate
## <chr>
## 1 High Sales
## 2 Low Sales
## 3 Medium Sales
Outlet Establishment Year
Next we come to a variable we can definitely improve on for our analysis: ‘Outlet_Establishment_Year’.
We’re going to create a new variable, ‘Outlet_Years’, to more intuitively represent location age.
#Subtract year established from 2013, as 2013 is the year in which this sales data was collected
data <- data %>%
mutate(Outlet_Years = 2013 - Outlet_Establishment_Year)
summary(data$Outlet_Years)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 9.00 14.00 15.17 26.00 28.00
Item Fat Content
Next we’ll deal with the mislabelling of the Item_Fat_Content categories
#Just a reminder of what the existing category names are
data %>%
group_by(Item_Fat_Content) %>%
distinct(Item_Fat_Content)
## # A tibble: 5 x 1
## # Groups: Item_Fat_Content [5]
## Item_Fat_Content
## <chr>
## 1 Low Fat
## 2 Regular
## 3 low fat
## 4 LF
## 5 reg
#Replace redundant category names with 'Low Fat' and 'Regular' for the sake of consistency
data$Item_Fat_Content <- recode(data$Item_Fat_Content,
LF = 'Low Fat',
`low fat` = 'Low Fat',
reg = 'Regular')
#And we musn't forget to change the fat content label for non-consumable items
data <- data %>%
mutate(Item_Fat_Content = replace(Item_Fat_Content, grepl('NC', Item_Identifier), 'Non-Consumable'))
#Check to make sure we have just the two categories remaining
data %>% group_by(Item_Fat_Content) %>% select(Item_Fat_Content) %>% apply(2, table)
## Item_Fat_Content
## Low Fat 6499
## Non-Consumable 2686
## Regular 5019
Before constructing a model with our tidied and modified data, let’s do some visualizaiton of the relationships between individual attributes and overall sales
Sales on Item_MRP
#Simply Item_Outlet_Sales on Item_MRP
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales)) +
geom_density(stat = 'identity') +
geom_smooth(color = 'red', size = 1.5) +
ggtitle('Sales on Item_MRP') +
bigmart_theme
#And dividing sales by the price to adjust for the fact that sales is basically price * quantity
ggplot(data, aes(x = Item_MRP, y = Item_Outlet_Sales / Item_MRP)) +
geom_density(stat = 'identity') +
geom_smooth(color = 'red', size = 1.5) +
ggtitle('Sales/Item_MRP on Item_MRP') +
bigmart_theme
As anticipated, the correlation between price and sales comes from the way in which Item_Outlet_Sales is calculated to begin with
To see visualizations for a wider range of attributes against Item_Outlet_Sales, check out my Shiny app:
https://warriwes.shinyapps.io/BigmartSales_Viz/
Below are just a few highlights and my comments:
Sales on Item_Visibility_MeanRatio
We can see here that the item visibility distribution is pulled way out by products in grocery stores. What this tells us is that grocery stores in fact have much less inventory, and therefore the ‘visibility’ awarded to each individual item is automatically greater than it would be in a supermarket
Sales on Item_Weight
From this distribution of item weights, we can see where we imputed the median for the OUT019 and OUT027 locations, which happened to be the two Type 3 supermarkets. Item_Weight doesn’t look like it will be a very helpful variable anyway, which we suspected from the beginning
Sales on Item_Type_Aggregate
Just using sales vs. our modified item types for a clear picture, we can see that all-in-all, outlet type has a major effect on sales. Interestingly, Type 1 supermarkets surpass Type 2 supermarkets overwhelmingly, but even more clear is the gap between grocery stores and all supermarket types.
Preparing data for model development and testing
Prepare to split back into train and test sets
#Drop columns we replaced
data <- data %>%
select(-c(Item_Type, Outlet_Establishment_Year))
#Set item and outlet ID pairs to row names and coerce character columns into factors (by converting data to data.frame)
ID_cols <- paste(data$Item_Identifier, data$Outlet_Identifier)
data_df <- as.data.frame(unclass(data), row.names = ID_cols)
#Split back into train and test on 'source' column
train <- data_df %>%
filter(source == 'train')
test <- data_df %>%
filter(source == 'test')
#Drop unnecessary columns
train <- train %>%
select(-source)
test <- test %>%
select(-c(source, Item_Outlet_Sales))
Export prepared train/test sets to csv files
write_csv(train, 'train_new.csv')
write_csv(test, 'test_new.csv')
train_1 <- train %>%
select(-c(Item_Identifier, Outlet_Identifier))
Random Forest Model
Now it’s time to test out our model..
library(randomForest)
rf <- randomForest(Item_Outlet_Sales ~ ., ntree = 100, data = train_1)
plot(rf)
print(rf)
##
## Call:
## randomForest(formula = Item_Outlet_Sales ~ ., data = train_1, ntree = 100)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 1232897
## % Var explained: 57.66
varImpPlot(rf, sort = T, n.var = 10,
main = 'Variable Importance')
Just another way to visualize the results:
#Create variable importance data.frame
var_imp = data.frame(importance(rf, type=2))
#Add row names columns
var_imp$Variables = row.names(var_imp)
var_imp <- var_imp[order(var_imp$IncNodePurity, decreasing = T),]
#Plot bar graph showing variable importance
ggplot(var_imp, aes(x = reorder(Variables, IncNodePurity), y = IncNodePurity, fill = IncNodePurity)) +
geom_col() +
theme(
axis.text.x = element_text(angle = 45, hjust = 0.85),
legend.position = 'none'
) +
bold_theme +
ggtitle('Variable Importance (Sales ~)') +
coord_flip()
Making it clear how biased the model is toward Item_MRP
Let’s say we just model against quantities of items sold (Sales / MRP)…
library(randomForest)
rf <- randomForest(Item_Outlet_Sales / Item_MRP ~ ., ntree = 100, data = train_1)
plot(rf)
print(rf)
##
## Call:
## randomForest(formula = Item_Outlet_Sales/Item_MRP ~ ., data = train_1, ntree = 100)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 50.92052
## % Var explained: 39.74
varImpPlot(rf, sort = T, n.var = 9,
main = 'Variable Importance (Sales / Item_MRP ~)')
#Create variable importance data.frame
var_imp = data.frame(importance(rf, type=2))
#Add row names columns
var_imp$Variables = row.names(var_imp)
var_imp <- var_imp[order(var_imp$IncNodePurity, decreasing = T),]
#Plot bar graph showing variable importance
ggplot(var_imp, aes(x = reorder(Variables, IncNodePurity), y = IncNodePurity, fill = IncNodePurity)) +
geom_col() +
theme(
axis.text.x = element_text(angle = 45, hjust = 0.85),
legend.position = 'none'
) +
bold_theme +
ggtitle('Variable Importance (Sales / Item_MRP ~)') +
xlab('Variables') +
coord_flip()
Removing Item_MRP gives us a clearer picture of how the other attributes influence the quantity of a given item that is sold