Sales Analysis - Full Report

1. About the Project

The dataset comprises of sales data (of a renowned Super Market) for 1559 products across 10 stores in different cities (broadly classified based on the purchase power parity, working population, size and few other factors).

The project aims to build a predictive model to analyze the sales of each product at a particular store. With this we shall understand the properties of products and stores which play a key role in increasing sales. The results of the model will be used to provide recommendations to improve the sales.

1.1. NOTES

To evaluate how good is a model, let us understand the impact of wrong predictions. If we predict sales to be higher than what they might be, the store will spend a lot of money making unnecessary arrangement which would lead to excess inventory. On the other side if I predict it too low, I will lose out on sales opportunity.

2. Creating an appropriate Environment

rm(list = ls())
setwd('/Users/Mughundhan/UIC/UIC Academics/FALL 2017/BIZ ANALYTICS STATS/Project/Mid Report')
library(lubridate) # for csv files
library(leaflet)   # interactive maps
library(dplyr)     # for piping purpose %>%
library(data.table)# aggregate
library(ggplot2)   # barplot
library(mice)      # imputing with plausible data values (drawn from a distribution specifically designed for each missing datapoint)
library(rpart)     # Decision Trees
library(VIM)       # Visual Representation for MICE
library(data.table)
train <- read.csv("Train.csv", header=T, na.strings=c("","NA")) #Empty spaces to be replaced by NA
test <- read.csv("Test.csv", header=T, na.strings=c("","NA"))
test$Item_Outlet_Sales <- NA
fdata <- rbind(test, train)
fdata <- as.data.table(fdata)

3. Data Exploration

3.1 Data Dictionary

Let us have a look at the description of each variable in the dataset:

Item_Identifier: Unique Product ID
Item_Weight: Weight of the Product
Item_Fat_Content: How much fat content the product contains (Low, Regular)
Item_Visibility: The percent of total display area of all products in a store allocated to the particular product
Item_Type: The Category to which the product belongs (eg: Breakfast, Soft Drinks, Household etc)
Item_MRP: Maximum Retail Price of the Product (Indian Rupees)
Outlet_Identifier: Unique Store ID - multiple stores located at different cities
Outlet_Establishment_Year: The year, when the store started its operation
Outlet_Size: Size of the store (High, Medium, Small)
Outlet_Location_Type: The type of the city in which the store is located (Tier1, Tier2 ….)
Outlet_Type: The type of the outlet (Grocery store or a Super Market)
Item_Outlet_Sales: Sales of the product in the particular store. [Outcome Variable to be predicted]

3.2 Overview of the dataset with R

Let us now perform basic operations to have a look at the summary and the structure of the dataset.

summary(fdata)

##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility  
##  DRA24  :   10   Min.   : 4.555   LF     : 522     Min.   :0.00000  
##  DRA59  :   10   1st Qu.: 8.710   Low Fat:8485     1st Qu.:0.02704  
##  DRB25  :   10   Median :12.600   Regular:4824     Median :0.05402  
##  DRC25  :   10   Mean   :12.793   low fat: 178     Mean   :0.06595  
##  DRC27  :   10   3rd Qu.:16.750   reg    : 195     3rd Qu.:0.09404  
##  DRC36  :   10   Max.   :21.350                    Max.   :0.32839  
##  (Other):14144   NA's   :2439                                       
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:2013   Min.   : 31.29   OUT027 :1559     
##  Snack Foods          :1989   1st Qu.: 94.01   OUT013 :1553     
##  Household            :1548   Median :142.25   OUT035 :1550     
##  Frozen Foods         :1426   Mean   :141.00   OUT046 :1550     
##  Dairy                :1136   3rd Qu.:185.86   OUT049 :1550     
##  Baking Goods         :1086   Max.   :266.89   OUT045 :1548     
##  (Other)              :5006                    (Other):4894     
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985              High  :1553   Tier 1:3980         
##  1st Qu.:1987              Medium:4655   Tier 2:4641         
##  Median :1999              Small :3980   Tier 3:5583         
##  Mean   :1998              NA's  :4016                       
##  3rd Qu.:2004                                                
##  Max.   :2009                                                
##                                                              
##             Outlet_Type   Item_Outlet_Sales 
##  Grocery Store    :1805   Min.   :   33.29  
##  Supermarket Type1:9294   1st Qu.:  834.25  
##  Supermarket Type2:1546   Median : 1794.33  
##  Supermarket Type3:1559   Mean   : 2181.29  
##                           3rd Qu.: 3101.30  
##                           Max.   :13086.97  
##                           NA's   :5681

str(fdata)

## Classes 'data.table' and 'data.frame':   14204 obs. of  12 variables:
##  $ Item_Identifier          : Factor w/ 1559 levels "DRA12","DRA24",..: 1104 1068 1407 810 1185 462 605 267 669 171 ...
##  $ Item_Weight              : num  20.75 8.3 14.6 7.32 NA ...
##  $ Item_Fat_Content         : Factor w/ 5 levels "LF","Low Fat",..: 2 5 2 2 3 3 3 2 3 2 ...
##  $ Item_Visibility          : num  0.00756 0.03843 0.09957 0.01539 0.1186 ...
##  $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 14 5 12 14 5 7 1 1 14 1 ...
##  $ Item_MRP                 : num  107.9 87.3 241.8 155 234.2 ...
##  $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 3 1 3 6 9 4 6 8 3 ...
##  $ Outlet_Establishment_Year: int  1999 2007 1998 2007 1985 1997 2009 1985 2002 2007 ...
##  $ Outlet_Size              : Factor w/ 3 levels "High","Medium",..: 2 NA NA NA 2 3 2 2 NA NA ...
##  $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 2 3 2 3 1 3 3 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 2 1 2 4 2 3 4 2 2 ...
##  $ Item_Outlet_Sales        : num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

Observation

There are 11 + 1 variables in the dataset (1-target variable: Item_Outlet_Sales)
We shall perform number operations on 3 numerical variables: Item_Weight, Item_Visibility, Item_MRP
There are several factor variables which will be transformed into character variables for feature engineering purpose: Item_Fat_Content, Outlet_Identifier, Outlet_Size, Outlet_Location_Type, Outlet_Type
There is only one variable with information regarding the date: Outlet_Establishment_Year. We might perform simple numerical operations since only the year is given.
Few variables (Outlet_Size, Item_Weight) contain missing values which needs to be imputed.

3.2 Deeper Insights from the dataset using R functions

##           Item_Identifier               Item_Weight 
##                      1559                       416 
##          Item_Fat_Content           Item_Visibility 
##                         5                     13006 
##                 Item_Type                  Item_MRP 
##                        16                      8052 
##         Outlet_Identifier Outlet_Establishment_Year 
##                        10                         9 
##               Outlet_Size      Outlet_Location_Type 
##                         4                         3 
##               Outlet_Type         Item_Outlet_Sales 
##                         4                      3494

##           Item_Identifier               Item_Weight 
##                         0                      2439 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                      4016                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681

## 
##      LF Low Fat Regular low fat     reg 
##     522    8485    4824     178     195

## 
##          Baking Goods                Breads             Breakfast 
##                  1086                   416                   186 
##                Canned                 Dairy          Frozen Foods 
##                  1084                  1136                  1426 
## Fruits and Vegetables           Hard Drinks    Health and Hygiene 
##                  2013                   362                   858 
##             Household                  Meat                Others 
##                  1548                   736                   280 
##               Seafood           Snack Foods           Soft Drinks 
##                    89                  1989                   726 
##         Starchy Foods 
##                   269

## 
## Tier 1 Tier 2 Tier 3 
##   3980   4641   5583

## 
##   High Medium  Small 
##   1553   4655   3980

## 
##     Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3 
##              1805              9294              1546              1559

Observation:

We can observe the number of missing values and the number of unique values (levels) in each column using sapply.
The graphs display the distribution and contribution of each sub-category corresponding to that variable.

4. Hypotheses Generation

Based on the basic data exploration, we shall have two levels of hypotheses: 1. Store-level; 2. Product-level. Both plays a crucial role in determining the sales of each product at specific stores located across different cities. The hypotheses generated at both the levels based on the available dataset are as follows:

I. Product-Level Hypotheses

Item_Fat_Content: Items are classified based on the fat content. Since we consume on low fat items as a part of our regular diet, It is highly possible that Low fat items are generally sold more than the items with high fat content.
Item_Type: Items which we use on regular basis - like ready to eat, soft drinks has higher probability of being sold when compared with luxury items.
Item_MRP: More expensive items might be bought occasionally. Items with lower prices might be a product which is being used on a regular basis. Thus, Low priced items might have sold better than expensive items.

II. Store-Level Hypotheses

Outlet_Size: Bigger outlets might attract bigger crowds. This results in increasing the sales of the products in that specific store.
Outlet_Location_Type: Bigger cities or cities with high population density has a larger customer base for the stores at their location. Stores located in Tier-1 cities might have better sales than stores located in other types of cities.
Outlet_Type: Similar to the previous hypotheses. Supermarkets look more fancy than grocery shops. Among supermarket, the highest among this sub-classification might attract larger crowds and emerge as the best selling store when compared with other outlet types.

5. Handling Missing Values

5.1 Finding the missing values

Identifying the missing values column-wise. The name of the column and the corresponding number of missing values in each column is given.

5.2 Imputing the missing values

Item_Weight and Item_Identifier: Taking average of Item_Weight based on Item_Identifier and imputing missing values in Item_Weight

length(unique(fdata$Item_Identifier)) #Identify no. of unique values in the Item_Identifier attribute

## [1] 1559

avg_Item_Weight <- aggregate(Item_Weight~Item_Identifier, data=fdata, FUN=function(x) c(mean=mean(x), count=length(x))) #making an aggregate - similar to group by feature in SQL
avg_Item_Weight <- as.data.table(avg_Item_Weight) #converting into data.table for easier computation

cdata <- merge(fdata, avg_Item_Weight, by="Item_Identifier") #merging the data

for(i in 1:nrow(cdata))
{
  if(is.na(cdata[i,2]))
  {
    cdata$Item_Weight.x[i] <- cdata$Item_Weight.y[i] #missing weights replaced by average weight of the item depending on the unique Item_Identifier
  }
}

fdata <- cdata[ ,1:(ncol(cdata)-1)] #deleting the unnecessary column created during the imputation process

#View(cdata)
names(fdata)[names(fdata)=="Item_Weight.x"] <- "Item_Weight" #Renaming the attribute
sapply(fdata, function(x) sum(is.na(x))) #Number of Missing Values in each column

##           Item_Identifier               Item_Weight 
##                         0                         0 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                      4016                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681

#View(fdata)

rm(cdata, i)

Outlet_Size and Outlet_Type: Taking average of Outlet_Size based on Outlet_Type and imputing missing values in Outlet_Size

table(fdata$Outlet_Type, fdata$Outlet_Size)

##                    
##                     High Medium Small
##   Grocery Store        0      0   880
##   Supermarket Type1 1553   1550  3100
##   Supermarket Type2    0   1546     0
##   Supermarket Type3    0   1559     0

round(prop.table(table(fdata$Outlet_Type, fdata$Outlet_Size), 1), 2) #Identify the proportion

##                    
##                     High Medium Small
##   Grocery Store     0.00   0.00  1.00
##   Supermarket Type1 0.25   0.25  0.50
##   Supermarket Type2 0.00   1.00  0.00
##   Supermarket Type3 0.00   1.00  0.00

Observation:

All Grocery Store -> Small
Most Super Market 1 -> Small
All Super Market 2 -> Medium
All Super Market 3 -> Medium

fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Grocery Store"] <- "Small"
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Supermarket Type1"] <- "Small"
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Supermarket Type2"] <- "Medium"
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Supermarket Type3"] <- "Medium"
sapply(fdata, function(x) sum(is.na(x))) #Number of Missing Values in each column

##           Item_Identifier               Item_Weight 
##                         0                         0 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                         0                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681

table(fdata$Outlet_Type, fdata$Outlet_Size)

##                    
##                     High Medium Small
##   Grocery Store        0      0  1805
##   Supermarket Type1 1553   1550  6191
##   Supermarket Type2    0   1546     0
##   Supermarket Type3    0   1559     0

round(prop.table(table(fdata$Outlet_Type, fdata$Outlet_Size), 1), 2)

##                    
##                     High Medium Small
##   Grocery Store     0.00   0.00  1.00
##   Supermarket Type1 0.17   0.17  0.67
##   Supermarket Type2 0.00   1.00  0.00
##   Supermarket Type3 0.00   1.00  0.00

6. Feature Engineering

We explored some nuances in the data in the data exploration section. Now let us try to resolve them and make our data ready for analysis. We will also create some new variables using the existing ones in this section.

6.1. Consider combining Outlet_Type

During exploration, we decided to consider combining the Supermarket Type2 and Type3 variables. But is that a good idea? A quick way to check that could be to analyze the mean sales by type of store. If they have similar sales, then keeping them separate won???t help much.

avg_Item_Sales <- aggregate(Item_Outlet_Sales~Outlet_Type, data=fdata, FUN=function(x) c(mean=mean(x), count=length(x)))
avg_Item_Sales <- as.data.table(avg_Item_Sales)
rm(avg_Item_Sales)

Observation This shows significant difference between Supermarket Type2 and Type3 variables, hence we???ll leave them as it is.

6.2. Modify Item_Visibility

We noticed that the minimum value here is 0, which makes no practical sense. Lets consider it like missing information and impute it with mean visibility of that product.

summary(fdata$Item_Visibility)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.02704 0.05402 0.06595 0.09404 0.32839

rm(cdata)
length(unique(fdata$Item_Identifier))

## [1] 1559

avg_Item_Visibility <- aggregate(Item_Visibility~Item_Identifier, data=fdata, FUN=function(x) c(mean=mean(x), count=length(x)))
avg_Item_Visibility <- as.data.table(avg_Item_Visibility)

cdata <- merge(fdata, avg_Item_Visibility, by="Item_Identifier")

for(i in 1:nrow(cdata))
{
  if(cdata[i,4]==0)
  {
    cdata$Item_Visibility.x[i] <- cdata$Item_Visibility.y[i]
  }
}

fdata <- cdata[ ,1:(ncol(cdata)-1)]
names(fdata)[names(fdata)=="Item_Visibility.x"] <- "Item_Visibility"
#summary(fdata$Item_Visibility)

Observation No values with value zero in Item_Visibility variable

NOTE We hypothesized that products with higher visibility are likely to sell more. But along with comparing products on absolute terms, we should look at the visibility of the product in that particular store as compared to the mean visibility of that product across all stores. This will give some idea about how much importance was given to that product in a store as compared to other stores.

#colnames(fdata)

rm(cdata, i)
cdata <- merge(fdata, avg_Item_Visibility, by="Item_Identifier")
ncol(fdata)

## [1] 12

fdata <- cdata


names(fdata)[names(fdata)=="Item_Visibility.y"] <- "Item_Visibility_MeanRatio"
names(fdata)[names(fdata)=="Item_Visibility.x"] <- "Item_Visibility"
colnames(fdata)

##  [1] "Item_Identifier"           "Item_Weight"              
##  [3] "Item_Fat_Content"          "Item_Visibility"          
##  [5] "Item_Type"                 "Item_MRP"                 
##  [7] "Outlet_Identifier"         "Outlet_Establishment_Year"
##  [9] "Outlet_Size"               "Outlet_Location_Type"     
## [11] "Outlet_Type"               "Item_Outlet_Sales"        
## [13] "Item_Visibility_MeanRatio"

rm(cdata)
fdata$Item_Visibility_MeanRatio <- as.numeric(fdata$Item_Visibility_MeanRatio)
class(fdata$Item_Visibility_MeanRatio)

## [1] "numeric"

class(fdata$Item_Visibility)

## [1] "numeric"

fdata$Item_Visibility_MeanRatio1 <- fdata$Item_Visibility/fdata$Item_Visibility_MeanRatio
quantile(fdata$Item_Visibility_MeanRatio1)

##        0%       25%       50%       75%      100% 
## 0.8445628 0.9251308 0.9990698 1.0420067 3.0100939

fdata$Item_Visibility_MeanRatio <- fdata$Item_Visibility_MeanRatio1
quantile(fdata$Item_Visibility_MeanRatio1)

##        0%       25%       50%       75%      100% 
## 0.8445628 0.9251308 0.9990698 1.0420067 3.0100939

ncol(fdata)

## [1] 14

fdata <- fdata[, 1:(ncol(fdata)-1)]
#head(fdata)

6.3. Broad category of Type of Item

Earlier we saw that the Item_Type variable has 16 categories which might prove to be very useful in analysis. So its a good idea to combine them. One way could be to manually assign a new category to each. But there???s a catch here. If you look at the Item_Identifier, i.e. the unique ID of each item, it starts with either F, D or N. If you see the categories, these look like being Food, Drinks and Non-Consumables. So I???ve used the Item_Identifier variable to create a new column:

fdata$Item_Type_Combined <- "NA"

fdata$Item_Type_Combined[grepl("^[fF].*", fdata$Item_Identifier) ] <- "Food"
fdata$Item_Type_Combined[grepl("^[dD].*", fdata$Item_Identifier) ] <- "Drinks"
fdata$Item_Type_Combined[grepl("^[nN].*", fdata$Item_Identifier) ] <- "Non-Consumable"

table(fdata$Item_Type_Combined)

## 
##         Drinks           Food Non-Consumable 
##           1317          10201           2686

6.4. Determine the years of operation of a store

We wanted to make a new column depicting the years of operation of a store. [NOTE: We are using 2013 Sales Data]

fdata$Outlet_Years <- 2013 - fdata$Outlet_Establishment_Year
#summary(fdata$Outlet_Years)
table(fdata$Outlet_Years)

## 
##    4    6    9   11   14   15   16   26   28 
## 1546 1543 1550 1548 1550  925 1550 1553 2439

Observation: All the stores are 4-28 years old

6.5. Modify categories of Item_Fat_Content

We found typos and difference in representation in categories of Item_Fat_Content variable.

fdata$Item_Fat_Content.y <- "NA"
fdata$Item_Fat_Content.y[grepl("^[lL].*", fdata$Item_Fat_Content) ] <- "Low Fat"
fdata$Item_Fat_Content.y[grepl("^[rR].*", fdata$Item_Fat_Content) ] <- "Regular"

fdata$Item_Fat_Content.y[fdata$Item_Type_Combined=="Non-Consumable"] <- "Non-Edible"
fdata$Item_Fat_Content <- fdata$Item_Fat_Content.y
table(fdata$Item_Fat_Content)

## 
##    Low Fat Non-Edible    Regular 
##       6499       2686       5019

fdata <- fdata[ ,1:(ncol(fdata)-1)]
#View(fdata)

6.6. Exploratory Data Analysis

Observation:

Sales Pattern based on Fat Content: All three performs almost similar
Sales Pattern based on Outlet’s age: Outlets which are 28 years old performs far better and the outlet which is 16 years old is amongst the worst performers.
Sales Pattern based on Type of Item: All three performs almost similar
Sales Pattern based on Outlet: Outlet027 outperforms other outlets
Sales Pattern based on Outlet’s size: The medium sized outlets perform better.
Sales Pattern based on Location Type: Tier-3 Performs better as Hypothesized.

6.6. One-Hot Encoding

One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable.

For example, the Item_Fat_Content has 3 categories ??? ???Low Fat???, ???Regular??? and ???Non-Edible???. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers ??? 0 (if the category is not present) and 1(if category is present). [Creates dummy variables]
‘Item_Fat_Content’
‘Outlet_Location_Type’
‘Outlet_Size’
‘Item_Type_Combined’
‘Outlet_Type’
‘Outlet_Identifier’

NOTE: all columns - Item_Identifier, Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type, Item_MRP, Outlet_Identifier, Outlet_Establishment_Year, Outlet_Size, Outlet_Location_Type, Outlet_Type, Item_Outlet_Sales, Item_Visibility_MeanRatio, Item_Type_Combined, Outlet_Years

##           Item_Identifier               Item_Weight 
##                      1559                       415 
##          Item_Fat_Content           Item_Visibility 
##                         3                     13688 
##                 Item_Type                  Item_MRP 
##                        16                      8052 
##         Outlet_Identifier Outlet_Establishment_Year 
##                        10                         9 
##               Outlet_Size      Outlet_Location_Type 
##                         3                         3 
##               Outlet_Type         Item_Outlet_Sales 
##                         4                      3494 
## Item_Visibility_MeanRatio        Item_Type_Combined 
##                     13287                         3 
##              Outlet_Years 
##                         9