PROBLEM STATEMENT

The data scientists at BigMart have collected sales data for 1559 products across 10 stores in different cities for the year 2013. Now each product has certain attributes that sets it apart from other products. Same is the case with each store.

The aim is to build a predictive model to find out the sales of each product at a particular store so that it would help the decision makers at BigMart to find out the properties of any product or store, which play a key role in increasing the overall sales.

HYPOTHESIS GENERATION

We can start the process by working on four levels: Store Level, Product Level, Customer Level and Macro Level.

Store Level Hypotheses City type: Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there.

Population Density: Stores located in densely populated areas should have higher sales because of more demand.

Store Capacity: Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place

Competitors: Stores having similar establishments nearby should have less sales because of more competition.

Marketing: Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising.

Location: Stores located within popular marketplaces should have higher sales because of better access to customers.

Ambiance: Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.

Product Level Hypotheses Brand: Branded products should have higher sales because of higher trust in the customer.

Packaging: Products with good packaging can attract customers and sell more.

Utility: Daily use products should have a higher tendency to sell as compared to the specific use products.

Display Area: Products which are given bigger shelves in the store are likely to catch attention first and sell more.

Visibility in Store: The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.

Advertising: Better advertising of products in the store will should higher sales in most cases. Promotional Offers: Products accompanied with attractive offers and discounts will sell more.

Customer Level Hypotheses Customer Behavior: Stores keeping the right set of products to meet the local needs of customers will have higher sales.

Job Profile: Customer working at executive levels would have higher chances of purchasing high amount products as compared to customers working at entry or mid senior level.

Family Size: More the number of family members, more amount will be spent by a customer to buy products

Annual Income: Higher the annual income of a customer, customer is more likely to buy high cost products.

Past Purchase History: Availablity of this information can help us to determine the frequency of a product being purchased by a user.

Macro Level Hypotheses Environment: If the environment is declared safe by government, customer would be more likely to purchase products without worrying if it’s environment friendly or not.

Economic Growth: If the current economy shows a consistent growth, per capita income will rise, therefore buying power of customers will increase.

LOADING PACKAGES

Here, we will include packages for reading the data, manipulation of data, visualization of data, and finally for modeling.

library(ggplot2)            # used for ploting 
library(plyr)               # used for renaming the levels in factor
library(dplyr)              # used for data manipulation and joining

Attaching package: 'dplyr'
The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(cowplot)            # used for combining multiple plots 

Attaching package: 'cowplot'
The following object is masked from 'package:ggplot2':

    ggsave
library(VIM)                # used for checking the missing data 
Loading required package: colorspace
Loading required package: grid
Loading required package: data.table

Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':

    between, first, last
VIM is ready to use. 
 Since version 4.0.0 the GUI is in its own package VIMGUI.

          Please use the package to use the new (and old) GUI.
Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

Attaching package: 'VIM'
The following object is masked from 'package:datasets':

    sleep
library(data.table)         # used for reading and manipulation of data
library(caret)              # used for modeling
Loading required package: lattice
library(corrplot)           # used for making correlation plot
corrplot 0.84 loaded
library(xgboost)            # used for building XGBoost model

Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':

    slice
library(moments)            # used for skewness
library(GoodmanKruskal)     # used for association among categorical values 
library(DMwR)               # used for regression evaluation metrics

Attaching package: 'DMwR'
The following object is masked from 'package:VIM':

    kNN
The following object is masked from 'package:plyr':

    join
library(modelr)             # used for regression metrics

Attaching package: 'modelr'
The following object is masked from 'package:DMwR':

    bootstrap
library(glmnet)             # used for ridge and lasso
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-16
library(earth)              # used for multivariate adaptive regression splines
Loading required package: plotmo
Loading required package: plotrix
Loading required package: TeachingDemos
library(kknn)               # used for knn regression

Attaching package: 'kknn'
The following object is masked from 'package:caret':

    contr.dummy
library(rpart)              # used for rpart decision trees
library(RColorBrewer)       # used for fancyrpartplot
library(rattle)
Rattle: A free graphical interface for data science with R.
Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.

Attaching package: 'rattle'
The following object is masked from 'package:xgboost':

    xgboost
library(nortest)            # used for anderson darling test for normality
library(car)                # used for durbin watson test for independency and for qq plot
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
library(plotmo)             # used for testing regression assumptions

LOADING THE DATA

bigmart_train<-read.csv("E:/R Markdown/projects/big mart sales/Train.csv")
bigmart_test<-read.csv("E:/R Markdown/projects/big mart sales/Test.csv")

COMBINE TRAIN AND TEST DATA: Lets append the test data to train data so as to explore the data and later split the combined data.

bigmart_test$Item_Outlet_Sales<-"NA"
bigmart_combine<-rbind(bigmart_train, bigmart_test)     # combining train and test datasets

Here the values for Item_Outlet_Sales in test data is assigned as “NA” as we have to predict it later.

UNDERSTANDING THE DATA STRUCTURE AND CONTENTS

DIMENSION OF DATA:

dim(bigmart_combine)
[1] 14204    12

The bigmart combined dataset has 14204 observations/rows with 12 features/variables/columns.

DATA FEATURES:

variable.names(bigmart_combine)     #alternatively we can even use 'names()'
 [1] "Item_Identifier"           "Item_Weight"              
 [3] "Item_Fat_Content"          "Item_Visibility"          
 [5] "Item_Type"                 "Item_MRP"                 
 [7] "Outlet_Identifier"         "Outlet_Establishment_Year"
 [9] "Outlet_Size"               "Outlet_Location_Type"     
[11] "Outlet_Type"               "Item_Outlet_Sales"        

DATA STRUCTURE: The str() function gives a short summary of all the features present in a dataframe. Let’s apply it on train and test data.

str(bigmart_combine)
'data.frame':   14204 obs. of  12 variables:
 $ Item_Identifier          : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
 $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
 $ Item_Fat_Content         : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
 $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
 $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
 $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
 $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
 $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
 $ Outlet_Size              : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
 $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
 $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
 $ Item_Outlet_Sales        : chr  "3735.138" "443.4228" "2097.27" "732.38" ...

As we can see there are 8 factor/categorical variables and 4 numeric variables(contains integer too) present in the dataset.

Checking The Classes:

bigmart_combine$Item_Outlet_Sales<-as.numeric(bigmart_combine$Item_Outlet_Sales)
## Warning: NAs introduced by coercion
bigmart_combine$Outlet_Establishment_Year<-as.numeric(bigmart_combine$Outlet_Establishment_Year)

Renaming The Levels In Item_Fat_Content:

bigmart_combine$Item_Fat_Content<-mapvalues(bigmart_combine$Item_Fat_Content,from = c('LF','low fat','Low Fat','reg','Regular'), to=c('lowfat','lowfat','lowfat','regular','regular'))   
                         # mapvalues() from plyr package

Here the levels ‘LF’, ‘low fat’, and ‘Low Fat’ are the same category and can be combined into one(i.e lowfat) similarly it can be done for ‘reg’ and ‘Regular’ into regular.

EXPLORATORY DATA ANALYSIS

After understanding the dimensions and properties of data, we have to deep dive and explore the data visually. It helps us in understanding the nature of data in terms of distribution of the individual variables/features, finding missing values, relationship with other variables and many other things. It is done as univariate EDA and bivariate EDA.

UNIVARIATE EDA

It involves exploring variables individually. We will try to visualize the continuous variables using histograms and categorical variables using bar plots.

Target Variable Since our target variable is continuous, we can visualize it by plotting its histogram.

ggplot(bigmart_combine) + geom_histogram(aes(bigmart_combine$Item_Outlet_Sales), binwidth = 100, fill = "darkgreen",colour="black") + xlab("Item_Outlet_Sales") + theme_minimal()+theme_linedraw() + geom_vline(xintercept=mean(bigmart_combine$Item_Outlet_Sales,na.rm = TRUE), color="red") + geom_vline(xintercept=median(bigmart_combine$Item_Outlet_Sales,na.rm = TRUE), color="blue")
Warning: Removed 5681 rows containing non-finite values (stat_bin).

                                        # ggplot() from ggplot2 package

Observations: As we can see mean (with red line) is greater than median (with blue line) and more data values are on right side of the plot from the mean, it is right-skewed.

Independent Variables (numeric variables) Now let’s check the numeric independent variables. We’ll again use the histograms for visualizations because that will help us in visualizing the distribution of the variables.

p1<- ggplot(bigmart_combine) + geom_histogram(aes(Item_Weight), binwidth = 0.5, fill = "darkgreen") + geom_vline(xintercept=mean(bigmart_combine$Item_Weight,na.rm = TRUE), color="red") + geom_vline(xintercept=median(bigmart_combine$Item_Weight,na.rm = TRUE), color="blue")

p2<- ggplot(bigmart_combine) + geom_histogram(aes(Item_Visibility), binwidth = 0.005, fill = "darkgreen") + geom_vline(xintercept=mean(bigmart_combine$Item_Visibility), color="red") + geom_vline(xintercept=median(bigmart_combine$Item_Visibility), color="blue")

p3<- ggplot(bigmart_combine) + geom_histogram(aes(Item_MRP), binwidth = 1, fill = "darkgreen") + geom_vline(xintercept=mean(bigmart_combine$Item_MRP), color="red") + geom_vline(xintercept=median(bigmart_combine$Item_MRP), color="blue")

p4<- ggplot(bigmart_combine) + geom_histogram(aes(Outlet_Establishment_Year), binwidth = 1.2, fill = "darkgreen") + geom_vline(xintercept=mean(bigmart_combine$Outlet_Establishment_Year), color="red") + geom_vline(xintercept=median(bigmart_combine$Outlet_Establishment_Year), color="blue")

plot_grid(p1, p2, p3, p4, nrow = 2)           # plot_grid() from cowplot package
Warning: Removed 2439 rows containing non-finite values (stat_bin).

Observations: As we can see mean (with red line) is greater than median (with blue line) in Item_Weight and Item_Visibility and more data values are on right side of the plot from the mean, it is right-skewed. whereas in Item_MRP and Outlet_Establishment_Year we can see mean (with red line) is lesser than median (with blue line) and more data values are on left side of the plot from the mean, it is left-skewed. There seems to be no clear-cut pattern in Item_Weight. We can clearly see 4 different distributions for Item_MRP. It is an interesting insight. Lesser number of observations in the data for the outlets established in the year 1998 as compared to the other years.

Independent Variables (categorical variables) Now we’ll try to explore and gain some insights from the categorical variables. A categorical variable or feature can have only a finite set of values.

ggplot(bigmart_combine %>% group_by(Item_Type) %>% summarise(Count = n())) + geom_bar(aes(x=Item_Type, y=Count), stat = "identity", fill = "coral1") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_label(aes(Item_Type, Count, label = Count), vjust = 0.5) + xlab("") + ggtitle("Item_Type")

ggplot(bigmart_combine %>% group_by(Outlet_Identifier) %>% summarise(Count = n())) + geom_bar(aes(x=Outlet_Identifier, y=Count), stat = "identity", fill = "coral1") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_label(aes(Outlet_Identifier, Count, label = Count), vjust = 0.5) + xlab("") + ggtitle("Outlet_Identifier")

p5 <- ggplot(bigmart_combine %>% group_by(Item_Fat_Content) %>% summarise(Count = n())) + geom_bar(aes(x=Item_Fat_Content, y=Count), stat = "identity", fill = "coral1") + geom_label(aes(Item_Fat_Content, Count, label = Count), vjust = 0.5) + xlab("") + ggtitle("Item_Fat_Content")

p6 <- ggplot(bigmart_combine %>% group_by(Outlet_Size) %>% summarise(Count = n())) + geom_bar(aes(x=Outlet_Size, y=Count), stat = "identity", fill = "coral1") + geom_label(aes(Outlet_Size, Count, label = Count), vjust = 0.5) + xlab("") + ggtitle("Outlet_Size")

p7 <- ggplot(bigmart_combine %>% group_by(Outlet_Location_Type) %>% summarise(Count = n())) + geom_bar(aes(x=Outlet_Location_Type, y=Count), stat = "identity", fill = "coral1") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_label(aes(Outlet_Location_Type, Count, label = Count), vjust = 0.5) + xlab("") + ggtitle("Outlet_Location_Type")

p8 <- ggplot(bigmart_combine %>% group_by(Outlet_Type) %>% summarise(Count = n())) + geom_bar(aes(x=Outlet_Type, y=Count), stat = "identity", fill = "coral1") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + geom_label(aes(Outlet_Type, Count, label = Count), vjust = 0.5) + xlab("") + ggtitle("Outlet_Type")

plot_grid(p5, p6, nrow = 1) 

plot_grid(p7, p8, nrow = 1)           # plot_grid() from cowplot package

Observations: In Outlet_Size’s plot, Outlet_Size is blank or missing for 4016 observations.We will check for this in the bivariate analysis to substitute the missing values in the Outlet_Size. Supermarket Type 1 seems to be the most popular category of Outlet_Type.

BIVARIATE EDA

After looking at every feature individually, let’s now do some bivariate analysis. Here we’ll explore the independent variables with respect to the target variable. The objective is to discover hidden relationships between the independent variable and the target variable and use those findings in missing data imputation and feature engineering in the next module. We will make use of SCATTER PLOTS for the continuous or numeric variables and VIOLIN PLOTS for the categorical variables.

Target Variable vs Independent Numerical Variables

# Item_Weight vs Item_Outlet_Sales
ggplot(bigmart_combine) + geom_point(aes(Item_Weight, Item_Outlet_Sales), colour = "violet", alpha = 0.3) + theme(axis.title = element_text(size = 8.5))
Warning: Removed 7144 rows containing missing values (geom_point).

Observation: Item_Outlet_Sales is spread well across the entire range of the Item_Weight without any obvious pattern.

# Item_Visibility vs Item_Outlet_Sales
p10 <- ggplot(bigmart_combine) + geom_point(aes(Item_Visibility, Item_Outlet_Sales), colour = "violet", alpha = 0.3) + theme(axis.title = element_text(size = 8.5))

# Item_MRP vs Item_Outlet_Sales
p11 <- ggplot(bigmart_combine) + geom_point(aes(Item_MRP, Item_Outlet_Sales), colour = "violet", alpha = 0.3) + theme(axis.title = element_text(size = 8.5))

plot_grid(p10, p11, nrow = 1)
Warning: Removed 5681 rows containing missing values (geom_point).

Warning: Removed 5681 rows containing missing values (geom_point).

Observations: In Item_Visibility vs Item_Outlet_Sales, there is a string of points at Item_Visibility = 0.0 which seems strange as item visibility cannot be completely zero. We will take note of this issue and deal with it in the later stages. In the plot of Item_MRP vs Item_Outlet_Sales, we can clearly see 4 segments of prices that can be used in feature engineering to create a new variable.

# Item_MRP vs Outlet_Establishment_Year
ggplot(bigmart_combine) + geom_point(aes(Outlet_Establishment_Year, Item_Outlet_Sales), colour = "violet", alpha = 0.3) + theme(axis.title = element_text(size = 8.5))
Warning: Removed 5681 rows containing missing values (geom_point).

Target Variable vs Independent Categorical Variables Now we’ll visualise the categorical variables with respect to Item_Outlet_Sales. We will try to check the distribution of the target variable across all the categories of each of the categorical variable.

We could have used boxplots here, but instead we’ll use the violin plots as they show the full distribution of the data. The width of a violin plot at a particular level indicates the concentration or density of data at that level. The height of a violin tells us about the range of the target variable values.

# Item_Type vs Item_Outlet_Sales
p12 <- ggplot(bigmart_combine) + 
      geom_violin(aes(Item_Type, Item_Outlet_Sales), fill = "magenta") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 6),
            axis.title = element_text(size = 8.5))

# Item_Fat_Content vs Item_Outlet_Sales
p13 <- ggplot(bigmart_combine) + 
      geom_violin(aes(Item_Fat_Content, Item_Outlet_Sales), fill = "magenta") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8),
            axis.title = element_text(size = 8.5))

# Outlet_Identifier vs Item_Outlet_Sales
p14 <- ggplot(bigmart_combine) + 
      geom_violin(aes(Outlet_Identifier, Item_Outlet_Sales), fill = "magenta") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8),
            axis.title = element_text(size = 8.5))

# Outlet__Location_Type vs Item_Outlet_Sales
p15 <- ggplot(bigmart_combine) + geom_violin(aes(Outlet_Location_Type, Item_Outlet_Sales), fill = "magenta")

# Outlet_Type vs Item_Outlet_Sales
p16 <- ggplot(bigmart_combine) + geom_violin(aes(Outlet_Type, Item_Outlet_Sales), fill = "magenta")

second_row_3 <- plot_grid(p13, p14, ncol = 2)
Warning: Removed 5681 rows containing non-finite values (stat_ydensity).

Warning: Removed 5681 rows containing non-finite values (stat_ydensity).
plot_grid(p12, second_row_3, ncol = 1)
Warning: Removed 5681 rows containing non-finite values (stat_ydensity).

plot_grid(p15, p16, nrow = 1)
Warning: Removed 5681 rows containing non-finite values (stat_ydensity).

Warning: Removed 5681 rows containing non-finite values (stat_ydensity).

NOTE:It has some warnings because these 5681 rows belong to test data where outcome is “NA”.

Observations: Distribution of Item_Outlet_Sales across the categories of Item_Type is not very distinct (almost all levels are at same level) and same is the case with Item_Fat_Content. The distribution for OUT010 and OUT019 categories of Outlet_Identifier are quite similar and very much different from the rest of the categories of Outlet_Identifier. Tier 1 and Tier 3 locations of Outlet_Location_Type look similar. In the Outlet_Type plot, Grocery Store has most of its data points around the lower sales values as compared to the other categories.

# Outlet_Size vs Item_Outlet_Sales
ggplot(bigmart_combine) + geom_violin(aes(Outlet_Size, Item_Outlet_Sales), fill = "magenta")
Warning: Removed 5681 rows containing non-finite values (stat_ydensity).

Observations: In the univariate analysis, we came to know about the empty values in Outlet_Size variable. So from the above distribution of the target variable across Outlet_Size, the distribution of ‘Small’ Outlet_Size is almost identical to the distribution of the blank category (first vioin) of Outlet_Size. So, we can substitute the blanks in Outlet_Size with ‘Small’. We should note that this is not the only way to impute missing values.

These are the kind of insights that we can extract by visualizing our data.

MISSING VALUE TREATMENT

Missing data can have a severe impact on building predictive models because the missing values might contain some vital information which could help in making better predictions. So, it becomes imperative to carry out missing data imputation.

Check For Missing Values:

aggr(bigmart_combine,numbers=TRUE,plot = FALSE)      #aggr() is in VIM package

 Missings in variables:
          Variable Count
       Item_Weight  2439
 Item_Outlet_Sales  5681

As we can see above, we have missing values in Item_Weight and Item_Outlet_Sales. Missing data in Item_Outlet_Sales can be ignored since they belong to the test dataset which are held for prediction.We’ll now impute Item_Weight with mean weight based on the Item_Identifier variable.

Impute Missing Values:

missing_index <- which(is.na(bigmart_combine$Item_Weight))
for(i in missing_index){
  item = bigmart_combine$Item_Identifier[i]
  bigmart_combine$Item_Weight[i] = mean(bigmart_combine$Item_Weight[bigmart_combine$Item_Identifier == item], na.rm = T)
}

Again Check For Missing Values:

aggr(bigmart_combine,numbers=TRUE,plot = FALSE)     #aggr() is in VIM package

 Missings in variables:
          Variable Count
 Item_Outlet_Sales  5681

Now we can find missing data only in Item_Outlet_Sales which belongs to test dataset.

Replacing Zero’s In Item_Visibility Variable: We know that zero’s shouldn’t exist in Item_Visibility as all items in the bigmart are visible so, zeroes in Item_Visibility variable are replaced by mean values of Item_Visibility w.r.t Item_Identifier. Lets visualize the plot below

ggplot(bigmart_combine) + geom_histogram(aes(Item_Visibility), bins = 100)

Replace The Zeros:

zero_index = which(bigmart_combine$Item_Visibility == 0)
for(i in zero_index){
  item = bigmart_combine$Item_Identifier[i]
  bigmart_combine$Item_Visibility[i] = mean(bigmart_combine$Item_Visibility[bigmart_combine$Item_Identifier == item], na.rm = T)
}

Again Visualize The Plot:

ggplot(bigmart_combine) + geom_histogram(aes(Item_Visibility), bins = 100)

In the above histogram, we can see that the issue of zero item visibility has been resolved.

FEATURE ENGINEERING

Most of the times, the given features in a dataset are not sufficient to give satisfactory predictions. In such cases, we have to create new features which might help in improving the model’s performance. Let’s try to create some new features for our dataset.

In this section we will create the following new features:

Item_Type_new: Broader categories for the variable Item_Type. Item_category: Categorical variable derived from Item_Identifier. Outlet_Years: Years of operation for outlets. price_per_unit_wt: Item_MRP/Item_Weight Item_MRP_clusters: Binned feature for Item_MRP.

Create Item_Type_New: We can have a look at the Item_Type variable and classify the categories into perishable and non_perishable as per our understanding and make it into a new feature

perishable = c("Breads", "Breakfast", "Dairy", "Fruits and Vegetables", "Meat", "Seafood")

non_perishable = c("Baking Goods", "Canned", "Frozen Foods", "Hard Drinks", "Health and Hygiene", "Household", "Soft Drinks")

# create a new feature 'Item_Type_new'

bigmart_combine$Item_Type_New <- ifelse(bigmart_combine$Item_Type %in% perishable, "perishable", ifelse(bigmart_combine$Item_Type %in% non_perishable, "non_perishable", "not_sure"))

bigmart_combine$Item_Type_New<-as.factor(bigmart_combine$Item_Type_New)

Create Item_category: Let’s compare Item_Type with the first 2 characters of Item_Identifier, i.e., ‘DR’(drink), ‘FD’(food), and ‘NC’(non-consumable).

table(bigmart_combine$Item_Type, substr(bigmart_combine$Item_Identifier, 1, 2))
                       
                          DR   FD   NC
  Baking Goods             0 1086    0
  Breads                   0  416    0
  Breakfast                0  186    0
  Canned                   0 1084    0
  Dairy                  229  907    0
  Frozen Foods             0 1426    0
  Fruits and Vegetables    0 2013    0
  Hard Drinks            362    0    0
  Health and Hygiene       0    0  858
  Household                0    0 1548
  Meat                     0  736    0
  Others                   0    0  280
  Seafood                  0   89    0
  Snack Foods              0 1989    0
  Soft Drinks            726    0    0
  Starchy Foods            0  269    0

Based on the above table we can create a new feature called Item_category.

bigmart_combine$Item_category <- substr(bigmart_combine$Item_Identifier, 1, 2)

bigmart_combine$Item_category<-as.factor(bigmart_combine$Item_category)

We will also change the values of Item_Fat_Content wherever Item_category is ‘NC’ because non-consumable items cannot have any fat content

bigmart_combine$Item_Fat_Content<-as.character(bigmart_combine$Item_Fat_Content)

bigmart_combine$Item_Fat_Content[bigmart_combine$Item_category == "NC"] <- "non-edible"

bigmart_combine$Item_Fat_Content<-as.factor(bigmart_combine$Item_Fat_Content)

Create Outlet_Years: As the data scientists at BigMart have collected sales data for 1559 products for the year 2013. Lets know the years of operation of bigmart established till 2013 by creating new feature called Outlet_Years.

bigmart_combine$Outlet_Years<-2013-bigmart_combine$Outlet_Establishment_Year

Create Price_per_unit_wt:

bigmart_combine$Price_per_unit_wt<-bigmart_combine$Item_MRP/bigmart_combine$Item_Weight

Create Item_MRP_clusters: Earlier in the Item_MRP vs Item_Outlet_Sales plot, we saw Item_MRP was spread across in 4 chunks. Now let’s assign a label to each of these chunks and use this label as a new variable.

bigmart_combine$Item_MRP_clusters<-ifelse(bigmart_combine$Item_MRP<69,"1",ifelse(bigmart_combine$Item_MRP>=69 & bigmart_combine$Item_MRP<136,"2",ifelse(bigmart_combine$Item_MRP>=136 & bigmart_combine$Item_MRP<203,"3","4")))

bigmart_combine$Item_MRP_clusters<-as.numeric(bigmart_combine$Item_MRP_clusters)

ENCODING CATEGORICAL VARIABLES

Most of the machine learning algorithms produce better result with numerical variables only. So, it is essential to treat the categorical variables present in the data. There are 2 techniques - Label Encoding and One Hot Encoding to convert our categorical variables into numerical ones. Label encoding simply means converting each category in a variable to a number. In One hot encoding, each category of a categorical variable is converted into a new binary column (1/0).

Label encoding for the categorical variables: We will label encode Outlet_Size and Outlet_Location_Type as these are ordinal variables.

#encoding Outlet_size
bigmart_combine$Outlet_Size <- ifelse(bigmart_combine$Outlet_Size=="Small",1,ifelse(bigmart_combine$Outlet_Size=="Medium",2,ifelse(bigmart_combine$Outlet_Size=="High",3,4)))

#encoding Outlet_Location_Type                   
bigmart_combine$Outlet_Location_Type <- ifelse(bigmart_combine$Outlet_Location_Type=="Tier 3",3,
ifelse(bigmart_combine$Outlet_Location_Type=="Tier 2",2,1))

DATA PREPROCESSING

Lets transform our data before feeding to the algorithm.

Removing Skewness: Skewness is undesirable for predictive modeling. In our data Item_Visibility and price_per_unit_wt variables are highly skewed.So, will treat their skewness with log transformations.

skewness(bigmart_combine$Item_Visibility)
[1] 1.254178
skewness(bigmart_combine$Price_per_unit_wt)
[1] 1.304689
bigmart_combine$Item_Visibility<-log(bigmart_combine$Item_Visibility+1)        # log + 1 to avoid division by zero
bigmart_combine$Price_per_unit_wt<-log(bigmart_combine$Price_per_unit_wt+1)

Here the skewness of both the features are greater than zero (right skewed).So, hence skewness is removed.

Scaling Numeric Predictors(Standardization): Scaling and centering is required for linear regression models. Let’s scale and center the numeric variables to make them have a mean of zero, standard deviation of one and scale of 0 to 1.

num_vars <- which(sapply(bigmart_combine, is.numeric))     # index of numeric features
num_vars_names <- names(num_vars)
num_vars_names
 [1] "Item_Weight"               "Item_Visibility"          
 [3] "Item_MRP"                  "Outlet_Establishment_Year"
 [5] "Outlet_Size"               "Outlet_Location_Type"     
 [7] "Item_Outlet_Sales"         "Outlet_Years"             
 [9] "Price_per_unit_wt"         "Item_MRP_clusters"        
bigmart_combine_numeric <- bigmart_combine[,setdiff(num_vars_names, "Item_Outlet_Sales")]

prepro_num <- preProcess(bigmart_combine_numeric, method=c("center", "scale"))

bigmart_combine_numeric_norm <- predict(prepro_num, bigmart_combine_numeric)

Remove Insignificant Features:

bigmart_combine<-select(bigmart_combine,-c("Item_Identifier"))

Reordering The Features:

bigmart_combine<-bigmart_combine[,c(1:10,12:16,11)]

Splitting the combined data “bigmart_combine” to train and test setdata,

train <- bigmart_combine[1:8523,]
test <- bigmart_combine[8524:14204,]
test <- test[, !(colnames(test) %in% c("Item_Outlet_Sales"))]        #remove outcome variable from test data

Correlated Variables: Let’s examine the correlated features of numerical features in train dataset. Correlation varies from -1 to 1. negative correlation: < 0 and >= -1 positive correlation: > 0 and <= 1 no correlation: 0 It is not desirable to have correlated predector features if we are using linear regression

cat.var<-c("Item_Fat_Content","Item_Type","Outlet_Identifier","Outlet_Type","Item_Type_New","Item_category")

num_var<-select(train,-c(cat.var))
corrplot(cor(num_var),method = "pie",type="lower",tl.cex = 0.9)

The correlation plot above shows correlation between all the possible pairs of variables in out data. The correlation between any two variables is represented by a pie. A blueish pie indicates positive correlation and reddish pie indicates negative correlation. The magnitude of the correlation is denoted by the area covered by the pie.

Associated Variables: Let’s examine the association between categorical features in train dataset using association plot based on corrplot library.

cat.var.frame<-subset(train,select = cat.var)
gkmatrix<-GKtauDataframe(cat.var.frame)
plot(gkmatrix,corrColors="blue",diagSize=0.5)

In this plot the diagonal element K refers to number of unique levels for each variable. The off-diagonal elements contain the forward and backward tau measures for each variable pair. Specifically, the numerical values appearing in each row represent the association measure among the variables. Here the variables with values 0,0.1,etc are having no variability and variables with values 0.32,0.41,etc have quite ability to variability whereas variables with values 0.85,0.94,etc are almost perfectly predictable. The forward association tau(x,y) of x=Item_Type is highly predictive of y=Item_category (i.e strong association) and indicates that if we know Item_Type then we can easily predict the class of Item_category being “DR”/“FD”/“NC”.But on the contrary the reverse association of tau(y,x)is quite predictable.

PREDICTIVE MODEL BUILDING

We will start off with the simpler models and gradually move on to more sophisticated models. We will build the following models:

Model Build:

#LINEAR
model_lr <- lm(Item_Outlet_Sales ~ ., data=train)

#RIDGE
data <- model.matrix(Item_Outlet_Sales ~ ., data=train)
x <- data[,-1]
kfold <- trainControl(method="cv",number = 5)
grid <- expand.grid(alpha=0,lambda=seq(0.001,0.1,by=0.0002))
model_ridge <- train(x=x, y=train$Item_Outlet_Sales, method="glmnet",trControl=kfold,tuneGrid=grid)

#LASSO
grid <- expand.grid(alpha=1,lambda=seq(0.001,0.1,by=0.0002))
model_lasso <- train(x=x, y=train$Item_Outlet_Sales, method="glmnet",trControl=kfold,tuneGrid=grid)

#MARS
model_earth <- train(Item_Outlet_Sales ~ ., data=train,method="earth",trControl=kfold)

#KNN
model_knn <- train(Item_Outlet_Sales ~ ., data=train, method="knn")

#CART DECISION TREES
model_cart <- rpart(Item_Outlet_Sales ~ ., data=train, method="anova")

Predict On Train Data:

#LINEAR
pred_lr <- model_lr$fitted.values

#RIDGE
pred_ridge <- predict(model_ridge,x,s="lambda.min")

#LASSO
pred_lasso <- predict(model_lasso,x,s="lambda.min")

#MARS
pred_earth <- predict(model_earth,train)

#KNN
pred_knn <- predict(model_knn,train,type="raw")

#CART DECISION TREES
pred_cart <- predict(model_cart,train)

Evaluation Metrics:

#LINEAR
mae_lr <-mean(abs(train$Item_Outlet_Sales-pred_lr))
mape_lr <- mape(model_lr,data=train)                  #from modelr package
## Warning in predict.lm(model, data): prediction from a rank-deficient fit
## may be misleading
mse_lr <-mean((train$Item_Outlet_Sales-pred_lr)^2)
rmse_lr <-sqrt(mean((train$Item_Outlet_Sales-pred_lr)^2))
rss_lr <-sum((train$Item_Outlet_Sales-pred_lr)^2)
tss_lr <-sum((train$Item_Outlet_Sales- mean(train$Item_Outlet_Sales))^2)
r2_lr <-1-rss_lr/tss_lr

#RIDGE
model_ridge$bestTune
##     alpha lambda
## 496     0    0.1
mae_ridge <-mean(abs(train$Item_Outlet_Sales-pred_ridge))
mse_ridge <-mean((train$Item_Outlet_Sales-pred_ridge)^2)
rmse_ridge <-sqrt(mean((train$Item_Outlet_Sales-pred_ridge)^2))
rss_ridge <-sum((train$Item_Outlet_Sales-pred_ridge)^2)
tss_ridge <-sum((train$Item_Outlet_Sales- mean(train$Item_Outlet_Sales))^2)
r2_ridge <-1-rss_ridge/tss_ridge

#LASSO
model_lasso$bestTune
##     alpha lambda
## 496     1    0.1
mae_lasso <-mean(abs(train$Item_Outlet_Sales-pred_lasso))
mse_lasso <-mean((train$Item_Outlet_Sales-pred_lasso)^2)
rmse_lasso <-sqrt(mean((train$Item_Outlet_Sales-pred_lasso)^2))
rss_lasso <-sum((train$Item_Outlet_Sales-pred_lasso)^2)
tss_lasso <-sum((train$Item_Outlet_Sales- mean(train$Item_Outlet_Sales))^2)
r2_lasso <-1-rss_lasso/tss_lasso

#MARS
model_earth
## Multivariate Adaptive Regression Spline 
## 
## 8523 samples
##   15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 6817, 6818, 6819, 6820, 6818 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE      Rsquared   MAE      
##   2       1405.239  0.3219139  1032.4784
##   4       1197.229  0.5076950   902.1694
##   7       1130.424  0.5610998   837.7611
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 7 and degree = 1.
plot(model_earth)

mae_earth <-mean(abs(train$Item_Outlet_Sales-pred_earth))
mape_earth <- mape(model_earth,data=train)                  #from modelr package
mse_earth <-mean((train$Item_Outlet_Sales-pred_earth)^2)
rmse_earth <-sqrt(mean((train$Item_Outlet_Sales-pred_earth)^2))
rss_earth <-sum((train$Item_Outlet_Sales-pred_earth)^2)
tss_earth <-sum((train$Item_Outlet_Sales- mean(train$Item_Outlet_Sales))^2)
r2_earth <-1-rss_earth/tss_earth

#KNN
model_knn
## k-Nearest Neighbors 
## 
## 8523 samples
##   15 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 8523, 8523, 8523, 8523, 8523, 8523, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  1371.825  0.3944455  969.3772
##   7  1346.426  0.4016669  959.4338
##   9  1333.581  0.4051064  957.5883
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
plot(model_knn)

mae_knn <-mean(abs(train$Item_Outlet_Sales-pred_knn))
mape_knn <- mape(model_knn,data=train)                  #from modelr package
mse_knn <-mean((train$Item_Outlet_Sales-pred_knn)^2)
rmse_knn <-sqrt(mean((train$Item_Outlet_Sales-pred_knn)^2))
rss_knn <-sum((train$Item_Outlet_Sales-pred_knn)^2)
tss_knn <-sum((train$Item_Outlet_Sales- mean(train$Item_Outlet_Sales))^2)
r2_knn <-1-rss_knn/tss_knn

#CART DECISION TREES
model_cart$cptable
##           CP nsplit rel error    xerror        xstd
## 1 0.23662590      0 1.0000000 1.0001852 0.020597092
## 2 0.16317858      1 0.7633741 0.7641894 0.016073699
## 3 0.05892632      2 0.6001955 0.6005667 0.013825073
## 4 0.03299713      3 0.5412692 0.5417462 0.011466529
## 5 0.03296604      5 0.4752749 0.4919921 0.010934631
## 6 0.01367571      6 0.4423089 0.4450521 0.010392761
## 7 0.01098391      7 0.4286332 0.4319647 0.010287538
## 8 0.01000000      8 0.4176493 0.4269543 0.009892866
fancyRpartPlot(model_cart,type = 2,cex=0.8,palettes = c("Purples","Oranges"))

mae_cart <-mean(abs(train$Item_Outlet_Sales-pred_cart))
mape_cart <- mape(model_cart,data=train)                  #from modelr package
rmse_cart <-sqrt(mean((train$Item_Outlet_Sales-pred_cart)^2))
rss_cart <-sum((train$Item_Outlet_Sales-pred_cart)^2)
tss_cart <-sum((train$Item_Outlet_Sales- mean(train$Item_Outlet_Sales))^2)
r2_cart <-1-rss_cart/tss_cart
methods <- data.frame(METHODS= c("Linear_regression","Ridge_regression","Lasso_regression","MARS (earth)","K_nearest_neighbours","Cart_decision_tree"))
mae <- data.frame(MAE=c(mae_lr,mae_ridge,mae_lasso,mae_earth,mae_knn,mae_cart))
mape <- data.frame(MAPE=c(mape_lr,mape_earth,mape_knn,mape_cart,NA,NA))
mse <- data.frame(MSE=c(mse_lr,mse_ridge,mse_lasso,mse_earth,mse_knn,NA))
rmse <- data.frame(RMSE=c(rmse_lr,rmse_ridge,rmse_lasso,rmse_earth,rmse_knn,rmse_cart))
Rsquare <- data.frame(R_SQUARE=c(r2_lr,r2_ridge,r2_lasso,r2_earth,r2_knn,r2_cart))

summary_methods <- cbind(methods,mae,mse,rmse,Rsquare,mape)
summary_methods
##                METHODS      MAE     MSE     RMSE  R_SQUARE      MAPE
## 1    Linear_regression 835.6736 1270157 1127.013 0.5637895 1.0348271
## 2     Ridge_regression 837.1619 1283619 1132.969 0.5591663 1.0395011
## 3     Lasso_regression 835.3176 1270353 1127.099 0.5637225 0.8559474
## 4         MARS (earth) 836.7614 1273918 1128.680 0.5624981 0.6007582
## 5 K_nearest_neighbours 801.6950 1232301 1110.090 0.5767907        NA
## 6   Cart_decision_tree 780.6560      NA 1102.774 0.5823507        NA

After trying and testing 6 different algorithms, the best metric score has been achieved by CART decision tree.

Check For Assumptions:

residuals <- train$Item_Outlet_Sales-pred_cart
#NORMALITY
ad.test(residuals)           #from nortest package
## 
##  Anderson-Darling normality test
## 
## data:  residuals
## A = 102.75, p-value < 2.2e-16
qqPlot(residuals)                  #from car package

## [1] 7189 5224

As we have p-value < 0.05 in anderson darling test and from the q-q plot plotted, it accepts alternative hypothesis saying data is not normally distributed.

#INDEPENDENCY
durbinWatsonTest(residuals)        #from car package
## [1] 2.032728

The D-W statistic being 2.03, there is no autocorrelation among residuals (error terms are independent to each other).Hence, independency of residuals is met.

#LINEARITY
plotres(model_cart,which = 3)       #from plotmo package

Here the red line is linear but all the fitted values are not scattered around it with no systematic relationship.Hence, linearity is not met.

#HOMOSCADESCITY
plotres(model_cart,which = 6)       #from plotmo package

Even here all the fitted values are not scattered at one place (no constant variance among error terms).Hence homoscadescity is not met (it has heteroscadescity).

CHECKING FOR VARIANCE

metrics <- data.frame(METRICS =c("MAE","MAPE","RMSE","R-SQUARE"))
summary_metrics <- data.frame(train = c(mae_cart,mape_cart,rmse_cart,r2_cart))

set.seed(123)
split <- createDataPartition(train$Item_Outlet_Sales,p=0.7,list = FALSE)
train1 <- train[split,]
train2 <- train[-split,]

model_cart1 <- rpart(Item_Outlet_Sales ~ ., data=train1, method="anova")
pred_cart1 <- predict(model_cart1,train1)

mae_cart1 <-mean(abs(train1$Item_Outlet_Sales-pred_cart1))
mape_cart1 <- mape(model_cart1,data=train1)                      #from modelr package
rmse_cart1 <-sqrt(mean((train1$Item_Outlet_Sales-pred_cart1)^2))
rss_cart1 <-sum((train1$Item_Outlet_Sales-pred_cart1)^2)
tss_cart1 <-sum((train1$Item_Outlet_Sales- mean(train1$Item_Outlet_Sales))^2)
r2_cart1 <-1-rss_cart1/tss_cart1

summary_metrics1 <- data.frame(train1 = c(mae_cart1,mape_cart1,rmse_cart1,r2_cart1))

set.seed(1234)

model_cart2 <- rpart(Item_Outlet_Sales ~ ., data=train2, method="anova")
pred_cart2 <- predict(model_cart2,train2)

mae_cart2 <-mean(abs(train2$Item_Outlet_Sales-pred_cart2))
mape_cart2 <- mape(model_cart2,data=train2)                       #from modelr package
rmse_cart2 <-sqrt(mean((train2$Item_Outlet_Sales-pred_cart2)^2))
rss_cart2 <-sum((train2$Item_Outlet_Sales-pred_cart2)^2)
tss_cart2 <-sum((train2$Item_Outlet_Sales- mean(train2$Item_Outlet_Sales))^2)
r2_cart2 <-1-rss_cart2/tss_cart2

summary_metrics2 <- data.frame(train2 = c(mae_cart2,mape_cart2,rmse_cart2,r2_cart2))

summary_CART_metrics <- cbind(metrics,summary_metrics,summary_metrics1,summary_metrics2)
summary_CART_metrics
##    METRICS        train       train1       train2
## 1      MAE  780.6560068  785.9666595  766.1150248
## 2     MAPE    0.6007582    0.6046097    0.5947985
## 3     RMSE 1102.7741584 1103.4496374 1087.5799615
## 4 R-SQUARE    0.5823507    0.5857430    0.5845548

Here, the accuracy metrics are consistent for different set of train data, so we can rely on this model and now we can predict the sales on unseen data (i.e, test data).

PREDICT ON FUTURE DATASET

pred_cart_test <- predict(model_cart,test)
test$Item_Outlet_Sales <- pred_cart_test
test$Item_Identifier <- bigmart_test$Item_Identifier
final_test <- test[,c("Item_Identifier","Outlet_Identifier","Item_Outlet_Sales")] 

write.csv(final_test,file = "E:/R Markdown/projects/big mart sales/PREDICTED_TESTDATA.csv")

CONCLUSION

As sales prediction is a very common real-life problem that each company faces at least once in its lifetime. If done correctly, it can have a significant impact on the success and performance of that company. According to a study, companies with accurate sales predictions are 10% more likely to grow their revenue year-over-year and 7.3% more likely to hit quota.

What else can be tried:

    There are still quite a many things that can be tried to improve our models' predictions. We can create and add more variables, try different models with different subset of features, etc. Some of the ideas based on our EDA are listed below:
We can transform the target variable to reduce its skewness.
We can also make independent vs independent variable visualizations to discover some more patterns.
We found out four groups of Item_MRP in EDA. We can create a separate model for each of these groups and then combine the predictions.
Based on the violin plots, we can group the categories of the categorical variables which are quite similar in shape and size.
We can even use different algorithms such as XtremeGBM or neural network based models, etc.