This dataset contains information about BigMart a nation wide supermarket chain. Bigmart’s board of directors have given a challenge to all the data scientists stating to create a model that can predict the sales, per product, for each store. BigMart has collected sales data from the year 2013, for 1559 products across 10 stores in different cities. With this information the corporation hopes we can identify the products and stores which play a key role in their sales and use that information to take the correct measures to ensure success of their business.
As an aspiring data scientist, understanding how to clean and model data like this is of great importance to me. In this kernel, I will use the BigMart data clean it and explore which factors affect the Item_Outlet_Sales. I will perform a regression analysis on the same.
The data has following features, Item_Outlet_Sales being the target (dependent) variable:
library(nnet) #for multiple regression
library(dplyr) #for data cleaning
library(caret) #for trainiing the model
library(glmnet) #for lasso regression
library(Amelia) #for predictive imputatin
library(stringr) #for slicing character variables
library(ggplot2) #for visualisation
library(mltools) #for one hot encoding
library(corrplot) #for plotting a correlation graph
library(tidyverse) #for data cleaning
library(data.table) #for data manupulation
train <- read.csv("/Users/kritik/Desktop/Data Science/Datasets/Analytics Vidhya/bigmart_train.csv",na.strings = c('',' ','NA'),header = T)
test <- read.csv("/Users/kritik/Desktop/Data Science/Datasets/Analytics Vidhya/bigmart_test.csv",na.strings = c('',' ','NA'),header = T)
In read.csv- * header is a logical value indicating whether the file contains the names of the variables as its first line. * na.strings = c() is a character vector of strings which are to be interpreted as NA values, therefore any blank cells/spaces will be considered as NA.
dim(train); dim(test)
## [1] 8523 12
## [1] 5681 11
test$Item_Outlet_Sales <- NA #Setting the response variable in test dataset to NA
bigmart <- suppressWarnings(bind_rows(train,test))
dim(bigmart)
## [1] 14204 12
head(bigmart)
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1 FDA15 9.300 Low Fat 0.01604730
## 2 DRC01 5.920 Regular 0.01927822
## 3 FDN15 17.500 Low Fat 0.01676007
## 4 FDX07 19.200 Regular 0.00000000
## 5 NCD19 8.930 Low Fat 0.00000000
## 6 FDP36 10.395 Regular 0.00000000
## Item_Type Item_MRP Outlet_Identifier
## 1 Dairy 249.8092 OUT049
## 2 Soft Drinks 48.2692 OUT018
## 3 Meat 141.6180 OUT049
## 4 Fruits and Vegetables 182.0950 OUT010
## 5 Household 53.8614 OUT013
## 6 Baking Goods 51.4008 OUT018
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1 1999 Medium Tier 1
## 2 2009 Medium Tier 3
## 3 1999 Medium Tier 1
## 4 1998 <NA> Tier 3
## 5 1987 High Tier 3
## 6 2009 Medium Tier 3
## Outlet_Type Item_Outlet_Sales
## 1 Supermarket Type1 3735.1380
## 2 Supermarket Type2 443.4228
## 3 Supermarket Type1 2097.2700
## 4 Grocery Store 732.3800
## 5 Supermarket Type1 994.7052
## 6 Supermarket Type2 556.6088
str(bigmart)
## 'data.frame': 14204 obs. of 12 variables:
## $ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
## $ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
## $ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
## $ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
## $ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
## $ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
## $ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
## $ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
## $ Outlet_Size : Factor w/ 3 levels "High","Medium",..: 2 2 2 NA 1 2 1 2 NA NA ...
## $ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
## $ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
Creating a function which gives us all the necessary details needed for data cleaning and preparation in one place.
dataInfo <- function(df) {
NAvalues <- list(); NAclass <- list(); UniqueVals <- list(); ColName <- names(df); Index <- c(1:dim(df)[2])
for(i in 1:dim(df)[2]) {
NAvalues[i] <- sum(is.na(df[,i]))
NAclass[i] <- class(df[,i])
if(class(df[,i])=="numeric" | class(df[,i])=="integer") {
UniqueVals[i] <- 'Not Applicable'
}
else{
UniqueVals[i] <- length(unique(df[,i]))
}
}
NAlist <- cbind(Index,ColName,NAvalues,NAclass,UniqueVals)
}
Info <- dataInfo(bigmart); Info
## Index ColName NAvalues NAclass
## [1,] 1 "Item_Identifier" 0 "character"
## [2,] 2 "Item_Weight" 2439 "numeric"
## [3,] 3 "Item_Fat_Content" 0 "factor"
## [4,] 4 "Item_Visibility" 0 "numeric"
## [5,] 5 "Item_Type" 0 "factor"
## [6,] 6 "Item_MRP" 0 "numeric"
## [7,] 7 "Outlet_Identifier" 0 "factor"
## [8,] 8 "Outlet_Establishment_Year" 0 "integer"
## [9,] 9 "Outlet_Size" 4016 "factor"
## [10,] 10 "Outlet_Location_Type" 0 "factor"
## [11,] 11 "Outlet_Type" 0 "factor"
## [12,] 12 "Item_Outlet_Sales" 5681 "numeric"
## UniqueVals
## [1,] 1559
## [2,] "Not Applicable"
## [3,] 5
## [4,] "Not Applicable"
## [5,] 16
## [6,] "Not Applicable"
## [7,] 10
## [8,] "Not Applicable"
## [9,] 4
## [10,] 3
## [11,] 4
## [12,] "Not Applicable"
One can conclude that the variables in this dataset may be divided into two categories-
head(table(bigmart$Item_Identifier))
##
## DRA12 DRA24 DRA59 DRB01 DRB13 DRB24
## 9 10 10 8 9 8
min(table(bigmart$Item_Identifier))
## [1] 7
bigmart$Item_Identifier <- as.factor(bigmart$Item_Identifier)
On taking a deeper look at the Item_Identifier variable we notice the following-
Since the second. and third obsevation are somewhat similar to unique ID for each product and it is difficult to get some meaning out of it we will only look into the first observation.
We will try create a new column for Item Category.
bigmart$Item_Category <- str_sub(bigmart$Item_Identifier,start = 1L, end = 2L)
bigmart$Item_Category <- as.factor(bigmart$Item_Category)
head(bigmart$Item_Category)
## [1] FD DR FD FD NC FD
## Levels: DR FD NC
From above we can infer that every product in the dataset appears atleast 7 times. From common knowledge we know that product’s weight remains same be it any sample. Therefore we can impute the missing value in Item_Weight with the weight of same product as identified by Item_Identifier.
imputeWeight <- function(df) {
df %>%
group_by(Item_Identifier) %>%
mutate(Item_Weight = ifelse(is.na(Item_Weight),mean(Item_Weight,na.rm = T),Item_Weight)) %>%
ungroup()
}
Identify_Weight <- bigmart[,c(1,2)]
Identify_Weight <- imputeWeight(Identify_Weight)
bigmart[,c(1,2)] <- Identify_Weight
ggplot(bigmart) + geom_histogram(aes(Item_Weight),binwidth = 0.5,fill = 'skyblue', col = 'black')
From the above graph no noticeable skewness pattern is visible hence no transformation is needed.
Info <- dataInfo(bigmart); Info
## Index ColName NAvalues NAclass
## [1,] 1 "Item_Identifier" 0 "factor"
## [2,] 2 "Item_Weight" 0 "numeric"
## [3,] 3 "Item_Fat_Content" 0 "factor"
## [4,] 4 "Item_Visibility" 0 "numeric"
## [5,] 5 "Item_Type" 0 "factor"
## [6,] 6 "Item_MRP" 0 "numeric"
## [7,] 7 "Outlet_Identifier" 0 "factor"
## [8,] 8 "Outlet_Establishment_Year" 0 "integer"
## [9,] 9 "Outlet_Size" 4016 "factor"
## [10,] 10 "Outlet_Location_Type" 0 "factor"
## [11,] 11 "Outlet_Type" 0 "factor"
## [12,] 12 "Item_Outlet_Sales" 5681 "numeric"
## [13,] 13 "Item_Category" 0 "factor"
## UniqueVals
## [1,] 1559
## [2,] "Not Applicable"
## [3,] 5
## [4,] "Not Applicable"
## [5,] 16
## [6,] "Not Applicable"
## [7,] 10
## [8,] "Not Applicable"
## [9,] 4
## [10,] 3
## [11,] 4
## [12,] "Not Applicable"
## [13,] 3
As we can see that Item_Weight has zero NA values now!
glimpse(bigmart$Item_Fat_Content)
## Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
unique(bigmart$Item_Fat_Content)
## [1] Low Fat Regular low fat LF reg
## Levels: LF low fat Low Fat reg Regular
As we can see that Item_Fat_Content has redundant variables, the ideal levels should be: 1. ‘LF’, ‘low fat’ and ‘Low Fat’ -> ‘Low Fat’ 2. ‘reg’ and ‘Regular’ -> ‘Regular’
levels(bigmart$Item_Fat_Content) <- list('Low Fat' = 'LF','Low Fat' = 'low fat','Regular'='reg')
unique(bigmart$Item_Fat_Content)
## [1] Low Fat Regular
## Levels: Low Fat Regular
As we found out during the analysis of Item_Identifier Variable, some products are Non Consumable hence we will replace the fat content in those rows by ‘Not Consumable’.
NC <- as.data.table(bigmart[,c(3,13)])
NC[Item_Category=='NC',Item_Fat_Content:='Non Consumable',Item_Fat_Content]
bigmart[,c(3,13)] <- NC
ggplot(bigmart %>% group_by(Item_Fat_Content) %>% summarise(Count = n())) +
geom_bar(aes(Item_Fat_Content, Count), stat = 'identity', fill = 'skyblue',col='black')
Info <- dataInfo(bigmart); Info
## Index ColName NAvalues NAclass
## [1,] 1 "Item_Identifier" 0 "factor"
## [2,] 2 "Item_Weight" 0 "numeric"
## [3,] 3 "Item_Fat_Content" 0 "factor"
## [4,] 4 "Item_Visibility" 0 "numeric"
## [5,] 5 "Item_Type" 0 "factor"
## [6,] 6 "Item_MRP" 0 "numeric"
## [7,] 7 "Outlet_Identifier" 0 "factor"
## [8,] 8 "Outlet_Establishment_Year" 0 "integer"
## [9,] 9 "Outlet_Size" 4016 "factor"
## [10,] 10 "Outlet_Location_Type" 0 "factor"
## [11,] 11 "Outlet_Type" 0 "factor"
## [12,] 12 "Item_Outlet_Sales" 5681 "numeric"
## [13,] 13 "Item_Category" 0 "factor"
## UniqueVals
## [1,] 1559
## [2,] "Not Applicable"
## [3,] 3
## [4,] "Not Applicable"
## [5,] 16
## [6,] "Not Applicable"
## [7,] 10
## [8,] "Not Applicable"
## [9,] 4
## [10,] 3
## [11,] 4
## [12,] "Not Applicable"
## [13,] 3
I noticed while pre-viewing the data that few rows in Item_Visibility are zero and it doesn’t make sense for an item to be occupying shelf space and yet have zero visibility. Hence we conclude that this is erraneous data.
We know that every product in the dataset appears atleast 7 times. A particular product has to have similar visibility because of its appearance, be it in any outlet. Therefore we will impute the 0 values in Item_Visibility with the mean visibiility of same product as identified by Item_Identifier.
bigmart[,4][bigmart[,4]==0] <- NA
imputeFat <- function(df) {
df %>%
group_by(Item_Identifier) %>%
mutate(Item_Visibility = ifelse(is.na(Item_Visibility),mean(Item_Visibility,na.rm=T),Item_Visibility)) %>%
ungroup()
}
Vis <- bigmart[,c(1,4,5)]
Vis <- imputeFat(Vis)
bigmart[,c(1,4,5)] <- Vis
ggplot(bigmart) + geom_histogram(aes(Item_Visibility),binwidth = 0.005, fill = 'skyblue', col = 'black')
As we see Item_Visibility is right skewed hence we will have to tranform the data.
Info <- dataInfo(bigmart); Info
## Index ColName NAvalues NAclass
## [1,] 1 "Item_Identifier" 0 "factor"
## [2,] 2 "Item_Weight" 0 "numeric"
## [3,] 3 "Item_Fat_Content" 0 "factor"
## [4,] 4 "Item_Visibility" 0 "numeric"
## [5,] 5 "Item_Type" 0 "factor"
## [6,] 6 "Item_MRP" 0 "numeric"
## [7,] 7 "Outlet_Identifier" 0 "factor"
## [8,] 8 "Outlet_Establishment_Year" 0 "integer"
## [9,] 9 "Outlet_Size" 4016 "factor"
## [10,] 10 "Outlet_Location_Type" 0 "factor"
## [11,] 11 "Outlet_Type" 0 "factor"
## [12,] 12 "Item_Outlet_Sales" 5681 "numeric"
## [13,] 13 "Item_Category" 0 "factor"
## UniqueVals
## [1,] 1559
## [2,] "Not Applicable"
## [3,] 3
## [4,] "Not Applicable"
## [5,] 16
## [6,] "Not Applicable"
## [7,] 10
## [8,] "Not Applicable"
## [9,] 4
## [10,] 3
## [11,] 4
## [12,] "Not Applicable"
## [13,] 3
We will now replace the NA values in Outlet_Size by predictive Imputation.
Outlet_Identifier, Outlet_Establishment_Year, Outlet_Location_Type and Outlet_Type are the only variables that the Outlet_Size should depend on hence we will use the Amelia function from Amelia library to impute the NA values in Outlet_Size.
sub_mart <- bigmart %>%
select(Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type)
sub_mart$Outlet_Identifier <- as.factor(sub_mart$Outlet_Identifier)
sub_mart$Outlet_Establishment_Year <- as.factor(sub_mart$Outlet_Establishment_Year)
sub_mart$Outlet_Size <- factor(sub_mart$Outlet_Size,levels = c("Small","Medium","High"),labels = c(1,2,3),ordered = T)
sub_mart$Outlet_Location_Type <- factor(sub_mart$Outlet_Location_Type,levels = c("Tier 3","Tier 2","Tier 1"),labels = c(3,2,1),ordered = T)
sub_mart$Outlet_Type <- as.factor(sub_mart$Outlet_Type)
str(sub_mart)
## 'data.frame': 14204 obs. of 5 variables:
## $ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
## $ Outlet_Establishment_Year: Factor w/ 9 levels "1985","1987",..: 5 9 5 4 2 9 2 1 6 8 ...
## $ Outlet_Size : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 2 NA 3 2 3 2 NA NA ...
## $ Outlet_Location_Type : Ord.factor w/ 3 levels "3"<"2"<"1": 3 1 3 1 1 1 1 1 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
imputed_mart <- amelia(sub_mart, m = 5, ords = c("Outlet_Size","Outlet_Location_Type"),
noms = c("Outlet_Establishment_Year", "Outlet_Identifier","Outlet_Type"),
incheck = TRUE,empri = 1)
## Warning in amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, : The variables (or variable with levels) Outlet_Establishment_Year1987, Outlet_Establishment_Year1997, Outlet_Establishment_Year1999, Outlet_Establishment_Year2004, Outlet_Establishment_Year2009, Outlet_Size.L, Outlet_Size.Q, Outlet_Location_Type.L, Outlet_Location_Type.Q, Outlet_TypeSupermarket Type1, Outlet_TypeSupermarket Type2, Outlet_TypeSupermarket Type3 are perfectly collinear with another variable in the data.
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8
anova(fit1,fit2,fit3,fit4,fit5,test = "Chisq")
## Likelihood ratio tests of Multinomial Models
##
## Response: Outlet_Size
## Model
## 1 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 2 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 3 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 4 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 5 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
## 1 28388 8071.067
## 2 28388 8070.989 1 vs 2 0 0.07799501 0
## 3 28388 7976.382 2 vs 3 0 94.60686197 0
## 4 28388 8073.995 3 vs 4 0 -97.61310039 1
## 5 28388 8070.431 4 vs 5 0 3.56389740 0
While writing this notebook, 4th imputation appeared to be the best one from the above statistics, hence I have used the fourth imputation.
bigmart$Outlet_Size <- imputed_mart$imputations$imp4$Outlet_Size
ggplot(bigmart %>% group_by(Outlet_Size) %>% summarise(Count = n())) +
geom_bar(aes(Outlet_Size, Count), stat = 'identity', fill = 'skyblue',col = 'black')
On analysis of the variables and common knowledge we know that most products are sold at a base rate (which is usually with respect to a base weight). Hence here we will create a new variable which tells us about the Price of the Item per unit Weight.
bigmart$Item_Price_Weight <- bigmart$Item_MRP / bigmart$Item_Weight
We know that people only shop at the place they may have been to earlier as well. Hence we will create a varriable which has the information about the Years of Operation of each store. Since this data is from 2013 we will calculate operational years with 2013.
bigmart$Outlet_Operational_Years <- 2013 - bigmart$Outlet_Establishment_Year
Info <- dataInfo(bigmart); Info
## Warning in NAclass[i] <- class(df[, i]): number of items to replace is not
## a multiple of replacement length
## Warning in if (class(df[, i]) == "numeric" | class(df[, i]) == "integer")
## {: the condition has length > 1 and only the first element will be used
## Index ColName NAvalues NAclass
## [1,] 1 "Item_Identifier" 0 "factor"
## [2,] 2 "Item_Weight" 0 "numeric"
## [3,] 3 "Item_Fat_Content" 0 "factor"
## [4,] 4 "Item_Visibility" 0 "numeric"
## [5,] 5 "Item_Type" 0 "factor"
## [6,] 6 "Item_MRP" 0 "numeric"
## [7,] 7 "Outlet_Identifier" 0 "factor"
## [8,] 8 "Outlet_Establishment_Year" 0 "integer"
## [9,] 9 "Outlet_Size" 0 "ordered"
## [10,] 10 "Outlet_Location_Type" 0 "factor"
## [11,] 11 "Outlet_Type" 0 "factor"
## [12,] 12 "Item_Outlet_Sales" 5681 "numeric"
## [13,] 13 "Item_Category" 0 "factor"
## [14,] 14 "Item_Price_Weight" 0 "numeric"
## [15,] 15 "Outlet_Operational_Years" 0 "numeric"
## UniqueVals
## [1,] 1559
## [2,] "Not Applicable"
## [3,] 3
## [4,] "Not Applicable"
## [5,] 16
## [6,] "Not Applicable"
## [7,] 10
## [8,] "Not Applicable"
## [9,] 3
## [10,] 3
## [11,] 4
## [12,] "Not Applicable"
## [13,] 3
## [14,] "Not Applicable"
## [15,] "Not Applicable"
Machine learning algorithms and deep learning neural networks require that input and output variables are numbers. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modelling, although the most common are as follows:
We will perform label encoding on the following variables- ‘Outlet_Size’, ‘Outlet_Location_Type’
bigmart$Outlet_Size <- ifelse(bigmart$Outlet_Size == 1, 0, ifelse(bigmart$Outlet_Size == 2, 1, 2))
bigmart$Outlet_Location_Type <- ifelse(bigmart$Outlet_Location_Type == 3, 0, ifelse(bigmart$Outlet_Location_Type == 2, 1, 2))
We will perform one hot encoding on the following variables- ‘Item_Fat_Content’, ‘Item_Type’, ‘Outlet_Identifier’, ‘Outlet_Type’, ‘Item_Category’.
ohe_var <- bigmart[,c(3,5,7,11,13)]
bigmart_ohe <- one_hot(as.data.table(ohe_var))
bigmart[,c(3,5,7,11,13)] <- NULL
bigmart <- bind_cols(bigmart,bigmart_ohe)
Now we will split the data back into orignal traininig and testing
test <- subset(bigmart,is.na(bigmart$Item_Outlet_Sales))
train <- subset(bigmart,!is.na(bigmart$Item_Outlet_Sales))
Dividing Predictor and Response into two different variables
Ytrain <- train[,8]
Xtrain <- train[,-8]
Now we will apply Lassoo regression to the training data.
set.seed(123)
my_control = trainControl(method="cv", number=5)
Grid = expand.grid(alpha = 1, lambda = seq(11,13,by = 0.0001))
lasso_mod = train(x = train[, -c(1,8)], y = train$Item_Outlet_Sales,
method='glmnet', trControl= my_control, tuneGrid = Grid)
head(lasso_mod$results)
## alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 11.0000 1129.883 0.5619019 836.4769 7.793448 0.006164546 5.248766
## 2 1 11.0001 1129.883 0.5619019 836.4769 7.793454 0.006164549 5.248766
## 3 1 11.0002 1129.883 0.5619019 836.4768 7.793461 0.006164553 5.248767
## 4 1 11.0003 1129.883 0.5619019 836.4768 7.793468 0.006164556 5.248767
## 5 1 11.0004 1129.883 0.5619019 836.4768 7.793474 0.006164560 5.248767
## 6 1 11.0005 1129.883 0.5619019 836.4768 7.793481 0.006164563 5.248767
Predicting the ‘Item_Outlet_Sales’ for testing data
test$Item_Outlet_Sales <- predict(lasso_mod,newdata=test)
I hope you enjoyed this analysis! I think going forward it would be interesting to use a method other than lasso regression. Comments are Welcome
Connect with me here- Click on any one of the below to connect with me there LinkedIn- linkedin.com/in/kritikseth GitHub- github.com/kritik-seth Kaggle- kaggle.com/kritikseth