Data Cleaning and Regression on BigMart Data in R

This dataset contains information about BigMart a nation wide supermarket chain. Bigmart’s board of directors have given a challenge to all the data scientists stating to create a model that can predict the sales, per product, for each store. BigMart has collected sales data from the year 2013, for 1559 products across 10 stores in different cities. With this information the corporation hopes we can identify the products and stores which play a key role in their sales and use that information to take the correct measures to ensure success of their business.

As an aspiring data scientist, understanding how to clean and model data like this is of great importance to me. In this kernel, I will use the BigMart data clean it and explore which factors affect the Item_Outlet_Sales. I will perform a regression analysis on the same.

BigMart Data

The data has following features, Item_Outlet_Sales being the target (dependent) variable:

Item_Identifier: Unique product ID
Item_Weight: Weight of the product
Item_Fat_Content: Levels of fat in different types of food
Item_Visibility: The total display area allocated to a product in each store
Item_Type: The category to which the product belongs
Item_MRP: Maximum Retail Price (list price) of the product
Outlet_Identifier: Unique store ID
Outlet_Establishment_Year: The year in which store was established
Outlet_Size: The size of the store
Outlet_Location_Type: The type of city in which the store is located
Outlet_Type: What type of supermarket is the store
Item_Outlet_Sales: Sales of the product in that particular store

We will structure the code as follows:

Loading the data
Exploratory Data Analysis (EDA) and Data Preparation
Regression

Libraries required-

library(nnet) #for multiple regression
library(dplyr) #for data cleaning
library(caret) #for trainiing the model
library(glmnet) #for lasso regression
library(Amelia) #for predictive imputatin
library(stringr) #for slicing character variables
library(ggplot2) #for visualisation
library(mltools) #for one hot encoding
library(corrplot) #for plotting a correlation graph
library(tidyverse) #for data cleaning
library(data.table) #for data manupulation

1. Loading the data

train <- read.csv("/Users/kritik/Desktop/Data Science/Datasets/Analytics Vidhya/bigmart_train.csv",na.strings = c('',' ','NA'),header = T)
test <- read.csv("/Users/kritik/Desktop/Data Science/Datasets/Analytics Vidhya/bigmart_test.csv",na.strings = c('',' ','NA'),header = T)

In read.csv- * header is a logical value indicating whether the file contains the names of the variables as its first line. * na.strings = c() is a character vector of strings which are to be interpreted as NA values, therefore any blank cells/spaces will be considered as NA.

For any further analysis and cleaning we will combine the testing and training data

dim(train); dim(test)

## [1] 8523   12

## [1] 5681   11

test$Item_Outlet_Sales <- NA #Setting the response variable in test dataset to NA
bigmart <- suppressWarnings(bind_rows(train,test))
dim(bigmart)

## [1] 14204    12

Previewing the data

head(bigmart)

##   Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1           FDA15       9.300          Low Fat      0.01604730
## 2           DRC01       5.920          Regular      0.01927822
## 3           FDN15      17.500          Low Fat      0.01676007
## 4           FDX07      19.200          Regular      0.00000000
## 5           NCD19       8.930          Low Fat      0.00000000
## 6           FDP36      10.395          Regular      0.00000000
##               Item_Type Item_MRP Outlet_Identifier
## 1                 Dairy 249.8092            OUT049
## 2           Soft Drinks  48.2692            OUT018
## 3                  Meat 141.6180            OUT049
## 4 Fruits and Vegetables 182.0950            OUT010
## 5             Household  53.8614            OUT013
## 6          Baking Goods  51.4008            OUT018
##   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1                      1999      Medium               Tier 1
## 2                      2009      Medium               Tier 3
## 3                      1999      Medium               Tier 1
## 4                      1998        <NA>               Tier 3
## 5                      1987        High               Tier 3
## 6                      2009      Medium               Tier 3
##         Outlet_Type Item_Outlet_Sales
## 1 Supermarket Type1         3735.1380
## 2 Supermarket Type2          443.4228
## 3 Supermarket Type1         2097.2700
## 4     Grocery Store          732.3800
## 5 Supermarket Type1          994.7052
## 6 Supermarket Type2          556.6088

str(bigmart)

## 'data.frame':    14204 obs. of  12 variables:
##  $ Item_Identifier          : chr  "FDA15" "DRC01" "FDN15" "FDX07" ...
##  $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
##  $ Item_Fat_Content         : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
##  $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
##  $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
##  $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
##  $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
##  $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
##  $ Outlet_Size              : Factor w/ 3 levels "High","Medium",..: 2 2 2 NA 1 2 1 2 NA NA ...
##  $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
##  $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...

Creating a function which gives us all the necessary details needed for data cleaning and preparation in one place.

dataInfo <- function(df) {
  NAvalues <- list(); NAclass <- list(); UniqueVals <- list(); ColName <- names(df); Index <- c(1:dim(df)[2])
  for(i in 1:dim(df)[2]) {
    NAvalues[i] <- sum(is.na(df[,i]))
    NAclass[i] <- class(df[,i])
    if(class(df[,i])=="numeric" | class(df[,i])=="integer") {
      UniqueVals[i] <- 'Not Applicable'
    }
    else{
      UniqueVals[i] <- length(unique(df[,i]))
    }
  }
  NAlist <- cbind(Index,ColName,NAvalues,NAclass,UniqueVals) 
}

Info <- dataInfo(bigmart); Info

##       Index ColName                     NAvalues NAclass    
##  [1,] 1     "Item_Identifier"           0        "character"
##  [2,] 2     "Item_Weight"               2439     "numeric"  
##  [3,] 3     "Item_Fat_Content"          0        "factor"   
##  [4,] 4     "Item_Visibility"           0        "numeric"  
##  [5,] 5     "Item_Type"                 0        "factor"   
##  [6,] 6     "Item_MRP"                  0        "numeric"  
##  [7,] 7     "Outlet_Identifier"         0        "factor"   
##  [8,] 8     "Outlet_Establishment_Year" 0        "integer"  
##  [9,] 9     "Outlet_Size"               4016     "factor"   
## [10,] 10    "Outlet_Location_Type"      0        "factor"   
## [11,] 11    "Outlet_Type"               0        "factor"   
## [12,] 12    "Item_Outlet_Sales"         5681     "numeric"  
##       UniqueVals      
##  [1,] 1559            
##  [2,] "Not Applicable"
##  [3,] 5               
##  [4,] "Not Applicable"
##  [5,] 16              
##  [6,] "Not Applicable"
##  [7,] 10              
##  [8,] "Not Applicable"
##  [9,] 4               
## [10,] 3               
## [11,] 4               
## [12,] "Not Applicable"

Basic Analysis from Variable names and data previewing

One can conclude that the variables in this dataset may be divided into two categories-

‘Item’ Related (Item_Identifier, Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type, Item_MRP)
‘Outlet’ Related (Outlet_Identifier, Outler_Establishment_Year, Outlet_Size, Outlet_Location_Type, Outlet_Type)

We will deal with one variable at a time

1. Item_Identifier

head(table(bigmart$Item_Identifier))

## 
## DRA12 DRA24 DRA59 DRB01 DRB13 DRB24 
##     9    10    10     8     9     8

min(table(bigmart$Item_Identifier))

## [1] 7

bigmart$Item_Identifier <- as.factor(bigmart$Item_Identifier)

On taking a deeper look at the Item_Identifier variable we notice the following-

The first two letters in the Item_Identifier are of three types- ‘DR’, ‘FD’ and ‘NC’.These should stand for ‘Drinks’, ‘Food’ and ‘Non Consumable’ respectively.
The third letter which ranges from A - Z could stand for subcategories in a product Type.
The numbers in Item_Identifier could stand for a unique ID assigned to a product in a category.

Since the second. and third obsevation are somewhat similar to unique ID for each product and it is difficult to get some meaning out of it we will only look into the first observation.

We will try create a new column for Item Category.

bigmart$Item_Category <- str_sub(bigmart$Item_Identifier,start = 1L, end = 2L)
bigmart$Item_Category <- as.factor(bigmart$Item_Category)
head(bigmart$Item_Category)

## [1] FD DR FD FD NC FD
## Levels: DR FD NC

2. Item_Weight

From above we can infer that every product in the dataset appears atleast 7 times. From common knowledge we know that product’s weight remains same be it any sample. Therefore we can impute the missing value in Item_Weight with the weight of same product as identified by Item_Identifier.

imputeWeight <- function(df) {
  df %>%
    group_by(Item_Identifier) %>%
    mutate(Item_Weight = ifelse(is.na(Item_Weight),mean(Item_Weight,na.rm = T),Item_Weight)) %>%
    ungroup()
}

Identify_Weight <- bigmart[,c(1,2)]
Identify_Weight <- imputeWeight(Identify_Weight)
bigmart[,c(1,2)] <- Identify_Weight

ggplot(bigmart) + geom_histogram(aes(Item_Weight),binwidth = 0.5,fill = 'skyblue', col = 'black')

From the above graph no noticeable skewness pattern is visible hence no transformation is needed.

Info <- dataInfo(bigmart); Info

##       Index ColName                     NAvalues NAclass  
##  [1,] 1     "Item_Identifier"           0        "factor" 
##  [2,] 2     "Item_Weight"               0        "numeric"
##  [3,] 3     "Item_Fat_Content"          0        "factor" 
##  [4,] 4     "Item_Visibility"           0        "numeric"
##  [5,] 5     "Item_Type"                 0        "factor" 
##  [6,] 6     "Item_MRP"                  0        "numeric"
##  [7,] 7     "Outlet_Identifier"         0        "factor" 
##  [8,] 8     "Outlet_Establishment_Year" 0        "integer"
##  [9,] 9     "Outlet_Size"               4016     "factor" 
## [10,] 10    "Outlet_Location_Type"      0        "factor" 
## [11,] 11    "Outlet_Type"               0        "factor" 
## [12,] 12    "Item_Outlet_Sales"         5681     "numeric"
## [13,] 13    "Item_Category"             0        "factor" 
##       UniqueVals      
##  [1,] 1559            
##  [2,] "Not Applicable"
##  [3,] 5               
##  [4,] "Not Applicable"
##  [5,] 16              
##  [6,] "Not Applicable"
##  [7,] 10              
##  [8,] "Not Applicable"
##  [9,] 4               
## [10,] 3               
## [11,] 4               
## [12,] "Not Applicable"
## [13,] 3

As we can see that Item_Weight has zero NA values now!

3. Item_Fat_Content

glimpse(bigmart$Item_Fat_Content)

##  Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...

unique(bigmart$Item_Fat_Content)

## [1] Low Fat Regular low fat LF      reg    
## Levels: LF low fat Low Fat reg Regular

As we can see that Item_Fat_Content has redundant variables, the ideal levels should be: 1. ‘LF’, ‘low fat’ and ‘Low Fat’ -> ‘Low Fat’ 2. ‘reg’ and ‘Regular’ -> ‘Regular’

levels(bigmart$Item_Fat_Content) <- list('Low Fat' = 'LF','Low Fat' = 'low fat','Regular'='reg')
unique(bigmart$Item_Fat_Content)

## [1] Low Fat Regular
## Levels: Low Fat Regular

As we found out during the analysis of Item_Identifier Variable, some products are Non Consumable hence we will replace the fat content in those rows by ‘Not Consumable’.

NC <- as.data.table(bigmart[,c(3,13)])
NC[Item_Category=='NC',Item_Fat_Content:='Non Consumable',Item_Fat_Content]
bigmart[,c(3,13)] <- NC

ggplot(bigmart %>% group_by(Item_Fat_Content) %>% summarise(Count = n())) + 
  geom_bar(aes(Item_Fat_Content, Count), stat = 'identity', fill = 'skyblue',col='black')

Info <- dataInfo(bigmart); Info

##       Index ColName                     NAvalues NAclass  
##  [1,] 1     "Item_Identifier"           0        "factor" 
##  [2,] 2     "Item_Weight"               0        "numeric"
##  [3,] 3     "Item_Fat_Content"          0        "factor" 
##  [4,] 4     "Item_Visibility"           0        "numeric"
##  [5,] 5     "Item_Type"                 0        "factor" 
##  [6,] 6     "Item_MRP"                  0        "numeric"
##  [7,] 7     "Outlet_Identifier"         0        "factor" 
##  [8,] 8     "Outlet_Establishment_Year" 0        "integer"
##  [9,] 9     "Outlet_Size"               4016     "factor" 
## [10,] 10    "Outlet_Location_Type"      0        "factor" 
## [11,] 11    "Outlet_Type"               0        "factor" 
## [12,] 12    "Item_Outlet_Sales"         5681     "numeric"
## [13,] 13    "Item_Category"             0        "factor" 
##       UniqueVals      
##  [1,] 1559            
##  [2,] "Not Applicable"
##  [3,] 3               
##  [4,] "Not Applicable"
##  [5,] 16              
##  [6,] "Not Applicable"
##  [7,] 10              
##  [8,] "Not Applicable"
##  [9,] 4               
## [10,] 3               
## [11,] 4               
## [12,] "Not Applicable"
## [13,] 3

4. Item_Visibility

I noticed while pre-viewing the data that few rows in Item_Visibility are zero and it doesn’t make sense for an item to be occupying shelf space and yet have zero visibility. Hence we conclude that this is erraneous data.

We know that every product in the dataset appears atleast 7 times. A particular product has to have similar visibility because of its appearance, be it in any outlet. Therefore we will impute the 0 values in Item_Visibility with the mean visibiility of same product as identified by Item_Identifier.

bigmart[,4][bigmart[,4]==0] <- NA

imputeFat <- function(df) {
  df %>%
    group_by(Item_Identifier) %>% 
    mutate(Item_Visibility = ifelse(is.na(Item_Visibility),mean(Item_Visibility,na.rm=T),Item_Visibility)) %>%
    ungroup()
}

Vis <- bigmart[,c(1,4,5)]
Vis <- imputeFat(Vis)
bigmart[,c(1,4,5)] <- Vis

ggplot(bigmart) + geom_histogram(aes(Item_Visibility),binwidth = 0.005, fill = 'skyblue', col = 'black')

As we see Item_Visibility is right skewed hence we will have to tranform the data.

Info <- dataInfo(bigmart); Info

##       Index ColName                     NAvalues NAclass  
##  [1,] 1     "Item_Identifier"           0        "factor" 
##  [2,] 2     "Item_Weight"               0        "numeric"
##  [3,] 3     "Item_Fat_Content"          0        "factor" 
##  [4,] 4     "Item_Visibility"           0        "numeric"
##  [5,] 5     "Item_Type"                 0        "factor" 
##  [6,] 6     "Item_MRP"                  0        "numeric"
##  [7,] 7     "Outlet_Identifier"         0        "factor" 
##  [8,] 8     "Outlet_Establishment_Year" 0        "integer"
##  [9,] 9     "Outlet_Size"               4016     "factor" 
## [10,] 10    "Outlet_Location_Type"      0        "factor" 
## [11,] 11    "Outlet_Type"               0        "factor" 
## [12,] 12    "Item_Outlet_Sales"         5681     "numeric"
## [13,] 13    "Item_Category"             0        "factor" 
##       UniqueVals      
##  [1,] 1559            
##  [2,] "Not Applicable"
##  [3,] 3               
##  [4,] "Not Applicable"
##  [5,] 16              
##  [6,] "Not Applicable"
##  [7,] 10              
##  [8,] "Not Applicable"
##  [9,] 4               
## [10,] 3               
## [11,] 4               
## [12,] "Not Applicable"
## [13,] 3

5. Outlet_Size

We will now replace the NA values in Outlet_Size by predictive Imputation.

Outlet_Identifier, Outlet_Establishment_Year, Outlet_Location_Type and Outlet_Type are the only variables that the Outlet_Size should depend on hence we will use the Amelia function from Amelia library to impute the NA values in Outlet_Size.

sub_mart <- bigmart  %>%
  select(Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type)

sub_mart$Outlet_Identifier <- as.factor(sub_mart$Outlet_Identifier)
sub_mart$Outlet_Establishment_Year <- as.factor(sub_mart$Outlet_Establishment_Year)
sub_mart$Outlet_Size <- factor(sub_mart$Outlet_Size,levels = c("Small","Medium","High"),labels = c(1,2,3),ordered = T)
sub_mart$Outlet_Location_Type <- factor(sub_mart$Outlet_Location_Type,levels = c("Tier 3","Tier 2","Tier 1"),labels = c(3,2,1),ordered = T)
sub_mart$Outlet_Type <- as.factor(sub_mart$Outlet_Type)

str(sub_mart)

## 'data.frame':    14204 obs. of  5 variables:
##  $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
##  $ Outlet_Establishment_Year: Factor w/ 9 levels "1985","1987",..: 5 9 5 4 2 9 2 1 6 8 ...
##  $ Outlet_Size              : Ord.factor w/ 3 levels "1"<"2"<"3": 2 2 2 NA 3 2 3 2 NA NA ...
##  $ Outlet_Location_Type     : Ord.factor w/ 3 levels "3"<"2"<"1": 3 1 3 1 1 1 1 1 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...

imputed_mart <- amelia(sub_mart, m = 5, ords = c("Outlet_Size","Outlet_Location_Type"),
                                   noms = c("Outlet_Establishment_Year", "Outlet_Identifier","Outlet_Type"),
                                   incheck = TRUE,empri = 1)

## Warning in amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, : The variables (or variable with levels) Outlet_Establishment_Year1987, Outlet_Establishment_Year1997, Outlet_Establishment_Year1999, Outlet_Establishment_Year2004, Outlet_Establishment_Year2009, Outlet_Size.L, Outlet_Size.Q, Outlet_Location_Type.L, Outlet_Location_Type.Q, Outlet_TypeSupermarket Type1, Outlet_TypeSupermarket Type2, Outlet_TypeSupermarket Type3 are perfectly collinear with another variable in the data.

## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8

anova(fit1,fit2,fit3,fit4,fit5,test = "Chisq")

## Likelihood ratio tests of Multinomial Models
## 
## Response: Outlet_Size
##                                                                                Model
## 1 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 2 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 3 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 4 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
## 5 Outlet_Location_Type + Outlet_Establishment_Year + Outlet_Identifier + Outlet_Type
##   Resid. df Resid. Dev   Test    Df     LR stat. Pr(Chi)
## 1     28388   8071.067                                  
## 2     28388   8070.989 1 vs 2     0   0.07799501       0
## 3     28388   7976.382 2 vs 3     0  94.60686197       0
## 4     28388   8073.995 3 vs 4     0 -97.61310039       1
## 5     28388   8070.431 4 vs 5     0   3.56389740       0

While writing this notebook, 4th imputation appeared to be the best one from the above statistics, hence I have used the fourth imputation.

bigmart$Outlet_Size <- imputed_mart$imputations$imp4$Outlet_Size

ggplot(bigmart %>% group_by(Outlet_Size) %>% summarise(Count = n())) + 
  geom_bar(aes(Outlet_Size, Count), stat = 'identity', fill = 'skyblue',col = 'black')

Feature Engineering

On analysis of the variables and common knowledge we know that most products are sold at a base rate (which is usually with respect to a base weight). Hence here we will create a new variable which tells us about the Price of the Item per unit Weight.

bigmart$Item_Price_Weight <- bigmart$Item_MRP / bigmart$Item_Weight

We know that people only shop at the place they may have been to earlier as well. Hence we will create a varriable which has the information about the Years of Operation of each store. Since this data is from 2013 we will calculate operational years with 2013.

bigmart$Outlet_Operational_Years <- 2013 - bigmart$Outlet_Establishment_Year

Info <- dataInfo(bigmart); Info

## Warning in NAclass[i] <- class(df[, i]): number of items to replace is not
## a multiple of replacement length

## Warning in if (class(df[, i]) == "numeric" | class(df[, i]) == "integer")
## {: the condition has length > 1 and only the first element will be used

##       Index ColName                     NAvalues NAclass  
##  [1,] 1     "Item_Identifier"           0        "factor" 
##  [2,] 2     "Item_Weight"               0        "numeric"
##  [3,] 3     "Item_Fat_Content"          0        "factor" 
##  [4,] 4     "Item_Visibility"           0        "numeric"
##  [5,] 5     "Item_Type"                 0        "factor" 
##  [6,] 6     "Item_MRP"                  0        "numeric"
##  [7,] 7     "Outlet_Identifier"         0        "factor" 
##  [8,] 8     "Outlet_Establishment_Year" 0        "integer"
##  [9,] 9     "Outlet_Size"               0        "ordered"
## [10,] 10    "Outlet_Location_Type"      0        "factor" 
## [11,] 11    "Outlet_Type"               0        "factor" 
## [12,] 12    "Item_Outlet_Sales"         5681     "numeric"
## [13,] 13    "Item_Category"             0        "factor" 
## [14,] 14    "Item_Price_Weight"         0        "numeric"
## [15,] 15    "Outlet_Operational_Years"  0        "numeric"
##       UniqueVals      
##  [1,] 1559            
##  [2,] "Not Applicable"
##  [3,] 3               
##  [4,] "Not Applicable"
##  [5,] 16              
##  [6,] "Not Applicable"
##  [7,] 10              
##  [8,] "Not Applicable"
##  [9,] 3               
## [10,] 3               
## [11,] 4               
## [12,] "Not Applicable"
## [13,] 3               
## [14,] "Not Applicable"
## [15,] "Not Applicable"

Encoding the Variables

Machine learning algorithms and deep learning neural networks require that input and output variables are numbers. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modelling, although the most common are as follows:

Label Encoding: Where each label is converted to a unique number (for Ordinal Variables)
One Hot Encoding: Where each label is mapped to a binary vector (for Nominal Variables)

1. Label Encoding

We will perform label encoding on the following variables- ‘Outlet_Size’, ‘Outlet_Location_Type’

bigmart$Outlet_Size <- ifelse(bigmart$Outlet_Size == 1, 0, ifelse(bigmart$Outlet_Size == 2, 1, 2))
bigmart$Outlet_Location_Type <- ifelse(bigmart$Outlet_Location_Type == 3, 0, ifelse(bigmart$Outlet_Location_Type == 2, 1, 2))

2. One Hot Encoding

We will perform one hot encoding on the following variables- ‘Item_Fat_Content’, ‘Item_Type’, ‘Outlet_Identifier’, ‘Outlet_Type’, ‘Item_Category’.

ohe_var <- bigmart[,c(3,5,7,11,13)]
bigmart_ohe <- one_hot(as.data.table(ohe_var))
bigmart[,c(3,5,7,11,13)] <- NULL
bigmart <- bind_cols(bigmart,bigmart_ohe)

Splitting the Data

Now we will split the data back into orignal traininig and testing

test <- subset(bigmart,is.na(bigmart$Item_Outlet_Sales))
train <- subset(bigmart,!is.na(bigmart$Item_Outlet_Sales))

Dividing Predictor and Response into two different variables

Ytrain <- train[,8]
Xtrain <- train[,-8]

3. Regression

Now we will apply Lassoo regression to the training data.

set.seed(123)
my_control = trainControl(method="cv", number=5)
Grid = expand.grid(alpha = 1, lambda = seq(11,13,by = 0.0001))
lasso_mod = train(x = train[, -c(1,8)], y = train$Item_Outlet_Sales,
                             method='glmnet', trControl= my_control, tuneGrid = Grid)

head(lasso_mod$results)

##   alpha  lambda     RMSE  Rsquared      MAE   RMSESD  RsquaredSD    MAESD
## 1     1 11.0000 1129.883 0.5619019 836.4769 7.793448 0.006164546 5.248766
## 2     1 11.0001 1129.883 0.5619019 836.4769 7.793454 0.006164549 5.248766
## 3     1 11.0002 1129.883 0.5619019 836.4768 7.793461 0.006164553 5.248767
## 4     1 11.0003 1129.883 0.5619019 836.4768 7.793468 0.006164556 5.248767
## 5     1 11.0004 1129.883 0.5619019 836.4768 7.793474 0.006164560 5.248767
## 6     1 11.0005 1129.883 0.5619019 836.4768 7.793481 0.006164563 5.248767

Predicting the ‘Item_Outlet_Sales’ for testing data

test$Item_Outlet_Sales <- predict(lasso_mod,newdata=test)

I hope you enjoyed this analysis! I think going forward it would be interesting to use a method other than lasso regression. Comments are Welcome

Connect with me here- Click on any one of the below to connect with me there LinkedIn- linkedin.com/in/kritikseth GitHub- github.com/kritik-seth Kaggle- kaggle.com/kritikseth