The dataset comprises of sales data (of a renowned Super Market) for 1559 products across 10 stores in different cities (broadly classified based on the purchase power parity, working population, size and few other factors).
The project aims to build a predictive model to analyze the sales of each product at a particular store. With this we shall understand the properties of products and stores which play a key role in increasing sales. The results of the model will be used to provide recommendations to improve the sales.
rm(list = ls())
setwd('/Users/Mughundhan/UIC/UIC Academics/FALL 2017/BIZ ANALYTICS STATS/Project/Mid Report')
library(lubridate) # for csv files
library(leaflet) # interactive maps
library(dplyr) # for piping purpose %>%
library(data.table)# aggregate
library(ggplot2) # barplot
library(mice) # imputing with plausible data values (drawn from a distribution specifically designed for each missing datapoint)
library(rpart) # Decision Trees
library(VIM) # Visual Representation for MICE
library(data.table)
train <- read.csv("Train.csv", header=T, na.strings=c("","NA")) #Empty spaces to be replaced by NA
test <- read.csv("Test.csv", header=T, na.strings=c("","NA"))
test$Item_Outlet_Sales <- NA
fdata <- rbind(test, train)
fdata <- as.data.table(fdata)
Let us have a look at the description of each variable in the dataset:
Let us now perform basic operations to have a look at the summary and the structure of the dataset.
summary(fdata)
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## DRA24 : 10 Min. : 4.555 LF : 522 Min. :0.00000
## DRA59 : 10 1st Qu.: 8.710 Low Fat:8485 1st Qu.:0.02704
## DRB25 : 10 Median :12.600 Regular:4824 Median :0.05402
## DRC25 : 10 Mean :12.793 low fat: 178 Mean :0.06595
## DRC27 : 10 3rd Qu.:16.750 reg : 195 3rd Qu.:0.09404
## DRC36 : 10 Max. :21.350 Max. :0.32839
## (Other):14144 NA's :2439
## Item_Type Item_MRP Outlet_Identifier
## Fruits and Vegetables:2013 Min. : 31.29 OUT027 :1559
## Snack Foods :1989 1st Qu.: 94.01 OUT013 :1553
## Household :1548 Median :142.25 OUT035 :1550
## Frozen Foods :1426 Mean :141.00 OUT046 :1550
## Dairy :1136 3rd Qu.:185.86 OUT049 :1550
## Baking Goods :1086 Max. :266.89 OUT045 :1548
## (Other) :5006 (Other):4894
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## Min. :1985 High :1553 Tier 1:3980
## 1st Qu.:1987 Medium:4655 Tier 2:4641
## Median :1999 Small :3980 Tier 3:5583
## Mean :1998 NA's :4016
## 3rd Qu.:2004
## Max. :2009
##
## Outlet_Type Item_Outlet_Sales
## Grocery Store :1805 Min. : 33.29
## Supermarket Type1:9294 1st Qu.: 834.25
## Supermarket Type2:1546 Median : 1794.33
## Supermarket Type3:1559 Mean : 2181.29
## 3rd Qu.: 3101.30
## Max. :13086.97
## NA's :5681
str(fdata)
## Classes 'data.table' and 'data.frame': 14204 obs. of 12 variables:
## $ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 1104 1068 1407 810 1185 462 605 267 669 171 ...
## $ Item_Weight : num 20.75 8.3 14.6 7.32 NA ...
## $ Item_Fat_Content : Factor w/ 5 levels "LF","Low Fat",..: 2 5 2 2 3 3 3 2 3 2 ...
## $ Item_Visibility : num 0.00756 0.03843 0.09957 0.01539 0.1186 ...
## $ Item_Type : Factor w/ 16 levels "Baking Goods",..: 14 5 12 14 5 7 1 1 14 1 ...
## $ Item_MRP : num 107.9 87.3 241.8 155 234.2 ...
## $ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 3 1 3 6 9 4 6 8 3 ...
## $ Outlet_Establishment_Year: int 1999 2007 1998 2007 1985 1997 2009 1985 2002 2007 ...
## $ Outlet_Size : Factor w/ 3 levels "High","Medium",..: 2 NA NA NA 2 3 2 2 NA NA ...
## $ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 2 3 2 3 1 3 3 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 2 1 2 4 2 3 4 2 2 ...
## $ Item_Outlet_Sales : num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
Observation
## Item_Identifier Item_Weight
## 1559 416
## Item_Fat_Content Item_Visibility
## 5 13006
## Item_Type Item_MRP
## 16 8052
## Outlet_Identifier Outlet_Establishment_Year
## 10 9
## Outlet_Size Outlet_Location_Type
## 4 3
## Outlet_Type Item_Outlet_Sales
## 4 3494
## Item_Identifier Item_Weight
## 0 2439
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 4016 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
##
## LF Low Fat Regular low fat reg
## 522 8485 4824 178 195
##
## Baking Goods Breads Breakfast
## 1086 416 186
## Canned Dairy Frozen Foods
## 1084 1136 1426
## Fruits and Vegetables Hard Drinks Health and Hygiene
## 2013 362 858
## Household Meat Others
## 1548 736 280
## Seafood Snack Foods Soft Drinks
## 89 1989 726
## Starchy Foods
## 269
##
## Tier 1 Tier 2 Tier 3
## 3980 4641 5583
##
## High Medium Small
## 1553 4655 3980
##
## Grocery Store Supermarket Type1 Supermarket Type2 Supermarket Type3
## 1805 9294 1546 1559
Observation:
Based on the basic data exploration, we shall have two levels of hypotheses: 1. Store-level; 2. Product-level. Both plays a crucial role in determining the sales of each product at specific stores located across different cities. The hypotheses generated at both the levels based on the available dataset are as follows:
Item_Fat_Content: Items are classified based on the fat content. Since we consume on low fat items as a part of our regular diet, It is highly possible that Low fat items are generally sold more than the items with high fat content.
Item_Type: Items which we use on regular basis - like ready to eat, soft drinks has higher probability of being sold when compared with luxury items.
Item_MRP: More expensive items might be bought occasionally. Items with lower prices might be a product which is being used on a regular basis. Thus, Low priced items might have sold better than expensive items.
Outlet_Size: Bigger outlets might attract bigger crowds. This results in increasing the sales of the products in that specific store.
Outlet_Location_Type: Bigger cities or cities with high population density has a larger customer base for the stores at their location. Stores located in Tier-1 cities might have better sales than stores located in other types of cities.
Outlet_Type: Similar to the previous hypotheses. Supermarkets look more fancy than grocery shops. Among supermarket, the highest among this sub-classification might attract larger crowds and emerge as the best selling store when compared with other outlet types.
Identifying the missing values column-wise. The name of the column and the corresponding number of missing values in each column is given.
length(unique(fdata$Item_Identifier)) #Identify no. of unique values in the Item_Identifier attribute
## [1] 1559
avg_Item_Weight <- aggregate(Item_Weight~Item_Identifier, data=fdata, FUN=function(x) c(mean=mean(x), count=length(x))) #making an aggregate - similar to group by feature in SQL
avg_Item_Weight <- as.data.table(avg_Item_Weight) #converting into data.table for easier computation
cdata <- merge(fdata, avg_Item_Weight, by="Item_Identifier") #merging the data
for(i in 1:nrow(cdata))
{
if(is.na(cdata[i,2]))
{
cdata$Item_Weight.x[i] <- cdata$Item_Weight.y[i] #missing weights replaced by average weight of the item depending on the unique Item_Identifier
}
}
fdata <- cdata[ ,1:(ncol(cdata)-1)] #deleting the unnecessary column created during the imputation process
#View(cdata)
names(fdata)[names(fdata)=="Item_Weight.x"] <- "Item_Weight" #Renaming the attribute
sapply(fdata, function(x) sum(is.na(x))) #Number of Missing Values in each column
## Item_Identifier Item_Weight
## 0 0
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 4016 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
#View(fdata)
rm(cdata, i)
table(fdata$Outlet_Type, fdata$Outlet_Size)
##
## High Medium Small
## Grocery Store 0 0 880
## Supermarket Type1 1553 1550 3100
## Supermarket Type2 0 1546 0
## Supermarket Type3 0 1559 0
round(prop.table(table(fdata$Outlet_Type, fdata$Outlet_Size), 1), 2) #Identify the proportion
##
## High Medium Small
## Grocery Store 0.00 0.00 1.00
## Supermarket Type1 0.25 0.25 0.50
## Supermarket Type2 0.00 1.00 0.00
## Supermarket Type3 0.00 1.00 0.00
Observation:
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Grocery Store"] <- "Small"
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Supermarket Type1"] <- "Small"
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Supermarket Type2"] <- "Medium"
fdata$Outlet_Size[is.na(fdata$Outlet_Size) & fdata$Outlet_Type == "Supermarket Type3"] <- "Medium"
sapply(fdata, function(x) sum(is.na(x))) #Number of Missing Values in each column
## Item_Identifier Item_Weight
## 0 0
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 0 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
table(fdata$Outlet_Type, fdata$Outlet_Size)
##
## High Medium Small
## Grocery Store 0 0 1805
## Supermarket Type1 1553 1550 6191
## Supermarket Type2 0 1546 0
## Supermarket Type3 0 1559 0
round(prop.table(table(fdata$Outlet_Type, fdata$Outlet_Size), 1), 2)
##
## High Medium Small
## Grocery Store 0.00 0.00 1.00
## Supermarket Type1 0.17 0.17 0.67
## Supermarket Type2 0.00 1.00 0.00
## Supermarket Type3 0.00 1.00 0.00
We explored some nuances in the data in the data exploration section. Now let us try to resolve them and make our data ready for analysis. We will also create some new variables using the existing ones in this section.
During exploration, we decided to consider combining the Supermarket Type2 and Type3 variables. But is that a good idea? A quick way to check that could be to analyze the mean sales by type of store. If they have similar sales, then keeping them separate won???t help much.
avg_Item_Sales <- aggregate(Item_Outlet_Sales~Outlet_Type, data=fdata, FUN=function(x) c(mean=mean(x), count=length(x)))
avg_Item_Sales <- as.data.table(avg_Item_Sales)
rm(avg_Item_Sales)
Observation This shows significant difference between Supermarket Type2 and Type3 variables, hence we???ll leave them as it is.
We noticed that the minimum value here is 0, which makes no practical sense. Lets consider it like missing information and impute it with mean visibility of that product.
summary(fdata$Item_Visibility)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.02704 0.05402 0.06595 0.09404 0.32839
rm(cdata)
length(unique(fdata$Item_Identifier))
## [1] 1559
avg_Item_Visibility <- aggregate(Item_Visibility~Item_Identifier, data=fdata, FUN=function(x) c(mean=mean(x), count=length(x)))
avg_Item_Visibility <- as.data.table(avg_Item_Visibility)
cdata <- merge(fdata, avg_Item_Visibility, by="Item_Identifier")
for(i in 1:nrow(cdata))
{
if(cdata[i,4]==0)
{
cdata$Item_Visibility.x[i] <- cdata$Item_Visibility.y[i]
}
}
fdata <- cdata[ ,1:(ncol(cdata)-1)]
names(fdata)[names(fdata)=="Item_Visibility.x"] <- "Item_Visibility"
#summary(fdata$Item_Visibility)
Observation No values with value zero in Item_Visibility variable
NOTE We hypothesized that products with higher visibility are likely to sell more. But along with comparing products on absolute terms, we should look at the visibility of the product in that particular store as compared to the mean visibility of that product across all stores. This will give some idea about how much importance was given to that product in a store as compared to other stores.
#colnames(fdata)
rm(cdata, i)
cdata <- merge(fdata, avg_Item_Visibility, by="Item_Identifier")
ncol(fdata)
## [1] 12
fdata <- cdata
names(fdata)[names(fdata)=="Item_Visibility.y"] <- "Item_Visibility_MeanRatio"
names(fdata)[names(fdata)=="Item_Visibility.x"] <- "Item_Visibility"
colnames(fdata)
## [1] "Item_Identifier" "Item_Weight"
## [3] "Item_Fat_Content" "Item_Visibility"
## [5] "Item_Type" "Item_MRP"
## [7] "Outlet_Identifier" "Outlet_Establishment_Year"
## [9] "Outlet_Size" "Outlet_Location_Type"
## [11] "Outlet_Type" "Item_Outlet_Sales"
## [13] "Item_Visibility_MeanRatio"
rm(cdata)
fdata$Item_Visibility_MeanRatio <- as.numeric(fdata$Item_Visibility_MeanRatio)
class(fdata$Item_Visibility_MeanRatio)
## [1] "numeric"
class(fdata$Item_Visibility)
## [1] "numeric"
fdata$Item_Visibility_MeanRatio1 <- fdata$Item_Visibility/fdata$Item_Visibility_MeanRatio
quantile(fdata$Item_Visibility_MeanRatio1)
## 0% 25% 50% 75% 100%
## 0.8445628 0.9251308 0.9990698 1.0420067 3.0100939
fdata$Item_Visibility_MeanRatio <- fdata$Item_Visibility_MeanRatio1
quantile(fdata$Item_Visibility_MeanRatio1)
## 0% 25% 50% 75% 100%
## 0.8445628 0.9251308 0.9990698 1.0420067 3.0100939
ncol(fdata)
## [1] 14
fdata <- fdata[, 1:(ncol(fdata)-1)]
#head(fdata)
Earlier we saw that the Item_Type variable has 16 categories which might prove to be very useful in analysis. So its a good idea to combine them. One way could be to manually assign a new category to each. But there???s a catch here. If you look at the Item_Identifier, i.e. the unique ID of each item, it starts with either F, D or N. If you see the categories, these look like being Food, Drinks and Non-Consumables. So I???ve used the Item_Identifier variable to create a new column:
fdata$Item_Type_Combined <- "NA"
fdata$Item_Type_Combined[grepl("^[fF].*", fdata$Item_Identifier) ] <- "Food"
fdata$Item_Type_Combined[grepl("^[dD].*", fdata$Item_Identifier) ] <- "Drinks"
fdata$Item_Type_Combined[grepl("^[nN].*", fdata$Item_Identifier) ] <- "Non-Consumable"
table(fdata$Item_Type_Combined)
##
## Drinks Food Non-Consumable
## 1317 10201 2686
We wanted to make a new column depicting the years of operation of a store. [NOTE: We are using 2013 Sales Data]
fdata$Outlet_Years <- 2013 - fdata$Outlet_Establishment_Year
#summary(fdata$Outlet_Years)
table(fdata$Outlet_Years)
##
## 4 6 9 11 14 15 16 26 28
## 1546 1543 1550 1548 1550 925 1550 1553 2439
Observation: All the stores are 4-28 years old
We found typos and difference in representation in categories of Item_Fat_Content variable.
fdata$Item_Fat_Content.y <- "NA"
fdata$Item_Fat_Content.y[grepl("^[lL].*", fdata$Item_Fat_Content) ] <- "Low Fat"
fdata$Item_Fat_Content.y[grepl("^[rR].*", fdata$Item_Fat_Content) ] <- "Regular"
fdata$Item_Fat_Content.y[fdata$Item_Type_Combined=="Non-Consumable"] <- "Non-Edible"
fdata$Item_Fat_Content <- fdata$Item_Fat_Content.y
table(fdata$Item_Fat_Content)
##
## Low Fat Non-Edible Regular
## 6499 2686 5019
fdata <- fdata[ ,1:(ncol(fdata)-1)]
#View(fdata)
Observation:
One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable.
For example, the Item_Fat_Content has 3 categories ??? ???Low Fat???, ???Regular??? and ???Non-Edible???. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers ??? 0 (if the category is not present) and 1(if category is present). [Creates dummy variables]
‘Outlet_Identifier’
NOTE: all columns - Item_Identifier, Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type, Item_MRP, Outlet_Identifier, Outlet_Establishment_Year, Outlet_Size, Outlet_Location_Type, Outlet_Type, Item_Outlet_Sales, Item_Visibility_MeanRatio, Item_Type_Combined, Outlet_Years
## Item_Identifier Item_Weight
## 1559 415
## Item_Fat_Content Item_Visibility
## 3 13688
## Item_Type Item_MRP
## 16 8052
## Outlet_Identifier Outlet_Establishment_Year
## 10 9
## Outlet_Size Outlet_Location_Type
## 3 3
## Outlet_Type Item_Outlet_Sales
## 4 3494
## Item_Visibility_MeanRatio Item_Type_Combined
## 13287 3
## Outlet_Years
## 9
Observation: New Columns added are as follows:-
Observation: New Columns added are as follows:-
Observation: New Columns added are as follows:-
Observation: New Columns added are as follows:-
Observation: New Columns added are as follows:-
Observation: Nine columns are added, each indicating the unique outlet identifier. With this, we can find which outlet has made most of the sales.
Lets look at the 3 columns formed from Item_Fat_Content
OHECdata <- as.data.table(OHECdata)
head(cbind(OHECdata$Item_Fat_ContentLow.Fat, OHECdata$Item_Fat_ContentNon.Edible, OHECdata$Item_Fat_ContentRegular), 20)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 1 0 0
## [3,] 1 0 0
## [4,] 1 0 0
## [5,] 1 0 0
## [6,] 1 0 0
## [7,] 1 0 0
## [8,] 1 0 0
## [9,] 1 0 0
## [10,] 0 0 1
## [11,] 0 0 1
## [12,] 0 0 1
## [13,] 0 0 1
## [14,] 0 0 1
## [15,] 0 0 1
## [16,] 0 0 1
## [17,] 0 0 1
## [18,] 0 0 1
## [19,] 0 0 1
## [20,] 0 0 1
Observation: We can see the binary values in the columns - One Hot Encoding worked!
Let us now export the dataset as follows:
OHECdata <- as.data.frame(OHECdata)
drop_columns <- c("Item_Type","Outlet_Establishment_Year")
Export_data <- OHECdata[ , !(names(OHECdata) %in% drop_columns)]
#head(Export_data)
Export_data <- as.data.table(Export_data)
test_Export <- Export_data[is.na(Item_Outlet_Sales), ]
train_Export <- Export_data[!is.na(Item_Outlet_Sales), ]
# write.csv(Export_data, "data_Export.csv")
# write.csv(test_Export, "test_Export.csv")
# write.csv(train_Export, "train_Export.csv")
rm(list = ls())
Now let us read the train and the test dataset separately for the purpose of model building.
rm(list=ls())
train <- read.csv("train_Export.csv", header=T, na.strings=c("","NA"))
test <- read.csv("test_Export.csv", header=T, na.strings=c("","NA"))
fdata <- read.csv("data_Export.csv", header=T, na.strings=c("","NA"))
train <- as.data.table(train)
test <- as.data.table(test)
fdata <- as.data.table(fdata)
#glimpse(train)
We can see that the data is properly exported on performing one-hot-encoding (with 0’s and 1’s indicating its presence). Now that we have the data ready, its time to start making predictive models.
Baseline model is the one which requires no predictive model and its like an informed guess. For instance, in this case lets predict the sales as the overall average sales.
NOTE: If the score of the predictive algorithm is below this, then there is something going seriously wrong and the data is to be checked.
#Mean based:
mean_sales <- mean(train$Item_Outlet_Sales)
drop_columns <- c("X", "Item_Identifier", "Outlet_Identifier", "Item_Outlet_Sales")
baseline_model <- test[,!(names(test) %in% drop_columns)] #input_variables_values_training_datasets
baseline_model$Item_Outlet_Sales <- mean_sales
Observation: We can see that every observation in the Item_Outlet_Sales is predicted to be 2181.29. This is the average or mean of the Item_Outlet_Sales. Thus, gives a very poor model. The aim of this model is to have a benchamark below which our subsequent models should not perform.
train <- as.data.frame(train)
library(rpart) # Decision Trees
dt <- rpart(Item_Outlet_Sales ~ Outlet_IdentifierOUT046 + Outlet_IdentifierOUT045 + Outlet_IdentifierOUT049
+ Outlet_IdentifierOUT035 + Outlet_IdentifierOUT018 + Outlet_IdentifierOUT019 + Outlet_IdentifierOUT027
+ Outlet_IdentifierOUT017 + Outlet_IdentifierOUT013 + Outlet_IdentifierOUT010 + Outlet_TypeSupermarket.Type3
+ Outlet_TypeSupermarket.Type2 + Outlet_TypeSupermarket.Type1 + Item_Type_CombinedFood +
Item_Type_CombinedNon.Consumable + Outlet_TypeGrocery.Store + Item_Type_CombinedDrinks + Outlet_SizeSmall
+ Outlet_SizeMedium + Outlet_Location_TypeTier.2 + Outlet_Location_TypeTier.3 + Outlet_Location_TypeTier.1
+ Outlet_SizeHigh + Item_Fat_ContentRegular + Item_Fat_ContentNon.Edible + Item_Fat_ContentLow.Fat
+ Outlet_Years + Item_Visibility_MeanRatio + Item_MRP, data = train, method = "anova")
plot(dt)
text(dt, pretty = 0, cex = 0.5)
summary(dt)
drop_columns <- c("X", "Item_Identifier", "Outlet_Identifier", "Item_Outlet_Sales")
dt_test <- test[,!(names(test) %in% drop_columns)] #input_variables_values_training_datasets
class(dt)
predicted_sales_dt <- predict(dt, dt_test)
head(predicted_sales_dt)
#dt_test$Item_Outlet_Sales <- predicted_sales_dt
Observation:
Variable importance: (most important variable at 1.)
Further, we have predicted the Item_Outlet_Sales based on this decision tree model and have stored. The rmse and cp for the decision tree is computed and displayed at the end (along with the model comparison chunk)
## [1] 987
## % Inc MSE
## Item_MRP 261.841372
## Outlet_TypeGrocery.Store 35.308514
## Item_Visibility_MeanRatio 31.700735
## Outlet_Years 23.665615
## Outlet_TypeSupermarket.Type3 21.101159
## Outlet_IdentifierOUT027 20.082356
## Outlet_TypeSupermarket.Type1 18.293701
## Item_Type_CombinedFood 16.912389
## Outlet_IdentifierOUT010 13.520863
## Outlet_IdentifierOUT018 11.687310
## Outlet_TypeSupermarket.Type2 11.500453
## Outlet_IdentifierOUT019 11.046375
## Outlet_SizeSmall 10.240351
## Item_Fat_ContentRegular 9.760270
## Outlet_SizeMedium 9.370689
## Outlet_IdentifierOUT035 8.785113
## Item_Type_CombinedNon.Consumable 8.684652
## Outlet_Location_TypeTier.3 7.101722
## Item_Fat_ContentNon.Edible 6.963881
## Outlet_IdentifierOUT045 6.447246
## Outlet_IdentifierOUT013 5.909847
## Item_Type_CombinedDrinks 5.816765
## Outlet_Location_TypeTier.2 5.197840
## Outlet_SizeHigh 4.971637
## Outlet_IdentifierOUT049 4.823207
## Item_Fat_ContentLow.Fat 4.673488
## Outlet_IdentifierOUT046 4.409267
## Outlet_Location_TypeTier.1 4.271646
## Outlet_IdentifierOUT017 -4.150751
Observation:
2.We have predicted the Item_Outlet_Sales based on this Random Forest model and have stored.
## [1] "X" "Item_Identifier"
## [3] "Item_Weight" "Item_Visibility"
## [5] "Item_Fat_Content" "Outlet_Location_Type"
## [7] "Outlet_Size" "Item_Type_Combined"
## [9] "Outlet_Type" "Item_MRP"
## [11] "Outlet_Identifier" "Item_Outlet_Sales"
## [13] "Item_Visibility_MeanRatio" "Outlet_Years"
## [15] "Item_Fat_ContentLow.Fat" "Item_Fat_ContentNon.Edible"
## [17] "Item_Fat_ContentRegular" "Outlet_Location_TypeTier.1"
## [19] "Outlet_Location_TypeTier.2" "Outlet_Location_TypeTier.3"
## [21] "Outlet_SizeHigh" "Outlet_SizeMedium"
## [23] "Outlet_SizeSmall" "Item_Type_CombinedDrinks"
## [25] "Item_Type_CombinedFood" "Item_Type_CombinedNon.Consumable"
## [27] "Outlet_TypeGrocery.Store" "Outlet_TypeSupermarket.Type1"
## [29] "Outlet_TypeSupermarket.Type2" "Outlet_TypeSupermarket.Type3"
## [31] "Outlet_IdentifierOUT010" "Outlet_IdentifierOUT013"
## [33] "Outlet_IdentifierOUT017" "Outlet_IdentifierOUT018"
## [35] "Outlet_IdentifierOUT019" "Outlet_IdentifierOUT027"
## [37] "Outlet_IdentifierOUT035" "Outlet_IdentifierOUT045"
## [39] "Outlet_IdentifierOUT046" "Outlet_IdentifierOUT049"
##
## Call:
## lm(formula = Item_Outlet_Sales ~ Outlet_IdentifierOUT046 + Outlet_IdentifierOUT045 +
## Outlet_IdentifierOUT049 + Outlet_IdentifierOUT035 + Outlet_IdentifierOUT018 +
## Outlet_IdentifierOUT019 + Outlet_IdentifierOUT027 + Outlet_IdentifierOUT017 +
## Outlet_IdentifierOUT013 + Outlet_IdentifierOUT010 + Outlet_TypeSupermarket.Type3 +
## Outlet_TypeSupermarket.Type2 + Outlet_TypeSupermarket.Type1 +
## Item_Type_CombinedFood + Item_Type_CombinedNon.Consumable +
## Outlet_TypeGrocery.Store + Item_Type_CombinedDrinks + Outlet_SizeSmall +
## Outlet_SizeMedium + Outlet_Location_TypeTier.2 + Outlet_Location_TypeTier.3 +
## Outlet_Location_TypeTier.1 + Outlet_SizeHigh + Item_Fat_ContentRegular +
## Item_Fat_ContentNon.Edible + Item_Fat_ContentLow.Fat + Outlet_Years +
## Item_Visibility_MeanRatio + Item_MRP, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4331.8 -677.8 -88.9 572.5 7942.5
##
## Coefficients: (15 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1986.3501 169.3349 -11.730 <2e-16
## Outlet_IdentifierOUT046 1959.9281 83.7867 23.392 <2e-16
## Outlet_IdentifierOUT045 1891.8411 83.5331 22.648 <2e-16
## Outlet_IdentifierOUT049 2057.6792 84.1093 24.464 <2e-16
## Outlet_IdentifierOUT035 2104.4092 83.8369 25.101 <2e-16
## Outlet_IdentifierOUT018 1683.1160 83.6592 20.119 <2e-16
## Outlet_IdentifierOUT019 12.0442 68.8349 0.175 0.861
## Outlet_IdentifierOUT027 3411.2793 84.0027 40.609 <2e-16
## Outlet_IdentifierOUT017 2062.8085 83.1515 24.808 <2e-16
## Outlet_IdentifierOUT013 1990.9033 83.5285 23.835 <2e-16
## Outlet_IdentifierOUT010 NA NA NA NA
## Outlet_TypeSupermarket.Type3 NA NA NA NA
## Outlet_TypeSupermarket.Type2 NA NA NA NA
## Outlet_TypeSupermarket.Type1 NA NA NA NA
## Item_Type_CombinedFood 16.1708 43.9286 0.368 0.713
## Item_Type_CombinedNon.Consumable -10.1285 49.0070 -0.207 0.836
## Outlet_TypeGrocery.Store NA NA NA NA
## Item_Type_CombinedDrinks NA NA NA NA
## Outlet_SizeSmall NA NA NA NA
## Outlet_SizeMedium NA NA NA NA
## Outlet_Location_TypeTier.2 NA NA NA NA
## Outlet_Location_TypeTier.3 NA NA NA NA
## Outlet_Location_TypeTier.1 NA NA NA NA
## Outlet_SizeHigh NA NA NA NA
## Item_Fat_ContentRegular 40.7363 28.2782 1.441 0.150
## Item_Fat_ContentNon.Edible NA NA NA NA
## Item_Fat_ContentLow.Fat NA NA NA NA
## Outlet_Years NA NA NA NA
## Item_Visibility_MeanRatio 71.1070 98.4560 0.722 0.470
## Item_MRP 15.5588 0.1966 79.132 <2e-16
##
## (Intercept) ***
## Outlet_IdentifierOUT046 ***
## Outlet_IdentifierOUT045 ***
## Outlet_IdentifierOUT049 ***
## Outlet_IdentifierOUT035 ***
## Outlet_IdentifierOUT018 ***
## Outlet_IdentifierOUT019
## Outlet_IdentifierOUT027 ***
## Outlet_IdentifierOUT017 ***
## Outlet_IdentifierOUT013 ***
## Outlet_IdentifierOUT010
## Outlet_TypeSupermarket.Type3
## Outlet_TypeSupermarket.Type2
## Outlet_TypeSupermarket.Type1
## Item_Type_CombinedFood
## Item_Type_CombinedNon.Consumable
## Outlet_TypeGrocery.Store
## Item_Type_CombinedDrinks
## Outlet_SizeSmall
## Outlet_SizeMedium
## Outlet_Location_TypeTier.2
## Outlet_Location_TypeTier.3
## Outlet_Location_TypeTier.1
## Outlet_SizeHigh
## Item_Fat_ContentRegular
## Item_Fat_ContentNon.Edible
## Item_Fat_ContentLow.Fat
## Outlet_Years
## Item_Visibility_MeanRatio
## Item_MRP ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1128 on 8508 degrees of freedom
## Multiple R-squared: 0.5635, Adjusted R-squared: 0.5627
## F-statistic: 784.4 on 14 and 8508 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Item_Outlet_Sales ~ Outlet_IdentifierOUT046 + Outlet_IdentifierOUT045 +
## Outlet_IdentifierOUT049 + Outlet_IdentifierOUT035 + Outlet_IdentifierOUT018 +
## Outlet_IdentifierOUT019 + Outlet_IdentifierOUT027 + Outlet_IdentifierOUT017 +
## Outlet_IdentifierOUT013 + Item_Type_CombinedFood + Item_Type_CombinedNon.Consumable +
## Item_Fat_ContentRegular + Item_Visibility_MeanRatio + Item_MRP,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4331.8 -677.8 -88.9 572.5 7942.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1986.3501 169.3349 -11.730 <2e-16
## Outlet_IdentifierOUT046 1959.9281 83.7867 23.392 <2e-16
## Outlet_IdentifierOUT045 1891.8411 83.5331 22.648 <2e-16
## Outlet_IdentifierOUT049 2057.6792 84.1093 24.464 <2e-16
## Outlet_IdentifierOUT035 2104.4092 83.8369 25.101 <2e-16
## Outlet_IdentifierOUT018 1683.1160 83.6592 20.119 <2e-16
## Outlet_IdentifierOUT019 12.0442 68.8349 0.175 0.861
## Outlet_IdentifierOUT027 3411.2793 84.0027 40.609 <2e-16
## Outlet_IdentifierOUT017 2062.8085 83.1515 24.808 <2e-16
## Outlet_IdentifierOUT013 1990.9033 83.5285 23.835 <2e-16
## Item_Type_CombinedFood 16.1708 43.9286 0.368 0.713
## Item_Type_CombinedNon.Consumable -10.1285 49.0070 -0.207 0.836
## Item_Fat_ContentRegular 40.7363 28.2782 1.441 0.150
## Item_Visibility_MeanRatio 71.1070 98.4560 0.722 0.470
## Item_MRP 15.5588 0.1966 79.132 <2e-16
##
## (Intercept) ***
## Outlet_IdentifierOUT046 ***
## Outlet_IdentifierOUT045 ***
## Outlet_IdentifierOUT049 ***
## Outlet_IdentifierOUT035 ***
## Outlet_IdentifierOUT018 ***
## Outlet_IdentifierOUT019
## Outlet_IdentifierOUT027 ***
## Outlet_IdentifierOUT017 ***
## Outlet_IdentifierOUT013 ***
## Item_Type_CombinedFood
## Item_Type_CombinedNon.Consumable
## Item_Fat_ContentRegular
## Item_Visibility_MeanRatio
## Item_MRP ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1128 on 8508 degrees of freedom
## Multiple R-squared: 0.5635, Adjusted R-squared: 0.5627
## F-statistic: 784.4 on 14 and 8508 DF, p-value: < 2.2e-16
Observation:
Based on the correlation plot we can observe that Item_MRP is strongly correlated to the Item_Outlet_Sales: This is in-line with our hypotheses.
Further we can see that the Item’s Visibility ratio is negavtively correlated with the Item_Outlet_Sales: This is not in-line with our hypotheses.
## Linear Regression
##
## 8523 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 7671, 7670, 7671, 7670, 7671, 7671, ...
## Resampling results:
##
## RMSE Rsquared
## 1128.826 0.5627596
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## CART
##
## 8523 samples
## 29 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 7670, 7671, 7672, 7671, 7671, 7670, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared
## 0.05892632 1294.557 0.4251454
## 0.16317858 1412.077 0.3125517
## 0.23662590 1592.082 0.2225731
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.05892632.
## [1] 987
Inferences:
Based on the model comparison we can see that Random Forests outperform Decision Trees and Linear Regression Models. This is because of the optimal selection of the parameters and the dependent variables.
To make the model better:
We can try several sets of parameters to identify the optimal set of predictors. With these predictors, we can make use of the RandomForest model.