The dataset comprises of sales data (of a renowned Super Market) for 1559 products across 10 stores in different cities (broadly classified based on the purchase power parity, working population, size and few other factors).
The project aims to build a predictive model to analyze the sales of each product at a particular store. With this we shall understand the properties of products and stores which play a key role in increasing sales. The results of the model will be used to provide recommendations to improve the sales.
rm(list = ls())
setwd('/Users/Mughundhan/UIC/UIC Academics/FALL 2017/BIZ ANALYTICS STATS/Project')
library(lubridate) # for csv files
library(leaflet) # interactive maps
#library(dplyr) # for piping purpose %>%
#library(rMaps) # route-map
library(data.table)# aggregate
library(ggplot2) # barplot
library(mice) # imputing with plausible data values (drawn from a distribution specifically designed for each missing datapoint)
train <- read.csv("Train.csv", header=T, na.strings=c("","NA")) #Empty spaces to be replaced by NA
test <- read.csv("Test.csv", header=T, na.strings=c("","NA"))
test$Item_Outlet_Sales <- NA
fdata <- rbind(test, train)
fdata <- as.data.table(fdata)
Let us have a look at the description of each variable in the dataset:
The Structure of the data-set (all variables and its corresponding data-type) is given as follows:
str(fdata)
## Classes 'data.table' and 'data.frame': 14204 obs. of 12 variables:
## $ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 1104 1068 1407 810 1185 462 605 267 669 171 ...
## $ Item_Weight : num 20.75 8.3 14.6 7.32 NA ...
## $ Item_Fat_Content : Factor w/ 5 levels "LF","Low Fat",..: 2 5 2 2 3 3 3 2 3 2 ...
## $ Item_Visibility : num 0.00756 0.03843 0.09957 0.01539 0.1186 ...
## $ Item_Type : Factor w/ 16 levels "Baking Goods",..: 14 5 12 14 5 7 1 1 14 1 ...
## $ Item_MRP : num 107.9 87.3 241.8 155 234.2 ...
## $ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 3 1 3 6 9 4 6 8 3 ...
## $ Outlet_Establishment_Year: int 1999 2007 1998 2007 1985 1997 2009 1985 2002 2007 ...
## $ Outlet_Size : Factor w/ 3 levels "High","Medium",..: 2 NA NA NA 2 3 2 2 NA NA ...
## $ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 2 3 2 3 1 3 3 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 2 1 2 4 2 3 4 2 2 ...
## $ Item_Outlet_Sales : num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
Observation
Based on the basic data exploration, we shall have two levels of hypotheses: 1. Store-level; 2. Product-level. Both plays a crucial role in determining the sales of each product at specific stores located across different cities. The hypotheses generated at both the levels based on the available dataset are as follows:
Item_Fat_Content: Items are classified based on the fat content. Since we consume on low fat items as a part of our regular diet, It is highly possible that Low fat items are generally sold more than the items with high fat content.
Item_Type: Items which we use on regular basis - like ready to eat, soft drinks has higher probability of being sold when compared with luxury items.
Item_MRP: More expensive items might be bought occasionally. Items with lower prices might be a product which is being used on a regular basis. Thus, Low priced items might have sold better than expensive items.
Outlet_Size: Bigger outlets might attract bigger crowds. This results in increasing the sales of the products in that specific store.
Outlet_Location_Type: Bigger cities or cities with high population density has a larger customer base for the stores at their location. Stores located in Tier-1 cities might have better sales than stores located in other types of cities.
Outlet_Type: Similar to the previous hypotheses. Supermarkets look more fancy than grocery shops. Among supermarket, the highest among this sub-classification might attract larger crowds and emerge as the best selling store when compared with other outlet types.
The basic idea is to perform the following three steps in a sequential manner:
## Item_Identifier Item_Weight
## 0 2439
## Item_Fat_Content Item_Visibility
## 0 0
## Item_Type Item_MRP
## 0 0
## Outlet_Identifier Outlet_Establishment_Year
## 0 0
## Outlet_Size Outlet_Location_Type
## 4016 0
## Outlet_Type Item_Outlet_Sales
## 0 5681
We can see that there are missing values in few attributes. We need to impute missing values by using an appropriate technique. Further, we need to work on outlier analysis as well.
In the given bar-plot, we can see the distribution of items in each sub-category of the Super-Market. This gives us a rough idea about the importance of Super-market type in determining the Sales.
The basic idea is to perform the following three steps in a sequential manner:
Reference Link: https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
I am using a data-set which is currently posted as a challenge with a deadline to be posted in 72 days from now. In-order to access the data-set, we need to have an account and sign-up for this competition. This competition closes on: Sun Dec 31 2017 12:29:59 GMT-0600 (Central Standard Time).
To evaluate how good is a model, we need to understand the impact of wrong predictions. If we predict sales to be higher than what they might be, the store will spend a lot of money making unnecessary arrangement which would lead to excess inventory. On the other side if I predict it too low, I will lose out on sales opportunity. A delicate balance is to be maintained!