Project Demo 1

1. About the Project / Data-Set

The dataset comprises of sales data (of a renowned Super Market) for 1559 products across 10 stores in different cities (broadly classified based on the purchase power parity, working population, size and few other factors).

The project aims to build a predictive model to analyze the sales of each product at a particular store. With this we shall understand the properties of products and stores which play a key role in increasing sales. The results of the model will be used to provide recommendations to improve the sales.

rm(list = ls())
setwd('/Users/Mughundhan/UIC/UIC Academics/FALL 2017/BIZ ANALYTICS STATS/Project')
library(lubridate) # for csv files
library(leaflet)   # interactive maps
#library(dplyr)     # for piping purpose %>%
#library(rMaps)     # route-map
library(data.table)# aggregate
library(ggplot2)   # barplot
library(mice)      # imputing with plausible data values (drawn from a distribution specifically designed for each missing datapoint)
train <- read.csv("Train.csv", header=T, na.strings=c("","NA")) #Empty spaces to be replaced by NA
test <- read.csv("Test.csv", header=T, na.strings=c("","NA"))
test$Item_Outlet_Sales <- NA
fdata <- rbind(test, train)
fdata <- as.data.table(fdata)

1.1 Data Dictionary

Let us have a look at the description of each variable in the dataset:

Item_Identifier: Unique Product ID
Item_Weight: Weight of the Product
Item_Fat_Content: How much fat content the product contains (Low, Regular)
Item_Visibility: The percent of total display area of all products in a store allocated to the particular product
Item_Type: The Category to which the product belongs (eg: Breakfast, Soft Drinks etc)
Item_MRP: Maximum Retail Price of the Product (Indian Rupees)
Outlet_Identifier: Unique Store ID - multiple stores located at different cities
Outlet_Establishment_Year: The year, when the store started its operation
Outlet_Size: Size of the store (High, Medium, Small)
Outlet_Location_Type: The type of the city in which the store is located (Tier1, Tier2 ….)
Outlet_Type: The type of the outlet (Grocery store or a Super Market)
Item_Outlet_Sales: Sales of the product in the particular store. [Outcome Variable to be predicted]

1.2 Inference from Attributes

The Structure of the data-set (all variables and its corresponding data-type) is given as follows:

str(fdata)

## Classes 'data.table' and 'data.frame':   14204 obs. of  12 variables:
##  $ Item_Identifier          : Factor w/ 1559 levels "DRA12","DRA24",..: 1104 1068 1407 810 1185 462 605 267 669 171 ...
##  $ Item_Weight              : num  20.75 8.3 14.6 7.32 NA ...
##  $ Item_Fat_Content         : Factor w/ 5 levels "LF","Low Fat",..: 2 5 2 2 3 3 3 2 3 2 ...
##  $ Item_Visibility          : num  0.00756 0.03843 0.09957 0.01539 0.1186 ...
##  $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 14 5 12 14 5 7 1 1 14 1 ...
##  $ Item_MRP                 : num  107.9 87.3 241.8 155 234.2 ...
##  $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 3 1 3 6 9 4 6 8 3 ...
##  $ Outlet_Establishment_Year: int  1999 2007 1998 2007 1985 1997 2009 1985 2002 2007 ...
##  $ Outlet_Size              : Factor w/ 3 levels "High","Medium",..: 2 NA NA NA 2 3 2 2 NA NA ...
##  $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 2 3 2 3 1 3 3 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 2 1 2 4 2 3 4 2 2 ...
##  $ Item_Outlet_Sales        : num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

Observation

There are 11 + 1 variables in the dataset (1-target variable: Item_Outlet_Sales)
We shall perform number operations on 3 numerical variables: Item_Weight, Item_Visibility, Item_MRP
There are several factor variables which will be transformed into character variables for feature engineering purpose: Item_Fat_Content, Outlet_Identifier, Outlet_Size, Outlet_Location_Type, Outlet_Type
There is only one variable with information regarding the date: Outlet_Establishment_Year. We might perform simple numerical operations since only the year is given.
Few variables (Outlet_Size, Item_Weight) contain missing values which needs to be imputed.

2. Hypotheses Generation

Based on the basic data exploration, we shall have two levels of hypotheses: 1. Store-level; 2. Product-level. Both plays a crucial role in determining the sales of each product at specific stores located across different cities. The hypotheses generated at both the levels based on the available dataset are as follows:

I. Product-Level Hypotheses

Item_Fat_Content: Items are classified based on the fat content. Since we consume on low fat items as a part of our regular diet, It is highly possible that Low fat items are generally sold more than the items with high fat content.
Item_Type: Items which we use on regular basis - like ready to eat, soft drinks has higher probability of being sold when compared with luxury items.
Item_MRP: More expensive items might be bought occasionally. Items with lower prices might be a product which is being used on a regular basis. Thus, Low priced items might have sold better than expensive items.

II. Store-Level Hypotheses

Outlet_Size: Bigger outlets might attract bigger crowds. This results in increasing the sales of the products in that specific store.
Outlet_Location_Type: Bigger cities or cities with high population density has a larger customer base for the stores at their location. Stores located in Tier-1 cities might have better sales than stores located in other types of cities.
Outlet_Type: Similar to the previous hypotheses. Supermarkets look more fancy than grocery shops. Among supermarket, the highest among this sub-classification might attract larger crowds and emerge as the best selling store when compared with other outlet types.

3. Ideas for the Project / Plans to do in future

Idea(1): Exploratory Data Analysis followed by Model Building:

The basic idea is to perform the following three steps in a sequential manner:

Data Cleaning: We need to clean the data prior to performing data analysis or data modeling in-order to get effective results.

##           Item_Identifier               Item_Weight 
##                         0                      2439 
##          Item_Fat_Content           Item_Visibility 
##                         0                         0 
##                 Item_Type                  Item_MRP 
##                         0                         0 
##         Outlet_Identifier Outlet_Establishment_Year 
##                         0                         0 
##               Outlet_Size      Outlet_Location_Type 
##                      4016                         0 
##               Outlet_Type         Item_Outlet_Sales 
##                         0                      5681

We can see that there are missing values in few attributes. We need to impute missing values by using an appropriate technique. Further, we need to work on outlier analysis as well.

Exploratory Data Analysis: In-order to gain better insights about our data-set and understand the existing patterns, we need to plot graphs. This would eventually allow us to understand the contribution of each sub-classification in several attributes (hypotheses testing) and identify the relative importance of each attribute in-order to assign weights. Let us make it clear with an example:

In the given bar-plot, we can see the distribution of items in each sub-category of the Super-Market. This gives us a rough idea about the importance of Super-market type in determining the Sales.

Model Building: Perform several modeling techniques like Decision Trees, Random Forest and Linear Regression to name a few and consider the best performing model for final evaluation.

Idea (2): Exploratory Data Analysis followed by Ensemble Modelling:

The basic idea is to perform the following three steps in a sequential manner:

Data Cleaning: Same as previously mentioned
Exploratory Data Analysis: Same as previously mentioned
Ensemble Modeling: Combine 2 or more models / classifiers which can be similar or dissimilar to give a more robust system. Simple words, combine several weak learners to give a strong prediction.Based on the results and knowledge we gather from the Exploratory Data Analysis step, we shall decide to perform one of the following
- Averaging: Defined as taking the average of predictions from models in case of regression problem or while predicting probabilities for the classification problem.
- Majority vote: Defined as taking the prediction with maximum vote / recommendation from multiple models predictions while predicting the outcomes of a classification problem.
- Weighted average: In this, different weights are applied to predictions from multiple models then taking the average which means giving high or low importance to specific model output.

4. Data Source

Reference Link: https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/

I am using a data-set which is currently posted as a challenge with a deadline to be posted in 72 days from now. In-order to access the data-set, we need to have an account and sign-up for this competition. This competition closes on: Sun Dec 31 2017 12:29:59 GMT-0600 (Central Standard Time).

5. Impact of wrong predictions

To evaluate how good is a model, we need to understand the impact of wrong predictions. If we predict sales to be higher than what they might be, the store will spend a lot of money making unnecessary arrangement which would lead to excess inventory. On the other side if I predict it too low, I will lose out on sales opportunity. A delicate balance is to be maintained!

6. References

Handling Missing Values: https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/
Feature Engineering: http://trevorstephens.com/kaggle-titanic-tutorial/r-part-4-feature-engineering/
Model Building: http://blog.learningtree.com/how-to-build-a-predictive-model-using-r/
Ensemble Modeling: https://machinelearningmastery.com/machine-learning-ensembles-with-r/
Hypotheses Generation: https://discuss.analyticsvidhya.com/t/why-and-when-is-hypothesis-generation-important/2109