description

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

start with the data

first loading it and checking its dimensions

library(dummies)
## dummies-1.5.6 provided by Decision Patterns
if(file.exists("train_d.csv") && file.exists("test_d.csv")){
  MainData <- read.csv(file = "train_d.csv") ## read data
}

cleaning

since the data that we have is a little bit messy and feature engineering looks a bit hard we decide to run a Principal Component Analysis to be able to make use the most of our data.

first we combine train and testing data the we implement a whole PCA on it then we choose first N important dimensions in of our data. we need to change factor data to numerical type in order to be able to implement a PCA on it. the package Dummies does this for us.

MainData$Item_Weight[is.na(MainData$Item_Weight)] <- median(MainData$Item_Weight,na.rm = T)  # replace NAs in dataset with median to be able to create the model
MainData$Item_Visibility <- ifelse(MainData$Item_Visibility == 0 , median(MainData$Item_Visibility),MainData$Item_Visibility)

# replace item visiblity zeros with median

levels(MainData$Outlet_Size)[1] <- "other"

preparing for PCA

in this step we create a subset of our dataset which contains the following features by now :

names(MainData)
##  [1] "Item_Identifier"           "Item_Weight"              
##  [3] "Item_Fat_Content"          "Item_Visibility"          
##  [5] "Item_Type"                 "Item_MRP"                 
##  [7] "Outlet_Identifier"         "Outlet_Establishment_Year"
##  [9] "Outlet_Size"               "Outlet_Location_Type"     
## [11] "Outlet_Type"               "Item_Outlet_Sales"

“Item_Outlet_Sales” is the outcome which we want to predict so we need to remove it from our data for implementing a PCA, let’s see how it’s distributed over our data :

hist(MainData$Item_Outlet_Sales, col = c(7),xlab = "outlet sales" , ylab = "frequency", main = "histogram of outlet sales distribution")

we can say it’s better to normalize our data and scale in order to have a better model since the outcome is not gaussian shape.

next step we need to exclude two unrelated columns from dataset Item_Identifier,Outlet_Identifier.

my_data <- subset(MainData, select = -c(Item_Outlet_Sales, Item_Identifier,Outlet_Identifier))

in order to implement PCA we need our dataset to be fully numerical not factor levels. let’s see :

str(my_data)
## 'data.frame':    8523 obs. of  9 variables:
##  $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
##  $ Item_Fat_Content         : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
##  $ Item_Visibility          : num  0.016 0.0193 0.0168 0.0539 0.0539 ...
##  $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
##  $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
##  $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
##  $ Outlet_Size              : Factor w/ 4 levels "other","High",..: 3 3 3 1 2 3 2 3 1 1 ...
##  $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...

using the library dummies we can convert factor data to numerical type :

new_my_data <- dummy.data.frame(my_data, names = c("Item_Fat_Content","Item_Type",
                                                   "Outlet_Establishment_Year","Outlet_Size",
                                                   "Outlet_Location_Type","Outlet_Type"))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored

split to train and test

we have to split our data into two separated datasets. needless to say that our model shouldn’t see test data at all.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
tridx <- createDataPartition(MainData$Item_Outlet_Sales , p = 0.8 , list = F)
train <- new_my_data[tridx, ]
test  <- new_my_data[-tridx,]
dim(train)
## [1] 6820   44
dim(test)
## [1] 1703   44

PCA part

we use Principal component analysis when we are in the following situations :

  1. we find that most of the variables are correlated. You lose patience and decide to run a model on the whole data. This returns poor accuracy and you feel terrible.
  2. we become indecisive about what to do
  3. start thinking of some strategic method to find few important variables

in this particular data we use the PCA to catch the most data variation using EigenValues and EigenVectors .

pca.model <- prcomp(train , scale. = T)
summary(pca.model)
## Importance of components:
##                           PC1     PC2     PC3    PC4     PC5    PC6
## Standard deviation     2.1359 1.79299 1.65690 1.5934 1.47826 1.4242
## Proportion of Variance 0.1037 0.07306 0.06239 0.0577 0.04966 0.0461
## Cumulative Proportion  0.1037 0.17674 0.23914 0.2968 0.34651 0.3926
##                            PC7     PC8    PC9  PC10    PC11    PC12
## Standard deviation     1.39262 1.12865 1.0980 1.090 1.07316 1.06609
## Proportion of Variance 0.04408 0.02895 0.0274 0.027 0.02617 0.02583
## Cumulative Proportion  0.43668 0.46563 0.4930 0.520 0.54620 0.57203
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     1.06027 1.05200 1.04931 1.04331 1.03921 1.03565
## Proportion of Variance 0.02555 0.02515 0.02502 0.02474 0.02454 0.02438
## Cumulative Proportion  0.59758 0.62274 0.64776 0.67250 0.69704 0.72142
##                           PC19    PC20   PC21    PC22    PC23    PC24
## Standard deviation     1.02847 1.02404 1.0147 1.01327 1.00995 0.99916
## Proportion of Variance 0.02404 0.02383 0.0234 0.02333 0.02318 0.02269
## Cumulative Proportion  0.74546 0.76929 0.7927 0.81603 0.83921 0.86190
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.99545 0.98194 0.96097 0.95540 0.93289 0.85367
## Proportion of Variance 0.02252 0.02191 0.02099 0.02075 0.01978 0.01656
## Cumulative Proportion  0.88442 0.90633 0.92732 0.94807 0.96785 0.98441
##                           PC31      PC32      PC33      PC34      PC35
## Standard deviation     0.82826 1.152e-14 4.203e-15 3.386e-15 3.039e-15
## Proportion of Variance 0.01559 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC36      PC37      PC38     PC39     PC40
## Standard deviation     2.929e-15 2.535e-15 2.259e-15 1.63e-15 1.33e-15
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.00e+00
##                             PC41      PC42      PC43     PC44
## Standard deviation     9.199e-16 8.101e-16 6.431e-16 1.63e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.00e+00

looking at the summary we can see that by selecting the first 30 Components we will have more than 98% of data variation.

PCA intrepretation

let’s first see the most important vectors in our PCA using the function biplot inside factoextra package.

library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_biplot(pca.model)

biggest vectors contribute more in data variation. e.g: outlet location tiers and types

fviz_pca_contrib(pca.model,choice = "var")
## Warning in fviz_pca_contrib(pca.model, choice = "var"): The function
## fviz_pca_contrib() is deprecated. Please use the function fviz_contrib()
## which can handle outputs of PCA, CA and MCA functions.

modeling

first let’s fit a decistion tree using the library rpart to our trainig data. here we first should create a new dataset for train adding the outcome from first dataset to the results from pca model. then we choose first 30 components of PCA with outcome as the input for decision tree model

library(rpart)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
train.pca <- data.frame(Item_Outlet_Sales = MainData[tridx,"Item_Outlet_Sales"],pca.model$x)
train.pca_30 <- train.pca[,1:31]

decision tree

rpart.fit <- rpart(data = train.pca_30 , Item_Outlet_Sales ~ ., method = "anova")

rpart.fit
## n= 6820 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 6820 19713440000 2181.119  
##    2) PC3< -1.682351 876    60159430  344.002 *
##    3) PC3>=-1.682351 5944 16261070000 2451.866  
##      6) PC27>=0.02602754 2909  4259163000 1780.706  
##       12) PC9< -0.09082829 1304  1153548000 1395.208 *
##       13) PC9>=-0.09082829 1605  2754387000 2093.907  
##         26) PC27>=0.6759859 759   813376200 1593.743 *
##         27) PC27< 0.6759859 846  1580788000 2542.636 *
##      7) PC27< 0.02602754 3035  9435555000 3095.162  
##       14) PC9< -1.005565 648   996097200 2023.451 *
##       15) PC9>=-1.005565 2387  7493142000 3386.100  
##         30) PC6< 1.834383 2032  5075126000 3127.741  
##           60) PC24< 0.429728 1577  3559242000 2952.866 *
##           61) PC24>=0.429728 455  1300510000 3733.843 *
##         31) PC6>=1.834383 355  1506016000 4864.931 *

since rpart creates a tree of decisions we can have a simple visualization of it using the library rpart.plot.

library(rpart.plot)
rpart.plot(rpart.fit)

we can see that the most components are pc3 and pc27 and so on.

Random Forest

rf.fit <- randomForest(data = train.pca_30 ,Item_Outlet_Sales ~ .)
rf.fit
## 
## Call:
##  randomForest(formula = Item_Outlet_Sales ~ ., data = train.pca_30) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 1316642
##                     % Var explained: 54.45

RMSE

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.

RMSE

RMSE

test data and models

test.pca <- predict(pca.model , newdata = test )
test.pca_30 <- as.data.frame(test.pca[,1:30])

test.rpart <- predict(rpart.fit , newdata = test.pca_30)
test.rf  <- predict(rf.fit , newdata = test.pca_30)

we now calculate RMSEs and try to visualize our model’s prediction

test.pred <- data.frame(rpart = test.rpart , rf = test.rf , real = MainData[-tridx,"Item_Outlet_Sales"])
library(ggplot2)

g <- ggplot(data = test.pred ) + geom_density(aes(real , col = "real data")) + geom_density(aes(rf,col = "random forest")) + geom_density(aes(rpart,col = "rpart"))
g

it can be seen that random forest gave us a much better model. yet we calculate RMSE as our performance parameter.

nro = nrow(test.pred) 
rmse.rf <- sqrt(sum((test.pred$real - test.pred$rf)^2/nro))
rmse.rpart <- sqrt(sum((test.pred$real - test.pred$rpart)^2/nro))
rmse.rf
## [1] 1134.897
rmse.rpart
## [1] 1314.355

RMSE for RF model is better than rpart but yet it can get better since RF has too many parameters which tunable.