The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.
first loading it and checking its dimensions
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
if(file.exists("train_d.csv") && file.exists("test_d.csv")){
MainData <- read.csv(file = "train_d.csv") ## read data
}
since the data that we have is a little bit messy and feature engineering looks a bit hard we decide to run a Principal Component Analysis to be able to make use the most of our data.
first we combine train and testing data the we implement a whole PCA on it then we choose first N important dimensions in of our data. we need to change factor data to numerical type in order to be able to implement a PCA on it. the package Dummies does this for us.
MainData$Item_Weight[is.na(MainData$Item_Weight)] <- median(MainData$Item_Weight,na.rm = T) # replace NAs in dataset with median to be able to create the model
MainData$Item_Visibility <- ifelse(MainData$Item_Visibility == 0 , median(MainData$Item_Visibility),MainData$Item_Visibility)
# replace item visiblity zeros with median
levels(MainData$Outlet_Size)[1] <- "other"
in this step we create a subset of our dataset which contains the following features by now :
names(MainData)
## [1] "Item_Identifier" "Item_Weight"
## [3] "Item_Fat_Content" "Item_Visibility"
## [5] "Item_Type" "Item_MRP"
## [7] "Outlet_Identifier" "Outlet_Establishment_Year"
## [9] "Outlet_Size" "Outlet_Location_Type"
## [11] "Outlet_Type" "Item_Outlet_Sales"
“Item_Outlet_Sales” is the outcome which we want to predict so we need to remove it from our data for implementing a PCA, let’s see how it’s distributed over our data :
hist(MainData$Item_Outlet_Sales, col = c(7),xlab = "outlet sales" , ylab = "frequency", main = "histogram of outlet sales distribution")
we can say it’s better to normalize our data and scale in order to have a better model since the outcome is not gaussian shape.
next step we need to exclude two unrelated columns from dataset Item_Identifier,Outlet_Identifier.
my_data <- subset(MainData, select = -c(Item_Outlet_Sales, Item_Identifier,Outlet_Identifier))
in order to implement PCA we need our dataset to be fully numerical not factor levels. let’s see :
str(my_data)
## 'data.frame': 8523 obs. of 9 variables:
## $ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
## $ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
## $ Item_Visibility : num 0.016 0.0193 0.0168 0.0539 0.0539 ...
## $ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
## $ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
## $ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
## $ Outlet_Size : Factor w/ 4 levels "other","High",..: 3 3 3 1 2 3 2 3 1 1 ...
## $ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
## $ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
using the library dummies we can convert factor data to numerical type :
new_my_data <- dummy.data.frame(my_data, names = c("Item_Fat_Content","Item_Type",
"Outlet_Establishment_Year","Outlet_Size",
"Outlet_Location_Type","Outlet_Type"))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
we have to split our data into two separated datasets. needless to say that our model shouldn’t see test data at all.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
tridx <- createDataPartition(MainData$Item_Outlet_Sales , p = 0.8 , list = F)
train <- new_my_data[tridx, ]
test <- new_my_data[-tridx,]
dim(train)
## [1] 6820 44
dim(test)
## [1] 1703 44
we use Principal component analysis when we are in the following situations :
in this particular data we use the PCA to catch the most data variation using EigenValues and EigenVectors .
pca.model <- prcomp(train , scale. = T)
summary(pca.model)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.1359 1.79299 1.65690 1.5934 1.47826 1.4242
## Proportion of Variance 0.1037 0.07306 0.06239 0.0577 0.04966 0.0461
## Cumulative Proportion 0.1037 0.17674 0.23914 0.2968 0.34651 0.3926
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.39262 1.12865 1.0980 1.090 1.07316 1.06609
## Proportion of Variance 0.04408 0.02895 0.0274 0.027 0.02617 0.02583
## Cumulative Proportion 0.43668 0.46563 0.4930 0.520 0.54620 0.57203
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 1.06027 1.05200 1.04931 1.04331 1.03921 1.03565
## Proportion of Variance 0.02555 0.02515 0.02502 0.02474 0.02454 0.02438
## Cumulative Proportion 0.59758 0.62274 0.64776 0.67250 0.69704 0.72142
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 1.02847 1.02404 1.0147 1.01327 1.00995 0.99916
## Proportion of Variance 0.02404 0.02383 0.0234 0.02333 0.02318 0.02269
## Cumulative Proportion 0.74546 0.76929 0.7927 0.81603 0.83921 0.86190
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.99545 0.98194 0.96097 0.95540 0.93289 0.85367
## Proportion of Variance 0.02252 0.02191 0.02099 0.02075 0.01978 0.01656
## Cumulative Proportion 0.88442 0.90633 0.92732 0.94807 0.96785 0.98441
## PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.82826 1.152e-14 4.203e-15 3.386e-15 3.039e-15
## Proportion of Variance 0.01559 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.00000 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC36 PC37 PC38 PC39 PC40
## Standard deviation 2.929e-15 2.535e-15 2.259e-15 1.63e-15 1.33e-15
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.00e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.00e+00
## PC41 PC42 PC43 PC44
## Standard deviation 9.199e-16 8.101e-16 6.431e-16 1.63e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.00e+00
looking at the summary we can see that by selecting the first 30 Components we will have more than 98% of data variation.
let’s first see the most important vectors in our PCA using the function biplot inside factoextra package.
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_biplot(pca.model)
biggest vectors contribute more in data variation. e.g: outlet location tiers and types
fviz_pca_contrib(pca.model,choice = "var")
## Warning in fviz_pca_contrib(pca.model, choice = "var"): The function
## fviz_pca_contrib() is deprecated. Please use the function fviz_contrib()
## which can handle outputs of PCA, CA and MCA functions.
first let’s fit a decistion tree using the library rpart to our trainig data. here we first should create a new dataset for train adding the outcome from first dataset to the results from pca model. then we choose first 30 components of PCA with outcome as the input for decision tree model
library(rpart)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
train.pca <- data.frame(Item_Outlet_Sales = MainData[tridx,"Item_Outlet_Sales"],pca.model$x)
train.pca_30 <- train.pca[,1:31]
rpart.fit <- rpart(data = train.pca_30 , Item_Outlet_Sales ~ ., method = "anova")
rpart.fit
## n= 6820
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 6820 19713440000 2181.119
## 2) PC3< -1.682351 876 60159430 344.002 *
## 3) PC3>=-1.682351 5944 16261070000 2451.866
## 6) PC27>=0.02602754 2909 4259163000 1780.706
## 12) PC9< -0.09082829 1304 1153548000 1395.208 *
## 13) PC9>=-0.09082829 1605 2754387000 2093.907
## 26) PC27>=0.6759859 759 813376200 1593.743 *
## 27) PC27< 0.6759859 846 1580788000 2542.636 *
## 7) PC27< 0.02602754 3035 9435555000 3095.162
## 14) PC9< -1.005565 648 996097200 2023.451 *
## 15) PC9>=-1.005565 2387 7493142000 3386.100
## 30) PC6< 1.834383 2032 5075126000 3127.741
## 60) PC24< 0.429728 1577 3559242000 2952.866 *
## 61) PC24>=0.429728 455 1300510000 3733.843 *
## 31) PC6>=1.834383 355 1506016000 4864.931 *
since rpart creates a tree of decisions we can have a simple visualization of it using the library rpart.plot.
library(rpart.plot)
rpart.plot(rpart.fit)
we can see that the most components are pc3 and pc27 and so on.
rf.fit <- randomForest(data = train.pca_30 ,Item_Outlet_Sales ~ .)
rf.fit
##
## Call:
## randomForest(formula = Item_Outlet_Sales ~ ., data = train.pca_30)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 1316642
## % Var explained: 54.45
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.
RMSE
test.pca <- predict(pca.model , newdata = test )
test.pca_30 <- as.data.frame(test.pca[,1:30])
test.rpart <- predict(rpart.fit , newdata = test.pca_30)
test.rf <- predict(rf.fit , newdata = test.pca_30)
we now calculate RMSEs and try to visualize our model’s prediction
test.pred <- data.frame(rpart = test.rpart , rf = test.rf , real = MainData[-tridx,"Item_Outlet_Sales"])
library(ggplot2)
g <- ggplot(data = test.pred ) + geom_density(aes(real , col = "real data")) + geom_density(aes(rf,col = "random forest")) + geom_density(aes(rpart,col = "rpart"))
g
it can be seen that random forest gave us a much better model. yet we calculate RMSE as our performance parameter.
nro = nrow(test.pred)
rmse.rf <- sqrt(sum((test.pred$real - test.pred$rf)^2/nro))
rmse.rpart <- sqrt(sum((test.pred$real - test.pred$rpart)^2/nro))
rmse.rf
## [1] 1134.897
rmse.rpart
## [1] 1314.355
RMSE for RF model is better than rpart but yet it can get better since RF has too many parameters which tunable.