The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.
first loading it and checking its dimensions
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
if(file.exists("train_d.csv") && file.exists("test_d.csv")){
train <- read.csv(file = "train_d.csv")
test <- read.csv(file = "test_d.csv")
}
since the data that we have is a little bit messy and feature engineering looks a bit hard we decide to run a Principal Component Analysis to be able to make use the most of our data.
first we combine train and testing data the we implement a whole PCA on it then we choose first N important dimensions in of our data. we need to change factor data to numerical type in order to be able to implement a PCA on it. the package Dummies does this for us.
test$Item_Outlet_Sales <- 1
combi <- rbind(train, test)
combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight,na.rm = T) # NAs replaced with median of item weights
NAs <- apply(combi, 2, function(x){sum(is.na(x))})
combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0 , median(combi$Item_Visibility),combi$Item_Visibility)
levels(combi$Outlet_Size)[1] <- "other"
my_data <- subset(combi, select = -c(Item_Outlet_Sales, Item_Identifier,Outlet_Identifier))
new_my_data <- dummy.data.frame(my_data, names = c("Item_Fat_Content","Item_Type",
"Outlet_Establishment_Year","Outlet_Size",
"Outlet_Location_Type","Outlet_Type"))
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
## Warning in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts =
## FALSE): non-list contrasts argument ignored
we use Principal component analysis when we are in the following situations :
in this particular data we use the PCA to catch the most data variation using EigenValues and EigenVectors .
pca.train <- new_my_data[1:nrow(train),]
pca.test <- new_my_data[-(1:nrow(train)),]
####
prin_comp <- prcomp(pca.train, scale. = T)
let’s first see the most important vectors in our PCA using the function biplot inside factoextra package.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_biplot(prin_comp)
also in order to have a better intuition about contribution of factors in our PCA model we can see their contribution by checking the SD/total variation :
fviz_pca_contrib(prin_comp,choice = "var")
## Warning in fviz_pca_contrib(prin_comp, choice = "var"): The function
## fviz_pca_contrib() is deprecated. Please use the function fviz_contrib()
## which can handle outputs of PCA, CA and MCA functions.
finally in order to reduce the data we can choose first n vectors from our PCA by checking their varation.
std_dev <- prin_comp$sdev
pr_var <- std_dev^2
pr_var_ratio <- pr_var/sum(pr_var)
plot(cumsum(pr_var_ratio) , ylab = "variation represented" , xlab = "first n vectors" , type = "b")
by choosing first 30 components we catch almost 98% of data variation.
first let’s fit a decistion tree using the library rpart to our trainig data.
train.data <- data.frame(Item_Outlet_Sales = train$Item_Outlet_Sales, prin_comp$x)
train.data <- train.data[,1:31]
#####
library(rpart)
rpart.model <- rpart(Item_Outlet_Sales ~ .,data = train.data, method = "anova")
it’s time to see our model’s performance by feeding the test data to our model :
test.data <- predict(prin_comp, newdata = pca.test)
test.data <- as.data.frame(test.data)
##
test.data <- test.data[,1:30]
rpart.prediction <- predict(rpart.model, test.data)
time to check performance of our model by being compared in rankingns :
sample <- read.csv("SampleSubmission_TmnO39y.txt")
final.sub <- data.frame(Item_Identifier = sample$Item_Identifier, Outlet_Identifier = sample$Outlet_Identifier, Item_Outlet_Sales = rpart.prediction)
write.csv(final.sub, "pca_res.csv",row.names = F)
we scored 2370 in challenge with RMSE : 1360
let’s use a better method :
## [1] train-rmse:541.589844
## [2] train-rmse:451.123779
## [3] train-rmse:368.730408
## [4] train-rmse:299.430084
## [5] train-rmse:181.904053
## [6] train-rmse:173.837524
## [7] train-rmse:162.528152
## [8] train-rmse:153.858719
## [9] train-rmse:147.710571
## [10] train-rmse:137.899185
## [11] train-rmse:125.625687
## [12] train-rmse:122.851151
## [13] train-rmse:120.512001
## [14] train-rmse:117.840874
## [15] train-rmse:112.614693
## [16] train-rmse:105.957954
## [17] train-rmse:99.190308
## [18] train-rmse:90.387604
## [19] train-rmse:80.306549
## [20] train-rmse:73.782272
## [21] train-rmse:71.400276
## [22] train-rmse:70.472832
## [23] train-rmse:69.751778
## [24] train-rmse:68.015793
## [25] train-rmse:66.326378
## [26] train-rmse:64.488556
## [27] train-rmse:62.780884
## [28] train-rmse:61.162640
## [29] train-rmse:60.007889
## [30] train-rmse:59.542305
## [31] train-rmse:58.935711
## [32] train-rmse:57.498085
## [33] train-rmse:56.008247
## [34] train-rmse:54.307816
## [35] train-rmse:52.584499
## [36] train-rmse:51.092667
## [37] train-rmse:49.460197
## [38] train-rmse:47.916119
## [39] train-rmse:46.718010
## [40] train-rmse:45.712540
## [41] train-rmse:44.949753
## [42] train-rmse:43.670277
## [43] train-rmse:42.501442
## [44] train-rmse:41.587372
## [45] train-rmse:40.393139
## [46] train-rmse:39.140247
## [47] train-rmse:38.635883
## [48] train-rmse:37.838943
## [49] train-rmse:37.017212
## [50] train-rmse:36.623428
## [51] train-rmse:35.714645
## [52] train-rmse:35.606525
## [53] train-rmse:35.429985
## [54] train-rmse:35.346100
## [55] train-rmse:35.256935
## [56] train-rmse:35.176727
## [57] train-rmse:34.589478
## [58] train-rmse:34.515263
## [59] train-rmse:34.459290
## [60] train-rmse:34.356960
## [61] train-rmse:34.138229
## [62] train-rmse:34.107868
## [63] train-rmse:34.063545
## [64] train-rmse:34.014412
## [65] train-rmse:33.962090
## [66] train-rmse:33.917343
## [67] train-rmse:33.858509
## [68] train-rmse:33.632145
## [69] train-rmse:33.592514
## [70] train-rmse:33.549198
## [71] train-rmse:33.512203
## [72] train-rmse:33.128979
## [73] train-rmse:33.059139
## [74] train-rmse:32.856865
## [75] train-rmse:32.650135
## [76] train-rmse:32.246986
## [77] train-rmse:32.003891
## [78] train-rmse:31.958500
## [79] train-rmse:31.914558
## [80] train-rmse:31.868992
## [81] train-rmse:31.826027
## [82] train-rmse:31.779884
## [83] train-rmse:31.454367
## [84] train-rmse:31.427706
## [85] train-rmse:31.405697
## [86] train-rmse:31.291155
## [87] train-rmse:31.272430
## [88] train-rmse:31.226568
## [89] train-rmse:31.096087
## [90] train-rmse:31.067686
## [91] train-rmse:31.033606
## [92] train-rmse:31.001806
## [93] train-rmse:30.968382
## [94] train-rmse:30.901787
## [95] train-rmse:30.856886
## [96] train-rmse:30.812117
## [97] train-rmse:30.775145
## [98] train-rmse:30.728165
## [99] train-rmse:30.691801
## [100] train-rmse:30.628496