The goal is to forecast the total amount of product sold every shop for the test data set. The data is provided by the Russian software firm - 1C. The daily historical data of stores and products are provided from Jan-2013 to Oct-2015. We need to forecast the sales for shop and products for Nov-2015.
Why I have selected this dataset? The store sales is something helped to study behavior of human For instance it help us to understand the nature of demography , what they purchase and how price defined the point of purchase.
What drives variation in the variable? I think pricing will be bit seasonal as well as dependent on the supply chain of the product. The festive discounts and other factor also can lead to change in buying behavior. The forecast of the pricing will be bit moderate as there are chances of variation on date level.
To begin the project we will install some libraries for data wrangling and we will import the training data set.
options(warn=-1)
rm(list = ls())
pkgs <- c("dplyr", "tidyverse", "lubridate", "data.table", "ggplot2")
lib <- installed.packages()[, "Package"]
install.packages(setdiff(pkgs, lib))
setwd("C:/Users/nawal/OneDrive/Desktop/Aditya Course/Time Series/competitive-data-science-predict-future-sales")
sales_train<- read.csv("sales_train.csv")
head(sales_train)
## date date_block_num shop_id item_id item_price item_cnt_day
## 1 02.01.2013 0 59 22154 999.00 1
## 2 03.01.2013 0 25 2552 899.00 1
## 3 05.01.2013 0 25 2552 899.00 -1
## 4 06.01.2013 0 25 2554 1709.05 1
## 5 15.01.2013 0 25 2555 1099.00 1
## 6 10.01.2013 0 25 2564 349.00 1
dim(sales_train)
## [1] 2935849 6
str(sales_train)
## 'data.frame': 2935849 obs. of 6 variables:
## $ date : chr "02.01.2013" "03.01.2013" "05.01.2013" "06.01.2013" ...
## $ date_block_num: int 0 0 0 0 0 0 0 0 0 0 ...
## $ shop_id : int 59 25 25 25 25 25 25 25 25 25 ...
## $ item_id : int 22154 2552 2552 2554 2555 2564 2565 2572 2572 2573 ...
## $ item_price : num 999 899 899 1709 1099 ...
## $ item_cnt_day : num 1 1 -1 1 1 1 1 1 1 3 ...
Shop_id : unique identifier for the shop
item_id: unique identifier of a product
item_price: current price of an item
date : date in format dd/mm/yyyy
item_cnt_day : number of product sold
date_block_num : a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,…, October 2015 is 33
Along with sales training data set we have information about item name , categories name , shop name which we won’t be using in our data set since the naming are in Russian which won’t be easy for an Indian to interpret :) . We will try to forecast our model on the shop_id and item_id. We will be doing summary statisting and data wrangling in next step.
options(warn=-1)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
sales_train$date1<- as.Date(sales_train$date,"%d.%m.%Y")
sales_train$year<- lubridate::year(sales_train$date1)
sales_train$month<- lubridate::month(sales_train$date1)
sales_train$day<- lubridate::day(sales_train$date1)
sales_train_store <- sales_train%>%group_by(shop_id,date_block_num)%>%summarize(package_moved =sum(item_cnt_day))%>%arrange(shop_id,date_block_num)
## `summarise()` has grouped output by 'shop_id'. You can override using the `.groups` argument.
sales_train_store <- filter(sales_train_store , date_block_num %in% 22:33)
summary(sales_train_store)
## shop_id date_block_num package_moved
## Min. : 2 Min. :22.0 Min. : -1
## 1st Qu.:18 1st Qu.:24.0 1st Qu.: 966
## Median :34 Median :27.0 Median : 1340
## Mean :32 Mean :27.3 Mean : 1915
## 3rd Qu.:47 3rd Qu.:30.0 3rd Qu.: 2084
## Max. :59 Max. :33.0 Max. :14610
ggplot2::ggplot(sales_train_store, ggplot2::aes(x=date_block_num,y=package_moved,color=shop_id))+geom_point()+geom_line()+
theme_classic()+theme(panel.grid=element_blank())
options(warn=-1)
sales_store_req <- sales_train %>% group_by(shop_id) %>% summarise(package_moved = sum(item_cnt_day))%>%arrange(desc(package_moved ))
### shop id =31 & 25 have highest number of package moved in category so we cut_off our data only for this 2 id
sales_train_top_store <- filter(sales_train, shop_id ==31)
sales_train_top_store$shop_id<- as.vector(sales_train_top_store$shop_id)
sales_train_top_store_month <- sales_train_top_store%>% group_by(shop_id,date1) %>% summarize(package_moved = sum(item_cnt_day))%>% arrange(date1)
## `summarise()` has grouped output by 'shop_id'. You can override using the `.groups` argument.
ggplot2::ggplot(sales_train_top_store_month, ggplot2::aes(x=date1,y=package_moved,color=shop_id))+geom_point()+geom_line()+
theme_classic()+theme(panel.grid=element_blank())
options(warn=-1)
boxplot(sales_train_top_store_month$package_moved,main = "Average Box moved on daily basis by store")
summary(sales_train_top_store_month$package_moved)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 102.0 221.5 275.0 301.4 358.5 1080.0
options(warn=-1)
sales_train_top_store_month$package_moved <- ifelse(sales_train_top_store_month$package_moved > 700 ,650,sales_train_top_store_month$package_moved)
ggplot2::ggplot(sales_train_top_store_month, ggplot2::aes(x=date1,y=package_moved,color=shop_id))+geom_point()+geom_line()+
theme_classic()+theme(panel.grid=element_blank())
options(warn=-1)
sales_train_top_store_month$index<- 1:1031
reg<- lm(package_moved ~ index,data= sales_train_top_store_month)
summary(reg)
##
## Call:
## lm(formula = package_moved ~ index, data = sales_train_top_store_month)
##
## Residuals:
## Min 1Q Median 3Q Max
## -192.54 -66.97 -20.58 43.50 426.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 384.83352 6.02257 63.90 <2e-16 ***
## index -0.16765 0.01011 -16.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.62 on 1029 degrees of freedom
## Multiple R-squared: 0.2109, Adjusted R-squared: 0.2101
## F-statistic: 275 on 1 and 1029 DF, p-value: < 2.2e-16
##The fitness of the model is very low as Adjusted R-sq is 21% so it is not a good fit for time series data.