Assignment 01

About Store Sales Data set

The goal is to forecast the total amount of product sold every shop for the test data set. The data is provided by the Russian software firm - 1C. The daily historical data of stores and products are provided from Jan-2013 to Oct-2015. We need to forecast the sales for shop and products for Nov-2015.

Why I have selected this dataset? The store sales is something helped to study behavior of human For instance it help us to understand the nature of demography , what they purchase and how price defined the point of purchase.

What drives variation in the variable? I think pricing will be bit seasonal as well as dependent on the supply chain of the product. The festive discounts and other factor also can lead to change in buying behavior. The forecast of the pricing will be bit moderate as there are chances of variation on date level.

To begin the project we will install some libraries for data wrangling and we will import the training data set.

options(warn=-1)
rm(list = ls())
pkgs <- c("dplyr", "tidyverse", "lubridate", "data.table", "ggplot2")
lib <- installed.packages()[, "Package"]
install.packages(setdiff(pkgs, lib))
setwd("C:/Users/nawal/OneDrive/Desktop/Aditya Course/Time Series/competitive-data-science-predict-future-sales")

sales_train<- read.csv("sales_train.csv")
head(sales_train)

##         date date_block_num shop_id item_id item_price item_cnt_day
## 1 02.01.2013              0      59   22154     999.00            1
## 2 03.01.2013              0      25    2552     899.00            1
## 3 05.01.2013              0      25    2552     899.00           -1
## 4 06.01.2013              0      25    2554    1709.05            1
## 5 15.01.2013              0      25    2555    1099.00            1
## 6 10.01.2013              0      25    2564     349.00            1

dim(sales_train)

## [1] 2935849       6

str(sales_train)

## 'data.frame':    2935849 obs. of  6 variables:
##  $ date          : chr  "02.01.2013" "03.01.2013" "05.01.2013" "06.01.2013" ...
##  $ date_block_num: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ shop_id       : int  59 25 25 25 25 25 25 25 25 25 ...
##  $ item_id       : int  22154 2552 2552 2554 2555 2564 2565 2572 2572 2573 ...
##  $ item_price    : num  999 899 899 1709 1099 ...
##  $ item_cnt_day  : num  1 1 -1 1 1 1 1 1 1 3 ...

Description of sales training data set

Shop_id : unique identifier for the shop

item_id: unique identifier of a product

item_price: current price of an item

date : date in format dd/mm/yyyy

item_cnt_day : number of product sold

date_block_num : a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,…, October 2015 is 33

Along with sales training data set we have information about item name , categories name , shop name which we won’t be using in our data set since the naming are in Russian which won’t be easy for an Indian to interpret :) . We will try to forecast our model on the shop_id and item_id. We will be doing summary statisting and data wrangling in next step.

options(warn=-1)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
sales_train$date1<- as.Date(sales_train$date,"%d.%m.%Y")
sales_train$year<- lubridate::year(sales_train$date1)
sales_train$month<- lubridate::month(sales_train$date1)
sales_train$day<- lubridate::day(sales_train$date1)

sales_train_store <- sales_train%>%group_by(shop_id,date_block_num)%>%summarize(package_moved =sum(item_cnt_day))%>%arrange(shop_id,date_block_num)

## `summarise()` has grouped output by 'shop_id'. You can override using the `.groups` argument.

sales_train_store <- filter(sales_train_store , date_block_num %in% 22:33)

summary(sales_train_store)

##     shop_id   date_block_num package_moved  
##  Min.   : 2   Min.   :22.0   Min.   :   -1  
##  1st Qu.:18   1st Qu.:24.0   1st Qu.:  966  
##  Median :34   Median :27.0   Median : 1340  
##  Mean   :32   Mean   :27.3   Mean   : 1915  
##  3rd Qu.:47   3rd Qu.:30.0   3rd Qu.: 2084  
##  Max.   :59   Max.   :33.0   Max.   :14610

ggplot2::ggplot(sales_train_store, ggplot2::aes(x=date_block_num,y=package_moved,color=shop_id))+geom_point()+geom_line()+
                                                                                 theme_classic()+theme(panel.grid=element_blank())

Plot doesn’t look that great so we will keep only top store with highest packaged moved forour analysis in the given time frame

options(warn=-1)
sales_store_req <- sales_train %>% group_by(shop_id) %>% summarise(package_moved = sum(item_cnt_day))%>%arrange(desc(package_moved ))

### shop id =31 & 25 have highest number of package moved in category so we cut_off our data only for this 2 id

sales_train_top_store <- filter(sales_train, shop_id ==31)

sales_train_top_store$shop_id<- as.vector(sales_train_top_store$shop_id)

sales_train_top_store_month <- sales_train_top_store%>% group_by(shop_id,date1) %>% summarize(package_moved = sum(item_cnt_day))%>% arrange(date1)

## `summarise()` has grouped output by 'shop_id'. You can override using the `.groups` argument.

ggplot2::ggplot(sales_train_top_store_month, ggplot2::aes(x=date1,y=package_moved,color=shop_id))+geom_point()+geom_line()+
                                                                                 theme_classic()+theme(panel.grid=element_blank())

The below graph represent the date over date of total packaged moved by a shop in 33 months.As we can see the sales of the total items are changing over time

options(warn=-1)
boxplot(sales_train_top_store_month$package_moved,main = "Average Box moved on daily basis by store")

summary(sales_train_top_store_month$package_moved)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   102.0   221.5   275.0   301.4   358.5  1080.0

As box plot and summary stats suggest there are some outlier that we need to remove or either impute some max value

options(warn=-1)

sales_train_top_store_month$package_moved <- ifelse(sales_train_top_store_month$package_moved > 700 ,650,sales_train_top_store_month$package_moved)

ggplot2::ggplot(sales_train_top_store_month, ggplot2::aes(x=date1,y=package_moved,color=shop_id))+geom_point()+geom_line()+
                                                                                 theme_classic()+theme(panel.grid=element_blank())

Visualization look better as compared to the last as we have imputed the values for the outliers with mean +3*sd

options(warn=-1)
sales_train_top_store_month$index<- 1:1031   

reg<- lm(package_moved ~ index,data= sales_train_top_store_month)

summary(reg)

## 
## Call:
## lm(formula = package_moved ~ index, data = sales_train_top_store_month)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -192.54  -66.97  -20.58   43.50  426.71 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 384.83352    6.02257   63.90   <2e-16 ***
## index        -0.16765    0.01011  -16.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.62 on 1029 degrees of freedom
## Multiple R-squared:  0.2109, Adjusted R-squared:  0.2101 
## F-statistic:   275 on 1 and 1029 DF,  p-value: < 2.2e-16

##The fitness of the model is very low as Adjusted R-sq is 21% so it is not a good fit for time series data.