TimeSeries and Forecasting Assignment#1

Introduction

Background

We have daily historic data of the sales from January 2013 to October 2015..

For scope this project we are interested o forecast the total amount of products sold in every shop for the test set.

Source: Kaggle.com

Data used for the analysis

Following Data sets were used:

Sales_train - The training set. Daily historical data from January 2013 to October 2015.
test - the test set. You need to forecast the sales for these shops and products for November 2015.
items - supplemental information about the items/products.
item_categories - supplemental information about the items categories
shops - supplemental information about the shops.

Packages Required

Following packages were used:

tidyverse - Which will provide us functionality to model, transform, and visualize data.
ggplot2 - Used for plotting charts
plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
corrplot - For displaying correlation matrices and confidence intervals
factoextra - To visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
RColorBrewer - To help you choose sensible colour schemes for figures in R
Lubridate - It is a package that eases working with Date and Time datatypes
Knitr - It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.
DT - Data objects in R can be rendered as HTML by importing this package.
cowplot - For providing addition functionalities to ggplot.
vtable - To print the summary statistics of the data

library(tidyverse)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(knitr)
library(lubridate)
library(DT)
library(cowplot)
library(vtable)

Data Dictionary

Description of Attributes. The attributes are as follows

ID: an Id that represents a (Shop, Item) tuple within the test set
shop_id: unique identifier of a shop
item_id: unique identifier of a product
item_category_id: unique identifier of item category
item_cnt_day: number of products sold. We are predicting a monthly amount of this measure
item_price: current price of an item.
date: date in format dd/mm/yyyy
date_block_num: a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,…, October 2015 is 33
item_name : name of item
shop_name: Name of the shop
item_category_name : Name of the category

Data Preparation

This sections contains all the procedures followed in preparing the data for analysis. Each step has been explained with code for those steps.

Data Source

The dataset used for this project is sales prediction dataset from Kaggle provided in the course curriculum

Data Loading

#### Reading Data 
Item_cat <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/item_categories.csv')

Item <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/items.csv')

Sales <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/sales_train.csv')

Shops <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/shops.csv')

Test_set <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/test.csv')

### Checking dimension of Data
dim(Item_cat)

## [1] 84  2

dim(Item)

## [1] 22170     3

dim(Sales)

## [1] 2935849       6

dim(Shops)

## [1] 60  2

dim(Test_set)

## [1] 214200      3

The main data set sales has 2,935,849 observations and with 6 columns

The dataset consists of the following variables:

#### Checking column name
names(Sales)

## [1] "date"           "date_block_num" "shop_id"        "item_id"       
## [5] "item_price"     "item_cnt_day"

Data Cleaning

Step 1: Handling Missing and Empty Values

#### Counting NA values in every column
colSums(is.na(Sales))

##           date date_block_num        shop_id        item_id     item_price 
##              0              0              0              0              0 
##   item_cnt_day 
##              0

Step 2: Checking the structure and changing datatypes of certain variables

#### checking Structure of the data 
str(Sales)

## spec_tbl_df [2,935,849 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date          : chr [1:2935849] "02.01.2013" "03.01.2013" "05.01.2013" "06.01.2013" ...
##  $ date_block_num: num [1:2935849] 0 0 0 0 0 0 0 0 0 0 ...
##  $ shop_id       : num [1:2935849] 59 25 25 25 25 25 25 25 25 25 ...
##  $ item_id       : num [1:2935849] 22154 2552 2552 2554 2555 ...
##  $ item_price    : num [1:2935849] 999 899 899 1709 1099 ...
##  $ item_cnt_day  : num [1:2935849] 1 1 -1 1 1 1 1 1 1 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_character(),
##   ..   date_block_num = col_double(),
##   ..   shop_id = col_double(),
##   ..   item_id = col_double(),
##   ..   item_price = col_double(),
##   ..   item_cnt_day = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

#### checking Summary of the data 
summary(Sales)

##      date           date_block_num     shop_id      item_id     
##  Length:2935849     Min.   : 0.00   Min.   : 0   Min.   :    0  
##  Class :character   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
##  Mode  :character   Median :14.00   Median :31   Median : 9343  
##                     Mean   :14.57   Mean   :33   Mean   :10197  
##                     3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
##                     Max.   :33.00   Max.   :59   Max.   :22169  
##    item_price        item_cnt_day     
##  Min.   :    -1.0   Min.   : -22.000  
##  1st Qu.:   249.0   1st Qu.:   1.000  
##  Median :   399.0   Median :   1.000  
##  Mean   :   890.9   Mean   :   1.243  
##  3rd Qu.:   999.0   3rd Qu.:   1.000  
##  Max.   :307980.0   Max.   :2169.000

#### Changing datatype of some columns
Sales<-Sales %>%
  mutate(date=as.Date(paste(substr(Sales$date,4,5),substr(Sales$date,1,2),substr(Sales$date,7,11),sep="/"),"%m/%d/%Y"))

Step 3: Removing Duplicate

#### removing duplicated data 
Sales <- Sales[!duplicated(Sales),]
dim(Sales)

## [1] 2935843       6

Sales$month <- lubridate::month(Sales$date)
Sales$day <- lubridate::day(Sales$date)
Sales$quarter <- lubridate::quarter(Sales$date)
Sales$day_w <- lubridate::wday(Sales$date)
Sales$year <- lubridate::year(Sales$date)
Sales$year_mon <- substr(Sales$date,1,7)

Data Preview

A preview of the clean dataset is given below:

### displaying top 100 rows
output_data <- head(Sales, n = 10)
datatable(output_data, filter = 'top', options = list(pageLength = 25))

Exploratory Data Analysis

Salesbyyearmonth <- Sales %>% group_by(year_mon) %>% summarise(Total_Sales=sum(item_cnt_day,na.rm=TRUE))
Salesbyyear <- Sales %>% group_by(year) %>% summarise(Total_Sales=sum(item_cnt_day,na.rm=TRUE))

Salesbyyearmonth%>%ggplot(aes(year_mon,Total_Sales))+geom_col(fill="lightblue")+theme_classic()+scale_x_discrete(guide = guide_axis(n.dodge=3))+ scale_y_continuous(guide = guide_axis(check.overlap = TRUE))+ 
  geom_text(aes(label=Total_Sales),position=position_stack(vjust=0.5),size=3)+ggtitle("Total sale by Year and month")

Salesbyyear%>%ggplot(aes(year,Total_Sales))+geom_col(fill="lightblue")+theme_classic()+scale_x_discrete(guide = guide_axis(n.dodge=3))+ scale_y_continuous(guide = guide_axis(check.overlap = TRUE))+ 
  geom_text(aes(label=Total_Sales),position=position_stack(vjust=0.5),size=3)+ggtitle("Total sales amount value by Year")

item_id <-Sales%>%group_by(item_id)%>%summarise(sold=sum(item_cnt_day,na.rm=TRUE))%>%arrange(desc(sold))
item_id%>%head(50) %>% ggplot(aes(x=reorder(item_id,sold),y=sold,fill=item_id))+
  geom_col()+theme_classic()+coord_flip()+
  ggtitle("Top50 Items by Units Sold")

item_cat <-Sales%>%group_by(shop_id)%>%summarise(sold=sum(item_cnt_day,na.rm=TRUE))%>%arrange(desc(sold))
item_cat%>%head(50)%>%ggplot(aes(x=reorder(shop_id,sold),y=sold,fill=shop_id))+
  geom_col()+ coord_flip()+theme_classic()+
  ggtitle("Top 50 Shops by Units")

item_shop <-Sales%>%group_by(shop_id,item_id)%>%summarise(sold_qties=sum(item_cnt_day,na.rm=TRUE))%>%arrange(desc(sold_qties))
item_shop%>%head(25)%>%ggplot(aes(x=reorder(shop_id,sold_qties),y=sold_qties,fill=as.factor(item_id)))+
  geom_col()+theme_classic()+coord_flip()+ggtitle("Most Popular Item by Shop by Unit")

sales_by_yearmonth <- Sales %>% group_by(year,month) %>% summarise(sold_val=sum(item_cnt_day,na.rm=TRUE)) %>% arrange(desc(sold_val))

sales_by_yearmonth%>%ggplot(aes(as.factor(month),sold_val,color=as.factor(year),group=as.factor(year)))+
  geom_line()+theme_classic()+geom_text(aes(label=round(sold_val,0)),size=3)+
  ggtitle("Sales by Month & year")

Linear Model

# Fit the model

lmsales<- Sales %>% group_by(year,month,shop_id,item_id) %>% summarise(total_sales=sum(item_cnt_day)) 
mod <- lm(total_sales ~ item_id + shop_id + year + month ,data=lmsales)
summary(mod)

## 
## Call:
## lm(formula = total_sales ~ item_id + shop_id + year + month, 
##     data = lmsales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -24.53   -1.32   -1.12   -0.25 2250.61 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.250e+01  1.780e+01   2.388   0.0169 *  
## item_id     -1.531e-05  1.093e-06 -14.009   <2e-16 ***
## shop_id      9.251e-05  4.126e-04   0.224   0.8226    
## year        -2.003e-02  8.838e-03  -2.266   0.0234 *  
## month        4.266e-02  1.989e-03  21.451   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.648 on 1609119 degrees of freedom
## Multiple R-squared:  0.0004223,  Adjusted R-squared:  0.0004198 
## F-statistic:   170 on 4 and 1609119 DF,  p-value: < 2.2e-16

par(mfrow = c(1, 1))  # three rows and two columns
plot(mod, which = 1:6)

mod_broom <-  lmsales %>%
  lm(total_sales ~ item_id + shop_id + year + month , data = .) %>%
  broom::augment() %>%
  mutate(row_num = 1:n()) 
head(mod_broom)

## # A tibble: 6 × 12
##   total_sales item_id shop_id  year month .fitted .resid    .hat .sigma  .cooksd
##         <dbl>   <dbl>   <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
## 1           6      32       0  2013     1    2.22  3.78  6.95e-6   8.65 2.65e- 7
## 2           3      33       0  2013     1    2.22  0.777 6.95e-6   8.65 1.12e- 8
## 3           1      35       0  2013     1    2.22 -1.22  6.95e-6   8.65 2.78e- 8
## 4           1      43       0  2013     1    2.22 -1.22  6.95e-6   8.65 2.78e- 8
## 5           2      51       0  2013     1    2.22 -0.223 6.95e-6   8.65 9.23e-10
## 6           1      61       0  2013     1    2.22 -1.22  6.94e-6   8.65 2.78e- 8
## # … with 2 more variables: .std.resid <dbl>, row_num <int>