We have daily historic data of the sales from January 2013 to October 2015..
For scope this project we are interested o forecast the total amount of products sold in every shop for the test set.
Source: Kaggle.com
Following Data sets were used:
Sales_train - The training set. Daily historical data from January 2013 to October 2015.
test - the test set. You need to forecast the sales for these shops and products for November 2015.
items - supplemental information about the items/products.
item_categories - supplemental information about the items categories
shops - supplemental information about the shops.
Following packages were used:
tidyverse - Which will provide us functionality to model, transform, and visualize data.
ggplot2 - Used for plotting charts
plotly - For web-based graphs via the open source JavaScript graphing library plotly.js for interactive charts
corrplot - For displaying correlation matrices and confidence intervals
factoextra - To visualize the output of multivariate data analysis
funModeling - Exploratory Data Analysis and Data Preparation Tool-Box
RColorBrewer - To help you choose sensible colour schemes for figures in R
Lubridate - It is a package that eases working with Date and Time datatypes
Knitr - It enables the integration of R code into R markdown and in our case we used it to display the variables in a neat scrollable tabular format.
DT - Data objects in R can be rendered as HTML by importing this package.
cowplot - For providing addition functionalities to ggplot.
vtable - To print the summary statistics of the data
library(tidyverse)
library(ggplot2)
library(plotly)
library(corrplot)
library(factoextra)
library(knitr)
library(RColorBrewer)
library(funModeling)
library(knitr)
library(lubridate)
library(DT)
library(cowplot)
library(vtable)
Description of Attributes. The attributes are as follows
ID: an Id that represents a (Shop, Item) tuple within the test set
shop_id: unique identifier of a shop
item_id: unique identifier of a product
item_category_id: unique identifier of item category
item_cnt_day: number of products sold. We are predicting a monthly amount of this measure
item_price: current price of an item.
date: date in format dd/mm/yyyy
date_block_num: a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,…, October 2015 is 33
item_name : name of item
shop_name: Name of the shop
item_category_name : Name of the category
This sections contains all the procedures followed in preparing the data for analysis. Each step has been explained with code for those steps.
The dataset used for this project is sales prediction dataset from Kaggle provided in the course curriculum
#### Reading Data
Item_cat <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/item_categories.csv')
Item <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/items.csv')
Sales <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/sales_train.csv')
Shops <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/shops.csv')
Test_set <- read_csv('/Users/satyamanikantakota/Desktop/BANA - SEM2/Time Series/competitive-data-science-predict-future-sales/test.csv')
### Checking dimension of Data
dim(Item_cat)
## [1] 84 2
dim(Item)
## [1] 22170 3
dim(Sales)
## [1] 2935849 6
dim(Shops)
## [1] 60 2
dim(Test_set)
## [1] 214200 3
The main data set sales has 2,935,849 observations and with 6 columns
The dataset consists of the following variables:
#### Checking column name
names(Sales)
## [1] "date" "date_block_num" "shop_id" "item_id"
## [5] "item_price" "item_cnt_day"
Step 1: Handling Missing and Empty Values
#### Counting NA values in every column
colSums(is.na(Sales))
## date date_block_num shop_id item_id item_price
## 0 0 0 0 0
## item_cnt_day
## 0
Step 2: Checking the structure and changing datatypes of certain variables
#### checking Structure of the data
str(Sales)
## spec_tbl_df [2,935,849 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : chr [1:2935849] "02.01.2013" "03.01.2013" "05.01.2013" "06.01.2013" ...
## $ date_block_num: num [1:2935849] 0 0 0 0 0 0 0 0 0 0 ...
## $ shop_id : num [1:2935849] 59 25 25 25 25 25 25 25 25 25 ...
## $ item_id : num [1:2935849] 22154 2552 2552 2554 2555 ...
## $ item_price : num [1:2935849] 999 899 899 1709 1099 ...
## $ item_cnt_day : num [1:2935849] 1 1 -1 1 1 1 1 1 1 3 ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_character(),
## .. date_block_num = col_double(),
## .. shop_id = col_double(),
## .. item_id = col_double(),
## .. item_price = col_double(),
## .. item_cnt_day = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
#### checking Summary of the data
summary(Sales)
## date date_block_num shop_id item_id
## Length:2935849 Min. : 0.00 Min. : 0 Min. : 0
## Class :character 1st Qu.: 7.00 1st Qu.:22 1st Qu.: 4476
## Mode :character Median :14.00 Median :31 Median : 9343
## Mean :14.57 Mean :33 Mean :10197
## 3rd Qu.:23.00 3rd Qu.:47 3rd Qu.:15684
## Max. :33.00 Max. :59 Max. :22169
## item_price item_cnt_day
## Min. : -1.0 Min. : -22.000
## 1st Qu.: 249.0 1st Qu.: 1.000
## Median : 399.0 Median : 1.000
## Mean : 890.9 Mean : 1.243
## 3rd Qu.: 999.0 3rd Qu.: 1.000
## Max. :307980.0 Max. :2169.000
#### Changing datatype of some columns
Sales<-Sales %>%
mutate(date=as.Date(paste(substr(Sales$date,4,5),substr(Sales$date,1,2),substr(Sales$date,7,11),sep="/"),"%m/%d/%Y"))
Step 3: Removing Duplicate
#### removing duplicated data
Sales <- Sales[!duplicated(Sales),]
dim(Sales)
## [1] 2935843 6
Sales$month <- lubridate::month(Sales$date)
Sales$day <- lubridate::day(Sales$date)
Sales$quarter <- lubridate::quarter(Sales$date)
Sales$day_w <- lubridate::wday(Sales$date)
Sales$year <- lubridate::year(Sales$date)
Sales$year_mon <- substr(Sales$date,1,7)
A preview of the clean dataset is given below:
### displaying top 100 rows
output_data <- head(Sales, n = 10)
datatable(output_data, filter = 'top', options = list(pageLength = 25))
Salesbyyearmonth <- Sales %>% group_by(year_mon) %>% summarise(Total_Sales=sum(item_cnt_day,na.rm=TRUE))
Salesbyyear <- Sales %>% group_by(year) %>% summarise(Total_Sales=sum(item_cnt_day,na.rm=TRUE))
Salesbyyearmonth%>%ggplot(aes(year_mon,Total_Sales))+geom_col(fill="lightblue")+theme_classic()+scale_x_discrete(guide = guide_axis(n.dodge=3))+ scale_y_continuous(guide = guide_axis(check.overlap = TRUE))+
geom_text(aes(label=Total_Sales),position=position_stack(vjust=0.5),size=3)+ggtitle("Total sale by Year and month")
Salesbyyear%>%ggplot(aes(year,Total_Sales))+geom_col(fill="lightblue")+theme_classic()+scale_x_discrete(guide = guide_axis(n.dodge=3))+ scale_y_continuous(guide = guide_axis(check.overlap = TRUE))+
geom_text(aes(label=Total_Sales),position=position_stack(vjust=0.5),size=3)+ggtitle("Total sales amount value by Year")
item_id <-Sales%>%group_by(item_id)%>%summarise(sold=sum(item_cnt_day,na.rm=TRUE))%>%arrange(desc(sold))
item_id%>%head(50) %>% ggplot(aes(x=reorder(item_id,sold),y=sold,fill=item_id))+
geom_col()+theme_classic()+coord_flip()+
ggtitle("Top50 Items by Units Sold")
item_cat <-Sales%>%group_by(shop_id)%>%summarise(sold=sum(item_cnt_day,na.rm=TRUE))%>%arrange(desc(sold))
item_cat%>%head(50)%>%ggplot(aes(x=reorder(shop_id,sold),y=sold,fill=shop_id))+
geom_col()+ coord_flip()+theme_classic()+
ggtitle("Top 50 Shops by Units")
item_shop <-Sales%>%group_by(shop_id,item_id)%>%summarise(sold_qties=sum(item_cnt_day,na.rm=TRUE))%>%arrange(desc(sold_qties))
item_shop%>%head(25)%>%ggplot(aes(x=reorder(shop_id,sold_qties),y=sold_qties,fill=as.factor(item_id)))+
geom_col()+theme_classic()+coord_flip()+ggtitle("Most Popular Item by Shop by Unit")
sales_by_yearmonth <- Sales %>% group_by(year,month) %>% summarise(sold_val=sum(item_cnt_day,na.rm=TRUE)) %>% arrange(desc(sold_val))
sales_by_yearmonth%>%ggplot(aes(as.factor(month),sold_val,color=as.factor(year),group=as.factor(year)))+
geom_line()+theme_classic()+geom_text(aes(label=round(sold_val,0)),size=3)+
ggtitle("Sales by Month & year")
# Fit the model
lmsales<- Sales %>% group_by(year,month,shop_id,item_id) %>% summarise(total_sales=sum(item_cnt_day))
mod <- lm(total_sales ~ item_id + shop_id + year + month ,data=lmsales)
summary(mod)
##
## Call:
## lm(formula = total_sales ~ item_id + shop_id + year + month,
## data = lmsales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.53 -1.32 -1.12 -0.25 2250.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.250e+01 1.780e+01 2.388 0.0169 *
## item_id -1.531e-05 1.093e-06 -14.009 <2e-16 ***
## shop_id 9.251e-05 4.126e-04 0.224 0.8226
## year -2.003e-02 8.838e-03 -2.266 0.0234 *
## month 4.266e-02 1.989e-03 21.451 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.648 on 1609119 degrees of freedom
## Multiple R-squared: 0.0004223, Adjusted R-squared: 0.0004198
## F-statistic: 170 on 4 and 1609119 DF, p-value: < 2.2e-16
par(mfrow = c(1, 1)) # three rows and two columns
plot(mod, which = 1:6)
mod_broom <- lmsales %>%
lm(total_sales ~ item_id + shop_id + year + month , data = .) %>%
broom::augment() %>%
mutate(row_num = 1:n())
head(mod_broom)
## # A tibble: 6 × 12
## total_sales item_id shop_id year month .fitted .resid .hat .sigma .cooksd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 32 0 2013 1 2.22 3.78 6.95e-6 8.65 2.65e- 7
## 2 3 33 0 2013 1 2.22 0.777 6.95e-6 8.65 1.12e- 8
## 3 1 35 0 2013 1 2.22 -1.22 6.95e-6 8.65 2.78e- 8
## 4 1 43 0 2013 1 2.22 -1.22 6.95e-6 8.65 2.78e- 8
## 5 2 51 0 2013 1 2.22 -0.223 6.95e-6 8.65 9.23e-10
## 6 1 61 0 2013 1 2.22 -1.22 6.94e-6 8.65 2.78e- 8
## # … with 2 more variables: .std.resid <dbl>, row_num <int>