Section 1

Description of the dataset and its source

I want to choose the Store-sales dataset for my Forecasting and Time Series assignments and the final project. This dataset was published as a part of the final project for the How to win a Data Science Competition course on Coursera. This dataset consists of the daily sales data of the Russian Software firm 1C Company. I want to work on this dataset as I have prior work experience on sales data of a manufacturing company. The similar domain knowledge can ease my understanding.

The dataset has daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. The list of shops and products slightly changes every month. The variation in these variables drives the variation in sales. Creating a robust model that can handle such variations is a challenging task.

data <- read.csv("C:/UC MSBANA/Forecasting/competitive-data-science-predict-future-sales/sales_train.csv")
data$date <- as.Date(data$date, "%d.%m.%Y")
head(data)
##         date date_block_num shop_id item_id item_price item_cnt_day
## 1 2013-01-02              0      59   22154     999.00            1
## 2 2013-01-03              0      25    2552     899.00            1
## 3 2013-01-05              0      25    2552     899.00           -1
## 4 2013-01-06              0      25    2554    1709.05            1
## 5 2013-01-15              0      25    2555    1099.00            1
## 6 2013-01-10              0      25    2564     349.00            1
summary(data)
##       date            date_block_num     shop_id      item_id     
##  Min.   :2013-01-01   Min.   : 0.00   Min.   : 0   Min.   :    0  
##  1st Qu.:2013-08-01   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
##  Median :2014-03-04   Median :14.00   Median :31   Median : 9343  
##  Mean   :2014-04-03   Mean   :14.57   Mean   :33   Mean   :10197  
##  3rd Qu.:2014-12-05   3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
##  Max.   :2015-10-31   Max.   :33.00   Max.   :59   Max.   :22169  
##    item_price        item_cnt_day     
##  Min.   :    -1.0   Min.   : -22.000  
##  1st Qu.:   249.0   1st Qu.:   1.000  
##  Median :   399.0   Median :   1.000  
##  Mean   :   890.9   Mean   :   1.243  
##  3rd Qu.:   999.0   3rd Qu.:   1.000  
##  Max.   :307980.0   Max.   :2169.000

Data Wrangling

library("stringr")
library("dplyr")
date_split<- data.frame(do.call("rbind", strsplit(as.character(data$date), "-", fixed = TRUE)))

data$month = date_split$X2
data$year = date_split$X1
head(data)
##         date date_block_num shop_id item_id item_price item_cnt_day month year
## 1 2013-01-02              0      59   22154     999.00            1    01 2013
## 2 2013-01-03              0      25    2552     899.00            1    01 2013
## 3 2013-01-05              0      25    2552     899.00           -1    01 2013
## 4 2013-01-06              0      25    2554    1709.05            1    01 2013
## 5 2013-01-15              0      25    2555    1099.00            1    01 2013
## 6 2013-01-10              0      25    2564     349.00            1    01 2013
  • Calculating Sales column from price of the item and quantity sold, and displaying the first 5 mean sales values per month.
data$Sales = data$item_cnt_day * data$item_price
data %>% 
  group_by(date_block_num)  %>% 
  summarize(mean_sales = round(mean(Sales))) %>% 
  ungroup() %>% 
  arrange(date_block_num) %>% head(5) 
## # A tibble: 5 x 2
##   date_block_num mean_sales
##            <int>      <dbl>
## 1              0        795
## 2              1        835
## 3              2        865
## 4              3        735
## 5              4        709

Section 2

library(tidyverse)
data %>% 
  group_by(year) %>% 
  summarize(mean_sales = mean(Sales)) %>% 
  ungroup() %>% 
  ggplot(aes(x = year, y = mean_sales, fill = year)) +
   geom_bar(stat = "identity") + geom_text(
    aes(label = round(mean_sales)),
    position = position_dodge(width = 1),
    hjust = 0.5,
    vjust = 2.5,
    size = 3
  )+
  theme_bw()+
  xlab("Year") + 
  ylab("Total Sales across the years in dollars") + 
  ggtitle("Mean Sales from 2013 to 2015")+
  scale_fill_discrete(name='Year')

data %>% 
  group_by(date_block_num) %>% 
  summarize(mean_Sales = mean(Sales)) %>% 
  ungroup() %>% 
  ggplot() + 
  geom_line(aes(date_block_num, mean_Sales, group = 1, col= "red"))+
  theme_bw()+
  xlab("Month Number") + 
  ylab("Mean Sales per month") + 
  ggtitle("Mean Sales across all shops and products over time")

data %>% 
  group_by(date_block_num) %>% 
  summarize(monthly_count = sum(item_cnt_day)) %>% 
  ungroup() %>% 
  ggplot() + 
  geom_line(aes(date_block_num, monthly_count, group = 1, col= "red"))+
  theme_bw()+
  xlab("Month Number") + 
  ylab("Total products sold per month") + 
  ggtitle("Total products sold across all shops over time")

library(kableExtra)
data %>% 
  group_by(year) %>% 
  summarize(mean_Sales = mean(Sales)) %>% 
  ungroup() %>% 
  rename(
    Year = year,
    `Mean Sales per year (in dollars)`=mean_Sales
  ) %>% 
  kbl(format='html') %>% 
  kable_paper("hover", full_width = T)
Year Mean Sales per year (in dollars)
2013 960.5248
2014 1275.5263
2015 1362.8147

Box Plots

boxplot(data$item_price, ylab = "Item Price", main = "Box Plot for Item Price")

boxplot(data$item_cnt_day, ylab = "Item Count per day", main = "Box Plot for Number of products sold each day")

Generating Summary statistics for time-series data

library(vtable)
new_data <- within(data, rm(year))
st(new_data)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
date_block_num 2935849 14.57 9.423 0 7 23 33
shop_id 2935849 33.002 16.227 0 22 47 59
item_id 2935849 10197.227 6324.297 0 4476 15684 22169
item_price 2935849 890.853 1729.8 -1 249 999 307980
item_cnt_day 2935849 1.243 2.619 -22 1 1 2169
Sales 2935849 1157.732 5683.604 -68970 249 1078.2 1829990

Section 3

Initial Data Analysis

We can generate the summary statistics for the dataset.

data$month <- as.numeric(data$month)
data$year <- as.numeric(data$year)
str(data)
## 'data.frame':    2935849 obs. of  9 variables:
##  $ date          : Date, format: "2013-01-02" "2013-01-03" ...
##  $ date_block_num: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ shop_id       : int  59 25 25 25 25 25 25 25 25 25 ...
##  $ item_id       : int  22154 2552 2552 2554 2555 2564 2565 2572 2572 2573 ...
##  $ item_price    : num  999 899 899 1709 1099 ...
##  $ item_cnt_day  : num  1 1 -1 1 1 1 1 1 1 3 ...
##  $ month         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ year          : num  2013 2013 2013 2013 2013 ...
##  $ Sales         : num  999 899 -899 1709 1099 ...
summary(data)
##       date            date_block_num     shop_id      item_id     
##  Min.   :2013-01-01   Min.   : 0.00   Min.   : 0   Min.   :    0  
##  1st Qu.:2013-08-01   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
##  Median :2014-03-04   Median :14.00   Median :31   Median : 9343  
##  Mean   :2014-04-03   Mean   :14.57   Mean   :33   Mean   :10197  
##  3rd Qu.:2014-12-05   3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
##  Max.   :2015-10-31   Max.   :33.00   Max.   :59   Max.   :22169  
##    item_price        item_cnt_day          month             year     
##  Min.   :    -1.0   Min.   : -22.000   Min.   : 1.000   Min.   :2013  
##  1st Qu.:   249.0   1st Qu.:   1.000   1st Qu.: 3.000   1st Qu.:2013  
##  Median :   399.0   Median :   1.000   Median : 6.000   Median :2014  
##  Mean   :   890.9   Mean   :   1.243   Mean   : 6.248   Mean   :2014  
##  3rd Qu.:   999.0   3rd Qu.:   1.000   3rd Qu.: 9.000   3rd Qu.:2014  
##  Max.   :307980.0   Max.   :2169.000   Max.   :12.000   Max.   :2015  
##      Sales        
##  Min.   : -68970  
##  1st Qu.:    249  
##  Median :    449  
##  Mean   :   1158  
##  3rd Qu.:   1078  
##  Max.   :1829990

Outliers

  • item_price has minimum value of -1 which is clearly an outlier, since price of an item can not be negative.
  • item_cnt_day has negative value which is another outlier because the quantity of items sold can never be negative.
  • Sales column has negative value because the sales value is calculated by multiplying the item_price and item_cnt_day which are also negative values.

Missing Values

colSums(is.na(data))
##           date date_block_num        shop_id        item_id     item_price 
##              0              0              0              0              0 
##   item_cnt_day          month           year          Sales 
##              0              0              0              0

From the above output, we note that there are no missing values in any columns.

Correlation matrix

data2 <- within(data, rm(date))
cor(data2)
##                date_block_num      shop_id      item_id  item_price
## date_block_num    1.000000000  0.019273446  0.009355676  0.09501013
## shop_id           0.019273446  1.000000000  0.029396392 -0.02403405
## item_id           0.009355676  0.029396392  1.000000000 -0.13410404
## item_price        0.095010128 -0.024034051 -0.134104042  1.00000000
## item_cnt_day      0.009402007 -0.005229685  0.016650304  0.01119663
## month             0.243938535  0.020574098 -0.002033734  0.04404866
## year              0.928280414  0.011804588  0.010339705  0.08019235
## Sales             0.040056155 -0.013424049 -0.062941234  0.43372505
##                item_cnt_day        month         year       Sales
## date_block_num  0.009402007  0.243938535  0.928280414  0.04005616
## shop_id        -0.005229685  0.020574098  0.011804588 -0.01342405
## item_id         0.016650304 -0.002033734  0.010339705 -0.06294123
## item_price      0.011196625  0.044048656  0.080192346  0.43372505
## item_cnt_day    1.000000000  0.018013875  0.002699486  0.40795434
## month           0.018013875  1.000000000 -0.134203363  0.03035856
## year            0.002699486 -0.134203363  1.000000000  0.02928882
## Sales           0.407954342  0.030358560  0.029288823  1.00000000
  • From the above correlation matrix, we note that there is a positive correlation between Sales and item_price, item_cnt_day.
  • We try to plot the Scatterplots between Sales and Item Price, Sales and No. of Products sold per day to check if there is any relationship between the variables.

Scatterplots

plot(data$Sales, data$item_price, xlab = "Item Price", ylab = "Sales", main = "Sales vs Item Price")

plot(data$Sales, data$item_cnt_day, xlab = "No. of products sold per day", ylab = "Sales", main = "Sales vs No of Products sold")

  • From the above scatterplots, we can not identify any significant pattern between Sales and Item Price, No. of Products sold per day.

Building Linear Regression Model for Sales vs time(month number)

The task in this project is to predict the sales values of products over time.

library(sjPlot)
library(sjmisc)
library(sjlabelled)
lm01 = lm(Sales ~ date_block_num, data = data)
tab_model(lm01)
  Sales
Predictors Estimates CI p
(Intercept) 805.72 793.75 – 817.68 <0.001
date block num 24.16 23.47 – 24.85 <0.001
Observations 2935849
R2 / R2 adjusted 0.002 / 0.002
  • From the above linear regression model, we observe that the p-value is very close to zero. So, there should be a statistically significant relationship between Sales and date_block_num.

  • From the summary statistics, adjusted R-Squared value is 0.002. It implies that 0.2% variation in Sales is explained by date_block_num.

Linear Regression Model for Time Series Data

  • One of the assumptions of Linear Regression models is that residuals are not correlated. When we have time-series data, there is a high chance of autocorrelated residuals. In that case, the linear regression model may not capture all the trends in the data.