Assignment 1

Section 1

Description of the dataset and its source

I want to choose the Store-sales dataset for my Forecasting and Time Series assignments and the final project. This dataset was published as a part of the final project for the How to win a Data Science Competition course on Coursera. This dataset consists of the daily sales data of the Russian Software firm 1C Company. I want to work on this dataset as I have prior work experience on sales data of a manufacturing company. The similar domain knowledge can ease my understanding.

The dataset has daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. The list of shops and products slightly changes every month. The variation in these variables drives the variation in sales. Creating a robust model that can handle such variations is a challenging task.

data <- read.csv("C:/UC MSBANA/Forecasting/competitive-data-science-predict-future-sales/sales_train.csv")
data$date <- as.Date(data$date, "%d.%m.%Y")
head(data)

##         date date_block_num shop_id item_id item_price item_cnt_day
## 1 2013-01-02              0      59   22154     999.00            1
## 2 2013-01-03              0      25    2552     899.00            1
## 3 2013-01-05              0      25    2552     899.00           -1
## 4 2013-01-06              0      25    2554    1709.05            1
## 5 2013-01-15              0      25    2555    1099.00            1
## 6 2013-01-10              0      25    2564     349.00            1

summary(data)

##       date            date_block_num     shop_id      item_id     
##  Min.   :2013-01-01   Min.   : 0.00   Min.   : 0   Min.   :    0  
##  1st Qu.:2013-08-01   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
##  Median :2014-03-04   Median :14.00   Median :31   Median : 9343  
##  Mean   :2014-04-03   Mean   :14.57   Mean   :33   Mean   :10197  
##  3rd Qu.:2014-12-05   3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
##  Max.   :2015-10-31   Max.   :33.00   Max.   :59   Max.   :22169  
##    item_price        item_cnt_day     
##  Min.   :    -1.0   Min.   : -22.000  
##  1st Qu.:   249.0   1st Qu.:   1.000  
##  Median :   399.0   Median :   1.000  
##  Mean   :   890.9   Mean   :   1.243  
##  3rd Qu.:   999.0   3rd Qu.:   1.000  
##  Max.   :307980.0   Max.   :2169.000

Data Wrangling

library("stringr")
library("dplyr")
date_split<- data.frame(do.call("rbind", strsplit(as.character(data$date), "-", fixed = TRUE)))

data$month = date_split$X2
data$year = date_split$X1
head(data)

##         date date_block_num shop_id item_id item_price item_cnt_day month year
## 1 2013-01-02              0      59   22154     999.00            1    01 2013
## 2 2013-01-03              0      25    2552     899.00            1    01 2013
## 3 2013-01-05              0      25    2552     899.00           -1    01 2013
## 4 2013-01-06              0      25    2554    1709.05            1    01 2013
## 5 2013-01-15              0      25    2555    1099.00            1    01 2013
## 6 2013-01-10              0      25    2564     349.00            1    01 2013

Calculating Sales column from price of the item and quantity sold, and displaying the first 5 mean sales values per month.

data$Sales = data$item_cnt_day * data$item_price
data %>% 
  group_by(date_block_num)  %>% 
  summarize(mean_sales = round(mean(Sales))) %>% 
  ungroup() %>% 
  arrange(date_block_num) %>% head(5)

## # A tibble: 5 x 2
##   date_block_num mean_sales
##            <int>      <dbl>
## 1              0        795
## 2              1        835
## 3              2        865
## 4              3        735
## 5              4        709

Section 2

The below plot shows the Mean sales(in dollars) for all items for the years 2013 to 2015.

library(tidyverse)
data %>% 
  group_by(year) %>% 
  summarize(mean_sales = mean(Sales)) %>% 
  ungroup() %>% 
  ggplot(aes(x = year, y = mean_sales, fill = year)) +
   geom_bar(stat = "identity") + geom_text(
    aes(label = round(mean_sales)),
    position = position_dodge(width = 1),
    hjust = 0.5,
    vjust = 2.5,
    size = 3
  )+
  theme_bw()+
  xlab("Year") + 
  ylab("Total Sales across the years in dollars") + 
  ggtitle("Mean Sales from 2013 to 2015")+
  scale_fill_discrete(name='Year')

The below plot shows the Mean sales(in dollars) across all shops and products for months ranging from Jan 2013 to Oct 2015.
From the plot we note that, on an average, sales have increased over time from Jan 2013 to October 2015.

data %>% 
  group_by(date_block_num) %>% 
  summarize(mean_Sales = mean(Sales)) %>% 
  ungroup() %>% 
  ggplot() + 
  geom_line(aes(date_block_num, mean_Sales, group = 1, col= "red"))+
  theme_bw()+
  xlab("Month Number") + 
  ylab("Mean Sales per month") + 
  ggtitle("Mean Sales across all shops and products over time")

From the below line plot, we note that the total products sold decreased from months 0 to 33.

data %>% 
  group_by(date_block_num) %>% 
  summarize(monthly_count = sum(item_cnt_day)) %>% 
  ungroup() %>% 
  ggplot() + 
  geom_line(aes(date_block_num, monthly_count, group = 1, col= "red"))+
  theme_bw()+
  xlab("Month Number") + 
  ylab("Total products sold per month") + 
  ggtitle("Total products sold across all shops over time")

The below table represents the Mean Sales for all items sold across the years 2013, 2014, 2015.

library(kableExtra)
data %>% 
  group_by(year) %>% 
  summarize(mean_Sales = mean(Sales)) %>% 
  ungroup() %>% 
  rename(
    Year = year,
    `Mean Sales per year (in dollars)`=mean_Sales
  ) %>% 
  kbl(format='html') %>% 
  kable_paper("hover", full_width = T)

Year	Mean Sales per year (in dollars)
2013	960.5248
2014	1275.5263
2015	1362.8147

Box Plots

boxplot(data$item_price, ylab = "Item Price", main = "Box Plot for Item Price")

boxplot(data$item_cnt_day, ylab = "Item Count per day", main = "Box Plot for Number of products sold each day")

Generating Summary statistics for time-series data

library(vtable)
new_data <- within(data, rm(year))
st(new_data)

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
date_block_num	2935849	14.57	9.423	0	7	23	33
shop_id	2935849	33.002	16.227	0	22	47	59
item_id	2935849	10197.227	6324.297	0	4476	15684	22169
item_price	2935849	890.853	1729.8	-1	249	999	307980
item_cnt_day	2935849	1.243	2.619	-22	1	1	2169
Sales	2935849	1157.732	5683.604	-68970	249	1078.2	1829990

Section 3

Initial Data Analysis

We can generate the summary statistics for the dataset.

data$month <- as.numeric(data$month)
data$year <- as.numeric(data$year)
str(data)

## 'data.frame':    2935849 obs. of  9 variables:
##  $ date          : Date, format: "2013-01-02" "2013-01-03" ...
##  $ date_block_num: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ shop_id       : int  59 25 25 25 25 25 25 25 25 25 ...
##  $ item_id       : int  22154 2552 2552 2554 2555 2564 2565 2572 2572 2573 ...
##  $ item_price    : num  999 899 899 1709 1099 ...
##  $ item_cnt_day  : num  1 1 -1 1 1 1 1 1 1 3 ...
##  $ month         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ year          : num  2013 2013 2013 2013 2013 ...
##  $ Sales         : num  999 899 -899 1709 1099 ...

summary(data)

##       date            date_block_num     shop_id      item_id     
##  Min.   :2013-01-01   Min.   : 0.00   Min.   : 0   Min.   :    0  
##  1st Qu.:2013-08-01   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
##  Median :2014-03-04   Median :14.00   Median :31   Median : 9343  
##  Mean   :2014-04-03   Mean   :14.57   Mean   :33   Mean   :10197  
##  3rd Qu.:2014-12-05   3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
##  Max.   :2015-10-31   Max.   :33.00   Max.   :59   Max.   :22169  
##    item_price        item_cnt_day          month             year     
##  Min.   :    -1.0   Min.   : -22.000   Min.   : 1.000   Min.   :2013  
##  1st Qu.:   249.0   1st Qu.:   1.000   1st Qu.: 3.000   1st Qu.:2013  
##  Median :   399.0   Median :   1.000   Median : 6.000   Median :2014  
##  Mean   :   890.9   Mean   :   1.243   Mean   : 6.248   Mean   :2014  
##  3rd Qu.:   999.0   3rd Qu.:   1.000   3rd Qu.: 9.000   3rd Qu.:2014  
##  Max.   :307980.0   Max.   :2169.000   Max.   :12.000   Max.   :2015  
##      Sales        
##  Min.   : -68970  
##  1st Qu.:    249  
##  Median :    449  
##  Mean   :   1158  
##  3rd Qu.:   1078  
##  Max.   :1829990

Outliers

item_price has minimum value of -1 which is clearly an outlier, since price of an item can not be negative.
item_cnt_day has negative value which is another outlier because the quantity of items sold can never be negative.
Sales column has negative value because the sales value is calculated by multiplying the item_price and item_cnt_day which are also negative values.

Missing Values

colSums(is.na(data))

##           date date_block_num        shop_id        item_id     item_price 
##              0              0              0              0              0 
##   item_cnt_day          month           year          Sales 
##              0              0              0              0

From the above output, we note that there are no missing values in any columns.

Correlation matrix

data2 <- within(data, rm(date))
cor(data2)

##                date_block_num      shop_id      item_id  item_price
## date_block_num    1.000000000  0.019273446  0.009355676  0.09501013
## shop_id           0.019273446  1.000000000  0.029396392 -0.02403405
## item_id           0.009355676  0.029396392  1.000000000 -0.13410404
## item_price        0.095010128 -0.024034051 -0.134104042  1.00000000
## item_cnt_day      0.009402007 -0.005229685  0.016650304  0.01119663
## month             0.243938535  0.020574098 -0.002033734  0.04404866
## year              0.928280414  0.011804588  0.010339705  0.08019235
## Sales             0.040056155 -0.013424049 -0.062941234  0.43372505
##                item_cnt_day        month         year       Sales
## date_block_num  0.009402007  0.243938535  0.928280414  0.04005616
## shop_id        -0.005229685  0.020574098  0.011804588 -0.01342405
## item_id         0.016650304 -0.002033734  0.010339705 -0.06294123
## item_price      0.011196625  0.044048656  0.080192346  0.43372505
## item_cnt_day    1.000000000  0.018013875  0.002699486  0.40795434
## month           0.018013875  1.000000000 -0.134203363  0.03035856
## year            0.002699486 -0.134203363  1.000000000  0.02928882
## Sales           0.407954342  0.030358560  0.029288823  1.00000000

From the above correlation matrix, we note that there is a positive correlation between Sales and item_price, item_cnt_day.
We try to plot the Scatterplots between Sales and Item Price, Sales and No. of Products sold per day to check if there is any relationship between the variables.

Scatterplots

plot(data$Sales, data$item_price, xlab = "Item Price", ylab = "Sales", main = "Sales vs Item Price")

plot(data$Sales, data$item_cnt_day, xlab = "No. of products sold per day", ylab = "Sales", main = "Sales vs No of Products sold")

From the above scatterplots, we can not identify any significant pattern between Sales and Item Price, No. of Products sold per day.

Building Linear Regression Model for Sales vs time(month number)

The task in this project is to predict the sales values of products over time.

library(sjPlot)
library(sjmisc)
library(sjlabelled)
lm01 = lm(Sales ~ date_block_num, data = data)
tab_model(lm01)

	Sales
Predictors	Estimates	CI	p
(Intercept)	805.72	793.75 – 817.68	<0.001
date block num	24.16	23.47 – 24.85	<0.001
Observations	2935849
R² / R² adjusted	0.002 / 0.002

From the above linear regression model, we observe that the p-value is very close to zero. So, there should be a statistically significant relationship between Sales and date_block_num.
From the summary statistics, adjusted R-Squared value is 0.002. It implies that 0.2% variation in Sales is explained by date_block_num.

Linear Regression Model for Time Series Data

One of the assumptions of Linear Regression models is that residuals are not correlated. When we have time-series data, there is a high chance of autocorrelated residuals. In that case, the linear regression model may not capture all the trends in the data.