I want to choose the Store-sales dataset for my Forecasting and Time Series assignments and the final project. This dataset was published as a part of the final project for the How to win a Data Science Competition course on Coursera. This dataset consists of the daily sales data of the Russian Software firm 1C Company. I want to work on this dataset as I have prior work experience on sales data of a manufacturing company. The similar domain knowledge can ease my understanding.
The dataset has daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. The list of shops and products slightly changes every month. The variation in these variables drives the variation in sales. Creating a robust model that can handle such variations is a challenging task.
data <- read.csv("C:/UC MSBANA/Forecasting/competitive-data-science-predict-future-sales/sales_train.csv")
data$date <- as.Date(data$date, "%d.%m.%Y")
head(data)
## date date_block_num shop_id item_id item_price item_cnt_day
## 1 2013-01-02 0 59 22154 999.00 1
## 2 2013-01-03 0 25 2552 899.00 1
## 3 2013-01-05 0 25 2552 899.00 -1
## 4 2013-01-06 0 25 2554 1709.05 1
## 5 2013-01-15 0 25 2555 1099.00 1
## 6 2013-01-10 0 25 2564 349.00 1
summary(data)
## date date_block_num shop_id item_id
## Min. :2013-01-01 Min. : 0.00 Min. : 0 Min. : 0
## 1st Qu.:2013-08-01 1st Qu.: 7.00 1st Qu.:22 1st Qu.: 4476
## Median :2014-03-04 Median :14.00 Median :31 Median : 9343
## Mean :2014-04-03 Mean :14.57 Mean :33 Mean :10197
## 3rd Qu.:2014-12-05 3rd Qu.:23.00 3rd Qu.:47 3rd Qu.:15684
## Max. :2015-10-31 Max. :33.00 Max. :59 Max. :22169
## item_price item_cnt_day
## Min. : -1.0 Min. : -22.000
## 1st Qu.: 249.0 1st Qu.: 1.000
## Median : 399.0 Median : 1.000
## Mean : 890.9 Mean : 1.243
## 3rd Qu.: 999.0 3rd Qu.: 1.000
## Max. :307980.0 Max. :2169.000
library("stringr")
library("dplyr")
date_split<- data.frame(do.call("rbind", strsplit(as.character(data$date), "-", fixed = TRUE)))
data$month = date_split$X2
data$year = date_split$X1
head(data)
## date date_block_num shop_id item_id item_price item_cnt_day month year
## 1 2013-01-02 0 59 22154 999.00 1 01 2013
## 2 2013-01-03 0 25 2552 899.00 1 01 2013
## 3 2013-01-05 0 25 2552 899.00 -1 01 2013
## 4 2013-01-06 0 25 2554 1709.05 1 01 2013
## 5 2013-01-15 0 25 2555 1099.00 1 01 2013
## 6 2013-01-10 0 25 2564 349.00 1 01 2013
Sales column from price of the item and quantity sold, and displaying the first 5 mean sales values per month.data$Sales = data$item_cnt_day * data$item_price
data %>%
group_by(date_block_num) %>%
summarize(mean_sales = round(mean(Sales))) %>%
ungroup() %>%
arrange(date_block_num) %>% head(5)
## # A tibble: 5 x 2
## date_block_num mean_sales
## <int> <dbl>
## 1 0 795
## 2 1 835
## 3 2 865
## 4 3 735
## 5 4 709
library(tidyverse)
data %>%
group_by(year) %>%
summarize(mean_sales = mean(Sales)) %>%
ungroup() %>%
ggplot(aes(x = year, y = mean_sales, fill = year)) +
geom_bar(stat = "identity") + geom_text(
aes(label = round(mean_sales)),
position = position_dodge(width = 1),
hjust = 0.5,
vjust = 2.5,
size = 3
)+
theme_bw()+
xlab("Year") +
ylab("Total Sales across the years in dollars") +
ggtitle("Mean Sales from 2013 to 2015")+
scale_fill_discrete(name='Year')
The below plot shows the Mean sales(in dollars) across all shops and products for months ranging from Jan 2013 to Oct 2015.
From the plot we note that, on an average, sales have increased over time from Jan 2013 to October 2015.
data %>%
group_by(date_block_num) %>%
summarize(mean_Sales = mean(Sales)) %>%
ungroup() %>%
ggplot() +
geom_line(aes(date_block_num, mean_Sales, group = 1, col= "red"))+
theme_bw()+
xlab("Month Number") +
ylab("Mean Sales per month") +
ggtitle("Mean Sales across all shops and products over time")
data %>%
group_by(date_block_num) %>%
summarize(monthly_count = sum(item_cnt_day)) %>%
ungroup() %>%
ggplot() +
geom_line(aes(date_block_num, monthly_count, group = 1, col= "red"))+
theme_bw()+
xlab("Month Number") +
ylab("Total products sold per month") +
ggtitle("Total products sold across all shops over time")
library(kableExtra)
data %>%
group_by(year) %>%
summarize(mean_Sales = mean(Sales)) %>%
ungroup() %>%
rename(
Year = year,
`Mean Sales per year (in dollars)`=mean_Sales
) %>%
kbl(format='html') %>%
kable_paper("hover", full_width = T)
| Year | Mean Sales per year (in dollars) |
|---|---|
| 2013 | 960.5248 |
| 2014 | 1275.5263 |
| 2015 | 1362.8147 |
boxplot(data$item_price, ylab = "Item Price", main = "Box Plot for Item Price")
boxplot(data$item_cnt_day, ylab = "Item Count per day", main = "Box Plot for Number of products sold each day")
library(vtable)
new_data <- within(data, rm(year))
st(new_data)
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| date_block_num | 2935849 | 14.57 | 9.423 | 0 | 7 | 23 | 33 |
| shop_id | 2935849 | 33.002 | 16.227 | 0 | 22 | 47 | 59 |
| item_id | 2935849 | 10197.227 | 6324.297 | 0 | 4476 | 15684 | 22169 |
| item_price | 2935849 | 890.853 | 1729.8 | -1 | 249 | 999 | 307980 |
| item_cnt_day | 2935849 | 1.243 | 2.619 | -22 | 1 | 1 | 2169 |
| Sales | 2935849 | 1157.732 | 5683.604 | -68970 | 249 | 1078.2 | 1829990 |
We can generate the summary statistics for the dataset.
data$month <- as.numeric(data$month)
data$year <- as.numeric(data$year)
str(data)
## 'data.frame': 2935849 obs. of 9 variables:
## $ date : Date, format: "2013-01-02" "2013-01-03" ...
## $ date_block_num: int 0 0 0 0 0 0 0 0 0 0 ...
## $ shop_id : int 59 25 25 25 25 25 25 25 25 25 ...
## $ item_id : int 22154 2552 2552 2554 2555 2564 2565 2572 2572 2573 ...
## $ item_price : num 999 899 899 1709 1099 ...
## $ item_cnt_day : num 1 1 -1 1 1 1 1 1 1 3 ...
## $ month : num 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 2013 2013 2013 2013 2013 ...
## $ Sales : num 999 899 -899 1709 1099 ...
summary(data)
## date date_block_num shop_id item_id
## Min. :2013-01-01 Min. : 0.00 Min. : 0 Min. : 0
## 1st Qu.:2013-08-01 1st Qu.: 7.00 1st Qu.:22 1st Qu.: 4476
## Median :2014-03-04 Median :14.00 Median :31 Median : 9343
## Mean :2014-04-03 Mean :14.57 Mean :33 Mean :10197
## 3rd Qu.:2014-12-05 3rd Qu.:23.00 3rd Qu.:47 3rd Qu.:15684
## Max. :2015-10-31 Max. :33.00 Max. :59 Max. :22169
## item_price item_cnt_day month year
## Min. : -1.0 Min. : -22.000 Min. : 1.000 Min. :2013
## 1st Qu.: 249.0 1st Qu.: 1.000 1st Qu.: 3.000 1st Qu.:2013
## Median : 399.0 Median : 1.000 Median : 6.000 Median :2014
## Mean : 890.9 Mean : 1.243 Mean : 6.248 Mean :2014
## 3rd Qu.: 999.0 3rd Qu.: 1.000 3rd Qu.: 9.000 3rd Qu.:2014
## Max. :307980.0 Max. :2169.000 Max. :12.000 Max. :2015
## Sales
## Min. : -68970
## 1st Qu.: 249
## Median : 449
## Mean : 1158
## 3rd Qu.: 1078
## Max. :1829990
item_price has minimum value of -1 which is clearly an outlier, since price of an item can not be negative.item_cnt_day has negative value which is another outlier because the quantity of items sold can never be negative.Sales column has negative value because the sales value is calculated by multiplying the item_price and item_cnt_day which are also negative values.colSums(is.na(data))
## date date_block_num shop_id item_id item_price
## 0 0 0 0 0
## item_cnt_day month year Sales
## 0 0 0 0
From the above output, we note that there are no missing values in any columns.
data2 <- within(data, rm(date))
cor(data2)
## date_block_num shop_id item_id item_price
## date_block_num 1.000000000 0.019273446 0.009355676 0.09501013
## shop_id 0.019273446 1.000000000 0.029396392 -0.02403405
## item_id 0.009355676 0.029396392 1.000000000 -0.13410404
## item_price 0.095010128 -0.024034051 -0.134104042 1.00000000
## item_cnt_day 0.009402007 -0.005229685 0.016650304 0.01119663
## month 0.243938535 0.020574098 -0.002033734 0.04404866
## year 0.928280414 0.011804588 0.010339705 0.08019235
## Sales 0.040056155 -0.013424049 -0.062941234 0.43372505
## item_cnt_day month year Sales
## date_block_num 0.009402007 0.243938535 0.928280414 0.04005616
## shop_id -0.005229685 0.020574098 0.011804588 -0.01342405
## item_id 0.016650304 -0.002033734 0.010339705 -0.06294123
## item_price 0.011196625 0.044048656 0.080192346 0.43372505
## item_cnt_day 1.000000000 0.018013875 0.002699486 0.40795434
## month 0.018013875 1.000000000 -0.134203363 0.03035856
## year 0.002699486 -0.134203363 1.000000000 0.02928882
## Sales 0.407954342 0.030358560 0.029288823 1.00000000
Sales and item_price, item_cnt_day.plot(data$Sales, data$item_price, xlab = "Item Price", ylab = "Sales", main = "Sales vs Item Price")
plot(data$Sales, data$item_cnt_day, xlab = "No. of products sold per day", ylab = "Sales", main = "Sales vs No of Products sold")
The task in this project is to predict the sales values of products over time.
library(sjPlot)
library(sjmisc)
library(sjlabelled)
lm01 = lm(Sales ~ date_block_num, data = data)
tab_model(lm01)
| Sales | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 805.72 | 793.75 – 817.68 | <0.001 |
| date block num | 24.16 | 23.47 – 24.85 | <0.001 |
| Observations | 2935849 | ||
| R2 / R2 adjusted | 0.002 / 0.002 | ||
From the above linear regression model, we observe that the p-value is very close to zero. So, there should be a statistically significant relationship between Sales and date_block_num.
From the summary statistics, adjusted R-Squared value is 0.002. It implies that 0.2% variation in Sales is explained by date_block_num.