The data is about sales of the thousands of products of different families sold at Favorita stores located in Ecuador. The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers.

The data is publically available on Kaggle website.

I found that this dataset have other csv than test.csv, which includes price of oil and Holiday event. This information can be used along with information given in test.csv to forecast.

I find this tak challanging as it involves varios factors along with forecasting sales.

The relation of oil prices, the impact of holidays, the seasnality of products and trend all these things need to be involved to get a good forecast. There are some missing values which we need to take care of too.

File Descriptions and Data Field Information

train.csv

The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales. store_nbr identifies the store at which the products are sold. family identifies the type of product sold. onpromotion gives the total number of items in a product family that were being promoted at a store at a given date.

test.csv

The test data, having the same features as the training data. You will predict the target sales for the dates in this file. The dates in the test data are for the 15 days after the last date in the training data.

stores.csv

Store metadata, including city, state, type, and cluster. cluster is a grouping of similar stores.

oil.csv

Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it’s economical health is highly vulnerable to shocks in oil prices.)

holidays_events.csv

Holidays and Events, with metadata

**************************************************************************************************





START

DATA LOADING

df_train <- readr::read_csv('C:\\Users\\vibze\\Time series\\train.csv')
## Rows: 3000888 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (1): family
## dbl  (4): id, store_nbr, sales, onpromotion
## date (1): date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_stores <- readr::read_csv('C:\\Users\\vibze\\Time series\\stores.csv')
## Rows: 54 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): city, state, type
## dbl (2): store_nbr, cluster
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_oil <- readr::read_csv('C:\\Users\\vibze\\Time series\\oil.csv')
## Rows: 1218 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl  (1): dcoilwtico
## date (1): date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_holiday <- readr::read_csv('C:\\Users\\vibze\\Time series\\holidays_events.csv')
## Rows: 350 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (4): type, locale, locale_name, description
## lgl  (1): transferred
## date (1): date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.


SUMMARY FOR EVERY DATA SET

summary(df_train)
##        id               date              store_nbr       family         
##  Min.   :      0   Min.   :2013-01-01   Min.   : 1.0   Length:3000888    
##  1st Qu.: 750222   1st Qu.:2014-02-26   1st Qu.:14.0   Class :character  
##  Median :1500444   Median :2015-04-24   Median :27.5   Mode  :character  
##  Mean   :1500444   Mean   :2015-04-24   Mean   :27.5                     
##  3rd Qu.:2250665   3rd Qu.:2016-06-19   3rd Qu.:41.0                     
##  Max.   :3000887   Max.   :2017-08-15   Max.   :54.0                     
##      sales           onpromotion     
##  Min.   :     0.0   Min.   :  0.000  
##  1st Qu.:     0.0   1st Qu.:  0.000  
##  Median :    11.0   Median :  0.000  
##  Mean   :   357.8   Mean   :  2.603  
##  3rd Qu.:   195.8   3rd Qu.:  0.000  
##  Max.   :124717.0   Max.   :741.000
summary(df_holiday)
##       date                type              locale          locale_name       
##  Min.   :2012-03-02   Length:350         Length:350         Length:350        
##  1st Qu.:2013-12-23   Class :character   Class :character   Class :character  
##  Median :2015-06-08   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2015-04-24                                                           
##  3rd Qu.:2016-07-03                                                           
##  Max.   :2017-12-26                                                           
##  description        transferred    
##  Length:350         Mode :logical  
##  Class :character   FALSE:338      
##  Mode  :character   TRUE :12       
##                                    
##                                    
## 
summary(df_oil)
##       date              dcoilwtico    
##  Min.   :2013-01-01   Min.   : 26.19  
##  1st Qu.:2014-03-03   1st Qu.: 46.41  
##  Median :2015-05-02   Median : 53.19  
##  Mean   :2015-05-02   Mean   : 67.71  
##  3rd Qu.:2016-06-30   3rd Qu.: 95.66  
##  Max.   :2017-08-31   Max.   :110.62  
##                       NA's   :43
summary(df_stores)
##    store_nbr         city              state               type          
##  Min.   : 1.00   Length:54          Length:54          Length:54         
##  1st Qu.:14.25   Class :character   Class :character   Class :character  
##  Median :27.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :27.50                                                           
##  3rd Qu.:40.75                                                           
##  Max.   :54.00                                                           
##     cluster      
##  Min.   : 1.000  
##  1st Qu.: 4.000  
##  Median : 8.500  
##  Mean   : 8.481  
##  3rd Qu.:13.000  
##  Max.   :17.000

JOINING ALL THE DATA SET WITH COMMON DATE

df_main <- df_train %>% left_join(df_oil, by ="date")

df_main <- df_main %>% left_join(df_holiday, by="date")

df_main <- df_main %>% left_join(df_stores, by="store_nbr")

MAIN DATA SET

We can see from below summary that we have id as primary key, data from 01 Jan 2013 to 15 Aug 2017. There are lot of NA values in the oil prices

summary(df_main)
##        id               date              store_nbr       family         
##  Min.   :      0   Min.   :2013-01-01   Min.   : 1.0   Length:3054348    
##  1st Qu.: 754677   1st Qu.:2014-03-01   1st Qu.:14.0   Class :character  
##  Median :1507572   Median :2015-04-28   Median :27.5   Mode  :character  
##  Mean   :1504277   Mean   :2015-04-26   Mean   :27.5                     
##  3rd Qu.:2255120   3rd Qu.:2016-06-22   3rd Qu.:41.0                     
##  Max.   :3000887   Max.   :2017-08-15   Max.   :54.0                     
##                                                                          
##      sales         onpromotion        dcoilwtico        type.x         
##  Min.   :     0   Min.   :  0.000   Min.   : 26.2    Length:3054348    
##  1st Qu.:     0   1st Qu.:  0.000   1st Qu.: 46.4    Class :character  
##  Median :    11   Median :  0.000   Median : 53.4    Mode  :character  
##  Mean   :   359   Mean   :  2.618   Mean   : 68.0                      
##  3rd Qu.:   196   3rd Qu.:  0.000   3rd Qu.: 95.8                      
##  Max.   :124717   Max.   :741.000   Max.   :110.6                      
##                                     NA's   :955152                     
##     locale          locale_name        description        transferred    
##  Length:3054348     Length:3054348     Length:3054348     Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:486486   
##  Mode  :character   Mode  :character   Mode  :character   TRUE :16038    
##                                                           NA's :2551824  
##                                                                          
##                                                                          
##                                                                          
##      city              state              type.y             cluster      
##  Length:3054348     Length:3054348     Length:3054348     Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 8.500  
##                                                           Mean   : 8.481  
##                                                           3rd Qu.:13.000  
##                                                           Max.   :17.000  
## 



SECTION 2 - GRAPHS

Lets see the performance of sales over time

df_1 <- df_main %>% 
           group_by(date) %>% 
           summarise(sales=sum(sales))

library(ggplot2)
 
plot_1 <- ggplot(df_1,aes(x=date,y=sales)) + geom_line(color="lightblue")+geom_smooth() + labs(title = "SALES")


plot_1
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'



### Months with high sale to check if we can find any relation ### The high July and December sales can been seen in graph. This can be seen as this is seasonal data with 6 month of season.

plot_4 <- df_main %>% 
  group_by(month_n) %>% 
  summarise(sales=sum(sales)) %>% 
  ggplot(aes(x=reorder(month_n,sales),y=sales,fill="red")) + geom_bar(stat="identity") + labs(title="sales by month") + xlab("Month") +theme(legend.position = "none")

plot_4



### The distribution of sales in the different families shows that Groceries contribute to the highest sale among all

plot_3 <- df_main %>% 
  group_by(family) %>% 
  summarise(sales=sum(sales)) %>% 
  ggplot(aes(x=reorder(family,sales),y=sales))+ geom_bar(stat="identity",fill="green",color="black")+coord_flip()+labs(title = "SALES BY FAMILY") + ylab("Family Category")

plot_3



### Given a look at oil graph, we may consider that seems to be no significant relationship between oil prices and sales

plot_2 <- ggplot(df_main) + geom_line(aes(x=date,y=oil,color="continent")) + labs(title = "OIL PRICE TREND")

plot_2
## Warning: Removed 1782 row(s) containing missing values (geom_path).



from this graph we can check that which type of product has seasonal sales and according to how high sale is, we can figure out how much effect it could have in overall sale and its season.

plot_5 <- df_main %>% 
  group_by(date,family) %>% 
  summarise(sales=sum(sales)) %>% 
  ggplot(aes(x=date,y=sales,color=family))+geom_smooth()+ labs(title = "SALES FOR EACH FAMILY TYPE") + ylab("SALE")+ theme(plot.title=element_text(size=12, face="bold"),
                                                                                                                           legend.key.height = unit(0.4, 'cm'), #change legend key height
        legend.key.width = unit(1, 'cm'),
        legend.text = element_text(size=5))
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
plot_5
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'



### The sale by each family type shows us which family product has more sales to consider if there seaonality will reflect in the overall sale.

#library(scales)
#install.packages("repr")

plot_5+facet_wrap(~family) +
                  theme(plot.title=element_text(size=12, face="bold"), 
                  axis.text.x=element_text(size=3), 
                  axis.text.y=element_text(size=3),
                  axis.title.x=element_text(size=3),
                  axis.title.y=element_text(size=3),
                  legend.position="none")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

st(df_main)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
id 3054348 1504276.775 866261.012 0 754676.75 2255120.25 3000887
store_nbr 3054348 27.5 15.586 1 14 41 54
sales 3054348 359.021 1107.286 0 0 196.011 124717
onpromotion 3054348 2.617 12.255 0 0 0 741
oil 2099196 68.016 25.691 26.19 46.41 95.81 110.62
locale 502524
… Local 208494 41.5%
… National 261954 52.1%
… Regional 32076 6.4%
transferred 502524
… No 486486 96.8%
… Yes 16038 3.2%
type.y 3054348
… A 509058 16.7%
… B 452496 14.8%
… C 848430 27.8%
… D 1018116 33.3%
… E 226248 7.4%
cluster 3054348 8.481 4.65 1 4 13 17
month_n 3054348 6.223 3.374 1 3 9 12


Regression against time.



df_main <- df_main %>% 
              group_by(date) %>% 
              mutate(sum_sale=sum(sales)) %>% 
              ungroup()




 plot(df_main$date,df_main$sum_sale)
abline(lm(sum_sale~date,data=df_main),col="red")



The summary shows that there is linear relationship between the date and total sales per day. As date and sales are both increasing.

lmmodel <- lm(sum_sale~date,data=df_main)

summary(lmmodel)
## 
## Call:
## lm(formula = sum_sale ~ date, data = df_main)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -893582 -145601  -54246   76375 2305452 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.639e+06  5.137e+03   -1098   <2e-16 ***
## date         3.812e+02  3.102e-01    1229   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264300 on 3054346 degrees of freedom
## Multiple R-squared:  0.3309, Adjusted R-squared:  0.3309 
## F-statistic: 1.51e+06 on 1 and 3054346 DF,  p-value: < 2.2e-16

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.



The following generated heatmap shows that there is no significant relationship between any other variables to consider.

Here we haven’t yet encoded our character variables to numeric variables. The ? in the table represents there are lot of Null values in the oil table making it hard to determaine corelation.

So we will try by skipping null values.

df_num <- df_main %>% 
          select(c="id","sales","oil","month_n","cluster")


m = cor(df_num)

corrplot(m,type="lower",method = "number")

y = cor(df_num,use="pairwise.complete.obs")

corrplot(y,type="lower",method = "number")