Antallagma Forecasting

1. Problem Statement

Welcome to Antallagma - a digital exchange for trading goods. Antallagma started its operations 5 years back and has supported more than a million transactions till date. The Antallagma platform enables working of a traditional exchange on an online portal.

On one hand, buyers make a bid at the value they are willing to buy (“bid value”) and the quantity they are willing to buy. Sellers on the other hand, ask for an ask price and the quantity they are willing to sell. The portal matches the buyers and sellers in realtime to create trades. All trades are settled at the end of the day at the median price of all agreed trades.

You are one of the traders on the exchange and can supply all the material being traded on the exchange. In order to improve your logistics, you want to predict the median trade prices and volumes for all the trades happening (at item level) on the exchange. You can then plan to then use these predictions to create an optimized inventory strategy.

You are expected to create trade forecasts for all items being traded on Antallagma along with the trade prices for a period of 6 months.

2. Data Dictionary

Given train and test dataset. Variable Definition ID Unique_transaction_ID Item_ID Unique ID of the product Datetime Date of Sale Price Median Price at Sale on that day(Target Variable_1) Number_Of_Sales Total Item Sold on that day(Target Variable_2) Category_1 Unordered Masked feature Category_2 Ordered Masked feature Category_3 Binary Masked feature

3. load the required libraries and Data Understanding

library(fpp2)

## Loading required package: forecast

## Loading required package: fma

## Loading required package: expsmooth

## Loading required package: ggplot2

library(plyr)

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:fma':
## 
##     ozone

train <- read.csv("E:/mahesh/Downloads/Forecast/data/train_RTwONnY/train.csv")
test <- read.csv("E:/mahesh/Downloads/Forecast/data/test_XaRbxhd/test.csv")

# dimension of train and test datasets
print(dim(train))

## [1] 881876      8

print(dim(test))

## [1] 266248      6

#colnames of train and test datasets
print(colnames(train))

## [1] "ID"              "Item_ID"         "Datetime"        "Category_3"     
## [5] "Category_2"      "Category_1"      "Price"           "Number_Of_Sales"

print(colnames(test))

## [1] "Item_ID"    "Datetime"   "Category_1" "Category_2" "Category_3"
## [6] "ID"

#top records of train data
print(head(train,5))

##               ID Item_ID   Datetime Category_3 Category_2 Category_1
## 1 30495_20140101   30495 2014-01-01          0          2         90
## 2 30375_20140101   30375 2014-01-01          0          2        307
## 3 30011_20140101   30011 2014-01-01          0          3         67
## 4 30864_20140101   30864 2014-01-01          0          2        315
## 5 30780_20140101   30780 2014-01-01          1          2        132
##     Price Number_Of_Sales
## 1 165.123               1
## 2  68.666               5
## 3 253.314               2
## 4 223.122               1
## 5  28.750               1

#structure of train dataset
print(str(train))

## 'data.frame':    881876 obs. of  8 variables:
##  $ ID             : Factor w/ 881876 levels "29654_20160617",..: 428990 363754 174166 623169 581156 659074 863720 450865 165854 203328 ...
##  $ Item_ID        : int  30495 30375 30011 30864 30780 30927 31342 30540 29999 30068 ...
##  $ Datetime       : Factor w/ 912 levels "2014-01-01","2014-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Category_3     : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Category_2     : num  2 2 3 2 2 1 2 3 3 2 ...
##  $ Category_1     : int  90 307 67 315 132 38 38 38 67 195 ...
##  $ Price          : num  165.1 68.7 253.3 223.1 28.8 ...
##  $ Number_Of_Sales: int  1 5 2 1 1 402 832 423 133 375 ...
## NULL

#remove the unnecessary variables
train <- train[,-c(1,3,4,5,6)]

#convert ItemId variable to Factor in train and test data sets
train$Item_ID <- as.factor(train$Item_ID)
test$Item_ID <- as.factor(test$Item_ID)

#check the number of levels of Items_Id variables in train and test datasets
print(nlevels(train$Item_ID))

## [1] 1529

print(nlevels(test$Item_ID))

## [1] 1447

#Modify the train dataset based on number of levels of items_id in test datasets
train1  <- train[train$Item_ID %in% test$Item_ID, ] %>% droplevels()

#Again check the number of levels of train and test datasets
print(nlevels(train1$Item_ID))

## [1] 1447

print(nlevels(test$Item_ID))

## [1] 1447

#Subset the total train dataset into multiple subset based on Item_id
df_list <- split(train1, as.factor(train1$Item_ID))

Model Implementation(ETS):

#Implement the ETS Model on the each dataset
results <- list()
results1 <- list()

#ets model 
for(i in 1:length(df_list)) {
  a <- as.data.frame(df_list[i])
  l <- BoxCox.lambda(a[,2])
  ll <- BoxCox.lambda(a[,3])
  results[[levels(train$Item_ID)[i]]] <- a[,2] %>% ets(lambda =l)  %>% forecast(h=184)
  results1[[levels(train$Item_ID)[i]]] <- a[,3] %>% ets(lambda =ll)  %>% forecast(h=184)
}

#Convert output the required format
price <-NULL
numberofsales <-NULL

for(i in 1:length(results)){
  b <- as.data.frame(results[i]) 
  b1 <- as.data.frame(results1[i]) 
  price <- append(price,b[,1])
  numberofsales <- append(numberofsales,b1[,1])
  
}
library(plyr)
test <- arrange(test,ID)
output <- data.frame(test$ID,price,numberofsales)

Forecasted Output

##          test.ID     price numberofsales
## 1 29654_20160701 0.7089991      295.7515
## 2 29654_20160702 0.7089991      295.7515
## 3 29654_20160703 0.7089991      295.7515
## 4 29654_20160704 0.7089991      295.7515
## 5 29654_20160705 0.7089991      295.7515
## 6 29654_20160706 0.7089991      295.7515