library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(tidyverse)
library(fpp2)
library(zoo)
FinTech is interested in predicting expenses for a particular financial line item as a % of Gross Bookings. To help explore this question, we have provided a sample dataset of vendors associated with this line item, the expenses paid out, and the total Gross Bookings attributed to that vendor and product. Please include the code you wrote for the analysis and delete the dataset when you have finished with the challenge. Using the attached dataset, please do the following: 1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice).
Firstly, we begin with importing the file through readr package, assign proper column names and column types to the dataframe.
#importing file, assigning relevant column names and data types
file <- read_csv("/Users/akul/Desktop/Uber/FintechDA_Exercise/Exercise.csv")
names(file) <- c('Date','VendorName','Country','Product','GrossBookings','Fees')
file$Date <- mdy(file$Date)
file$VendorName <- as.factor(file$VendorName)
file$Country <- as.factor(file$Country)
file$Product <- as.factor(file$Product)
file$GrossBookings <- as.integer(file$GrossBookings)
file$Fees <- as.integer(file$Fees)
#Checking for the number of rows
nrow(file)
## [1] 7911
# checking the structure of the dataframe
glimpse(file)
## Observations: 7,911
## Variables: 6
## $ Date <date> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,...
## $ VendorName <fct> Vendor A, Vendor A, Vendor A, Vendor A, Vendor A...
## $ Country <fct> Country 1, Country 2, Country 3, Country 4, Coun...
## $ Product <fct> Product 1, Product 2, Product 3, Product 1, Prod...
## $ GrossBookings <int> 63, 39295, 127, 4, 4734, 60, -90, 1460, 381472, ...
## $ Fees <int> 0, 2169, 3, 0, 171, 5, 40, 844, 10681, 385, 23, ...
There are a total of 7911 rows in the data frame
#Checking for missing values
sum(is.na(file))
## [1] 289
There are a total of 289 missing values in the data frame
#removing missing values
file<- na.omit(file)
sum(is.na(file))
## [1] 0
In order to process the data properly, we remove the missing values, now we have a total of 0 missing values
#checking if any duplicate values exist
file %>% group_by(Date,Country,Product,VendorName) %>% mutate(n=n()) %>% filter(n>1)
## # A tibble: 0 x 7
## # Groups: Date, Country, Product, VendorName [0]
## # ... with 7 variables: Date <date>, VendorName <fct>, Country <fct>,
## # Product <fct>, GrossBookings <int>, Fees <int>, n <int>
Here, we check for the duplicated values in the dataframe, there are no duplicate values.
#Summary of the variables
summary(file)
## Date VendorName Country Product
## Min. :2015-01-01 Vendor A:5608 Country 2 : 272 Product 9 :1452
## 1st Qu.:2015-06-01 Vendor F: 941 Country 7 : 202 Product 4 :1446
## Median :2015-11-01 Vendor B: 596 Country 35: 183 Product 2 : 665
## Mean :2015-09-25 Vendor C: 379 Country 1 : 178 Product 3 : 631
## 3rd Qu.:2016-02-01 Vendor D: 48 Country 25: 176 Product 10: 618
## Max. :2016-04-01 Vendor E: 16 Country 6 : 171 Product 6 : 596
## (Other) : 51 (Other) :6457 (Other) :2231
## GrossBookings Fees
## Min. : -2701 Min. : 0
## 1st Qu.: 75 1st Qu.: 3
## Median : 2483 Median : 105
## Mean : 444523 Mean : 15516
## 3rd Qu.: 39706 3rd Qu.: 1362
## Max. :68736613 Max. :6135380
##
From the Summary of the dataframe the following observations can be made:
GrossBookings has negative values, which should be removed or changed to positive
In our analysis we will be defining the metric of expense percentage hence GrossBookings of value 0 should be removed as expense percentage would be infinite otherwise
The data is available for a period of 16 months from January 2015 to April 2016
As there is a huge difference between the Median and Mean value for Fees and Gross Bookings, we can concur a highly skewed distribution i.e. more values are less than the mean than greater than the mean.
file <- file %>% mutate(GrossBookings =
ifelse(GrossBookings < 0,
(-1 * GrossBookings),
GrossBookings))
file <- file %>% filter(GrossBookings > 0)
nrow(file)
## [1] 7561
After removing the missing values and gross bookings of 0, we have a total of 7561 rows and, hence a total of 350 rows were removed
ggplot(file, aes(x = Fees, y = GrossBookings, col = Product)) + geom_point() + ggtitle("Scatter plot of Gross Bookings vs Fees by Product")
From the Scatter Plot, we see a very good linear trend between Fees and GrossBookings for all the product categories and there is a major outlier for product 6, which has a very high value of Fee. Further investigation of this data point yields that it has a Fee value of 6135380, which looks to be a data entry error with an extra 0, hence we update this point to 613538, this data point is present for February 2015 Product 6 Country 2 and Vendor B, as 613538 would lie between the values for January 2015 and March 2015, we remove the 0 and correct this entry later on in the forecasting section in the analysis.
ggplot(file, aes(x = GrossBookings, fill=Product)) + geom_density(aes(alpha=0.8)) +
scale_x_log10() + ggtitle("Density Plot for GrossBookings")
ggplot(file, aes(x = Fees, fill=Product)) + geom_density(aes(alpha=0.8)) +
scale_x_log10() + ggtitle("Density Plot for Fees")
The Density Plots show that there is a great variability for Fees and Gross Bookings for different product lines, there are some product lines which vary across a very narraw range whereas others vary over a very broad spectrum.
ggplot(file, aes(y=GrossBookings, x = Product)) + geom_boxplot(aes(fill=Product)) + theme(axis.text.x=element_text(angle=45, hjust=1)) + scale_y_log10() +
ylab("Gross Bookings") +
xlab("Product") + ggtitle("BoxPlot for Gross Bookings")
ggplot(file, aes(y=Fees, x = Product)) + geom_boxplot(aes(fill=Product)) + theme(axis.text.x=element_text(angle=45, hjust=1)) + scale_y_log10() +
ylab("Fees") +
xlab("Products") + ggtitle("BoxPlot for Fees")
Outlier Analysis
Gross Bookings: Product 10, Product 14, Product 15, Product 5, Product 7 have no outliers present whereas all the product lines have major outliers especially Product 9 both on the lower and the higher end.
Fees- Huge Outlier seen for product 6, building on our finding from the scatter plot
data1 <- file %>% group_by(Date) %>% summarise(tot_gross = sum(GrossBookings),
tot_fees = sum(Fees), Exp_percentage = tot_fees/tot_gross)
lbls <- paste0(month.abb[month(data1$Date)], " ", lubridate::year(data1$Date))
brks <- data1$Date
# plot
ggplot(data1, aes(x=Date)) +
geom_line(aes(y=tot_gross)) +
labs(title="Gross Booking Total Monthly Time Series",
caption="Source: Uber Data",
y="Sum of Gross Bookings by Date") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
The trend for total gross bookings by Date seems to be very strong, and smoothly increasing over time, we see that value in April 2016 is almost thrice as that of January 2015
lbls <- paste0(month.abb[month(data1$Date)], " ", lubridate::year(data1$Date))
brks <- data1$Date
ggplot(data1, aes(x=Date)) +
geom_line(aes(y=tot_fees)) +
labs(title="Monthly Time Series for Total Fees",
caption="Source: Uber Data",
y="Sum of Fees by Date") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
For the total Fees variable, we see a huge spike in the month of February 2015, it can be attributed to the oulier we detected earlier, overall the trend looks to be highly positive similar to gross bookings
lbls <- paste0(month.abb[month(data1$Date)], " ", lubridate::year(data1$Date))
brks <- data1$Date
# plot
ggplot(data1, aes(x=Date)) +
geom_line(aes(y=Exp_percentage)) +
labs(title="Monthly Time Series for Expense Percentage",
caption="Source: Uber Data",
y="Expense Percentage by Date") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
Similar to Total Fees by date we see the huge rise in the month of February 2015, the trend is a little random otherwise with slightly increasing, as the data is only available for 15 data points, it becomes very difficult to concur seasonality.
Total Gross Booking by Product
data2 <- file %>% group_by(Product) %>% summarise(sum_gross_booking = sum(GrossBookings),
sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)
ggplot(data = data2, aes(x = reorder(Product, sum_gross_booking), y = sum_gross_booking, label = sum_gross_booking)) + geom_bar(stat = "identity",
aes(fill = Product)) + labs(title = "Gross Bookings Distribution by Product", x = "Product",
y = "Gross Bookings", color = NULL) + geom_text(color="black", size=2) +coord_flip()
data2 %>% arrange(desc(sum_gross_booking)) %>% head(n=10)
## # A tibble: 10 x 4
## Product sum_gross_booking sum_fees Expense_percentage
## <fct> <dbl> <int> <dbl>
## 1 Product 9 1530970522. 55032280 0.0359
## 2 Product 6 586008844. 20631630 0.0352
## 3 Product 4 573234677. 25227228 0.0440
## 4 Product 13 184160011. 3660102 0.0199
## 5 Product 7 175653012. 3272300 0.0186
## 6 Product 22 74103150. 2283648 0.0308
## 7 Product 21 62173027. 1237713 0.0199
## 8 Product 5 57227648. 1255641 0.0219
## 9 Product 2 36861170. 1580043 0.0429
## 10 Product 19 29310746. 208736 0.00712
data3 <- file %>% filter(Product %in% c("Product 9","Product 6","Product 4","Product 13","Product 7","Product 22","Product 21","Product 5","Product 2","Product 19")) %>% group_by(Date,Product) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)
lbls <- paste0(month.abb[month(data3$Date)], " ", lubridate::year(data3$Date))
brks <- data3$Date
ggplot(data3, aes(x=Date)) +
geom_line(aes(y=total_gross, col=Product)) +
labs(title="Time Series for Gross Booking by product",
caption="Source: Uber Data",
y="Sum of Gross Booking across product Lines") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
Here, we begin our analyis for Gross Bookings variable, Product 9 accounts for a high number of bookings and has the greatest share of 1.5 Billion followed by Product 6 and Product 3 which are around 5 billion each, which can be collectively called as the ‘Big 3’, all the product lines except the top 13 seem to have very little share.
From the time series plot, we can see that Product 9, 6 and 4 are growing rapidly which is a good indication of the strategy as these products hold the highest number of bookings as well, there are some product lines such as Product 21 and Product 5 that got introduced later on.
Total Gross Booking by Vendor
data3 <- file %>% group_by(VendorName) %>% summarise(sum_gross_booking = sum(GrossBookings),
sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)
ggplot(data3, aes(x = reorder(VendorName, sum_gross_booking), y = sum_gross_booking,label=sum_gross_booking)) +
geom_point(stat='identity', fill="black", size=6) +
geom_segment(aes(y = 0,
x = VendorName,
yend = sum_gross_booking,
xend = VendorName),
color = "black") +
geom_text(color="red", size=3) +
labs(title="Gross Booking by Vendor", y = "sum of gross booking", x ="Vendor") +
coord_flip()
data4 <- file %>% group_by(Date,VendorName) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)
lbls <- paste0(month.abb[month(data4$Date)], " ", lubridate::year(data4$Date))
brks <- data4$Date
ggplot(data4, aes(x=Date)) +
geom_line(aes(y=total_gross, col=VendorName)) +
labs(title="Time Series - Gross Booking by Vendor",
caption="Source: Uber Data",
y="sum of gross bookings") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
From the diverging point plot, we see that Vendor D, Vendor A, Vendor F and Vendor B hold the maximum percentage of bookings campared to the rest of the vendors with Vendor D holding close to 910 Million.
The trend analysis shows that Vendor D, Vendor F and Vendor B have grown overtime. All the other vendors except these 3 have a somewhat flat trend.
Vendor D’s growth is more than that of Vendor A’s. Vendor D,A and B are showing almost similar patterns of growth and drop, there could be some seasonality involved but due to lack of data it becomes difficult to conclude.
Total Gross Booking by Country
data5 <- file %>% group_by(Country) %>% summarise(sum_gross_booking = sum(GrossBookings),
sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)
data5 %>% arrange(desc(sum_gross_booking)) %>% head(n=10)
## # A tibble: 10 x 4
## Country sum_gross_booking sum_fees Expense_percentage
## <fct> <dbl> <int> <dbl>
## 1 Country 2 2168076928. 74002459 0.0341
## 2 Country 35 314456982. 5628034 0.0179
## 3 Country 3 268718127. 7966423 0.0296
## 4 Country 7 124782407. 3224414 0.0258
## 5 Country 18 87483474. 6356664 0.0727
## 6 Country 8 72803404. 4298168 0.0590
## 7 Country 54 60684454. 975648 0.0161
## 8 Country 19 30796422. 515417 0.0167
## 9 Country 40 30732266. 1734012 0.0564
## 10 Country 29 23485182. 1562883 0.0665
ggplot(data5 %>% arrange(desc(sum_gross_booking)) %>% head(n=10) , aes(x = reorder(Country, sum_gross_booking), y = sum_gross_booking,label=sum_gross_booking)) +
geom_point(stat='identity', fill="black", size=6) +
geom_segment(aes(y = 0,
x = Country,
yend = sum_gross_booking,
xend = Country),
color = "black") +
geom_text(color="red", size=3) +
labs(title="Gross Booking by Country", y = "sum of gross booking", x ="Country") +
coord_flip()
data6 <- file %>% filter(Country %in% c("Country 2","Country 35","Country 3","Country 7","Country 18","Country 8","Country 54","Country 19","Country 40","Country 29")) %>% group_by(Date,Country) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)
lbls <- paste0(month.abb[month(data6$Date)], " ", lubridate::year(data6$Date))
brks <- data6$Date
ggplot(data6, aes(x=Date)) +
geom_line(aes(y=total_gross, col=Country)) +
labs(title="Time Series - Gross Booking by Country",
caption="Source: Uber Data",
y="sum of gross bookings") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
For Country wise analysis, we see that country 2 holds the maximum share among all the countries i.e. close to 2 Billion and from the trend analysis also we see that the gross bookings for country 2 have more than doubled in the given time period, Country 35 and 40 also have a good growth trend, for the all the other countries the trend is fairly flat.
Analysis of fees by Country
data7 <- file %>% group_by(Country) %>% summarise(sum_gross_booking = sum(GrossBookings),
sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking) %>%
arrange(desc(sum_fees)) %>% head(n=15)
ggplot(data7, aes(x = reorder(Country, sum_fees), y = sum_fees,label=sum_fees)) +
geom_point(stat='identity', fill="black", size=6) +
geom_segment(aes(y = 0,
x = Country,
yend = sum_fees,
xend = Country),
color = "black") +
geom_text(color="red", size=3) +
labs(title="Fees by Country" ,y = "Sum of Fees", x ="Country") +
coord_flip()
data8 <- file %>% filter(Country %in% c("Country 2", "Country 3","Country 18","Country 35","Country 8","Country 7","Country 40","Country 29","Country 31","Country 53")) %>% group_by(Date,Country) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)
lbls <- paste0(month.abb[month(data8$Date)], " ", lubridate::year(data8$Date))
brks <- data8$Date
ggplot(data8, aes(x=Date)) +
geom_line(aes(y=total_fees, col=Country)) +
labs(title="Time Series - Fees by Country",
caption="Source: Uber Data",
y="Sum of Fees") +
scale_x_date(labels = lbls,
breaks = brks) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
Similar to gross booking, Country 2 has good growth trend for fees as well as accounts for the maximum share in the fees
Analysis of fees by Vendor
data9 <- file %>% group_by(VendorName) %>% summarise(sum_gross_booking = sum(GrossBookings),
sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)
ggplot(data9, aes(x = reorder(VendorName, sum_fees), y = sum_fees,label=sum_fees)) +
geom_point(stat='identity',fill="black", size=6) +
geom_segment(aes(y = 0,
x = VendorName,
yend = sum_fees,
xend = VendorName),
color = "black") +
geom_text(color="red", size=3) +
labs(title="Fees by Vendor" ,y = "Sum of Fees", x ="Vendor") +
coord_flip()
data8 <- file %>% group_by(Date,VendorName) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)
lbls <- paste0(month.abb[month(data8$Date)], " ", lubridate::year(data8$Date))
brks <- data8$Date
ggplot(data8, aes(x=Date)) +
geom_line(aes(y=total_fees, col=VendorName)) +
labs(title="Time Series - Fees by Vendor",
caption="Source: Uber Data",
y="Sum of Fees") +
scale_x_date(labels = lbls,
breaks = brks) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
Vendor A has the highest fees and Vendor D’s is approximately the same, whereas gross booking by Vendor D is more than that of Vendor A, indicating that Vendor A is more expensive than Vendor D similarly Vendor B is more expensive than Vendor F
Analysis of fees by Product
data9 <- file %>% group_by(Product) %>% summarise(sum_gross_booking = sum(GrossBookings),
sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)
ggplot(data = data9, aes(x = reorder(Product, sum_fees), y = sum_fees,label=sum_fees)) +
geom_bar(stat = "identity",
aes(fill = Product)) + labs(title="Fees by Product" ,y = "Sum of Fees", x ="Product") + geom_text(color="black", size=2) + coord_flip()
data9 %>% arrange(desc(sum_fees)) %>% head(n=10)
## # A tibble: 10 x 4
## Product sum_gross_booking sum_fees Expense_percentage
## <fct> <dbl> <int> <dbl>
## 1 Product 9 1530970522. 55032280 0.0359
## 2 Product 4 573234677. 25227228 0.0440
## 3 Product 6 586008844. 20631630 0.0352
## 4 Product 13 184160011. 3660102 0.0199
## 5 Product 7 175653012. 3272300 0.0186
## 6 Product 12 27426126. 2367693 0.0863
## 7 Product 22 74103150. 2283648 0.0308
## 8 Product 2 36861170. 1580043 0.0429
## 9 Product 5 57227648. 1255641 0.0219
## 10 Product 21 62173027. 1237713 0.0199
data10 <- file %>% filter(Product %in% c("Product 9","Product 6","Product 4","Product 13","Product 7","Product 22","Product 12","Product 5","Product 2","Product 21")) %>% group_by(Date,Product) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)
lbls <- paste0(month.abb[month(data10$Date)], " ", lubridate::year(data10$Date))
brks <- data10$Date
ggplot(data10, aes(x=Date)) +
geom_line(aes(y=total_fees, col=Product)) +
labs(title="Time Series - Fees by Product",
caption="Source: Uber Data",
y="Sum of Fees") +
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank())
We see that Product 9 accounts for the highest Fees which aligns with the growth patterns of gross bookings for product 9, almost double than that of Product 4. From the growth trend we see the Product 9 and Product 4 are growing rapidly, both have more than doubled their values in the given time frame.
Product 9 has the highest Fees which is aligned with the Product 9 Gross booking.Product 4 and 6 have similar gross bookings but Product 4 has higher fees than Product 6 means Product 4 is more expensive than Product 6.
Summary
In this section, we first begin by removing the outlier and preparing data for train and test sets, as we have to predict the percentage of expense we make two forecasting models one for predicting gross bookings and the other for predicting fees and calculate the percentage, for data preparation, we aggregate both the fields at the Date level by taking sum of Fees and Gross Bookings and diving them in train set(Jan 2015 to March 2016) and test Set (April 2016), as there are only 16 data points we cannot make huge training and test sets, however to test the validity of the models, I had begun my analysis by aggregating the data at the Date and Product level and making different models for 21 product lines and thereby gaining good accuracy on the test set. For making the forecast model I tried different techniques such as Random Forest, Arima Methods, Holtz- Winter Model, TBATS (“Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components”). I used the metric of Mean Absolute Percentage Error to compare different models across the 21 product lines achieving the best performance through TBATS. Decent results were achieved through Random Forest as well, but due to core statistical nature of time-series models we get the confidence bands as well, which can be have broader business implications than a prediction with no confidence. Hence, the final forecasting model was chosen through implementation of TBATS. The TBATS incorporates a state space model that is a generalization of those underpinning exponential smoothing. It also allows for automatic Box-Cox transformation and ARMA errors, it provides a very flexible way of accounting for seasonality and trends. The parameters include a box-cox transformation parameter, ARMA errors, damping. The seasonality and trend is easily handled in this model through the fourier transformations and smoothening parameters, which can be further tuned. Due to limited data however, the accuracy on test set in terms of Mean Absolute Percentage Error can vary from the metric given in this report, however when forecasting for longer periods, we see very smooth trends. After making the forecasting model, the expense percentage predicted for May 2016 is 3.52%.
#Outlier correction
ggplot(file %>% filter(Product == 'Product 6' & VendorName == "Vendor B" & Country == "Country 2"),aes(x=Date,y=Fees)) + geom_point() + geom_line() + ggtitle("Before outlier correction")
#outlier value
file[file$Product == 'Product 6' & file$VendorName == "Vendor B" & file$Country == "Country 2" & file$Date == as.Date("2015-02-01"),]
## # A tibble: 1 x 6
## Date VendorName Country Product GrossBookings Fees
## <date> <fct> <fct> <fct> <dbl> <int>
## 1 2015-02-01 Vendor B Country 2 Product 6 22749990. 6135380
#correcting the outlier value seems to have an extra 0
file[file$Product == 'Product 6' & file$VendorName == "Vendor B" & file$Country == "Country 2" & file$Date == as.Date("2015-02-01"),]$Fees <- 613538
ggplot(file %>% filter(Product == 'Product 6' & VendorName == "Vendor B" & Country == "Country 2"),aes(x=Date,y=Fees)) + geom_point() + geom_line() + ggtitle("After outlier correction")
After removing the much discussed outlier from the EDA section, we begin our model building process.
#preparing data for forecasting
model_data <- file %>% group_by(Date) %>% summarise(sum_gross_booking = sum(GrossBookings), sum_fees = sum(Fees))
train <- model_data %>% filter(Date < as.Date("2016-04-01"))
test <- model_data %>% filter(Date == as.Date("2016-04-01"))
Train and Test sets are prepared, the training data contains values aggregated at date level from January 2015 to March 2016, the forecasting model is built on this, and we test the performance on test set which has the values for April 2016.
#forecasting value for gross booking
data_agg7 <- train[,2]
data_agg8 <- test[,2]
ts_data <- ts(data_agg7$sum_gross_booking, start = c(2015,1), frequency = 12)
ts_data_test <- ts(data_agg8$sum_gross_booking, start = c(2016,4), frequency = 12)
autoplot(ts_data) + ggtitle("Time Series of Total Gross Bookings by Date")
ggAcf(ts_data, lag.max = 12) + ggtitle("ACF of Total Gross Bookings by Date")
ggPacf(ts_data) + ggtitle("PACF of Total Gross Bookings by Date")
Box.test(ts_data, lag = 12, fitdf = 0, type = "Lj")
##
## Box-Ljung test
##
## data: ts_data
## X-squared = 63.053, df = 12, p-value = 6.232e-09
We first begin by plotting the time series for gross booking, which shows a growing trend overtime, from the ACF plot also we see that autocorrelation for the first 2 lags is significant, indicating there is suffcient information to make a forecasting model. We further conduct a Ljung-Box test which gives us a p-value of 6.232e-09, indicating the time-series is not a random white noise process.
tbats_fit <- tbats(ts_data)
s <- forecast(tbats_fit,h=1)
accuracy(s, ts_data_test)
## ME RMSE MAE MPE MAPE MASE
## Training set -2182985 11709749 8755590 -1.507053 4.345111 0.04998570
## Test set -11838872 11838872 11838872 -3.715805 3.715805 0.06758817
## ACF1
## Training set -0.08430022
## Test set NA
autoplot(forecast(tbats_fit, h=20)) + ggtitle("Forecast for Total Gross booking")
After making the forecast model we assess its accuarcy on the train and test set, getting a Mean Absolute Percentage Error of 4.34 on the training set and 3.71 on the test set. When we plot the forecast for the next 20 intervals, the model looks smooth with tight condifence intervals.
#forecasting value for fees
data_agg7 <- train[,3]
data_agg8 <- test[,3]
ts_data <- ts(data_agg7$sum_fees, start = c(2015,1), frequency = 12)
ts_data_test <- ts(data_agg8$sum_fees, start = c(2016,4), frequency = 12)
autoplot(ts_data) + ggtitle("Time Series of Total Fees by Date")
ggAcf(ts_data, lag.max = 10) + ggtitle("ACF of Total Fees")
ggPacf(ts_data) + ggtitle("PACF of Total Fees")
Box.test(ts_data, lag = 12, fitdf = 0, type = "Lj")
##
## Box-Ljung test
##
## data: ts_data
## X-squared = 59.543, df = 12, p-value = 2.735e-08
tbats_fit <- tbats(ts_data)
s <- forecast(tbats_fit,h=1)
accuracy(s, ts_data_test)
## ME RMSE MAE MPE MAPE
## Training set 77435.57 437189.88 322949.80 1.2215451 4.5551206
## Test set -45295.32 45295.32 45295.32 -0.4051141 0.4051141
## MASE ACF1
## Training set 0.049075724 -0.2346443
## Test set 0.006883115 NA
autoplot(forecast(tbats_fit, h=20)) + ggtitle("Forecast for Total Fees")
Similar to making the forecasting model for Gross Booking, we model the total fees getting a Mean Absolute Percentage Error of 4.55 on Training set and 0.4 on the test set, the next 20 forecast also appear to be pretty smooth with tight confidence bands. After validating the modelling process, we train the model on the complete data from January 2015 to April 2016 and make forecasts for May 2016 and calculate the expense percentage for May 2016.
#training on complete data for gross booking
ts_data <- ts(model_data$sum_gross_booking, start = c(2015,1), frequency = 12)
tbats_fit_gross <- tbats(ts_data)
gross_may_forecast <- forecast(tbats_fit_gross,h=1)
gross_may_forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2016 335292389 321413698 349171079 314066763 356518014
ts_data <- ts(model_data$sum_fees, start = c(2015,1), frequency = 12)
tbats_fit_fee <- tbats(ts_data)
fee_may_forecast <- forecast(tbats_fit_fee,h=1)
fee_may_forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## May 2016 11804209 11261517 12346901 10974233 12634185
#Forecated expense percentage
fee_may_forecast$mean/gross_may_forecast$mean * 100
## May
## 2016 3.520572
The Expense percentage calculated for May 2016 is 3.52 percent