Install Required Packages

library(readr)
library(dplyr)
library(tidyr)
library(lubridate)
library(tidyverse)      
library(fpp2)           
library(zoo)

Exploratory Data Analysis and Data Cleaning

FinTech is interested in predicting expenses for a particular financial line item as a % of Gross Bookings. To help explore this question, we have provided a sample dataset of vendors associated with this line item, the expenses paid out, and the total Gross Bookings attributed to that vendor and product. Please include the code you wrote for the analysis and delete the dataset when you have finished with the challenge. Using the attached dataset, please do the following: 1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice).

Firstly, we begin with importing the file through readr package, assign proper column names and column types to the dataframe.

#importing file, assigning relevant column names and data types
file <- read_csv("/Users/akul/Desktop/Uber/FintechDA_Exercise/Exercise.csv")
names(file) <- c('Date','VendorName','Country','Product','GrossBookings','Fees')
file$Date <- mdy(file$Date)
file$VendorName <- as.factor(file$VendorName)
file$Country <- as.factor(file$Country)
file$Product <- as.factor(file$Product)
file$GrossBookings <- as.integer(file$GrossBookings)
file$Fees <- as.integer(file$Fees)
#Checking for the number of rows
nrow(file)
## [1] 7911
# checking the structure of the dataframe
glimpse(file)
## Observations: 7,911
## Variables: 6
## $ Date          <date> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01,...
## $ VendorName    <fct> Vendor A, Vendor A, Vendor A, Vendor A, Vendor A...
## $ Country       <fct> Country 1, Country 2, Country 3, Country 4, Coun...
## $ Product       <fct> Product 1, Product 2, Product 3, Product 1, Prod...
## $ GrossBookings <int> 63, 39295, 127, 4, 4734, 60, -90, 1460, 381472, ...
## $ Fees          <int> 0, 2169, 3, 0, 171, 5, 40, 844, 10681, 385, 23, ...

There are a total of 7911 rows in the data frame

#Checking for missing values
sum(is.na(file))
## [1] 289

There are a total of 289 missing values in the data frame

#removing missing values
file<- na.omit(file)
sum(is.na(file))
## [1] 0

In order to process the data properly, we remove the missing values, now we have a total of 0 missing values

#checking if any duplicate values exist
file %>% group_by(Date,Country,Product,VendorName) %>% mutate(n=n()) %>% filter(n>1)
## # A tibble: 0 x 7
## # Groups:   Date, Country, Product, VendorName [0]
## # ... with 7 variables: Date <date>, VendorName <fct>, Country <fct>,
## #   Product <fct>, GrossBookings <int>, Fees <int>, n <int>

Here, we check for the duplicated values in the dataframe, there are no duplicate values.

#Summary of the variables
summary(file)
##       Date               VendorName         Country           Product    
##  Min.   :2015-01-01   Vendor A:5608   Country 2 : 272   Product 9 :1452  
##  1st Qu.:2015-06-01   Vendor F: 941   Country 7 : 202   Product 4 :1446  
##  Median :2015-11-01   Vendor B: 596   Country 35: 183   Product 2 : 665  
##  Mean   :2015-09-25   Vendor C: 379   Country 1 : 178   Product 3 : 631  
##  3rd Qu.:2016-02-01   Vendor D:  48   Country 25: 176   Product 10: 618  
##  Max.   :2016-04-01   Vendor E:  16   Country 6 : 171   Product 6 : 596  
##                       (Other) :  51   (Other)   :6457   (Other)   :2231  
##  GrossBookings           Fees        
##  Min.   :   -2701   Min.   :      0  
##  1st Qu.:      75   1st Qu.:      3  
##  Median :    2483   Median :    105  
##  Mean   :  444523   Mean   :  15516  
##  3rd Qu.:   39706   3rd Qu.:   1362  
##  Max.   :68736613   Max.   :6135380  
## 

From the Summary of the dataframe the following observations can be made:

  1. GrossBookings has negative values, which should be removed or changed to positive

  2. In our analysis we will be defining the metric of expense percentage hence GrossBookings of value 0 should be removed as expense percentage would be infinite otherwise

  3. The data is available for a period of 16 months from January 2015 to April 2016

  4. As there is a huge difference between the Median and Mean value for Fees and Gross Bookings, we can concur a highly skewed distribution i.e. more values are less than the mean than greater than the mean.

file <- file %>% mutate(GrossBookings = 
                          ifelse(GrossBookings < 0,
                                 (-1 * GrossBookings),
                                 GrossBookings))

file <- file %>% filter(GrossBookings > 0)
nrow(file)
## [1] 7561

After removing the missing values and gross bookings of 0, we have a total of 7561 rows and, hence a total of 350 rows were removed

ggplot(file, aes(x = Fees, y = GrossBookings, col = Product)) + geom_point() + ggtitle("Scatter plot of Gross Bookings vs Fees by Product")

From the Scatter Plot, we see a very good linear trend between Fees and GrossBookings for all the product categories and there is a major outlier for product 6, which has a very high value of Fee. Further investigation of this data point yields that it has a Fee value of 6135380, which looks to be a data entry error with an extra 0, hence we update this point to 613538, this data point is present for February 2015 Product 6 Country 2 and Vendor B, as 613538 would lie between the values for January 2015 and March 2015, we remove the 0 and correct this entry later on in the forecasting section in the analysis.

ggplot(file, aes(x = GrossBookings, fill=Product)) + geom_density(aes(alpha=0.8)) +
  scale_x_log10() + ggtitle("Density Plot for GrossBookings")

ggplot(file, aes(x = Fees, fill=Product)) + geom_density(aes(alpha=0.8)) +
  scale_x_log10() + ggtitle("Density Plot for Fees")

The Density Plots show that there is a great variability for Fees and Gross Bookings for different product lines, there are some product lines which vary across a very narraw range whereas others vary over a very broad spectrum.

ggplot(file, aes(y=GrossBookings, x = Product)) + geom_boxplot(aes(fill=Product)) +  theme(axis.text.x=element_text(angle=45, hjust=1)) + scale_y_log10() +
  ylab("Gross Bookings") +
  xlab("Product") + ggtitle("BoxPlot for Gross Bookings")

ggplot(file, aes(y=Fees, x = Product)) + geom_boxplot(aes(fill=Product)) +  theme(axis.text.x=element_text(angle=45, hjust=1)) + scale_y_log10() +
  ylab("Fees") +
  xlab("Products") + ggtitle("BoxPlot for Fees")

Outlier Analysis

  1. Gross Bookings: Product 10, Product 14, Product 15, Product 5, Product 7 have no outliers present whereas all the product lines have major outliers especially Product 9 both on the lower and the higher end.

  2. Fees- Huge Outlier seen for product 6, building on our finding from the scatter plot

data1 <- file %>% group_by(Date) %>% summarise(tot_gross = sum(GrossBookings), 
    tot_fees = sum(Fees), Exp_percentage = tot_fees/tot_gross)
lbls <- paste0(month.abb[month(data1$Date)], " ", lubridate::year(data1$Date))
brks <- data1$Date

# plot

ggplot(data1, aes(x=Date)) + 
  geom_line(aes(y=tot_gross)) + 
  labs(title="Gross Booking Total Monthly Time Series", 
       caption="Source: Uber Data", 
       y="Sum of Gross Bookings by Date") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

The trend for total gross bookings by Date seems to be very strong, and smoothly increasing over time, we see that value in April 2016 is almost thrice as that of January 2015

lbls <- paste0(month.abb[month(data1$Date)], " ", lubridate::year(data1$Date))
brks <- data1$Date

ggplot(data1, aes(x=Date)) + 
  geom_line(aes(y=tot_fees)) + 
  labs(title="Monthly Time Series for Total Fees", 
       caption="Source: Uber Data", 
       y="Sum of Fees by Date") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

For the total Fees variable, we see a huge spike in the month of February 2015, it can be attributed to the oulier we detected earlier, overall the trend looks to be highly positive similar to gross bookings

lbls <- paste0(month.abb[month(data1$Date)], " ", lubridate::year(data1$Date))
brks <- data1$Date

# plot

ggplot(data1, aes(x=Date)) + 
  geom_line(aes(y=Exp_percentage)) + 
  labs(title="Monthly Time Series for Expense Percentage", 
       caption="Source: Uber Data", 
       y="Expense Percentage by Date") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

Similar to Total Fees by date we see the huge rise in the month of February 2015, the trend is a little random otherwise with slightly increasing, as the data is only available for 15 data points, it becomes very difficult to concur seasonality.

Analysis for Gross Booking

Total Gross Booking by Product

data2 <- file %>% group_by(Product) %>% summarise(sum_gross_booking = sum(GrossBookings),
    sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)

ggplot(data = data2, aes(x = reorder(Product, sum_gross_booking), y = sum_gross_booking, label = sum_gross_booking)) + geom_bar(stat = "identity",
    aes(fill = Product)) + labs(title = "Gross Bookings Distribution by Product", x = "Product", 
    y = "Gross Bookings", color = NULL) + geom_text(color="black", size=2) +coord_flip()

data2 %>% arrange(desc(sum_gross_booking)) %>% head(n=10)
## # A tibble: 10 x 4
##    Product    sum_gross_booking sum_fees Expense_percentage
##    <fct>                  <dbl>    <int>              <dbl>
##  1 Product 9        1530970522. 55032280            0.0359 
##  2 Product 6         586008844. 20631630            0.0352 
##  3 Product 4         573234677. 25227228            0.0440 
##  4 Product 13        184160011.  3660102            0.0199 
##  5 Product 7         175653012.  3272300            0.0186 
##  6 Product 22         74103150.  2283648            0.0308 
##  7 Product 21         62173027.  1237713            0.0199 
##  8 Product 5          57227648.  1255641            0.0219 
##  9 Product 2          36861170.  1580043            0.0429 
## 10 Product 19         29310746.   208736            0.00712
data3 <- file %>% filter(Product %in% c("Product 9","Product 6","Product 4","Product 13","Product 7","Product 22","Product 21","Product 5","Product 2","Product 19")) %>% group_by(Date,Product) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)

lbls <- paste0(month.abb[month(data3$Date)], " ", lubridate::year(data3$Date))
brks <- data3$Date

ggplot(data3, aes(x=Date)) + 
  geom_line(aes(y=total_gross, col=Product)) + 
  labs(title="Time Series for Gross Booking by product", 
       caption="Source: Uber Data", 
       y="Sum of Gross Booking across product Lines") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

Here, we begin our analyis for Gross Bookings variable, Product 9 accounts for a high number of bookings and has the greatest share of 1.5 Billion followed by Product 6 and Product 3 which are around 5 billion each, which can be collectively called as the ‘Big 3’, all the product lines except the top 13 seem to have very little share.

From the time series plot, we can see that Product 9, 6 and 4 are growing rapidly which is a good indication of the strategy as these products hold the highest number of bookings as well, there are some product lines such as Product 21 and Product 5 that got introduced later on.

Total Gross Booking by Vendor

data3 <- file %>% group_by(VendorName) %>% summarise(sum_gross_booking = sum(GrossBookings),
    sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)

ggplot(data3, aes(x = reorder(VendorName, sum_gross_booking), y = sum_gross_booking,label=sum_gross_booking)) + 
  geom_point(stat='identity', fill="black", size=6)  +
  geom_segment(aes(y = 0, 
                   x = VendorName, 
                   yend = sum_gross_booking, 
                   xend = VendorName), 
               color = "black") +
  geom_text(color="red", size=3) +
  labs(title="Gross Booking by Vendor", y = "sum of gross booking", x ="Vendor")  +
  coord_flip()

data4 <- file %>% group_by(Date,VendorName) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)

lbls <- paste0(month.abb[month(data4$Date)], " ", lubridate::year(data4$Date))
brks <- data4$Date

ggplot(data4, aes(x=Date)) + 
  geom_line(aes(y=total_gross, col=VendorName)) + 
  labs(title="Time Series - Gross Booking by Vendor", 
       caption="Source: Uber Data", 
       y="sum of gross bookings") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

From the diverging point plot, we see that Vendor D, Vendor A, Vendor F and Vendor B hold the maximum percentage of bookings campared to the rest of the vendors with Vendor D holding close to 910 Million.

The trend analysis shows that Vendor D, Vendor F and Vendor B have grown overtime. All the other vendors except these 3 have a somewhat flat trend.

Vendor D’s growth is more than that of Vendor A’s. Vendor D,A and B are showing almost similar patterns of growth and drop, there could be some seasonality involved but due to lack of data it becomes difficult to conclude.

Total Gross Booking by Country

data5 <- file %>% group_by(Country) %>% summarise(sum_gross_booking = sum(GrossBookings),
    sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)

data5 %>% arrange(desc(sum_gross_booking)) %>% head(n=10)
## # A tibble: 10 x 4
##    Country    sum_gross_booking sum_fees Expense_percentage
##    <fct>                  <dbl>    <int>              <dbl>
##  1 Country 2        2168076928. 74002459             0.0341
##  2 Country 35        314456982.  5628034             0.0179
##  3 Country 3         268718127.  7966423             0.0296
##  4 Country 7         124782407.  3224414             0.0258
##  5 Country 18         87483474.  6356664             0.0727
##  6 Country 8          72803404.  4298168             0.0590
##  7 Country 54         60684454.   975648             0.0161
##  8 Country 19         30796422.   515417             0.0167
##  9 Country 40         30732266.  1734012             0.0564
## 10 Country 29         23485182.  1562883             0.0665
ggplot(data5 %>% arrange(desc(sum_gross_booking)) %>% head(n=10) , aes(x = reorder(Country, sum_gross_booking), y = sum_gross_booking,label=sum_gross_booking)) + 
  geom_point(stat='identity', fill="black", size=6)  +
  geom_segment(aes(y = 0, 
                   x = Country, 
                   yend = sum_gross_booking, 
                   xend = Country), 
               color = "black") +
  geom_text(color="red", size=3) +
  labs(title="Gross Booking by Country", y = "sum of gross booking", x ="Country")  +
  coord_flip()

data6 <- file %>% filter(Country %in% c("Country 2","Country 35","Country 3","Country 7","Country 18","Country 8","Country 54","Country 19","Country 40","Country 29")) %>% group_by(Date,Country) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)

lbls <- paste0(month.abb[month(data6$Date)], " ", lubridate::year(data6$Date))
brks <- data6$Date

ggplot(data6, aes(x=Date)) + 
  geom_line(aes(y=total_gross, col=Country)) + 
  labs(title="Time Series - Gross Booking by Country", 
       caption="Source: Uber Data", 
       y="sum of gross bookings") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

For Country wise analysis, we see that country 2 holds the maximum share among all the countries i.e. close to 2 Billion and from the trend analysis also we see that the gross bookings for country 2 have more than doubled in the given time period, Country 35 and 40 also have a good growth trend, for the all the other countries the trend is fairly flat.

Analysis for Fees

Analysis of fees by Country

data7 <- file %>% group_by(Country) %>% summarise(sum_gross_booking = sum(GrossBookings),
    sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking) %>% 
  arrange(desc(sum_fees)) %>% head(n=15)

ggplot(data7, aes(x = reorder(Country, sum_fees), y = sum_fees,label=sum_fees)) + 
  geom_point(stat='identity', fill="black", size=6)  +
  geom_segment(aes(y = 0, 
                   x = Country, 
                   yend = sum_fees, 
                   xend = Country), 
               color = "black") +
  geom_text(color="red", size=3) +
  labs(title="Fees by Country" ,y = "Sum of Fees", x ="Country") +
  coord_flip()

data8 <- file %>% filter(Country %in% c("Country 2", "Country 3","Country 18","Country 35","Country 8","Country 7","Country 40","Country 29","Country 31","Country 53")) %>% group_by(Date,Country) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)

lbls <- paste0(month.abb[month(data8$Date)], " ", lubridate::year(data8$Date))
brks <- data8$Date

ggplot(data8, aes(x=Date)) + 
  geom_line(aes(y=total_fees, col=Country)) + 
  labs(title="Time Series - Fees by Country", 
       caption="Source: Uber Data", 
       y="Sum of Fees") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

Similar to gross booking, Country 2 has good growth trend for fees as well as accounts for the maximum share in the fees

Analysis of fees by Vendor

data9 <- file %>% group_by(VendorName) %>% summarise(sum_gross_booking = sum(GrossBookings),
    sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)

ggplot(data9, aes(x = reorder(VendorName, sum_fees), y = sum_fees,label=sum_fees)) + 
  geom_point(stat='identity',fill="black", size=6)  +
  geom_segment(aes(y = 0, 
                   x = VendorName, 
                   yend = sum_fees, 
                   xend = VendorName), 
               color = "black") +
  geom_text(color="red", size=3) +
  labs(title="Fees by Vendor" ,y = "Sum of Fees", x ="Vendor")  +
  coord_flip()

data8 <- file  %>% group_by(Date,VendorName) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)

lbls <- paste0(month.abb[month(data8$Date)], " ", lubridate::year(data8$Date))
brks <- data8$Date

ggplot(data8, aes(x=Date)) + 
  geom_line(aes(y=total_fees, col=VendorName)) + 
  labs(title="Time Series - Fees by Vendor", 
       caption="Source: Uber Data", 
       y="Sum of Fees") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

Vendor A has the highest fees and Vendor D’s is approximately the same, whereas gross booking by Vendor D is more than that of Vendor A, indicating that Vendor A is more expensive than Vendor D similarly Vendor B is more expensive than Vendor F

Analysis of fees by Product

data9 <- file %>% group_by(Product) %>% summarise(sum_gross_booking = sum(GrossBookings),
    sum_fees = sum(Fees), Expense_percentage = sum_fees/sum_gross_booking)

ggplot(data = data9, aes(x = reorder(Product, sum_fees), y = sum_fees,label=sum_fees)) + 
  geom_bar(stat = "identity",
    aes(fill = Product)) + labs(title="Fees by Product" ,y = "Sum of Fees", x ="Product") + geom_text(color="black", size=2) + coord_flip()

data9 %>% arrange(desc(sum_fees)) %>% head(n=10)
## # A tibble: 10 x 4
##    Product    sum_gross_booking sum_fees Expense_percentage
##    <fct>                  <dbl>    <int>              <dbl>
##  1 Product 9        1530970522. 55032280             0.0359
##  2 Product 4         573234677. 25227228             0.0440
##  3 Product 6         586008844. 20631630             0.0352
##  4 Product 13        184160011.  3660102             0.0199
##  5 Product 7         175653012.  3272300             0.0186
##  6 Product 12         27426126.  2367693             0.0863
##  7 Product 22         74103150.  2283648             0.0308
##  8 Product 2          36861170.  1580043             0.0429
##  9 Product 5          57227648.  1255641             0.0219
## 10 Product 21         62173027.  1237713             0.0199
data10 <- file %>% filter(Product %in% c("Product 9","Product 6","Product 4","Product 13","Product 7","Product 22","Product 12","Product 5","Product 2","Product 21")) %>% group_by(Date,Product) %>% summarise(total_gross = sum(GrossBookings),total_fees = sum(Fees),exp_percentage = total_fees/total_gross)

lbls <- paste0(month.abb[month(data10$Date)], " ", lubridate::year(data10$Date))
brks <- data10$Date

ggplot(data10, aes(x=Date)) + 
  geom_line(aes(y=total_fees, col=Product)) + 
  labs(title="Time Series - Fees by Product", 
       caption="Source: Uber Data", 
       y="Sum of Fees") +  
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())

We see that Product 9 accounts for the highest Fees which aligns with the growth patterns of gross bookings for product 9, almost double than that of Product 4. From the growth trend we see the Product 9 and Product 4 are growing rapidly, both have more than doubled their values in the given time frame.

Product 9 has the highest Fees which is aligned with the Product 9 Gross booking.Product 4 and 6 have similar gross bookings but Product 4 has higher fees than Product 6 means Product 4 is more expensive than Product 6.

Forecasting

  1. Build a predictive model to help Uber determine the % of Gross Bookings that will be spent in May 2016. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.

Summary

In this section, we first begin by removing the outlier and preparing data for train and test sets, as we have to predict the percentage of expense we make two forecasting models one for predicting gross bookings and the other for predicting fees and calculate the percentage, for data preparation, we aggregate both the fields at the Date level by taking sum of Fees and Gross Bookings and diving them in train set(Jan 2015 to March 2016) and test Set (April 2016), as there are only 16 data points we cannot make huge training and test sets, however to test the validity of the models, I had begun my analysis by aggregating the data at the Date and Product level and making different models for 21 product lines and thereby gaining good accuracy on the test set. For making the forecast model I tried different techniques such as Random Forest, Arima Methods, Holtz- Winter Model, TBATS (“Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components”). I used the metric of Mean Absolute Percentage Error to compare different models across the 21 product lines achieving the best performance through TBATS. Decent results were achieved through Random Forest as well, but due to core statistical nature of time-series models we get the confidence bands as well, which can be have broader business implications than a prediction with no confidence. Hence, the final forecasting model was chosen through implementation of TBATS. The TBATS incorporates a state space model that is a generalization of those underpinning exponential smoothing. It also allows for automatic Box-Cox transformation and ARMA errors, it provides a very flexible way of accounting for seasonality and trends. The parameters include a box-cox transformation parameter, ARMA errors, damping. The seasonality and trend is easily handled in this model through the fourier transformations and smoothening parameters, which can be further tuned. Due to limited data however, the accuracy on test set in terms of Mean Absolute Percentage Error can vary from the metric given in this report, however when forecasting for longer periods, we see very smooth trends. After making the forecasting model, the expense percentage predicted for May 2016 is 3.52%.

#Outlier correction

ggplot(file %>% filter(Product == 'Product 6' & VendorName == "Vendor B" & Country == "Country 2"),aes(x=Date,y=Fees)) + geom_point() + geom_line() + ggtitle("Before outlier correction")

#outlier value
file[file$Product == 'Product 6' & file$VendorName == "Vendor B" & file$Country == "Country 2" & file$Date == as.Date("2015-02-01"),]
## # A tibble: 1 x 6
##   Date       VendorName Country   Product   GrossBookings    Fees
##   <date>     <fct>      <fct>     <fct>             <dbl>   <int>
## 1 2015-02-01 Vendor B   Country 2 Product 6     22749990. 6135380
#correcting the outlier value seems to have an extra 0
file[file$Product == 'Product 6' & file$VendorName == "Vendor B" & file$Country == "Country 2" & file$Date == as.Date("2015-02-01"),]$Fees <- 613538

ggplot(file %>% filter(Product == 'Product 6' & VendorName == "Vendor B" & Country == "Country 2"),aes(x=Date,y=Fees)) + geom_point() + geom_line() + ggtitle("After outlier correction")

After removing the much discussed outlier from the EDA section, we begin our model building process.

#preparing data for forecasting
model_data <- file %>% group_by(Date) %>% summarise(sum_gross_booking = sum(GrossBookings), sum_fees = sum(Fees))
train <- model_data %>% filter(Date < as.Date("2016-04-01"))
test <- model_data %>% filter(Date == as.Date("2016-04-01"))

Train and Test sets are prepared, the training data contains values aggregated at date level from January 2015 to March 2016, the forecasting model is built on this, and we test the performance on test set which has the values for April 2016.

#forecasting value for gross booking
data_agg7 <- train[,2]
data_agg8 <- test[,2]
ts_data <- ts(data_agg7$sum_gross_booking, start = c(2015,1), frequency = 12)
ts_data_test <- ts(data_agg8$sum_gross_booking, start = c(2016,4), frequency = 12)
autoplot(ts_data) + ggtitle("Time Series of Total Gross Bookings by Date")

ggAcf(ts_data, lag.max = 12) + ggtitle("ACF of Total Gross Bookings by Date")

ggPacf(ts_data) + ggtitle("PACF of Total Gross Bookings by Date")

Box.test(ts_data, lag = 12, fitdf = 0, type = "Lj")
## 
##  Box-Ljung test
## 
## data:  ts_data
## X-squared = 63.053, df = 12, p-value = 6.232e-09

We first begin by plotting the time series for gross booking, which shows a growing trend overtime, from the ACF plot also we see that autocorrelation for the first 2 lags is significant, indicating there is suffcient information to make a forecasting model. We further conduct a Ljung-Box test which gives us a p-value of 6.232e-09, indicating the time-series is not a random white noise process.

tbats_fit <- tbats(ts_data)
s <- forecast(tbats_fit,h=1)
accuracy(s, ts_data_test)
##                     ME     RMSE      MAE       MPE     MAPE       MASE
## Training set  -2182985 11709749  8755590 -1.507053 4.345111 0.04998570
## Test set     -11838872 11838872 11838872 -3.715805 3.715805 0.06758817
##                     ACF1
## Training set -0.08430022
## Test set              NA
autoplot(forecast(tbats_fit, h=20)) + ggtitle("Forecast for Total Gross booking")

After making the forecast model we assess its accuarcy on the train and test set, getting a Mean Absolute Percentage Error of 4.34 on the training set and 3.71 on the test set. When we plot the forecast for the next 20 intervals, the model looks smooth with tight condifence intervals.

#forecasting value for fees

data_agg7 <- train[,3]
data_agg8 <- test[,3]
ts_data <- ts(data_agg7$sum_fees, start = c(2015,1), frequency = 12)
ts_data_test <- ts(data_agg8$sum_fees, start = c(2016,4), frequency = 12)
autoplot(ts_data) + ggtitle("Time Series of Total Fees by Date")

ggAcf(ts_data, lag.max = 10) + ggtitle("ACF of Total Fees")

ggPacf(ts_data) + ggtitle("PACF of Total Fees")

Box.test(ts_data, lag = 12, fitdf = 0, type = "Lj")
## 
##  Box-Ljung test
## 
## data:  ts_data
## X-squared = 59.543, df = 12, p-value = 2.735e-08
tbats_fit <- tbats(ts_data)
s <- forecast(tbats_fit,h=1)
accuracy(s, ts_data_test)
##                     ME      RMSE       MAE        MPE      MAPE
## Training set  77435.57 437189.88 322949.80  1.2215451 4.5551206
## Test set     -45295.32  45295.32  45295.32 -0.4051141 0.4051141
##                     MASE       ACF1
## Training set 0.049075724 -0.2346443
## Test set     0.006883115         NA
autoplot(forecast(tbats_fit, h=20)) + ggtitle("Forecast for Total Fees")

Similar to making the forecasting model for Gross Booking, we model the total fees getting a Mean Absolute Percentage Error of 4.55 on Training set and 0.4 on the test set, the next 20 forecast also appear to be pretty smooth with tight confidence bands. After validating the modelling process, we train the model on the complete data from January 2015 to April 2016 and make forecasts for May 2016 and calculate the expense percentage for May 2016.

#training on complete data for gross booking
ts_data <- ts(model_data$sum_gross_booking, start = c(2015,1), frequency = 12)
tbats_fit_gross <- tbats(ts_data)
gross_may_forecast <- forecast(tbats_fit_gross,h=1)
gross_may_forecast
##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## May 2016      335292389 321413698 349171079 314066763 356518014
ts_data <- ts(model_data$sum_fees, start = c(2015,1), frequency = 12)
tbats_fit_fee <- tbats(ts_data)
fee_may_forecast <- forecast(tbats_fit_fee,h=1)
fee_may_forecast
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2016       11804209 11261517 12346901 10974233 12634185
#Forecated expense percentage
fee_may_forecast$mean/gross_may_forecast$mean * 100
##           May
## 2016 3.520572

The Expense percentage calculated for May 2016 is 3.52 percent