Creating a Budget Optimization tool for a Fortune 500 eCommerce Company

Joshua Hummell

Abstract

The advertising agency team, who partners with a Fortune 500 company eCommerce team, wants to change how they allocate budget each month. Traditionally, they built tools in Excel to help allocate the budget. However, as the budgets have grown and uncertainty has risen, the current method has become unsatisfactory. The team advertises on Search and Social channels, and the budget optimization will be performed on each group of channels individually.

The premise of this paper can be summed up with three questions: What is the best mix of spend within the two budgets? How do these contributions change over time? And is there a way to predict the performance of the budgets?

The method for addressing these questions consists of four processes: data exploration, data preparation, modeling with Meta’s Robyn, and forecasting revenue to find the Return on Ad Spend (ROAS), henceforth referred to as the ROAS model. The data is initially cleaned and then split into two, one part for the Robyn model and the other for the ROAS model. Both have data from nine unique channels, but Robyn uses data from 10-01-2018 to 04-30-2023 to better account for trend, seasonality, and holiday influences, which equates to 55,044 rows of data. While the ROAS model uses data from 01-01-2021 for search channels, and from 01-01-2019 for social channels.

The Robyn model proved to be the most challenging because it requires significant computational power and time to run since it uses Nevergrad for hyperparameter tuning. For the ROAS model, experimentaion began with GLS, then Penalized Linear Models (Ridge Regression), then tried a MARS model, before using a boosted model from the XGBoost package.

For Search, the model created attained an RSquared of .93 and an RMSE of ~$279k, which is a suitable RMSE when revenue is in the hundreds of thousands to the low millions. For Social, the model created attained an RSquared of .92 and an RMSE of roughly $1 million, which is suitable for social since revenue is in the $10s of millions. In the end, the ROAS model was run on the budgets provided by the team and budgets provided by Robyn and was able to estimate that Social ROAS would increase by 2.2% (increasing revenue by ~$438k) and Search ROAS would increase by 7% (increasing revenue by ~$500k).

Key Words: Budget Optimization, Robyn, Prophet, Regression, Gradient Boosting

Introduction

Joshua Hummell is the sole author of this paper. He is currently a Manager, Data and Analysis for Digitas, an advertising subsidiary of Publicis Groupe. He currently sits on the Fortune 500 company eCommerce team working as the data guy for the Investment and Strategy teams. He handles the engineering, analysis, and any other data-related tasks for the team.

Problem Statement

John Wanamaker, the original American marketer, once said, “half the money I spend on advertising is wasted; the trouble is I don’t know which half” (Echoes (1919), p 333). This is the inherent problem with advertising across multiple channels. You get results, but are they the best results?

The eCommerce team at the Fortune 500 company, gives to the agency a separate budget each month to split between social media and search engine channels with the main goal to drive the highest revenue with the highest efficiency, i.e., ROAS. Products sold are from multiple lines of business and include hundreds of individual SKUs. The eCommerce team allocates the budget across several “keys”, in the case of social, the key is channel and campaign type and in the case of search it is platform, channel, and account.

The team currently uses Excel calculations coupled with their intuition and experience to allocate the budget. Recently, the team ran into issues where the budgets were not properly spending nor attaining ROAS thresholds. In some cases, they are underspending and in other cases, they are not spending enough. In the chart below, you can see in the last few months ROAS has fluctuated wildly.

The goal of this project is to aid the team by creating a Budget Allocation Model to affirm or correct the current budgets. The paper will look at creating optimal budget mixes using Robyn, a tool built by Meta. Robyn is an experimental, ML-powered and semi-automated Marketing Mix Modeling (MMM) open-source package built by the Meta team on top of Prophet, an open-source algorithm for generating time-series models also built by Meta. This paper does not pursue a traditional MMM, but budget optimization follows a similar process as MMM and therefore Robyn is co-opted for this purpose. The paper will then investigate methods forecast both the team’s allocation as well as Robyn’s allocation to predict the Revenue and ROAS and then compare the results.

The sponsoring organization for the project is the agency that runs eCommerce ads for the Fortune 500 company. Aligning with the Fortune 500 company, this project will answer three especially important questions.

What is the best level to predict the budget at?
Can the Robyn Model be exploited to create better budgeting mixes than the team?
Can ROAS be forecasted for the current budget and the Robyn budget?

Literature Review

The literature review encompasses several different topics that, when combined, create a holistic proposal for an effective Budget Allocation model.

Marketing Mix Models

While a Marketing Mix Model appears to be different from budget allocation, there are a lot of similarities as will be shown in this section.

Hierarchical Marketing Mix Models with Sign Constraints is a paper that holistically covers the issues when building an MMM model (Chen, Hao, et al. 2021). The article stipulates that “first, besides being affected by marketing activities, sales volume is also affected by many non-marketing factors, such as prices, holidays, seasonality, etc.” The Fortune 500 company deals with these issues throughout the year. In fact, there are several price changes, holidays, and seasonal sales that need to be accounted for when building a budget allocation model.

In addition, the paper discusses the importance of working with media across several channels. First, marketing activities have different responses to one another. For example, the difference can be seen when comparing a promotional sale vs TV advertising. Second, when creating a model, one needs to take into account carry-over effects of advertising, i.e. Adstock. Adstock is the lag effect that is seen when looking at high-level and low-level marketing, e.g., a TV ad might prep a consumer for a purchase, but an Instagram ad leads them to the purchase. One way this can be considered is by using targeted rating points, where one looks at the percentage of the target audience that sees the advertisements (Chen, Hao, et al. 2021). This can be done by calculating TRP \[TRP = \frac{I}{A}\] where $I$ stands for impressions and $A$ stands for the intended target audience. However, since the optimized budget is not always running, this would be difficult to implement. Ultimately, the solution for implanting an Adstock transformation comes in the transformations and endogeneity section.

The last topic discussed in this paper is the idea of saturation, the concept that the relationship between spend and revenue has a limit. In an ideal world, the relationship between advertising spend and revenue would be linear, but that is not the case in the real world. As will be shown in the transformations and endogeneity section, the relationship typically follows an s or a c curve. For saturation, the paper suggests one uses a Weibull transformation to find the limits (Chen, Hao, et al. 2021). This method is used by the Robyn model to account for saturation.

In “Challenges and Opportunities In Media Mix Modeling”, a Google published paper, the article informs on the current methodologies for Marketing Mix Models and discusses challenges and solutions for them. The article warns about bias that comes from targeting groups of people in advertising. This should be considered when using the Fortune 500’s data as some of the audiences target consumers who have made purchases before. So, a model created with this data would not be general but would apply to the audiences that have been selected (Chan, & Perry. 2017).

One method to correct this is to use the search/audience volumes since, “query volume can be used in a MMM to make the estimate of paid search impact be more causal” (Chen, et al. 2017). The paper also recommends testing with simulated data as it, “allows for more control of the conditions of the marketing environment, and provides a flexible and inexpensive environment from which to conduct virtual experiments.” However, it would be difficult to add this data since it lives with another team who will not readily share it. There is another solution, since these audiences are used when an updated version of a product is launched, there is a seasonal component. So, if seasonality is accounted for, there will not be an issue. As will be shown later in the results, this proves to be an effective method.

The article also gives a general regression model. \[y_t = F(x_{t-L+1},...x_t,z_{t-L+1},...z_t;\Phi) \space t=1,...T,\]

Where $y_t$ is the sales at time $t$
$F(\cdot)$ is the regression function
$x_t = \{x_{t,m},\space m=1,...,M\}$ is a vector of the ad channel variables at time $t$
$z_t = \{z_{t,c}, \space c=1,...,C\}$ is a vector of control variables at time $t$
$\Phi$ is the vector of parameters in the model
$L$ indicates the longest lag effect that media or control variables on sales (Chan, D., & Perry, M. 2017).

After understanding an MMM model it is easy to see how it can be used for budget allocation. Essentially, when allocating budget, the process is like the process of an MMM. First, one needs to understand the saturation for each key and use that data to build out response curves. Then, given the response curves, one can find the optimal spend for each key using linear optimization, in this case, a Hill function. The Robyn section will show that the Robyn package does this for MMM, allowing it to be co-opted for budget allocation (“Analysts Guide to MMM”. 2021)

Penalized Regression, MARS, and XGBoost

Penalized models are important for advertising data as there is the potential for collinearity within the data (Assaf, et al. 2019). First, the data should be checked for collinearity. “The test can be made more formal by using a Kolmogorov-Smirnov test for testing the equality of the two distributions” (Assaf, et al. 2019). Another method is to use a correlation plot, which will be done in the initial analysis.

The next paper discussed Machine learning advances for time series forecasting. There have been several advances in penalized models over the years, starting from Ridge Regression to Least Absolute Shrinkage and Selection Operator (LASSO) and Adaptive LASSO to Elastic net, which was proposed to combine the strengths of Ridge Regression and Lasso (Masini, Medeiros, & Mendes. 2023). These models keep all the predictor variables in the model but constrain (regularize) the regression coefficients by shrinking them toward zero. If the amount of shrinkage is large enough, these methods can also perform variable selection by shrinking some coefficients to zero.

Another advance is Multivariate Adaptive Regression Splines (MARS). MARS is a tool that uses linear models to capture non-linear relationships (Friedman. 1991). Multivariate adaptive regression splines (MARS) provide a convenient approach to capture the non-linearity aspect of polynomial regression by assessing “cutpoints” (knots) similar to step functions (Boemke, Bradley. 2018). Beomke also offers several methods on how to implement MARS in r.

Last is a tree-based method called XGBoost, which is boosting that is optimized to take full advantage of computers in both parallel computing and cache-aware pre-fetching (Chen, Tianqi, and Carlos Guestrin. 2016). Boosting generally means increasing performance. In ML, Boosting is a sequential ensemble learning technique to convert a weak hypothesis or weak learners into strong learners to increase the accuracy of the model. Essentially, it takes models such as SVM or decision trees, and combines them and then optimizes the models to minimize the residuals (Wade, Corey, and Kevin Glynn. 2020). As will be shown, this method results in the lowest RMSE and highest RSqaured for the ROAS model.

Transformations and Endogeneity

As shown in the marketing mix models section, advertising data requires several transformations to build best model. For the lag between ads and purchases, Adstock, there are several transformations that can take in to affect the lag (Joseph, Joy V. (2006). Since the eCommerce team deals with lower funnel conversions, they benefit from higher level tactics are performed by other teams. However, it will later be shown that adstock transformations help the model.

Also discussed the marketing mix section is saturation, the concept that eventually more spend does not lead to an increase in revenue." The usual approach to account for saturation is to transform the advertising variable to a non-linear scale for example log or negative exponential transformations. It is this transformed variable that is used in the sales response models" (Joseph, Joy V. 2006). One can also experiment with some of the other method the author proposes, such as a Simple Decay-Effect Model, Log Decay Model, Negative Exponential Decay Model, or a Logistic (S-Curve) Decay Model.

A larger issue that marketers face is Endogeneity, which is essentially where the estimated effect of a marketing variable on the dependent variable is distorted due to correlation between one or more independent variables (Ebbes, Papies, van Heerde. 2022). As an example, the price of ice cream could be increased on hotter days that leads to more sales, but if the person modeling wasn’t familiar with this strategy, temperature would become part of the error and the price raises would be associated with higher sales. In the team’s data, there is often new channel added when launching a product, if this were not known, then it could be assumed that the new channel led other channels to increase sales.

There are two approaches to avoid endogeneity. The first is to use the REndo package, which address endogeneity without external instrumental variables. It builds on the work of several papers to address endogeneity (Gui, 2020). The second is by adding in instrumental variables (IVs), “to address the endogeneity problem, a popular approach is to find one or more additional variables… which correlate with the price variable but not with the unobserved determinants of sales (that are part of the error term)” (Ebbes, Papies, van Heerde. 2022). In the case of the ice cream vendor, the data scientist could add cost of goods sold as an IV. For advertising purposes and building out the budget optimization, dummy variables can be added for sales, new products, or even holidays.

Macroeconomic Factors i.e External Data

Another crucial factor when looking at budget optimization is adding in macroeconomic factors. Adding external data to a model factors in exogenous causes that help reduce error. This allows for augmented decision-making and can create a better understanding around consumer decisions, (Brown. 2021).

For budget optimization, it is important to look at the macroeconomic trends that are affecting the country. Exogenous data is publicly available through the Federal Reserve Economic Data website, Wagner, Rich. 2021). Data sources from the FED to be used are:

Meta Tools

The last topic of discussion is around Meta’s open-source Data Science solutions. Meta has published tools created specifically for time series forecasting and MMMs. The interest in using Meta for the MMM is that it is a tool any junior analyst can run once the initial methodologies are worked out and the model is selected. Robyn also automatically creates transformations on the data for both Adstock and saturation and even recalculates each month so that model drift is inhibited.

Robyn is Meta’s ML-powered and semi-automated Marketing Mix Modeling (MMM) open-source package. Robyn aims to reduce human bias in the modeling process, especially by automating modelers’ decisions like adstocking, saturation, trend & seasonality as well as model validation. Moreover, the budget allocator & calibration enable actionability and causality of the results (Facebookexperimental. 2021). For budget optimization, it uses Meta’s Prophet tool and Meta’s Nevergrad tool as well as using auto ridge regression in order to regularize multi-collinearity and prevent overfitting.

Prophet is a Meta open-source code for forecasting time series data and has been included in the Robyn code to decompose the time series data into trends, seasonality, and holidays (“Analysts Guide to MMM”. 2021). Unlike other time series models, prophet allows multivariate models to account for holidays/promotions (“Analysts Guide to MMM”. 2021). This in turn changes the time series general equation from

$y(t) = g(t) + s(t) + e(t)$ to $y(t) = g(t) + s(t) + h(t) + e(t)$

Where $g$ stands for trend, $s$ is season, $h$ is holiday, and $e$ is the error term.

Nevergrad is an open-source platform for black-box optimization. It is a population control algorithm that fixes a bias found in other population control methods and has proven robust for noisy optimization of continuous variables, (Bennet, Doerr, Moreau, Rapin, Teytaud, & Teytaud 2021). Ultimately, it is a tool used to find the optimal model, in Robyn. It is used to find the best model fit, the optimal saturation curve, and the optimal Adstock transformation (“Analysts Guide to MMM”. 2021).

Methodology

As stated in the Introduction, the paper’s purpose is to answer three questions:

What is the best level to predict the budget at?
Can the Robyn Model be exploited to create better budgeting mixes than the team?
Can ROAS be forecasted for the current budget and the Robyn budget?

Another stipulation is that there must be an Excel-based template to add in budget mix, and the results must also be shared in Excel.

The method for addressing these questions consists of four separate processes: Data Exploration, Data Preparation, Robyn Modeling, and ROAS Modeling and forecasting. The nature of each is described below:

Data Exploration: Investigating and pruning variables contained within the data set, checking data quality and fidelity as well as understanding the distributions and shapes of the variables. In this section, the best level to predict the budget will be determined. The method will be based on team input and data availability. This leads to the use of “keys”.
Data Preparation: Missing or invalid data values need to be handled then master data should be partitioned into dedicated “Training” and “Evaluation” subsets. Training and Evaluation are necessary to confirm the model’s accuracy before the results are shared with the team. This paper will evaluate training and evaluation based on time with training occurring before March 2023 and testing for March and April 2023.
Robyn Modeling: The Robyn requires the user to submit data in a certain way for the open-source package to work. Then, the best suited model will be chosen to predict the optimal budget mix. The results from this step will be a chart that shows the budget mix between channels for May 2023.
ROAS Modeling: This section will answer whether the Robyn model can create better budgeting mixes than the team. It will also delve into the forecast for the budgets. It will begin with identification, development, and testing of task-appropriate models. There will be an experimental component for tuning hyperparameters and understanding how transformations affect the model’s results. The optimal model will be chosen with the highest RSquared value and the lowest RSME. Then, the eCommerce team will be consulted to confirm if the estimations fall in the range of their expected historical performance. The results will be a forecast for May based on the eCommerce team’s budget mix and the Robyn budget mix. There will be a follow-up at the end of the month to see if the numbers are approximate to the forecasted numbers. This section will answer whether the Robyn

During the ROAS Modeling phase, several experiments were run on the data to get the best results possible including:

Logging transformations
Box-Cox transformations
Adstock transformations
Poly transformations
Adding constants to linear regression
Tuning hyperparameters for Ridge Regression, MARS, and XGBoost
Testing minimum threshold for data points
Analyze several different leave counts for XGBoost
Try different objectives and evaluation metrics for XGBoost

The experimentation also included adding in new columns for the models, including:

Count of Campaigns
Count of Ad Groups
Count of Ad Names
Audience sizes (Social only)
Clicks
Impressions
CTR (Click Through Rate)

Ultimately, the transformations, hyperparameters, and columns used were selected after careful evaluation and can be seen in the next section.

Experimentation and Results

Data Exploration and Preparation

For the current iteration of the data set and data used,

Data Shape

Initially, the data constituted 24,969 rows with 10 columns. Columns originally included were LOB (Line of Business), Campaign, Ad Group, and Ad Name. Since these points were too granular for building a replicable model and forecasting (there are thousands of ads), it was decided to make the model based on the variables that can be seen in the data dictionary.

Continuous Values The same data is used for both Robyn and the forecast, but they are aggregated at various levels. The Robyn model is built on a weekly level while the ROAS model is aggregated to the monthly level. Below you will see six charts, three are for the Robyn model’s data at the weekly level and highlight revenue, spend, and ROAS. The other three are for the ROAS model and highlight revenue, spend, and ROAS at the monthly level. Both are grouped to show the channels’ performance.

For spend, you can see that most of the data is right-skewed, meaning most spends are on the lower side. You can also see the same pattern for revenue. However, when you look at ROAS, you are able to see that the numbers are more normally distributed with less of a right skew.

At the monthly level, the data has less skew, but there are several channels that do not have sufficient monthly data. Later the team will inform that the only channels they are budgeting are SSC and Social for May. In the conclusion section, methods to work with new/sparse advertising channels will be discussed.

Missing Values

There were several spends that are $0 dollars despite showing revenue. This is due to the attribution window from the advertising channels. Attribution windows usually are for 7 days, so if a user clicks on an ad and makes a purchase within the next 7 days, it is counted as a sale. This leads to a lag in performance since the sales can be potentially counted after a campaign has ended. To account for this in the ROAS model, those weeks were removed since a campaign that has ended should not be included. However, they were kept in the Robyn model to better account for Adstock.

Time Series

For Robyn with social, the model is trained on data from 10-01-2018 to 04-30-2023, after several experimentations, this timeline was found suitable for Social. For Search, after several experimentations, the timeline chosen is from week 53 of 2022, 12-28-20 to 04-30-2023. For the ROAS model the data is the same as Robyn Search.

For ROAS modeling, if the key had under 12 months of data, it was filtered out since it would not build a great model. The ideal number would be at least 24 months, but many keys did not have that number as many channels were added in after 2021. For Robyn, since it was at the week level there are more than enough data points.

Data preparation

Separation

Since budgets are at two levels, social and search, the data is split by filtering channels. OLV and Social were considered social, while all other channels are considered search.

Keynames

Collaborating with the team and data key pairs were identified for the budget allocation to be based on. The social team wanted the budget and forecast at Channel + Campaign Type, and the search team wanted Channel + Platform + Account.

Dummy Variables

For Robyn, there is no need to create the dummy variables, but holidays were added to the Robyn list of holidays.

For the ROAS model, dummy variables were added to consider the time series components in the linear, non-linear, and tree-based models. For Social, dummy variables for months, holidays, quarter, year, method, and keynames were created. For Search, through experimentation keyname, year, holidays, and months were chosen as the most important.

Aggregation

For ROAS modeling, since the budget is fluid at the beginning and end of the month, data was aggregated from week to month. When creating the aggregation, each week that began in one month and ended in another is treated as a part of the former month. In the case of the week that begins 05-29-2023 and ends on 06-04-2023, it was aggregated to May. Since the team’s budgets are changed weekly, the last week of May will have a dedicated budget despite the latter part occurring in June.

Multicollinearity

Using penalized regression, MARS, and XGBoost models accounted for any multicollinearity that the data has.

Robyn Modeling

Since social and search had separate budgets, a Robyn was created for each budget. Shown below are the illustrations and outputs for search.

Creating the Robyn model took three steps. The first was to format the data in the way Robyn needed, which involved spreading the data so that the keynames were columns that had the weekly spend in each row with a column for total revenue for the week and a column for the week. Robyn also has a list where the user must pass their own holidays.

Second, the keys needed to be specified to bu. Then one must pass the variables to the Robyn engine, including specifying to look at Trend, Season, and Holiday. Then for Nevergrad optimizations, one needs to pass the hyperparameter bounds to test for the Adstock transformations, saturation transformation, specifying to run Weibull CDF, and finally, the parameters to test for the optimal models.

The chart below shows how the hyperparameters affect the saturation curves.

Search Saturation Curves

The chart below shows how the hyperparameters affect the adstock transformation.

Search Saturation Curves

Robyn, with hyperparameters set, then runs in the background. First, it transforms the cost and revenue to account for both Adstock and create saturation curves. It then creates many ridge regression models that take into account the Trend, Seasonality, and Holidays. It then tunes the hyperparameters via their Nevergrad gradient-free optimization platform to perform a multi-objective optimization that balances out the relationship between spend share and channels coefficient decomposition share by providing a set of Pareto optimal model solutions.

For each model, Robyn ran 5 trials with 2000 iterations each. The time of run depends on the machine used.

Once run, Robyn outputs two charts showing convergence as the trials are run. The cloud convergence chart shows how RSqaured and Normalized RMSE are respectfully increased and reduced over time. The convergence distribution chart shows the same information with selected iterations from the earliest iterations to the iteration.

Convergence Cloud

Convergence Distribution

Lastly, Robyn outputs several models and an excel where you can search for the best model depending on your criteria. There are some models that completely cut out channels and tell you to have 0 spend for them, which is unrealistic for most keys. This paper chose the model that included all keys with the lowest NRMSE and the highest RSquared. The optimal model in this case was 2_128_4.

Model 2_128_4

At the top of the output, you can the NRMSE as well as the RSqaured. For the model chosen, the NRMSE was suitable compared with the models, and while the RSquared was a bit low with .667, this model used all keys, whereas some of the higher models have fewer keys but higher RSqaured values. The reason some of the models are not able to consider keys is due to newer keys or the keys had gaps in their data from being paused. So, a lower RSquared was expected and is considered acceptable.

Below that, you can see the % effect each variable has on revenue, with Trend making up most of the causation.

Below that, you can see where the saturation curves have predicted the optimal share of spend to get the best ROAS.

Below that, you see the optimal Adstock for each key.

Below that, you see the Response curves and mean spends for each channel.

Moving to the top right, you can see the actual vs predicted revenues for each week.

Below that, the spot is reserved for a chart going over bootstrap results, which was not run due to computational limits, hence there is no chart there.

Below that, you can see the response to ads with most seeing carryover effects vs seeing immediate action, most show a carryover with the exception of one key that shows immediate results and another that has no discernible performance attributed to it, the same one that shows a negative effect on revenue.

Lastly, you can see the fitted values vs residuals.

Having chosen this model, the last step in the Robyn model is to allocate spend across channels given a certain budget. For May, the budget is ~ 1MM USD, and the runtime is 35 days (May 1st through June 4th). For the allocation, this paper chose to set the min and max of results to 80% to 120% of historical spend.

Then the model is run and results in another chart highlighting the optimal mix.

Model 2_128_4

In this output, you can first see at the top that the model shows overspending. It shows that the budget should be reduced by 31% to reach optimal spend levels. However, as the Fortune 500 company stipulates that the agency must spend the whole budget, it can be ignored. The only chart that matters for this paper is the initial vs. optimized budget allocation. When allocating, those numbers are the percentages that should be used. Notice that the negative key is not shown anymore, that is because Robyn does not want any money spent on it as it attributes a negative effect on revenue.

The same process is repeated for social.

ROAS Modeling

Again, separate models were created for the search and social channels. The chart below displays the results for each model created. Initially, the models were created on the weekly level, but when discussing with the team, they preferred the budgets to be set monthly level as that is how they input it into the platforms. The platforms then dictate spend changes each week and automatically spend throughout the month.

The process included cleaning the data, then adding in the variables as well as the dummy variables seen above, then transformations were added. For all linear models, DLDM Adstock transformation led to the best results, while non-transformed cost led to the best results in the Extreme Boosted model. For GLS, revenue was logged and spend was poly-transformed.

The Search Adstock Lambda (the decay for the model) was .1 and for social it was .005.

For the Search boosted model, the max depth for the tree was set to 3 and the objective was set to squared log error and eval metric was RMSE.

For the social boosted model, the max depth for the tree was set to 4 and the objective was set to squared log error and eval metric was RMSE.

Search Models:

Social Models:

Conclusions and Next Steps

The paper started with the following questions

What is the best level to predict the budget at?
Can the Robyn Model be exploited to create better budgeting mixes than the team?
Can ROAS be forecasted for the current budget and the Robyn budget?

This paper concludes that the best level to predict the budget is monthly and based on the key that the team wanted to forecast. Robyn worked brilliantly in creating optimal budget mixes and the ROAS model forecasts revenue for both the team’s budget as well as the Robyn budget with a high degree of accuracy.

The chart below shows each key with the budget given by the team, the forecasted revenue for the team’s budget, the Robyn budget mix, and the projected revenue from the Robyn model.

Search Results

For search the overall ROAS would increase from $6.59 to $7.08 increasing Revenue by $541k

Social Results:

For social the overall ROAS would increase from $13.15 to $13.44 increasing Revenue by $438k

The ROAS Model indicates that the Robyn budget mix has the better ROAS as well as the better overall Revenue. The files are outputted and saved as an Excel to easily be reviewed by the team. When discussing with the team they requested the Revenue and ROAS values to be included in the Excel, so they can validate the ROAS forecast on their end. To validate the ROAS model, in June the data for May will be pulled and compared with the ROAS model to better understand the forecast and error.

The model will be run again in June so the team can implement the budget mix from Robyn. This will allow for an evaluation of how accurate the Robyn model is on unseen data. If the team does not agree to evaluate the whole estimation, the test can be used on the key with the lowest impact on spend so as not to scare off the team.

Eventually, the model would need to be housed somewhere in the cloud as the time/resources are too great for a work computer. This would allow for a junior analyst to collaborate with the team to get the Excel, upload it, and get an output. This is essential since some of the budgets are sent over at the last minute and need to be allocated quickly for the investment team to create the ads.

One point of concern is the process of adding new channels/platforms/accounts. There will be a framework that tests the response for the new addition. It would begin by having the team fluctuate budgets over the course of 8 weeks. This way a model can be built on the fluctuations leading to optimal spend early on. If the new addition meets initial expected goals, it can be included in the Robyn model after 12 months of activity when it will show response to trend, seasonality, and holidays.

This is the case with Snapchat, a channel the team is reading to their list of channels. Snapchat ran only once before, and the goal was for lead generation and not sales. In addition, this time the budget was determined by the clients, so it did not make sense to include in this iteration of the model. However, the team will implement the framework so that the performance can be gauged and forecasted later in the year.

This framework will also be applied to data for campaigns that have been paused for an extended amount of time. The emphasis for the Fortune 500 company team is that to maintain the highest accuracy possible, there are no gaps in running keys. This might cost more money in the short term but will lead to better ROIs and better performance in the long term.

Bibliography

“Analysts Guide to MMM.” (2021). Robyn, https://facebookexperimental.github.io/Robyn/docs/analysts-guide-to-MMM.

Assaf, A. George, et al. (2019). “Diagnosing and Correcting the Effects of Multicollinearity: Bayesian Implications of Ridge Regression.” Tourism Management, vol. 71, pp. 1–8., https://doi.org/10.1016/j.tourman.2018.09.008.

Boemke, Bradley. (2018). “Multivariate Adaptive Regression Splines.” Multivariate Adaptive Regression Splines · UC Business Analytics R Programming Guide, uc-r.github.io/mars. Accessed 11 May 2023.

Brown, Sara. (2021). “Why External Data Should Be Part of Your Data Strategy.” MIT Sloan, https://mitsloan.mit.edu/ideas-made-to-matter/why-external-data-should-be-part-your-data-strategy.

Chan, D., & Perry, M. (2017). Challenges and opportunities in media mix modeling.

Chen, A., Chan, D., Perry, M., Jin, Y., Sun, Y., Wang, Y. & Koehler, J. (2017). Bias correction for paid search in media mix modeling. Forthcoming on https:// research.google.com.

Chen, Hao, et al. (2021) “Hierarchical Marketing Mix Models with Sign Constraints.” Journal of Applied Statistics, vol. 48, no. 13-15, pp. 2944–2960., https://doi.org/10.1080/02664763.2021.1946020.

Chen, Tianqi, and Carlos Guestrin. (2016). “Xgboost: A scalable tree boosting system.” Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.

Ebbes, P., Papies, D., van Heerde, H.J. (2022). Dealing with Endogeneity: A Nontechnical Guide for Marketing Researchers. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-57413-4_8

Ebbes P, Wedel M, Boeckenholt U, Steerneman A (2005). “Solving and Testing for Regressor- Error (In)Dependence When no Instrumental Variables Are vailable: With New Evidence for the Effect of Education on Income.” Quantitative Marketing and Economics, 3(4), 365–392.

Echoes, Winona: A Book of Sermons and Addresses Delivered at the Annual Bible Conference. (1919) Winona Lake, Indiana United States:Committee on Publication. pp 333.

Facebookexperimental. (2021). “Facebookexperimental/Robyn: Robyn Is an Experimental, Automated and Open-Sourced Marketing Mix Modeling (MMM) Package from Facebook Marketing Science.” GitHub, https://github.com/facebookexperimental/Robyn.

Friedman, Jerome H. (1991). “Multivariate Adaptive Regression Splines,” The Annals of Statistics, Ann. Statist. 19(1), 1-67, (March, 1991)

Gui, Raluca,(2020). REndo, v2.4.8.

Joseph, Joy V. (2006). “Understanding Advertising Adstock Transformations.” SSRN Electronic Journal, https://doi.org/10.2139/ssrn.924128.

Masini, R. P., Medeiros, M. C., & Mendes, E. F. (2023). Machine learning advances for time series forecasting. J Econ Surv, 37, 76– 111. https://doi.org/10.1111/joes.12429

Pauline Bennet, Carola Doerr, Antoine Moreau, Jeremy Rapin, Fabien Teytaud, and Olivier Teytaud. (2021). Nevergrad: black-box optimization platform. SIGEVOlution 14, 1, 8–15. https://doi.org/10.1145/3460310.3460312

Wade, Corey, and Kevin Glynn. (2020). Hands-On Gradient Boosting with XGBoost and Scikit-Learn. Packt Publishing.

Wagner, Rich. (2021) “Council Post: External Data: The Key to Building Predictive Models That Help Navigate Uncertainty.” Forbes, Forbes Magazine, https://www.forbes.com/sites/forbestechcouncil/2021/09/30/external-data-the-key-to-building-predictive-models-that-help-navigate-uncertainty/?sh=1568640a348f.

Zivot, Eric, and Jiahui Wang. (2003). “Vector Autoregressive Models for Multivariate Time Series.” Modeling Financial Time Series with S-Plus®, pp. 369–413., https://doi.org/10.1007/978-0-387-21763-5_11.

Appendix with code

Packages

library("broom")
library("caret")
library("dplyr")
library("earth")
library("fabletools")
library("fastDummies")
library("fpp3")
library("glmnet")
library("httr")
library("kableExtra")
library("lubridate")
library("mboost")
library("olsrr")
library("readxl")
library("reshape2")
library("reticulate")
library("Robyn")
library("stringr")
library("tidyverse")
library("timetk")
library("tsibble")
library("urca")
library("writexl")
library("xgboost")

Robyn

Search

Gather Data

## Force multi-core use when running RStudio
Sys.setenv(R_FUTURE_FORK_ENABLE = "true")
options(future.fork.enable = TRUE)

# Set to FALSE to avoid the creation of files locally
create_files <- TRUE


path_work <- "C:/Users/joshumme/Downloads/Budget Optimization/"
path_home <- "C:/Users/humme/My Drive/CUNY Classes/Data 698  - Analytics Master's Research Project/Budget Optimization/"

#select path
path <-  path_work

setwd(path)

# Read in Ecom Data
ecom_data_r <-read.csv("Ecom_historical_data.csv", col.names = c("lob", 
                                                                 "account", 
                                                                 "channel", 
                                                                 "campaign_type", 
                                                                 "platform", 
                                                                 "week_from",
                                                                 "campaign", 
                                                                 "cost",
                                                                 "revenue" ))
# Deselect uneeded columns
ecom_new_r <-ecom_data_r %>%  dplyr::select(-c("campaign", "ad_group", "ad_name"))


# Read in Holiday List
holidays <- read_excel("fb_events.xlsx")

# Read in Prophet holidays list
data(dt_prophet_holidays)

robyn_directory <- paste0(path, "Search/", sep="")

Prepare the data for the Robyn engine

ecom_new_r$date = format(as.Date(ecom_new_r$week_from, format = "%Y-%m-%d"), "%Y-%m-%d")

holidays$ds <- strptime(as.character(holidays$ds),  "%Y-%m-%d")
holidays$ds <- format(holidays$ds, "%Y-%m-%d")
dt_prophet_holidays <- rbind(dt_prophet_holidays,holidays)

# Turn any NA costs into 0s
ecom_new_r$cost[is.na(ecom_new_r$cost)] <- 0

# Filter for week 53 of 2020
ecom_new_r <- ecom_new_r %>% filter(date >= "2020-12-28")


# Select Search data
search <- ecom_new_r %>% filter(channel != "Social" & channel != "OLV")

# Clean Account Names
search$account <- gsub("-", "", search$account)
search$account <- gsub("   ", "_", search$account)
search$account <- gsub("  ", "_", search$account)
search$account <- gsub(" ", "_", search$account)


# Create Key
search$key <- paste0(search$channel, ".", search$platform, ".", search$account)

# for search, filter for platforms and channels we will forecast:

search <- search %>% filter(platform == "Google" | platform == "Microsoft")

search <- search %>% filter(channel != "Brand")

# Check the data, we will be using only SSC
search %>% group_by(week_from, channel) %>%
  summarise(revenue =sum(revenue)) %>%
    ggplot(aes(x = revenue)) +
    geom_histogram(fill = "white", colour = "black", bins = 30) +
    facet_grid(channel ~ ., scales = "free")



### Spread the data

#Separate Rev
rev <- search %>% dplyr::select(date, revenue) %>% group_by(date) %>% summarise(revenue = sum(revenue))
rev_c <- sum(rev$revenue)

spend <- search %>% dplyr::select(date, key, cost)
# Get a check
sum_c <- sum(spend$cost)

# Remove duplicates
spend2 <- spend %>% group_by(date, key) %>% summarise(spend = sum(cost))

# Spread Data

spend3 <- spend2 %>% spread(key = "key", value="spend")
# Remove NAs
spend3[,2:length(spend3)][is.na(spend3[,2:length(spend3)])] <- 0
# Confirm accurate
sum(colSums(spend3[,2:length(spend3)])) == sum_c
# merge spend and rev
search <- merge(spend3, rev, by = "date")
# check Rev
rev_c == sum(search$revenue)

Select needed data for the Robyn Budget Model

# Change Revenue name to total_revenue because passing as revenue caused an error
search$total_revenue <- search$revenue
# select the needed columns
search <- search %>% select("date", 
                            "SSC.Google.1", 
                            "SSC.Google.2", 
                            "SSC.Google.3", 
                            "SSC.Google.4",
                            "SSC.Google.5",
                            "SSC.Google.6",
                            "SSC.Google.7",
                            "SSC.Google.8",
                            "SSC.Google.9",
                            "SSC.Google.10",
                            "total_revenue")
# check the class and convert
sapply(search, class)

search[,2:12] <- sapply(search[,2:12], as.numeric)

# Gather the names in a vector
names <- c("SSC.Google.1", 
          "SSC.Google.2", 
          "SSC.Google.3", 
          "SSC.Google.4",
          "SSC.Google.5",
          "SSC.Google.6",
          "SSC.Google.7",
          "SSC.Google.8",
          "SSC.Google.9",
          "SSC.Google.10")


# Look at series data
meltdf <- melt(search,id="date")

for (i in unique(meltdf$variable)) {
  print(meltdf %>% filter(variable == i) %>%
  ggplot(aes(x=date,y=value,colour=variable,group=variable)) + geom_line()) }


# We can see that from the data there are quite a few keys that were not continuously run but were mostly continuous from 3-28-22, so at least a years worth of data. A chunk were not run during Jan of this year. This will likely lead to a lower RSquared than we would like. 
# Boxplots
for (i in unique(meltdf$variable)) {
  print(meltdf %>% filter(variable == i & value > 0) %>%
  ggplot(aes(x=date,y=value,colour=variable,group=variable)) + geom_boxplot()) }

# here we can see that each key has some outliers, likely due to sales and will be understood with the time series portion of Robyn.

Robyn Model

InputCollect <- robyn_inputs(
  dt_input = search,
  dt_holidays = dt_prophet_holidays,
  date_var = "date", # date format must be "2020-01-01"
  dep_var = "total_revenue", # there should be only one dependent variable
  dep_var_type = "revenue", # "revenue" (ROI) or "conversion" (CPA)
  prophet_vars = c("trend", "season", "holiday"), # "trend","season", "weekday" & "holiday"
  prophet_country = "US", # input one country. dt_prophet_holidays includes 59 countries by default
  #context_vars = c("event"), # e.g. competitors, discount, unemployment etc
  paid_media_spends = names, # mandatory input
  paid_media_vars = names, # mandatory.
  # paid_media_vars must have same order as paid_media_spends. Use media exposure metrics like
  # impressions, GRP etc. If not applicable, use spend instead.
  window_start = "2021-12-28",
  window_end = "2023-04-24",
  adstock = "weibull_cdf" # geometric, weibull_cdf or weibull_pdf.
)

print(InputCollect)

# Specify Hyperparameters
hyperparameters <- list(

SSC.Google.1_alphas = c(1,3),
SSC.Google.1_gammas = c(0.1,.9), 
SSC.Google.1_scales = c(0,.1), 
SSC.Google.1_shapes = c(2.01,10), 

SSC.Google.2_alphas = c(1,3), 
SSC.Google.2_gammas = c(0.1,.9), 
SSC.Google.2_scales = c(0,.1), 
SSC.Google.2_shapes = c(2.01,10), 

SSC.Google.3_alphas = c(1,3), 
SSC.Google.3_gammas = c(0.1,.9), 
SSC.Google.3_scales = c(0,.1), 
SSC.Google.3_shapes = c(2.01,10),

SSC.Google.4_alphas = c(1,3), 
SSC.Google.4_gammas = c(0.1,.9), 
SSC.Google.4_scales = c(0,.1), 
SSC.Google.4_shapes = c(2.01,10),

SSC.Google.5_alphas = c(1,3), 
SSC.Google.5_gammas = c(0.1,.9), 
SSC.Google.5_scales = c(0,.1), 
SSC.Google.5_shapes = c(2.01,10),

SSC.Google.6_alphas = c(1,3), 
SSC.Google.6_gammas = c(0.1,.9), 
SSC.Google.6_scales = c(0,.1), 
SSC.Google.6_shapes = c(2.01,10),

SSC.Google.7 = c(1,3), 
SSC.Google.7 = c(0.1,.9), 
SSC.Google.7 = c(0,.1), 
SSC.Google.7 = c(2.01,10),

SSC.Google.8 = c(1,3), 
SSC.Google.8 = c(0.1,.9), 
SSC.Google.8 = c(0,.1), 
SSC.Google.8 = c(2.01,10),

SSC.Google.9 = c(1,3), 
SSC.Google.9 = c(0.1,.9), 
SSC.Google.9 = c(0,.1), 
SSC.Google.9 = c(2.01,10),

SSC.Google.10 = c(1,3), 
SSC.Google.10 = c(0.1,.9), 
SSC.Google.10 = c(0,.1), 
SSC.Google.10 = c(2.01,10)

)
# add to input
InputCollect <- robyn_inputs(InputCollect = InputCollect, hyperparameters = hyperparameters)
print(InputCollect)

## Run all trials and iterations. Use ?robyn_run to check parameter definition
OutputModels <- robyn_run(
  InputCollect = InputCollect, # feed in all model specification
  cores = NULL, # NULL defaults to (max available - 1)
  iterations = 2000, # 2000 recommended for the dummy dataset with no calibration
  trials = 5, # 5 recommended for the dummy dataset
  ts_validation = TRUE, # 3-way-split time series for NRMSE validation.
  add_penalty_factor = FALSE # Experimental feature. Use with caution.
)
print(OutputModels)

OutputModels$convergence$moo_distrb_plot
OutputModels$convergence$moo_cloud_plot


## Calculate Pareto fronts, cluster and export results and plots.
OutputCollect <- robyn_outputs(
  InputCollect, OutputModels,
  pareto_fronts = "auto", # automatically pick how many pareto-fronts to fill min_candidates (100)
  # min_candidates = 100, # top pareto models for clustering. Default to 100
  # calibration_constraint = 0.1, # range c(0.01, 0.1) & default at 0.1
  csv_out = "pareto", # "pareto", "all", or NULL (for none)
  clusters = FALSE, # Set to TRUE to cluster similar models by ROAS. See ?robyn_clusters
  export = create_files, # this will create files locally
  plot_folder = robyn_directory, # path for plots exports and files creation
  plot_pareto = TRUE # Set to FALSE to deactivate plotting and saving model one-pagers
)
print(OutputCollect)

Select model and run Budget Allocator

select_model <- "2_128_4"

# NOTE: The order of constraints should follow:
InputCollect$paid_media_spends


# Scenario "max_response": "What's the max. return given certain spend?"
# Example 1: max_response default setting: maximize response for latest month
AllocatorCollect1 <- robyn_allocator(
  InputCollect = InputCollect,
  OutputCollect = OutputCollect,
  select_model = select_model,
  date_range = "last_5", # Default last month as initial period
  total_budget = 1104844.07, # When NULL, default is total spend in date_range
  expected_spend = 1104844.07,
  expected_spend_days = 35,
  channel_constr_low = 0.8,
  channel_constr_up = c(1.2, 1.2, 1.2,1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2),
  # channel_constr_multiplier = 3,
  scenario = "max_response_expected_spend",
  export = create_files
)

Social

# Specify the directory
robyn_directory <- paste0(path, "Social/", sep="")
# filter for the channels
social <- ecom_new_r %>% filter(channel == "Social" | channel == "OLV")

# Create the Key

social$key <- paste0(social$platform, ":", social$campaign_type)
  
social <- social %>% dplyr::select(-c("account", "channel", "campaign_type", "platform", "lob", "week_from"))

#Seperate Rev
rev <- social %>% dplyr::select(date, revenue) %>% group_by(date) %>% summarise(revenue = sum(revenue))
rev_c <- sum(rev$revenue)

#Select Spend
spend <- social %>% dplyr::select(date, key, cost)
# Get a check
sum_c <- sum(spend$cost)

# Remove duplicates
spend2 <- spend %>% group_by(date, key) %>% summarise(spend = sum(cost))

# Spread Data
spend3 <- spend2 %>% spread(key = key, value=spend)

# Remove NAs
spend3[,2:length(spend3)][is.na(spend3[,2:length(spend3)])] <- 0
# Confirm accurate
sum(colSums(spend3[,2:length(spend3)])) == sum_c
# merge spend and rev
social <- merge(spend3, rev, by = "date")

# check Rev
rev_c == sum(social$revenue)


columns <- paste(colnames(social))
match <- c("date","Meta.1", "Meta.2", "Meta.3", "Meta.4", "Meta.5",
                        "Meta.6", "Meta.7","Meta.8" ,"Meta.9","Meta.10",          
                        "Meta.11","Meta.12", "Snapchat.1", "TikTok.1","TikTok.2",
                        "Youtube.1", "Youtube.2", "Youtube.3", "total_revenue")

columns
columns <- data.frame(columns, match)


social <- social %>% rename_at(vars(as.character(columns$columns)), ~ as.character(columns$match))

social2 <- social %>% select("date", "Meta.12", "Meta.4", "Meta.2", "total_revenue")


colnames(social)
sapply(social2, class)



social2[,2:4] <- sapply(social2[,2:4], as.numeric)


meltdf <- melt(social2,id="date")

for (i in unique(meltdf$variable)) {
  print(meltdf %>% filter(variable == i) %>%
  ggplot(aes(x=date,y=value,colour=variable,group=variable)) + geom_line()) }

# We can see continuous data in all groups with the exception of App_Retargeting, which may lead to skewed results for that channel. Later it was determined to forecast an appropriate number despite only having 15 weeks worth of data. 


for (i in unique(meltdf$variable)) {
  print(meltdf %>% filter(variable == i) %>%
  ggplot(aes(x=date,y=value,colour=variable,group=variable)) + geom_boxplot()) }

# Boxplots
for (i in unique(meltdf$variable)) {
  print(meltdf %>% filter(variable == i & value > 0) %>%
  ggplot(aes(x=date,y=value,colour=variable,group=variable)) + geom_boxplot()) }

# here we can see that each key has some outliers, likely due to sales and will be understood with the time series portion of Robyn.

Robyn Engine Setup

InputCollect <- robyn_inputs(
  dt_input = social2,
  dt_holidays = dt_prophet_holidays,
  date_var = "date", # date format must be "2020-01-01"
  dep_var = "total_revenue", # there should be only one dependent variable
  dep_var_type = "revenue", # "revenue" (ROI) or "conversion" (CPA)
  prophet_vars = c("trend", "season", "holiday"), # "trend","season", "weekday" & "holiday"
  prophet_country = "US", # input one country. dt_prophet_holidays includes 59 countries by default
  #context_vars = c("event"), # e.g. competitors, discount, unemployment etc
  paid_media_spends = c("Meta.12", "Meta.4", "Meta.2"), # mandatory input
  paid_media_vars = c("Meta.12", "Meta.4", "Meta.2"), # mandatory.
  # paid_media_vars must have same order as paid_media_spends. Use media exposure metrics like
  # impressions, GRP etc. If not applicable, use spend instead.
  window_start = "2021-03-01",
  window_end = "2023-04-17",
  adstock = "weibull_cdf" # geometric, weibull_cdf or weibull_pdf.
)

print(InputCollect)

hyper_names(adstock = InputCollect$adstock, all_media = InputCollect$all_media)

hyperparameters <- list(

Meta.2_alphas = c(1,3),
Meta.2_gammas = c(0.1,.9), 
Meta.2_scales = c(0,.1), 
Meta.2_shapes = c(2.01,10), 
Meta.4_alphas = c(1,3), 
Meta.4_gammas = c(0.1,.9), 
Meta.4_scales = c(0,.1), 
Meta.4_shapes = c(2.01,10), 
Meta.12_alphas = c(1,3), 
Meta.12_gammas = c(0.1,.9), 
Meta.12_scales = c(0,.1), 
Meta.12_shapes = c(2.01,10)
)

#### Step 3: Build initial model

## Run all trials and iterations. Use ?robyn_run to check parameter definition
OutputModels <- robyn_run(
  InputCollect = InputCollect, # feed in all model specification
  cores = NULL, # NULL defaults to (max available - 1)
  iterations = 2000, # 2000 recommended for the dummy dataset with no calibration
  trials = 5, # 5 recommended for the dummy dataset
  ts_validation = TRUE, # 3-way-split time series for NRMSE validation.
  add_penalty_factor = FALSE # Experimental feature. Use with caution.
)
print(OutputModels)

## Calculate Pareto fronts, cluster and export results and plots. See ?robyn_outputs
OutputCollect <- robyn_outputs(
  InputCollect, OutputModels,
  pareto_fronts = "auto", # automatically pick how many pareto-fronts to fill min_candidates (100)
  # min_candidates = 100, # top pareto models for clustering. Default to 100
  # calibration_constraint = 0.1, # range c(0.01, 0.1) & default at 0.1
  csv_out = "pareto", # "pareto", "all", or NULL (for none)
  clusters = FALSE, # Set to TRUE to cluster similar models by ROAS. See ?robyn_clusters
  export = create_files, # this will create files locally
  plot_folder = robyn_directory, # path for plots exports and files creation
  plot_pareto = TRUE # Set to FALSE to deactivate plotting and saving model one-pagers
)
print(OutputCollect)

Budget Allocator

select_model <- "2_104_12"

ExportedModel <- robyn_write(InputCollect, OutputCollect, select_model, export = create_files)
print(ExportedModel)

# NOTE: The order of constraints should follow:
InputCollect$paid_media_spends

# Scenario "max_response": "What's the max. return given certain spend?"
# Example 1: max_response default setting: maximize response for latest month
AllocatorCollect1 <- robyn_allocator(
  InputCollect = InputCollect,
  OutputCollect = OutputCollect,
  select_model = select_model,
  date_range = "last_5", # Default last month as initial period
  total_budget = 1422029-75000, # When NULL, default is total spend in date_range
  expected_spend = 1422029-75000,
  expected_spend_days = 35,
  channel_constr_low = 0.8,
  channel_constr_up = c(1.2, 1.2, 1.2),
  # channel_constr_multiplier = 3,
  scenario = "max_response_expected_spend",
  export = create_files
)

ROAS Model

Data Downloading

path_work <- "C:/Users/joshumme/Downloads/Budget Optimization/"
path_home <- "C:/Users/humme/My Drive/CUNY Classes/Data 698  - Analytics Master's Research Project/Budget Optimization/"

path <-  path_work

# Get data
ecom_new <-read.csv(paste0(path,"Ecom_historical_data_exp.csv", sep=""), col.names = c("account", "channel","campaign_type", "platform", "week_from","cost","revenue"))

# Get Holidays
holidays <- read_excel(paste0(path,"holidays.xlsx", sep=""))

# Create month
ecom_new$month = as.character(floor_date(as.Date(ecom_new$week_from, tryFormats = "%Y-%m-%d"), "month"))

# Aggregate on the month level
ecom_new <- ecom_new %>% dplyr::select(-week_from) %>% group_by(account,channel, campaign_type,platform, month) %>% summarise(cost = sum(cost), revenue = sum(revenue))

# Add in dates and holidays

holidays$week_from <- strptime(as.character(holidays$week_from),  "%Y-%m-%d")
holidays$week_from <- format(holidays$week_from, "%Y-%m-%d")

# Create month in Holidays
holidays$month = as.character(floor_date(as.Date(holidays$week_from, tryFormats = "%Y-%m-%d"), "month"))

# Get a count for each month
holidays <- holidays %>% select(month, holiday) %>% group_by(month) %>% summarise(holidays = n())
holidays$month = as.Date(holidays$month, tryFormats = "%Y-%m-%d")

ecom_new$month = as.Date(ecom_new$month, tryFormats = "%Y-%m-%d")

# Merge holidays
ecom_new <- ecom_new %>%
  left_join(holidays, by = "month")

# Filter for data after 2018-12-31
ecom_new <- ecom_new %>% filter(month> "2018-12-31")


# Add in Year variable
ecom_new$year = lubridate::year(as.Date(ecom_new$month, tryFormats = "%Y-%m-%d"))


# Add in and lag External data by two months

ecom_new$year_month = yearmonth(as.Date(ecom_new$month, tryFormats = "%Y-%m-%d"))

month <- 5
pmonth <- month-1

consumer_sent <- read_csv(paste0(path,"consumer_sent.csv", sep=""))
consumer_sent$year_month = yearmonth(as.Date(consumer_sent$DATE, tryFormats = "%m/%d/%Y"))
consumer_sent <-consumer_sent %>% dplyr::select(-DATE)
consumer_sent[nrow(consumer_sent) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
consumer_sent[nrow(consumer_sent) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
consumer_sent <- consumer_sent %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


disp_inc <- read_csv(paste0(path,"disp_inc.csv", sep="")) 
disp_inc$year_month = yearmonth(as.Date(disp_inc$DATE, tryFormats = "%m/%d/%Y"))
disp_inc <- disp_inc %>% dplyr::select(-DATE)
disp_inc[nrow(disp_inc) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
disp_inc[nrow(disp_inc) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
disp_inc <-disp_inc %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


avg_gas <- read_csv(paste0(path,"avg_gas.csv", sep="")) 
avg_gas$year_month = yearmonth(as.Date(avg_gas$DATE, tryFormats = "%m/%d/%Y"))
avg_gas <-avg_gas %>% dplyr::select(-DATE)
avg_gas[nrow(avg_gas) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
avg_gas[nrow(avg_gas) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
avg_gas <-avg_gas %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


avg_inf <- read_csv(paste0(path,"avg_inf.csv", sep=""))
avg_inf$year_month = yearmonth(as.Date(avg_inf$DATE, tryFormats = "%m/%d/%Y"))
avg_inf <-avg_inf %>% dplyr::select(-DATE)
avg_inf[nrow(avg_inf) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
avg_inf[nrow(avg_inf) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
avg_inf <-avg_inf %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


ecom_new <- ecom_new %>%
  left_join(consumer_sent[,2:3], by = "year_month") %>%
  left_join(disp_inc[,2:3], by = "year_month") %>%
  left_join(avg_gas[,2:3], by = "year_month") %>%
  left_join(avg_inf[,2:3], by = "year_month") 


colnames(ecom_new)[which(names(ecom_new) == "UMCSENT_lag2")] <- "consumer_sentiment"
colnames(ecom_new)[which(names(ecom_new) == "DSPIC96_lag2")] <- "disposable_income"
colnames(ecom_new)[which(names(ecom_new) == "APU000074714_lag2")] <- "avg_gas"
colnames(ecom_new)[which(names(ecom_new) == "MEDCPIM158SFRBCLE_lag2")] <- "avg_inflation"


head(ecom_new)

# Filter out non conversion based campaign types
ecom_new <- ecom_new %>%
  filter(campaign_type != "Lead Gen" & campaign_type != "App Install")

#there is only one occurrence of Lead Gen - On Site, and it should be app retargeting
ecom_new$campaign_type[ecom_new$campaign_type == "Lead Gen - On Site"] <- "App Retargeting"

# Remove xlob since we have no say in monthly budget (spent yearly)
ecom_new <- ecom_new %>%
  filter(!grepl("xlob",tolower(account)))

# Make sure account names are cleaned
ecom_new$account <- gsub("-", "", ecom_new$account)
ecom_new$account <- gsub("   ", "_", ecom_new$account)
ecom_new$account <- gsub("  ", "_", ecom_new$account)
ecom_new$account <- gsub(" ", "_", ecom_new$account)



#Check the class
sapply(ecom_new, class)
# Get a list of names
ecom_names <- colnames(ecom_new)

# See if there are missing values
colnames(ecom_new)[unlist(lapply(ecom_new, function(x) any(is.na(x))))]

ecom_new[!complete.cases(ecom_new), ]


# All of the missing values are for metrics, so if there is no performance, we can make 0
ecom_new$cost[is.na(ecom_new$cost)] <- 0

ecom_new$holidays[is.na(ecom_new$holidays)] <- 0


# Now let's specify data types
transform(ecom_new,
                cost = as.numeric(cost),
                revenue = as.numeric(revenue))

# Last, let's make the numbers more readable and divide by one hundred thousand
ecom_new$revenue = ecom_new$revenue /1e6
ecom_new$cost = ecom_new$cost /1e6

# remove year month since we no longer need it
ecom_new <- ecom_new %>% dplyr::select(-year_month)

# Check out the summary
summary(ecom_new)

# get a look at the time series data

#Spend
ecom_new %>%
  dplyr::select(month, cost) %>%
  group_by(month) %>% 
  summarise(spend = sum(cost)) %>%
    ggplot(aes(x=as.Date(month), y=spend)) +
    geom_line() + 
    xlab("Week and Year") +
    scale_x_date(date_labels = "%W %y", date_breaks = "8 week")

# Revenue
ecom_new %>%
  dplyr::select(month, revenue) %>%
  group_by(month) %>% 
  summarise(revenue = sum(revenue)) %>%
    ggplot(aes(x=as.Date(month), y=revenue)) +
    geom_line() + 
    xlab("Week and Year") +
    scale_x_date(date_labels = "%W %y", date_breaks = "8 week")



# ROAS
ecom_new %>%
  dplyr::select(month, revenue, cost) %>%
  group_by(month) %>% 
  summarise(revenue = sum(revenue),
            spend = sum(cost)) %>%
  mutate(roas = revenue / spend) %>%
    ggplot(aes(x=as.Date(month), y=roas)) +
    geom_line() + 
    xlab("Week and Year") +
    scale_x_date(date_labels = "%W %y", date_breaks = "8 week")


# Look at histograms for revenue and spend as well as compare spend with Revenue
ecom_plots <- ecom_new %>%
              dplyr::select(month, revenue, cost) %>%
              group_by(month) %>% 
              summarise(revenue = sum(revenue),
                        spend = sum(cost)) %>%
              mutate(roas = revenue / spend)


hist(ecom_plots$revenue)

hist(ecom_plots$spend)


ecom_plots %>%
    ggplot(aes(x=spend, y=revenue)) +
    geom_point() + 
    xlab("Spend and Revenue") 

# Look at seasonal decomp

ecom_tsibble <- ecom_new %>%
              dplyr::select(month, revenue, cost) %>%
              group_by(month) %>% 
              summarise(revenue = sum(revenue),
                        spend = sum(cost)) %>%
              mutate(roas = revenue / spend)


ecom_tsibble <- ecom_tsibble %>%
                  mutate(month = yearmonth(month)) %>%
                  as_tsibble(index = month) %>%
                  fill_gaps()

ecom_tsibble$roas[ecom_tsibble$roas == "Inf"] <- 0

ecom_tsibble %>%
  model(STL(roas ~ trend(window = 7) + season(window = "periodic"), robust = TRUE) ) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for roas")


ecom_tsibble %>%
  model(STL(spend ~ trend(window = 7) + season(window = "periodic"), robust = TRUE) ) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for spend")

ecom_tsibble %>%
  model(STL(revenue ~ trend(window = 7) + season(window = "periodic"), robust = TRUE) ) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition for revenue")

rm(ecom_tsibble)

# select the needed columns
ecom_search <- ecom_new %>%
  filter(channel != "Social" & channel != "OLV"
         )

# filter for the date needed

ecom_search <- ecom_search %>% filter(month > "2020-12-31")

# Create Key
ecom_search$keyname <- paste0(ecom_search$channel, ":",ecom_search$platform, ":", ecom_search$account)

# Remove old columns

ecom_search <-  ecom_search %>%
  dplyr::select(-c("account","campaign_type", "platform"))


# Transform data
ecom_search %>% dplyr::mutate_all(as.factor) %>% str

transform(ecom_search,
                cost = as.numeric(cost),
                revenue = as.numeric(revenue))


# create dummy variables
ecom_search$months = lubridate::month(as.Date(ecom_search$month, tryFormats = "%Y-%m-%d"))

ecom_search <- dummy_cols(ecom_search, select_columns = c("keyname", "year", "holidays", "months"), remove_first_dummy = TRUE)

ecom_search <- ecom_search %>% dplyr::select(-c("holidays",  "year", "months" ))

# remove variables now dummy variables
ecom_search <-  ecom_search %>%
  dplyr::select(-c("account","campaign_type", "platform", "channel"))

# Look at the correlation plot

ecom_soc_cor <- ecom_search  %>% ungroup() %>%  dplyr::select(-keyname)

ecom_soc_cor[] <-lapply(ecom_soc_cor, function(x) as.numeric(as.character(x)))

m <- stats::cor(ecom_soc_cor, method = "pearson")

corrplot::corrplot(m)
## We can see the highest correlation is cost and then several keynames


# Look at hist
for (i in unique(ecom_search$keyname)) {
  
  splot <- ecom_search %>%
    filter(keyname == i) %>%
    ggplot(aes(x=revenue)) +
    geom_histogram(bins = 25)  +
    xlab(paste0("Revenue for ", i)) 
  
  print(splot)
  
}


# remove all data with less than 12 months of data

ng_data <- ecom_search %>%
  group_by(keyname) %>%
  summarise(count = n()) %>%
  filter(count< 12) %>%
  arrange(count)

ng_data


ecom_search_clean <- ecom_search %>%
  filter(!keyname %in% ng_data$keyname)

# plot revenue logged

for (i in unique(ecom_search_clean$keyname)) {
  
  splot <- ecom_search_clean %>%
    filter(keyname == i) %>%
    ggplot(aes(x=log(revenue))) +
    geom_histogram(bins = 25)  +
    xlab(paste0("Revenue for ", i)) 
  
  print(splot)
  
}

# compare spend and revenue relationship via a plot

for (i in unique(ecom_search_clean$keyname)) {
  
  splot <- ecom_search_clean %>%
    filter(keyname == i) %>%
    ggplot(aes(x=cost, y=log(revenue))) +
    geom_point() +
    stat_smooth(method="glm", formula = y~poly(x,2)) +
    xlab(paste0("Spend and Revenue for ", i)) 
  
  print(splot)
  
}

# Add in AdStock Transformations - https://medium.com/@kyliefu/implementation-of-the-advertising-adstock-theory-in-r-62c2cc4b82fd

# Simple Geometric Model

GeometricSimpleAdstock <- function(advertising, lambda){
  
  adstock <- as.numeric(stats::filter(advertising, lambda, method="recursive"))
  
  return(adstock)   
}


ecom_search_clean <- ecom_search_clean %>% 
   group_by(keyname) %>%
   arrange(month) %>%
   mutate(spend_SGM=GeometricSimpleAdstock(cost, .1))


# Geometric Log Decay Model- need to modify by adding a 1 since there are quite a few dates with 0 values

GeometricLogAdstock <- function(advertising, lambda){
  
  adstock <- as.numeric(stats::filter(log(advertising+1), lambda, method="recursive"))
  
  return(adstock)
}

# Now let's implement the two different models and see how they change the data

ecom_search_clean <- ecom_search_clean %>% 
   group_by(keyname) %>%
   arrange(month) %>%
   mutate(spend_DLDM=GeometricLogAdstock(cost, .1))

# Create train and test splits. Since we are working with Quasi time series data, we will filter out the last 4 months to test with
set.seed(41)
ecom_search_clean <- ecom_search_clean %>% dplyr::select(-keyname)

ecom_search_clean_train <- ecom_search_clean %>% filter(month < "2023-01-01")
ecom_search_clean_test <-  ecom_search_clean %>% filter(month >= "2023-01-01")


ecom_search_clean_test_rev <- ecom_search_clean_test %>%  dplyr::select(revenue)


ecom_search_clean_test <- ecom_search_clean_test %>% dplyr::select(-revenue)


ecom_search_clean_train <- ecom_search_clean_train %>% dplyr::select(-c("month"))

ecom_search_clean_test <- ecom_search_clean_test %>% dplyr::select(-c("month"))

# Linear Regression

set.seed(56)

ecom.fit1s <- ecom_search_clean_train %>% dplyr::select(-c(spend_SGM, spend_DLDM)) %>% glm((log(revenue+1)) ~ poly(cost) + .,  data = ., family = "gaussian")

ecom.fit2s <- ecom_search_clean_train %>% dplyr::select(-c(cost, spend_DLDM)) %>% glm((log(revenue+1)) ~ poly(spend_SGM) + .,  data = ., family = "gaussian")

ecom.fit3s <- ecom_search_clean_train %>% dplyr::select(-c(spend_SGM, cost)) %>% glm((log(revenue+1)) ~ poly(spend_DLDM)+ .,  data = ., family = "gaussian")


summary(ecom.fit1s)

ggResidpanel::resid_panel(ecom.fit1s, plots='default', smoother = TRUE)

ecom.pred1s <- predict(ecom.fit1s, newdata = ecom_search_clean_test)
ecom.pred2s <- predict(ecom.fit2s, newdata = ecom_search_clean_test)
ecom.pred3s <- predict(ecom.fit3s, newdata = ecom_search_clean_test)

ecom.pred1s  <- exp(ecom.pred1s)-1
ecom.pred2s  <- exp(ecom.pred2s)-1
ecom.pred3s  <- exp(ecom.pred3s)-1



ecom.results1s <- postResample(ecom.pred1s, ecom_search_clean_test_rev$revenue)
ecom.results2s <- postResample(ecom.pred2s, ecom_search_clean_test_rev$revenue)
ecom.results3s <- postResample(ecom.pred3s, ecom_search_clean_test_rev$revenue)


rbind(ecom.results1s, ecom.results2s, ecom.results3s)


ecom.results1s[1] * 1e6


cbind(ecom.pred1s,ecom_search_clean_test_rev$revenue)

# The winner is the the model using the DLDM AdStock transformation, so we will use it on the other models. 

# Ridge Regression

train_x = data.matrix(ecom_search_clean_train %>% dplyr::select(-c(revenue, spend_SGM, cost, keyname)))
train_y = data.matrix(ecom_search_clean_train$revenue)

test_x = data.matrix(ecom_search_clean_test %>% dplyr::select(-c(spend_SGM, cost, keyname)))
test_y = data.matrix(ecom_search_clean_test_rev$revenue)


#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(train_x, train_y, alpha = 0)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda


ridge_search <- glmnet(train_x, train_y, alpha = 0, lambda = best_lambda)
summary(ridge_search)


pred_y = predict(ridge_search, test_x)

ecom.results_ridge <- postResample(pred_y, test_y)

ecom.results_ridge


## MARS Model


ecom_mars <- ecom_search_clean_train %>% dplyr::select(-c(spend_SGM, cost))
ecom_mars_test <- ecom_search_clean_test %>% dplyr::select(-c(spend_SGM, cost))


mars2 <- earth(
  revenue ~ .,  
  data = ecom_mars,
  degree = 1
)

plot(mars2, which = 1)


ecom.pred1s <- predict(mars2, newdata = ecom_mars_test)
ecom.results1s <- postResample(ecom.pred1s, ecom_search_clean_test_rev$revenue)
ecom.results1s 

# XGBoost model - for this one we found that using cost was the best spend variable

train_x = data.matrix(ecom_search_clean_train %>% dplyr::select(-c(revenue, spend_SGM, spend_DLDM , keyname)))
train_y = data.matrix(ecom_search_clean_train$revenue)

test_x = data.matrix(ecom_search_clean_test %>% dplyr::select(-c(spend_SGM, spend_DLDM, keyname)))
test_y = data.matrix(ecom_search_clean_test_rev$revenue)


#define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)


set.seed(1452)
#defining a watchlist
watchlist = list(train=xgb_train, test=xgb_test)

#fit XGBoost model and display training and testing data at each iteration
model = xgb.train(data = xgb_train, max.depth = 3, 
                  watchlist=watchlist, 
                  nrounds = 100,
                  objective = "reg:squaredlogerror")


#define final model
model_xgboost_search = xgboost(data = xgb_train, 
                        max.depth = 3, 
                        nrounds = 68, 
                        verbose = 0, 
                        objective = "reg:squaredlogerror")

summary(model_xgboost_search)

importance_matrix <- xgb.importance(model = model_xgboost_search)

print(importance_matrix)

xgb.plot.importance(importance_matrix = importance_matrix)

#use model to make predictions on test data
pred_y = predict(model_xgboost_search, test_x)

ecom.results4 <- postResample(pred_y, test_y)

ecom.results4
ecom.results4[1] * 1e6

x = 1:length(test_y)                   # visualize the model, actual and predicted data
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38,  legend = c("original test_y", "predicted test_y"), 
       col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))


setwd(path)
# if (file.exists('search_model.json')) file.remove('search_model.json')
# xgb.save(model_xgboost_search, 'search_model.json')

Social

# select Data
ecom_social <- ecom_new %>%
  filter(channel == "Social"| channel == "OLV")

# Create Key
ecom_social$keyname <- paste0(ecom_social$platform, ":", ecom_social$campaign_type)

# Add in Quarter and months
ecom_social$quarter = lubridate::quarter(as.Date(ecom_social$month, tryFormats = "%Y-%m-%d"))
ecom_social$months = lubridate::month(as.Date(ecom_social$month, tryFormats = "%Y-%m-%d"))

# Remove old columns

ecom_social <-  ecom_social %>%
  dplyr::select(-c("channel","campaign_type", "platform",  "account" ))

## add in Method
ecom_social<- ecom_social %>%
  mutate(method = case_when(month >= "2022-12-31" ~ 'New',
                             TRUE ~ 'Old'))

# convert Data Types
ecom_social %>% dplyr::mutate_all(as.factor) %>% str

transform(ecom_social,
                cost = as.numeric(cost),
                revenue = as.numeric(revenue))

# Create Dummies
ecom_social <- dummy_cols(ecom_social , select_columns = c("holidays", "quarter",  "year", "method", "keyname", "months"), remove_first_dummy = TRUE)

ecom_social  <- ecom_social  %>%  dplyr::select(-c("holidays",  "year", "method", "quarter"))

ecom_social <-  ecom_social %>%
  dplyr::select(-c("channel","campaign_type", "platform",  "account" ))

# Look at the correlation plot

ecom_soc_cor <- ecom_social  %>% ungroup() %>%  dplyr::select(-keyname)

ecom_soc_cor[] <-lapply(ecom_soc_cor, function(x) as.numeric(as.character(x)))

m <- stats::cor(ecom_soc_cor, method = "pearson")

corrplot::corrplot(m)


# When we take a look the plot we can see that cost, Keyname Meta 4, holidays_4, avg_gas, have the highest impact revenue


# Remove all 0s
ecom_social <- ecom_social %>% filter(cost != 0)

# Remove data less than 12 months
ng_data <- ecom_social %>%
  group_by(keyname) %>%
  summarise(count = n()) %>%
  filter(count< 12) %>%
  arrange(count)

ng_data


ecom_social <- ecom_social %>%
  filter(!keyname %in% ng_data$keyname)

# Add in AdStocks
ecom_social <- ecom_social %>% 
   group_by(keyname) %>%
   arrange(month) %>%
   mutate(spend_SGM=GeometricSimpleAdstock(cost, .005))

ecom_social <- ecom_social %>% 
   group_by(keyname) %>%
   arrange(month) %>%
   mutate(spend_DLDM=GeometricLogAdstock(cost, .005))

# Split Train and Test

ecom_social<- ecom_social %>% dplyr::select(-keyname)

ecom_social_clean_train <- ecom_social  %>% filter(month < "2023-03-01")
ecom_social_clean_test <-  ecom_social  %>% filter(month >= "2023-03-01")

ecom_social_clean_test_rev <- ecom_social_clean_test %>%  dplyr::select(revenue)

ecom_social_clean_test  <- ecom_social_clean_test  %>% dplyr::select(-revenue)

ecom_social_clean_train <- ecom_social_clean_train %>% dplyr::select(-month)
ecom_social_clean_test <- ecom_social_clean_test %>% dplyr::select(-month)

# Linear Regression

set.seed(13)

sapply(lapply(ecom_social_clean_train, unique), length)

ecom.fit1 <- ecom_social_clean_train %>%  dplyr::select(-c(spend_SGM, spend_DLDM)) %>% glm(log(revenue+1) ~ poly(cost)  +  .,  data = ., family = "gaussian")

ecom.fit2 <- ecom_social_clean_train %>%  dplyr::select(-c(cost, spend_DLDM)) %>% glm(log(revenue+1) ~ poly(spend_SGM) +  .,  data = ., family = "gaussian")

ecom.fit3 <- ecom_social_clean_train %>%  dplyr::select(-c(spend_SGM, cost)) %>% glm(log(revenue+1) ~ poly(spend_DLDM)  +  .,  data = ., family = "gaussian")


#summary(ecom.fit1)

ggResidpanel::resid_panel(ecom.fit1, plots='default', smoother = TRUE)

ecom.pred1 <- predict(ecom.fit1, newdata = ecom_social_clean_test)
ecom.pred2 <- predict(ecom.fit2, newdata = ecom_social_clean_test)
ecom.pred3 <- predict(ecom.fit3, newdata = ecom_social_clean_test)

ecom.pred1  <- exp(ecom.pred1)-1
ecom.pred2  <- exp(ecom.pred2)-1
ecom.pred3  <- exp(ecom.pred3)-1



ecom.results1 <- postResample(ecom.pred1, ecom_social_clean_test_rev$revenue)
ecom.results2 <- postResample(ecom.pred2, ecom_social_clean_test_rev$revenue)
ecom.results3 <- postResample(ecom.pred3, ecom_social_clean_test_rev$revenue)

rbind(ecom.results1, ecom.results2, ecom.results3)
summary(ecom.fit1)

ecom.results1[1] * 1e6

# Like before DLDM performed the best, so we will use it for the other models except XGBoost

# Ridge Regression

train_x = data.matrix(ecom_social_clean_train %>% dplyr::select(-c(revenue, spend_SGM, cost, keyname)))
train_y = data.matrix(ecom_social_clean_train$revenue)

test_x = data.matrix(ecom_social_clean_test %>% dplyr::select(-c(spend_SGM, cost, keyname)))
test_y = data.matrix(ecom_social_clean_test_rev$revenue)


#perform k-fold cross-validation to find optimal lambda value
cv_model <- cv.glmnet(train_x, train_y, alpha = 0)

#find optimal lambda value that minimizes test MSE
best_lambda <- cv_model$lambda.min
best_lambda


ridge_social <- glmnet(train_x, train_y, alpha = 0, lambda = best_lambda)
summary(ridge_social)


pred_y = predict(ridge_social, test_x)

ecom.results_ridge <- postResample(pred_y, test_y)

ecom.results_ridge

## MARS Model

ecom_mars <- ecom_social_clean_train %>% dplyr::select(-c(spend_SGM, cost))
ecom_mars_test <- ecom_social_clean_test %>% dplyr::select(-c(spend_SGM, cost))


mars2 <- earth(
  revenue ~ .,  
  data = ecom_mars,
  degree = 1
)

plot(mars2, which = 1)


ecom.pred1s <- predict(mars2, newdata = ecom_mars_test)
ecom.results1s <- postResample(ecom.pred1s, ecom_social_clean_test_rev$revenue)
ecom.results1s 


#XGBOost

set.seed(13)

# Create the train and test splits

train_x = data.matrix(ecom_social_clean_train %>% dplyr::select(-c(revenue, spend_SGM, spend_DLDM, keyname)))
train_y = data.matrix(ecom_social_clean_train$revenue)

test_x = data.matrix(ecom_social_clean_test %>% dplyr::select(-c(spend_SGM, spend_DLDM)))
test_y1 = data.matrix(ecom_social_clean_test_rev$revenue)

#define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y1)


#defining a watchlist
watchlist = list(train=xgb_train, test=xgb_test)

#fit XGBoost model and display training and testing data at each iteration
model = xgb.train(data = xgb_train, max.depth = 4, 
                  watchlist=watchlist, 
                  nrounds = 100,
                  objective = "reg:squaredlogerror",
                  eval_metric = "rmse")


#define final model
model_xgboost_social = xgboost(data = xgb_train, 
                        max.depth = 4, 
                        nrounds = 56, 
                        verbose = 0, 
                        objective = "reg:squaredlogerror",
                        eval_metric = "rmse")

summary(model_xgboost_social)

importance_matrix <- xgb.importance(model = model_xgboost_social)

print(importance_matrix)

xgb.plot.importance(importance_matrix = importance_matrix)

#use model to make predictions on test data
pred_y1 = predict(model_xgboost_social, test_x)

ecom.results4 <- postResample(pred_y1, test_y1)

ecom.results4

x = 1:length(test_y) # visualize the model, actual and predicted data
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38,  legend = c("original test_y", "predicted test_y"), 
       col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))


setwd(path)
# if (file.exists("social_model.json")) file.remove("social_model.json")
# xgb.save(model_xgboost_social, "social_model.json")

Forecasting

setwd(path)

month <- 5
pmonth <- month-1

# Load the XGBoost models
model_xgboost_social <- xgb.load('social_model.json')

model_xgboost_search <- xgb.load('search_model.json')

# add in holidays
holidays <- read_excel("holidays.xlsx")

# Load search budget
search <- read_excel(paste0(path,"Search Budget.xlsx"))
# Clean data
names(search) <- tolower(names(search))
names(search) <- gsub(" ", "_", names(search))
search <- search %>% rename("account" = "account_name")

search$account <- gsub("-", "", search$account)
search$account <- gsub("   ", "_", search$account)
search$account <- gsub("  ", "_", search$account)
search$account <- gsub(" ", "_", search$account)

search$keyname <- paste0(search$channel, ":",search$platform, ":", search$account)

search <- search %>% filter(as.numeric(budget_raw) > 0)


# Load social
social <- read_excel(paste0(path,"Social Budget.xlsx"))
# Clean data
names(social) <- tolower(names(social))
names(social) <- gsub(" ", "_", names(social))
social <- social %>% filter(as.numeric(budget_raw) > 0)

# Create an empty dataframe with the weeks we will have
search$month <- as.Date("2023-05-01")

social$month <- as.Date("2023-05-01")

## Add in Holidays
holidays$week_from <- strptime(as.character(holidays$week_from),  "%Y-%m-%d")
holidays$week_from <- format(holidays$week_from, "%Y-%m-%d")

holidays$month = as.character(floor_date(as.Date(holidays$week_from, tryFormats = "%Y-%m-%d"), "month"))

holidays <- holidays %>% select(month, holiday) %>% group_by(month) %>% summarise(holidays = n())
holidays$month = as.Date(holidays$month, tryFormats = "%Y-%m-%d")

search<- search %>%
  left_join(holidays, by = "month")

social<- social %>%
  left_join(holidays, by = "month")



## Add in FED Data

search$year_month = yearmonth(as.Date(search$month, tryFormats = "%Y-%m-%d"))
social$year_month = yearmonth(as.Date(social$month, tryFormats = "%Y-%m-%d"))

consumer_sent <- read_csv("consumer_sent.csv", show_col_types = FALSE)
consumer_sent$year_month = yearmonth(as.Date(consumer_sent$DATE, tryFormats = "%m/%d/%Y"))
consumer_sent <-consumer_sent %>% dplyr::select(-DATE)
consumer_sent[nrow(consumer_sent) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
consumer_sent[nrow(consumer_sent) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
consumer_sent <- consumer_sent %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


disp_inc <- read_csv("disp_inc.csv", show_col_types = FALSE) 
disp_inc$year_month = yearmonth(as.Date(disp_inc$DATE, tryFormats = "%m/%d/%Y"))
disp_inc <- disp_inc %>% dplyr::select(-DATE)
disp_inc[nrow(disp_inc) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
disp_inc[nrow(disp_inc) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
disp_inc <-disp_inc %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


avg_gas <- read_csv("avg_gas.csv", show_col_types = FALSE) 
avg_gas$year_month = yearmonth(as.Date(avg_gas$DATE, tryFormats = "%m/%d/%Y"))
avg_gas <-avg_gas %>% dplyr::select(-DATE)
avg_gas[nrow(avg_gas) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
avg_gas[nrow(avg_gas) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
avg_gas <-avg_gas %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


avg_inf <- read_csv("avg_inf.csv", show_col_types = FALSE)
avg_inf$year_month = yearmonth(as.Date(avg_inf$DATE, tryFormats = "%m/%d/%Y"))
avg_inf <-avg_inf %>% dplyr::select(-DATE)
avg_inf[nrow(avg_inf) + 1,] <- list(NA, make_yearmonth(year = 2023, month = pmonth))
avg_inf[nrow(avg_inf) + 1,] <- list(NA, make_yearmonth(year = 2023, month = month))
avg_inf <-avg_inf %>% tk_augment_lags(1, .lags = 2, .names = 'auto')


search <- search %>%
  left_join(consumer_sent[,2:3], by = "year_month") %>%
  left_join(disp_inc[,2:3], by = "year_month") %>%
  left_join(avg_gas[,2:3], by = "year_month") %>%
  left_join(avg_inf[,2:3], by = "year_month") 

colnames(search)[which(names(search) == "UMCSENT_lag2")] <- "consumer_sentiment"
colnames(search)[which(names(search) == "DSPIC96_lag2")] <- "disposable_income"
colnames(search)[which(names(search) == "APU000074714_lag2")] <- "avg_gas"
colnames(search)[which(names(search) == "MEDCPIM158SFRBCLE_lag2")] <- "avg_inflation"

social <- social %>%
  left_join(consumer_sent[,2:3], by = "year_month") %>%
  left_join(disp_inc[,2:3], by = "year_month") %>%
  left_join(avg_gas[,2:3], by = "year_month") %>%
  left_join(avg_inf[,2:3], by = "year_month") 


colnames(social)[which(names(social) == "UMCSENT_lag2")] <- "consumer_sentiment"
colnames(social)[which(names(social) == "DSPIC96_lag2")] <- "disposable_income"
colnames(social)[which(names(social) == "APU000074714_lag2")] <- "avg_gas"
colnames(social)[which(names(social) == "MEDCPIM158SFRBCLE_lag2")] <- "avg_inflation"

Run for Robyn Allocated Search Data

search_tmp <- search
search_tmp <- dummy_cols(search_tmp, select_columns = c("holidays", "keyname"), remove_first_dummy = FALSE)
search_tmp$year_2023 <- 1
search_tmp$months_5 <- 1
search_tmp$holidays_3 <- 1

# Search F

search_f <- search_tmp %>% dplyr::select(-c(budget_raw, budget_perc.,budget_rev, platform,channel, keyname, account, year_month, forcast_perc., forecast_rev, month, holidays))  %>% rename("cost" = "forcast_raw")

# Get the column names from the other file used to train the model
search_list <- colnames(ecom_search_clean_train %>% select(-c(spend_SGM, spend_DLDM, revenue)))
search_list_new <- colnames(search_f)

search_list_updated <- setdiff(search_list, search_list_new)

for (i in search_list_updated){
    search_f[i]=0
}
search_list_new <- colnames(search_f)

search_list_updated <- setdiff(search_list, search_list_new)


search_f <- search_f %>% select(all_of(search_list))

search_f %>% dplyr::mutate_all(as.factor) %>% str

transform(search_f,
                cost = as.numeric(cost))

search_f$cost <- search_f$cost / 1e6


# run
test_x = data.matrix(search_f)

pred_f = predict(model_xgboost_search, test_x)
pred_f <- pred_f * 1e6

Run for Team Search Data

search_t <- search_tmp %>% dplyr::select(-c(forcast_raw, budget_perc.,budget_rev, platform,channel, account, year_month, forcast_perc., keyname, forecast_rev, month, holidays))  %>% rename("cost" = "budget_raw")

# Names
search_list <- colnames(ecom_search_clean_train %>% select(-c(spend_SGM, spend_DLDM, revenue)))
search_list_new <- colnames(search_t)

search_list_updated <- setdiff(search_list, search_list_new)

for (i in search_list_updated){
    search_t[i]=0
}
search_list_new <- colnames(search_t)

search_list_updated <- setdiff(search_list, search_list_new)


search_t <- search_t %>% select(all_of(search_list))


search_t %>% dplyr::mutate_all(as.factor) %>% str

transform(search_t,
                cost = as.numeric(cost))
# Run

search_t$cost <- search_t$cost / 1e6

test_x = data.matrix(search_t)

pred_t = predict(model_xgboost_search, test_x)

pred_t <- pred_t * 1e6 

search$budget_rev <- pred_t

search$forecast_rev <- pred_f

search$forecast_rev[search$forcast_raw == 0] <- 0


paste0("Meta %s will result in a $", round(sum(search$forecast_rev)/sum(search$forcast_raw),2), " ROAS")

paste0("Team's %s will result in a $", round(sum(search$budget_rev) /sum(search$budget_raw),2), " ROAS")

search$forcast_ROAS <- search$forecast_rev/search$forcast_raw

search$team_ROAS <- search$budget_rev /search$budget_raw

Run for Social Robyn allocated budget

social_tmp <- social
social_tmp <- dummy_cols(social_tmp, select_columns = c("holidays", "keyname"), remove_first_dummy = FALSE)
social_tmp$year_2023 <- 1
social_tmp$months_5 <- 1
social_tmp$holidays_3 <- 1

# social F

social_f <- social_tmp %>% dplyr::select(-c(keyname, budget_raw, budget_perc.,budget_rev, platform, campaign_type, year_month, forcast_perc., forecast_rev, month, holidays))  %>% rename("cost" = "forcast_raw")

# Get the column names from the other file used to train the model
social_list <- colnames(ecom_social_clean_train %>% select(-c(spend_SGM, spend_DLDM, revenue, keyname)))
social_list_new <- colnames(social_f)

social_list_updated <- setdiff(social_list, social_list_new)


for (i in social_list_updated){
    social_f[i]=0
}

social_list_new <- colnames(social_f)
social_list_updated <- setdiff(social_list, social_list_new)


social_f <- social_f %>% select(all_of(social_list))

social_f %>% dplyr::mutate_all(as.factor) %>% str

transform(social_f,
                cost = as.numeric(cost))

social_f$cost <- social_f$cost / 1e6


# run
test_x = data.matrix(social_f)

pred_f = predict(model_xgboost_social, test_x)
pred_f <- pred_f * 1e6

Run for Social team allocated budget

social_t <- social_tmp %>% dplyr::select(-c(forcast_raw, budget_perc.,budget_rev, platform,campaign_type, year_month, forcast_perc., keyname, forecast_rev, month, holidays))  %>% rename("cost" = "budget_raw")

# Names
social_list <- colnames(ecom_social_clean_train %>% select(-c(spend_SGM, spend_DLDM, revenue)))
social_list_new <- colnames(social_t)

social_list_updated <- setdiff(social_list, social_list_new)

for (i in social_list_updated){
    social_t[i]=0
}
social_list_new <- colnames(social_t)

social_list_updated <- setdiff(social_list, social_list_new)


social_t <- social_t %>% select(all_of(social_list))


social_t %>% dplyr::mutate_all(as.factor) %>% str

transform(social_t,
                cost = as.numeric(cost))
# Run

social_t$cost <- social_t$cost / 1e6

test_x = data.matrix(social_t)

pred_t = predict(model_xgboost_social, test_x)

pred_t <- pred_t * 1e6 

social$budget_rev <- pred_t

social$forecast_rev <- pred_f

social$forecast_rev[social$forcast_raw == 0] <- 0


paste0("Meta %s will result in a $", round(sum(social$forecast_rev)/sum(social$forcast_raw),2), " ROAS")

paste0("Team's %s will result in a $", round(sum(social$budget_rev) /sum(social$budget_raw),2), " ROAS")

social$forcast_ROAS <- social$forecast_rev/social$forcast_raw

social$team_ROAS <- social$budget_rev /social$budget_raw

Output results as an excel:

setwd(path)
write_xlsx(social %>% dplyr::select(-c(month,holidays,year_month,consumer_sentiment,disposable_income,avg_gas,avg_inflation)), "updated_social_budgets.xlsx")

write_xlsx(search%>% dplyr::select(-c(month,holidays,year_month,consumer_sentiment,disposable_income,avg_gas,avg_inflation)), "updated_search_budgets.xlsx")