1 Introduction to the course

This is course is part of the “CADI”: Business Analytics for beginners. Business analytics (BA) covers several functional areas in business such as: marketing, finance, human resources, logistics. In this course I will focus on basic applications of BA in Finance and Economics using the R/RStudio platform.

The objective of this course is to illustrate applications of BA in Marketing and Finance. Participants will learn basic data analytic skills related to a) data management and visualization, and b) basic statistical analysis applied to marketing, economics and finance.

We will focus on 3 applications of BA:

  1. Marketing analytics: models for price elasticity of demand

  2. Economics analytics: forecasting the US and the Mexican economies in the context of COVID

  3. Financial analytics: basic financial analysis of stocks and portfolios

The course will be practical oriented. I have designed these problems with the objective to illustrate the potential of BA competencies to deep in important business and current problems. If the participant wants to learn more about theories and econometric methods used in this CADI, let me know and I can share with you my “work-in-progress” e-book about financial econometrics and quantitative analysis.

It is assumed that participants have very basic knowledge of R/RStudio platform, and might not have deep knowledge about the economic, financial and econometric theories used here. However, all the exercises can be re-run by any student; you only need patience!

2 An introduction to Data Science and Business Analtyics

There is not a formal definition for Data Science. However, it can be seen as the intersection between statistics, mathematics, computer science and any knowledge domain that demands analysis of big volumes of data such as Banking, Finance, Economics, Marketing, Demographics, Genetics, Astronomy, etc. If the domain is a business discipline, then we can call it Business Analytics, or just Data Science for Business.

Data Science includes not only the data management - data collection, data merging and data preparation- but also data analysis tailored to understand a problem in a specific domain, and then be able to do predictions, and finally improve decision making in organizations. For data analysis we can use techniques based on a) statistical modeling, and also b) machine learning. Actually, machine learning is a combination of statistical, mathematical and artificial intelligence methods.

Focusing on the business domain, we can classify business analytics in the following:

Descriptive analytics refers to analysis of historical business/economic data to better understand a current business situation. Diagnostic analytics refers to the ability of finding business insights - mainly business opportunities and threats- after analyzing present and past business data. Predictive analytics refers to the ability of creating possible future scenarios based on historical data and assumptions related to organizational variables. Prescriptive analytics refers to the ability of identifying/proposing a specific strategic business plan according to the previous descriptive, diagnostic and predictive analysis.

For descriptive analytics the main techniques are data management, statistical analysis and data visualization. For predictive analytics the main techniques are statistical modeling and machine learning.

Machine learning is a very dynamic and changing field. Machine learning refers to automatic algorithms that can learn from facts (data), identify patterns, and then provide a solution to a problem or come up with forecast of variables.

Machine learning is a combination of artificial intelligence techniques along with statistical and mathematical methods. The main purpose of machine learning is to process data and come up with models to a) understand patterns, and b) predict variables. Then, machine learning is a set of algorithms that receive data, select a model, train the model with the data in order to calibrate the model, and then execute the model and come up with insights and/or predictions.

The machine learning techniques we will focus in this course will be mostly statistic techniques applied to Marketing and Finance.

Successful data-driven companies apply Data Science to keep competitive in the market. Those organizations combine a variety of internal and external data sources in an analytics engine, translate the data into quantifiable and actionable insights to make effective decisions for the organization.

3 Application 1: Estimating price elasticity of demand

3.1 Price chage versus sales volume change

A company is analyzing a price promotion for one of its consumer products for next month. The current price for each unit of the product is $100.00, and the total volume sold in this month was 1,000 units. If a company offers a sales promotion of 20% off for next month, what is the minimum % increase in volume (% in units) of the product needed in order to end up with no loss in value sales (0% of value sales change) for the next month?

More generally, for each of the following price discounts, what should be the % increase in sales volume in order to achieve the same value sales than this month?

% Discount depth % Volume growth to level sales
-5% ?
-10% ?
-15% ?
-20% ?
-25% ?
-30% ?
-35% ?
-40% ?
-45% ?
-50% ?

Can you find a way to quickly calculate the missing values of this table in R?

Many people would think that the solution is to apply the same % price discount to the % volume increase, but let see what happens.

For the case of -20% price change, we would have a discount price of $80.00. Then if next month volume sales increase in +20%, then the units sold would be 1,200 units, and the total value sales would be $80.00 * 1,200, which is$96,000.00. Then, value sales would be down $4,000.00 compared to this month value sales.

3.1.1 Data management to create a table of price scenarios

We will define the variables for the problem, and create an R data frame with the price scenarios.

We first clear the environment of variables, and create the variables for the problem:

# To clear our environment we use the remove function rm:
rm(list=ls())
# ls() is a function that returns all the names of the R objects we
#   have in the environment

# We create variables for price and volume of the product in the
#   current month:

Price0 = 100
Vol0 = 1000

# We calculate value sales of this month:
Sales0 = Price0 * Vol0

For price, volume and value sales we will create a table for the different price discount scenarios. We create a data frame for the table:

# We create a vector as a sequence for the discount scenarios:
D = seq(from=0,to=-0.50, by=-0.05)
# We create a data frame using the discount vector
scenarios = as.data.frame(D)

Now we can create columns for the different prices, volume and % increases for each scenario:

scenarios$Price1 = Price0 * (1+scenarios$D)

We start assigning the intuitive (but wrong) volume increase for each scenario and see what happens to value sales:

# G column will be the volume growth in %, and will be equal to 
#  the % price discount:
scenarios$G = -1*scenarios$D
# Create the column for volume sales for each scenario (Vol1):
scenarios$Vol1 = Vol0 * (1+scenarios$G)

Now we can create a column for value sales in money:

scenarios$Sales1 = scenarios$Price1 * scenarios$Vol1

We see that volume sales increases, while value sales decreases with each discount depth scenario:

View(scenarios)
D Price1 G Vol1 Sales1
0.00 100 0.00 1000 100000
-0.05 95 0.05 1050 99750
-0.10 90 0.10 1100 99000
-0.15 85 0.15 1150 97750
-0.20 80 0.20 1200 96000
-0.25 75 0.25 1250 93750
-0.30 70 0.30 1300 91000
-0.35 65 0.35 1350 87750
-0.40 60 0.40 1400 84000
-0.45 55 0.45 1450 79750
-0.50 50 0.50 1500 75000

Then we need to find higher % volume growth (values of G) for each discount depth to end up with the same value sales for all discount depths. We can do this by trial and error, but we might take some time. Then, the question is: can we find a formula for G that depends of the discount depth? Let’s play with the relationship between price, price discount, volume, volume growth and value sales:

Value sales is equal to unit price times volume sales. I will use Then:

\[ S_{0}=Price_{0}*Vol_{0} \]

\[ S_{1}=Price_{1}*Vol_{1} \] Where

\(S_0\) = Value sales this month \(S_1\) = Value sales next month \(Price_0\) = Unit price this month \(Price_1\) = Unit price next month \(Vol_0\) = Volume sales this month \(Vol_1\) = Volume sales next month

We can express unit price and volume of this month with respect to the previous price and volume: \[ Price_{1}=Price_{0}*\left(1+D\right) \] \[ Vol_{1}=Vol_{0}*\left(1+G\right) \] Where: D = % unit price change from this month to the next month G = % growth of units sold from this month to the next

If we want to force that the value sales of next month (\(S_1\)) be the same as the value sales of this month (\(S_0\)), then:

\[ S_{0}=S_{1} \] \[ Price_{0}*Vol_{0}=Price_{1}*Vol_{1} \] Expressing \(Price_1\) and \(Vol_1\) as functions of \(Price_0\) and \(Vol_0\):

\[ Price_{0}*Vol_{0}=Price_{0}*\left(1+D\right)*Vol_{0}*\left(1+G\right) \] Dividing both sides by \(Price_0 * Vol_0\):

\[ 1=\left(1+D\right)*\left(1+G\right) \] Dividing by (1+G) both sides:

\[ \frac{1}{\left(1+D\right)}=\left(1+G\right) \]

Subtracting 1 in both sides:

\[ G=\frac{1}{\left(1+D\right)}-1 \]

Let’s apply this formula to update G, Vol1 and Sales1 in our table of discount scenarios:

scenarios$G = 1 / (1+scenarios$D) - 1
scenarios$Vol1 = Vol0 * (1+scenarios$G)
scenarios$Sales1 = scenarios$Price1 * scenarios$Vol1

We see the content of the scenarios table:

View(scenarios)
D Price1 G Vol1 Sales1
0.00 100 0.000000 1000.00 100000
-0.05 95 0.052632 1052.63 100000
-0.10 90 0.111111 1111.11 100000
-0.15 85 0.176471 1176.47 100000
-0.20 80 0.250000 1250.00 100000
-0.25 75 0.333333 1333.33 100000
-0.30 70 0.428571 1428.57 100000
-0.35 65 0.538462 1538.46 100000
-0.40 60 0.666667 1666.67 100000
-0.45 55 0.818182 1818.18 100000
-0.50 50 1.000000 2000.00 100000

We got the expected result. Sales for next month (Sales1) is the same for each discount scenario. Interestingly, we can see that usually we need to have a higher % of volume growth compared to the absolute value of % in price change. For a discount of 50%, we need to sell 100% more (the double) to get the same value sales!

Once we played with the relationship between % of price change and % of volume sales, we will review the concept of price elasticity of demand.

3.2 Price elasticity of demand

Price elasticity measures the sensitivity of changes in volume sales with respect to changes in unit price. We can use % as a measure of change in volume sales and price change. Then, we can state price elasticity of demand of a product as:

\[ E=\frac{\%\Delta Volume}{\%\Delta Price} \] Where E= Price Elasticity of demand

Using our notation: \[ E=\frac{G}{D} \] It is easy to understand the concept of price elasticity, but the estimation of elasticity is not quite straight forward.Let’s see how we can calculate E for each price discount of our example. We add a new column for price elasticity in our scenarios table:

scenarios$E = scenarios$G / scenarios$D 

We see the content of the scenarios table:

View(scenarios)
D Price1 G Vol1 Sales1 E
0.00 100 0.000000 1000.00 100000 NaN
-0.05 95 0.052632 1052.63 100000 -1.05263
-0.10 90 0.111111 1111.11 100000 -1.11111
-0.15 85 0.176471 1176.47 100000 -1.17647
-0.20 80 0.250000 1250.00 100000 -1.25000
-0.25 75 0.333333 1333.33 100000 -1.33333
-0.30 70 0.428571 1428.57 100000 -1.42857
-0.35 65 0.538462 1538.46 100000 -1.53846
-0.40 60 0.666667 1666.67 100000 -1.66667
-0.45 55 0.818182 1818.18 100000 -1.81818
-0.50 50 1.000000 2000.00 100000 -2.00000

We see that for each discount depth, the estimation for E is different. This sounds weird, isn’t? Is this estimation is correct, then do we need to calculate a elasticity value for different values of discount depth?

Before responding to these questions, we will show a scatter plot to better appreciate the relationship between price % change and volume % change considering all discount scenarios:

plot.default(x=scenarios$D, y=scenarios$G)

We see a curvilinear relationship between price % change and volume % change to keep the same level of value sales.

We can also see the relationship between price and volume (not in % changes; in values):

plot(x=scenarios$Price1,y=scenarios$Vol1)

We also see a curvilinear relationship between price and volume to keep the same level of value sales.

Usually volume % change is a function of price % change; the deeper the price discount the higher the volume % change. What we plot is a mathematical function of the %volume change with respect to %price change:

\[ G=f(D) \] Then, Elasticity can be defined as the derivative of G with respect to D, since elasticity is how much G changes with changes in D: \[ E=\frac{\delta G}{\delta D}=G' \] Remember that the derivative is the slope of a tangent to the function in a specific value of the X variable. If we see the previous plot, we can see that if we draw an imaginary tangent to the curve for each values of discount %, we will see that the more negative the discount, the more inclined the tangent, so the elasticity changes with different %price discounts.

Then, the question can be: can we have a measure of elasticity with the same value for any value of %price discount?

The response is YES! If we calculate price elasticity using continuously compounded (cc) % changes for price and volume, instead of simple % changes, then we will end up with a measure of elasticity that does not change with changes in %price.

One way to calculate cc % change of a variable is just to take the natural log of 1 + simple % change. To calculate the cc % change of price and volume:

\[ d=log\left(1+D\right) \] \[ g=log\left(1+G\right) \] Then we use these cc % changes to estimate elasticity:

\[ e=\frac{g}{d} \] I used small letters (d, g) for the cc % changes of discount and sales volume growth; and e for the new elasticity measure using the cc % changes.

Let’s do these calculations for our example:

We first calculate the continuously compounded % change of price and volume. We will add these variables as columns of our scenarios table:

scenarios$d = log(1+scenarios$D)
scenarios$g = log(1+scenarios$G)
scenarios$e = scenarios$d/scenarios$g

We see the content of the scenarios table:

View(scenarios)
D Price1 G Vol1 Sales1 E d g e
0.00 100 0.000000 1000.00 100000 NaN 0.000000 0.000000 NaN
-0.05 95 0.052632 1052.63 100000 -1.05263 -0.051293 0.051293 -1
-0.10 90 0.111111 1111.11 100000 -1.11111 -0.105361 0.105361 -1
-0.15 85 0.176471 1176.47 100000 -1.17647 -0.162519 0.162519 -1
-0.20 80 0.250000 1250.00 100000 -1.25000 -0.223144 0.223144 -1
-0.25 75 0.333333 1333.33 100000 -1.33333 -0.287682 0.287682 -1
-0.30 70 0.428571 1428.57 100000 -1.42857 -0.356675 0.356675 -1
-0.35 65 0.538462 1538.46 100000 -1.53846 -0.430783 0.430783 -1
-0.40 60 0.666667 1666.67 100000 -1.66667 -0.510826 0.510826 -1
-0.45 55 0.818182 1818.18 100000 -1.81818 -0.597837 0.597837 -1
-0.50 50 1.000000 2000.00 100000 -2.00000 -0.693147 0.693147 -1

Interestingly the magnitudes of d and g are exactly the same for each discount scenarios, but with opposite sign. Then the new elasticity measure e is equal to -1 for all discount scenarios! Now we can ONLY use 1 measure for price elasticity of demand no matter the discount % to be applied to unit price.

Another advantage of using cc % changes of price and volume is that the relationship between d and g are now linear instead of curvilinear:

plot(x=scenarios$d,y=scenarios$g)

The slope of a tangent of this line will always be equal to -1, that is the elasticity measure e!

Another interesting relationship is between the log of price and the log of volume:

plot(x=log(scenarios$Price1),y=log(scenarios$Vol1))

This relationship is also linear, and the slope of the tangent for each level of values for log of price is also -1, which is the new elasticity measure.

3.3 How sales moves with changes in price elasticity

We can simulate how sales volume and sales value changes with different values of price elasticity of a product. We need to express volume growth G in terms of elasticity e, and recalculate most of the columns for our scenario table.

We start with the e formula, and then get the g, and then get the g (cc % volume change) into G (simple % volume change). Since \(e=\frac{g}{d}\), then \(g=e*d\). Now we can get G from g. Since \(g=log\left(1+G\right)\), then we apply the exponential function to both sides and we get:

\[ exp\left(g\right)=\left(1+G\right) \]

exp(g) means to raise the irrational Euler number e (2.71…) to g.

Then we can update the columns for g, G, Vol1, and Sales1 in our scenario table with specific values for elasticity e.

Let’s start moving elasticity from -1 to -2 and see how the sales volume (Vol1) and sales value (Sales1) changes:

e=-2
scenarios$e=e
scenarios$g=scenarios$e*scenarios$d
scenarios$G=exp(scenarios$g) - 1
scenarios$Vol1=Vol0*(1+scenarios$G)
scenarios$Sales1=scenarios$Price1* scenarios$Vol1
scenarios$E=scenarios$G/scenarios$D

We see the content of the scenarios table:

View(scenarios)
D Price1 G Vol1 Sales1 E d g e
0.00 100 0.000000 1000.00 100000 NaN 0.000000 0.000000 -2
-0.05 95 0.108033 1108.03 105263 -2.16066 -0.051293 0.102587 -2
-0.10 90 0.234568 1234.57 111111 -2.34568 -0.105361 0.210721 -2
-0.15 85 0.384083 1384.08 117647 -2.56055 -0.162519 0.325038 -2
-0.20 80 0.562500 1562.50 125000 -2.81250 -0.223144 0.446287 -2
-0.25 75 0.777778 1777.78 133333 -3.11111 -0.287682 0.575364 -2
-0.30 70 1.040816 2040.82 142857 -3.46939 -0.356675 0.713350 -2
-0.35 65 1.366864 2366.86 153846 -3.90532 -0.430783 0.861566 -2
-0.40 60 1.777778 2777.78 166667 -4.44444 -0.510826 1.021651 -2
-0.45 55 2.305785 3305.79 181818 -5.12397 -0.597837 1.195674 -2
-0.50 50 3.000000 4000.00 200000 -6.00000 -0.693147 1.386294 -2

We can see that not only sales volume increases, but also sales value increases with each level of discount depth.

Then, what can you learn/conclude about price elasticity e with respect to changes in sales? In which situations might be a good idea to do aggressive discounts for a product? briefly explain.

3.4 Estimation of price elasticity with real data

In the previous sections we estimated price elasticity of a hypothetical product with specific scenarios of discount depth. Now the question is, how can we estimate price elasticity of a consumer product?

First we need historical sales data of a the consumer product. We need historical sales in value ($) and sales in volume, and we can calculate an average unit price per period. We can get daily, weekly, monthly or quarterly data. Using monthly or quarterly data is a good idea since the volatility of sales at a weekly and daily level is high so that it is harder to get robust estimation of elasticity. In addition, using monthly or quarterly data allows us to easily measure the seasonality of a consumer product.

Imagine we have monthly data for sales volume and sales value for a consumer product for the past 36 months. How can you estimate price elasticity?

It is easily to calculate unit price for each month. We just divide sales value by sales volume for each month. Now we will have a column for historical prices, a column for sales volume and a column for sales value. Then we can estimate price elasticity by using a simple regression model.

For the previous hypothetical exercise, we can also estimate price elasticity using a simple regression model. The way we set up this simple regression is using the natural logarithm of sales volume as the dependent variable, and the natural log of price as the independent variable. We can express this regression model as:

\[ lnvol_{t}=b_{0}+b_{1}*lnprice_{t}+\varepsilon_{t} \]

Where:

\(lnvol_{t}\) = log of sales volume at period t, which is the dependent variable of the model.

\(lnprice_{t}\) = log of unit price at period t, which is the independent or explanatory variable of the model.

\(\varepsilon_{t}\) is called the error of the model, and it is supposed to behave like a normal distribution with mean=0 and a specific standard deviation: \(\varepsilon_{t}\sim N(0,\sigma_{\varepsilon})\)

\(b_{0}\)= coefficient for the intercept; it represents the value of the dependent variable, in this case, log of sales volume when the value of the independent variable (log of price) is equal to zero. In other words, it is the point where the regression line crosses the Y axis.

\(b_{1}\) = coefficient for the slope of the regression line. This coefficient is a measure of sensitivity of how much the dependent variable moves for each 1-unit move in the independent variable.

The equation for the regression model is actually the expected value of the dependent variable. Since the mean of the error is zero, then:

\[ E\left[lnvol\right]=b_{0}+b_{1}*lnprice_{t} \]

This regression equation is actually a line that represents all the pairs of values for log of price and log of sales volume.

Let’s run a regression model using the hypothetical example of the discount scenarios:

model1 <- lm(log(scenarios$Vol1)~log(scenarios$Price1))
summary(model1)
## 
## Call:
## lm(formula = log(scenarios$Vol1) ~ log(scenarios$Price1))
## 
## Residuals:
##                   Min                    1Q                Median 
## -0.000000000000006438 -0.000000000000000288  0.000000000000000052 
##                    3Q                   Max 
##  0.000000000000001649  0.000000000000002513 
## 
## Coefficients:
##                                  Estimate          Std. Error          t value
## (Intercept)           16.1180956509583417  0.0000000000000155 1040120010071390
## log(scenarios$Price1) -2.0000000000000058  0.0000000000000036 -554948114163277
##                                  Pr(>|t|)    
## (Intercept)           <0.0000000000000002 ***
## log(scenarios$Price1) <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0000000000000026 on 9 degrees of freedom
## Multiple R-squared:     1,   Adjusted R-squared:     1 
## F-statistic: 3.08e+29 on 1 and 9 DF,  p-value: <0.0000000000000002

The coefficient \(b_{0}\) or intercept = 16.118096. \(b_{1}\)=-2, which is the same value of elasticity e we got previously using cc % changes in price and volume! In this hypothetical case, all the points of the data lie on the regression line, so all errors are equal to zero and the standard errors (standard deviation of the regression coefficients) are actually zero.

Now move to a real-world consumer product and estimate price elasticity. We will download a dataset of historical monthly data of 4 consumer products sold in retail stores. The dataset was constructed using a real-world dataset of consumer products of one specific category and sold in big retail stores in Mexico. Do the following to directly download the dataset into your R:

# Load the readxl package to read Excel files:
library(readxl)
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/salesdata1.xlsx",
              "sales1.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
#  local file (sales1.xlsx)

# Use the function read_excel() to import the dataset into your R:
sales <- read_excel("sales1.xlsx")

We can see see the content of this file. We will display retail sales data for the first and for the last months:

View(head(sales))
month qsku1 vsku1 qsku2 vsku2 qsku3 vsku3 qsku4 vsku4
2017-05-01 266745 1733845 213811 3998265 579472 3882460 1737617 19353875
2017-06-01 293427 1819250 235841 4268730 604328 4048995 1743237 18915640
2017-07-01 231065 1432600 201680 3630235 595902 3813775 1690972 18800880
2017-08-01 196930 1201275 190614 3507305 608106 3831065 1835160 20683400
2017-09-01 174157 1132020 171971 3129870 998830 5493565 2067265 21311065
2017-10-01 168432 1077965 177789 3289090 1556771 7939530 2512034 24061980
View(tail(sales))
month qsku1 vsku1 qsku2 vsku2 qsku3 vsku3 qsku4 vsku4
2019-12-01 82863 629755 147879 3061090 439893 2947280 1624348 16548050
2020-01-01 95525 725990 99323 2055990 420908 2777990 1604222 15487280
2020-02-01 82477 635075 130238 2669880 331109 2284655 1730077 16541685
2020-03-01 112063 862885 166859 3437290 337043 2325600 1913984 19594795
2020-04-01 90136 676020 151963 3160840 197738 1364390 1457315 16395955
2020-05-01 91567 686755 159098 3372880 196111 1313945 1382876 15917535

The columns that start with v are value sales, and the columns that start with q are volume sales (in units). Then, vsku1 is the value sales (in $) of product 1, qsku1 is the # of unit sold of product 1, etc.

We will work with product 1. We can calculate an average monthly price for product 1, and also columns for the log of value sales and log of price:

sales$psku1=sales$vsku1/sales$qsku1
sales$logpsku1=log(sales$psku1)
sales$logqsku1=log(sales$qsku1)

We can start visualizing the sales data for this product:

plot(x=sales$month,y=sales$qsku1,type="l")

We can see that the number of units sold of product 1 is dramatically going down for the last 36 months.

We can calculate the % decline of volume sales from the first month until the most recent month:

library(xts)
# Calculating the % growth of volume sales during the whole period:
G1 = last(sales$qsku1)/first(sales$qsku1) - 1
G1
## [1] -0.656725

The % decline in volume sales of product 1 is -65.672459%. If you were the responsible for this product in a company, you must be very, very concerned.

Now we do a plot for historical price of product 1:

plot(x=sales$month,y=sales$psku1,type="l")

We can see that the unit price of product 1 has been going up for the last 36 months. We can calculate the % price increase in the whole period:

D1 = last(sales$psku1) / first(sales$psku1) - 1 
D1
## [1] 0.153849

The price increased 15.384869% in the last 36 months (from the first month up to the last month of the data).

Now we can visualize how sales volume changes with different price levels. We plot price vs volume in log scale:

plot(x=sales$logpsku1,y=sales$logqsku1)
abline(lm(sales$logqsku1 ~ sales$logpsku1),col='blue')

I also plotted the regression line between price and volume (in log scale). We clearly see a negative relationship between price and sales volume.

Now we can run a simple linear regression model using the log of sales volume as dependent variable, and the log of price as independent variable:

model1<-lm(sales$logqsku1 ~ sales$logpsku1)
summary(model1)
## 
## Call:
## lm(formula = sales$logqsku1 ~ sales$logpsku1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2456 -0.1348 -0.0146  0.0952  0.4289 
## 
## Coefficients:
##                Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      18.438      0.720   25.60 < 0.0000000000000002 ***
## sales$logpsku1   -3.404      0.372   -9.15       0.000000000083 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.165 on 35 degrees of freedom
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.697 
## F-statistic: 83.7 on 1 and 35 DF,  p-value: 0.0000000000829

Then, the beta 1 coefficient of this regression is -3.404429, which is an estimation of price elasticity of demand.

We can interpret price elasticity as follows: for each 1% increase in the price of product, its own demand is expected to decrease in about -3.404429. This result make sense after looking the volume decline vs the price increase.

It is important to note that this measure of price elasticity does not consider important factors that happen with consumer products in real life. Important drivers of demand for consumer products are 1) seasonality, 2) organic growing trend, and 3) cross-elasticities with competing/substitute products and with complementary products, 4) other commercial factors such as distribution, advertising, promotions, and 5) external events such as economic recessions.

3.5 Alternative measures for direct elasticity of demand

In the case of consumer products it is recommended to include some of the factors mentioned above in the econometric model in order to calculate a more robust measure of direct price elasticity, and also measures of cross-elasticities with significant competing or complementary products. We need to use more advanced econometric models. Due to the complexity of these models, we cannot cover them in detail in this workshop, but we will mention some of the most important econometric models used that consider these factors.

The most popular models are called ARIMA/SARIMA and ARIMAX models. ARIMA stands for Autoregressive Integrated Moving Average models. SARIMA stands for Seasonal ARIMA models. ARIMAX stands for ARIMA/SARIMA models with explanatory variables (X’s variables) or factors.

In the next BA application we will design and run a simple ARIMA/SARIMA model to forecast the Mexican and US Economy.

4 Application 2: Examining the Mexican Economy in the context of COVID

In this exercise we will forecast the Mexican Economy for the next 5 years considering the effect of COVID. We will use the GDP as the main measure for the Mexican economy.

We will apply ARIMA/SARIMA and ARIMAX models in this problem. We will not cover the econometric theory of these models due to its complexity. But we will try to focus on its application, and by working on this problem I try to illustrate the powerful of these models to forecast economic series.

The questions we will try to respond are the following:

  1. How previous economic crisis has affected the Mexican Economy?

  2. How COVID crisis is affecting and will continue affecting the Mexican Economy?

  3. How the Mexican Economy is related to the United States (US) Economy?

  4. What is the forecast for the Mexican Economy for the next 5 years considering only its historical GDP and the effect of the US Economy? With the response of this question we will see how fast our economy will recover and achieve the levels just prior the crisis.

4.1 Data management for the Mexican and US GDP

Let’s start by downloading real data of the Mexico and US GDP. I have to install the following packages to download online economic/financial data and packages to work with time series data:

# The quantmod package has functions to download real economic and financial data. It also has functions for several financial analysis:

install.packages("quantmod")

# The tseries and the astsa packages have functions to manage time-series datasets
install.packages("tseries")
install.packages("astsa")

# The fpp2 package has several functions to run ARIMA/SARIMA/ARIMAX models, and also to do forecasting of time series
install.packages("fpp2")

We load these packages on memory:

# We clear all the R variables/objects from the environment:
rm(list=ls())

library(quantmod)
library(fpp2)
library(tseries)
library(astsa)

We start downloading both Mexico and US GDP from the FRED Data source. FRED stands for Federal Reserve Economic Data. FRED is managed by the Federal Reserve Bank of St. Louis, one of the Federal Reserve Banks of the United States.

# I define a vector with the tickers I need from FRED:
tickers = c("NAEXKP01MXQ189S", "GDPC1")
# I download the quarterly GDP data:
getSymbols(tickers, src="FRED")
## [1] "NAEXKP01MXQ189S" "GDPC1"
# getSymbols creates xts R objects with the online data of the tickers I specified. xts stands for "extensible" time series objects.
# I select quarters from 1993 to the most recent quarter:
GDPMX <- NAEXKP01MXQ189S["1993-01-01/"]
GDPUS <- GDPC1["1993-01-01/"]
# I name the columns of the time series:
names(GDPMX)<- c("GDP")
names(GDPUS)<- c("GDP")

Both the Mexican and the US GDP are constant prices, so they are adjusted for inflation. In the case of Mexican GDP, the measure is in $ Mexican pesos, and in the case of the US GDP the index is in Billions of US dollars. To keep both measures in billions, I will divide the Mexican GDP series by 1 billion:

GDPMX$GDP=GDPMX$GDP / 1000000000

For both GDP datasets, I add the log of the GDP index, the simple annual % growth of the GDP, and the cc annual % growth of the GDP:

GDPMX$lnGDP = log(GDPMX$GDP)
GDPUS$lnGDP = log(GDPUS$GDP)
GDPMX$R = GDPMX$GDP / lag(GDPMX$GDP, k = 4) - 1
GDPUS$R = GDPUS$GDP / lag(GDPUS$GDP, k = 4) - 1
# I used the lag function to get the GDP value of the 4 previous quarters in order to calculate the annual % growth
GDPMX$r = diff(log(GDPMX$GDP),lag=4)
GDPUS$r = diff(log(GDPUS$GDP),lag=4)
# The diff function calculates a difference between the log of the GDP and its own value 4 quarters ago. This is the method to calculate the cc % annual growth of the GDP.

I add a dummy (binary variable) to indicate recession/crisis periods for the whole history. For the quarters where the annual % GDP growth is negative, I assign 1; 0 otherwise:

GDPMX$crisis = 0
GDPMX$crisis[GDPMX$r<0]=1
GDPUS$crisis = 0
GDPUS$crisis[GDPUS$r<0]=1

I calculate a shock for the crisis periods according to the maximum shock in the history. I do this for both, the Mexico and the US GDP:

#I get a summary of the annual % growth ONLY for the crisis periods
summary(GDPUS$r[GDPUS$crisis==1])
##      Index                  r          
##  Min.   :2008-10-01   Min.   :-0.0947  
##  1st Qu.:2009-01-23   1st Qu.:-0.0384  
##  Median :2009-05-16   Median :-0.0322  
##  Mean   :2012-11-15   Mean   :-0.0427  
##  3rd Qu.:2017-07-24   3rd Qu.:-0.0294  
##  Max.   :2020-07-01   Max.   :-0.0279
# I get the minimum or more negative value of the annual growth:
maxusshock<-min(GDPUS$r[GDPUS$crisis==1])
# I calculate the magnitude of the shock for each quarter. This will be a percentage with respect to the maximum shock in the history of the economy: 
GDPUS$shock = GDPUS$r/maxusshock * GDPUS$crisis
# For those cases with NA values, I assign a zero. I do this to maximize the number of observations used in the model:
GDPUS$shock[is.na(GDPUS$r)] = 0

# I do the same for the Mexico GDP:
summary(GDPMX$r[GDPMX$crisis==1])
##      Index                  r           
##  Min.   :1995-01-01   Min.   :-0.20658  
##  1st Qu.:2001-05-16   1st Qu.:-0.07535  
##  Median :2008-10-01   Median :-0.01938  
##  Mean   :2007-03-08   Mean   :-0.04453  
##  3rd Qu.:2014-08-16   3rd Qu.:-0.00842  
##  Max.   :2020-07-01   Max.   :-0.00123
maxmxshock<-min(GDPMX$r[GDPMX$crisis==1])
GDPMX$shock = GDPMX$r/maxmxshock * GDPMX$crisis
GDPMX$shock[is.na(GDPMX$r)] = 0

I will start modeling the annual cc % growth of the GDP. I do this since national economies usually have seasonality patterns during the year. But I first have to check whether the annual cc % GDP growth behaves like a stationary time series variable. In order to model a time series variable, it is necessary to use stationary variables.

4.1.1 Test for stationary of the US and Mexico GDP annual cc % growth

A stationary time-series variable has the following characteristics: a) the mean and the volatility of the variable are about the same for any time period, and b) the autocorrelations of the variable depends only on the its lag, and not on the time period.

Then I run a Dicky-Fuller unit root test to examine whether the annual cc % growth (seasonal difference of the log of the GDP) quarter by quarter is stationary:

adf.test(na.omit(diff(GDPUS$lnGDP, lag = 4)), 
         alternative = "stationary", k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(diff(GDPUS$lnGDP, lag = 4))
## Dickey-Fuller = -3.776, Lag order = 0, p-value = 0.0227
## alternative hypothesis: stationary

The p-value is <0.05, so I can reject the null hypothesis that states that the seasonal difference is a random walk.Then I have statistical evidence to treat this series as stationary.

I do the same for the Mexican GDP series:

adf.test(na.omit(diff(GDPMX$lnGDP, lag = 4)), 
         alternative = "stationary", k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(diff(GDPMX$lnGDP, lag = 4))
## Dickey-Fuller = -3.631, Lag order = 0, p-value = 0.0337
## alternative hypothesis: stationary

I also found the same result as the US GDP, so I can use the annual difference of log of GDP as the variable to model with an ARIMA/SARIMA model.

4.2 Forecasting the US Economy considering the COVID shock

4.2.1 Calibrating the model for the US Economy

I start by creating the autocorrelation and partial autocorrelation plots for the dependent variable, which is the annual difference of the log of the US GDP (cc % growth of US GDP):

acf2(na.omit(diff(GDPUS$lnGDP,lag=4)))

##      [,1]  [,2] [,3]  [,4] [,5]  [,6] [,7]  [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.75  0.48 0.35  0.24 0.17  0.13 0.09  0.06 0.06  0.05  0.03  0.02  0.01
## PACF 0.75 -0.17 0.13 -0.08 0.06 -0.02 0.01 -0.01 0.04 -0.03  0.00  0.00  0.00
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF   0.01  0.02  0.03  0.02  0.00 -0.02 -0.07 -0.09
## PACF  0.01  0.03  0.01 -0.05  0.01 -0.05 -0.07  0.02

The ac plot shows a classical AR(1) pattern since autocorrelations between the variable and its first lags go down slowly and the partial ac plot shows a radical decay in the partial autocorrelations after lag 1.

I start with an AR(1) model and I also include the explanatory variable, which is the COVOD shock:

m1<-Arima(GDPUS$lnGDP,order=c(1,0,0),
           seasonal = list(order=c(0,1,0),period=4),
           include.constant  = TRUE,
           xreg = GDPUS$shock)
m1
## Series: GDPUS$lnGDP 
## Regression with ARIMA(1,0,0)(0,1,0)[4] errors 
## 
## Coefficients:
##         ar1  drift   shock
##       0.900  0.006  -0.085
## s.e.  0.042  0.002   0.005
## 
## sigma^2 estimated as 0.0000512:  log likelihood=380.91
## AIC=-753.81   AICc=-753.42   BIC=-743.12
# In the order parameter I indicate that the model has 1 AR lag, 
#  0 differences, and 0 MA lags.
# In the seasonal option I indicate that I need to do an annual/seasonal difference;  period=4 means that I have 4 quarters in a season/year

I check whether the residuals of this model behave like a white noise with no significant autocorrelations:

acf2(m1$residuals)

##      [,1] [,2] [,3]  [,4] [,5]  [,6] [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
## ACF  0.17 0.11 0.05 -0.18 0.16  0.01 0.08 -0.24 -0.16  0.02 -0.03  0.14  0.06
## PACF 0.17 0.08 0.02 -0.20 0.23 -0.03 0.07 -0.37  0.05  0.06  0.08 -0.07  0.14
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF   0.01 -0.05  0.04  0.05 -0.03  0.03 -0.14 -0.09
## PACF  0.01 -0.03 -0.02  0.02 -0.05  0.00 -0.12  0.03

Lag 4 is negative and significant in the pac, so I will add a seasonal MA term in my model since the season has 4 periods:

m2<-Arima(GDPUS$lnGDP,order=c(1,0,0),
          seasonal = list(order=c(0,1,1),period=4),
          include.constant  = TRUE,
          xreg = GDPUS$shock)
m2
## Series: GDPUS$lnGDP 
## Regression with ARIMA(1,0,0)(0,1,1)[4] errors 
## 
## Coefficients:
##         ar1    sma1  drift   shock
##       0.992  -0.858  0.006  -0.092
## s.e.  0.010   0.079  0.001   0.005
## 
## sigma^2 estimated as 0.0000391:  log likelihood=394.98
## AIC=-779.97   AICc=-779.37   BIC=-766.6

I check the ac plots of the residuals of this model: acf2(m2$residuals)

Now lag 1 is positive and significant. This is a sign that I need to add an MA(1) since I had already added an AR(1):

m3<-Arima(GDPUS$lnGDP,order=c(1,0,1),
          seasonal = list(order=c(0,1,1),period=4),
          include.constant  = TRUE,
          xreg = GDPUS$shock)
m3
## Series: GDPUS$lnGDP 
## Regression with ARIMA(1,0,1)(0,1,1)[4] errors 
## 
## Coefficients:
##         ar1    ma1    sma1  drift   shock
##       0.991  0.189  -0.903  0.006  -0.091
## s.e.  0.011  0.088   0.099  0.001   0.004
## 
## sigma^2 estimated as 0.0000376:  log likelihood=397.23
## AIC=-782.46   AICc=-781.62   BIC=-766.43

Instead of using the log series, I will use the original series of GDP, and I indicate in the Arima function that the log transformation needs to be done before estimating the model. I found this option better since the final forecast will be automatically calculated in the original units of GDP, not in log of GDP:

m4<-Arima(GDPUS$GDP,order=c(1,0,1),
          seasonal = list(order=c(0,1,1),period=4),
          include.constant  = TRUE, 
          lambda = 0,
          xreg = GDPUS$shock)
m4
## Series: GDPUS$GDP 
## Regression with ARIMA(1,0,1)(0,1,1)[4] errors 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##         ar1    ma1    sma1  drift   shock
##       0.991  0.189  -0.903  0.006  -0.091
## s.e.  0.011  0.088   0.099  0.001   0.004
## 
## sigma^2 estimated as 0.0000376:  log likelihood=397.23
## AIC=-782.46   AICc=-781.62   BIC=-766.43

I check the ac plots of the residuals of this model:

acf2(m4$residuals[5:111])

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
## ACF  0.03 0.18 0.11 0.10 0.18 0.10 0.18 -0.13 -0.03  0.07 -0.01  0.02  0.02
## PACF 0.03 0.18 0.10 0.07 0.15 0.07 0.12 -0.21 -0.13  0.06  0.00 -0.01  0.06
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF   0.01 -0.07 -0.05  0.04 -0.09  0.02 -0.16 -0.02
## PACF  0.04 -0.03 -0.09  0.00 -0.07  0.02 -0.15  0.02

I stop here, and accept this model since there is no more significant lags in the autocorrelation plots, indicating that the residuals behave like a white noise.

4.2.2 Forecasting the US GDP for the next 5 years

Now I will do a forecast. Before this, I have to fill out future data for the COVID shock on the US Economy. I will consider that the COVID shock will last until Q1 of 2022, but will be diminishing.

I create an index for the next future 20 quarters

I create an xts dataset for the future US shock:

#Creating a sequence of dates for the future 20 quarters:
futurequarters <- seq(as.Date(end(GDPMX)),
                      by="quarter",
                      length.out = 21)[-1]
# I show the last future quarters
tail(futurequarters)
## [1] "2024-04-01" "2024-07-01" "2024-10-01" "2025-01-01" "2025-04-01"
## [6] "2025-07-01"
# I create a vector with 20 zeros:
FUTURESHOCKUS<-rep(0,20)
# Now I create an xts object with shocks equal to zero for the future 20 quarters:
FUTURESHOCKUS<-xts(FUTURESHOCKUS,order.by=futurequarters)
names(FUTURESHOCKUS)<-c("shock")

# Now I assign the shock magnitude for the next quarters:
FUTURESHOCKUS$shock["2020-10-01"]=0.30
FUTURESHOCKUS$shock["2021-01-01"]=0.35
FUTURESHOCKUS$shock["2021-04-01"]=0.30
FUTURESHOCKUS$shock["2021-07-01"]=0.25
FUTURESHOCKUS$shock["2021-10-01"]=0.20
FUTURESHOCKUS$shock["2022-01-01"]=0.20
FUTURESHOCKUS$shock["2022-04-01"]=0.10

I re-run the model, using the historical shock as explanatory variable:

mgdpus<-Arima(GDPUS$GDP,order=c(1,0,1),
           seasonal = list(order=c(0,1,1),period=4),
           include.constant  = TRUE,
           xreg = GDPUS$shock,
           lambda = 0)
mgdpus
## Series: GDPUS$GDP 
## Regression with ARIMA(1,0,1)(0,1,1)[4] errors 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##         ar1    ma1    sma1  drift   shock
##       0.991  0.189  -0.903  0.006  -0.091
## s.e.  0.011  0.088   0.099  0.001   0.004
## 
## sigma^2 estimated as 0.0000376:  log likelihood=397.23
## AIC=-782.46   AICc=-781.62   BIC=-766.43

With this model I run the forecast using the future 20 quarters along with the expected COVID shocks:

usforecast<-forecast(mgdpus,xreg=FUTURESHOCKUS$shock)
autoplot(usforecast)

tail(usforecast$mean)
## Time Series:
## Start = 126 
## End = 131 
## Frequency = 1 
## [1] 20756.6 20884.6 21013.3 21095.4 21219.9 21351.2
summary(usforecast)
## 
## Forecast method: Regression with ARIMA(1,0,1)(0,1,1)[4] errors
## 
## Model Information:
## Series: GDPUS$GDP 
## Regression with ARIMA(1,0,1)(0,1,1)[4] errors 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##         ar1    ma1    sma1  drift   shock
##       0.991  0.189  -0.903  0.006  -0.091
## s.e.  0.011  0.088   0.099  0.001   0.004
## 
## sigma^2 estimated as 0.0000376:  log likelihood=397.23
## AIC=-782.46   AICc=-781.62   BIC=-766.43
## 
## Error measures:
##                    ME    RMSE     MAE         MPE     MAPE     MASE      ACF1
## Training set -5.23218 87.4948 64.3446 -0.00655596 0.445069 0.490473 0.0549532
## 
## Forecasts:
##     Point Forecast   Lo 80   Hi 80   Lo 95   Hi 95
## 112        18731.0 18584.4 18878.9 18507.2 18957.6
## 113        18716.8 18490.7 18945.7 18372.1 19068.0
## 114        18911.7 18625.7 19202.0 18476.1 19357.5
## 115        19114.1 18777.9 19456.4 18602.3 19640.0
## 116        19318.6 18931.0 19714.1 18729.0 19926.8
## 117        19392.8 18960.7 19834.7 18735.9 20072.6
## 118        19684.9 19207.5 20174.1 18959.5 20438.0
## 119        19987.2 19466.8 20521.5 19196.8 20810.1
## 120        20109.4 19547.6 20687.4 19256.6 21000.0
## 121        20187.1 19586.9 20805.7 19276.5 21140.7
## 122        20305.3 19668.0 20963.2 19338.8 21320.1
## 123        20430.1 19757.4 21125.6 19410.3 21503.4
## 124        20555.4 19844.5 21291.9 19478.1 21692.4
## 125        20635.3 19888.7 21410.1 19504.4 21831.9
## 126        20756.6 19974.5 21569.4 19572.4 22012.5
## 127        20884.6 20068.2 21734.3 19649.0 22198.0
## 128        21013.3 20159.6 21903.0 19721.9 22389.2
## 129        21095.4 20207.2 22022.7 19752.2 22529.9
## 130        21219.9 20296.6 22185.1 19824.3 22713.7
## 131        21351.2 20393.8 22353.5 19904.5 22903.0

The blue section is the forecast. We can see the 80% and the 95% confidence intervals of the GDP forecast, and the mid blue line is the actual forecast. We can see that the US GDP is expected to achieve its maximum historical level (before COVID) in less than 2 years. To check this in more detail, we can calculate the maximum historical GDP, and then see when the forecast will surpass that maximum:

max_gdpus = max(GDPUS$GDP)
max_gdpus
## [1] 19254

The maximum historical GDP was 19253.959 and it happened in Q3 of 2019:

View(tail(GDPUS,6))
Date GDP lnGDP R r crisis shock
105 2019-01-01 18950.3 9.84958 0.022658 0.022405 0 0.000000
106 2019-04-01 19020.6 9.85328 0.019632 0.019441 0 0.000000
107 2019-07-01 19141.7 9.85963 0.020765 0.020552 0 0.000000
108 2019-10-01 19254.0 9.86547 0.023389 0.023120 0 0.000000
109 2020-01-01 19010.8 9.85276 0.003193 0.003188 0 0.000000
110 2020-04-01 17302.5 9.75861 -0.090328 -0.094671 1 1.000000
111 2020-07-01 18596.5 9.83073 -0.028483 -0.028897 1 0.305236

Looking at the forecast for the next quarters:

View(usforecast$mean)
x
2020-10-01 18731.0
2021-01-01 18716.8
2021-04-01 18911.7
2021-07-01 19114.1
2021-10-01 19318.6
2022-01-01 19392.8
2022-04-01 19684.9
2022-07-01 19987.2
2022-10-01 20109.4
2023-01-01 20187.1
2023-04-01 20305.3
2023-07-01 20430.1
2023-10-01 20555.4
2024-01-01 20635.3
2024-04-01 20756.6
2024-07-01 20884.6
2024-10-01 21013.3
2025-01-01 21095.4
2025-04-01 21219.9
2025-07-01 21351.2

We see that this historical GDP maximum will be achieved around Q3 of 2021.

4.3 Forecasting the Mexican Economy considering the COVID shock

4.3.1 Calibrating the model for the Mexican Economy

Now I do the model for the Mexican GDP. In this case, I assume that the US economy significantly influences the Mexican economy.

Then I will include the following explanatory variables of this model:

1 - The GDPUS (historical and forecasted)

2 - A COVID shock for the Mexican Economy

I check the ac and pac plot for the seasonal difference of the log of the Mexican GDP:

acf2(na.omit(diff(GDPMX$lnGDP,lag=4)))

##      [,1]  [,2] [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## ACF  0.71  0.35  0.1 -0.06 -0.08 -0.08 -0.07 -0.06 -0.08 -0.09 -0.10 -0.10
## PACF 0.71 -0.31  0.0 -0.11  0.11 -0.08  0.00 -0.04 -0.05 -0.01 -0.04 -0.03
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF  -0.07 -0.03  0.00  0.00 -0.05 -0.11 -0.16 -0.16 -0.11
## PACF  0.01  0.01  0.01 -0.07 -0.07 -0.07 -0.07  0.01 -0.01

I start with an AR(1) model and I also include the explanatory variable, which is the COVOD shock:

m1<-Arima(GDPMX$GDP,order=c(1,0,0),
          seasonal = list(order=c(0,1,0),period=4),
          include.constant  = TRUE,
          lambda = 0,
          xreg = cbind(GDPMX$shock,GDPUS$lnGDP))
m1
## Series: GDPMX$GDP 
## Regression with ARIMA(1,0,0)(0,1,0)[4] errors 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##         ar1  drift   shock  lnGDP
##       0.824  0.003  -0.161  0.435
## s.e.  0.055  0.001   0.011  0.115
## 
## sigma^2 estimated as 0.000101:  log likelihood=342.21
## AIC=-674.43   AICc=-673.83   BIC=-661.06

With the lamda=0 option, R performs the log transformation of the dependent variable, that is the log of the MX GDP. The lamda=0 option does not apply to the explanatory variables, so I included the log of the US GD as explanatory variable.

I check the ac and the pac of the residuals of m1:

acf2(m1$residuals)

##      [,1] [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## ACF  0.29 0.12 -0.06 -0.23 -0.03 -0.14 -0.26 -0.22 -0.14  0.05  0.07  0.03
## PACF 0.29 0.04 -0.12 -0.20  0.12 -0.13 -0.28 -0.12  0.01  0.04 -0.10 -0.08
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF   0.18  0.03  0.13 -0.03 -0.10 -0.05 -0.10 -0.18 -0.09
## PACF  0.20 -0.10 -0.01 -0.14  0.02 -0.03 -0.09 -0.23  0.05
# I do not include the first 4 residuals since they cannot be 
#   estimated. I do not know why R calculates residuals in the first
#   4 rows since the explanatory variable is the seasonal difference, and the
#   dependent variable is also a seasonal difference, so the values of
#   these variables in the first 4 rows cannot be estimated. 
acf2(m1$residuals[5:111])

##      [,1] [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## ACF  0.29 0.11 -0.07 -0.24 -0.02 -0.11 -0.24 -0.19 -0.14  0.06  0.07  0.02
## PACF 0.29 0.03 -0.12 -0.21  0.13 -0.12 -0.27 -0.11  0.01  0.04 -0.10 -0.07
##      [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF   0.16  0.01  0.10 -0.06 -0.12 -0.06 -0.11 -0.17 -0.08
## PACF  0.19 -0.11 -0.01 -0.15  0.02 -0.04 -0.09 -0.24  0.04

Lag 4 is significantly different than zero. Then I include a seasonal MA for the lag4:

I will not include constant in the Mexican GDP model since I will assume that the Mexican Economy will grow mostly due to the US Economy and its own trend measure by the autocorrelation of the previous annual % growth.

m2<-Arima(GDPMX$GDP,order=c(1,0,0),
          seasonal = list(order=c(0,1,1),period=4),
          include.constant  = FALSE,
          lambda = 0,
          xreg = cbind(GDPMX$shock,GDPUS$lnGDP))
m2
## Series: GDPMX$GDP 
## Regression with ARIMA(1,0,0)(0,1,1)[4] errors 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##         ar1    sma1   shock  lnGDP
##       0.960  -0.733  -0.139  0.617
## s.e.  0.038   0.127   0.013  0.129
## 
## sigma^2 estimated as 0.0000824:  log likelihood=351.61
## AIC=-693.22   AICc=-692.63   BIC=-679.86
acf2(m2$residuals[5:111])

##      [,1] [,2]  [,3] [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
## ACF  0.12 0.01 -0.14 0.14  0.01 -0.05 -0.28 -0.07 -0.04  0.01 -0.04 -0.06  0.12
## PACF 0.12 0.00 -0.14 0.18 -0.03 -0.07 -0.23 -0.03 -0.03 -0.04  0.02 -0.06  0.14
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF  -0.02  0.11 -0.12 -0.03 -0.07  0.03 -0.22  0.00
## PACF -0.14  0.11 -0.15 -0.07 -0.02 -0.05 -0.17  0.02

I keep this model since there is no significant autocorrelations of the residuals.

4.4 Forecasting the Mexican GDP

I fill out the future shocks for the Mexican Economy. I will assume that the COVID shock for the Mexican Economy will last much more compared to the case of the US Economy. I will assume that the shock at the beginning of 2021 will be 50% of the maximum shock we had in 2020, but will start to increase for the next 2 quarters, and then start diminishing from the Q3 of 2021 until Q4 of 2023:

FUTURESHOCKMX<-rep(0,20)
# Now I create 
FUTURESHOCKMX<-xts(FUTURESHOCKMX,order.by=futurequarters)
names(FUTURESHOCKMX)<-c("shock")


FUTURESHOCKMX$shock["2020-10-01"]=0.55
FUTURESHOCKMX$shock["2021-01-01"]=0.70
FUTURESHOCKMX$shock["2021-04-01"]=0.80
FUTURESHOCKMX$shock["2021-07-01"]=0.60
FUTURESHOCKMX$shock["2021-10-01"]=0.40

FUTURESHOCKMX$shock["2022-01-01"]=0.30
FUTURESHOCKMX$shock["2022-04-01"]=0.25
FUTURESHOCKMX$shock["2022-07-01"]=0.25
FUTURESHOCKMX$shock["2022-10-01"]=0.20

FUTURESHOCKMX$shock["2023-01-01"]=0.20
FUTURESHOCKMX$shock["2023-04-01"]=0.15
FUTURESHOCKMX$shock["2023-07-01"]=0.10
FUTURESHOCKMX$shock["2023-10-01"]=0.10

I re-run the model, using the historical shock as explanatory variable:

mgdpmx<-Arima(GDPMX$GDP,order=c(1,0,0),
              seasonal = list(order=c(0,1,1),period=4),
              xreg = cbind(GDPMX$shock,GDPUS$lnGDP),
              include.constant  = FALSE,
              lambda = 0)
mgdpmx
## Series: GDPMX$GDP 
## Regression with ARIMA(1,0,0)(0,1,1)[4] errors 
## Box Cox transformation: lambda= 0 
## 
## Coefficients:
##         ar1    sma1   shock  lnGDP
##       0.960  -0.733  -0.139  0.617
## s.e.  0.038   0.127   0.013  0.129
## 
## sigma^2 estimated as 0.0000824:  log likelihood=351.61
## AIC=-693.22   AICc=-692.63   BIC=-679.86

I need to create a series for the forecasted log of the US GDP and include it as explanatory variable in the forecast:

FUTURElnGDP<-xts(log(usforecast$mean),order.by=futurequarters)
names(FUTURElnGDP)<-c("lnGDP")

mxforecast<-forecast(mgdpmx,xreg=cbind(FUTURESHOCKMX$shock,FUTURElnGDP$lnGDP))
autoplot(mxforecast)

tail(mxforecast$mean)
## Time Series:
## Start = 126 
## End = 131 
## Frequency = 1 
## [1] 4817.22 4823.13 4843.26 4868.79 4881.97 4888.07

The blue section is the forecast. We can see the 80% and the 95% confidence intervals of the GDP forecast, and the mid blue line is the actual forecast. We can see that the Mexican GDP will achieve its maximum levels it had before COVID in 3 to 4 years. To check this in more detail, we can calculate the maximum historical GDP, and then see when the forecast will surpass that maximum:

max_gdpmx = max(GDPMX$GDP)
max_gdpmx
## [1] 4644.9

The maximum historical GDP was 4644.900687 and it happened in Q1 of 2019:

View(tail(GDPMX,7))
Date GDP lnGDP R r crisis shock
105 2019-01-01 4644.90 8.44352 0.003677 0.003671 0 0.000000
106 2019-04-01 4640.47 8.44257 0.002360 0.002357 0 0.000000
107 2019-07-01 4631.25 8.44058 -0.001232 -0.001233 1 0.005969
108 2019-10-01 4601.71 8.43418 -0.006710 -0.006733 1 0.032593
109 2020-01-01 4544.77 8.42173 -0.021558 -0.021794 1 0.105499
110 2020-04-01 3774.38 8.23599 -0.186638 -0.206579 1 1.000000
111 2020-07-01 4231.77 8.35038 -0.086257 -0.090206 1 0.436667

Looking at the forecast for the next quarters:

View(mxforecast$mean)
x
2020-10-01 4185.44
2021-01-01 4108.52
2021-04-01 4073.71
2021-07-01 4205.29
2021-10-01 4353.91
2022-01-01 4437.83
2022-04-01 4505.87
2022-07-01 4536.63
2022-10-01 4587.14
2023-01-01 4611.08
2023-04-01 4655.61
2023-07-01 4693.74
2023-10-01 4713.21
2024-01-01 4804.34
2024-04-01 4817.22
2024-07-01 4823.13
2024-10-01 4843.26
2025-01-01 4868.79
2025-04-01 4881.97
2025-07-01 4888.07

We see that this historical maximum GDP for Mexico will be achieved around Q2 of 2023, but the growing rate of GDP after this quarter looks much smaller than the historical average annual growth. Actually, we can estimate the GDP growth for the current presidential period from 2018 to 2024:

GDPg2024 = as.numeric(mxforecast$mean["2024-10-01"] / GDPMX$GDP["2018-10-01"]) - 1
cat("The 2018-2024 GDP Growth is expected to be ", 100*GDPg2024,"%")
## The 2018-2024 GDP Growth is expected to be  4.54284 %

The estimated growth for the 6-year Mexico presidential period from 2018 to 2024 would be 4.5428%. We calculate the compound average annual growth rate (CAGR) for the 2018-2024 “sexenio” as follows:

GDPcagr2024 = (1+GDPg2024)^(1/6) - 1
GDPcagr2024
## [1] 0.00743193

The expected average annual GDP growth for this “sexenio” is estimated around 0.7432%.

This might be the worst economic growth of the previous “sexenios” in Mexico at least since 1994, which has been between 11% and 19%. We can calculate the GDP growth of all the presidential “sexenios” as follows:

GDPg2024 = as.numeric(mxforecast$mean["2024-10-01"] / GDPMX$GDP["2018-10-01"]) - 1
cat("The 2018-2024 GDP Growth is expected to be ", 100*GDPg2024,"%")
## The 2018-2024 GDP Growth is expected to be  4.54284 %
GDPg2018 = as.numeric(GDPMX$GDP["2018-10-01"]) / as.numeric(GDPMX$GDP["2012-10-01"]) - 1 
cat("The 2012-2018 GDP growth was ",100*GDPg2018,"%")
## The 2012-2018 GDP growth was  14.2795 %
GDPg2012 = as.numeric(GDPMX$GDP["2012-10-01"]) / as.numeric(GDPMX$GDP["2006-10-01"]) - 1 
cat("The 2006-2012 GDP growth was ",100*GDPg2012,"%")
## The 2006-2012 GDP growth was  11.108 %
GDPg2006 = as.numeric(GDPMX$GDP["2006-10-01"]) / as.numeric(GDPMX$GDP["2000-10-01"]) - 1 
cat("The 2000-2006 GDP growth was ",100*GDPg2006,"%")
## The 2000-2006 GDP growth was  12.7661 %
GDPg2000 = as.numeric(GDPMX$GDP["2000-10-01"]) / as.numeric(GDPMX$GDP["1994-10-01"]) - 1 
cat("The 1994-2000 GDP growth was ",100*GDPg2000,"%")
## The 1994-2000 GDP growth was  19.1432 %

5 Application 3: Basic financial analytics

In this section I will work on the following: a) Data management of financial data, b) return calculations, c) descriptive statistics for finance, d) visualization of financial data, and e) Portfolio formation and basic estimations.

5.1 Data management of financial data

5.1.1 Data collection

We will use the quantmod package to import real online data from Yahoo Finance. This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.

If you have not installed it, use the install.packages function. We start loading the package:

#install.packages("quantmod")
library(quantmod)

The getSymbols() function download online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, FRED database and Oanda. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables of most of the countries.

You can type ?function in the console or the R Script and run it to know more about the syntax of any function. This will display the R documentation of the function in the bottom-right pane:

?getSymbols

Now, we will work with historical data of the Bitcoin cryptocurrency and the TESLA stock. We download daily quotations of these instruments from January 1, 2017 to date from Yahoo Finance:

getSymbols(c("BTC-USD","TSLA"), from="2017-01-01", src="yahoo", periodicity="daily")
## [1] "BTC-USD" "TSLA"

This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical daily prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.

BTC-USD and TSLA are the ticker names in Yahoo Finance. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly) also as a character vector.

You can check the content of any of these dataset with View(). When tickers have special characters, we have to make reference to the object with simple quotes (``):

You can list the FIRST 5 rows of a dataset by using head():

head(`BTC-USD`,5)
##            BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2017-01-01      963.658      1003.08     958.699       998.325      147775008
## 2017-01-02      998.617      1031.39     996.702      1021.750      222184992
## 2017-01-03     1021.600      1044.08    1021.600      1043.840      185168000
## 2017-01-04     1044.400      1159.42    1044.400      1154.730      344945984
## 2017-01-05     1156.730      1191.10     910.417      1013.380      510199008
##            BTC-USD.Adjusted
## 2017-01-01          998.325
## 2017-01-02         1021.750
## 2017-01-03         1043.840
## 2017-01-04         1154.730
## 2017-01-05         1013.380

Also, you can list the LAST 5 rows of a data set. Note that you can change number of rows you want to display.

tail(`BTC-USD`, 5)
##            BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2021-01-17      36163.6      36722.4     34069.3       35791.3    52359854336
## 2021-01-18      35792.2      37299.3     34883.8       36630.1    49511702429
## 2021-01-19      36642.2      37755.9     36069.8       36069.8    57244195486
## 2021-01-20           NA           NA          NA            NA             NA
## 2021-01-21           NA           NA          NA            NA             NA
##            BTC-USD.Adjusted
## 2021-01-17          35791.3
## 2021-01-18          36630.1
## 2021-01-19          36069.8
## 2021-01-20               NA
## 2021-01-21               NA

For each period, Yahoo Finance keeps track of the open, high, low, close (OHLC) and adjusted prices. Also, it keeps track of volume that was traded in every specific period. The adjusted prices are used for stocks, not for currencies. Adjusted prices considers dividend payments and also stock splits. Then, for the Bitcoin series we can use close of adjusted price to calculate daily returns.

Let’s see some of the benefits of using xts-zoo objects. We can, for example, select columns using any of the following functions, where x represents a generic xts zoo object:

  • Op(x): Extract the Opening prices of the period.
  • Hi(x): Extract the Highest price of the period.
  • Lo(x): Extract the Lowest price of the period.
  • Cl(x): Extract the closing prices of the period.
  • Vo(x): Extract the volume traded of the period.
  • Ad(x): Extract the Adjusted prices of the period.

5.1.2 Merging and cleaning financial datasets:

We can use the merge() function to create a consolidated dataset of one or more xts-zoo objects:

prices <- merge(`BTC-USD`, TSLA)
# I can select only the adjusted prices in order to calculate returns:
adjprices <-Ad(prices) 

Now we have an xts-zoo objects with 2 columns. I can change the names of the columns with simple names:

names(adjprices)<-c("bitcoin","tesla")

Now I can make reference to the adjusted prices using these names.

In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some missing days. For example, for stock series there is no data for weekends or holidays. However, R deals with gaps because it recognizes that we are working with a time series object. Thus, we have a time variable with NO GAPS, which avoids problems when computing returns. However, R does not deal automatically with empty values (called NA’s). It is a good idea to have a data set free of NA’s. So, I can use the function na.omit:

adjprices <- na.omit(adjprices)

5.2 Return calculation

We can create xts-zoo objects for simple and continuously compounded returns of both instruments:

# Calculating continuously compounded daily returns:
ccreturns <- diff(log(adjprices)) 
# Calculating simple daily returns:
returns <- adjprices / lag(adjprices,n=1) - 1
# We can also calculate simple returns using the diff function:
returns2<- diff(adjprices) / lag(adjprices,n=1)

The diff function works with xts to calculate the difference between the value of one period minus the previous one. The lag function gets the previous value of the time series.

We can calculate holding returns for any time period for each instrument (HPR). For example, if we want to calculate the whole period return from the first day of the series to the most recent one, we can do the following:

HPR1<- 100* as.vector(last(adjprices)) / as.vector(first(adjprices)) - 1 
HPR1
## [1] 3454.49 1945.06
cat("The holding period return for Bitcoin is: ", HPR1[1], "%")
## The holding period return for Bitcoin is:  3454.49 %
cat("The holding period return for Bitcoin is: ", HPR1[2], "%")
## The holding period return for Bitcoin is:  1945.06 %

We could also use the continuously compounded returns to calculate the same holding simple returns:

HPR2<-100*exp(colSums(ccreturns,na.rm=TRUE)) - 1
HPR2
## bitcoin   tesla 
## 3454.49 1945.06

5.3 Visualization of financial data

One of the advantages of the quantmod package is that it has some built-in data visualization capabilities. For example, the chartSeries function is a plotting tool designed to create standard financial charts given a time series object. The chartSeries function also includes some arguments that allows the user to modify the cosmetics of the plot such as the theme argument.

Let’s see the performance of Bitcoin prices:

chartSeries(`BTC-USD`, theme = ("white"))

Let’s see the performance of TESLA prices:

chartSeries(TSLA, theme = ("black"))

We can see exponential growth of both, the Bitcoin prices and Tesla prices for the last days. If you had invested in Bitcoin or Tesla in July 2020 and has sold your position early January 2021, you would have multiplied your investment by around 4 times (300% period return)!

We can visualize the daily returns over time:

plot(returns$bitcoin)

plot(returns$tesla)

With this plot we can appreciate the daily volatility of each instrument over time. Volatility is a measure of how disperse the returns move up and down from its mean; it is basically the standard deviation of returns.

5.4 Basics of Portfolio analysis

We will use the PerformanceAnalytics and related packages for this section:

# Install the packages only if you need:
#install.packages("tidyverse")
#install.packages("PerformanceAnalytics")
#install.packages("highcharter")
# Load the packages:
library(tidyverse)
library(PerformanceAnalytics)
library(highcharter)

We will do an exercise of a simple portfolio composed by 5 US stocks: Tesla, Microsoft, Nextera, WalMart and Pfizer. You can change the tickers of the stocks as you wish as long as Yahoo Finance has data for your tickers. I selected stocks from the industries: high-tech, clean energies, pharmaceutical, and retail.

5.4.1 Automation of data collection and data management

With the following code we will automatically download data from the 5 stocks, merge the data and calculate returns:

symbols <- c("TSLA","MSFT","NEE","WMT","PFE")

prices <- getSymbols(symbols,src = 'yahoo',
             periodicity = "monthly",
             from = "2018-01-01",
             auto.assign = TRUE,
             warnings = FALSE) %>%
  map(~Ad(get(.))) %>%
  reduce(merge) %>%
  `colnames<-`(symbols)

If you see I am doing several data management process in sequential process. The %>% is used to indicate the separation of each data management process. The previous code does the following:

  1. Download price data of the 5 stocks

  2. Apply (map) the adjusted function to all xts-zoo datasets in order to get the adjusted stock prices

  3. Do the merge of all the xts-zoo objects into one object

  4. Finally rename the columns of the integrated object, and return the object as prices.

5.4.2 Return calculations

We calculate returns of all stocks using the Return.calculate function:

Returns<-Return.calculate(prices) %>%
  na.omit() 

The na.omit function was also applied to clean the dataset and drop the rows with NA values.

We calculate the continuous compounded return (also called log returns) of all stocks:

returns<-Return.calculate(prices,method = "log") %>%
         na.omit() 
# We could calculate this using the diff and log functions:
returns2<-na.omit(diff(log(prices)))
# The returns and returns2 objects will have exactly the same returns

5.4.3 Descriptive statistics of returns

We calculate descriptive statistics for the returns of all stocks:

table.Stats(returns) 
##                    TSLA    MSFT     NEE     WMT     PFE
## Observations    36.0000 36.0000 36.0000 36.0000 36.0000
## NAs              0.0000  0.0000  0.0000  0.0000  0.0000
## Minimum         -0.2539 -0.0835 -0.0621 -0.1692 -0.1453
## Quartile 1      -0.0856 -0.0018 -0.0019 -0.0148 -0.0294
## Median           0.0438  0.0261  0.0172  0.0076 -0.0084
## Arithmetic Mean  0.0690  0.0250  0.0233  0.0103  0.0042
## Geometric Mean   0.0503  0.0237  0.0222  0.0089  0.0019
## Quartile 3       0.2226  0.0561  0.0519  0.0459  0.0331
## Maximum          0.5547  0.1278  0.1781  0.0964  0.1628
## SE Mean          0.0341  0.0088  0.0083  0.0090  0.0114
## LCL Mean (0.95) -0.0002  0.0073  0.0065 -0.0079 -0.0190
## UCL Mean (0.95)  0.1383  0.0428  0.0402  0.0285  0.0273
## Variance         0.0419  0.0028  0.0025  0.0029  0.0047
## Stdev            0.2047  0.0526  0.0497  0.0538  0.0684
## Skewness         0.3872 -0.2415  0.6751 -0.8903  0.3867
## Kurtosis        -0.5337 -0.5447  1.1027  1.6477  0.3583

We can see arithmetic, geometric mean, median of returns. Also we can see risk measures such as standard deviation, variance. Finally we can see skewness and kurtosis, which are important financial measures to understand the data.

5.4.4 Visualization of risk and return of stocks

We can do a Box plot of the returns to better appreciate historical median return and risk by looking at the median, the quartiles Q1 and Q3, volatility and extreme values of these returns:

chart.Boxplot(returns)

It is easy to see that Tesla is the stock with the highest risk. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Q1 (25 percentile) to the Q3 (75 percentile). The dots are considered extreme values for in the context of its own distribution.

Now we can start evaluating the performance over time for each stock by looking at how much $ we would have made if we had invested $1.00 in each stock at the beginning of the time periods:

charts.PerformanceSummary(Returns, 
                          main = "Performance of $1.00 over time",
                          wealth.index = TRUE)

Since Tesla has had an extraordinary performance for the last months, it is hard to appreciate the performance of the rest of the stocks. So, we can drop Tesla from the plot:

charts.PerformanceSummary(Returns[,2:5], 
                        main = "Performance of $1.00 over time",
                          wealth.index = TRUE)

5.5 Portfolio formation

I start creating the weights of 2 portfolios. One will be an equally-weighted portfolio, and the other will be an aggressive portfolio assigning high weights to risky stocks, and low weights to conservative stocks.

I start creating a vector of weights for the equally weighted portfolio:

w_ew = rep(0.2,5)
# The rep function repeats a value n times

Now I create a vector of weights for the aggressive portfolio. I will assign the following weights:

Tesla: 40% Microsoft: 30% Nextera: 20% WalMart:0% Pfizer: 10%

w_aggressive = c(0.4,0.3,0.2,0,0.1)

5.6 Portfolio return calculation

We calculate the historical return of the equally-weighted portfolio using the Return.portfolio function:

portfolio_returns_ew <-
  Return.portfolio(Returns,
                   weights = w_ew) %>%
  `colnames<-`("returns")

We calculate the historical return of the aggressive portfolio using the Return.portfolio function:

portfolio_returns_ag <-
  Return.portfolio(Returns,
                   weights = w_aggressive) %>%
  `colnames<-`("returns")

With the portfolio returns I can plot a performance chart of each portfolio:

charts.PerformanceSummary(portfolio_returns_ew, 
                          main = "Equally-weighted Portfolio")

charts.PerformanceSummary(portfolio_returns_ag, 
                          main = "Aggressive Portfolio Performance")

We finally do a dynamic plot of historical monthly returns of each portfolio using the highchart function from the highcharter package:

highchart(type = "stock") %>% 
  hc_title(text = "Equally-weighted Portfolio") %>% 
  hc_add_series(portfolio_returns_ew$returns, 
                name = "Monthly Returns", 
                color = "cornflowerblue") %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_navigator(enabled = FALSE) %>% 
  hc_scrollbar(enabled = FALSE) %>% 
  hc_legend(enabled = TRUE) %>% 
  hc_exporting(enabled = TRUE)

We can interact with the plot to zoom in and out, and selecting time periods.

We do the same for the aggressive portfolio:

highchart(type = "stock") %>% 
  hc_title(text = "Aggressive Portfolio") %>% 
  hc_add_series(portfolio_returns_ag$returns, 
                name = "Monthly Returns", 
                color = "cornflowerblue") %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_navigator(enabled = FALSE) %>% 
  hc_scrollbar(enabled = FALSE) %>% 
  hc_legend(enabled = TRUE) %>% 
  hc_exporting(enabled = TRUE)

This is just an example of basic financial analysis that we can do in R.

I hope you enjoyed the workshop! at least, I hope that you ended with curiosity to continue learning this fascinating world of Business Analytics.