This workshop is illustrates the potential of Business Analytics. In this workshop I will focus on the use of R and RStudio. If you are interested in applications/examples using Excel and PowerBI and my own company BI platform, you can access a past conference (2020) I did about Business Analytics using these tools at:
https://www.youtube.com/watch?v=_0n38vIzOAI.
At the end of the workshop you have to work in finding insights about the economic development of your region/industry. You have to play with the following dashboard_
In this workshop I include 3 applications in 3 business areas as example of how technical capabilities in BA can bring competitive advantage to an organization.
Business analytics (BA) covers several functional areas in business such as: marketing, finance, human resources, logistics. In this workshop I will focus on basic applications of BA in Finance and Economics using the R/RStudio platform.
The objective of this workshop is to illustrate applications of BA in Marketing and Finance. Participants can learn basic data analytic skills related to a) data management and visualization, and b) basic statistical analysis applied to marketing, economics and finance.
We will focus on 3 applications of BA:
Marketing analytics: models for price elasticity of demand
Economics analytics: forecasting the US and the Mexican economies in the context of COVID
Financial analytics: basic financial analysis of stocks and portfolios
The course will be practical oriented. I have designed these problems with the objective to illustrate the potential of BA competencies to deep in important business and current problems. If the participant wants to learn more about theories and econometric methods used in this CADI, let me know and I can share with you my “work-in-progress” e-book about financial econometrics and quantitative analysis.
It is assumed that participants have very basic knowledge of R/RStudio platform, and might not have deep knowledge about the economic, financial and econometric theories used here. However, all the exercises can be re-run by any student; you only need patience!
There is not a formal definition for Data Science. However, it can be seen as the intersection between statistics, mathematics, computer science and any knowledge domain that demands analysis of big volumes of data such as Banking, Finance, Economics, Marketing, Demographics, Genetics, Astronomy, etc. If the domain is a business discipline, then we can call it Business Analytics, or just Data Science for Business.
Data Science includes not only the data management - data collection, data merging and data preparation- but also data analysis tailored to understand a problem in a specific domain, and then be able to do predictions, and finally improve decision making in organizations. For data analysis we can use techniques based on a) statistical modeling, and also b) machine learning. Actually, machine learning is a combination of statistical, mathematical and artificial intelligence methods.
Focusing on the business domain, we can classify business analytics in the following:
Descriptive analytics refers to analysis of historical business/economic data to better understand a current business situation. Diagnostic analytics refers to the ability of finding business insights - mainly business opportunities and threats- after analyzing present and past business data. Predictive analytics refers to the ability of creating possible future scenarios based on historical data and assumptions related to organizational variables. Prescriptive analytics refers to the ability of identifying/proposing a specific strategic business plan according to the previous descriptive, diagnostic and predictive analysis.
For descriptive analytics the main techniques are data management, statistical analysis and data visualization. For predictive analytics the main techniques are statistical modeling and machine learning.
Machine learning is a very dynamic and changing field. Machine learning refers to automatic algorithms that can learn from facts (data), identify patterns, and then provide a solution to a problem or come up with forecast of variables.
Machine learning is a combination of artificial intelligence techniques along with statistical and mathematical methods. The main purpose of machine learning is to process data and come up with models to a) understand patterns, and b) predict variables. Then, machine learning is a set of algorithms that receive data, select a model, train the model with the data in order to calibrate the model, and then execute the model and come up with insights and/or predictions.
The machine learning techniques we will focus in this course will be mostly statistic techniques applied to Marketing and Finance.
Successful data-driven companies apply Data Science to keep competitive in the market. Those organizations combine a variety of internal and external data sources in an analytics engine, translate the data into quantifiable and actionable insights to make effective decisions for the organization.
A company is analyzing a price promotion for one of its consumer products for next month. The current price for each unit of the product is $100.00, and the total volume sold in this month was 1,000 units. If a company offers a sales promotion of 20% off for next month, what is the minimum % increase in volume (% in units) of the product needed in order to end up with no loss in value sales (0% of value sales change) for the next month?
More generally, for each of the following price discounts, what should be the % increase in sales volume in order to achieve the same value sales than this month?
% Discount depth | % Volume growth to level sales |
---|---|
-5% | ? |
-10% | ? |
-15% | ? |
-20% | ? |
-25% | ? |
-30% | ? |
-35% | ? |
-40% | ? |
-45% | ? |
-50% | ? |
Can you find a way to quickly calculate the missing values of this table in R?
Many people would think that the solution is to apply the same % price discount to the % volume increase, but let see what happens.
For the case of -20% price change, we would have a discount price of $80.00. Then if next month volume sales increase in +20%, then the units sold would be 1,200 units, and the total value sales would be $80.00 * 1,200, which is$96,000.00. Then, value sales would be down $4,000.00 compared to this month value sales.
We will define the variables for the problem, and create an R data frame with the price scenarios.
We first clear the environment of variables, and create the variables for the problem:
# To clear our environment we use the remove function rm:
rm(list=ls())
# ls() is a function that returns all the names of the R objects we
# have in the environment
# We create variables for price and volume of the product in the
# current month:
= 100
Price0 = 1000
Vol0
# We calculate value sales of this month:
= Price0 * Vol0 Sales0
For price, volume and value sales we will create a table for the different price discount scenarios. We create a data frame for the table:
# We create a vector as a sequence for the discount scenarios:
= seq(from=0,to=-0.50, by=-0.05)
D # We create a data frame using the discount vector
= as.data.frame(D) scenarios
Now we can create columns for the different prices, volume and % increases for each scenario:
$Price1 = Price0 * (1+scenarios$D) scenarios
We start assigning the intuitive (but wrong) volume increase for each scenario and see what happens to value sales:
# G column will be the volume growth in %, and will be equal to
# the % price discount:
$G = -1*scenarios$D
scenarios# Create the column for volume sales for each scenario (Vol1):
$Vol1 = Vol0 * (1+scenarios$G) scenarios
Now we can create a column for value sales in money:
$Sales1 = scenarios$Price1 * scenarios$Vol1 scenarios
We see that volume sales increases, while value sales decreases with each discount depth scenario:
View(scenarios)
D | Price1 | G | Vol1 | Sales1 |
---|---|---|---|---|
0.00 | 100 | 0.00 | 1000 | 100000 |
-0.05 | 95 | 0.05 | 1050 | 99750 |
-0.10 | 90 | 0.10 | 1100 | 99000 |
-0.15 | 85 | 0.15 | 1150 | 97750 |
-0.20 | 80 | 0.20 | 1200 | 96000 |
-0.25 | 75 | 0.25 | 1250 | 93750 |
-0.30 | 70 | 0.30 | 1300 | 91000 |
-0.35 | 65 | 0.35 | 1350 | 87750 |
-0.40 | 60 | 0.40 | 1400 | 84000 |
-0.45 | 55 | 0.45 | 1450 | 79750 |
-0.50 | 50 | 0.50 | 1500 | 75000 |
Then we need to find higher % volume growth (values of G) for each discount depth to end up with the same value sales for all discount depths. We can do this by trial and error, but we might take some time. Then, the question is: can we find a formula for G that depends of the discount depth? Let’s play with the relationship between price, price discount, volume, volume growth and value sales:
Value sales is equal to unit price times volume sales. I will use Then:
\[ S_{0}=Price_{0}*Vol_{0} \]
\[ S_{1}=Price_{1}*Vol_{1} \] Where
\(S_0\) = Value sales this month \(S_1\) = Value sales next month \(Price_0\) = Unit price this month \(Price_1\) = Unit price next month \(Vol_0\) = Volume sales this month \(Vol_1\) = Volume sales next month
We can express unit price and volume of this month with respect to the previous price and volume: \[ Price_{1}=Price_{0}*\left(1+D\right) \] \[ Vol_{1}=Vol_{0}*\left(1+G\right) \] Where: D = % unit price change from this month to the next month G = % growth of units sold from this month to the next
If we want to force that the value sales of next month (\(S_1\)) be the same as the value sales of this month (\(S_0\)), then:
\[ S_{0}=S_{1} \] \[ Price_{0}*Vol_{0}=Price_{1}*Vol_{1} \] Expressing \(Price_1\) and \(Vol_1\) as functions of \(Price_0\) and \(Vol_0\):
\[ Price_{0}*Vol_{0}=Price_{0}*\left(1+D\right)*Vol_{0}*\left(1+G\right) \] Dividing both sides by \(Price_0 * Vol_0\):
\[ 1=\left(1+D\right)*\left(1+G\right) \] Dividing by (1+G) both sides:
\[ \frac{1}{\left(1+D\right)}=\left(1+G\right) \]
Subtracting 1 in both sides:
\[ G=\frac{1}{\left(1+D\right)}-1 \]
Let’s apply this formula to update G, Vol1 and Sales1 in our table of discount scenarios:
$G = 1 / (1+scenarios$D) - 1
scenarios$Vol1 = Vol0 * (1+scenarios$G)
scenarios$Sales1 = scenarios$Price1 * scenarios$Vol1 scenarios
We see the content of the scenarios table:
View(scenarios)
D | Price1 | G | Vol1 | Sales1 |
---|---|---|---|---|
0.00 | 100 | 0.000000 | 1000.00 | 100000 |
-0.05 | 95 | 0.052632 | 1052.63 | 100000 |
-0.10 | 90 | 0.111111 | 1111.11 | 100000 |
-0.15 | 85 | 0.176471 | 1176.47 | 100000 |
-0.20 | 80 | 0.250000 | 1250.00 | 100000 |
-0.25 | 75 | 0.333333 | 1333.33 | 100000 |
-0.30 | 70 | 0.428571 | 1428.57 | 100000 |
-0.35 | 65 | 0.538462 | 1538.46 | 100000 |
-0.40 | 60 | 0.666667 | 1666.67 | 100000 |
-0.45 | 55 | 0.818182 | 1818.18 | 100000 |
-0.50 | 50 | 1.000000 | 2000.00 | 100000 |
We got the expected result. Sales for next month (Sales1) is the same for each discount scenario. Interestingly, we can see that usually we need to have a higher % of volume growth compared to the absolute value of % in price change. For a discount of 50%, we need to sell 100% more (the double) to get the same value sales!
Once we played with the relationship between % of price change and % of volume sales, we will review the concept of price elasticity of demand.
Price elasticity measures the sensitivity of changes in volume sales with respect to changes in unit price. We can use % as a measure of change in volume sales and price change. Then, we can state price elasticity of demand of a product as:
\[ E=\frac{\%\Delta Volume}{\%\Delta Price} \] Where E= Price Elasticity of demand
Using our notation: \[ E=\frac{G}{D} \] It is easy to understand the concept of price elasticity, but the estimation of elasticity is not quite straight forward.Let’s see how we can calculate E for each price discount of our example. We add a new column for price elasticity in our scenarios table:
$E = scenarios$G / scenarios$D scenarios
We see the content of the scenarios table:
View(scenarios)
D | Price1 | G | Vol1 | Sales1 | E |
---|---|---|---|---|---|
0.00 | 100 | 0.000000 | 1000.00 | 100000 | NaN |
-0.05 | 95 | 0.052632 | 1052.63 | 100000 | -1.05263 |
-0.10 | 90 | 0.111111 | 1111.11 | 100000 | -1.11111 |
-0.15 | 85 | 0.176471 | 1176.47 | 100000 | -1.17647 |
-0.20 | 80 | 0.250000 | 1250.00 | 100000 | -1.25000 |
-0.25 | 75 | 0.333333 | 1333.33 | 100000 | -1.33333 |
-0.30 | 70 | 0.428571 | 1428.57 | 100000 | -1.42857 |
-0.35 | 65 | 0.538462 | 1538.46 | 100000 | -1.53846 |
-0.40 | 60 | 0.666667 | 1666.67 | 100000 | -1.66667 |
-0.45 | 55 | 0.818182 | 1818.18 | 100000 | -1.81818 |
-0.50 | 50 | 1.000000 | 2000.00 | 100000 | -2.00000 |
We see that for each discount depth, the estimation for E is different. This sounds weird, isn’t? Is this estimation is correct, then do we need to calculate a elasticity value for different values of discount depth?
Before responding to these questions, we will show a scatter plot to better appreciate the relationship between price % change and volume % change considering all discount scenarios:
plot.default(x=scenarios$D, y=scenarios$G)
We see a
curvilinear relationship between price % change and volume % change to
keep the same level of value sales.
We can also see the relationship between price and volume (not in % changes; in values):
plot(x=scenarios$Price1,y=scenarios$Vol1)
We also see a
curvilinear relationship between price and volume to keep the same level
of value sales.
Usually volume % change is a function of price % change; the deeper the price discount the higher the volume % change. What we plot is a mathematical function of the %volume change with respect to %price change:
\[ G=f(D) \] Then, Elasticity can be defined as the derivative of G with respect to D, since elasticity is how much G changes with changes in D: \[ E=\frac{\delta G}{\delta D}=G' \] Remember that the derivative is the slope of a tangent to the function in a specific value of the X variable. If we see the previous plot, we can see that if we draw an imaginary tangent to the curve for each values of discount %, we will see that the more negative the discount, the more inclined the tangent, so the elasticity changes with different %price discounts.
Then, the question can be: can we have a measure of elasticity with the same value for any value of %price discount?
The response is YES! If we calculate price elasticity using continuously compounded (cc) % changes for price and volume, instead of simple % changes, then we will end up with a measure of elasticity that does not change with changes in %price.
One way to calculate cc % change of a variable is just to take the natural log of 1 + simple % change. To calculate the cc % change of price and volume:
\[ d=log\left(1+D\right) \] \[ g=log\left(1+G\right) \] Then we use these cc % changes to estimate elasticity:
\[ e=\frac{g}{d} \] I used small letters (d, g) for the cc % changes of discount and sales volume growth; and e for the new elasticity measure using the cc % changes.
Let’s do these calculations for our example:
We first calculate the continuously compounded % change of price and volume. We will add these variables as columns of our scenarios table:
$d = log(1+scenarios$D)
scenarios$g = log(1+scenarios$G)
scenarios$e = scenarios$d/scenarios$g scenarios
We see the content of the scenarios table:
View(scenarios)
D | Price1 | G | Vol1 | Sales1 | E | d | g | e |
---|---|---|---|---|---|---|---|---|
0.00 | 100 | 0.000000 | 1000.00 | 100000 | NaN | 0.000000 | 0.000000 | NaN |
-0.05 | 95 | 0.052632 | 1052.63 | 100000 | -1.05263 | -0.051293 | 0.051293 | -1 |
-0.10 | 90 | 0.111111 | 1111.11 | 100000 | -1.11111 | -0.105361 | 0.105361 | -1 |
-0.15 | 85 | 0.176471 | 1176.47 | 100000 | -1.17647 | -0.162519 | 0.162519 | -1 |
-0.20 | 80 | 0.250000 | 1250.00 | 100000 | -1.25000 | -0.223144 | 0.223144 | -1 |
-0.25 | 75 | 0.333333 | 1333.33 | 100000 | -1.33333 | -0.287682 | 0.287682 | -1 |
-0.30 | 70 | 0.428571 | 1428.57 | 100000 | -1.42857 | -0.356675 | 0.356675 | -1 |
-0.35 | 65 | 0.538462 | 1538.46 | 100000 | -1.53846 | -0.430783 | 0.430783 | -1 |
-0.40 | 60 | 0.666667 | 1666.67 | 100000 | -1.66667 | -0.510826 | 0.510826 | -1 |
-0.45 | 55 | 0.818182 | 1818.18 | 100000 | -1.81818 | -0.597837 | 0.597837 | -1 |
-0.50 | 50 | 1.000000 | 2000.00 | 100000 | -2.00000 | -0.693147 | 0.693147 | -1 |
Interestingly the magnitudes of d and g are exactly the same for each discount scenarios, but with opposite sign. Then the new elasticity measure e is equal to -1 for all discount scenarios! Now we can ONLY use 1 measure for price elasticity of demand no matter the discount % to be applied to unit price.
Another advantage of using cc % changes of price and volume is that the relationship between d and g are now linear instead of curvilinear:
plot(x=scenarios$d,y=scenarios$g)
The slope of a tangent of this line will always be equal to -1, that is the elasticity measure e!
Another interesting relationship is between the log of price and the log of volume:
plot(x=log(scenarios$Price1),y=log(scenarios$Vol1))
This relationship is also linear, and the slope of the tangent for each level of values for log of price is also -1, which is the new elasticity measure.
We can simulate how sales volume and sales value changes with different values of price elasticity of a product. We need to express volume growth G in terms of elasticity e, and recalculate most of the columns for our scenario table.
We start with the e formula, and then get the g, and then get the g (cc % volume change) into G (simple % volume change). Since \(e=\frac{g}{d}\), then \(g=e*d\). Now we can get G from g. Since \(g=log\left(1+G\right)\), then we apply the exponential function to both sides and we get:
\[ exp\left(g\right)=\left(1+G\right) \]
exp(g) means to raise the irrational Euler number e (2.71…) to g.
Then we can update the columns for g, G, Vol1, and Sales1 in our scenario table with specific values for elasticity e.
Let’s start moving elasticity from -1 to -2 and see how the sales volume (Vol1) and sales value (Sales1) changes:
=-2
e$e=e
scenarios$g=scenarios$e*scenarios$d
scenarios$G=exp(scenarios$g) - 1
scenarios$Vol1=Vol0*(1+scenarios$G)
scenarios$Sales1=scenarios$Price1* scenarios$Vol1
scenarios$E=scenarios$G/scenarios$D scenarios
We see the content of the scenarios table:
View(scenarios)
D | Price1 | G | Vol1 | Sales1 | E | d | g | e |
---|---|---|---|---|---|---|---|---|
0.00 | 100 | 0.000000 | 1000.00 | 100000 | NaN | 0.000000 | 0.000000 | -2 |
-0.05 | 95 | 0.108033 | 1108.03 | 105263 | -2.16066 | -0.051293 | 0.102587 | -2 |
-0.10 | 90 | 0.234568 | 1234.57 | 111111 | -2.34568 | -0.105361 | 0.210721 | -2 |
-0.15 | 85 | 0.384083 | 1384.08 | 117647 | -2.56055 | -0.162519 | 0.325038 | -2 |
-0.20 | 80 | 0.562500 | 1562.50 | 125000 | -2.81250 | -0.223144 | 0.446287 | -2 |
-0.25 | 75 | 0.777778 | 1777.78 | 133333 | -3.11111 | -0.287682 | 0.575364 | -2 |
-0.30 | 70 | 1.040816 | 2040.82 | 142857 | -3.46939 | -0.356675 | 0.713350 | -2 |
-0.35 | 65 | 1.366864 | 2366.86 | 153846 | -3.90532 | -0.430783 | 0.861566 | -2 |
-0.40 | 60 | 1.777778 | 2777.78 | 166667 | -4.44444 | -0.510826 | 1.021651 | -2 |
-0.45 | 55 | 2.305785 | 3305.79 | 181818 | -5.12397 | -0.597837 | 1.195674 | -2 |
-0.50 | 50 | 3.000000 | 4000.00 | 200000 | -6.00000 | -0.693147 | 1.386294 | -2 |
We can see that not only sales volume increases, but also sales value increases with each level of discount depth.
Then, what can you learn/conclude about price elasticity e with respect to changes in sales? In which situations might be a good idea to do aggressive discounts for a product? briefly explain.
In the previous sections we estimated price elasticity of a hypothetical product with specific scenarios of discount depth. Now the question is, how can we estimate price elasticity of a consumer product?
First we need historical sales data of a the consumer product. We need historical sales in value ($) and sales in volume, and we can calculate an average unit price per period. We can get daily, weekly, monthly or quarterly data. Using monthly or quarterly data is a good idea since the volatility of sales at a weekly and daily level is high so that it is harder to get robust estimation of elasticity. In addition, using monthly or quarterly data allows us to easily measure the seasonality of a consumer product.
Imagine we have monthly data for sales volume and sales value for a consumer product for the past 36 months. How can you estimate price elasticity?
It is easily to calculate unit price for each month. We just divide sales value by sales volume for each month. Now we will have a column for historical prices, a column for sales volume and a column for sales value. Then we can estimate price elasticity by using a simple regression model.
For the previous hypothetical exercise, we can also estimate price elasticity using a simple regression model. The way we set up this simple regression is using the natural logarithm of sales volume as the dependent variable, and the natural log of price as the independent variable. We can express this regression model as:
\[ lnvol_{t}=b_{0}+b_{1}*lnprice_{t}+\varepsilon_{t} \]
Where:
\(lnvol_{t}\) = log of sales volume at period t, which is the dependent variable of the model.
\(lnprice_{t}\) = log of unit price at period t, which is the independent or explanatory variable of the model.
\(\varepsilon_{t}\) is called the error of the model, and it is supposed to behave like a normal distribution with mean=0 and a specific standard deviation: \(\varepsilon_{t}\sim N(0,\sigma_{\varepsilon})\)
\(b_{0}\)= coefficient for the intercept; it represents the value of the dependent variable, in this case, log of sales volume when the value of the independent variable (log of price) is equal to zero. In other words, it is the point where the regression line crosses the Y axis.
\(b_{1}\) = coefficient for the slope of the regression line. This coefficient is a measure of sensitivity of how much the dependent variable moves for each 1-unit move in the independent variable.
The equation for the regression model is actually the expected value of the dependent variable. Since the mean of the error is zero, then:
\[ E\left[lnvol\right]=b_{0}+b_{1}*lnprice_{t} \]
This regression equation is actually a line that represents all the pairs of values for log of price and log of sales volume.
Let’s run a regression model using the hypothetical example of the discount scenarios:
<- lm(log(scenarios$Vol1)~log(scenarios$Price1))
model1 summary(model1)
##
## Call:
## lm(formula = log(scenarios$Vol1) ~ log(scenarios$Price1))
##
## Residuals:
## Min 1Q Median
## -0.000000000000006438 -0.000000000000000288 0.000000000000000052
## 3Q Max
## 0.000000000000001649 0.000000000000002513
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 16.1180956509583417 0.0000000000000155 1040120010071390
## log(scenarios$Price1) -2.0000000000000058 0.0000000000000036 -554948114163277
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## log(scenarios$Price1) <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0000000000000026 on 9 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.08e+29 on 1 and 9 DF, p-value: <0.0000000000000002
The coefficient \(b_{0}\) or intercept = 16.118096. \(b_{1}\)=-2, which is the same value of elasticity e we got previously using cc % changes in price and volume! In this hypothetical case, all the points of the data lie on the regression line, so all errors are equal to zero and the standard errors (standard deviation of the regression coefficients) are actually zero.
Now move to a real-world consumer product and estimate price elasticity. We will download a dataset of historical monthly data of 4 consumer products sold in retail stores. The dataset was constructed using a real-world dataset of consumer products of one specific category and sold in big retail stores in Mexico. Do the following to directly download the dataset into your R:
# Load the readxl package to read Excel files:
library(readxl)
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/salesdata1.xlsx",
"sales1.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
# local file (sales1.xlsx)
# Use the function read_excel() to import the dataset into your R:
<- read_excel("sales1.xlsx") sales
We can see see the content of this file. We will display retail sales data for the first and for the last months:
View(head(sales))
month | qsku1 | vsku1 | qsku2 | vsku2 | qsku3 | vsku3 | qsku4 | vsku4 |
---|---|---|---|---|---|---|---|---|
2017-05-01 | 266745 | 1733845 | 213811 | 3998265 | 579472 | 3882460 | 1737617 | 19353875 |
2017-06-01 | 293427 | 1819250 | 235841 | 4268730 | 604328 | 4048995 | 1743237 | 18915640 |
2017-07-01 | 231065 | 1432600 | 201680 | 3630235 | 595902 | 3813775 | 1690972 | 18800880 |
2017-08-01 | 196930 | 1201275 | 190614 | 3507305 | 608106 | 3831065 | 1835160 | 20683400 |
2017-09-01 | 174157 | 1132020 | 171971 | 3129870 | 998830 | 5493565 | 2067265 | 21311065 |
2017-10-01 | 168432 | 1077965 | 177789 | 3289090 | 1556771 | 7939530 | 2512034 | 24061980 |
View(tail(sales))
month | qsku1 | vsku1 | qsku2 | vsku2 | qsku3 | vsku3 | qsku4 | vsku4 |
---|---|---|---|---|---|---|---|---|
2019-12-01 | 82863 | 629755 | 147879 | 3061090 | 439893 | 2947280 | 1624348 | 16548050 |
2020-01-01 | 95525 | 725990 | 99323 | 2055990 | 420908 | 2777990 | 1604222 | 15487280 |
2020-02-01 | 82477 | 635075 | 130238 | 2669880 | 331109 | 2284655 | 1730077 | 16541685 |
2020-03-01 | 112063 | 862885 | 166859 | 3437290 | 337043 | 2325600 | 1913984 | 19594795 |
2020-04-01 | 90136 | 676020 | 151963 | 3160840 | 197738 | 1364390 | 1457315 | 16395955 |
2020-05-01 | 91567 | 686755 | 159098 | 3372880 | 196111 | 1313945 | 1382876 | 15917535 |
The columns that start with v are value sales, and the columns that start with q are volume sales (in units). Then, vsku1 is the value sales (in $) of product 1, qsku1 is the # of unit sold of product 1, etc.
We will work with product 1. We can calculate an average monthly price for product 1, and also columns for the log of value sales and log of price:
$psku1=sales$vsku1/sales$qsku1
sales$logpsku1=log(sales$psku1)
sales$logqsku1=log(sales$qsku1) sales
We can start visualizing the sales data for this product:
plot(x=sales$month,y=sales$qsku1,type="l")
We can see that the number of units sold of product 1 is dramatically going down for the last 36 months.
We can calculate the % decline of volume sales from the first month until the most recent month:
library(xts)
# Calculating the % growth of volume sales during the whole period:
= last(sales$qsku1)/first(sales$qsku1) - 1
G1 G1
## [1] -0.656725
The % decline in volume sales of product 1 is -65.672459%. If you were the responsible for this product in a company, you must be very, very concerned.
Now we do a plot for historical price of product 1:
plot(x=sales$month,y=sales$psku1,type="l")
We can see that the unit price of product 1 has been going up for the last 36 months. We can calculate the % price increase in the whole period:
= last(sales$psku1) / first(sales$psku1) - 1
D1 D1
## [1] 0.153849
The price increased 15.384869% in the last 36 months (from the first month up to the last month of the data).
Now we can visualize how sales volume changes with different price levels. We plot price vs volume in log scale:
plot(x=sales$logpsku1,y=sales$logqsku1)
abline(lm(sales$logqsku1 ~ sales$logpsku1),col='blue')
I also plotted the regression line between price and volume (in log scale). We clearly see a negative relationship between price and sales volume.
Now we can run a simple linear regression model using the log of sales volume as dependent variable, and the log of price as independent variable:
<-lm(sales$logqsku1 ~ sales$logpsku1)
model1summary(model1)
##
## Call:
## lm(formula = sales$logqsku1 ~ sales$logpsku1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2456 -0.1348 -0.0146 0.0952 0.4289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.438 0.720 25.60 < 0.0000000000000002 ***
## sales$logpsku1 -3.404 0.372 -9.15 0.000000000083 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.165 on 35 degrees of freedom
## Multiple R-squared: 0.705, Adjusted R-squared: 0.697
## F-statistic: 83.7 on 1 and 35 DF, p-value: 0.0000000000829
Then, the beta 1 coefficient of this regression is -3.404429, which is an estimation of price elasticity of demand.
We can interpret price elasticity as follows: for each 1% increase in the price of product, its own demand is expected to decrease in about -3.404429. This result make sense after looking the volume decline vs the price increase.
It is important to note that this measure of price elasticity does not consider important factors that happen with consumer products in real life. Important drivers of demand for consumer products are 1) seasonality, 2) organic growing trend, and 3) cross-elasticities with competing/substitute products and with complementary products, 4) other commercial factors such as distribution, advertising, promotions, and 5) external events such as economic recessions.
In the case of consumer products it is recommended to include some of the factors mentioned above in the econometric model in order to calculate a more robust measure of direct price elasticity, and also measures of cross-elasticities with significant competing or complementary products. We need to use more advanced econometric models. Due to the complexity of these models, we cannot cover them in detail in this workshop, but we will mention some of the most important econometric models used that consider these factors.
The most popular models are called ARIMA/SARIMA and ARIMAX models. ARIMA stands for Autoregressive Integrated Moving Average models. SARIMA stands for Seasonal ARIMA models. ARIMAX stands for ARIMA/SARIMA models with explanatory variables (X’s variables) or factors.
In the next BA application we will design and run a simple ARIMA/SARIMA model to forecast the Mexican and US Economy.
In this exercise I will forecast the Mexican Economy for the next 5 years. It will consider the effect of as as an external shock. We will use the GDP as the main measure for the Mexican economy.
I will apply ARIMA/SARIMA and ARIMAX models in this problem. I will not cover the econometric theory of these models due to its complexity. But I will try to focus on its application, and by working on this problem I try to illustrate the powerful of these models to forecast economic series.
The questions we will try to respond are the following:
How previous economic crisis has affected the Mexican Economy?
How did COVID impact and continue affecting the Mexican Economy?
How the Mexican Economy is related to the United States (US) Economy?
What is the forecast for the Mexican Economy for the next 5 years considering only its historical GDP and the effect of the US Economy? With the response of this question I will see how fast our economy will recover and achieve the levels just prior the crisis.
Let’s start by downloading real data of the Mexico and US GDP. I have to install the following packages to download online economic/financial data and packages to work with time series data:
# The quantmod package has functions to download real economic and financial data. It also has functions for several financial analysis:
#install.packages("quantmod")
# The tseries and the astsa packages have functions to manage time-series datasets
#install.packages("tseries")
#install.packages("astsa")
# The fpp2 package has several functions to run ARIMA/SARIMA/ARIMAX models, and also to do forecasting of time series
#install.packages("fpp2")
We load these packages on memory:
# We clear all the R variables/objects from the environment:
#rm(list=ls())
library(quantmod)
library(fpp2)
library(tseries)
library(astsa)
We start downloading both Mexico and US GDP from the FRED Data source. FRED stands for Federal Reserve Economic Data. FRED is managed by the Federal Reserve Bank of St. Louis, one of the Federal Reserve Banks of the United States.
# I define a vector with the tickers I need from FRED:
= c("NAEXKP01MXQ189S", "GDPC1")
tickers # I download the quarterly GDP data:
getSymbols(tickers, src="FRED")
## [1] "NAEXKP01MXQ189S" "GDPC1"
# getSymbols creates xts R objects with the online data of the tickers I specified. xts stands for "extensible" time series objects.
# I select quarters from 1993 to the most recent quarter:
<- NAEXKP01MXQ189S["1993-01-01/"]
GDPMX <- GDPC1["1993-01-01/"]
GDPUS # I name the columns of the time series:
names(GDPMX)<- c("GDP")
names(GDPUS)<- c("GDP")
Both the Mexican and the US GDP are constant prices, so they are adjusted for inflation. In the case of Mexican GDP, the measure is in $ Mexican pesos, and in the case of the US GDP the index is in Billions of US dollars. To keep both measures in billions, I will divide the Mexican GDP series by 1 billion:
$GDP=GDPMX$GDP / 1000000000 GDPMX
For both GDP datasets, I add the log of the GDP index, the simple annual % growth of the GDP, and the cc annual % growth of the GDP:
$lnGDP = log(GDPMX$GDP)
GDPMX$lnGDP = log(GDPUS$GDP)
GDPUS$R = GDPMX$GDP / lag(GDPMX$GDP, k = 4) - 1
GDPMX$R = GDPUS$GDP / lag(GDPUS$GDP, k = 4) - 1
GDPUS# I used the lag function to get the GDP value of the 4 previous quarters in order to calculate the annual % growth
$r = diff(log(GDPMX$GDP),lag=4)
GDPMX$r = diff(log(GDPUS$GDP),lag=4)
GDPUS# The diff function calculates a difference between the log of the GDP and its own value 4 quarters ago. This is the method to calculate the cc % annual growth of the GDP.
I add a dummy (binary variable) to indicate recession/crisis periods for the whole history. For the quarters where the annual % GDP growth is negative, I assign 1; 0 otherwise:
$crisis = 0
GDPMX$crisis[GDPMX$r<0]=1
GDPMX$crisis = 0
GDPUS$crisis[GDPUS$r<0]=1 GDPUS
I calculate a shock for the crisis periods according to the maximum shock in the history. I do this for both, the Mexico and the US GDP:
#I get a summary of the annual % growth ONLY for the crisis periods
summary(GDPUS$r[GDPUS$crisis==1])
## Index r
## Min. :2008-10-01 Min. :-0.0952
## 1st Qu.:2009-02-15 1st Qu.:-0.0371
## Median :2009-07-01 Median :-0.0319
## Mean :2013-12-31 Mean :-0.0399
## 3rd Qu.:2020-05-16 3rd Qu.:-0.0275
## Max. :2020-10-01 Max. :-0.0229
# I get the minimum or more negative value of the annual growth:
<-min(GDPUS$r[GDPUS$crisis==1])
maxusshock# I calculate the magnitude of the shock for each quarter. This will be a percentage with respect to the maximum shock in the history of the economy:
$shock = GDPUS$r/maxusshock * GDPUS$crisis
GDPUS# For those cases with NA values, I assign a zero. I do this to maximize the number of observations used in the model:
$shock[is.na(GDPUS$r)] = 0
GDPUS
# I do the same for the Mexico GDP:
summary(GDPMX$r[GDPMX$crisis==1])
## Index r
## Min. :1995-01-01 Min. :-0.20881
## 1st Qu.:2001-07-24 1st Qu.:-0.06830
## Median :2009-02-15 Median :-0.01890
## Mean :2008-12-22 Mean :-0.04153
## 3rd Qu.:2019-09-08 3rd Qu.:-0.00805
## Max. :2021-01-01 Max. :-0.00105
<-min(GDPMX$r[GDPMX$crisis==1])
maxmxshock$shock = GDPMX$r/maxmxshock * GDPMX$crisis
GDPMX$shock[is.na(GDPMX$r)] = 0 GDPMX
I will start modeling the annual cc % growth of the GDP. I do this since national economies usually have seasonality patterns during the year. But I first have to check whether the annual cc % GDP growth behaves like a stationary time series variable. In order to model a time series variable, it is necessary to use stationary variables.
A stationary time-series variable has the following characteristics: a) the mean and the volatility of the variable are about the same for any time period, and b) the autocorrelations of the variable depends only on the its lag, and not on the time period.
Then I run a Dicky-Fuller unit root test to examine whether the annual cc % growth (seasonal difference of the log of the GDP) quarter by quarter is stationary:
adf.test(na.omit(diff(GDPUS$lnGDP, lag = 4)),
alternative = "stationary",k=1)
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(diff(GDPUS$lnGDP, lag = 4))
## Dickey-Fuller = -4.361, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
The p-value is <0.05, so I can reject the null hypothesis that states that the seasonal difference is a random walk.Then I have statistical evidence to treat this series as stationary.
I do the same for the Mexican GDP series:
adf.test(na.omit(diff(GDPMX$lnGDP, lag = 4)),
alternative = "stationary", k = 1)
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(diff(GDPMX$lnGDP, lag = 4))
## Dickey-Fuller = -5.059, Lag order = 1, p-value = 0.01
## alternative hypothesis: stationary
I also found the same result as the US GDP, so I can use the annual difference of log of GDP as the variable to model with an ARIMA/SARIMA model.
I start by creating the autocorrelation and partial autocorrelation plots for the dependent variable, which is the annual difference of the log of the US GDP (cc % growth of US GDP):
acf2(na.omit(diff(GDPUS$lnGDP,lag=4)))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF 0.66 0.44 0.21 -0.07 0.01 0.02 0.04 0.04 0.04 0.03 0.03 0.03 0.03
## PACF 0.66 0.01 -0.16 -0.27 0.37 0.01 -0.10 -0.17 0.26 -0.01 -0.07 -0.11 0.21
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF 0.02 0.02 0.03 0.02 0.02 0.00 -0.05 -0.06
## PACF -0.03 -0.05 -0.05 0.13 -0.04 -0.07 -0.11 0.11
The ac plot shows a classical AR(1) pattern since autocorrelations between the variable and its first lags go down slowly and the partial ac plot shows a radical decay in the partial autocorrelations after lag 1. Also, I see a significant autocorrelation at lag 4, which is a seasonal autocorrelation.
With this information I calibrate the model including important autocorre.ations. I also include the explanatory variable, which is the COVID shock:
library(lmtest)
<-Arima(GDPUS$GDP,order=c(1,0,2), # I include an AR(1)
mgdpusseasonal = list(order=c(0,1,1),period=4), # I indicate to calculate a seasonal difference
include.constant = TRUE,
lambda = 0, # indicates to use the log transformation to get annal % growth
xreg = GDPUS$shock)
coeftest(mgdpus)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.986227 0.016808 58.677 < 0.0000000000000002 ***
## ma1 0.146520 0.098159 1.493 0.136
## ma2 0.142896 0.089965 1.588 0.112
## sma1 -0.873660 0.106673 -8.190 0.000000000000000261 ***
## drift 0.006122 0.001234 4.960 0.000000704539877110 ***
## shock -0.093007 0.004247 -21.899 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$aic mgdpus
## [1] -823.055
# In the order parameter I indicate that the model has 1 AR lag,
# 0 differences, and 0 MA lags.
# In the seasonal option I indicate that I need to do an annual/seasonal difference; period=4 means that I have 4 quarters in a season/year
I check whether the residuals of this model behave like a white noise with no significant autocorrelations:
acf2(mgdpus$residuals, max.lag = 8)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## ACF 0.08 0.09 0.11 0.14 0.12 0.07 0.10 -0.12
## PACF 0.08 0.08 0.09 0.12 0.10 0.04 0.05 -0.18
The residuals looks like a white noise, so I can say that the model is well calibrated.
Now I will do a forecast. Before this, I have to fill out future data for the COVID shock on the US Economy. I will consider that the COVID shock will not continue to impact from 2022 Q1. However, I will consider a RECESSION shock, considering the high inflation and the current economic recession in the US. I will consider a small shock for the following 2 quarters (according to shocks in the history):
I create an index for the next future 20 quarters
I create an xts dataset for the future US shock:
#Creating a sequence of dates for the future 20 quarters:
<- seq(as.Date(end(GDPMX)),
futurequarters by="quarter",
length.out = 21)[-1]
# I show the last future quarters
tail(futurequarters)
## [1] "2025-10-01" "2026-01-01" "2026-04-01" "2026-07-01" "2026-10-01"
## [6] "2027-01-01"
# I create a vector with 20 zeros:
<-rep(0,20)
FUTURESHOCKUS# Now I create an xts object with shocks equal to zero for the future 20 quarters:
<-xts(FUTURESHOCKUS,order.by=futurequarters)
FUTURESHOCKUSnames(FUTURESHOCKUS)<-c("shock")
# Now I assign the shock magnitude for the next quarters:
$shock["2022-04-01"]=0.30
FUTURESHOCKUS$shock["2022-07-01"]=0.20
FUTURESHOCKUS$shock["2022-10-01"]=0.10
FUTURESHOCKUS$shock["2023-04-01"]=0.10 FUTURESHOCKUS
I re-run the model, using the historical shock as explanatory variable:
<-Arima(GDPUS$GDP,order=c(1,0,2),
mgdpusseasonal = list(order=c(0,1,1),period=4),
include.constant = TRUE,
xreg = GDPUS$shock,
lambda = 0)
coeftest(mgdpus)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.986227 0.016808 58.677 < 0.0000000000000002 ***
## ma1 0.146520 0.098159 1.493 0.136
## ma2 0.142896 0.089965 1.588 0.112
## sma1 -0.873660 0.106673 -8.190 0.000000000000000261 ***
## drift 0.006122 0.001234 4.960 0.000000704539877110 ***
## shock -0.093007 0.004247 -21.899 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With this model I run the forecast using the future 20 quarters along with the expected COVID shocks:
<-forecast(mgdpus,xreg=FUTURESHOCKUS$shock)
usforecastautoplot(usforecast)
tail(usforecast$mean)
## Time Series:
## Start = 132
## End = 137
## Frequency = 1
## [1] 21501.1 21529.4 21699.4 21835.7 21985.8 22015.4
summary(usforecast)
##
## Forecast method: Regression with ARIMA(1,0,2)(0,1,1)[4] errors
##
## Model Information:
## Series: GDPUS$GDP
## Regression with ARIMA(1,0,2)(0,1,1)[4] errors
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ma1 ma2 sma1 drift shock
## 0.986 0.147 0.143 -0.874 0.006 -0.093
## s.e. 0.017 0.098 0.090 0.107 0.001 0.004
##
## sigma^2 = 0.0000386: log likelihood = 418.53
## AIC=-823.05 AICc=-821.99 BIC=-803.96
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -3.34948 91.5064 66.4819 0.00223007 0.44811 0.491073 -0.00260383
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 118 19339.9 19186.5 19494.4 19105.8 19576.8
## 119 19622.7 19388.1 19860.2 19265.1 19987.0
## 120 19939.6 19629.8 20254.4 19467.7 20423.0
## 121 20150.4 19780.6 20527.1 19587.5 20729.4
## 122 20119.5 19693.0 20555.1 19471.0 20789.6
## 123 20432.9 19948.9 20928.7 19697.3 21196.0
## 124 20571.4 20037.9 21119.1 19761.1 21415.0
## 125 20597.2 20021.7 21189.2 19723.6 21509.5
## 126 20758.5 20133.5 21402.8 19810.3 21752.0
## 127 20887.5 20216.6 21580.6 19870.2 21956.8
## 128 21029.7 20314.7 21770.0 19946.1 22172.3
## 129 21056.8 20304.3 21837.1 19917.0 22261.8
## 130 21222.4 20423.7 22052.2 20013.2 22504.6
## 131 21354.9 20512.9 22231.5 20080.7 22710.0
## 132 21501.1 20616.7 22423.5 20163.3 22927.7
## 133 21529.4 20609.6 22490.4 20138.7 23016.3
## 134 21699.4 20734.4 22709.4 20241.0 23263.0
## 135 21835.7 20828.0 22892.0 20313.6 23471.8
## 136 21985.8 20936.1 23088.2 20400.8 23693.9
## 137 22015.4 20931.2 23155.8 20379.0 23783.3
The blue section is the forecast. We can see the 80% and the 95% confidence intervals of the GDP forecast, and the mid blue line is the actual forecast. We can see that the US GDP is expected to achieve its maximum historical level (before COVID) in less than 2 years.
Now I do the model for the Mexican GDP. In this case, I assume that the US economy significantly influences the Mexican economy.
Then I will include the following explanatory variables of this model:
1 - The GDPUS (historical and forecasted)
2 - A COVID shock for the Mexican Economy
I check the ac and pac plot for the seasonal difference of the log of the Mexican GDP:
acf2(na.omit(diff(GDPMX$lnGDP,lag=4)))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## ACF 0.62 0.32 0.06 -0.22 -0.11 -0.07 -0.06 -0.07 -0.09 -0.09 -0.08 -0.08
## PACF 0.62 -0.09 -0.16 -0.28 0.34 -0.09 -0.13 -0.16 0.19 -0.07 -0.09 -0.13
## [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## ACF -0.06 -0.03 -0.01 0.00 -0.04 -0.08 -0.14 -0.14 -0.09
## PACF 0.17 -0.03 -0.04 -0.14 0.06 -0.09 -0.10 -0.06 0.10
I start with an AR(1) model and I also include the explanatory variable, which is the COVID shock:
<-Arima(GDPMX$GDP,order=c(1,0,1),
mgdpmxseasonal = list(order=c(0,1,1),period=4),
include.constant = TRUE,
lambda = 0,
xreg = cbind(GDPMX$shock,GDPUS$lnGDP))
coeftest(mgdpmx)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.915445 0.051281 17.852 < 0.0000000000000002 ***
## ma1 0.169549 0.091861 1.846 0.0649 .
## sma1 -0.680434 0.160065 -4.251 0.0000213 ***
## drift 0.001918 0.001119 1.714 0.0865 .
## shock -0.152578 0.012636 -12.074 < 0.0000000000000002 ***
## lnGDP 0.500565 0.119941 4.173 0.0000300 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
$aic mgdpmx
## [1] -725.568
With the lamda=0 option, R performs the log transformation of the dependent variable, that is the log of the MX GDP. The lamda=0 option does not apply to the explanatory variables, so I included the log of the US GD as explanatory variable.
I check the ac and the pac of the residuals:
acf2(mgdpmx$residuals,max.lag = 8)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## ACF 0.03 0.16 -0.14 0.18 0.00 0.03 -0.24 -0.02
## PACF 0.03 0.16 -0.15 0.17 0.03 -0.05 -0.21 -0.02
I keep this model since there is no significant autocorrelations of the residuals.
I fill out the future shocks for the Mexican Economy. As in the US model, I will not consider a covid effect in the future, but I will consider a shock due to the recent high inflation and economic recesion of the US and Mexico. I will calibrate the shocks based on past history shocks of the Mexican economy (they are in the dataset as shock column). I will consider a tiny higher impact of the shock compared to the US shock.
Since the US GDP model I already included a shock, in the case of the future shock for Mexico, I will calculate it as the extra shock beyond the US shock to avoid double shock impact.
<-rep(0,20)
FUTURESHOCKMX# Now I create
<-xts(FUTURESHOCKMX,order.by=futurequarters)
FUTURESHOCKMXnames(FUTURESHOCKMX)<-c("shock")
# I set the shock as the "additional" shock beyond the shock that affected the US economy.
$shock["2022-04-01"]=0.10
FUTURESHOCKMX$shock["2022-07-01"]=0.10
FUTURESHOCKMX$shock["2022-10-01"]=0.05
FUTURESHOCKMX$shock["2023-04-01"]=0.05 FUTURESHOCKMX
I re-run the model, using the historical shock as explanatory variable:
<-Arima(GDPMX$GDP,order=c(1,0,1),
mgdpmxseasonal = list(order=c(0,1,1),period=4),
xreg = cbind(GDPMX$shock,GDPUS$lnGDP),
include.constant = TRUE,
lambda = 0)
coeftest(mgdpmx)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 0.915445 0.051281 17.852 < 0.0000000000000002 ***
## ma1 0.169549 0.091861 1.846 0.0649 .
## sma1 -0.680434 0.160065 -4.251 0.0000213 ***
## drift 0.001918 0.001119 1.714 0.0865 .
## shock -0.152578 0.012636 -12.074 < 0.0000000000000002 ***
## lnGDP 0.500565 0.119941 4.173 0.0000300 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I need to create a series for the forecasted log of the US GDP and include it as explanatory variable in the forecast:
<-xts(log(usforecast$mean),order.by=futurequarters)
FUTURElnGDPnames(FUTURElnGDP)<-c("lnGDP")
<-forecast(mgdpmx,xreg=cbind(FUTURESHOCKMX$shock,FUTURElnGDP$lnGDP))
mxforecastautoplot(mxforecast)
tail(mxforecast$mean)
## Time Series:
## Start = 132
## End = 137
## Frequency = 1
## [1] 4815.77 4846.61 4851.55 4881.88 4905.75 4937.36
The blue section is the forecast. We can see the 80% and the 95% confidence intervals of the GDP forecast, and the mid blue line is the actual forecast. We can see that the Mexican GDP will achieve its maximum levels it had before COVID at least after 2024.
We can estimate annual GDP growth for the current presidential period from 2018 to 2024. Before we need to collapse the data from quarterly to annual and append the history with the forecast:
library(xts)
<- as.data.frame(mxforecast$mean)
mxgdpdf names(mxgdpdf) <- c("GDP")
<-as.data.frame(GDPMX$GDP)
histgdpdf # append history with forecast:
<-rbind(histgdpdf,mxgdpdf)
mxgdpall
<- as.data.frame(mxgdpall)
mxgdpall
<- seq(as.Date(start(GDPMX)),
allq by="quarter",
length.out = nrow(mxgdpall))
$q <- allq
mxgdpall$year<-format(mxgdpall$q,format="%Y")
mxgdpall
<- aggregate(mxgdpall$GDP,
mxgdpy by=list(mxgdpall$year),FUN=sum,na.rm=TRUE)
names(mxgdpy)<-c("year","GDP")
rownames(mxgdpy)<-unique(mxgdpall$year)
= as.numeric(mxgdpy[mxgdpy$year=="2024","GDP"] / mxgdpy[mxgdpy$year=="2018","GDP"]) - 1
GDPg2024 cat("The 2018-2024 GDP Growth is expected to be ", 100*GDPg2024,"%")
## The 2018-2024 GDP Growth is expected to be 1.37423 %
The estimated growth for the 6-year Mexico presidential period from 2018 to 2024 would be 1.3742%. We calculate the compound average annual growth rate (CAGR) for the 2018-2024 “sexenio” as follows:
= (1+GDPg2024)^(1/6) - 1
GDPcagr2024 GDPcagr2024
## [1] 0.00227738
The expected average annual GDP growth for this “sexenio” is estimated around 0.2277%.
This might be the worst economic growth of the previous “sexenios” in Mexico at least since 1994, which has been between 11% and 19%. We can calculate the GDP growth of all the presidential “sexenios” as follows:
= as.numeric(mxgdpy[mxgdpy$year=="2024","GDP"] / mxgdpy[mxgdpy$year=="2018","GDP"]) - 1
GDPg2024 cat("The 2018-2024 GDP Growth is expected to be ", 100*GDPg2024,"%\n")
## The 2018-2024 GDP Growth is expected to be 1.37423 %
= as.numeric(mxgdpy[mxgdpy$year=="2018","GDP"] / mxgdpy[mxgdpy$year=="2012","GDP"]) - 1
GDPg2018 cat("The 2012-2018 GDP growth was ",100*GDPg2018,"%\n")
## The 2012-2018 GDP growth was 15.5836 %
= as.numeric(mxgdpy[mxgdpy$year=="2012","GDP"] / mxgdpy[mxgdpy$year=="2006","GDP"]) - 1
GDPg2012 cat("The 2006-2012 GDP growth was ",100*GDPg2012,"%\n")
## The 2006-2012 GDP growth was 10.4563 %
= as.numeric(mxgdpy[mxgdpy$year=="2006","GDP"] / mxgdpy[mxgdpy$year=="2000","GDP"]) - 1
GDPg2006 cat("The 2000-2006 GDP growth was ",100*GDPg2006,"%\n")
## The 2000-2006 GDP growth was 12.429 %
= as.numeric(mxgdpy[mxgdpy$year=="2000","GDP"] / mxgdpy[mxgdpy$year=="1994","GDP"]) - 1
GDPg2000 cat("The 1994-2000 GDP growth was ",100*GDPg2000,"%\n")
## The 1994-2000 GDP growth was 20.9527 %
In this section I will work on the following: a) Data management of financial data, b) return calculations, c) descriptive statistics for finance, d) visualization of financial data, and e) Portfolio formation and basic estimations.
We will use the quantmod package to import real online data from Yahoo Finance. This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.
If you have not installed it, use the install.packages function. We start loading the package:
#install.packages("quantmod")
library(quantmod)
The getSymbols() function download online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, FRED database and Oanda. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables of most of the countries.
You can type ?function in the console or the R Script and run it to know more about the syntax of any function. This will display the R documentation of the function in the bottom-right pane:
?getSymbols
Now, we will work with historical data of the Bitcoin cryptocurrency and the TESLA stock. We download daily quotations of these instruments from January 1, 2017 to date from Yahoo Finance:
getSymbols(c("BTC-USD","TSLA"), from="2017-01-01", src="yahoo", periodicity="daily")
## [1] "BTC-USD" "TSLA"
This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical daily prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.
BTC-USD and TSLA are the ticker names in Yahoo Finance. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly) also as a character vector.
You can check the content of any of these dataset with View(). When tickers have special characters, we have to make reference to the object with simple quotes (``):
You can list the FIRST 5 rows of a dataset by using head():
head(`BTC-USD`,5)
## BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2017-01-01 963.658 1003.08 958.699 998.325 147775008
## 2017-01-02 998.617 1031.39 996.702 1021.750 222184992
## 2017-01-03 1021.600 1044.08 1021.600 1043.840 185168000
## 2017-01-04 1044.400 1159.42 1044.400 1154.730 344945984
## 2017-01-05 1156.730 1191.10 910.417 1013.380 510199008
## BTC-USD.Adjusted
## 2017-01-01 998.325
## 2017-01-02 1021.750
## 2017-01-03 1043.840
## 2017-01-04 1154.730
## 2017-01-05 1013.380
Also, you can list the LAST 5 rows of a data set. Note that you can change number of rows you want to display.
tail(`BTC-USD`, 5)
## BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2022-07-10 21591.1 21591.1 20727.1 20860.4 28688807249
## 2022-07-11 20856.4 20856.4 19924.5 19970.6 24150249025
## 2022-07-12 19970.5 20043.4 19308.5 19323.9 25810220018
## 2022-07-13 19326.0 20223.1 19000.0 20212.1 33042430345
## 2022-07-14 20248.8 20765.8 19692.7 20510.9 31665979392
## BTC-USD.Adjusted
## 2022-07-10 20860.4
## 2022-07-11 19970.6
## 2022-07-12 19323.9
## 2022-07-13 20212.1
## 2022-07-14 20510.9
For each period, Yahoo Finance keeps track of the open, high, low, close (OHLC) and adjusted prices. Also, it keeps track of volume that was traded in every specific period. The adjusted prices are used for stocks, not for currencies. Adjusted prices considers dividend payments and also stock splits. Then, for the Bitcoin series we can use close of adjusted price to calculate daily returns.
Let’s see some of the benefits of using xts-zoo objects. We can, for
example, select columns using any of the following functions, where
x
represents a generic xts
zoo
object:
Op(x)
: Extract the Opening prices of the period.Hi(x)
: Extract the Highest price of the period.Lo(x)
: Extract the Lowest price of the period.Cl(x)
: Extract the closing prices of the period.Vo(x)
: Extract the volume traded of the period.Ad(x)
: Extract the Adjusted prices of the period.We can use the merge()
function to create a consolidated
dataset of one or more xts-zoo objects:
<- merge(`BTC-USD`, TSLA)
prices # I can select only the adjusted prices in order to calculate returns:
<-Ad(prices) adjprices
Now we have an xts-zoo objects with 2 columns. I can change the names of the columns with simple names:
names(adjprices)<-c("bitcoin","tesla")
Now I can make reference to the adjusted prices using these names.
In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some missing days. For example, for stock series there is no data for weekends or holidays. However, R deals with gaps because it recognizes that we are working with a time series object. Thus, we have a time variable with NO GAPS, which avoids problems when computing returns. However, R does not deal automatically with empty values (called NA’s). It is a good idea to have a data set free of NA’s. So, I can use the function na.omit:
<- na.omit(adjprices) adjprices
We can create xts-zoo objects for simple and continuously compounded returns of both instruments:
# Calculating continuously compounded daily returns:
<- diff(log(adjprices))
ccreturns # Calculating simple daily returns:
<- adjprices / lag(adjprices,n=1) - 1
returns # We can also calculate simple returns using the diff function:
<- diff(adjprices) / lag(adjprices,n=1) returns2
The diff function works with xts to calculate the difference between the value of one period minus the previous one. The lag function gets the previous value of the time series.
We can calculate holding returns for any time period for each instrument (HPR). For example, if we want to calculate the whole period return from the first day of the series to the most recent one, we can do the following:
<- 100* as.vector(last(adjprices)) / as.vector(first(adjprices)) - 1
HPR1 HPR1
## [1] 1935.32 1637.60
cat("The holding period return for Bitcoin is: ", HPR1[1], "%")
## The holding period return for Bitcoin is: 1935.32 %
cat("The holding period return for Bitcoin is: ", HPR1[2], "%")
## The holding period return for Bitcoin is: 1637.6 %
We could also use the continuously compounded returns to calculate the same holding simple returns:
<-100*exp(colSums(ccreturns,na.rm=TRUE)) - 1
HPR2 HPR2
## bitcoin tesla
## 1935.32 1637.60
One of the advantages of the quantmod
package is that it
has some built-in data visualization capabilities. For example, the
chartSeries
function is a plotting tool designed to create
standard financial charts given a time series object. The
chartSeries
function also includes some arguments that
allows the user to modify the cosmetics of the plot such as the
theme
argument.
Let’s see the performance of Bitcoin prices:
chartSeries(`BTC-USD`, theme = ("white"))
Let’s see the
performance of TESLA prices:
chartSeries(TSLA, theme = ("black"))
We can see exponential growth of both, the Bitcoin prices and Tesla prices for the last days. If you had invested in Bitcoin or Tesla in July 2020 and has sold your position early January 2021, you would have multiplied your investment by around 4 times (300% period return)!
We can visualize the daily returns over time:
plot(returns$bitcoin)
plot(returns$tesla)
With this plot we can appreciate the daily volatility of each instrument over time. Volatility is a measure of how disperse the returns move up and down from its mean; it is basically the standard deviation of returns.
We will use the PerformanceAnalytics and related packages for this section:
# Install the packages only if you need:
#install.packages("tidyverse")
#install.packages("PerformanceAnalytics")
#install.packages("highcharter")
# Load the packages:
library(tidyverse)
library(PerformanceAnalytics)
library(highcharter)
We will do an exercise of a simple portfolio composed by 5 US stocks: Tesla, Microsoft, Nextera, WalMart and Pfizer. You can change the tickers of the stocks as you wish as long as Yahoo Finance has data for your tickers. I selected stocks from the industries: high-tech, clean energies, pharmaceutical, and retail.
With the following code we will automatically download data from the 5 stocks, merge the data and calculate returns:
<- c("TSLA","MSFT","NEE","WMT","PFE")
symbols
<- getSymbols(symbols,src = 'yahoo',
prices periodicity = "monthly",
from = "2018-01-01",
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>%
reduce(merge) %>%
`colnames<-`(symbols)
If you see I am doing several data management process in sequential process. The %>% is used to indicate the separation of each data management process. The previous code does the following:
Download price data of the 5 stocks
Apply (map) the adjusted function to all xts-zoo datasets in order to get the adjusted stock prices
Do the merge of all the xts-zoo objects into one object
Finally rename the columns of the integrated object, and return
the object as prices
.
We calculate returns of all stocks using the Return.calculate function:
<-Return.calculate(prices) %>%
Returnsna.omit()
The na.omit function was also applied to clean the dataset and drop the rows with NA values.
We calculate the continuous compounded return (also called log returns) of all stocks:
<-Return.calculate(prices,method = "log") %>%
returnsna.omit()
# We could calculate this using the diff and log functions:
<-na.omit(diff(log(prices)))
returns2# The returns and returns2 objects will have exactly the same returns
We calculate descriptive statistics for the returns of all stocks:
table.Stats(returns)
## TSLA MSFT NEE WMT PFE
## Observations 54.0000 54.0000 54.0000 54.0000 54.0000
## NAs 0.0000 0.0000 0.0000 0.0000 0.0000
## Minimum -0.2539 -0.1052 -0.1782 -0.1734 -0.1453
## Quartile 1 -0.1141 -0.0195 -0.0043 -0.0294 -0.0296
## Median 0.0328 0.0221 0.0210 0.0112 -0.0007
## Arithmetic Mean 0.0427 0.0191 0.0150 0.0045 0.0102
## Geometric Mean 0.0269 0.0176 0.0130 0.0029 0.0076
## Quartile 3 0.1689 0.0557 0.0520 0.0424 0.0654
## Maximum 0.5547 0.1624 0.1614 0.0969 0.2057
## SE Mean 0.0255 0.0078 0.0084 0.0078 0.0101
## LCL Mean (0.95) -0.0085 0.0035 -0.0018 -0.0111 -0.0099
## UCL Mean (0.95) 0.0939 0.0348 0.0318 0.0202 0.0304
## Variance 0.0352 0.0033 0.0038 0.0033 0.0055
## Stdev 0.1876 0.0573 0.0615 0.0574 0.0739
## Skewness 0.6296 -0.0299 -0.9379 -0.8875 0.3382
## Kurtosis -0.1079 -0.3059 2.0150 1.3052 0.0006
We can see arithmetic, geometric mean, median of returns. Also we can see risk measures such as standard deviation, variance. Finally we can see skewness and kurtosis, which are important financial measures to understand the data.
We can do a Box plot of the returns to better appreciate historical median return and risk by looking at the median, the quartiles Q1 and Q3, volatility and extreme values of these returns:
chart.Boxplot(returns)
It is easy to see that Tesla is the stock with the highest risk. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Q1 (25 percentile) to the Q3 (75 percentile). The dots are considered extreme values for in the context of its own distribution.
Now we can start evaluating the performance over time for each stock by looking at how much $ we would have made if we had invested $1.00 in each stock at the beginning of the time periods:
charts.PerformanceSummary(Returns,
main = "Performance of $1.00 over time",
wealth.index = TRUE)
Since Tesla has had an extraordinary performance for the last months, it is hard to appreciate the performance of the rest of the stocks. So, we can drop Tesla from the plot:
charts.PerformanceSummary(Returns[,2:5],
main = "Performance of $1.00 over time",
wealth.index = TRUE)
I start creating the weights of 2 portfolios. One will be an equally-weighted portfolio, and the other will be an aggressive portfolio assigning high weights to risky stocks, and low weights to conservative stocks.
I start creating a vector of weights for the equally weighted portfolio:
= rep(0.2,5)
w_ew # The rep function repeats a value n times
Now I create a vector of weights for the aggressive portfolio. I will assign the following weights:
Tesla: 40% Microsoft: 30% Nextera: 20% WalMart:0% Pfizer: 10%
= c(0.4,0.3,0.2,0,0.1) w_aggressive
We calculate the historical return of the equally-weighted portfolio using the Return.portfolio function:
<-
portfolio_returns_ew Return.portfolio(Returns,
weights = w_ew) %>%
`colnames<-`("returns")
We calculate the historical return of the aggressive portfolio using the Return.portfolio function:
<-
portfolio_returns_ag Return.portfolio(Returns,
weights = w_aggressive) %>%
`colnames<-`("returns")
With the portfolio returns I can plot a performance chart of each portfolio:
charts.PerformanceSummary(portfolio_returns_ew,
main = "Equally-weighted Portfolio")
charts.PerformanceSummary(portfolio_returns_ag,
main = "Aggressive Portfolio Performance")
We finally do a dynamic plot of historical monthly returns of each portfolio using the highchart function from the highcharter package:
highchart(type = "stock") %>%
hc_title(text = "Equally-weighted Portfolio") %>%
hc_add_series(portfolio_returns_ew$returns,
name = "Monthly Returns",
color = "cornflowerblue") %>%
hc_add_theme(hc_theme_flat()) %>%
hc_navigator(enabled = FALSE) %>%
hc_scrollbar(enabled = FALSE) %>%
hc_legend(enabled = TRUE) %>%
hc_exporting(enabled = TRUE)
We can interact with the plot to zoom in and out, and selecting time periods.
We do the same for the aggressive portfolio:
highchart(type = "stock") %>%
hc_title(text = "Aggressive Portfolio") %>%
hc_add_series(portfolio_returns_ag$returns,
name = "Monthly Returns",
color = "cornflowerblue") %>%
hc_add_theme(hc_theme_flat()) %>%
hc_navigator(enabled = FALSE) %>%
hc_scrollbar(enabled = FALSE) %>%
hc_legend(enabled = TRUE) %>%
hc_exporting(enabled = TRUE)
This is just an example of basic financial analysis that we can do in R.
I hope you enjoyed the workshop! at least, I hope that you ended with curiosity to continue learning this fascinating world of Business Analytics.