install.packages("quantmod")
install.packages("PerformanceAnalytics")
Workshop 1, Algorithms and data analysis
0.1 General Directions for each workshop
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab document for each workshop.
You must share each Colab document (workshop) with the following accounts:
cdorante.tec@gmail.com cdorante@tec.mx
You must give Edit privileges to these accounts.
In Google Colab you can work with Python or R notebooks. The default is Python notbooks, so you must go to “Runtime” or “Entorno de Ejecución”, select Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).
Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1-Econometrics-YourFirstName YourLastname”.
Pay attention in class to learn how to write text and R code into your Notebook.
In your Workshop Notebook you have to:
Replicate all the R Code along with its output.
You have to do any QUESTION AND/OR CHALLENGE asked in the workshop. The challenges can be: responses to specific questions and/or do an exercise/challenge.
For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
- It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.
Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.
1 Data management for time-series financial variables
1.1 Installation of R packages
Google Colab has many R packages already installed, but not all. In this workshop we use R packages that are already installed in Colab, and others that we need to install before loading them in Colab.
We start installing the packages that are NOT installed in Colab:
The installation of packages in Colab might take up to 30 seg or 1 minute each, so after they are installed you can add the # character before the install.packages command to avoid re-installing the packages and waiting for a long time.
Once these packages are installed, we need to load them in memory:
# Now I load the packages in Memory:
library(quantmod)
library(PerformanceAnalytics)
Now we load the packages of the workshop that are already installed in R:
library(readxl)
library(dplyr)
1.2 Data collection
We start by downloading online real stock prices from Yahoo Finance.
We start clearing our R environment:
# To avoid scientific notation for numbers:
options(scipen=999)
1.2.1 The quantmod package
The quantmod package is designed to help financial traders in developing econometric-based trading models.
It has several functions for financial data collection, data management and data visualization.
This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.
1.2.2 Downloading real financial prices
The getSymbols() function enables its user to download online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, Google Finance, FRED and Oanda. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables around the world.
Now, we will work with historical data of the Bitcoin cryptocurrency and the TESLA stock. We download daily quotations of these instruments from January 1, 2020 to the most complete recent month from Yahoo Finance:
getSymbols(c("BTC-USD","TSLA"), from="2020-01-01",
to="2024-09-13", src="yahoo", periodicity="daily")
[1] "BTC-USD" "TSLA"
This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical daily prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.
BTC-USD and TSLA are the ticker names in Yahoo Finance. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly).
1.2.3 Show the content of datasets
We can view the content of an R object just by typing the name of the object in the R code. If the dataset is long we can view the first rows or the last rows of the dataset by using the following functions:
head(`BTC-USD`,n=5)
BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2020-01-01 7194.892 7254.331 7174.944 7200.174 18565664997
2020-01-02 7202.551 7212.155 6935.270 6985.470 20802083465
2020-01-03 6984.429 7413.715 6914.996 7344.884 28111481032
2020-01-04 7345.375 7427.386 7309.514 7410.657 18444271275
2020-01-05 7410.452 7544.497 7400.536 7411.317 19725074095
BTC-USD.Adjusted
2020-01-01 7200.174
2020-01-02 6985.470
2020-01-03 7344.884
2020-01-04 7410.657
2020-01-05 7411.317
When tickers have special characters, we have to make reference to the object with simple quotes (``).
Also, you can list the LAST 5 rows of the dataset. Note that you can change number of rows you want to display.
tail(`BTC-USD`, n=10)
BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2024-09-04 57430.35 58511.57 55673.16 57971.54 35627680312
2024-09-05 57971.70 58300.58 55712.45 56160.49 31030280656
2024-09-06 56160.19 56976.11 52598.70 53948.75 49361693566
2024-09-07 53949.09 54838.14 53740.07 54139.69 19061486526
2024-09-08 54147.93 55300.86 53653.76 54841.57 18268287531
2024-09-09 54851.89 58041.12 54598.43 57019.54 34618096173
2024-09-10 57020.10 58029.98 56419.41 57648.71 28857630507
2024-09-11 57650.29 57991.32 55567.34 57343.17 37049062672
2024-09-12 57343.17 58534.36 57330.10 58127.01 33835707949
2024-09-13 58130.32 60648.02 57650.11 60571.30 32490528356
BTC-USD.Adjusted
2024-09-04 57971.54
2024-09-05 56160.49
2024-09-06 53948.75
2024-09-07 54139.69
2024-09-08 54841.57
2024-09-09 57019.54
2024-09-10 57648.71
2024-09-11 57343.17
2024-09-12 58127.01
2024-09-13 60571.30
The xts datasets created with the getSymbols function have different columns for open, low, high, close, adjusted and volume data. These datasets are also called OHLC datasets.
The adjusted prices are used for stocks, not for currencies. Adjusted prices is an adjustment of the closing prices after considering dividend payments and stock splits. Then, for the Bitcoin series we can use close or adjusted price to calculate returns.
1.3 Data selection
1.3.1 Column selection
We can easily select either columns or rows of a dataset. For example, if we want to select the adjusted price column of the TSLA xts dataset, we use the $ and the name of the column, and then assign it to any new name:
= TSLA$TSLA.Adjusted
adjTSLA head(adjTSLA)
TSLA.Adjusted
2020-01-02 28.68400
2020-01-03 29.53400
2020-01-06 30.10267
2020-01-07 31.27067
2020-01-08 32.80933
2020-01-09 32.08933
Another way to select a column is by the # of the column(s):
#Selecting the adjusted price, which is the column #6:
= TSLA[, 6] adjTSLA
Note that the notation to make reference to column 6 and ALL rows of TSLA is:
TSLA[ , 6]
In this case, I want to keep ALL rows, so we leave empty the # of rows, and indicate that we want only column # 6.
R datasets have 2 dimensions: [#rows, #columns]. We can show the dimensions of any R dataset:
dim(TSLA)
[1] 1182 6
We see that the # of rows of TSLA dataset is 1182, and the # of columns is 6.
We can make reference to any subset of the dataset using this notation of [#rows, #columns]
1.3.2 Row selection
We can also select specific rows of a dataset following the same notation of [#rows, #columns]. If I want to select only the first rows of TSLA dataset, I can do the following:
= TSLA[1:10,]
TSLAfirstdays TSLAfirstdays
TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02 28.30000 28.71333 28.11400 28.68400 142981500 28.68400
2020-01-03 29.36667 30.26667 29.12800 29.53400 266677500 29.53400
2020-01-06 29.36467 30.10400 29.33333 30.10267 151995000 30.10267
2020-01-07 30.76000 31.44200 30.22400 31.27067 268231500 31.27067
2020-01-08 31.58000 33.23267 31.21533 32.80933 467164500 32.80933
2020-01-09 33.14000 33.25333 31.52467 32.08933 426606000 32.08933
2020-01-10 32.11933 32.32933 31.58000 31.87667 194392500 31.87667
2020-01-13 32.90000 35.04200 32.80000 34.99067 397764000 34.99067
2020-01-14 36.28400 36.49400 34.99333 35.86133 434943000 35.86133
2020-01-15 35.31733 35.85600 34.45267 34.56667 260532000 34.56667
We used the notation 1:10 in the row specification, indicating that we want the sequence from 1 to 10.
We can also select specific rows and columns. For example, if I only want to select the first 10 days of the adjusted prices of TESLA:
= TSLA[1:10,6]
TSLA_adjusted_first_prices head(TSLA_adjusted_first_prices)
TSLA.Adjusted
2020-01-02 28.68400
2020-01-03 29.53400
2020-01-06 30.10267
2020-01-07 31.27067
2020-01-08 32.80933
2020-01-09 32.08933
If we want to select specific columns, we can use the container c() vector to indicate which column # we want to select. For example, if for some reason we want to select the Open and Close columns (columns #1 and #4), and select the first 10 days of TESLA prices:
= TSLA[1:10,c(1,4)]
TSLA_Open_Close head(TSLA_Open_Close)
TSLA.Open TSLA.Close
2020-01-02 28.30000 28.68400
2020-01-03 29.36667 29.53400
2020-01-06 29.36467 30.10267
2020-01-07 30.76000 31.27067
2020-01-08 31.58000 32.80933
2020-01-09 33.14000 32.08933
For xts datasets, we can also select rows using the date index. Each xts dataset has a time index that we can use. For example, if we want to select all TESLA prices of only year 2021 we can do the following:
= TSLA["2021-01-01/2021-12-31", ]
TESLA_2021 head(TESLA_2021)
TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2021-01-04 239.8200 248.1633 239.0633 243.2567 145914600 243.2567
2021-01-05 241.2200 246.9467 239.7333 245.0367 96735600 245.0367
2021-01-06 252.8300 258.0000 249.7000 251.9933 134100000 251.9933
2021-01-07 259.2100 272.3300 258.4000 272.0133 154496700 272.0133
2021-01-08 285.3333 294.8300 279.4633 293.3400 225166500 293.3400
2021-01-11 283.1333 284.8100 267.8733 270.3967 177904800 270.3967
We can also combine column selection with row selection. If we want the same date selection, but only adjusted prices:
= TSLA["2021-01-01/2021-12-31", "TSLA.Adjusted"] TESLA_2021_adjusted
In this case, to select the column I used the column name instead of the # of the column.
1.4 Data Merging:
We can use the merge()
function to join two or more xts time-series datasets into an integrated dataset:
<- merge(`BTC-USD`$`BTC-USD.Adjusted`, TSLA$TSLA.Adjusted)
adjprices # I can select only the adjusted prices in order to calculate returns:
The quantmod has the Ad function to get Adjusted prices. We can do the same we did above using this function:
<- merge(Ad(`BTC-USD`), Ad(TSLA$TSLA.Adjusted)) adjprices
Now we have an xts-zoo objects with the 2 adjusted prices. We can change the names of the columns:
names(adjprices)<-c("bitcoin","tesla")
Now we can make reference to the adjusted prices using these names.
1.5 Data cleaning
In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some null values for some days. For example, for stock series there is no data for weekends or holidays:
head(adjprices,n=10)
bitcoin tesla
2020-01-01 7200.174 NA
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-04 7410.657 NA
2020-01-05 7411.317 NA
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933
2020-01-10 8166.554 31.87667
In this case, Bitcoin has price data for any day including weekends and holidays. When we did the merge, the TESLA column will have NA values for these days. If we keep this dataset as it is, when calculating daily returns for TESLA, it will not be possible to calculate returns for Modays since there is no price value for Sunday. An easy way to deal with NA values is to delete the rows that have any NA value in any column. We can do this with the na.omit function:
However, R deals with gaps because it recognizes that we are working with a time series object. It is a good idea to have a data set free of NA’s. So, I can use the function na.omit:
<- na.omit(adjprices) adjprices
Now the holidays and weekends were deleted:
head(adjprices,n=6)
bitcoin tesla
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933
1.6 Visualization of prices
Visualize how Bitcoin has been valued over time:
plot(`BTC-USD`)
We can do a better visualization if we use the chartSeries function from the quantmod package:
chartSeries(`BTC-USD`, theme=("white"))
We can also do a plot of specific periods. For example, we can see how Bitcoin price behaved only in 2021:
chartSeries(`BTC-USD`, subset = '2021-01-01/2021-12-31')
1.7 Financial returns
A financial simple return for a stock (R_{t}) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):
R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1}}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1
For example, if the adjusted price of a stock at the end of January 2022 was $100.00, and its previous (December 2021) adjusted price was $80.00, then the monthly simple return of the stock in January 2022 will be:
R_{Jan2022}=\frac{Adprice_{Jan2022}}{Adprice_{Dec2021}}-1=\frac{100}{80}-1=0.25
We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.
In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.
One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):
r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})
This is also called as the difference of the log of the price.
We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):
r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)
cc returns are usually represented by small r, while simple returns are represented by capital R.
1.8 Financial return calculation for time-series
1.8.1 Simple returns
We must use adjusted stock prices to calculate financial returns. To calculate a lagged (past) value of price of a time-series variable we can use the function lag from the stats package.
= adjprices / stats::lag(adjprices,n=1) - 1
R head(R)
bitcoin tesla
2020-01-02 NA NA
2020-01-03 0.05145166 0.02963326
2020-01-06 0.05777286 0.01925464
2020-01-07 0.05077385 0.03880052
2020-01-08 -0.01026859 0.04920483
2020-01-09 -0.02485086 -0.02194501
Since adjprices has 2 columns, then R calculates the daily simple returns for all columns of the dataset.
Note that we use the name stats:: before the lag function. We did this since there are other R packages that have a function called lag, so R can be confused if we do not specify which package we are using for the lag function.
1.8.2 Continuously compounded returns
Now we calculate the daily continuously compounded returns using adjusted prices. We use the diff and log functions. The diff function calculates the difference between the value of a time-series variable and its past value:
= diff(log(adjprices))
r head(r)
bitcoin tesla
2020-01-02 NA NA
2020-01-03 0.05017175 0.02920268
2020-01-06 0.05616562 0.01907161
2020-01-07 0.04952689 0.03806671
2020-01-08 -0.01032167 0.04803257
2020-01-09 -0.02516485 -0.02218939
Remember that the first difference of log prices is actually the continuously compounded returns of the period.
We can visualize both daily returns over time for Tesla:
plot(R$tesla)
We can observe increasing volatility in 2020 due to the pandemic. Volatility can be calculated as the standard deviation of returns.
1.9 Descriptive statistics with time-series
We will use the table.Stats from the PerformanceAnalytics package.
We load the package and get the main descriptive statistics of both, Tesla and Bitcoin daily returns:
<- table.Stats(R)
table1 table1
bitcoin tesla
Observations 1181.0000 1181.0000
NAs 1.0000 1.0000
Minimum -0.3717 -0.2106
Quartile 1 -0.0160 -0.0201
Median 0.0009 0.0020
Arithmetic Mean 0.0027 0.0026
Geometric Mean 0.0018 0.0018
Quartile 3 0.0213 0.0228
Maximum 0.2111 0.1989
SE Mean 0.0012 0.0012
LCL Mean (0.95) 0.0003 0.0002
UCL Mean (0.95) 0.0050 0.0050
Variance 0.0017 0.0018
Stdev 0.0413 0.0420
Skewness -0.4806 0.1004
Kurtosis 8.5237 2.8587
We can also get specific descriptive statistics using specific functions such as mean and sd:
= mean(R$tesla, na.rm=TRUE)
mean_tesla_R = median(R$tesla, na.rm=TRUE)
median_tesla_R = sd(R$tesla, na.rm=TRUE)
sd_tesla_R
cat("Tesla daily mean return is ",mean_tesla_R, "\n")
Tesla daily mean return is 0.002645963
cat("Tesla daily median return is ",median_tesla_R, "\n")
Tesla daily median return is 0.001975226
cat("Tesla daily volatility is ", sd_tesla_R)
Tesla daily volatility is 0.04204204
When columns have NA values, some descriptive statistics functions cannot be calculated, unless we use the option na.rm=TRUE (means remove the NA values before calculation)
When median and mean are very different, this is a sign of a non-normality in the distribution of the variable. Here we see that Tesla median return is much less than its mean (0.14% vs 0.31%). The median is a much better measure of central tendency in Financial returns.
If Kurtosis of the variable is greater than 3, then the variable has more extreme values than a normal distributed variable. In this case, we see that both returns have Kurtosis much greater than 3.
Financial returns usually have more extreme values than normal distributed variables, so traditional standard deviation might not provide a complete view of the dispersion of returns. The quartiles provide us a better perception about the volatility of financial returns. The box-plot gives us a good perception of volatility, mean, median (Q2), quartile 1 (Q1) and quartile 3 (Q3) of returns:
chart.Boxplot(R)
It is easy to see that Bitcoin is riskier than Tesla since we have a wider range of returns in both, negative and positive returns. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Quartile 1 or Q1 (25 percentile) to the Q3 (75 percentile). The vertical lines limit non-extreme values. The dots are considered extreme values in the context of its own distribution.
1.10 Visualizing holing return over time
We can use the charts.PerformanceSummary of returns to visualize how much we would made over time if we had invested $1.00 for both, Bitcoin and Tesla, and hold your investment in the whole period of the data:
charts.PerformanceSummary(R,
main = "Performance of $1.00 invested in Bitcoin and Tesla",
wealth.index = TRUE)
We see that if we had invested in Tesla $1.00 in Jan 1, 2020, today we would have more than 7 times my money at the end of the period!
We can calculate the exact holding period return by getting the percentage growth of the series from the first adjusted price up to the last adjusted price:
= as.numeric(adjprices$tesla[nrow(adjprices)]) / as.numeric(adjprices$tesla[1]) - 1
hpr_tesla hpr_tesla
[1] 7.011783
I used the function nrow that returns the # of rows of the data frame. In this case, I am dividing the last value by the first value of Tesla price, and then subtract 1.
We can do the same, but instead of using the numeric index for the rows, use the date index:
= as.numeric(adjprices$tesla['2024-09-12']) / as.numeric(adjprices$tesla['2020-01-02']) - 1
hpr_tesla2 hpr_tesla2
[1] 7.011783
We have to multiply by 100 to get the return in percentage:
* 100 hpr_tesla
[1] 701.1783
1.11 CHALLENGE 1
Calculate and display the Holding-period-return for Bitcoin from Jan 2020 to Sep 12, 2024.
Calculate and display the Holding-period-return for Bitcoin from the first day of January 2024 to Sep 12, 2024.
2 Data structures
In Finance there are basically the following dataset structures or data types:
- Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols:
p_stock1 r_stock1 p_stock2 r_stock2
2014m1 10 0.02 20 0.01
2014m2 11 0.10 21 0.05
- Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016:
Ticker ROA ATO ProfitMargin
1 ALFAA.MX 0.023 0.9 0.15
2 AMXL.MX 0.015 1.1 0.10
- Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period. Example:
Ticker quarter ROA ATO ProfitMargin
1 ALFAA.MX 2014q1 0.2 1.1 0.17
2 ALFAA.MX 2014q2 0.21 1.2 0.16
3 ... ... ... ... ...
4 BIMBOA.MX 2014q1 0.15 0.8 0.10
5 BIMBOA.MX 2014q2 0.20 0.9 0.05
6 ... ... ... ... ...
7 BIMBOA.MX 2017q1 0.15 1.1 0.09
3 Data management for panel-data
For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do simple and sophisticated data management.
In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.
We will merge 2 dataset: a time-series dataset with monthly historical data for the Mexican IPyC Market index, and a panel dataset with quarterly financial information for many Mexican public firms (A public firm is a company that issue shares in the market).
We can only merge datasets with the same granularity. In other words, we cannot merge a monthly with a quarterly dataset. In this case, the market monthly dataset must be aggregated to quarters, and then we can merge the resulting dataset with the other quarterly dataset.
We will work with an online panel data of Mexican firms. Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset.
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
"firmspaneldata.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
# local file
# Use the function read_excel()
<- read_excel("firmspaneldata.xlsx")
paneldataset
# The quarter column has a format with the time zone. I create a new column with only the date:
$q = as.Date(paneldataset$quarter) paneldataset
3.1 Data calculations with panel data
We will use the dplyr package to easily do data selection and calculations for panel datasets.
This dataset has historical quarterly financial data of active and not active firms in the Bolsa Mexicana de Valores. We will keep ONLY active firms in the dataset from 2010. You can do this by typing:
= paneldataset %>%
activedata filter(status=="active", year>=2010)
As you see, we start with the paneldataset, and then we do a sub-setting or selection of rows indicating specific conditions.
Let’s select a few important columns from this dataset to perform some basic financial ratio calculations.
We will use the dplyr package. This package is designed for data management of data frames and panel data. We link data management processes by the operand %>%.
<-activedata %>%
activedataselect(firm,q,year,revenue,cogs,ebit,totalassets,
adjustedstockprice, naics1)
Here we start with the activedata dataset, and then select a few columns.
The dplyr package is very powerful for data management. Click HERE (https://datacarpentry.org/R-genomics/04-dplyr.html) a good summary of the main functionality of this package.
3.1.1 Calculating financial returns and ratios with panel data
As we did for time-series data, we can calculate returns with a panel dataset. We follow the same formula mentioned above, but we have to consider that we are using a panel-data structure:
<- activedata %>%
activedata group_by(firm) %>%
arrange(firm, q) %>%
mutate(stockret = log(adjustedstockprice) - log(lag(adjustedstockprice)) )
We added the column r to the panel data. Since R knows that we have a panel data structure, it calculates returns correctly for each firm-quarter. When we have a change in firm, then returns do not have to be calculated. You can have a look to the data and check whether the first quarters for each firms have NA values.
3.1.2 Using conditionals to create columns
Let’s calculate Return on Assets using EBIT (Earnings before interest and taxes) instead of Net income:
In this case, if the denominator of the ratio is zero, we assign a null value (NA); otherwise, we calculate the ratio. If we do not validate for this conditional, R calculates a value Inf (Infinite) that causes problems when running models.
<- activedata %>%
activedata mutate(ROA = ifelse(totalassets==0,NA,ebit / totalassets) )
Now we calculate Profit Margin (PM) and Asset Turn Over (ATO). We will use EBIT instead of Net Income to calculate PM:
<- activedata %>%
activedata mutate(PM = ifelse(revenue==0,NA,ebit / revenue ),
ATO = ifelse(totalassets==0,NA,revenue / totalassets))
Here we use a conditional with the function ifelse. The first parameter of ifelse is a logical condition that can be true or false. If the condition is true, then the value assigned will be the second parameter; if it is false, the value assigned will be the third parameter.
The ifelse function is very useful when creating financial signals that compare financial values. Here is the general description of the ifelse function:
ifelse(data_condition, value_if_TRUE, value_if_FALSE)
For example:
=5
a=10
bifelse(a>b,1,0)
[1] 0
Since a is NOT greater than b, then the value that was returned was zero.
We can use this function to create financial signals as new columns in a panel dataset or time-series dataset.
Let’s create a binary variable as a signal of positive ROA in a quarter compared with its ROA of one year ago (4 quarters ago). In other words, if the firm had a ROA in a specific quarter that is greater than its ROA 4 quarters ago, we will assign 1; 0 otherwise:
<- activedata %>%
activedata
mutate(ROAsignal = ifelse(ROA>lag(ROA,4),1,0))
4 CHALLENGE:
CREATE A COLUMN IN THE PANEL DATASET CALLED ROA_ABOVE THAT WILL BE ONE (=1) IF THE FIRM-QUARTER HAD A ROA GREATER THAN THE AVERAGE ROA OF ITS INDUSTRY IN THE CORRESPONDING QUARTER; AND ASSIGN A ZERO (=0) OTHERWISE.
HINT: Using dplyr is easier. You can ask Chat-GPT or Gemini to help you out, but try to understand the solution! If you use Gemini or ChatGPT, MENTION this in your workshop.
5 Descriptive statistics with panel data
Doing descriptive statistics with panel data is not quite equal as in the case of time-series structure. It is a good idea to do descriptive statistics by period of the panel data to avoid getting wrong summaries.
We can start by doing a descriptive statistics selecting only the last quarter of the data:
= activedata %>%
data2019q4 filter(q=="2019-10-01") %>%
select(firm, revenue, ebit, totalassets, ROA, PM)
summary(data2019q4,na.rm=TRUE)
firm revenue ebit
Length:152 Min. : 0 Min. : -6734642
Class :character 1st Qu.: 3257100 1st Qu.: 266741
Mode :character Median : 10634607 Median : 1673254
Mean : 50962874 Mean : 6282411
3rd Qu.: 35395790 3rd Qu.: 5079130
Max. :1007347869 Max. :154840535
NA's :30 NA's :25
totalassets ROA PM
Min. : 24541 Min. :-0.37696 Min. :-3.28603
1st Qu.: 10110939 1st Qu.: 0.03230 1st Qu.: 0.06562
Median : 31564719 Median : 0.06030 Median : 0.11673
Mean : 103148171 Mean : 0.06135 Mean : 0.18612
3rd Qu.: 77058094 3rd Qu.: 0.08508 3rd Qu.: 0.26798
Max. :1580009936 Max. : 0.42413 Max. : 0.98586
NA's :8 NA's :25 NA's :32
We can get descriptive statistics by groups.
In this case, we will select the last quarter of the dataset, group the firms by industry, and for each industry we get the median of the main variables:
<-unique(activedata$naics1)
industries
%>%
activedata filter(q=='2019-10-01') %>%
group_by(naics1) %>%
summarize(firms = n(),
median_total_assets = median(totalassets, na.rm = TRUE),
median_revenue = median(revenue, na.rm = TRUE),
median_ebit = median(ebit, na.rm = TRUE))
# A tibble: 17 × 5
naics1 firms median_total_assets median_revenue median_ebit
<chr> <int> <dbl> <dbl> <dbl>
1 Accommodation and Food … 4 13012570 2921780. 494324.
2 Administrative and Supp… 2 6641031 7061152. 741712.
3 Agriculture, Forestry, … 2 36248426. 38465597 2619456.
4 Arts, Entertainment, an… 2 8075112. 6862020 845286.
5 Construction 17 12334830. 4361809 1065160.
6 Finance and Insurance 33 28222952 524279 78285
7 Health Care and Social … 1 5829852 3698068 358695
8 Information 10 28681463 12814470 1673254
9 Management of Companies… 1 151432402 102477596 11453226
10 Manufacturing 39 44292990 25936810 2989902
11 Mining, Quarrying, and … 5 35489167. 8717697. 352864.
12 Professional, Scientifi… 1 19332591 12755893 2434974
13 Public Administration 1 78995636 10762172 1841889
14 Real Estate and Rental … 9 26925420 4098367 1789606.
15 Retail Trade 13 58341576 57586340. 5137620
16 Transportation and Ware… 10 30341542 14191390. 3564913
17 Utilities 2 102864999. 16192952. 8991664.
Here we can see a good picture of the whole Mexican Financial Market. It is important to note that the best central tendency measure for any financial statement variable such as revenue, total assets is the MEDIAN since the distribution of these variables is always skewed to the right with very few big, big firms. The arithmetic mean of financial variables do not provide a good representation of the typical company in a market.
6 Datacamp online courses
You will receive an email invitation to register in datacamp.com. Datacamp is one of the best online learning sites for Data Science applied to Business and Finance.
You will receive free access for the whole semester! Accept the invitation to be registered in datacamp.
YUO MUST TAKE Chapter 1 : Transforming Data with dplyr from the course: Data Manipulation with dplyr
The following courses/chapters are OPTIONAL (but recommended to do during the first 2-3 weeks of the course):
Chapter 2 (Aggregating data) from the same course: Manipulation with dplyr
If you want learn more about getting financial data, you can check the course: Importing and Managing Financial Data in R
7 W1 submission
The grade of this Workshop will be the following:
Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)
Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me.