install.packages("quantmod")
install.packages("PerformanceAnalytics")
Workshop 1
FZ2022 Algorithms and Data Analytics
1 Introduction
1.1 Introduction to Algorithms and Data Analytics for Financial Data
This course is designed to provide a comprehensive introduction to the principles and techniques of algorithms and data analytics, with a focus on financial datasets. Through a combination of theoretical concepts and practical applications, you will learn how to handle and analyze large sets of financial data efficiently. The course covers several essential topics:
1.1.1 Data Cleaning and Data Management
You will begin by mastering the fundamentals of data collection, cleaning, and preparation. This includes working with financial datasets, performing return calculations, understanding data structures, and effectively merging data from multiple sources. Ensuring data quality is critical, especially when working with large, complex financial information.
1.1.2 Big Data
The course introduces you to handling big data in finance. We’ll work with a large panel dataset comprising historical financial statement variables for numerous firms, giving you hands-on experience with the unique challenges and opportunities presented by big data. You will learn to apply algorithms that allow for scalable data processing.
1.1.3 Econometrics and Logit Models
Some basic econometric techniques will be explored, with a particular focus on logit models. These methods will help you understand relationships in financial data, such as predicting binary outcomes (e.g., credit default, fraud detection) from financial variables.
1.1.4 Introduction to Machine Learning
Finally, we’ll delve into basic machine learning techniques and explore how they can be applied to financial datasets. You will get an introduction to supervised learning, with a focus on classification and regression, and how these techniques are used in real-world financial analytics.
1.2 Workshop Overview
This is an individual workshop that serves as a hands-on introduction to data management for financial datasets. In this workshop, you will practice key tasks such as:
- Data collection and cleaning: Ensuring the accuracy and completeness of financial data.
- Return calculation: Understanding how to compute returns from raw financial data.
- Data structures: Working with different data formats, from structured tables to more complex time series and panel datasets.
- Data merging: Combining datasets to form comprehensive views for analysis.
Throughout the workshop, you’ll begin working with a large panel dataset of historical financial statement variables for many firms. This dataset will serve as the foundation for many exercises in the course, giving you ample opportunities to apply the techniques and tools you learn. The exercises in this workshop are based on Dorantes (2024).
1.3 General Directions for each workshop
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab notebook for each workshop.
You must share each Colab document (workshop) with me via e-mail:
sergio.castellanos@tec.mx
You must give Edit privileges to this account.
In Google Colab, you can work with Python or R notebooks. The default is Python notebooks, so you must go to “Runtime” or “Entorno de Ejecución”, select “Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).
Your notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1_FZ2022_YourFirstName_YourLastname”.
Pay attention in class to learn how to write text and R code into your notebook.
In your Workshop Notebook, you have to:
- Replicate all the R code along with its output.
- You must complete any question and/or challenge asked in the workshop. The challenges can include specific questions or require you to complete an exercise/challenge.
For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
- It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for further study.
Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks at once with Ctrl-F9. You must submit the web link of your Google Colab workshop to Canvas.
2 Data management for time-series financial variables
2.1 Installation of R packages
Google Colab has many R packages already installed, but not all. In this workshop, we will use R packages that are already installed in Colab, and others that we need to install before loading them in Colab.
We start by installing the packages that are NOT pre-installed in Colab:
The installation of packages in Colab might take up to 30 seconds or 1 minute each, so after they are installed, you can add the #
character before the install.packages
command to avoid re-installing the packages and waiting for a long time.
Once these packages are installed, we need to load them into memory:
# Now I load the packages in Memory:
library(quantmod)
library(PerformanceAnalytics)
Now we load the packages of the workshop that are already installed in R:
library(readxl)
library(dplyr)
2.2 Data collection
We start by downloading real stock prices online from Yahoo Finance.
First, let’s clear our R environment:
# To avoid scientific notation for numbers:
options(scipen=999)
2.2.1 The quantmod package
The quantmod package is designed to assist financial traders in developing econometric-based trading models.
It provides several functions for financial data collection, data management, and data visualization.
This package includes the getSymbols()
function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.
2.2.2 Downloading Real Financial Prices
The getSymbols()
function enables users to download real-time, up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, and more. getSymbols()
allows for the download of data from multiple sources, including Yahoo Finance, Google Finance, FRED, and Oanda. These sources offer thousands of financial and economic data series from many market exchanges and other macroeconomic variables worldwide.
Now, we will work with historical data of the Bitcoin cryptocurrency and Tesla stock. We will download the daily quotations of these instruments from January 1, 2020, to the most recent complete month from Yahoo Finance:
getSymbols(c("BTC-USD","TSLA"), from="2020-01-01",
to="2024-08-31", src="yahoo", periodicity="daily")
[1] "BTC-USD" "TSLA"
This function will create an xts-zoo R object for each ticker. Each object will contain the corresponding historical daily prices. xts stands for extensible time-series, and an xts-zoo object is designed to facilitate the manipulation of time series data.
BTC-USD and TSLA are the ticker names used in Yahoo Finance. The from
argument specifies the start date for the data you want to retrieve. The to
argument defines the end date of the series, but in this case, we omit it to download the most recent data. The src
argument indicates the data source, which in this case is Yahoo Finance. Lastly, the periodicity
argument determines the frequency of the data (daily, weekly, monthly, quarterly).
2.2.3 Show the content of datasets
We can view the content of an R object just by typing the name of the object in the R code. If the dataset is long we can view the first rows or the last rows of the dataset by using the following functions:
head(`BTC-USD`,n=5)
BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2020-01-01 7194.892 7254.331 7174.944 7200.174 18565664997
2020-01-02 7202.551 7212.155 6935.270 6985.470 20802083465
2020-01-03 6984.429 7413.715 6914.996 7344.884 28111481032
2020-01-04 7345.375 7427.386 7309.514 7410.657 18444271275
2020-01-05 7410.452 7544.497 7400.536 7411.317 19725074095
BTC-USD.Adjusted
2020-01-01 7200.174
2020-01-02 6985.470
2020-01-03 7344.884
2020-01-04 7410.657
2020-01-05 7411.317
When tickers have special characters, we have to make reference to the object with simple quotes (``).
Also, you can list the LAST 5 rows of the dataset. Note that you can change number of rows you want to display.
tail(`BTC-USD`, n=10)
BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2024-08-22 61168.32 61408.11 59815.25 60381.91 27625734377
2024-08-23 60380.95 64947.06 60372.05 64094.36 42530509233
2024-08-24 64103.87 64513.79 63619.92 64178.99 21430585163
2024-08-25 64176.37 64996.42 63833.52 64333.54 18827683555
2024-08-26 64342.23 64489.71 62849.56 62880.66 27682040631
2024-08-27 62879.71 63210.80 58116.75 59504.13 39103882198
2024-08-28 59507.93 60236.45 57890.68 59027.62 40289564698
2024-08-29 59027.47 61184.08 58786.23 59388.18 32224990582
2024-08-30 59388.60 59896.89 57768.53 59119.48 32292756405
2024-08-31 59117.48 59432.59 58768.79 58969.90 12403470760
BTC-USD.Adjusted
2024-08-22 60381.91
2024-08-23 64094.36
2024-08-24 64178.99
2024-08-25 64333.54
2024-08-26 62880.66
2024-08-27 59504.13
2024-08-28 59027.62
2024-08-29 59388.18
2024-08-30 59119.48
2024-08-31 58969.90
The xts datasets created with the getSymbols()
function contain different columns for open, low, high, close, adjusted, and volume data. These datasets are often referred to as OHLC datasets (Open, High, Low, Close).
Adjusted prices are used for stocks, not for currencies. Adjusted prices account for dividend payments and stock splits, providing a more accurate reflection of stock price movements over time. For the Bitcoin series, we can use either the close or adjusted price to calculate returns.
2.3 Data selection
2.3.1 Column selection
We can easily select either columns or rows from a dataset. For example, if we want to select the adjusted price column from the TSLA xts dataset, we use the $
operator followed by the column name, and then assign it to a new variable name:
= TSLA$TSLA.Adjusted
adjTSLA head(adjTSLA)
TSLA.Adjusted
2020-01-02 28.68400
2020-01-03 29.53400
2020-01-06 30.10267
2020-01-07 31.27067
2020-01-08 32.80933
2020-01-09 32.08933
Another way to select a column is by using the column number(s):
#Selecting the adjusted price, which is the column #6:
= TSLA[, 6] adjTSLA
Note that the notation to reference column 6 and ALL rows of the TSLA dataset is:
TSLA[ , 6]
In this case, we want to keep ALL rows, so we leave the row index empty and specify only column 6.
R datasets have 2 dimensions: [#rows, #columns]. We can display the dimensions of any R dataset using the dim()
function.
dim(TSLA)
[1] 1174 6
We see that the # of rows of TSLA dataset is 1174, and the # of columns is 6.
We can make reference to any subset of the dataset using this notation of [#rows, #columns]
2.3.2 Row selection
We can also select specific rows of a dataset following the same notation of [#rows, #columns]. If I want to select only the first rows of TSLA dataset, I can do the following:
= TSLA[1:10,]
TSLAfirstdays TSLAfirstdays
TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02 28.30000 28.71333 28.11400 28.68400 142981500 28.68400
2020-01-03 29.36667 30.26667 29.12800 29.53400 266677500 29.53400
2020-01-06 29.36467 30.10400 29.33333 30.10267 151995000 30.10267
2020-01-07 30.76000 31.44200 30.22400 31.27067 268231500 31.27067
2020-01-08 31.58000 33.23267 31.21533 32.80933 467164500 32.80933
2020-01-09 33.14000 33.25333 31.52467 32.08933 426606000 32.08933
2020-01-10 32.11933 32.32933 31.58000 31.87667 194392500 31.87667
2020-01-13 32.90000 35.04200 32.80000 34.99067 397764000 34.99067
2020-01-14 36.28400 36.49400 34.99333 35.86133 434943000 35.86133
2020-01-15 35.31733 35.85600 34.45267 34.56667 260532000 34.56667
We used the notation 1:10 in the row specification, indicating that we want the sequence from 1 to 10.
We can also select specific rows and columns. For example, if I only want to select the first 10 days of the adjusted prices of TESLA:
= TSLA[1:10,6]
TSLA_adjusted_first_prices head(TSLA_adjusted_first_prices)
TSLA.Adjusted
2020-01-02 28.68400
2020-01-03 29.53400
2020-01-06 30.10267
2020-01-07 31.27067
2020-01-08 32.80933
2020-01-09 32.08933
If we want to select specific columns, we can use the container c() vector to indicate which column # we want to select. For example, if for some reason we want to select the Open and Close columns (columns #1 and #4), and select the first 10 days of TESLA prices:
= TSLA[1:10,c(1,4)]
TSLA_Open_Close head(TSLA_Open_Close)
TSLA.Open TSLA.Close
2020-01-02 28.30000 28.68400
2020-01-03 29.36667 29.53400
2020-01-06 29.36467 30.10267
2020-01-07 30.76000 31.27067
2020-01-08 31.58000 32.80933
2020-01-09 33.14000 32.08933
For xts datasets, we can also select rows using the date index. Each xts dataset has a time index that we can use. For example, if we want to select all TESLA prices of only year 2021 we can do the following:
= TSLA["2021-01-01/2021-12-31", ]
TESLA_2021 head(TESLA_2021)
TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2021-01-04 239.8200 248.1633 239.0633 243.2567 145914600 243.2567
2021-01-05 241.2200 246.9467 239.7333 245.0367 96735600 245.0367
2021-01-06 252.8300 258.0000 249.7000 251.9933 134100000 251.9933
2021-01-07 259.2100 272.3300 258.4000 272.0133 154496700 272.0133
2021-01-08 285.3333 294.8300 279.4633 293.3400 225166500 293.3400
2021-01-11 283.1333 284.8100 267.8733 270.3967 177904800 270.3967
We can also combine column selection with row selection. If we want the same date selection, but only adjusted prices:
= TSLA["2021-01-01/2021-12-31", "TSLA.Adjusted"] TESLA_2021_adjusted
In this case, to select the column I used the column name instead of the # of the column.
2.4 Data Merging:
We can use the merge()
function to join two or more xts time-series datasets into an integrated dataset:
<- merge(`BTC-USD`$`BTC-USD.Adjusted`, TSLA$TSLA.Adjusted)
adjprices # I can select only the adjusted prices in order to calculate returns:
The quantmod has the Ad function to get Adjusted prices. We can do the same we did above using this function:
<- merge(Ad(`BTC-USD`), Ad(TSLA$TSLA.Adjusted)) adjprices
Now we have an xts-zoo objects with the 2 adjusted prices. We can change the names of the columns:
names(adjprices)<-c("bitcoin","tesla")
Now we can make reference to the adjusted prices using these names.
2.5 Data cleaning
In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some null values for some days. For example, for stock series there is no data for weekends or holidays:
head(adjprices,n=10)
bitcoin tesla
2020-01-01 7200.174 NA
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-04 7410.657 NA
2020-01-05 7411.317 NA
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933
2020-01-10 8166.554 31.87667
In this case, Bitcoin has price data for every day, including weekends and holidays. When we merged the datasets, the TESLA column will have NA
values for these days. If we keep this dataset as it is, calculating daily returns for TESLA will be problematic for Mondays, as there will be no price value for Sundays.
An easy way to handle NA
values is to remove rows that contain any NA
values in any column. This can be done using the na.omit()
function. Although R can manage gaps in time series objects, it is still a good practice to ensure that your dataset is free of NA
values.
<- na.omit(adjprices) adjprices
Now the holidays and weekends were deleted:
head(adjprices,n=6)
bitcoin tesla
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933
2.6 Visualization of prices
Visualize how Bitcoin has been valued over time:
plot(`BTC-USD`)
We can do a better visualization if we use the chartSeries function from the quantmod package:
chartSeries(`BTC-USD`, theme=("white"))
We can also do a plot of specific periods. For example, we can see how Bitcoin price behaved only in 2021:
chartSeries(`BTC-USD`, subset = '2021-01-01/2021-12-31')
2.7 Financial returns
A financial simple return for a stock (R_{t}) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):
R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1}}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1
For example, if the adjusted price of a stock at the end of January 2022 was $100.00, and its previous (December 2021) adjusted price was $80.00, then the monthly simple return of the stock in January 2022 will be:
R_{Jan2022}=\frac{Adprice_{Jan2022}}{Adprice_{Dec2021}}-1=\frac{100}{80}-1=0.25
We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.
In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.
One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):
r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})
This is also called as the difference of the log of the price.
We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):
r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)
cc returns are usually represented by small r, while simple returns are represented by capital R.
2.8 Financial return calculation for time-series
2.8.1 Simple returns
We must use adjusted stock prices to calculate financial returns. To calculate a lagged (past) value of price of a time-series variable we can use the function lag from the stats package.
= adjprices / stats::lag(adjprices,n=1) - 1
R head(R)
bitcoin tesla
2020-01-02 NA NA
2020-01-03 0.05145166 0.02963326
2020-01-06 0.05777286 0.01925464
2020-01-07 0.05077385 0.03880052
2020-01-08 -0.01026859 0.04920483
2020-01-09 -0.02485086 -0.02194501
Since adjprices has 2 columns, then R calculates the daily simple returns for all columns of the dataset.
Note that we use the name stats:: before the lag function. We did this since there are other R packages that have a function called lag, so R can be confused if we do not specify which package we are using for the lag function.
2.8.2 Continuously compounded returns
Now we calculate the daily continuously compounded returns using adjusted prices. We use the diff and log functions. The diff function calculates the difference between the value of a time-series variable and its past value:
= diff(log(adjprices))
r head(r)
bitcoin tesla
2020-01-02 NA NA
2020-01-03 0.05017175 0.02920268
2020-01-06 0.05616562 0.01907161
2020-01-07 0.04952689 0.03806671
2020-01-08 -0.01032167 0.04803257
2020-01-09 -0.02516485 -0.02218939
Remember that the first difference of log prices is actually the continuously compounded returns of the period.
We can visualize both daily returns over time for Tesla:
plot(R$tesla)
We can observe increasing volatility in 2020 due to the pandemic. Volatility can be calculated as the standard deviation of returns.
2.9 Descriptive statistics with time-series
We will use the table.Stats from the PerformanceAnalytics package.
We load the package and get the main descriptive statistics of both, Tesla and Bitcoin daily returns:
<- table.Stats(R)
table1 table1
bitcoin tesla
Observations 1173.0000 1173.0000
NAs 1.0000 1.0000
Minimum -0.3717 -0.2106
Quartile 1 -0.0159 -0.0202
Median 0.0009 0.0019
Arithmetic Mean 0.0027 0.0026
Geometric Mean 0.0018 0.0017
Quartile 3 0.0213 0.0226
Maximum 0.2111 0.1989
SE Mean 0.0012 0.0012
LCL Mean (0.95) 0.0003 0.0002
UCL Mean (0.95) 0.0051 0.0050
Variance 0.0017 0.0018
Stdev 0.0413 0.0420
Skewness -0.4839 0.1093
Kurtosis 8.5246 2.8804
We can also get specific descriptive statistics using specific functions such as mean and sd:
= mean(R$tesla, na.rm=TRUE)
mean_tesla_R = median(R$tesla, na.rm=TRUE)
median_tesla_R = sd(R$tesla, na.rm=TRUE)
sd_tesla_R
cat("Tesla daily mean return is ",mean_tesla_R, "\n")
Tesla daily mean return is 0.002597413
cat("Tesla daily median return is ",median_tesla_R, "\n")
Tesla daily median return is 0.001871393
cat("Tesla daily volatility is ", sd_tesla_R)
Tesla daily volatility is 0.04204226
When columns have NA values, some descriptive statistics functions cannot be calculated, unless we use the option na.rm=TRUE (means remove the NA values before calculation)
When median and mean are very different, this is a sign of a non-normality in the distribution of the variable. Here we see that Tesla median return is much less than its mean (0.14% vs 0.31%). The median is a much better measure of central tendency in Financial returns.
If Kurtosis of the variable is greater than 3, then the variable has more extreme values than a normal distributed variable. In this case, we see that both returns have Kurtosis much greater than 3.
Financial returns usually have more extreme values than normal distributed variables, so traditional standard deviation might not provide a complete view of the dispersion of returns. The quartiles provide us a better perception about the volatility of financial returns. The box-plot gives us a good perception of volatility, mean, median (Q2), quartile 1 (Q1) and quartile 3 (Q3) of returns:
chart.Boxplot(R)
It is easy to see that Bitcoin is riskier than Tesla since we have a wider range of returns in both, negative and positive returns. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Quartile 1 or Q1 (25 percentile) to the Q3 (75 percentile). The vertical lines limit non-extreme values. The dots are considered extreme values in the context of its own distribution.
2.10 Visualizing holing return over time
We can use the charts.PerformanceSummary of returns to visualize how much we would made over time if we had invested $1.00 for both, Bitcoin and Tesla, and hold your investment in the whole period of the data:
charts.PerformanceSummary(R,
main = "Performance of $1.00 invested in Bitcoin and Tesla",
wealth.index = TRUE)
We see that if we had invested in Tesla $1.00 in Jan 1, 2020, today we would have more than 7 times my money at the end of the period!
We can calculate the exact holding period return by getting the percentage growth of the series from the first adjusted price up to the last adjusted price:
= as.numeric(adjprices$tesla[nrow(adjprices)]) / as.numeric(adjprices$tesla[1]) - 1
hpr_tesla hpr_tesla
[1] 6.46444
I used the function nrow that returns the # of rows of the data frame. In this case, I am dividing the last value by the first value of Tesla price, and then subtract 1.
We can do the same, but instead of using the numeric index for the rows, use the date index:
= as.numeric(adjprices$tesla['2024-09-12']) / as.numeric(adjprices$tesla['2020-01-02']) - 1
hpr_tesla2 hpr_tesla2
numeric(0)
We have to multiply by 100 to get the return in percentage:
* 100 hpr_tesla
[1] 646.444
2.11 CHALLENGE 1
Calculate and display the Holding-period-return for Bitcoin from Jan 2020 to Sep 12, 2024.
Calculate and display the Holding-period-return for Bitcoin from the first day of January 2024 to Sep 12, 2024.
3 Data structures
In Finance there are basically the following dataset structures or data types:
- Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols:
p_stock1 r_stock1 p_stock2 r_stock2
2014m1 10 0.02 20 0.01
2014m2 11 0.10 21 0.05
- Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016:
Ticker ROA ATO ProfitMargin
1 ALFAA.MX 0.023 0.9 0.15
2 AMXL.MX 0.015 1.1 0.10
- Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period. Example:
Ticker quarter ROA ATO ProfitMargin
1 ALFAA.MX 2014q1 0.2 1.1 0.17
2 ALFAA.MX 2014q2 0.21 1.2 0.16
3 ... ... ... ... ...
4 BIMBOA.MX 2014q1 0.15 0.8 0.10
5 BIMBOA.MX 2014q2 0.20 0.9 0.05
6 ... ... ... ... ...
7 BIMBOA.MX 2017q1 0.15 1.1 0.09
4 Data management for panel-data
For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do simple and sophisticated data management.
In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.
We will merge 2 dataset: a time-series dataset with monthly historical data for the Mexican IPyC Market index, and a panel dataset with quarterly financial information for many Mexican public firms (A public firm is a company that issue shares in the market).
We can only merge datasets with the same granularity. In other words, we cannot merge a monthly with a quarterly dataset. In this case, the market monthly dataset must be aggregated to quarters, and then we can merge the resulting dataset with the other quarterly dataset.
We will work with an online panel data of Mexican firms. Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset.
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
"firmspaneldata.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
# local file
# Use the function read_excel()
<- read_excel("firmspaneldata.xlsx")
paneldataset
# The quarter column has a format with the time zone. I create a new column with only the date:
$q = as.Date(paneldataset$quarter) paneldataset
4.1 Data calculations with panel data
We will use the dplyr package to easily do data selection and calculations for panel datasets.
This dataset has historical quarterly financial data of active and not active firms in the Bolsa Mexicana de Valores. We will keep ONLY active firms in the dataset from 2010. You can do this by typing:
= paneldataset %>%
activedata filter(status=="active", year>=2010)
As you see, we start with the paneldataset, and then we do a sub-setting or selection of rows indicating specific conditions.
Let’s select a few important columns from this dataset to perform some basic financial ratio calculations.
We will use the dplyr package. This package is designed for data management of data frames and panel data. We link data management processes by the operand %>%.
<-activedata %>%
activedataselect(firm,q,year,revenue,cogs,ebit,totalassets,
adjustedstockprice, naics1)
Here we start with the activedata dataset, and then select a few columns.
The dplyr package is very powerful for data management. Click HERE (https://datacarpentry.org/R-genomics/04-dplyr.html) a good summary of the main functionality of this package.
4.1.1 Calculating financial returns and ratios with panel data
As we did for time-series data, we can calculate returns with a panel dataset. We follow the same formula mentioned above, but we have to consider that we are using a panel-data structure:
<- activedata %>%
activedata group_by(firm) %>%
arrange(firm, q) %>%
mutate(stockret = log(adjustedstockprice) - log(lag(adjustedstockprice)) )
We added the column r to the panel data. Since R knows that we have a panel data structure, it calculates returns correctly for each firm-quarter. When we have a change in firm, then returns do not have to be calculated. You can have a look to the data and check whether the first quarters for each firms have NA values.
4.1.2 Using conditionals to create columns
Let’s calculate Return on Assets using EBIT (Earnings before interest and taxes) instead of Net income:
In this case, if the denominator of the ratio is zero, we assign a null value (NA); otherwise, we calculate the ratio. If we do not validate for this conditional, R calculates a value Inf (Infinite) that causes problems when running models.
<- activedata %>%
activedata mutate(ROA = ifelse(totalassets==0,NA,ebit / totalassets) )
Now we calculate Profit Margin (PM) and Asset Turn Over (ATO). We will use EBIT instead of Net Income to calculate PM:
<- activedata %>%
activedata mutate(PM = ifelse(revenue==0,NA,ebit / revenue ),
ATO = ifelse(totalassets==0,NA,revenue / totalassets))
Here we use a conditional with the function ifelse. The first parameter of ifelse is a logical condition that can be true or false. If the condition is true, then the value assigned will be the second parameter; if it is false, the value assigned will be the third parameter.
The ifelse function is very useful when creating financial signals that compare financial values. Here is the general description of the ifelse function:
ifelse(data_condition, value_if_TRUE, value_if_FALSE)
For example:
=5
a=10
bifelse(a>b,1,0)
[1] 0
Since a is NOT greater than b, then the value that was returned was zero.
We can use this function to create financial signals as new columns in a panel dataset or time-series dataset.
Let’s create a binary variable as a signal of positive ROA in a quarter compared with its ROA of one year ago (4 quarters ago). In other words, if the firm had a ROA in a specific quarter that is greater than its ROA 4 quarters ago, we will assign 1; 0 otherwise:
<- activedata %>%
activedata
mutate(ROAsignal = ifelse(ROA>lag(ROA,4),1,0))
5 CHALLENGE:
CREATE A COLUMN IN THE PANEL DATASET CALLED ROA_ABOVE THAT WILL BE ONE (=1) IF THE FIRM-QUARTER HAD A ROA GREATER THAN THE AVERAGE ROA OF ITS INDUSTRY IN THE CORRESPONDING QUARTER; AND ASSIGN A ZERO (=0) OTHERWISE.
HINT: Using dplyr is easier. You can ask Chat-GPT or Gemini to help you out, but try to understand the solution! If you use Gemini or ChatGPT, MENTION this in your workshop.
6 Descriptive statistics with panel data
Doing descriptive statistics with panel data is not quite equal as in the case of time-series structure. It is a good idea to do descriptive statistics by period of the panel data to avoid getting wrong summaries.
We can start by doing a descriptive statistics selecting only the last quarter of the data:
= activedata %>%
data2019q4 filter(q=="2019-10-01") %>%
select(firm, revenue, ebit, totalassets, ROA, PM)
summary(data2019q4,na.rm=TRUE)
firm revenue ebit
Length:152 Min. : 0 Min. : -6734642
Class :character 1st Qu.: 3257100 1st Qu.: 266741
Mode :character Median : 10634607 Median : 1673254
Mean : 50962874 Mean : 6282411
3rd Qu.: 35395790 3rd Qu.: 5079130
Max. :1007347869 Max. :154840535
NA's :30 NA's :25
totalassets ROA PM
Min. : 24541 Min. :-0.37696 Min. :-3.28603
1st Qu.: 10110939 1st Qu.: 0.03230 1st Qu.: 0.06562
Median : 31564719 Median : 0.06030 Median : 0.11673
Mean : 103148171 Mean : 0.06135 Mean : 0.18612
3rd Qu.: 77058094 3rd Qu.: 0.08508 3rd Qu.: 0.26798
Max. :1580009936 Max. : 0.42413 Max. : 0.98586
NA's :8 NA's :25 NA's :32
We can get descriptive statistics by groups.
In this case, we will select the last quarter of the dataset, group the firms by industry, and for each industry we get the median of the main variables:
<-unique(activedata$naics1)
industries
%>%
activedata filter(q=='2019-10-01') %>%
group_by(naics1) %>%
summarize(firms = n(),
median_total_assets = median(totalassets, na.rm = TRUE),
median_revenue = median(revenue, na.rm = TRUE),
median_ebit = median(ebit, na.rm = TRUE))
# A tibble: 17 × 5
naics1 firms median_total_assets median_revenue median_ebit
<chr> <int> <dbl> <dbl> <dbl>
1 Accommodation and Food … 4 13012570 2921780. 494324.
2 Administrative and Supp… 2 6641031 7061152. 741712.
3 Agriculture, Forestry, … 2 36248426. 38465597 2619456.
4 Arts, Entertainment, an… 2 8075112. 6862020 845286.
5 Construction 17 12334830. 4361809 1065160.
6 Finance and Insurance 33 28222952 524279 78285
7 Health Care and Social … 1 5829852 3698068 358695
8 Information 10 28681463 12814470 1673254
9 Management of Companies… 1 151432402 102477596 11453226
10 Manufacturing 39 44292990 25936810 2989902
11 Mining, Quarrying, and … 5 35489167. 8717697. 352864.
12 Professional, Scientifi… 1 19332591 12755893 2434974
13 Public Administration 1 78995636 10762172 1841889
14 Real Estate and Rental … 9 26925420 4098367 1789606.
15 Retail Trade 13 58341576 57586340. 5137620
16 Transportation and Ware… 10 30341542 14191390. 3564913
17 Utilities 2 102864999. 16192952. 8991664.
Here we can see a good picture of the whole Mexican Financial Market. It is important to note that the best central tendency measure for any financial statement variable such as revenue, total assets is the MEDIAN since the distribution of these variables is always skewed to the right with very few big, big firms. The arithmetic mean of financial variables do not provide a good representation of the typical company in a market.
7 Datacamp online courses
You will receive an email invitation to register on datacamp.com. DataCamp is one of the best online learning platforms for Data Science applied to Business and Finance.
You will receive free access for the entire semester! Accept the invitation to be registered on DataCamp.
You MUST TAKE Chapter 1: Transforming Data with dplyr from the course: Data Manipulation with dplyr.
The following courses/chapters are OPTIONAL (but recommended to complete during the first 2-3 weeks of the course):
- Chapter 2: Aggregating Data from the same course: Data Manipulation with dplyr.
If you want to learn more about acquiring financial data, you can explore the course: Importing and Managing Financial Data in R.
8 W1 submission
The grade for this workshop will be as follows:
Complete (100%): If you submit an ORIGINAL and COMPLETE notebook with all the activities, your notes, and your OWN RESPONSES to questions. Submitting someone else’s work or having two or more people submit the same answers (exact texts or with minor changes or translations) will be considered plagiarism and a violation of academic integrity (FIA).
Incomplete (75%): If you submit an ORIGINAL notebook with ALL the activities but did NOT RESPOND to the questions and/or did not complete all activities, or if you only responded to some questions. Remember, submitting identical or slightly altered work from another person is considered a breach of academic integrity (FIA).
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or if you completed more but parts of your work are copy-pasted from other notebooks. This too will be considered an FIA if found to be plagiarized.
Not submitted (0%): No submission, no credit.
It is important to remember that submitting someone else’s work, or identical work (even with minor changes or translations) between two or more students, is strictly prohibited and will be considered an academic integrity violation (FIA).
Remember that you have to submit the Google Colab LINK to your notebook, and you must SHARE it with me, ensuring that I have edit access.