Workshop 1

FZ2022 Algorithms and Data Analytics

Author
Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

September 17, 2024

1 Introduction

1.1 Introduction to Algorithms and Data Analytics for Financial Data

This course is designed to provide a comprehensive introduction to the principles and techniques of algorithms and data analytics, with a focus on financial datasets. Through a combination of theoretical concepts and practical applications, you will learn how to handle and analyze large sets of financial data efficiently. The course covers several essential topics:

1.1.1 Data Cleaning and Data Management

You will begin by mastering the fundamentals of data collection, cleaning, and preparation. This includes working with financial datasets, performing return calculations, understanding data structures, and effectively merging data from multiple sources. Ensuring data quality is critical, especially when working with large, complex financial information.

1.1.2 Big Data

The course introduces you to handling big data in finance. We’ll work with a large panel dataset comprising historical financial statement variables for numerous firms, giving you hands-on experience with the unique challenges and opportunities presented by big data. You will learn to apply algorithms that allow for scalable data processing.

1.1.3 Econometrics and Logit Models

Some basic econometric techniques will be explored, with a particular focus on logit models. These methods will help you understand relationships in financial data, such as predicting binary outcomes (e.g., credit default, fraud detection) from financial variables.

1.1.4 Introduction to Machine Learning

Finally, we’ll delve into basic machine learning techniques and explore how they can be applied to financial datasets. You will get an introduction to supervised learning, with a focus on classification and regression, and how these techniques are used in real-world financial analytics.


1.2 Workshop Overview

This is an individual workshop that serves as a hands-on introduction to data management for financial datasets. In this workshop, you will practice key tasks such as:

  • Data collection and cleaning: Ensuring the accuracy and completeness of financial data.
  • Return calculation: Understanding how to compute returns from raw financial data.
  • Data structures: Working with different data formats, from structured tables to more complex time series and panel datasets.
  • Data merging: Combining datasets to form comprehensive views for analysis.

Throughout the workshop, you’ll begin working with a large panel dataset of historical financial statement variables for many firms. This dataset will serve as the foundation for many exercises in the course, giving you ample opportunities to apply the techniques and tools you learn. The exercises in this workshop are based on Dorantes (2024).

1.3 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab notebook for each workshop.

You must share each Colab document (workshop) with me via e-mail:

sergio.castellanos@tec.mx

You must give Edit privileges to this account.

In Google Colab, you can work with Python or R notebooks. The default is Python notebooks, so you must go to “Runtime” or “Entorno de Ejecución”, select “Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).

Your notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1_FZ2022_YourFirstName_YourLastname”.

Pay attention in class to learn how to write text and R code into your notebook.

In your Workshop Notebook, you have to:

  • Replicate all the R code along with its output.
  • You must complete any question and/or challenge asked in the workshop. The challenges can include specific questions or require you to complete an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks at once with Ctrl-F9. You must submit the web link of your Google Colab workshop to Canvas.

2 Data management for time-series financial variables

2.1 Installation of R packages

Google Colab has many R packages already installed, but not all. In this workshop, we will use R packages that are already installed in Colab, and others that we need to install before loading them in Colab.

We start by installing the packages that are NOT pre-installed in Colab:

install.packages("quantmod")
install.packages("PerformanceAnalytics")

The installation of packages in Colab might take up to 30 seconds or 1 minute each, so after they are installed, you can add the # character before the install.packages command to avoid re-installing the packages and waiting for a long time.

Once these packages are installed, we need to load them into memory:

# Now I load the packages in Memory:
library(quantmod)
library(PerformanceAnalytics)

Now we load the packages of the workshop that are already installed in R:

library(readxl)
library(dplyr)

2.2 Data collection

We start by downloading real stock prices online from Yahoo Finance.

First, let’s clear our R environment:

# To avoid scientific notation for numbers: 
options(scipen=999)

2.2.1 The quantmod package

The quantmod package is designed to assist financial traders in developing econometric-based trading models.

It provides several functions for financial data collection, data management, and data visualization.

This package includes the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.

2.2.2 Downloading Real Financial Prices

The getSymbols() function enables users to download real-time, up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, and more. getSymbols() allows for the download of data from multiple sources, including Yahoo Finance, Google Finance, FRED, and Oanda. These sources offer thousands of financial and economic data series from many market exchanges and other macroeconomic variables worldwide.

Now, we will work with historical data of the Bitcoin cryptocurrency and Tesla stock. We will download the daily quotations of these instruments from January 1, 2020, to the most recent complete month from Yahoo Finance:

getSymbols(c("BTC-USD","TSLA"), from="2020-01-01", 
           to="2024-08-31", src="yahoo", periodicity="daily")
[1] "BTC-USD" "TSLA"   

This function will create an xts-zoo R object for each ticker. Each object will contain the corresponding historical daily prices. xts stands for extensible time-series, and an xts-zoo object is designed to facilitate the manipulation of time series data.

BTC-USD and TSLA are the ticker names used in Yahoo Finance. The from argument specifies the start date for the data you want to retrieve. The to argument defines the end date of the series, but in this case, we omit it to download the most recent data. The src argument indicates the data source, which in this case is Yahoo Finance. Lastly, the periodicity argument determines the frequency of the data (daily, weekly, monthly, quarterly).

2.2.3 Show the content of datasets

We can view the content of an R object just by typing the name of the object in the R code. If the dataset is long we can view the first rows or the last rows of the dataset by using the following functions:

head(`BTC-USD`,n=5)
           BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2020-01-01     7194.892     7254.331    7174.944      7200.174    18565664997
2020-01-02     7202.551     7212.155    6935.270      6985.470    20802083465
2020-01-03     6984.429     7413.715    6914.996      7344.884    28111481032
2020-01-04     7345.375     7427.386    7309.514      7410.657    18444271275
2020-01-05     7410.452     7544.497    7400.536      7411.317    19725074095
           BTC-USD.Adjusted
2020-01-01         7200.174
2020-01-02         6985.470
2020-01-03         7344.884
2020-01-04         7410.657
2020-01-05         7411.317

When tickers have special characters, we have to make reference to the object with simple quotes (``).

Also, you can list the LAST 5 rows of the dataset. Note that you can change number of rows you want to display.

tail(`BTC-USD`, n=10)
           BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2024-08-22     61168.32     61408.11    59815.25      60381.91    27625734377
2024-08-23     60380.95     64947.06    60372.05      64094.36    42530509233
2024-08-24     64103.87     64513.79    63619.92      64178.99    21430585163
2024-08-25     64176.37     64996.42    63833.52      64333.54    18827683555
2024-08-26     64342.23     64489.71    62849.56      62880.66    27682040631
2024-08-27     62879.71     63210.80    58116.75      59504.13    39103882198
2024-08-28     59507.93     60236.45    57890.68      59027.62    40289564698
2024-08-29     59027.47     61184.08    58786.23      59388.18    32224990582
2024-08-30     59388.60     59896.89    57768.53      59119.48    32292756405
2024-08-31     59117.48     59432.59    58768.79      58969.90    12403470760
           BTC-USD.Adjusted
2024-08-22         60381.91
2024-08-23         64094.36
2024-08-24         64178.99
2024-08-25         64333.54
2024-08-26         62880.66
2024-08-27         59504.13
2024-08-28         59027.62
2024-08-29         59388.18
2024-08-30         59119.48
2024-08-31         58969.90

The xts datasets created with the getSymbols() function contain different columns for open, low, high, close, adjusted, and volume data. These datasets are often referred to as OHLC datasets (Open, High, Low, Close).

Adjusted prices are used for stocks, not for currencies. Adjusted prices account for dividend payments and stock splits, providing a more accurate reflection of stock price movements over time. For the Bitcoin series, we can use either the close or adjusted price to calculate returns.

2.3 Data selection

2.3.1 Column selection

We can easily select either columns or rows from a dataset. For example, if we want to select the adjusted price column from the TSLA xts dataset, we use the $ operator followed by the column name, and then assign it to a new variable name:

adjTSLA = TSLA$TSLA.Adjusted
head(adjTSLA)
           TSLA.Adjusted
2020-01-02      28.68400
2020-01-03      29.53400
2020-01-06      30.10267
2020-01-07      31.27067
2020-01-08      32.80933
2020-01-09      32.08933

Another way to select a column is by using the column number(s):

#Selecting the adjusted price, which is the column #6: 
adjTSLA = TSLA[, 6]

Note that the notation to reference column 6 and ALL rows of the TSLA dataset is:

TSLA[ , 6]

In this case, we want to keep ALL rows, so we leave the row index empty and specify only column 6.

R datasets have 2 dimensions: [#rows, #columns]. We can display the dimensions of any R dataset using the dim() function.

dim(TSLA)
[1] 1174    6

We see that the # of rows of TSLA dataset is 1174, and the # of columns is 6.

We can make reference to any subset of the dataset using this notation of [#rows, #columns]

2.3.2 Row selection

We can also select specific rows of a dataset following the same notation of [#rows, #columns]. If I want to select only the first rows of TSLA dataset, I can do the following:

TSLAfirstdays = TSLA[1:10,]
TSLAfirstdays
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02  28.30000  28.71333 28.11400   28.68400   142981500      28.68400
2020-01-03  29.36667  30.26667 29.12800   29.53400   266677500      29.53400
2020-01-06  29.36467  30.10400 29.33333   30.10267   151995000      30.10267
2020-01-07  30.76000  31.44200 30.22400   31.27067   268231500      31.27067
2020-01-08  31.58000  33.23267 31.21533   32.80933   467164500      32.80933
2020-01-09  33.14000  33.25333 31.52467   32.08933   426606000      32.08933
2020-01-10  32.11933  32.32933 31.58000   31.87667   194392500      31.87667
2020-01-13  32.90000  35.04200 32.80000   34.99067   397764000      34.99067
2020-01-14  36.28400  36.49400 34.99333   35.86133   434943000      35.86133
2020-01-15  35.31733  35.85600 34.45267   34.56667   260532000      34.56667

We used the notation 1:10 in the row specification, indicating that we want the sequence from 1 to 10.

We can also select specific rows and columns. For example, if I only want to select the first 10 days of the adjusted prices of TESLA:

TSLA_adjusted_first_prices = TSLA[1:10,6]
head(TSLA_adjusted_first_prices)
           TSLA.Adjusted
2020-01-02      28.68400
2020-01-03      29.53400
2020-01-06      30.10267
2020-01-07      31.27067
2020-01-08      32.80933
2020-01-09      32.08933

If we want to select specific columns, we can use the container c() vector to indicate which column # we want to select. For example, if for some reason we want to select the Open and Close columns (columns #1 and #4), and select the first 10 days of TESLA prices:

TSLA_Open_Close = TSLA[1:10,c(1,4)]
head(TSLA_Open_Close)
           TSLA.Open TSLA.Close
2020-01-02  28.30000   28.68400
2020-01-03  29.36667   29.53400
2020-01-06  29.36467   30.10267
2020-01-07  30.76000   31.27067
2020-01-08  31.58000   32.80933
2020-01-09  33.14000   32.08933

For xts datasets, we can also select rows using the date index. Each xts dataset has a time index that we can use. For example, if we want to select all TESLA prices of only year 2021 we can do the following:

TESLA_2021 = TSLA["2021-01-01/2021-12-31", ]
head(TESLA_2021)
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2021-01-04  239.8200  248.1633 239.0633   243.2567   145914600      243.2567
2021-01-05  241.2200  246.9467 239.7333   245.0367    96735600      245.0367
2021-01-06  252.8300  258.0000 249.7000   251.9933   134100000      251.9933
2021-01-07  259.2100  272.3300 258.4000   272.0133   154496700      272.0133
2021-01-08  285.3333  294.8300 279.4633   293.3400   225166500      293.3400
2021-01-11  283.1333  284.8100 267.8733   270.3967   177904800      270.3967

We can also combine column selection with row selection. If we want the same date selection, but only adjusted prices:

TESLA_2021_adjusted = TSLA["2021-01-01/2021-12-31", "TSLA.Adjusted"]

In this case, to select the column I used the column name instead of the # of the column.

2.4 Data Merging:

We can use the merge() function to join two or more xts time-series datasets into an integrated dataset:

adjprices <- merge(`BTC-USD`$`BTC-USD.Adjusted`, TSLA$TSLA.Adjusted)
# I can select only the adjusted prices in order to calculate returns:

The quantmod has the Ad function to get Adjusted prices. We can do the same we did above using this function:

adjprices <- merge(Ad(`BTC-USD`), Ad(TSLA$TSLA.Adjusted))

Now we have an xts-zoo objects with the 2 adjusted prices. We can change the names of the columns:

names(adjprices)<-c("bitcoin","tesla")

Now we can make reference to the adjusted prices using these names.

2.5 Data cleaning

In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some null values for some days. For example, for stock series there is no data for weekends or holidays:

head(adjprices,n=10)
            bitcoin    tesla
2020-01-01 7200.174       NA
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-04 7410.657       NA
2020-01-05 7411.317       NA
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933
2020-01-10 8166.554 31.87667

In this case, Bitcoin has price data for every day, including weekends and holidays. When we merged the datasets, the TESLA column will have NA values for these days. If we keep this dataset as it is, calculating daily returns for TESLA will be problematic for Mondays, as there will be no price value for Sundays.

An easy way to handle NA values is to remove rows that contain any NA values in any column. This can be done using the na.omit() function. Although R can manage gaps in time series objects, it is still a good practice to ensure that your dataset is free of NA values.

adjprices <- na.omit(adjprices)

Now the holidays and weekends were deleted:

head(adjprices,n=6)
            bitcoin    tesla
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933

2.6 Visualization of prices

Visualize how Bitcoin has been valued over time:

plot(`BTC-USD`)

We can do a better visualization if we use the chartSeries function from the quantmod package:

chartSeries(`BTC-USD`, theme=("white"))

We can also do a plot of specific periods. For example, we can see how Bitcoin price behaved only in 2021:

chartSeries(`BTC-USD`, subset = '2021-01-01/2021-12-31')

2.7 Financial returns

A financial simple return for a stock (R_{t}) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1}}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2022 was $100.00, and its previous (December 2021) adjusted price was $80.00, then the monthly simple return of the stock in January 2022 will be:

R_{Jan2022}=\frac{Adprice_{Jan2022}}{Adprice_{Dec2021}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.

One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):

r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

2.8 Financial return calculation for time-series

2.8.1 Simple returns

We must use adjusted stock prices to calculate financial returns. To calculate a lagged (past) value of price of a time-series variable we can use the function lag from the stats package.

R = adjprices / stats::lag(adjprices,n=1) - 1 
head(R)
               bitcoin       tesla
2020-01-02          NA          NA
2020-01-03  0.05145166  0.02963326
2020-01-06  0.05777286  0.01925464
2020-01-07  0.05077385  0.03880052
2020-01-08 -0.01026859  0.04920483
2020-01-09 -0.02485086 -0.02194501

Since adjprices has 2 columns, then R calculates the daily simple returns for all columns of the dataset.

Note that we use the name stats:: before the lag function. We did this since there are other R packages that have a function called lag, so R can be confused if we do not specify which package we are using for the lag function.

2.8.2 Continuously compounded returns

Now we calculate the daily continuously compounded returns using adjusted prices. We use the diff and log functions. The diff function calculates the difference between the value of a time-series variable and its past value:

r = diff(log(adjprices))
head(r)
               bitcoin       tesla
2020-01-02          NA          NA
2020-01-03  0.05017175  0.02920268
2020-01-06  0.05616562  0.01907161
2020-01-07  0.04952689  0.03806671
2020-01-08 -0.01032167  0.04803257
2020-01-09 -0.02516485 -0.02218939

Remember that the first difference of log prices is actually the continuously compounded returns of the period.

We can visualize both daily returns over time for Tesla:

plot(R$tesla)

We can observe increasing volatility in 2020 due to the pandemic. Volatility can be calculated as the standard deviation of returns.

2.9 Descriptive statistics with time-series

We will use the table.Stats from the PerformanceAnalytics package.

We load the package and get the main descriptive statistics of both, Tesla and Bitcoin daily returns:

table1 <- table.Stats(R)
table1
                  bitcoin     tesla
Observations    1173.0000 1173.0000
NAs                1.0000    1.0000
Minimum           -0.3717   -0.2106
Quartile 1        -0.0159   -0.0202
Median             0.0009    0.0019
Arithmetic Mean    0.0027    0.0026
Geometric Mean     0.0018    0.0017
Quartile 3         0.0213    0.0226
Maximum            0.2111    0.1989
SE Mean            0.0012    0.0012
LCL Mean (0.95)    0.0003    0.0002
UCL Mean (0.95)    0.0051    0.0050
Variance           0.0017    0.0018
Stdev              0.0413    0.0420
Skewness          -0.4839    0.1093
Kurtosis           8.5246    2.8804

We can also get specific descriptive statistics using specific functions such as mean and sd:

mean_tesla_R = mean(R$tesla, na.rm=TRUE)
median_tesla_R = median(R$tesla, na.rm=TRUE)
sd_tesla_R = sd(R$tesla, na.rm=TRUE)

cat("Tesla daily mean return is ",mean_tesla_R, "\n")
Tesla daily mean return is  0.002597413 
cat("Tesla daily median return is ",median_tesla_R, "\n")
Tesla daily median return is  0.001871393 
cat("Tesla daily volatility is ", sd_tesla_R)
Tesla daily volatility is  0.04204226

When columns have NA values, some descriptive statistics functions cannot be calculated, unless we use the option na.rm=TRUE (means remove the NA values before calculation)

When median and mean are very different, this is a sign of a non-normality in the distribution of the variable. Here we see that Tesla median return is much less than its mean (0.14% vs 0.31%). The median is a much better measure of central tendency in Financial returns.

If Kurtosis of the variable is greater than 3, then the variable has more extreme values than a normal distributed variable. In this case, we see that both returns have Kurtosis much greater than 3.

Financial returns usually have more extreme values than normal distributed variables, so traditional standard deviation might not provide a complete view of the dispersion of returns. The quartiles provide us a better perception about the volatility of financial returns. The box-plot gives us a good perception of volatility, mean, median (Q2), quartile 1 (Q1) and quartile 3 (Q3) of returns:

chart.Boxplot(R)

It is easy to see that Bitcoin is riskier than Tesla since we have a wider range of returns in both, negative and positive returns. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Quartile 1 or Q1 (25 percentile) to the Q3 (75 percentile). The vertical lines limit non-extreme values. The dots are considered extreme values in the context of its own distribution.

2.10 Visualizing holing return over time

We can use the charts.PerformanceSummary of returns to visualize how much we would made over time if we had invested $1.00 for both, Bitcoin and Tesla, and hold your investment in the whole period of the data:

charts.PerformanceSummary(R,
                          main = "Performance of $1.00 invested in Bitcoin and Tesla",
                          wealth.index = TRUE)

We see that if we had invested in Tesla $1.00 in Jan 1, 2020, today we would have more than 7 times my money at the end of the period!

We can calculate the exact holding period return by getting the percentage growth of the series from the first adjusted price up to the last adjusted price:

hpr_tesla = as.numeric(adjprices$tesla[nrow(adjprices)]) / as.numeric(adjprices$tesla[1]) - 1
hpr_tesla
[1] 6.46444

I used the function nrow that returns the # of rows of the data frame. In this case, I am dividing the last value by the first value of Tesla price, and then subtract 1.

We can do the same, but instead of using the numeric index for the rows, use the date index:

hpr_tesla2 = as.numeric(adjprices$tesla['2024-09-12']) / as.numeric(adjprices$tesla['2020-01-02']) - 1
hpr_tesla2
numeric(0)

We have to multiply by 100 to get the return in percentage:

hpr_tesla * 100 
[1] 646.444

2.11 CHALLENGE 1

  • Calculate and display the Holding-period-return for Bitcoin from Jan 2020 to Sep 12, 2024.

  • Calculate and display the Holding-period-return for Bitcoin from the first day of January 2024 to Sep 12, 2024.

3 Data structures

In Finance there are basically the following dataset structures or data types:

  1. Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols:
       p_stock1 r_stock1 p_stock2 r_stock2
2014m1       10     0.02       20     0.01
2014m2       11     0.10       21     0.05
  1. Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016:
    Ticker   ROA ATO ProfitMargin
1 ALFAA.MX 0.023 0.9         0.15
2  AMXL.MX 0.015 1.1         0.10
  1. Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period. Example:
     Ticker quarter  ROA ATO ProfitMargin
1  ALFAA.MX  2014q1  0.2 1.1         0.17
2  ALFAA.MX  2014q2 0.21 1.2         0.16
3       ...     ...  ... ...          ...
4 BIMBOA.MX  2014q1 0.15 0.8         0.10
5 BIMBOA.MX  2014q2 0.20 0.9         0.05
6       ...     ...  ... ...          ...
7 BIMBOA.MX  2017q1 0.15 1.1         0.09

4 Data management for panel-data

For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do simple and sophisticated data management.

In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.

We will merge 2 dataset: a time-series dataset with monthly historical data for the Mexican IPyC Market index, and a panel dataset with quarterly financial information for many Mexican public firms (A public firm is a company that issue shares in the market).

We can only merge datasets with the same granularity. In other words, we cannot merge a monthly with a quarterly dataset. In this case, the market monthly dataset must be aggregated to quarters, and then we can merge the resulting dataset with the other quarterly dataset.

We will work with an online panel data of Mexican firms. Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset.

# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
              "firmspaneldata.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
#  local file

# Use the function read_excel()
paneldataset <- read_excel("firmspaneldata.xlsx")

# The quarter column has a format with the time zone. I create a new column with only the date:
paneldataset$q = as.Date(paneldataset$quarter)

4.1 Data calculations with panel data

We will use the dplyr package to easily do data selection and calculations for panel datasets.

This dataset has historical quarterly financial data of active and not active firms in the Bolsa Mexicana de Valores. We will keep ONLY active firms in the dataset from 2010. You can do this by typing:

activedata = paneldataset %>% 
  filter(status=="active", year>=2010)  

As you see, we start with the paneldataset, and then we do a sub-setting or selection of rows indicating specific conditions.

Let’s select a few important columns from this dataset to perform some basic financial ratio calculations.

We will use the dplyr package. This package is designed for data management of data frames and panel data. We link data management processes by the operand %>%.

activedata<-activedata %>%
    select(firm,q,year,revenue,cogs,ebit,totalassets,
           adjustedstockprice, naics1)

Here we start with the activedata dataset, and then select a few columns.

The dplyr package is very powerful for data management. Click HERE (https://datacarpentry.org/R-genomics/04-dplyr.html) a good summary of the main functionality of this package.

4.1.1 Calculating financial returns and ratios with panel data

As we did for time-series data, we can calculate returns with a panel dataset. We follow the same formula mentioned above, but we have to consider that we are using a panel-data structure:

activedata <-  activedata %>% 
   group_by(firm) %>% 
   arrange(firm, q) %>% 
   mutate(stockret = log(adjustedstockprice) - log(lag(adjustedstockprice)) )

We added the column r to the panel data. Since R knows that we have a panel data structure, it calculates returns correctly for each firm-quarter. When we have a change in firm, then returns do not have to be calculated. You can have a look to the data and check whether the first quarters for each firms have NA values.

4.1.2 Using conditionals to create columns

Let’s calculate Return on Assets using EBIT (Earnings before interest and taxes) instead of Net income:

In this case, if the denominator of the ratio is zero, we assign a null value (NA); otherwise, we calculate the ratio. If we do not validate for this conditional, R calculates a value Inf (Infinite) that causes problems when running models.

activedata <- activedata %>%
  mutate(ROA = ifelse(totalassets==0,NA,ebit / totalassets) )

Now we calculate Profit Margin (PM) and Asset Turn Over (ATO). We will use EBIT instead of Net Income to calculate PM:

activedata <- activedata %>% 
  mutate(PM = ifelse(revenue==0,NA,ebit / revenue ),
         ATO = ifelse(totalassets==0,NA,revenue / totalassets))

Here we use a conditional with the function ifelse. The first parameter of ifelse is a logical condition that can be true or false. If the condition is true, then the value assigned will be the second parameter; if it is false, the value assigned will be the third parameter.

The ifelse function is very useful when creating financial signals that compare financial values. Here is the general description of the ifelse function:

ifelse(data_condition, value_if_TRUE, value_if_FALSE)

For example:

a=5
b=10
ifelse(a>b,1,0)
[1] 0

Since a is NOT greater than b, then the value that was returned was zero.

We can use this function to create financial signals as new columns in a panel dataset or time-series dataset.

Let’s create a binary variable as a signal of positive ROA in a quarter compared with its ROA of one year ago (4 quarters ago). In other words, if the firm had a ROA in a specific quarter that is greater than its ROA 4 quarters ago, we will assign 1; 0 otherwise:

activedata <- activedata %>%
  
  mutate(ROAsignal = ifelse(ROA>lag(ROA,4),1,0))

5 CHALLENGE:

CREATE A COLUMN IN THE PANEL DATASET CALLED ROA_ABOVE THAT WILL BE ONE (=1) IF THE FIRM-QUARTER HAD A ROA GREATER THAN THE AVERAGE ROA OF ITS INDUSTRY IN THE CORRESPONDING QUARTER; AND ASSIGN A ZERO (=0) OTHERWISE.

HINT: Using dplyr is easier. You can ask Chat-GPT or Gemini to help you out, but try to understand the solution! If you use Gemini or ChatGPT, MENTION this in your workshop.

6 Descriptive statistics with panel data

Doing descriptive statistics with panel data is not quite equal as in the case of time-series structure. It is a good idea to do descriptive statistics by period of the panel data to avoid getting wrong summaries.

We can start by doing a descriptive statistics selecting only the last quarter of the data:

data2019q4 = activedata %>%
  filter(q=="2019-10-01") %>% 
  select(firm, revenue, ebit, totalassets, ROA, PM)

summary(data2019q4,na.rm=TRUE)
     firm              revenue                ebit          
 Length:152         Min.   :         0   Min.   : -6734642  
 Class :character   1st Qu.:   3257100   1st Qu.:   266741  
 Mode  :character   Median :  10634607   Median :  1673254  
                    Mean   :  50962874   Mean   :  6282411  
                    3rd Qu.:  35395790   3rd Qu.:  5079130  
                    Max.   :1007347869   Max.   :154840535  
                    NA's   :30           NA's   :25         
  totalassets              ROA                 PM          
 Min.   :     24541   Min.   :-0.37696   Min.   :-3.28603  
 1st Qu.:  10110939   1st Qu.: 0.03230   1st Qu.: 0.06562  
 Median :  31564719   Median : 0.06030   Median : 0.11673  
 Mean   : 103148171   Mean   : 0.06135   Mean   : 0.18612  
 3rd Qu.:  77058094   3rd Qu.: 0.08508   3rd Qu.: 0.26798  
 Max.   :1580009936   Max.   : 0.42413   Max.   : 0.98586  
 NA's   :8            NA's   :25         NA's   :32        

We can get descriptive statistics by groups.

In this case, we will select the last quarter of the dataset, group the firms by industry, and for each industry we get the median of the main variables:

industries<-unique(activedata$naics1)

activedata %>% 
   filter(q=='2019-10-01') %>%
  group_by(naics1) %>% 
  summarize(firms = n(),
            median_total_assets = median(totalassets, na.rm = TRUE),
            median_revenue = median(revenue, na.rm = TRUE),
            median_ebit = median(ebit, na.rm = TRUE))
# A tibble: 17 × 5
   naics1                   firms median_total_assets median_revenue median_ebit
   <chr>                    <int>               <dbl>          <dbl>       <dbl>
 1 Accommodation and Food …     4           13012570        2921780.     494324.
 2 Administrative and Supp…     2            6641031        7061152.     741712.
 3 Agriculture, Forestry, …     2           36248426.      38465597     2619456.
 4 Arts, Entertainment, an…     2            8075112.       6862020      845286.
 5 Construction                17           12334830.       4361809     1065160.
 6 Finance and Insurance       33           28222952         524279       78285 
 7 Health Care and Social …     1            5829852        3698068      358695 
 8 Information                 10           28681463       12814470     1673254 
 9 Management of Companies…     1          151432402      102477596    11453226 
10 Manufacturing               39           44292990       25936810     2989902 
11 Mining, Quarrying, and …     5           35489167.       8717697.     352864.
12 Professional, Scientifi…     1           19332591       12755893     2434974 
13 Public Administration        1           78995636       10762172     1841889 
14 Real Estate and Rental …     9           26925420        4098367     1789606.
15 Retail Trade                13           58341576       57586340.    5137620 
16 Transportation and Ware…    10           30341542       14191390.    3564913 
17 Utilities                    2          102864999.      16192952.    8991664.

Here we can see a good picture of the whole Mexican Financial Market. It is important to note that the best central tendency measure for any financial statement variable such as revenue, total assets is the MEDIAN since the distribution of these variables is always skewed to the right with very few big, big firms. The arithmetic mean of financial variables do not provide a good representation of the typical company in a market.

7 Datacamp online courses

You will receive an email invitation to register on datacamp.com. DataCamp is one of the best online learning platforms for Data Science applied to Business and Finance.

You will receive free access for the entire semester! Accept the invitation to be registered on DataCamp.

You MUST TAKE Chapter 1: Transforming Data with dplyr from the course: Data Manipulation with dplyr.

The following courses/chapters are OPTIONAL (but recommended to complete during the first 2-3 weeks of the course):

  • Chapter 2: Aggregating Data from the same course: Data Manipulation with dplyr.

If you want to learn more about acquiring financial data, you can explore the course: Importing and Managing Financial Data in R.

8 W1 submission

The grade for this workshop will be as follows:

  • Complete (100%): If you submit an ORIGINAL and COMPLETE notebook with all the activities, your notes, and your OWN RESPONSES to questions. Submitting someone else’s work or having two or more people submit the same answers (exact texts or with minor changes or translations) will be considered plagiarism and a violation of academic integrity (FIA).

  • Incomplete (75%): If you submit an ORIGINAL notebook with ALL the activities but did NOT RESPOND to the questions and/or did not complete all activities, or if you only responded to some questions. Remember, submitting identical or slightly altered work from another person is considered a breach of academic integrity (FIA).

  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or if you completed more but parts of your work are copy-pasted from other notebooks. This too will be considered an FIA if found to be plagiarized.

  • Not submitted (0%): No submission, no credit.

It is important to remember that submitting someone else’s work, or identical work (even with minor changes or translations) between two or more students, is strictly prohibited and will be considered an academic integrity violation (FIA).

Remember that you have to submit the Google Colab LINK to your notebook, and you must SHARE it with me, ensuring that I have edit access.