Workshop 1, Algorithms and data analysis

Author

Alberto Dorantes, Ph.D.

Published

September 17, 2024

Abstract
This is an INDIVIDUAL workshop. In this workshop we learn about data management for financial datasets: data collection, data cleaning, return calculation, data structures, and data merging. In addition, we start playing with a big panel-dataset of historical financial statement variables for many firms.

0.1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab document for each workshop.

You must share each Colab document (workshop) with the following accounts:

cdorante.tec@gmail.com cdorante@tec.mx

You must give Edit privileges to these accounts.

In Google Colab you can work with Python or R notebooks. The default is Python notbooks, so you must go to “Runtime” or “Entorno de Ejecución”, select Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).

Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1-Econometrics-YourFirstName YourLastname”.

Pay attention in class to learn how to write text and R code into your Notebook.

In your Workshop Notebook you have to:

  • Replicate all the R Code along with its output.

  • You have to do any QUESTION AND/OR CHALLENGE asked in the workshop. The challenges can be: responses to specific questions and/or do an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.

1 Data management for time-series financial variables

1.1 Installation of R packages

Google Colab has many R packages already installed, but not all. In this workshop we use R packages that are already installed in Colab, and others that we need to install before loading them in Colab.

We start installing the packages that are NOT installed in Colab:

install.packages("quantmod")
install.packages("PerformanceAnalytics")

The installation of packages in Colab might take up to 30 seg or 1 minute each, so after they are installed you can add the # character before the install.packages command to avoid re-installing the packages and waiting for a long time.

Once these packages are installed, we need to load them in memory:

# Now I load the packages in Memory:
library(quantmod)
library(PerformanceAnalytics)

Now we load the packages of the workshop that are already installed in R:

library(readxl)
library(dplyr)

1.2 Data collection

We start by downloading online real stock prices from Yahoo Finance.

We start clearing our R environment:

# To avoid scientific notation for numbers: 
options(scipen=999)

1.2.1 The quantmod package

The quantmod package is designed to help financial traders in developing econometric-based trading models.

It has several functions for financial data collection, data management and data visualization.

This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.

1.2.2 Downloading real financial prices

The getSymbols() function enables its user to download online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, Google Finance, FRED and Oanda. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables around the world.

Now, we will work with historical data of the Bitcoin cryptocurrency and the TESLA stock. We download daily quotations of these instruments from January 1, 2020 to the most complete recent month from Yahoo Finance:

getSymbols(c("BTC-USD","TSLA"), from="2020-01-01", 
           to="2024-09-13", src="yahoo", periodicity="daily")
[1] "BTC-USD" "TSLA"   

This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical daily prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.

BTC-USD and TSLA are the ticker names in Yahoo Finance. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly).

1.2.3 Show the content of datasets

We can view the content of an R object just by typing the name of the object in the R code. If the dataset is long we can view the first rows or the last rows of the dataset by using the following functions:

head(`BTC-USD`,n=5)
           BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2020-01-01     7194.892     7254.331    7174.944      7200.174    18565664997
2020-01-02     7202.551     7212.155    6935.270      6985.470    20802083465
2020-01-03     6984.429     7413.715    6914.996      7344.884    28111481032
2020-01-04     7345.375     7427.386    7309.514      7410.657    18444271275
2020-01-05     7410.452     7544.497    7400.536      7411.317    19725074095
           BTC-USD.Adjusted
2020-01-01         7200.174
2020-01-02         6985.470
2020-01-03         7344.884
2020-01-04         7410.657
2020-01-05         7411.317

When tickers have special characters, we have to make reference to the object with simple quotes (``).

Also, you can list the LAST 5 rows of the dataset. Note that you can change number of rows you want to display.

tail(`BTC-USD`, n=10)
           BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
2024-09-04     57430.35     58511.57    55673.16      57971.54    35627680312
2024-09-05     57971.70     58300.58    55712.45      56160.49    31030280656
2024-09-06     56160.19     56976.11    52598.70      53948.75    49361693566
2024-09-07     53949.09     54838.14    53740.07      54139.69    19061486526
2024-09-08     54147.93     55300.86    53653.76      54841.57    18268287531
2024-09-09     54851.89     58041.12    54598.43      57019.54    34618096173
2024-09-10     57020.10     58029.98    56419.41      57648.71    28857630507
2024-09-11     57650.29     57991.32    55567.34      57343.17    37049062672
2024-09-12     57343.17     58534.36    57330.10      58127.01    33835707949
2024-09-13     58130.32     60648.02    57650.11      60571.30    32490528356
           BTC-USD.Adjusted
2024-09-04         57971.54
2024-09-05         56160.49
2024-09-06         53948.75
2024-09-07         54139.69
2024-09-08         54841.57
2024-09-09         57019.54
2024-09-10         57648.71
2024-09-11         57343.17
2024-09-12         58127.01
2024-09-13         60571.30

The xts datasets created with the getSymbols function have different columns for open, low, high, close, adjusted and volume data. These datasets are also called OHLC datasets.

The adjusted prices are used for stocks, not for currencies. Adjusted prices is an adjustment of the closing prices after considering dividend payments and stock splits. Then, for the Bitcoin series we can use close or adjusted price to calculate returns.

1.3 Data selection

1.3.1 Column selection

We can easily select either columns or rows of a dataset. For example, if we want to select the adjusted price column of the TSLA xts dataset, we use the $ and the name of the column, and then assign it to any new name:

adjTSLA = TSLA$TSLA.Adjusted
head(adjTSLA)
           TSLA.Adjusted
2020-01-02      28.68400
2020-01-03      29.53400
2020-01-06      30.10267
2020-01-07      31.27067
2020-01-08      32.80933
2020-01-09      32.08933

Another way to select a column is by the # of the column(s):

#Selecting the adjusted price, which is the column #6: 
adjTSLA = TSLA[, 6]

Note that the notation to make reference to column 6 and ALL rows of TSLA is:

TSLA[ , 6]

In this case, I want to keep ALL rows, so we leave empty the # of rows, and indicate that we want only column # 6.

R datasets have 2 dimensions: [#rows, #columns]. We can show the dimensions of any R dataset:

dim(TSLA)
[1] 1182    6

We see that the # of rows of TSLA dataset is 1182, and the # of columns is 6.

We can make reference to any subset of the dataset using this notation of [#rows, #columns]

1.3.2 Row selection

We can also select specific rows of a dataset following the same notation of [#rows, #columns]. If I want to select only the first rows of TSLA dataset, I can do the following:

TSLAfirstdays = TSLA[1:10,]
TSLAfirstdays
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2020-01-02  28.30000  28.71333 28.11400   28.68400   142981500      28.68400
2020-01-03  29.36667  30.26667 29.12800   29.53400   266677500      29.53400
2020-01-06  29.36467  30.10400 29.33333   30.10267   151995000      30.10267
2020-01-07  30.76000  31.44200 30.22400   31.27067   268231500      31.27067
2020-01-08  31.58000  33.23267 31.21533   32.80933   467164500      32.80933
2020-01-09  33.14000  33.25333 31.52467   32.08933   426606000      32.08933
2020-01-10  32.11933  32.32933 31.58000   31.87667   194392500      31.87667
2020-01-13  32.90000  35.04200 32.80000   34.99067   397764000      34.99067
2020-01-14  36.28400  36.49400 34.99333   35.86133   434943000      35.86133
2020-01-15  35.31733  35.85600 34.45267   34.56667   260532000      34.56667

We used the notation 1:10 in the row specification, indicating that we want the sequence from 1 to 10.

We can also select specific rows and columns. For example, if I only want to select the first 10 days of the adjusted prices of TESLA:

TSLA_adjusted_first_prices = TSLA[1:10,6]
head(TSLA_adjusted_first_prices)
           TSLA.Adjusted
2020-01-02      28.68400
2020-01-03      29.53400
2020-01-06      30.10267
2020-01-07      31.27067
2020-01-08      32.80933
2020-01-09      32.08933

If we want to select specific columns, we can use the container c() vector to indicate which column # we want to select. For example, if for some reason we want to select the Open and Close columns (columns #1 and #4), and select the first 10 days of TESLA prices:

TSLA_Open_Close = TSLA[1:10,c(1,4)]
head(TSLA_Open_Close)
           TSLA.Open TSLA.Close
2020-01-02  28.30000   28.68400
2020-01-03  29.36667   29.53400
2020-01-06  29.36467   30.10267
2020-01-07  30.76000   31.27067
2020-01-08  31.58000   32.80933
2020-01-09  33.14000   32.08933

For xts datasets, we can also select rows using the date index. Each xts dataset has a time index that we can use. For example, if we want to select all TESLA prices of only year 2021 we can do the following:

TESLA_2021 = TSLA["2021-01-01/2021-12-31", ]
head(TESLA_2021)
           TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
2021-01-04  239.8200  248.1633 239.0633   243.2567   145914600      243.2567
2021-01-05  241.2200  246.9467 239.7333   245.0367    96735600      245.0367
2021-01-06  252.8300  258.0000 249.7000   251.9933   134100000      251.9933
2021-01-07  259.2100  272.3300 258.4000   272.0133   154496700      272.0133
2021-01-08  285.3333  294.8300 279.4633   293.3400   225166500      293.3400
2021-01-11  283.1333  284.8100 267.8733   270.3967   177904800      270.3967

We can also combine column selection with row selection. If we want the same date selection, but only adjusted prices:

TESLA_2021_adjusted = TSLA["2021-01-01/2021-12-31", "TSLA.Adjusted"]

In this case, to select the column I used the column name instead of the # of the column.

1.4 Data Merging:

We can use the merge() function to join two or more xts time-series datasets into an integrated dataset:

adjprices <- merge(`BTC-USD`$`BTC-USD.Adjusted`, TSLA$TSLA.Adjusted)
# I can select only the adjusted prices in order to calculate returns:

The quantmod has the Ad function to get Adjusted prices. We can do the same we did above using this function:

adjprices <- merge(Ad(`BTC-USD`), Ad(TSLA$TSLA.Adjusted))

Now we have an xts-zoo objects with the 2 adjusted prices. We can change the names of the columns:

names(adjprices)<-c("bitcoin","tesla")

Now we can make reference to the adjusted prices using these names.

1.5 Data cleaning

In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some null values for some days. For example, for stock series there is no data for weekends or holidays:

head(adjprices,n=10)
            bitcoin    tesla
2020-01-01 7200.174       NA
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-04 7410.657       NA
2020-01-05 7411.317       NA
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933
2020-01-10 8166.554 31.87667

In this case, Bitcoin has price data for any day including weekends and holidays. When we did the merge, the TESLA column will have NA values for these days. If we keep this dataset as it is, when calculating daily returns for TESLA, it will not be possible to calculate returns for Modays since there is no price value for Sunday. An easy way to deal with NA values is to delete the rows that have any NA value in any column. We can do this with the na.omit function:

However, R deals with gaps because it recognizes that we are working with a time series object. It is a good idea to have a data set free of NA’s. So, I can use the function na.omit:

adjprices <- na.omit(adjprices)

Now the holidays and weekends were deleted:

head(adjprices,n=6)
            bitcoin    tesla
2020-01-02 6985.470 28.68400
2020-01-03 7344.884 29.53400
2020-01-06 7769.219 30.10267
2020-01-07 8163.692 31.27067
2020-01-08 8079.863 32.80933
2020-01-09 7879.071 32.08933

1.6 Visualization of prices

Visualize how Bitcoin has been valued over time:

plot(`BTC-USD`)

We can do a better visualization if we use the chartSeries function from the quantmod package:

chartSeries(`BTC-USD`, theme=("white"))

We can also do a plot of specific periods. For example, we can see how Bitcoin price behaved only in 2021:

chartSeries(`BTC-USD`, subset = '2021-01-01/2021-12-31')

1.7 Financial returns

A financial simple return for a stock (R_{t}) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):

R_{t}=\frac{\left(Adjprice_{t}-Adjprice_{t-1}\right)}{Adjprice_{t-1}}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1

For example, if the adjusted price of a stock at the end of January 2022 was $100.00, and its previous (December 2021) adjusted price was $80.00, then the monthly simple return of the stock in January 2022 will be:

R_{Jan2022}=\frac{Adprice_{Jan2022}}{Adprice_{Dec2021}}-1=\frac{100}{80}-1=0.25

We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.

One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):

r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})

This is also called as the difference of the log of the price.

We can also calculate cc returns as the log of the current adjusted price (at t) divided by the previous adjusted price (at t-1):

r_{t}=log\left(\frac{Adjprice_{t}}{Adjprice_{t-1}}\right)

cc returns are usually represented by small r, while simple returns are represented by capital R.

1.8 Financial return calculation for time-series

1.8.1 Simple returns

We must use adjusted stock prices to calculate financial returns. To calculate a lagged (past) value of price of a time-series variable we can use the function lag from the stats package.

R = adjprices / stats::lag(adjprices,n=1) - 1 
head(R)
               bitcoin       tesla
2020-01-02          NA          NA
2020-01-03  0.05145166  0.02963326
2020-01-06  0.05777286  0.01925464
2020-01-07  0.05077385  0.03880052
2020-01-08 -0.01026859  0.04920483
2020-01-09 -0.02485086 -0.02194501

Since adjprices has 2 columns, then R calculates the daily simple returns for all columns of the dataset.

Note that we use the name stats:: before the lag function. We did this since there are other R packages that have a function called lag, so R can be confused if we do not specify which package we are using for the lag function.

1.8.2 Continuously compounded returns

Now we calculate the daily continuously compounded returns using adjusted prices. We use the diff and log functions. The diff function calculates the difference between the value of a time-series variable and its past value:

r = diff(log(adjprices))
head(r)
               bitcoin       tesla
2020-01-02          NA          NA
2020-01-03  0.05017175  0.02920268
2020-01-06  0.05616562  0.01907161
2020-01-07  0.04952689  0.03806671
2020-01-08 -0.01032167  0.04803257
2020-01-09 -0.02516485 -0.02218939

Remember that the first difference of log prices is actually the continuously compounded returns of the period.

We can visualize both daily returns over time for Tesla:

plot(R$tesla)

We can observe increasing volatility in 2020 due to the pandemic. Volatility can be calculated as the standard deviation of returns.

1.9 Descriptive statistics with time-series

We will use the table.Stats from the PerformanceAnalytics package.

We load the package and get the main descriptive statistics of both, Tesla and Bitcoin daily returns:

table1 <- table.Stats(R)
table1
                  bitcoin     tesla
Observations    1181.0000 1181.0000
NAs                1.0000    1.0000
Minimum           -0.3717   -0.2106
Quartile 1        -0.0160   -0.0201
Median             0.0009    0.0020
Arithmetic Mean    0.0027    0.0026
Geometric Mean     0.0018    0.0018
Quartile 3         0.0213    0.0228
Maximum            0.2111    0.1989
SE Mean            0.0012    0.0012
LCL Mean (0.95)    0.0003    0.0002
UCL Mean (0.95)    0.0050    0.0050
Variance           0.0017    0.0018
Stdev              0.0413    0.0420
Skewness          -0.4806    0.1004
Kurtosis           8.5237    2.8587

We can also get specific descriptive statistics using specific functions such as mean and sd:

mean_tesla_R = mean(R$tesla, na.rm=TRUE)
median_tesla_R = median(R$tesla, na.rm=TRUE)
sd_tesla_R = sd(R$tesla, na.rm=TRUE)

cat("Tesla daily mean return is ",mean_tesla_R, "\n")
Tesla daily mean return is  0.002645963 
cat("Tesla daily median return is ",median_tesla_R, "\n")
Tesla daily median return is  0.001975226 
cat("Tesla daily volatility is ", sd_tesla_R)
Tesla daily volatility is  0.04204204

When columns have NA values, some descriptive statistics functions cannot be calculated, unless we use the option na.rm=TRUE (means remove the NA values before calculation)

When median and mean are very different, this is a sign of a non-normality in the distribution of the variable. Here we see that Tesla median return is much less than its mean (0.14% vs 0.31%). The median is a much better measure of central tendency in Financial returns.

If Kurtosis of the variable is greater than 3, then the variable has more extreme values than a normal distributed variable. In this case, we see that both returns have Kurtosis much greater than 3.

Financial returns usually have more extreme values than normal distributed variables, so traditional standard deviation might not provide a complete view of the dispersion of returns. The quartiles provide us a better perception about the volatility of financial returns. The box-plot gives us a good perception of volatility, mean, median (Q2), quartile 1 (Q1) and quartile 3 (Q3) of returns:

chart.Boxplot(R)

It is easy to see that Bitcoin is riskier than Tesla since we have a wider range of returns in both, negative and positive returns. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Quartile 1 or Q1 (25 percentile) to the Q3 (75 percentile). The vertical lines limit non-extreme values. The dots are considered extreme values in the context of its own distribution.

1.10 Visualizing holing return over time

We can use the charts.PerformanceSummary of returns to visualize how much we would made over time if we had invested $1.00 for both, Bitcoin and Tesla, and hold your investment in the whole period of the data:

charts.PerformanceSummary(R,
                          main = "Performance of $1.00 invested in Bitcoin and Tesla",
                          wealth.index = TRUE)

We see that if we had invested in Tesla $1.00 in Jan 1, 2020, today we would have more than 7 times my money at the end of the period!

We can calculate the exact holding period return by getting the percentage growth of the series from the first adjusted price up to the last adjusted price:

hpr_tesla = as.numeric(adjprices$tesla[nrow(adjprices)]) / as.numeric(adjprices$tesla[1]) - 1
hpr_tesla
[1] 7.011783

I used the function nrow that returns the # of rows of the data frame. In this case, I am dividing the last value by the first value of Tesla price, and then subtract 1.

We can do the same, but instead of using the numeric index for the rows, use the date index:

hpr_tesla2 = as.numeric(adjprices$tesla['2024-09-12']) / as.numeric(adjprices$tesla['2020-01-02']) - 1
hpr_tesla2
[1] 7.011783

We have to multiply by 100 to get the return in percentage:

hpr_tesla * 100 
[1] 701.1783

1.11 CHALLENGE 1

  • Calculate and display the Holding-period-return for Bitcoin from Jan 2020 to Sep 12, 2024.

  • Calculate and display the Holding-period-return for Bitcoin from the first day of January 2024 to Sep 12, 2024.

2 Data structures

In Finance there are basically the following dataset structures or data types:

  1. Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols:
       p_stock1 r_stock1 p_stock2 r_stock2
2014m1       10     0.02       20     0.01
2014m2       11     0.10       21     0.05
  1. Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016:
    Ticker   ROA ATO ProfitMargin
1 ALFAA.MX 0.023 0.9         0.15
2  AMXL.MX 0.015 1.1         0.10
  1. Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period. Example:
     Ticker quarter  ROA ATO ProfitMargin
1  ALFAA.MX  2014q1  0.2 1.1         0.17
2  ALFAA.MX  2014q2 0.21 1.2         0.16
3       ...     ...  ... ...          ...
4 BIMBOA.MX  2014q1 0.15 0.8         0.10
5 BIMBOA.MX  2014q2 0.20 0.9         0.05
6       ...     ...  ... ...          ...
7 BIMBOA.MX  2017q1 0.15 1.1         0.09

3 Data management for panel-data

For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do simple and sophisticated data management.

In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.

We will merge 2 dataset: a time-series dataset with monthly historical data for the Mexican IPyC Market index, and a panel dataset with quarterly financial information for many Mexican public firms (A public firm is a company that issue shares in the market).

We can only merge datasets with the same granularity. In other words, we cannot merge a monthly with a quarterly dataset. In this case, the market monthly dataset must be aggregated to quarters, and then we can merge the resulting dataset with the other quarterly dataset.

We will work with an online panel data of Mexican firms. Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset.

# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
              "firmspaneldata.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
#  local file

# Use the function read_excel()
paneldataset <- read_excel("firmspaneldata.xlsx")

# The quarter column has a format with the time zone. I create a new column with only the date:
paneldataset$q = as.Date(paneldataset$quarter)

3.1 Data calculations with panel data

We will use the dplyr package to easily do data selection and calculations for panel datasets.

This dataset has historical quarterly financial data of active and not active firms in the Bolsa Mexicana de Valores. We will keep ONLY active firms in the dataset from 2010. You can do this by typing:

activedata = paneldataset %>% 
  filter(status=="active", year>=2010)  

As you see, we start with the paneldataset, and then we do a sub-setting or selection of rows indicating specific conditions.

Let’s select a few important columns from this dataset to perform some basic financial ratio calculations.

We will use the dplyr package. This package is designed for data management of data frames and panel data. We link data management processes by the operand %>%.

activedata<-activedata %>%
    select(firm,q,year,revenue,cogs,ebit,totalassets,
           adjustedstockprice, naics1)

Here we start with the activedata dataset, and then select a few columns.

The dplyr package is very powerful for data management. Click HERE (https://datacarpentry.org/R-genomics/04-dplyr.html) a good summary of the main functionality of this package.

3.1.1 Calculating financial returns and ratios with panel data

As we did for time-series data, we can calculate returns with a panel dataset. We follow the same formula mentioned above, but we have to consider that we are using a panel-data structure:

activedata <-  activedata %>% 
   group_by(firm) %>% 
   arrange(firm, q) %>% 
   mutate(stockret = log(adjustedstockprice) - log(lag(adjustedstockprice)) )

We added the column r to the panel data. Since R knows that we have a panel data structure, it calculates returns correctly for each firm-quarter. When we have a change in firm, then returns do not have to be calculated. You can have a look to the data and check whether the first quarters for each firms have NA values.

3.1.2 Using conditionals to create columns

Let’s calculate Return on Assets using EBIT (Earnings before interest and taxes) instead of Net income:

In this case, if the denominator of the ratio is zero, we assign a null value (NA); otherwise, we calculate the ratio. If we do not validate for this conditional, R calculates a value Inf (Infinite) that causes problems when running models.

activedata <- activedata %>%
  mutate(ROA = ifelse(totalassets==0,NA,ebit / totalassets) )

Now we calculate Profit Margin (PM) and Asset Turn Over (ATO). We will use EBIT instead of Net Income to calculate PM:

activedata <- activedata %>% 
  mutate(PM = ifelse(revenue==0,NA,ebit / revenue ),
         ATO = ifelse(totalassets==0,NA,revenue / totalassets))

Here we use a conditional with the function ifelse. The first parameter of ifelse is a logical condition that can be true or false. If the condition is true, then the value assigned will be the second parameter; if it is false, the value assigned will be the third parameter.

The ifelse function is very useful when creating financial signals that compare financial values. Here is the general description of the ifelse function:

ifelse(data_condition, value_if_TRUE, value_if_FALSE)

For example:

a=5
b=10
ifelse(a>b,1,0)
[1] 0

Since a is NOT greater than b, then the value that was returned was zero.

We can use this function to create financial signals as new columns in a panel dataset or time-series dataset.

Let’s create a binary variable as a signal of positive ROA in a quarter compared with its ROA of one year ago (4 quarters ago). In other words, if the firm had a ROA in a specific quarter that is greater than its ROA 4 quarters ago, we will assign 1; 0 otherwise:

activedata <- activedata %>%
  
  mutate(ROAsignal = ifelse(ROA>lag(ROA,4),1,0))

4 CHALLENGE:

CREATE A COLUMN IN THE PANEL DATASET CALLED ROA_ABOVE THAT WILL BE ONE (=1) IF THE FIRM-QUARTER HAD A ROA GREATER THAN THE AVERAGE ROA OF ITS INDUSTRY IN THE CORRESPONDING QUARTER; AND ASSIGN A ZERO (=0) OTHERWISE.

HINT: Using dplyr is easier. You can ask Chat-GPT or Gemini to help you out, but try to understand the solution! If you use Gemini or ChatGPT, MENTION this in your workshop.

5 Descriptive statistics with panel data

Doing descriptive statistics with panel data is not quite equal as in the case of time-series structure. It is a good idea to do descriptive statistics by period of the panel data to avoid getting wrong summaries.

We can start by doing a descriptive statistics selecting only the last quarter of the data:

data2019q4 = activedata %>%
  filter(q=="2019-10-01") %>% 
  select(firm, revenue, ebit, totalassets, ROA, PM)

summary(data2019q4,na.rm=TRUE)
     firm              revenue                ebit          
 Length:152         Min.   :         0   Min.   : -6734642  
 Class :character   1st Qu.:   3257100   1st Qu.:   266741  
 Mode  :character   Median :  10634607   Median :  1673254  
                    Mean   :  50962874   Mean   :  6282411  
                    3rd Qu.:  35395790   3rd Qu.:  5079130  
                    Max.   :1007347869   Max.   :154840535  
                    NA's   :30           NA's   :25         
  totalassets              ROA                 PM          
 Min.   :     24541   Min.   :-0.37696   Min.   :-3.28603  
 1st Qu.:  10110939   1st Qu.: 0.03230   1st Qu.: 0.06562  
 Median :  31564719   Median : 0.06030   Median : 0.11673  
 Mean   : 103148171   Mean   : 0.06135   Mean   : 0.18612  
 3rd Qu.:  77058094   3rd Qu.: 0.08508   3rd Qu.: 0.26798  
 Max.   :1580009936   Max.   : 0.42413   Max.   : 0.98586  
 NA's   :8            NA's   :25         NA's   :32        

We can get descriptive statistics by groups.

In this case, we will select the last quarter of the dataset, group the firms by industry, and for each industry we get the median of the main variables:

industries<-unique(activedata$naics1)

activedata %>% 
   filter(q=='2019-10-01') %>%
  group_by(naics1) %>% 
  summarize(firms = n(),
            median_total_assets = median(totalassets, na.rm = TRUE),
            median_revenue = median(revenue, na.rm = TRUE),
            median_ebit = median(ebit, na.rm = TRUE))
# A tibble: 17 × 5
   naics1                   firms median_total_assets median_revenue median_ebit
   <chr>                    <int>               <dbl>          <dbl>       <dbl>
 1 Accommodation and Food …     4           13012570        2921780.     494324.
 2 Administrative and Supp…     2            6641031        7061152.     741712.
 3 Agriculture, Forestry, …     2           36248426.      38465597     2619456.
 4 Arts, Entertainment, an…     2            8075112.       6862020      845286.
 5 Construction                17           12334830.       4361809     1065160.
 6 Finance and Insurance       33           28222952         524279       78285 
 7 Health Care and Social …     1            5829852        3698068      358695 
 8 Information                 10           28681463       12814470     1673254 
 9 Management of Companies…     1          151432402      102477596    11453226 
10 Manufacturing               39           44292990       25936810     2989902 
11 Mining, Quarrying, and …     5           35489167.       8717697.     352864.
12 Professional, Scientifi…     1           19332591       12755893     2434974 
13 Public Administration        1           78995636       10762172     1841889 
14 Real Estate and Rental …     9           26925420        4098367     1789606.
15 Retail Trade                13           58341576       57586340.    5137620 
16 Transportation and Ware…    10           30341542       14191390.    3564913 
17 Utilities                    2          102864999.      16192952.    8991664.

Here we can see a good picture of the whole Mexican Financial Market. It is important to note that the best central tendency measure for any financial statement variable such as revenue, total assets is the MEDIAN since the distribution of these variables is always skewed to the right with very few big, big firms. The arithmetic mean of financial variables do not provide a good representation of the typical company in a market.

6 Datacamp online courses

You will receive an email invitation to register in datacamp.com. Datacamp is one of the best online learning sites for Data Science applied to Business and Finance.

You will receive free access for the whole semester! Accept the invitation to be registered in datacamp.

YUO MUST TAKE Chapter 1 : Transforming Data with dplyr from the course: Data Manipulation with dplyr

The following courses/chapters are OPTIONAL (but recommended to do during the first 2-3 weeks of the course):

Chapter 2 (Aggregating data) from the same course: Manipulation with dplyr

If you want learn more about getting financial data, you can check the course: Importing and Managing Financial Data in R

7 W1 submission

The grade of this Workshop will be the following:

  • Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions

  • Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.

  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.

  • Not submitted (0%)

Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me.