Abstract In this workshop we learn about data management for financial datasets: data collection, data cleaning, return calculation, data structures, and data merging. In addition, we start playing with a big panel-dataset of historical financial statement variables for many firms.

Data management for time-series financial variables

Data collection

We start by downloading online real stock prices from Yahoo Finance. We start clearing our R environment:

rm(list=ls())
# To avoid scientific notation for numbers: 
options(scipen=999)

The quantmod package

The quantmod package is designed to help financial traders in developing econometric-based trading models. It has several functions for financial data collection, data management and data visualization.

This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.

Load the quantmod package

library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Downloading real financial prices

The getSymbols() function enables its user to download online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, Google Finance, FRED and Oanda. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables around the world.

Now, we will work with historical data of the Bitcoin cryptocurrency and the TESLA stock. We download daily quotations of these instruments from January 1, 2017 to Aug 31, 2021 from Yahoo Finance:

getSymbols(c("BTC-USD","TSLA"), from="2017-01-01", 
           to="2021-08-31", src="yahoo", periodicity="daily")

## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.

## Warning: BTC-USD contains missing values. Some functions will not work if
## objects contain missing values in the middle of the series. Consider using
## na.omit(), na.approx(), na.fill(), etc to remove or replace them.

## [1] "BTC-USD" "TSLA"

This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical daily prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.

BTC-USD and TSLA are the ticker names in Yahoo Finance. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly).

Show the content of datasets

We can view the content of an R object just by typing the name of the object in the R code. If the dataset is long we can view the first rows or the last rows of the dataset by using the following functions:

head(`BTC-USD`,n=5)

##            BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2017-01-01      963.658      1003.08     958.699       998.325      147775008
## 2017-01-02      998.617      1031.39     996.702      1021.750      222184992
## 2017-01-03     1021.600      1044.08    1021.600      1043.840      185168000
## 2017-01-04     1044.400      1159.42    1044.400      1154.730      344945984
## 2017-01-05     1156.730      1191.10     910.417      1013.380      510199008
##            BTC-USD.Adjusted
## 2017-01-01          998.325
## 2017-01-02         1021.750
## 2017-01-03         1043.840
## 2017-01-04         1154.730
## 2017-01-05         1013.380

When tickers have special characters, we have to make reference to the object with simple quotes (``). Also, you can list the LAST 5 rows of the dataset. Note that you can change number of rows you want to display.

tail(`BTC-USD`, n=10)

##            BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2021-08-22     48869.11     49471.61    48199.94      49321.65    25370975378
## 2021-08-23     49291.68     50482.08    49074.61      49546.15    34305053719
## 2021-08-24     49562.35     49878.77    47687.12      47706.12    35361168834
## 2021-08-25     47727.26     49202.88    47163.61      48960.79    32646349931
## 2021-08-26     49002.64     49347.58    46405.78      46942.22    32666549568
## 2021-08-27     46894.55     49112.79    46394.28      49058.67    34511076995
## 2021-08-28     49072.59     49283.50    48499.24      48902.40    28568103401
## 2021-08-29     48911.25     49644.11    47925.86      48829.83    25889650240
## 2021-08-30     48834.85     48925.61    46950.27      47054.98    31847007016
## 2021-08-31     47024.34     48189.55    46750.09      47166.69    34730363427
##            BTC-USD.Adjusted
## 2021-08-22         49321.65
## 2021-08-23         49546.15
## 2021-08-24         47706.12
## 2021-08-25         48960.79
## 2021-08-26         46942.22
## 2021-08-27         49058.67
## 2021-08-28         48902.40
## 2021-08-29         48829.83
## 2021-08-30         47054.98
## 2021-08-31         47166.69

The xts datasets created with the getSymbols function have different columns for open, low, high, close, adjusted and volume data. These datasets are also called OHLC datasets.

The adjusted prices are used for stocks, not for currencies. Adjusted prices is an adjustment of the closing prices after considering dividend payments and stock splits. Then, for the Bitcoin series we can use close or adjusted price to calculate returns.

Data selection

Column selection

We can easily select either columns or rows of a dataset. For example, if we want to select the adjusted price column of the TSLA xts dataset, we use the $ and the name of the column, and then assign it to any new name:

adjTSLA = TSLA$TSLA.Adjusted
head(adjTSLA)

##            TSLA.Adjusted
## 2017-01-03        43.398
## 2017-01-04        45.398
## 2017-01-05        45.350
## 2017-01-06        45.802
## 2017-01-09        46.256
## 2017-01-10        45.974

Another way to select a column is by the # of the column(s):

#Selecting the adjusted price, which is the column #6: 
adjTSLA = TSLA[, 6]

Note that the notation to make reference to column 6 and ALL rows of TSLA is: TSLA[ , 6]

In this case, I want to keep ALL rows, so we leave empty the # of rows, and indicate that we want only column # 6.

R datasets have 2 dimensions: [#rows, #columns]. We can show the dimensions of any R dataset:

dim(TSLA)

## [1] 1173    6

We see that the # of rows of TSLA dataset is 1173, and the # of columns is 6.

We can make reference to any subset of the dataset using this notation of [#rows, #columns]

Row selection

We can also select specific rows of a dataset following the same notation of [#rows, #columns]. If I want to select only the first rows of TSLA dataset, I can do the following:

TSLAfirstdays = TSLA[1:10,]
TSLAfirstdays

##            TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
## 2017-01-03    42.972    44.066   42.192     43.398    29616500        43.398
## 2017-01-04    42.950    45.600   42.862     45.398    56067500        45.398
## 2017-01-05    45.284    45.496   44.390     45.350    29558500        45.350
## 2017-01-06    45.386    46.062   45.090     45.802    27639500        45.802
## 2017-01-09    45.794    46.384   45.600     46.256    19897500        46.256
## 2017-01-10    46.400    46.400   45.378     45.974    18300000        45.974
## 2017-01-11    45.814    45.996   45.336     45.946    18254000        45.946
## 2017-01-12    45.812    46.140   45.116     45.918    18951000        45.918
## 2017-01-13    46.000    47.570   45.918     47.550    30465000        47.550
## 2017-01-17    47.340    47.992   46.874     47.116    23087500        47.116

We used the notation 1:10 in the row specification, indicating that we want the sequence from 1 to 10.

We can also select specific rows and columns. For example, if I only want to select the first 10 days of the adjusted prices of TESLA:

TSLA_adjusted_first_prices = TSLA[1:10,6]
head(TSLA_adjusted_first_prices)

##            TSLA.Adjusted
## 2017-01-03        43.398
## 2017-01-04        45.398
## 2017-01-05        45.350
## 2017-01-06        45.802
## 2017-01-09        46.256
## 2017-01-10        45.974

If we want to select specific columns, we can use the container c() vector to indicate which column # we want to select. For example, if for some reason we want to select the Open and Close columns (columns #1 and #4), and select the first 10 days of TESLA prices:

TSLA_Open_Close = TSLA[1:10,c(1,4)]
head(TSLA_Open_Close)

##            TSLA.Open TSLA.Close
## 2017-01-03    42.972     43.398
## 2017-01-04    42.950     45.398
## 2017-01-05    45.284     45.350
## 2017-01-06    45.386     45.802
## 2017-01-09    45.794     46.256
## 2017-01-10    46.400     45.974

For xts datasets, we can also select rows using the date index. Each xts dataset has a time index that we can use. For example, if we want to select all TESLA prices of only year 2020 we can do the following:

TESLA_2020 = TSLA["2020-01-01/2020-12-31", ]
head(TESLA_2020)

##            TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
## 2020-01-02    84.900    86.140   84.342     86.052    47660500        86.052
## 2020-01-03    88.100    90.800   87.384     88.602    88892500        88.602
## 2020-01-06    88.094    90.312   88.000     90.308    50665000        90.308
## 2020-01-07    92.280    94.326   90.672     93.812    89410500        93.812
## 2020-01-08    94.740    99.698   93.646     98.428   155721500        98.428
## 2020-01-09    99.420    99.760   94.574     96.268   142202000        96.268

We can also combine column selection with row selection. If we want the same date selection, but only adjusted prices:

TESLA_2020_adjusted = TSLA["2020-01-01/2020-12-31", "TSLA.Adjusted"]

In this case, to select the column I used the column name instead of the # of the column.

Data Merging:

We can use the merge() function to join two or more xts time-series datasets into an integrated dataset:

adjprices <- merge(`BTC-USD`$`BTC-USD.Adjusted`, TSLA$TSLA.Adjusted)
# I can select only the adjusted prices in order to calculate returns:

The quantmod has the Ad function to get Adjusted prices. We can do the same we did above using this function:

adjprices <- merge(Ad(`BTC-USD`), Ad(TSLA$TSLA.Adjusted))

Now we have an xts-zoo objects with the 2 adjusted prices. We can change the names of the columns:

names(adjprices)<-c("bitcoin","tesla")

Now we can make reference to the adjusted prices using these names.

Data cleaning

In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some null values for some days. For example, for stock series there is no data for weekends or holidays:

head(adjprices,n=10)

##             bitcoin  tesla
## 2017-01-01  998.325     NA
## 2017-01-02 1021.750     NA
## 2017-01-03 1043.840 43.398
## 2017-01-04 1154.730 45.398
## 2017-01-05 1013.380 45.350
## 2017-01-06  902.201 45.802
## 2017-01-07  908.585     NA
## 2017-01-08  911.199     NA
## 2017-01-09  902.828 46.256
## 2017-01-10  907.679 45.974

In this case, Bitcoin has price data for any day including weekends and holidays. When we did the merge, the TESLA column will have NA values for these days. If we keep this dataset as it is, when calculating daily returns for TESLA, it will not be possible to calculate returns for Modays since there is no price value for Sunday. An easy way to deal with NA values is to delete the rows that have any NA value in any column. We can do this with the na.omit function:

However, R deals with gaps because it recognizes that we are working with a time series object. It is a good idea to have a data set free of NA’s. So, I can use the function na.omit:

adjprices <- na.omit(adjprices)

Now the holidays and weekends were deleted:

head(adjprices,n=6)

##             bitcoin  tesla
## 2017-01-03 1043.840 43.398
## 2017-01-04 1154.730 45.398
## 2017-01-05 1013.380 45.350
## 2017-01-06  902.201 45.802
## 2017-01-09  902.828 46.256
## 2017-01-10  907.679 45.974

Visualization of prices

Visualize how Bitcoin has been valued over time:

plot(`BTC-USD`)

We can do a better visualization if we use the chartSeries function from the quantmod package:

chartSeries(`BTC-USD`, theme=("white"))

We can also do a plot of specific periods. For example, we can see how Bitcoin price behaved only in 2020:

chartSeries(`BTC-USD`, subset = '2020-01-01/2020-12-31')

Financial returns

A financial simple return for a stock (Rt) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):

\[R_{t}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1\] We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.

In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.

One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):

\[r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})\] This is also called as the difference of the log of the price.

Financial return calculation for time-series

Simple returns

We must use adjusted stock prices to calculate financial returns. To calculate a lagged (past) value of price of a time-series variable we can use the function lag from the stats package.

R = adjprices / stats::lag(adjprices,n=1) - 1 
head(R)

##                  bitcoin        tesla
## 2017-01-03            NA           NA
## 2017-01-04  0.1062327729  0.046085074
## 2017-01-05 -0.1224095481 -0.001057337
## 2017-01-06 -0.1097110812  0.009966924
## 2017-01-09  0.0006949826  0.009912297
## 2017-01-10  0.0053731309 -0.006096550

Since adjprices has 2 columns, then R calculates the daily simple returns for all columns of the dataset.

Note that we use the name stats:: before the lag function. We did this since there are other R packages that have a function called lag, so R can be confused if we do not specify which package we are using for the lag function.

Continuously compounded returns

Now we calculate the daily continuously compounded returns using adjusted prices. We use the diff and log functions. The diff function calculates the difference between the value of a time-series variable and its past value:

r = diff(log(adjprices))
head(r)

##                  bitcoin        tesla
## 2017-01-03            NA           NA
## 2017-01-04  0.1009603447  0.045054695
## 2017-01-05 -0.1305752498 -0.001057897
## 2017-01-06 -0.1162092411  0.009917582
## 2017-01-09  0.0006947412  0.009863492
## 2017-01-10  0.0053587471 -0.006115209

Remember that the first difference of log prices is actually the continuously compounded returns of the period.

We can visualize both daily returns over time for Tesla:

plot(R$tesla)

We can observe increasing volatility in 2018 and strong volatility in the pandemic months. Volatility can be calculated as the standard deviation of returns.

Descriptive statistics with time-series

We will use the table.Stats from the PerformanceAnalytics package. You have to install this package by clicking the Packages tab in the right-bottom windows of RStudio, then click Install and type PerformanceAnalytics.

We load the package and get the main descriptive statistics of both, Tesla and Bitcoin daily returns:

library(PerformanceAnalytics)

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

table1 <- table.Stats(R)
table1

We can also get specific descriptive statistics using specific functions such as mean and sd:

mean_tesla_R = mean(R$tesla, na.rm=TRUE)
median_tesla_R = median(R$tesla, na.rm=TRUE)
sd_tesla_R = sd(R$tesla, na.rm=TRUE)

cat("Tesla daily mean return is ",mean_tesla_R, "\n")

## Tesla daily mean return is  0.003155467

cat("Tesla daily median return is ",median_tesla_R, "\n")

## Tesla daily median return is  0.001496471

cat("Tesla daily volatility is ", sd_tesla_R)

## Tesla daily volatility is  0.03845195

When columns have NA values, some descriptive statistics functions cannot be calculated, unless we use the option na.rm=TRUE (means remove the NA values before calculation)

When median and mean are very different, this is a sign of a non-normality in the distribution of the variable. Here we see that Tesla median return is much less than its mean (0.14% vs 0.31%). The median is a much better measure of central tendency in Financial returns.

If Kurtosis of the variable is greater than 3, then the variable has more extreme values than a normal distributed variable. In this case, we see that both returns have Kurtosis much greater than 3.

Financial returns usually have more extreme values than normal distributed variables, so traditional standard deviation might not provide a complete view of the dispersion of returns. Looking at quartiles provide us a better perception about the volatility of financial returns. The box-plot gives us a good perception of volatility, mean and median of returns:

chart.Boxplot(R)

It is easy to see that Bitcoin is riskier than Tesla since we have a wider range of returns in both, negative and positive returns. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Quartile 1 or Q1 (25 percentile) to the Q3 (75 percentile). The vertical lines limit non-extreme values. The dots are considered extreme values in the context of its own distribution.

Visualizing holing return over time

We can use the charts.PerformanceSummary of returns to visualize how much we would made over time if we had invested $1.00 and hold it in the whole period of the data:

charts.PerformanceSummary(R$tesla, 
                          main = "Performance of $1.00 Tesla",
                          wealth.index = TRUE)

We see that if we had invested in Tesla $1.00 in Jan 1, 2017, today we would have more than 15 times my money at the end of the period!

We can calculate the exact holding period return by getting the percentage growth of the series from the first adjusted price up to the last adjusted price:

hpr_tesla = as.numeric(adjprices$tesla[nrow(adjprices)]) / as.numeric(adjprices$tesla[1]) - 1
hpr_tesla

## [1] 15.84202

We can do the same for Bitcoin:

charts.PerformanceSummary(R$bitcoin, 
                          main = "Performance of $1.00 in Bitcoin",
                          wealth.index = TRUE)

With financial xts datasets, it is easy to get aggregation of financial prices. We can use the to.weekly, to.monthly or to.quarterly functions from the quantmod package:

adjprices_monthlyTSLA = to.monthly(adjprices$tesla)
adjprices_monthlyBTC = to.monthly(adjprices$bitcoin)

We can also use the Return.Calculate function from PerformanceAnalytics to easily calulate returns from prices:

R_monthly_TSLA = Return.calculate(adjprices_monthlyTSLA$`adjprices$tesla.Close`)
R_monthly_TSLA = na.omit(R_monthly_TSLA)
r_monthly_TSLA = Return.calculate(adjprices_monthlyTSLA$`adjprices$tesla.Close`,method = "log")
r_monthly_TSLA = na.omit(r_monthly_TSLA)

R_monthly_BTC = Return.calculate(adjprices_monthlyBTC$`adjprices$bitcoin.Close`)
R_monthly_BTC = na.omit(R_monthly_BTC)
r_monthly_BTC = Return.calculate(adjprices_monthlyBTC$`adjprices$bitcoin.Close`,method = "log")
r_monthly_BTC = na.omit(r_monthly_BTC)

Now we can visualize monthly risk of both instruments:

table.Stats(R_monthly_TSLA)

table.Stats(R_monthly_BTC)

chart.Boxplot(R_monthly_BTC)

chart.Boxplot(R_monthly_TSLA)

Data structures

In Finance there are basically the following dataset structures or data types:

Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols.
Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016.
Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period.

Data management for panel-data

For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do simple and sophisticated data management.

In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.

We will merge 2 dataset: a time-series dataset with monthly historical data for the Mexican IPyC Market index, and a panel dataset with quarterly financial information for many Mexican public firms (A public firm is a company that issue shares in the market).

We can only merge datasets with the same granularity. In other words, we cannot merge a monthly with a quarterly dataset. In this case, the market monthly dataset must be aggregated to quarters, and then we can merge the resulting dataset with the other quarterly dataset.

We will work with an online panel data of Mexican firms. Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset.

# Load the package
library(readxl)
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
              "firmspaneldata.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
#  local file

# Use the function read_excel()
paneldataset <- read_excel("firmspaneldata.xlsx")

We need to merge the market monthly return to this panel dataset.

Then, we need to:

Download the monthly market index from Yahoo Finance (getsymbols)
Transform (Collapse) the dataset from monthly to quarterly
Merge the resulting quarterly market data with the panel data of public firms.

We download the ipyc data for the same perids as the firmpaneldata:

library(quantmod)
getSymbols("^MXX", from="2000-01-01", to= "2019-12-31",
            periodicity="monthly", src="yahoo")

## [1] "^MXX"

We need to aggregate (collapse) the dataset from monthly to quarterly, keeping the LAST market index per quarter. We can use the to.quarterly function from quantmod:

QMXX <- to.quarterly(MXX,indexAt='startof')

This function creates an OHLC xts dataset for each quarter. We can have a look of the content:

head(QMXX)

##            MXX.Open MXX.High MXX.Low MXX.Close MXX.Volume MXX.Adjusted
## 2000-01-01  7185.71  8417.33 6510.84   7473.25          0      7473.25
## 2000-04-01  7459.75  7639.38 5516.77   6948.33          0      6948.33
## 2000-07-01  6953.73  7548.57 6171.95   6334.56          0      6334.56
## 2000-10-01  6334.64  6502.60 5148.02   5652.19          0      5652.19
## 2001-01-01  5651.35  6615.83 5512.77   5727.89          0      5727.89
## 2001-04-01  5733.67  6895.70 5541.68   6666.17 4888424600      6666.17

We see that the Open, High, Low, Close and Adjusted prices for each quarter were calculated. We only need the Adjusted price to calculate the market returns, so we select only the Close column:

QMXX = QMXX$MXX.Adjusted
# Change the name of the column:
colnames(QMXX) <- "MXXindex"

Now we can calculate cc return of the market

QMXX$MXXreturns <- diff(log(QMXX))

Now we are almost ready to merge this quarterly data to the panel dataset using merge.

We need to have both datasets with a common column. In this case, the common column is quarter. Both datasets must be of the same class, in this case, data frame class.

The QMXX has the quarter as index, but not as part of a column. We create a data frame with the quarter column equal to its index:

# Create a dataframe with a quarter column for the QMXX
QMXX.df<-data.frame(quarter=index(QMXX),coredata(QMXX))
# I extracted the quarter from the index
# coredata is a function that extract the data of an object
#   without extracting formatting

The common column must be of the same type, so we make sure that the column quarter of dataset is a Date type:

paneldataset$quarter<-as.Date(paneldataset$quarter)

Now we can do a MANY-TO-ONE merge of both datasets:

paneldataset<-merge(paneldataset,QMXX.df,by="quarter")

# Now I have a dataset for all firms-quarters, and the
# MXX index and MXX return is merged for each firm

Setting a panel data structure

We need to tell R that the dataset is panel data with the structure firm-quarter.

We define the dataset as a panel data using the pdata.frame function. You need to install the package “plm”.

We indicate the index of the panel data as follows: firmcode will be the column for the subjects and quarter column for the time:

library(plm)
paneldataset <- pdata.frame(paneldataset, index= c("firmcode","quarter"))

Data calculations with panel data

We will use the dplyr package to easily do data selection and calculations for panel datasets.

You have to install this package by clicking the Packages tab in the right-bottom windows of RStudio, then click Install and type dplyr.

Now we load the dplyr package:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plm':
## 
##     between, lag, lead

## The following objects are masked from 'package:xts':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

This dataset has historical quarterly financial data of active and not active firms in the Bolsa Mexicana de Valores. We will keep ONLY active firms in the dataset from 2010. You can do this by typing:

activedata<-paneldataset[paneldataset$status=="active",]
activedata<-activedata[activedata$year>=2010,]

As you see, we start with the paneldataset, and then we do a sub-setting or selection of rows indicating specific conditions.

Let’s select a few important columns from this dataset to perform some basic financial ratio calculations.

We will use the dplyr package. This package is designed for data management of data frames and panel data. We link data management processes by the operand %>%.

activedata<-activedata %>%
    select(firm,quarter,year,revenue,cogs,ebit,totalassets,
           adjustedstockprice, naics1)

Calculating financial returns and ratios with panel data

As we did for time-series data, we can calculate returns with a panel dataset. We follow the same formula mentioned above, but now we use the panel dataset:

activedata$r = diff(log(activedata$adjustedstockprice))

We added the column r to the panel data. Since R knows that we have a panel data structure, it calculates returns correctly for each firm-quarter. When we have a change in firm, then returns do not have to be calculated. You can have a look to the data and check whether the first quarters for each firms have NA values.

Let’s calculate Return on Assets using EBIT (Earnings before interest and taxes) instead of Net income:

activedata$ROA = activedata$ebit / activedata$totalassets

Now we calculate Profit Margin (PM) and Asset Turn Over (ATO). We will use EBIT instead of Net Income to calculate PM:

activedata$PM = activedata$ebit / activedata$revenue
activedata$ATO = activedata$revenue / activedata$totalassets

Using conditionals to create columns

Let’s create a binary variable as a signal of positive ROA in a quarter compared with its ROA of one year ago (4 quarters ago). In other words, if the firm had a ROA in a specific quarter that is greater than its ROA 4 quarters ago, we will assign 1; 0 otherwise:

activedata$ROAsignal = ifelse(activedata$ROA>
                    plm::lag(activedata$ROA, 4),1,0)

Here se use a conditional with the function ifelse. The first parameter of ifelse is a logical condition that can be true or false. If the condition is true, then the value assigned will be the second parameter; if it is false, the value assigned will be the third parameter.

The ifelse function is very useful when creating financial signals that compare financial values. Here is the general description of the ifelse function:

ifelse(data_condition, value_if_TRUE, value_if_FALSE)

For example:

a=5
b=10
ifelse(a>b,1,0)

## [1] 0

Since a is NOT greater than b, then the value that was returned was zero.

We can use this function to create financial signals as new columns in a panel dataset or time-series dataset.

Descriptive statistics with panel data

Doing descriptive statistics with panel data is not quite equal as in the case of time-series structure. It is a good idea to do descriptive statistics by period of the panel data to avoid getting wrong summaries.

We can start by doing a descriptive statistics selecting only the last quarter of the data:

data2019q4 = activedata[activedata$quarter=='2019-10-01',]
table.Stats(data2019q4[,c(4:7,10)])

## Warning in if (class(x) == "matrix" | class(x) == "data.frame") {: the condition
## has length > 1 and only the first element will be used

## Warning in if (class(x) == "numeric") {: the condition has length > 1 and only
## the first element will be used

## Warning in log(1 + x): NaNs produced

Here we will use the dplyr package to select, group and get descriptive statistics by groups.

In this case, we will select the last quarter of the dataset, group the firms by industry, and for each industry we get the median of the main variables:

industries<-unique(activedata$naics1)
activedata %>% 
   filter(quarter=='2019-10-01') %>%
  group_by(naics1) %>% 
  summarize(firms = n(),
            median_total_assets = median(totalassets, na.rm = TRUE),
            median_revenue = median(revenue, na.rm = TRUE),
            median_ebit = median(ebit, na.rm = TRUE))

Here we can see a good picture of the whole Mexican Financial Market. It is important to note that the best central tendency measure for any financial statement variable such as revenue, total assets is the median since the distribution of these variables is always skewed to the right with very few big, big firms. The arithmetic mean of financial variables do not provide a good representation of the typical company in a market.

LOOKING AT THE 2 TABLES ABOVE, PROVIDE A GOOD DESCRIPTION OF A TYPICAL (AVERAGE) FIRM IN MEXICO, AND A DESCRIPTION OF THE MEXICAN MARKET IN TERMS OF FIRMS BY INDUSTRY AND FIRM SIZE FOR EACH INDUSTRY.

TRAS HABER EMPLEADO LOS CALCULOS NECESARIOS PARA EVALUAR A LAS EMPRESAS MEXICANAS CON SOPORTE ESTADÍSTICO, ENCONTRAMOS LOS SIGUIENTES HALLAZGOS DE VALOR:

· DE LAS 122 EMPRESAS OBSERVADAS, SE REPORTA UNA MEDIANA DE INGRESOS POR $10,634,607 MXN. ES DECIR, EL INGRESO TÍPICO DE UNA EMPRESA REGISTRADA EN LA BMV ES DE ESA CIFRA.

· DE LAS 122 EMPRESAS OBSERVADAS, UNA EMPRESA REPORTÓ UN EBIT DE -$6,734,642 MXN. HABRÍA QUE DETERMINAR QUÉ EMPRESA ES PARA NO PONER NI POR ERROR UN PESO SOBRE ESTA.

· EN PROMEDIO, LAS EMPRESAS MEXICANAS REPORTAN INGRESOS POR $50,962,873 MXN.

· LA EMPRESA CON MAYOR INGRESO FUE POR $1,007,347,869 MXN.

AHORA BIEN, CON RELACIÓN AL ANÁLISIS SECTORIAL, PODEMOS INTERPRETAR LO SIGUIENTE:

· DENTRO DE LA CARTERA DE EMPRESAS REGISTRADAS EN LA BMV, 39 PERTENECEN A LA INDUSTRIA DE MANUFACTURA, 33 PERTENECEN A LA INDUSTRIA FINANCIERA Y SE SEGUROS, Y POR ÚLTIMO, 17 PERTENECEN A EMPRESAS DE CONSTRUCCIÓN.

· LAS INDUSTRIAS MÁS REPRESENTATIVAS DE MÉXICO, EN MATERIA DE INGRESOS, SON: Management of Companies and Enterprises (102,477,596), Agriculture, Forestry, Fishing and Hunting (38,465,597), Manufacturing (25,936,810), Y POR ÚLTIMO, Information (12,814,470).

· LAS INDUSTRIAS CON MENOS ACTIVOS BAJO GESTIÓN, SON: Health Care and Social Assistance, Administrative and Support and Waste Management and Remediation Services Y Arts, Entertainment, and Recreation.

Workshop 1 - Algorithms and data analysis

Rafael Romo - A01701493