Abstract In this workshop we learn about data management for financial datasets: data collection, data cleaning, return calculation, data structures, and data merging. In addition, we start playing with a big panel-dataset of historical financial statement variables for many firms.
We start by downloading online real stock prices from Yahoo Finance. We start clearing our R environment:
rm(list=ls())
# To avoid scientific notation for numbers:
options(scipen=999)
The quantmod package is designed to help financial traders in developing econometric-based trading models. It has several functions for financial data collection, data management and data visualization.
This package contains the getSymbols() function, which creates an xts (extensible time series) object in the environment with the downloaded data from the Internet.
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
The getSymbols() function enables its user to download online and up-to-date financial data, such as stock prices, ETF prices, interest rates, exchange rates, etc. getSymbols() allows to download this data from multiple sources: Yahoo Finance, Google Finance, FRED and Oanda. These sources have thousands of finance and economic data series from many market exchanges and other macroeconomic variables around the world.
Now, we will work with historical data of the Bitcoin cryptocurrency and the TESLA stock. We download daily quotations of these instruments from January 1, 2017 to Aug 31, 2021 from Yahoo Finance:
getSymbols(c("BTC-USD","TSLA"), from="2017-01-01",
to="2021-08-31", src="yahoo", periodicity="daily")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## Warning: BTC-USD contains missing values. Some functions will not work if
## objects contain missing values in the middle of the series. Consider using
## na.omit(), na.approx(), na.fill(), etc to remove or replace them.
## [1] "BTC-USD" "TSLA"
This function will create an xts-zoo R object for each ticker. Each object has the corresponding historical daily prices. xts stands for extensible time-series. An xts-zoo object is designed to easily manipulate time series data.
BTC-USD and TSLA are the ticker names in Yahoo Finance. The from argument is used to indicate the initial date from which you want to bring data. The to argument is the end date of the series you want to download. In this case we omit the to argument in order to download the most recent data. The src argument indicates the source of the data, in this case it is Yahoo Finance. Finally, the periodicity argument specifies the granularity of the data (daily, weekly, monthly, quarterly).
We can view the content of an R object just by typing the name of the object in the R code. If the dataset is long we can view the first rows or the last rows of the dataset by using the following functions:
head(`BTC-USD`,n=5)
## BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2017-01-01 963.658 1003.08 958.699 998.325 147775008
## 2017-01-02 998.617 1031.39 996.702 1021.750 222184992
## 2017-01-03 1021.600 1044.08 1021.600 1043.840 185168000
## 2017-01-04 1044.400 1159.42 1044.400 1154.730 344945984
## 2017-01-05 1156.730 1191.10 910.417 1013.380 510199008
## BTC-USD.Adjusted
## 2017-01-01 998.325
## 2017-01-02 1021.750
## 2017-01-03 1043.840
## 2017-01-04 1154.730
## 2017-01-05 1013.380
When tickers have special characters, we have to make reference to the object with simple quotes (``). Also, you can list the LAST 5 rows of the dataset. Note that you can change number of rows you want to display.
tail(`BTC-USD`, n=10)
## BTC-USD.Open BTC-USD.High BTC-USD.Low BTC-USD.Close BTC-USD.Volume
## 2021-08-22 48869.11 49471.61 48199.94 49321.65 25370975378
## 2021-08-23 49291.68 50482.08 49074.61 49546.15 34305053719
## 2021-08-24 49562.35 49878.77 47687.12 47706.12 35361168834
## 2021-08-25 47727.26 49202.88 47163.61 48960.79 32646349931
## 2021-08-26 49002.64 49347.58 46405.78 46942.22 32666549568
## 2021-08-27 46894.55 49112.79 46394.28 49058.67 34511076995
## 2021-08-28 49072.59 49283.50 48499.24 48902.40 28568103401
## 2021-08-29 48911.25 49644.11 47925.86 48829.83 25889650240
## 2021-08-30 48834.85 48925.61 46950.27 47054.98 31847007016
## 2021-08-31 47024.34 48189.55 46750.09 47166.69 34730363427
## BTC-USD.Adjusted
## 2021-08-22 49321.65
## 2021-08-23 49546.15
## 2021-08-24 47706.12
## 2021-08-25 48960.79
## 2021-08-26 46942.22
## 2021-08-27 49058.67
## 2021-08-28 48902.40
## 2021-08-29 48829.83
## 2021-08-30 47054.98
## 2021-08-31 47166.69
The xts datasets created with the getSymbols function have different columns for open, low, high, close, adjusted and volume data. These datasets are also called OHLC datasets.
The adjusted prices are used for stocks, not for currencies. Adjusted prices is an adjustment of the closing prices after considering dividend payments and stock splits. Then, for the Bitcoin series we can use close or adjusted price to calculate returns.
We can easily select either columns or rows of a dataset. For example, if we want to select the adjusted price column of the TSLA xts dataset, we use the $ and the name of the column, and then assign it to any new name:
adjTSLA = TSLA$TSLA.Adjusted
head(adjTSLA)
## TSLA.Adjusted
## 2017-01-03 43.398
## 2017-01-04 45.398
## 2017-01-05 45.350
## 2017-01-06 45.802
## 2017-01-09 46.256
## 2017-01-10 45.974
Another way to select a column is by the # of the column(s):
#Selecting the adjusted price, which is the column #6:
adjTSLA = TSLA[, 6]
Note that the notation to make reference to column 6 and ALL rows of TSLA is: TSLA[ , 6]
In this case, I want to keep ALL rows, so we leave empty the # of rows, and indicate that we want only column # 6.
R datasets have 2 dimensions: [#rows, #columns]. We can show the dimensions of any R dataset:
dim(TSLA)
## [1] 1173 6
We see that the # of rows of TSLA dataset is 1173, and the # of columns is 6.
We can make reference to any subset of the dataset using this notation of [#rows, #columns]
We can also select specific rows of a dataset following the same notation of [#rows, #columns]. If I want to select only the first rows of TSLA dataset, I can do the following:
TSLAfirstdays = TSLA[1:10,]
TSLAfirstdays
## TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
## 2017-01-03 42.972 44.066 42.192 43.398 29616500 43.398
## 2017-01-04 42.950 45.600 42.862 45.398 56067500 45.398
## 2017-01-05 45.284 45.496 44.390 45.350 29558500 45.350
## 2017-01-06 45.386 46.062 45.090 45.802 27639500 45.802
## 2017-01-09 45.794 46.384 45.600 46.256 19897500 46.256
## 2017-01-10 46.400 46.400 45.378 45.974 18300000 45.974
## 2017-01-11 45.814 45.996 45.336 45.946 18254000 45.946
## 2017-01-12 45.812 46.140 45.116 45.918 18951000 45.918
## 2017-01-13 46.000 47.570 45.918 47.550 30465000 47.550
## 2017-01-17 47.340 47.992 46.874 47.116 23087500 47.116
We used the notation 1:10 in the row specification, indicating that we want the sequence from 1 to 10.
We can also select specific rows and columns. For example, if I only want to select the first 10 days of the adjusted prices of TESLA:
TSLA_adjusted_first_prices = TSLA[1:10,6]
head(TSLA_adjusted_first_prices)
## TSLA.Adjusted
## 2017-01-03 43.398
## 2017-01-04 45.398
## 2017-01-05 45.350
## 2017-01-06 45.802
## 2017-01-09 46.256
## 2017-01-10 45.974
If we want to select specific columns, we can use the container c() vector to indicate which column # we want to select. For example, if for some reason we want to select the Open and Close columns (columns #1 and #4), and select the first 10 days of TESLA prices:
TSLA_Open_Close = TSLA[1:10,c(1,4)]
head(TSLA_Open_Close)
## TSLA.Open TSLA.Close
## 2017-01-03 42.972 43.398
## 2017-01-04 42.950 45.398
## 2017-01-05 45.284 45.350
## 2017-01-06 45.386 45.802
## 2017-01-09 45.794 46.256
## 2017-01-10 46.400 45.974
For xts datasets, we can also select rows using the date index. Each xts dataset has a time index that we can use. For example, if we want to select all TESLA prices of only year 2020 we can do the following:
TESLA_2020 = TSLA["2020-01-01/2020-12-31", ]
head(TESLA_2020)
## TSLA.Open TSLA.High TSLA.Low TSLA.Close TSLA.Volume TSLA.Adjusted
## 2020-01-02 84.900 86.140 84.342 86.052 47660500 86.052
## 2020-01-03 88.100 90.800 87.384 88.602 88892500 88.602
## 2020-01-06 88.094 90.312 88.000 90.308 50665000 90.308
## 2020-01-07 92.280 94.326 90.672 93.812 89410500 93.812
## 2020-01-08 94.740 99.698 93.646 98.428 155721500 98.428
## 2020-01-09 99.420 99.760 94.574 96.268 142202000 96.268
We can also combine column selection with row selection. If we want the same date selection, but only adjusted prices:
TESLA_2020_adjusted = TSLA["2020-01-01/2020-12-31", "TSLA.Adjusted"]
In this case, to select the column I used the column name instead of the # of the column.
We can use the merge() function to join two or more xts time-series datasets into an integrated dataset:
adjprices <- merge(`BTC-USD`$`BTC-USD.Adjusted`, TSLA$TSLA.Adjusted)
# I can select only the adjusted prices in order to calculate returns:
The quantmod has the Ad function to get Adjusted prices. We can do the same we did above using this function:
adjprices <- merge(Ad(`BTC-USD`), Ad(TSLA$TSLA.Adjusted))
Now we have an xts-zoo objects with the 2 adjusted prices. We can change the names of the columns:
names(adjprices)<-c("bitcoin","tesla")
Now we can make reference to the adjusted prices using these names.
In Finance, when managing daily data it is very common to have gaps in the series. What does this mean? It means that the contains some null values for some days. For example, for stock series there is no data for weekends or holidays:
head(adjprices,n=10)
## bitcoin tesla
## 2017-01-01 998.325 NA
## 2017-01-02 1021.750 NA
## 2017-01-03 1043.840 43.398
## 2017-01-04 1154.730 45.398
## 2017-01-05 1013.380 45.350
## 2017-01-06 902.201 45.802
## 2017-01-07 908.585 NA
## 2017-01-08 911.199 NA
## 2017-01-09 902.828 46.256
## 2017-01-10 907.679 45.974
In this case, Bitcoin has price data for any day including weekends and holidays. When we did the merge, the TESLA column will have NA values for these days. If we keep this dataset as it is, when calculating daily returns for TESLA, it will not be possible to calculate returns for Modays since there is no price value for Sunday. An easy way to deal with NA values is to delete the rows that have any NA value in any column. We can do this with the na.omit function:
However, R deals with gaps because it recognizes that we are working with a time series object. It is a good idea to have a data set free of NA’s. So, I can use the function na.omit:
adjprices <- na.omit(adjprices)
Now the holidays and weekends were deleted:
head(adjprices,n=6)
## bitcoin tesla
## 2017-01-03 1043.840 43.398
## 2017-01-04 1154.730 45.398
## 2017-01-05 1013.380 45.350
## 2017-01-06 902.201 45.802
## 2017-01-09 902.828 46.256
## 2017-01-10 907.679 45.974
Visualize how Bitcoin has been valued over time:
plot(`BTC-USD`)
We can do a better visualization if we use the chartSeries function from the quantmod package:
chartSeries(`BTC-USD`, theme=("white"))
We can also do a plot of specific periods. For example, we can see how Bitcoin price behaved only in 2020:
chartSeries(`BTC-USD`, subset = '2020-01-01/2020-12-31')
A financial simple return for a stock (Rt) is calculated as a percentage change of price from the previous period (t-1) to the present period (t):
\[R_{t}=\frac{Adjprice_{t}}{Adjprice_{t-1}}-1\] We can use returns in decimal or in percentage (multiplying by 100). We will keep using decimals.
In Finance it is very recommended to calculate continuously compounded returns (cc returns) and using cc returns instead of simple returns for data analysis, statistics and econometric models. cc returns are also called log returns.
One way to calculate cc returns is by subtracting the log of the current adjusted price (at t) minus the log of the previous adjusted price (at t-1):
\[r_{t}=log(Adjprice_{t})-log(Adjprice_{t-1})\] This is also called as the difference of the log of the price.
We must use adjusted stock prices to calculate financial returns. To calculate a lagged (past) value of price of a time-series variable we can use the function lag from the stats package.
R = adjprices / stats::lag(adjprices,n=1) - 1
head(R)
## bitcoin tesla
## 2017-01-03 NA NA
## 2017-01-04 0.1062327729 0.046085074
## 2017-01-05 -0.1224095481 -0.001057337
## 2017-01-06 -0.1097110812 0.009966924
## 2017-01-09 0.0006949826 0.009912297
## 2017-01-10 0.0053731309 -0.006096550
Since adjprices has 2 columns, then R calculates the daily simple returns for all columns of the dataset.
Note that we use the name stats:: before the lag function. We did this since there are other R packages that have a function called lag, so R can be confused if we do not specify which package we are using for the lag function.
Now we calculate the daily continuously compounded returns using adjusted prices. We use the diff and log functions. The diff function calculates the difference between the value of a time-series variable and its past value:
r = diff(log(adjprices))
head(r)
## bitcoin tesla
## 2017-01-03 NA NA
## 2017-01-04 0.1009603447 0.045054695
## 2017-01-05 -0.1305752498 -0.001057897
## 2017-01-06 -0.1162092411 0.009917582
## 2017-01-09 0.0006947412 0.009863492
## 2017-01-10 0.0053587471 -0.006115209
Remember that the first difference of log prices is actually the continuously compounded returns of the period.
We can visualize both daily returns over time for Tesla:
plot(R$tesla)
We can observe increasing volatility in 2018 and strong volatility in the pandemic months. Volatility can be calculated as the standard deviation of returns.
We will use the table.Stats from the PerformanceAnalytics package. You have to install this package by clicking the Packages tab in the right-bottom windows of RStudio, then click Install and type PerformanceAnalytics.
We load the package and get the main descriptive statistics of both, Tesla and Bitcoin daily returns:
library(PerformanceAnalytics)
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
table1 <- table.Stats(R)
table1
We can also get specific descriptive statistics using specific functions such as mean and sd:
mean_tesla_R = mean(R$tesla, na.rm=TRUE)
median_tesla_R = median(R$tesla, na.rm=TRUE)
sd_tesla_R = sd(R$tesla, na.rm=TRUE)
cat("Tesla daily mean return is ",mean_tesla_R, "\n")
## Tesla daily mean return is 0.003155467
cat("Tesla daily median return is ",median_tesla_R, "\n")
## Tesla daily median return is 0.001496471
cat("Tesla daily volatility is ", sd_tesla_R)
## Tesla daily volatility is 0.03845195
When columns have NA values, some descriptive statistics functions cannot be calculated, unless we use the option na.rm=TRUE (means remove the NA values before calculation)
When median and mean are very different, this is a sign of a non-normality in the distribution of the variable. Here we see that Tesla median return is much less than its mean (0.14% vs 0.31%). The median is a much better measure of central tendency in Financial returns.
If Kurtosis of the variable is greater than 3, then the variable has more extreme values than a normal distributed variable. In this case, we see that both returns have Kurtosis much greater than 3.
Financial returns usually have more extreme values than normal distributed variables, so traditional standard deviation might not provide a complete view of the dispersion of returns. Looking at quartiles provide us a better perception about the volatility of financial returns. The box-plot gives us a good perception of volatility, mean and median of returns:
chart.Boxplot(R)
It is easy to see that Bitcoin is riskier than Tesla since we have a wider range of returns in both, negative and positive returns. The red circles show the mean, the mid line is the median (50 percentile), the boxes include the 50% of the data from the Quartile 1 or Q1 (25 percentile) to the Q3 (75 percentile). The vertical lines limit non-extreme values. The dots are considered extreme values in the context of its own distribution.
We can use the charts.PerformanceSummary of returns to visualize how much we would made over time if we had invested $1.00 and hold it in the whole period of the data:
charts.PerformanceSummary(R$tesla,
main = "Performance of $1.00 Tesla",
wealth.index = TRUE)
We see that if we had invested in Tesla $1.00 in Jan 1, 2017, today we would have more than 15 times my money at the end of the period!
We can calculate the exact holding period return by getting the percentage growth of the series from the first adjusted price up to the last adjusted price:
hpr_tesla = as.numeric(adjprices$tesla[nrow(adjprices)]) / as.numeric(adjprices$tesla[1]) - 1
hpr_tesla
## [1] 15.84202
We can do the same for Bitcoin:
charts.PerformanceSummary(R$bitcoin,
main = "Performance of $1.00 in Bitcoin",
wealth.index = TRUE)
With financial xts datasets, it is easy to get aggregation of financial prices. We can use the to.weekly, to.monthly or to.quarterly functions from the quantmod package:
adjprices_monthlyTSLA = to.monthly(adjprices$tesla)
adjprices_monthlyBTC = to.monthly(adjprices$bitcoin)
We can also use the Return.Calculate function from PerformanceAnalytics to easily calulate returns from prices:
R_monthly_TSLA = Return.calculate(adjprices_monthlyTSLA$`adjprices$tesla.Close`)
R_monthly_TSLA = na.omit(R_monthly_TSLA)
r_monthly_TSLA = Return.calculate(adjprices_monthlyTSLA$`adjprices$tesla.Close`,method = "log")
r_monthly_TSLA = na.omit(r_monthly_TSLA)
R_monthly_BTC = Return.calculate(adjprices_monthlyBTC$`adjprices$bitcoin.Close`)
R_monthly_BTC = na.omit(R_monthly_BTC)
r_monthly_BTC = Return.calculate(adjprices_monthlyBTC$`adjprices$bitcoin.Close`,method = "log")
r_monthly_BTC = na.omit(r_monthly_BTC)
Now we can visualize monthly risk of both instruments:
table.Stats(R_monthly_TSLA)
table.Stats(R_monthly_BTC)
chart.Boxplot(R_monthly_BTC)
chart.Boxplot(R_monthly_TSLA)
In Finance there are basically the following dataset structures or data types:
Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols.
Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016.
Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period.
For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do simple and sophisticated data management.
In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.
We will merge 2 dataset: a time-series dataset with monthly historical data for the Mexican IPyC Market index, and a panel dataset with quarterly financial information for many Mexican public firms (A public firm is a company that issue shares in the market).
We can only merge datasets with the same granularity. In other words, we cannot merge a monthly with a quarterly dataset. In this case, the market monthly dataset must be aggregated to quarters, and then we can merge the resulting dataset with the other quarterly dataset.
We will work with an online panel data of Mexican firms. Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset.
# Load the package
library(readxl)
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
"firmspaneldata.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
# local file
# Use the function read_excel()
paneldataset <- read_excel("firmspaneldata.xlsx")
We need to merge the market monthly return to this panel dataset.
Then, we need to:
We download the ipyc data for the same perids as the firmpaneldata:
library(quantmod)
getSymbols("^MXX", from="2000-01-01", to= "2019-12-31",
periodicity="monthly", src="yahoo")
## [1] "^MXX"
We need to aggregate (collapse) the dataset from monthly to quarterly, keeping the LAST market index per quarter. We can use the to.quarterly function from quantmod:
QMXX <- to.quarterly(MXX,indexAt='startof')
This function creates an OHLC xts dataset for each quarter. We can have a look of the content:
head(QMXX)
## MXX.Open MXX.High MXX.Low MXX.Close MXX.Volume MXX.Adjusted
## 2000-01-01 7185.71 8417.33 6510.84 7473.25 0 7473.25
## 2000-04-01 7459.75 7639.38 5516.77 6948.33 0 6948.33
## 2000-07-01 6953.73 7548.57 6171.95 6334.56 0 6334.56
## 2000-10-01 6334.64 6502.60 5148.02 5652.19 0 5652.19
## 2001-01-01 5651.35 6615.83 5512.77 5727.89 0 5727.89
## 2001-04-01 5733.67 6895.70 5541.68 6666.17 4888424600 6666.17
We see that the Open, High, Low, Close and Adjusted prices for each quarter were calculated. We only need the Adjusted price to calculate the market returns, so we select only the Close column:
QMXX = QMXX$MXX.Adjusted
# Change the name of the column:
colnames(QMXX) <- "MXXindex"
Now we can calculate cc return of the market
QMXX$MXXreturns <- diff(log(QMXX))
Now we are almost ready to merge this quarterly data to the panel dataset using merge.
We need to have both datasets with a common column. In this case, the common column is quarter. Both datasets must be of the same class, in this case, data frame class.
The QMXX has the quarter as index, but not as part of a column. We create a data frame with the quarter column equal to its index:
# Create a dataframe with a quarter column for the QMXX
QMXX.df<-data.frame(quarter=index(QMXX),coredata(QMXX))
# I extracted the quarter from the index
# coredata is a function that extract the data of an object
# without extracting formatting
The common column must be of the same type, so we make sure that the column quarter of dataset is a Date type:
paneldataset$quarter<-as.Date(paneldataset$quarter)
Now we can do a MANY-TO-ONE merge of both datasets:
paneldataset<-merge(paneldataset,QMXX.df,by="quarter")
# Now I have a dataset for all firms-quarters, and the
# MXX index and MXX return is merged for each firm
We need to tell R that the dataset is panel data with the structure firm-quarter.
We define the dataset as a panel data using the pdata.frame function. You need to install the package “plm”.
We indicate the index of the panel data as follows: firmcode will be the column for the subjects and quarter column for the time:
library(plm)
paneldataset <- pdata.frame(paneldataset, index= c("firmcode","quarter"))
We will use the dplyr package to easily do data selection and calculations for panel datasets.
You have to install this package by clicking the Packages tab in the right-bottom windows of RStudio, then click Install and type dplyr.
Now we load the dplyr package:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plm':
##
## between, lag, lead
## The following objects are masked from 'package:xts':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This dataset has historical quarterly financial data of active and not active firms in the Bolsa Mexicana de Valores. We will keep ONLY active firms in the dataset from 2010. You can do this by typing:
activedata<-paneldataset[paneldataset$status=="active",]
activedata<-activedata[activedata$year>=2010,]
As you see, we start with the paneldataset, and then we do a sub-setting or selection of rows indicating specific conditions.
Let’s select a few important columns from this dataset to perform some basic financial ratio calculations.
We will use the dplyr package. This package is designed for data management of data frames and panel data. We link data management processes by the operand %>%.
activedata<-activedata %>%
select(firm,quarter,year,revenue,cogs,ebit,totalassets,
adjustedstockprice, naics1)
As we did for time-series data, we can calculate returns with a panel dataset. We follow the same formula mentioned above, but now we use the panel dataset:
activedata$r = diff(log(activedata$adjustedstockprice))
We added the column r to the panel data. Since R knows that we have a panel data structure, it calculates returns correctly for each firm-quarter. When we have a change in firm, then returns do not have to be calculated. You can have a look to the data and check whether the first quarters for each firms have NA values.
Let’s calculate Return on Assets using EBIT (Earnings before interest and taxes) instead of Net income:
activedata$ROA = activedata$ebit / activedata$totalassets
Now we calculate Profit Margin (PM) and Asset Turn Over (ATO). We will use EBIT instead of Net Income to calculate PM:
activedata$PM = activedata$ebit / activedata$revenue
activedata$ATO = activedata$revenue / activedata$totalassets
Let’s create a binary variable as a signal of positive ROA in a quarter compared with its ROA of one year ago (4 quarters ago). In other words, if the firm had a ROA in a specific quarter that is greater than its ROA 4 quarters ago, we will assign 1; 0 otherwise:
activedata$ROAsignal = ifelse(activedata$ROA>
plm::lag(activedata$ROA, 4),1,0)
Here se use a conditional with the function ifelse. The first parameter of ifelse is a logical condition that can be true or false. If the condition is true, then the value assigned will be the second parameter; if it is false, the value assigned will be the third parameter.
The ifelse function is very useful when creating financial signals that compare financial values. Here is the general description of the ifelse function:
ifelse(data_condition, value_if_TRUE, value_if_FALSE)
For example:
a=5
b=10
ifelse(a>b,1,0)
## [1] 0
Since a is NOT greater than b, then the value that was returned was zero.
We can use this function to create financial signals as new columns in a panel dataset or time-series dataset.
Doing descriptive statistics with panel data is not quite equal as in the case of time-series structure. It is a good idea to do descriptive statistics by period of the panel data to avoid getting wrong summaries.
We can start by doing a descriptive statistics selecting only the last quarter of the data:
data2019q4 = activedata[activedata$quarter=='2019-10-01',]
table.Stats(data2019q4[,c(4:7,10)])
## Warning in if (class(x) == "matrix" | class(x) == "data.frame") {: the condition
## has length > 1 and only the first element will be used
## Warning in if (class(x) == "numeric") {: the condition has length > 1 and only
## the first element will be used
## Warning in log(1 + x): NaNs produced
Here we will use the dplyr package to select, group and get descriptive statistics by groups.
In this case, we will select the last quarter of the dataset, group the firms by industry, and for each industry we get the median of the main variables:
industries<-unique(activedata$naics1)
activedata %>%
filter(quarter=='2019-10-01') %>%
group_by(naics1) %>%
summarize(firms = n(),
median_total_assets = median(totalassets, na.rm = TRUE),
median_revenue = median(revenue, na.rm = TRUE),
median_ebit = median(ebit, na.rm = TRUE))
Here we can see a good picture of the whole Mexican Financial Market. It is important to note that the best central tendency measure for any financial statement variable such as revenue, total assets is the median since the distribution of these variables is always skewed to the right with very few big, big firms. The arithmetic mean of financial variables do not provide a good representation of the typical company in a market.
LOOKING AT THE 2 TABLES ABOVE, PROVIDE A GOOD DESCRIPTION OF A TYPICAL (AVERAGE) FIRM IN MEXICO, AND A DESCRIPTION OF THE MEXICAN MARKET IN TERMS OF FIRMS BY INDUSTRY AND FIRM SIZE FOR EACH INDUSTRY.
TRAS HABER EMPLEADO LOS CALCULOS NECESARIOS PARA EVALUAR A LAS EMPRESAS MEXICANAS CON SOPORTE ESTADÍSTICO, ENCONTRAMOS LOS SIGUIENTES HALLAZGOS DE VALOR:
· DE LAS 122 EMPRESAS OBSERVADAS, SE REPORTA UNA MEDIANA DE INGRESOS POR $10,634,607 MXN. ES DECIR, EL INGRESO TÍPICO DE UNA EMPRESA REGISTRADA EN LA BMV ES DE ESA CIFRA.
· DE LAS 122 EMPRESAS OBSERVADAS, UNA EMPRESA REPORTÓ UN EBIT DE -$6,734,642 MXN. HABRÍA QUE DETERMINAR QUÉ EMPRESA ES PARA NO PONER NI POR ERROR UN PESO SOBRE ESTA.
· EN PROMEDIO, LAS EMPRESAS MEXICANAS REPORTAN INGRESOS POR $50,962,873 MXN.
· LA EMPRESA CON MAYOR INGRESO FUE POR $1,007,347,869 MXN.
AHORA BIEN, CON RELACIÓN AL ANÁLISIS SECTORIAL, PODEMOS INTERPRETAR LO SIGUIENTE:
· DENTRO DE LA CARTERA DE EMPRESAS REGISTRADAS EN LA BMV, 39 PERTENECEN A LA INDUSTRIA DE MANUFACTURA, 33 PERTENECEN A LA INDUSTRIA FINANCIERA Y SE SEGUROS, Y POR ÚLTIMO, 17 PERTENECEN A EMPRESAS DE CONSTRUCCIÓN.
· LAS INDUSTRIAS MÁS REPRESENTATIVAS DE MÉXICO, EN MATERIA DE INGRESOS, SON: Management of Companies and Enterprises (102,477,596), Agriculture, Forestry, Fishing and Hunting (38,465,597), Manufacturing (25,936,810), Y POR ÚLTIMO, Information (12,814,470).
· LAS INDUSTRIAS CON MENOS ACTIVOS BAJO GESTIÓN, SON: Health Care and Social Assistance, Administrative and Support and Waste Management and Remediation Services Y Arts, Entertainment, and Recreation.