1 Data management and Introduction to Multiple Regression

Before we continue to learn more advanced econometric models, it is very important to emphasize the importance of data management in Finance. If we do not have a cleaned and well structured dataset, then our econometric models might not provide the accurate insights we are looking for. Real data sets in Finance are usually not quite cleaned and structured. No matter the quality of econometric models, if the input data is not cleaned or reliable, then the output will not be reliable either. Remember the popular say: garbage in, garbage out.

We will start learning the traditional dataset structures, and then the basics of data manipulation and data cleaning. At the end we will start learning about multiple regression models.

2 Data management

2.1 Data set structures

In time-series econometrics there are basically the following dataset structures or types:

Time-series: in this structure you can have many periods, and information for one “subject” (subject can be company, index, industry, etc). Also, you can have more than one subject, but the information is placed as new column for each subject. For example: the dataset created after running getsymbols:

##        p_stock1 r_stock1 p_stock2 r_stock2
## 2014m1       10     0.02       20     0.01
## 2014m2       11     0.10       21     0.05

Cross-sectional structure: in this structure, you usually have many “subjects”, for ONLY ONE period For example, key performance indicators of Mexican companies for the last quarter of 2016:

##     Ticker   ROA ATO ProfitMargin
## 1 ALFAA.MX 0.023 0.9         0.15
## 2  AMXL.MX 0.015 1.1         0.10

Panel-data structure: it is combination of time-series and cross-sectional. In this structure, we have more than one subject, and for each subject we can have more than one period. Example:

##      Ticker quarter  ROA ATO ProfitMargin
## 1  ALFAA.MX  2014q1  0.2 1.1         0.17
## 2  ALFAA.MX  2014q2 0.21 1.2         0.16
## 3       ...     ...  ... ...          ...
## 4 BIMBOA.MX  2014q1 0.15 0.8         0.10
## 5 BIMBOA.MX  2014q2 0.20 0.9         0.05
## 6       ...     ...  ... ...          ...
## 7 BIMBOA.MX  2017q1 0.15 1.1         0.09

2.2 Introduction to Data management

In Econometrics, data management skills are very important. Most of the time, before designing and running an econometric model, it is needed to do simple and sophisticated data management. It is very common to merge data sets before we start our econometric analysis.

In this example we will learn how to collapse a dataset and merge one dataset with another considering the data structures we learned in the previous section.

In this case, you have to create a dataset with quarterly information for the IPyC market index using monthly data. Then, you have to merge this dataset with a panel-dataset of financial information for many Mexican stocks. In more detail, you have to do the following:

Download the dataset: http://www.apradie.com/datos/datamx2020q4.xlsx using the readxl package (you have to install this package). This dataset has real quarterly financial data of Mexican rms (from the BMV) for many years. This information comes from Economatica, a leading software and database economic and financial information. This dataset is a panel dataset, which is a combination of cross sectional and time-series dataset. Navigate through the data (using View() function) to learn more about this dataset. Save this dataset with a name.

# Load the package
library(readxl)
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx",
              "dataw6.xlsx", mode="wb")
# The first parameter is the link and the second is a name for the
#  local file

# Use the function read_excel()
dataset <- read_excel("dataw6.xlsx")

We need to attach the market return to my dataset. Then, I need to:

Download the monthly market index from Yahoo Finance (getsymbols)
Transform (Collapse) the dataset from monthly to quarterly

Now download the ipyc:

library(quantmod)
getSymbols("^MXX", from="2000-01-01", to= "2019-12-31",
            periodicity="monthly", src="yahoo")

## [1] "^MXX"

We need to aggregate (collapse) the dataset from monthly to quarterly, keeping the LAST market index per quarter. We can use the to.quarterly function from quantmod:

QMXX <- to.quarterly(MXX,indexAt='startof')

This function creates an OHLC xts dataset for each quarter. We can have a look of the content:

head(QMXX)

##            MXX.Open MXX.High MXX.Low MXX.Close MXX.Volume MXX.Adjusted
## 2000-01-01  7185.71  8417.33 6510.84   7473.25          0      7473.25
## 2000-04-01  7459.75  7639.38 5516.77   6948.33          0      6948.33
## 2000-07-01  6953.73  7548.57 6171.95   6334.56          0      6334.56
## 2000-10-01  6334.64  6502.60 5148.02   5652.19          0      5652.19
## 2001-01-01  5651.35  6615.83 5512.77   5727.89          0      5727.89
## 2001-04-01  5733.67  6895.70 5541.68   6666.17 4888424600      6666.17

We see that the Open, High, Low, Close and Adjusted prices for each quarter were calculated. We only need the Adjusted price to calculate the market returns, so we select only the Close column:

QMXX = QMXX$MXX.Adjusted
# Change the name of the column:
colnames(QMXX) <- "MXXindex"

Now we can calculate cc return of the market

QMXX$MXXreturns <- diff(log(QMXX))

Now we are almost ready to merge this quarterly data to the panel dataset using merge.

We need to have both datasets with a common column. In this case, the common column is quarter. Both datasets must be of the same class, in this case, data frame class.

The QMXX has the quarter as index, but not as part of a column. We create a data frame with the quarter column equal to its index:

# Create a dataframe with a quarter column for the QMXX
QMXX.df<-data.frame(quarter=index(QMXX),coredata(QMXX))
# I extracted the quarter from the index
# coredata is a function that extract the data of an object
#   without extracting formatting

The common column must be of the same type, so we make sure that the column quarter of dataset is a Date type:

dataset$quarter<-as.Date(dataset$quarter)

Now we do a MANY-TO-ONE merge of both datasets:

dataset<-merge(dataset,QMXX.df,by="quarter")
# Now I have a dataset for all firms-quarters, and the
# MXX index and MXX return is merged for each firm

Now we define the dataset as a panel data using the pdata.frame function. You need to install the package “plm”.

library(plm)
datasetp <- pdata.frame(dataset, index= c("firmcode","quarter"))

I need to tell R that the dataset is panel data with the structure firm-quarter.

Before running multiple regressions with this panel dataset I need some data cleanning.

3 Data cleaning - Extreme values

When one or more independent variables of a regression model have very extreme values, the regression coefficients are usually not reliable. The extreme values of a variable are called outliers. The OLS method used for regression is very sensible for outliers. In other words, when you have outliers in an independent variable, your regression coefficients can be biased and not reliable.

In this section we continue using the same data set. You have to do the following:

Learn about the Book-to-market ratio (BMR). As comments, EXPLAIN what the BMR is.

Calculate the BMR for all firm-quarters of the dataset. Consider book value of equity as total assets minus total liabilities. Calculate market value of the fir according to the stock price and the number of shares outstanding:

datasetp$bookvalue= datasetp$totalassets - datasetp$totalliabilities
datasetp$marketvalue =  datasetp$originalhistoricalstockprice*datasetp$sharesoutstanding

Check that I used original stock price, not adjusted stock price. In financial markets, the adjusted stock prices are calculated after considering dividend payments and stock splits. A stock split is when a firm decides to divide the value of its stock price by 2, 3 or other multiple with the only purpose to avoid the perception that the stock is expensive. For example, late August 2020 Apple and Tesla decided to do stock split. Apple did a split on a 4-for-1 basis. This means that if the stock price was about USD$400.00 on that day, then its price was reduced to USD$100.00, but they multiplied the number of shares (shares outstanding) by 4 to keep the same market value of the firm. In this historical dataset the shares outstanding is the historical, so we need to use the historical/original stock price without considering stock splits nor dividend payments.

We calculated book value and market value for ALL rms and ALL quarters! Now we can generate the book-to-market ratio for all firms all quarters:

datasetp$bmr= datasetp$bookvalue / datasetp$marketvalue

If bmr=1 means that the firm book value is about the same as firm market value. If that is the case, then the market value has not grown beyond book-value, meaning that the firm has not created value beyond its book value.

If bmr>1 means that the market value is less than book value. So, if bmr>1 means that the firm has significantly lost shareholder’s wealth, and it might incur in bankrupt risk.

Then, what would be the bmr level that all firms are looking for? One of the main purposes of the executives is to MAXIMIZE shareholder’s value. The way to increase shareholder’s value is to increase its market value, and the only way to increase market value is to increase stock price.

Then, the bmr level that all executives prefer is a value much less than 1.

If bmr=0.5 means that the firm market value is the double of its book value. In this case, the main difference between market value and book value is the wealth that the firm has created thanks to its valuable intangible assets such as prestige, high quality, and innovation.

Then, what do you think it might be the relationship between bmr and stock return? Intuitively, we might think that a small value of bmr is a good news, then the stock return might be related in a negative way. If bmr goes down (good news), then the stock return might go up. Then, it might be expected that the relationship between bmr and stock return is linear and negative. Some finance research (like Fama & French, 1995), mainly with US firms has found that the relationship between bmr and future stock return is negative.

However, there are mixed findings about the relationship between bmr and stock returns. Some research (like Piotrosky, 2000) has found that firms with high bmr, but strong financials (like earnings per share) usually provides significant positive stock returns. In this special case, bmr has found to be positively related to stock returns.

Finance research has found that bmr influences earnings per share, which in turn influences current and future stock returns. It is some times that firms with low bmr do not experience significant high stock returns due to the high expectations of investors. This is an interesting finding in finance that is still being researched!

In the following models we will see what happens for Mexican firms.

Keep ONLY active firms in the dataset. You can do this by typing:

datasetpa<-datasetp[datasetp$status=="active",]

In R we can subset (select) specific rows of a data frame by including a specific condition as the row index of the data frame.

Before understanding subsetting, we need to understand how we can get access to the rows and columns of a data frame in R.

R datasets have 2 dimensions: [#rows, #columns]. We can show the dimensions of any R dataset:

dim(datasetpa)

## [1] 12160    48

We see that the # of rows of datasetpa is 12160, and the # of columns is 48.

We can make reference to any subset of the dataset using this notation of [#rows, #columns]

Run the histogram command to see the distribution of bmr.

hist(datasetpa$bmr, main="Histogram of the bmr data", 
     xlab="bmr", col="dark blue")

Do a boxplot to identify possible outliers in bmr. You can type

boxplot(datasetpa$bmr)

outlier_values <- boxplot.stats(datasetpa$bmr)$out

There are many outliers according to the boxplot. The boxplot identifies outliers those that are outside 1.5 time interquartile range. We can see the # of outliers according to the boxplot:

length(outlier_values)

## [1] 560

There are 560 outliers according to the boxplot. We can use the winsorize function to treat outliers.

Install the user command winsorize, which performs a winsorization process. With your words EXPLAIN what is winsorization.

library(robustHD)

Apply the winsor command to BMR at the 2 percentile level. Comment what you did with the winsor command. Type the following:

winsbmr <- winsorize(na.omit(datasetpa$bmr), standarized= "True", prob= 0.98, return= "data")

hist(winsbmr, main="Histogram of the winsorized bmr data", 
     xlab="bmr", col="dark blue")

3.1 Simple and Multiple regression models

Using the same dataset we constructed in the previous sections, design the following regression models:

Model 1. A market model for all quarters for ALFAA:

#First, I get all of the cc stock returns
datasetpa$stockreturn <- log(datasetpa$adjustedstockprice
        / (lag(datasetpa$adjustedstockprice, 1)))
# We can also calculate the cc stock return using the difference of the 
#  log prices of one period with respect to the previous period:

datasetpa$stockreturn2<-diff(log(datasetpa$adjustedstockprice))
# We get the same result as in the previous column stockreturn

# Now I do a new object with ALFAA information to make the following 
# regression models

# Now we will select ONLY the rows for the ALFAA ticker. 
# We can do this by selecting those raws of the datasetpa that the firm column be
#  equal to "ALFAA":
ALFAAdata<-as.data.frame(datasetpa[(datasetpa$firm=="ALFAA"),])

# This is a way to do a SUBSETTING  of a data frame. In this case, I set a condition
#   in the rows of the dataset datasetpa, and leave nothing for the columns, 
#   indicating that I want all columns of the dataset


#I set the stock returns as the dependent variable, and the market returns as independent:

reg1 <- lm(ALFAAdata$stockreturn ~ ALFAAdata$MXXreturns)
s_reg1<- summary(reg1)
s_reg1

## 
## Call:
## lm(formula = ALFAAdata$stockreturn ~ ALFAAdata$MXXreturns)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32800 -0.07931 -0.00343  0.10590  0.39669 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -0.01143    0.01673  -0.683    0.496    
## ALFAAdata$MXXreturns  1.62667    0.17725   9.177 5.46e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1445 on 77 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5224, Adjusted R-squared:  0.5162 
## F-statistic: 84.22 on 1 and 77 DF,  p-value: 5.46e-14

Calculate the confidence intervals:

# To do a rough estimate of the 95% confidence interval for B0:
B0 <- s_reg1$coefficients[1,1]
STDERRORB0 <- s_reg1$coefficients[1,2]
minB0 <- B0  - (2* STDERRORB0 )
maxB0 <- B0  + (2* STDERRORB0 )

cat("The B0 confidence interval goes from about ", minB0, "to", maxB0)

## The B0 confidence interval goes from about  -0.04488987 to 0.02202375

# Instead of using the critical value of t=2, I can use the exact t value that
#  gives me the 95% confidence intervale:
t_critical_value <- abs(qt(0.025,reg1$df.residual))
t_critical_value

## [1] 1.991254

# Note that the function qt gets the t critical value so that the area under the
# function t from -infinite to the t critical value is equal to 0.025.
# This t-critical value depends on the degrees of freedom, which is equal to 
#  the number of observations of the regression minus the number of coefficients
#  estimated in the regression. This t-critical value will always be close to 2.

# I indicate 0.025 since the confidence interval is 95%, so I have 2.5% in the
# left, and 2.5% to the right. I specify the degrees of freedom calculated in the
#  previous model.

# To do an exact estimate of the 95% confidence interval for B0:
minB0 <- B0  - (t_critical_value* STDERRORB0 )
maxB0 <- B0  + (t_critical_value* STDERRORB0 )

cat("The exact B0 confidence interval goes from", minB0, "to", maxB0)

## The exact B0 confidence interval goes from -0.04474357 to 0.02187745

# To estimate the 95% confidence interval for B1:
B1<- s_reg1$coefficients[2,1]
STDERRORB1 <- s_reg1$coefficients[2,2]
minB1 <- B1  - (t_critical_value* STDERRORB1 )
maxB1 <- B1  + (t_critical_value* STDERRORB1 )

cat("The B1 confidence interval goes from", minB1, "to", maxB1)

## The B1 confidence interval goes from 1.273728 to 1.979622

This is a market model for Alfa.

The dependent variable is Alfa quarterly returns, and the independent variable is the market returns. We are using Alfa and market returns from quarter 1 2000 to quarter 4 2019.

There is statistical evidence to say that the market returns are linearly and positively related to Alfa’s returns since its beta1 is significantly greater than zero. On average, when the market moves in 1%, Alfa returns will move in about 1.6266749%.

What can we say about the market risk of Alfa? its beta1 is 1.6266749, and its 95% confidence interval goes from 1.2737282 to 1.9796216. Then, we can say that 95% of the cases, its beta1 will be bigger than 1. In other words, we can say that there is enough evidence to say that Alfa is significantly riskier than the market, considering Alfa history from 2000 up to 2019.

Even though beta0 is negative, there is no statistical evidence to say that beta0 is significantly less than 0 since its 95% C.I. can move from -0.0447436 to 0.0218774.

In other words, we cannot say that Alfa is offering returns systematically below the market returns.

Model 2. Using the same market model for ALFAA, now you have to include the BMR as another explanatory (independent) variable in your market model.

reg2 <- lm(ALFAAdata$stockreturn ~ ALFAAdata$MXXreturns + 
             ALFAAdata$bmr)

s_reg2 <- summary(reg2)
s_reg2

## 
## Call:
## lm(formula = ALFAAdata$stockreturn ~ ALFAAdata$MXXreturns + ALFAAdata$bmr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30797 -0.09343 -0.01791  0.09929  0.42392 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.02069    0.02666   0.776    0.440    
## ALFAAdata$MXXreturns  1.58092    0.17819   8.872 2.35e-13 ***
## ALFAAdata$bmr        -0.02579    0.01676  -1.539    0.128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1432 on 76 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5368, Adjusted R-squared:  0.5246 
## F-statistic: 44.04 on 2 and 76 DF,  p-value: 1.985e-13

This is a multiple regression model to explain Alfa returns with the following explanatory variables: a) market returns and b) Alfa’s book-to-market ratio We are using all quarters that Alfa has been in the Mexican market (from q1 2000 to 2019q4).

Now we have 3 beta coefficients: beta1 (B1), the sensitivity of stock returns related to changes in market returns, beta2 (B2), the sensitivity of stock returns to changes in book-to-marekt ratio, and B0, the expected value of stock return when both independent variables are zero. In this case, B0 does not take the same meaning as in the case of the CAPM since now we included another explanatory variable.

Regarding the effect of the market: I will calculate the 95% confidence interval:

B1<- s_reg2$coefficients[2,1]
STDERRORB1<- s_reg2$coefficients[2,2]
minB1 <- B1  - (t_critical_value* STDERRORB1  )
maxB1 <- B1  + (t_critical_value* STDERRORB1 )

cat("The confidence interval of beta1 goes from", minB1, "to", maxB1)

## The confidence interval of beta1 goes from 1.226091 to 1.935743

After considering the effect of bmr on Alfa quarterly returns, the effect of market returns is still significant and positive since its 95% C.I. goes from 1.2260915 to 1.9357427.
On average, considering no change in book-to-market ratio, when the market quarterly return increases in 1%, Alfa quarterly returns increases in about 1.5809171. This increase can go from 1.2260915 to 1.9357427, so it is significantly bigger than 1. Then, we can say that Alfa stock is significantly riskier than the market after considering the effect of book-to-market ratio.

Regarding the effect of BMR:

  - After considering the effect of the market, the effect of BMR on
    stock quarterly returns of ALFA is negative BUT NOT SIGNIFICANT. 

 I will calculate the 95% confidence interval of B2, the beta coefficient of BMR:

B2<- s_reg2$coefficients[3,1]
STDERRORB2<- s_reg2$coefficients[3,2]
minB2 <- B2  - (t_critical_value* STDERRORB2  )
maxB2 <- B2  + (t_critical_value* STDERRORB2 )

cat("The 95%C.I. of beta2 goes from", minB2, "to", maxB2)

## The 95%C.I. of beta2 goes from -0.05915722 to 0.007578538

After considering the effect of the market return on Alfa quarterly returns, the effect of BMR is negative, but not significant at the 95% C.I since its 95% C.I. goes from -0.0591572 to 0.0075785. We can see that most of the values of this range are negative.

We can calculate the 90% C.I. instead of the 95% C.I. to see whether at 90% confidence we can say that B2 is negative:

B2<- s_reg2$coefficients[3,1]
STDERRORB2<- s_reg2$coefficients[3,2]
# I calculate the critical t value to get the 90% C.I. 
t_critical_value90 <- abs(qt(0.050,reg2$df.residual))

minB2 <- B2  - (t_critical_value90* STDERRORB2  )
maxB2 <- B2  + (t_critical_value90* STDERRORB2 )

cat("The 90%C.I. of beta2 goes from", minB2, "to", maxB2)

## The 90%C.I. of beta2 goes from -0.05369264 to 0.002113959

The 90%C.I. ends on a positive value, so I cannot say that beta2 is significantly negative at the 90% C.I. It is expected that in most stocks this beta2 should be significantly negative since BMR and stock returns usually have a negative relationship. 

If we calculate 1-pvalue of beta2, we get the probability that beta2 will be negative. In this case, (1-pvalue)= (1--1.5389992) = 0.9832428. Then, we can say that beta2 is negative with this % of confidence.

On average, considering no change in market return, when the book-to-market ratio increases in 1%, Alfa quarterly returns changes in about -0.0257893. This increase can go from -0.0536926 to 0.002114, so it is not significantly less than 0 (although it was close to be significant).

Regarding the constant (b0):

I fist calculate b0 and its 95% confidence intervale:

B0 <- s_reg2$coefficients[1,1]
STDERRORB0 <- s_reg2$coefficients[1,2]
minB0 <- B0  - (2* STDERRORB0 )
maxB0 <- B0  + (2* STDERRORB0 )

When the market return is 0 AND BMR is also 0, the expected stock return of Alfa is 0.0206917%. In this case, it is difficult that BMR can be zero. One possibility is that the firm goes to bankrupcy when the book value is close to zero. Another option is that the market value is much bigger than book value, so bmr diminishes to values about 0.1 or less. We can infer that the less the BMR (if market value increases much faster than book-value), the higher will be the expected return of Alfa when the market return is zero.

IT IS ALSO IMPORTANT TO NOTE THAT the beta coefficients are PARTIAL coefficients since its value is a measure of HOW MUCH the independent variable (like BMR) is explaining the dependent variable (stock return) AFTER considering the effect of the the other(s) independent variable(s).

As we can see, the results of this model partially confirms finance research that say that bmr and stock returns are negatively related (the significance of beta2 of bmr was not quite significant). This regression is using time-series structure since we are ONLY using the rows (periods) for ALFA company.

Workshop 6 SOLUTION, Financial Econometrics I

Data structures, outliers and introduction to multiple regression

Alberto Dorantes, Ph.D.

Sep 29, 2021