Workshop 1 Solution, Financial Modeling and Programming

Review of logit regression and its applications in Finance

The logit model is one type of non-linear regression model where the dependent variable is binary. The logistic or logit model is used to examine the relationship between one or more quantitative variables and the probability of an event happening (1=the event happens; 0 otherwise). For example, a bank that gives loans to businesses might be very interested in knowing which are the factors/variables/characteristics of firms that are more related to loan defaults. If the bank understands these factors, then it can improve its decisions about which firms deserve a loan, and minimize the losses due to loan defaults.

In this workshop, we will define the event to be whether a firm in a specific quarter had higher stock return compared to the market (S&P500 index) return. If the stock return is higher than the market return, then we codify the binary variable equal to 1; 0 otherwise.

Then, in this case, the dependent variable of the regression is the binary variable with 1 if the stock beats the market and 0 otherwise. The independent or explanatory variables can be any financial indicator/ratio/variable that we believe is related to the likelihood of a stock to beat the market in the near future.

Let’s work with an example of a model with only 1 independent variable X1. Imagine that X1 is the variable earnings per share (eps) of a firm, and the event is that the firm beats the market. Then, let’s do a simple example with specific values for b0, b1, and a range of values for eps from -1 to 1 jumping by 0.1:

# The seq function creates a numeric vector. We specify the fist, last and the jumps: 
eps=seq(from=-1,to=1,by=0.1)
eps
##  [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4
## [16]  0.5  0.6  0.7  0.8  0.9  1.0
# This vector has 21 values for x1 including the zero
# I define b0=-0.5 and b1=10:
b0=-0.5
b1=10

# I define a temporal variable Y to be the regression equation:
Y = b0 + b1*eps

# Since eps is a vector, then R performs a row-wise calculation using the equation for all values of eps

# Now I create a vector with the values of the function according to the equation of the probability:
prob = 1 / (1 + exp(-(Y)) )

# I display the probability values of the function for all values of eps:
prob
##  [1] 2.753569e-05 7.484623e-05 2.034270e-04 5.527786e-04 1.501182e-03
##  [6] 4.070138e-03 1.098694e-02 2.931223e-02 7.585818e-02 1.824255e-01
## [11] 3.775407e-01 6.224593e-01 8.175745e-01 9.241418e-01 9.706878e-01
## [16] 9.890131e-01 9.959299e-01 9.984988e-01 9.994472e-01 9.997966e-01
## [21] 9.999252e-01
# Finally, I plot the function values for all values of x1:
plot(x=eps,y=prob,type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character

Here we can see that function is not linear with changes in eps. There is a specific range of values for eps close to 0 when the probability that the firm beats the market increases very fast up to a value of about 0.4 where the probability grows very slow with any more increase in eps.

The interpretations of the magnitude of the coefficients b0 and b1 in logistic regression is not quite the same as the case of multiple regression. However, the interpretation of the sign of the coefficient (positive or negative) and the level of significance (pvalue) is the same as in the multiple regression model. What we can say up to know is that if b1 is positive and significant (its p-value < 0.05), then it means that the variable, in this case, eps is significantly and positively related to the probability that a firm beats the market returns.

Before going to the interpretation of the magnitudes of the coefficient, here is a quick explanation of how the logistic regression works and how it is estimated in any specialized software (such as R).

Let’s continue with the same event, which is that the firm return beats the market (in other words, that the firm return is higher than the market return). Then:

p = probability that the firm beats the market (Event=1); or that the event happens.

(1-p) = probability that the firm DOES NOT beat the market (Event=0); or that the event does not happen.

To have a dependent variable that can get any numeric value from a negative value to a positive value, we can do the following mathematical transformation with these probabilities:

\[Y=log(\frac{P}{1-P})\]

The \(\frac{P}{1-P}\) is called the odds ratio: \[ODDS RATIO = (\frac{P}{1-P})\] The odds ratio is the ratio of the probability of the event happening to the probability of the event NOT happening. Since p can have a value from 0 to 1, then the possible values of ODDSRATIO can be from 0 (when p=0) to infinity (when p=1). Since we want a variable that can have values from any negative to any positive value, then we can just apply the logarithmic function, and then the range of this log will be from any negative value to any positive value.

Now that we have a transformed variable Y (the log of ODDSRATIO) that uses the probability p, then we can use this variable as the dependent variable for our regression mode.

Then, if X1 increases in one unit, then the ODDSRATIO will be equal to ODDSRATIO times eb1. Then, eb1 will be the factor that indicates how many times the ODDSRATIO changes with a 1-unit change in X1.

Then:

·If eb1 = 1, then the ODDSRATIO does not change, meaning that there is no relationship between the variable X1 and the probability of the event.

·If eb1 > 1, then the ODDSRATIO will grow by this factor, meaning that there is a positive relationship between X1 and the probability of the event.

·If eb1 < 1, then the ODDSRATIO will decrease by this factor, meaning that there is a negative relationship between X1 and the probability of the event.

Then, when we see the output of a logistic regression, we need to apply the exponential function to the coefficients to provide a meaningful interpretation of the magnitude of the coefficients in the model.

Data collection

The dataset we will use for this workshop and for the final project is located in a public we site.

This dataset has real historical information of all US public firms listed in NASDAQ and NYSE. You can download the dataset as follows.

You first need to install the package readxl and dplyr. You can do so using the right-hand side window of RStudio in the “Package” tab.

Before running the rest of the chunks, you need to save your .Rmd in your computer.

# To avoid scientific notation:
options(scipen=999)
#Activate the libraries
library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Clear the environment
rm(list=ls())

# Download the panel dataset from an online text file: (this must be done only once- set this as a comment afterwards)
download.file("http://www.apradie.com/datos/usactive2020q.csv",
              "usactive2020q.csv")

# The second parameter is a name for the
#  local file in the same folder where you saved your .Rmd 

Now we import this file into our R environment:

uspanel <- read.csv("usactive2020q.csv")

We import the data dictionary for the variables of this dataset:

download.file("http://www.apradie.com/datos/usdatadictionary.csv",
              "usdatadictionary.csv")

dictionary <- read.csv("usdatadictionary.csv")

All the information of the financial variables is in 1000’s (thousands). For example, if you see that a firm revenue in a specific quarter is equal to 1,000,000 this means that the firm sold 1 thousand millions.

Setting the dataset as panel data

This dataset has the panel data structure: each firm has many variables for many periods (quarters), and the data of each firm is piled one over the other one.

It is very important to set the dataset as panel data. We use the library plm to do this:

library(plm)
## 
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
uspanel <- pdata.frame(uspanel, index= c("ticker","q"))

This is a very important step since we need to calculate returns that use lagged (previous) values of stock prices. Remember that we have different firms piled over another, so to get lagged values of a variable, it is necessary to know that after the last period of one firm, the next row will be the first period of another firm!

Calculation of financial variables and ratios

Here we illustrate how to calculate a few financial variables, ratios and returns for all firms and all quarters.

· Gross profit (grossprofit): Revenue - Cost of good Sold

uspanel$grossprofit = uspanel$revenue - uspanel$cogs

· Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses

uspanel$ebit = uspanel$grossprofit - uspanel$sgae

· Net Income (netincome): ebit - financial expenses - income taxes

uspanel$netincome = uspanel$ebit - uspanel$finexp - uspanel$incometax

· Annual market return: use adjusted stock price and remember that you have quarterly data.

Here we have to use the lagged value of adjusted stock price:

uspanel$stockannual_R = uspanel$adjprice / plm::lag(uspanel$adjprice,k=4) - 1

· Market capitalization: (marketcap): original stock price * shares outstanding.

uspanel$marketcap = uspanel$originalprice * uspanel$sharesoutstanding

This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

Check that we used original stock price, not adjusted stock price. In financial markets, the adjusted stock prices are calculated after considering dividend payments and stock splits. A stock split is when a firm decides to divide the value of its stock price by 2, 3 or other multiple with the only purpose to avoid the perception that the stock is expensive. For example, late August 2020 Apple and Tesla decided to do stock split. Apple did a split on a 4-for-1 basis. This means that if the stock price was about USD $400.00 on that day, then its price was reduced to USD $100.00, but they multiplied the number of shares (shares outstanding) by 4 to keep the same market value of the firm. In this historical dataset the shares outstanding is the historical, so we need to use the historical/original stock price without considering stock splits nor dividend payments.

Now we calculate the operational earnings per share ratio:

· Operational earnings per share (oeps): ebit / shares outstanding

uspanel$oeps = ifelse(uspanel$sharesoutstanding==0,NA,uspanel$ebit / uspanel$sharesoutstanding)

· Operational earnings per share deflated by stock price (oepsp) : oeps / original stock price

uspanel$oepsp = ifelse(uspanel$originalprice==0,NA,uspanel$oeps / uspanel$originalprice)

Note that we use the ifelse function to validate in case that the denominator has a zero value. This is important since R can calculate infinite values that can cause problems in our analysis.

Winsorization of ratios

Before we run any regression model, it is very important to check that the independent variables do not have very extreme values. When one or more independent variables have very extreme values, the estimation of the regression coefficients and standard errors can be biased (not reliable).

We can deal with extreme variables (called outliers) applying the winsorization process. The winsorization process replaces the very extreme values at a specific percentile with a value of the variable in that percentile.

We first see the histogram of the oepsp ratio to see the distribution of extreme values:

hist(uspanel$oepsp)

We see extreme values to the left and few to the right. Apply a winsorization of 2% to the left and 0.5% to the right:

library(statar)
uspanel$oepspw <- winsorize(uspanel$oepsp, probs = c(0.02,0.995))
## 1.42 % observations replaced at the bottom
## 0.36 % observations replaced at the top
hist(uspanel$oepspw)

Running a logit model with financial ratios

As an example of a logistic model, we will run a model to examine how the oepsp is related with the probability of a firm to beat the market return.

We first have to calculate the market annual return, and then compare whether the stock return is higher than the market return. If that is the case, we assign a dummy variable = 1; 0 otherwise:

# Creating the market annual return:
uspanel$marketannual_R = uspanel$SP500index / plm::lag(uspanel$SP500index,k=4) - 1

#Creating the binary variable to see if the stock return is higher than the market return:
uspanel$r_above_market = ifelse(is.na(uspanel$stockannual_R),NA,
                            ifelse(uspanel$stockannual_R>uspanel$marketannual_R,1,0))
# we can see how many 1's and 0's were calculated:
table(uspanel$r_above_market)
## 
##     0     1 
## 54941 44919

Now we are ready to run the logit model. We need to use the glm function:

logitm1 <- glm(r_above_market ~ oepspw ,data = uspanel, family = "binomial",na.action = na.omit)
summary(logitm1)
## 
## Call:
## glm(formula = r_above_market ~ oepspw, family = "binomial", data = uspanel, 
##     na.action = na.omit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6041  -1.0989  -0.9478   1.2552   1.6292  
## 
## Coefficients:
##              Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -0.249627   0.006659  -37.49 <0.0000000000000002 ***
## oepspw       1.633521   0.048049   34.00 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 135000  on 98064  degrees of freedom
## Residual deviance: 133771  on 98063  degrees of freedom
##   (53295 observations deleted due to missingness)
## AIC: 133775
## 
## Number of Fisher Scoring iterations: 4

YOU HAVE TO INTERPRET THIS MODEL WITH YOUR WORDS.

LOOKING AT THE BETA COEFFICIENT OF THE EXPLANATORY VARIABLE oepspw WE CAN SEE THAT THE COEFFICIENT IS POSITIVE AND STATISTICALLY SIGNIFICANT (SINCE ITS P-VALUE IS MUCH LESS THAN 0.05). THEN, WE CAN SAY THAT EARNINGS PER SHARE DEFLATED BY PRICE IS SIGNIFICANTLY AND POSITIVELY RELATED WITH THE PROBABILITY OF A STOCK BEATING THE MARKET ONE YEAR IN THE FUTUER (IN TERMS OF ANNUAL RETURN).

AS I MENTIONED EARLIER, THE WAY WE INTERPRET THE MAGNITUD OF THE COEFFICIENT IS NOT THE SAME AS IN THE CASE OF MULTIPLE REGRESSION. AS EXPLAINED ABOVE, WE NEED TO EXPONENTIATE THE COEFFICIENT TO HAVE A BETTER INTERPRETATION. ANOTHER WAY TO UNDERSTAND WHY WE NEED TO GET THE EXPONENTIAL OF THE COEFFICIENT IS THAT THE ACTUAL DEPENDENT VARIABLE OF THE LOGIT MODEL IS THE LOGARITHM OF ODD RATIO. BE AWARE THAT THE DEPENDENT VARIABLE IS NOT THE PROBABILITY OF THE EVENT! ALSO, THE DEPENDENT VARIABLE IS NOT THE ODDS RATIO; IT IS THE LOGARITHM OF ODD RATIO. THEN, IF WE APPLY THE EXPONENTIAL FUNCTION TO THE LOGARITHM OF THE ODDS RATIO WE GET THE ACTUAL ODDS RATIO.

SINCE THE EXPECTED VALUE OF Y IS ACTUALLY THE REGRESSION EQUATION, THEN I CAN APPLY THE EXPONENTIAL EQUATION TO THE REGRESSION EQUATION AND GET TRANSFORMED BETAS THAT ARE RELATED TO THE ODDS RATIO (NOT TO THE LOG OF ODDS RATIO). THIS IS CONSISTENT WITH A PREVIOUS EXPLANATION ABOUT HOW TO INTERPRET THE MAGNITUDE OF A COEFFICIENT OF A LOGIT REGRESSSION.

· DADO QUE ESTAMOS TRABAJANDO CON EL LOGARITMO DEL ODDSRATIO, ES IMPORTANTE RECORDAR QUE EL DATO SIGNIFICATIVO ES EL EXPONENTE DE LOS COEFICIENTES DE LA REGRESIÓN.

IF THE EXPLANATORY VARIABLE INCREASES IN 1 UNIT. THEN, (eb1) IS THE TRANSFORMED COEFFICIENT WE NEED TO ESTIMATE IN ORDER TO HAVE A BETTER INTERPRETATION OF THE REGRESSION OUTPUT. LET’S GET THE EXPONENTIAL OF THE REGRESSION COEFFICIENTS:

# I USE THE coef FUNCTION TO EXTRACT THE COEFFICIENT FROM THE LOGIT MODEL:
coefficients <- coef(logitm1)

coefficients
## (Intercept)      oepspw 
##  -0.2496267   1.6335214
# I GET b0 AND b1 FROM THIS VECTOR:
b0 = coefficients[1]
b1 = coefficients[2]

NOW I CAN APPLY THE EXPONENTIAL FUNCTION TO THIS VECTOR THAT CONTAINS THE COEFFICIENTS b0 AND b1:

odd_coeffs <- exp(coefficients)
odd_coeffs
## (Intercept)      oepspw 
##   0.7790916   5.1218791
# I SAVE THE MODIFIED COEFFICIENTS:
modified_b0 = odd_coeffs[1]
modified_b1 = odd_coeffs[2]

NOW THE TRANSFORMED b1 IS 5.1218791. WE INTERPRET THIS COEFFICIENT AS FOLLOWS:

IF A FIRM IMPROVES ITS oepspw BY ONE UNIT, ITS PROBABILITY TO BEAT THE MARKET IN THE SAME QUARTER WILL BE 5.1218791 TIMES HIGHER THAN THE CASE THAT THE FIRM DOES NOT IMPROVE ITS oepspw.

THE INTERPRETATION OF b0 IS THE FOLLOWING. IF oepspw IS EQUAL TO ZERO, THEN THE TRANSFORMED COEFFICIENT b0 WILL BE THE EXPECTED ODDS RATIO. IN THIS CASE, THE TRANSFORMED COEFFICIENT OF b0 IS 0.7790916, which is less than 1. THIS MEANS THAT IF A FIRM HAS AN oepsp=0, THEN IT WILL BE MORE LIKELY THAT THE FIRM DOES NOT BEAT THE MARKET. IN OTHER WORDS, THE PROBABILITY OF BEATING THE MARKET WILL BE LESS THAN THE PROBABILITY OF NOT BEATING THE MARKET.

IN THE REAL WORLD, IT IS IMPOSSIBLE TO IMPROVE EARNINGS PER SHARE DEFLATED BY PRICE IN ONE UNIT SINCE THE NORMAL RANGE OF OEPSP IS BETWEEN -0.5 TO ABOUT +0.4. THEN, IT COULD BE A GOOD IDEA TO THINK IN AN INTERPRETATION USING AN SCENARIO THAT THE FIRM INCREASES ITS oepspw BY +0.1 INSTEAD OF +1. HOWEVER, TO DO THIS, WE NEED TO GET A NEW TRANSFORMED COEFFICIENT. FOLLOWING THE PREVIOUS EQUATION OF THE SCENARIO WHEN THE oepsw CHANGES BY +1 UNIT:

IF oepsw INCREASEES IN +0.1, THEN, I NEED TO GET THE EXPONENTIAL OF 0.1 TIMES oepspw:

new_modified_b1 = exp(0.1*b1)
new_modified_b1
##   oepspw 
## 1.177451

WE CAN INTERPRET THIS NEW MODIFIED COEFFICIENT AS FOLLOWS. IF A FIRM IMPROVES ITS oepspw BY 0.1, ITS PROBABILITY TO BEAT THE MARKET IN THE SAME QUARTER WILL BE 1.1774512 TIMES HIGHER THAN THE CASE THAT THE FIRM DOES NOT IMPROVE ITS oepspw.

Another interesting model is to change the dependent variable to be whether the stock will beat the market return in the future 1 year later. We can do this by calculating the future value 4 quarters later for the binary variable r_above_market using the lag function of the plm package:

class(uspanel)
## [1] "pdata.frame" "data.frame"
uspanel$F4_r_above_market <- plm::lag(uspanel$r_above_market,-4)

We use the lag function, but with a negative value to indicate that we want a future value of the variable. It is always good to check whether the dataset is a pdata.frame before we run the lag function.

Now we can run the logit model with this future binary variable to see how much the earnings per share of one quarter is related with the probability that the firm beats the market 1 year later:

logitm2 <- glm(F4_r_above_market ~ oepspw ,data = uspanel, family = "binomial",na.action = na.omit)
summary(logitm2)
## 
## Call:
## glm(formula = F4_r_above_market ~ oepspw, family = "binomial", 
##     data = uspanel, na.action = na.omit)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.441  -1.098  -1.026   1.255   1.513  
## 
## Coefficients:
##              Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -0.232986   0.006776  -34.38 <0.0000000000000002 ***
## oepspw       1.121954   0.048392   23.18 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130696  on 94911  degrees of freedom
## Residual deviance: 130142  on 94910  degrees of freedom
##   (56448 observations deleted due to missingness)
## AIC: 130146
## 
## Number of Fisher Scoring iterations: 4

I WILL INTERPRET THIS MODEL USING THE MODIFIED COEFFICIENT IN THE CASE THAT THE oepspw CHANGES IN 0.1 UNIT. THEN, FOLLOWING THE SAME LOGIT WE WORKED THE PREVIOUS MODEL, WE CALCULATE THE MODIFIED BETA COEFFICIENTS:

# I EXTRACT THE COEFFICIENTS FROM THE MODEL:
coefficients2 <- coef(logitm2)

# I APPLY THE EXPONENTIAL FUNCTION TO BETA0: 
modified_b0 = exp(coefficients2[1])
modified_b0
## (Intercept) 
##   0.7921644
# I APPLY THE EXPONENTIAL FUNCTION TO BETA1 TIMES 0.1:
modified_b1 = exp(0.1*coefficients2[2])
modified_b1
##   oepspw 
## 1.118731

WE CAN INTERPRET THIS NEW MODIFIED COEFFICIENT AS FOLLOWS. IF A FIRM IMPROVES ITS oepspw BY 0.1, ITS PROBABILITY TO BEAT THE MARKET ONE YEAR LATER WILL BE 1.1187314 TIMES HIGHER THAN THE CASE THAT THE FIRM DOES NOT IMPROVE ITS oepspw.

Prediction with the logit model

We can use the last model to predict the probability whether the firm will beat the market return 1 year in the future. For this prediction, it might be a good idea to select only the firms in the last quarter of 2020, so we can predict which firm will beat the market in 2021:

We create a dataset with only the Q4 of 2020:

firmsQ42020 <- uspanel %>%
          select(ticker,q,year, F4_r_above_market,oepspw) %>%
          filter(q=="2020-10-01") %>%
        as.data.frame()
hist(firmsQ42020$oepspw)

Now we run the prediction using the model 2 with this new dataset:

firmsQ42020 <- firmsQ42020 %>% 
  mutate(pred=predict.glm(logitm2,newdata=firmsQ42020,type=c("response")) )

We can do a histogram to see how this predicted probability behaves:

hist(firmsQ42020$pred,breaks=20)

We can also see how the predicted probability of beating the market changes with changes in earnings per share:

# The plot function expect the x and y values to be vectors, not columns of data frames.
# Then, I use the as.vector function before I do the plot:
plot(x=as.vector(firmsQ42020$oepspw),y=as.vector(firmsQ42020$pred))

It is curious to see that the relationship between oepspw and the probability of beating the market looks similar to a linear relationship. However, we can see a tiny curvature, indicating that the model is doing the non-linear effect, but if we had more extreme values of oepspw we could see the S-shape relationship.

Selection of best stocks based on results of the logit model

We can sort the firms according to the predicted probability of beating the benchmark, and then select those with the highest probability:

top40firms <- firmsQ42020 %>%
     arrange(desc(firmsQ42020$pred)) %>%
     top_n(40) 
## Selecting by pred
top40firms

CHALLENGE

BASED ON THIS WORKSHOP, DEFINE 3 TO 4 FINANCIAL VARIABLES/RATIOS AS EXPLANATORY VARIABLES THAT COULD IMPACT THE PROBABILITY OF A STOCK RETURN BEATING THE MARKET RETURN.

Generate the ratios, run at least one logit regression model (with 2 to 3 explanatory variables) and create the predictions and then select best 40 companies. For this exercise, it is optional to do the winsorization of the explanatory variables (For the final project, you must do the winsorization process).

TO AVOID BIAS IN THE SELECTION OF FINANCIAL VARIABLES/RATIOS FOR THE FINAL PROJECT, IN THIS SOLUTION I WILL JUST ADD ONE VARIABLE AND PROVIDE AN INTERPRETATION OF A LOGIT MODEL WITH 2 INDEPENDENT VARIABLES. I WILL ONLY ADD FIRM SIZE AS THE LOGARITHM OF MARKET CAPITALIZATION.

I GENERATE THE VARIABLE FIRM SIZE:

uspanel$size = log(uspanel$marketcap)
# I CAN SEE THE HISTOGRAM OF SIZE:
hist(uspanel$size)

IN THE CASE OF INCLUDING A FINANCIAL VARIABLE AS EXPLANATORY (NOT A RATIO), IT IS RECOMMENDED TO APPLY THE LOGARITHM SINCE MOST OF THE FINANCIAL VARIABLES HAVE A DISTRIBUTION VERY SKEWED TO THE RIGHT. WHEN WE APPLY THE LOGARITHM TO THESE TYPES OF VARIABLES, THE DISTRIBUTION OF THIS LOG BEHAVES VERY SIMILAR TO A NORMAL DISTRIBUTION.

BE CAREFUL ADDING MORE THAN ONE FINANCIAL VARIABLE THAT ARE VERY RELATED SINCE YOU CAN END UP IN MULTICOLLIENARITY. FOR EXAMPLE, IF YOU ADD THE LOG OF REVENUE AND COGS AS INDEPENDENT VARIABLES, IT IS VERY LIKELY THAT BOTH WILL BE VERY CORRELATED SINCE COGS IS VARIABLE COST THAT DEPENDS ON THE LEVEL OF REVENUES.

IN THE CASE OF ADDING FINANCIAL RATIOS, REMEMBER TO CHECK WHETHER THE RATIO HAS EXTREME VALUES. IF THAT IS THE CASE, THEN YOU NEED TO DO ITS WINSORIZATION BEFORE INCLUDING THE RATIO IN THE MODEL AS EXPLANATORY VARIABLE.

I ADD SIZE TO THE MODEL. I WILL USE MODEL 2 THAT HAS THE PROBABILITY OF BEATING THE MARKET ONE YEAR IN THE FUTURE:

logitm3 <- glm(F4_r_above_market ~ oepspw + size,data = uspanel, family = "binomial",na.action = na.omit)
summary(logitm3)
## 
## Call:
## glm(formula = F4_r_above_market ~ oepspw + size, family = "binomial", 
##     data = uspanel, na.action = na.omit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4874  -1.1095  -0.9868   1.2370   1.7264  
## 
## Coefficients:
##              Estimate Std. Error z value            Pr(>|z|)    
## (Intercept) -0.900872   0.043512  -20.70 <0.0000000000000002 ***
## oepspw       0.953737   0.049651   19.21 <0.0000000000000002 ***
## size         0.048622   0.003126   15.55 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 130696  on 94911  degrees of freedom
## Residual deviance: 129899  on 94909  degrees of freedom
##   (56448 observations deleted due to missingness)
## AIC: 129905
## 
## Number of Fisher Scoring iterations: 4

INTERESTINGLY, THE BETA COEFFICIENT OF oepspw CHANGED FROM THE PREVIOUS MODEL (IT CHANGED FROM 1.12 TO 0.95). IT IS COMMON THAT A NEW VARIABLE ADDED TO A MODEL CAN DIMINISH THE EXPLANATORY VARIABLE OF THE PREVIOUS VARIABLES INCLUDED IN THE MODEL.

THE EFFECT OF SIZE ON THE PROBABILITY OF BEATING THE MARKET ONE YEAR IN THE FUTURE IS POSITIVE AND SIGNIFICANT AFTER CONSIDERING THE EFFECT OF EARNINGS PER SHARE DEFLATED BY PRICE. IN OTHER WORDS, THE BIGGER THE COMPANY, THE MORE LIKELY THAT THE COMPANY WILL BEAT THE MARKET ONE YEAR LATER. INTERESTINGLY, THIS IS AGAINST PREVIOUS RESEARCH IN FINANCE THAT SAYS THAT IN THE LONG RUN, SMALL FIRMS ON AVERAGE EXPERINENCE HIGHER RETURNS THAN BIG FIRMS.

THE EFFECT OF oepspw ON THE PROBABILITY OF BEATING THE MARKET ONE YEAR IN THE FUTURE IS POSITIVE AND SIGNIFICANT AFTER CONSIDERING THE EFFECT OF SIZE.

TO KNOW WHICH OF THE TWO INDEPENDENT VARIABLES HAVE HIGHER EXPLANATORY POWER, WE NEED TO CHECK THE ABSOLUTE VALUE OF THEIR Z VALUE (NOT THE MAGNITUDE OF THEIR COEFFICIENTS). IN THIS CASE, WE CAN SEE THAT oepsw HAS MORE EXPLANATORY POWER THAN FIRM SIZE SINCE THE Z VALUE OF oepspw IS GREATER IN MAGNITUDE.

LET’S INTERPRET THE MAGNITUDE OF THE MODIFIED COEFFICIENT OF SIZE. WE FOLLOW THE SAME LINE OF REASONING WE DID WITH oepspw:

# GETTING THE COEFFICIENT MATRIX:

coefficients = coef(logitm3)
# IN THIS CASE, THE MATRIX WILL HAVE 3 ROWS:
coefficients
## (Intercept)      oepspw        size 
## -0.90087238  0.95373689  0.04862248
b0 = coefficients[1]
b1 = coefficients[2]
b2 = coefficients[3]

# APPLYING THE EXPONENTIAL TO THE COEFFICIENT MATRIX:
odd_coeffs <- exp(coefficients)
odd_coeffs
## (Intercept)      oepspw        size 
##   0.4062151   2.5953902   1.0498239
#WE CAN APPLY THE EXPONENTIAL FUNCTION TO THIS MATRIX, OR APPLY THE EXPONENTIAL FUNCTION
#  TO EACH COEFFICIENT:
modified_b0 = exp(b0)
modified_b1 = exp(b1)
modified_b2 = exp(b2)

WE CAN INTERPRET THE SIZE COEFFICIENT AS FOLLOWS. SUPPOSE YOU HAVE FIRM A AND FIRM B WITH THE SAME VALUE OF EARNINGS PER SHARE DEFLATED BY PRICE. IMAGINE THAT FIRM B IS BIGGER IN SIZE THAN FIRM A. FIRM B HAS A SIZE EQUAL TO THE SIZE OF FIRM A PLUS 1. IN THIS CASE, FIRM B WILL BE 1.04 MORE LIKELY THAN FIRM A TO BEAT THE MARKET IN THE FUTURE ON YEAR LATER.

SINCE IT IS HARD TO IMAGINE HOW FIRM B IS BIGGER THAN FIRM A IN 1 UNIT OF SIZE (SINCE SIZE IS THE LOGARITHM OF MARKET VALUE), THEN WE CAN DO THE FOLLOWING TO HAVE A BETTER INTERPRETATION.

SINCE SIZE IS THE LOG OF MARKET CAPITALIZATION, LET’S COMPARE THE SIZE OF FIRM A AND FIRM B IN TERMS OF MARKET CAPITALIZATION:

IMAGINE THAT FIRM A HAS A MARKET CAPITALIZATION OF $US 1,000 MILLION. THEN LET’S SEE ITS SIZE AS THE LOG OF ITS MARKET CAP, AND ALSO LET’S CALCULATE THE SIZE OF FIRM B:

MARKETCAP_FIRM_A = 1000
SIZE_FIRM_A = log(MARKETCAP_FIRM_A)
# SIZE OF FIRM A:
SIZE_FIRM_A
## [1] 6.907755
# NOW, SIZE OF FIRM B IS 1 PLUS THE SIZE OF FIRM A:
SIZE_FIRM_B = SIZE_FIRM_A + 1
# SINCE SIZE IS THE LOG, THEN WE CAN APPLY THE EXPONENTIAL FUNCTION TO CALCULATE THE MARKET CAP OF FIRM B:
MARKETCAP_FIRM_B = exp(SIZE_FIRM_B)
# THE MARKET CAP OF FIRM B WOULD B:
MARKETCAP_FIRM_B
## [1] 2718.282

THIS MEANS THAT FIRM B WILL 2,718,28 MILLIONS OF MARKET CAP, WHICH IS 2.71 TIMES BIGGER THAN FIRM A. THEN, WE CAN RE-FRAME THE INTERPRETATION AS FOLLOWS:

IMAGINE THAT FIRM B IS 2.71 TIMES BIGGER (IN MARKET CAP) THAN FIRM A. ALSO, FIRM A AND FIRM B HAVE THE SAME VALUE OF EARNINGS PER SHARE DEFLATED BY PRICE. IN THIS CASE, FIRM B WILL BE 1.04 MORE LIKELY THAN FIRM A TO BEAT THE MARKET IN THE FUTURE ON YEAR LATER.

FOR A MORE EASY INTERPRETATION, WE CAN CALCULATE THE FACTOR WE NEED TO USE TO GET FIRM B EQUAL TO THE DOUBLE OF FIRM A. IN THIS CASE, WE ONLY NEED TO GET THE LOG OF 2, AND USE THIS VALUE AS THE MULTIPLIER TO THE BETA COEFFICIENT WHEN TRANSFORMING THE BETA:

# DEFINE THE NUMBER OF TIMES FIRM B IS BIGGER THAN FIRM A:
SIZE_FACTOR = 2
# THE MULTIPLIER WILL BE THE LOG OF 2

# CALCULATING THE NEW MODIFIED BETA:

new_modified_b1 = exp(log(SIZE_FACTOR)*b2)
new_modified_b1
##     size 
## 1.034277

REFRAMING THE INTERPRETATION:

IMAGINE THAT FIRM B IS THE DOUBLE IN SIZE (IN MARKET CAP) THAN FIRM A. ALSO, FIRM A AND FIRM B HAVE THE SAME VALUE OF EARNINGS PER SHARE DEFLATED BY PRICE. IN THIS CASE, FIRM B WILL BE 1.03 MORE LIKELY THAN FIRM A TO BEAT THE MARKET IN THE FUTURE ON YEAR LATER.

THE COEFFICIENT OF oepspw diminished to 0.9537. APPLYING THE EXPONENTIAL FUNCTION AND CONSIDERING A CHANGE OF oepspw OF 0.1:

new_modified_b1 = exp(0.1*coefficients[2])
new_modified_b1
##  oepspw 
## 1.10007

THIS MEANS THAT A FIRM THAT DOES NOT CHANGE ITS SIZE, AND ITS EPS CHANGES IN 0.1, ITS PROBABILITY TO BEAT THE MARKET ONE YEAR IN THE FUTURE INCREASES IN 1.1 TIMES.

Workshop 2, Financial Modeling and Programming

The Linear regression model

The simple linear regression model is used to understand the linear relationship between two variables assuming that one variable, the independent variable (IV), can be used as a predictor of the other variable, the dependent variable (DV). In this part we illustrate a simple regression model with the Market Model.

The Market Model states that the expected return of a stock is given by its alpha coefficient (b0) plus its market beta coefficient (b1) multiplied times the market return. In mathematical terms:

\[E[Ri]=α+β(RM)\] We can express the same equation using B0 as alpha, and B1 as market beta: \[E[Ri]=β0+β1(RM)\] We can estimate the alpha and market beta coefficient by running a simple linear regression model specifying that the market return is the independent variable and the stock return is the dependent variable. It is strongly recommended to use continuously compounded returns instead of simple returns to estimate the market regression model. The market regression model can be expressed as: \[r(i,t)=b0+b1∗r(M,t)+εt\] Where:

εt is the error at time t. Thanks to the Central Limit Theorem, this error behaves like a Normal distributed random variable ∼ N(0, σε); the error term εt is expected to have mean=0 and a specific standard deviation σε (also called volatility).

r(i,t) is the return of the stock i at time t.

r(M,t) is the market return at time t.

b0 and b1 are called regression coefficients.

Running a market regression model with real data

Data collection

We first load the quantmod package and download monthly price data for Tesla and the S&P500 market index. We also merge both datasets into one:

# load package quantmod
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# Download the data
getSymbols(c("TSLA", "^GSPC"), from="2017-01-01", to= "2020-12-31", periodicity="monthly")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "TSLA"  "^GSPC"
#Merge both xts-zoo objects into one dataset, but selecting only adjusted prices:

adjprices<-Ad(merge(TSLA,GSPC))

Return calculation

We calculate continuously returns for both, Tesla and the S&P500:

returns <- diff(log(adjprices)) 
#I dropped the na's:
returns <- na.omit(returns)

#I renamed the columns:
colnames(returns) <- c("TSLA", "GSPC")

Visualize the relationship

Do a scatter plot putting the S&P500 returns as the independent variable (X) and the stock return as the dependent variable (Y). We also add a line that better represents the relationship between the stock returns and the market returns.Type:

plot.default(x=returns$GSPC,y=returns$TSLA)
abline(lm(returns$TSLA ~ returns$GSPC),col='blue')

# As you see, I indicated that the Market returns goes in the X axis and 
#   Tesla returns in the Y axis. 
# In the market model, the independent variable is the market returns, while
#   the dependent variable is the stock return

Sometimes graphs can be deceiving. Always check the Y sale and the X scale. In this case, the X goes from -0.10 to 0.10, while the Y scale goes from -0.20 to 0.40. Then, the real slope of the line should be steeper.

WHAT DOES THE PLOT TELL YOU? BRIEFLY EXPLAIN

IN THIS GRAPH, WE CAN CLEARLY SEE THE MATHEMATICAL RELATIONSHIP BETWEEN THE TESLA STOCK RETURNS AND THE S&P500 RETURNS. AS WE CAN SEE, AT THE X AXIS WE HAVE THE INDEPENDENT VARIABLE, S&P 500 RETURNS; AND AT THE Y AXIS WE HAVE THE DEPENDENT VARIABLE, TESLA RETURNS IN THIS CASE. SO, THE TRACED LINEAR LINE COULD TELL US THE PROBABLE RETURNS TESLA COULD HAVE IN THE FUTURE ACCORDING TO ITS HISTORICAL RELATIONSHIP WITH THE MARKET RETURNS.

Running the market regression model

We can run the market regression model with the lm() function. The first parameter of the function is the DEPENDENT VARIABLE (in this case, the stock return), and the second parameter must be the INDEPENDENT VARIABLE, also named the EXPLANATORY VARIABLE (in this case, the market return).

What you will get is called The Market Regression Model. You are trying to examine how the market returns can explain stock returns from Jan 2017 to Dec 2020.

Assign your market model to an object named “model1”:

model1 <- lm(TSLA ~ GSPC, data=returns)
summary(model1)
## 
## Call:
## lm(formula = TSLA ~ GSPC, data = returns)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26666 -0.10957 -0.00789  0.10140  0.40797 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03660    0.02422   1.511 0.137772    
## GSPC         1.83953    0.49909   3.686 0.000611 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.162 on 45 degrees of freedom
## Multiple R-squared:  0.2319, Adjusted R-squared:  0.2148 
## F-statistic: 13.58 on 1 and 45 DF,  p-value: 0.0006106

YOU HAVE TO INTERPRET THIS MODEL. MAKE SURE YOU RESPOND THE FOLLOWING THE QUESTIONS:

  • IS THE STOCK SIGNIFICANTLY OFFERING RETURNS OVER THE MARKET? YES, IT IS OFFERING SIGNIFICANTLY RETURNS OVER THE MARKET SINCE THE BETA IS HIGHER THAN 1. NEVERTHELESS, IF THE MARKET GOES DOWN, THE STOCK WILL GO PROPORTIONALLY MORE DOWN.

  • IS THE STOCK SIGNIFICANTLY RISKIER THAN THE MARKET? CORRECT, SINCE WE CAN SEE THE B1 IS EQUAL TO 1.84; THIS MEANS THAT FOR EACH 1% OF INCREMENT INN THE INDEX, TESLA WILL INCREASE BY 1.84%. IF THE B1 WERE 1 THE STOCK WOULD PAID AS MUCH AS THE MARKET PAY. CLEARLY, A BETA OVER 1.5 WE CAN CONSIDER IT SIGNIFICANTLY RISKIER.

We can save the regression summary in another variable and display the regression coefficients:

s <-summary(model1)
s$coefficients
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 0.0366007 0.02422234 1.511031 0.1377715880
## GSPC        1.8395302 0.49909297 3.685746 0.0006106426

The beta0 coefficient is in the first row (the intercept), while the beta1 coefficient is in the second raw.

The regression coefficients will be in the first column of this matrix. The second column is the standard error of the coefficients, which are the standard deviation of the coefficients. The third column is the t-value of each coefficient, and finally their p-values in the fourth column.

We can get beta0, beta1, standard errors, t-values and p-values and store in other variables:

b0= s$coefficients[1,1]
b1= s$coefficients[2,1]

stderr_b0 = s$coefficients[1,2]
stderr_b1 = s$coefficients[2,2]

tval_b0 = s$coefficients[1,3]
tval_b1 = s$coefficients[2,3]

pvalue_b0 = s$coefficients[1,4]
pvalue_b1 = s$coefficients[2,4]

The reference [2,1] of the matrix s$coefficients means that I want to get the element of the matrix that is in the row #2 and the column #1. In this position we can find the beta1 of the model.

Then, any R matrix you can make reference as matrix[#row,#column] to get the value of any cell according to its position in row and column numbers.

We can also get the coefficients applying the function coef to the regression object:

coefs <- coef(model1)
coefs
## (Intercept)        GSPC 
##   0.0366007   1.8395302
b0 = coefs[1]
b1 = coefs[2]

Automating the calculation of market models for many stocks

Automating downloading and return calculation

If we want to analyze the market risk of many stocks, we need to calculate the market model for many stocks. The challenge here will be to automate the process of estimating market models for any list of tickers.

Let’s start with an easy example of 2 stocks:

Here we will learn how to automate the downloading of many stock prices and calculate returns starting with a definition of a set of tickers.

We define a vector with the tickers for Oracle, Netflix, and the S&P500, and download the monthly stock prices:

setSymbolLookup(GSPC = list(name="^GSPC"))
# Set the ticker ^GSPC as GSPC (previously defined name)

tickers <- c("ORCL","NFLX", "GSPC")

getSymbols(tickers, periodicity = "monthly",
             from = "2017-01-01", to = "2020-12-31")
## [1] "ORCL" "NFLX" "GSPC"

To avoid writing the name of the tickers in the code, we can use the get function. The get function receives a name of a dataset and then create a new dataset.

We apply the get function to the tickers vector:

datasets_list <- lapply(tickers, get)

Now we have the 3 datasets in a list; each element of the list is an xts dataset with the stock price data.

Now we merge all datasets using the function call. I do this to generalize our program for any number of tickers:

prices <- do.call(merge, datasets_list)
# We select only Adjusted prices:
prices <- Ad(prices)
# We change the name of the columns to be equal to the tickers vector:
names(prices)<-tickers

The do.call function is similar to the lapply function. For the do.call function, the first argument is the function, and the second argument is the list of datasets.

With this code, you can change the list of tickers from 2 to any set of tickers, and you will end up with a clean dataset of only adjusted prices for all the tickers.

Now we calculate continuously compounded returns for all tickers:

returns <- na.omit(diff(log(prices)))

Running manually the market model for 2 stocks

Now we can run the market model for both stocks manually:

model1 <- lm(returns$ORCL ~ returns$GSPC)
summary(model1)
## 
## Call:
## lm(formula = returns$ORCL ~ returns$GSPC)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.086366 -0.026758 -0.002029  0.032639  0.092158 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)  0.003902   0.006577   0.593      0.556    
## returns$GSPC 0.722066   0.135521   5.328 0.00000306 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.044 on 45 degrees of freedom
## Multiple R-squared:  0.3868, Adjusted R-squared:  0.3732 
## F-statistic: 28.39 on 1 and 45 DF,  p-value: 0.000003064
model2 <- lm(returns$NFLX ~ returns$GSPC)
summary(model2)
## 
## Call:
## lm(formula = returns$NFLX ~ returns$GSPC)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.197992 -0.044412  0.001473  0.042125  0.276328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.01964    0.01339   1.467  0.14930   
## returns$GSPC  0.84664    0.27585   3.069  0.00363 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08955 on 45 degrees of freedom
## Multiple R-squared:  0.1731, Adjusted R-squared:  0.1547 
## F-statistic:  9.42 on 1 and 45 DF,  p-value: 0.00363
# Since ORCL is in the 1st column, and the GSPC index in the 3rd column:
model1 <- lm(returns[,1] ~ returns[,3])
# Since NFLX is in the 2nd column, and the GSPC index in the 3rd column:
model2 <- lm(returns[,2] ~ returns[,3])
summary(model1)
## 
## Call:
## lm(formula = returns[, 1] ~ returns[, 3])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.086366 -0.026758 -0.002029  0.032639  0.092158 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)  0.003902   0.006577   0.593      0.556    
## returns[, 3] 0.722066   0.135521   5.328 0.00000306 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.044 on 45 degrees of freedom
## Multiple R-squared:  0.3868, Adjusted R-squared:  0.3732 
## F-statistic: 28.39 on 1 and 45 DF,  p-value: 0.000003064
summary(model2)
## 
## Call:
## lm(formula = returns[, 2] ~ returns[, 3])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.197992 -0.044412  0.001473  0.042125  0.276328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.01964    0.01339   1.467  0.14930   
## returns[, 3]  0.84664    0.27585   3.069  0.00363 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08955 on 45 degrees of freedom
## Multiple R-squared:  0.1731, Adjusted R-squared:  0.1547 
## F-statistic:  9.42 on 1 and 45 DF,  p-value: 0.00363

The only thing that changes in the lm models is the column name used for the dependent variable (1 or 2); the rest of the parameters is exactly the same.

How can we run these models using a loop?

A loop helps us to execute a set of commands many times. Let’s see this example:

for (i in 1:2) {
  model <- lm(returns[,i] ~ returns[,3])
  print(summary(model))
}
## 
## Call:
## lm(formula = returns[, i] ~ returns[, 3])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.086366 -0.026758 -0.002029  0.032639  0.092158 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)  0.003902   0.006577   0.593      0.556    
## returns[, 3] 0.722066   0.135521   5.328 0.00000306 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.044 on 45 degrees of freedom
## Multiple R-squared:  0.3868, Adjusted R-squared:  0.3732 
## F-statistic: 28.39 on 1 and 45 DF,  p-value: 0.000003064
## 
## 
## Call:
## lm(formula = returns[, i] ~ returns[, 3])
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.197992 -0.044412  0.001473  0.042125  0.276328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.01964    0.01339   1.467  0.14930   
## returns[, 3]  0.84664    0.27585   3.069  0.00363 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08955 on 45 degrees of freedom
## Multiple R-squared:  0.1731, Adjusted R-squared:  0.1547 
## F-statistic:  9.42 on 1 and 45 DF,  p-value: 0.00363

Use a Loop to generate information and store it in a Matrix.

An analyst made an analysis and tells you that financial leverage has a negative quadratic effect on the expected financial performance. This is a typical inverted U-shaped relationship. In other words, financial leverage has a positive effect on performance up to a certain level of leverage; after this level the relationship becomes negative: the more financial leverage, the less the firm performance. The equation is the following:

\[Performance=20−(leverage−4)^{2}\] Leverage is measure from level 0 (no leverage) up to the maximum leverage = 10. Calculate the expected firm performance for leverage levels 0 to 10 and store the results in a matrix:

# I initialize the matrix as empty:
matrix_results = c()
# I loop over i that will take values from 0 to 1 jumping by 1:
for (i in seq(from=0,to=10, by=1)) {
  # I calculate the performance according to the leverage level i:
  performance = 20 - (i - 4)^2
  # I put the leverage and performance in a vector:
  result = c(i,performance)
  # I append/ attach the result vector to the matrix
  # rbind stands for row bind, so I attach the result vector as a row below to whatever the matrix has 
  #   at the moment of the iteration
  matrix_results = rbind(matrix_results,result)
}
# I change the matrix to a data frame; data frames have advantages compared to matrices

matrix_results = as.data.frame(matrix_results)
# I set the names for the columns:
names(matrix_results) = c("leverage","performance")
matrix_results
# plotting performance for different leverage levels:
plot(x=matrix_results$leverage, y=matrix_results$performance)

Imagine that I want to select only those cases where performance was positive. I can do a selection of a data frame as follows:

library(dplyr)
selected_cases <- matrix_results %>%
      filter(performance>0)
selected_cases

CHALLENGE 1

The problem here is that we run both models to create the linear regressions, but we only displayed the models, but did not save the beta information for each stock. Adjust the loop of section 4.2 to store the data in a matrix that contains B0, and B1 coefficients, standard errors and PValues for each company.

Answer the following questions:

HOW CAN WE STORE THIS INFORMATION FOR BOTH STOCKS? WHICH STRUCTURE CAN YOU USE?

CREANDO UNA NUEVA MATRIZ DONDE TRAS CREAR EL LOOP NOS GUARDE TODOS LOS DATOS Y FORME ESTE NUEVA DATASET. LA ESTRUCTURA VENDRÍA SIENDO UNA MATRIZ DONDE INDIQUEMOS COMO FILAS LAS ACCIONES Y COMO COLUMNAS LOS COEFICIENTES.

EXTRACT THE COEFFICIENTS, CREATING A MATRIX WITH A LOOP.

#MANUALLY

# Model 1 - ORACLE

s1 = summary(model1)

ORCL_B0 = s$coefficients[1,]
ORCL_B1 = s$coefficients[2,]

ORCL_R = c(ORCL_B0, ORCL_B1)
ORCL_R
##     Estimate   Std. Error      t value     Pr(>|t|)     Estimate   Std. Error 
## 0.0366007042 0.0242223393 1.5110309431 0.1377715880 1.8395301552 0.4990929691 
##      t value     Pr(>|t|) 
## 3.6857464820 0.0006106426
# Model 2 - NETFLIX

s2 = summary(model2)

NFLX_B0 = s2$coefficients[1,]
NFLX_B1 = s2$coefficients[2,]

NFLX_R = c(NFLX_B0, NFLX_B1)
NFLX_R
##    Estimate  Std. Error     t value    Pr(>|t|)    Estimate  Std. Error 
## 0.019641642 0.013387878 1.467121389 0.149297636 0.846642606 0.275852622 
##     t value    Pr(>|t|) 
## 3.069184556 0.003630354
matrix_results_w2 = c()
for (i in 1:2) {
  model <- lm(returns[,i] ~ returns[,3])
  s_w2 = summary(model)
  
#Coeficientes a almacenar
  B0_Est = s_w2$coefficients[1,1]
  B0_Std_Err = s_w2$coefficients[1,2]
  B0_TV = s_w2$coefficients[1,3]
  B0_PV = s_w2$coefficients[1,4]
  B1_Est = s_w2$coefficients[2,1]
  B1_Std_Err = s_w2$coefficients[2,2]
  B1_TV = s_w2$coefficients[2,3]
  B1_PV = s_w2$coefficients[2,4]
  
#Creamos la matriz  
  RESULT = c(i, B0_Est, B0_Std_Err, B0_TV, B0_PV, B1_Est, B1_Std_Err, B1_TV, B1_PV)
  matrix_results_w2 = rbind(matrix_results_w2, RESULT)
}

matrix_results_w2 = as.data.frame(matrix_results_w2)
names(matrix_results_w2) = c("Ticke No.", "B0 Coef.", "B0 Std. Error", "B0 T Value", "B0 P Value", "B1 Coef.", "B1 Std. Error", "B1 T Value", "B1 P Value")

matrix_results_w2

HOW CAN YOU SELECT THOSE STOCKS THAT HAVE A SIGNIFICANT BETA0, GREATER THAN ZERO?

APLICANDO UN FILTRO QUE ARROJE MEDIANTE UNA CONDICIONAL AQUELLAS ACCIONES QUE SUPEREN UN B0 SUPERIOR A 0.

library(dplyr)
B0_Above_0 <- matrix_results_w2 %>%
      filter(B0_Est > 0)
B0_Above_0

AT THIS PART, WE CAN SEE THE ONLY STOCK THAT HAVE A SIGNIFICANT BETA0 GREATER THAN 0 IS THE SECOND ONE; PRECISELY NETFLIX. THE FILTER TELL US BOTH ARE GREATER TAN 0, BUT THE VERY FIRST ONE IN MY CRITERIA ISN’T RELEVANT. NEVERTHELESS, BOTH BETAS HAVE A P VALUE HIGHER THAN .05, SO WE CAN CONCLUDE THEY AREN’T STATISCALLY SUPPORTED.

CHALLENGE 2

Based on the prelinimary selection of 40 to 50 stocks on the Challenge of workshop 1: 1.- Create a list with the ticker of each company 2.- Using a Loop, generate the market model for each company and store the data 3.- Based on the data of each company select a set of 10 companies based on any of these criteria:

CRITERIA A: Stocks that are SIGNIFICANTLY offering returns over the market

top40firms <- firmsQ42020 %>%
     arrange(desc(firmsQ42020$pred)) %>%
     top_n(40) 
## Selecting by pred
top40firms
setSymbolLookup(GSPC = list(name="^GSPC"))
# Set the ticker ^GSPC as GSPC (previously defined name)
tickers_w2 <- c("BRK-B", "COHN", "CPSS", "CYH", "GHLD", "GNK", "HCHC", "IBEX", "MTEX", "NMRK", "GSPC")

getSymbols(tickers_w2, periodicity = "monthly",
             from = "2017-01-01", to = "2020-12-31")
## Warning in read.table(file = file, header = header, sep = sep,
## quote = quote, : incomplete final line found by readTableHeader
## on 'https://query1.finance.yahoo.com/v7/finance/download/GHLD?
## period1=1483228800&period2=1609372800&interval=1mo&events=history&crumb=Qf0.CRwQfm8'
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## Warning in read.table(file = file, header = header, sep = sep,
## quote = quote, : incomplete final line found by readTableHeader
## on 'https://query2.finance.yahoo.com/v7/finance/download/IBEX?
## period1=1483228800&period2=1609372800&interval=1mo&events=history&crumb=Qf0.CRwQfm8'
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
## pausing 1 second between requests for more than 5 symbols
##  [1] "BRK-B" "COHN"  "CPSS"  "CYH"   "GHLD"  "GNK"   "HCHC"  "IBEX"  "MTEX" 
## [10] "NMRK"  "GSPC"
datasets_list_w2 <- lapply(tickers_w2, get)

prices_w2 <- do.call(merge, datasets_list_w2)
# We select only Adjusted prices:
prices_w2 <- Ad(prices_w2)
# We change the name of the columns to be equal to the tickers vector:
names(prices_w2)<-tickers_w2

returns_w2<- na.omit(diff(log(prices_w2)))

#model1_w2 <- lm(returns_w2$BRK-B ~ returns_w2$GSPC)
#summary(model1_w2)

#model2_w2 <- lm(returns_w2$COHN ~ returns_w2$GSPC)
#summary(model2_w2)

#model3_w2 <- lm(returns_w2$CPSS ~ returns_w2$GSPC)
#summary(model3_w2)

#model4_w2 <- lm(returns_w2$CYH ~ returns_w2$GSPC)
#summary(model4_w2)

#model5_w2 <- lm(returns_w2$GNK ~ returns_w2$GSPC)
#summary(model5_w2)

# Since ORCL is in the 1st column, and the GSPC index in the 3rd column:
#model1 <- lm(returns[,1] ~ returns[,3])
# Since NFLX is in the 2nd column, and the GSPC index in the 3rd column:
#model2 <- lm(returns[,2] ~ returns[,3])
#summary(model1)

#for (i in 1:5) {
  #model_w2 <- lm(returns[,i] ~ returns[,3])
  #print(summary(model_w2))
#}

#Coeficientes a almacenar
  #B0_Est = s_w2$coefficients[1,1]
  #B0_Std_Err = s_w2$coefficients[1,2]
  #B0_TV = s_w2$coefficients[1,3]
  #B0_PV = s_w2$coefficients[1,4]
  #B1_Est = s_w2$coefficients[2,1]
  #B1_Std_Err = s_w2$coefficients[2,2]
  #B1_TV = s_w2$coefficients[2,3]
  #B1_PV = s_w2$coefficients[2,4]
  
#Creamos la matriz  
  #RESULT = c(i, B0_Est, B0_Std_Err, B0_TV, B0_PV, B1_Est, B1_Std_Err, B1_TV, B1_PV)
  #matrix_results_w2 = rbind(matrix_results_w2, RESULT)
#}

#matrix_results_w2 = as.data.frame(matrix_results_w2)
#names(matrix_results_w2) = c("Ticke No.", "B0 Coef.", "B0 Std. Error", "B0 T Value", "B0 P Value", "B1 Coef.", "B1 Std. Error", "B1 T Value", "B1 P Value")

#matrix_results_w2_2

CRITERIA B: Stocks that are SIGNIFICANTLY less risky than the market according to the market model regressions.

On the following phases of the problem setup, you will be generating the portfolios with this stocks.