In this workshop we learn about logistic regression applied to fundamental analysis in Finance. In addition, we practice data management programming skills using a big panel-dataset of historical financial statement variables.
In this workshop we practice data management with financial historical data, and review the logit regression model and its application to fundamental analysis.
We will work with a real dataset of all historical financial variables of ALL US public firms that belong to the NYSE and the NASDAQ exchanges.
We will use a logit model to examine whether some financial ratios are related to the probability that future stock return (1 year later) is higher than the market future return (1 year later).
We will learn basic programming skills for the required data management and data modeling.
The logit model is one type of non-linear regression model where the dependent variable is binary. The logistic or logit model is used to examine the relationship between one or more quantitative variables and the probability of an event happening (1=the event happens; 0 otherwise). For example, a bank that gives loans to businesses might be very interested in knowing which are the factors / variables / characteristics of firms that are more related to loan defaults. If the bank understands these factors, then it can improve its decisions about which firms deserve a loan, and minimize the losses due to loan defaults.
In this workshop, we will define the event to be whether a firm in a specific quarter had higher stock return compared to the market (S&P500 index) return. If the stock return is higher than the market return, then we codify the binary variable equal to 1; 0 otherwise.
Then, in this case, the dependent variable of the regression is the binary variable with 1 if the stock beats the market and 0 otherwise. The independent or explanatory variables can be any financial indicator/ratio/variable that we believe is related to the likelihood of a stock to beat the market in the near future.
We can define this logistic model using the following mathematical function. Imagine that Y is the binary dependent variable, then the probability that the event happens (Event=1) can be defined as:
\[Prob(Event=1)=fx(X_{1},X_{2},...,X_{n})\] The binary variable Event can be either 1 or 0, but the probability of Event=1 is a continuous value from 0 to 1. The function is a non-linear function defined as follows:
\[Prob(Event=1)=\frac{1}{1+e^{-(b_{0}+b_{1}X_{1}+b_{2}X_{2}+...+b_{n}X_{n})}}\]
As we can see, the argument of the exponential function is actually a traditional regression equation. We can re-express this equation as follows:
\[Y=b_{0}+b_{1}X_{1}+b_{2}X_{2}+...+b_{n}X_{n}\]
Now we use Y in the original non-linear function:
\[Prob(Event=1)=\frac{1}{1+e^{-Y}}\]
This is a non-linear function since the value of the function does not move in a linear way with a change in the value of one independent variable X.
Let’s work with an example of a model with only 1 independent variable X1. Imagine that X1 is the variable earnings per share (eps) of a firm, and the event is that the firm beats the market. Then, let’s do a simple example with specific values for b0, b1, and a range of values for eps from -1 to 1 jumping by 0.1:
# The seq function creates a numeric vector. We specify the fist, last and the jumps:
eps=seq(from=-1,to=1,by=0.1)
eps
## [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
## [16] 0.5 0.6 0.7 0.8 0.9 1.0
# This vector has 21 values for x1 including the zero
# I define b0=-0.5 and b1=10:
b0=-0.5
b1=10
# I define a temporal variable Y to be the regression equation:
Y = b0 + b1*eps
# Since eps is a vector, then R performs a row-wise calculation using the equation for all values of eps
# Now I create a vector with the values of the function according to the equation of the probability:
prob = 1 / (1 + exp(-(Y)) )
# I display the probability values of the function for all values of eps:
prob
## [1] 2.753569e-05 7.484623e-05 2.034270e-04 5.527786e-04 1.501182e-03
## [6] 4.070138e-03 1.098694e-02 2.931223e-02 7.585818e-02 1.824255e-01
## [11] 3.775407e-01 6.224593e-01 8.175745e-01 9.241418e-01 9.706878e-01
## [16] 9.890131e-01 9.959299e-01 9.984988e-01 9.994472e-01 9.997966e-01
## [21] 9.999252e-01
# Finally, I plot the function values for all values of x1:
plot(x=eps,y=prob,type="line")
## Warning in plot.xy(xy, type, ...): plot type 'line' will be truncated to first
## character
Here we can see that function is not linear with changes in eps. There is a specific range of values for eps close to 0 when the probability that the firm beats the market increases very fast up to a value of about 0.4 where the probability grows very slow with any more increase in eps.
The interpretations of the magnitude of the coefficients b0 and b1 in logistic regression is not quite the same as the case of multiple regression. However, the interpretation of the sign of the coefficient (positive or negative) and the level of significance (pvalue) is the same as in the multiple regression model. What we can say up to know is that if b1 is positive and significant (its p-value<0.05), then it means that the variable, in this case, eps is significantly and positively related to the probability that a firm beats the market returns.
Before going to the interpretation of the magnitudes of the coefficient, here is a quick explanation of how the logistic regression works and how it is estimated in any specialized software (such as R).
Let’s continue with the same event, which is that the firm return beats the market (in other words, that the firm return is higher than the market return). Then:
p = probability that the firm beats the market (Event=1); or that the event happens.
(1-p) = probability that the firm DOES NOT beat the market (Event=0); or that the event does not happen.
To have a dependent variable that can get any numeric value from a negative value to a positive value, we can do the following mathematical transformation with these probabilities:
\[Y=log(\frac{P}{1-P})\]
The (p1−p) is called the odds ratio:
\[ODDSRATIO =(\frac{P}{1-P})\]
The odds ratio is the ratio of the probability of the event happening to the probability of the event NOT happening. Since p can have a value from 0 to 1, then the possible values of ODDSRATIO can be from 0 (when p=0) to infinity (when p=1). Since we want a variable that can have values from any negative to any positive value, then we can just apply the logarithmic function, and then the range of this log will be from any negative value to any positive value.
Now that we have a transformed variable Y (the log of ODDSRATIO) that uses the probability p, then we can use this variable as the dependent variable for our regression model:
\[Y = log(\frac{P}{1-P}) = b_{0}+b_{1}X_{1}\]
This is the actual regression model that R estimates, so the coefficients b0 and b1 values define the logarithm of the odd ratio!, not the actual probability p of the event happening!
How can we interpret the magnitude of the beta coefficients of this regression? Let’s do a mathematical trick from the previous equation. We can apply the exponential function to both sides of the equation:
\[e^{Y}=e^{b_{0}}e^{b_{1}X_{1}}= ODDSRATIO\]
Then, if X1 increases in one unit, then the ODDSRATIO will be equal to ODDSRATIO times eb1. Then, eb1 will be the factor that indicates how many times the ODDSRATIO changes with a 1-unit change in X1.
Then:
· If eb1 = 1, then the ODDSRATIO does not change, meaning that there is no relationship between the variable X1 and the probability of the event.
· If eb1 > 1, then the ODDSRATIO will grow by this factor, meaning that there is a positive relationship between X1 and the probability of the event.
· If eb1 < 1, then the ODDSRATIO will decrease by this factor, meaning that there is a negative relationship between X1 and the probability of the event.
Then, when we see the output of a logistic regression, we need to apply the exponential function to the coefficients to provide a meaningful interpretation of the magnitude of the coefficients in the model.
If we want to estimate the probability p for any value of X1, we just need to do some algebraic manipulations to the previous equation:
\[ Y=log()=b_{0}+b_{1}X_{1}\]
# To avoid scientific notation:
options(scipen=999)
#Activate the libraries
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Clear the environment
rm(list=ls())
# Download the panel dataset from an online excel file: (this must be done only once- set this as a comment afterwards)
download.file("http://www.apradie.com/datos/usactive2020q.xlsx",
"usactive2020q.xlsx", mode="wb")
# The second parameter is a name for the
# local file in the same folder where you saved your .Rmd
Now we import this Excel file into our R environment:
uspanel <- read_excel("usactive2020q.xlsx",sheet = "data")
We import the data dictionary for the variables of this dataset:
dictionary <- read_excel("usactive2020q.xlsx",sheet = "data dictionary")
All the information of the financial variables is in 1000’s (thousands). For example, if you see that a firm revenue in a specific quarter is equal to 1,000,000 this means that the firm sold 1 thousand millions.
#Setting the dataset as panel data
This dataset has the panel data structure: each firm has many variables for many periods (quarters), and the data of each firm is piled one over the other one.
It is very important to set the dataset as panel data. We use the library plm to do this:
library(plm)
##
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
uspanel <- pdata.frame(uspanel, index= c("ticker","q"))
This is a very important step since we need to calculate returns that use lagged (previous) values of stock prices. Remember that we have different firms piled over another, so to get lagged values of a variable, it is necessary to know that after the last period of one firm, the next row will be the first period of another firm!
#Brief descriptive statistics
We can use the dplyr package to get important descriptive statistics for the US firms. Since we have a panel data, we can select only the last quarter of 2020 and then get a list of firms and calculate important descriptive statistics about the sample.
##Creating a cross-sectional dataset by selecting last fiscal quarter of 2020
If we do descriptive statistics with the panel data, the information will be repeated since we have many rows for the same company. We need to have a cross-sectional data where the information of 1 firm is only in 1 row of the dataset.
In the panel data we have 4 quarters for each year, but only 1 quarter has the cumulative information for the whole fiscal year. In the US, most of the firms end the fiscal year in the Q4, however there are few firms that end the fiscal year in the other 3 quarters. To identify for each firm when the fiscal year ends, there is a column in the dataset called fiscalmonth. If the fiscalmonth=12, that means that that quarter is the end of the corresponding fiscal year.
Then, we can select only those rows with year=2020 and fiscalmonth=12 to end up with the last fiscal annual data for each firm:
usfirms2020 <- uspanel %>%
filter(year==2020, fiscalmonth==12)
#Descriptive statistics for 2020
With this dataset we can do basic descriptive statistics to learn how the sample of firms is composed.
Let’s calculate some basic financial variables first.
As a measure of firm size we can calculate the market capitalization (or market value) of firms by multiplying the stock price times the number of shares outstanding:
marketcap = original stock price * shares outstanding.
We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares:
usfirms2020$marketcap = usfirms2020$originalprice * usfirms2020$sharesoutstanding
We can now do a summary statistics to learn the number of firms by industry and important percentiles of the market capitalization to learn about the size of the firms.
Remember that values are stored in thousands (’1,000s) of US dollars.
size_summary <- usfirms2020 %>%
summarize(firms = n(),
median_marketcap = median(marketcap, na.rm = TRUE),
Q1_marketcap = quantile(marketcap, probs=c(0.25),na.rm=TRUE),
Q3_marketcap = quantile(marketcap,probs=c(0.75),na.rm=TRUE)
)
size_summary
We have 3212 firms in the sample. The typical firm size in terms of market capitalization of the US public firm its median, which is $USD 1392349065.28. The median is the best measure of central tendency for any financial variable when we have many firms since the distribution of these variables is always skewed to the right (there are very few big firms and many firms with a more reasonable size).
Now we do a summary by industry:
by_industry_summary<- usfirms2020 %>%
group_by(NAICS) %>%
summarize(firms = n(),
median_marketcap = median(marketcap, na.rm = TRUE),
Q1_marketcap = quantile(marketcap, probs=c(0.25),na.rm=TRUE),
Q3_marketcap = quantile(marketcap,probs=c(0.75),na.rm=TRUE)
) %>%
arrange(desc(firms))
We display the results:
by_industry_summary
The Manufacturing industry is the more populated with 1343 firms followed by the Finance and Insurance industry with 697 firms.
Utilities is the industry with the biggest firms in terms of market size with a median of market capitalization of more than US $5,609 million followed by the information industry with more than $US 3,661 million of market capitalization.
We can also know which are the 10 biggest firms in terms of market capitalization:
top_10_bigfirms <- usfirms2020 %>%
arrange(desc(marketcap)) %>%
top_n(10) %>%
select(ticker, company, NAICS, marketcap, revenue)
## Selecting by marketcap
We display the result:
top_10_bigfirms
The biggest US firm in terms of market capitalization as of Q4 of 2020 is Apple with almost $US 2 trillion dollars. In the US, one trillion is one million of millions. Remember that the typical firm size was only about 1,392 millions. Then Apple is more than 1,000 times the size of a typical US firm!
#Calculation of financial variables and ratios
Here we illustrate how to calculate a few financial variables, ratios and returns for all firms and all quarters.
Gross profit (grossprofit): Revenue - Cost of good Sold
uspanel$grossprofit = uspanel$revenue - uspanel$cogs
Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses
uspanel$ebit = uspanel$grossprofit - uspanel$sgae
Net Income (netincome): ebit - financial expenses - income taxes
uspanel$netincome = uspanel$ebit - uspanel$finexp - uspanel$incometax
Annual market return: use adjusted stock price and remember that you have quarterly data. Here we have to use the lagged value of adjusted stock price:
uspanel$stockannual_R = uspanel$adjprice / plm::lag(uspanel$adjprice,k=4) - 1
Market capitalization: (marketcap): original stock price * shares outstanding.
uspanel$marketcap = uspanel$originalprice * uspanel$sharesoutstanding
This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.
Check that we used original stock price, not adjusted stock price. In financial markets, the adjusted stock prices are calculated after considering dividend payments and stock splits. A stock split is when a firm decides to divide the value of its stock price by 2, 3 or other multiple with the only purpose to avoid the perception that the stock is expensive. For example, late August 2020 Apple and Tesla decided to do stock split. Apple did a split on a 4-for-1 basis. This means that if the stock price was about USD $400.00 on that day, then its price was reduced to USD $100.00, but they multiplied the number of shares (shares outstanding) by 4 to keep the same market value of the firm. In this historical dataset the shares outstanding is the historical, so we need to use the historical/original stock price without considering stock splits nor dividend payments.
Now we calculate the operational earnings per share ratio:
· Operational earnings per share (oeps): ebit / shares outstanding
uspanel$oeps = ifelse(uspanel$sharesoutstanding==0,NA,uspanel$ebit / uspanel$sharesoutstanding)
· Operational earnings per share deflated by stock price (oepsp) : oeps / original stock price
uspanel$oepsp = ifelse(uspanel$originalprice==0,NA,uspanel$oeps / uspanel$originalprice)
Note that we use the ifelse function to validate in case that the denominator has a zero value. This is important since R can calculate infinite values that can cause problems in our analysis.
#Winsorization of ratios
Before we run any regression model, it is very important to check that the independent variables do not have very extreme values. When one or more independent variables have very extreme values, the estimation of the regression coefficients and standard errors can be biased (not reliable).
We can deal with extreme variables (called outliers) applying the winsorization process. The winsorization process replaces the very extreme values at a specific percentile with a value of the variable in that percentile.
We first see the histogram of the oepsp ratio to see the distribution of extreme values:
hist(uspanel$oepsp)
We see extreme values to the left and few to the right.
Apply a winsorization of 2% to the left and 0.5% to the right:
library(statar)
uspanel$oepspw <- winsorize(uspanel$oepsp, probs = c(0.02,0.995))
## 1.42 % observations replaced at the bottom
## 0.36 % observations replaced at the top
hist(uspanel$oepspw)
#Running a logit model with financial ratios
As an example of a logistic model, we will run a model to examine how the oepsp is related with the probability of a firm to beat the market return.
We first have to calculate the market annual return, and then compare whether the stock return is higher than the market return. If that is the case, we assign a dummy variable = 1; 0 otherwise:
# Creating the market annual return:
uspanel$marketannual_R = uspanel$SP500index / plm::lag(uspanel$SP500index,k=4) - 1
#Creating the binary variable to see if the stock return is higher than the market return:
uspanel$r_above_market = ifelse(is.na(uspanel$stockannual_R),NA,
ifelse(uspanel$stockannual_R>uspanel$marketannual_R,1,0))
# we can see how many 1's and 0's were calculated:
table(uspanel$r_above_market)
##
## 0 1
## 54941 44919
Now we are ready to run the logit model. We need to use the glm function:
logitm1 <- glm(r_above_market ~ oepspw ,data = uspanel, family = "binomial",na.action = na.omit)
summary(logitm1)
##
## Call:
## glm(formula = r_above_market ~ oepspw, family = "binomial", data = uspanel,
## na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6041 -1.0989 -0.9478 1.2552 1.6292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.249627 0.006659 -37.49 <0.0000000000000002 ***
## oepspw 1.633521 0.048049 34.00 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 135000 on 98064 degrees of freedom
## Residual deviance: 133771 on 98063 degrees of freedom
## (53295 observations deleted due to missingness)
## AIC: 133775
##
## Number of Fisher Scoring iterations: 4
En este primer modelo, nuestra variable dependiente va a ser la probabilidad de vencer al mercado en un futuro, y como variables independientes emplearemos el oepsp winsorizado; es decir, las ganancias operativas por acción deflactadas por el precio. Como podemos ver el coeficiente OEPSPW es positivo y significativo al arrojar un valor p inferior a .05. Esto significa que la variable OEPSPW está positivamente relacionada a la probabilidad de que las empresas venzan al retorno del mercado. En conclusión, el modelo es válido y supone suficiente soporte estadístico para decir que entre mayor sea el OESPSW, mayor la probabilidad de que la empresa le gané al mercado.
Another interesting model is to change the dependent variable to be whether the stock will beat the market return in the future 1 year later. We can do this by calculating the future value 4 quarters later for the binary variable r_above_market using the lag function of the plm package:
class(uspanel)
## [1] "pdata.frame" "data.frame"
uspanel$F4_r_above_market <- plm::lag(uspanel$r_above_market,-4)
We use the lag function, but with a negative value to indicate that we want a future value of the variable. It is always good to check whether the dataset is a pdata.frame before we run the lag function.
Now we can run the logit model with this future binary variable to see how much the earnings per share of one quarter is related with the probability that the firm beats the market 1 year later:
logitm2 <- glm(F4_r_above_market ~ oepspw ,data = uspanel, family = "binomial",na.action = na.omit)
summary(logitm2)
##
## Call:
## glm(formula = F4_r_above_market ~ oepspw, family = "binomial",
## data = uspanel, na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.441 -1.098 -1.026 1.255 1.513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.232986 0.006776 -34.38 <0.0000000000000002 ***
## oepspw 1.121954 0.048392 23.18 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130696 on 94911 degrees of freedom
## Residual deviance: 130142 on 94910 degrees of freedom
## (56448 observations deleted due to missingness)
## AIC: 130146
##
## Number of Fisher Scoring iterations: 4
En este segundo modelo, al igual que el primero, nuestra variable dependiente va a ser la probabilidad de vencer al mercado en un futuro, pero ahora en un año y como variables independientes emplearemos el oepsp winsorizado; es decir, las ganancias operativas por acción deflactadas por el precio. Como podemos ver el coeficiente OEPSPW es positivo y significativo al arrojar un valor p inferior a .05. Esto significa que la variable OEPSPW está positivamente relacionada a la probabilidad de la alguna empresa venza al retorno del mercado para el próximo año. En conclusión, el modelo es válido y supone suficiente soporte estadístico para decir que entre mayor sea el OESPSW, mayor la probabilidad de que la empresa le gané al mercado. O bien, para determinar que las ganancias operativas por acción deflactadas por el precio es una variable explicativa.
#Prediction with the logit model
We can use the last model to predict the probability whether the firm will beat the market return 1 year in the future. For this prediction, it might be a good idea to select only the firms in the last quarter of 2020, so we can predict which firm will beat the market in 2021:
We create a dataset with only the Q4 of 2020:
firmsQ42020 <- uspanel %>%
select(ticker,q,year, F4_r_above_market,oepspw) %>%
filter(q=="2020-10-01") %>%
as.data.frame()
hist(firmsQ42020$oepspw)
Now we run the prediction using the model 2 with this new dataset:
firmsQ42020 <- firmsQ42020 %>%
mutate(pred=predict.glm(logitm2,newdata=firmsQ42020,type=c("response")) )
We can do a histogram to see how this predicted probability behaves:
hist(firmsQ42020$pred,breaks=20)
We can also see how the predicted probability of beating the market changes with changes in earnings per share:
# The plot function expect the x and y values to be vectors, not columns of data frames.
# Then, I use the as.vector function before I do the plot:
plot(x=as.vector(firmsQ42020$oepspw),y=as.vector(firmsQ42020$pred))
It is curious to see that the relationship between oepspw and the probability of beating the market looks similar to a linear relationship. However, we can see a tiny curvature, indicating that the model is doing the non-linear effect, but if we had more extreme values of oepspw we could see the S-shape relationship.
#Selection of best stocks based on results of the logit model
We can sort the firms according to the predicted probability of beating the benchmark, and then select those with the highest probability:
top40firms <- firmsQ42020 %>%
arrange(desc(firmsQ42020$pred)) %>%
top_n(40)
## Selecting by pred
top40firms
#CHALLENGE
BASED ON THIS WORKSHOP, DEFINE 3 TO 4 FINANCIAL VARIABLES/RATIOS AS EXPLANATORY VARIABLES THAT COULD IMPACT THE PROBABILITY OF A STOCK RETURN BEATING THE MARKET RETURN.
Generate the ratios, run at least one logit regression model (with 2 to 3 explanatory variables) and create the predictions and then select best 40 companies. For this exercise, it is optional to do the winsorization of the explanatory variables (For the final project, you must do the winsorization process).
uspanel$marketcap_w <- winsorize(uspanel$marketcap, probs = c(0.02,0.995))
## 1.43 % observations replaced at the bottom
## 0.36 % observations replaced at the top
uspanel$ebit_w <- winsorize(uspanel$ebit, probs = c(0.02,0.995))
## 1.48 % observations replaced at the bottom
## 0.37 % observations replaced at the top
uspanel$netincome_w <- winsorize(uspanel$netincome, probs = c(0.02,0.995))
## 1.26 % observations replaced at the bottom
## 0.32 % observations replaced at the top
#Modelo proyección trimestre
CHM1 <- glm(r_above_market ~ marketcap_w + ebit_w + netincome_w ,data = uspanel, family = "binomial", na.action = na.omit)
summary(CHM1)
##
## Call:
## glm(formula = r_above_market ~ marketcap_w + ebit_w + netincome_w,
## family = "binomial", data = uspanel, na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.555 -1.067 -1.063 1.285 2.068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2736115139398 0.0074200971611 -36.874 < 0.0000000000000002 ***
## marketcap_w 0.0000000163587 0.0000000006559 24.942 < 0.0000000000000002 ***
## ebit_w -0.0000000970319 0.0000000272283 -3.564 0.000366 ***
## netincome_w -0.0000000962931 0.0000000377477 -2.551 0.010742 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 114759 on 83401 degrees of freedom
## Residual deviance: 113755 on 83398 degrees of freedom
## (67958 observations deleted due to missingness)
## AIC: 113763
##
## Number of Fisher Scoring iterations: 3
Todas las variables elegidas parecen ser significativas y explicativas en el ejercicio trimestral.
class(uspanel)
## [1] "pdata.frame" "data.frame"
uspanel$F4_r_above_market2 <- plm::lag(uspanel$r_above_market,-4)
#Modelo proyección anual
CHM2 <- glm(F4_r_above_market2 ~ marketcap_w + ebit_w + netincome_w ,data = uspanel, family = "binomial",na.action = na.omit)
summary(CHM2)
##
## Call:
## glm(formula = F4_r_above_market2 ~ marketcap_w + ebit_w + netincome_w,
## family = "binomial", data = uspanel, na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.644 -1.086 -1.084 1.270 1.673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2233055680110 0.0074668936674 -29.906 < 0.0000000000000002 ***
## marketcap_w 0.0000000045275 0.0000000005691 7.956 0.00000000000000177 ***
## ebit_w 0.0000000575888 0.0000000260989 2.207 0.02734 *
## netincome_w -0.0000001159188 0.0000000365873 -3.168 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111271 on 80815 degrees of freedom
## Residual deviance: 111123 on 80812 degrees of freedom
## (70544 observations deleted due to missingness)
## AIC: 111131
##
## Number of Fisher Scoring iterations: 3
Todas las variables elegidas parecen ser significativas y explicativas en el ejercicio anual.
challengefirmsQ42020 <- uspanel %>%
select(ticker,q,year, F4_r_above_market2, marketcap_w, ebit_w, netincome_w) %>%
filter(q=="2020-10-01") %>%
as.data.frame()
hist(challengefirmsQ42020$marketcap_w)
hist(challengefirmsQ42020$ebit_w)
hist(challengefirmsQ42020$netincome_w)
Utilizando el modelo anual, creamos el modelo predictivo
challengefirmsQ42020 <- challengefirmsQ42020 %>%
mutate(pred=predict.glm(CHM2,newdata=challengefirmsQ42020,type=c("response")) )
Ahora podemos gráficamente la probabilidad predecida de vencer al mercado con cada variable seleccionada:
plot(x=as.vector(challengefirmsQ42020$marketcap_w),y=as.vector(challengefirmsQ42020$pred))
plot(x=as.vector(challengefirmsQ42020$ebit_w),y=as.vector(challengefirmsQ42020$pred))
plot(x=as.vector(challengefirmsQ42020$netincome_w),y=as.vector(challengefirmsQ42020$pred))
Por último, arrojamos las 50 empresas más probables a vencer al mercado con la variables explicativas elegidas anteriormente:
Top50Firms<- challengefirmsQ42020 %>%
arrange(desc(challengefirmsQ42020$pred)) %>%
top_n(50)
## Selecting by pred
Top50Firms
NOTES:
· What we can say up to know is that if b1 is positive and significant (its p-value<0.05), then it means that the variable, in this case, eps is significantly and positively related to the probability that a firm beats the market returns.
p = probability that the firm beats the market (Event=1); or that the event happens. (1-p) = probability that the firm DOES NOT beat the market (Event=0); or that the event does not happen.
· If X1 increases in one unit, then the ODDSRATIO will be equal to ODDSRATIO times eb1. Then, eb1 will be the factor that indicates how many times the ODDSRATIO changes with a 1-unit change in X1.
Then:
· If eb1 = 1, then the ODDSRATIO does not change, meaning that there is no relationship between the variable X1 and the probability of the event.
· If eb1 > 1, then the ODDSRATIO will grow by this factor, meaning that there is a positive relationship between X1 and the probability of the event.
· If eb1 < 1, then the ODDSRATIO will decrease by this factor, meaning that there is a negative relationship between X1 and the probability of the event.