In this workshop we learn how to run and interpret a multiple regression model when one or more independent variables are categorical variables. In addition we learn about the multicollienarity problem in multiple regression models.
You will work in RStudio. Create an R Notebook document to write whatever is asked in this workshop.
At the beginning of the R Notebook write Workshop 8 - Financial Econometrics I and your name (as we did in previous workshop).
You have to replicate all the steps explained in this workshop, and ALSO you have to do whatever is asked. Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question.
It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your notebook. Your own workshop/notebook will be very helpful for your further study.
Keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file.
In the previous workshop we examined whether the market return, BMR and EPSP influence stock returns. We ran the models using both, contemporary values of the variables, and lagged values of the independent variables. Remember the time-series functions in R:
lag(variable, #) refers to the Lag number # of the variable. If you want to go forward (ahead) in the data, you can use negative numbers.
Now we will keep using these time series functions for regression models that include categorical variables. Until now we have included continuous independent (explanatory) variables in our regression models. A categorical variable is usually a non-numeric variable that represents categories having a few values. When we have a categorical variable in a regression model we need to do a special treatment in the regression model.
A categorical variable is a variable that might not have a numeric ranking meaning, but they are useful variables for classification. For example, the variable Industry is a categorical variable since we cannot sum, average nor sort values of this variable; we can only count different types of industries.
Nevertheless, there are some numeric variables that can be treated as categorical variables. For example, year can be treated as both numeric and also categorical variable in a regression.
When we are interested in evaluating the effect of one variable on a dependent variable, it is recommended to include control variables to make our results more robust. A control variable is a variable that might not be the focus of study, but in previous research or analysis, this control variable has been shown to be related to the dependent variable of study.
Then, if we run a regression to examine whether one explanatory (independent) variable has an effect on the dependent variable, and we include one or two control variables, then if we find that there is a significant relationship between the explanatory variable and the dependent variable, we can say that this effect holds even after considering the effect of the control variables. In this case, our result will be more robust, and we will have more statistical evidence about the effect of the independent variable on the dependent variable.
Now we will use categorical variables as control variables in multiple regression models. Before doing this, I will briefly explain what we need to do before we include a categorical variable in a regression model.
If we want to include a simple binary categorical variable such as manufacturing company vs non-manufacturing company, then we need to have a categorical variable in my dataset with this information. The categorical variables that have 2 values are named dummy variables. Every observation of the dataset - in our case, every firm-quarter- will be classified as either manufacturing or non-manufacturing firm. We cannot include this variable directly since this variable is not numeric. We need to codify the dummy variable with 2 numeric values such as 0 and 1, or 1 and 2. We can assign 1 to manufacturing firms, and 0 to non-manufacturing firms (or vice versa). We will practice with some exercises about this.
We download the Excel dataset from the web:
library(readxl)
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx", "dataw8.xlsx", mode="wb")
trying URL 'http://www.apradie.com/datos/datamx2020q4.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 5163617 bytes (4.9 MB)
==================================================
downloaded 4.9 MB
data <- read_excel("dataw8.xlsx")
data <- data[data$status=="active",]
data$bookvalue <- data$totalassets-data$totalliabilities
data$mktval <- data$originalhistoricalstockprice * data$sharesoutstanding
data$bmr <- data$bookvalue / data$mktval
data$eps <- data$ebit / data$sharesoutstanding
data$epsp <- data$eps / data$originalhistoricalstockprice
Create a winsorized variable for book-to-market ratio and earnings per share using percentile 1%. Remember to use original stock price to calculate the market value and earnings per share deflated by price.
library(statar)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
library(plm)
data$bmr_w <- winsorize(data$bmr,probs = c(0.01,0.99))
0.60 % observations replaced at the bottom
0.60 % observations replaced at the top
par(mfrow=c(1,2))
hist(data$bmr, col="lightblue", main = "bmr")
hist(data$bmr_w, col = "blue", main="bmr winsorized")
data$epsp_w <- winsorize(data$epsp, probs=c(0.01,0.99))
0.44 % observations replaced at the bottom
0.44 % observations replaced at the top
par(mfrow=c(1,2))
hist(data$epsp, main = "epsp", col="orange")
hist(data$epsp_w, main="epsp winsorized", col="gold")
We set the dataset as a panel-data. This is important since we will use future returns in some regression, so we need to create columns for future stock returns one quearter and one year later:
# The index of the data frame must be firmcode-quarter
data <- pdata.frame(data, index = c("firmcode", "quarter"))
# Calculate firm return
data$stockreturn <- diff(log(data$adjustedstockprice))
# We add columns for future returns one year later and one quarter later:
data$F4r <- plm::lag(data$stockreturn,-4)
data$F1r <- plm::lag(data$stockreturn,-1)
Let’s create a binary variable to classify firms according to its size. We classify firms in 2 groups: small vs big firms. Those firms that have a size (natural log of market value) bigger than its mean will be classified as big firms. Since we have panel data, it is a good idea to calculate the mean of size by quarter. Those firms with size smaller than its mean will be classified as small firms.
We can do so with the function xtile from package statar, which emulates some functions from Stata.
The xtile function calculates percentiles. When combined with group_by() and summarise(), the percentiles by quarter are calculated. In this case, since I specified n=2, R will generate sizetype0 according to the 50 percentile to create only 2 groups. Then sizetype0 will be a variable equal to 1 or 2. Those with values=1 will be the small firms, and those with values equal to 2 will be the big firms.
We will use the dplyr package to create groups by size.
We can do this in R as follows:
# Create variable size as the log of market value
data$size <- log(data$mktval)
# We take the log since the market value variable usually does not
# behave like a normal distributed variable (it is skewed to the right)
# When we take the log of skewed variables, then the log will behave
# close to a normal variable
# Load packages dplyr
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:plm’:
between, lag, lead
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# xtile function calculates percentiles by groups. In this case, we
# group the data by quarter, and for each quarter we will classify
# firms in 4: very big, big, small, very small
sizetype0 <- data %>%
group_by(quarter) %>%
summarise(sizetype = xtile(size, n=4), firmcode=firmcode)
`summarise()` has grouped output by 'quarter'. You can override using the `.groups` argument.
# In this case, n=4 means that it will generate sizetype0 according to
# the 25 percentile to create 4 groups. Those with values=1 will be
# the smallest firms and those with values=4 will be the biggest firms.
# Now we merge the panel data with the sizetype0
data <- merge(data,sizetype0,by=c("firmcode","quarter"))
# We again set the data as panel-data since the merge function changed the class of
# the dataset to data frame:
data <- pdata.frame(data, index = c("firmcode", "quarter"))
We can see that different values of sizetype0 and the number of appearance in the dataset:
table(data$sizetype)
1 2 3 4
1900 1855 1883 1838
Now we will analyze the effect of BMR on firm return (one year later) after considering the effect of size type. Then, we can include this new binary variable in the regression as follows:
# Before running a regression with a categorical independent variable,
# we need to indicate R that this variable is a "factor" variable:
data$size1 <- factor(data$sizetype, c(1,2,3,4), labels = c("Very Small", "Small", "Big", "Very Big"))
# Look at the coding
contrasts(data$size1)
Small Big Very Big
Very Small 0 0 0
Small 1 0 0
Big 0 1 0
Very Big 0 0 1
We run the multiple regression model with categorical variable size type (size1) as follows.
We want to examine whether there is a difference of future stock returns between size groups, and whether book-to-market ratio has a relationship with future stock returns 1 year later (4 quarters in the future).
We can run this regression using the plm function:
reg1 <-plm(plm::lag(stockreturn,-4) ~ size1 + bmr_w, data = data, model="pooling")
reg1results<-summary(reg1)
reg1results
Pooling Model
Call:
plm(formula = plm::lag(stockreturn, -4) ~ size1 + bmr_w, data = data,
model = "pooling")
Unbalanced Panel: n = 143, T = 1-76, N = 6577
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-1.3004498 -0.0769402 -0.0014358 0.0802195 1.5023835
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) -0.0148149 0.0063925 -2.3176 0.020504 *
size1Small 0.0188085 0.0067562 2.7839 0.005387 **
size1Big 0.0240453 0.0068316 3.5197 0.000435 ***
size1Very Big 0.0290969 0.0072614 4.0071 6.216e-05 ***
bmr_w 0.0165505 0.0026145 6.3302 2.610e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 222.54
Residual Sum of Squares: 221.1
R-Squared: 0.0064381
Adj. R-Squared: 0.0058334
F-statistic: 10.6464 on 4 and 6572 DF, p-value: 1.3426e-08
# We used the lag function with -4 to indicate 4 quarters in the future for stock returns
We can also run the same regression using the simple lm function, but we need to create a new variable for future stock returns:
reg2 <-lm(F1r ~ size1 + bmr_w, data = data)
reg2results<-summary(reg2)
reg2results
Call:
lm(formula = F1r ~ size1 + bmr_w, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.29767 -0.07594 -0.00053 0.07907 1.49003
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.014324 0.006073 -2.359 0.018366 *
size1Small 0.018915 0.006454 2.931 0.003394 **
size1Big 0.024994 0.006538 3.823 0.000133 ***
size1Very Big 0.027483 0.006947 3.956 7.69e-05 ***
bmr_w 0.013662 0.002497 5.470 4.65e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1837 on 7094 degrees of freedom
(5061 observations deleted due to missingness)
Multiple R-squared: 0.004896, Adjusted R-squared: 0.004335
F-statistic: 8.726 on 4 and 7094 DF, p-value: 5.051e-07
INTERPRET THE OUTPUT OF THE MODEL. Here are some explanations for your interpretation:
AFTER CONSIDERING THE EFFECT OF SIZE TYPE, THE EFFECT OF BOOK-TO-MARET RATIO (bmr_w) ON FUTURE STOCK RETURN ONE QUARTER LATER IS POSITIVE AND SIGNIFICANT. FOR EACH +1 MOVEMENT OF bmr_w, THE AVERAGE MOVEMENT OF FUTURE STOCK RETURNS (F4r) IS ABOUT 0.0136, OR 1.36%.
TAKING INTO CONSIDERATION THE EFFECT OF BOOK-TO-MARKET-RATIO, SMALL FIRMS OFFER 1.89% FUTURE RETURNS ABOVE VERY-SMALL FIRMS. ANALYZING THE DUMMY VARIABLE COEFFICIENT size1BIG,TAKING IN MIND THE EFFECT OF BMRW, BIG FIRMS OFFER SIGNIFICANTLY HIGHER FUTURE RETURNS IN ABOUT 2.49% IN CONTRAST OT VERY SMALL FIRMS. VERY-BIG FIRMS OFFER 2.9% HIGHER FUTURE RETURNS WHEN COMPARED TO THE VERY SMALL FIRMS.
We have 5 regression coefficients. I will name the first coefficient as alpha0 (instead of beta0). The coefficients of the dummy variables for the categorical (factor) variable I will name them as alpha1, alpha2, and alpha3. Finally, the coefficient for BMR I will name it as beta1.
Then, the expected value of the regression will be:
E[F4.stockret]=α0+α1∗size1Small+α2∗size1Big+α3∗size1VeryBig+beta1∗bmrw
size1Small, size1Big and size1VeryBig are dummy variables with values equal to 0 or 1. As you see, there is no dummy variable for size1VerySmall since this category is the base category. The codification of this categorical variable can be better understood with the following table:
contrasts(data$size1)
Small Big Very Big
Very Small 0 0 0
Small 1 0 0
Big 0 1 0
Very Big 0 0 1
Then, we can plug the values of these dummy variables according to the size of the firms, and come up with one regression equation for each group of firms according to firm size.
For small firms size1Small=1 and the other dummy variables will be equal to 0. For Big firms size1Big=1, size1Small=0, and size1VeryBig=0. For Very Big firms size1VeryBig=1, size1Small=0 and size1Big=0. For Very Small firms all the dummy variables will be = 0.
Then for small firms only the dummy variable size1Small=1 and the other dummy variables =0. We can express the regression equation for SMALL firms as follows:
E[F4.r/firm=small]=α0+α1∗size1Small+beta1∗bmrw
E[F4.r/firm=small]=(−0.0143241+0.0189146)+0.0136617∗bmrw
As you see, the term for α2 and α3 disappear since their dummy variables are equal to zero.
For Big firms only the dummy variable size1Big=1 and the other dummy variables =0. We can express the regression equation for BIG firms as follows:
E[F4.r/firm=Big]=α0+α2∗size1Big+beta1∗bmrw E[F4.r/firm=Big]=(−0.0143241+0.0249939)+0.0136617∗bmrw
As you see, the term for α1 and α3 disappear since their dummy variables are equal to zero.
For Very Big firms only the dummy variable size1VeryBig=1 and the other dummy variables =0. We can express the regression equation for VERY BIG firms as follows:
E[F4.r/firm=VeryBig]=α0+α3∗size1VeryBig+beta1∗bmrw E[F4.r/firm=VeryBig]=(−0.0143241+0.0274828)+0.0136617∗bmrw
As you see, the term for α1 and α2 disappear since their dummy variables are equal to zero.
Finally, the regression equation for the VERY SMALL firms is the following:
E[F4.r/firm=VerySmall]=α0+beta1∗bmrw
E[F4.r/firm=VerySmall]=(−0.0143241)+0.0136617∗bmrw
As you see, the term for α1, α2 and α3 disappear since their dummy variables are equal to zero.
Then, we have 4 regression equations, one for each type of firm according to its size.
Then α1 coefficient is the actual vertical distance between the regression lines of the VERY SMALL and the SMALL firms. The α2 coefficient is the actual vertical distance between the regression lines of the VERY SMALL and the BIG firms. And the α3 coefficient is the actual vertical distance between the regressino lines of the VERY SMALL and the VERY BIG firms.
Use predict.lm() to predict future stock returns with different levels of BMR (from 0.6 to 1.6, jumping by 0.1), and for small firms and for big firms.
# Make prediction using predict.lm()
newx <- data.frame(bmr_w = rep(seq(0.6, 1.6, by=0.1), 4), size1 = levels(data$size1))
pr_reg1 <- predict.lm(reg2, newx, interval = "confidence")
colnames(pr_reg1) <- c("StockReturn", "lwr", "upr")
pred_reg1 <- cbind(newx, pr_reg1)
# Plot
library(ggplot2)
ggplot(pred_reg1, aes(x = bmr_w, y=StockReturn, color=size1)) +
geom_point(size = 2) + geom_line() +
geom_errorbar(aes(ymax = upr, ymin = lwr))
INTERPRET the graph and explain how the graph is related to the 4 equations explained above. After analyzing the graph and the regression equations, how can you interpret the alpha1 coefficient? What the does the alpha1 coefficient represent? Briefly explain.
FIRST THINGS FIRTS, THE HORIZONTAL AXIS REPRESENTS BOOK-TO-MARKET RATIO, WHILE THE VERTICAL AXIS REPRESENTS THE FUTURE STOCK RETURNS ONE QUARTER LATER. ANALYZING THE FOUR GROUP TYPES, I CAN INFER THAT THE RELATIONSHIP IS POSITIVE. THE HIGHER THE BOOK-MARKET RATIO, THE HIGHER WILL BE THE FUTURE STOCK RETURNS. ALTHOUGH VERY SMALL FIRMS OFFER LITTLE TO NO RETURN IN COMPARRISON TO ALL OTHER GROUP. THE 95% CONFIDENCE INTERVAL OF THIS PREDICTION AT THE VERTICAL LINES CAN BE SEEN POSITIONED AT DIFFERENT POINTS OF BMRW VALUES.
Now we will practice other model using the categorical variable industry as control variable.
Design a regression model to examine whether the BMR(winsorized) influence the future stock return ONE quarter later, but add as control variable the categorical variable industry. In this case the categorical variable has more than 2 values since it represents the industry of the firms. In this dataset the industry is the 1-digit NAICS (North American Industry Classification System) classification. To see how many different industry classification we have in this variable, type:
table(data$naics1)
Accommodation and Food Services
320
Administrative and Support and Waste Management and Remediation Services
160
Agriculture, Forestry, Fishing and Hunting
160
Arts, Entertainment, and Recreation
160
Construction
1360
Finance and Insurance
2640
Health Care and Social Assistance
80
Information
800
Management of Companies and Enterprises
80
Manufacturing
3120
Mining, Quarrying, and Oil and Gas Extraction
400
Professional, Scientific, and Technical Services
80
Public Administration
80
Real Estate and Rental and Leasing
720
Retail Trade
1040
Transportation and Warehousing
800
Utilities
160
You will see about 17 different industries and the frequency of firms-quarters for each industry. You will see that the industry with more firms is the manufacturing industry.
To run a regression using this classification, we need to treat this variable as Categorical, NOT numeric. In this case, this variable has more than 2 values, so we need to create many dummy variables and include them in the regression.
For example, imagine we ONLY have 3 industry types: manufacturing, retail and finance services. In this case we need to create only 2 dummy variables.
Then, if we have only 3 firms, one in each category, we might have something like:
We need to create N-1 dummy variables. N is the # of different industries, in this case, 3. Then, we need to create 2 dummy variables and end up as something like:
Now we have 2 dummy variables to represent 1 Categorical variable with 3 values. You see that there is NO dummy variable for the Manufacturing firms since it is not necessary. If the firm is NOT service and the firm is NOT Finance service, then the firm should be a manufacturing firm.
Fortunately, we do not have to create each dummy variable manually. R performs this automatically and temporarily. Then, we run the regression as:
# I transform the industry column into a factor column to consider
# this variable as a categorical variable:
industries<-unique(data$naics1)
# With the unique function I get the unique names of industries from
# the panel dataset
# I short the name of the 2nd industry to improve the display of the
# regression models:
industries2<-industries
industries2[2]<-"Administrative and Waste Management"
industries
[1] "Manufacturing"
[2] "Administrative and Support and Waste Management and Remediation Services"
[3] "Finance and Insurance"
[4] "Transportation and Warehousing"
[5] "Construction"
[6] "Information"
[7] "Mining, Quarrying, and Oil and Gas Extraction"
[8] "Agriculture, Forestry, Fishing and Hunting"
[9] "Retail Trade"
[10] "Accommodation and Food Services"
[11] "Arts, Entertainment, and Recreation"
[12] "Real Estate and Rental and Leasing"
[13] "Utilities"
[14] "Management of Companies and Enterprises"
[15] "Professional, Scientific, and Technical Services"
[16] "Health Care and Social Assistance"
[17] "Public Administration"
data$naicsf<-factor(data$naics1,industries,labels=industries2)
reg3<-lm(F1r ~ bmr_w + naicsf, data = data)
summary(reg3)
Call:
lm(formula = F1r ~ bmr_w + naicsf, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.32226 -0.07538 -0.00606 0.08093 1.46856
Coefficients:
Estimate
(Intercept) 0.011702
bmr_w 0.010409
naicsfAdministrative and Waste Management -0.014430
naicsfFinance and Insurance 0.009673
naicsfTransportation and Warehousing -0.004392
naicsfConstruction -0.046688
naicsfInformation -0.012114
naicsfMining, Quarrying, and Oil and Gas Extraction -0.010426
naicsfAgriculture, Forestry, Fishing and Hunting 0.019905
naicsfRetail Trade 0.004037
naicsfAccommodation and Food Services -0.015823
naicsfArts, Entertainment, and Recreation -0.027757
naicsfReal Estate and Rental and Leasing -0.015399
naicsfUtilities 0.030428
naicsfManagement of Companies and Enterprises 0.020675
naicsfProfessional, Scientific, and Technical Services 0.003414
naicsfHealth Care and Social Assistance 0.013675
naicsfPublic Administration -0.176974
Std. Error
(Intercept) 0.004624
bmr_w 0.002332
naicsfAdministrative and Waste Management 0.016834
naicsfFinance and Insurance 0.006758
naicsfTransportation and Warehousing 0.010670
naicsfConstruction 0.007817
naicsfInformation 0.009161
naicsfMining, Quarrying, and Oil and Gas Extraction 0.011879
naicsfAgriculture, Forestry, Fishing and Hunting 0.015100
naicsfRetail Trade 0.007514
naicsfAccommodation and Food Services 0.013736
naicsfArts, Entertainment, and Recreation 0.017535
naicsfReal Estate and Rental and Leasing 0.013424
naicsfUtilities 0.032158
naicsfManagement of Companies and Enterprises 0.021003
naicsfProfessional, Scientific, and Technical Services 0.027358
naicsfHealth Care and Social Assistance 0.021241
naicsfPublic Administration 0.129685
t value
(Intercept) 2.531
bmr_w 4.463
naicsfAdministrative and Waste Management -0.857
naicsfFinance and Insurance 1.431
naicsfTransportation and Warehousing -0.412
naicsfConstruction -5.973
naicsfInformation -1.322
naicsfMining, Quarrying, and Oil and Gas Extraction -0.878
naicsfAgriculture, Forestry, Fishing and Hunting 1.318
naicsfRetail Trade 0.537
naicsfAccommodation and Food Services -1.152
naicsfArts, Entertainment, and Recreation -1.583
naicsfReal Estate and Rental and Leasing -1.147
naicsfUtilities 0.946
naicsfManagement of Companies and Enterprises 0.984
naicsfProfessional, Scientific, and Technical Services 0.125
naicsfHealth Care and Social Assistance 0.644
naicsfPublic Administration -1.365
Pr(>|t|)
(Intercept) 0.0114
bmr_w 8.22e-06
naicsfAdministrative and Waste Management 0.3914
naicsfFinance and Insurance 0.1524
naicsfTransportation and Warehousing 0.6806
naicsfConstruction 2.45e-09
naicsfInformation 0.1861
naicsfMining, Quarrying, and Oil and Gas Extraction 0.3802
naicsfAgriculture, Forestry, Fishing and Hunting 0.1875
naicsfRetail Trade 0.5911
naicsfAccommodation and Food Services 0.2494
naicsfArts, Entertainment, and Recreation 0.1135
naicsfReal Estate and Rental and Leasing 0.2514
naicsfUtilities 0.3441
naicsfManagement of Companies and Enterprises 0.3250
naicsfProfessional, Scientific, and Technical Services 0.9007
naicsfHealth Care and Social Assistance 0.5197
naicsfPublic Administration 0.1724
(Intercept) *
bmr_w ***
naicsfAdministrative and Waste Management
naicsfFinance and Insurance
naicsfTransportation and Warehousing
naicsfConstruction ***
naicsfInformation
naicsfMining, Quarrying, and Oil and Gas Extraction
naicsfAgriculture, Forestry, Fishing and Hunting
naicsfRetail Trade
naicsfAccommodation and Food Services
naicsfArts, Entertainment, and Recreation
naicsfReal Estate and Rental and Leasing
naicsfUtilities
naicsfManagement of Companies and Enterprises
naicsfProfessional, Scientific, and Technical Services
naicsfHealth Care and Social Assistance
naicsfPublic Administration
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1833 on 7081 degrees of freedom
(5061 observations deleted due to missingness)
Multiple R-squared: 0.01063, Adjusted R-squared: 0.008259
F-statistic: 4.477 on 17 and 7081 DF, p-value: 2.128e-09
If the predictor (x, independent variable) is a categorical variable, R will automatically, temporarily create N-1 dummy variables with values 0 or 1 and put them in the multiple regression model. That is the reason you will see many independent variables, but all of these are just dummy variables related to the 1-digit NAICS industry classification. If you want to figure out the dummy coding scheme of a categorical variable, use the function contrasts() or model.matrix().
You will notice that R did not include the 1st industry type. Then we say that the reference or base industry will be the 1st industry. In other words, the coefficients of each dummy variable will represent how much or how less each respective industry is providing returns compared to the base industry.
You have to pay attention to this explanation in class. Do the following:
INTERPRET the regression results with your own words. Write down the regression equation for the first industry that appears in the regression output.
IN THE REGRESSION THE RELATIONSHIP BETWEEN BOOK-TO-MARKET RATIO AND HOW IT’S RELATED TO FUTURE STOCK RETURNS ONE QUARTER LATER IS BEING ANALYZED. TAKING INTO CONSIDERATION THE INDUSTRY BENCHMARKES, ALL PUBLIC MEXICAN FIRMS FROM 2000 UP TO 2019 ARE TO BE ANALYZED. AFTER CONSIDERING THE EFFECTS MADE BY THE INDUSTRY, BOOK TO MARKET RATIO IS POSITIVELY RELATED TO FUTURE STOCK RETURNS ONE QUARTER LATER. FOR EACH POSITIVE INCREASE BY ONE MOVEMENT OF BMRW, WE EXPECT THAT F4r WILL MOVE APPROXIMATELY 0.0104, WHICH IS 1.04%. CONSTRUCTION FIRMS ARE OFFERING 4.67% LESS FUTURE STOCK RETURNS COMPARED TO MANUFACTURING FIRMS (OUR BASE GROUP). THE REST OF THE INDUSTRY GROUPS AREN’T OFFERING RETURNS SIGNIFICANTLY DIFFERENT FROM THE BASE GROUP.
Do the following:
Generate a categorical variable for size. Consider large firms those that are bigger than the 66 percentile; consider small firms those that are smaller than the 33 percentile; consider middle-size firms those that are between the 33 and the 66 size percentile. You have to remember what are percentiles! Here you have to be careful since we have historical data for many quarters. Then, we have to do this classification quarter by quarter.
library(dplyr)
sizetype1 <- data %>%
group_by(quarter) %>%
summarise(sizetype1 = xtile(size, n=3), firmcode=firmcode)
`summarise()` has grouped output by 'quarter'. You can override using the `.groups` argument.
# Now we merge the panel data with sizetype1
data <- merge(data,sizetype1,by=c("firmcode","quarter"))
# We indicate to do the merge by firm-quarter
# The index of the data frame must be firmcode-quarter
data <- pdata.frame(data, index = c("firmcode", "quarter"))
# Code dummy variable
data$size3 <- factor(data$sizetype1, c(1,2,3), labels = c("Small","Medium", "Big"))
# Look at the coding
contrasts(data$size3)
Medium Big
Small 0 0
Medium 1 0
Big 0 1
You can check the content of this new variable using the table() function:
table(data$size3)
Small Medium Big
2519 2515 2442
Now we are ready to use this categorical variable in a regression model.
Run a multiple regression model to examine whether BMR(winsorized), EPSP(winsorized) and the categorical variable sizetype influence future stock returns one quarter later. INTERPRET the output of the model.
reg4 <- lm(F1r ~ bmr_w + epsp_w + size3, data = data)
summary(reg4)
Call:
lm(formula = F1r ~ bmr_w + epsp_w + size3, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.29367 -0.07474 0.00327 0.07853 1.50234
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.015913 0.006434 -2.473 0.013429 *
bmr_w 0.006059 0.003335 1.817 0.069314 .
epsp_w 0.106705 0.026620 4.008 6.2e-05 ***
size3Medium 0.021635 0.006526 3.315 0.000922 ***
size3Big 0.022712 0.006951 3.268 0.001091 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1815 on 5191 degrees of freedom
(6964 observations deleted due to missingness)
Multiple R-squared: 0.006195, Adjusted R-squared: 0.005429
F-statistic: 8.089 on 4 and 5191 DF, p-value: 1.691e-06
INTERPRET THE MODEL THE EFFECT OF BOOK TO MARKET RATIO ON FUTURE STOCK RETURNS A QUARTER LATER ARE POSITIVE AND MARGINALLY SIGNIFICANT, WHERE OUR P-VALUE EQUALS 0.06. EARNINGS PER SHARE DEFLATED BY PRICE ON FUTURE STOCK RETURNS A QUARTER LATER IS POSITIVE AND VERY SIGNIFICANT WITH A P-VALUE SMALLER THAN 0.01. FOR EACH ONE POSITIVE MOVEMENT OF EPSP, FUTURE STOCK RETURN IS EXPECTED TO MOVE IN 0.1067. MEDIUM AND BIG FIRMS ARE SIGNIFICANTLY OFFERING HIGHER RETURNS COMPARED TO THE SMALL FIRMS. EX. MEDIUM-SIZE FIRMS OFFER 2.16% FUTURE RETURNS HIGHER THAN SMALL FIRMS, WHEREAS BIG FIRMS OFFER 2.27% FUTURE RETURNS HIGHER THAN SMALLER FIRMS.
With this model, estimate the following predictions
Predict the firm return for a company with a BMR that moves from 0.40 to 1.6 moving by 0.10, and for the 3 size categories. You can do this as:
# Make prediction using predict.lm()
newx2 <- data.frame(bmr_w = rep(seq(0.4, 1.6, by=0.1), 3),
epsp_w = mean(data$epsp_w,na.rm=TRUE),
size3 = levels(data$size3))
pr_reg3 <- predict.lm(reg4, newx2, interval = "confidence")
colnames(pr_reg3) <- c("StockReturn", "lwr", "upr")
pred_reg3 <- cbind(newx2, pr_reg3)
# Plot
ggplot(pred_reg3, aes(x = bmr_w, y=StockReturn, color=size3)) +
geom_point(size = 2) + geom_line() +
geom_errorbar(aes(ymax = upr, ymin = lwr))
SOMETHING INTERESTING IS THE FACT THAT ALL GROUPS STAND ON THE SAME SLOPE AND IN ORDER. FOR EVERY GROUP THERE IS A POSITIVE BMR EFFECT. THE DISTANCES BETWEEN SMALL AND BIG GROUPS IS HOW MUCH MORE EXTRA RETURNS ARE PERCIEVED BETWEEN EACH OF THESE GROUPS. THE DISTANCE BETWEEN MEDIUM AND SMALL GROUPS NOT ONLY GIVES US THE EXTRA RETURNS BUT ALSO THE ALPHA COEFFICIENT FOR MEDIUM SIZED VARIABLES.