1 Introduction to Categorical independent variables

In the previous workshop we examined whether the market return, BMR and EPSP influence stock returns. We ran the models using both, contemporary values of the variables, and lagged values of the independent variables. Remember the time-series functions in R:

lag(variable, #) refers to the Lag number # of the variable. If you want to go forward (ahead) in the data, you can use negative numbers.

Now we will keep using these time series functions for regression models that include categorical variables. Until now we have included continuous independent (explanatory) variables in our regression models.

A categorical variable is usually a non-numeric variable that represents categories having a few values. When we have a categorical variable in a regression model we need to do a special treatment in the regression model.

A categorical variable is a variable that might not have a numeric ranking meaning, but they are useful variables for classification. For example, the variable Industry is a categorical variable since we cannot sum, average nor sort values of this variable; we can only count different types of industries.

Nevertheless, there are some numeric variables that can be treated as categorical variables. For example, year can be treated as both numeric and also categorical variable in a regression.

When we are interested in evaluating the effect of one variable on a dependent variable, it is recommended to include control variables to make our results more robust. A control variable is a variable that might not be the focus of study, but in previous research or analysis, this control variable has been shown to be related to the dependent variable of study.

Then, if we run a regression to examine whether one explanatory (independent) variable has an effect on the dependent variable, and we include one or two control variables, then if we find that there is a significant relationship between the explanatory variable and the dependent variable, we can say that this effect holds even after considering the effect of the control variables. In this case, our result will be more robust, and we will have more statistical evidence about the effect of the independent variable on the dependent variable.

Now we will use categorical variables as control variables in multiple regression models. Before doing this, I will briefly explain what we need to do before we include a categorical variable in a regression model.

If we want to include a simple binary categorical variable such as manufacturing company vs non-manufacturing company, then we need to have a categorical variable in my dataset with this information. The categorical variables that have 2 values are named dummy variables. Every observation of the dataset - in our case, every firm-quarter- will be classified as either manufacturing or non-manufacturing firm. We cannot include this variable directly since this variable is not numeric. We need to codify the dummy variable with 2 numeric values such as 0 and 1, or 1 and 2. We can assign 1 to manufacturing firms, and 0 to non-manufacturing firms (or vice versa). We will practice with some exercises about this.

#2.1 Data management We download the Excel dataset from the web:

library(readxl)
# Download the excel file from a web site:
download.file("http://www.apradie.com/datos/datamx2020q4.xlsx", "dataw7.xlsx", mode="wb")
# Save the data from the file in an R object
data <- read_excel("dataw7.xlsx")

Do the following data management to create financial ratios to be used in the workshop:

Select only active firms from the dataset.

Create variables for book value, market value, book-to-market value, earnings per share, earnings per share deflated by price.

# Select only active firms:

data <- data[data$status=="active",]

# Create book value variable
data$bookvalue <- data$totalassets-data$totalliabilities

# Calculate marketvalue
data$mktval <- data$originalhistoricalstockprice * data$sharesoutstanding

# Create book-to-market ratio variable
data$bmr <- data$bookvalue / data$mktval

# Create earnings per share variable
data$eps <- data$ebit / data$sharesoutstanding

# Create earnings per share deflated by price column
data$epsp <- data$eps / data$originalhistoricalstockprice

Create a winsorized variable for book-to-market ratio and earnings per share using percentile 1%. Remember to use original stock price to calculate the market value and earnings per share deflated by price.

# Loading the libraries statar and plm
library(statar)
library(plm)

data$bmr_w <- winsorize(data$bmr,probs = c(0.01,0.99))

## 0.60 % observations replaced at the bottom

## 0.60 % observations replaced at the top

par(mfrow=c(1,2))
hist(data$bmr, col="lightblue", main = "bmr")
hist(data$bmr_w, col = "blue", main="bmr winsorized")

data$epsp_w <- winsorize(data$epsp, probs=c(0.01,0.99))

## 0.44 % observations replaced at the bottom

## 0.44 % observations replaced at the top

par(mfrow=c(1,2))
hist(data$epsp, main = "epsp", col="orange")
hist(data$epsp_w, main="epsp winsorized", col="gold")

We set the dataset as a panel-data. This is important since we will use future returns in some regression, so we need to create columns for future stock returns one quearter and one year later:

# The index of the data frame must be firmcode-quarter
data <- pdata.frame(data, index = c("firmcode", "quarter"))

# Calculate firm return
data$stockreturn <- diff(log(data$adjustedstockprice))

# We add columns for future returns one year later and one quarter later:
data$F4r <- plm::lag(data$stockreturn,-4)
data$F1r <- plm::lag(data$stockreturn,-1)

Let’s create a binary variable to classify firms according to its size. We classify firms in 2 groups: small vs big firms. Those firms that have a size (natural log of market value) bigger than its mean will be classified as big firms. Since we have panel data, it is a good idea to calculate the mean of size by quarter. Those firms with size smaller than its mean will be classified as small firms.

We can do so with the function xtile from package statar, which emulates some functions from Stata.

The xtile function calculates percentiles. When combined with group_by() and summarise(), the percentiles by quarter are calculated. In this case, since I specified n=2, R will generate sizetype0 according to the 50 percentile to create only 2 groups. Then sizetype0 will be a variable equal to 1 or 2. Those with values=1 will be the small firms, and those with values equal to 2 will be the big firms.

We will use the dplyr package to create groups by size.

We can do this in R as follows:

# Create variable size as the log of market value
data$size <- log(data$mktval)
# We take the log since the market value variable usually does not 
#  behave like a normal distributed variable (it is skewed to the right)
# When we take the log of skewed variables, then the log will behave
#  close to a normal variable


# Load packages dplyr

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plm':
## 
##     between, lag, lead

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# xtile function calculates percentiles by groups. In this case, we 
#  group the data by quarter, and for each quarter we will classify
#  firms in 4: very big, big, small, very small  
sizetype0 <- data %>%
  group_by(quarter) %>%
  summarise(sizetype = xtile(size, n=4), firmcode=firmcode)

## `summarise()` has grouped output by 'quarter'. You can override using the `.groups` argument.

# In this case, n=4 means that it will generate sizetype0 according to 
#   the 25 percentile to create 4 groups. Those with values=1 will be 
#   the smallest firms and those with values=4 will be the biggest firms.

# Now we merge the panel data with the sizetype0 

data <- merge(data,sizetype0,by=c("firmcode","quarter"))

# We again set the data as panel-data since the merge function changed the class of
#   the dataset to data frame:
data <- pdata.frame(data, index = c("firmcode", "quarter"))

We can see that different values of sizetype0 and the number of appearance in the dataset:

table(data$sizetype)

## 
##    1    2    3    4 
## 1900 1855 1883 1838

2.2 Regression model with categorical variable

Now we will analyze the effect of BMR on firm return (one year later) after considering the effect of size type. Then, we can include this new binary variable in the regression as follows:

# Before running a regression with a categorical independent variable, 
#  we need to indicate R that this variable is a "factor" variable: 
data$size1 <- factor(data$sizetype, c(1,2,3,4), labels = c("Very Small", "Small", "Big", "Very Big")) 

# Look at the coding
contrasts(data$size1)

##            Small Big Very Big
## Very Small     0   0        0
## Small          1   0        0
## Big            0   1        0
## Very Big       0   0        1

We run the multiple regression model with categorical variable size type (size1) as follows.

We want to examine whether there is a difference of future stock returns between size groups, and whether book-to-market ratio has a relationship with future stock returns 1 year later (4 quarters in the future).

We can run this regression using the plm function:

# We need to detach (unload) the dplyr library since it makes conflict
#   with the plm package 
detach(package:dplyr)

reg1 <-plm(lag(stockreturn,-4) ~ size1 + bmr_w, data = data, model="pooling")
reg1results<-summary(reg1)
reg1results

## Pooling Model
## 
## Call:
## plm(formula = lag(stockreturn, -4) ~ size1 + bmr_w, data = data, 
##     model = "pooling")
## 
## Unbalanced Panel: n = 143, T = 1-76, N = 6577
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -1.3004498 -0.0769402 -0.0014358  0.0802195  1.5023835 
## 
## Coefficients:
##                 Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)   -0.0148149  0.0063925 -2.3176  0.020504 *  
## size1Small     0.0188085  0.0067562  2.7839  0.005387 ** 
## size1Big       0.0240453  0.0068316  3.5197  0.000435 ***
## size1Very Big  0.0290969  0.0072614  4.0071 6.216e-05 ***
## bmr_w          0.0165505  0.0026145  6.3302 2.610e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    222.54
## Residual Sum of Squares: 221.1
## R-Squared:      0.0064381
## Adj. R-Squared: 0.0058334
## F-statistic: 10.6464 on 4 and 6572 DF, p-value: 1.3426e-08

# We used the lag function with -4 to indicate 4 quarters in the future for stock returns

We can also run the same regression using the simple lm function, but we need to create a new variable for future stock returns:

reg2 <-lm(F4r ~ size1 + bmr_w, data = data)
#reg2 <-lm(stockreturn ~ size1 + bmr_w, data = data)
reg2results<-summary(reg2)
reg2results

## 
## Call:
## lm(formula = F4r ~ size1 + bmr_w, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30045 -0.07694 -0.00144  0.08022  1.50238 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.014815   0.006392  -2.318 0.020504 *  
## size1Small     0.018808   0.006756   2.784 0.005387 ** 
## size1Big       0.024045   0.006832   3.520 0.000435 ***
## size1Very Big  0.029097   0.007261   4.007 6.22e-05 ***
## bmr_w          0.016550   0.002615   6.330 2.61e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1834 on 6572 degrees of freedom
##   (5583 observations deleted due to missingness)
## Multiple R-squared:  0.006438,   Adjusted R-squared:  0.005833 
## F-statistic: 10.65 on 4 and 6572 DF,  p-value: 1.343e-08

INTERPRETATION THE OUTPUT OF THE MODEL

AFTER CONSIDERING THE BMR EFFECT ON THE STOCK RETURN, WE CAN SEE THAT IT IS POSITIVE AND SIGNIFICANT, FOR EACH UNIT INCREASE IN ON THE BMR THERE IS A MOVEMENT OF STOCK RETURN OF 0.016550. BIG FIRMS HAVE HIGHER RETURNS THAN VERY SMALL FIRMS.

We have 5 regression coefficients. I will name the first coefficient as alpha0 (instead of beta0). The coefficients of the dummy variables for the categorical (factor) variable I will name them as alpha1, alpha2, and alpha3. Finally, the coefficient for BMR I will name it as beta1.

Then, the expected value of the regression will be:

E[F4.stockret]=α0+α1∗size1Small+α2∗size1Big+α3∗size1VeryBig+beta1∗bmrw

size1Small, size1Big and size1VeryBig are dummy variables with values equal to 0 or 1. As you see, there is no dummy variable for size1VerySmall since this category is the base category. The codification of this categorical variable can be better understood with the following table:

contrasts(data$size1)

##            Small Big Very Big
## Very Small     0   0        0
## Small          1   0        0
## Big            0   1        0
## Very Big       0   0        1

Then, we can plug the values of these dummy variables according to the size of the firms, and come up with one regression equation for each group of firms according to firm size.

For small firms size1Small=1 and the other dummy variables will be equal to 0. For Big firms size1Big=1, size1Small=0, and size1VeryBig=0. For Very Big firms size1VeryBig=1, size1Small=0 and size1Big=0. For Very Small firms all the dummy variables will be = 0.

Then for small firms only the dummy variable size1Small=1 and the other dummy variables =0. We can express the regression equation for SMALL firms as follows:

E[F4.r/firm=small]=α0+α1∗size1Small+beta1∗bmrw

E[F4.r/firm=small]=(−0.0148149+0.0188085)+0.0165505∗bmrw

As you see, the term for α2 and α3 disappear since their dummy variables are equal to zero.

For Big firms only the dummy variable size1Big=1 and the other dummy variables =0. We can express the regression equation for BIG firms as follows:

E[F4.r/firm=Big]=α0+α2∗size1Big+beta1∗bmrw E[F4.r/firm=Big]=(−0.0148149+0.0240453)+0.0165505∗bmrw

As you see, the term for α1 and α3 disappear since their dummy variables are equal to zero.

For Very Big firms only the dummy variable size1VeryBig=1 and the other dummy variables =0. We can express the regression equation for VERY BIG firms as follows:

E[F4.r/firm=VeryBig]=α0+α3∗size1VeryBig+beta1∗bmrw E[F4.r/firm=VeryBig]=(−0.0148149+0.0290969)+0.0165505∗bmrw

As you see, the term for α1 and α2 disappear since their dummy variables are equal to zero.

Finally, the regression equation for the VERY SMALL firms is the following:

E[F4.r/firm=VerySmall]=α0+beta1∗bmrw

E[F4.r/firm=VerySmall]=(−0.0148149)+0.0165505∗bmrw

As you see, the term for α1, α2 and α3 disappear since their dummy variables are equal to zero.

Then, we have 4 regression equations, one for each type of firm according to its size.

Then α1 coefficient is the actual vertical distance between the regression lines of the VERY SMALL and the SMALL firms. The α2 coefficient is the actual vertical distance between the regression lines of the VERY SMALL and the BIG firms. And the α3 coefficient is the actual vertical distance between the regressino lines of the VERY SMALL and the VERY BIG firms.

Use predict.lm() to predict future stock returns with different levels of BMR (from 0.6 to 1.6, jumping by 0.1), and for small firms and for big firms.

# Make prediction using predict.lm()
newx <- data.frame(bmr_w = rep(seq(0.6, 1.6, by=0.1), 4), size1 = levels(data$size1))

pr_reg1 <- predict.lm(reg2, newx, interval = "confidence")
colnames(pr_reg1) <- c("StockReturn", "lwr", "upr")
pred_reg1 <- cbind(newx, pr_reg1)

# Plot
library(ggplot2)
ggplot(pred_reg1, aes(x = bmr_w, y=StockReturn, color=size1)) +
  geom_point(size = 2) + geom_line() + 
  geom_errorbar(aes(ymax = upr, ymin = lwr))

#INTERPRET the graph and explain how the graph is related to the 4 equations explained above. THIS GRAPH PLOTS THE FOUR GROUP SIZE TYPES, IT LETS US SEE THE RELATION BETWEEN THE BMR AND THE STOCK RETURNS, HERE WE CAN APPRECIATE HAW ALL OF THE GROUPS HAVE A POSITIVE RELATION. EACH OF THE ECQUATIONS ABOVE IS MEANT FOR A GROUP SHOWN IN THE GRAPH.

3 Evaluating the effect of book-to-market ratio controlling for Industry

Now we will practice other model using the categorical variable industry as control variable.

Design a regression model to examine whether the BMR(winsorized) influence the future stock return ONE quarter later, but add as control variable the categorical variable industry. In this case the categorical variable has more than 2 values since it represents the industry of the firms. In this dataset the industry is the 1-digit NAICS (North American Industry Classification System) classification. To see how many different industry classification we have in this variable, type:

table(data$naics1)

## 
##                                          Accommodation and Food Services 
##                                                                      320 
## Administrative and Support and Waste Management and Remediation Services 
##                                                                      160 
##                               Agriculture, Forestry, Fishing and Hunting 
##                                                                      160 
##                                      Arts, Entertainment, and Recreation 
##                                                                      160 
##                                                             Construction 
##                                                                     1360 
##                                                    Finance and Insurance 
##                                                                     2640 
##                                        Health Care and Social Assistance 
##                                                                       80 
##                                                              Information 
##                                                                      800 
##                                  Management of Companies and Enterprises 
##                                                                       80 
##                                                            Manufacturing 
##                                                                     3120 
##                            Mining, Quarrying, and Oil and Gas Extraction 
##                                                                      400 
##                         Professional, Scientific, and Technical Services 
##                                                                       80 
##                                                    Public Administration 
##                                                                       80 
##                                       Real Estate and Rental and Leasing 
##                                                                      720 
##                                                             Retail Trade 
##                                                                     1040 
##                                           Transportation and Warehousing 
##                                                                      800 
##                                                                Utilities 
##                                                                      160

You will see about 17 different industries and the frequency of firms-quarters for each industry. You will see that the industry with more firms is the manufacturing industry.

To run a regression using this classification, we need to treat this variable as Categorical, NOT numeric. In this case, this variable has more than 2 values, so we need to create many dummy variables and include them in the regression.

For example, imagine we ONLY have 3 industry types: manufacturing, retail and finance services. In this case we need to create only 2 dummy variables.

Then, if we have only 3 firms, one in each category, we might have something like:

##    firm           Industry
## 1 firm1      Manufacturing
## 2 firm2            Service
## 3 firm3 Financial services

We need to create N-1 dummy variables. N is the # of different industries, in this case, 3. Then, we need to create 2 dummy variables and end up as something like:

##    firm           Industry Dummy1.Is_Service Dummy2.Is_Finance_service
## 1 firm1      Manufacturing                 0                         0
## 2 firm2            Service                 1                         0
## 3 firm3 Financial services                 0                         1

Now we have 2 dummy variables to represent 1 Categorical variable with 3 values. You see that there is NO dummy variable for the Manufacturing firms since it is not necessary. If the firm is NOT service and the firm is NOT Finance service, then the firm should be a manufacturing firm.

Fortunately, we do not have to create each dummy variable manually. R performs this automatically and temporarily. Then, we run the regression as:

# I transform the industry column into a factor column to consider 
#   this variable as a categorical variable:
industries<-unique(data$naics1)
# With the unique function I get the unique names of industries from 
#   the panel dataset
# I short the name of the 2nd industry to improve the display of the 
#  regression models:
industries2<-industries
industries2[2]<-"Administrative and Waste Management"
industries

##  [1] "Manufacturing"                                                           
##  [2] "Administrative and Support and Waste Management and Remediation Services"
##  [3] "Finance and Insurance"                                                   
##  [4] "Transportation and Warehousing"                                          
##  [5] "Construction"                                                            
##  [6] "Information"                                                             
##  [7] "Mining, Quarrying, and Oil and Gas Extraction"                           
##  [8] "Agriculture, Forestry, Fishing and Hunting"                              
##  [9] "Retail Trade"                                                            
## [10] "Accommodation and Food Services"                                         
## [11] "Arts, Entertainment, and Recreation"                                     
## [12] "Real Estate and Rental and Leasing"                                      
## [13] "Utilities"                                                               
## [14] "Management of Companies and Enterprises"                                 
## [15] "Professional, Scientific, and Technical Services"                        
## [16] "Health Care and Social Assistance"                                       
## [17] "Public Administration"

data$naicsf<-factor(data$naics1,industries,labels=industries2)

reg3<-lm(F1r ~ bmr_w + naicsf, data = data)
summary(reg3)

## 
## Call:
## lm(formula = F1r ~ bmr_w + naicsf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32226 -0.07538 -0.00606  0.08093  1.46856 
## 
## Coefficients:
##                                                         Estimate Std. Error
## (Intercept)                                             0.011702   0.004624
## bmr_w                                                   0.010409   0.002332
## naicsfAdministrative and Waste Management              -0.014430   0.016834
## naicsfFinance and Insurance                             0.009673   0.006758
## naicsfTransportation and Warehousing                   -0.004392   0.010670
## naicsfConstruction                                     -0.046688   0.007817
## naicsfInformation                                      -0.012114   0.009161
## naicsfMining, Quarrying, and Oil and Gas Extraction    -0.010426   0.011879
## naicsfAgriculture, Forestry, Fishing and Hunting        0.019905   0.015100
## naicsfRetail Trade                                      0.004037   0.007514
## naicsfAccommodation and Food Services                  -0.015823   0.013736
## naicsfArts, Entertainment, and Recreation              -0.027757   0.017535
## naicsfReal Estate and Rental and Leasing               -0.015399   0.013424
## naicsfUtilities                                         0.030428   0.032158
## naicsfManagement of Companies and Enterprises           0.020675   0.021003
## naicsfProfessional, Scientific, and Technical Services  0.003414   0.027358
## naicsfHealth Care and Social Assistance                 0.013675   0.021241
## naicsfPublic Administration                            -0.176974   0.129685
##                                                        t value Pr(>|t|)    
## (Intercept)                                              2.531   0.0114 *  
## bmr_w                                                    4.463 8.22e-06 ***
## naicsfAdministrative and Waste Management               -0.857   0.3914    
## naicsfFinance and Insurance                              1.431   0.1524    
## naicsfTransportation and Warehousing                    -0.412   0.6806    
## naicsfConstruction                                      -5.973 2.45e-09 ***
## naicsfInformation                                       -1.322   0.1861    
## naicsfMining, Quarrying, and Oil and Gas Extraction     -0.878   0.3802    
## naicsfAgriculture, Forestry, Fishing and Hunting         1.318   0.1875    
## naicsfRetail Trade                                       0.537   0.5911    
## naicsfAccommodation and Food Services                   -1.152   0.2494    
## naicsfArts, Entertainment, and Recreation               -1.583   0.1135    
## naicsfReal Estate and Rental and Leasing                -1.147   0.2514    
## naicsfUtilities                                          0.946   0.3441    
## naicsfManagement of Companies and Enterprises            0.984   0.3250    
## naicsfProfessional, Scientific, and Technical Services   0.125   0.9007    
## naicsfHealth Care and Social Assistance                  0.644   0.5197    
## naicsfPublic Administration                             -1.365   0.1724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1833 on 7081 degrees of freedom
##   (5061 observations deleted due to missingness)
## Multiple R-squared:  0.01063,    Adjusted R-squared:  0.008259 
## F-statistic: 4.477 on 17 and 7081 DF,  p-value: 2.128e-09

If the predictor (x, independent variable) is a categorical variable, R will automatically, temporarily create N-1 dummy variables with values 0 or 1 and put them in the multiple regression model. That is the reason you will see many independent variables, but all of these are just dummy variables related to the 1-digit NAICS industry classification. If you want to figure out the dummy coding scheme of a categorical variable, use the function contrasts() or model.matrix().

You will notice that R did not include the 1st industry type. Then we say that the reference or base industry will be the 1st industry. In other words, the coefficients of each dummy variable will represent how much or how less each respective industry is providing returns compared to the base industry.

You have to pay attention to this explanation in class. Do the following:

INTERPRET the regression results with your own words. Write down the regression equation for the first industry that appears in the regression output.

WITH THIS REGRESSION MODEL I WAS ABLE TO SEE THE RELATION BETWEEN THE BMR AND THE FUTURE STOCK RETURNS, AFTER CONSIDERING THE EFFECT OF OF THE INDUSTRY WE CAN SAY THAT FOR A UNIT INCREASE IN THE BMR F1R ALSO INCREASES IN ABOUT 1%.

4 Effect of BMR, EPS and size type on future stock returns Do the following:

Generate a categorical variable for size. Consider large firms those that are bigger than the 66 percentile; consider small firms those that are smaller than the 33 percentile; consider middle-size firms those that are between the 33 and the 66 size percentile. You have to remember what are percentiles! Here you have to be careful since we have historical data for many quarters. Then, we have to do this classification quarter by quarter.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plm':
## 
##     between, lag, lead

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sizetype1 <- data %>%
  group_by(quarter) %>%
  summarise(sizetype1 = xtile(size, n=3), firmcode=firmcode)

## `summarise()` has grouped output by 'quarter'. You can override using the `.groups` argument.

# Now we merge the panel data with sizetype1 
data <- merge(data,sizetype1,by=c("firmcode","quarter"))
# We indicate to do the merge by firm-quarter

# The index of the data frame must be firmcode-quarter
data <- pdata.frame(data, index = c("firmcode", "quarter"))

# Code dummy variable
data$size3 <- factor(data$sizetype1, c(1,2,3), labels = c("Small","Medium", "Big")) 
# Look at the coding
contrasts(data$size3)

##        Medium Big
## Small       0   0
## Medium      1   0
## Big         0   1

You can check the content of this new variable using the table() function:

table(data$size3)

## 
##  Small Medium    Big 
##   2519   2515   2442

Now we are ready to use this categorical variable in a regression model.

Run a multiple regression model to examine whether BMR(winsorized), EPSP(winsorized) and the categorical variable sizetype influence future stock returns one quarter later.

reg4 <- lm(F1r ~ bmr_w + epsp_w + size3, data = data)
summary(reg4)

## 
## Call:
## lm(formula = F1r ~ bmr_w + epsp_w + size3, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29367 -0.07474  0.00327  0.07853  1.50234 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.015913   0.006434  -2.473 0.013429 *  
## bmr_w        0.006059   0.003335   1.817 0.069314 .  
## epsp_w       0.106705   0.026620   4.008  6.2e-05 ***
## size3Medium  0.021635   0.006526   3.315 0.000922 ***
## size3Big     0.022712   0.006951   3.268 0.001091 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1815 on 5191 degrees of freedom
##   (6964 observations deleted due to missingness)
## Multiple R-squared:  0.006195,   Adjusted R-squared:  0.005429 
## F-statistic: 8.089 on 4 and 5191 DF,  p-value: 1.691e-06

INTERPRET THE MODEL

WITH THE EFFECT OF THE SIZE TYPE AND THE EPSP ON THE STOCK RETURNS, WE CAN SA THAT IT IS POSITIVE AND SIGNIFICANT, FOR EACH PERCENTAGE UNIT INCREASE IN THE EPSP THE STOCK RETURN IS EXPECTED TO MOVE 0.106705. SMALL FIRMS OFFER THE LOWEST RETURNS.

With this model, estimate the following predictions

Predict the firm return for a company with a BMR that moves from 0.40 to 1.6 moving by 0.10, and for the 3 size categories. You can do this as:

# Make prediction using predict.lm()
newx2 <- data.frame(bmr_w = rep(seq(0.4, 1.6, by=0.1), 3), 
                    epsp_w = mean(data$epsp_w,na.rm=TRUE),
                    size3 = levels(data$size3))
pr_reg3 <- predict.lm(reg4, newx2, interval = "confidence")
colnames(pr_reg3) <- c("StockReturn", "lwr", "upr")
pred_reg3 <- cbind(newx2, pr_reg3)

# Plot
ggplot(pred_reg3, aes(x = bmr_w, y=StockReturn, color=size3)) +
  geom_point(size = 2) + geom_line() + 
  geom_errorbar(aes(ymax = upr, ymin = lwr))

# INTERPRET the prediction and the 95% Confidence Interval. IN THIS GRAPH WE CAN SEE THE SLOPE FOR ALL THREE GROUPS OS ABOUT THE SAME, IN SMALL FIRMS THE STOCK RETURNS ARE SMALLR THAN FOR THE OTHER GROUPS, THE RETURNS FOR THE BIG AND MEDIUM GROUPS ARE WAY MORE SIMILAR, ALTHOUGH BIG ARE A LITTLE BIT HIGHER.