Workshop 4, Business Analytics for Decision Making
Authors
Alberto Dorantes D., Ph.D.
Monterrey Tech, EGADE Business School
Abstract
In this workshop we introduce the Linear Regression Model.
1 Workshop Directions
You have to work on Google Colab for this workshop. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.
You must share each Colab document (workshop) with the following accounts:
cdorante@tec.mx
You must give Edit privileges to this account.
Rename your Notebook as “W2-YourFirstName YourLastname”.
You must submit your workshop by uploading your Google Colab link in Canvas. What you have to write in your workshop? You have to:
You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!
I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.
2 Introduction to Linear Regression
Up to know we have learn about
Descriptive Statistics
The Histogram
The Central Limit Theorem
Hypothesis Testing
Covariance and Correlation
Without the idea of summarizing data with descriptive statistics, we cannot conceive the histogram. Without the idea of the histogram we cannot conceive the CLT, and without the CLT we cannot make inferences for hypothesis testing. We can apply hypothesis testing to test claims about random variables. These random variables can be one mean, difference of 2 means, correlation, and also coefficients of the linear regression model. But what is the linear regression model?
We learned that covariance and correlation are measures of linear relationship between 2 random variables, X and Y. The simple regression model also measures the linear relationship between 2 random variables (X and Y), but the difference is that X is supposed to explain the movements Y, so Y depends on the movement of X, the independent variable. In addition, the regression model estimates a linear equation (regression line) to represent how much Y (on average) moves with movements of X, and what is the expected value of Y when X=0.
The simple linear regression model is used to understand the linear relationship between two variables assuming that one variable, the independent variable (IV), can be used as a predictor of the other variable, the dependent variable (DV).
Besides using linear regression models to better understand how the dependent variable moves or changes according to changes in the independent variable, linear regression models are also used for prediction or forecasting of the dependent variable.
The simple regression model considers only one independent variable, while the multiple regression model can include more than one independent variables. But both models only consider one dependent variable. Then, we can use regression models for:
• Understanding the relationship between a dependent variable and a one or more independent variables - also called explanatory variables
• Predicting or estimating the expected value of the dependent variable according to specific value of the independent variables
In regression models it is assumed that there is a linear relationship between the dependent variable and the independent variables. It might be possible that in reality, the relation is not linear. A linear regression model does not capture non-linear relationships unless we do specific mathematical transformations of the variables.
2.1 The regression line
The regression line for a simple model (only 1 dependent variable) is given by the expected value of Y (also called \hat{Y}):
E[y_i] = b_0 + b_1(x_i) = \hat{y}_i
In this case, i goes from 1 to N, the total number of observations of the sample.
2.2 Interpretation of beta coefficients
In a simple regression model we have the independent variable (X or IV), and the dependent variable (Y or DV). We assume that we are interested on learning about the DV, and how it can change with changes in other, the IV.
In the simple regression model, we can provide a general interpretation of the beta coefficients as follows:
beta1 is the measure of linear relationship between the DV and the IV; if beta1>0, then, on average the linear relationship will be positive; if beta1<0, on average the linear relationship will be negative.
beta1 is a measure of sensitivity of the DV with changes in +1 unit of the IV. Then, beta1 is how much (on average), the DV moves if the IV moves in +1 unit. This is the reason why beta1 represents the slope of the regression line.
beta0 is the expected value of the DV when the IV=0. If beta0=0, then the regression line will pass by the origin (X=0, Y=0). beta0 is the intercept since it is the point in the Y axis where the regression line passes. beta0 defines how high or low the regression line will be.
One of the most common methods to estimate linear regression models is called ordinary least squares, which was first developed by mathematicians to predict planets’ orbits. On January 1st, 1801, the Italian priest and astronomer Giuseppe Piazzi discovered a small planetoid (asteroid) in our solar system, which he named Ceres. Piazzi observed and recorded 22 Ceres positions during 42 days, but suddenly Ceres was lost in glare of the Sun. Then, most Europeans astronomers started to find out a way to predict Cere’s orbit. The great German mathematician Friedrich Carl Gauss successfully predicted Ceres’ orbit using a least squares method he had developed in 1796, when he was 18 years old. Gauss applied his least squares method using the 22 Ceres observations and 6 explanatory variables. Gauss published his least square method until 1809 (Gauss 1809); interestingly, the French mathematician Arien-Marie Legendre first published the least-squared method in 1805 (Legendre 1805).
About 70 years later, the English anthropologist Francis Galton and the English mathematician Karl Pearson - leading founders of the Statistics discipline- used the foundations of the least-square method to first develop the linear regression model. Galton developed the conceptual foundation of regression models when he was studying the inherited characteristics of sweet peas. Pearson further developed Galton ideas following rigurous mathematical development.
Pearson used to work in Galton’s laboratory. When Galton died, Pearson wrote Galton’s biography. In this biography (Pearson 1930), Pearson described how Galton came up with the idea of regression. In 1875 Galton gave sweet peas seeds to seven friends. All sweet peas seeds had uniform weights. His friends harvested the sweet peas and returned the plants to Galton. He did a graph to see the size of each plant compared with their respective parents’ sizes. He found that all of them had parents with higher size. When graphing the offspring’s size as the Y axis, and parents’ size as the X axis, he tried to manually draw a line that could represent this relationship, and he found that the line slope was less than 1.0. He concluded that the size of these plants in their generation was “regressing” to the supposed mean of this specie (considering several generations).
Two research articles by Galton (Galton 1886) and Pearson (Pearson 1930) written in 1886 and 1903 respectively further developed the foundations of regression models. They examined why sons of very tall fathers are usually shorter than their fathers, while sons of very short fathers are usually taller than their fathers. After collecting and analyzing data from hundreds of families, they concluded that the height of an individual in a community or population tends to “regress” to the average height of the such population where they were born. If the father is very tall, then his sons’ height will “regress” to the average height of such population. If the father is very short, then his sons’ height will also “regress” to such average height. They named their model as “regression” model. Nowadays the interpretation of regression models is not quite the same as “regress” to a specific average value. Nowadays regression models are used to examine linear relationships between a dependent variable and a set of independent variables.
4 Types of data structures
The market model is a time-series regression model. In this model we looked at the relationship between 2 variables representing one feature or attribute (returns) of two “subjects” over time: a stock and a market index. The market model is an example of a regression model, but the data structure or type of data used is time-series data. These type of regression models are called pulled time-series regression.
There are basically three types of data used in regression models:
Time-series: each observation represents one period, and each column represents one or more variables, which are characteristics of one or more subjects. Then, we have one or more variables measured in several time periods.
Cross-sectional: each observation represents one subject in only one time period, and each column represents variables or characteristics of the subjects.
Panel data: this is a combination of time-series with cross-sectional structure.
Then, we can consider the market model as a pulled “time-series” regression model.
Another way to classify regression models is based on the number of independent variables. If the regression model considers only one independent variable, the the model is known as simple regression model. If the model considers more than one independent variable, the model is known as multiple regression model.
5 The OLS method
The Ordinary Least Square is an optimization method to estimate the best values of the beta coefficients and their standard errors in a linear regression model.
In the case of simple regression model, we need to estimate the best values of beta0 (the intercept) and beta1 coefficient (the slope of the regression line). If Y is the dependent variable and X the independent variable, the regression equation is as follows:
y_i = b_0 + b_1(x_i) + \epsilon_i
The regression line is given by the expected value of Y (also called \hat{Y}):
E[y_i] = b_0 + b_1(x_i) = \hat{y}_i
In this case, i goes from 1 to N, the total number of observations of the sample.
If we plot all pairs of (x_i,y_i) we can first visualize whether there is a linear relationship between Y and X. In a scatter plot, each pair of (x_i,y_i) is one point in the plot.
The purpose of OLS is to find the best regression line that best represents all the points(x_i,y_i). The beta0 and beta1 coefficients define the regression line. If beta0 changes, then the intercept of the line moves. If beta1 changes, then the slope of the line changes.
To find the best regression line (beta0 and beta1), the OLS method tries to minimize the sum of squared errors of the regression equation. Then OLS is an optimization method with the following objective:
Minimize \sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}
If we replace \hat{Y}_i by the regression equation we get:
Then, this sum can be seen as a function of b_0 and b_1. If b_0 changes, this sum changes; the same with b_1. Then we can re-write the optimization objective as:
This is a quadratic function of 2 variables. If we get its 2 first partial derivatives (with respect to b_0 and b_1) and make them equal to zero we will get 2 equations and 2 unknowns, and the solution will be the optimal values of b_0 and b_1 that minimizes this sum of squared errors. The set of these 2 partial derivatives is also called the gradient of the function.
If you remember basic geometry, if we imagine possible values of b_0 in the X axis and values of b_1 in the Y axis, then this function is actually a single-curved area (surface). The optimal point (b_0, b_1) will be the lowest point of this curved surface.
Let’s do an example with a dataset of only 4 observations:
X
Y
1
2
2
-1
7
10
8
7
If I do a scatter plot and a line that fits the points:
Code
import pandas as pdimport matplotlibimport matplotlib.pyplot as pltimport numpy as np# I create a data frame with the X and Y columns and 4 observations to illustrate linear regression data = {'x': [1,2,7,8],'y': [2,-1,10,7]}df = pd.DataFrame(data)# I calculate the beta coefficients of the line equation that represents the points (the linear regression equation)b1,b0 = np.polyfit(df.x,df.y,1)# I calculate the regression equation with the coefficients:df['yhat'] = b0 + b1*df['x']#plt.clf()plt.scatter(df.x,df.y)plt.plot(df.x, df.yhat,c="orange")plt.xticks(np.arange(-4,14,1))
([<matplotlib.axis.XTick object at 0x000001D9AFE16EA0>, <matplotlib.axis.XTick object at 0x000001D9AFD56F90>, <matplotlib.axis.XTick object at 0x000001D9AFE17890>, <matplotlib.axis.XTick object at 0x000001D9AFC04D40>, <matplotlib.axis.XTick object at 0x000001D9AFE6B710>, <matplotlib.axis.XTick object at 0x000001D9AFE6A180>, <matplotlib.axis.XTick object at 0x000001D9AFE68E60>, <matplotlib.axis.XTick object at 0x000001D9AFE90680>, <matplotlib.axis.XTick object at 0x000001D9AFE91040>, <matplotlib.axis.XTick object at 0x000001D9AFE6BB00>, <matplotlib.axis.XTick object at 0x000001D9AFE91700>, <matplotlib.axis.XTick object at 0x000001D9AFE92000>, <matplotlib.axis.XTick object at 0x000001D9AFE929F0>, <matplotlib.axis.XTick object at 0x000001D9AFE93320>, <matplotlib.axis.XTick object at 0x000001D9AFE17350>, <matplotlib.axis.XTick object at 0x000001D9AFE92CF0>, <matplotlib.axis.XTick object at 0x000001D9AFE93F50>, <matplotlib.axis.XTick object at 0x000001D9AFEB0800>], [Text(-4, 0, '−4'), Text(-3, 0, '−3'), Text(-2, 0, '−2'), Text(-1, 0, '−1'), Text(0, 0, '0'), Text(1, 0, '1'), Text(2, 0, '2'), Text(3, 0, '3'), Text(4, 0, '4'), Text(5, 0, '5'), Text(6, 0, '6'), Text(7, 0, '7'), Text(8, 0, '8'), Text(9, 0, '9'), Text(10, 0, '10'), Text(11, 0, '11'), Text(12, 0, '12'), Text(13, 0, '13')])
Code
plt.yticks(np.arange(-2,11,1))
([<matplotlib.axis.YTick object at 0x000001D9AFDF2270>, <matplotlib.axis.YTick object at 0x000001D9AFE16D80>, <matplotlib.axis.YTick object at 0x000001D9AFE15E80>, <matplotlib.axis.YTick object at 0x000001D9AFE692B0>, <matplotlib.axis.YTick object at 0x000001D9AFEB18E0>, <matplotlib.axis.YTick object at 0x000001D9AFEB26C0>, <matplotlib.axis.YTick object at 0x000001D9AFEB30B0>, <matplotlib.axis.YTick object at 0x000001D9AFEB3A40>, <matplotlib.axis.YTick object at 0x000001D9AFEB3380>, <matplotlib.axis.YTick object at 0x000001D9AFEB1D60>, <matplotlib.axis.YTick object at 0x000001D9AFEC0A70>, <matplotlib.axis.YTick object at 0x000001D9AFEC13A0>, <matplotlib.axis.YTick object at 0x000001D9AFEC1D00>], [Text(0, -2, '−2'), Text(0, -1, '−1'), Text(0, 0, '0'), Text(0, 1, '1'), Text(0, 2, '2'), Text(0, 3, '3'), Text(0, 4, '4'), Text(0, 5, '5'), Text(0, 6, '6'), Text(0, 7, '7'), Text(0, 8, '8'), Text(0, 9, '9'), Text(0, 10, '10')])
Code
for i inrange(4): x=df.x.iloc[i] ymin= df.y.iloc[i] ymax=df.yhat.iloc[i]if (ymin>ymax): temp=ymax ymax=ymin ymin=temp plt.vlines(x=x,ymin=ymin,ymax=ymax,color='r')plt.axhline(y=0)plt.axvline(x=0)plt.xlabel("X")plt.ylabel("Y")plt.grid()plt.show()
The error of each point is the red vertical line, which is the distance between the point and the prediction of the regression line. This distance is given by the difference between the specific y_i value and its predicted value:
\epsilon_i=(y_i - \hat{y}_i)
In this example, 2 errors are positive and 2 are negative.
The purpose of OLS is to find the values of b_0 and b_1 such that the sum of the squared of all errors is minimized. Mathematically there is only one solution for a specific set of X and Y values.
Let’s return to the objective function, which is determined by both coefficients, b_0 and b_1.
We can do a 3D plot for this function to have a better idea. It is a 3D plot since the function depends on 2 values: b_0 and b_1:
Code
from mpl_toolkits.mplot3d import Axes3Dimport matplotlib.pyplot as pltimport collections# I define a function to get the sum of squared errors given a specific b0 and b1 coefficients:def sumsqerrors2(b1, b0,df):returnsum( ( df.y - (b0+b1*df.x)) **2)# Note that df is a dataframe, so this line of code performs a row-wise operation to avoid # writing a loop to sum each squared error for each observation# Create the plot:fig = plt.figure()ax = fig.add_subplot(1,1,1, projection='3d')# I create 20 possible values of beta0 and beta1:# beta1 will move between -1 and 3b1s = np.linspace(-1, 3.0, 20)# beta0 will move between -2 and 2:b0s = np.linspace(-2, 2, 20)# I create a grid with all possible combinations of beta0 and beta1 using the meshgrid function:# M will be all the b1s values, and B the beta0 values:M, B = np.meshgrid(b1s, b0s)# I calculate the sum of squared errors with all possible pairs of beta0 and beta1 of the previous grid:zs = np.array([sumsqerrors2(mp, bp, df) for mp, bp inzip(np.ravel(M), np.ravel(B))])# I reshape the zs (squared errors) from a vector to a grid of the same size as M (20x20)Z = zs.reshape(M.shape)ax.plot_surface(M, B, Z, rstride=1, cstride=1, color='b', alpha=0.5)ax.set_xlabel('b1')ax.set_ylabel('b0')ax.set_zlabel('sum sq.errors')plt.show()
We see that this function is single-curved. The sum of squared errors changes with different paired values (b_0,b_1). The lowest point of this surface will be the optimal values of (b_0, b_1) where the sum of squared error is the minimum of all. Then, how can we calculate this optimal point?
Remembering a little bit of Calculus and simultaneous equations, we can do the following:
Take the partial derivatives of the function with respect to its variables b_0 and b_1 and make it equal to zero:
Since \left[(y_{i}-(b_{0}+b_{1}*x_{i})\right] is the error for the i observation, then this Equation 1 states that the sum of all errors must be equal to zero.
Let’s do the same for the other partial derivative:
Now we have 2 equations with 2 unknowns. I this case, the unknown variables are b_0 and b_1, since the X and Y values are considered as given since the start of the problem.
Solve the systems of equations
We can further simplify both equations by applying the sum operator to each term:
Since \bar{y}=\frac{1}{N}\sum_{i=1}^Ny_{i}, then \sum_{i=1}^{N}y_{i}=N*\bar{y}, and the same for the x variable, then:
N(\bar{y})-N(b_0)-b1(N)(\bar{x})=0
Dividing both sides by N:
\bar{y}-b_0-b_1(\bar{x})=0
This is another way to express Equation 1. This form of the equation indicates that the point(\bar{x},\bar{y}) must be in the regression line since the error at this point is equal to zero!
From previous equation, then we get that:
b_0=\bar{y} - b_1\bar{x}
After applying the sum to each term of Equation 2, it can be express as:
Interestingly, from this final formula for b_1, which is the slope of the regression line, we can see that b_1 is how much y covariates with x with respect to the variability of x. This is actually the concept of a slope of a line! it is the sensitivity of how much y changes when x changes. Then, b_1 can be seen as the expected rate of change of y with respect to x, which is a derivative!
How the b_1 coefficient and the correlation between x and y are related? We learned that correlation measures the linear relationship between 2 variables. Also, b_1 measures the linear relationship between 2 variables, however, there is an important difference. Let’s see the correlation formula:
Corr(x,y)=\frac{Cov(x,y)}{SD(x)SD(y)}
We can express Covariance in terms of Correlation:
Cov(x,y)=Corr(x,y)SD(x)SD(y)
If we plug this formula in the b_1 formula:
b_1=Corr(x,y)*\frac{SD(y)}{SD(x)}
Then, we can see that b_1 is a type of scaled correlation since it is that is scaled by the ratio of both standard deviations, so b_1 provides information not only about relationship, but also about sensitivity of how much y changes in magnitude for each change in 1 unit of x!
6 The standard error of the beta coefficients
The OLS method includes the estimation of the beta coefficients and also estimation of their corresponding standard errors. Then, what is the standard error of the beta coefficients?
The standard error of b_0 is actually an estimation of the expected standard deviation ofb_0. The same happens for b_1. But how we can estimate a standard deviation of a beta coefficient? It seems that we need several possible values of b_0 and b_1, so I could estimate their standard deviation as a measure of standard error. Then, we would need many samples to estimate several possible pair values of beta coefficients. However, most of the time we only use 1 sample to estimate the beta coefficients and their standard errors. Then, why do we need to estimate the expected standard deviation of the coefficients?
In several disciplines, when we want to understand the relationship between 2 random variables, we only have access to 1 sample (not the population), so we need to find a way to estimate the error levels we might have with the beta estimations. Then, we need to find a way to estimate the possible variation of the beta coefficients as if we would had the possibility to collect many samples to get many possible pairs of beta coefficients.
It sound weird, but it is possible to estimate how much the b_0 and b_1 might vary using only 1 sample. We can use basic probability theory and the result of the Central Limit Theorem to estimate the expected standard deviation of the beta coefficients, their corresponding t-values, p-values and their 95% confidence intervals.
To estimate the standard errors of the coefficients, we need to estimate the standard deviation of the Y variable, which is actually the standard deviation of the errors, and it is also called mean squared errors (MSE).
The formula to estimate the standard deviation of the regression errors, mean squared errors (MSE), is:
MSE=\frac{SSE}{N-2}
Where: SSE = Sum of squared errors:
The SSE is divided by N-2 since we need 2 parameters (b_0 and b_1) to estimate SSE.
The SSE is calculated as:
SSE = \sum_{i=1}^{N}(y_i-\hat{y})^2
The formula to estimate the standard error (SE) of b_1 is the following:
These estimations are automatically calculated and displayed in the output of any regression software. By learning their formulas we see that the magnitude of standard error of both coefficients is directly proportional to the standard deviation of the the errors, which is the mean of squared regression errors (MSE). Then, the greater the individual magnitude of errors, the greater the standard error of both beta coefficients, so the more difficult to find significant betas (betas that are significantly different than zero).
To further understand the standard error of beta coefficients we will do an exercise to estimate several regression models using different time ranges. Each regression will estimate one value for b_0 and one value for b_1. Then, if we run N regressions, we will have N pairs of beta coefficients, so we will see how these beta coefficients change over time. This change is measured by the standard error of the beta coefficients.
7 t-Statistic, p-value and 95% confidence interval of beta coefficients
7.1 Hypothesis tests for the beta coefficients
When we run a linear regression model, besides the estimation of the beta coefficients and their corresponding standard errors, one hypothesis test is performed for each beta coefficient.
We apply the hypothesis test to each of the beta coefficients to test whether the beta coefficient is or is not equal to zero.
For the case of the simple market regression, the following hypothesis are performed:
For b_0:
H0: The mean of b_0 = 0
Ha: The mean of b_0 <> 0 (Our hypothesis)
In this case, the variable of study for the hypothesis test is theb_0 coefficient.
Then, we calculate the t-Statistic for b_0 as follows:
t =\frac{(b_0 - 0)}{SE(b_0)}
SE(b_0)$ is the standard error of b_0, which is its estimated standard deviation.
Remember that the t-Statistic is the standardized distance from b_0 (the value estimated from the regression) and zero. In other words, the t-Statistic tells us how many standard deviations of the b_0 the actual value of b_0 is away from zero, which is the hypothetical true value.
Remember that the null hypothesis (H0) is the hypothesis of the skeptical person who believes that b_0 is equal to zero. Then, we start assuming that H0 is true. If we show that there is very little probability (its p-value) that the b_0=0, then we will have statistical evidence to reject the H0 and support our Ha.
Then, if \mid t\mid>2, then we will have statistical evidence at least at the 95% confidence level to reject the null hypothesis. The critical value of 2 for t is an approximation; it depends on the # of observations of the regression that this critical value can move from around 1.8 and 2.1.
From a t-Statistic and the # of observations, we can estimate the exact p-value. This value cannot be calculated using a formula since there is no close solution for the t cumulative density function. However, remember that the t-Student probability distribution becomes very similar to the normal probability distribution when the # of observations is equal or greater than 30. Then, if the t-Statistic is about 2, then if we remember the characteristic of the probability density function of a normal distribution, then the area under the curve (which is the probability) beyond t=2 and less than t=-2 will be around 0.05 (5%). This is the 2-sided p-value of the test.
Remember that the p-value is the probability of making a mistake if we reject the null hypothesis. Then, the less the p-value, the better. The rule of thumb is that if the p-value<0.05, then we have statistical evidence at least at the 95% confidence level to reject the null.
Fortunately, the standard error, t-Statistic and the 2-sided p-value for this hypothesis test is automatically calculated and shown when we run a regression model.
Then, in conclusion, if the p-value estimated forb_0<0.05 and b_0>0, then we can say that there is statistical evidence at the 95% confidence level to say that b_0 is greater than zero.
In the context of the market regression model, the b_0 has an important meaning. If we find that b_0 is significantly greater than zero, then we can say that the stock is systematically offering positive returns over the market. In Finance the b_0 coefficient is called Alpha of Jensen, and it is supposed to always be zero or NOT significantly different than zero according to the market efficiency hypothesis.
For b_1 the same process is performed:
H0: The mean of b_1 = 0
Ha: The mean of b_1 <> 0 (Our hypothesis)
In this case, the variable of study for the hypothesis test is theb_0 coefficient.
Then, we calculate the t-Statistic as follows:
t =\frac{(b_1 - 0)}{SE(b_1)}
SE(b_1)$ is the standard error of b_1, which is its estimated standard deviation.
Then, we follow the same logic as it is explained above to make a conclusion about b_1.
In the context of market regression model, b_1 not only measures the linear relationship between the stock return and the market return; b_1 is a measure of systematic market risk of the stock. If the p-value(b_1)<0.05 and b_1>0, then we can say that the stock return is positively and significantly related to market return.
Another interesting hypothesis test for b_1 is the examine whether b_1 is significantly greater or less than 1 (not zero). If b_1 is significantly > 1, then we can say that the stock is significantly riskier than the market. Unfortunately, this hypothesis is NOT tested in the traditional output of the regression model. We need to calculate the corresponding t-Statistic for this test manually.
7.2 The 95% confidence interval for each coefficient
Besides the standard error, t-Statistic and p-value, the 95% confidence interval (C.I.) is also calculated for each beta coefficient. The 95% C.I. has a minimum and maximum possible value. The 95% C.I. illustrates how much the beta coefficient can move 95% of the time.
An approximate way to estimate the minimum and maximum of this 95% C.I. is just by subtracting and adding 2 standard errors to the beta coefficient. For example, an approximate 95% C.I. for b_0 can be estimated as:
The exact critical value for the 95% is not 2, it depends on the # of observations, but it can go from 1.8 to 2.1. The exact values of the 95%C.I. are automatically calculated when we run the regression model.
The 95% C.I. of a beta coefficient tells us the possible movement of the beta according to its standard error.
We can use the 95% C.I. instead of the t-Statistic or p-value to make the same conclusion for the hypothesis test of the coefficient. If the 95%C.I. does NOT contain the zero, then it means that the beta coefficient is significantly different than zero. An advantage of the 95%C.I. is that we could quickly test the hypothesis that b_1>1 to check whether a stock is significantly riskier; if the 1 is not included in the 95% C.I. and b_1>1, then we can say that the stock is significantly riskier than the market at the 95% confidence level.
8 Application: The Market Regression Model
The Market Model (also named Single-index model) in Finance states that the expected return of a stock is given by its alpha coefficient (b0) plus its market beta coefficient (b1) multiplied times the market return. In mathematical terms:
E[R_i] = α + β(R_M)
We can express the same equation using BO as alpha, and B1 as market beta:
E[R_i] = β_0 + β_1(R_M)
We can estimate the alpha and market beta coefficient by running a simple linear regression model specifying that the market return is the independent variable and the stock return is the dependent variable. It is strongly recommended to use continuously compounded returns instead of simple returns to estimate the market regression model.
The market regression model can be expressed as:
r_{(i,t)} = b_0 + b_1*r_{(M,t)} + ε_t
Where:
ε_t is the error at time t. Thanks to the Central Limit Theorem, this error behaves like a Normal distributed random variable ∼ N(0, σ_ε); the error term ε_t is expected to have mean=0 and a specific standard deviation σ_ε (also called volatility).
r_{(i,t)} is the return of the stock i at time t.
r_{(M,t)} is the market return at time t
b_0 and b_1 are called regression coefficients
Now it’s time to use real data to better understand this model. Download monthly prices for Alfa (ALFAA.MX) and the Mexican market index IPCyC (^MXX) from Yahoo from January 2018 to Jan 2023.
8.1 Data collection
We first load the yfinance package and download monthly price data for Alfa and the Mexican market index.
Import the Python libraries
# yfinance downloads data from Yahoo Financeimport yfinance as yf# numpy is used to do numeric calculationsimport numpy as np# pandas is used for data managementimport pandas as pd# matplotlib is used for graphsimport matplotlibimport matplotlib.pyplot as plt
# Download a dataset with prices for Alfa and the Mexican IPyC:data = yf.download("ALFAA.MX, ^MXX", start="2018-01-01", end="2023-01-31", interval='1mo')
[ 0% ]
[*********************100%***********************] 2 of 2 completed
# I create another dataset with the Adjusted Closing price of both instruments:adjprices = data['Adj Close']
8.2 Return calculation
We calculate continuously returns for both, Alfa and the IPCyC:
Do a scatter plot putting the IPCyC returns as the independent variable (X) and the stock return as the dependent variable (Y). We also add a line that better represents the relationship between the stock returns and the market returns.Type:
import seaborn as sbplt.clf()x = returns['MXX']y = returns['ALFA']# I plot the (x,y) values along with the regression line that fits the data:sb.regplot(x=x,y=y)plt.xlabel('Market returns')plt.ylabel('Alfa returns') plt.show()
Sometimes graphs can be deceiving. In this case, the range of X axis and Y axis are different, so it is better to do a graph where we can make both X and Y ranges with equal distance. Type:
plt.clf()sb.regplot(x=x,y=y)# I adjust the scale of the X axis so that the magnitude of each unit of X is equal to that of the Y axis plt.xticks(np.arange(-1,1,0.2))
([<matplotlib.axis.XTick object at 0x000001D9DCD06FC0>, <matplotlib.axis.XTick object at 0x000001D9DCCE4290>, <matplotlib.axis.XTick object at 0x000001D9DCC9F530>, <matplotlib.axis.XTick object at 0x000001D9DCD06750>, <matplotlib.axis.XTick object at 0x000001D9DCD587A0>, <matplotlib.axis.XTick object at 0x000001D9DCD590A0>, <matplotlib.axis.XTick object at 0x000001D9DCD599D0>, <matplotlib.axis.XTick object at 0x000001D9DCD333E0>, <matplotlib.axis.XTick object at 0x000001D9DCD323F0>, <matplotlib.axis.XTick object at 0x000001D9DCD5A060>], [Text(-1.0, 0, '−1.0'), Text(-0.8, 0, '−0.8'), Text(-0.6000000000000001, 0, '−0.6'), Text(-0.40000000000000013, 0, '−0.4'), Text(-0.20000000000000018, 0, '−0.2'), Text(-2.220446049250313e-16, 0, '0.0'), Text(0.19999999999999973, 0, '0.2'), Text(0.3999999999999997, 0, '0.4'), Text(0.5999999999999996, 0, '0.6'), Text(0.7999999999999996, 0, '0.8')])
# I label the axis:plt.xlabel('Market returns')plt.ylabel('Alfa returns') plt.show()
8.3.1 CHALLENGE: WHAT DOES THE PLOT TELL YOU? BRIEFLY RESPOND**
8.4 RUNNING THE MARKET REGRESSION MODEL
The OLS function from the satsmodel package is used to estimate a regression model. We run a simple regression model to see how the monthly returns of the stock are related with the market return.
The first parameter of the OLS function is the DEPENDENT VARIABLE (in this case, the stock return), and the second parameter must be the INDEPENDENT VARIABLE, also named the EXPLANATORY VARIABLE (in this case, the market return).
Before we run the OLS function, we need to add a column of 1’s to the X vector in order to estimate the beta0 coefficient (the constant).
What you will get is called The Single-Index Model. You are trying to examine how the market returns can explain stock returns.
Run the Single-Index model (Y=stock return, the X=market return). You can use the function OLS from the statsmodels.api library:
import statsmodels.api as sm# I add a column of 1's to the X dataframe in order to include the beta0 coefficient (intercept) in the model:X = sm.add_constant(x)# I estimate the OLS regression model:mkmodel = sm.OLS(y,X).fit()# I display the summary of the regression: print(mkmodel.summary())
import statsmodels.formula.api as smf# I estimate the OLS regression model:mkmodel2 = smf.ols('ALFA ~ MXX',data=returns).fit()# I display the summary of the regression: print(mkmodel2.summary())
The regression output shows a lot of information about the relationship between the X (independent) and the Y (dependent) variables.
For now we can focus on the table of 2 rows. The first row (Intercept) shows the information of the beta0 coefficient, which is the intercept of the regression equation, also known as constant.
The second column (MXX) shows the information of the beta1 coefficient, which represents the slope of the regression line. In this example, since the X variable is the market return and the Y variable is the stock return, beta1 can be interpreted as the sensitivity or market risk of the stock.
For each beta coefficient, the following is calculated and shown:
coef : this is the average value of the beta coefficient
std err : this is the standard error of the coeffcient, which is the standard deviation of the beta coefficient.
t : this is the t-Statistics of the following Hypothesis test:
H0: beta = 0; Ha: beta <> 0;
P>|t| : is the p-value of the above hypothesis test; if it is a value < 0.05, we can say that the beta coefficient is SIGNIFICANTLY different than ZERO with a 95% confidence.
** [0.025 0.975]** : This is the 95% Confidence Interval of the beta coefficient. This shows the possible values that the beta coefficient can take in the future with 95% probability.
How the t-Statistic, p-value and the 95% C.I. are related?
INTERPRETATION OF THE REGRESSION OUTPUT
IN A SIMPLE REGRESSION MODEL, BETA0 (THE INTERCEPT WHERE THE LINE CROSSES THE Y AXIS), AND BETA1 (THE INCLINATION OR SLOPE OF THE LINE) ARE ESTIMATED.
THE REGRESSION MODEL FINDS THE BEST LINE THAT BETTER REPRESENTS ALL THE POINTS. THE BETA0 AND BETA1 COEFFICIENTS TOGETHER DEFINE THE REGRESSION LINE.
THE REGRESSION EQUATION.
ACCORDING TO THE REGRESSION OUTPUT, THE REGRESSION EQUATION THAT EXPLAINS THE RETURN OF ALFA BASED ON THE IPC’S RETURN IS:
E[ALFAret]= b0 + b1(MXXret)
E[ALFAret]= -0.0106 + 1.3823(MXXret)
THE REGRESSION MODEL AUTOMATICALLY PERFORMS ONE HYPOTHESIS TEST FOR EACH COEFFICIENT. IN THIS CASE WE HAVE 2 BETA COEFFICIENTS, SO 2 HYPOTHESIS TESTS ARE DONE. YOU CAN SEE THAT IN THE COEFFICIENTS TABLE IN THE OUTPUT.
WE START LOOKING AT THE TABLE OF COEFFICIENTS. WHERE IT SAYS (Intercept), YOU CAN SEE THE RESULT OF THE HYPOTHESIS TESTING FOR BETA0. WHERE IT SAYS THE NAME OF THE INDEPENDENT VARIABLE, IN THIS CASE, THE MARKET RETURN (MXX), YOU CAN SEE THE RESULT FOR THE BETA1 OF THE STOCK.
THE HYPOTHESIS TEST FOR BETA0 IS THE FOLLOWING:
H0: BETA0=0; THIS MEANS THAT THE INTERCEPT OF THE LINE (THE POINT WHERE THE LINE CROSSES THE Y AXIS) HAS AN AVERAGE OF ZERO. IN THE CONTEXT OF THE MARKET MODEL THIS MEANS THAT THE ALFA STOCK DOES NOT OFFER SIGNIFICANTLY LOWER NOR HIGHER RETURNS THAN THE MARKET.
HA: BETA0 <>0; THIS MEANS THAT THE INTERCEPT IS SIGNIFICANTLY DIFFERENT THAN ZERO; IN OTHER WORDS, ALFA OFFERS RETURNS ABOVE (OR BELOW) THE MARKET.
ABOUT STANDARD ERROR, T-VALUE AND P-VALUE OF THE HYPOTHESIS TESTS:
ACCORDING TO THE CENTRAL LIMIT THEOREM, SINCE THE BETA0 CAN BE EXPRESSED AS A LINEAR COMBINATION OF THE STOCK AND THE MARKET RETURN, BETA0 WILL HAVE A DISTRIBUTION SIMILAR TO A NORMAL DISTRIBUTION WITH ITS MEAN AND STANDARD DEVIATION EQUAL TO THE STANDARD ERROR.
IN OTHER WORDS, BETA0 WILL MOVE IN THE FUTURE, AND THE MEAN VALUE WILL BE ABOUT -0.0106, AND IT WILL VARY ON AVERAGE ABOUT 0.0139, WHICH IS THE STANDARD DEVIATION OR STANDARD ERROR OF BETA0.
WHAT DOES THIS MEAN? THIS MEAN THAT IF WE COULD TRAVEL INTO THE FUTURE AND COLLECT A NEW SAMPLE FOR EACH FUTURE MONTH, WE CAN ESTIMATE ONE BETA0 FOR EACH SAMPLE, SO WE COULD IMAGINE MAY BETA0’s THAT WILL CHANGE, BUT ALL THESE VALUES WILL BE AROUND ITS CURRENT MEAN.
IF WE COULD TRAVEL TO THE FUTURE, COLLECT THESE SAMPLES, AND FOR EACH SAMPLE CALCULATE A BETA0, THE HISTOGRAM WITH THESE BETA0’s WILL LOOK LIKE:
ACCORDING TO THIS HISTOGRAM, THE AVERAGE MIGHT BE BETWEEN -0.01 AND -0.005 SINCE IT IS THE RANGE OF BETA0 VALUES THAT APPEARS MORE OFTEN (IT HAS THE HIGHEST BAR). IF WE ADD AND SUBTRACT ABOUT 2 TIMES 0.014 (THE STANDARD ERROR OF BETA0) FROM THE MIDPOINT -0.01, WE COVER ABOUT 95% OF THE DIFFERENT VALUES OF BETA0!
THE ESTIMATION FOR BETA0 IS -0.0106. THIS IS THE MEAN FOR BETA0. SINCE REALITY ALWAYS CHANGE, BETA0 MIGHT CHANGE IN THE FUTURE. HOW MUCH IT CAN CHANGE? THAT IS GIVEN BY ITS STANDARD DEVIATION, WHICH IS CALLED STANDARD ERROR. AND THANKS TO THE CENTRAL LIMIT THEOREM, BETA0 WILL BEHAVE LIKE A NORMAL DISTRIBUTED VARIABLE.
IN THIS CASE, THE STANDARD ERROR OF BETA0 IS 0.0139. THIS MEANS THAT IN THE FUTURE BETA0 WILL HAVE A MEAN OF -0.0106, AND ABOUT 68% OF THE TIME WILL VARY ONE STANDARD DEVIATION LESS THAN ITS MEAN AND 1 STANDARD DEVIATION ABOVE ITS MEAN. IN ADDITION, WE CAN SAY THAT 95% OF THE TIME BETA0 WILL MOVE BETWEEN -2 STANDARD DEVIATIONS AND + 2 STANDARD DEVIATIONS FROM -0.0106.
THIS IS THE MEAN FOR BETA0. SINCE REALITY ALWAYS CHANGE, BETA0 MIGHT CHANGE IN THE FUTURE. HOW MUCH IT CAN CHANGE? THAT IS GIVEN BY ITS STANDARD DEVIATION, WHICH IS CALLED STANDARD ERROR. AND THANKS TO THE CENTRAL LIMIT THEOREM, BETA0 WILL BEHAVE LIKE A NORMAL DISTRIBUTED VARIABLE.
FOLLOWING THE HYPOTHESIS TEST METHOD, WE CALCULATE THE CORRESPONDING t-value OF THIS HYPOTHESIS AS FOLLOWS:
t=\frac{(B_{0}-0)}{SD(B_{0})}
THEN, t = (-0.0106 - 0 ) / 0.0139 = -0.762. THIS VALUE IS AUTOMATICALLY CALCULATED IN THE REGRESSION OUTPUT IN THE COEFFICIENTS TABLE IN THE ROW (intercept)!
REMEMBER THAT t-value IS THE DISTANCE BETWEEN THE HYPOTHETICAL VALUE OF THE VARIABLE OF ANALYSIS (IN THIS CASE, B_0=-0.0106) AND ITS HYPOTHETICAL VALUE, WHICH IS ZERO. BUT THIS DISTANCE IS MEASURED IN STANDARD DEVIATIONS OF THE VARIABLE OF ANALYSIS. REMEMBER THAT THE STANDARD ERROR OF THE VARIABLE OF ANALYSIS IS CALLED STANDARD ERROR (IN THIS CASE, THE STD.ERROR OF B_0 = 0.0139).
SINCE THE ABSOLUTE VALUE OF THE t-value OF B_0 IS LESS THAN 2, THEN WE CANNOT REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN SAY THAT B_0 IS NOT SIGNIFICANTLY LESS THAN ZERO (AT THE 95% CONFIDENCE LEVEL).
THE HYPOTHESIS TEST FOR BETA1 IS THE FOLLOWING:
H0: B_1 = 0 (THERE IS NO RELATIONSHIP BETWEEN THE MARKET AND THE STOCK RETURN)
Ha: B_1 > 0 (THERE IS A POSITIVE RELATIONSHIP BETWEEN THE THE MARKET AND THE STOCK RETURN)
IN THIS HYPOTHESIS, THE VARIABLE OF ANALYSIS IS BETA1 (B_1).
FOLLOWING THE HYPOTHESIS TEST METHOD, WE CALCULATE THE CORRESPONDING t-value OF THIS HYPOTHESIS AS FOLLOWS:
t=\frac{(B_{1}-0)}{SD(B_{1})}
THEN, t = (1.3823 - 0 ) / 0.2502 = 5.5248. THIS VALUE IS AUTOMATICALLY CALCULATED IN THE REGRESSION OUTPUT IN THE COEFFICIENTS TABLE IN THE SECOND ROW OF THE COEFFICIENT TABLE.
REMEMBER THAT t-value IS THE DISTANCE BETWEEN THE HYPOTHETICAL VALUE OF THE VARIABLE OF ANALYSIS (IN THIS CASE,B_1=1.3823) AND ITS HYPOTHETICAL VALUE, WHICH IS ZERO. BUT THIS DISTANCE IS MEASURED IN STANDARD DEVIATIONS OF THE VARIABLE OF ANALYSIS. REMEMBER THAT THE STANDARD ERROR OF THE VARIABLE OF ANALYSIS IS CALLED STANDARD ERROR (IN THIS CASE, THE STD.ERROR OF B_1 = 0.2502).
THE ESTIMATION FOR BETA1 IS 1.3823. THIS IS THE MEAN FOR BETA1. SINCE REALITY ALWAYS CHANGE, BETA1 MIGHT CHANGE IN THE FUTURE. HOW MUCH IT CAN CHANGE? THAT IS GIVEN BY ITS STANDARD DEVIATION, WHICH IS CALLED STANDARD ERROR OF BETA1. THANKS TO THE CENTRAL LIMIT THEREFORE WE CAN MAKE SURE THAT BETA1 WILL MOVE LIKE A NORMAL DISTRIBUTED VARIABLE IN THE FUTURE WITH THE MEAN AND STANDARD DEVIATIONS (STANDARD ERROR) CALCULATED IN THE REGRESSION OUTPUT.
WE CAN SAY THAT 95% OF THE TIME BETA1 WILL MOVE BETWEEN -2 STANDARD DEVIATIONS AND + 2 STANDARD DEVIATIONS FROM 1.3823.
SINCE THE ABSOLUTE VALUE OF THE t-value OFB_1 IS MUCH GREATER THAN 2, THEN WE HAVE ENOUGH STATISTICAL EVIDENCE AT THE 95% CONFIDENCE TO SAY THAT WE REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN SAY THAT B_1 IS SIGNIFICANTLY GREATER THAN ZERO. WE CAN ALSO SAY THAT WE HAVE ENOUGH STATISTICAL EVIDENCE TO SAY THAT THERE IS A POSITIVE RELATIONSHIP BETWEEN THE STOCK AND THE MARKET RETURN.
8.4.0.1 MORE ABOUT THE INTERPRETATION OF THE BETA COEFFICIENTS AND THEIR t-values AND p-values
THEN, IN THIS OUTPUT WE SEE THAT B_0 = -0.0106, AND B_1 = 1.3823. WE CAN ALSO SEE THE STANDARD ERROR, t-value AND p-value OF BOTH B_0 AND B_1.
B_0 ON AVERAGE IS NEGATIVE, BUT IT IS NOT SIGNIFICANTLY NEGATIVE (AT THE 95% CONFIDENCE) SINCE ITS p-value>0.05 AND ITS ABSOLUTE VALUE OF t-value<2. THEN I CAN SAY THAT IT SEEMS THAT ALFA RETURN ON AVERAGE UNDERPERFORMS THE MARKET RETURN BY -1.061% (SINCE B_0 = -0.0106). IN OTHER WORDS, THE EXPECTED RETURN OF ALFA IF THE MARKET RETURN IS ZERO IS NEGATIVE. HOWEVER, THIS IS NOT SIGNIFICANTLY LESS THAN ZERO SINCE ITS p-value>0.05! THEN, I DO NOT HAVE STATISTICAL EVIDENCE AT THE 95% CONFIDENCE LEVEL TO SAY THAT ALFA UNDERPERFORMS THE MARKET.
B_1 IS +1.3823 (ON AVERAGE). SINCE ITS p-value<0.05 I CAN SAY THAT B_1 IS SIGNFICANTLY GREATER THAN ZERO (AT THE 95% CONFIDENCE INTERVAL). IN OTHER WORDS, I HAVE STRONG STATISTICAL EVIDENCE TO SAY THAT ALFA RETURN IS POSITIVELY RELATED TO THE MARKET RETURN SINCE ITS B_1 IS SIGNIFICANTLY GREATER THAN ZERO.
INTERPRETING THE MAGNITUDE OF B_1, WE CAN SAY THAT IF THE MARKET RETURN INCREASES BY +1%, I SHOULD EXPECT THAT, ON AVERAGE,THE RETURN OF ALFA WILL INCREASE BY 1.3823%. THE SAME HAPPENS IF THE MARKET RETURN LOSSES 1%, THEN IT IS EXPECTED THAT ALFA RETURN, ON AVERAGE, LOSSES ABOUT 1.3823%. THEN, ON AVERAGE IT SEEMS THAT ALFA IS RISKIER THAN THE MARKET (ON AVERAGE). BUT WE NEED TO CHECK WHETHER IT IS SIGNIFICANTLY RISKIER THAN THE MARKET.
AN IMPORTANT ANALYSIS OF B_1 IS TO CHECK WHETHER B_1 IS SIGNIFICANTLY MORE RISKY OR LESS RISKY THAN THE MARKET. IN OTHER WORDS, IT IS IMPORTANT TO CHECK WHETHER B_1 IS LESS THAN 1 OR GREATER THAN 1. TO DO THIS CAN DO ANOTHER HYPOTHESIS TEST TO CHECK WHETHER B_1 IS SIGNIFICANTLY GREATER THAN 1!
WE CAN DO THE FOLLOWING HYPOTHESIS TEST TO CHECK WHETHER ALFA IS RISKIER THAN THE MARKET:
H0: B_1 = 1 (ALFA IS EQUALLY RISKY THAN THE MARKET)
Ha: B_1 > 1 (ALFA IS RISKIER THAN THE MARKET)
IN THIS HYPOTHESIS, THE VARIABLE OF ANALYSIS IS BETA1 (B_1).
FOLLOWING THE HYPOTHESIS TEST METHOD, WE CALCULATE THE CORRESPONDING t-value OF THIS HYPOTHESIS AS FOLLOWS:
t=\frac{(B_{1}-1)}{SD(B_{1})}
THEN, t = (1.3823 - 1 ) / 0.2502 = 1.528. THIS VALUE IS NOT AUTOMATICALLY CALCULATED IN THE REGRESSION OUTPUT.
SINCE t-value > 2, THEN WE CAN SAY THAT WE HAVE SIGNIFICANT EVIDENCE TO REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN SAY THAT ALFA IS SIGNIFICANTLY RISKIER THAN THE MARKET (AT THE 95% CONFIDENCE LEVEL)
8.4.1 95% CONFIDENCE INTERVAL OF THE BETA COEFFICIENTS
WE CAN USE THE 95% CONFIDENCE INTERVAL OF BETA COEFFICIENTS AS AN ALTERNATIVE TO MAKE CONCLUSIONS ABOUT B_0 AND B_1 (INSTEAD OF USING t-values AND p-values).
THE 95% CONFINDENCE INTERVALS FOR BOTH BETAS ARE DISPLAYED IN THE REGRESSION OUTPUT
THE FIRST ROW SHOWS THE 95% CONFIDENCE INTERVAL FOR B_0, AND THE SECOND ROW SHOWS THE CONFIDENCE INTERVAL OF B_1. WE CAN SEE THAT THESE VALUES ARE VERY SIMILAR TO THE “ROUGH” ESTIMATE USING t-critical-value = 2. THE EXACT CRITICAL t-value DEPENDS ON THE # OF OBSERVATIONS OF THE SAMPLE.
HOW WE INTERPRET THE 95% CONFIDENCE INTERVAL FORB_0?
IN THE NEAR FUTURE, B_0 CAN HAVE A VALUE BETWEEN -0.0385 AND 0.0173 95% OF THE TIME. IN OTHER WORDS B_0 CAN MOVE FROM A NEGATIVE VALUE TO ZERO TO A POSITIVE VALUE. THEN, WE CANNOT SAY THAT 95% OF THE TIME, B_0 WILL BE NEGATIVE. IN OTHER WORDS, WE CONCLUDE THAT B_0 IS NOT SIGNIFICANTLY NEGATIVE AT THE 95% CONFIDENCE LEVEL.
HOW OFTEN B_0 WILL BE NEGATIVE? LOOKING AT THE 95% CONFIDENCE INTERVAL, B_0 WILL BE NEGATIVE AROUND MORE THAN 50% OF THE TIME. BEING MORE SPECIFIC, WE CALCULATE THIS BY SUBTRACTING THE p-value FROM 1: (1-pvalue). IN THIS CASE, THE P-VALUE= 0.4492. THEN 55.0823% OF THE TIME B_0 WILL BE NEGATIVE!
HOW WE INTERPRET THE 95% CONFIDENCE INTERVAL FORB_1?
IN THE NEAR FUTURE, B_1 CAN MOVE BETWEEN 0.8815 AND 1.8832 95% OF THE TIME. IN OTHER WORDS, B_1 CAN HAVE A VALUE GREATER THAN 1 AT LEAST 95% OF THE TIME. THEN, WE CAN SAY THAT B_1 IS SIGNIFICANTLY POSITIVE AND GREATER THAN 1. IN OTHER WORDS, ALFA IS SIGNIFICANTLY RISKIER THAN THE MARKET SINCE ITS B_1>1 AT LEAST 95% OF THE TIME.
8.5 OPTIONAL CHALLENGE: Estimate moving betas for the market regression model
How the beta coefficients of a stock move over time? Are the b_1 and b_0 of a stock stable? if not, do they change gradually or can they radically change over time? We will run several rolling regression for Alfa to try to respond these questions.
Before we do the exercise, I will review the meaning of the beta coefficients in the context of the market model.
In the market regression model, b_1 is a measure of the sensitivity; it measures how much the stock return might move (on average) when the market return moves in +1%.
Then, according to the market regression model, the stock return will change if the market return changes, and also it will change by many other external factors. The aggregation of these external factors is what the error term represents.
It is said that b_1 in the market model measures the systematic risk of the stock, which depends on changes in the market return. The unsystematic risk of the stock is given by the error term, that is also named the random shock, which is the summary of the overall reaction of all investors to news that might affect the stock (news about the company, its industry, regulations, national news, global news).
We can make predictions of the stock return by measuring the systematic risk with the market regression model, but we cannot predict the unsystematic risk. The most we can measure with the market model is the variability of this unsystematic risk (the variance of the error).
In this exercise you have to estimate rolling regressions by moving time windows and run 1 regression for each time window.
For the same ALFAA.MX stock, run rolling regressions using a time window of 36 months, starting from Jan 2010.
The first regression has to start in Jan 2010 and end in Dec 2012 (36 months). For the second you have to move time window 1 month ahead, so it will start in Feb 2010 and ends in Jan 2013. For the third regression you move another month ahead and run the regression. You continue running all possible regressions until you end up with a window with the last 36 months of the dataset.
This sounds complicated, but fortunately we can use the function RollingOLS that automatically performs rolling regressions by shifting the 36-moth window by 1 month in each iteration.
Then, you have to do the following:
Download monthly stock prices for ALFAA.MX and the market (^MXX) from Jan 2010 to Jul 2022, and calculate cc returns.
Code
# Getting price data and selecting adjusted price columns:sprices = yf.download("ALFAA.MX ^MXX",start="2010-01-01",interval="1mo")
[ 0% ]
[*********************100%***********************] 2 of 2 completed
Code
sprices = sprices['Adj Close']# Calculating returns:sr = np.log(sprices) - np.log(sprices.shift(1))# Deleting the first month with NAs:sr=sr.dropna()sr.columns=['ALFAAret','MXXret']
Run rolling regressions and save the moving b_0 and b_1 coefficients for all time windows.
Code
from statsmodels.regression.rolling import RollingOLSimport statsmodels.api as smx=sm.add_constant(sr['MXXret'])y = sr['ALFAAret']rolreg = RollingOLS(y,x,window=36).fit()betas = rolreg.params# I check the last pairs of beta values:betas.tail()
Do a plot to see how b_1 and b_0 has changed over time.
Code
plt.clf()plt.plot(betas['MXXret'])plt.title('Moving beta1 for Alfaa')plt.xlabel('Date')plt.ylabel('beta1')plt.show()
Code
plt.clf()plt.plot(betas['const'])plt.title('Moving beta0 for Alfaa')plt.xlabel('Date')plt.ylabel('beta0')plt.show()
We can see that the both beta coefficients move over time; they are not constant. There is no apparent pattern for the changes of the beta coefficients, but we can appreciate how much they can move over time; in other words, we can visualize their standard deviation, which is the average movement from their means.
We can actually calculate the mean and standard deviation of all these pairs of moving beta coefficients and see how they compare with their beta coefficients and their standard errors of the original regression when we use only 1 sample with the last 36 months:
betas.describe()
const MXXret
count 146.000000 146.000000
mean -0.002838 1.358933
std 0.014118 0.505293
min -0.025587 0.428315
25% -0.012807 1.078348
50% -0.007655 1.375746
75% 0.006367 1.711368
max 0.030651 2.343688
We calculated 116 regressions using 116 36-month rolling windows. For each regression we calculated a pair of b_0 and b_1.
Compared with the first market regression of Alfa using the most recent months from 2018 (about 54 months or 4.5 years), we see that the mean of the moving betas is very similar to the estimated beta of the first regression. Also, we see that the standard deviation of the moving b_0 is very similar to the standard error of b_0 estimated in the first regression. The standard deviation of b_1 was much higher than the standard error of b_1 of the first regression. This difference might be because the moving betas were estimated using data from 2010, while the first regression used data from 2018, so it seems that the systematic risk of Alfa (measured by its b_1) has been reducing in the recent months.
I hope that now you can understand why we need an estimation of the standard error of the beta coefficients (standard deviation of the coefficients).
9 References
Galton, Francis. 1886. “Family Likeness in Stature.”Proceedings of Royal Society 40: 42–72.
Gauss, Carl Friedrich. 1809. Theory of Motion of the Celestial Bodies Moving in Conic Sections Around the Sun.
Legendre, Adrien-Marie. 1805. Nouvelles Méthodes Pour La Détermination Des or-Bites Des Comètes.
Pearson, Karl Lee. 1930. The Life, Letters and Labors of Francis Galton.