Workshop 2, Econometric Models

Author

Alberto Dorantes

Published

February 17, 2025

Abstract

This is an INDIVIDUAL workshop. In this workshop we review the concepts of covariance and correlation, and introduce the Linear Regression Model.

1 Measures of linear relationship

We might be interested in learning whether there is a pattern of movement of a random variable when another random variable moves up or down. An important pattern we can measure is the linear relationship. The main two measures of linear relationship between 2 random variables are:

Covariance and
Correlation

Let’s start with an example. Imagine we want to see whether there is a relationship between the S&P500 and Microsoft stock.

The S&P500 is an index that represents the 500 biggest US companies, which is a good representation of the US financial market. We will use monthly data for the last 3-4 years.

Let’s download the price data and do the corresponding return calculation. Instead of pandas, we will use yfinance to download online data from Yahoo Finance.

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib
import matplotlib.pyplot as plt

# We download price data for Microsoft and the S&P500 index:
prices=yf.download(tickers="MSFT ^GSPC", start="2019-01-01",interval="1mo", auto_adjust=True)


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed

# We select Adjusted closing prices and drop any row with NA values:
adjprices = prices['Close'].dropna()

GSPC stands for Global Standard & Poors Composite, which is the S&P500 index.

Now we will do some informative plots to start learning about the possible relationship between GSPC and MSFT.

Unfortunately, the range of stock prices and market indexes can vary a lot, so this makes difficult to compare price movements in one plot. For example, if we plot the MSFT prices and the S&P500:

adjprices.plot(y=['MSFT','^GSPC'])
plt.show()

It looks like the GSPC has had a better performance, but this is misleading since both investment have different range of prices.

When comparing the performance of 2 or more stock prices and/or indexes, it is a good idea to generate an index for each series, so that we can emulate how much $1.00 invested in each stock/index would have moved over time. We can divide the stock price of any month by the stock price of the first month to get a growth factor:

# I create a dataset to calculate indexes for each variable, where the index value will be a growth factor = its value divided by its first value
indexprices = adjprices / adjprices.iloc[0]

This growth factor is like an index of the original variable. Now we can plot these 2 new indexes over time and see which investment was better:

indexprices.plot(y=['MSFT','^GSPC'])
plt.show()

Now we have a much better picture of which instrument has had better performance over time. The line of each instrument represents how much $1.00 invested the instrument would have been changing over time.

Now we calculate continuously compounded monthly returns. In Pandas most of the data management functions works row-wise. In other words, operations are performed to all columns, and row by row:

# I create a new data frame to calculate the log returns
r = np.log(adjprices).diff(1)
# The diff function calculates the difference between the log price of t and the log price of t-1

# Dropping rows with NA values (the first month has NA's)
r = r.dropna()
# Renameing the column names to avoid special characters like ^GSPC:
r.columns = ['MSFT','GSPC']

Now the r dataframe will have 2 columns for both cc historical returns:

r.head()

                MSFT      GSPC
Date                          
2019-02-01  0.070250  0.029296
2019-03-01  0.055671  0.017766
2019-04-01  0.101963  0.038560
2019-05-01 -0.054441 -0.068041
2019-06-01  0.083539  0.066658

To learn about the possible relationship between the GSPC and MSFT we can look at their prices and also we can look at their returns.

We start with a scatter plot to see whether there is a linear relationship between the MSFT returns and the GSPC returns:

#plt.clf()
r.plot.scatter(x='GSPC', y='MSFT',c='DarkBlue')
plt.show()

What do you see?

We can also do a scatter plot to visualize the relationship between the MSFT prices and GSPC index:

plt.clf()
adjprices.plot.scatter(x='^GSPC', y='MSFT',c='DarkBlue')
plt.show()

Which plot conveys a stronger linear relationship?

The scatter plot using the prices conveys an apparent stronger linear relationship compared to the scatter plot using returns.

Stock returns are variables that usually does NOT grow over time; they look like a plot of heart bits:

plt.clf()
r.plot(y=['MSFT','GSPC'])
plt.show()

Stock returns behave like a stationary variable since they do not have a growing or declining trend over time. A stationary variable is a variable that has a similar average and standard deviation in any time period.

Stock prices (and indexes) are variables that usually grow over time (sooner or later). These variables are called non-stationary variables. A non-stationary variable usually changes its mean depending on the time period.

In statistics, we have to be very careful when looking at linear relationships when using non-stationary variables, like stock prices. It is very likely that we end up with spurious measures of linear relationships when we use non-stationary variables. To learn more about the risk of estimating spurious relationships, we will cover this issue in the topic of time-series regression models (covered in a more advanced module).

Then, in this case it is better to look at linear relationship between stock returns (not prices).

1.1 Covariance

The Covariance between 2 random variables, X and Y, is a measure of linear relationship.

The Covariance is the average of product deviations between X and Y from their corresponding means.

For a sample of N and 2 random variables X and Y, we can calculate the population covariance as:

Cov(X,Y)=\frac{1}{N}\left[(X_{1}-\bar{X})(Y_{1}-\bar{Y})+...+(X_{N}-\bar{X})(Y_{N}-\bar{Y})\right]

We can easily express this average as:

Cov(X,Y)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)

The covariance is also defined as the expected value of the product deviations:

Cov(X,Y)=E[(X-\bar{X})(Y-\bar{Y})]

Doing some math:

Cov(X,Y)=E[(XY-X\bar{Y}-\bar{X}Y+\bar{X}\bar{Y})]

Applying the expectation to each term:

Cov(X,Y)=E[XY]-E[X\bar{Y}]-E[\bar{X}Y]+E[\bar{X}\bar{Y}]

Since \bar{X} and \bar{Y} are constant, we can take them out of the expectation.

Cov(X,Y)=E[XY]-\bar{Y}E[X]-\bar{X}E[Y]+\bar{X}\bar{Y}

Since E[X]=\bar{X} and E[Y]=\bar{Y}, then:

Cov(X,Y)=E[XY]-\bar{Y}\bar{X}-\bar{X}\bar{Y}+\bar{X}\bar{Y}

Simplifying:

Cov(X,Y)=E[XY]-\bar{Y}\bar{X}

Then, we can express the covariance as

Cov(X,Y)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}Y_{i}\right)-\bar{X}\bar{Y}

Since the Variance is a special case of the Covariance - the variance is the covariance of a variable with itself- then we can also say that:

Var(X)=E[X^2]-\bar{X}^2

Also:

Var(X)=\frac{1}{N}\sum_{i=1}^{N}\left(X_{i}\right)^2-\bar{X}^2

The sample covariance formula is very similar, but it divides by N-1 instead of N to get the average of product deviations:

Cov(X,Y)=\frac{1}{N-1}\sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)

Why dividing by N-1 instead of N? In Statistics, we assume that we work with samples and never have access to the population, so when calculating a sample measure, we always miss data. The sample formula will calculate a more conservative than the population formula. That is the reason why we use N-1 as degree of freedom instead of N.

Sample covariance will be always a little bit greater than population covariance, but they will be similar. When N is large (N>30), population and sample covariance values will be almost the same. The sample covariance formula is the default formula for all statistical software.

If Cov(X,Y)>0, we can say that, on average, there is a positive linear relationship between X and Y. If Cov(X,Y)<0, we can say that there is a negative relationship between X and Y.

A positive linear relationship between X and Y means that if X increases, it is likely that Y will also increase; and if X decreases, it is likely that Y will also decrease.

A negative linear relationship value between X and Y means that if X increases, it is likely that Y will decrease; and if X decreases, it is likely that Y will increase.

If we can test that Cov(X,Y) is positive and significant, we need to do a hypothesis test. If the pvalue<0.05 and the Cov(X,Y) is positive, then we can say that we have a 95% confidence that there is a linear relationship.

There is no constraint in the possible values of Cov(X,Y) that we can get:

-\infty<Cov(X,Y)<\infty

We can interpret the sign of covariance, but we CANNOT interpret its magnitude. Fortunately, the correlation is a very practical measure of linear relationship since we can interpret its sign and magnitude since the possible values of correlation goes from -1 to 1 and represent percentage of linear relationship.

Actually, the correlation between X and Y is a standardized measure of the covariance.

1.2 Correlation

Correlation is a very practical measure of linear relationship between 2 random variables. It is actually a scaled version of the Covariance:

Corr(X,Y)=\frac{Cov(X,Y)}{SD(X)SD(Y)}

If we divide Cov(X,Y) by the product of the standard deviations of X and Y, we get the correlation, which can have values only between -1 and +1.

-1<=Corr(X,Y)<=1

If Corr(X,Y) = +1, that means that X moves exactly in the same way than Y, so Y is proportional (in the same direction) than X; actually Y should be equal to X multiplied by number.

If Corr(X,Y) = -1 means that Y moves exactly proportional to X, but in the opposite direction.

If Corr(X,Y) = 0 means that the movements of Y are not related to the movements of X. In other words, that X and Y move independent of each other; in this case, there is no clear linear pattern of how Y moves when X moves.

If 0<Corr(X,Y)<1 means that there is a positive linear relationship between X and Y. The strength of this relationship is given by the magnitude of the correlation. For example, if Corr(X,Y) = 0.50, that means that if X increases, there is a probability of 50% that Y will also increase.

If -1<Corr(X,Y)<0 means that there is a negative linear relationship between X and Y. The strength of this relationship is given by the magnitude of the correlation. For example, if Corr(X,Y) = - 0.50, that means that if X increases, there is a probability of 50% that Y will decrease (and vice versa).

If we want to test that Corr(X,Y) is positive and significant, we need to do a hypothesis test. The formula for the standard error (standard deviation of the correlation) is:

SD(corr)=\sqrt{\frac{(1-corr^{2})}{(N-2)}}

Then, the t-Statistic for this hypothesis test will be:

t=\frac{corr}{\sqrt{\frac{(1-corr^{2})}{(N-2)}}}

If Corr(X,Y)>0 and t>2 (its pvalue will be <0.05), then we can say that we have a 95% confidence that there is a positive linear relationship; in other words, that the correlation is positive and statistically significant (significantly greater than zero).

1.3 Calculating covariance and correlation

We can program the covariance of 2 variables according to the formula:

msft_mean = r['MSFT'].mean()
gspc_mean = r['GSPC'].mean()
N = r['GSPC'].count()
sum_of_prod = ((r['MSFT'] - msft_mean) * (r['GSPC'] - gspc_mean) ).sum()  
cov = sum_of_prod / (N-1)
print(f"Covariance of MSFT with GSPC returns = {cov}")

Covariance of MSFT with GSPC returns = 0.002184496436896604

Fortunately, we have the numpy function cov to calculate the covariance:

covm = np.cov(r['MSFT'],r['GSPC'])
print("Covariance matrix of MSFT with GSPC returns :")

Covariance matrix of MSFT with GSPC returns :

covm

array([[0.00361832, 0.0021845 ],
       [0.0021845 , 0.00244557]])

The cov function calculates the variance-covariance matrix using both returns. We can find the covariance in the non-diagonal elements, which will be the same values since the covariance matrix is symmetric.

The diagonal values have the variances of each return since the covariance of one variable with itself is actually its variance (Cov(X,X) = Var(X) ) .

Then, to extract the covariance between MSFT and GSPC returns we can extract the element in the row 1 and column 2 of the matrix:

cov = covm[0,1]
print(f"Covariance of MSFT with GSPC returns = {cov}")

Covariance of MSFT with GSPC returns = 0.0021844964368966046

In Python, the first row of an array or a data frame has the position number zero! The same for the row number. Python always starts numbering rows and columns with the number zero.

This value is exactly the same we calculated manually.

We can use the corrcoef function of numpy to calculate the correlation matrix:

corr = np.corrcoef(r['MSFT'],r['GSPC'])
print("The correlation matrix is:")

The correlation matrix is:

corr

array([[1.        , 0.73435694],
       [0.73435694, 1.        ]])

The correlation matrix will have +1 in its diagonal since the correlation of one variable with itself is +1. The non-diagonal value will be the actual correlation between the corresponding 2 variables (the one in the row, and the one in the column).

We could also manually calculate correlation using the previous covariance:

corr2 = cov / (r['MSFT'].std() * r['GSPC'].std())
corr2

np.float64(0.7343569426145472)

print(f"The correlation between MSFT and GSPC returns is = {corr2}")

The correlation between MSFT and GSPC returns is = 0.7343569426145472

We can use the scipy pearsonr function to calculate correlation and also the 2-tailed pvalue to see whether the correlation is statistically different than zero:

from scipy.stats import pearsonr
corr2 = pearsonr(r['MSFT'],r['GSPC'])
print(corr2)

PearsonRResult(statistic=np.float64(0.734356942614547), pvalue=np.float64(9.654021159320175e-14))

The pvalue is almost zero (1.5 * 10^{-13}) . MSFT and GSPC returns have a positive and very significant correlation (at the 99.9999…% confidence level).

2 The Linear regression model

2.1 Introduction

Up to know we have learn about

Descriptive Statistics
The Histogram
The Central Limit Theorem
Hypothesis Testing
Covariance and Correlation

Without the idea of summarizing data with descriptive statistics, we cannot conceive the histogram. Without the idea of the histogram we cannot conceive the CLT, and without the CLT we cannot make inferences for hypothesis testing.

We can apply hypothesis testing to test claims about random variables.

What is a random variable? It is a variable that we cannot predict its future values.

Important examples of random variables where we can apply hypothesis testing are:

The mean of a variable
Difference between 2 variable means
Correlation coefficient
Coefficients of Linear regression models

But what is the linear regression model?

The linear regression model examines the linear relationship between one or more explanatory variables (usually named x_1, x_2, … ) and a dependent variable (DV) (usually named Y). With this model we try to use the explanatory variables to understand and/or predict the dependent variable. The explanatory variables are also called independent variables (IV).

If there is only one explanatory variable (x_1), the model is called simple regression model. When we have more than one explanatory variable, then the model is called multiple regression model.

We learned that covariance and correlation are measures of linear relationship between 2 random variables, X and Y. The simple regression model also measures the linear relationship between 2 random variables (X and Y), but the difference is that X is supposed to explain the movements of Y, so Y depends on the movement of X, the independent variable.

In addition, the regression model estimates a linear equation (regression line) to represent how much Y moves (on average) for a +1 unit movement of X (the beta1 coefficient). This equation also indicates what is the expected value of Y when X=0 (the beta0 coefficient).

Then, we can use regression models for:

• Understanding the relationship between a dependent variable and a one or more independent variables - also called explanatory variables

• Predicting or estimating the expected value of the dependent variable according to specific future value(s) of independent variable(s)

2.2 Interesting facts from history

One of the most common methods to estimate linear regression models is called ordinary least squares, which was first developed by mathematicians to predict planets’ orbits. On January 1st, 1801, the Italian priest and astronomer Giuseppe Piazzi discovered a small planetoid (asteroid) in our solar system, which he named Ceres. Piazzi observed and recorded 22 Ceres positions during 42 days, but suddenly Ceres was lost in glare of the Sun. Then, most Europeans astronomers started to find out a way to predict Cere’s orbit. The great German mathematician Friedrich Carl Gauss successfully predicted Ceres’ orbit using a least squares method he had developed in 1796, when he was 18 years old. Gauss applied his least squares method using the 22 Ceres observations and 6 explanatory variables. Gauss published his least square method until 1809 [@Gauss1809]; interestingly, the French mathematician Arien-Marie Legendre first published the least-squared method in 1805 [@Legendre1805].

About 70 years later, the English anthropologist Francis Galton and the English mathematician Karl Pearson - leading founders of the Statistics discipline- used the foundations of the least-square method to first develop the linear regression model. Galton developed the conceptual foundation of regression models when he was studying the inherited characteristics of sweet peas. Pearson further developed Galton ideas following rigurous mathematical development.

Pearson used to work in Galton’s laboratory. When Galton died, Pearson wrote Galton’s biography. In this biography [@Pearson1930], Pearson described how Galton came up with the idea of regression. In 1875 Galton gave sweet peas seeds to seven friends. All sweet peas seeds had uniform weights. His friends harvested the sweet peas and returned the plants to Galton. He did a graph to see the size of each plant compared with their respective parents’ sizes. He found that all of them had parents with higher size. When graphing the offspring’s size as the Y axis, and parents’ size as the X axis, he tried to manually draw a line that could represent this relationship, and he found that the line slope was less than 1.0. He concluded that the size of these plants in their generation was “regressing” to the supposed mean of this specie (considering several generations).

Two research articles by Galton [@Galton1886] and Pearson [@Pearson1930] written in 1886 and 1903 respectively further developed the foundations of regression models. They examined why sons of very tall fathers are usually shorter than their fathers, while sons of very short fathers are usually taller than their fathers. After collecting and analyzing data from hundreds of families, they concluded that the height of an individual in a community or population tends to “regress” to the average height of the such population where they were born. If the father is very tall, then his sons’ height will “regress” to the average height of such population. If the father is very short, then his sons’ height will also “regress” to such average height. They named their model as “regression” model. Nowadays the interpretation of regression models is not quite the same as “regress” to a specific average value. Nowadays regression models are used to examine linear relationships between a dependent variable and a set of independent variables.

2.3 Types of data structures

The Market Model is a time-series regression model. In this model we looked at the relationship between 2 variables representing one feature or attribute (returns) of two “subjects” over time: a stock and a market index. The market model is estimated using a time-series dataset.

There are basically three types of data used in regression models:

Time-series: each observation represents one period, and each column represents one or more variables, which are characteristics of one or more subjects. Then, we have one or more variables measured in several time periods.
Cross-sectional: each observation represents one subject in only one time period, and each column represents variables or characteristics of the subjects.
Panel data: this is a combination of time-series and cross-sectional structure. This structure is also called long-format dataset. In a panel data we can have more than 1 subject and more than 1 variable. The data of each subject is pilled together one over the another in long-format.

Then, we can consider the Market Model as a pulled “time-series” regression model.

2.4 The simple regression model

The Market Model in Finance can be estimated using a simple regression model. The Market Model - also called the Single-Index Model - states that the expected return of a stock is given by its alpha coefficient (b0) plus its market beta coefficient (b1) times the market return. In mathematical terms:

E[R_i] = α + β(R_M)

We can express the same equation using b_0 as alpha, and b_1 as market beta:

E[R_i] = b_0 + b_1(R_M)

We can estimate β_0 and β_1 by running a linear regression model specifying that the market return is the independent variable (IV) and the stock return is the dependent variable (DV).

It is strongly recommended to use continuously compounded returns (r_i) instead of simple returns (R_i) to estimate the market regression model.

The market regression model can be expressed as:

r_{(i,t)} = b_0 + b_1*r_{(M,t)} + ε_t

Where:

ε_t is the error at time t. Thanks to the Central Limit Theorem, this error behaves like a Normal distributed random variable ∼ N(0, σ_ε) since the error can be expressed as a sum of products; the error term ε_t is expected to have mean=0 and a specific standard deviation σ_ε (also called volatility).

r_{(i,t)} is the cc return of the stock i at time t.

r_{(M,t)} is the market cc return at time t

b_0 and b_1 are called regression coefficients

Now it’s time to use real data to better understand this model. Download monthly prices for Alfa (ALFAA.MX) and the Mexican market index IPCyC (^MXX) from Yahoo from January 2020 to Jan 2025.

3 CHALLENGE: Running a market regression model with real data

3.1 Data collection

We first load the yfinance package and download monthly price data for Alfa and the Mexican market index.

# Download a dataset with prices for Alfa and the Mexican IPyC:
data = yf.download("ALFAA.MX, ^MXX", start="2020-01-01", end="2025-01-31", interval='1mo',auto_adjust=True)


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed


# I create another dataset with the Adjusted Closing price of both instruments:
adjprices = data['Close']

3.2 Return calculation

We calculate continuously returns for both, Alfa and the IPCyC:

returns = np.log(adjprices).diff(1).dropna()
returns.columns

Index(['ALFAA.MX', '^MXX'], dtype='object', name='Ticker')

# I change the name of the columns to avoid special characters like ^MXX
returns.columns=['ALFA','MXX']
returns.columns

Index(['ALFA', 'MXX'], dtype='object')

3.3 Visualize the relationship

Do a scatter plot putting the IPCyC returns as the independent variable (X) and the stock return as the dependent variable (Y). We also add a line that better represents the relationship between the stock returns and the market returns.Type:

import seaborn as sb
#plt.clf()
x = returns['MXX']
y = returns['ALFA']
# I plot the (x,y) values along with the regression line that fits the data:
sb.regplot(x=x,y=y)
plt.xlabel('Market returns')
plt.ylabel('Alfa returns') 
plt.show()

Sometimes graphs can be deceiving. In this case, the range of X axis and Y axis are different, so it is better to do a graph where we can make both X and Y ranges with equal distance. Type:

plt.clf()

sb.regplot(x=x,y=y)
# I adjust the scale of the X axis so that the magnitude of each unit of X is equal to that of the Y axis 
plt.xticks(np.arange(-1,1,0.20))

([<matplotlib.axis.XTick object at 0x000002B9FEC1DA90>, <matplotlib.axis.XTick object at 0x000002B9FEC1CB90>, <matplotlib.axis.XTick object at 0x000002B9FEC6C7D0>, <matplotlib.axis.XTick object at 0x000002B9FEC6CF50>, <matplotlib.axis.XTick object at 0x000002B9FEC6D6D0>, <matplotlib.axis.XTick object at 0x000002B9FEC6DE50>, <matplotlib.axis.XTick object at 0x000002B9FEC6E5D0>, <matplotlib.axis.XTick object at 0x000002B9FEC6ED50>, <matplotlib.axis.XTick object at 0x000002B9FEC6F4D0>, <matplotlib.axis.XTick object at 0x000002B9FEC6FC50>], [Text(-1.0, 0, '−1.0'), Text(-0.8, 0, '−0.8'), Text(-0.6000000000000001, 0, '−0.6'), Text(-0.40000000000000013, 0, '−0.4'), Text(-0.20000000000000018, 0, '−0.2'), Text(-2.220446049250313e-16, 0, '0.0'), Text(0.19999999999999973, 0, '0.2'), Text(0.3999999999999997, 0, '0.4'), Text(0.5999999999999996, 0, '0.6'), Text(0.7999999999999996, 0, '0.8')])

# I label the axis:
plt.xlabel('Market returns')

plt.ylabel('Alfa returns') 
plt.show()

WHAT DOES THE PLOT TELL YOU? BRIEFLY EXPLAIN

3.4 RUNNING THE MARKET REGRESSION MODEL

The OLS function from the statsmodel package is used to estimate a regression model. We run a simple regression model to see how the monthly returns of the stock are related with the market return.

The first parameter of the OLS function is the DEPENDENT VARIABLE (in this case, the stock return), and the second parameter must be the INDEPENDENT VARIABLE, also named the EXPLANATORY VARIABLE (in this case, the market return).

Before we run the OLS function, we need to add a column of 1’s to the X vector in order to estimate the beta0 coefficient (the constant).

What you will get is called The Single-Index Model. You are trying to examine how the market returns can explain stock returns.

Run the Single-Index model (Y=stock return, the X=market return). You can use the function OLS from the statsmodels.api library:

import statsmodels.formula.api as smf

# I estimate the OLS regression model:
mkmodel = smf.ols('ALFA ~ MXX',data=returns).fit()
# The Dependent variable Y is the first one in the formula, and the second is the IV: Y ~ X

# I display the summary of the regression: 
print(mkmodel.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   ALFA   R-squared:                       0.281
Model:                            OLS   Adj. R-squared:                  0.268
Method:                 Least Squares   F-statistic:                     22.63
Date:              jue., 06 mar. 2025   Prob (F-statistic):           1.34e-05
Time:                        15:34:04   Log-Likelihood:                 45.724
No. Observations:                  60   AIC:                            -87.45
Df Residuals:                      58   BIC:                            -83.26
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0012      0.015      0.084      0.933      -0.028       0.031
MXX            1.3325      0.280      4.757      0.000       0.772       1.893
==============================================================================
Omnibus:                       12.257   Durbin-Watson:                   2.160
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               41.744
Skew:                          -0.071   Prob(JB):                     8.62e-10
Kurtosis:                       7.084   Cond. No.                         18.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression output shows a lot of information about the relationship between the Market return (X) and the stock return (Y).

Looking at the table of beta coefficients, the first row (Intercept) shows the information of the beta0 coefficient, which is the intercept of the regression equation, also known as constant.

The second row (MXX) shows the information of the beta1 coefficient, which represents the slope of the regression line. In this example, since the X variable is the market return and the Y variable is the stock return, beta1 can be interpreted as the sensitivity or market risk of the stock.

For each beta coefficient, the following is calculated and shown:

coef : this is the average value of the beta coefficient
std err : this is the standard error of the coeffcient, which is the standard deviation of the beta coefficient.
t : this is the t-Statistics of the following Hypothesis test:

H0: beta = 0;

Ha: beta <> 0;
P>|t| : is the p-value of the above hypothesis test; if it is a value < 0.05, we can say that the beta coefficient is SIGNIFICANTLY different than ZERO with a 95% confidence.
[0.025 0.975] : This is the 95% Confidence Interval of the beta coefficient. This shows the possible values that the beta coefficient can take in the future with 95% probability.

How the t-Statistic, p-value and the 95% C.I. are related?

INTERPRETATION OF THE REGRESSION OUTPUT

IN A SIMPLE REGRESSION MODEL, BETA0 (THE INTERCEPT WHERE THE LINE CROSSES THE Y AXIS), AND BETA1 (THE INCLINATION OR SLOPE OF THE LINE) ARE ESTIMATED.

THE REGRESSION MODEL FINDS THE BEST LINE THAT BETTER REPRESENTS ALL THE POINTS. THE BETA0 AND BETA1 COEFFICIENTS TOGETHER DEFINE THE REGRESSION LINE.

THE REGRESSION EQUATION.

ACCORDING TO THE REGRESSION OUTPUT, THE REGRESSION EQUATION THAT EXPLAINS THE RETURN OF ALFA BASED ON THE IPC’S RETURN IS:

E[ALFAret]= b0 + b1(MXXret)

E[ALFAret]= 0.0012 + 1.3325(MXXret)

THE REGRESSION MODEL AUTOMATICALLY PERFORMS ONE HYPOTHESIS TEST FOR EACH COEFFICIENT. IN THIS CASE WE HAVE 2 BETA COEFFICIENTS, SO 2 HYPOTHESIS TESTS ARE DONE. YOU CAN SEE THAT IN THE COEFFICIENTS TABLE IN THE OUTPUT.

WE START LOOKING AT THE TABLE OF COEFFICIENTS. WHERE IT SAYS (Intercept), YOU CAN SEE THE RESULT OF THE HYPOTHESIS TESTING FOR BETA0. WHERE IT SAYS THE NAME OF THE INDEPENDENT VARIABLE, IN THIS CASE, THE MARKET RETURN (MXX), YOU CAN SEE THE RESULT FOR THE BETA1 OF THE STOCK.

THE HYPOTHESIS TEST FOR BETA0 IS THE FOLLOWING:

H0: BETA0=0; THIS MEANS THAT THE INTERCEPT OF THE LINE (THE POINT WHERE THE LINE CROSSES THE Y AXIS) HAS AN AVERAGE OF ZERO. IN THE CONTEXT OF THE MARKET MODEL THIS MEANS THAT THE ALFA STOCK DOES NOT OFFER SIGNIFICANTLY LOWER NOR HIGHER RETURNS THAN THE MARKET.

HA: BETA0 <>0; THIS MEANS THAT THE INTERCEPT IS SIGNIFICANTLY DIFFERENT THAN ZERO; IN OTHER WORDS, ALFA OFFERS RETURNS ABOVE (OR BELOW) THE MARKET.

ABOUT STANDARD ERROR, T-VALUE AND P-VALUE OF THE HYPOTHESIS TESTS:

ACCORDING TO THE CENTRAL LIMIT THEOREM, SINCE THE BETA0 CAN BE EXPRESSED AS A LINEAR COMBINATION OF THE STOCK AND THE MARKET RETURN, BETA0 WILL HAVE A DISTRIBUTION SIMILAR TO A NORMAL DISTRIBUTION WITH ITS MEAN AND STANDARD DEVIATION EQUAL TO THE STANDARD ERROR.

IN OTHER WORDS, BETA0 WILL MOVE IN THE FUTURE, AND THE MEAN VALUE WILL BE ABOUT 0.0012, AND IT WILL VARY ON AVERAGE ABOUT 0.0148, WHICH IS THE STANDARD DEVIATION OR STANDARD ERROR OF BETA0.

WHAT DOES THIS MEAN? THIS MEAN THAT IF WE COULD TRAVEL INTO THE FUTURE AND COLLECT A NEW SAMPLE FOR EACH FUTURE MONTH, WE CAN ESTIMATE ONE BETA0 FOR EACH SAMPLE, SO WE COULD IMAGINE MAY BETA0’s THAT WILL CHANGE, BUT ALL THESE VALUES WILL BE AROUND ITS CURRENT MEAN.

IF WE COULD TRAVEL TO THE FUTURE, COLLECT THESE SAMPLES, AND FOR EACH SAMPLE CALCULATE A BETA0, THE HISTOGRAM WITH THESE BETA0’s WILL LOOK LIKE:

ACCORDING TO THIS HISTOGRAM, THE AVERAGE MIGHT BE BETWEEN -0.01 AND -0.005 SINCE IT IS THE RANGE OF BETA0 VALUES THAT APPEARS MORE OFTEN (IT HAS THE HIGHEST BAR). IF WE ADD AND SUBTRACT ABOUT 2 TIMES 0.014 (THE STANDARD ERROR OF BETA0) FROM THE MIDPOINT -0.01, WE COVER ABOUT 95% OF THE DIFFERENT VALUES OF BETA0!

THE ESTIMATION FOR BETA0 IS 0.0012. THIS IS THE MEAN FOR BETA0. SINCE REALITY ALWAYS CHANGE, BETA0 MIGHT CHANGE IN THE FUTURE. HOW MUCH IT CAN CHANGE? THAT IS GIVEN BY ITS STANDARD DEVIATION, WHICH IS CALLED STANDARD ERROR. AND THANKS TO THE CENTRAL LIMIT THEOREM, BETA0 WILL BEHAVE LIKE A NORMAL DISTRIBUTED VARIABLE.

IN THIS CASE, THE STANDARD ERROR OF BETA0 IS 0.0148. THIS MEANS THAT IN THE FUTURE BETA0 WILL HAVE A MEAN OF 0.0012, AND ABOUT 68% OF THE TIME WILL VARY ONE STANDARD DEVIATION LESS THAN ITS MEAN AND 1 STANDARD DEVIATION ABOVE ITS MEAN. IN ADDITION, WE CAN SAY THAT 95% OF THE TIME BETA0 WILL MOVE BETWEEN -2 STANDARD DEVIATIONS AND + 2 STANDARD DEVIATIONS FROM 0.0012.

THIS IS THE MEAN FOR BETA0. SINCE REALITY ALWAYS CHANGE, BETA0 MIGHT CHANGE IN THE FUTURE. HOW MUCH IT CAN CHANGE? THAT IS GIVEN BY ITS STANDARD DEVIATION, WHICH IS CALLED STANDARD ERROR. AND THANKS TO THE CENTRAL LIMIT THEOREM, BETA0 WILL BEHAVE LIKE A NORMAL DISTRIBUTED VARIABLE.

FOLLOWING THE HYPOTHESIS TEST METHOD, WE CALCULATE THE CORRESPONDING t-value OF THIS HYPOTHESIS AS FOLLOWS:

t=\frac{(B_{0}-0)}{SD(B_{0})}

THEN, t = (0.0012 - 0 ) / 0.0148 = 0.084. THIS VALUE IS AUTOMATICALLY CALCULATED IN THE REGRESSION OUTPUT IN THE COEFFICIENTS TABLE IN THE ROW (intercept)!

REMEMBER THAT t-value IS THE DISTANCE BETWEEN THE HYPOTHETICAL VALUE OF THE VARIABLE OF ANALYSIS (IN THIS CASE, B_0=0.0012) AND ITS HYPOTHETICAL VALUE, WHICH IS ZERO. BUT THIS DISTANCE IS MEASURED IN STANDARD DEVIATIONS OF THE VARIABLE OF ANALYSIS. REMEMBER THAT THE STANDARD ERROR OF THE VARIABLE OF ANALYSIS IS CALLED STANDARD ERROR (IN THIS CASE, THE STD.ERROR OF B_0 = 0.0148).

SINCE THE ABSOLUTE VALUE OF THE t-value OF B_0 IS LESS THAN 2, THEN WE CANNOT REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN SAY THAT B_0 IS NOT SIGNIFICANTLY LESS THAN ZERO (AT THE 95% CONFIDENCE LEVEL).

THE HYPOTHESIS TEST FOR BETA1 IS THE FOLLOWING:

H0: B_1 = 0 (THERE IS NO RELATIONSHIP BETWEEN THE MARKET AND THE STOCK RETURN)

Ha: B_1 > 0 (THERE IS A POSITIVE RELATIONSHIP BETWEEN THE THE MARKET AND THE STOCK RETURN)

IN THIS HYPOTHESIS, THE VARIABLE OF ANALYSIS IS BETA1 (B_1).

FOLLOWING THE HYPOTHESIS TEST METHOD, WE CALCULATE THE CORRESPONDING t-value OF THIS HYPOTHESIS AS FOLLOWS:

t=\frac{(B_{1}-0)}{SD(B_{1})}

THEN, t = (1.3325 - 0 ) / 0.2801 = 4.7572. THIS VALUE IS AUTOMATICALLY CALCULATED IN THE REGRESSION OUTPUT IN THE COEFFICIENTS TABLE IN THE SECOND ROW OF THE COEFFICIENT TABLE.

REMEMBER THAT t-value IS THE DISTANCE BETWEEN THE HYPOTHETICAL VALUE OF THE VARIABLE OF ANALYSIS (IN THIS CASE, B_1=1.3325) AND ITS HYPOTHETICAL VALUE, WHICH IS ZERO. BUT THIS DISTANCE IS MEASURED IN STANDARD DEVIATIONS OF THE VARIABLE OF ANALYSIS. REMEMBER THAT THE STANDARD ERROR OF THE VARIABLE OF ANALYSIS IS CALLED STANDARD ERROR (IN THIS CASE, THE STD.ERROR OF B_1 = 0.2801).

THE ESTIMATION FOR BETA1 IS 1.3325. THIS IS THE MEAN FOR BETA1. SINCE REALITY ALWAYS CHANGE, BETA1 MIGHT CHANGE IN THE FUTURE. HOW MUCH IT CAN CHANGE? THAT IS GIVEN BY ITS STANDARD DEVIATION, WHICH IS CALLED STANDARD ERROR OF BETA1. THANKS TO THE CENTRAL LIMIT THEREFORE WE CAN MAKE SURE THAT BETA1 WILL MOVE LIKE A NORMAL DISTRIBUTED VARIABLE IN THE FUTURE WITH THE MEAN AND STANDARD DEVIATIONS (STANDARD ERROR) CALCULATED IN THE REGRESSION OUTPUT.

WE CAN SAY THAT 95% OF THE TIME BETA1 WILL MOVE BETWEEN -2 STANDARD DEVIATIONS AND + 2 STANDARD DEVIATIONS FROM 1.3325.

SINCE THE ABSOLUTE VALUE OF THE t-value OF B_1 IS MUCH GREATER THAN 2, THEN WE HAVE ENOUGH STATISTICAL EVIDENCE AT THE 95% CONFIDENCE TO SAY THAT WE REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN SAY THAT B_1 IS SIGNIFICANTLY GREATER THAN ZERO. WE CAN ALSO SAY THAT WE HAVE ENOUGH STATISTICAL EVIDENCE TO SAY THAT THERE IS A POSITIVE RELATIONSHIP BETWEEN THE STOCK AND THE MARKET RETURN.

3.4.0.1 MORE ABOUT THE INTERPRETATION OF THE BETA COEFFICIENTS AND THEIR t-values AND p-values

THEN, IN THIS OUTPUT WE SEE THAT B_0 = 0.0012, AND B_1 = 1.3325. WE CAN ALSO SEE THE STANDARD ERROR, t-value AND p-value OF BOTH B_0 AND B_1.

B_0 ON AVERAGE IS POSITIVE, BUT IT IS NOT SIGNIFICANTLY POSITIVE (AT THE 95% CONFIDENCE) SINCE ITS p-value>0.05 AND ITS ABSOLUTE VALUE OF t-value<2. THEN I CAN SAY THAT IT SEEMS THAT ALFA RETURN ON AVERAGE OVERPERFORMS THE MARKET RETURN BY 0.1247% (SINCE B_0 = 0.0012). IN OTHER WORDS, THE EXPECTED RETURN OF ALFA IF THE MARKET RETURN IS ZERO IS POSITIVE ON AVERAGE. HOWEVER, THIS IS NOT SIGNIFICANTLY GREATER THAN ZERO SINCE ITS p-value>0.05! THEN, I DO NOT HAVE STATISTICAL EVIDENCE AT THE 95% CONFIDENCE LEVEL TO SAY THAT ALFA OVERPERFORMS THE MARKET.

B_1 IS +1.3325 (ON AVERAGE). SINCE ITS p-value<0.05 I CAN SAY THAT B_1 IS SIGNFICANTLY GREATER THAN ZERO (AT THE 95% CONFIDENCE INTERVAL). IN OTHER WORDS, I HAVE STRONG STATISTICAL EVIDENCE TO SAY THAT ALFA RETURN IS POSITIVELY RELATED TO THE MARKET RETURN SINCE ITS B_1 IS SIGNIFICANTLY GREATER THAN ZERO.

INTERPRETING THE MAGNITUDE OF B_1, WE CAN SAY THAT IF THE MARKET RETURN INCREASES BY +1%, I SHOULD EXPECT THAT, ON AVERAGE,THE RETURN OF ALFA WILL INCREASE BY 1.3325%. THE SAME HAPPENS IF THE MARKET RETURN LOSSES 1%, THEN IT IS EXPECTED THAT ALFA RETURN, ON AVERAGE, LOSSES ABOUT 1.3325%. THEN, ON AVERAGE IT SEEMS THAT ALFA IS RISKIER THAN THE MARKET (ON AVERAGE). BUT WE NEED TO CHECK WHETHER IT IS SIGNIFICANTLY RISKIER THAN THE MARKET.

AN IMPORTANT ANALYSIS OF B_1 IS TO CHECK WHETHER B_1 IS SIGNIFICANTLY MORE RISKY OR LESS RISKY THAN THE MARKET. IN OTHER WORDS, IT IS IMPORTANT TO CHECK WHETHER B_1 IS LESS THAN 1 OR GREATER THAN 1. TO DO THIS CAN DO ANOTHER HYPOTHESIS TEST TO CHECK WHETHER B_1 IS SIGNIFICANTLY GREATER THAN 1!

WE CAN DO THE FOLLOWING HYPOTHESIS TEST TO CHECK WHETHER ALFA IS RISKIER THAN THE MARKET:

H0: B_1 = 1 (ALFA IS EQUALLY RISKY THAN THE MARKET)

Ha: B_1 > 1 (ALFA IS RISKIER THAN THE MARKET)

IN THIS HYPOTHESIS, THE VARIABLE OF ANALYSIS IS BETA1 (B_1).

FOLLOWING THE HYPOTHESIS TEST METHOD, WE CALCULATE THE CORRESPONDING t-value OF THIS HYPOTHESIS AS FOLLOWS:

t=\frac{(B_{1}-1)}{SD(B_{1})}

THEN, t = (1.3325 - 1 ) / 0.2801 = 1.1872. THIS VALUE IS NOT AUTOMATICALLY CALCULATED IN THE REGRESSION OUTPUT.

SINCE t-value > 2, THEN WE CAN SAY THAT WE HAVE SIGNIFICANT EVIDENCE TO REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN SAY THAT ALFA IS SIGNIFICANTLY RISKIER THAN THE MARKET (AT THE 95% CONFIDENCE LEVEL)

3.4.1 95% CONFIDENCE INTERVAL OF THE BETA COEFFICIENTS

WE CAN USE THE 95% CONFIDENCE INTERVAL OF BETA COEFFICIENTS AS AN ALTERNATIVE TO MAKE CONCLUSIONS ABOUT B_0 AND B_1 (INSTEAD OF USING t-values AND p-values).

THE 95% CONFINDENCE INTERVALS FOR BOTH BETAS ARE DISPLAYED IN THE REGRESSION OUTPUT

THE FIRST ROW SHOWS THE 95% CONFIDENCE INTERVAL FOR B_0, AND THE SECOND ROW SHOWS THE CONFIDENCE INTERVAL OF B_1. WE CAN SEE THAT THESE VALUES ARE VERY SIMILAR TO THE “ROUGH” ESTIMATE USING t-critical-value = 2. THE EXACT CRITICAL t-value DEPENDS ON THE # OF OBSERVATIONS OF THE SAMPLE.

HOW WE INTERPRET THE 95% CONFIDENCE INTERVAL FOR B_0?

IN THE NEAR FUTURE, B_0 CAN HAVE A VALUE BETWEEN -0.0285 AND 0.031 95% OF THE TIME. IN OTHER WORDS B_0 CAN MOVE FROM A NEGATIVE VALUE TO ZERO TO A POSITIVE VALUE. THEN, WE CANNOT SAY THAT 95% OF THE TIME, B_0 WILL BE POSITIVE. IN OTHER WORDS, WE CONCLUDE THAT B_0 IS NOT SIGNIFICANTLY POSITIVE AT THE 95% CONFIDENCE LEVEL.

HOW OFTEN B_0 WILL BE POSITIVE? LOOKING AT THE 95% CONFIDENCE INTERVAL, B_0 WILL BE POSITIVE AROUND MORE THAN 50% OF THE TIME.

HOW WE INTERPRET THE 95% CONFIDENCE INTERVAL FOR B_1?

IN THE NEAR FUTURE, B_1 CAN MOVE BETWEEN 0.7718 AND 1.8932 95% OF THE TIME. IN OTHER WORDS, B_1 CAN HAVE A VALUE GREATER THAN 1 AT LEAST 95% OF THE TIME. THEN, WE CAN SAY THAT B_1 IS SIGNIFICANTLY POSITIVE AND GREATER THAN 1. IN OTHER WORDS, ALFA IS SIGNIFICANTLY RISKIER THAN THE MARKET SINCE ITS B_1>1 AT LEAST 95% OF THE TIME.

3.5 OPTIONAL CHALLENGE: Estimate moving betas for the market regression model

How the beta coefficients of a stock move over time? Are the b_1 and b_0 of a stock stable? if not, do they change gradually or can they radically change over time? We will run several rolling regression for Alfa to try to respond these questions.

Before we do the exercise, I will review the meaning of the beta coefficients in the context of the market model.

In the market regression model, b_1 is a measure of the sensitivity; it measures how much the stock return might move (on average) when the market return moves in +1%.

Then, according to the market regression model, the stock return will change if the market return changes, and also it will change by many other external factors. The aggregation of these external factors is what the error term represents.

It is said that b_1 in the market model measures the systematic risk of the stock, which depends on changes in the market return. The unsystematic risk of the stock is given by the error term, that is also named the random shock, which is the summary of the overall reaction of all investors to news that might affect the stock (news about the company, its industry, regulations, national news, global news).

We can make predictions of the stock return by measuring the systematic risk with the market regression model, but we cannot predict the unsystematic risk. The most we can measure with the market model is the variability of this unsystematic risk (the variance of the error).

In this exercise you have to estimate rolling regressions by moving time windows and run 1 regression for each time window.

For the same ALFAA.MX stock, run rolling regressions using a time window of 36 months, starting from Jan 2010.

The first regression has to start in Jan 2010 and end in Dec 2012 (36 months). For the second you have to move time window 1 month ahead, so it will start in Feb 2010 and ends in Jan 2013. For the third regression you move another month ahead and run the regression. You continue running all possible regressions until you end up with a window with the last 36 months of the dataset.

This sounds complicated, but fortunately we can use the function RollingOLS that automatically performs rolling regressions by shifting the 36-moth window by 1 month in each iteration.

Then, you have to do the following:

Download monthly stock prices for ALFAA.MX and the market (^MXX) from Jan 2010 to Jul 2022, and calculate cc returns.

Code

# Getting price data and selecting adjusted price columns:
sprices = yf.download("ALFAA.MX ^MXX",start="2010-01-01",interval="1mo", auto_adjust=True)


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed

Code

sprices = sprices['Close']

# Calculating returns:
sr = np.log(sprices) - np.log(sprices.shift(1))
# Deleting the first month with NAs:
sr=sr.dropna()
sr.columns=['ALFAAret','MXXret']

Run rolling regressions and save the moving b_0 and b_1 coefficients for all time windows.

Code

from statsmodels.regression.rolling import RollingOLS
import statsmodels.api as sm
x=sm.add_constant(sr['MXXret'])
y = sr['ALFAAret']
rolreg = RollingOLS(y,x,window=36).fit()
betas = rolreg.params
# I check the last pairs of beta values:
betas.tail()

               const    MXXret
Date                          
2024-11-01  0.002707  0.645549
2024-12-01  0.004101  0.701092
2025-01-01  0.005985  0.743714
2025-02-01  0.006775  0.753619
2025-03-01  0.006876  0.789700

Do a plot to see how b_1 and b_0 has changed over time.

Code

plt.clf()
plt.plot(betas['MXXret'])
plt.title('Moving beta1 for Alfaa')
plt.xlabel('Date')
plt.ylabel('beta1')
plt.show()

Code

plt.clf()
plt.plot(betas['const'])
plt.title('Moving beta0 for Alfaa')
plt.xlabel('Date')
plt.ylabel('beta0')
plt.show()

We can see that the both beta coefficients move over time; they are not constant. There is no apparent pattern for the changes of the beta coefficients, but we can appreciate how much they can move over time; in other words, we can visualize their standard deviation, which is the average movement from their means.

We can actually calculate the mean and standard deviation of all these pairs of moving beta coefficients and see how they compare with their beta coefficients and their standard errors of the original regression when we use only 1 sample with the last 36 months:

betas.describe()

            const      MXXret
count  147.000000  147.000000
mean    -0.002766    1.354924
std      0.014099    0.505736
min     -0.025589    0.428210
25%     -0.012755    1.076883
50%     -0.007641    1.337805
75%      0.006655    1.709654
max      0.030664    2.343680

We calculated 116 regressions using 116 36-month rolling windows. For each regression we calculated a pair of b_0 and b_1.

Compared with the first market regression of Alfa using the most recent months from 2018 (about 54 months or 4.5 years), we see that the mean of the moving betas is very similar to the estimated beta of the first regression. Also, we see that the standard deviation of the moving b_0 is very similar to the standard error of b_0 estimated in the first regression. The standard deviation of b_1 was much higher than the standard error of b_1 of the first regression. This difference might be because the moving betas were estimated using data from 2010, while the first regression used data from 2018, so it seems that the systematic risk of Alfa (measured by its b_1) has been reducing in the recent months.

I hope that now you can understand why we need an estimation of the standard error of the beta coefficients (standard deviation of the coefficients).

4 CHALLENGE 1: RUN A MARKET REGRESSION MODEL FOR A US STOCK

You have to run a Market Regression model for any US firm using monthly data from Jan 2018 to Dec 2022 and run a MARKET REGRESSION MODEL. In this case, you MUST USE the right Market index, which is the ^GSPC (the S&P500). You have to show the result of the model.

With the result of your model, respond to the following:

INTERPRET THE RESULTS OF THE COEFFICIENTS (b0 and b1), THEIR STANDARD ERRORS, P-VALUES AND 95% CONFIDENCE INTERVALS. Before doing this, re-read the Note: Basics of Linear Regression Models to better interpret your results.
DO A QUICK RESEARCH ABOUT THE EFFICIENT MARKET HYPOTHESIS. BRIEFLY DESCRIBE WHAT THIS HYPOTHESIS SAYS.
ACCORDING TO THE EFFICIENT MARKET HYPOTHESIS, WHAT IS THE EXPECTED VALUE OF b0 in the Market REGRESSION MODEL?
ACCORDING TO YOUR RESULTS, IS THE FIRM SIGNIFICANTLY RISKIER THAN THE MARKET? WHAT IS THE t-test YOU NEED TO DO TO RESPOND THIS QUESTION? Do the test and provide your interpretation. (Hint: Here you have to change the null hypothesis for b1: H0: b1=1; Ha=b1<>1)

5 CHALLENGE 1 SOLUTION

I selected PFIZER.

I COLLECT THE DATA FOR PFIZER AND THE MARKET INDEX AND CALCULATE RETURNS:

# Download a dataset with prices for Pfizer
pfe = yf.download("PFE ^GSPC", start="2020-01-01", end="2025-01-31", interval='1mo', auto_adjust=True)


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed


# I create another dataset with the Adjusted Closing price:
adjprices = pfe['Close']

# I calculate cc returns of Pfizer:
returns = np.log(adjprices) - np.log(adjprices.shift(1))
returns = returns.dropna()

# I rename the columns
returns.columns = ['PFE','SP500']

I ESTIMATE THE MARKET REGRESSION MODEL FOR PFIZER:

import statsmodels.formula.api as smf

# I estimate the OLS regression model:
mkmodel = smf.ols('PFE ~ SP500',data=returns).fit()
# I display the summary of the regression: 
print(mkmodel.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    PFE   R-squared:                       0.163
Model:                            OLS   Adj. R-squared:                  0.149
Method:                 Least Squares   F-statistic:                     11.30
Date:              jue., 06 mar. 2025   Prob (F-statistic):            0.00137
Time:                        15:34:06   Log-Likelihood:                 75.152
No. Observations:                  60   AIC:                            -146.3
Df Residuals:                      58   BIC:                            -142.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0073      0.009     -0.792      0.432      -0.026       0.011
SP500          0.5854      0.174      3.362      0.001       0.237       0.934
==============================================================================
Omnibus:                        3.034   Durbin-Watson:                   2.045
Prob(Omnibus):                  0.219   Jarque-Bera (JB):                2.160
Skew:                           0.324   Prob(JB):                        0.340
Kurtosis:                       3.666   Cond. No.                         19.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I CAN GET THE ESTIMATIONS FOR BETAS AND THEIR STANDARD ERRORS AND PVALUES:

# Beta coefficients:
b0 = mkmodel.params[0]
b1 = mkmodel.params[1]
# Std Error of beta coefficients:
seb0 = mkmodel.bse[0]
seb1 = mkmodel.bse[1]
# I store the minimum and maximum values of the 95% confidence interval for both coefficients:
minb0 = mkmodel.conf_int()[0][0]
maxb0 = mkmodel.conf_int()[1][0]
minb1 = mkmodel.conf_int()[0][1]
maxb1 = mkmodel.conf_int()[1][1]

THE REGRESSION EQUATION OF THIS MARKET REGRESSION MODEL IS:

PFE_returns = B_0 + B_1(GSPC_returns) + \varepsilon

WHERE \varepsilon IS THE REGRESSION ERROR

THE REGRESSION EQUATION OF THE EXPECTED VALUE OF PFE RETURNS IS:

E[PFE_returns] = B_0 + B_1(GSPC_returns)

(THE EXPECTED VALUE OF THE ERROR IS ALWAYS ZERO)

THEN,

E[PFE_returns] = -0.0073 + 0.5854 (GSPC_returns)

a. INTERPRET THE RESULTS OF THE COEFFICIENTS (B_0 and B_1), THEIR STANDARD ERRORS, P-VALUES AND 95% CONFIDENCE INTERVALS.

REGARDING B_0:

B_0= -0.0073 AND ITS STANDARD DEVIATION (ALSO CALLED STANDARD ERROR) IS . 0.0093. B_0 CAN CHANGE OVER TIME, AND IT WILL BEHAVE SIMILAR TO A NORMAL DISTRIBUTED VARIABLE.

THEN, THE AVERAGE VALUE OF B_0 WILL BE -0.0073, AND IT WILL HAVE A VARIABILITY ACCORDING TO ITS STANDARD DEVIATION.

ALTHOUGH B_0<0, WE CANNOT SAY THAT B_0 IS SIGNIFICANTLY LESS THAN ZERO SINCE ITS t-value<2 AND ITS p-value>0.05. THEN WE CANNOT REJECT THE HYPOTHESIS THAT B_0=0; WE ACCEPT THAT PFIZER IS NOT SIGNIFICANTLY OFFERING RETURNS UNDER OR OVER THE MARKET. THEN, B_0 CAN EASILY MOVE FROM NEGATIVE TO ZERO TO A POSITIVE VALUE.

REGARDING B_1:

B_1= 0.5854 AND ITS STANDARD DEVIATION (ALSO CALLED STANDARD ERROR) IS . 0.1741. B_1 CAN CHANGE OVER TIME, AND IT WILL BEHAVE SIMILAR TO A NORMAL DISTRIBUTED VARIABLE.

B_1 > 0, AND IT IS SIGNIFICANTLY GREATER THAN ZERO SINCE ITS p-value<0.05 AND ITS ABSOLUTE VALUE OF t IS >2. THEN I CAN SAY THAT PFIZER RETURN IS POSITIVELY AND SIGNIFICANTLY RELATED TO MARKET RETURN SINCE ITS B_1>0 95% OF THE TIME.

WE SEE THAT B_1<1. THEN, ON AVERAGE PFIZER IS LESS RISKY THAN THE MARKET. HOWEVER, WE HAVE TO CHECK WHETHER IT IS SIGNIFICANTLY LESS RISKY THAN THE MARKET.

WE CAN CHECK THIS BY CALCULATING A NEW t-value FOR THE FOLLOWING HYPOTHESIS:

H0: B_1=1 (PFIZER IS EQUALLY RISKY THAN THE MARKET)

HA: B_1<1 (PFIZER IS LESS RISKY THAN THE MARKET)

OR WE CAN CHECK THE 95% CONFIDENCE INTERVAL OF B_1. FROM THE REGRESSION OUTPUT WE CAN SEE THAT 95% CONFIDENCE INTERVAL FOR BETA1:

WITH A PROBABILITY OF 95%, THE MINIMUM POSSIBLE VALUE OF BETA1 IS 0.2369, AND THE MAXIMUM POSSIBLE VALUE OF BETA1 IS 0.9339.

THEN, B_1 IS SIGNIFICANTLY LESS THAN ONE. IN OTHER WORDS, PFIZER IS SIGNIFICANTLY LESS RISKY THAN THE MARKET.

IF THE MARKET INCREASES IN 1.00%, IT IS EXPECTED THAT PFIZER RETURN WILL ALSO INCREASE BUT IN 0.5854%. IF THE MARKET LOSSES 1.00%, IT IS EXPECTED THAT PFIZER RETURN WILL LOSE ABOUT 0.5854%.

DO A QUICK RESEARCH ABOUT THE EFFICIENT MARKET HYPOTHESIS. BRIEFLY DESCRIBE WHAT THIS HYPOTHESIS SAYS.

THIS HYPOTHESIS STATES THAT STOCK PRICES REFLECT ALL AVAILABLE INFORMATION THAT IS RELEASED TO INVESTORS. THEN, STOCK PRICES ARE ALWAYS TRADED AT THEIR FAIR VALUE ON EXCHANGES, SO THERE IS NO POSSIBILITY THAT INVESTORS CAN PURCHASE UNDERVALUED STOCKS OR SELL STOCKS FOR INFLATED PRICES. THEREFORE, IT SHOULD BE IMPOSSIBLE TO OUTPERFORM THE OVERALL MARKET AND THE ONLY WAY AN INVESTOR CAN OBTAIN HIGHER RETURNS IS BY PURCHASING RISKIER INVESTMENTS.

ACCORDING TO THE EFFICIENT MARKET HYPOTHESIS, WHAT IS THE EXPECTED VALUE OF B_0 in the Market REGRESSION MODEL?

SINCE THERE IS NO POSSIBLE WAY THAT A STOCK OVERPERFORM THE MARKET SYSTEMATICALLY, THEN THE EXPECTED VALUE OF B_0 SHOULD BE ZERO. THEN, WHEN THE MARKET RETURN IS ZERO, IT IS EXPECTED THAT A STOCK ALSO OFFERS ZERO RETURNS.

ACCORDING TO YOUR RESULTS, IS PFIZER SIGNIFICANTLY RISKIER THAN THE MARKET? WHAT IS THE t-test YOU NEED TO DO TO RESPOND THIS QUESTION? Do the test and provide your interpretation. (Hint: Here you have to change the null hypothesis for B_1: H0: B_1=1; Ha=B_1<>1)

SINCE B_1<1, THEN WE CAN CHECK WHETHER PRIZER IS LESS RISKY THAN THE MARKET. WE CAN RESPOND THIS QUESTION BY LOOKING AT THE 95% CONFIDENCE INTERVAL OF B_1 OR CALCULATING t-value OF THE FOLLOWING HYPOTHESIS:

H0: B_1=1 (PFE IS EQUALLY RISKY THAN THE MARKET)

Ha: B_1<1 (PFE IS LESS RISKY THAN THE MARKET)

ABOVE WE RESPONDED THIS QUESTION USING THE 95% CONFIDENCE INTERVAL. WE CONCLUDED THAT PFIZER IS SIGNIFICANTLY LESS RISKY THAN THE MARKET SINCE B_1<1 95% OF THE TIME. LET’S CALCULATE THE t-value OF THIS TEST AND SEE WHETHER WE ARRIVE TO THE SAME CONCLUSION:

t-value = (0.5854 - 1) / 0.1741

t-value = -2.3815

SINCE THE ABSOLUTE VALUE OF T IS > 2, THEN WE HAVE ENOUGH STATISTICAL EVIDENCE AT THE 95% CONFIDENCE LEVEL TO REJECT THE NULL HYPOTHESIS. IN OTHER WORDS, WE CAN CONCLUDE THAT PFIZER IS SIGNIFICANTLY LESS RISKY THAN THE MARKET. THEN, WE ARRIVED TO THE SAME CONCLUSION WHEN WE ANALYZED THE B_1 95% CONFIDENCE INTERVAL!

6 READING

Read the following note:

Hypothesis tests
Basics of Linear Regression models

1 Measures of linear relationship

1.1 Covariance

1.2 Correlation

1.3 Calculating covariance and correlation

2 The Linear regression model

2.1 Introduction

2.2 Interesting facts from history

2.3 Types of data structures

2.4 The simple regression model

3 CHALLENGE: Running a market regression model with real data

3.1 Data collection

3.2 Return calculation

3.3 Visualize the relationship

3.4 RUNNING THE MARKET REGRESSION MODEL

3.4.0.1 MORE ABOUT THE INTERPRETATION OF THE BETA COEFFICIENTS AND THEIR t-values AND p-values

3.4.1 95% CONFIDENCE INTERVAL OF THE BETA COEFFICIENTS

3.5 OPTIONAL CHALLENGE: Estimate moving betas for the market regression model

4 CHALLENGE 1: RUN A MARKET REGRESSION MODEL FOR A US STOCK

5 CHALLENGE 1 SOLUTION

6 READING

7 References