Workshop 4, Advanced AI - Statistics Module

Authors

Alberto Dorantes D., Ph.D.

Monterrey Tech, Queretaro Campus

Abstract

In this workshop we continue learning about the Simple Linear Regression Model.

1 Workshop Directions

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a google colab document for each workshop.

You must share each Colab document (workshop) with the following accounts:

cdorante.tec@gmail.com
cdorante@tec.mx

You must give Edit privileges to these accounts.

You have to follow this workshop in class to learn about topics. You have to do your own Markdown note/document for every workshop we cover.

Rename your Notebook as “W1-Statistics-AI-YourFirstName-YourLastname”.

You must submit your workshop before we start with the next workshop. What you have to write in your workshop? You have to:

You have to REPLICATE and RUN all the Python code, and
DO ALL CHALLENGES stated in sections. These challenges can be Python code or just responding QUESTIONS with your own words and in CAPITAL LETTERS. You have to WRITE CLEARLY so that I can see your LINE OF THINKING!

The submissions of your workshops is a REQUISITE for grading your final deliverable document of the Statistics Module.

I strongly recommended you to write your OWN NOTES about the topics as if it were your study NOTEBOOK.

2 Applying the Linear Regression Model: Estimate a market regression model

Now it’s time to use real data to better understand Regression models.

We download monthly prices for Alfa (ALFAA.MX) and the Mexican market index IPCyC (^MXX) from Yahoo Finance from January 2019 to July 2024. With these prices we calculate continuously compounded (cc) returns of Alfa and the Market (MXX) and drop NA values:

import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Getting price data and selecting adjusted price columns:
sprices=yf.download(tickers="ALFAA.MX,^MXX", start="2019-01-01", end = "2024-07-31",interval="1mo")

sprices = sprices['Adj Close']

# Calculating cc returns:
returns = np.log(sprices) - np.log(sprices.shift(1))
# Deleting the first month with NAs:
returns=returns.dropna()
# I rename the columns; the first column is Alfa returns:
returns.columns=['ALFAAret','MXXret']
# I view the first and last returns:
print(returns.head())
print(returns.tail())

[                       0%%                      ][*********************100%%**********************]  2 of 2 completed
            ALFAAret    MXXret
Date                          
2019-02-01 -0.092199 -0.026821
2019-03-01 -0.062021  0.010626
2019-04-01 -0.072858  0.029954
2019-05-01 -0.068583 -0.042324
2019-06-01  0.052801  0.009592
            ALFAAret    MXXret
Date                          
2024-03-01 -0.023285  0.034672
2024-04-01  0.019336 -0.011237
2024-05-01 -0.060824 -0.027681
2024-06-01 -0.101601 -0.050917
2024-07-01  0.000000  0.013200

We do a scatter plot including the regression line:

import seaborn as sb
plt.clf()
x = returns['MXXret']
y = returns['ALFAAret']
# I plot the (x,y) values along with the regression line that best fits the data:
sb.regplot(x=x,y=y)
plt.xlabel('Market returns')
plt.ylabel('Alfa returns')
plt.show()

Scatter plots can be misleading when ranges of X and Y are very different. In this case, Alfa had a wider range of return values compared to the Market returns. Alfa has offered very negative returns from -60% to about +40%, while the Market returns had offered returns from about -17% to + 12%.

Then, we can re-do the scatter plot trying to make the X and Y axis using a similar scale fro both variables:

plt.clf()
sb.regplot(x=x,y=y)
# I adjust the scale of the X axis so that the magnitude of each unit of x is equal to that of the Y axis:
plt.xticks(np.arange(-1,1,0.2))
# I label the axis:
plt.xlabel("Market returns")
plt.ylabel("Alfa returns")
plt.show()

2.1 CHALLENGE

WHAT DOES THE PLOT TELL YOU? BRIEFLY EXPLAIN

2.2 ESTIMATING THE MARKET REGRESSION MODEL

The OLS function from the satsmodel package is used to estimate a regression model. We run a simple regression model to see how the monthly returns of the stock are related with the market return.

The first parameter of the OLS function is the DEPENDENT VARIABLE (in this case, the stock return), and the second parameter must be the INDEPENDENT VARIABLE, also named the EXPLANATORY VARIABLE (in this case, the market return).

Before we run the OLS function, we need to add a column of 1’s to the X vector in order to estimate the beta0 coefficient (the constant).

What you will get is called the Market Regression Model. You are trying to examine how the market returns can explain stock returns:

import statsmodels.api as sm
# I add a column of 1's to the X dataframe in order to include the beta0 coefficient (intercept) in the model:
X = sm.add_constant(x)
# I estimate the OLS regression model:
mkmodel = sm.OLS(y,X).fit()
# I display the summary of the regression: 
print(mkmodel.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               ALFAAret   R-squared:                       0.316
Model:                            OLS   Adj. R-squared:                  0.305
Method:                 Least Squares   F-statistic:                     29.59
Date:                Tue, 03 Sep 2024   Prob (F-statistic):           8.96e-07
Time:                        07:48:49   Log-Likelihood:                 56.986
No. Observations:                  66   AIC:                            -110.0
Df Residuals:                      64   BIC:                            -105.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0147      0.013     -1.147      0.255      -0.040       0.011
MXXret         1.3502      0.248      5.440      0.000       0.854       1.846
==============================================================================
Omnibus:                       15.349   Durbin-Watson:                   2.206
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               70.326
Skew:                          -0.124   Prob(JB):                     5.36e-16
Kurtosis:                       8.051   Cond. No.                         19.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

# I can also run the OLS regression using the ols function 
import statsmodels.formula.api as smf

mkmodel2 = smf.ols('ALFAAret ~ MXXret',data=returns).fit()
# This function does not require to add the column of 1's to include the intercept!
print(mkmodel2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               ALFAAret   R-squared:                       0.316
Model:                            OLS   Adj. R-squared:                  0.305
Method:                 Least Squares   F-statistic:                     29.59
Date:                Tue, 03 Sep 2024   Prob (F-statistic):           8.96e-07
Time:                        07:48:49   Log-Likelihood:                 56.986
No. Observations:                  66   AIC:                            -110.0
Df Residuals:                      64   BIC:                            -105.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0147      0.013     -1.147      0.255      -0.040       0.011
MXXret         1.3502      0.248      5.440      0.000       0.854       1.846
==============================================================================
Omnibus:                       15.349   Durbin-Watson:                   2.206
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               70.326
Skew:                          -0.124   Prob(JB):                     5.36e-16
Kurtosis:                       8.051   Cond. No.                         19.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

# Using matrix algebra to estimate the beta coefficients:
# I add the column of 1's to the dataframe:
sr = returns
sr['constant'] = 1
selcols = ['constant','MXXret']
# I set x as a matrix with the column of 1's and the values of X:
x = sr[selcols].values
# I set y as the dependent variable:
y = sr['ALFAAret'].values
# I calculate the matrix multiplication X'X:
xtx = np.matmul(x.transpose(),x)
# I calculate the matrix multiplication X'Y:
xty = np.matmul(x.transpose(),y)
# I get the inverse of the matrix (X'X) to solve for the beta coefficients:
invtxt = np.linalg.inv(xtx)
# I multiply inv(X'X)*X'Y to get the estimation of the beta vector (beta0 and beta1 coefficients) 
betas = np.matmul(invtxt,xty)
betas

array([-0.01465928,  1.35019665])

Write down the regression equation

The regression equation is: E[ALFAret]= -0.014659278826326859 + 1.3501966500607914*MXXret.

In the regression outputs we see that both beta coefficients have not only their optimal values, but also their standard error, t-Statistic, p-values, and their 95% confidence intervals. Note that the regression function in Python automatically performs hypothesis testing for both coefficients, b_0 and b_1, where the null hypotheses are that the coefficients are equal to zero

In the next section we will learn about these estimations.

3 The standard error of the beta coefficients

The OLS method includes the estimation of the beta coefficients and also estimation of their corresponding standard errors. Then, what is the standard error of the beta coefficients?

The standard error of b_0 is actually an estimation of the expected standard deviation of b_0. The same happens for b_1. But how we can estimate a standard deviation of a beta coefficient? It seems that we need several possible values of b_0 and b_1, so I could estimate their standard deviation as a measure of standard error. Then, we would need many samples to estimate several possible pair values of beta coefficients. However, most of the time we only use 1 sample to estimate the beta coefficients and their standard errors. Then, why do we need to estimate the expected standard deviation of the coefficients?

In several disciplines, when we want to understand the relationship between 2 random variables, we only have access to 1 sample (not the population), so we need to find a way to estimate the error levels we might have with the beta estimations. Then, we need to find a way to estimate the possible variation of the beta coefficients as if we would had the possibility to collect many samples to get many possible pairs of beta coefficients.

It sound weird, but it is possible to estimate how much the b_0 and b_1 might vary using only 1 sample. We can use basic probability theory and the result of the Central Limit Theorem to estimate the expected standard deviation of the beta coefficients, their corresponding t-values, p-values and their 95% confidence intervals.

I will not derive here the formulas for the expected standard deviation of the beta coefficients, but you can check this derivation in the Appendix of my note: Basics of Linear Regression Models in the context of Finance

To estimate the standard errors of the coefficients, we need to estimate the standard deviation of the Y variable, which is actually the standard deviation of the errors, and it is also called mean squared errors (MSE).

The formula to estimate the standard deviation of the regression errors, mean squared errors (MSE), is:

MSE=\frac{SSE}{N-2}

Where: SSE = Sum of squared errors:

The SSE is divided by N-2 since we need 2 parameters (b_0 and b_1) to estimate SSE.

The SSE is calculated as:

SSE = \sum_{i=1}^{N}(y_i-\hat{y})^2

The formula to estimate the standard error (SE) of b_1 is the following:

SE(b_1)=\sqrt{\frac{MSE}{\sum_{i=1}^N(x_i-\bar{x})^2}}

The formula to estimate the standard error (SE) of b_0 is the following:

SE(b_0)=\sqrt{\frac{MSE\sum_{i=1}^{N}x_i^2}{N\sum_{i=1}^N(x_i-\bar{x})^2}}

These estimations are automatically calculated and displayed in the output of any regression software. By learning their formulas we see that the magnitude of standard error of both coefficients is directly proportional to the standard deviation of the the errors, which is the mean of squared regression errors (MSE). Then, the greater the individual magnitude of errors, the greater the standard error of both beta coefficients, so the more difficult to find significant betas (betas that are significantly different than zero).

To further understand the standard error of beta coefficients we will do an exercise to estimate several regression models using different time ranges. Each regression will estimate one value for b_0 and one value for b_1. Then, if we run N regressions, we will have N pairs of beta coefficients, so we will see how these beta coefficients change over time. This change is measured by the standard error of the beta coefficients.

4 CHALLENGE: Estimate moving betas for the market regression model

How the beta coefficients of a stock move over time? Are the b_1 and b_0 of a stock stable? if not, do they change gradually or can they radically change over time? We will run several rolling regression for Alfa to try to respond these questions.

Before we do the exercise, I will review the meaning of the beta coefficients in the context of the market model.

In the market regression model, b_1 is a measure of the sensitivity; it measures how much the stock return might move (on average) when the market return moves in +1%.

Then, according to the market regression model, the stock return will change if the market return changes, and also it will change by many other external factors. The aggregation of these external factors is what the error term represents.

It is said that b_1 in the market model measures the systematic risk of the stock, which depends on changes in the market return. The unsystematic risk of the stock is given by the error term, that is also named the random shock, which is the summary of the overall reaction of all investors to news that might affect the stock (news about the company, its industry, regulations, national news, global news).

We can make predictions of the stock return by measuring the systematic risk with the market regression model, but we cannot predict the unsystematic risk. The most we can measure with the market model is the variability of this unsystematic risk (the variance of the error).

In this exercise you have to estimate rolling regressions by moving time windows and run 1 regression for each time window.

For the same ALFAA.MX stock, run rolling regressions using a time window of 36 months, starting from Jan 2010.

The first regression has to start in Jan 2010 and end in Dec 2012 (36 months). For the second you have to move time window 1 month ahead, so it will start in Feb 2010 and ends in Jan 2013. For the third regression you move another month ahead and run the regression. You continue running all possible regressions until you end up with a window with the last 36 months of the dataset.

This sounds complicated, but fortunately we can use the function RollingOLS that automatically performs rolling regressions by shifting the 36-moth window by 1 month in each iteration.

Then, you have to do the following:

Download monthly stock prices for ALFAA.MX and the market (^MXX) from Jan 2010 to Jul 2022, and calculate cc returns.

Code

# Getting price data and selecting adjusted price columns:
sprices = yf.download("ALFAA.MX ^MXX",start="2010-01-01",interval="1mo")
sprices = sprices['Adj Close']

# Calculating returns:
sr = np.log(sprices) - np.log(sprices.shift(1))
# Deleting the first month with NAs:
sr=sr.dropna()
sr.columns=['ALFAAret','MXXret']

[                       0%%                      ][*********************100%%**********************]  2 of 2 completed

Run rolling regressions and save the moving b_0 and b_1 coefficients for all time windows.

Code

from statsmodels.regression.rolling import RollingOLS
x=sm.add_constant(sr['MXXret'])
y = sr['ALFAAret']
rolreg = RollingOLS(y,x,window=36).fit()
betas = rolreg.params
# I check the last pairs of beta values:
betas.tail()

	const	MXXret
Date
2024-05-01	-0.003278	0.546430
2024-06-01	-0.007444	0.594719
2024-07-01	-0.007841	0.593199
2024-08-01	-0.002920	0.635460
2024-09-01	-0.003992	0.651561

Do a plot to see how b_1 and b_0 has changed over time.

Code

plt.clf()
plt.plot(betas['MXXret'])
plt.title('Moving beta1 for Alfaa')
plt.xlabel('Date')
plt.ylabel('beta1')
plt.show()

Code

plt.clf()
plt.plot(betas['const'])
plt.title('Moving beta0 for Alfaa')
plt.xlabel('Date')
plt.ylabel('beta0')
plt.show()

We can see that the both beta coefficients move over time; they are not constant. There is no apparent pattern for the changes of the beta coefficients, but we can appreciate how much they can move over time; in other words, we can visualize their standard deviation, which is the average movement from their means.

We can actually calculate the mean and standard deviation of all these pairs of moving beta coefficients and see how they compare with their beta coefficients and their standard errors of the original regression when we use only 1 sample with the last 36 months:

betas.describe()

	const	MXXret
count	141.000000	141.000000
mean	-0.003143	1.382452
std	0.014290	0.497860
min	-0.025584	0.428251
25%	-0.013520	1.094895
50%	-0.007841	1.443745
75%	0.006537	1.713583
max	0.030666	2.343706

We calculated 116 regressions using 116 36-month rolling windows. For each regression we calculated a pair of b_0 and b_1.

Compared with the first market regression of Alfa using the most recent months from 2018 (about 54 months or 4.5 years), we see that the mean of the moving betas is very similar to the estimated beta of the first regression. Also, we see that the standard deviation of the moving b_0 is very similar to the standard error of b_0 estimated in the first regression. The standard deviation of b_1 was much higher than the standard error of b_1 of the first regression. This difference might be because the moving betas were estimated using data from 2010, while the first regression used data from 2018, so it seems that the systematic risk of Alfa (measured by its b_1) has been reducing in the recent months.

I hope that now you can understand why we need an estimation of the standard error of the beta coefficients (standard deviation of the coefficients).

Next we will learn how to use the estimated beta coefficient and their corresponding standard errors to calculate their corresponding t-Statistic, p-value and their 95% confidence interval.

5 t-Statistic, p-value and 95% confidence interval of beta coefficients

5.1 Hypothesis tests for the beta coefficients

When we run a linear regression model, besides the estimation of the beta coefficients and their corresponding standard errors, one hypothesis test is performed for each beta coefficient.

We apply the hypothesis test to each of the beta coefficients to test whether the beta coefficient is or is not equal to zero.

For the case of the simple market regression, the following hypothesis are performed:

For b_0:

H0: The mean of b_0 = 0
Ha: The mean of b_0 <> 0 (Our hypothesis)

In this case, the variable of study for the hypothesis test is the b_0 coefficient.

Then, we calculate the t-Statistic for b_0 as follows:

t =\frac{(b_0 - 0)}{SE(b_0)}

SE(b_0)$ is the standard error of b_0, which is its estimated standard deviation.

Remember that the t-Statistic is the standardized distance from b_0 (the value estimated from the regression) and zero. In other words, the t-Statistic tells us how many standard deviations of the b_0 the actual value of b_0 is away from zero, which is the hypothetical true value.

Remember that the null hypothesis (H0) is the hypothesis of the skeptical person who believes that b_0 is equal to zero. Then, we start assuming that H0 is true. If we show that there is very little probability (its p-value) that the b_0=0, then we will have statistical evidence to reject the H0 and support our Ha.
Then, if \mid t\mid>2, then we will have statistical evidence at least at the 95% confidence level to reject the null hypothesis. The critical value of 2 for t is an approximation; it depends on the # of observations of the regression that this critical value can move from around 1.8 and 2.1.

From a t-Statistic and the # of observations, we can estimate the exact p-value. This value cannot be calculated using a formula since there is no close solution for the t cumulative density function. However, remember that the t-Student probability distribution becomes very similar to the normal probability distribution when the # of observations is equal or greater than 30. Then, if the t-Statistic is about 2, then if we remember the characteristic of the probability density function of a normal distribution, then the area under the curve (which is the probability) beyond t=2 and less than t=-2 will be around 0.05 (5%). This is the 2-sided p-value of the test.

Remember that the p-value is the probability of making a mistake if we reject the null hypothesis. Then, the less the p-value, the better. The rule of thumb is that if the p-value<0.05, then we have statistical evidence at least at the 95% confidence level to reject the null.

Fortunately, the standard error, t-Statistic and the 2-sided p-value for this hypothesis test is automatically calculated and shown when we run a regression model.

Then, in conclusion, if the p-value estimated for b_0<0.05 and b_0>0, then we can say that there is statistical evidence at the 95% confidence level to say that b_0 is greater than zero.

In the context of the market regression model, the b_0 has an important meaning. If we find that b_0 is significantly greater than zero, then we can say that the stock is systematically offering positive returns over the market. In Finance the b_0 coefficient is called Alpha of Jensen, and it is supposed to always be zero or NOT significantly different than zero according to the market efficiency hypothesis.

For b_1 the same process is performed:

H0: The mean of b_1 = 0
Ha: The mean of b_1 <> 0 (Our hypothesis)

In this case, the variable of study for the hypothesis test is the b_0 coefficient.

Then, we calculate the t-Statistic as follows:

t =\frac{(b_1 - 0)}{SE(b_1)}

SE(b_1)$ is the standard error of b_1, which is its estimated standard deviation.

Then, we follow the same logic as it is explained above to make a conclusion about b_1.

In the context of market regression model, b_1 not only measures the linear relationship between the stock return and the market return; b_1 is a measure of systematic market risk of the stock. If the p-value(b_1)<0.05 and b_1>0, then we can say that the stock return is positively and significantly related to market return.

Another interesting hypothesis test for b_1 is the examine whether b_1 is significantly greater or less than 1 (not zero). If b_1 is significantly > 1, then we can say that the stock is significantly riskier than the market. Unfortunately, this hypothesis is NOT tested in the traditional output of the regression model. We need to calculate the corresponding t-Statistic for this test manually.

5.2 The 95% confidence interval for each coefficient

Besides the standard error, t-Statistic and p-value, the 95% confidence interval (C.I.) is also calculated for each beta coefficient. The 95% C.I. has a minimum and maximum possible value. The 95% C.I. illustrates how much the beta coefficient can move 95% of the time.

An approximate way to estimate the minimum and maximum of this 95% C.I. is just by subtracting and adding 2 standard errors to the beta coefficient. For example, an approximate 95% C.I. for b_0 can be estimated as:

95\%C.I.(b_0)=[b_0 - 2(SE(b_0) .. b_0 + 2(SE(b_0)] The exact critical value for the 95% is not 2, it depends on the # of observations, but it can go from 1.8 to 2.1. The exact values of the 95%C.I. are automatically calculated when we run the regression model.

The 95% C.I. of a beta coefficient tells us the possible movement of the beta according to its standard error.

We can use the 95% C.I. instead of the t-Statistic or p-value to make the same conclusion for the hypothesis test of the coefficient. If the 95%C.I. does NOT contain the zero, then it means that the beta coefficient is significantly different than zero. An advantage of the 95%C.I. is that we could quickly test the hypothesis that b_1>1 to check whether a stock is significantly riskier; if the 1 is not included in the 95% C.I. and b_1>1, then we can say that the stock is significantly riskier than the market at the 95% confidence level.

1 Workshop Directions

2 Applying the Linear Regression Model: Estimate a market regression model

2.1 CHALLENGE

2.2 ESTIMATING THE MARKET REGRESSION MODEL

3 The standard error of the beta coefficients

4 CHALLENGE: Estimate moving betas for the market regression model

5 t-Statistic, p-value and 95% confidence interval of beta coefficients

5.1 Hypothesis tests for the beta coefficients

5.2 The 95% confidence interval for each coefficient

6 References