# Load libraries for the workshop
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.api as sms
# Additionals
from statsmodels.stats.diagnostic import het_breuschpagan # Breusch-Pagan test for homoscedasticity
from statsmodels.tools.tools import add_constant # Add constant for regression loopRational Agent and Behavioral Finance
FZ2024 Financial Modeling and Programming
0.1 Before you begin: important instructions for all Workshops
Welcome to our workshop series! Please read these instructions carefully before starting any activity. Following these guidelines will make your work smoother and ensure that your submissions are graded without issues.
0.1.1 Working environment
We will use Google Colab for all workshops. Colab runs Python in the cloud — you don’t need to install anything locally.
- Access Colab at: https://colab.research.google.com/
- Sign in with your institutional Google account for access to all features.
- Always save a copy of the notebook to your Google Drive:
- Go to File → Save a copy in Drive.
0.1.2 Loading data
You may work with datasets provided by the instructor or public datasets online. You will receive instructions each time to load the data with Python code. However, it is a good idea to store files, like data or your own notes, in a dedicated Google Drive folder:
- Create a folder in your Google Drive named
fz2024_workshops(or similar). - Upload your datasets there.
0.1.3 Output and submission format
- After completing the workshop, export your notebook as PDF:
- In Colab: File → Print → Save as PDF.
- Submit the PDF file through Canvas, as well as the
.ipynb. - Include all outputs, tables, and graphs in your PDF — make sure you run all cells before exporting.
- Name your PDF file using the following format:
Lastname_Firstname_WorkshopX.pdf
0.1.4 Deadlines
All assignments must be uploaded to Canvas before the stated deadline. Late submissions are not accepted. Once you have read and understood these instructions, you are ready to begin the workshop!
1 Overview
This workshop is the first of three wokshops that together explore how to build a portfolio grounded in Rational Agent Theory, Behavioural Finance, and the Market Anomalies literature. The process unfolds in three analytical stages:
Filter 1 – Market Efficiency: Test whether each stock’s historical information helps forecast its price. This step identifies inefficient markets, where past returns contain predictive signals.
Filter 2 – Market Anomalies: Examine systematic patterns—such as momentum or trend-following behaviour—that contradict the Efficient Market Hypothesis. Here, the strategy buys assets with upward trends and sells those trending downward.
Filter 3 – Portfolio Allocation: Optimize the portfolio composition using only the assets that pass the previous filters.
Unlike the traditional Markowitz framework, which focuses solely on optimizing asset weights, this three-part approach first conducts stock selection through theoretical filters. Each filter reflects assumptions derived from the rational agent and behavioural perspectives, allowing students to connect empirical testing with economic theory before moving into optimization.
This workshop designs a pre-allocation stock filter using (i) Efficient Market Hypothesis (EMH) tests on the conditional mean of returns, and (ii) a Breusch–Pagan test on the conditional variance of returns. The resulting list of stocks passes to the next stage (trend/anomaly filter) before portfolio optimization. This is not a pure Markowitz exercise; we first select assets under theory-driven assumptions, then optimize weights.
2 Introduction to rational agent and behavioural finance
The Efficient Market Hypothesis (EMH) posits that asset prices “fully reflect” available information. In its strict form, past returns should not help predict current returns (Fama, 1970). By contrast, behavioural finance documents departures from full rationality—heuristics, over/under-reaction, limits to arbitrage—that can generate predictable patterns in returns (Burton & Shah, 2013).
From an empirical perspective, a practical first step is to ask: Can yesterday’s return help forecast today’s return? If no, this is consistent with EMH (at least for the conditional mean). If yes, the asset exhibits predictability, suggesting a possible inefficiency exploitable by systematic strategies (Aldridge, 2010).
| Dimension | Rational Agent / Efficient Markets | Behavioural Finance |
|---|---|---|
| Core Logic | Markets are efficient and prices fully reflect all available information. Agents act rationally, updating beliefs and maximizing expected utility. | Markets are not fully efficient because investors are subject to cognitive biases, emotions, and social dynamics that distort decisions and prices. |
| View of Human Behavior | Individuals are homo economicus: rational, consistent, and optimizing. | Individuals are boundedly rational: influenced by heuristics, overconfidence, loss aversion, mental accounting, and social imitation. |
| Price Formation | Prices adjust instantaneously to new information through arbitrage and competition among rational traders. | Prices may deviate from fundamentals due to collective biases, slow information diffusion, or feedback loops (herding, momentum, bubbles). |
| Market Outcomes | Random walk of prices; predictability is negligible in the short run. Any abnormal pattern is quickly arbitraged away. | Persistent anomalies (momentum, overreaction, calendar effects) can exist, as markets do not always self-correct efficiently. |
| Predictability of Returns | Past prices or returns contain no useful information about future prices; returns are unpredictable. | Some predictability may exist because human reactions to information are not purely rational or immediate. |
| Role of Emotions and Psychology | Minimal—emotions are assumed irrelevant to market equilibrium. | Central—fear, greed, regret, and overconfidence drive much of observed market behavior. |
| Policy and Strategy Implications | Focus on diversification, passive investing, and risk–return optimization under rational expectations. | Incorporate investor behavior, sentiment, and cognitive limits into strategy design, forecasting, and regulation. |
| Narrative Essence | “Markets are rational, investors are disciplined, and prices tell the truth.” | “Markets are stories, investors are human, and psychology shapes the truth.” |
The rational agent model provides the benchmark of how markets should work under ideal conditions, while behavioural finance explains why they often do not. Modern finance uses both: EMH as a theoretical foundation, and behavioural insights to interpret deviations from it.
In what follows, we implement two simple tests frequently taught in introductory econometrics/finance courses (Wooldridge, 2020):
- EMH mean test (OLS with lagged returns): If past returns have no predictive power, their coefficients should be statistically insignificant.
- EMH variance test (Breusch–Pagan): If the variance of returns is unrelated to past information/explanatory variables, the model is homoscedastic. Heteroscedasticity can be interpreted as structure in second moments, which, under certain trading schemes, may be used for prediction or risk timing (Breusch & Pagan, 1979).
We apply these tests asset-by-asset to filter a large ticker set down to a shortlist for the following topics in this course (anomaly/trend filter).
2.1 Refresher: logs and log returns
Log returns are widely used because they (i) are time-additive, (ii) approximate arithmetic returns for small changes, and (iii) connect naturally to continuous compounding.
- Price at date t: P_t > 0.
- Log price: \ln P_t.
- Log return:
r_t^{\log} \equiv \ln P_t - \ln P_{t-1} = \ln\!\left(\frac{P_t}{P_{t-1}}\right). - Relationship to arithmetic return r_t^{\text{arith}} = \frac{P_t}{P_{t-1}} - 1:
\;\; r_t^{\log} \approx r_t^{\text{arith}} for small returns; exactly,
\;\; 1 + r_t^{\text{arith}} = e^{r_t^{\log}}.
If P_{t-1}=100 and P_t=101, then r_t^{\log} = \ln(101/100) \approx 0.00995 \approx 0.995\%, while r_t^{\text{arith}} = 0.01 = 1\%.
In the code below we will compute log returns by differencing the log price series.
3 Setup and data
We use daily close prices for a broad set of large-cap tickers (firms whose total market value of outstanding shares typically exceeds $10 billion USD.) As usual, first let’s load all the libraries we will need for the workshop. It is always a good idea to keep them all grouped so they are easier to track.
Now, we will load the dataset from an online repository. Column names are in the form TICKER.Close; the index is the trading date. The sample spans roughly 2020-01 to 2022-05.
data_url = "https://raw.githubusercontent.com/abernal30/AFP_py/refs/heads/main/data/1Rational_agent.csv"
data = pd.read_csv(data_url, index_col=0) # index_col= 0 indicates that date is in the first column
# Show data
data| AAPL.Close | MSFT.Close | GOOG.Close | GOOGL.Close | AMZN.Close | TSLA.Close | BRK.A.Close | BRK.B.Close | FB.Close | TSM.Close | ... | TMUS.Close | PM.Close | AMD.Close | LIN.Close | TXN.Close | CRM.Close | BMY.Close | UPS.Close | RLLCF.Close | QCOM.Close | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| date | |||||||||||||||||||||
| 01/02/2020 | 75.087502 | 160.619995 | 1367.369995 | 1368.680054 | 1898.010010 | 86.052002 | 342261 | 228.389999 | 209.779999 | 60.040001 | ... | 78.589996 | 85.190002 | 49.099998 | 210.740005 | 129.570007 | 166.990005 | 63.340000 | 116.790001 | 0.0046 | 88.690002 |
| 01/03/2020 | 74.357498 | 158.619995 | 1360.660034 | 1361.520020 | 1874.969971 | 88.601997 | 339155 | 226.179993 | 208.669998 | 58.060001 | ... | 78.169998 | 85.029999 | 48.599998 | 205.259995 | 127.849998 | 166.169998 | 62.779999 | 116.720001 | 0.0100 | 87.019997 |
| 01/06/2020 | 74.949997 | 159.029999 | 1394.209961 | 1397.810059 | 1902.880005 | 90.307999 | 340210 | 226.990005 | 212.600006 | 57.389999 | ... | 78.620003 | 86.019997 | 48.389999 | 204.389999 | 126.959999 | 173.449997 | 62.980000 | 116.199997 | 0.0217 | 86.510002 |
| 01/07/2020 | 74.597504 | 157.580002 | 1393.339966 | 1395.109985 | 1906.859985 | 93.811996 | 338901 | 225.919998 | 213.059998 | 58.320000 | ... | 78.919998 | 86.400002 | 48.250000 | 204.830002 | 129.410004 | 176.000000 | 63.930000 | 116.000000 | 0.0126 | 88.970001 |
| 01/08/2020 | 75.797501 | 160.089996 | 1404.319946 | 1405.040039 | 1891.969971 | 98.428001 | 339188 | 225.990005 | 215.220001 | 58.750000 | ... | 79.419998 | 88.040001 | 47.830002 | 207.389999 | 129.759995 | 177.330002 | 63.860001 | 116.660004 | 0.0099 | 88.709999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 05/20/2022 | 137.589996 | 252.559998 | 2186.260010 | 2178.159912 | 2151.820068 | 663.900024 | 456500 | 304.049988 | 193.539993 | 90.779999 | ... | 126.040001 | 101.150002 | 93.500000 | 315.179993 | 169.809998 | 159.649994 | 76.190002 | 171.039993 | 0.0072 | 131.600006 |
| 05/23/2022 | 143.110001 | 260.649994 | 2233.330078 | 2229.760010 | 2151.139893 | 674.900024 | 464510 | 310.200012 | 196.229996 | 91.500000 | ... | 129.889999 | 102.910004 | 95.070000 | 320.420013 | 169.929993 | 160.320007 | 76.699997 | 174.389999 | 0.0085 | 132.119995 |
| 05/24/2022 | 140.360001 | 259.619995 | 2118.520020 | 2119.399902 | 2082.000000 | 628.159973 | 463606 | 309.170013 | 181.279999 | 88.720001 | ... | 129.220001 | 106.620003 | 91.160004 | 320.489990 | 167.860001 | 156.929993 | 77.129997 | 174.110001 | 0.0075 | 128.529999 |
| 05/25/2022 | 140.520004 | 262.519989 | 2116.790039 | 2116.100098 | 2135.500000 | 658.799988 | 462890 | 308.640015 | 183.830002 | 90.410004 | ... | 131.440002 | 108.570000 | 92.650002 | 315.850006 | 170.009995 | 159.649994 | 77.239998 | 173.860001 | 0.0080 | 131.229996 |
| 05/26/2022 | 143.779999 | 265.899994 | 2165.919922 | 2155.850098 | 2221.550049 | 707.729980 | 468805 | 312.500000 | 191.630005 | 91.000000 | ... | 132.740005 | 108.070000 | 98.750000 | 320.329987 | 174.130005 | 162.460007 | 77.589996 | 178.380005 | 0.0085 | 134.839996 |
606 rows × 100 columns
4 EMH mean test on historical returns (one asset)
We test whether lagged log returns have predictive power for today’s log return using a simple OLS:
r_t = \alpha + \beta_1 r_{t-1} + \beta_2 r_{t-2} + \varepsilon_t,
where r_t denotes log return. Under EMH (mean efficiency), \beta_1 = \beta_2 = 0.
# Choose a single asset for illustration
stock = "AAPL.Close"
# Compute log returns
stock_price = data[stock].dropna()
r = np.log(stock_price).diff() # First differences of natural logarithm of price.
# Build the dataset for the regression with *lags as past information*
df = pd.DataFrame({
"r": r,
"lag_1": r.shift(1), #Defining first lag
"lag_2": r.shift(2) # Defining second lag
})
df.head()| r | lag_1 | lag_2 | |
|---|---|---|---|
| date | |||
| 01/02/2020 | NaN | NaN | NaN |
| 01/03/2020 | -0.009770 | NaN | NaN |
| 01/06/2020 | 0.007937 | -0.009770 | NaN |
| 01/07/2020 | -0.004714 | 0.007937 | -0.009770 |
| 01/08/2020 | 0.015958 | -0.004714 | 0.007937 |
Now that we have the dataset with the returns for the first stock and the first two lags of that return, we can estimate an OLS regression:
df = df.dropna() # Remember to drop all missing values
# OLS regression for EMH test
X = sm.add_constant(df[["lag_1", "lag_2"]])
y = df["r"]
ols = sm.OLS(y, X).fit()
print(ols.summary()) OLS Regression Results
==============================================================================
Dep. Variable: r R-squared: 0.034
Model: OLS Adj. R-squared: 0.031
Method: Least Squares F-statistic: 10.65
Date: Tue, 11 Nov 2025 Prob (F-statistic): 2.85e-05
Time: 15:32:01 Log-Likelihood: 1418.9
No. Observations: 603 AIC: -2832.
Df Residuals: 600 BIC: -2819.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0012 0.001 1.315 0.189 -0.001 0.003
lag_1 -0.1779 0.041 -4.356 0.000 -0.258 -0.098
lag_2 0.0286 0.041 0.700 0.484 -0.052 0.109
==============================================================================
Omnibus: 63.982 Durbin-Watson: 1.997
Prob(Omnibus): 0.000 Jarque-Bera (JB): 385.101
Skew: -0.189 Prob(JB): 2.38e-84
Kurtosis: 6.897 Cond. No. 47.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
If the p-values on lag_1 and lag_2 are large (e.g., > 0.10), we fail to reject the EMH in mean for this asset. If one or both are statistically significant, we find predictability inconsistent with EMH’s strict form.
For the stock selected in this example, the Efficient Market Hypothesis (EMH) mean test yields inconclusive evidence. The first lag (r_{t-1}) is statistically significant at the 1% level, while the second lag (r_{t-2}) is not significant at any conventional level.
In practical terms, this means that yesterday’s return appears to help predict today’s return, but the return from two days ago does not. For a clear rejection of the EMH (in its weak form), both lag coefficients would need to be significant, implying that historical returns consistently contain predictive information.
Conversely, if both lags were insignificant, we would fail to reject the EMH, concluding that the stock’s past returns provide no useful information for forecasting future returns—consistent with an efficient market.
Since in this case one coefficient is significant and the other is not, the result is inconclusive.
When faced with such mixed outcomes, it is advisable to complement the test with additional diagnostics—such as variance-based tests (e.g., the Breusch–Pagan test)—to evaluate whether patterns may exist in the volatility of returns even when mean predictability is unclear.
5 EMH variance test (one asset)
We use the Breusch–Pagan (BP) test (Breusch & Pagan, 1979) to assess whether the error variance depends on the regressors (here, lagged returns). Under homoskedasticity (EMH-consistent variance), the BP test’s null holds.
5.0.1 Breusch–Pagan (BP) Test
The Breusch–Pagan test (Breusch & Pagan, 1979) examines whether the variance of the residuals from a regression model depends on the explanatory variables.
It starts from the standard regression model:
y_t = X_t \beta + \varepsilon_t
where the residuals \varepsilon_t are assumed to have constant variance under the null hypothesis of homoscedasticity.
The BP test models the squared residuals as a function of the regressors:
\hat{\varepsilon}_t^2 = \delta_0 + \delta_1 X_{1t} + \delta_2 X_{2t} + \dots + \delta_k X_{kt} + u_t
The test statistic is computed as:
LM = n R^2
where n is the sample size and R^2 is the coefficient of determination from the auxiliary regression above.
An equivalent version uses the F-statistic from the same auxiliary regression of the squared residuals on the explanatory variables .The test statistic is:
F = \frac{(R^2 / k)}{[(1 - R^2) / (n - k - 1)]}
where R^2 is from the auxiliary regression, k is the number of explanatory variables, and n is the sample size. In practice, we are testing the following null hypothesis:
\delta_1 = \delta_2 = \dots = \delta_k = 0
- Null hypothesis (H_0): the variance of the residuals is constant (homoscedasticity).
- Alternative hypothesis (H_1): the variance of the residuals depends on the regressors (heteroscedasticity).
# BP test using OLS residuals and the OLS design matrix
bp_lm, bp_lm_p, bp_f, bp_f_p = het_breuschpagan(ols.resid, ols.model.exog)
bp_results = pd.DataFrame({
"statistic": ["LM", "F"],
"value": [bp_lm, bp_f],
"p_value": [bp_lm_p, bp_f_p]
})
bp_results| statistic | value | p_value | |
|---|---|---|---|
| 0 | LM | 14.521472 | 0.000703 |
| 1 | F | 7.402890 | 0.000667 |
The Breusch-Pagan test generates two statistics with their p-values: the Lagrange-Multiplier (LM) and the F-test. Usually, both arrive at the same conclusions. In this workshop, we will focus on the F test. Remember that the decision rule is:
In the context of the Efficient Market Hypothesis (EMH):
- If the F-test p-value is large (e.g., p \ge 0.1), we fail to reject H_0, suggesting that the stock’s return variance is constant and consistent with market efficiency.
- If the F-test p-value is small (e.g., p < 0.1), we reject H_0, implying that the variance changes with past information — evidence of heteroscedasticity and potential market inefficiency.
In our case, the F-test p-value is significant at the 1% level, so we reject the null hypothesis of homoscedasticity. This means the model is heteroscedastic, and the variance of returns depends on the explanatory variables—in this case, the lagged returns.
Interpreted through the lens of the (EMH), this result suggests that the variance of past returns influences today’s return, indicating that the market for this stock is not fully efficient. In other words, historical information contains patterns that could be used to forecast changes in volatility or risk, which in turn may affect expected returns. From a portfolio construction perspective, this makes the stock eligible for inclusion in the next stage of analysis, since it exhibits predictable structure inconsistent with EMH.
In the following section, we extend this same test to all available stocks, identifying those that display similar inefficiencies. These filtered assets will then form the candidate set for the next step—examining market anomalies and refining our portfolio composition.
6 EMH variance test across many assets (filter)
We now apply the same pipeline to all tickers, collecting BP F-statistics and p-values. The filter keeps assets with p < 0.10.
tickers = list(data.columns) # Tickers names
records = [] # Create the object to store the results
for col in tickers: # Start the loop
series = data[col].dropna() # Variable no NAs for regression
r = np.log(series).diff() # Log returns
# The following lines create the dataframe with returns and lags no NAs
df = pd.DataFrame({
"r": r,
"lag_1": r.shift(1),
"lag_2": r.shift(2)
}).dropna()
# Vector of independent variables with constant for regression
X = add_constant(df[["lag_1", "lag_2"]])
# Dependent variable
y = df["r"]
# Regression to run Breusch-Pagan Test
fit = sm.OLS(y, X).fit()
lm, lmp, fstat, fp = sms.het_breuschpagan(fit.resid, fit.model.exog)
records.append((col, fstat, fp))
# Create data frame to filter
bp_df = pd.DataFrame(records, columns=["ticker", "F_stat", "p_value"]).set_index("ticker")
# Show the top of the table
bp_df.head(10)| F_stat | p_value | |
|---|---|---|
| ticker | ||
| AAPL.Close | 7.402890 | 6.667956e-04 |
| MSFT.Close | 5.709427 | 3.497252e-03 |
| GOOG.Close | 5.608142 | 3.862781e-03 |
| GOOGL.Close | 5.482668 | 4.369222e-03 |
| AMZN.Close | 0.486758 | 6.148581e-01 |
| TSLA.Close | 3.699726 | 2.529626e-02 |
| BRK.A.Close | 13.298510 | 2.232942e-06 |
| BRK.B.Close | 17.135412 | 5.793682e-08 |
| FB.Close | 0.152859 | 8.582842e-01 |
| TSM.Close | 6.901224 | 1.088404e-03 |
Filter: keep assets with p < 0.10.
alpha = 0.10
selected = bp_df[bp_df["p_value"] < alpha]
# Organize by p-value
selected = selected.sort_values("p_value")
selected.head() # Observations with lowest p-value| F_stat | p_value | |
|---|---|---|
| ticker | ||
| WFC.PQ.Close | 85.605410 | 1.965355e-33 |
| BML.PH.Close | 43.868318 | 1.654730e-18 |
| JNJ.Close | 37.951483 | 3.020436e-16 |
| PG.Close | 36.181957 | 1.459002e-15 |
| BAC.PE.Close | 24.481575 | 6.016778e-11 |
selected.tail() # Observations with highest p-value| F_stat | p_value | |
|---|---|---|
| ticker | ||
| LIN.Close | 3.014815 | 0.049799 |
| CICHY.Close | 2.897345 | 0.055942 |
| AZNCF.Close | 2.807902 | 0.061124 |
| RYDAF.Close | 2.754262 | 0.064461 |
| BMY.Close | 2.502647 | 0.082722 |
Check with how many stocks we arrived at the end.
selected.shape # Check the size of the new datset(79, 2)
The tables above show the shortlist for the next topic (trend/anomaly filter). Note that this is an introductory filter; it does not correct for multiple testing and should be interpreted as a pedagogical first pass.
7 References
- Aldridge, I. (2010). High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems. Wiley.
- Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica, 47(5), 1287–1294.
- Burton, E., & Shah, S. (2013). Behavioral Finance: Understanding the Social, Cognitive, and Economic Debates. Wiley.
- Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25(2), 383–417.
- Wooldridge, J. M. (2020). Introductory Econometrics: A Modern Approach (7th ed.). Cengage.