Workshop 2, Investment Funds

Author

Alberto Dorantes, Ph.D.

Published

April 10, 2026

Abstract

In this workshop we learn the foundations of the Single-Index model, the Capital Asset Pricing Model (CAPM) and multi-factor models. In addition, we further explore how these models can help us define a market-based variance-covariance matrix for portfolio optimization.

1 Directions

Continue with the same Google Colab Notebook you used for Workshop 1.

You have to replicate this Workshop in Python. You can use Gemini in Colab for code generation.

2 The single-index model

The Single index model is a factor model to better understand how an asset return is linearly related with an index return over time. This index is usually the market index, so we can learn how much an asset return is following or is related with the return of the market.

The simple linear regression model is used to understand the linear relationship between two variables assuming that one variable, the independent variable (IV), can be used as a predictor of the other variable, the dependent variable (DV). In this part we illustrate a single index model as a simple regression model.

The single-index model states that the expected return of a stock is given by its alpha coefficient (b0) plus its market beta coefficient (b1) multiplied times the market return. In mathematical terms:

E[R_i] = α + β(R_M)

We can express the same equation using B0 as alpha, and B1 as market beta:

E[R_i] = β_0 + β_1(R_M)

We can estimate the alpha and market beta coefficient by running a simple linear regression model specifying that the market return is the independent variable and the stock return is the dependent variable. It is strongly recommended to use continuously compounded returns instead of simple returns to estimate the market regression model. The market regression model can be expressed as:

r_{(i,t)} = b_0 + b_1*r_{(M,t)} + ε_t

Where:

ε_t is the error at time t. Thanks to the Central Limit Theorem, this error behaves like a Normal distributed random variable ∼ N(0, σ_ε); the error term ε_t is expected to have mean=0 and a specific standard deviation σ_ε (also called volatility).

r_{(i,t)} is the return of the stock i at time t.

r_{(M,t)} is the market return at time t.

b_0 and b_1 are called regression coefficients.

2.1 Running a single-index model with real data

2.1.1 Data collection

We download monthly price data for Tesla and the S&P500 market index:

import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib
import matplotlib.pyplot as plt


# We download monthly prices for Tesla and the S&P500 index:
adjprices=yf.download(tickers="TSLA ^GSPC", start="2021-01-01", end = "2026-03-31", interval="1mo", auto_adjust=True)['Close']


[                       0%                       ]
[*********************100%***********************]  2 of 2 completed

2.1.2 Return calculation

We calculate continuously returns for both, Tesla and the S&P500:

# I create a new data frame to calculate the log returns
r = np.log(adjprices).diff(1).dropna()
# The diff function calculates the difference between the log price of t and the log price of the previous period, t-1

# The dropna() function drops rows with NA values (the first month has NA since returns cannot be calculated for the month 1)
# Renameing the column names to avoid special characters like ^GSPC:
r.columns = ['TSLA','GSPC']

2.1.3 Visualize the relationship

To start understanding the relationship between the stock return and the market return, se do a scatter plot putting the S&P500 returns as the independent variable (X) and the stock return as the dependent variable (Y).

We also add a line that better represents the relationship between the stock returns and the market returns:

import seaborn as sb
#plt.clf()
x = r['GSPC']
y = r['TSLA']
# I plot the (x,y) values along with the regression line that fits the data:
sb.regplot(x=x,y=y)
plt.xlabel('Market returns')
plt.ylabel('Tesla returns') 
plt.show()

Sometimes graphs can be deceiving. In this case, the range of X axis and Y axis are different, so it is better to do a graph where we can make both X and Y ranges with equal distance.

We can change the X scale so that both Y and X axis have similar ranges:

plt.clf()
sb.regplot(x=x,y=y)
# I adjust the scale of the X axis so that the magnitude of each unit of X is equal to that of the Y axis 
plt.xticks(np.arange(-0.8,0.8,0.20))

([<matplotlib.axis.XTick object at 0x000002C7C12ABB10>, <matplotlib.axis.XTick object at 0x000002C7C12AB390>, <matplotlib.axis.XTick object at 0x000002C7C3339310>, <matplotlib.axis.XTick object at 0x000002C7C33396D0>, <matplotlib.axis.XTick object at 0x000002C7C3339A90>, <matplotlib.axis.XTick object at 0x000002C7C3339E50>, <matplotlib.axis.XTick object at 0x000002C7C333A210>, <matplotlib.axis.XTick object at 0x000002C7C333A5D0>], [Text(-0.8, 0, '−0.8'), Text(-0.6000000000000001, 0, '−0.6'), Text(-0.40000000000000013, 0, '−0.4'), Text(-0.20000000000000018, 0, '−0.2'), Text(-2.220446049250313e-16, 0, '0.0'), Text(0.19999999999999973, 0, '0.2'), Text(0.3999999999999997, 0, '0.4'), Text(0.5999999999999996, 0, '0.6')])

# I label the axis:
plt.xlabel('Market returns')

plt.ylabel('Tesla returns') 
plt.show()

Now we see that the the market and stock returns have a similar scale. With this we can better appreciate their linear relationship.

WHAT DOES THE PLOT TELL YOU? BRIEFLY EXPLAIN

2.2 Running the single-index regression model

We can run the market regression model with the lm() function. The first parameter of the function is the DEPENDENT VARIABLE (in this case, the stock return), and the second parameter must be the INDEPENDENT VARIABLE, also named the EXPLANATORY VARIABLE (in this case, the market return).

What you will get is called The single-index regression model. You are trying to examine how the market returns can explain stock returns in the historical periods we downloaded.

import statsmodels.formula.api as smf
# I estimate the OLS regression model:
mkmodel = smf.ols('TSLA ~ GSPC',data=r).fit()
# The Dependent variable Y is the first one in the formula, and the second is the IV: Y ~ X

# I display the summary of the regression: 
print(mkmodel.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   TSLA   R-squared:                       0.215
Model:                            OLS   Adj. R-squared:                  0.202
Method:                 Least Squares   F-statistic:                     16.41
Date:                Sun, 26 Apr 2026   Prob (F-statistic):           0.000149
Time:                        09:32:37   Log-Likelihood:                 31.314
No. Observations:                  62   AIC:                            -58.63
Df Residuals:                      60   BIC:                            -54.37
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0106      0.019     -0.549      0.585      -0.049       0.028
GSPC           1.7657      0.436      4.051      0.000       0.894       2.638
==============================================================================
Omnibus:                        1.177   Durbin-Watson:                   1.828
Prob(Omnibus):                  0.555   Jarque-Bera (JB):                1.230
Skew:                          -0.269   Prob(JB):                        0.541
Kurtosis:                       2.569   Cond. No.                         23.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3 CHALLENGE 1

YOU HAVE TO INTERPRET THIS MODEL. MAKE SURE YOU RESPOND THE FOLLOWING THE QUESTIONS:

- IS THE STOCK SIGNIFICANTLY OFFERING RETURNS OVER THE MARKET?

- IS THE STOCK SIGNIFICANTLY RISKIER THAN THE MARKET?

The beta0 coefficient is in the first row (the intercept), while the beta1 coefficient is in the second raw.

The regression coefficients will be in the first column of this matrix. The second column is the standard error of the coefficients, which are the standard deviation of the coefficients. The third column is the t-value of each coefficient, and finally their p-values in the fourth column.

We can get beta0, beta1, standard errors, t-values and p-values and store in other variables:

# Beta coefficients:
b0=mkmodel.params.iloc[0]
b1=mkmodel.params.iloc[1]
# Standard errors of betas:
seb0 = mkmodel.bse.iloc[0]
seb1 = mkmodel.bse.iloc[1]
# t-Statistics of betas:
tb0 = mkmodel.params.iloc[0] / mkmodel.bse.iloc[0]
tb1 = mkmodel.params.iloc[1] / mkmodel.bse.iloc[1]
# p-values of betas:
pvalueb0 = mkmodel.pvalues.iloc[0]
pvalueb1 = mkmodel.pvalues.iloc[1]
# 95% Confidence intervals for both betas:
minb0 = mkmodel.conf_int().iloc[0].iloc[0]
maxb0 = mkmodel.conf_int().iloc[1].iloc[0]
minb1 = mkmodel.conf_int().iloc[0].iloc[1]
maxb1 = mkmodel.conf_int().iloc[1].iloc[1]

4 The CAPM model

The Capital Asset Pricing Model states that the expected return of a stock is given by the risk-free rate plus its beta coefficient multiplied by the market premium return. In mathematical terms:

E[R_i] = R_f + β_1(R_M − R_f )

We can express the same equation as:

(E[R_i] − R_f ) = β_1(R_M − R_f )

Then, we are saying that the expected value of the premium return of a stock is equal to the premium market return multiplied by its market beta coefficient.

In this model there is NO b_0 coefficient since the CAPM assumes that b_0 should be zero since there should not be a stock that systematically offers excess returns compared to the market returns. However, when we estimate the CAPM with historical data, it is very recommended to estimate the b_0 coefficient.

Then, the CAPM model used to estimate its beta coefficients is expressed as:

StockPremiumReturn_{i,t} = b_0 + b_1(MarketPremiumReturn_t) + ε_t

Where:

StockPremiumReturn_{i,t}=(E[R_i] − R_f)

MarketPremiumReturn_{i,t}=(R_M − R_f)

ε_t ∼ N(0, σ_ε); the error is a random shock with an expected mean=0 and a specific standard deviation or volatility. This error represents the result of all factors that influence stock returns, and cannot be explained by the model (by the market).

As in the Single-Index model, you can estimate the b_1 coefficient of the CAPM as:

b_1=\frac{cov(MarketPremiumReturn,StockPremiumReturn_i)}{var(MarketPremiumReturn)}

And the b_0 coefficient can be calculated as:

b_0 = \overline{StockPremiumReturns_i} - b_1*\overline{MarketPremiumReturns}

Remember that the bar over the variable means the arithmetic mean of the variable.

In the CAPM we use premium returns instead of returns for the calculation of the beta coefficients.

The interpretation of b_1 in the CAPM is the same than in the Single-Index Model, but we have to refer to premium returns instead of returns. Then, b_1 is a measure of market risk of the stock, while b_0 is a measure of excess return of the stock over the market.

If b_1=2.0, the means that the risk of the stock is the doble compared to the market. In other words, for each +1% of change in the market premium return, it is expected that the stock premium return will move in aboutg +2%.

The CAPM was developed based on Portfolio Theory. More specifically, CAPM was developed after the concept of Capital Market Line (CML). The CML was derived in the 1960’s from Markowitz Portfolio Theory (developed 1953). The main contributors to the idea of the CML and the development of the CAPM are Jack Treynor, William F. Sharpe, John Litner and Jan Mossin.

Remember that the CML is the set of efficient portfolios when the risk-free asset is added to the optimal portfolio of risky assets.

It is assumed that, if all participants of a financial market are risk-averse, and if they estimate the risk and correlations of assets returns in the same way, then the most optimal portfolio of risky assets that offer the highest premium return for each unit of risk is the Market Portfolio. In other words, the Market Portfolio is expected to have the highest possible Sharpe ratio. Remember that Sharpe ratio is a measure of how much expected return a portfolio offers for each unit of risk (1 unit of standard deviation). Then, the Sharpe ratio is estimated as:

SharpRatio = \frac{(E[PortRet]-RiskFreeRate)}{SD(PortfolioReturn)}

When we add the risk-free rate to the tangency portfolio, then the set of efficient portfolios will NOT lie in the curved efficient frontier; the efficient portfolios will lie in the Capital Market Line.

In other words, the CML becomes the efficient frontier after adding the risk-free instrument to the tangent portfolio of risky assets.

The CML equation is:

E[PortRet]=RiskFreeRate+\left[\frac{(E[MarketReturn]-RiskFreeRate)}{SD(MarketReturn)}\right]PortfolioRisk

The ratio in the brackets is the Sharpe Ratio, which is the slope of the Capital Market Line when plotting PortfilioRisk in the X axis and Expected Portfolio Return in the Y axis.

Imagine we want to do a portfolio with 5 stocks. Then, we can estimate the expected return of each stock, and then estimate the efficient frontier and the optimal portfolio using Portfolio Theory.

# We download monthly prices for 5 stocks:

adjprices=yf.download(tickers="TSLA AMZN AAPL JPM WMT", start="2020-01-01", end = "2026-03-31", interval="1mo", auto_adjust=True)['Close']


[                       0%                       ]
[*******************   40%                       ]  2 of 5 completed
[**********************60%****                   ]  3 of 5 completed
[**********************80%*************          ]  4 of 5 completed
[*********************100%***********************]  5 of 5 completed

# I create a new data frame to calculate the log returns
returns = np.log(adjprices).diff(1).dropna()
# The diff function calculates the difference between the log price of t and the log price of the previous period, t-1

# The dropna() function drops rows with NA values (the first month has NA since returns cannot be calculated for the month 1)
returns.columns

Index(['AAPL', 'AMZN', 'JPM', 'TSLA', 'WMT'], dtype='object', name='Ticker')

# Be careful. Once you download the prices with yf.download, the columns are sorted alphabetically!

Before I continue with the optimization of the portfolio, I will plot the adjusted stock price over time for the 5 stocks to have a general idea about the growth of each stock price:

Since stock prices are not standardized, I will create an index. Using the stock adjusted prices, I create a dataset with growth indices starting in $1. I divide each monthly price by the first monthly price:

growth_indices = adjprices / adjprices.iloc[0]

I create a function to do a time plot for all variables of a time-series data frame:

import matplotlib.pyplot as plt

def plot_time_series(df, title, y_label, legend_title='Legend'):
    """
    Generates and displays a time series plot from a DataFrame.
    Parameters:
    df (pandas.DataFrame): The DataFrame containing the time series data.
    title (str): The title of the plot.
    y_label (str): The label for the Y-axis.
    legend_title (str): The title for the legend.
    """
    plt.figure(figsize=(12, 6))
    ax = plt.gca()
    df.plot(ax=ax)
    plt.title(title)
    plt.xlabel('Date') # Assuming the DataFrame index is a datetime index for time series.
    plt.ylabel(y_label)
    plt.grid(True)
    plt.legend(title=legend_title)
    plt.show()

I plot both growth indexes:

plot_time_series(growth_indices, "$1 invested in each stock over time","Investment value", "Stock")

I can see the monthly returns over time for each stock:

plot_time_series(returns, "Stock returns over time","Monthly return", "Stock")

We see that TSLA has been very volatile. Let’s calculate the monthly volatility for all stocks:

Volatility of each stock:

returns.std()

Ticker
AAPL    0.077944
AMZN    0.090167
JPM     0.075898
TSLA    0.186476
WMT     0.055925
dtype: float64

TSLA has been the most volatile.

Calculating Expected Returns of the stocks:

import numpy as np
# I get the Expected returns of all stocks:
ER = np.exp(returns.mean()) - 1 
ER

Ticker
AAPL    0.016682
AMZN    0.009904
JPM     0.013175
TSLA    0.029458
WMT     0.017244
dtype: float64

Calculating the Variance-Covariance matrix:

COV= returns.cov()
COV

Ticker      AAPL      AMZN       JPM      TSLA       WMT
Ticker                                                  
AAPL    0.006075  0.004175  0.001796  0.009493  0.001599
AMZN    0.004175  0.008130  0.001681  0.010338  0.001328
JPM     0.001796  0.001681  0.005760  0.003774  0.000937
TSLA    0.009493  0.010338  0.003774  0.034773  0.003015
WMT     0.001599  0.001328  0.000937  0.003015  0.003128

I will use the PyPortfolioOpt library to find the GMV, optimal, and the Efficient Frontier.

In case you do not have the library instaaled, do the installation as follows:

We will use the PyPortfolioOpt Python library:

Check the documentation here:

https://pyportfolioopt.readthedocs.io/en/latest/

!pip install PyPortfolioOpt

I import the modules I will need from this library, and estimate the Efficient Frontier and the optimal portfolio:

import matplotlib.pyplot as plt
import numpy as np
from pypfopt import CLA, EfficientFrontier, plotting

# --------------------------------------------------
# Annualized inputs
# --------------------------------------------------
ER_A = 12 * ER
COV_A = 12 * COV

annual_risk_free = 0.03  # 3% annual risk-free rate

# --------------------------------------------------
# 1) Min Volatility portfolio
# --------------------------------------------------
ef_minvol = EfficientFrontier(ER_A, COV_A, weight_bounds=(0, 1))
w_minvar_noshort = ef_minvol.min_volatility()
ret_vol, vol_vol, sharpe_vol = ef_minvol.portfolio_performance(risk_free_rate=annual_risk_free)

print("Min Volatility Portfolio Weights:")

Min Volatility Portfolio Weights:

print(w_minvar_noshort)

OrderedDict({'AAPL': 0.0901604279162843, 'AMZN': 0.0896698051874826, 'JPM': 0.2491129842649416, 'TSLA': 0.0, 'WMT': 0.5710567826312916})

print("Expected annual return of the min-vol portfolio:", ret_vol)

Expected annual return of the min-vol portfolio: 0.18625659591928123

print("Expected volatility of the min-vol portfolio:", vol_vol)

Expected volatility of the min-vol portfolio: 0.16550864805658458

print("Sharpe Ratio of the min-vol portfolio:", sharpe_vol)

Sharpe Ratio of the min-vol portfolio: 0.9440992827508311

# --------------------------------------------------
# 2) Max Sharpe (Tangency) portfolio
# --------------------------------------------------
ef_sharpe = EfficientFrontier(ER_A, COV_A, weight_bounds=(0, 1))
weights_sharpe = ef_sharpe.max_sharpe(risk_free_rate=annual_risk_free)
ret_sharpe, vol_sharpe, sharpe_sharpe = ef_sharpe.portfolio_performance(risk_free_rate=annual_risk_free)

print("\nMax Sharpe / Tangency Portfolio Weights:")


Max Sharpe / Tangency Portfolio Weights:

print(weights_sharpe)

OrderedDict({'AAPL': 0.1554917926983705, 'AMZN': 0.0, 'JPM': 0.1517352630582644, 'TSLA': 0.01685936440298, 'WMT': 0.6759135798403851})

print("Expected annual return of the tangency portfolio:", ret_sharpe)

Expected annual return of the tangency portfolio: 0.20093939925617885

print("Expected volatility of the tangency portfolio:", vol_sharpe)

Expected volatility of the tangency portfolio: 0.17213352063061194

print("Sharpe Ratio of the tangency portfolio:", sharpe_sharpe)

Sharpe Ratio of the tangency portfolio: 0.9930628190833579

# --------------------------------------------------
# 3) Efficient frontier with CLA
# --------------------------------------------------
cla = CLA(ER_A, COV_A, weight_bounds=(0, 1))
#cla.max_sharpe(risk_free_rate=annual_risk_free)  # precompute turning points

fig, ax = plt.subplots(figsize=(10, 6))
plotting.plot_efficient_frontier(cla, ax=ax, show_assets=True)

# --------------------------------------------------
# 4) Plot key portfolios
# --------------------------------------------------
# Min Vol portfolio
ax.scatter(
    vol_vol, ret_vol,
    marker="D", s=180,
    label="Minimum Volatility Portfolio"
)

# Tangency / Max Sharpe portfolio
ax.scatter(
    vol_sharpe, ret_sharpe,
    marker="*", s=260,
    label="Tangency Portfolio (Max Sharpe)"
)

# Risk-free asset
ax.scatter(
    0, annual_risk_free,
    marker="o", s=120,
    label=f"Risk-Free Asset ({annual_risk_free:.0%})"
)

# --------------------------------------------------
# 5) Capital Market Line
# --------------------------------------------------
# Slope of the CML = Sharpe ratio of tangency portfolio
cml_slope = (ret_sharpe - annual_risk_free) / vol_sharpe

# x-range for the CML
xmax = ax.get_xlim()[1]
cml_x = np.linspace(0, xmax, 200)
cml_y = annual_risk_free + cml_slope * cml_x

ax.plot(
    cml_x, cml_y,
    linestyle="--",
    linewidth=2,
    label="Capital Market Line"
)

# --------------------------------------------------
# 6) Optional annotations for clarity
# --------------------------------------------------
ax.annotate(
    "Risk-Free Asset",
    xy=(0, annual_risk_free),
    xytext=(10, 10),
    textcoords="offset points"
)

ax.annotate(
    "Tangency Portfolio",
    xy=(vol_sharpe, ret_sharpe),
    xytext=(10, -15),
    textcoords="offset points"
)

ax.annotate(
    "Min Vol Portfolio",
    xy=(vol_vol, ret_vol),
    xytext=(10, 10),
    textcoords="offset points"
)

# --------------------------------------------------
# 7) Final formatting
# --------------------------------------------------
ax.set_title("Efficient Frontier with Capital Market Line")
ax.set_xlabel("Annualized Volatility")
ax.set_ylabel("Annualized Expected Return")
ax.legend()
ax.grid(True)

plt.show()

The Capital Market Line (CML) is the set of possible efficient portfolios represented in the line that goes from the risk-free rate to the Tangency Portfolio.

Usually the CMLis more efficient than the efficient frontier (the blue curved line) in the risk level between the GMV and the Tangency Portfolio. In this example it is difficult to appreciate visually, but usually between the GMV and the Tangency portfolio, for a specific portfolio expected risk, the CML line will have portfolios with higher expected returns compared to the curved efficient frontier.

How can we interpret the CML? The different portfolios that lie in the CML are constructed with 2 assets.

The risk-free asset
The tangency portfolio

If we invest 100% if the risk-free asset and 0% in the tangency portfolio, then we are in the initial point of the CML (at the left); if we invest 0% in the risk-free and 100% in the tangency portfolio, then we are in the tangency point of the CML. Then, if we invest 50% in the risk-free and 50% in the tangency portfolio, we will be in the middle point of the CML between the risk-free portfolio and the tangency portfolio.

The CML portfolios that lie to the right of the tangency portfolio are portfolios that short the risk-free rate and then invest more money in the tangency portfolio (long the tangency and short the risk-free asset). This cannot be done in the real world since we cannot get a loan with a rate equal to the the risk-free rate!

Then, any point in the CML is a combination of weights between the tangency portfolio and the risk-free asset.

If we could create a portfolio considering ALL stocks in the US market and running the same optimization process we did, it is expected that the tangency portfolio will be the Market Portfolio with weights equal to the proportion of market capitalization of each stock compared to the total capitalization of the market. This will be true if we do the following assumptions:

All investors are risk-averse, but they are mean-variance optimizers. In other words, they are 100% rational (they prefer maximize return for each risk level)
All investors calculate in the same way the expected return of each asset, expected risk of each asset, and correlations of all pairs of assets (the covariance matrix)

Then, with these assumptions, if we keep adding stocks to our portfolio of 5 assets until we add ALL of them, then it is expected that the Tangency (optimal) portfolio will be the Market Portfolio.

With this idea of adding stocks to a portfolio, the idea of the CAPM was developed.

Once we add many stocks and estimate the tangency portfolio we will be close to the Market Portfolio.

In the next section, I describe the details of how the CAPM was developed from Portfolio Theory.

4.1 Derivation of the CAPM

Here I explain a mathematical derivation of the CAPM (see Sharpe 1964 for more details).

I start developing the equation for the Capital Market Line. From Portfolio Theory, we know that the market portfolio is the tangency portfolio and is the most optimal portfolio (with the assumptions I mentioned earlier).

The first idea to come up to the CML is that we can construct a portfolio of 2 assets: the risk-free asset and the market portfolio. Then we can describe the return of this portfolio R as follows:

P=wR_{f}+\left(1-w\right)R_{m}

Following Portfolio Theory, we can estimate the variance and risk of this portfolio R as:

VAR[P]=w^{2}VAR(R_{f})+(1-w)^{2}VAR(R_{m})+2w(1-w)COV(R_{f},R_{m})

Since the R_f is the risk-free rate asset, then by definition it has no variability (it is constant) so its variance and its covariance with the market return must be zero. Then, the variance of the portfolio P can be written as:

VAR(P)=(1-w)^2VAR(R_m)

Taking the squared root of both sides to get the portfolio risk:

PortfolioRisk=SD(P)=(1-w)SD(R_m)

Then the risk of the portfolio will be reduced as the % allocated to the risk-free asset (w) increases; if w=1 then the risk becomes zero since 100% is invested in the risk-free asset. If w=0, then the risk becomes the risk of the market.

To simplify the notation, I will use the Greek letter for standard deviation ($\sigma$) and variance $\sigma^{2}$:

\sigma_{p}=(1-w)\sigma_{_{m}}

Then the return and risk of the portfolio P depends on w. Then I can estimate how much the return and risk changes for a change in w, and then estimate how much the portfolio return changes when the portfolio risk changes. To do this, I can take the partial derivative of the portfolio return and risk equations with respect to w, and then divide both derivatives to get the rate of change in return for a change in risk:

\frac{\delta\sigma_{p}}{\delta w}=-\sigma_{m}

\frac{\delta P}{\delta w}=R_{f}-R_{m}

Now dividing both derivatives I get the marginal change of portfolio return for any change in portfolio risk:

\frac{\delta R}{\delta\sigma_{p}}=\frac{(R_{m}-R_{f})}{\sigma_{m}}

We arrive to the slope of the Capital Market Line, which is also called the Sharpe Ratio. The numerator is the market premium return and the denominator is the risk of the market. Then the Sharpe Ratio measures how much the market premium return changes for each unit of change in the market risk.

Since the CML states a linear relation between the portfolio return (Y axis) and portfolio risk (X axis), we can get the CML equation considering that the point where the line crosses the Y axis (the portfolio return) is the risk-free return:

CML=E[P]=R_f+\frac{(R_{m}-R_{f})}{\sigma_{m}}\sigma_p

Now, with a similar logic, we can think in a portfolio composed of 2 assets:

R_i: A risky asset i

R_m: The market portfolio

Then, we assign w% to the asset i and (1-w)% to the market return. Then, this new portfolio can be written as:

P=wR_{i}+\left(1-w\right)R_{m}

Following Portfolio Theory, we can estimate the variance and risk of this portfolio R as:

VAR[P]=w^{2}VAR(R_{i})+(1-w)^{2}VAR(R_{m})+2w(1-w)COV(R_{i},R_{m})

Using the notation of sigma squared for the variances, I can re-write this formula as:

\sigma^{2}_{p}=w^2\sigma^{2}_{i}+(1-w)^{2}\sigma^2_{m}+2w(1-w)\sigma_{im}

Where \sigma_{im} is the covariance between the stock return and the market return.

Then the portfolio risk is the squared root of its variance:

\sigma_{p}=\left[w^2\sigma^{2}_{i}+(1-w)^{2}\sigma^2_{m}+2w(1-w)\sigma_{im}\right]^{\frac{1}{2}}

Following the same rational than in the case of the portfolio of the risk-free and the market assets, I take the partial derivatives of return and risk with respect to the weight w , and then divide these 2 derivatives to get the marginal change of return for any change in portfolio risk:

\frac{\delta P}{\delta w}=R_i-R_m

Now I do the same for portfolio risk:

\frac{\delta\sigma_{p}}{\delta w}=\frac{1}{2}\left[\frac{2w\sigma_{i}^{2}+2(1-w)(-1)\sigma_{m}^{2}+\sigma_{im}(2)-4w\sigma_{im}}{\sigma_{p}}\right]

Simplifying:

\frac{\delta\sigma_{p}}{\delta w}=\frac{w\sigma_{i}^{2}+(w-1)\sigma_{m}^{2}+\sigma_{im}-2w\sigma_{im}}{\sigma_{p}}

When w=0 I am considering only the Market Portfolio, so the portfolio p becomes the market portfolio. Then, I evaluate this derivative when w=0 to see the rate of change of portfolio return with respect to weight:

\frac{\delta\sigma_{p}}{\delta w}\mid_{w=0}=\frac{\sigma_{im}-\sigma_{m}^{2}}{\sigma_{m}}

Now dividing both derivatives I get the instantaneous rate of change of return with respect to portfolio risk at the Market portfolio:

\frac{\delta P}{\delta\sigma_{p}}\mid_{w=0}=\frac{(R_i-R_m)\sigma_{m}}{\sigma_{im}-\sigma_{m}^{2}}

The rate of change of return with respect of portfolio risk must be equal to the Sharpe Ratio since this portfolio is the market return (w=0). Then, I make this equation equal to the Sharpe Ratio and do some algebra to get the return of the stock in terms of the return of the market:

SharpeRatio=\frac{(R_{m}-R_{f})}{\sigma_{m}}

Then:

\frac{(R_{m}-R_{f})}{\sigma_{m}}=\frac{(R_i-R_m)\sigma_{m}}{\sigma_{im}-\sigma_{m}^{2}}

Moving terms to leave R_i alone we get:

R_{i}=R_{f}+\frac{\sigma_{im}}{\sigma_{m}^{2}}(R_{m}-R_{f})

The ratio of covariance divided by the variance of the market return is actually the beta coefficient of the stock, which is a measure of market risk of the stock.

We arrive to the Capital Asset Pricing Formula!

Then, when a market there is equilibrium (supply equals demand), the CAPM equation should hold. The CAPM states that the expected return of a risky assets is equal to the risk-free rate plus the market premium return multiplied (scaled) by stock beta. The stock beta measures the market risk of a stock, which is how much on average the premium return of the stock changes for each change in the market premium return.

4.2 Estimation of the CAPM beta

We can estimate the beta of each stock using a simple linear regression with historical cc returns.

Moving the risk-free rate to the left of the CAPM equation, we can express the CAPM as:

(R_i − R_f ) = β_1(R_M − R_f )

Then, we are saying that the expected value of the premium return of a stock is equal to the premium market return multiplied by its market beta coefficient. You can estimate the beta coefficient of the CAPM using a regression model and using continuously compounded returns instead of simple returns. However, you must include the intercept b0 in the regression equation:

(r_i − r_f ) = β_0 + β_1(r_M − r_f ) + ε

Where ε ∼ N(0, σ_ε); the error is a random shock with an expected mean=0 and a specific standard deviation or volatility. This error represents the result of all factors that influence stock returns, and cannot be explained by the model (by the market).

In the single-index model, the dependent variable was the stock return and the independent variable was the market return. Unlike the market model, here the dependent variable is the difference between the stock return minus the risk-free rate (the stock premium return), and the independent variable is the premium return, which is equal to the market return minus the risk-free rate. Let’s run this model in r with a couple of stocks.

According to the market efficiency hypothesis, it is expected that the β_0 coefficient is zero, since it is assumed that there is no asset or financial instrument that systematically outperforms the market. The market efficient hypothesis states that all information of a stock that is released to the market is assimilated immediately by all market participants, which will react accordingly and will make the stock to price correctly without systematically beating the market.

Although the theory states that β_0 must be zero, we have to include it in the model, and, if the theory is true, then we expect that the p-vaue of beta0 to be non-significant. According to the market efficient hypothesis, if β_0 is estimated, it is expected that, if it is not zero (that will always be the case), it will not be statistically significant! In other words, although it will be non-zero, since its p-value should be much greater than 0.05, then it is like assuming that β_0 is zero since it could be negative, zero or positive.

Let’s do an exercise to estimate the CAPM.

4.3 Data collection

4.4 Download stock data

Download monthly stock data for Apple, Tesla and the S&P500 from 2021 to March, 2026 and calculate log returns:

# We download price data for Microsoft and the S&P500 index:
adjprices=yf.download(tickers="AAPL TSLA ^GSPC", start="2020-12-01", end="2026-03-31", interval="1mo", auto_adjust=True)['Close']


[                       0%                       ]
[**********************67%*******                ]  2 of 3 completed
[*********************100%***********************]  3 of 3 completed

# I create a new data frame to calculate the log returns
returns = np.log(adjprices).diff(1).dropna()
# The diff function calculates the difference between the log price of t and the log price of t-1

I have monthly returns from Jan 2021:

returns

Ticker          AAPL      TSLA     ^GSPC
Date                                    
2021-01-01 -0.005517  0.117344 -0.011199
2021-02-01 -0.084562 -0.161038  0.025757
2021-03-01  0.008806 -0.011270  0.041563
2021-04-01  0.073453  0.060293  0.051097
2021-05-01 -0.053514 -0.126372  0.005471
...              ...       ...       ...
2025-11-01  0.030883 -0.059540  0.001299
2025-12-01 -0.024418  0.044445 -0.000524
2026-01-01 -0.046608 -0.043887  0.013570
2026-02-01  0.017951 -0.067018 -0.008706
2026-03-01 -0.039188 -0.079498 -0.052276

[63 rows x 3 columns]

4.5 Download risk-free data from the FED

We download the risk-free monthly rate for the US (3-month treasury bills), which is the TB3MS ticker. We do this with the pandas_datareader library:

# You have to install the pandas-datareader package:
#!pip install pandas-datareader
import pandas_datareader.data as pdr

import pandas_datareader.data as pdr
import datetime
# I define start as the month Jan 2020
start = datetime.datetime(2021,1,1)
# I define the end month as Jan 2026
end = datetime.datetime(2026,3,31)
Tbills = pdr.DataReader('TB3MS','fred',start,end)

We see the content of Tbills:

Tbills

            TB3MS
DATE             
2021-01-01   0.08
2021-02-01   0.04
2021-03-01   0.03
2021-04-01   0.02
2021-05-01   0.02
...           ...
2025-11-01   3.78
2025-12-01   3.59
2026-01-01   3.57
2026-02-01   3.60
2026-03-01   3.61

[63 rows x 1 columns]

The TB3MS serie is given in percentage and in annual rate. I divide it by 100 and 12 to get a monthly simple rate since I am using monthly rates for the stocks:

rfrate = Tbills / 100 / 12

Now I get the continuously compounded return from the simple return:

rfrate = np.log(1+rfrate)

I used the formula to get cc reteurns from simple returns, which is applying the natural log of the growth factor (1+rfrate)

4.6 Estimating the premium returns

Now you have to generate new variables (columns) for the premium returns for the stocks and the S&P 500.

The premium returns will be equal to the returns minus the risk-free rate. However, it is a good idea to check whether the returns dataset and the rfrate dataset have the same time periods of information:

print(returns.shape)

(63, 3)

print(rfrate.shape)

(63, 1)

Both data frames have 63 rows (months) of data. We can check the beginning and end of each dataset to make sure they have the same time periods:

print(returns.head())

Ticker          AAPL      TSLA     ^GSPC
Date                                    
2021-01-01 -0.005517  0.117344 -0.011199
2021-02-01 -0.084562 -0.161038  0.025757
2021-03-01  0.008806 -0.011270  0.041563
2021-04-01  0.073453  0.060293  0.051097
2021-05-01 -0.053514 -0.126372  0.005471

print(returns.tail())

Ticker          AAPL      TSLA     ^GSPC
Date                                    
2025-11-01  0.030883 -0.059540  0.001299
2025-12-01 -0.024418  0.044445 -0.000524
2026-01-01 -0.046608 -0.043887  0.013570
2026-02-01  0.017951 -0.067018 -0.008706
2026-03-01 -0.039188 -0.079498 -0.052276

print(rfrate.head())

               TB3MS
DATE                
2021-01-01  0.000067
2021-02-01  0.000033
2021-03-01  0.000025
2021-04-01  0.000017
2021-05-01  0.000017

print(rfrate.tail())

               TB3MS
DATE                
2025-11-01  0.003145
2025-12-01  0.002987
2026-01-01  0.002971
2026-02-01  0.002996
2026-03-01  0.003004

Both data frames have the same time periods, so we are ready to calculate the premium returns:

The premium returns will be equal to the returns minus the risk-free rate:

# I create new columns for the Premium returns in the returns dataset:
returns['TSLA_Premr'] = returns['TSLA'] - rfrate['TB3MS'] 
returns['GSPC_Premr'] = returns['^GSPC'] - rfrate['TB3MS']

5 Visualize the relationship

We do a scatter plot putting the S&P500 premium returns as the independent variable (X) and Tesla premium return as the dependent variable (Y). We also add a line that better represents the relationship between the stock returns and the market returns:

import seaborn as sb
plt.clf()
x = returns['GSPC_Premr']
y = returns['TSLA_Premr']
# I plot the (x,y) values along with the regression line that fits the data:
sb.regplot(x=x,y=y)
plt.xlabel('Market Premium returns')
plt.ylabel('TSLA Premium returns') 
plt.show()

Sometimes graphs can be deceiving. In this case, the range of X axis and Y axis are different, so it is better to do a graph where we can make both X and Y ranges with equal distance.

plt.clf()

sb.regplot(x=x,y=y)
# I adjust the scale of the X axis so that the magnitude of each unit of X is equal to that of the Y axis 
plt.xticks(np.arange(-0.6,0.8,0.2))

([<matplotlib.axis.XTick object at 0x000002C78050B9D0>, <matplotlib.axis.XTick object at 0x000002C78056D1D0>, <matplotlib.axis.XTick object at 0x000002C78056D590>, <matplotlib.axis.XTick object at 0x000002C78056D950>, <matplotlib.axis.XTick object at 0x000002C78056DD10>, <matplotlib.axis.XTick object at 0x000002C78056E0D0>, <matplotlib.axis.XTick object at 0x000002C78056E490>], [Text(-0.6, 0, '−0.6'), Text(-0.39999999999999997, 0, '−0.4'), Text(-0.19999999999999996, 0, '−0.2'), Text(1.1102230246251565e-16, 0, '0.0'), Text(0.20000000000000007, 0, '0.2'), Text(0.4, 0, '0.4'), Text(0.6000000000000002, 0, '0.6')])

# I label the axis:
plt.xlabel('Market Premium returns')

plt.ylabel('TSLA Premium returns') 
plt.show()

WHAT DOES THE PLOT TELL YOU? BRIEFLY EXPLAIN

6 Estimating the CAPM model for a stock

Use the premium returns to run the CAPM regression model for each stock.

We run the CAPM for TESLA:

import statsmodels.formula.api as smf

# I estimate the OLS regression model:
mkmodel = smf.ols('TSLA_Premr ~ GSPC_Premr',data=returns).fit()
# I display the summary of the regression: 
print(mkmodel.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             TSLA_Premr   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.195
Method:                 Least Squares   F-statistic:                     16.03
Date:                Sun, 26 Apr 2026   Prob (F-statistic):           0.000171
Time:                        09:32:44   Log-Likelihood:                 31.836
No. Observations:                  63   AIC:                            -59.67
Df Residuals:                      61   BIC:                            -55.39
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0061      0.019     -0.321      0.750      -0.044       0.032
GSPC_Premr     1.7479      0.437      4.004      0.000       0.875       2.621
==============================================================================
Omnibus:                        1.357   Durbin-Watson:                   1.888
Prob(Omnibus):                  0.507   Jarque-Bera (JB):                1.386
Skew:                          -0.289   Prob(JB):                        0.500
Kurtosis:                       2.560   Cond. No.                         23.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

I can important results of the model in the following variables:


b0=mkmodel.params.iloc[0]
b1=mkmodel.params.iloc[1]
seb0 = mkmodel.bse.iloc[0]
seb1 = mkmodel.bse.iloc[1]
tb0 = mkmodel.params.iloc[0] / mkmodel.bse.iloc[0]
tb1 = mkmodel.params.iloc[1] / mkmodel.bse.iloc[1]

pvalueb0 = mkmodel.pvalues.iloc[0]
pvalueb1 = mkmodel.pvalues.iloc[1]
confb0 = 100 * (1-pvalueb0)
minb0 = mkmodel.conf_int().iloc[0].iloc[0]
maxb0 = mkmodel.conf_int().iloc[0].iloc[1]
minb1 = mkmodel.conf_int().iloc[1].iloc[0]
maxb1 = mkmodel.conf_int().iloc[1].iloc[1]

tvalueb1 = (b1 - 1) / seb1

The beta0 coefficient of the model is -0.0061, while beta1 is 1.7479.

The 95% confidence interval for beta0 goes from -0.0438 to 0.0317, while the 95% confidence interval for beta1 goes from 0.875 to 2.6209.

If we subtract and add about 2 times the standard error of beta0 from beta0 we get the 95% confidence interval for beta0. Why? Because thanks to the Central Limit Theorem, beta0 (and also beta1) will behave similar to a normal distributed variable since the beta0 can be expressed as a linear combination of random variables.

We can construct the 95% confidence interval for beta1 in the same way we calculate the 95% C.I. for beta0.

7 CHALLENGE 2

** INTERPRET THE RESULTS OF THE COEFFICIENTS (b0 and b1), THEIR STANDARD ERRORS, P-VALUES AND 95% CONFIDENCE INTERVALS.**

8 CHALLENGE 3

You have to find the optimal portfolio of a 4 US stocks with 2 methods:

Estimate the Variance-Covariance matrix according to the Markowitz Portfolio Theory
Estimate the Variance-Covariance matrix according to the Single-Index Model. In this case you have to use the Single-index model to estimate the individual Expected return of each stock, so you have to make an assumption about the expected return of the market.

The 4 stocks must be from different industries and will be announced in class. You have to use historical monthly data (3 to 5 years of data).

You have to do the following:
Estimate the efficient frontier for both methods
Estimate the optimal portfolio for both methods for a risk-free rate of 4% annual rate
Compare the 2 methods and their results. Explain the differences and why do you think the results are different? Which method would you prefer and why?

9 Multi-Factor Models

After the CAPM was introduced, it received strong critiques by researchers and analysts since there were some stocks that had positive abnormal returns that were not explained by the CAPM. In other words, several stocks had alpha coefficients significantly positive for a long time. These and other cases are considered anomalies since the CAPM does not explain the excess return of all assets all the time. In other words, there should be other sources of systematic risk, not only the market risk.

In 1993, the 3-factor (Fama&French) model was proposed. This factor model starts with the CAPM model and adds 2 other sources of systematic risk. The sources of systematic risk are called factors.

The FF 3-factor model includes the following factors:

Market Factor
Value Factor
Size Factor

The value factor refers that firms with high book-to-market value usually get higher future returns compared to firms with low book-to-market value. The book-to-market value of a stock is the ratio:

BTM_t =\frac{bookvalue_t}{marketvalue_t}

The book value at any accounting period t is the accounting value of a stock, which can be estimated as the difference between total assets minus total liabilities:

bookvalue_t=totalassets_t-totalliabilities_t

Then, if both values are equal, BTM =1. Most current active firms have BTM that are much less than 1, so the market value must be always greater than the book value. This difference can be explained by intangible assets such as competitive advantage, quality of products/services, innovation, reputation, etc.

The market value of a stock at any period t is the product of stock price times the number of shares outstanding:

marketvalue_t=(stockprice_t)(sharesoutstanding)

The value factor is estimated “artificially” by creating 2 portfolios over time: one with firms with high BMV and firms with low BMV. The portfolio returns are subtracted each period to get the difference in returns between the portfolio with high BMV and the portfolio with low BMV.

It is supposed that this is a factor that measure how firms with high BMV usually recover and surprise with excess returns in the near future.

The size factor refers that usually small firms get high future returns compared to big firms. This factor is also calculated “virtually” by creating a portfolio with small firms and a portfolio with big firms over time, and then get the difference between these 2 portfolios. This difference is considered the SIZE factor.

Then, now the 3 factors, market, value and size are calculated with excess returns. The market is calculated as the market return minus the risk-free return; the value factor is calculated as the difference between the portfolio of high BMV portfolio returns minus low BMV portfolio returns overtime. The size factor is calculated as the difference of returns between the portfolio with SMALL firms and the portfoio with BIG firms.

9.1 Estimatig 3-factor FF model with real data

The FF factors are constantly estimated for the US and other international markets. These calculations are public and can be downloaded from a web page (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)

# prompt: Using the public calculations from the source below, download the fama french 5 factors for the US market. Store the data in an object called ff5.
# https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

#!pip install pandas-datareader

import pandas as pd
import pandas_datareader.data as web
import datetime

# Define the start and end dates
start = '2021-01-01'  # Adjust as needed
end = '2026-03-31'    # Adjust as needed

# Download the Fama-French 5-Factor data
ff5 = web.DataReader('F-F_Research_Data_5_Factors_2x3', 'famafrench', start=start, end=end)[0]

#I change to percentage the factor's premiums
ff5 = ff5/100

# Display the first few rows of the data
print(ff5)

         Mkt-RF     SMB     HML     RMW     CMA      RF
Date                                                   
2021-01 -0.0007  0.0681  0.0322 -0.0365  0.0497  0.0000
2021-02  0.0281  0.0450  0.0720  0.0033 -0.0199  0.0000
2021-03  0.0316 -0.0084  0.0735  0.0635  0.0352  0.0000
2021-04  0.0497 -0.0316 -0.0102  0.0243 -0.0272  0.0000
2021-05  0.0029  0.0127  0.0704  0.0234  0.0301  0.0000
...         ...     ...     ...     ...     ...     ...
2025-10  0.0196 -0.0131 -0.0310 -0.0524 -0.0403  0.0037
2025-11 -0.0013  0.0147  0.0376  0.0144  0.0068  0.0030
2025-12 -0.0036 -0.0022  0.0242  0.0040  0.0037  0.0034
2026-01  0.0103  0.0326  0.0372  0.0182  0.0183  0.0030
2026-02 -0.0117  0.0063  0.0283  0.0162  0.0507  0.0028

[62 rows x 6 columns]

After the 3-factor FF model, they proposed other 2 extra factors, so now we can use up to 5 factors to model stock returns. These extra factors are profitability and investment pattern.

Now I download historic stock prices of a company to run the FF 3-factor model:

Bajo datos de precios de 1 empresa para hacer mi modelo de 3 factores:

print(ff5.columns)

Index(['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', 'RF'], dtype='object')

ff5.index

PeriodIndex(['2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06',
             '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12',
             '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12',
             '2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06',
             '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12',
             '2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06',
             '2024-07', '2024-08', '2024-09', '2024-10', '2024-11', '2024-12',
             '2025-01', '2025-02', '2025-03', '2025-04', '2025-05', '2025-06',
             '2025-07', '2025-08', '2025-09', '2025-10', '2025-11', '2025-12',
             '2026-01', '2026-02'],
            dtype='period[M]', name='Date')

ff5.info()

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 62 entries, 2021-01 to 2026-02
Freq: M
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  62 non-null     float64
 1   SMB     62 non-null     float64
 2   HML     62 non-null     float64
 3   RMW     62 non-null     float64
 4   CMA     62 non-null     float64
 5   RF      62 non-null     float64
dtypes: float64(6)
memory usage: 3.4 KB

ff5.index

PeriodIndex(['2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06',
             '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12',
             '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12',
             '2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06',
             '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12',
             '2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06',
             '2024-07', '2024-08', '2024-09', '2024-10', '2024-11', '2024-12',
             '2025-01', '2025-02', '2025-03', '2025-04', '2025-05', '2025-06',
             '2025-07', '2025-08', '2025-09', '2025-10', '2025-11', '2025-12',
             '2026-01', '2026-02'],
            dtype='period[M]', name='Date')

# prompt: Now, can you download the monthly information of WMT, using as the starting period of 2020-01-01 to 2025-03-31, and after that, calculate the monthly log returns storing them in an individual dataset.

adjprices=yf.download(tickers="WMT", start="2020-12-01", end = "2026-03-31", interval="1mo", auto_adjust=True)['Close']


[*********************100%***********************]  1 of 1 completed

# Calculate monthly log returns
dataret = np.log(adjprices / adjprices.shift(1)).dropna()

We want to merge the ff5 and the dataret data frames into one, so that I can easily run a multi-factor regression model.

A merge can be easily made if both data frames have the same index. Let’s check the data structure information and whether the data frames have indexes:

# Data structure information of data frames:
print(ff5.info())

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 62 entries, 2021-01 to 2026-02
Freq: M
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  62 non-null     float64
 1   SMB     62 non-null     float64
 2   HML     62 non-null     float64
 3   RMW     62 non-null     float64
 4   CMA     62 non-null     float64
 5   RF      62 non-null     float64
dtypes: float64(6)
memory usage: 3.4 KB
None

print(dataret.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 63 entries, 2021-01-01 to 2026-03-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   WMT     63 non-null     float64
dtypes: float64(1)
memory usage: 1008.0 bytes
None

print(ff5.index)

PeriodIndex(['2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06',
             '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12',
             '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12',
             '2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06',
             '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12',
             '2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06',
             '2024-07', '2024-08', '2024-09', '2024-10', '2024-11', '2024-12',
             '2025-01', '2025-02', '2025-03', '2025-04', '2025-05', '2025-06',
             '2025-07', '2025-08', '2025-09', '2025-10', '2025-11', '2025-12',
             '2026-01', '2026-02'],
            dtype='period[M]', name='Date')

print(dataret.index)

DatetimeIndex(['2021-01-01', '2021-02-01', '2021-03-01', '2021-04-01',
               '2021-05-01', '2021-06-01', '2021-07-01', '2021-08-01',
               '2021-09-01', '2021-10-01', '2021-11-01', '2021-12-01',
               '2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
               '2022-05-01', '2022-06-01', '2022-07-01', '2022-08-01',
               '2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01',
               '2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01',
               '2023-05-01', '2023-06-01', '2023-07-01', '2023-08-01',
               '2023-09-01', '2023-10-01', '2023-11-01', '2023-12-01',
               '2024-01-01', '2024-02-01', '2024-03-01', '2024-04-01',
               '2024-05-01', '2024-06-01', '2024-07-01', '2024-08-01',
               '2024-09-01', '2024-10-01', '2024-11-01', '2024-12-01',
               '2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01',
               '2025-05-01', '2025-06-01', '2025-07-01', '2025-08-01',
               '2025-09-01', '2025-10-01', '2025-11-01', '2025-12-01',
               '2026-01-01', '2026-02-01', '2026-03-01'],
              dtype='datetime64[ns]', name='Date', freq=None)

Before merging 2 datasets, we have to look at 1) the data type of the datasets’ indexes, and 2) whether they have rows with equal index values to do the match.

We can see that the data frames’ indexes have different types. ff5 index is monthly and dataret index is datetime. We can convert the datetime index to monthly:

dataret.index = dataret.index.to_period("M")
print(dataret.index)

PeriodIndex(['2021-01', '2021-02', '2021-03', '2021-04', '2021-05', '2021-06',
             '2021-07', '2021-08', '2021-09', '2021-10', '2021-11', '2021-12',
             '2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12',
             '2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06',
             '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12',
             '2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06',
             '2024-07', '2024-08', '2024-09', '2024-10', '2024-11', '2024-12',
             '2025-01', '2025-02', '2025-03', '2025-04', '2025-05', '2025-06',
             '2025-07', '2025-08', '2025-09', '2025-10', '2025-11', '2025-12',
             '2026-01', '2026-02', '2026-03'],
            dtype='period[M]', name='Date')

Now both data frames have the same index type, which is monthly.

Before running the merge, we see that the dataret has 63 rows and ff5 has 62 rows (months). It seems that the ff5 has no information for the most recent month:

print(ff5.tail())

         Mkt-RF     SMB     HML     RMW     CMA      RF
Date                                                   
2025-10  0.0196 -0.0131 -0.0310 -0.0524 -0.0403  0.0037
2025-11 -0.0013  0.0147  0.0376  0.0144  0.0068  0.0030
2025-12 -0.0036 -0.0022  0.0242  0.0040  0.0037  0.0034
2026-01  0.0103  0.0326  0.0372  0.0182  0.0183  0.0030
2026-02 -0.0117  0.0063  0.0283  0.0162  0.0507  0.0028

print(dataret.tail())

Ticker        WMT
Date             
2025-11  0.088205
2025-12  0.008111
2026-01  0.069118
2026-02  0.071340
2026-03 -0.029102

Then, if we merge the data frames, only those with equal index values (months) will be kept.

We can merge both data frames by their index:

merged_data = ff5.merge(dataret, left_index = True, right_index=True, how='inner', validate='one_to_one')
print(merged_data.info())

<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 62 entries, 2021-01 to 2026-02
Freq: M
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Mkt-RF  62 non-null     float64
 1   SMB     62 non-null     float64
 2   HML     62 non-null     float64
 3   RMW     62 non-null     float64
 4   CMA     62 non-null     float64
 5   RF      62 non-null     float64
 6   WMT     62 non-null     float64
dtypes: float64(7)
memory usage: 3.9 KB
None

print(merged_data.tail())

         Mkt-RF     SMB     HML     RMW     CMA      RF       WMT
Date                                                             
2025-10  0.0196 -0.0131 -0.0310 -0.0524 -0.0403  0.0037 -0.018410
2025-11 -0.0013  0.0147  0.0376  0.0144  0.0068  0.0030  0.088205
2025-12 -0.0036 -0.0022  0.0242  0.0040  0.0037  0.0034  0.008111
2026-01  0.0103  0.0326  0.0372  0.0182  0.0183  0.0030  0.069118
2026-02 -0.0117  0.0063  0.0283  0.0162  0.0507  0.0028  0.071340

The RF is the monthly US Risk-Free rate, so we can create a column for Walmart premium returns.

merged_data['WMT_premr'] = merged_data['WMT'] - merged_data['RF']
merged_data.tail()

         Mkt-RF     SMB     HML     RMW     CMA      RF       WMT  WMT_premr
Date                                                                        
2025-10  0.0196 -0.0131 -0.0310 -0.0524 -0.0403  0.0037 -0.018410  -0.022110
2025-11 -0.0013  0.0147  0.0376  0.0144  0.0068  0.0030  0.088205   0.085205
2025-12 -0.0036 -0.0022  0.0242  0.0040  0.0037  0.0034  0.008111   0.004711
2026-01  0.0103  0.0326  0.0372  0.0182  0.0183  0.0030  0.069118   0.066118
2026-02 -0.0117  0.0063  0.0283  0.0162  0.0507  0.0028  0.071340   0.068540

We need stock premium returns as the Dependent Variable of the multi-factor (multiple regression) model.

We will run the CAPM model and the 3-factor model.

The CAPM model (1-factor model) is a simple regression model including only the Market premium as independent variable.

The 3-factor model is a multiple regression model including the factors 1)Size (SMB), 2) Value (HML) and 3) the Market premium return (Mkt-RF) as independent variables.

import statsmodels.formula.api as smf
# It is not a good idea to have column names with special characters such as '-'. I change the name of this column:
merged_data = merged_data.rename(columns={"Mkt-RF": "Mkt_premr"})

# I run the CAPM regression model:
mkmodel = smf.ols('WMT_premr ~ Mkt_premr',data=merged_data).fit()
# I display the summary of the regression: 
print(mkmodel.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              WMT_premr   R-squared:                       0.219
Model:                            OLS   Adj. R-squared:                  0.206
Method:                 Least Squares   F-statistic:                     16.82
Date:                Sun, 26 Apr 2026   Prob (F-statistic):           0.000125
Time:                        09:32:46   Log-Likelihood:                 98.852
No. Observations:                  62   AIC:                            -193.7
Df Residuals:                      60   BIC:                            -189.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0093      0.006      1.441      0.155      -0.004       0.022
Mkt_premr      0.6014      0.147      4.101      0.000       0.308       0.895
==============================================================================
Omnibus:                       13.033   Durbin-Watson:                   2.222
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               17.393
Skew:                          -0.799   Prob(JB):                     0.000167
Kurtosis:                       5.044   Cond. No.                         23.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# I estimate the OLS regression model:
f3model = smf.ols('WMT_premr ~ Mkt_premr + SMB + HML',data=merged_data).fit()
# I display the summary of the regression: 
print(f3model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              WMT_premr   R-squared:                       0.251
Model:                            OLS   Adj. R-squared:                  0.212
Method:                 Least Squares   F-statistic:                     6.468
Date:                Sun, 26 Apr 2026   Prob (F-statistic):           0.000749
Time:                        09:32:46   Log-Likelihood:                 100.14
No. Observations:                  62   AIC:                            -192.3
Df Residuals:                      58   BIC:                            -183.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0091      0.007      1.355      0.181      -0.004       0.023
Mkt_premr      0.6174      0.153      4.031      0.000       0.311       0.924
SMB           -0.2529      0.226     -1.119      0.268      -0.705       0.200
HML           -0.1008      0.165     -0.610      0.544      -0.432       0.230
==============================================================================
Omnibus:                       10.182   Durbin-Watson:                   2.255
Prob(Omnibus):                  0.006   Jarque-Bera (JB):               12.313
Skew:                          -0.662   Prob(JB):                      0.00212
Kurtosis:                       4.736   Cond. No.                         38.9
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

10 CHALLENGE 4

You have to interpret the FF model in detail