Workshop 3, Algorithms and Data Analysis

Author

Alberto Dorantes

Published

September 29, 2025

Abstract
This is an INDIVIDUAL workshop. In this workshop we continue practice programming for data management, and we introduce the logistic regression model in the context of machine learning

1 Data management for panel-data

For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do data wrangling: data collection/importing, data cleanning, data merging, data transformation.

We will keep using the same datasets we used in Workshop 2 for all excercises of this workshop.

Download the online datasets from:

Use the same Python code you used before:

import pandas as pd
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Download the first file
url1 = 'https://www.apradie.com/datos/usfirms.csv'
response1 = requests.get(url1, headers=headers)
with open('usfirms.csv', 'wb') as f:
    f.write(response1.content)
633061
# Download the second file
url2 = 'https://www.apradie.com/datos/usdata.csv'
response2 = requests.get(url2, headers=headers)
with open('usdata.csv', 'wb') as f:
    f.write(response2.content)
11317275
# Import as data frames
usfirms = pd.read_csv('usfirms.csv')
usdata = pd.read_csv('usdata.csv')

# Display the first few rows of each DataFrame to confirm
print("usfirms DataFrame:")
usfirms DataFrame:
usfirms.head()
    empresa  ...  SectorEconomatica
0         A  ...  Electric Electron
1        AA  ...  Basic & Fab Metal
2  AABA_old  ...    Software & Data
3   AAC_old  ...              Other
4  AAIC_old  ...              Funds

[5 rows x 7 columns]
print("\nusdata DataFrame:")

usdata DataFrame:
usdata.head()
  firm  year     revenue  ...  originalprice  sharesoutstanding  fixedassets
0    A  2000  10773000.0  ...          54.75         456366.381    1741000.0
1    A  2001   8396000.0  ...          28.51         463695.160    1848000.0
2    A  2002   6010000.0  ...          17.96         467024.421    1579000.0
3    A  2003   6056000.0  ...          29.24         476149.083    1447000.0
4    A  2004   7181000.0  ...          24.10         486841.087    1258000.0

[5 rows x 28 columns]

You can go to Workshop 2 (https://rpubs.com/cdorante/fz2022p_w2) to see the data dictionary for each dataset.

As in the previous workshop, merge the 2 datasets (usdata and usfirms) using a left-join. Remember, the panel-dataset usdata is the left dataset, which has historical annual financial data for all US firms; and the usfirms is the right dataset, which is a cross-sectional dataset with general information of firms from the S&P500 index.

# I merge the datasets using usdata as the left dataset, and usfirms as the right dataset 

merged_data = pd.merge(usdata, usfirms[['empresa', 'Nombre', 'status', 'naics1']], left_on='firm', right_on='empresa', how='left')

# Drop the redundant 'empresa' column from the merged DataFrame
merged_data = merged_data.drop('empresa', axis=1)

# Display the first few rows of the merged DataFrame to confirm
print("Merged DataFrame:")
Merged DataFrame:
merged_data.head()
  firm  year     revenue  ...                     Nombre  status         naics1
0    A  2000  10773000.0  ...  Agilent Technologies, Inc  active  Manufacturing
1    A  2001   8396000.0  ...  Agilent Technologies, Inc  active  Manufacturing
2    A  2002   6010000.0  ...  Agilent Technologies, Inc  active  Manufacturing
3    A  2003   6056000.0  ...  Agilent Technologies, Inc  active  Manufacturing
4    A  2004   7181000.0  ...  Agilent Technologies, Inc  active  Manufacturing

[5 rows x 31 columns]

2 CHALLENGE 1. Calculating financial variables

Using the merged dataset, you have to write the code to calculate the following financial variables and financial ratios for all firms-years:

1. Create financial variables

Create columns for the following variables:

  • Market value (marketvalue) = originalprice * sharesoutstanding

We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

import numpy as np

# Market value
merged_data['marketvalue'] = merged_data['originalprice'] * merged_data['sharesoutstanding']
  • Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
# Gross profit
merged_data['grossprofit'] = merged_data['revenue'] - merged_data['cogs']
  • Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae) - depreciation
# Earnings before interest and taxes (ebit)
merged_data['ebit'] = merged_data['grossprofit'] - merged_data['sgae'] - merged_data['depreciation']
  • Net Income (netincome) = ebit + otherincome + extraordinaryitems - finexp (financial expenses) - incometax
# Net Income
merged_data['netincome'] = merged_data['ebit'] + merged_data['otherincome'] + merged_data['extraordinaryitems'] - merged_data['finexp'] - merged_data['incometax']
  • Annual market return: calculate annual return for all firm-years by calculating the continuously compounded percentage of thea djusted stock price (adjprice). Consider that you have a panel-data, so be careful when calculating returns to avoid using data from another firm in the cases of the first year for all firms. Hint: you can use the shift function and groupby firm to avoid using stock price of another stock to calculate the annual return of each firm for all years.
# Annual market return
# Sort by firm and year to ensure correct calculation within each firm
merged_data = merged_data.sort_values(by=['firm', 'year'])

# Calculate annual log return:
# I create a column for the previous year; I use the groupby('firm') function to avoid pulling the stock price of another firm:
merged_data['price_lag1'] = merged_data.groupby('firm')['adjprice'].shift(1)
# I create the annual log return (continuously compounded return) by getting the log of the current price divided by the previous price
merged_data['annual_return'] = np.log(merged_data['adjprice'] / merged_data['price_lag1'])
# I could also calculate the log return by getting the difference of the log stock prices between t and t-1:
#merged_data['annual_return'] = np.log(merged_data['adjprice']) - np.log(merged_data['price_lag1'])

2. Using the same panel dataset, create columns for the following financial ratios:

  • Operational Return on Assets (roabit): ebit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of the previous year.

roabit=\frac{ebit_{t}}{totalassets_{t-1}}

# Operational Return on Assets (roabit)

# Calculate total assets at the beginning of the period (previous year's total assets)
merged_data['totalassets_lag1'] = merged_data.groupby('firm')['totalassets'].shift(1)
# calculate roabit using the lagged total assets:
merged_data['roabit'] = merged_data['ebit'] / merged_data['totalassets_lag1']

Here you can use the shift function to get value of total assets one year ago. Make sure that you indicate to groupby firm so you do not use the totalssets from another firm to calculate the roabit of a firm.

  • Return on Assets (roa):

roa=\frac{netincome_{t}}{totalassets_{t-1}}


# Return on Assets (roa)
merged_data['totalassets_lag1'] = merged_data.groupby('firm')['totalassets'].shift(1)
merged_data['roa'] = merged_data['netincome'] / merged_data['totalassets_lag1']
  • Operational Earnings per share (oeps): ebit / sharesoutstanding
# Operational Earnings per share (oeps)
merged_data['oeps'] = merged_data['ebit'] / merged_data['sharesoutstanding']
  • Operational eps deflated by stock price (oepsp) : oeps / originalprice
# Operational eps deflated by stock price (oepsp)
merged_data['oepsp'] = merged_data['oeps'] / merged_data['originalprice']
  • Book-to-market ratio (bmr): book value / market value.

Book value can be calculated as totalassets minus totalliabilities

# Book-to-market ratio (bmr)
merged_data['bookvalue'] = merged_data['totalassets'] - merged_data['totalliabilities']
merged_data['bmr'] = merged_data['bookvalue'] / merged_data['marketvalue']

Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio

EARNINGS PER SHARE IS EQUAL TO EARNINGS DIVIDED BY THE # OF SHARES. THE MEASURE FOR EARNINGS IS NET INCOME. HOWEVER, SOME ANALYSTS ALSO USE OTHER OPERATIONAL MEASURES FOR EARNINGS SUCH AS EARNINGS BEFORE INTEREST AND TAXES (EBIT). IF WE WANT TO MEASURE OPERATIONAL EARNINGS AND CALCULATE IT FOR MANY FIRMS, IT IS RECOMMENDED TO USE EBIT AS A MEASURE OF EARNINGS.

THE FORMULA OF EPS USING NET INCOME AS A MEASURE OF EARNINGS IS:

EPS_{t}=\frac{NETINCOME_{t}}{\#OFSHARES_{t}}

THE FORMULA FOR EPS USING EBIT AS A MEASURE OF EARNINGS IS:*

EPS_{t}=\frac{EBIT_{t}}{\#OFSHARES_{t}}

IN A HYPOTHETICAL SCENARIO, IF THE ALL EARNINGS OF A PERIOD t WERE PAYED TO THE INVESTORS, THEN EPS WILL BE HOW MUCH OF ALL EARNINGS OF THE PERIOD IS PAYED TO EACH SHARE OWN BY INVESTORS.

WHAT IS BOOK-TO-MARKET RATIO?

Book-to-maret ratio (bmr) is the ratio of accounting book value of the firm to its market value. In other words, it results by dividing book value by the market value of the firm at a specific time period.

If bmr=1 means that the firm book value is about the same as firm market value. If that is the case, then the market value has not grown beyond book-value, meaning that the firm has not created value beyond its book value.

If bmr>1 means that the market value is less than book value. So, if bmr>1 means that the firm has significantly lost shareholder’s wealth, and it might incur in bankrupt risk.

Then, what would be the bmr level that all firms are looking for? One of the main purposes of the executives is to MAXIMIZE shareholder’s value. The way to increase shareholder’s value is to increase its market value, and the only way to increase market value is to increase stock price.

Then, the bmr level that all executives prefer is a value much less than 1.

If bmr=0.5 means that the firm market value is the double of its book value. In this case, the main difference between market value and book value is the wealth that the firm has created thanks to its valuable intangible assets such as prestige, high quality, and innovation.

Then, what do you think it might be the relationship between bmr and stock return? Intuitively, we might think that a small value of bmr is a good news, then the stock return might be related in a negative way. If bmr goes down (good news), then the stock return might go up. Then, it might be expected that the relationship between bmr and stock return is linear and negative. Some finance research (like Fama & French, 1995), mainly with US firms has found that the relationship between bmr and future stock return is negative.

However, there are mixed findings about the relationship between bmr and stock returns. Some research (like Piotrosky, 2000) has found that firms with high bmr, but strong financials (like earnings per share) usually provides significant positive stock returns. In this special case, bmr has found to be positively related to stock returns.

Finance research has found that bmr influences earnings per share, which in turn influences current and future stock returns. It is some times that firms with low bmr do not experience significant high stock returns due to the high expectations of investors. This is an interesting finding in finance that is still being researched!

3 CHALLENGE 2: Winsorization of variables

You have to do your own research about winsorization. Explain it with your words (pay attention in class)

WINSORIZATION IS THE PROCESS OF FLATTENING EXTREME (OUTLIERS) OBSERVATIONS OF A VARIABLE WITH THE PURPOSE OF AVOIDING UNRELIABLE ESTIMATIONS OF BETA COEFFICIENTS WHEN INCLUDING THAT VARIABLE AS EXPLANATORY VARIABLE IN AN ECONOMETRIC MODEL.

THE WINSORIZATION PROCESS CAN BE APPLIED TO HIGH OR LOW (NEGATIVE) VALUES OF THE VARIABLE. FOR HIGH VALUES WE CAN INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES ABOVE THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE. FOR LOW VALUES WE INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES BELOW THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE.

Let’s review the histogram of the operating earnings per share to check whether there are extreme values:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(merged_data['oepsp'].dropna(), bins=50, kde=True)
plt.title('Distribution of Operational EPS deflated by Stock Price (oepsp)')
plt.xlabel('oepsp')
plt.ylabel('Frequency')
plt.show()

We can see that there are very, very few extreme values for oepsp; the maximum value is more than 10 million! You have to see the 1e7 in the X axix indicating the scale of oepsp; 1e7 is scientific notation for 10 million (1 x 10^7)

Then, we need to winsorize oepsp to avoid unreliable estimations of the logistic regression since we plan to use oepsp as the X independent variable.

You have to winsorize the following ratio at the 2 percentile for the left values and 98 percentile for the high values:

  • Earnings per share deflated by price

I use the quantile method of pandas to get the 2 and 95 percentiles, and the clip method to flatten the extreme values with the corresponding percentiles:

# Winsorize oepsp at 2% and 98%
lower_bound = merged_data['oepsp'].quantile(0.02)
upper_bound = merged_data['oepsp'].quantile(0.98)

merged_data['oepspw'] = merged_data['oepsp'].clip(lower=lower_bound, upper=upper_bound)

# Display the first few rows to confirm the new column
print("Merged DataFrame with winsorized oepsp (oepspw):")
Merged DataFrame with winsorized oepsp (oepspw):
merged_data.head()
  firm  year     revenue       cogs  ...     oepsp  bookvalue       bmr    oepspw
0    A  2000  10773000.0  5522000.0  ...  0.042144  5265000.0  0.210718  0.042144
1    A  2001   8396000.0  5166000.0  ... -0.058850  5659000.0  0.428065 -0.058850
2    A  2002   6010000.0  3694000.0  ... -0.191589  4627000.0  0.551637 -0.191589
3    A  2003   6056000.0  3762000.0  ... -0.052074  2824000.0  0.202836 -0.052074
4    A  2004   7181000.0  4058000.0  ...  0.032899  3569000.0  0.304188  0.032899

[5 rows x 45 columns]
merged_data[['oepsp','oepspw','annual_return']].describe()
              oepsp        oepspw  annual_return
count  5.917200e+04  59172.000000   53983.000000
mean            inf     -0.024825      -0.003683
std             NaN      0.283234       0.557446
min   -3.887477e+05     -1.409846     -10.685795
25%   -2.085332e-02     -0.020853      -0.196382
50%    4.857955e-02      0.048580       0.069674
75%    9.258618e-02      0.092586       0.280394
max             inf      0.301041       3.993313

Now I can check the histogram of the winsorized oepsp:

plt.figure(figsize=(10, 6))
sns.histplot(merged_data['oepspw'].dropna(), bins=50, kde=True)
plt.title('Distribution of Winsorized Operational EPS deflated by Stock Price (oepspw)')
plt.xlabel('oepspw')
plt.ylabel('Frequency')
plt.show()

Now the minimum value for oepspw is from about -1.30 to +0.25! we do not have extreme values. We can proceed using oepspw as the X predictor for a logistic regression model.

But before we need to bring the market annual return to our merged dataset.

Then, I need to create the dataset for annual market return, and then merge it with the merged_data.

4 CHALLENGE 3: Create an annual dataset for historical market returns

You have to create an annual dataset with annual market returns of the S&P500 index (^GSPC), from 2000 to 2023. You can download monthly quotes from Yahoo Finance using the yfinance library, and then aggregate by year, and then calculate continuously compounded returns.

Once you have the annual returns of the ^GSPC in a dataset, merge these values into the merged panel-dataset using the year as the common/matching column.

Here is the code to create this annual dataset for the market returns:

import yfinance as yf

# Download monthly data for S&P 500 (^GSPC) from Dec 1999 to Dec 2023
sp500_data = yf.download('^GSPC', start='1999-12-01', end='2023-12-31', interval='1mo')
YF.download() has changed argument auto_adjust default to True

[*********************100%***********************]  1 of 1 completed
# Keep only the last month's adjusted close price for each year
# I do a group by year and get the last value (each december for each year)
annual_sp500_data = sp500_data.groupby(sp500_data.index.year)['Close'].tail(1)

# Create a DataFrame for annual returns
annual_sp500_returns = pd.DataFrame(annual_sp500_data)
annual_sp500_returns.columns = ['adjprice']
annual_sp500_returns['year'] = annual_sp500_returns.index.year

# Calculate annual log returns
annual_sp500_returns['price_lag1'] = annual_sp500_returns['adjprice'].shift(1)
annual_sp500_returns['sp500_annual_return'] = np.log(annual_sp500_returns['adjprice'] / annual_sp500_returns['price_lag1'])

# Drop the intermediate columns
annual_sp500_returns = annual_sp500_returns[['year', 'sp500_annual_return']]

# Display the annual S&P 500 returns
print("Annual S&P 500 Returns:")
Annual S&P 500 Returns:
annual_sp500_returns.head()
            year  sp500_annual_return
Date                                 
1999-12-01  1999                  NaN
2000-12-01  2000            -0.106908
2001-12-01  2001            -0.139753
2002-12-01  2002            -0.266129
2003-12-01  2003             0.234126

Check that I did a group by year and used the tail method to get the december price for each year by firm.

I ended up with a small dataset of 24 rows (24 years of market returns) and 2 columns (year and sp500_annual_return)

Finally I merge this dataset with the panel dataset:

# Merge the S&P 500 annual returns with the merged_data DataFrame
merged_data = pd.merge(merged_data, annual_sp500_returns, on='year', how='left')

# Display the first few rows of the merged DataFrame to confirm
print("Merged DataFrame with S&P 500 annual returns:")
Merged DataFrame with S&P 500 annual returns:
merged_data.head()
  firm  year     revenue  ...       bmr    oepspw  sp500_annual_return
0    A  2000  10773000.0  ...  0.210718  0.042144            -0.106908
1    A  2001   8396000.0  ...  0.428065 -0.058850            -0.139753
2    A  2002   6010000.0  ...  0.551637 -0.191589            -0.266129
3    A  2003   6056000.0  ...  0.202836 -0.052074             0.234126
4    A  2004   7181000.0  ...  0.304188  0.032899             0.086118

[5 rows x 46 columns]

5 CHALLENGE 4: Logistic regression models with lagged values

Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future annual stock returns (1 year ahead) is higher than the future market annual return (1 year ahead).

Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.

# import statsmodels to run the logistic regression:
import statsmodels.api as sm
# Prepare data - drop rows with NaN in relevant columns and ensure sorting for correct lagging
regression_data = merged_data.dropna(subset=['firm', 'year', 'oepspw', 'annual_return', 'sp500_annual_return']).copy()

# Sort data by firm and year to ensure correct lagging
regression_data = regression_data.sort_values(by=['firm', 'year'])

# Create the binary dependent variable: 1 if firm annual return > S&P 500 annual return, 0 otherwise
regression_data['firm_return_higher_than_market'] = (regression_data['annual_return'] > regression_data['sp500_annual_return']).astype(int)

# Shift the dependent variable one year ahead within each firm group
regression_data['firm_return_higher_than_market_forward1'] = regression_data.groupby('firm')['firm_return_higher_than_market'].shift(-1)

# Drop rows with NaN in the lagged dependent variable
regression_data = regression_data.dropna(subset=['firm_return_higher_than_market_forward1']).copy()

# Define the independent and dependent variables
X = regression_data['oepspw']
Y = regression_data['firm_return_higher_than_market_forward1']

# Add a constant to the independent variable for the intercept
X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(Y, X)
result = model.fit()
Optimization terminated successfully.
         Current function value: 0.688519
         Iterations 5
# Print the summary of the regression results
print(result.summary())
                                      Logit Regression Results                                     
===================================================================================================
Dep. Variable:     firm_return_higher_than_market_forward1   No. Observations:                48491
Model:                                               Logit   Df Residuals:                    48489
Method:                                                MLE   Df Model:                            1
Date:                                   mié., 22 oct. 2025   Pseudo R-squ.:                0.004578
Time:                                             09:53:26   Log-Likelihood:                -33387.
converged:                                            True   LL-Null:                       -33541.
Covariance Type:                                 nonrobust   LLR p-value:                 9.361e-69
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1023      0.009    -11.215      0.000      -0.120      -0.084
oepspw         0.6354      0.037     16.959      0.000       0.562       0.709
==============================================================================

You have to interpret the model

INTERPRETATION:

THE INTERPRETATION OF A LOGISTIC MODEL IS SIMILAR TO THAT OF THE LINEAR REGRESSION MODEL, BUT THERE IS AN IMPORTANT DIFFERENCE.

IN TERMS OF SIGN AND SIGNIFICANCE OF THE INDEPENDENT VARIABLE, WE INTERPRET THE LOGISTIC REGRESSION IN A SIMILAR WAY. IN THIS CASE, I CAN INTERPRET THE COEFFICIENT AND SIGNIFICANCE OF EARNINGS PER SHARE AS FOLLOWS:

SINCE THE BETA1 COEFFICIENT IS POSITIVE AND SIGNIFICANCE, THEN EARNINGS PER SHARE (DEFLATED BY PRICE) IS POSITIVELY AND SIGNIFICANTLY RELATED TO THE PROBABILITY THAT THE STOCK WILL HAVE A HIGHER RETURN IN THE FUTURE COMPARED TO THE FUTURE MARKET RETURN. IN OTHER WORDS, THE HIGHER THE EPSPW, THE MORE LIKELY THAT THE STOCK WILL BEAT THE MARKET RETURN ONE QUARTER LATER.

THE INTERPRETATION OF THE MAGNITUDE OF THE COEFFICIENT IS DIFFERENT THAN IN THE CASE OF LINEAR REGRESSION.

FIRST WE NEED TO CALCULATE THE EXPONENCIAL OF THE BETA1 COEFFICIENT:

# Get the beta1 coefficient (coefficient for 'oepspw') from the previous model
beta1_coefficient = result.params['oepspw']

# Calculate the exponential of the beta1 coefficient
exp_beta1 = np.exp(beta1_coefficient)

# Print the result
print(f"The exponential of the beta1 coefficient (oepspw) is: {exp_beta1:.4f}")
The exponential of the beta1 coefficient (oepspw) is: 1.8879

This new beta coefficient of 1.8879 can be interpreted as:

FOR EACH +1 CHANGE IN EPSPW, IT IS EXPECTED THAT THE ODDS RATIO (THE PROBABILITY THAT THE STOCK BEATS THE MARKET WITH RESPECT TO THE PROBABILITY THAT THE STOCK DO NOT BEAT THE MARKET) INCREASES IN 1.8879 UNITS. IN OTHER WORDS, A FIRM A HAS AN EPSPW THAT IS +1.00 HIGHER THAN A FIRM B, THEN THE FIRM A WILL BE 1.8879 TIMES MORE LIKELY TO BEAT THE MARKET COMPARED TO FIRM B.

THIS SOUNDS WEIRD SINCE A CHANGE OF +1.00 UNIT IN EPSPW IS ALMOST IMPOSSIBLE IN THE REAL WORLD; EPSPW IS A VARIABLE THAT USUALLY RANGES BETWEEN -0.3 AND +0.30. THEN, IT IS CONVENIENT TO DO ANOTHER CALCULATION FOR THE BETA COEFFICIENT:

betaepspw_odds = np.exp(beta1_coefficient * 0.1)
betaepspw_odds
np.float64(1.065606999081175)

I MULTIPLIED THE COEFFICIENT TIMES 0.1 TO CALCULATE THE BETA COEFFICIENT AS A PARTIAL CHANGE OF THE ODDS RATIO FOR A CHANGE OF +0.1 IN EPSPW.

IN THIS CASE, IF A FIRM A IMPROVES ITS EPSPW IN +0.1 UNIT FROM ONE QUARTER TO THE NEXT, THEN THIS FIRM WILL BE 1.0656 MORE LIKELY TO BEAT THE MARKET COMPARED TO CASE THAT THE FIRM DOES NOT CHANGE ITS EPSPW.

YOU CAN INTERPRET THE MAGNITUDE OF THE COEFFICIENT USING THE CASE OF 1 FIRM THAT INCREASES (OR DECREASES) ITS EPSPW FROM ONE QUARTER TO THE NEXT, OR COMPARING 2 FIRMS IN THE SAME QUARTER WITH DIFFERENT EPSPW VALUES.

6 CHALLENGE 5: Running your first Machine Learning model

Create a dataset with the following columns:

  • Future annual stock return (1 year later)
  • F1r_above_market (1=beat the market in the corresponding year; 0= otherwise)
  • Earnings per share deflated by price (epsp).

Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.

Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.

Create and interpret the confusion matrix

7 CHALLENGE 5 SOLUTION

It is strongly recommended to review the Chapter 1 : Classification from the course: Supervised Learning with scikit-learn

I will use the scikit-learn library to build a logistic regression model. This model will be trained on 80% of the data to predict the probability that a firm’s annual return will be higher than the S&P 500 annual return in the following year, using the winsorized operational earnings per share deflated by stock price (oepspw) as the predictor. The remaining 20% of the data will be used to test the model’s performance.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay

# Prepare data - drop rows with NaN in relevant columns and ensure sorting for correct lagging
# Use the regression_data from the previous step which already has the lagged dependent variable
regression_data_ml = regression_data.copy()

# Define the independent and dependent variables (using the lagged dependent variable)
X_ml = regression_data_ml[['oepspw']] # X needs to be a DataFrame
Y_ml = regression_data_ml['firm_return_higher_than_market_forward1']

# Shuffle and split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_ml, Y_ml, test_size=0.20, random_state=42, shuffle=True)

# Initialize and train the Logistic Regression model
model_ml = LogisticRegression()
model_ml.fit(X_train, Y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict probabilities on the test dataset
Y_pred_proba = model_ml.predict_proba(X_test)[:, 1]

# Create the final prediction based on a probability threshold of 0.50
Y_pred = (Y_pred_proba > 0.50).astype(int)

# Create the Confusion Matrix
cm = confusion_matrix(Y_test, Y_pred)

# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1]) # Assuming 0 and 1 are the class labels
disp.plot(cmap=plt.cm.Blues)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000026788D9A270>
plt.title('Confusion Matrix')
plt.show()

# Calculate statistics
accuracy = accuracy_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred)

# Extracto the values from the confusion matrix:
tn, fp, fn, tp = cm.ravel()
# The ravel method pull all values of a matrix from top-left to bottom-right, and put them in different values: 

specificity = tn / (tn + fp)

all_actual0 = tn + fp
all_actual1 = fn + tp

print("\nModel Evaluation Statistics:")

Model Evaluation Statistics:
print(f"Accuracy: {accuracy:.4f}")
Accuracy: 0.5406
print(f"Precision: {precision:.4f}")
Precision: 0.5521
print(f"Recall (Sensitivity): {recall:.4f}")
Recall (Sensitivity): 0.1026
print(f"Specificity: {specificity:.4f}")
Specificity: 0.9267
print(f"F1-Score: {f1:.4f}")
F1-Score: 0.1730

The Confusion Matrix can also be calculated as a Pivot Table (like the Excel Pivot Table) using Y_test and Y_pred. Here is an alternative:

# Create a DataFrame from Y_test and Y_pred for the pivot table
pivot_data = pd.DataFrame({'Y_test': Y_test, 'Y_pred': Y_pred})

# Create the pivot table
confusion_pivot = pivot_data.pivot_table(index='Y_test', columns='Y_pred', aggfunc='size', fill_value=0)

# Display the pivot table (which is the confusion matrix)
print("Confusion Matrix (Pivot Table):")
Confusion Matrix (Pivot Table):
confusion_pivot
Y_pred     0    1
Y_test           
0.0     4777  378
1.0     4078  466

Here is an illustration of what the Confusion Matrix calculates:

All observations of the testing data-set are classified and counted according to the following table:

Predicted: Positive Predicted: Negative
Actual: Positive True Positive (TP) False Negative (FN)
Actual: Negative False Positive (FP) True Negative (TN)

In each cell the number of cases (observations) are shown.

Positive/Negative refers whether the event happens or not

True/False refers whether the model correctly predicted (true) or not (false)

TP (True Positive): the model correctly predicted positive (that the event happened)

TN (True Negative): the model correctly predicted negative (that the event did not happened)

FP (False Positive): the model incorrectly predicted positive (Type I error)

FN (False Negative): the model incorrectly predicted negative (Type II error)

INTERPRETATION

The diagonal of the confusion matrix has the cases that my model CORRECTLY PREDICTED whether the stock beat or did not beat the market.

Then, looking at the matrix, we see the following:

  • The sum of the FIRST ROW= 5155= # of CASES when the stock actually DID NOT BEAT the market
  • The sum of the SECOND ROW= 4544= # of CASES when the stock actually BEAT the market
  • Out of the 4544 cases when the stocks ACTUALLY BEAT THE MARKET, the model CORRECTLY PREDICTED 466 cases, which are the TRUE POSITIVES cases. The rate of TRUE POSITIVES with respect to ALL POSITIVES is called Sensitivity or Recall, which is 0.1026.
  • Out of the 5155 cases when the stock DID NOT BEAT THE MARKET, the model CORRECTLY PREDICTED 4777 cases. This # is called: TRUE NEGATIVES. The rate (or %) of TRUE NEGATIVES with respect to ALL NEGATIVES is called Specificity, which is 0.9267.
  • Out of the 4544 cases that the stock ACTUALLY BEAT THE MARKET, the model WRONGLY PREDICTED 4078 cases. This # is called: FALSE NEGATIVES.
  • Out of the 5155 cases when the stock DID NOT BEAT THE MARKET, the model WRONGLY PREDICTED 378 cases. This # is called: FALSE POSITIVES.

Accuracy (0.5406): This is the overall proportion of correct predictions (both true positives and true negatives) out of all predictions. An accuracy of 0.5344 means the model was correct about 54.0571% of the time.

Precision (0.5521): This measures the accuracy of the positive predictions. Of all the times the model predicted that the firm’s return would be higher than the market return, it was correct about 55.2133% of the time.

The model has a very high specificity but very low recall. This means it is quite good at predicting when a stock will not outperform the market, but it is very poor at predicting when a stock will outperform the market.

Remember that the Sensitivity and the Specificity rates are:

SensitivityRate = \frac{TRUEPOSITIVE}{(TRUEPOSITIVE+FALSENEGATIVE)}

SpecificityRate = \frac{TRUENEGATIVE}{(TRUENEGATIVE+FALSEPOSITIVE)}

If Sensitivity is greater than Specificity this means that the model is better predicting cases when THE EVENT ACTAULLY HAPPENED (when a stock actually beat the market) vs cases when THE EVENT DID NOT HAPPEN (when a stock did not beat the market)

The PRECISION RATIO is:

Precision = \frac{TRUEPOSITIVE}{(TRUEPOSITIVE+FALSEPOSITIVE)}

The NEGATIVE PREDICTED RATIO is:

NegPredValueRate = \frac{TRUENEGATIVE}{(TRUENEGATIVE+FALSENEGATIVE)}

We can do a ROC (Receiver Operator Characteristic) PloT.

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Calculate the False Positive Rate (FPR) and True Positive Rate (TPR) for different probability thresholds
fpr, tpr, thresholds = roc_curve(Y_test, Y_pred_proba)

# Calculate the Area Under the ROC Curve (AUC)
roc_auc = roc_auc_score(Y_test, Y_pred_proba)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random guess')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

The ROC curve plots the True Positive Rate (sensitivity) against the False Positive Rate (1 - specificity) for different possible classification thresholds.

The Area Under the Curve (AUC) is a measure of the model’s ability to distinguish between the positive and negative classes. An AUC of 1.0 represents a perfect model, while an AUC of 0.5 represents a model that performs no better than random chance.

The model’s AUC is 0.5628. A higher AUC indicates better model performance.

As long as the area is greater than 0.5 this means that the model is better than a naive or a totally random model.

We can analyze how Precision, Sensitivity and Specificity change when the probability threshold changes (moves from the original 0.50 probability). We do this since we can change this probability threshold without modifying the logit model to see if we improve any of these accuracy ratios.

In the next code I do a loop by changing the probability threshold from 0 to 1 and calculate and plot the relevant ratios for each probability threshold:

# Analyze the impact of different probability thresholds on sensitivity, specificity, and precision

thresholds = np.arange(0.1, 1.0, 0.05) # Define a range of thresholds to test
sensitivity_scores = []
specificity_scores = []
precision_scores = []

for threshold in thresholds:
    # Create predictions based on the current threshold
    Y_pred_threshold = (Y_pred_proba > threshold).astype(int)

    # Calculate confusion matrix
    cm_threshold = confusion_matrix(Y_test, Y_pred_threshold)

    # Extract TN, FP, FN, TP
    # The ravel method pull all values of a matrix from top-left to bottom-right, and put them in different values. 
    tn, fp, fn, tp = cm_threshold.ravel()

    # Calculate sensitivity (Recall)
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
    sensitivity_scores.append(sensitivity)

    # Calculate specificity
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    specificity_scores.append(specificity)

    # Calculate precision
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    precision_scores.append(precision)


# Display the results
print("Sensitivity, Specificity, and Precision for different probability thresholds:")
Sensitivity, Specificity, and Precision for different probability thresholds:
for i in range(len(thresholds)):
    print(f"Threshold: {thresholds[i]:.2f}, Sensitivity: {sensitivity_scores[i]:.4f}, Specificity: {specificity_scores[i]:.4f}, Precision: {precision_scores[i]:.4f}")
Threshold: 0.10, Sensitivity: 1.0000, Specificity: 0.0000, Precision: 0.4685
Threshold: 0.15, Sensitivity: 1.0000, Specificity: 0.0000, Precision: 0.4685
Threshold: 0.20, Sensitivity: 1.0000, Specificity: 0.0000, Precision: 0.4685
Threshold: 0.25, Sensitivity: 1.0000, Specificity: 0.0000, Precision: 0.4685
Threshold: 0.30, Sensitivity: 0.9857, Specificity: 0.0223, Precision: 0.4705
Threshold: 0.35, Sensitivity: 0.9784, Specificity: 0.0326, Precision: 0.4713
Threshold: 0.40, Sensitivity: 0.9635, Specificity: 0.0533, Precision: 0.4729
Threshold: 0.45, Sensitivity: 0.9124, Specificity: 0.1331, Precision: 0.4813
Threshold: 0.50, Sensitivity: 0.1026, Specificity: 0.9267, Precision: 0.5521
Threshold: 0.55, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.60, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.65, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.70, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.75, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.80, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.85, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.90, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
Threshold: 0.95, Sensitivity: 0.0000, Specificity: 1.0000, Precision: 0.0000
# Plot the metrics vs. threshold
plt.figure(figsize=(10, 6))
plt.plot(thresholds, sensitivity_scores, marker='o', label='Sensitivity (Recall)')
plt.plot(thresholds, specificity_scores, marker='o', label='Specificity')
plt.plot(thresholds, precision_scores, marker='o', label='Precision')
plt.xlabel('Probability Threshold')
plt.ylabel('Score')
plt.title('Model Performance Metrics vs. Probability Threshold')
plt.legend()
plt.grid(True)
plt.show()

We can ses that Sensitivity and Specificity are like mirrors to each other. In other words, when probability threshold increases, Specificity improves, but Sensitivity get worse in a similar proportion.

In this case, we might be interested in maximizing Precision to maximize the probability of correctly guessing when we predict that the stock will beat the market. Then, we can identify the exact probability threshold that maximizes Precision:

precision_scores
[np.float64(0.46850190741313535), np.float64(0.46850190741313535), np.float64(0.46850190741313535), np.float64(0.46850190741313535), np.float64(0.47053261897258114), np.float64(0.47132407505565566), np.float64(0.4728883128105422), np.float64(0.4812536273940801), np.float64(0.5521327014218009), 0, 0, 0, 0, 0, 0, 0, 0, 0]
# Find the threshold that maximizes Precision
max_precision = max(precision_scores)
max_precision_index = precision_scores.index(max_precision)
threshold_maximizing_precision = thresholds[max_precision_index]

print(f"\nThe probability threshold that maximizes Precision is: {threshold_maximizing_precision:.2f}")

The probability threshold that maximizes Precision is: 0.50
print(f"The maximum Precision achieved is: {max_precision:.4f}")
The maximum Precision achieved is: 0.5521

In this case, the probability threshold that maximizes precision is 0.50, so we keep the original threshold of 50%.

It depends on the context, we can use a different probability threshold to maximize any ratio (Sensitivity, Specificity, Precision), and then re-calculate the binary predictions and re-do the Confusion Matrix to evaluate the model, and use it for final predictions.