Workshop 01 Solution, Financial Modeling and Programming

Author

Alberto Dorantes, Ph.D.

Published

November 10, 2025

Abstract
This is an INDIVIDUAL workshop. In this workshop we review the logistic regression model applied to fundamental analysis in Finance. In addition, we practice data management programming skills using a big panel-dataset of historical financial statement variables.

1 Introduction

In this workshop we practice data management with financial historical data, and review the logit regression model and its application to fundamental analysis.

We will work with a real dataset of all historical financial variables of ALL US public firms that belong to the NYSE and the NASDAQ exchanges.

We will use a logit model to examine whether some financial ratios are related to the probability that future stock return (1 year later) is higher than the median return of the corresponding industry (1 year later).

We will learn basic programming skills for the required data management and data modeling.

You have to work with 2 datasets:

  • firmsus2024.csv: List of all US public firms with general information of each firm

  • dataus2024.csv : Panel data with historical financial quarterly data for all US public firms.

You have to download these 2 files from a web page and save in the directory where you have your workshop.

The first dataset (dataus2024) contains the historical financial data of the firms, while the second dataset (firmsus2024) is a catalog of all firms along with the corresponding industry type and status (active or cancelled).

The dataus2024 dataset has a panel-data (also called long format) structure. Each row has financial information for one US firm and 1 period (a quarter). All $ amounts are in thousands (’1000s). Here is a data dictionary of the columns:

Data dictionary of historical quarterly financial data.
Variable Description
firm Unique code of the company (also called ticker)
q Quarter date
fiscalmonth Month of the year when the firm closes a fiscal year
revenue Total sales of the firm from the first fiscal quarter to the current quarter
cogs Cost of good sold - variable costs of the products sold - from the first fiscal quarter to the current quarter
sgae Sales and general administrative expenses - from the first fiscal quarter to the current quarter
otherincome Other operational income/expenses that are not directly from the core operations of the firm - from the first fiscal quarter to the current quarter
extraordinaryitems Extra income/expenses not related to regular operations - from the first fiscal quarter to the current quarter
finexp Financial expenses - interest expenses paid (generated from loans) - from the first fiscal quarter to the current quarter
incometax Income tax from the first fiscal quarter to the current quarter
totalassets Total assets of the firm at the end of the quarter
currentassets Current assets of the firm at the end of the quarter
totalliabilities Total liabilities of the firm at the end of the quarter
currentliabilities Current liabilities of the firm at the end of the quarter
longdebt Balance of long-term financial debt (loans to pay longer than 1 year)
adjprice Stock adjusted price at the end of the quarter; adjusted for stock splits and dividend payments; used to calculate stock returns
originalprice Historical stock price (not adjusted); used to calculate historical market value
sharesoutstanding Historical number of shares available in the market
fixedassets Fixed assets value at the end of the quarter
cashinbanks Cash balance at the end of the quarter
yearf Fiscal year - this depends on when the firm ends its fiscal year; if fiscalmonth=12 in the quarter 3, then the fiscal year will start in Q4 of a year and ends in the Q3 of the following year

Each row of this dataset has quarterly financial data of one firm in one quarter. All firms have quarters from Q1 2000 to Q2 2024. Not all firms have existed since 2000, so if the first quarters are empty that means that the firm did not exist in the US financial market in those quarters. Then, it is possible to know when each firm went public to issue shares in the financial market: the first quarter with some non-empty data.

Each firm has defined the month of the year used to close a fiscal year. For example, Apple closes the fiscal year at the end of Quarter 3 (end of September) of any year. Then, for Apple, in the Q3 of 2022, there will be a 12 for the fiscalmonth variable. In this case, Apple starts its fiscal year in the Q4 of each year and ends in the Q3 of the following year. Most of the firms (about 80%) close fiscal year in December, so these firms will have a 12 in the Q4 of each year.

The variables related to sales and expenses are cumulative for each fiscal year. For example, Apple sold about $117 billion in the last calendar quarter (Q4) of 2022, but this is the first fiscal quarter for Apple. For Q1 (calendar) 2023 (which is the 2nd fiscal quarter), Apple has about $212 billion in the revenue variable, meaning that considering fiscal quarter 1 and 2, Apple has sold $212 billion. For Q2 2023 Apple has about $293 billion, meaning that the cumulative revenue of fiscal Q1, Q2 and Q3 is about $293 billion. Then, if you select rows with fiscalmonth=12, then you will be selecting those quarters with annual financial information for each firm!

Earnings before interest and Taxes (ebit) and Net Income (netincome) must be calculated as:

ebit = revenue - cogs - sgae - depreciation

netincome = ebit + otherincome + extraordinaryitems - finexp - incometax

The firmsus2023.csv is a catalog of all active and cancelled US firms:

Variable Description
firm Unique code of the company (also called ticker)
name Name of the firm
status Status of the firm: active or cancelled
partind Percent participation in the S&P500 market index
naics1 North American Industry Classification Code - Level 1
naics2 North American Industry Classification Code - Level 2
SectorEconomatica Economatica Industry classification

2 Review of logit regression and its applications in Finance

The logit model is one type of non-linear regression model where the dependent variable is binary. The logistic or logit model is used to examine the relationship between one or more quantitative variables and the probability of an event happening (1=the event happens; 0 otherwise). For example, a bank that gives loans to businesses might be very interested in knowing which are the factors/variables/characteristics of firms that are more related to loan defaults. If the bank understands these factors, then it can improve its decisions about which firms deserve a loan, and minimize the losses due to loan defaults.

In this workshop, we define the event to be whether a firm in a specific quarter has higher stock return compared to median returns of the firms within its industry. If the stock return is higher than the median return, then we codify the binary variable equal to 1; 0 otherwise.

Then, in this case, the dependent variable of the regression is the binary variable with 1 if the stock outperforms the median of its industry and 0 otherwise. The independent or explanatory variables can be any financial indicator/ratio/variable that we believe is related to the likelihood of a stock to beat the market in the near future.

We can define this logistic model using the following mathematical function. Imagine that Y is the binary dependent variable, then the probability that the event happens (Event=1) can be defined as:

Prob(Event=1)=f(X_{1},X_{2},...,X_{n})

The binary variable Event can be either 1 or 0, but the probability of Event=1 is a continuous value from 0 to 1. The function is a non-linear function defined as follows:

Prob(Event=1)=\frac{1}{1+e^{-\left(b_{0}+b_{1}X_{1}+b_{2}X_{2}+...+b_{n}X_{n}\right)}}

As we can see, the argument of the exponential function is actually a traditional regression equation. We can re-express this equation as follows:

Y=b_{0}+b_{1}X_{1}+b_{2}X_{2}+...+b_{n}X_{n} Now we use Y in the original non-linear function:

Prob(Event=1)=\frac{1}{1+e^{-Y}}

This is a non-linear function since the value of the function does not move in a linear way with a change in the value of one independent variable X.

Let’s work with an example of a model with only 1 independent variable X_1. Imagine that X_1 is the variable earnings per share (eps) of a firm, and the event is that the firm beats the market. Then, let’s do a simple example with specific values for b_0, b_1, and a range of values for eps from -1 to 1 jumping by 0.1:

import numpy as np
# Create a numeric vector from -1 to 1 by 0.1
eps = np.arange(-1.0, 1.0 + 1e-9, 0.1)  # inclusive end
b0 = -4
b1 = 10.0

# Regression-like linear index and logistic transform
Y = b0 + b1 * eps
prob = 1.0 / (1.0 + np.exp(-Y))

# Looking at the valueos of eps and  probabilities
print(eps)
[-1.00000000e+00 -9.00000000e-01 -8.00000000e-01 -7.00000000e-01
 -6.00000000e-01 -5.00000000e-01 -4.00000000e-01 -3.00000000e-01
 -2.00000000e-01 -1.00000000e-01 -2.22044605e-16  1.00000000e-01
  2.00000000e-01  3.00000000e-01  4.00000000e-01  5.00000000e-01
  6.00000000e-01  7.00000000e-01  8.00000000e-01  9.00000000e-01
  1.00000000e+00]
print(prob)
[8.31528028e-07 2.26032430e-06 6.14417460e-06 1.67014218e-05
 4.53978687e-05 1.23394576e-04 3.35350130e-04 9.11051194e-04
 2.47262316e-03 6.69285092e-03 1.79862100e-02 4.74258732e-02
 1.19202922e-01 2.68941421e-01 5.00000000e-01 7.31058579e-01
 8.80797078e-01 9.52574127e-01 9.82013790e-01 9.93307149e-01
 9.97527377e-01]

Plotting the probability values as Y and eps as X:

import matplotlib.pyplot as plt

# Plot the logistic curve
plt.plot(eps, prob)
plt.title("A logistic function: expected probability of an event given values of eps")
plt.xlabel("eps")
plt.ylabel("prob")
plt.grid(True)
plt.show()

Here we can see that function is not linear with changes in eps. There is a specific range of values for eps close to 0 when the probability that the firm beats the market increases very fast up to a value of about 0.3 where the probability grows very slow with any more increase in eps.

The interpretations of the magnitude of the coefficients b_0 and b_1 in logistic regression is not quite the same as the case of multiple regression. However, the interpretations of the sign of the coefficient (positive or negative) and the level of significance (pvalue) are the same as in the case of multiple regression model. What we can say up to know is that if b_1 is positive and significant (its p-value<0.05), then it means that the variable, in this case, eps is significantly and positively related to the probability that a firm return outperform its industry median return.

Before going to the interpretation of the magnitude of a coefficient, here is a quick explanation of how the logistic regression works and how it is estimated in any specialized software (such as R).

Let’s continue with the same event, which is that the firm return beats its industry median (in other words, that the firm return is higher than the median return of its industry). Then:

p = probability that the firm beats its industry (Event=1); or that the event happens.

(1-p) = probability that the firm DOES NOT beat its industry (Event=0); or that the event does not happen.

To have a dependent variable that can get any numeric value from a negative value to a positive value, we can do the following mathematical transformation with these probabilities:

Y=log\left(\frac{p}{1-p}\right)

The \left(\frac{p}{1-p}\right) is called the odds ratio:

ODDSRATIO=\left(\frac{p}{1-p}\right)

The odds ratio is the ratio of the probability of the event happening to the probability of the event NOT happening. Since p can have a value from 0 to 1, then the possible values of ODDSRATIO can be from 0 (when p=0) to infinity (when p=1). Since we want a variable that can have values from any negative to any positive value, then we can apply the logarithmic function, and then the range of this log will be from any negative value to any positive value.

Now that we have a transformed variable Y (the log of ODDSRATIO) that uses the probability p, then we can use this variable as the dependent variable for our regression model:

Y=log\left(\frac{p}{1-p}\right)=b_{0}+b_{1}X_{1}

This mathematical trick help us to use a linear model to model a non-linear relationship!

Then, with this transformation we can estimate a linear regression model (don’t worry, Python estimate it), so the coefficients b_0 and b_1 values define the logarithm of the odd ratio!, not the actual probability p of the event happening!

How can we interpret the magnitude of the beta coefficients of this regression? Let’s do a mathematical trick from the previous equation. We can apply the exponential function to both sides of the equation:

e^{Y}=e^{log\left(\frac{p}{1-p}\right)}=e^{\left(b_{0}+b_{1}X_{1}\right)}=\left(\frac{p}{1-p}\right)=ODDSRATIO

Following the rule of exponents, we can express this equation as:

e^{Y}=e^{b_{0}}e^{b_{1}X_{1}}=ODDSRATIO

Let’s see what happens with ODDSRATIO if X_1 increases in 1 unit:

e^{b_{0}}e^{b_{1}(X_{1}+1)}=e^{b_{0}}e^{b_{1}X_{1}}e^{b_{1}}=ODDSRATIO*(e^{b_{1}})

Then, if X_1 increases in one unit, then the ODDSRATIO will be equal to ODDSRATIO times e^{b_1}. Then, e^{b_1} will be the factor that indicates how many times the ODDSRATIO changes with a +1-unit change in X_1.

Then:

  • If e^{b_1} = 1, then the ODDSRATIO does not change, meaning that there is no relationship between the variable X_1 and the probability of the event.

  • If e^{b_1} > 1, then the ODDSRATIO will grow by this factor, meaning that there is a positive relationship between X_1 and the probability of the event.

  • If e^{b_1} < 1, then the ODDSRATIO will decrease by this factor, meaning that there is a negative relationship between X_1 and the probability of the event.

Then, when we see the output of a logistic regression, we need to apply the exponential function to the coefficients to provide a meaningful interpretation of the magnitude of the coefficients in the model.

If we want to estimate the probability p for any value of X_1, we just need to do some algebraic manipulations to the previous equation:

Y=log\left(\frac{p}{1-p}\right)=b_{0}+b_{1}X_{1}

To get p from this equation, we can apply the exponential function to both sides:

e^{Y}=\left(\frac{p}{1-p}\right)

Leaving p alone, we multiply both sides times (1-p):

e^{Y}\left(1-p\right)=p

We continue playing with the terms to leave p alone:

e^{Y}-pe^{Y}=p

e^{Y}=p+pe^{Y}

e^{Y}=p(1+e^{Y})

p=\frac{e^{Y}}{\left(1+e^{Y}\right)}

Dividing the numerator and the denominator by e^Y ;

p=\frac{1}{\left(\frac{1}{e^{Y}}+1\right)}=\frac{1}{\left(e^{-Y}+1\right)}

This is the same mathematical equation we had used above to illustrate the non-linear relationship between the explanatory variables and the probability of an event happening!

Fortunately, Python performs all these calculations automatically when you run the predict function! So, you do not have to memorize these steps; I just tried to be curious to understand how the logistic model estimates predicted probabilities.

3 Data collection

The dataset we will use for this workshop and for the final project is located in a public we site.

This dataset has real historical information of all US public firms listed in NASDAQ and NYSE. It has data from Q1 2000 to Q2 2024. You can download the dataset as follows.

3.1 Importing data

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
# Download the first file
url1 = 'https://www.apradie.com/datos/firmsus2024.csv'
response1 = requests.get(url1, headers=headers)
with open('usfirms.csv', 'wb') as f:
    f.write(response1.content)
747426
# Download the second file
url2 = 'https://www.apradie.com/datos/usdataq.csv'
response2 = requests.get(url2, headers=headers)
with open('uspanel.csv', 'wb') as f:
    f.write(response2.content)
55336905

Importing the csv files:

import pandas as pd
# Import as data frames
usfirms = pd.read_csv('usfirms.csv')
uspanel = pd.read_csv('uspanel.csv')

# Display the first few rows of each DataFrame to confirm
usfirms.head()
    empresa  ...  SectorEconomatica
0         A  ...  Electroelectronic
1        AA  ...  Siderur & Metalur
2  AABA_old  ...   Software y Datos
3   AAC_old  ...              Otros
4  AAIC_old  ...             Fondos

[5 rows x 7 columns]
uspanel.head()
  firm       q  fiscalmonth  ...  sharesoutstanding  fixedassets   yearf
0    A  2000q1          NaN  ...         452000.000          NaN     NaN
1    A  2000q2          6.0  ...         452271.967    1453000.0  2000.0
2    A  2000q3          9.0  ...         453014.579    1581000.0  2000.0
3    A  2000q4         12.0  ...         456366.381    1741000.0  2000.0
4    A  2001q1          3.0  ...         456769.737    1821000.0  2001.0

[5 rows x 27 columns]

4 CHALLENGE 1 - CALCULATE QUARTERLY REVENUE AND EBIT FROM YTD AMOUNTS

All income-statement variables in the uspanel dataset are Year-to-Date (YTD) amounts starting in the first fiscal quarter.

For example, if you look at revenue of a firm in the fiscal Q3 (fiscalmonth==9), that amount will be the total revenue of the firm starting in the first fiscal Q1 until the fiscal Q3.

Remember that in the US public firms can end a fiscal year in any of the 4 calendar quarters.

Using the uspanel dataset calculate EBIT as: revenue - cogs - sgae - depreciation

Remember that the YTD amount is a cumulative amount from the first fiscal quarter up to any fiscal quarter. Then, what might be the algorithm/code to create the quarter-amount for revenue and EBIT ?

CHALLENGE 1 - WRITE YOUR ALGORITHM WITH YOUR OWN WORDS, AND WRITE THE PYTHON CODE TO CREATE THE QUARTERLY REVENUE AND EBIT AMOUNT IN THE USPANEL DATASET

4.1 CHALLENGE 1 SOLUTION

The input is the uspanel dataset, which has quarterly data of many US firms from 2000 to 2024.

The output will be the same uspanel, but with 3 new columns:

  • ebit_ytd: ebit fiscal Year-to-date (YTD): cumulative ebit from the first fiscal Q of the year up to the quarter q

  • ebit_quarterly: ebit in the quarter q

  • revenue_quartery: revenu in the quarter

My algoritm is the following:

  1. Generate the column ebit as:

    ebit_ytd = revenue - cogs - sgae - depreciation

All income-statement columns in the dataset are fiscal YTD amounts

  1. For columns revenue and ebit do the following:

    • Make sure the dataset is sorted by firm-yearf-quarter, so that I can calculate quarterly amounts from the Year-to-date amounts by getting the difference of the amount for a firm-quarter minus its own value of the previous quarter in the same fiscal year.

    • Using the uspanel dataset, and grouping by firm-yearf generate the column revenue_quarterly and ebit_quarterly as follows:

      • revenue_quarterly = revenue - revenue of the previous quarter;

      Since the first fiscal quarter cannot be calculated in this way since there is no previous quarter in the same fiscal year, then for the first quarter of the fiscal year, assign the value of the first quarter of revenue

      Do the same for ebit_quarterly:

      • ebit_quarterly = ebit - ebit of the previous quarter;

      Since the first fiscal quarter cannot be calculated in this way since there is no previous quarter in the same fiscal year, then for the first quarter of the fiscal year, assign the value of the first quarter of ebit

I now write the Python code according to the previous algorithm:

I create the ebit_ytd column:

uspanel['ebit_ytd'] = uspanel['revenue'] - uspanel['cogs'] - uspanel['sgae'] - uspanel['depreciation']

I create the quarterly revenue:

# Make sure the dataset is sorted by firm-yearf-q
uspanel = uspanel.sort_values(by=['firm','yearf','q'])
# I create revenue_quarterly:
# I group by firm-year, so each group will have 4 quarters. Then, I get the first difference
#  with diff(), and finally, for all fiscal Q1 I will get null value, so I assign the revenue value
uspanel['revenue_quarterly'] = uspanel.groupby(['firm','yearf'])['revenue'].diff().fillna(uspanel['revenue'])

I do the same for ebit:

# I create ebit_quarterly:
# I group by firm-year, so each group will have 4 quarters. Then, I get the first difference
#  with diff(), and finally, for all fiscal Q1 I will get null value, so I assign the ebit_ytd value
uspanel['ebit_quarterly'] = uspanel.groupby(['firm','yearf'])['ebit_ytd'].diff().fillna(uspanel['ebit_ytd'])

5 CHALLENGE 2 - GROWTH OF THE US FINANCIAL MARKET

Using the panel dataset, you have to understand how the US financial market has been growing over the years starting from the oldest year to the last complete fiscal year (2023).

You have to examine how much the US has grown in terms of:

  • Total market value of all firms by calendar year

    • Calculate marketvalue = originalprice * sharesoutstanding
  • Total revenue of all firms by calendar year

  • Total Net Income of all firms by calendar year

    • Calculate netincome = ebit + otherincome + extraordinaryitems - finexp - incometax

You have to create a table of these 3 totals by year (years as rows)

Which graph(s) would you design to better understand the US financial growth in terms of these variables?

5.1 CHALLENGE 2 SOLUTION

I start creating the marketvalue column quarter by quarter.

uspanel['marketvalue'] = uspanel['originalprice'] * uspanel['sharesoutstanding']

Market value is the value of each firm up to any quarter, since the beginning of the company life.

I calculate Net income, but quarterly amounts, as I did with revenue and ebit:

I first create YTD netincome:

uspanel['netincome_ytd'] = uspanel['ebit_ytd'] - uspanel['finexp'] - uspanel['incometax'] + uspanel['extraordinaryitems'] + uspanel['otherincome']

I calculate quarterly netincome:

# I create netincome_quarterly:
# I group by firm-year, so each group will have 4 quarters. Then, I get the first difference
#  with diff(), and finally, for all fiscal Q1 I will get null value, so I assign the netincome_ytd value
uspanel['netincome_quarterly'] = uspanel.groupby(['firm','yearf'])['netincome_ytd'].diff().fillna(uspanel['netincome_ytd'])

Now I can group by calendar year, which is the year column, and do the following aggregations:

  • For revenue_quarterly and netincome_quarterly, sum the quarterly amounts

  • For marketvalue, get the last quarterly value of the last quarter of the year

The yearf is the fiscal year. The calendar year is in the quarter column q, so I create a column for the calendar year:

uspanel['year'] = uspanel['q'].str[:4].astype(int)

Since I need to do different aggregations for the variables, I create a panel dataset by firm-year. After this, I can do a summary table by calendar year.

I create a panel annual dataset with marketvalue, revenue and netincome for each firm-year:

uspanely = uspanel.groupby(['firm','year']).agg(
    # I get the last value of the quarter for marketvalue 
    marketvalue=('marketvalue', 'last'),
    # I sum the quarter values for revenue_quarterly and netincome_quarterly:
    revenue=('revenue_quarterly', 'sum'),
    netincome=('netincome_quarterly', 'sum')
).reset_index()

print(uspanely.head())
  firm  year   marketvalue     revenue  netincome
0    A  2000  4.700800e+07  10773000.0   757000.0
1    A  2001  1.321995e+07   8396000.0   174000.0
2    A  2002  8.387759e+06   6010000.0 -1032000.0
3    A  2003  1.392260e+07   6056000.0 -2058000.0
4    A  2004  1.173287e+07   7181000.0   349000.0

With this annual panel dataset, I have the marketvalue, revenue and netincome for each firm-year.

Then, I can get a summary table (like a pivot table) by year and just sum the 3 variables:

annualdata = uspanely.groupby('year').agg(
  # I get the sum for the 3 variables:
    marketvalue=('marketvalue', 'sum'),
    revenue=('revenue', 'sum'),
    netincome=('netincome', 'sum')
).reset_index()

print(annualdata.head())
   year   marketvalue       revenue     netincome
0  2000  1.380230e+10  7.364689e+09  4.898300e+08
1  2001  1.185441e+10  7.736169e+09  1.335694e+08
2  2002  9.258813e+09  7.432325e+09  6.608115e+07
3  2003  1.214979e+10  8.030068e+09  5.434120e+08
4  2004  1.363718e+10  8.976372e+09  6.262368e+08
print(annualdata.tail())
    year   marketvalue       revenue     netincome
20  2020  4.119196e+10  1.574771e+10  7.871825e+08
21  2021  5.149576e+10  1.853319e+10  2.060694e+09
22  2022  3.969276e+10  2.085038e+10  1.721287e+09
23  2023  4.852033e+10  1.954640e+10  1.690138e+09
24  2024  5.424166e+10  1.141259e+10  1.028335e+09

Since 2024 is not complete, I delete the row for 2024 year

annualdata = annualdata[annualdata['year'] < 2024]
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 7))

# Plot Total Market Value
plt.plot(annualdata['year'], annualdata['marketvalue'], label='Total Market Value', marker='o')

# Plot Total Revenue
plt.plot(annualdata['year'], annualdata['revenue'], label='Total Revenue', marker='o')

# Plot Total Net Income
plt.plot(annualdata['year'], annualdata['netincome'], label='Total Net Income', marker='o')

plt.title('US Financial Growth Over Years (1999-2023)')
plt.xlabel('Year')
plt.ylabel('Amount (in 10 Billions)')
plt.legend()
plt.grid(True)
plt.xticks(annualdata['year'].astype(int), rotation=45)
([<matplotlib.axis.XTick object at 0x0000017A6653CB90>, <matplotlib.axis.XTick object at 0x0000017A6653C410>, <matplotlib.axis.XTick object at 0x0000017A66514F50>, <matplotlib.axis.XTick object at 0x0000017A665156D0>, <matplotlib.axis.XTick object at 0x0000017A66515E50>, <matplotlib.axis.XTick object at 0x0000017A665165D0>, <matplotlib.axis.XTick object at 0x0000017A66516D50>, <matplotlib.axis.XTick object at 0x0000017A665174D0>, <matplotlib.axis.XTick object at 0x0000017A66517C50>, <matplotlib.axis.XTick object at 0x0000017A75FF4410>, <matplotlib.axis.XTick object at 0x0000017A75FF4B90>, <matplotlib.axis.XTick object at 0x0000017A75FF5310>, <matplotlib.axis.XTick object at 0x0000017A75FF5A90>, <matplotlib.axis.XTick object at 0x0000017A75FF6210>, <matplotlib.axis.XTick object at 0x0000017A75FF6990>, <matplotlib.axis.XTick object at 0x0000017A75FF7110>, <matplotlib.axis.XTick object at 0x0000017A75FF7890>, <matplotlib.axis.XTick object at 0x0000017A75FD0050>, <matplotlib.axis.XTick object at 0x0000017A75FD07D0>, <matplotlib.axis.XTick object at 0x0000017A75FD0F50>, <matplotlib.axis.XTick object at 0x0000017A75FD16D0>, <matplotlib.axis.XTick object at 0x0000017A75FD1E50>, <matplotlib.axis.XTick object at 0x0000017A75FD25D0>, <matplotlib.axis.XTick object at 0x0000017A75FD2D50>], [Text(2000, 0, '2000'), Text(2001, 0, '2001'), Text(2002, 0, '2002'), Text(2003, 0, '2003'), Text(2004, 0, '2004'), Text(2005, 0, '2005'), Text(2006, 0, '2006'), Text(2007, 0, '2007'), Text(2008, 0, '2008'), Text(2009, 0, '2009'), Text(2010, 0, '2010'), Text(2011, 0, '2011'), Text(2012, 0, '2012'), Text(2013, 0, '2013'), Text(2014, 0, '2014'), Text(2015, 0, '2015'), Text(2016, 0, '2016'), Text(2017, 0, '2017'), Text(2018, 0, '2018'), Text(2019, 0, '2019'), Text(2020, 0, '2020'), Text(2021, 0, '2021'), Text(2022, 0, '2022'), Text(2023, 0, '2023')])
plt.tight_layout()
plt.show()

We see that from 2012, the market value of all US firms grew much faster than revenue and netincome.

To better visually appreciate the growth of each variable, I can create an index where 1 is the amount of the first year. This index will be the times that the variable has grew over the years:

growth_indices_market = annualdata.copy() 
growth_indices_market.set_index('year', inplace=True)

for col in ['revenue', 'netincome','marketvalue']:
    growth_indices_market[f'{col}_index'] = growth_indices_market[col] / growth_indices_market[col].iloc[0]

growth_indices_market.head()
       marketvalue       revenue  ...  netincome_index  marketvalue_index
year                              ...                                    
2000  1.380230e+10  7.364689e+09  ...         1.000000           1.000000
2001  1.185441e+10  7.736169e+09  ...         0.272685           0.858872
2002  9.258813e+09  7.432325e+09  ...         0.134906           0.670817
2003  1.214979e+10  8.030068e+09  ...         1.109389           0.880273
2004  1.363718e+10  8.976372e+09  ...         1.278478           0.988037

[5 rows x 6 columns]

I can plot these indices:

plt.figure(figsize=(12, 6))
plt.plot(growth_indices_market.index, growth_indices_market['revenue_index'], label='Revenue Index')
plt.plot(growth_indices_market.index, growth_indices_market['netincome_index'], label='Net Income Index')
plt.plot(growth_indices_market.index, growth_indices_market['marketvalue_index'], label='Market Value Index')
plt.xlabel('Year')
plt.ylabel('Growth Index (Normalized to First Year)')
plt.title('Growth of Overall Market Financial Metrics and Market Value - Indexed to First Year')
plt.legend()
plt.show()

Now we see a different picture. Compared with the 2000 amounts, Net income and Market value had grew very similar to about 4 times from 2000 to 2023. This is consistent with Financial theory that says that the stock price (marketvalue divided by # of shares) reflects the present value of future free cash flows, and must be consistent with the growth of earnings, in this case, Net Income.

6 CHALLENGE 3 - RUN A LOGIT MODEL TO PREDICT PROBABILITY OF BEATING THE INDUSTRY MEDIAN RETURN

You have to run a logit model to examine whether Operational Earnings per share divided by price (EBITQ/sharesoutstanding) / originalprice explains the probability that a firm has a higher annual return than its corresponding median industry return.

Use the naics1 column as the industry. naics stands for North American Industry Classification System.

You have to create the following variables for the model:

  • oepsp = Operational Earnings per Share divided by price = (EBITQ/sharesoutstanding) / originalprice

  • oepspw = Winsorized version of oepsp (use the winsorize function from the statar package to do this)

  • r = Annual stock return (log return); remember that you have quarterly data

Hint: use this r variable to calculate the median return by industry.

Remember that EBITQ is the Quarterly EBIT you calculated in Challenge 1.

CHALLENGE 4 - INTERPRET WITH YOUR WORDS THE LOGIT MODEL

6.1 CHALLENGE 3 SOLUTION

Since I need the industry to get the median for each industry, I need to pull the industry from the usfirms dataset and merge it in the dataset. I will create a new merged dataset:

# I merge the datasets using usdata as the left dataset, and usfirms as the right dataset 

merged_data = pd.merge(uspanel, usfirms[['empresa', 'Nombre', 'status', 'naics1']], left_on='firm', right_on='empresa', how='left')

# Drop the redundant 'empresa' column from the merged DataFrame
merged_data = merged_data.drop('empresa', axis=1)

# Display the first few rows of the merged DataFrame to confirm
print("Merged DataFrame:")
Merged DataFrame:
merged_data.head()
  firm       q  ...  status                     naics1
0    A  2000q2  ...  activo  Industrias manufactureras
1    A  2000q3  ...  activo  Industrias manufactureras
2    A  2000q4  ...  activo  Industrias manufactureras
3    A  2001q1  ...  activo  Industrias manufactureras
4    A  2001q2  ...  activo  Industrias manufactureras

[5 rows x 37 columns]

I create the operating earnings per share deflated by price, and oeps deflated by stock price:

# Operational Earnings per share (oeps) 
merged_data['oeps'] = merged_data['ebit_quarterly'] / merged_data['sharesoutstanding']

merged_data['oepsp'] = merged_data['oeps'] / merged_data['originalprice']

I view the histogram of oepsp:

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(merged_data['oepsp'].dropna(), bins=50, kde=True)
plt.title('Distribution of Operational EPS deflated by Stock Price (oepsp)')
plt.xlabel('oepsp')
plt.ylabel('Frequency')
plt.show()

I see that there are very extreme values for oepsp. Operating earnings per share divided by price usually has values between -0.50 to +0.50. The problem might be that in this dataset there are firms that are not active, and many of them went bakrupt, so the stock price goes down to values close to zero.

When we include variables with these very extreme values in a model, the estimation of the model is not reliable. In these cases it is strongly recommended to winsorize the variable to flatten very large values (both negative and positive).

WINSORIZATION IS THE PROCESS OF FLATTENING EXTREME (OUTLIERS) OBSERVATIONS OF A VARIABLE WITH THE PURPOSE OF AVOIDING UNRELIABLE ESTIMATIONS OF BETA COEFFICIENTS WHEN INCLUDING THAT VARIABLE AS EXPLANATORY VARIABLE IN AN ECONOMETRIC MODEL.

THE WINSORIZATION PROCESS CAN BE APPLIED TO HIGH OR LOW (NEGATIVE) VALUES OF THE VARIABLE. FOR HIGH VALUES WE CAN INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES ABOVE THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE. FOR LOW VALUES WE INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES BELOW THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE.

I use the quantile method of pandas to get the 1 and 99 percentiles, and the clip method to flatten the extreme values with the corresponding percentiles:

# Winsorize oepsp at 2% and 98%
lower_bound = merged_data['oepsp'].quantile(0.01)
upper_bound = merged_data['oepsp'].quantile(0.99)

merged_data['oepspw'] = merged_data['oepsp'].clip(lower=lower_bound, upper=upper_bound)
# I get descriptive statistics of oepsp y oepspw
merged_data[['oepsp','oepspw']].describe()
              oepsp         oepspw
count  2.404640e+05  240464.000000
mean            NaN      -0.007325
std             NaN       0.095533
min            -inf      -0.638777
25%   -6.077557e-03      -0.006078
50%    1.236910e-02       0.012369
75%    2.499723e-02       0.024997
max             inf       0.157534

The minimum and maximum values look in the normal range (from -0.6 to +0.157) of operating earnings per share deflated by price.

I also check the histogram of the winsorized oepsp:

plt.figure(figsize=(10, 6))
sns.histplot(merged_data['oepspw'].dropna(), bins=50, kde=True)
plt.title('Distribution of Winsorized Operational EPS deflated by Stock Price (oepspw)')
plt.xlabel('oepspw')
plt.ylabel('Frequency')
plt.show()

I calculate the annual log return for each firm-quarter:

# Calculate log return: ln(current_price / stockprice 4 quarters ago)
merged_data['r'] = np.log(merged_data['adjprice'] / merged_data.groupby('firm')['adjprice'].shift(4))
plt.figure(figsize=(10, 6))
sns.histplot(merged_data['r'].dropna(), bins=50, kde=True)
plt.title('Distribution of log annual returns')
plt.xlabel('Stock annual return')
plt.ylabel('Frequency')
plt.show()

I get the median annual return by industry-quarter (industri=naics1):

merged_data['median_industry_r'] = merged_data.groupby(['naics1','q'])['r'].transform('median')
merged_data.head()
  firm       q  fiscalmonth  ...    oepspw         r  median_industry_r
0    A  2000q2          6.0  ...  0.011542       NaN          -1.224193
1    A  2000q3          9.0  ...  0.009473       NaN          -1.333842
2    A  2000q4         12.0  ...  0.018330       NaN          -1.226877
3    A  2001q1          3.0  ...  0.019877       NaN          -0.049258
4    A  2001q2          6.0  ... -0.003837 -0.819441           0.070438

[5 rows x 42 columns]

I can run the logit model

import statsmodels.api as sm
# Create the binary dependent variable: 1 if annual_r > median_industry_r, 0 otherwise
merged_data['higher_than_median_r'] = (merged_data['r'] > merged_data['median_industry_r']).astype(int)

# Define the dependent and independent variables, handling NaNs
data_for_logit = merged_data[['higher_than_median_r', 'oepspw']].dropna()
Y = data_for_logit['higher_than_median_r']
X = data_for_logit['oepspw']

# Add a constant to the independent variable for the intercept
X = sm.add_constant(X)

# Run the logit model
logit_model = sm.Logit(Y, X)
result = logit_model.fit()
Optimization terminated successfully.
         Current function value: 0.669386
         Iterations 6
# Display the model summary
print(result.summary())
                            Logit Regression Results                            
================================================================================
Dep. Variable:     higher_than_median_r   No. Observations:               240464
Model:                            Logit   Df Residuals:                   240462
Method:                             MLE   Df Model:                            1
Date:                mié., 12 nov. 2025   Pseudo R-squ.:                 0.02863
Time:                          19:33:30   Log-Likelihood:            -1.6096e+05
converged:                         True   LL-Null:                   -1.6571e+05
Covariance Type:              nonrobust   LLR p-value:                     0.000
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.1753      0.004    -41.872      0.000      -0.183      -0.167
oepspw         5.8254      0.075     77.219      0.000       5.678       5.973
==============================================================================

INTERPRETATION OF THE MODEL:

LOOKING AT THE BETA COEFFICIENT OF THE EXPLANATORY VARIABLE oepspw WE CAN SEE THAT THE COEFFICIENT IS POSITIVE AND STATISTICALLY SIGNIFICANT (SINCE ITS P-VALUE IS MUCH LESS THAN 0.05). THEN, WE CAN SAY THAT OPERATING EARNINGS PER SHARE DEFLATED BY PRICE IS SIGNIFICANTLY AND POSITIVELY RELATED WITH THE PROBABILITY OF A STOCK BEATING THE INDUSTRY (IN TERMS OF ANNUAL RETURN).

THE WAY WE INTERPRET THE MAGNITUDE OF THE COEFFICIENT IS NOT THE SAME AS IN THE CASE OF LINEAR REGRESSION. WE NEED TO GET THE EXPONENTIAL OF THE COEFFICIENT TO HAVE A BETTER INTERPRETATION.

ANOTHER WAY TO UNDERSTAND WHY WE NEED TO GET THE EXPONENTIAL OF THE COEFFICIENT IS THAT THE ACTUAL DEPENDENT VARIABLE OF THE LOGIT MODEL IS THE LOGARITHM OF ODD RATIO. REMEMBER THAT THE ODD RATIO IS THE DIVISION BETWEEN THE PROBABILITY OF THE STOCK BEATING THE INDUSTRY AND THE PROBABILITY OF THE STOCK NOT BEATING THE INDUSTRY. THEN, IF WE APPLY THE EXPONENTIAL FUNCTION TO THE LOGARITHM OF THE ODDS RATIO WE GET THE ACTUAL ODDS RATIO:

Y_{i,t}=log(\frac{p}{1-p})=b_{0}+b_{1}(oepspw_{i,t})+\varepsilon_{i,t}

e^{Y_{i,t}}=\left(\frac{p}{1-p}\right)

SINCE THE EXPECTED VALUE OF Y IS ACTUALLY THE REGRESSION EQUATION, THEN I CAN APPLY THE EXPONENTIAL EQUATION TO THE REGRESSION EQUATION AND GET TRANSFORMED BETAS THAT ARE RELATED TO THE ODDS RATIO (NOT TO THE LOG OF ODDS RATIO). THIS IS CONSISTENT WITH A PREVIOUS EXPLANATION ABOUT HOW TO INTERPRET THE MAGNITUDE OF A COEFFICIENT OF A LOGIT REGRESSION. AS I HAD EXPLAINED IN A PREVIOUS SECTION:

IF THE EXPLANATORY VARIABLE INCREASES IN 1 UNIT:

e^{b_{0}}e^{b_{1}(oepspw+1)}=e^{b_{0}}e^{b_{1}(oepspw)}e^{b_{1}}=ODDSRATIO*(e^{b_{1}})

THEN, (e^{b_{1}}) IS THE TRANSFORMED COEFFICIENT WE NEED TO ESTIMATE IN ORDER TO HAVE A BETTER INTERPRETATION OF THE REGRESSION OUTPUT. LET’S GET THE EXPONENTIAL OF THE REGRESSION COEFFICIENTS:

IN THE REAL WORLD, oepspw CANNOT MOVE +1.0 UNIT SINCE IT IS A RATIO. THEN, WE CAN BETTER THINK IN WHAT WOULD HAPPEN IF THE oepspw MOVES IN 0.1 UNIT:

e^{b_{0}}e^{b_{1}(oepspw+0.1)}=e^{b_{0}}e^{b_{1}(oepspw)}e^{0.1*b_{1}}=ODDSRATIO*(e^{0.1*b_{1}})

# Get the beta coefficients, which are located in the params attribute of the model 
coefs = result.params

# I calculate the exponential of 0.1 times the coefficients to be able to interpret the coefficient in terms of odds ratio, and I multiply the coefficients times 0.1 to have an easier interpretation 

b0 = coefs.iloc[0]
b1 = coefs.iloc[1]

odds_coefs = np.exp(0.1* coefs)
# I get the odds  coefficients from the vector:
modified_b0 = odds_coefs.iloc[0]
modified_b1 = odds_coefs.iloc[1]

# Print the result
print(f"The modified beta1 coefficient (oepspw) is: {modified_b1:.4f}")
The modified beta1 coefficient (oepspw) is: 1.7906

This new beta coefficient of 1.7906 can be interpreted as:

NOW THE TRANSFORMED b_1 IS 1.7906. WE INTERPRET THIS COEFFICIENT AS FOLLOWS:

IF A FIRM IMPROVES ITS oepspw BY 0.1 UNIT, ITS ODDS RATIO -THE PROBABILITY TO BEAT THE industry WITH RESPECT TO THE PROBABILITY OF NOT BEATING THE industry- IN THE SAME QUARTER WILL BE 1.7906 TIMES HIGHER THAN THE CASE THAT THE FIRM DOES NOT IMPROVE ITS oepspw.

THE INTERPRETATION OF b_0 IS THE FOLLOWING. IF oepspw IS EQUAL TO ZERO, THEN THE TRANSFORMED COEFFICIENT b_0 WILL BE THE EXPECTED ODDS RATIO. IN THIS CASE, THE TRANSFORMED COEFFICIENT OF b_0 IS 0.9826, which is less than 1. THIS MEANS THAT THE PROBABILITY THAT THE STOCK BEATS THE industry IS LESS THAN THE PROBABLITY DOES NOT BEAT THE industry.

WE CAN GET A BETTER INTERPRETATION OF b_0 CALCULATING THE EXPECTED PROBABILITY OF BEATING THE industry FROM THE EXPECTED ODDS RATIO WHEN oepsp=0. WE NEED TO TRANSFORM FROM ODDSRATIO TO PROBABILITY:

ODDSRATIO=\frac{p}{1-p}

DOING SOME ALGEBRA WE CAN EXPRESS P IN TERMS OF ODDSRATIO:

(1-p)ODDSRATIO=p

ODDSRATIO - p(ODDSRATIO) = p

ODDSRATIO = p + p(ODDSRATIO) = p(1+ODDSRATIO)

p=\frac{ODDSRATIO}{1+ODDSRATIO}

WE HAD CALCULATED THE exp(b0), the TRANSFORMED COEFFICIENT OF b_0 = 0.9826, WHICH IS THE EXPECTED ODDSRATIO IF oepsp=0. THEN, THE EXPECTED PROBABILITY IF oepsp=0 WILL BE: p=\frac{0.9826}{1+0.9826}, then p=0.4956.

# Prob of beating the industry when oepsp=0:
p = modified_b0 / (1+modified_b0)
p
np.float64(0.49561850730664275)

IN TERMS OF PROBABILITY, IF THE CURRENT ODDS-RATIO=1 (THEN p=0.5) AND IF A FIRM IMPROVES ITS oepspw BY 0.1, ITS PROBABILITY TO BEAT THE industry IS EXPECTED TO IMPROVE:

# If current odds-ratio of a firm = 1, its current probability to beat the industry must be 1
current_odds_ratio = 1
current_p = current_odds_ratio / (1+current_odds_ratio)
current_p
0.5
# If oepsp moves in 0.1 unit, then its new odds_ratio will be:
new_odds_ratio = current_odds_ratio * np.exp(0.1*b1) 
# The, the probability will increase:
new_p = new_odds_ratio / (1 + new_odds_ratio)
new_p
np.float64(0.6416525334861959)

THEN, IN THIS HYPOTHETICAL CASE, IF oepsp INCREASES IN +0.1 UNIT, THEN THE PROBABILITY TO BEAT THE INDUSTRY WILL INCREASE FROM 0.5 TO 0.6417.