Workshop 3, Algorithms and Data Analysis

Author

Alberto Dorantes

Published

September 29, 2025

Abstract

This is an INDIVIDUAL workshop. In this workshop we continue practice programming for data management, and we introduce the logistic regression model in the context of machine learning

1 Data management for panel-data

For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do data wrangling: data collection/importing, data cleanning, data merging, data transformation.

We will keep using the same datasets we used in Workshop 2 for all excercises of this workshop.

Download the online datasets from:

List of all US public firms with general information of each firm

https://www.apradie.com/datos/usfirms.csv
Panel data with historical financial annual data for all US public firms:

https://www.apradie.com/datos/usdata.csv

Use the same Python code you used before:

import pandas as pd
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Download the first file
url1 = 'https://www.apradie.com/datos/usfirms.csv'
response1 = requests.get(url1, headers=headers)
with open('usfirms.csv', 'wb') as f:
    f.write(response1.content)

# Download the second file
url2 = 'https://www.apradie.com/datos/usdata.csv'
response2 = requests.get(url2, headers=headers)
with open('usdata.csv', 'wb') as f:
    f.write(response2.content)

11317275

# Import as data frames
usfirms = pd.read_csv('usfirms.csv')
usdata = pd.read_csv('usdata.csv')

# Display the first few rows of each DataFrame to confirm
print("usfirms DataFrame:")

usfirms DataFrame:

usfirms.head()

    empresa  ...  SectorEconomatica
0         A  ...  Electric Electron
1        AA  ...  Basic & Fab Metal
2  AABA_old  ...    Software & Data
3   AAC_old  ...              Other
4  AAIC_old  ...              Funds

[5 rows x 7 columns]

print("\nusdata DataFrame:")


usdata DataFrame:

usdata.head()

  firm  year     revenue  ...  originalprice  sharesoutstanding  fixedassets
0    A  2000  10773000.0  ...          54.75         456366.381    1741000.0
1    A  2001   8396000.0  ...          28.51         463695.160    1848000.0
2    A  2002   6010000.0  ...          17.96         467024.421    1579000.0
3    A  2003   6056000.0  ...          29.24         476149.083    1447000.0
4    A  2004   7181000.0  ...          24.10         486841.087    1258000.0

[5 rows x 28 columns]

You can go to Workshop 2 (https://rpubs.com/cdorante/fz2022p_w2) to see the data dictionary for each dataset.

As in the previous workshop, merge the 2 datasets (usdata and usfirms) using a left-join. Remember, the panel-dataset usdata is the left dataset, which has historical annual financial data for all US firms; and the usfirms is the right dataset, which is a cross-sectional dataset with general information of firms from the S&P500 index.

2 CHALLENGE 1. Calculating financial variables

Using the merged dataset, you have to write the code to calculate the following financial variables and financial ratios for all firms-years:

1. Create financial variables

Create columns for the following variables:

Market value (marketvalue) = originalprice * sharesoutstanding

We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae) - depreciation
Net Income (netincome) = ebit + otherincome + extraordinaryitems - finexp (financial expenses) - incometax
Annual market return: calculate annual return for all firm-years by calculating the continuously compounded percentage of thea djusted stock price (adjprice). Consider that you have a panel-data, so be careful when calculating returns to avoid using data from another firm in the cases of the first year for all firms. Hint: you can use the shift function and groupby firm to avoid using stock price of another stock to calculate the annual return of each firm for all years.

2. Using the same panel dataset, create columns for the following financial ratios:

Operational Return on Assets (roabit): ebit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of the previous year.

roabit=\frac{ebit_{t}}{totalassets_{t-1}}

Here you can use the shift function to get value of total assets one year ago. Make sure that you indicate to groupby firm so you do not use the totalssets from another firm to calculate the roabit of a firm.

Return on Assets (roa):

roa=\frac{netincome_{t}}{totalassets_{t-1}}

Operational Earnings per share (oeps): ebit / sharesoutstanding
Operational eps deflated by stock price (oepsp) : oeps / originalprice
Book-to-market ratio (bmr): book value / market value.

Book value can be calculated as totalassets minus totalliabilities

Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio

3 CHALLENGE 2: Winsorization of variables

You have to do your own research about winsorization. Explain it with your words (pay attention in class)

You have to winsorize the following ratio at the 2 percentile for the left values and 98 percentile for the high values:

Earnings per share deflated by price

4 CHALLENGE 3: Create an annual dataset for historical market returns

You have to create an annual dataset with annual market returns of the S&P500 index (^GSPC), from 2000 to 2023. You can download monthly quotes from Yahoo Finance using the yfinance library, and then aggregate by year, and then calculate continuously compounded returns.

Once you have the annual returns of the ^GSPC in a dataset, merge these values into the merged panel-dataset using the year as the common/matching column.

5 CHALLENGE 4: Logistic regression models with lagged values

Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future annual stock returns is higher than the future market annual return.

Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.

You have to interpret the model

6 CHALLENGE 5: Running your first Machine Learning model

(moved to week 4)

Create a dataset with the following columns:

Future annual stock return (1 year later)
F1r_above_market (1=beat the market in the corresponding year; 0= otherwise)
Earnings per share deflated by price (epsp).

Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.

Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.

Create and interpret the confusion matrix

7 Datacamp online courses

YOU MUST TAKE Chapter 1 : Classification from the course: Supervised Learning with scikit-learn

8 W3 submission

The grade of this Workshop will be the following:

Complete (100%): If you submit an ORIGINAL and COMPLETE work with all the activities, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL work with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)

Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me (cdorante@tec.mx).