Evidence_v1, Algorithms and Data Analysis

Author

Alberto Dorantes

Published

October 20, 2025

Abstract
This is the in-class evidence of the course. You have to run a Machine Learning model to predict whether a firm return is higher than its industry in market returns. Follow directions

Download the same dataset we have used in the workshops. Use the code stated in workshop 3 (https://rpubs.com/cdorante/fz2022p_w3)

You can go to Workshop 2 (https://rpubs.com/cdorante/fz2022p_w2) to see the data dictionary for each dataset.

As in the previous workshop, merge the 2 datasets (usdata and usfirms) using a left-join. Remember, the panel-dataset usdata is the left dataset, which has historical annual financial data for all US firms; and the usfirms is the right dataset, which is a cross-sectional dataset with general information of firms from the S&P500 index.

You MUST keep ONLY firm-years with status=‘active’.

1 Calculating financial variables

For the variable calculation, you MUST WITH YOUR WORDS IN CAPITAL LETTER how you created the variables. If you use a Gemini / chatGPT prompt, indicate in quotes what was your prompt.

Using the merged dataset, you have to write the code to calculate the following financial variables and financial ratios for all firms-years:

1. Create financial variables

  • Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)

  • Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae) - depreciation

  • Net Income (netincome) = ebit + otherincome + extraordinaryitems - finexp (financial expenses) - incometax

  • Annual market return: calculate annual return for all firm-years by calculating the continuously compounded percentage of thea djusted stock price (adjprice). Consider that you have a panel-data, so be careful when calculating returns to avoid using data from another firm in the cases of the first year for all firms. Hint: you can use the shift function and groupby firm to avoid using stock price of another stock to calculate the annual return of each firm for all years.

2. Using the same panel dataset, create columns for the following financial ratios:

Here you can use the shift function to get value of total assets one year ago. Make sure that you indicate to groupby firm so you do not use the totalssets from another firm to calculate the roabit of a firm.

  • Return on Assets (roa):

roa=\frac{netincome_{t}}{totalassets_{t-1}}

  • Operational Earnings per share (oeps): ebit / sharesoutstanding

  • Operational eps deflated by stock price (oepsp) : oeps / originalprice

  • Cash flow to Assets ratio (cfr) as

cfr=\frac{cashflowoper_{t}}{totalassets_{t-1}} You have to winsorize epsp using the 1 and 99 percentile, and name it epspw.

2 DESIGN A MACHINE LEARNING MODEL

Prepare the data to desin and run a machine learning model to predict whether a stock annual return beats its industry average returns in the same year. Use naics1 as the industry.

For the model, use the following explanatory variables (as X predictors)

  • epspw (winsorized)

  • Fscore = F1 + F2 + F3 + F4

You can calculate the F accounting signals as:

F1 = 1 if roa>0; 0 otherwise

F2 = 1 if cfr>0; 0 otherwise

F3 = 1 if the change in roa (roa at t minus roa at t-1) is positive; = 0 otherwise

F4 = 1 if cfr > roa; =0 otherwise

Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) and Fscore are related to the probability that the annual stock returns is higher than its industry average in the corresponding year.

In addition, you must run the corresponding MACHINE LEARNING model.

You have to EXPLAIN the following WITH YOUR WORDS:

  1. HOW YOU CREATED THE ACCOUNTING F SIGNALS (VARIABLES)

  2. EXPLAIN THE CODE YOU USED TO RUN THE LOGISTIC MODEL

  3. RUN THE FIRST VERSION OF THE MODEL WITH ALL OBSERVATIONS (BEFORE THE MACHINE LEARNING MODEL), AND INTERPRET THE beta COEFFICIENTS OF epspw and Fscore WITH YOUR WORDS

  4. EXPLAIN THE STEPS YOU FOLLOWED TO RUN THE MACHINE LEARNING MODEL

  5. Show the Confusion Matrix. Just MENTION how many cases your model correctly predicted.

  6. Calculate AND INTERPRET the following ratios:

    6.1) Precision

    6.2) Sensitivity

    6.3) Specificity

ONLY KEEP THE Python CODE YOU NEED FOR THIS EVIDENCE. Extra CODE CAN BE PENALIZED

Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me (cdorante@tec.mx).