Evidence_v2, Algorithms and Data Analysis

Author

Alberto Dorantes

Published

October 20, 2025

Abstract
This is the in-class evidence of the course. You have to run a Machine Learning model to predict whether a firm return is higher than its industry in market returns. Follow directions

Download the 2 dataset we have used in the workshops. Use the code stated in workshop 3 (https://rpubs.com/cdorante/fz2022p_w3)

You can go to Workshop 2 (https://rpubs.com/cdorante/fz2022p_w2) to see the data dictionary for each dataset.

As in the previous workshop, merge the 2 datasets (usdata and usfirms) using a left-join. Remember, the panel-dataset usdata is the left dataset, which has historical annual financial data for all US firms; and the usfirms is the right dataset, which is a cross-sectional dataset with general information of firms from the S&P500 index.

You MUST keep ONLY firm-years with status=‘active’.

1 Calculating financial variables

For the variable calculation, you MUST WITH YOUR WORDS IN CAPITAL LETTER how you created the variables. If you use a Gemini prompt, indicate in quotes what was your prompt.

Using the merged dataset, you have to write the code to calculate the following financial variables and financial ratios for all firms-years:

1. Create financial variables

  • Total Revenue (totalsales) = revenue + extraordinaryitems + otherincome

  • Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)

  • Shareholder Equity (equity) = totalassets - totalliabilites

  • Avg Assets (avgassets) = arithmetic average of totalassets at the year t and totalassets at t-1

  • Annual market return (stockretur): calculate annual return for all firm-years by calculating the continuously compounded percentage of the adjusted stock price (adjprice).

Consider that you have a panel-data, so be careful when calculating returns to avoid using data from another firm in the cases of the first year for all firms.

2. Using the same panel dataset, create columns for the following financial ratios:

Here you can use the shift function to get value of total assets one year ago. Make sure that you indicate to groupby firm so you do not use the totalssets from another firm to calculate the roabit of a firm.

  • Operational Earnings per share (oeps): ebit / sharesoutstanding

  • Operational eps deflated by stock price (oepsp) : oeps / originalprice

  • Winsorized oepsp (oepspw) at the 1 and 99 percentiles

  • Financial leverage (leverage): longdebt / avgassets

  • Current Ratio (currentratio): currentassets / currentliabilities

  • GrossProfit-to-Asset ratio (gprofitratio) as:

gprofitratio=\frac{grossprofit_{t}}{totalassets_{t-1}}

  • Asset turnover ratio (ato) as:

ato=\frac{totalsales_{t}}{totalassets_{t-1}}

2 DESIGN A MACHINE LEARNING MODEL

Prepare the data to design and run a machine learning model to predict whether a stock annual return beats its industry average returns in the same year. For industry average return calculate the average of stockreturn by industry-year. Use naics1 as the industry

For the model, use the following explanatory variables (as X predictors)

  • oepspw (winsorized)

  • Fscore = F1 + F2 + F3 + F4 + F5

F1, F2, F3, F4, F5 are accounting binary signals that measure annual firm improvements (=1 means improvement; 0 means no improvement) about important indicators related to firm performance, leverage or efficiency.

You can calculate the F accounting signals as:

F1 = 1 if the annual difference of leverage (leverage at t minus leverage at t-1) is negative (<0); =0 otherwise

F2 = 1 if the annual difference of currentratio is positive (>0); =0 otherwise

F3 = 1 if the annual difference of equity is positive (>0); =o otherwise

F4 = 1 if the annual difference of gprofitratio is positive (>0); =0 otherwise

F5 = 1 if the annual difference of ato is positive (>0); =0 otherwise

1. You have to EXPLAIN WITH YOUR OWN WORDS HOW YOU CREATED THE ACCOUNTING F SIGNALS (VARIABLES)

2. Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (pepspw) and Fscore are related to the probability that the annual stock returns is higher than its industry average return in the corresponding year.

  • EXPLAIN THE CODE YOU USED TO RUN THE LOGISTIC MODEL

  • Show the regression output with all observations (BEFORE THE MACHINE LEARNING MODEL), AND INTERPRET THE beta COEFFICIENTS OF epspw and Fscore WITH YOUR WORDS

3. You must run the corresponding MACHINE LEARNING model.

  • EXPLAIN THE STEPS YOU FOLLOWED TO RUN THE MACHINE LEARNING MODEL

  • Show the Confusion Matrix. Just MENTION how many cases your model correctly predicted.

  • Calculate AND INTERPRET the following ratios:

    • Precision

    • Sensitivity

    • Specificity

ONLY KEEP THE Python CODE YOU NEED FOR THIS EVIDENCE. Extra CODE CAN BE PENALIZED

Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me (cdorante@tec.mx).