This is an INDIVIDUAL workshop. In this workshop we continue practice programming for data management, and we introduce the logistic regression model in the context of machine learning
1 Data management for panel-data
For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do data wrangling: data collection/importing, data cleanning, data merging, data transformation.
We will keep using the same datasets we used in Workshop 2 for all excercises of this workshop.
Download the online datasets from:
List of all US public firms with general information of each firm
import pandas as pdimport requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}# Download the first fileurl1 ='https://www.apradie.com/datos/usfirms.csv'response1 = requests.get(url1, headers=headers)withopen('usfirms.csv', 'wb') as f: f.write(response1.content)
633061
# Download the second fileurl2 ='https://www.apradie.com/datos/usdata.csv'response2 = requests.get(url2, headers=headers)withopen('usdata.csv', 'wb') as f: f.write(response2.content)
11317275
# Import as data framesusfirms = pd.read_csv('usfirms.csv')usdata = pd.read_csv('usdata.csv')# Display the first few rows of each DataFrame to confirmprint("usfirms DataFrame:")
usfirms DataFrame:
usfirms.head()
empresa ... SectorEconomatica
0 A ... Electric Electron
1 AA ... Basic & Fab Metal
2 AABA_old ... Software & Data
3 AAC_old ... Other
4 AAIC_old ... Funds
[5 rows x 7 columns]
print("\nusdata DataFrame:")
usdata DataFrame:
usdata.head()
firm year revenue ... originalprice sharesoutstanding fixedassets
0 A 2000 10773000.0 ... 54.75 456366.381 1741000.0
1 A 2001 8396000.0 ... 28.51 463695.160 1848000.0
2 A 2002 6010000.0 ... 17.96 467024.421 1579000.0
3 A 2003 6056000.0 ... 29.24 476149.083 1447000.0
4 A 2004 7181000.0 ... 24.10 486841.087 1258000.0
[5 rows x 28 columns]
You can go to Workshop 2 (https://rpubs.com/cdorante/fz2022p_w2) to see the data dictionary for each dataset.
As in the previous workshop, merge the 2 datasets (usdata and usfirms) using a left-join. Remember, the panel-dataset usdata is the left dataset, which has historical annual financial data for all US firms; and the usfirms is the right dataset, which is a cross-sectional dataset with general information of firms from the S&P500 index.
2 CHALLENGE 1. Calculating financial variables
Using the merged dataset, you have to write the code to calculate the following financial variables and financial ratios for all firms-years:
1. Create financial variables
Create columns for the following variables:
Market value (marketvalue) = originalprice * sharesoutstanding
We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.
Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae) - depreciation
Net Income (netincome) = ebit + otherincome + extraordinaryitems - finexp (financial expenses) - incometax
Annual market return: calculate annual return for all firm-years by calculating the continuously compounded percentage of thea djusted stock price (adjprice). Consider that you have a panel-data, so be careful when calculating returns to avoid using data from another firm in the cases of the first year for all firms. Hint: you can use the shift function and groupby firm to avoid using stock price of another stock to calculate the annual return of each firm for all years.
2. Using the same panel dataset, create columns for the following financial ratios:
Operational Return on Assets (roabit): ebit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of the previous year.
roabit=\frac{ebit_{t}}{totalassets_{t-1}}
Here you can use the shift function to get value of total assets one year ago. Make sure that you indicate to groupby firm so you do not use the totalssets from another firm to calculate the roabit of a firm.
Return on Assets (roa):
roa=\frac{netincome_{t}}{totalassets_{t-1}}
Operational Earnings per share (oeps): ebit / sharesoutstanding
Book-to-market ratio (bmr): book value / market value.
Book value can be calculated as totalassets minus totalliabilities
Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio
3 CHALLENGE 2: Winsorization of variables
You have to do your own research about winsorization. Explain it with your words (pay attention in class)
You have to winsorize the following ratio at the 2 percentile for the left values and 98 percentile for the high values:
Earnings per share deflated by price
4 CHALLENGE 3: Create an annual dataset for historical market returns
You have to create an annual dataset with annual market returns of the S&P500 index (^GSPC), from 2000 to 2023. You can download monthly quotes from Yahoo Finance using the yfinance library, and then aggregate by year, and then calculate continuously compounded returns.
Once you have the annual returns of the ^GSPC in a dataset, merge these values into the merged panel-dataset using the year as the common/matching column.
5 CHALLENGE 4: Logistic regression models with lagged values
Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future annual stock returns is higher than the future market annual return.
Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.
You have to interpret the model
6 CHALLENGE 5: Running your first Machine Learning model
(moved to week 4)
Create a dataset with the following columns:
Future annual stock return (1 year later)
F1r_above_market (1=beat the market in the corresponding year; 0= otherwise)
Earnings per share deflated by price (epsp).
Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.
Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.
Create and interpret the confusion matrix
7 Datacamp online courses
YOU MUST TAKEChapter 1 : Classification from the course: Supervised Learning with scikit-learn
8 W3 submission
The grade of this Workshop will be the following:
Complete (100%): If you submit an ORIGINAL and COMPLETE work with all the activities, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL work with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)
Remember that you have to submit your Google Colab LINK, and you have to SHARE it with me (cdorante@tec.mx).