download.file("https://www.apradie.com/datos/dataus2024.csv","dataus2023.csv")
= read.csv("dataus2023.csv") uspanel
Workshop 2, Algorithms and data analysis
0.1 General Directions for each workshop
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab document for each workshop.
You must share each Colab document (workshop) with the following accounts:
cdorante.tec@gmail.com
cdorante@tec.mx
You must give Edit privileges to these accounts.
In Google Colab you can work with Python or R notebooks. The default is Python notbooks, so you must go to “Runtime” or “Entorno de Ejecución”, select Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).
Your Notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W2-Econometrics-YourFirstName YourLastname”.
Pay attention in class to learn how to write text and R code into your Notebook.
In your Workshop Notebook you have to:
Replicate all the R Code along with its output.
You have to do any QUESTION AND/OR CHALLENGE asked in the workshop. The challenges can be: responses to specific questions and/or do an exercise/challenge.
For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
- It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for your further study.
Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks in one-shot with Ctrl-F9. You have to submit to Canvas the web link of your Google Colab workshop.
1 Introduction
We will work with a panel data of real US public firms. You have to analyze historical quarterly financial statements of all US public firms listed in the New York Exchange and NASDAQ.
This is the dataset you will use for the Problem Situation (Final Project) of this class.
Download the following csv datasets:
List of all US public firms with general information of each firm
Panel data with historical financial quarterly data for all US public firms:
The first dataset (dataus2024) contains the historical financial data of the firms, while the second dataset (firmsus2024) is a catalog of all firms along with the corresponding industry type and status (active or cancelled).
The dataus2024 dataset has a panel-data (also called long format) structure. Each row has financial information for one US firm and 1 period (a quarter). All $ amounts are in thousands (’1000s). Here is a data dictionary of the columns:
Variable | Description |
---|---|
firm | Unique code of the company (also called ticker) |
q | Quarter date |
fiscalmonth | Month of the year when the firm closes a fiscal year |
revenue | Total sales of the firm from the first fiscal quarter to the current quarter |
cogs | Cost of good sold - variable costs of the products sold - from the first fiscal quarter to the current quarter |
sgae | Sales and general administrative expenses - from the first fiscal quarter to the current quarter |
otherincome | Other operational income/expenses that are not directly from the core operations of the firm - from the first fiscal quarter to the current quarter |
extraordinaryitems | Extra income/expenses not related to regular operations - from the first fiscal quarter to the current quarter |
finexp | Financial expenses - interest expenses paid (generated from loans) - from the first fiscal quarter to the current quarter |
incometax | Income tax from the first fiscal quarter to the current quarter |
totalassets | Total assets of the firm at the end of the quarter |
currentassets | Current assets of the firm at the end of the quarter |
totalliabilities | Total liabilities of the firm at the end of the quarter |
currentliabilities | Current liabilities of the firm at the end of the quarter |
longdebt | Balance of long-term financial debt (loans to pay longer than 1 year) |
adjprice | Stock adjusted price at the end of the quarter; adjusted for stock splits and dividend payments; used to calculate stock returns |
originalprice | Historical stock price (not adjusted); used to calculate historical market value |
sharesoutstanding | Historical number of shares available in the market |
fixedassets | Fixed assets value at the end of the quarter |
year | Calendar year |
yearf | Fiscal year - this depends on when the firm ends its fiscal year; if fiscalmonth=12 in the quarter 3, then the fiscal year will start in Q4 of a year and ends in the Q3 of the following year |
Each row of this dataset has quarterly financial data of one firm in one quarter. All firms have quarters from Q1 2000 to Q2 2024. Not all firms have existed since 2000, so if the first quarters are empty that means that the firm did not exist in the US financial market in those quarters. Then, it is possible to know when each firm went public to issue shares in the financial market: the first quarter with some non-empty data.
Each firm has defined the month of the year used to close a fiscal year. For example, Apple closes the fiscal year at the end of Quarter 3 (end of September) of any year. Then, for Apple, in the Q3 of 2022, there will be a 12 for the fiscalmonth variable. In this case, Apple starts its fiscal year in the Q4 of each year and ends in the Q3 of the following year. Most of the firms (about 80%) close fiscal year in December, so these firms will have a 12 in the Q4 of each year.
The variables related to sales and expenses are cumulative for each fiscal year. For example, Apple sold about $117 billion in the last calendar quarter (Q4) of 2022, but this is the first fiscal quarter for Apple. For Q1 (calendar) 2023 (which is the 2nd fiscal quarter), Apple has about $212 billion in the revenue variable, meaning that considering fiscal quarter 1 and 2, Apple has sold $212 billion. For Q2 2023 Apple has about $293 billion, meaning that the cumulative revenue of fiscal Q1, Q2 and Q3 is about $293 billion. Then, if you select rows with fiscalmonth=12, then you will be selecting those quarters with annual financial information for each firm!
Earnings before interest and Taxes (ebit) and Net Income (netincome) must be calculated as:
ebit = revenue - cogs - sgae
netincome = ebit + otherincome + extraordinaryitems - finexp - incometax
The firmsus2023.csv is a catalog of all active and cancelled US firms:
Variable | Description |
---|---|
firm | Unique code of the company (also called ticker) |
name | Name of the firm |
status | Status of the firm: active or cancelled |
partind | Percent participation in the S&P500 market index |
naics1 | North American Industry Classification Code - Level 1 |
naics2 | North American Industry Classification Code - Level 2 |
SectorEconomatica | Economatica Industry classification |
2 CHALLENGE 1: Data management algorithms
You have to import the datasets. We will download these datasets from a web site. Here is the code:
Download and import the panel quarterly data of all US firms:
Now we download and import that list of all public US firms:
download.file("https://www.apradie.com/datos/firmsus2024.csv","firmsus2024.csv")
= read.csv("firmsus2024.csv") usfirms
Write the code do the following queries:
1. Display APPLE annual historical information
Display the revenue and total assets for Apple for the annual fiscal quarters of 2021, 2022 and 2023. The annual quarters are those with fiscalmonth=12
Do an online check, find the most recent 10-K report (annual financial statement) of Apple, and compare 2023 anual sales with what you got. You must get the same amount for the anual sales reported at the end of fiscal year (for Apple, the fiscal year is Q3).
2. Merge the datasets
In the dataus2024 panel dataset there no information about firm name, industry, status (active or cancelled), etc. That information is in the firmsus2024 datase. So, we start by integrating this information into the panel data.
Merge the firmsus2024 dataset into the dataus2024 dataset, but only pull the following firm characteristics (columns): Company name, status, and industry (naics1).
In the dataus2024 dataset we have quarterly financial data. For income-statement variables, the information is the cumulative amount from Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the information is the value of the variable at the end of the quarter (since the beginning when the firm was created).
3 CHALLENGE 2. Calculating financial variables
You have to write the code to calculate the following financial variables and financial ratios for all firms-quarters of US firms.
1. Create financial variables
Using the dataus20204 dataset, create columns for the following variables:**
- Market value (marketvalue) = originalprice * sharesoutstanding
We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.
Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae)
Net Income (netincome) = ebit + otherincome + extraordinaryitems - financial expenses - incometax
Annual market return: use adjusted stock price and remember that you have quarterly data.
Here you have to use the lag function and group by firm to avoid using stock price of another stock to calculate the stock return of a stock. Hint: check Workshop 1.
2. Using the same panel dataset, create columns for the following financial ratios:
- Operational Return on Assets (roabit): roabit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of ONE year ago.
roabit=\frac{ebit_{t}}{totalassets_{t-4}} Here you can use the lag function to get value of total assets one year ago, which is 4 quarters ago.
- Return on Assets (roa):
roa=\frac{netincome_{t}}{totalassets_{t-4}}
Operational Earnings per share (oeps): ebit / sharesoutstanding
Operational eps deflated by stock price (oepsp) : oeps / originalprice
Book-to-market ratio (bmr): book value / market value.
Book value can be calculated as totalassets minus totalliabilities
Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio
4 CHALLENGE 3: Create a cross-sectional dataset with the most recent annual data
1. Select active firms in the 2023 fiscal year
Using the dataus2024 dataset, write the code to select active firms in the 2023 fiscal year.
To do this, you have to select firm-quarters with fiscalmonth=12, yearf=2023 and status=“active”
Save the result in a new dataset called active2023
The fiscal month variable indicates the following: if fiscalmonth=12, then that means that the corresponding quarter is the end-of-the-fiscal-year quarter.
5 CHALLENGE 4: Code for descriptive statistics for 2023
Using the active2023 dataset, you have to do the following descriptive statistics and respond the following questions.
1. Show and interpret the median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023
2. By industry show the # of firms, median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023
3. Which are the 10 biggest active firms in terms of total assets according to the end of fiscal year of 2023? Show the name of the firms, the industry and total assets.
4. Which are the 10 biggest active firms in terms of market capitalization? Show the name of the firms, the industry and market value as of 2023
5. Show and interpret a histogram of the 2023 market capitalization for all ACTIVE firms
6. For each industry show and interpret a box plot to better understand the size of active firms in 2023 (use the market value)
6 Challenge 5. Review of Logistic Regression models
1. Using the dataus2024 dataset you have to run a logistic regression model to examine whether the operational earnings per share deflated by price (oepsp) is related to the probability that the firm ROA is greater than the corresponding industry ROA mean.
HINT: you can use the function glm specifying family=“binomial”:
logit_model1 <- glm(ROA_above ~ oepspw ,data = uspanel, family = “binomial”,na.action = na.omit)
2. Learn about the logistic regression and interpret the result of this regression model (interpret the coefficient of this model and its significance.
7 Challenge 6. Datacamp Course
Go to the course: Machine Learning with caret in R and DO CHAPTER 2: Classification models: fitting them and evaluating their performance