# Download the CSV file from the website:
download.file("http://www.apradie.com/datos/dataus2024.csv", "dataus2024.csv")
download.file("http://www.apradie.com/datos/firmsus2024.csv", "firmsus2024.csv")
# Import the panel data
<- read.csv("dataus2024.csv")
uspanel
# Import the firm catalog
<- read.csv("firmsus2024.csv") usfirms
Workshop 2
FZ2022 Algorithms and Data Analytics
0.1 General Directions for each workshop
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab notebook for each workshop.
You must share each Colab document (workshop) with me via e-mail:
sergio.castellanos@tec.mx
You must give Edit privileges to this account.
In Google Colab, you can work with Python or R notebooks. The default is Python notebooks, so you must go to “Runtime” or “Entorno de Ejecución”, select “Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).
Your notebook will have a default name like “Untitled2.ipynb”. Click on this name and change it to “W1_FZ2022_YourFirstName_YourLastname”.
Pay attention in class to learn how to write text and R code into your notebook.
In your Workshop Notebook, you have to:
- Replicate all the R code along with its output.
- You must complete any question and/or challenge asked in the workshop. The challenges can include specific questions or require you to complete an exercise/challenge.
For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
- It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for further study.
Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks at once with Ctrl-F9. You must submit the web link of your Google Colab workshop to Canvas.
1 Introduction
We will work with a panel dataset of real US public firms. You will analyze historical quarterly financial statements of all US public firms listed on the New York Stock Exchange (NYSE) and NASDAQ.
This is the dataset with which you will work on the Case Study (Final Project) of this class.
1.1 Objective
You are required to analyze historical quarterly financial statements of real US public firms listed in the NYSE and NASDAQ. Download the following CSV datasets:
You can download these files to your Google Drive folder for easy import into Colab.
First, we will start by conducting descriptive statistics and visualizations to better understand the US financial market.
1.2 Understanding the Datasets
The dataus2024.csv
contains historical financial quarterly data for the firms, while the firmsus2024.csv
dataset is a catalog of firms with industry classifications and status (active or canceled).
1.2.1 Structure of the Panel Data
The dataus2024
dataset has a panel-data (long format) structure, where each row provides financial information for a US firm and one period (a quarter). All dollar amounts are in thousands (’1000s). Below is the data dictionary for the columns:
Variable | Description |
---|---|
firm | Unique code (ticker) for the company |
q | Quarter date |
fiscalmonth | Month in which the firm closes the fiscal year |
revenue | Cumulative total sales from the first fiscal quarter to the current quarter |
cogs | Cost of goods sold - cumulative variable costs of products sold |
sgae | Cumulative sales and general administrative expenses |
otherincome | Other operational income or expenses unrelated to core operations |
extraordinaryitems | Income or expenses not related to regular operations (e.g., gains from asset sales, disaster losses) |
finexp | Financial expenses (interest on loans) |
incometax | Income tax paid |
totalassets | Total assets at the end of the quarter |
currentassets | Current assets at the end of the quarter |
totalliabilities | Total liabilities at the end of the quarter |
currentliabilities | Current liabilities at the end of the quarter |
longdebt | Long-term financial debt (due in more than 1 year) |
adjprice | Adjusted stock price (adjusted for splits and dividends) |
originalprice | Historical stock price (unadjusted) |
sharesoutstanding | Number of shares available |
fixedassets | Fixed assets at the end of the quarter |
year | Calendar year |
yearf | Fiscal year - depends on when the firm ends its fiscal year |
This structure allows you to track a firm’s financial status over time and perform time-series analyses, which are essential for evaluating firm performance, trends, and conducting industry comparisons.
Each row in this dataset contains the quarterly financial data of a firm for a specific quarter. The dataset spans from Q1 2000 to Q2 2024. However, not all firms have existed since 2000, so if a firm’s first few quarters are empty, it indicates that the firm was not yet active in the U.S. financial market during those periods. The first quarter with non-empty data typically marks the firm’s public listing date when it started issuing shares.
Each firm defines the month in which its fiscal year ends. For example, Apple closes its fiscal year at the end of Q3 (end of September). Therefore, in Q3 2022, the variable fiscalmonth
will show 12 for Apple, signifying the fiscal year-end. This means that Apple’s fiscal year starts in Q4 of each calendar year and ends in Q3 of the following year. Most firms (around 80%) close their fiscal year in December, so these firms will display a fiscalmonth
of 12 in Q4.
Variables related to sales and expenses are cumulative throughout each fiscal year. For instance, while Apple sold around $117 billion in the calendar Q4 of 2022, this represents its first fiscal quarter. In calendar Q1 of 2023 (Apple’s second fiscal quarter), Apple’s cumulative revenue is approximately $212 billion, meaning that the combined sales for fiscal Q1 and Q2 total this amount. By Q2 of 2023, the revenue reaches $293 billion, which is the cumulative total for fiscal Q1, Q2, and Q3.
Tip: To obtain annual financial data, you can filter the dataset by selecting rows where fiscalmonth = 12
. This will give you the quarters where firms report their full-year financials.
The firmsus2024.csv is a catalog of all active and cancelled US firms:
Variable | Description |
---|---|
firm | Unique code of the company (also called ticker) |
name | Name of the firm |
status | Status of the firm: active or cancelled |
partind | Percent participation in the S&P500 market index |
naics1 | North American Industry Classification Code - Level 1 |
naics2 | North American Industry Classification Code - Level 2 |
SectorEconomatica | Economatica Industry classification |
1.3 Tips for Working with the Data
Fiscal Month: The variable
fiscalmonth
is key in determining when a firm’s fiscal year ends. About 80% of firms close their fiscal year in December. For these firms, rows withfiscalmonth=12
represent the annual summary of their financial statements.Cumulative Variables: Revenue, COGS, and other income variables are cumulative over the fiscal year. Be careful when interpreting these; if you’re interested in annual data, focus on rows with
fiscalmonth=12
.
2 Challenge 1: Data Management Algorithms
The first step is to import the datasets into R and perform a few basic checks.
2.1 Importing Data
You will download and import the data files directly into your R environment:
2.2 Display APPLE annual historical information
Display the revenue and total assets for Apple for the annual fiscal quarters of 2021, 2022, and 2023. The annual quarters are those where fiscalmonth = 12
.
Next, perform an online check to find the most recent 10-K report (Apple’s annual financial statement), and compare the 2023 annual sales with your results. Make sure the amount you get matches the annual sales reported at the end of the fiscal year (for Apple, the fiscal year ends in Q3). You should be able to verify this using the information from the 10-K report.
2.3 Merge the datasets
In the dataus2024
panel dataset, there is no information about the firm name, industry, status (active or cancelled), etc. That information is available in the firmsus2024
dataset. We will start by integrating this information into the panel data.
Merge the firmsus2024
dataset into the dataus2024
dataset, but only include the following firm characteristics (columns): - Company name
- Status
- Industry
(naics1
)
In the dataus2024
dataset, we have quarterly financial data: - For income-statement variables, the information is the cumulative amount from Q1 to the corresponding quarter of the fiscal year. - For balance-sheet variables, the information represents the value of the variable at the end of the quarter, tracking the firm’s financial position since its inception.
3 CHALLENGE 2. Calculating financial variables
You have to write the code to calculate the following financial variables and financial ratios for all firms-quarters of US firms.
1. Create financial variables
Using the dataus20204 dataset, create columns for the following variables:
- Market value (marketvalue) = originalprice * sharesoutstanding
We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.
Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae)
Net Income (netincome) = ebit + otherincome + extraordinaryitems - financial expenses - incometax
Annual market return: use adjusted stock price and remember that you have quarterly data.
Here you have to use the lag function and group by firm to avoid using stock price of another stock to calculate the stock return of a stock. Hint: check Workshop 1.
2. Using the same panel dataset, create columns for the following financial ratios:
- Operational Return on Assets (roabit): roabit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of ONE year ago.
roabit=\frac{ebit_{t}}{totalassets_{t-4}} Here you have to use the lag function of the plm package to get value of total assets one year ago, which is 4 quarters ago.
- Return on Assets (roa):
roa=\frac{netincome_{t}}{totalassets_{t-4}}
Operational Earnings per share (oeps): ebit / sharesoutstanding
Operational epsebit deflated by stock price (oepsp) : eps_ebit / original stock price
Book-to-market ratio (bmr): book value / market value. Use total assets as book value of the firm-quarter.
Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio
4 CHALLENGE 3: Create a cross-sectional dataset with the most recent annual data
1. Select active firms in the 2023 fiscal year
Using the dataus2024 dataset, write the code to select active firms in the 2023 fiscal year.
To do this, you have to select firm-quarters with fiscalmonth=12, yearf=2023 and status=“active”
Save the result in a new dataset called active2023
The fiscal month variable indicates the following: if fiscalmonth=12, then that means that the corresponding quarter is the end-of-the-fiscal-year quarter.
5 CHALLENGE 4: Code for descriptive statistics for 2023
Using the active2023 dataset, you have to do the following descriptive statistics and respond the following questions.
1. Show and interpret the median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023.
2. By industry show the # of firms, median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023.
3. Which are the 10 biggest active firms in terms of total assets according to the end of fiscal year of 2023? Show the name of the firms, the industry and total assets.
4. Which are the 10 biggest active firms in terms of market capitalization? Show the name of the firms, the industry and market value as of 2023.
5. Show and interpret a histogram of the 2023 market capitalization for all ACTIVE firms.
6. For each industry show and interpret a box plot to better understand the size of active firms in 2023 (use the market value)
6 Challenge 5. Review of Logistic Regression models
1. Using the dataus2024 dataset you have to run a logistic regression model to examine whether the operational earnings per share deflated by price (oepsp) is related to the probability that the firm ROA is greater than the corresponding industry ROA mean.
HINT: you can use the function glm specifying family=“binomial”:
logit_model1 <- glm(ROA_above ~ oepspw ,data = uspanel, family = “binomial”,na.action = na.omit)
2. Learn about the logistic regression and interpret the result of this regression model (interpret the coefficient of this model and its significance).
7 Challenge 6. Datacamp Course
Go to the course: Machine Learning with caret in R and DO CHAPTER 2: Classification models: fitting them and evaluating their performance.
8 W2 submission
The grade for this workshop will be as follows:
Complete (100%): If you submit an ORIGINAL and COMPLETE notebook with all the activities, your notes, and your OWN RESPONSES to questions. Submitting someone else’s work or having two or more people submit the same answers (exact texts or with minor changes or translations) will be considered plagiarism and a violation of academic integrity (FIA).
Incomplete (75%): If you submit an ORIGINAL notebook with ALL the activities but did NOT RESPOND to the questions and/or did not complete all activities, or if you only responded to some questions. Remember, submitting identical or slightly altered work from another person is considered a breach of academic integrity (FIA).
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or if you completed more but parts of your work are copy-pasted from other notebooks. This too will be considered an FIA if found to be plagiarized.
Not submitted (0%): No submission, no credit.
It is important to remember that submitting someone else’s work, or identical work (even with minor changes or translations) between two or more students, is strictly prohibited and will be considered an academic integrity violation (FIA).
Remember that you have to submit the Google Colab LINK to your notebook, and you must SHARE it with me, ensuring that I have edit access.