# To avoid scientific notation:
options(scipen=999)
#Read the csv files:
<- read.csv("https://www.apradie.com/datos/dataus2024.csv")
uspanel <- read.csv("https://www.apradie.com/datos/firmsus2024.csv") usfirms
Workshop 2 Solution, Algorithms and data analysis
1 Introduction
We will work with a panel data of real US public firms. You have to analyze historical quarterly financial statements of all US public firms listed in the New York Exchange and NASDAQ.
This is the dataset you will use for the Problem Situation (Final Project) of this class.
First, we will do descriptive statistics and visualization with the purpose of understanding the US financial market.
You have to work with 2 datasets:
firmsus2024.csv: List of all US public firms with general information of each firm
dataus2024.csv : Panel data with historical financial quarterly data for all US public firms.
You have to download these 2 files from Canvas/Modules and save in the directory where you have your workshop.
The first dataset (dataus2023q2) contains the historical financial data of the firms, while the second dataset (firmsus2023) is a catalog of all firms along with the corresponding industry type and status (active or cancelled).
The dataus2023q2 dataset has a panel-data (also called long format) structure. Each row has financial information for one US firm and 1 period (a quarter). All $ amounts are in thousands (’1000s). Here is a data dictionary of the columns:
Variable | Description |
---|---|
firm | Unique code of the company (also called ticker) |
q | Quarter date |
fiscalmonth | Month of the year when the firm closes a fiscal year |
revenue | Total sales of the firm from the first fiscal quarter to the current quarter |
cogs | Cost of good sold - variable costs of the products sold - from the first fiscal quarter to the current quarter |
sgae | Sales and general administrative expenses - from the first fiscal quarter to the current quarter |
otherincome | Other operational income/expenses that are not directly from the core operations of the firm - from the first fiscal quarter to the current quarter |
extraordinaryitems | Extra income/expenses not related to regular operations - from the first fiscal quarter to the current quarter |
finexp | Financial expenses - interest expenses paid (generated from loans) - from the first fiscal quarter to the current quarter |
incometax | Income tax from the first fiscal quarter to the current quarter |
totalassets | Total assets of the firm at the end of the quarter |
currentassets | Current assets of the firm at the end of the quarter |
totalliabilities | Total liabilities of the firm at the end of the quarter |
currentliabilities | Current liabilities of the firm at the end of the quarter |
longdebt | Balance of long-term financial debt (loans to pay longer than 1 year) |
adjprice | Stock adjusted price at the end of the quarter; adjusted for stock splits and dividend payments; used to calculate stock returns |
originalprice | Historical stock price (not adjusted); used to calculate historical market value |
sharesoutstanding | Historical number of shares available in the market |
fixedassets | Fixed assets value at the end of the quarter |
year | Calendar year |
yearf | Fiscal year - this depends on when the firm ends its fiscal year; if fiscalmonth=12 in the quarter 3, then the fiscal year will start in Q4 of a year and ends in the Q3 of the following year |
Each row of this dataset has quarterly financial data of one firm in one quarter. All firms have quarters from Q1 2000 to Q2 2023. Not all firms have existed since 2000, so if the first quarters are empty that means that the firm did not exist in the US financial market in those quarters. Then, it is possible to know when each firm went public to issue shares in the financial market: the first quarter with some non-empty data.
Each firm has defined the month of the year used to close a fiscal year. For example, Apple closes the fiscal year at the end of Quarter 3 (end of September) of any year. Then, for Apple, in the Q3 of 2022, there will be a 12 for the fiscalmonth variable. In this case, Apple starts its fiscal year in the Q4 of each year and ends in the Q3 of the following year. Most of the firms (about 80%) close fiscal year in December, so these firms will have a 12 in the Q4 of each year.
The variables related to sales and expenses are cumulative for each fiscal year. For example, Apple sold about $117 billion in the last calendar quarter (Q4) of 2022, but this is the first fiscal quarter for Apple. For Q1 (calendar) 2023 (which is the 2nd fiscal quarter), Apple has about $212 billion in the revenue variable, meaning that considering fiscal quarter 1 and 2, Apple has sold $212 billion. For Q2 2023 Apple has about $293 billion, meaning that the cumulative revenue of fiscal Q1, Q2 and Q3 is about $293 billion. Then, if you select rows with fiscalmonth=12, then you will be selecting those quarters with annual financial information for each firm!
The firmsus2023.csv is a catalog of all active and cancelled US firms:
Variable | Description |
---|---|
firm | Unique code of the company (also called ticker) |
name | Name of the firm |
status | Status of the firm: active or cancelled |
partind | Percent participation in the S&P500 market index |
naics1 | North American Industry Classification Code - Level 1 |
naics2 | North American Industry Classification Code - Level 2 |
SectorEconomatica | Economatica Industry classification |
2 Challenge 1: Data management algorithms
We import the csv files into the R Environment:
As a quick check of the imported data, write the code do the following query:
1. Display the revenue and total assets for Apple for the annual fiscal quarters of 2022, 2021 and 2023. The annual quarters are those with fiscalmonth=12
Do an online check, find the most recent 10-K report (annual financial statement) of Apple, and compare 2022 annual sales with what you got. You must get the same amount for the annual sales reported at the end of fiscal year (for Apple, the fiscal year is Q3).
This type of data management is also called query. A query is a search statement in a database.
APPROACH 1: Using sub-setting of data frames:
In the row index of the data frame I indicate the condition to Apple and the year equal or greater than 2020. In the column index I indicate to select only 4 relevant columns:
# I assign the result of this query
= uspanel[uspanel$firm=="AAPL" & uspanel$yearf>=2021 & uspanel$fiscalmonth==12,
APPLE_info c("firm","q","revenue","yearf","fiscalmonth")]
# I saved the result of the query in a new dataset called APPLE_2020
# I display the result:
APPLE_info
firm q revenue yearf fiscalmonth
1165 AAPL 2021q3 365817000 2021 12
1169 AAPL 2022q3 394328000 2022 12
1173 AAPL 2023q3 383285000 2023 12
All the financial amounts in this dataset is in thousands (1000’s). The fiscalmonth column is the fiscal month at the end of the quarter. The q column is the quarter. Since the fiscalmonth=12 is in the Quarters Q3 of each year, this indicates that Apple ends its fiscal year in the Q3, which is in September of each year.
We can display only the annual 2023 sales of Apple as follows:
$yearf==2023,c("revenue")] APPLE_info[APPLE_info
[1] 383285000
To report the amount in currency format, we can use the dollar function from the scales package (you have to install the scales package) as follows:
library(scales)
= as.numeric(1000*APPLE_info$revenue[APPLE_info$yearf==2023])
APPLE_2023_ANNUALSALES # I used the as.numeric function since the dollar function only receives numeric variables:
dollar(APPLE_2023_ANNUALSALES)
[1] "$383,285,000,000"
Then, we can say that in 2023 Apple reported annual total sales amount of $383,285,000,000. This is more than 380 billion dollars! (383 thousand of millions!).
APPROACH 2: USING THE dplyr PACKAGE:
I use the filter function to select rows and select function to select columns:
library(dplyr)
<- uspanel %>%
APPLE_info filter(firm=="AAPL", yearf>=2021, fiscalmonth==12) %>%
select(firm,q,revenue,yearf,fiscalmonth)
APPLE_info
firm | q | revenue | yearf | fiscalmonth |
---|---|---|---|---|
AAPL | 2021q3 | 365817000 | 2021 | 12 |
AAPL | 2022q3 | 394328000 | 2022 | 12 |
AAPL | 2023q3 | 383285000 | 2023 | 12 |
The rest is just formatting as shown in approach 1.
2. Merge the datasets
In the dataus2024 panel dataset there no information about firm name, industry, status (active or cancelled), etc. That information is in the firmsus2024 datase. So, we start by integrating this information into the panel data.
Merge the firmsus2024 dataset into the dataus2024 dataset, but only pull the following firm characteristics (columns): Company name, status, and industry (naics1).
In the dataus2024 dataset we have quarterly financial data. For income-statement variables, the information is the cumulative amount from Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the information is the value of the variable at the end of the quarter (since the beginning when the firm was created).
2. SOLUTION
We can use the merge function or the left_join function to pull the important variables for each firm from the firms2024 dataset into the uspanel dataset.
When doing a merge between 2 dataset, we have to identify which is the common column between the datasets. The common column or variable is needed to do the “match” between the uspanel with the usfirms dataset.
We can check the variables for both datasaet and identify the common variable:
For usfirms:
colnames(usfirms)
[1] "empresa" "Nombre" "status"
[4] "partind" "naics1" "naics2"
[7] "SectorEconomatica"
For uspanel:
colnames(uspanel)
[1] "firm" "q" "fiscalmonth"
[4] "revenue" "cogs" "sgae"
[7] "otherincome" "extraordinaryitems" "finexp"
[10] "incometax" "totalassets" "currentassets"
[13] "inventory" "totalliabilities" "currentliabilities"
[16] "longdebt" "adjprice" "originalprice"
[19] "sharesoutstanding" "year" "fixedassets"
[22] "yearf" "cto" "fiscalq"
It seems that the id for the firms is called empresa in the usfirms and firm in the uspanel:
head(usfirms)
empresa Nombre status partind
1 A Agilent Technologies, Inc activo 0.09
2 AA Alcoa Corp activo NA
3 AABA_old Altaba Inc cancelado NA
4 AAC_old Aac Hldg Inc cancelado NA
5 AAIC_old Arlington Asset Investment Corp cancelado NA
6 AAL American Airlines Group Inc activo 0.02
naics1
1 Industrias manufactureras
2 Industrias manufactureras
3 Servicios profesionales, científicos y técnicos
4 Servicios de salud y de asistencia social
5 Servicios financieros y de seguros
6 Transportes, correos y almacenamiento
naics2
1 Fabricación de instrumentos de medición, control, navegación, y equipo médico electrónico
2 Industria básica del aluminio
3 Servicios de diseño de sistemas de cómputo y servicios relacionados
4 Otros consultorios para el cuidado de la salud
5 Otros fondos
6 Transporte aéreo regular
SectorEconomatica
1 Electroelectronic
2 Siderur & Metalur
3 Software y Datos
4 Otros
5 Fondos
6 Transporte Servic
head(uspanel)
firm q fiscalmonth revenue cogs sgae otherincome
1 A 2000q1 NA NA NA NA NA
2 A 2000q2 6 4731000 2421000 1925000 73000
3 A 2000q3 9 7401000 3790000 3016000 101000
4 A 2000q4 12 10773000 5522000 4198000 111000
5 A 2001q1 3 2841000 1449000 1113000 19000
6 A 2001q2 6 4930000 2683000 2025000 277000
extraordinaryitems finexp incometax totalassets currentassets inventory
1 NA NA NA NA NA NA
2 0 0 161000 7321000 5057000 1622000
3 0 0 244000 7827000 5344000 1762000
4 0 0 407000 8425000 5655000 1853000
5 -25000 0 119000 9208000 5461000 2129000
6 -14000 0 235000 9080000 4998000 1791000
totalliabilities currentliabilities longdebt adjprice originalprice
1 NA NA NA 62.87922 104.0000
2 2679000 2186000 0 44.58983 73.7500
3 2925000 2510000 0 29.58800 48.9375
4 3160000 2758000 0 33.10228 54.7500
5 3667000 3070000 0 18.57960 30.7300
6 3464000 3098000 0 19.64976 32.5000
sharesoutstanding year fixedassets yearf cto fiscalq
1 452000.0 2000 NA NA 1 NA
2 452272.0 2000 1453000 2000 2 2
3 453014.6 2000 1581000 2000 3 3
4 456366.4 2000 1741000 2000 4 4
5 456769.7 2001 1821000 2001 1 1
6 457072.7 2001 1848000 2001 2 2
We can re-name the empresa column to “firm” to have the same name in both datasets, and then do the merge:
I can use dplyr to rename the column:
= usfirms %>%
usfirms rename(firm = empresa )
colnames(usfirms)
[1] "firm" "Nombre" "status"
[4] "partind" "naics1" "naics2"
[7] "SectorEconomatica"
Now I use the firm column for both dataset to do the merge. If I only want to consider firm name, industry and status, I can select only these columns from the usfirms datsaet before doing the merge.
In this case, instead of using the merge function, I will use the left_join function from the dplyr package. The left_join is faster and make sure you keep all rows of the uspanel. The merge function might not keep all original rows of uspanel in the case that there is a firm that is in the uspanel and is not in the usfirms.
# I do a subset with only the columns I am interested to pull and paste into the uspanel:
= usfirms %>%
usfirms1 select(firm, Nombre,naics1,status)
# I do the merge and keep the result in the same uspanel:
= left_join(uspanel,usfirms1,by="firm") uspanel
I can check the result by showing only some columns for one firm:
%>%
uspanel filter(firm=="MSFT") %>%
head(5) %>%
select(firm,Nombre,q,naics1,totalassets,revenue)
firm Nombre q naics1 totalassets revenue
1 MSFT Microsoft Corp 2000q1 Información en medios masivos 50895000 17152000
2 MSFT Microsoft Corp 2000q2 Información en medios masivos 52150000 22956000
3 MSFT Microsoft Corp 2000q3 Información en medios masivos 56089000 5800000
4 MSFT Microsoft Corp 2000q4 Información en medios masivos 57691000 12385000
5 MSFT Microsoft Corp 2001q1 Información en medios masivos 59605000 18841000
3 CHALLENGE 2. Calculating financial variables
You have to write the code to calculate the following financial variables and financial ratios for all firms-quarters of US firms.
1. Create financial variables
Using the dataus20204 dataset, create columns for the following variables:**
- Market value (marketvalue) = originalprice * sharesoutstanding
We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.
I USE THE dplyr package:
= uspanel %>%
uspanel mutate(marketvalue= originalprice * sharesoutstanding)
- Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
= uspanel %>%
uspanel mutate(grossprofit= revenue - cogs)
- Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae)
= uspanel %>%
uspanel mutate(ebit = grossprofit - sgae)
- Net Income (netincome) = ebit + otherincome + extraordinaryitems - financial expenses - incometax
= uspanel %>%
uspanel mutate(netincome = ebit + otherincome + extraordinaryitems - finexp - incometax)
- Annual market return: use adjusted stock price and remember that you have quarterly data.
Here you have to use the lag function and group by firm to avoid using stock price of another stock to calculate the stock return of a stock. Hint: check Workshop 1.
I will calculate simple and continuously compounded annual return:
= uspanel %>%
uspanel group_by(firm) %>%
arrange(firm,q) %>%
mutate(annual_R = adjprice / lag(adjprice,4) - 1, # Annual simple return
annual_r = log(adjprice) - lag(log(adjprice,4))) %>% # log annual return
ungroup()
THE lag FUNCTION GETS THE PREVIOUS VALUE OF THE VARIABLE. WHEN USING the lag FUNCTION IT IS VERY IMPORTANT TO GROUP_BY THE DATASET BY FIRM TO AVOID USING PRICES FROM OTHER STOCK TO CALCULATE STOCK RETURN!
THE arrange FUNCTION IS RECOMMENDED TO MAKE SURE THAT THE DATA IS SORTED BY QUARTER BEFORE CALCULATING THE QUARTERLY RETURN!
NOTE THAT I USE THE PARAMETER 1 IN THE lag FUNCTION TO INDICATE THAT I AM CALCULATING QUARTERLY RETURN SINCE THE DATASET IS QUARTERLY.
2. Using the same panel dataset, create columns for the following financial ratios:
- Operational Return on Assets (roabit): roabit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of ONE year ago.
roabit=\frac{ebit_{t}}{totalassets_{t-4}}
Here you can use the lag function to get value of total assets one year ago, which is 4 quarters ago.
- Return on Assets (roa):
roa=\frac{netincome_{t}}{totalassets_{t-4}}
= uspanel %>%
uspanel group_by(firm) %>%
arrange(firm,q) %>%
mutate(roabit = ebit / lag(totalassets,4),
roa = netincome / lag(totalassets,4)) %>%
ungroup()
Operational Earnings per share (oeps): ebit / sharesoutstanding
Operational eps deflated by stock price (oepsp) : oeps / originalprice
Book-to-market ratio (bmr): book value / market value.
Book value can be calculated as totalassets minus totalliabilities
= uspanel %>%
uspanel mutate(oeps = ifelse(sharesoutstanding==0,NA,ebit / sharesoutstanding),
oepsp = ifelse(adjprice==0,NA,oeps/adjprice),
bmr = ifelse(marketvalue==0,NA,(totalassets-totalliabilities)/marketvalue)
)
Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio
EARNINGS PER SHARE IS EQUAL TO EARNINGS DIVIDED BY THE # OF SHARES. THE MEASURE FOR EARNINGS IS NET INCOME. HOWEVER, SOME ANALYSTS ALSO USE OTHER OPERATIONAL MEASURES FOR EARNINGS SUCH AS EARNINGS BEFORE INTEREST AND TAXES (EBIT). IF WE WANT TO MEASURE OPERATIONAL EARNINGS AND CALCULATE IT FOR MANY FIRMS, IT IS RECOMMENDED TO USE EBIT AS A MEASURE OF EARNINGS.
THE FORMULA OF EPS USING NET INCOME AS A MEASURE OF EARNINGS IS:
EPS_{t}=\frac{NETINCOME_{t}}{\#OFSHARES_{t}}
THE FORMULA FOR EPS USING EBIT AS A MEASURE OF EARNINGS IS:*
EPS_{t}=\frac{EBIT_{t}}{\#OFSHARES_{t}} IN A HYPOTHETICAL SCENARIO, IF THE ALL EARNINGS OF A PERIOD t WERE PAYED TO THE INVESTORS, THEN EPS WILL BE HOW MUCH OF ALL EARNINGS OF THE PERIOD IS PAYED TO EACH SHARE OWN BY INVESTORS.
WHAT IS BOOK-TO-MARKET RATIO?
Book-to-maret ratio (bmr) is the ratio of accounting book value of the firm to its market value. In other words, it results by dividing book value by the market value of the firm at a specific time period.
If bmr=1 means that the firm book value is about the same as firm market value. If that is the case, then the market value has not grown beyond book-value, meaning that the firm has not created value beyond its book value.
If bmr>1 means that the market value is less than book value. So, if bmr>1 means that the firm has significantly lost shareholder’s wealth, and it might incur in bankrupt risk.
Then, what would be the bmr level that all firms are looking for? One of the main purposes of the executives is to MAXIMIZE shareholder’s value. The way to increase shareholder’s value is to increase its market value, and the only way to increase market value is to increase stock price.
Then, the bmr level that all executives prefer is a value much less than 1.
If bmr=0.5 means that the firm market value is the double of its book value. In this case, the main difference between market value and book value is the wealth that the firm has created thanks to its valuable intangible assets such as prestige, high quality, and innovation.
Then, what do you think it might be the relationship between bmr and stock return? Intuitively, we might think that a small value of bmr is a good news, then the stock return might be related in a negative way. If bmr goes down (good news), then the stock return might go up. Then, it might be expected that the relationship between bmr and stock return is linear and negative. Some finance research (like Fama & French, 1995), mainly with US firms has found that the relationship between bmr and future stock return is negative.
However, there are mixed findings about the relationship between bmr and stock returns. Some research (like Piotrosky, 2000) has found that firms with high bmr, but strong financials (like earnings per share) usually provides significant positive stock returns. In this special case, bmr has found to be positively related to stock returns.
Finance research has found that bmr influences earnings per share, which in turn influences current and future stock returns. It is some times that firms with low bmr do not experience significant high stock returns due to the high expectations of investors. This is an interesting finding in finance that is still being researched!
4 CHALLENGE 3: Create a cross-sectional dataset with the most recent annual data
1. Select active firms in the 2023 fiscal year
Using the dataus2024 dataset, write the code to select active firms in the 2023 fiscal year.
To do this, you have to select firm-quarters with fiscalmonth=12, yearf=2023 and status=“active”
Save the result in a new dataset called active2023
The fiscal month variable indicates the following: if fiscalmonth=12, then that means that the corresponding quarter is the end-of-the-fiscal-year quarter.
= uspanel %>%
active2023 filter(fiscalmonth==12,yearf==2023)
5 CHALLENGE 4: Code for descriptive statistics for 2023
Using the active2023 dataset, you have to do the following descriptive statistics and respond the following questions.
1. Show and interpret the median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023
%>%
active2023 summarize( firms=n(),
median_tot_assets = median(totalassets, na.rm = TRUE),
Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
Q3_totassets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
median_marketvalue = median(marketvalue, na.rm = TRUE),
Q1_marketvalue = quantile(marketvalue, probs=c(0.25),na.rm=TRUE),
Q3_marketvalue = quantile(marketvalue,probs=c(0.75),na.rm=TRUE))
# A tibble: 1 × 7
firms median_tot_assets Q1_total_assets Q3_totassets median_marketvalue
<int> <dbl> <dbl> <dbl> <dbl>
1 3415 1635910 290343 7029217 1272466.
# ℹ 2 more variables: Q1_marketvalue <dbl>, Q3_marketvalue <dbl>
At the end of the 2023 fiscal year, the typical size of a US firm in terms of total assets is USD $1,635.9 million (the median of total assets).
At the end of the 2023 fiscal year, the typical size of a US firm in terms of market value is USD $1,272.4 million (the median of market value).
The top 25% biggest US firms have a market value greater or equal to $5,641,2 million (the quartile 3 of market value)
The bottom 25% of US firms have a market value less or equal to $215.4 million (quartile 1 of market value)
2. By industry show the # of firms, median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023
I do a very similar code than the previous one. I add the group_by industry:
= active2023 %>%
by_industry group_by(naics1) %>%
summarize( firms=n(),
median_tot_assets = median(totalassets, na.rm = TRUE),
Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
Q3_totassets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
median_marketvalue = median(marketvalue, na.rm = TRUE),
Q1_marketvalue = quantile(marketvalue, probs=c(0.25),na.rm=TRUE),
Q3_marketvalue = quantile(marketvalue,probs=c(0.75),na.rm=TRUE))
We display the results:
by_industry
naics1 | firms | median_tot_assets | Q1_total_assets | Q3_totassets | median_marketvalue | Q1_marketvalue | Q3_marketvalue |
---|---|---|---|---|---|---|---|
- | 1 | 13511396.0 | 13511396.0 | 13511396 | 5752457.1 | 5752457.13 | 5752457 |
Agricultura, ganadería, aprovechamiento forestal, pesca y caza | 18 | 1189610.5 | 189294.9 | 3310175 | 718756.4 | 86801.89 | 3690190 |
Comercio al por mayor | 74 | 2211872.0 | 468339.2 | 6535675 | 1474175.1 | 304554.48 | 6976933 |
Comercio al por menor | 139 | 2725250.0 | 640250.5 | 7740470 | 1968102.3 | 299336.15 | 10116451 |
Construcción | 48 | 3356715.0 | 1192458.8 | 6415290 | 3137764.7 | 864587.54 | 6511583 |
Empresas de electricidad, gas y agua | 85 | 14226600.0 | 3323801.0 | 43939700 | 5044677.6 | 1879869.75 | 21398960 |
Industrias manufactureras | 1470 | 501802.5 | 97954.0 | 2734500 | 715509.0 | 89892.61 | 4386725 |
Información en medios masivos | 243 | 1443046.0 | 445938.0 | 4836900 | 2731011.9 | 391739.57 | 9229658 |
Minería, explotación de canteras y extracción de petróleo y gas | 113 | 2239006.0 | 400904.0 | 7420031 | 1247307.5 | 231780.91 | 4550560 |
Otros servicios excepto actividades gubernamentales | 17 | 1014723.0 | 607377.0 | 2854552 | 848578.5 | 276644.17 | 2713564 |
Servicios de alojamiento temporal y de preparación de alimentos y bebidas | 70 | 1752148.5 | 559121.8 | 6262595 | 1488995.1 | 336943.95 | 6127148 |
Servicios de apoyo a los negocios y manejo de residuos y desechos, y servicios de remediación | 116 | 2507046.0 | 700369.5 | 8207400 | 3119600.3 | 715276.88 | 9572014 |
Servicios de esparcimiento culturales y deportivos, y otros servicios recreativos | 20 | 2432789.5 | 1118437.2 | 6974660 | 2095108.4 | 779143.33 | 9265038 |
Servicios de salud y de asistencia social | 59 | 1441699.0 | 447263.0 | 4450510 | 1460913.2 | 371526.70 | 5012408 |
Servicios educativos | 13 | 1007316.0 | 740685.0 | 2125213 | 1497541.4 | 314845.46 | 2255292 |
Servicios financieros y de seguros | 661 | 5894400.0 | 2149535.0 | 18274022 | 1440087.5 | 352355.71 | 5526287 |
Servicios inmobiliarios y de alquiler de bienes muebles e intangibles | 66 | 1631034.5 | 320103.2 | 5497144 | 873429.3 | 245355.47 | 2799109 |
Servicios profesionales, científicos y técnicos | 132 | 679940.0 | 184932.8 | 3597883 | 830948.4 | 219680.71 | 6252087 |
Transportes, correos y almacenamiento | 70 | 4882901.0 | 1810797.0 | 19048809 | 3026746.8 | 1132554.23 | 11244748 |
3. Which are the 10 biggest active firms in terms of total assets according to the end of fiscal year of 2023? Show the name of the firms, the industry and total assets.
<- active2023 %>%
top_10_tassets arrange(desc(totalassets)) %>%
top_n(10, totalassets) %>%
select(firm, Nombre, naics1, totalassets, revenue)
We display the result:
top_10_tassets
firm | Nombre | naics1 | totalassets | revenue |
---|---|---|---|---|
FNM_old | Federal National Mortgage Association Fannie Mae | Servicios financieros y de seguros | 4325437000 | 141240000 |
JPM | Jpmorgan Chase & Co | Servicios financieros y de seguros | 3875393000 | 239425000 |
FRE_old | Federal Home Loan Mortgage Corp | Servicios financieros y de seguros | 3280976000 | 108050000 |
BAC | Bank Of America Corp | Servicios financieros y de seguros | 3180151000 | 171912000 |
C | Citigroup Inc | Servicios financieros y de seguros | 2411834000 | 156820000 |
WFC | Wells Fargo & Company | Servicios financieros y de seguros | 1932468000 | 115340000 |
GS | The Goldman Sachs Group, Inc | Servicios financieros y de seguros | 1641594000 | 108418000 |
MS | Morgan Stanley | Servicios financieros y de seguros | 1193693000 | 96194000 |
BRK_B | Berkshire Hathaway Inc | Servicios financieros y de seguros | 1069978000 | 364482000 |
PRU | Prudential Financial, Inc | Servicios financieros y de seguros | 721123000 | 53979000 |
4. Which are the 10 biggest active firms in terms of market capitalization? Show the name of the firms, the industry and market value as of 2023
We do the same as above, but now for the market capitalization variable:
<- active2023 %>%
top_10_mcap arrange(desc(marketvalue)) %>%
head(10) %>%
select(firm, Nombre, naics1, marketvalue, revenue)
We display the result:
top_10_mcap
firm | Nombre | naics1 | marketvalue | revenue |
---|---|---|---|---|
AAPL | Apple Inc | Industrias manufactureras | 2676736861 | 383285000 |
MSFT | Microsoft Corp | Información en medios masivos | 2532080939 | 211915000 |
NVDA | Nvidia Corp | Industrias manufactureras | 2258900000 | 60922000 |
GOOGL | Alphabet Inc | Servicios profesionales, científicos y técnicos | 1748360040 | 307394000 |
AMZN | Amazon.Com, Inc | Comercio al por menor | 1570152607 | 574785000 |
META | Meta Platforms, Inc | Servicios profesionales, científicos y técnicos | 909628613 | 134902000 |
TSLA | Tesla, Inc | Industrias manufactureras | 789898387 | 96773000 |
LLY | Eli Lilly And Company | Industrias manufactureras | 553370175 | 34124100 |
AVGO | Broadcom Inc | Industrias manufactureras | 522561910 | 35819000 |
JPM | Jpmorgan Chase & Co | Servicios financieros y de seguros | 491760519 | 239425000 |
5. Show and interpret a histogram of the 2023 market capitalization for all ACTIVE firms
hist(active2023$marketvalue)
This is a typical skewed-to-the-right distribution since there are very few very extreme positive values. Due to the very extreme values of market capitalization of firms, the histogram is not well appreciated. We can change the number of bins or breaks of the histogram, so that we can better appreciate the distribution of the variable in the range where most firms are located in terms of market capitalization:
hist(active2023$marketvalue, breaks=40)
We can see that most of the firms have a market capitalization that is less than $US 250 billion.
To better appreciate the distribution of most of the firms excluding those with very extreme values, we can filter the dataset as follows:
<- active2023 %>%
firm_selection1 filter(marketvalue <=250000000) %>%
select(firm, marketvalue)
hist(firm_selection1$marketvalue, breaks=40)
We can still see that most of these selected firms have a market cap less than $USD 20 billion. We could keep selecting firms with less market capitalization to better appreciate the distribution of most of the firms:
<- active2023 %>%
firm_selection2 filter(marketvalue <=50000000) %>%
select(firm, marketvalue)
hist(firm_selection2$marketvalue, breaks=40)
Even with this selection, we still see that the distribution is very skewed to the right with extreme values of market capitalization.
6. For each industry show and interpret a box plot to better understand the size of active firms in 2023 (use the market value)
library(ggplot2)
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
geom_boxplot()
Here we have 2 visualization problems: 1) We cannot easily visualize the market capitalization of 50% of the firms by industry, which is the box for each industry (Quartiles fro Q1 to Q3), and 2) The x labels for the industry names cannot be read.
To solve the first problem, since there are too many very extreme values for market capitalization, we can plot the logarithm base 10 of market capitalization.
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
geom_boxplot() + scale_y_log10()
We can also plot the winsorized variable of market capitalization instead of the logarithm base 10.
In statistics, winsorization is the process to treat extreme values, which are also called outliers. The winsorization process replace the very extreme values at a specific percentile with a value of the variable in that percentile.
Here we apply a winsorization at the 95 percentiles for big values of market capitalization:
library(statar)
$marketvalue_w <- winsorize(active2023$marketvalue, probs = c(0,0.95)) active2023
We created a winsorized market capitalization for the purpose of visualization of the variable. The process of winsorization is also used for regression analysis since outliers can bias the estimation of the regression results. We will learn that later.
For the second problem, we can add an option to the ggplot to indicate that we want to display the x labes in vertical wah (90 degrees of inclination):
ggplot(active2023, aes(x=naics1, y=marketvalue_w)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
6 Challenge 5. Review of Logistic Regression models
1. Using the dataus2024 dataset you have to run a logistic regression model to examine whether the operational earnings per share deflated by price (oepsp) is related to the probability that the firm ROA is greater than the corresponding industry ROA mean.
HINT: you can use the function glm specifying family=“binomial”:
logit_model1 <- glm(ROA_above ~ oepspw ,data = uspanel, family = “binomial”,na.action = na.omit)
Now I run a logistic model to examine wether operating earnings per share is related to the probability that a firm has a ROA higher than the mean of the ROA of its industry.
The dependent variable is the binary variable ROA_above that will be 1 if the firm have a ROA greater than the ROA mean of its industry. Then, we need to calculate the ROA mean for each industry, and then calculate this binary variable:
<- uspanel %>%
uspanel group_by(naics1, q) %>%
mutate(ROAmean = mean(roa, na.rm=TRUE),
ROA_above = ifelse(roa>ROAmean,1,0)) %>%
ungroup()
<- glm(ROA_above ~ oepsp ,data = uspanel, family = "binomial",na.action = na.omit)
logit_model1 <- summary(logit_model1)
smodel1 smodel1
Call:
glm(formula = ROA_above ~ oepsp, family = "binomial", data = uspanel,
na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.822428395 0.004955761 165.954 <0.0000000000000002 ***
oepsp 0.000003163 0.000004186 0.756 0.45
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 236261 on 192005 degrees of freedom
Residual deviance: 236257 on 192004 degrees of freedom
(365175 observations deleted due to missingness)
AIC: 236261
Number of Fisher Scoring iterations: 8
We see that the beta coefficient of oepsp is very close to zero and its pvalue is much greater than 0.05. This means that there is no significant evidence to say that operating earnings per share is related to the probability that a firm has a ROA greater that the ROA of its industry.
Before making conclusion that oepsp has no relationship with the probability of a firm to have ROA higher than the industry average, we have to check some data issues that might affect our model:
Extreme values of independent variables. In this case, it is very likely that oepsp has very extreme values. We will learn about a method called winsorization to treat extreme values.
Treatment of income-statement variables in this dataset. Remember that these variables have Year-to-date amounts (cumulative amounts). This might cause a problem for the calculation or ROA and epsp since the numerator of these variables is netincome and operating earnings that increase each quarter within each year. We can try selecting only rows with the last fiscal quarter of each year (fiscalmonth==12)
We will learn more details about the logistic regression model in the next workshop.
2. Learn about the logistic regression and interpret the result of this regression model (interpret the coefficient of this model and its significance.
7 Challenge 6. Datacamp Course
Go to the course: Machine Learning with caret in R and DO CHAPTER 2: Classification models: fitting them and evaluating their performance