# Installing and loading libraries
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
if (!require("scales")) install.packages("scales")
library(scales)
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
Workshop 2 Solution
FZ2022 Algorithms and Data Analytics
1 Housekeeping section
1.0.1 Setting up the environment
First, we need to set up everything we might need throughout the workshop. Keep in mind that we are working with Colab, so some libraries might already be installed, while others might not. We will need to install these libraries each time we start a new session (basically, every time you turn on your computer).
To make sure everything runs smoothly, follow the steps below to install the necessary packages.
# To avoid scientific notation:
options(scipen=999)
2 Challenge 1: Data Management algorithms
2.1 Importing Data
# Download the CSV file from the website:
download.file("http://www.apradie.com/datos/dataus2024.csv", "dataus2024.csv")
download.file("http://www.apradie.com/datos/firmsus2024.csv", "firmsus2024.csv")
# Import the panel data
<- read.csv("dataus2024.csv")
uspanel
# Import the firm catalog
<- read.csv("firmsus2024.csv") usfirms
Now we can use the function list()
or head()
to get and overview of the objects we created in the working space memory.
head(uspanel)
firm q fiscalmonth revenue cogs sgae otherincome
1 A 2000q1 NA NA NA NA NA
2 A 2000q2 6 4731000 2421000 1925000 73000
3 A 2000q3 9 7401000 3790000 3016000 101000
4 A 2000q4 12 10773000 5522000 4198000 111000
5 A 2001q1 3 2841000 1449000 1113000 19000
6 A 2001q2 6 4930000 2683000 2025000 277000
extraordinaryitems finexp incometax totalassets currentassets inventory
1 NA NA NA NA NA NA
2 0 0 161000 7321000 5057000 1622000
3 0 0 244000 7827000 5344000 1762000
4 0 0 407000 8425000 5655000 1853000
5 -25000 0 119000 9208000 5461000 2129000
6 -14000 0 235000 9080000 4998000 1791000
totalliabilities currentliabilities longdebt adjprice originalprice
1 NA NA NA 62.87922 104.0000
2 2679000 2186000 0 44.58983 73.7500
3 2925000 2510000 0 29.58800 48.9375
4 3160000 2758000 0 33.10228 54.7500
5 3667000 3070000 0 18.57960 30.7300
6 3464000 3098000 0 19.64976 32.5000
sharesoutstanding year fixedassets yearf cto fiscalq
1 452000.0 2000 NA NA 1 NA
2 452272.0 2000 1453000 2000 2 2
3 453014.6 2000 1581000 2000 3 3
4 456366.4 2000 1741000 2000 4 4
5 456769.7 2001 1821000 2001 1 1
6 457072.7 2001 1848000 2001 2 2
head(usfirms)
empresa Nombre status partind
1 A Agilent Technologies, Inc activo 0.09
2 AA Alcoa Corp activo NA
3 AABA_old Altaba Inc cancelado NA
4 AAC_old Aac Hldg Inc cancelado NA
5 AAIC_old Arlington Asset Investment Corp cancelado NA
6 AAL American Airlines Group Inc activo 0.02
naics1
1 Industrias manufactureras
2 Industrias manufactureras
3 Servicios profesionales, científicos y técnicos
4 Servicios de salud y de asistencia social
5 Servicios financieros y de seguros
6 Transportes, correos y almacenamiento
naics2
1 Fabricación de instrumentos de medición, control, navegación, y equipo médico electrónico
2 Industria básica del aluminio
3 Servicios de diseño de sistemas de cómputo y servicios relacionados
4 Otros consultorios para el cuidado de la salud
5 Otros fondos
6 Transporte aéreo regular
SectorEconomatica
1 Electroelectronic
2 Siderur & Metalur
3 Software y Datos
4 Otros
5 Fondos
6 Transporte Servic
2.2 Display Apple annual historical information
<- uspanel %>%
APPLE_21_23 filter(firm=="AAPL", yearf>=2021, fiscalmonth==12) %>%
select(firm,q,revenue,totalassets, yearf,fiscalmonth)
APPLE_21_23
firm q revenue totalassets yearf fiscalmonth
1 AAPL 2021q3 365817000 351002000 2021 12
2 AAPL 2022q3 394328000 352755000 2022 12
3 AAPL 2023q3 383285000 352583000 2023 12
To report the amount in currency format, we can use the dollar function from the scales package (you have to install the scales package) as follows:
= as.numeric(1000*APPLE_21_23[,c("revenue")])
APPLE_2023_ANNUALSALES # I used the as.numeric function since the dollar function only receives numeric variables:
dollar(APPLE_2023_ANNUALSALES)
[1] "$365,817,000,000" "$394,328,000,000" "$383,285,000,000"
Then, we can say that in 2023 Apple reported annual total sales amount of $383,285,000,000. This is more than 380 billion dollars! (380 thousand million dollars in Spanish).
2.3 Merge the datasets
2.3.1 Merging Important Firm Characteristics into the Panel Dataset
2.3.1.1 Step 1: Pull the following firm characteristics:
- Company name
- Status (active or cancelled)
- Industry (naics1)
In the uspanel
dataset, we don’t have general information about each firm, such as its name, industry, or status. This information is stored in the usfirms
dataset. So, we will start by integrating this information into the panel dataset.
2.3.1.2 Step 2: Select the relevant columns from the usfirms
dataset:
We need to create a smaller dataset (usfirms1
) that only contains the relevant columns for merging.
# I create a dataset with only the relevant columns to be merged:
<- usfirms %>%
usfirms1select(empresa,Nombre,status,naics1)
2.3.2 Merging Firm Characteristics into the Panel Dataset
To integrate the usfirms
dataset into the panel dataset uspanel
, we can use the merge()
function, which is similar to the vlookup function in Excel. Our goal is to pull the following columns into the uspanel
dataset: - Industry (naics1) - Company name - Status (active or cancelled)
This will allow us to perform queries and analyze the data using these additional firm characteristics.
2.3.2.1 Identifying the Columns to Match
Before using the merge()
function, it’s important to identify the columns that will be used to link both datasets. In this case, we use the firm code (ticker symbol), which uniquely identifies each firm.
However, there’s an issue: the firm code is named differently in each dataset: - In uspanel
, the firm code is named firm. - In usfirms
, the firm code is named empresa.
2.3.2.2 Solution: Specify the Columns to Match
We can address this issue by explicitly telling R which columns to match from each dataset in the merge()
function. Here’s how to do it:
# Merge usfirms1 into uspanel using the firm code as the key
<- merge(uspanel,usfirms1,by.x="firm", by.y="empresa") uspanel
The by.x
parameter refers to the column in the first dataset (uspanel
), and the by.y
parameter refers to the column in the second dataset (usfirms1
).
If both datasets have the same column name for the firm code (e.g., “firm”), you can simplify the merge()
function by omitting by.x
and by.y
:
uspanel <- merge(uspanel,usfirms1,by="firm")
3 CHALLENGE 2: Calculating financial variables
Using the dataus2024 dataset, you have to do the following descriptive statistics and respond the following questions.
You have to calculate market capitalization (marketcap) in this dataset. You can calculate a new column as follows:
marketcap = original stock price * shareoutstanding.
This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares:
Les have a look at the contents of the dataset:
head(uspanel)
firm q fiscalmonth revenue cogs sgae otherincome
1 A 2000q1 NA NA NA NA NA
2 A 2000q2 6 4731000 2421000 1925000 73000
3 A 2000q3 9 7401000 3790000 3016000 101000
4 A 2000q4 12 10773000 5522000 4198000 111000
5 A 2001q1 3 2841000 1449000 1113000 19000
6 A 2001q2 6 4930000 2683000 2025000 277000
extraordinaryitems finexp incometax totalassets currentassets inventory
1 NA NA NA NA NA NA
2 0 0 161000 7321000 5057000 1622000
3 0 0 244000 7827000 5344000 1762000
4 0 0 407000 8425000 5655000 1853000
5 -25000 0 119000 9208000 5461000 2129000
6 -14000 0 235000 9080000 4998000 1791000
totalliabilities currentliabilities longdebt adjprice originalprice
1 NA NA NA 62.87922 104.0000
2 2679000 2186000 0 44.58983 73.7500
3 2925000 2510000 0 29.58800 48.9375
4 3160000 2758000 0 33.10228 54.7500
5 3667000 3070000 0 18.57960 30.7300
6 3464000 3098000 0 19.64976 32.5000
sharesoutstanding year fixedassets yearf cto fiscalq
1 452000.0 2000 NA NA 1 NA
2 452272.0 2000 1453000 2000 2 2
3 453014.6 2000 1581000 2000 3 3
4 456366.4 2000 1741000 2000 4 4
5 456769.7 2001 1821000 2001 1 1
6 457072.7 2001 1848000 2001 2 2
Nombre status naics1
1 Agilent Technologies, Inc activo Industrias manufactureras
2 Agilent Technologies, Inc activo Industrias manufactureras
3 Agilent Technologies, Inc activo Industrias manufactureras
4 Agilent Technologies, Inc activo Industrias manufactureras
5 Agilent Technologies, Inc activo Industrias manufactureras
6 Agilent Technologies, Inc activo Industrias manufactureras
<- uspanel%>%
uspanel # Calculate market value
mutate(marketvalue = originalprice * sharesoutstanding,
# Calculate Gross Profit
grossprofit = revenue - cogs,
# Calculate EBIT
ebit = grossprofit - sgae,
# Calculate Net income
netincome = ebit + otherincome + extraordinaryitems - finexp - incometax)
head(uspanel)
firm q fiscalmonth revenue cogs sgae otherincome
1 A 2000q1 NA NA NA NA NA
2 A 2000q2 6 4731000 2421000 1925000 73000
3 A 2000q3 9 7401000 3790000 3016000 101000
4 A 2000q4 12 10773000 5522000 4198000 111000
5 A 2001q1 3 2841000 1449000 1113000 19000
6 A 2001q2 6 4930000 2683000 2025000 277000
extraordinaryitems finexp incometax totalassets currentassets inventory
1 NA NA NA NA NA NA
2 0 0 161000 7321000 5057000 1622000
3 0 0 244000 7827000 5344000 1762000
4 0 0 407000 8425000 5655000 1853000
5 -25000 0 119000 9208000 5461000 2129000
6 -14000 0 235000 9080000 4998000 1791000
totalliabilities currentliabilities longdebt adjprice originalprice
1 NA NA NA 62.87922 104.0000
2 2679000 2186000 0 44.58983 73.7500
3 2925000 2510000 0 29.58800 48.9375
4 3160000 2758000 0 33.10228 54.7500
5 3667000 3070000 0 18.57960 30.7300
6 3464000 3098000 0 19.64976 32.5000
sharesoutstanding year fixedassets yearf cto fiscalq
1 452000.0 2000 NA NA 1 NA
2 452272.0 2000 1453000 2000 2 2
3 453014.6 2000 1581000 2000 3 3
4 456366.4 2000 1741000 2000 4 4
5 456769.7 2001 1821000 2001 1 1
6 457072.7 2001 1848000 2001 2 2
Nombre status naics1 marketvalue
1 Agilent Technologies, Inc activo Industrias manufactureras 47008000
2 Agilent Technologies, Inc activo Industrias manufactureras 33355058
3 Agilent Technologies, Inc activo Industrias manufactureras 22169401
4 Agilent Technologies, Inc activo Industrias manufactureras 24986059
5 Agilent Technologies, Inc activo Industrias manufactureras 14036534
6 Agilent Technologies, Inc activo Industrias manufactureras 14854864
grossprofit ebit netincome
1 NA NA NA
2 2310000 385000 297000
3 3611000 595000 452000
4 5251000 1053000 757000
5 1392000 279000 154000
6 2247000 222000 250000
# To create the annual returns, I can group the data by firm and arrange by firm and quarter
# Calculate returns with proper sorting by quarter
<- uspanel %>%
uspanel group_by(firm) %>%
# Sort by quarter (q) within each firm
arrange(firm, q) %>%
# Calculate the annual return (adjusted price return)
mutate(annual_R = adjprice / lag(adjprice, n = 4) - 1,
# Calculate the continuously compounded annual return
annual_r = log(adjprice) - lag(log(adjprice), n = 4))
# Display the first 20 rows of the relevant columns: firm, q, annual_R, and annual_r
%>%
uspanel select(firm, q, annual_R, annual_r) %>%
head(20)
# A tibble: 20 × 4
# Groups: firm [1]
firm q annual_R annual_r
<chr> <chr> <dbl> <dbl>
1 A 2000q1 NA NA
2 A 2000q2 NA NA
3 A 2000q3 NA NA
4 A 2000q4 NA NA
5 A 2001q1 -0.705 -1.22
6 A 2001q2 -0.559 -0.819
7 A 2001q3 -0.601 -0.918
8 A 2001q4 -0.479 -0.653
9 A 2002q1 0.138 0.129
10 A 2002q2 -0.272 -0.318
11 A 2002q3 -0.332 -0.403
12 A 2002q4 -0.370 -0.462
13 A 2003q1 -0.624 -0.978
14 A 2003q2 -0.173 -0.190
15 A 2003q3 0.693 0.526
16 A 2003q4 0.628 0.487
17 A 2004q1 1.41 0.878
18 A 2004q2 0.498 0.404
19 A 2004q3 -0.0244 -0.0247
20 A 2004q4 -0.176 -0.193
2. Using the same panel dataset, create columns for the following financial ratios:
# Calculate financial ratios: roabit, roa, oeps, oepsp, and bmr with conditional handling of zero
<- uspanel %>%
uspanel group_by(firm) %>%
# Sort by quarter (q) within each firm
arrange(firm, q) %>%
# Calculate Operational Return on Assets (roabit)
mutate(roabit = ifelse(lag(totalassets, 4) == 0, NA, ebit / lag(totalassets, 4)),
# Calculate Return on Assets (roa)
roa = ifelse(lag(totalassets, 4) == 0, NA, netincome / lag(totalassets, 4)),
# Calculate Operational Earnings per share (oeps)
oeps = ifelse(sharesoutstanding == 0, NA, ebit / sharesoutstanding),
# Calculate Operational EPS deflated by stock price (oepsp)
oepsp = ifelse(originalprice == 0, NA, oeps / originalprice),
# Calculate Book-to-Market Ratio (bmr)
bmr = ifelse(marketvalue == 0, NA, totalassets / marketvalue))
# Display the first 20 rows of the relevant columns: firm, q, roabit, roa, oeps, oepsp, and bmr
%>%
uspanel select(firm, q, roabit, roa, oeps, oepsp, bmr) %>%
head(20)
# A tibble: 20 × 7
# Groups: firm [1]
firm q roabit roa oeps oepsp bmr
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2000q1 NA NA NA NA NA
2 A 2000q2 NA NA 0.851 0.0115 0.219
3 A 2000q3 NA NA 1.31 0.0268 0.353
4 A 2000q4 NA NA 2.31 0.0421 0.337
5 A 2001q1 NA NA 0.611 0.0199 0.656
6 A 2001q2 0.0303 0.0341 0.486 0.0149 0.611
7 A 2001q3 -0.0153 0.00396 -0.261 -0.0133 0.957
8 A 2001q4 -0.0923 0.0207 -1.68 -0.0589 0.604
9 A 2002q1 -0.0480 -0.0342 -0.953 -0.0273 0.528
10 A 2002q2 -0.0871 -0.0626 -1.70 -0.0720 0.750
11 A 2002q3 -0.141 -0.0923 -2.61 -0.200 1.38
12 A 2002q4 -0.201 -0.129 -3.44 -0.192 0.978
13 A 2003q1 -0.0299 -0.0431 -0.543 -0.0413 1.25
14 A 2003q2 -0.0717 -0.0625 -1.25 -0.0641 0.827
15 A 2003q3 -0.0926 -0.245 -1.64 -0.0743 0.570
16 A 2003q4 -0.0884 -0.251 -1.52 -0.0521 0.452
17 A 2004q1 0.0102 0.00914 0.164 0.00520 0.420
18 A 2004q2 0.0249 0.0230 0.395 0.0135 0.474
19 A 2004q3 0.0496 0.0459 0.612 0.0284 0.644
20 A 2004q4 0.0613 0.0554 0.793 0.0329 0.601
3.2 Book-to-Market Ratio (B/M)
The Book-to-Market Ratio (B/M) is calculated as:
\text{B/M Ratio} = \frac{\text{Book Value of Equity}}{\text{Market Value of Equity}}
Where:
Book Value of Equity is the net asset value of the company (total assets minus total liabilities).
Market Value of Equity (or Market Capitalization) is the current market value of all outstanding shares.
A high B/M ratio suggests that a stock might be undervalued (or in financial distress), while a low B/M ratio could indicate that a stock is overvalued.
4 Challenge 3: Create a cross-sectional dataset with the most recent annual data
4.0.1 Select Active Firms and Quarters with Annual Financial Data in 2023
Before performing basic descriptive statistics for the U.S. market, you need to implement some data management algorithms.
4.0.1.1 Step 1: Filter firm-quarters with fiscalmonth = 12
, year = 2023
, and only include active firms.
In this dataset, we have quarterly financial data. For income-statement variables, the reported values are cumulative amounts from fiscal Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the values reflect the firm’s financial position at the end of each quarter.
The fiscalmonth
variable shows the quarter that ends the fiscal year. If fiscalmonth = 12
, it means this quarter is the end of the fiscal year, and income statement variables reflect the annual values.
Save the result in a new dataset called active2023
.
Now I can complete the challenge, which is to create the active2023 dataset:
<- uspanel %>%
active2023 filter(status=="activo",yearf==2023, fiscalmonth==12)
This dataset, active2023
, is no longer a panel dataset as it contains data for only one period (2023) for each firm. Instead, it is now a cross-sectional dataset.
We can count the # active firms in 2023 in the US using the nrow function:
nrow(active2023)
[1] 3285
There were 3285 active public firms in the US market in 2023.
We can also check how many of these firms end the fiscal quarter in the Q4 of the year:
<- table(active2023$q)
firmsbyQ firmsbyQ
2023q2 2023q3 2023q4 2024q1
184 193 2693 215
The table
function counts the number of cases for the different values of the variable (in this case, the variable q
represents the quarter). This allows us to see how many firms end their fiscal year in each quarter.
We can observe that more than 80% of the firms end their fiscal year in Q4 (215 firms out of 3285).
5 Challenge 4: Descriptive Statistics and Analysis of Active US Firms in 2023
We will perform descriptive statistics on the active2023
dataset to analyze the total assets and market capitalization of US firms active in 2023.
5.1 Median, 25th and 75th Percentile of Total Assets and Market Capitalization
First, we calculate and interpret the median, 25th, and 75th percentiles for total assets and market capitalization of all active US firms.
# Descriptive statistics for total assets and market capitalization
<- active2023 %>%
summary_stats select(firm, totalassets, marketvalue) %>%
summary()
summary_stats
firm totalassets marketvalue
Length:3285 Min. : 4 Min. : 36
Class :character 1st Qu.: 313801 1st Qu.: 225389
Mode :character Median : 1683214 Median : 1295817
Mean : 17604938 Mean : 14770424
3rd Qu.: 7067672 3rd Qu.: 5747438
Max. :3875393000 Max. :2676736861
NA's :27
<- active2023 %>%
industry_stats group_by(naics1) %>%
summarise(num_firms = n(),
median_assets = median(totalassets, na.rm = TRUE),
p25_assets = quantile(totalassets, 0.25, na.rm = TRUE),
p75_assets = quantile(totalassets, 0.75, na.rm = TRUE),
median_market_value = median(marketvalue, na.rm = TRUE),
p25_market_value = quantile(marketvalue, 0.25, na.rm = TRUE),
p75_market_value = quantile(marketvalue, 0.75, na.rm = TRUE))
print(industry_stats)
# A tibble: 19 × 8
naics1 num_firms median_assets p25_assets p75_assets median_market_value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 - 1 13511396 13511396 13511396 5752457.
2 Agricultur… 16 1676710. 365443 4424650 924523.
3 Comercio a… 72 2449854. 494542. 6827575 1474175.
4 Comercio a… 137 2969586 644683 7774100 1968102.
5 Construcci… 47 3407851 1371598. 6477446. 3222951.
6 Empresas d… 76 14463800 4089750. 44058250 5293863.
7 Industrias… 1411 565299 116712. 2878860 729177.
8 Informació… 234 1401528 444998 4741062. 2731012.
9 Minería, e… 106 2167800. 396234. 7365994. 1203355.
10 Otros serv… 17 1014723 607377 2854552 848578.
11 Servicios … 68 1991052 581241. 6389344. 1555129.
12 Servicios … 113 2581600 721538 8233200 3293707.
13 Servicios … 19 2625046 1358087 6993820. 2528477.
14 Servicios … 58 1527342 456712. 4479580. 1505531.
15 Servicios … 13 1007316 740685 2125213 1497541.
16 Servicios … 640 5967370. 2193738 18281728. 1450774.
17 Servicios … 63 1770814 340774. 5155288. 867256.
18 Servicios … 129 695243 186101 3594405 830948.
19 Transporte… 65 4557200 1801846 17716265 3471636.
# ℹ 2 more variables: p25_market_value <dbl>, p75_market_value <dbl>
The median total assets indicate the central point of asset values, while the 25th and 75th percentiles provide insights into the distribution of firm sizes. The same applies for market capitalization, which reflects the distribution of firm valuations.
5.2 Industry-Wise Descriptive Statistics
We now calculate the number of firms, median, 25th, and 75th percentiles of total assets and market capitalization for each industry.
# Grouping by industry and summarizing stats
<- active2023 %>%
industry_stats group_by(naics1) %>%
summarize(
n_firms = n(),
median_assets = median(totalassets, na.rm = TRUE),
p25_assets = quantile(totalassets, 0.25, na.rm = TRUE),
p75_assets = quantile(totalassets, 0.75, na.rm = TRUE),
median_market_value = median(marketvalue, na.rm = TRUE),
p25_market_value = quantile(marketvalue, 0.25, na.rm = TRUE),
p75_market_value = quantile(marketvalue, 0.75, na.rm = TRUE)
)
industry_stats
# A tibble: 19 × 8
naics1 n_firms median_assets p25_assets p75_assets median_market_value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 - 1 13511396 13511396 13511396 5752457.
2 Agricultura,… 16 1676710. 365443 4424650 924523.
3 Comercio al … 72 2449854. 494542. 6827575 1474175.
4 Comercio al … 137 2969586 644683 7774100 1968102.
5 Construcción 47 3407851 1371598. 6477446. 3222951.
6 Empresas de … 76 14463800 4089750. 44058250 5293863.
7 Industrias m… 1411 565299 116712. 2878860 729177.
8 Información … 234 1401528 444998 4741062. 2731012.
9 Minería, exp… 106 2167800. 396234. 7365994. 1203355.
10 Otros servic… 17 1014723 607377 2854552 848578.
11 Servicios de… 68 1991052 581241. 6389344. 1555129.
12 Servicios de… 113 2581600 721538 8233200 3293707.
13 Servicios de… 19 2625046 1358087 6993820. 2528477.
14 Servicios de… 58 1527342 456712. 4479580. 1505531.
15 Servicios ed… 13 1007316 740685 2125213 1497541.
16 Servicios fi… 640 5967370. 2193738 18281728. 1450774.
17 Servicios in… 63 1770814 340774. 5155288. 867256.
18 Servicios pr… 129 695243 186101 3594405 830948.
19 Transportes,… 65 4557200 1801846 17716265 3471636.
# ℹ 2 more variables: p25_market_value <dbl>, p75_market_value <dbl>
This table allows us to compare the typical firm sizes and market valuations across different industries, providing insights into the distribution of firms by sector.
5.3 Top 10 Firms by Total Assets in 2023
Next, we identify the 10 largest active firms by total assets at the end of the 2023 fiscal year, showing their names, industries, and total assets.
# Top 10 firms by total assets
<- active2023 %>%
top10_assets arrange(desc(totalassets)) %>%
select(firm, Nombre, naics1, totalassets) %>%
head(10)
top10_assets
# A tibble: 10 × 4
# Groups: firm [10]
firm Nombre naics1 totalassets
<chr> <chr> <chr> <dbl>
1 JPM Jpmorgan Chase & Co Servicios financieros y de se… 3875393000
2 BAC Bank Of America Corp Servicios financieros y de se… 3180151000
3 C Citigroup Inc Servicios financieros y de se… 2411834000
4 WFC Wells Fargo & Company Servicios financieros y de se… 1932468000
5 GS The Goldman Sachs Group, Inc Servicios financieros y de se… 1641594000
6 MS Morgan Stanley Servicios financieros y de se… 1193693000
7 BRK_B Berkshire Hathaway Inc Servicios financieros y de se… 1069978000
8 PRU Prudential Financial, Inc Servicios financieros y de se… 721123000
9 MET Metlife, Inc Servicios financieros y de se… 687584000
10 USB US Bancorp Servicios financieros y de se… 663491000
These firms represent the largest companies by total assets, indicating their financial resources and operational capacity in the market.
5.4 Top 10 Firms by Market Capitalization in 2023
Now, we look at the 10 largest firms by market capitalization in 2023, listing their names, industries, and market values.
# Top 10 firms by market capitalization
<- active2023 %>%
top10_market_value arrange(desc(marketvalue)) %>%
select(firm, Nombre, naics1, marketvalue) %>%
head(10)
top10_market_value
# A tibble: 10 × 4
# Groups: firm [10]
firm Nombre naics1 marketvalue
<chr> <chr> <chr> <dbl>
1 AAPL Apple Inc Industrias manufactureras 2676736861.
2 MSFT Microsoft Corp Información en medios masivos 2532080939.
3 NVDA Nvidia Corp Industrias manufactureras 2258900000
4 GOOGL Alphabet Inc Servicios profesionales, científicos… 1748360040
5 AMZN Amazon.Com, Inc Comercio al por menor 1570152607.
6 META Meta Platforms, Inc Servicios profesionales, científicos… 909628613.
7 TSLA Tesla, Inc Industrias manufactureras 789898387.
8 LLY Eli Lilly And Company Industrias manufactureras 553370175.
9 AVGO Broadcom Inc Industrias manufactureras 522561910.
10 JPM Jpmorgan Chase & Co Servicios financieros y de seguros 491760519.
These firms have the highest market valuations, reflecting their perceived value by investors as of 2023.
5.5 Histogram of Market Capitalization for All Active Firms
We create a histogram to visualize the distribution of market capitalization for all active firms in 2023.
# Histogram of market capitalization
hist(active2023$marketvalue)
The histogram shows how market capitalization is distributed across firms, revealing the range of firm sizes and potential clustering of smaller or larger companies. We can see that there are some very big firms that appear to be outliers (biggest 1%). We can zoom in to see those firms with a market value of USD 20 billion.
<- active2023 %>%
subset_20billion filter(marketvalue <=20000000) %>%
select(firm, marketvalue)
hist(subset_20billion$marketvalue, breaks=40)
5.6 6. Box Plot of Market Capitalization by Industry
To further analyze the distribution of firm sizes, we generate box plots of market capitalization by industry for 2023.
# Box plot by industry
library(ggplot2)
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
geom_boxplot()
The box plots provide a visual summary of the distribution of market capitalization within each industry, highlighting outliers and the central tendencies of firm sizes by sector.
5.7 Visualization Issues and Solutions
We encountered two main visualization problems:
Difficulty in Visualizing Market Capitalization:
It’s challenging to visualize the market capitalization of 50% of the firms by industry, represented by the box in the box plot (Quartiles from Q1 to Q3).Unreadable Industry Labels:
The x-axis labels for industry names are difficult to read due to their length or positioning.
5.7.1 Proposed Solutions
To address these issues:
- For the Market Capitalization Visualization:
Given that there are many extreme values for market capitalization, we can plot the logarithm (base 10) of market capitalization to better visualize the data distribution.
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
geom_boxplot() + scale_y_log10()
- For the Industry Labels:
Adjust the x-axis labels by rotating or abbreviating them to improve readability.
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
geom_boxplot() + scale_y_log10() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
5.8 Introduction to Winsorization
Winsorization is a statistical technique used to limit extreme values in data to reduce the impact of outliers. Instead of removing these outliers, Winsorization transforms them by capping the extreme values at a specified percentile. This technique is especially useful in financial data where extreme values, such as abnormally high market capitalization, can distort analyses.
By applying Winsorization, we ensure that values at the extreme tails of the distribution are replaced with more reasonable ones, reducing the skewness and impact of these outliers without completely discarding them.
In the following code, we will use the statar
library to Winsorize the marketvalue
variable at the 1st and 99th percentiles.
5.8.1 Code to Winsorize Market Value
# Load the statar library
library(statar)
# Winsorize the marketvalue variable at the 1st and 99th percentiles
$marketvalue_winsorized <- winsorize(active2023$marketvalue, probs = c(0.01, 0.99))
active2023
# View summary of the Winsorized market value
summary(active2023$marketvalue_winsorized)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2955 225389 1295817 10403048 5747438 210701986 27
ggplot(active2023, aes(x=naics1, y=marketvalue_winsorized)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
6 Challenge 5: Review of Logistic Regression Models
6.1 Logistic Regression Model
We will run a logistic regression model to examine whether operational earnings per share deflated by price (oepsp) is related to the probability that a firm’s return on assets (ROA) is greater than the corresponding industry ROA mean.
We will use the glm()
function with the binomial
family to specify our logistic regression model. First, let’s start by creating the binary dependent variable ROA_above.
<- uspanel %>%
uspanel group_by(naics1, q) %>%
mutate(ROAmean = mean(roa, na.rm=TRUE),
ROA_above = ifelse(roa>ROAmean,1,0)) %>%
ungroup()
# Run logistic regression model
<- glm(ROA_above ~ oepsp, data = uspanel, family = "binomial", na.action = na.omit)
logit_model1
# View the summary of the model
summary(logit_model1)
Call:
glm(formula = ROA_above ~ oepsp, family = "binomial", data = uspanel,
na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.813283657 0.004946147 164.428 <0.0000000000000002 ***
oepsp 0.000003206 0.000004235 0.757 0.449
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 236951 on 192070 degrees of freedom
Residual deviance: 236947 on 192069 degrees of freedom
(365110 observations deleted due to missingness)
AIC: 236951
Number of Fisher Scoring iterations: 8
6.2 Interpretation of Logistic Regression Results
We observe that the beta coefficient for oepsp (operating earnings per share deflated by price) is very close to zero, and its p-value is much greater than 0.05. This suggests that there is no significant evidence to indicate that operating earnings per share is related to the probability of a firm having a ROA (Return on Assets) greater than the industry average.
Before concluding that oepsp has no relationship with the probability of a firm achieving a higher ROA than the industry average, it is essential to check for potential data issues that may be affecting our model:
Extreme Values of Independent Variables:
As we have already learned about Winsorization, this technique can be applied to handle extreme values of oepsp. Outliers may be affecting the model’s results, and Winsorization will help by capping extreme values at a certain threshold to reduce their impact.Treatment of Income Statement Variables:
Recall that income-statement variables in this dataset represent year-to-date (cumulative) amounts. This could pose a problem for the calculation of both ROA and oepsp, as the numerator (net income and operating earnings) accumulates throughout the fiscal year, increasing with each quarter. To address this issue, we could filter the data to include only rows corresponding to the last fiscal quarter of each year (fiscalmonth == 12).
6.2.1 Coefficients Interpretation
The coefficient of oepsp in the logistic regression is very close to zero, which means that a change in oepsp has almost no effect on the log-odds of the firm’s ROA being above the industry average. Moreover, the large p-value suggests that this relationship is not statistically significant at common significance levels (e.g., 0.05). Therefore, we cannot conclude that oepsp is a significant predictor of whether a firm’s ROA exceeds the industry average.
We will explore further details of logistic regression and its interpretation in the next workshop.
# Winsorize the 'oepsp' variable at the 1st and 99th percentiles to handle extreme values
<- uspanel %>%
uspanel mutate(oepsp_winsorized = winsorize(oepsp, probs = c(0.01, 0.99)))
# Filter the data to include only rows with fiscalmonth == 12 (last fiscal quarter)
<- uspanel %>%
uspanel_filtered filter(fiscalmonth == 12)
# Run the logistic regression using the Winsorized 'oepsp' variable and the filtered data
<- glm(ROA_above ~ oepsp_winsorized,
logit_model1 data = uspanel_filtered,
family = "binomial",
na.action = na.omit)
# Summary of the logistic regression model
summary(logit_model1)
Call:
glm(formula = ROA_above ~ oepsp_winsorized, family = "binomial",
data = uspanel_filtered, na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.00187 0.01079 92.88 <0.0000000000000002 ***
oepsp_winsorized 2.62671 0.04700 55.89 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 57291 on 47498 degrees of freedom
Residual deviance: 52086 on 47497 degrees of freedom
(15009 observations deleted due to missingness)
AIC: 52090
Number of Fisher Scoring iterations: 4
6.2.2 Interpretation of the Coefficient:
The coefficient of oepsp_winsorized is 2.62671, which suggests that for every one-unit increase in the Winsorized operating earnings per share (deflated by price), the log-odds of a firm having a ROA greater than the industry average increase by 2.62671. This is a substantial increase in the log-odds, implying that the variable may have a significant positive relationship with the outcome.
To interpret the coefficient in terms of odds, we need to calculate the odds ratio. The odds ratio is calculated by exponentiating the coefficient:
[ = e^{} = e^{2.62671} ]
This means that for every one-unit increase in oepsp_winsorized, the odds of a firm’s ROA being above the industry average increase by approximately 13.82 times, holding all else constant.
6.2.3 Odds Ratio:
The odds ratio helps interpret the logistic regression coefficient more intuitively:
Odds represent the ratio of the probability of an event happening to the probability of it not happening. \text{Odds} = \frac{P(\text{event})}{1 - P(\text{event})} In this case, the event is a firm having a ROA greater than the industry average.
The odds ratio is the factor by which the odds change for a one-unit increase in the independent variable (in this case, oepsp_winsorized). An odds ratio of 13.82 indicates that the odds increase more than thirteenfold for each unit increase in oepsp_winsorized.
This large odds ratio suggests that oepsp_winsorized is a strong predictor of whether a firm’s ROA will exceed the industry average. However, it’s important to assess the statistical significance (via p-value) to determine whether this effect is statistically meaningful.