Workshop 2 Solution

FZ2022 Algorithms and Data Analytics

Author
Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

October 8, 2024

1 Housekeeping section

1.0.1 Setting up the environment

First, we need to set up everything we might need throughout the workshop. Keep in mind that we are working with Colab, so some libraries might already be installed, while others might not. We will need to install these libraries each time we start a new session (basically, every time you turn on your computer).

To make sure everything runs smoothly, follow the steps below to install the necessary packages.

# Installing and loading libraries
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
if (!require("scales")) install.packages("scales")
library(scales)
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# To avoid scientific notation:
options(scipen=999)

2 Challenge 1: Data Management algorithms

2.1 Importing Data

# Download the CSV file from the website:
download.file("http://www.apradie.com/datos/dataus2024.csv", "dataus2024.csv")

download.file("http://www.apradie.com/datos/firmsus2024.csv", "firmsus2024.csv")

# Import the panel data
uspanel <- read.csv("dataus2024.csv")

# Import the firm catalog
usfirms <- read.csv("firmsus2024.csv")

Now we can use the function list() or head() to get and overview of the objects we created in the working space memory.

head(uspanel)
  firm      q fiscalmonth  revenue    cogs    sgae otherincome
1    A 2000q1          NA       NA      NA      NA          NA
2    A 2000q2           6  4731000 2421000 1925000       73000
3    A 2000q3           9  7401000 3790000 3016000      101000
4    A 2000q4          12 10773000 5522000 4198000      111000
5    A 2001q1           3  2841000 1449000 1113000       19000
6    A 2001q2           6  4930000 2683000 2025000      277000
  extraordinaryitems finexp incometax totalassets currentassets inventory
1                 NA     NA        NA          NA            NA        NA
2                  0      0    161000     7321000       5057000   1622000
3                  0      0    244000     7827000       5344000   1762000
4                  0      0    407000     8425000       5655000   1853000
5             -25000      0    119000     9208000       5461000   2129000
6             -14000      0    235000     9080000       4998000   1791000
  totalliabilities currentliabilities longdebt adjprice originalprice
1               NA                 NA       NA 62.87922      104.0000
2          2679000            2186000        0 44.58983       73.7500
3          2925000            2510000        0 29.58800       48.9375
4          3160000            2758000        0 33.10228       54.7500
5          3667000            3070000        0 18.57960       30.7300
6          3464000            3098000        0 19.64976       32.5000
  sharesoutstanding year fixedassets yearf cto fiscalq
1          452000.0 2000          NA    NA   1      NA
2          452272.0 2000     1453000  2000   2       2
3          453014.6 2000     1581000  2000   3       3
4          456366.4 2000     1741000  2000   4       4
5          456769.7 2001     1821000  2001   1       1
6          457072.7 2001     1848000  2001   2       2
head(usfirms)
   empresa                          Nombre    status partind
1        A       Agilent Technologies, Inc    activo    0.09
2       AA                      Alcoa Corp    activo      NA
3 AABA_old                      Altaba Inc cancelado      NA
4  AAC_old                    Aac Hldg Inc cancelado      NA
5 AAIC_old Arlington Asset Investment Corp cancelado      NA
6      AAL     American Airlines Group Inc    activo    0.02
                                           naics1
1                       Industrias manufactureras
2                       Industrias manufactureras
3 Servicios profesionales, científicos y técnicos
4       Servicios de salud y de asistencia social
5              Servicios financieros y de seguros
6           Transportes, correos y almacenamiento
                                                                                     naics2
1 Fabricación de instrumentos de medición, control, navegación, y equipo médico electrónico
2                                                             Industria básica del aluminio
3                       Servicios de diseño de sistemas de cómputo y servicios relacionados
4                                            Otros consultorios para el cuidado de la salud
5                                                                              Otros fondos
6                                                                  Transporte aéreo regular
  SectorEconomatica
1 Electroelectronic
2 Siderur & Metalur
3  Software y Datos
4             Otros
5            Fondos
6 Transporte Servic

2.2 Display Apple annual historical information

APPLE_21_23 <- uspanel %>% 
         filter(firm=="AAPL", yearf>=2021, fiscalmonth==12) %>%
         select(firm,q,revenue,totalassets, yearf,fiscalmonth)
APPLE_21_23
  firm      q   revenue totalassets yearf fiscalmonth
1 AAPL 2021q3 365817000   351002000  2021          12
2 AAPL 2022q3 394328000   352755000  2022          12
3 AAPL 2023q3 383285000   352583000  2023          12

To report the amount in currency format, we can use the dollar function from the scales package (you have to install the scales package) as follows:

APPLE_2023_ANNUALSALES = as.numeric(1000*APPLE_21_23[,c("revenue")])
# I used the as.numeric function since the dollar function only receives numeric variables:
dollar(APPLE_2023_ANNUALSALES)
[1] "$365,817,000,000" "$394,328,000,000" "$383,285,000,000"

Then, we can say that in 2023 Apple reported annual total sales amount of $383,285,000,000. This is more than 380 billion dollars! (380 thousand million dollars in Spanish).

2.3 Merge the datasets

2.3.1 Merging Important Firm Characteristics into the Panel Dataset

2.3.1.1 Step 1: Pull the following firm characteristics:

  • Company name
  • Status (active or cancelled)
  • Industry (naics1)

In the uspanel dataset, we don’t have general information about each firm, such as its name, industry, or status. This information is stored in the usfirms dataset. So, we will start by integrating this information into the panel dataset.

2.3.1.2 Step 2: Select the relevant columns from the usfirms dataset:

We need to create a smaller dataset (usfirms1) that only contains the relevant columns for merging.

# I create a dataset with only the relevant columns to be merged: 
usfirms1<- usfirms %>% 
  select(empresa,Nombre,status,naics1)

2.3.2 Merging Firm Characteristics into the Panel Dataset

To integrate the usfirms dataset into the panel dataset uspanel, we can use the merge() function, which is similar to the vlookup function in Excel. Our goal is to pull the following columns into the uspanel dataset: - Industry (naics1) - Company name - Status (active or cancelled)

This will allow us to perform queries and analyze the data using these additional firm characteristics.

2.3.2.1 Identifying the Columns to Match

Before using the merge() function, it’s important to identify the columns that will be used to link both datasets. In this case, we use the firm code (ticker symbol), which uniquely identifies each firm.

However, there’s an issue: the firm code is named differently in each dataset: - In uspanel, the firm code is named firm. - In usfirms, the firm code is named empresa.

2.3.2.2 Solution: Specify the Columns to Match

We can address this issue by explicitly telling R which columns to match from each dataset in the merge() function. Here’s how to do it:

# Merge usfirms1 into uspanel using the firm code as the key

uspanel <- merge(uspanel,usfirms1,by.x="firm", by.y="empresa")

The by.x parameter refers to the column in the first dataset (uspanel), and the by.y parameter refers to the column in the second dataset (usfirms1).

If both datasets have the same column name for the firm code (e.g., “firm”), you can simplify the merge() function by omitting by.x and by.y:

uspanel <- merge(uspanel,usfirms1,by="firm")

3 CHALLENGE 2: Calculating financial variables

Using the dataus2024 dataset, you have to do the following descriptive statistics and respond the following questions.

You have to calculate market capitalization (marketcap) in this dataset. You can calculate a new column as follows:

marketcap = original stock price * shareoutstanding.

This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares:

Les have a look at the contents of the dataset:

head(uspanel)
  firm      q fiscalmonth  revenue    cogs    sgae otherincome
1    A 2000q1          NA       NA      NA      NA          NA
2    A 2000q2           6  4731000 2421000 1925000       73000
3    A 2000q3           9  7401000 3790000 3016000      101000
4    A 2000q4          12 10773000 5522000 4198000      111000
5    A 2001q1           3  2841000 1449000 1113000       19000
6    A 2001q2           6  4930000 2683000 2025000      277000
  extraordinaryitems finexp incometax totalassets currentassets inventory
1                 NA     NA        NA          NA            NA        NA
2                  0      0    161000     7321000       5057000   1622000
3                  0      0    244000     7827000       5344000   1762000
4                  0      0    407000     8425000       5655000   1853000
5             -25000      0    119000     9208000       5461000   2129000
6             -14000      0    235000     9080000       4998000   1791000
  totalliabilities currentliabilities longdebt adjprice originalprice
1               NA                 NA       NA 62.87922      104.0000
2          2679000            2186000        0 44.58983       73.7500
3          2925000            2510000        0 29.58800       48.9375
4          3160000            2758000        0 33.10228       54.7500
5          3667000            3070000        0 18.57960       30.7300
6          3464000            3098000        0 19.64976       32.5000
  sharesoutstanding year fixedassets yearf cto fiscalq
1          452000.0 2000          NA    NA   1      NA
2          452272.0 2000     1453000  2000   2       2
3          453014.6 2000     1581000  2000   3       3
4          456366.4 2000     1741000  2000   4       4
5          456769.7 2001     1821000  2001   1       1
6          457072.7 2001     1848000  2001   2       2
                     Nombre status                    naics1
1 Agilent Technologies, Inc activo Industrias manufactureras
2 Agilent Technologies, Inc activo Industrias manufactureras
3 Agilent Technologies, Inc activo Industrias manufactureras
4 Agilent Technologies, Inc activo Industrias manufactureras
5 Agilent Technologies, Inc activo Industrias manufactureras
6 Agilent Technologies, Inc activo Industrias manufactureras
uspanel <- uspanel%>% 
  # Calculate market value
  mutate(marketvalue = originalprice * sharesoutstanding,
  # Calculate Gross Profit
         grossprofit = revenue - cogs,
  # Calculate EBIT
         ebit = grossprofit - sgae,
  # Calculate Net income
         netincome = ebit + otherincome + extraordinaryitems - finexp - incometax)
head(uspanel)
  firm      q fiscalmonth  revenue    cogs    sgae otherincome
1    A 2000q1          NA       NA      NA      NA          NA
2    A 2000q2           6  4731000 2421000 1925000       73000
3    A 2000q3           9  7401000 3790000 3016000      101000
4    A 2000q4          12 10773000 5522000 4198000      111000
5    A 2001q1           3  2841000 1449000 1113000       19000
6    A 2001q2           6  4930000 2683000 2025000      277000
  extraordinaryitems finexp incometax totalassets currentassets inventory
1                 NA     NA        NA          NA            NA        NA
2                  0      0    161000     7321000       5057000   1622000
3                  0      0    244000     7827000       5344000   1762000
4                  0      0    407000     8425000       5655000   1853000
5             -25000      0    119000     9208000       5461000   2129000
6             -14000      0    235000     9080000       4998000   1791000
  totalliabilities currentliabilities longdebt adjprice originalprice
1               NA                 NA       NA 62.87922      104.0000
2          2679000            2186000        0 44.58983       73.7500
3          2925000            2510000        0 29.58800       48.9375
4          3160000            2758000        0 33.10228       54.7500
5          3667000            3070000        0 18.57960       30.7300
6          3464000            3098000        0 19.64976       32.5000
  sharesoutstanding year fixedassets yearf cto fiscalq
1          452000.0 2000          NA    NA   1      NA
2          452272.0 2000     1453000  2000   2       2
3          453014.6 2000     1581000  2000   3       3
4          456366.4 2000     1741000  2000   4       4
5          456769.7 2001     1821000  2001   1       1
6          457072.7 2001     1848000  2001   2       2
                     Nombre status                    naics1 marketvalue
1 Agilent Technologies, Inc activo Industrias manufactureras    47008000
2 Agilent Technologies, Inc activo Industrias manufactureras    33355058
3 Agilent Technologies, Inc activo Industrias manufactureras    22169401
4 Agilent Technologies, Inc activo Industrias manufactureras    24986059
5 Agilent Technologies, Inc activo Industrias manufactureras    14036534
6 Agilent Technologies, Inc activo Industrias manufactureras    14854864
  grossprofit    ebit netincome
1          NA      NA        NA
2     2310000  385000    297000
3     3611000  595000    452000
4     5251000 1053000    757000
5     1392000  279000    154000
6     2247000  222000    250000
# To create the annual returns, I can group the data by firm and arrange by firm and quarter

# Calculate returns with proper sorting by quarter
uspanel <- uspanel %>%
  group_by(firm) %>%
    # Sort by quarter (q) within each firm
  arrange(firm, q) %>%
  
  # Calculate the annual return (adjusted price return)
  mutate(annual_R = adjprice / lag(adjprice, n = 4) - 1,
  
  # Calculate the continuously compounded annual return
         annual_r = log(adjprice) - lag(log(adjprice), n = 4))
 
# Display the first 20 rows of the relevant columns: firm, q, annual_R, and annual_r
uspanel %>%
  select(firm, q, annual_R, annual_r) %>%
  head(20)
# A tibble: 20 × 4
# Groups:   firm [1]
   firm  q      annual_R annual_r
   <chr> <chr>     <dbl>    <dbl>
 1 A     2000q1  NA       NA     
 2 A     2000q2  NA       NA     
 3 A     2000q3  NA       NA     
 4 A     2000q4  NA       NA     
 5 A     2001q1  -0.705   -1.22  
 6 A     2001q2  -0.559   -0.819 
 7 A     2001q3  -0.601   -0.918 
 8 A     2001q4  -0.479   -0.653 
 9 A     2002q1   0.138    0.129 
10 A     2002q2  -0.272   -0.318 
11 A     2002q3  -0.332   -0.403 
12 A     2002q4  -0.370   -0.462 
13 A     2003q1  -0.624   -0.978 
14 A     2003q2  -0.173   -0.190 
15 A     2003q3   0.693    0.526 
16 A     2003q4   0.628    0.487 
17 A     2004q1   1.41     0.878 
18 A     2004q2   0.498    0.404 
19 A     2004q3  -0.0244  -0.0247
20 A     2004q4  -0.176   -0.193 

2. Using the same panel dataset, create columns for the following financial ratios:

# Calculate financial ratios: roabit, roa, oeps, oepsp, and bmr with conditional handling of zero 
uspanel <- uspanel %>%
  group_by(firm) %>%
  # Sort by quarter (q) within each firm
  arrange(firm, q) %>%
  # Calculate Operational Return on Assets (roabit)
  mutate(roabit = ifelse(lag(totalassets, 4) == 0, NA, ebit / lag(totalassets, 4)),
  # Calculate Return on Assets (roa) 
         roa = ifelse(lag(totalassets, 4) == 0, NA, netincome / lag(totalassets, 4)),
  # Calculate Operational Earnings per share (oeps)
         oeps = ifelse(sharesoutstanding == 0, NA, ebit / sharesoutstanding),
  # Calculate Operational EPS deflated by stock price (oepsp)
         oepsp = ifelse(originalprice == 0, NA, oeps / originalprice),
  # Calculate Book-to-Market Ratio (bmr) 
         bmr = ifelse(marketvalue == 0, NA, totalassets / marketvalue))

# Display the first 20 rows of the relevant columns: firm, q, roabit, roa, oeps, oepsp, and bmr
uspanel %>%
  select(firm, q, roabit, roa, oeps, oepsp, bmr) %>%
  head(20)
# A tibble: 20 × 7
# Groups:   firm [1]
   firm  q       roabit      roa   oeps    oepsp    bmr
   <chr> <chr>    <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
 1 A     2000q1 NA      NA       NA     NA       NA    
 2 A     2000q2 NA      NA        0.851  0.0115   0.219
 3 A     2000q3 NA      NA        1.31   0.0268   0.353
 4 A     2000q4 NA      NA        2.31   0.0421   0.337
 5 A     2001q1 NA      NA        0.611  0.0199   0.656
 6 A     2001q2  0.0303  0.0341   0.486  0.0149   0.611
 7 A     2001q3 -0.0153  0.00396 -0.261 -0.0133   0.957
 8 A     2001q4 -0.0923  0.0207  -1.68  -0.0589   0.604
 9 A     2002q1 -0.0480 -0.0342  -0.953 -0.0273   0.528
10 A     2002q2 -0.0871 -0.0626  -1.70  -0.0720   0.750
11 A     2002q3 -0.141  -0.0923  -2.61  -0.200    1.38 
12 A     2002q4 -0.201  -0.129   -3.44  -0.192    0.978
13 A     2003q1 -0.0299 -0.0431  -0.543 -0.0413   1.25 
14 A     2003q2 -0.0717 -0.0625  -1.25  -0.0641   0.827
15 A     2003q3 -0.0926 -0.245   -1.64  -0.0743   0.570
16 A     2003q4 -0.0884 -0.251   -1.52  -0.0521   0.452
17 A     2004q1  0.0102  0.00914  0.164  0.00520  0.420
18 A     2004q2  0.0249  0.0230   0.395  0.0135   0.474
19 A     2004q3  0.0496  0.0459   0.612  0.0284   0.644
20 A     2004q4  0.0613  0.0554   0.793  0.0329   0.601

3.1 Earnings per Share (EPS) Deflated by Price

The Earnings per Share (EPS) deflated by price, also known as the Earnings Yield, is calculated as:

\text{Earnings Yield} = \frac{\text{EPS}}{\text{Price per Share}}

Where:

  • EPS represents the earnings generated for each outstanding share.
  • Price per Share is the market price of a single stock.

This ratio tells us how much earnings are generated per dollar invested in the stock. A higher earnings yield might indicate that a stock is undervalued, making it potentially more attractive for investors.

3.2 Book-to-Market Ratio (B/M)

The Book-to-Market Ratio (B/M) is calculated as:

\text{B/M Ratio} = \frac{\text{Book Value of Equity}}{\text{Market Value of Equity}}

Where:

  • Book Value of Equity is the net asset value of the company (total assets minus total liabilities).

  • Market Value of Equity (or Market Capitalization) is the current market value of all outstanding shares.

A high B/M ratio suggests that a stock might be undervalued (or in financial distress), while a low B/M ratio could indicate that a stock is overvalued.

4 Challenge 3: Create a cross-sectional dataset with the most recent annual data

4.0.1 Select Active Firms and Quarters with Annual Financial Data in 2023

Before performing basic descriptive statistics for the U.S. market, you need to implement some data management algorithms.

4.0.1.1 Step 1: Filter firm-quarters with fiscalmonth = 12, year = 2023, and only include active firms.

In this dataset, we have quarterly financial data. For income-statement variables, the reported values are cumulative amounts from fiscal Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the values reflect the firm’s financial position at the end of each quarter.

The fiscalmonth variable shows the quarter that ends the fiscal year. If fiscalmonth = 12, it means this quarter is the end of the fiscal year, and income statement variables reflect the annual values.

Save the result in a new dataset called active2023.

Now I can complete the challenge, which is to create the active2023 dataset:

active2023 <- uspanel %>%
          filter(status=="activo",yearf==2023, fiscalmonth==12)

This dataset, active2023, is no longer a panel dataset as it contains data for only one period (2023) for each firm. Instead, it is now a cross-sectional dataset.

We can count the # active firms in 2023 in the US using the nrow function:

nrow(active2023)
[1] 3285

There were 3285 active public firms in the US market in 2023.

We can also check how many of these firms end the fiscal quarter in the Q4 of the year:

firmsbyQ<- table(active2023$q)
firmsbyQ

2023q2 2023q3 2023q4 2024q1 
   184    193   2693    215 

The table function counts the number of cases for the different values of the variable (in this case, the variable q represents the quarter). This allows us to see how many firms end their fiscal year in each quarter.

We can observe that more than 80% of the firms end their fiscal year in Q4 (215 firms out of 3285).

5 Challenge 4: Descriptive Statistics and Analysis of Active US Firms in 2023

We will perform descriptive statistics on the active2023 dataset to analyze the total assets and market capitalization of US firms active in 2023.

5.1 Median, 25th and 75th Percentile of Total Assets and Market Capitalization

First, we calculate and interpret the median, 25th, and 75th percentiles for total assets and market capitalization of all active US firms.

# Descriptive statistics for total assets and market capitalization
summary_stats <- active2023 %>%
  select(firm, totalassets, marketvalue) %>%
  summary()
summary_stats
     firm            totalassets          marketvalue        
 Length:3285        Min.   :         4   Min.   :        36  
 Class :character   1st Qu.:    313801   1st Qu.:    225389  
 Mode  :character   Median :   1683214   Median :   1295817  
                    Mean   :  17604938   Mean   :  14770424  
                    3rd Qu.:   7067672   3rd Qu.:   5747438  
                    Max.   :3875393000   Max.   :2676736861  
                                         NA's   :27          
industry_stats <- active2023 %>%
  group_by(naics1) %>%
  summarise(num_firms = n(),
    median_assets = median(totalassets, na.rm = TRUE),
    p25_assets = quantile(totalassets, 0.25, na.rm = TRUE),
    p75_assets = quantile(totalassets, 0.75, na.rm = TRUE),
    median_market_value = median(marketvalue, na.rm = TRUE),
    p25_market_value = quantile(marketvalue, 0.25, na.rm = TRUE),
    p75_market_value = quantile(marketvalue, 0.75, na.rm = TRUE))

print(industry_stats)
# A tibble: 19 × 8
   naics1      num_firms median_assets p25_assets p75_assets median_market_value
   <chr>           <int>         <dbl>      <dbl>      <dbl>               <dbl>
 1 -                   1     13511396   13511396   13511396             5752457.
 2 Agricultur…        16      1676710.    365443    4424650              924523.
 3 Comercio a…        72      2449854.    494542.   6827575             1474175.
 4 Comercio a…       137      2969586     644683    7774100             1968102.
 5 Construcci…        47      3407851    1371598.   6477446.            3222951.
 6 Empresas d…        76     14463800    4089750.  44058250             5293863.
 7 Industrias…      1411       565299     116712.   2878860              729177.
 8 Informació…       234      1401528     444998    4741062.            2731012.
 9 Minería, e…       106      2167800.    396234.   7365994.            1203355.
10 Otros serv…        17      1014723     607377    2854552              848578.
11 Servicios …        68      1991052     581241.   6389344.            1555129.
12 Servicios …       113      2581600     721538    8233200             3293707.
13 Servicios …        19      2625046    1358087    6993820.            2528477.
14 Servicios …        58      1527342     456712.   4479580.            1505531.
15 Servicios …        13      1007316     740685    2125213             1497541.
16 Servicios …       640      5967370.   2193738   18281728.            1450774.
17 Servicios …        63      1770814     340774.   5155288.             867256.
18 Servicios …       129       695243     186101    3594405              830948.
19 Transporte…        65      4557200    1801846   17716265             3471636.
# ℹ 2 more variables: p25_market_value <dbl>, p75_market_value <dbl>

The median total assets indicate the central point of asset values, while the 25th and 75th percentiles provide insights into the distribution of firm sizes. The same applies for market capitalization, which reflects the distribution of firm valuations.

5.2 Industry-Wise Descriptive Statistics

We now calculate the number of firms, median, 25th, and 75th percentiles of total assets and market capitalization for each industry.

# Grouping by industry and summarizing stats
industry_stats <- active2023 %>%
  group_by(naics1) %>%
  summarize(
    n_firms = n(),
    median_assets = median(totalassets, na.rm = TRUE),
    p25_assets = quantile(totalassets, 0.25, na.rm = TRUE),
    p75_assets = quantile(totalassets, 0.75, na.rm = TRUE),
    median_market_value = median(marketvalue, na.rm = TRUE),
    p25_market_value = quantile(marketvalue, 0.25, na.rm = TRUE),
    p75_market_value = quantile(marketvalue, 0.75, na.rm = TRUE)
  )

industry_stats
# A tibble: 19 × 8
   naics1        n_firms median_assets p25_assets p75_assets median_market_value
   <chr>           <int>         <dbl>      <dbl>      <dbl>               <dbl>
 1 -                   1     13511396   13511396   13511396             5752457.
 2 Agricultura,…      16      1676710.    365443    4424650              924523.
 3 Comercio al …      72      2449854.    494542.   6827575             1474175.
 4 Comercio al …     137      2969586     644683    7774100             1968102.
 5 Construcción       47      3407851    1371598.   6477446.            3222951.
 6 Empresas de …      76     14463800    4089750.  44058250             5293863.
 7 Industrias m…    1411       565299     116712.   2878860              729177.
 8 Información …     234      1401528     444998    4741062.            2731012.
 9 Minería, exp…     106      2167800.    396234.   7365994.            1203355.
10 Otros servic…      17      1014723     607377    2854552              848578.
11 Servicios de…      68      1991052     581241.   6389344.            1555129.
12 Servicios de…     113      2581600     721538    8233200             3293707.
13 Servicios de…      19      2625046    1358087    6993820.            2528477.
14 Servicios de…      58      1527342     456712.   4479580.            1505531.
15 Servicios ed…      13      1007316     740685    2125213             1497541.
16 Servicios fi…     640      5967370.   2193738   18281728.            1450774.
17 Servicios in…      63      1770814     340774.   5155288.             867256.
18 Servicios pr…     129       695243     186101    3594405              830948.
19 Transportes,…      65      4557200    1801846   17716265             3471636.
# ℹ 2 more variables: p25_market_value <dbl>, p75_market_value <dbl>

This table allows us to compare the typical firm sizes and market valuations across different industries, providing insights into the distribution of firms by sector.

5.3 Top 10 Firms by Total Assets in 2023

Next, we identify the 10 largest active firms by total assets at the end of the 2023 fiscal year, showing their names, industries, and total assets.

# Top 10 firms by total assets
top10_assets <- active2023 %>%
  arrange(desc(totalassets)) %>%
  select(firm, Nombre, naics1, totalassets) %>%
  head(10)

top10_assets
# A tibble: 10 × 4
# Groups:   firm [10]
   firm  Nombre                       naics1                         totalassets
   <chr> <chr>                        <chr>                                <dbl>
 1 JPM   Jpmorgan Chase & Co          Servicios financieros y de se…  3875393000
 2 BAC   Bank Of America Corp         Servicios financieros y de se…  3180151000
 3 C     Citigroup Inc                Servicios financieros y de se…  2411834000
 4 WFC   Wells Fargo & Company        Servicios financieros y de se…  1932468000
 5 GS    The Goldman Sachs Group, Inc Servicios financieros y de se…  1641594000
 6 MS    Morgan Stanley               Servicios financieros y de se…  1193693000
 7 BRK_B Berkshire Hathaway Inc       Servicios financieros y de se…  1069978000
 8 PRU   Prudential Financial, Inc    Servicios financieros y de se…   721123000
 9 MET   Metlife, Inc                 Servicios financieros y de se…   687584000
10 USB   US Bancorp                   Servicios financieros y de se…   663491000

These firms represent the largest companies by total assets, indicating their financial resources and operational capacity in the market.

5.4 Top 10 Firms by Market Capitalization in 2023

Now, we look at the 10 largest firms by market capitalization in 2023, listing their names, industries, and market values.

# Top 10 firms by market capitalization
top10_market_value <- active2023 %>%
  arrange(desc(marketvalue)) %>%
  select(firm, Nombre, naics1, marketvalue) %>%
  head(10)

top10_market_value
# A tibble: 10 × 4
# Groups:   firm [10]
   firm  Nombre                naics1                                marketvalue
   <chr> <chr>                 <chr>                                       <dbl>
 1 AAPL  Apple Inc             Industrias manufactureras             2676736861.
 2 MSFT  Microsoft Corp        Información en medios masivos         2532080939.
 3 NVDA  Nvidia Corp           Industrias manufactureras             2258900000 
 4 GOOGL Alphabet Inc          Servicios profesionales, científicos… 1748360040 
 5 AMZN  Amazon.Com, Inc       Comercio al por menor                 1570152607.
 6 META  Meta Platforms, Inc   Servicios profesionales, científicos…  909628613.
 7 TSLA  Tesla, Inc            Industrias manufactureras              789898387.
 8 LLY   Eli Lilly And Company Industrias manufactureras              553370175.
 9 AVGO  Broadcom Inc          Industrias manufactureras              522561910.
10 JPM   Jpmorgan Chase & Co   Servicios financieros y de seguros     491760519.

These firms have the highest market valuations, reflecting their perceived value by investors as of 2023.

5.5 Histogram of Market Capitalization for All Active Firms

We create a histogram to visualize the distribution of market capitalization for all active firms in 2023.

# Histogram of market capitalization
hist(active2023$marketvalue)

The histogram shows how market capitalization is distributed across firms, revealing the range of firm sizes and potential clustering of smaller or larger companies. We can see that there are some very big firms that appear to be outliers (biggest 1%). We can zoom in to see those firms with a market value of USD 20 billion.

subset_20billion <- active2023 %>%
            filter(marketvalue <=20000000) %>%
            select(firm, marketvalue)
hist(subset_20billion$marketvalue, breaks=40)

5.6 6. Box Plot of Market Capitalization by Industry

To further analyze the distribution of firm sizes, we generate box plots of market capitalization by industry for 2023.

# Box plot by industry
library(ggplot2)
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
   geom_boxplot()

The box plots provide a visual summary of the distribution of market capitalization within each industry, highlighting outliers and the central tendencies of firm sizes by sector.

5.7 Visualization Issues and Solutions

We encountered two main visualization problems:

  1. Difficulty in Visualizing Market Capitalization:
    It’s challenging to visualize the market capitalization of 50% of the firms by industry, represented by the box in the box plot (Quartiles from Q1 to Q3).

  2. Unreadable Industry Labels:
    The x-axis labels for industry names are difficult to read due to their length or positioning.

5.7.1 Proposed Solutions

To address these issues:

  • For the Market Capitalization Visualization:
    Given that there are many extreme values for market capitalization, we can plot the logarithm (base 10) of market capitalization to better visualize the data distribution.
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
   geom_boxplot() + scale_y_log10() 

  • For the Industry Labels:
    Adjust the x-axis labels by rotating or abbreviating them to improve readability.
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
   geom_boxplot() + scale_y_log10() +
   theme(axis.text.x = element_text(angle = 25, hjust = 1))

5.8 Introduction to Winsorization

Winsorization is a statistical technique used to limit extreme values in data to reduce the impact of outliers. Instead of removing these outliers, Winsorization transforms them by capping the extreme values at a specified percentile. This technique is especially useful in financial data where extreme values, such as abnormally high market capitalization, can distort analyses.

By applying Winsorization, we ensure that values at the extreme tails of the distribution are replaced with more reasonable ones, reducing the skewness and impact of these outliers without completely discarding them.

In the following code, we will use the statar library to Winsorize the marketvalue variable at the 1st and 99th percentiles.

5.8.1 Code to Winsorize Market Value

# Load the statar library
library(statar)

# Winsorize the marketvalue variable at the 1st and 99th percentiles
active2023$marketvalue_winsorized <- winsorize(active2023$marketvalue, probs = c(0.01, 0.99))

# View summary of the Winsorized market value
summary(active2023$marketvalue_winsorized)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
     2955    225389   1295817  10403048   5747438 210701986        27 
ggplot(active2023, aes(x=naics1, y=marketvalue_winsorized)) +
   geom_boxplot() +
   theme(axis.text.x = element_text(angle = 25, hjust = 1))

6 Challenge 5: Review of Logistic Regression Models

6.1 Logistic Regression Model

We will run a logistic regression model to examine whether operational earnings per share deflated by price (oepsp) is related to the probability that a firm’s return on assets (ROA) is greater than the corresponding industry ROA mean.

We will use the glm() function with the binomial family to specify our logistic regression model. First, let’s start by creating the binary dependent variable ROA_above.

uspanel <- uspanel %>%
   group_by(naics1, q) %>% 
   mutate(ROAmean = mean(roa, na.rm=TRUE),
          ROA_above = ifelse(roa>ROAmean,1,0)) %>%
  ungroup()
# Run logistic regression model
logit_model1 <- glm(ROA_above ~ oepsp, data = uspanel, family = "binomial", na.action = na.omit)

# View the summary of the model
summary(logit_model1)

Call:
glm(formula = ROA_above ~ oepsp, family = "binomial", data = uspanel, 
    na.action = na.omit)

Coefficients:
               Estimate  Std. Error z value            Pr(>|z|)    
(Intercept) 0.813283657 0.004946147 164.428 <0.0000000000000002 ***
oepsp       0.000003206 0.000004235   0.757               0.449    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 236951  on 192070  degrees of freedom
Residual deviance: 236947  on 192069  degrees of freedom
  (365110 observations deleted due to missingness)
AIC: 236951

Number of Fisher Scoring iterations: 8

6.2 Interpretation of Logistic Regression Results

We observe that the beta coefficient for oepsp (operating earnings per share deflated by price) is very close to zero, and its p-value is much greater than 0.05. This suggests that there is no significant evidence to indicate that operating earnings per share is related to the probability of a firm having a ROA (Return on Assets) greater than the industry average.

Before concluding that oepsp has no relationship with the probability of a firm achieving a higher ROA than the industry average, it is essential to check for potential data issues that may be affecting our model:

  1. Extreme Values of Independent Variables:
    As we have already learned about Winsorization, this technique can be applied to handle extreme values of oepsp. Outliers may be affecting the model’s results, and Winsorization will help by capping extreme values at a certain threshold to reduce their impact.

  2. Treatment of Income Statement Variables:
    Recall that income-statement variables in this dataset represent year-to-date (cumulative) amounts. This could pose a problem for the calculation of both ROA and oepsp, as the numerator (net income and operating earnings) accumulates throughout the fiscal year, increasing with each quarter. To address this issue, we could filter the data to include only rows corresponding to the last fiscal quarter of each year (fiscalmonth == 12).

6.2.1 Coefficients Interpretation

The coefficient of oepsp in the logistic regression is very close to zero, which means that a change in oepsp has almost no effect on the log-odds of the firm’s ROA being above the industry average. Moreover, the large p-value suggests that this relationship is not statistically significant at common significance levels (e.g., 0.05). Therefore, we cannot conclude that oepsp is a significant predictor of whether a firm’s ROA exceeds the industry average.

We will explore further details of logistic regression and its interpretation in the next workshop.

# Winsorize the 'oepsp' variable at the 1st and 99th percentiles to handle extreme values
uspanel <- uspanel %>%
  mutate(oepsp_winsorized = winsorize(oepsp, probs = c(0.01, 0.99)))

# Filter the data to include only rows with fiscalmonth == 12 (last fiscal quarter)
uspanel_filtered <- uspanel %>%
  filter(fiscalmonth == 12)

# Run the logistic regression using the Winsorized 'oepsp' variable and the filtered data
logit_model1 <- glm(ROA_above ~ oepsp_winsorized, 
                    data = uspanel_filtered, 
                    family = "binomial", 
                    na.action = na.omit)

# Summary of the logistic regression model
summary(logit_model1)

Call:
glm(formula = ROA_above ~ oepsp_winsorized, family = "binomial", 
    data = uspanel_filtered, na.action = na.omit)

Coefficients:
                 Estimate Std. Error z value            Pr(>|z|)    
(Intercept)       1.00187    0.01079   92.88 <0.0000000000000002 ***
oepsp_winsorized  2.62671    0.04700   55.89 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57291  on 47498  degrees of freedom
Residual deviance: 52086  on 47497  degrees of freedom
  (15009 observations deleted due to missingness)
AIC: 52090

Number of Fisher Scoring iterations: 4

6.2.2 Interpretation of the Coefficient:

The coefficient of oepsp_winsorized is 2.62671, which suggests that for every one-unit increase in the Winsorized operating earnings per share (deflated by price), the log-odds of a firm having a ROA greater than the industry average increase by 2.62671. This is a substantial increase in the log-odds, implying that the variable may have a significant positive relationship with the outcome.

To interpret the coefficient in terms of odds, we need to calculate the odds ratio. The odds ratio is calculated by exponentiating the coefficient:

[ = e^{} = e^{2.62671} ]

This means that for every one-unit increase in oepsp_winsorized, the odds of a firm’s ROA being above the industry average increase by approximately 13.82 times, holding all else constant.

6.2.3 Odds Ratio:

The odds ratio helps interpret the logistic regression coefficient more intuitively:

  • Odds represent the ratio of the probability of an event happening to the probability of it not happening. \text{Odds} = \frac{P(\text{event})}{1 - P(\text{event})} In this case, the event is a firm having a ROA greater than the industry average.

  • The odds ratio is the factor by which the odds change for a one-unit increase in the independent variable (in this case, oepsp_winsorized). An odds ratio of 13.82 indicates that the odds increase more than thirteenfold for each unit increase in oepsp_winsorized.

This large odds ratio suggests that oepsp_winsorized is a strong predictor of whether a firm’s ROA will exceed the industry average. However, it’s important to assess the statistical significance (via p-value) to determine whether this effect is statistically meaningful.