Workshop 2 Solution, Algorithms and data analysis

Author

Alberto Dorantes, Ph.D.

Published

October 3, 2024

Abstract

This is a solution of Workshop 2. In this workshop we continue practicing with descriptive statistics, data understanding and financial data calculations. In addition, we run a logistic regression model.

1 Introduction

We will work with a panel data of real US public firms. You have to analyze historical quarterly financial statements of all US public firms listed in the New York Exchange and NASDAQ.

This is the dataset you will use for the Problem Situation (Final Project) of this class.

First, we will do descriptive statistics and visualization with the purpose of understanding the US financial market.

You have to work with 2 datasets:

firmsus2024.csv: List of all US public firms with general information of each firm
dataus2024.csv : Panel data with historical financial quarterly data for all US public firms.

You have to download these 2 files from Canvas/Modules and save in the directory where you have your workshop.

The first dataset (dataus2023q2) contains the historical financial data of the firms, while the second dataset (firmsus2023) is a catalog of all firms along with the corresponding industry type and status (active or cancelled).

The dataus2023q2 dataset has a panel-data (also called long format) structure. Each row has financial information for one US firm and 1 period (a quarter). All $ amounts are in thousands (’1000s). Here is a data dictionary of the columns:

Data dictionary of historical quarterly financial data.
Variable	Description
firm	Unique code of the company (also called ticker)
q	Quarter date
fiscalmonth	Month of the year when the firm closes a fiscal year
revenue	Total sales of the firm from the first fiscal quarter to the current quarter
cogs	Cost of good sold - variable costs of the products sold - from the first fiscal quarter to the current quarter
sgae	Sales and general administrative expenses - from the first fiscal quarter to the current quarter
otherincome	Other operational income/expenses that are not directly from the core operations of the firm - from the first fiscal quarter to the current quarter
extraordinaryitems	Extra income/expenses not related to regular operations - from the first fiscal quarter to the current quarter
finexp	Financial expenses - interest expenses paid (generated from loans) - from the first fiscal quarter to the current quarter
incometax	Income tax from the first fiscal quarter to the current quarter
totalassets	Total assets of the firm at the end of the quarter
currentassets	Current assets of the firm at the end of the quarter
totalliabilities	Total liabilities of the firm at the end of the quarter
currentliabilities	Current liabilities of the firm at the end of the quarter
longdebt	Balance of long-term financial debt (loans to pay longer than 1 year)
adjprice	Stock adjusted price at the end of the quarter; adjusted for stock splits and dividend payments; used to calculate stock returns
originalprice	Historical stock price (not adjusted); used to calculate historical market value
sharesoutstanding	Historical number of shares available in the market
fixedassets	Fixed assets value at the end of the quarter
year	Calendar year
yearf	Fiscal year - this depends on when the firm ends its fiscal year; if fiscalmonth=12 in the quarter 3, then the fiscal year will start in Q4 of a year and ends in the Q3 of the following year

Each row of this dataset has quarterly financial data of one firm in one quarter. All firms have quarters from Q1 2000 to Q2 2023. Not all firms have existed since 2000, so if the first quarters are empty that means that the firm did not exist in the US financial market in those quarters. Then, it is possible to know when each firm went public to issue shares in the financial market: the first quarter with some non-empty data.

Each firm has defined the month of the year used to close a fiscal year. For example, Apple closes the fiscal year at the end of Quarter 3 (end of September) of any year. Then, for Apple, in the Q3 of 2022, there will be a 12 for the fiscalmonth variable. In this case, Apple starts its fiscal year in the Q4 of each year and ends in the Q3 of the following year. Most of the firms (about 80%) close fiscal year in December, so these firms will have a 12 in the Q4 of each year.

The variables related to sales and expenses are cumulative for each fiscal year. For example, Apple sold about $117 billion in the last calendar quarter (Q4) of 2022, but this is the first fiscal quarter for Apple. For Q1 (calendar) 2023 (which is the 2nd fiscal quarter), Apple has about $212 billion in the revenue variable, meaning that considering fiscal quarter 1 and 2, Apple has sold $212 billion. For Q2 2023 Apple has about $293 billion, meaning that the cumulative revenue of fiscal Q1, Q2 and Q3 is about $293 billion. Then, if you select rows with fiscalmonth=12, then you will be selecting those quarters with annual financial information for each firm!

The firmsus2023.csv is a catalog of all active and cancelled US firms:

Variable	Description
firm	Unique code of the company (also called ticker)
name	Name of the firm
status	Status of the firm: active or cancelled
partind	Percent participation in the S&P500 market index
naics1	North American Industry Classification Code - Level 1
naics2	North American Industry Classification Code - Level 2
SectorEconomatica	Economatica Industry classification

2 Challenge 1: Data management algorithms

We import the csv files into the R Environment:

# To avoid scientific notation:
options(scipen=999)

#Read the csv files:
uspanel <- read.csv("https://www.apradie.com/datos/dataus2024.csv")
usfirms <- read.csv("https://www.apradie.com/datos/firmsus2024.csv")

As a quick check of the imported data, write the code do the following query:

1. Display the revenue and total assets for Apple for the annual fiscal quarters of 2022, 2021 and 2023. The annual quarters are those with fiscalmonth=12

Do an online check, find the most recent 10-K report (annual financial statement) of Apple, and compare 2022 annual sales with what you got. You must get the same amount for the annual sales reported at the end of fiscal year (for Apple, the fiscal year is Q3).

This type of data management is also called query. A query is a search statement in a database.

APPROACH 1: Using sub-setting of data frames:

In the row index of the data frame I indicate the condition to Apple and the year equal or greater than 2020. In the column index I indicate to select only 4 relevant columns:

# I assign the result of this query 
APPLE_info = uspanel[uspanel$firm=="AAPL" & uspanel$yearf>=2021 & uspanel$fiscalmonth==12,
                     c("firm","q","revenue","yearf","fiscalmonth")]
# I saved the result of the query in a new dataset called APPLE_2020
# I display the result:
APPLE_info

     firm      q   revenue yearf fiscalmonth
1165 AAPL 2021q3 365817000  2021          12
1169 AAPL 2022q3 394328000  2022          12
1173 AAPL 2023q3 383285000  2023          12

All the financial amounts in this dataset is in thousands (1000’s). The fiscalmonth column is the fiscal month at the end of the quarter. The q column is the quarter. Since the fiscalmonth=12 is in the Quarters Q3 of each year, this indicates that Apple ends its fiscal year in the Q3, which is in September of each year.

We can display only the annual 2023 sales of Apple as follows:

APPLE_info[APPLE_info$yearf==2023,c("revenue")]

[1] 383285000

To report the amount in currency format, we can use the dollar function from the scales package (you have to install the scales package) as follows:

library(scales)
APPLE_2023_ANNUALSALES = as.numeric(1000*APPLE_info$revenue[APPLE_info$yearf==2023])
# I used the as.numeric function since the dollar function only receives numeric variables:
dollar(APPLE_2023_ANNUALSALES)

[1] "$383,285,000,000"

Then, we can say that in 2023 Apple reported annual total sales amount of $383,285,000,000. This is more than 380 billion dollars! (383 thousand of millions!).

APPROACH 2: USING THE dplyr PACKAGE:

I use the filter function to select rows and select function to select columns:

library(dplyr)
APPLE_info <- uspanel %>% 
              filter(firm=="AAPL", yearf>=2021, fiscalmonth==12) %>%
              select(firm,q,revenue,yearf,fiscalmonth) 
APPLE_info

firm	q	revenue	yearf	fiscalmonth
AAPL	2021q3	365817000	2021	12
AAPL	2022q3	394328000	2022	12
AAPL	2023q3	383285000	2023	12

The rest is just formatting as shown in approach 1.

2. Merge the datasets

In the dataus2024 panel dataset there no information about firm name, industry, status (active or cancelled), etc. That information is in the firmsus2024 datase. So, we start by integrating this information into the panel data.

Merge the firmsus2024 dataset into the dataus2024 dataset, but only pull the following firm characteristics (columns): Company name, status, and industry (naics1).

In the dataus2024 dataset we have quarterly financial data. For income-statement variables, the information is the cumulative amount from Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the information is the value of the variable at the end of the quarter (since the beginning when the firm was created).

2. SOLUTION

We can use the merge function or the left_join function to pull the important variables for each firm from the firms2024 dataset into the uspanel dataset.

When doing a merge between 2 dataset, we have to identify which is the common column between the datasets. The common column or variable is needed to do the “match” between the uspanel with the usfirms dataset.

We can check the variables for both datasaet and identify the common variable:

For usfirms:

colnames(usfirms)

[1] "empresa"           "Nombre"            "status"           
[4] "partind"           "naics1"            "naics2"           
[7] "SectorEconomatica"

For uspanel:

colnames(uspanel)

 [1] "firm"               "q"                  "fiscalmonth"       
 [4] "revenue"            "cogs"               "sgae"              
 [7] "otherincome"        "extraordinaryitems" "finexp"            
[10] "incometax"          "totalassets"        "currentassets"     
[13] "inventory"          "totalliabilities"   "currentliabilities"
[16] "longdebt"           "adjprice"           "originalprice"     
[19] "sharesoutstanding"  "year"               "fixedassets"       
[22] "yearf"              "cto"                "fiscalq"

It seems that the id for the firms is called empresa in the usfirms and firm in the uspanel:

head(usfirms)

   empresa                          Nombre    status partind
1        A       Agilent Technologies, Inc    activo    0.09
2       AA                      Alcoa Corp    activo      NA
3 AABA_old                      Altaba Inc cancelado      NA
4  AAC_old                    Aac Hldg Inc cancelado      NA
5 AAIC_old Arlington Asset Investment Corp cancelado      NA
6      AAL     American Airlines Group Inc    activo    0.02
                                           naics1
1                       Industrias manufactureras
2                       Industrias manufactureras
3 Servicios profesionales, científicos y técnicos
4       Servicios de salud y de asistencia social
5              Servicios financieros y de seguros
6           Transportes, correos y almacenamiento
                                                                                     naics2
1 Fabricación de instrumentos de medición, control, navegación, y equipo médico electrónico
2                                                             Industria básica del aluminio
3                       Servicios de diseño de sistemas de cómputo y servicios relacionados
4                                            Otros consultorios para el cuidado de la salud
5                                                                              Otros fondos
6                                                                  Transporte aéreo regular
  SectorEconomatica
1 Electroelectronic
2 Siderur & Metalur
3  Software y Datos
4             Otros
5            Fondos
6 Transporte Servic

head(uspanel)

  firm      q fiscalmonth  revenue    cogs    sgae otherincome
1    A 2000q1          NA       NA      NA      NA          NA
2    A 2000q2           6  4731000 2421000 1925000       73000
3    A 2000q3           9  7401000 3790000 3016000      101000
4    A 2000q4          12 10773000 5522000 4198000      111000
5    A 2001q1           3  2841000 1449000 1113000       19000
6    A 2001q2           6  4930000 2683000 2025000      277000
  extraordinaryitems finexp incometax totalassets currentassets inventory
1                 NA     NA        NA          NA            NA        NA
2                  0      0    161000     7321000       5057000   1622000
3                  0      0    244000     7827000       5344000   1762000
4                  0      0    407000     8425000       5655000   1853000
5             -25000      0    119000     9208000       5461000   2129000
6             -14000      0    235000     9080000       4998000   1791000
  totalliabilities currentliabilities longdebt adjprice originalprice
1               NA                 NA       NA 62.87922      104.0000
2          2679000            2186000        0 44.58983       73.7500
3          2925000            2510000        0 29.58800       48.9375
4          3160000            2758000        0 33.10228       54.7500
5          3667000            3070000        0 18.57960       30.7300
6          3464000            3098000        0 19.64976       32.5000
  sharesoutstanding year fixedassets yearf cto fiscalq
1          452000.0 2000          NA    NA   1      NA
2          452272.0 2000     1453000  2000   2       2
3          453014.6 2000     1581000  2000   3       3
4          456366.4 2000     1741000  2000   4       4
5          456769.7 2001     1821000  2001   1       1
6          457072.7 2001     1848000  2001   2       2

We can re-name the empresa column to “firm” to have the same name in both datasets, and then do the merge:

I can use dplyr to rename the column:

usfirms = usfirms %>% 
  rename(firm = empresa )
colnames(usfirms)

[1] "firm"              "Nombre"            "status"           
[4] "partind"           "naics1"            "naics2"           
[7] "SectorEconomatica"

Now I use the firm column for both dataset to do the merge. If I only want to consider firm name, industry and status, I can select only these columns from the usfirms datsaet before doing the merge.

In this case, instead of using the merge function, I will use the left_join function from the dplyr package. The left_join is faster and make sure you keep all rows of the uspanel. The merge function might not keep all original rows of uspanel in the case that there is a firm that is in the uspanel and is not in the usfirms.

# I do a subset with only the columns I am interested to pull and paste into the uspanel: 
usfirms1 = usfirms %>% 
  select(firm, Nombre,naics1,status)
# I do the merge and keep the result in the same uspanel:
uspanel = left_join(uspanel,usfirms1,by="firm")

I can check the result by showing only some columns for one firm:

uspanel %>% 
  filter(firm=="MSFT") %>%
  head(5) %>%
  select(firm,Nombre,q,naics1,totalassets,revenue)

  firm         Nombre      q                        naics1 totalassets  revenue
1 MSFT Microsoft Corp 2000q1 Información en medios masivos    50895000 17152000
2 MSFT Microsoft Corp 2000q2 Información en medios masivos    52150000 22956000
3 MSFT Microsoft Corp 2000q3 Información en medios masivos    56089000  5800000
4 MSFT Microsoft Corp 2000q4 Información en medios masivos    57691000 12385000
5 MSFT Microsoft Corp 2001q1 Información en medios masivos    59605000 18841000

3 CHALLENGE 2. Calculating financial variables

You have to write the code to calculate the following financial variables and financial ratios for all firms-quarters of US firms.

1. Create financial variables

Using the dataus20204 dataset, create columns for the following variables:**

Market value (marketvalue) = originalprice * sharesoutstanding

We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

I USE THE dplyr package:

uspanel = uspanel %>% 
  mutate(marketvalue= originalprice * sharesoutstanding)

Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)

uspanel = uspanel %>%
  mutate(grossprofit= revenue - cogs)

Earnings before interest and taxes (ebit) = Gross profit - Sales & general administrative expenses (sgae)

uspanel = uspanel %>%
  mutate(ebit = grossprofit - sgae)

Net Income (netincome) = ebit + otherincome + extraordinaryitems - financial expenses - incometax

uspanel = uspanel %>%
  mutate(netincome = ebit + otherincome + extraordinaryitems - finexp - incometax)

Annual market return: use adjusted stock price and remember that you have quarterly data.

Here you have to use the lag function and group by firm to avoid using stock price of another stock to calculate the stock return of a stock. Hint: check Workshop 1.

I will calculate simple and continuously compounded annual return:

uspanel = uspanel %>%
  group_by(firm) %>%
  arrange(firm,q) %>% 
  mutate(annual_R = adjprice / lag(adjprice,4) - 1, # Annual simple return
         annual_r = log(adjprice) - lag(log(adjprice,4))) %>%  # log annual return
  ungroup()

THE lag FUNCTION GETS THE PREVIOUS VALUE OF THE VARIABLE. WHEN USING the lag FUNCTION IT IS VERY IMPORTANT TO GROUP_BY THE DATASET BY FIRM TO AVOID USING PRICES FROM OTHER STOCK TO CALCULATE STOCK RETURN!

THE arrange FUNCTION IS RECOMMENDED TO MAKE SURE THAT THE DATA IS SORTED BY QUARTER BEFORE CALCULATING THE QUARTERLY RETURN!

NOTE THAT I USE THE PARAMETER 1 IN THE lag FUNCTION TO INDICATE THAT I AM CALCULATING QUARTERLY RETURN SINCE THE DATASET IS QUARTERLY.

2. Using the same panel dataset, create columns for the following financial ratios:

Operational Return on Assets (roabit): roabit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of ONE year ago.

roabit=\frac{ebit_{t}}{totalassets_{t-4}}

Here you can use the lag function to get value of total assets one year ago, which is 4 quarters ago.

Return on Assets (roa):

roa=\frac{netincome_{t}}{totalassets_{t-4}}

uspanel = uspanel %>%
  group_by(firm) %>% 
  arrange(firm,q) %>% 
  mutate(roabit = ebit / lag(totalassets,4),
         roa = netincome / lag(totalassets,4)) %>%
  ungroup()

Operational Earnings per share (oeps): ebit / sharesoutstanding
Operational eps deflated by stock price (oepsp) : oeps / originalprice
Book-to-market ratio (bmr): book value / market value.

Book value can be calculated as totalassets minus totalliabilities

uspanel = uspanel %>%
  mutate(oeps = ifelse(sharesoutstanding==0,NA,ebit / sharesoutstanding),
         oepsp = ifelse(adjprice==0,NA,oeps/adjprice),
         bmr = ifelse(marketvalue==0,NA,(totalassets-totalliabilities)/marketvalue)
         )

Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio

EARNINGS PER SHARE IS EQUAL TO EARNINGS DIVIDED BY THE # OF SHARES. THE MEASURE FOR EARNINGS IS NET INCOME. HOWEVER, SOME ANALYSTS ALSO USE OTHER OPERATIONAL MEASURES FOR EARNINGS SUCH AS EARNINGS BEFORE INTEREST AND TAXES (EBIT). IF WE WANT TO MEASURE OPERATIONAL EARNINGS AND CALCULATE IT FOR MANY FIRMS, IT IS RECOMMENDED TO USE EBIT AS A MEASURE OF EARNINGS.

THE FORMULA OF EPS USING NET INCOME AS A MEASURE OF EARNINGS IS:

EPS_{t}=\frac{NETINCOME_{t}}{\#OFSHARES_{t}}

THE FORMULA FOR EPS USING EBIT AS A MEASURE OF EARNINGS IS:*

EPS_{t}=\frac{EBIT_{t}}{\#OFSHARES_{t}} IN A HYPOTHETICAL SCENARIO, IF THE ALL EARNINGS OF A PERIOD t WERE PAYED TO THE INVESTORS, THEN EPS WILL BE HOW MUCH OF ALL EARNINGS OF THE PERIOD IS PAYED TO EACH SHARE OWN BY INVESTORS.

WHAT IS BOOK-TO-MARKET RATIO?

Book-to-maret ratio (bmr) is the ratio of accounting book value of the firm to its market value. In other words, it results by dividing book value by the market value of the firm at a specific time period.

If bmr=1 means that the firm book value is about the same as firm market value. If that is the case, then the market value has not grown beyond book-value, meaning that the firm has not created value beyond its book value.

If bmr>1 means that the market value is less than book value. So, if bmr>1 means that the firm has significantly lost shareholder’s wealth, and it might incur in bankrupt risk.

Then, what would be the bmr level that all firms are looking for? One of the main purposes of the executives is to MAXIMIZE shareholder’s value. The way to increase shareholder’s value is to increase its market value, and the only way to increase market value is to increase stock price.

Then, the bmr level that all executives prefer is a value much less than 1.

If bmr=0.5 means that the firm market value is the double of its book value. In this case, the main difference between market value and book value is the wealth that the firm has created thanks to its valuable intangible assets such as prestige, high quality, and innovation.

Then, what do you think it might be the relationship between bmr and stock return? Intuitively, we might think that a small value of bmr is a good news, then the stock return might be related in a negative way. If bmr goes down (good news), then the stock return might go up. Then, it might be expected that the relationship between bmr and stock return is linear and negative. Some finance research (like Fama & French, 1995), mainly with US firms has found that the relationship between bmr and future stock return is negative.

However, there are mixed findings about the relationship between bmr and stock returns. Some research (like Piotrosky, 2000) has found that firms with high bmr, but strong financials (like earnings per share) usually provides significant positive stock returns. In this special case, bmr has found to be positively related to stock returns.

Finance research has found that bmr influences earnings per share, which in turn influences current and future stock returns. It is some times that firms with low bmr do not experience significant high stock returns due to the high expectations of investors. This is an interesting finding in finance that is still being researched!

4 CHALLENGE 3: Create a cross-sectional dataset with the most recent annual data

1. Select active firms in the 2023 fiscal year

Using the dataus2024 dataset, write the code to select active firms in the 2023 fiscal year.

To do this, you have to select firm-quarters with fiscalmonth=12, yearf=2023 and status=“active”

Save the result in a new dataset called active2023

The fiscal month variable indicates the following: if fiscalmonth=12, then that means that the corresponding quarter is the end-of-the-fiscal-year quarter.

active2023 = uspanel %>% 
  filter(fiscalmonth==12,yearf==2023)

5 CHALLENGE 4: Code for descriptive statistics for 2023

Using the active2023 dataset, you have to do the following descriptive statistics and respond the following questions.

1. Show and interpret the median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023

active2023 %>%
  summarize( firms=n(),
             median_tot_assets = median(totalassets, na.rm = TRUE),
            Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
            Q3_totassets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
            median_marketvalue = median(marketvalue, na.rm = TRUE),
            Q1_marketvalue = quantile(marketvalue, probs=c(0.25),na.rm=TRUE),
            Q3_marketvalue = quantile(marketvalue,probs=c(0.75),na.rm=TRUE))

# A tibble: 1 × 7
  firms median_tot_assets Q1_total_assets Q3_totassets median_marketvalue
  <int>             <dbl>           <dbl>        <dbl>              <dbl>
1  3415           1635910          290343      7029217           1272466.
# ℹ 2 more variables: Q1_marketvalue <dbl>, Q3_marketvalue <dbl>

At the end of the 2023 fiscal year, the typical size of a US firm in terms of total assets is USD $1,635.9 million (the median of total assets).

At the end of the 2023 fiscal year, the typical size of a US firm in terms of market value is USD $1,272.4 million (the median of market value).

The top 25% biggest US firms have a market value greater or equal to $5,641,2 million (the quartile 3 of market value)

The bottom 25% of US firms have a market value less or equal to $215.4 million (quartile 1 of market value)

2. By industry show the # of firms, median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2023

I do a very similar code than the previous one. I add the group_by industry:

by_industry = active2023 %>%
  group_by(naics1) %>%
  summarize( firms=n(),
             median_tot_assets = median(totalassets, na.rm = TRUE),
            Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
            Q3_totassets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
            median_marketvalue = median(marketvalue, na.rm = TRUE),
            Q1_marketvalue = quantile(marketvalue, probs=c(0.25),na.rm=TRUE),
            Q3_marketvalue = quantile(marketvalue,probs=c(0.75),na.rm=TRUE))

We display the results:

by_industry

naics1	firms	median_tot_assets	Q1_total_assets	Q3_totassets	median_marketvalue	Q1_marketvalue	Q3_marketvalue
-	1	13511396.0	13511396.0	13511396	5752457.1	5752457.13	5752457
Agricultura, ganadería, aprovechamiento forestal, pesca y caza	18	1189610.5	189294.9	3310175	718756.4	86801.89	3690190
Comercio al por mayor	74	2211872.0	468339.2	6535675	1474175.1	304554.48	6976933
Comercio al por menor	139	2725250.0	640250.5	7740470	1968102.3	299336.15	10116451
Construcción	48	3356715.0	1192458.8	6415290	3137764.7	864587.54	6511583
Empresas de electricidad, gas y agua	85	14226600.0	3323801.0	43939700	5044677.6	1879869.75	21398960
Industrias manufactureras	1470	501802.5	97954.0	2734500	715509.0	89892.61	4386725
Información en medios masivos	243	1443046.0	445938.0	4836900	2731011.9	391739.57	9229658
Minería, explotación de canteras y extracción de petróleo y gas	113	2239006.0	400904.0	7420031	1247307.5	231780.91	4550560
Otros servicios excepto actividades gubernamentales	17	1014723.0	607377.0	2854552	848578.5	276644.17	2713564
Servicios de alojamiento temporal y de preparación de alimentos y bebidas	70	1752148.5	559121.8	6262595	1488995.1	336943.95	6127148
Servicios de apoyo a los negocios y manejo de residuos y desechos, y servicios de remediación	116	2507046.0	700369.5	8207400	3119600.3	715276.88	9572014
Servicios de esparcimiento culturales y deportivos, y otros servicios recreativos	20	2432789.5	1118437.2	6974660	2095108.4	779143.33	9265038
Servicios de salud y de asistencia social	59	1441699.0	447263.0	4450510	1460913.2	371526.70	5012408
Servicios educativos	13	1007316.0	740685.0	2125213	1497541.4	314845.46	2255292
Servicios financieros y de seguros	661	5894400.0	2149535.0	18274022	1440087.5	352355.71	5526287
Servicios inmobiliarios y de alquiler de bienes muebles e intangibles	66	1631034.5	320103.2	5497144	873429.3	245355.47	2799109
Servicios profesionales, científicos y técnicos	132	679940.0	184932.8	3597883	830948.4	219680.71	6252087
Transportes, correos y almacenamiento	70	4882901.0	1810797.0	19048809	3026746.8	1132554.23	11244748

3. Which are the 10 biggest active firms in terms of total assets according to the end of fiscal year of 2023? Show the name of the firms, the industry and total assets.

top_10_tassets <- active2023 %>%
     arrange(desc(totalassets)) %>%
     top_n(10, totalassets) %>%
     select(firm, Nombre, naics1, totalassets, revenue)

We display the result:

top_10_tassets

firm	Nombre	naics1	totalassets	revenue
FNM_old	Federal National Mortgage Association Fannie Mae	Servicios financieros y de seguros	4325437000	141240000
JPM	Jpmorgan Chase & Co	Servicios financieros y de seguros	3875393000	239425000
FRE_old	Federal Home Loan Mortgage Corp	Servicios financieros y de seguros	3280976000	108050000
BAC	Bank Of America Corp	Servicios financieros y de seguros	3180151000	171912000
C	Citigroup Inc	Servicios financieros y de seguros	2411834000	156820000
WFC	Wells Fargo & Company	Servicios financieros y de seguros	1932468000	115340000
GS	The Goldman Sachs Group, Inc	Servicios financieros y de seguros	1641594000	108418000
MS	Morgan Stanley	Servicios financieros y de seguros	1193693000	96194000
BRK_B	Berkshire Hathaway Inc	Servicios financieros y de seguros	1069978000	364482000
PRU	Prudential Financial, Inc	Servicios financieros y de seguros	721123000	53979000

4. Which are the 10 biggest active firms in terms of market capitalization? Show the name of the firms, the industry and market value as of 2023

We do the same as above, but now for the market capitalization variable:

top_10_mcap <- active2023 %>%
     arrange(desc(marketvalue)) %>%
     head(10) %>%
     select(firm, Nombre, naics1, marketvalue, revenue)

We display the result:

top_10_mcap

firm	Nombre	naics1	marketvalue	revenue
AAPL	Apple Inc	Industrias manufactureras	2676736861	383285000
MSFT	Microsoft Corp	Información en medios masivos	2532080939	211915000
NVDA	Nvidia Corp	Industrias manufactureras	2258900000	60922000
GOOGL	Alphabet Inc	Servicios profesionales, científicos y técnicos	1748360040	307394000
AMZN	Amazon.Com, Inc	Comercio al por menor	1570152607	574785000
META	Meta Platforms, Inc	Servicios profesionales, científicos y técnicos	909628613	134902000
TSLA	Tesla, Inc	Industrias manufactureras	789898387	96773000
LLY	Eli Lilly And Company	Industrias manufactureras	553370175	34124100
AVGO	Broadcom Inc	Industrias manufactureras	522561910	35819000
JPM	Jpmorgan Chase & Co	Servicios financieros y de seguros	491760519	239425000

5. Show and interpret a histogram of the 2023 market capitalization for all ACTIVE firms

hist(active2023$marketvalue)

This is a typical skewed-to-the-right distribution since there are very few very extreme positive values. Due to the very extreme values of market capitalization of firms, the histogram is not well appreciated. We can change the number of bins or breaks of the histogram, so that we can better appreciate the distribution of the variable in the range where most firms are located in terms of market capitalization:

hist(active2023$marketvalue, breaks=40)

We can see that most of the firms have a market capitalization that is less than $US 250 billion.

To better appreciate the distribution of most of the firms excluding those with very extreme values, we can filter the dataset as follows:

firm_selection1 <- active2023 %>%
            filter(marketvalue <=250000000) %>%
            select(firm, marketvalue)
hist(firm_selection1$marketvalue, breaks=40)

We can still see that most of these selected firms have a market cap less than $USD 20 billion. We could keep selecting firms with less market capitalization to better appreciate the distribution of most of the firms:

firm_selection2 <- active2023 %>%
            filter(marketvalue <=50000000) %>%
            select(firm, marketvalue)
hist(firm_selection2$marketvalue, breaks=40)

Even with this selection, we still see that the distribution is very skewed to the right with extreme values of market capitalization.

6. For each industry show and interpret a box plot to better understand the size of active firms in 2023 (use the market value)

library(ggplot2)
ggplot(active2023, aes(x=naics1, y=marketvalue)) +
   geom_boxplot()

Here we have 2 visualization problems: 1) We cannot easily visualize the market capitalization of 50% of the firms by industry, which is the box for each industry (Quartiles fro Q1 to Q3), and 2) The x labels for the industry names cannot be read.

To solve the first problem, since there are too many very extreme values for market capitalization, we can plot the logarithm base 10 of market capitalization.

ggplot(active2023, aes(x=naics1, y=marketvalue)) +
   geom_boxplot() + scale_y_log10()

We can also plot the winsorized variable of market capitalization instead of the logarithm base 10.

In statistics, winsorization is the process to treat extreme values, which are also called outliers. The winsorization process replace the very extreme values at a specific percentile with a value of the variable in that percentile.

Here we apply a winsorization at the 95 percentiles for big values of market capitalization:

library(statar)
active2023$marketvalue_w <- winsorize(active2023$marketvalue, probs = c(0,0.95))

We created a winsorized market capitalization for the purpose of visualization of the variable. The process of winsorization is also used for regression analysis since outliers can bias the estimation of the regression results. We will learn that later.

For the second problem, we can add an option to the ggplot to indicate that we want to display the x labes in vertical wah (90 degrees of inclination):

ggplot(active2023, aes(x=naics1, y=marketvalue_w)) +
   geom_boxplot() +
   theme(axis.text.x = element_text(angle = 25, hjust = 1))

6 Challenge 5. Review of Logistic Regression models

1. Using the dataus2024 dataset you have to run a logistic regression model to examine whether the operational earnings per share deflated by price (oepsp) is related to the probability that the firm ROA is greater than the corresponding industry ROA mean.

HINT: you can use the function glm specifying family=“binomial”:

logit_model1 <- glm(ROA_above ~ oepspw ,data = uspanel, family = “binomial”,na.action = na.omit)

Now I run a logistic model to examine wether operating earnings per share is related to the probability that a firm has a ROA higher than the mean of the ROA of its industry.

The dependent variable is the binary variable ROA_above that will be 1 if the firm have a ROA greater than the ROA mean of its industry. Then, we need to calculate the ROA mean for each industry, and then calculate this binary variable:

uspanel <- uspanel %>%
   group_by(naics1, q) %>% 
   mutate(ROAmean = mean(roa, na.rm=TRUE),
          ROA_above = ifelse(roa>ROAmean,1,0)) %>%
  ungroup()

logit_model1 <- glm(ROA_above ~ oepsp ,data = uspanel, family = "binomial",na.action = na.omit)
smodel1 <- summary(logit_model1)
smodel1


Call:
glm(formula = ROA_above ~ oepsp, family = "binomial", data = uspanel, 
    na.action = na.omit)

Coefficients:
               Estimate  Std. Error z value            Pr(>|z|)    
(Intercept) 0.822428395 0.004955761 165.954 <0.0000000000000002 ***
oepsp       0.000003163 0.000004186   0.756                0.45    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 236261  on 192005  degrees of freedom
Residual deviance: 236257  on 192004  degrees of freedom
  (365175 observations deleted due to missingness)
AIC: 236261

Number of Fisher Scoring iterations: 8

We see that the beta coefficient of oepsp is very close to zero and its pvalue is much greater than 0.05. This means that there is no significant evidence to say that operating earnings per share is related to the probability that a firm has a ROA greater that the ROA of its industry.

Before making conclusion that oepsp has no relationship with the probability of a firm to have ROA higher than the industry average, we have to check some data issues that might affect our model:

Extreme values of independent variables. In this case, it is very likely that oepsp has very extreme values. We will learn about a method called winsorization to treat extreme values.
Treatment of income-statement variables in this dataset. Remember that these variables have Year-to-date amounts (cumulative amounts). This might cause a problem for the calculation or ROA and epsp since the numerator of these variables is netincome and operating earnings that increase each quarter within each year. We can try selecting only rows with the last fiscal quarter of each year (fiscalmonth==12)

We will learn more details about the logistic regression model in the next workshop.

2. Learn about the logistic regression and interpret the result of this regression model (interpret the coefficient of this model and its significance.

7 Challenge 6. Datacamp Course

Go to the course: Machine Learning with caret in R and DO CHAPTER 2: Classification models: fitting them and evaluating their performance