Workshop 1 Solution, Financial Modeling and Programming

Author

Alberto Dorantes, Ph.D.

Published

October 25, 2024

Abstract

This is an INDIVIDUAL workshop. In this workshop we learn about logistic regression applied to fundamental analysis in Finance. In addition, we practice data management programming skills using a big panel-dataset of historical financial statement variables.

1 General directions for this Workshop

You will work in RStudio. It is strongly recommended to have the latest version of R and RStudio. Once you are in RStudio, do the following.

Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop. More specifically, you have to:

Replicate all the R Code along with its output.
You have to do whatever is asked in the workshop. It can be: a) Responses to specific questions and/or do an exercise/challenge.

Any QUESTION or any INTERPRETATION you need to do will be written in CAPITAL LETTERS. For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook. Your own workshop/notebook will be very helpful for your further study.

You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.

2 Set up the name of your R Notebook for this workshop

Setup title and name of your Workshop

Once you have created a new R Notebook, you will see a sample R Notebook document. You must DELETE all the lines of this sample document except the first lines related to title and output. As title, write the workshop # and course, and add a new line with your name. You have to end up with something like:

title: “Workshop 1, Financial Modeling and Programming”

author: YourName

output: html_notebook

Now you are ready to continue writing your first R Notebook.

You can start writing your own notes/explanations we cover in this workshop. When you need to write lines of R Code, you need to click Insert at the top of the RStudio Window and select R. Immediately a chunk of R code will be set up to start writing your R code. You can execute this piece of code by clicking in the play button (green triangle).

Note that you can open and edit several R Notebooks, which will appear as tabs at the top of the window. You can visualize the output (results) of your code in the console, located at the bottom of the window. Also, the created variables are listed in the environment, located in the top-right pane. The bottom-right pane shows the files, plots, installed packages, help, and viewer tabs.

Save your R Notebook file as W1-YourName.Rmd. Go to the File menu and select Save As.

To generate the .html file, you have knit your R Notebook. Pay attention how to do this in class.

3 Introduction

In this workshop we practice data management with financial historical data, and review the logit regression model and its application to fundamental analysis.

We will work with a real dataset of all historical financial variables of ALL US public firms that belong to the NYSE and the NASDAQ exchanges.

We will use a logit model to examine whether some financial ratios are related to the probability that future stock return (1 year later) is higher than the median return of the corresponding industry (1 year later).

We will learn basic programming skills for the required data management and data modeling.

You have to work with 2 datasets:

firmsus2024.csv: List of all US public firms with general information of each firm
dataus2024.csv : Panel data with historical financial quarterly data for all US public firms.

You have to download these 2 files from a web page and save in the directory where you have your workshop.

The first dataset (dataus2024) contains the historical financial data of the firms, while the second dataset (firmsus2024) is a catalog of all firms along with the corresponding industry type and status (active or cancelled).

The dataus2024 dataset has a panel-data (also called long format) structure. Each row has financial information for one US firm and 1 period (a quarter). All $ amounts are in thousands (’1000s). Here is a data dictionary of the columns:

Data dictionary of historical quarterly financial data.
Variable	Description
firm	Unique code of the company (also called ticker)
q	Quarter date
fiscalmonth	Month of the year when the firm closes a fiscal year
revenue	Total sales of the firm from the first fiscal quarter to the current quarter
cogs	Cost of good sold - variable costs of the products sold - from the first fiscal quarter to the current quarter
sgae	Sales and general administrative expenses - from the first fiscal quarter to the current quarter
otherincome	Other operational income/expenses that are not directly from the core operations of the firm - from the first fiscal quarter to the current quarter
extraordinaryitems	Extra income/expenses not related to regular operations - from the first fiscal quarter to the current quarter
finexp	Financial expenses - interest expenses paid (generated from loans) - from the first fiscal quarter to the current quarter
incometax	Income tax from the first fiscal quarter to the current quarter
totalassets	Total assets of the firm at the end of the quarter
currentassets	Current assets of the firm at the end of the quarter
totalliabilities	Total liabilities of the firm at the end of the quarter
currentliabilities	Current liabilities of the firm at the end of the quarter
longdebt	Balance of long-term financial debt (loans to pay longer than 1 year)
adjprice	Stock adjusted price at the end of the quarter; adjusted for stock splits and dividend payments; used to calculate stock returns
originalprice	Historical stock price (not adjusted); used to calculate historical market value
sharesoutstanding	Historical number of shares available in the market
fixedassets	Fixed assets value at the end of the quarter
year	Calendar year
yearf	Fiscal year - this depends on when the firm ends its fiscal year; if fiscalmonth=12 in the quarter 3, then the fiscal year will start in Q4 of a year and ends in the Q3 of the following year

Each row of this dataset has quarterly financial data of one firm in one quarter. All firms have quarters from Q1 2000 to Q2 2023. Not all firms have existed since 2000, so if the first quarters are empty that means that the firm did not exist in the US financial market in those quarters. Then, it is possible to know when each firm went public to issue shares in the financial market: the first quarter with some non-empty data.

Each firm has defined the month of the year used to close a fiscal year. For example, Apple closes the fiscal year at the end of Quarter 3 (end of September) of any year. Then, for Apple, in the Q3 of 2022, there will be a 12 for the fiscalmonth variable. In this case, Apple starts its fiscal year in the Q4 of each year and ends in the Q3 of the following year. Most of the firms (about 80%) close fiscal year in December, so these firms will have a 12 in the Q4 of each year.

The variables related to sales and expenses are cumulative for each fiscal year. For example, Apple sold about $117 billion in the last calendar quarter (Q4) of 2022, but this is the first fiscal quarter for Apple. For Q1 (calendar) 2023 (which is the 2nd fiscal quarter), Apple has about $212 billion in the revenue variable, meaning that considering fiscal quarter 1 and 2, Apple has sold $212 billion. For Q2 2023 Apple has about $293 billion, meaning that the cumulative revenue of fiscal Q1, Q2 and Q3 is about $293 billion. Then, if you select rows with fiscalmonth=12, then you will be selecting those quarters with annual financial information for each firm!

Earnings before interest and Taxes (ebit) and Net Income (netincome) must be calculated as:

ebit = revenue - cogs - sgae

netincome = ebit + otherincome + extraordinaryitems - finexp - incometax

The firmsus2023.csv is a catalog of all active and cancelled US firms:

Variable	Description
firm	Unique code of the company (also called ticker)
name	Name of the firm
status	Status of the firm: active or cancelled
partind	Percent participation in the S&P500 market index
naics1	North American Industry Classification Code - Level 1
naics2	North American Industry Classification Code - Level 2
SectorEconomatica	Economatica Industry classification

4 Review of logit regression and its applications in Finance

The logit model is one type of non-linear regression model where the dependent variable is binary. The logistic or logit model is used to examine the relationship between one or more quantitative variables and the probability of an event happening (1=the event happens; 0 otherwise). For example, a bank that gives loans to businesses might be very interested in knowing which are the factors/variables/characteristics of firms that are more related to loan defaults. If the bank understands these factors, then it can improve its decisions about which firms deserve a loan, and minimize the losses due to loan defaults.

In this workshop, we define the event to be whether a firm in a specific quarter has higher stock return compared to median returns of the firms within its industry. If the stock return is higher than the median return, then we codify the binary variable equal to 1; 0 otherwise.

Then, in this case, the dependent variable of the regression is the binary variable with 1 if the stock outperforms the median of its industry and 0 otherwise. The independent or explanatory variables can be any financial indicator/ratio/variable that we believe is related to the likelihood of a stock to beat the market in the near future.

We can define this logistic model using the following mathematical function. Imagine that Y is the binary dependent variable, then the probability that the event happens (Event=1) can be defined as:

Prob(Event=1)=f(X_{1},X_{2},...,X_{n})

The binary variable Event can be either 1 or 0, but the probability of Event=1 is a continuous value from 0 to 1. The function is a non-linear function defined as follows:

Prob(Event=1)=\frac{1}{1+e^{-\left(b_{0}+b_{1}X_{1}+b_{2}X_{2}+...+b_{n}X_{n}\right)}}

As we can see, the argument of the exponential function is actually a traditional regression equation. We can re-express this equation as follows:

Y=b_{0}+b_{1}X_{1}+b_{2}X_{2}+...+b_{n}X_{n} Now we use Y in the original non-linear function:

Prob(Event=1)=\frac{1}{1+e^{-Y}}

This is a non-linear function since the value of the function does not move in a linear way with a change in the value of one independent variable X.

Let’s work with an example of a model with only 1 independent variable X_1. Imagine that X_1 is the variable earnings per share (eps) of a firm, and the event is that the firm beats the market. Then, let’s do a simple example with specific values for b_0, b_1, and a range of values for eps from -1 to 1 jumping by 0.1:

# The seq function creates a numeric vector. We specify the fist, last and the jumps:
eps=seq(from=-1,to=1,by=0.1)
eps

 [1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4
[16]  0.5  0.6  0.7  0.8  0.9  1.0

# This vector has 21 values for x1 including the zero

# I define b0=-0.5 and b1=10:
b0=-0.5
b1=10

# I define a temporal variable Y to be the regression equation:
Y = b0 + b1*eps
# Since eps is a vector, then R performs a row-wise calculation using the equation for all values of eps

# Now I create a vector with the values of the function according to the equation of the probability:
prob = 1 / (1 + exp(-(Y)) )
# I display the probability values of the function for all values of eps:
prob

 [1] 0.00002753569 0.00007484623 0.00020342698 0.00055277864 0.00150118226
 [6] 0.00407013772 0.01098694263 0.02931223075 0.07585818002 0.18242552381
[11] 0.37754066880 0.62245933120 0.81757447619 0.92414181998 0.97068776925
[16] 0.98901305737 0.99592986228 0.99849881774 0.99944722136 0.99979657302
[21] 0.99992515377

# Finally, I plot the function values for all values of x1:
plot(x=eps,y=prob,type="line")

Here we can see that function is not linear with changes in eps. There is a specific range of values for eps close to 0 when the probability that the firm beats the market increases very fast up to a value of about 0.3 where the probability grows very slow with any more increase in eps.

The interpretations of the magnitude of the coefficients b_0 and b_1 in logistic regression is not quite the same as the case of multiple regression. However, the interpretations of the sign of the coefficient (positive or negative) and the level of significance (pvalue) are the same as in the case of multiple regression model. What we can say up to know is that if b_1 is positive and significant (its p-value<0.05), then it means that the variable, in this case, eps is significantly and positively related to the probability that a firm return outperform its industry median return.

Before going to the interpretation of the magnitude of a coefficient, here is a quick explanation of how the logistic regression works and how it is estimated in any specialized software (such as R).

Let’s continue with the same event, which is that the firm return beats its industry median (in other words, that the firm return is higher than the median return of its industry). Then:

p = probability that the firm beats its industry (Event=1); or that the event happens.

(1-p) = probability that the firm DOES NOT beat its industry (Event=0); or that the event does not happen.

To have a dependent variable that can get any numeric value from a negative value to a positive value, we can do the following mathematical transformation with these probabilities:

Y=log\left(\frac{p}{1-p}\right)

The \left(\frac{p}{1-p}\right) is called the odds ratio:

ODDSRATIO=\left(\frac{p}{1-p}\right)

The odds ratio is the ratio of the probability of the event happening to the probability of the event NOT happening. Since p can have a value from 0 to 1, then the possible values of ODDSRATIO can be from 0 (when p=0) to infinity (when p=1). Since we want a variable that can have values from any negative to any positive value, then we can apply the logarithmic function, and then the range of this log will be from any negative value to any positive value.

Now that we have a transformed variable Y (the log of ODDSRATIO) that uses the probability p, then we can use this variable as the dependent variable for our regression model:

Y=log\left(\frac{p}{1-p}\right)=b_{0}+b_{1}X_{1}

This mathematical trick help us to use a linear model to model a non-linear relationship!

Then, with this transformation we can estimate a linear regression model (don’t worry, R estimate it), so the coefficients b_0 and b_1 values define the logarithm of the odd ratio!, not the actual probability p of the event happening!

How can we interpret the magnitude of the beta coefficients of this regression? Let’s do a mathematical trick from the previous equation. We can apply the exponential function to both sides of the equation:

e^{Y}=e^{log\left(\frac{p}{1-p}\right)}=e^{\left(b_{0}+b_{1}X_{1}\right)}=\left(\frac{p}{1-p}\right)=ODDSRATIO

Following the rule of exponents, we can express this equation as:

e^{Y}=e^{b_{0}}e^{b_{1}X_{1}}=ODDSRATIO

Let’s see what happens with ODDSRATIO if X_1 increases in 1 unit:

e^{b_{0}}e^{b_{1}(X_{1}+1)}=e^{b_{0}}e^{b_{1}X_{1}}e^{b_{1}}=ODDSRATIO*(e^{b_{1}})

Then, if X_1 increases in one unit, then the ODDSRATIO will be equal to ODDSRATIO times e^{b_1}. Then, e^{b_1} will be the factor that indicates how many times the ODDSRATIO changes with a +1-unit change in X_1.

Then:

If e^{b_1} = 1, then the ODDSRATIO does not change, meaning that there is no relationship between the variable X_1 and the probability of the event.
If e^{b_1} > 1, then the ODDSRATIO will grow by this factor, meaning that there is a positive relationship between X_1 and the probability of the event.
If e^{b_1} < 1, then the ODDSRATIO will decrease by this factor, meaning that there is a negative relationship between X_1 and the probability of the event.

Then, when we see the output of a logistic regression, we need to apply the exponential function to the coefficients to provide a meaningful interpretation of the magnitude of the coefficients in the model.

If we want to estimate the probability p for any value of X_1, we just need to do some algebraic manipulations to the previous equation:

Y=log\left(\frac{p}{1-p}\right)=b_{0}+b_{1}X_{1}

To get p from this equation, we can apply the exponential function to both sides:

e^{Y}=\left(\frac{p}{1-p}\right)

Leaving p alone, we multiply both sides times (1-p):

e^{Y}\left(1-p\right)=p

We continue playing with the terms to leave p alone:

e^{Y}-pe^{Y}=p

e^{Y}=p+pe^{Y}

e^{Y}=p(1+e^{Y})

p=\frac{e^{Y}}{\left(1+e^{Y}\right)}

Dividing the numerator and the denominator by e^Y ;

p=\frac{1}{\left(\frac{1}{e^{Y}}+1\right)}=\frac{1}{\left(e^{-Y}+1\right)}

This is the same mathematical equation we had used above to illustrate the non-linear relationship between the explanatory variables and the probability of an event happening!

Fortunately, R performs all these calculations automatically when you run the predict function! So, you do not have to memorize these steps; I just tried to be curious to understand how the logistic model estimates predicted probabilities.

5 Data collection

The dataset we will use for this workshop and for the final project is located in a public we site.

This dataset has real historical information of all US public firms listed in NASDAQ and NYSE. It has data from Q1 2000 to Q2 2023. You can download the dataset as follows.

You first need to install the packages ggplot2 and dplyr. You can do so using the right-hand side window of RStudio in the “Package” tab.

Before running the rest of the chunks, you need to save your .Rmd in your computer.

# To avoid scientific notation:
options(scipen=999)

#Download the csv files and import them into your R environment:
download.file("https://www.apradie.com/datos/dataus2024.csv","dataus2024.csv")
uspanel = read.csv("dataus2024.csv")
download.file("https://www.apradie.com/datos/firmsus2024.csv","firmsus2024.csv")
usfirms <- read.csv("firmsus2024.csv")

6 CHALLENGE 1 - CALCULATE QUARTERLY REVENUE AND EBIT FROM YTD AMOUNTS

All income-statement variables in the uspanel dataset are Year-to-Date (YTD) amounts starting in the first fiscal quarter.

For example, if you look at revenue of a firm in the fiscal Q3 (fiscalmonth==9), that amount will be the total revenue of the firm starting in the first fiscal Q1 until the fiscal Q3.

Remember that in the US public firms can end a fiscal year in any of the 4 calendar quarters.

Using the uspanel dataset calculate EBIT as: revenue - cogs - sgae.

Remember that the YTD amount is a cumulative amount from the first fiscal quarter up to any fiscal quarter. Then, what might be the algorithm/code to create the quarter-amount for revenue and EBIT ?

Use dplyr package to write your R code.

CHALLENGE 1 - WRITE YOUR PSEUDO-ALGORITHM WITH YOUR OWN WORDS, AND WRITE THE R CODE TO CREATE THE QUARTERLY REVENUE AND EBIT AMOUNT IN THE USPANEL DATASET

7 CHALLENGE 1 SOLUTION

A pseudo-algorithm is basically a set of steps that the computer has to execute in order to take the input and end up with the required output of the problem.

Then, before writing my pseudo-algorithm, I need to clearly state what is the input I can use for the problem, and what is the output:

INPUT: the uspanel dataset, which has quarterly financial data for all US firms from 2000 Q1 to 2024 Q2. The granularity of the data is quarterly and the dataset has more than 500,000 rows.

OUTPUT: The same uspanel dataset, but with the following new columns:

quarterly_revenue: revenue of the firm for the specific quarter (not cumulative YTD)
quarterly_ebit: ebit of the firm for the specific quarter (not cumulative YTD)

Step 1. Using the uspanel (firm-quarterly data), add the column ebit:

 - ebit = revenue - cogs - sgae

Since revenue, cogs and sgae have cumulative fiscal year-to-date amounts, then ebit will also have cumulative fiscal year-to-date amounts

Step 2. Create the quarterly revenue and quarterly ebit from the cumulative fiscal year-to-date revenue and ebit:

- For all quarters of each firm do the following:

   - If the fiscalq = 1, then:
   
       quarterly_revenue = revenue;
       
       quarterly_ebit = ebit:
       
   - If fiscalq>1, then:
   
      quarterly_revenue =  revenue - its previous revenue value. 
      
      quarterly_ebit = ebit -  its previous ebit value
   
I do this since revenue and ebit values are cumulative fiscal YTD values for each firm-fiscal year.

I will use tidyverse, which actually uses dplyr and have other powerful functions.

The tidyverse package uses a short connector |> instead of the dplyr connector %>%

Step 1: Create the ebit variable

library(tidyverse)
uspanel <- uspanel |>
  mutate(ebit = revenue - cogs - sgae)

The values of ebitp looks ok. For the fist fiscal quarters (when fiscalmonth==3), YTDebit is equal to ebitp. For the rest of the quarters, ebitp is equal to YTDebit minus its previous YTDebit!

Step 2: Create the quarterly amounts for revenue and ebit:

uspanel <- uspanel |>
  # I make sure that the dataset is sorted by firm-q, so the information is in chronological order for each firm:
  arrange(firm, q) |>
  # I group by firm to make sure that when I use the lag value, this value will be a value of the same firm 
  group_by(firm) |>
  # With mutate I create the variables:
  mutate(quarterly_revenue = ifelse(fiscalq==1,revenue, revenue - lag(revenue)),
         quarterly_ebit = ifelse(fiscalq==1,ebit,ebit - lag(ebit))) |>
  # The lag function gets the previous value, which is the value of the previous quarter
  ungroup()

I can see the result of one firm:

uspanel %>% select(firm,q,fiscalmonth,yearf,ebit,revenue, quarterly_ebit, quarterly_revenue) %>% 
  filter(firm=="AAPL", yearf>=2020)

# A tibble: 19 × 8
   firm  q     fiscalmonth yearf   ebit revenue quarterly_ebit quarterly_revenue
   <chr> <chr>       <int> <int>  <dbl>   <dbl>          <dbl>             <dbl>
 1 AAPL  2019…           3  2020 2.56e7  9.18e7       25569000          91819000
 2 AAPL  2020…           6  2020 3.84e7  1.50e8       12853000          58313000
 3 AAPL  2020…           9  2020 5.15e7  2.10e8       13091000          59685000
 4 AAPL  2020…          12  2020 6.63e7  2.75e8       14775000          64698000
 5 AAPL  2020…           3  2021 3.35e7  1.11e8       33534000         111439000
 6 AAPL  2021…           6  2021 6.10e7  2.01e8       27503000          89584000
 7 AAPL  2021…           9  2021 8.52e7  2.82e8       24126000          81434000
 8 AAPL  2021…          12  2021 1.09e8  3.66e8       23786000          83360000
 9 AAPL  2021…           3  2022 4.15e7  1.24e8       41488000         123945000
10 AAPL  2022…           6  2022 7.15e7  2.21e8       29979000          97278000
11 AAPL  2022…           9  2022 9.45e7  3.04e8       23076000          82959000
12 AAPL  2022…          12  2022 1.19e8  3.94e8       24894000          90146000
13 AAPL  2022…           3  2023 3.60e7  1.17e8       36016000         117154000
14 AAPL  2023…           6  2023 6.43e7  2.12e8       28318000          94836000
15 AAPL  2023…           9  2023 8.73e7  2.94e8       22998000          81797000
16 AAPL  2023…          12  2023 1.14e8  3.83e8       26969000          89498000
17 AAPL  2023…           3  2023 4.04e7  1.20e8       40373000         119575000
18 AAPL  2024…           6  2024 6.83e7  2.10e8       27900000          90753000
19 AAPL  2024…           9  2024 9.36e7  2.96e8       25352000          85777000

8 CHALLENGE 2 - GROWTH OF THE US FINANCIAL MARKET

Using the panel dataset, you have to understand how the US financial market has been growing over the years starting from the oldest year to the last complete fiscal year (2023).

You have to examine how much the US has grown in terms of:

Total market value of all firms by calendar year
Total revenue of all firms by calendar year
Total Net Income of all firms by calendar year

You have to create a table of these 3 totals by year (years as rows)

Which graph(s) would you design to better understand the US financial growth in terms of these variables?

9 CHALLENGE 2 SOLUTION

My pseudo-algorithm:

The input and output of the problem are:

INPUT: the uspanel dataset, which has quarterly financial data for all US firms from 2000 Q1 to 2024 Q2. The granularity of the data is quarterly and the dataset has more than 500,000 rows.

OUTPUT: A simple table total market value, total revenue and total net income by year (from 2000 to 2024, which are 15 rows).

I visualize the following steps to go from the input to the output:

Step 1. Using the uspanel (firm-quarterly data), add columns for: - Market value = originalprice * sharesoutstanding - Netincome = ebit + otherincome + extraordinaryitems - finexp - incometax

It is assumed that ebit was already created in the uspanel

Step 2. Create the quarterly net income from the cumulative fiscal year-to-date netincome. - For each firm-quarter do the following: - If the fiscalq = 1, then quarterly_netincome = netincome; - If fiscalq>1, then quarterly_netincome = its previous netincome value.

I do this since netincome variable is the cumulative fiscal YTD value for each firm-fiscal year.

Step 3: Generate a firm-annual dataset using the uspanel dataset for the annual revenue, annual net income and the end value of market value:

For each firm, do the following

For each calendar year: 
    
  a) **sum** the quarterly revenue and quarterly net income to get annual revenue and annual net income for the corresponding firm-year
      
  b) Get the **last** value of the marketvalues within the year (4 quarters) since marketvalue is a cumulative value from the first quarter of the firm.

Step 4. Generate an annual (by calendar year) table/dataset using the previous firm-annual dataset by aggregating marketvalue, annual revenue and annual netincome for all firms:

Using the firm-annual dataset:

For each year:

 **sum** the annual revenue, annual netincome and marketvalue for all firms

Here I sum marketvalue (instead of getting the last value) since the firm-year dataset already has the last market value of the year for each firm. Then, when grouping by year, each group will have different firms, so I can add the market value variable for each group and get the total market value by calendar year.

Step 1 and 2:

uspanel <- uspanel |>
  arrange(firm, q) |>
  group_by(firm) |>
  # With mutate I create the marketvalue, netincome and the quarterly_netincome variables:
  mutate(marketvalue = originalprice * sharesoutstanding,
         netincome = ebit + otherincome + extraordinaryitems- finexp - incometax,
         quarterly_netincome = ifelse(fiscalq==1,netincome, netincome - lag(netincome))) |>
  ungroup()

Step 3: Generate the firm-annual dataset from the firm-quarter dataset:

I create a pivot table dataset with group_by(firm, year) followed by summarize function.

Since I am grouping by firm-year, then each group will have 4 rows (1 for each quarter), so after the summarize, I will end up with a collapsed dataset with about 1/4th of rows compared with the original dataset.

annualdata = uspanel |>
  # I group by firm-year so firm-year will have 4 quarter rows:   
  group_by(firm, year) |>
  # For each group of 4 quarters I will get the market value of the last quarter and the sum of the quarterly amounts for revenue and netincome:
  summarize(marketvalue = last(marketvalue),
         annualrevenue =sum(quarterly_revenue,na.rm=TRUE),
         annualnetincome = sum(quarterly_netincome),na.rm=TRUE) |> 
  ungroup()

Step 4: Generate the final table by year:

Using the firm-annual dataset I now collapse it by year to get the final table by year:

byyear = annualdata |>
  group_by(year) |> 
  summarize(totrevenue = sum(annualrevenue, na.rm = TRUE),
         totnetincome = sum(annualnetincome, na.rm = TRUE),
         totmarketvalue = sum(marketvalue, na.rm = TRUE) )

I display the final table by year:

byyear

year	totrevenue	totnetincome	totmarketvalue
2000	6793219048	271512216	12887438564
2001	7857498394	58696889	11721506593
2002	7537241559	-27488912	9183458050
2003	8249342757	379115228	12026927136
2004	9187225931	442639842	13401623209
2005	10115045544	524871461	13929041684
2006	10960565162	653435612	15651916302
2007	11692289192	612309105	15994787131
2008	11896518274	312788764	9658046578
2009	10652737563	487228562	12252709054
2010	11760956131	714450311	13946915062
2011	13244436751	814665878	16607841503
2012	14062967050	766969754	16362052518
2013	14425973286	912915076	21510692981
2014	15007864521	888074334	23682408803
2015	14475156792	635118172	23109383508
2016	14514856188	776150052	24709433949
2017	15223333659	931857662	29029669613
2018	16325536913	1054185920	26555841939
2019	16555590177	963512380	33416260020
2020	15860891575	577873889	40913846455
2021	18691986594	1580489460	51148512697
2022	20932353092	1425675322	39304594341
2023	21611769761	1489078974	48267954023
2024	9186120741	637252127	54028291093

In order to be able to compare the growth rate for each variable, I create a growth index starting in 1 in 2000 for each total.

I just divide each yearly amount by its first-year value:

growthfactors = byyear |> 
   mutate (irevenue = totrevenue / first(totrevenue),
           inetincome = totnetincome / first(totnetincome),
           imarketvalue = totmarketvalue / first(totmarketvalue))
          # The first function gets the first value of the byyear dataset

I display the growth factors for each variable by year:

growthfactors

year	totrevenue	totnetincome	totmarketvalue	irevenue	inetincome	imarketvalue
2000	6793219048	271512216	12887438564	1.000000	1.0000000	1.0000000
2001	7857498394	58696889	11721506593	1.156668	0.2161851	0.9095296
2002	7537241559	-27488912	9183458050	1.109524	-0.1012437	0.7125899
2003	8249342757	379115228	12026927136	1.214350	1.3963100	0.9332287
2004	9187225931	442639842	13401623209	1.352411	1.6302760	1.0398981
2005	10115045544	524871461	13929041684	1.488992	1.9331412	1.0808231
2006	10960565162	653435612	15651916302	1.613457	2.4066527	1.2145095
2007	11692289192	612309105	15994787131	1.721171	2.2551807	1.2411145
2008	11896518274	312788764	9658046578	1.751234	1.1520246	0.7494155
2009	10652737563	487228562	12252709054	1.568143	1.7944996	0.9507482
2010	11760956131	714450311	13946915062	1.731279	2.6313745	1.0822100
2011	13244436751	814665878	16607841503	1.949655	3.0004760	1.2886844
2012	14062967050	766969754	16362052518	2.070148	2.8248075	1.2696125
2013	14425973286	912915076	21510692981	2.123584	3.3623352	1.6691209
2014	15007864521	888074334	23682408803	2.209242	3.2708449	1.8376350
2015	14475156792	635118172	23109383508	2.130824	2.3391882	1.7931712
2016	14514856188	776150052	24709433949	2.136668	2.8586193	1.9173270
2017	15223333659	931857662	29029669613	2.240960	3.4321022	2.2525554
2018	16325536913	1054185920	26555841939	2.403211	3.8826464	2.0605989
2019	16555590177	963512380	33416260020	2.437076	3.5486889	2.5929326
2020	15860891575	577873889	40913846455	2.334812	2.1283532	3.1747074
2021	18691986594	1580489460	51148512697	2.751565	5.8210621	3.9688657
2022	20932353092	1425675322	39304594341	3.081360	5.2508699	3.0498376
2023	21611769761	1489078974	48267954023	3.181374	5.4843903	3.7453489
2024	9186120741	637252127	54028291093	1.352249	2.3470477	4.1923219

Since 2024 is not a complete year in the dataset, I select years before 2024 to compare only complete years of data:

growthfactors = growthfactors |>
  filter(year<2024)

I plot the 3 growth indexes to see which variable has grown faster over the years:

plot(x=growthfactors$year,y = growthfactors$irevenue,type="l",col="blue",ylim = range(growthfactors$imarketvalue,growthfactors$inetincome,growthfactors$irevenue))
lines(x=growthfactors$year,y = growthfactors$inetincome,col="red")
lines(x=growthfactors$year,y = growthfactors$imarketvalue,col="green")

points(growthfactors$year, growthfactors$irevenue, col = "blue", pch = 16)
points(growthfactors$year, growthfactors$inetincome, col = "red", pch = 16)
points(growthfactors$year, growthfactors$imarketvalue, col = "green", pch = 16)

legend("topleft", legend = c("Revenue", "Netincome", "Market Value"),
       col = c("blue", "red", "green"), lty = 1, lwd = 2, pch = 16)

10 CHALLENGE 3 - RUN A LOGIT MODEL TO PREDICT PROBABILITY OF BEATING THE INDUSTRY MEDIAN RETURN

You have to run a logit model to examine whether Operational Earnings per share divided by price (EBITQ/sharesoutstanding) / originalprice explains the probability that a firm has a higher annual return than its corresponding median industry return.

You have to create the following variables for the model:

oepsp = Operational Earnings per Share divided by price = (EBITQ/sharesoutstanding) / originalprice
oepspw = Winsorized version of oepsp (use the winsorize function from the statar package to do this)
r = Annual stock return (log return); remember that you have quarterly data

You can use dplyr package for these variables. Remember that EBITQ is the Quarterly EBIT you calculated in Challenge 1.

11 CHALLENGE 3 SOLUTION

I create the variables of the model in the same dataset.

I start with calculating the quarterly EBIT. I use the same logic I used to calculate quarterly revenue and quarterly net income:

uspanel <- uspanel |> 
  mutate(ebit = revenue - cogs - sgae, 
         ebitq = ifelse(fiscalq==1,ebit,ebit - lag(ebit)))

Now I create the oepsp and the return variables:

uspanel <- uspanel |>
  # I sort by firm-quarter with arrange since I will use the lag function later:
  arrange(firm,q) |>
  # I group by firm since I will calculate log returns for each firm:
  group_by(firm) |>
  # I create the new columns for the model: 
  mutate(oepsp = (ebitq / sharesoutstanding) / originalprice,
         r = log(adjprice) - dplyr::lag(log(adjprice),4)
         # dplyr::lag means that I want to use the lag function from the dplyr package; what happens is that the lag function is in other packages such as stats package
  ) |>
  ungroup()

I now calculate the winsorized version of earning per share deflated by price.

Remember that winsorization is used to avoid very extreme values of independent variables in any model since extreme values can end up with unreliable estimations of coefficients and their correponsing standard errors in any econometric model.

I use the statar R package to do winsorization:

library(statar)
# histogram of original oepsp:
hist(uspanel$oepsp)

uspanel$oepspw= winsorize(uspanel$oepsp)
# histogram of the winsorized oepsp:
hist(uspanel$oepspw)

The winsorize function automatically decides the percentile levels to do winsorization. We can indicate the specific levels of percentiles to winsorize low values and high values.

Let’s winsorize with a small percentile for the low and high values:

uspanel$oepspw= winsorize(uspanel$oepsp, probs=c(0.005, 0.999))
# histogram of the winsorized oepsp:
hist(uspanel$oepspw, breaks=50)

The dependent variable of the logit model is whether a stock has quarterly returns higher than the median return of its correponding industry.

The uspanel dataset does NOT have the Industry variable. I can get the industry for each firm from the usfirms dataset, which is a catalog of all firms located in the uspnale. I can use the merge or the left_only functions:

# I create a catalog of firms with the firm code and the industry, but I rename the empresa column to have the same column name as in the uspanel:
usfirms1 = usfirms |>
  rename(firm=empresa) |>  
  select(firm, naics1, naics2)

uspanel = left_join(uspanel,usfirms1, by="firm")

I calculate the binary variable for the logit model. This variable will have the following values:

=0 if the stock return is less or equal than the median return of its industry

=1 if the stock return is greater than the median return of its industry

Then, I need to calculate a new column with the median return of the industry and the binary variable.

There are 2 industry classifications in the dataset: NAICS1 and NAICS2. NAICS1 is the first-level NAICS (North American Industry Classification System), and has few industry categories. NAICS2 is a more detailed classification of the industries based on NAICS1.

I will use NAICS2 for this challenge since NAICS1 has may firms by category.

uspanel = uspanel |>
  #group_by quarter-industry: 
  group_by(q,naics2) |>
  mutate(industry_return = median(r, na.rm=TRUE),
         beatindustry= ifelse(r>industry_return,1,0)) |>
  ungroup()

I can create a column for the binary variable 1 year in the future, so I could run a logistic model to examine whether operating earnings per share (oepsw) can be a predictor whether the firm will have annual returns greater than its industry 1 quarter in the future:

uspanel = uspanel |>
  group_by(firm) |>
  mutate(F1beatindustry = lead(beatindustry,1))

I run the logit model to examine whether oepspw is related to the probability of a stock to beat its industry return

logitm1 <- glm(F1beatindustry ~ oepspw, data= uspanel, family="binomial",na.action=na.omit)
summary(logitm1)


Call:
glm(formula = F1beatindustry ~ oepspw, family = "binomial", data = uspanel, 
    na.action = na.omit)

Coefficients:
             Estimate Std. Error z value            Pr(>|z|)    
(Intercept) -0.083582   0.004584  -18.23 <0.0000000000000002 ***
oepspw       3.355209   0.060977   55.02 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 269935  on 195073  degrees of freedom
Residual deviance: 265124  on 195072  degrees of freedom
  (362107 observations deleted due to missingness)
AIC: 265128

Number of Fisher Scoring iterations: 4

coefficients = coef(logitm1)
coefficients

(Intercept)      oepspw 
-0.08358232  3.35520921

# The coefficients for the odd ratio interpretation are:
exp(0.1*coefficients)

(Intercept)      oepspw 
  0.9916766   1.3986688

CHALLENGE 4 - INTERPRET WITH YOUR WORDS THE LOGIT MODEL

YOU HAVE TO INTERPRET THIS MODEL WITH YOUR WORDS.

LOOKING AT THE BETA COEFFICIENT OF THE EXPLANATORY VARIABLE oepspw WE CAN SEE THAT THE COEFFICIENT IS POSITIVE AND STATISTICALLY SIGNIFICANT (SINCE ITS P-VALUE IS MUCH LESS THAN 0.05). THEN, WE CAN SAY THAT OPERATING EARNINGS PER SHARE DEFLATED BY PRICE IS SIGNIFICANTLY AND POSITIVELY RELATED WITH THE PROBABILITY OF A STOCK BEATING THE industry (IN TERMS OF ANNUAL RETURN).

THE WAY WE INTERPRET THE MAGNITUDE OF THE COEFFICIENT IS NOT THE SAME AS IN THE CASE OF LINEAR REGRESSION. WE NEED TO GET THE EXPONENTIAL OF THE COEFFICIENT TO HAVE A BETTER INTERPRETATION. ANOTHER WAY TO UNDERSTAND WHY WE NEED TO GET THE EXPONENTIAL OF THE COEFFICIENT IS THAT THE ACTUAL DEPENDENT VARIABLE OF THE LOGIT MODEL IS THE LOGARITHM OF ODD RATIO. REMEMBER THAT THE ODD RATIO IS THE DIVISION BETWEEN THE PROBABILITY OF THE STOCK BEATING THE INDUSTRY AND THE PROBABILITY OF THE STOCK NOT BEATING THE INDUSTRY. THEN, IF WE APPLY THE EXPONENTIAL FUNCTION TO THE LOGARITHM OF THE ODDS RATIO WE GET THE ACTUAL ODDS RATIO:

Y_{i,t}=log(\frac{p}{1-p})=b_{0}+b_{1}(oepspw_{i,t})+\varepsilon_{i,t}

e^{Y_{i,t}}=\left(\frac{p}{1-p}\right)

SINCE THE EXPECTED VALUE OF Y IS ACTUALLY THE REGRESSION EQUATION, THEN I CAN APPLY THE EXPONENTIAL EQUATION TO THE REGRESSION EQUATION AND GET TRANSFORMED BETAS THAT ARE RELATED TO THE ODDS RATIO (NOT TO THE LOG OF ODDS RATIO). THIS IS CONSISTENT WITH A PREVIOUS EXPLANATION ABOUT HOW TO INTERPRET THE MAGNITUDE OF A COEFFICIENT OF A LOGIT REGRESSION. AS I HAD EXPLAINED IN A PREVIOUS SECTION:

IF THE EXPLANATORY VARIABLE INCREASES IN 1 UNIT:

e^{b_{0}}e^{b_{1}(oepspw+1)}=e^{b_{0}}e^{b_{1}(oepspw)}e^{b_{1}}=ODDSRATIO*(e^{b_{1}})

THEN, (e^{b_{1}}) IS THE TRANSFORMED COEFFICIENT WE NEED TO ESTIMATE IN ORDER TO HAVE A BETTER INTERPRETATION OF THE REGRESSION OUTPUT. LET’S GET THE EXPONENTIAL OF THE REGRESSION COEFFICIENTS:

IN THE REAL WORLD, oepspw CANNOT MOVE +1.0 UNIT SINCE IT IS A RATIO. THEN, WE CAN BETTER THINK IN WHAT WOULD HAPPEN IF THE oepspw MOVES IN 0.1 UNIT:

e^{b_{0}}e^{b_{1}(oepspw+0.1)}=e^{b_{0}}e^{b_{1}(oepspw)}e^{0.1*b_{1}}=ODDSRATIO*(e^{0.1*b_{1}})

# I USE THE coef FUNCTION TO EXTRACT THE COEFFICIENT FROM THE LOGIT MODEL:
coefficients <- coef(logitm1)

coefficients

(Intercept)      oepspw 
-0.08358232  3.35520921

NOW I CAN APPLY THE EXPONENTIAL FUNCTION TO THIS VECTOR THAT CONTAINS THE COEFFICIENTS b_0 AND b_1:

b0 = coefficients[1]
b1 = coefficients[2]

odd_coeffs <- exp(0.1*coefficients)
odd_coeffs

(Intercept)      oepspw 
  0.9916766   1.3986688

# I SAVE THE MODIFIED COEFFICIENTS:
modified_b0 = odd_coeffs[1]
modified_b1 = odd_coeffs[2]

NOW THE TRANSFORMED b_1 IS 1.3986688. WE INTERPRET THIS COEFFICIENT AS FOLLOWS:

IF A FIRM IMPROVES ITS oepspw BY ONE 0.1 UNIT, ITS ODDS RATIO -THE PROBABILITY TO BEAT THE industry WITH RESPECT TO THE PROBABILITY OF NOT BEATING THE industry- IN THE SAME QUARTER WILL BE 1.3986688 TIMES HIGHER THAN THE CASE THAT THE FIRM DOES NOT IMPROVE ITS oepspw.

THE INTERPRETATION OF b_0 IS THE FOLLOWING. IF oepspw IS EQUAL TO ZERO, THEN THE TRANSFORMED COEFFICIENT b_0 WILL BE THE EXPECTED ODDS RATIO. IN THIS CASE, THE TRANSFORMED COEFFICIENT OF b_0 IS 0.9916766, which is less than 1. THIS MEANS THAT THE PROBABILITY THAT THE STOCK BEATS THE industry IS LESS THAN THE PROBABLITY DOES NOT BEAT THE industry.

WE CAN GET A BETTER INTERPRETATION OF b_0 CALCULATING THE EXPECTED PROBABILITY OF BEATING THE industry FROM THE EXPECTED ODDS RATIO WHEN oepsp=0. WE NEED TO TRANSFORM FROM ODDSRATIO TO PROBABILITY:

ODDSRATIO=\frac{p}{1-p}

DOING SOME ALGEBRA WE CAN EXPRESS P IN TERMS OF ODDSRATIO:

(1-p)ODDSRATIO=p

ODDSRATIO - p(ODDSRATIO) = p

ODDSRATIO = p + p(ODDSRATIO) = p(1+ODDSRATIO)

p=\frac{ODDSRATIO}{1+ODDSRATIO}

WE HAD CALCULATED THE exp(b0), the TRANSFORMED COEFFICIENT OF b_0 = 0.9916766, WHICH IS THE EXPECTED ODDSRATIO IF oepsp=0. THEN, THE EXPECTED PROBABILITY IF oepsp=0 WILL BE: p=\frac{0.9916766}{1+0.9916766}, then p=0.4979105.

# Prob of beating the industry when oepsp=0:
p = odd_coeffs[1] / (1+odd_coeffs[1])
p

(Intercept) 
  0.4979105

IN TERMS OF PROBABILITY, IF THE CURRENT ODDS-RATIO=1 (THEN p=0.5) AND IF A FIRM IMPROVES ITS oepspw BY 0.1, ITS PROBABILITY TO BEAT THE industry IS EXPECTED TO IMPROVE:

# If current odds-ratio of a firm = 1, its current probability to beat the industry must be 1
current_odds_ratio = 1
current_p = current_odds_ratio / (1+current_odds_ratio)
current_p

[1] 0.5

# If oepsp moves in 0.1 unit, then its new odds_ratio will be:
new_odds_ratio = current_odds_ratio * exp(0.1*b1) 
# The, the probability will increase:
new_p = new_odds_ratio / (1 + new_odds_ratio)
new_p

   oepspw 
0.5831021

THEN, IN THIS HYPOTHETICAL CASE, IF oepsp INCREASES IN +0.1 UNIT, THEN THE PROBABILITY TO BEAT THE INDUSTRY WILL INCREASE FROM 0.5 TO 0.5831021.

12 Prediction with the logit model

We can use the last model to predict the probability whether the firm will beat the return of the industry 1 year in the future. For this prediction, it might be a good idea to select only the firms in the last quarter of the dataset (2024Q2), so we can predict which firm will beat the industry in 2025:

We create a dataset with only the Q2 of 2024:

data2024 <- uspanel %>%
          select(firm,q,yearf, F1beatindustry,oepspw) %>%
          filter(q=="2024q2") %>%
        as.data.frame()
hist(data2024$oepspw)

Now we run the prediction using the model 2 with this new dataset:

data2024 <- data2024 %>% 
  mutate(pred=predict(logitm1,newdata=data2024,type=c("response")) )

The type=c(“response”) parameter indicates to get the prediction in probability. If I do not specify this option, the prediction is calculated in the logarithm of ODDS RATIO.

Let’s do this prediction:

data2024 <- data2024 %>% 
  mutate(logodds_pred=predict(logitm1,newdata=data2024),
# Following the non-linear equation we explained, we get the predicted probability: 
         predprob = 1 / (1+exp(-logodds_pred)) )

The predprob and the pred columns have the same predicted probability:

head(data2024)

firm	q	yearf	F1beatindustry	oepspw	pred	logodds_pred	predprob
A	2024q2	2024	NA	0.0095979	0.4871579	-0.0513795	0.4871579
AA	2024q2	2024	NA	0.0153999	0.4920226	-0.0319123	0.4920226
AABA_old	2024q2	NA	NA	NA	NA	NA	NA
AAC_old	2024q2	NA	NA	NA	NA	NA	NA
AAIC_old	2024q2	NA	NA	NA	NA	NA	NA
AAL	2024q2	2024	NA	0.1861841	0.6320692	0.5411042	0.6320692

We can do a histogram to see how this predicted probability behaves:

hist(data2024$pred,breaks=20)

We can also see how the predicted probability of beating the industry changes with changes in earnings per share:

library(ggplot2)
# I use ggplot from the ggplot2 library:
ggplot(data2024, aes(x=oepspw, y=pred)) +  
  geom_point() # I specify to do a scatter plot

We see the logistic S-shape non linear relationship between operating earnings per share and the probability of a firm to beat the industry median return.

13 Selection of best stocks based on results of the logit model

We can sort the firms according to the predicted probability of beating the benchmark, and then select the top 5% with the highest probability:

top100 <- data2024 %>%
     merge(usfirms1,by="firm") %>% # merge with firms1 to pull firm name and industry
     arrange(desc(pred)) %>%  # sort by probability, from highest to lowest
     #slice_head(prop=0.05) %>% # select the top 5% firms  
     head(n=100)

# Show the # of firms selected:
nrow(top100)

[1] 100

#Showing the first 10 top firms:
head(top100,10)

       firm      q yearf F1beatindustry    oepspw      pred logodds_pred
1       AJG 2024q2  2024             NA 0.6706057 0.8971951    2.1664402
2       APA 2024q2  2024             NA 0.6706057 0.8971951    2.1664402
3       OCN 2024q2  2024             NA 0.5600790 0.8576125    1.7956000
4       CYH 2024q2  2024             NA 0.5075739 0.8347171    1.6194344
5        WW 2024q2  2024             NA 0.3874121 0.7714058    1.2162663
6      DTIL 2024q2  2024             NA 0.3583233 0.7537414    1.1186674
7      ATUS 2024q2  2024             NA 0.3523854 0.7500248    1.0987443
8      VNCE 2024q2  2024             NA 0.3238441 0.7316446    1.0029825
9       VGZ 2024q2  2024             NA 0.2735464 0.6972471    0.8342230
10 VIA__old 2024q2  2024             NA 0.2593823 0.6871222    0.7866995
    predprob                                                          naics1
1  0.8971951                              Servicios financieros y de seguros
2  0.8971951 Minería, explotación de canteras y extracción de petróleo y gas
3  0.8576125                              Servicios financieros y de seguros
4  0.8347171                       Servicios de salud y de asistencia social
5  0.7714058             Otros servicios excepto actividades gubernamentales
6  0.7537414                                       Industrias manufactureras
7  0.7500248                                   Información en medios masivos
8  0.7316446                                           Comercio al por menor
9  0.6972471 Minería, explotación de canteras y extracción de petróleo y gas
10 0.6871222                            Empresas de electricidad, gas y agua
                                                                                      naics2
1                                       Servicios relacionados con los seguros y las fianzas
2                                                               Extracción de petróleo y gas
3                                             Instituciones financieras de fomento económico
4                                                                       Hospitales generales
5                                                                       Servicios personales
6                                                     Fabricación de productos farmacéuticos
7  Producción de programación de canales para sistemas de televisión por cable o satelitales
8                                                     Tienda de ropas y accesorios de vestir
9                                                             Minería de minerales metálicos
10                                 Generación, transmisión y suministro de energía eléctrica

# Showing a summary by industry (Economatica) of the selected top 
top100 %>% 
   group_by(naics2) %>%
   summarize(firms = n(), mean_probability = mean(pred,na.rm=TRUE) )

# A tibble: 59 × 3
   naics2                                                 firms mean_probability
   <chr>                                                  <int>            <dbl>
 1 Agricultura                                                1            0.534
 2 Alquiler de artículos para el hogar y personales           2            0.525
 3 Alquiler de automóviles, camiones y otros transportes…     1            0.521
 4 Alquiler sin intermediación de bienes raíces               1            0.652
 5 Casas de bolsa, casas de cambio y centros cambiarios       1            0.516
 6 Comercio al por mayor de máquinas, equipos y  accesor…     1            0.621
 7 Comercio al por mayor de productos de tienda de comes…     1            0.532
 8 Comercio al por menor                                      4            0.528
 9 Confección de prendas de vestir                            1            0.542
10 Edición de periódicos, revistas, libros y similares        2            0.522
# ℹ 49 more rows

# Showing a summary by industry (NAICS1) of the selected top 5% 
top100 %>% 
   group_by(naics2) %>%
   summarize(firms = n(), mean_probability = mean(pred,na.rm=TRUE) )

# A tibble: 59 × 3
   naics2                                                 firms mean_probability
   <chr>                                                  <int>            <dbl>
 1 Agricultura                                                1            0.534
 2 Alquiler de artículos para el hogar y personales           2            0.525
 3 Alquiler de automóviles, camiones y otros transportes…     1            0.521
 4 Alquiler sin intermediación de bienes raíces               1            0.652
 5 Casas de bolsa, casas de cambio y centros cambiarios       1            0.516
 6 Comercio al por mayor de máquinas, equipos y  accesor…     1            0.621
 7 Comercio al por mayor de productos de tienda de comes…     1            0.532
 8 Comercio al por menor                                      4            0.528
 9 Confección de prendas de vestir                            1            0.542
10 Edición de periódicos, revistas, libros y similares        2            0.522
# ℹ 49 more rows