Machine Learning to identify the best factor models for portfolio design

Executive Summary

bla bla bla

Problem setup

You just started to work in SmartPort, a mutual fund firm. You were assigned to design and implement better algorithms to design financial portfolios for customers. In SmartPort, the methods to select stocks and design portfolios has been the traditional fundamental analysis using simple financial ratios, and basic technical analysis. You have the challenge to design a better methodology to select US stocks for portfolio design using Machine Learning algorithms.

The main questions that SmartPort wants to respond with your work are the following:

  1. What type of machine learning models/algorithms we can design to identify the financial ratios that better predict future stock returns and/or select stocks with above-average future returns?

  2. Which financial ratios (factors) might have a strong explanation power of future returns? To respond this question, you have to work on the following industry: Manufacturing industries.

  3. How can we use the best ML model to design financial portfolios of stocks?

  4. What might be the holding period return of portfolios designed with the best ML models?

You have to respond to the stated questions. You have to justify your responses with arguments using real data, financial theories, and machine learning models/algorithms.

Phase 1 (Data)

Data management for big financial datasets

Data collection and data preparation

You need to get the right data for the problem. You have to download historical real financial statement information and stock prices of all US public firms. You have to program basic algorithms for data management before you make sense of the data.

Data collection – import data, use of API functions to get online data

options(scipen=999)

library(readxl)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
download.file("http://www.apradie.com/datos/us2020q.xlsx",
              "us2020q.xlsx", mode="wb")

download.file("http://www.apradie.com/datos/usfirms.xlsx",
              "usfirms.xlsx", mode="wb")

Data structures – cross sectional, time-series, panel data

Importamos los datos trimestrales de panel de todas las empresas estadounidenses:

uspanel <- read_excel("us2020q.xlsx",sheet = "data")

Importamos el diccionario de datos para las variables de este conjunto de datos:

dictionary <- read_excel("us2020q.xlsx",sheet = "data dictionary")

Importamos la lista de todas las empresas públicas estadounidenses:

usfirms <- read_excel("usfirms.xlsx")

Creo un conjunto de datos con solo las columnas relevantes para fusionar, y a su vez, hago la combinación indicando que coincida con ambos archivos usando la columna firm que se encuentra en ambos conjuntos de datos.

usfirms1 <- usfirms %>% select(firm, CompanyName, status, naics1)
uspanel <- merge(uspanel, usfirms1, by="firm")    

uspanel

Data selection – algorithms for data selection

uspanel_2020 <- uspanel %>%
          filter(status=="active", year==2020, fiscalmonth==12, naics1 == "Manufacturing")

uspanel_2020
uspanel <- uspanel %>%
          filter(status=="active", naics1 == "Manufacturing")

uspanel
nrow(uspanel_2020)
## [1] 1343

En 2020 hubo 1343 empresas públicas activas pertenecientes a la industria de manufactura en el mercado estadounidense.

Data understanding

uspanel_2020$marketcap = uspanel_2020$originalprice * uspanel_2020$sharesoutstanding

ta_summary <- summary(uspanel_2020$totalassets)
ta_summary
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       266    140030    584954   6507966   2804174 332750000
mcap_summary<- summary(uspanel_2020$marketcap)
mcap_summary
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
##          2     214172    1075669   10967967    4378270 1980644694         18
top_10_tassets <- uspanel_2020 %>%
     arrange(desc(totalassets)) %>%
     top_n(10) %>%
     select(firm, CompanyName, naics1, totalassets, revenue)
## Selecting by marketcap
top_10_tassets

Las empresas más grandes de la industria manufacturera en términos de activos totales fueron Apple, J&J, Pfizer, Intel, P&G, Pepsico, Merck & Co, Coca Cola, Abbott Laboratories y Tesla.

top_10_mcap <- uspanel_2020 %>%
     arrange(desc(marketcap)) %>%
     top_n(10) %>%
     select(firm, CompanyName, naics1, marketcap, revenue)
## Selecting by marketcap
top_10_mcap

Las empresas más grandes de la industria manufacturera en términos de capitalización de mercado fueron Apple, Tesla, J&J, P&G, Coca Cola, Merck & Co, Pepsico, Pfizer, Intel y Abbott Laboratories.

top_10_rev <- uspanel_2020 %>%
     arrange(desc(revenue)) %>%
     top_n(10) %>%
     select(firm, CompanyName, naics1, marketcap, revenue)
## Selecting by marketcap
top_10_rev

Las empresas más grandes de la industria manufacturera en términos de ingresos fueron Apple, Exxon Mobil, GM, Chevron, J&J, GE, Intel, Intl, P&G y Pepsico.

Phase 2 (Data summary and introduction to ML)

Descriptive statistics, data visualization, and introduction to Machine Learning

Descriptive statistics

by_manufacturing_summary <- uspanel_2020 %>%
  group_by(naics1) %>%
  summarize(firms = n(),
            median_total_assets = median(totalassets, na.rm = TRUE),
            Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
            Q3_total_assets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
            median_marketcap = median(marketcap, na.rm = TRUE),
            Q1_marketcap = quantile(marketcap, probs=c(0.25),na.rm=TRUE),
            Q3_marketcap = quantile(marketcap,probs=c(0.75),na.rm=TRUE)
            )
by_manufacturing_summary

Hallazgos significativos de la industria manufacturera al año 2020 en el mercado estadounidense:

· El tamaño de empresa típico en la industria manufacturera de las empresas públicas estadounidenses en 2020 en términos de activos totales es de $ 584,954 millones. · El tamaño de empresa típico en la industria manufacturera de las empresas públicas estadounidenses en 2020 en términos de capitalización de mercado es de $ 1,075,660 millones.

· En 2020, aproximadamente el 25% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían activos totales iguales o inferiores a $ 140,030 millones. · En 2020, aproximadamente el 25% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían una capitalización de mercado igual o inferior a $ 214,172 millones.

· En 2020, alrededor del 75% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían activos totales iguales o inferiores a $ 2,804,174 millones. · En 2020, alrededor del 75% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían una capitalización de mercado igual o inferior a $ 4,378,270 millones.

Data visualization for financial analysis

firm_selection1 <- uspanel_2020 %>%
            filter(marketcap <=250000000) %>%
            select(firm, marketcap)
hist(firm_selection1$marketcap, breaks=100)

firm_selection2 <- uspanel_2020 %>%
            filter(marketcap <=50000000) %>%
            select(firm, marketcap)
hist(firm_selection2$marketcap, breaks=100)

library(statar)
library(ggplot2)

uspanel_2020$marketcap_w <- winsorize(uspanel_2020$marketcap, probs = c(0,0.95))
## 0.00 % observations replaced at the bottom
## 4.91 % observations replaced at the top
ggplot(uspanel_2020, aes(x=naics1, y=marketcap_w)) +
   geom_boxplot() +
   theme(axis.text.x = element_text(angle = 360, hjust = 1))
## Warning: Removed 18 rows containing non-finite values (stat_boxplot).

What is Machine Learning (ML)? Types of ML algorithms

De acuerdo con BBVA, en su artículo “Machine learning: ¿qué es y cómo funciona?”, el ‘machine learning’ –aprendizaje automático– es una rama de la inteligencia artificial que permite que las máquinas aprendan sin ser expresamente programadas para ello. Una habilidad indispensable para hacer sistemas capaces de identificar patrones entre los datos para hacer predicciones. Esta tecnología está presente en un sinfín de aplicaciones como las recomendaciones de Netflix o Spotify, las respuestas inteligentes de Gmail o el habla de Siri y Alexa. (BBVA, 2021)

En el sentido más básico, Machine Learning (ML) es una forma de implementar inteligencia artificial. Al igual que la IA, el aprendizaje automático es una rama de la informática en la que se diseña o estudia el diseño de algoritmos que pueden aprender. De acuerdo con el sitio oficial de DataCamp, existen varios algoritmos de aprendizaje automático como:

· Decision trees, · Naive Bayes, · Random forest · Support vector machine · K-nearest neighbor, · K-means clustering, · Gaussian mixture model, · Hidden Markov model etc.

What supervised ML methods can be applied to the problem?

Identification of the econometric model to be used in the ML model

Configuration of the test and training samples for the model

library(plm)
## 
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
uspanel <- pdata.frame(uspanel, index= c("firm","q"))
class(uspanel)
## [1] "pdata.frame" "data.frame"

Con el conjunto de datos del panel original (uspanel), creo columnas para las siguientes variables que serán utilizadas para generar las razones financieras más adelante.

Gross profit (grossprofit): Revenue - Cost of good Sold

uspanel$grossprofit = (uspanel$revenue - uspanel$cogs)

Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses

uspanel$ebit = uspanel$grossprofit - uspanel$sgae

Net Income (netincome): ebit - financial expenses - income taxes

uspanel$netincome = uspanel$ebit - uspanel$finexp - uspanel$incometax

Annual market return: use adjusted stock price and remember that you have quarterly data. Here we have to use the lagged value of adjusted stock price:

uspanel$annual_R = uspanel$adjprice / plm::lag(uspanel$adjprice,k=4) - 1

Market capitalization: (marketcap): original stock price * shareoutstanding.

uspanel$marketcap = uspanel$originalprice * uspanel$sharesoutstanding

Con el mismo conjunto de datos de panel (uspanel), ahora sí, creo columnas para el cálculo de las siguientes razones financieras:

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
uspanel$roabit = ifelse(plm::lag(uspanel$totalassets,k=4)==0,NA, uspanel$ebit / plm::lag(uspanel$totalassets,k=4))

uspanel$roa = ifelse(plm::lag(uspanel$totalassets,k=4)==0,NA,uspanel$netincome / plm::lag(uspanel$totalassets,k=4))

uspanel$oeps = ifelse(uspanel$sharesoutstanding==0,NA,uspanel$ebit / uspanel$sharesoutstanding)

uspanel$oepsp = ifelse(uspanel$originalprice==0,NA,uspanel$oeps / uspanel$originalprice)

uspanel$bookvalue = uspanel$totalassets - uspanel$totalliabilities

uspanel$bmr = ifelse(uspanel$marketcap==0,NA,uspanel$bookvalue / uspanel$marketcap)

uspanel$annual_r = diff(log(uspanel$adjprice),lag=4)
library(statar)
uspanel$bmrw <- winsorize(uspanel$bmr,probs=c(0.02,0.98))
## 1.35 % observations replaced at the bottom
## 1.35 % observations replaced at the top
hist(uspanel$bmrw)

uspanel$oepspw <- winsorize(uspanel$oepsp,probs=c(0.02,0.98))
## 1.34 % observations replaced at the bottom
## 1.34 % observations replaced at the top
hist(uspanel$oepspw)

Phase 3 (Configuration and validation of the ML models)

Model evaluation and validation

uspanelactive <- uspanel[uspanel$status == "active",]

LRM0 <- lm(annual_r ~ oepsp + roabit, data = uspanelactive, na.action = na.omit)
summary(LRM0)
## 
## Call:
## lm(formula = annual_r ~ oepsp + roabit, data = uspanelactive, 
##     na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3513 -0.2358  0.0366  0.2668 10.8563 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  0.0050164  0.0027186   1.845              0.0650 .  
## oepsp        0.6751529  0.0079798  84.608 <0.0000000000000002 ***
## roabit      -0.0005494  0.0002213  -2.482              0.0131 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5301 on 38563 degrees of freedom
##   (25982 observations deleted due to missingness)
## Multiple R-squared:  0.1566, Adjusted R-squared:  0.1565 
## F-statistic:  3580 on 2 and 38563 DF,  p-value: < 0.00000000000000022

Interpretación:

El siguiente modelo de regresión lineal contempla como variable dependiente el retorno anual de las acciones (trimestre a trimestre) y como variables independientes la utilidad operativa por acción deflactada por precio (oepsp) y la rentabilidad operativa sobre los activos (roabit).

El coeficiente beta de oepsp es de 0.6751529. Este coeficiente es positivo y estadísticamente significativo ya que su valor p (0.0000000000000002) es mucho menor que 0.05. Esto nos indica que al menos el 95% de las veces, en un futuro próximo este coeficiente será positivo. En otras palabras, existe una relación significativa y positiva entre las ganancias operativas por acción deflactadas por el precio y los rendimientos anuales de las acciones después de considerar el efecto del rendimiento operativo sobre los activos sobre los rendimientos anuales de las acciones. Es decir, después de considerar el efecto de roabit, por cada cambio de +1 en oepsp, se espera que el cambio en el rendimiento anual sea de aproximadamente 0.6751529.

Por otra parte, el coeficiente de roa operacional (roabit) es de -0.0005494; este coeficiente es negativo y estadísticamente significativo ya que su valor p (0.0131) es menor que 0.05. Por lo tanto, después de considerar el efecto de oeps en el rendimiento de las acciones, podemos determinar que sí hay un efecto relevante o relación entre el roabit y la rentabilidad anual de las acciones. Cabe resaltar que a diferencia de otras industrias, en la manufactura se logra apreciar esta relación, en algunas otras no hay soporte estadístico para concluir lo mismo.

LRM1 <- plm(plm::lag(annual_r,-1) ~ bmrw + oepsp + log(marketcap), data=uspanelactive, model="pooling", na.action = na.omit)
summary(LRM1)
## Pooling Model
## 
## Call:
## plm(formula = plm::lag(annual_r, -1) ~ bmrw + oepsp + log(marketcap), 
##     data = uspanelactive, na.action = na.omit, model = "pooling")
## 
## Unbalanced Panel: n = 1320, T = 1-40, N = 39354
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -4.260325 -0.238854  0.015514  0.263722  8.201229 
## 
## Coefficients:
##                  Estimate Std. Error t-value              Pr(>|t|)    
## (Intercept)    -0.4681163  0.0190408 -24.585 < 0.00000000000000022 ***
## bmrw           -0.2139067  0.0070841 -30.195 < 0.00000000000000022 ***
## oepsp           0.4806520  0.0082379  58.346 < 0.00000000000000022 ***
## log(marketcap)  0.0412405  0.0012904  31.960 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    13164
## Residual Sum of Squares: 10740
## R-Squared:      0.18411
## Adj. R-Squared: 0.18404
## F-statistic: 2959.76 on 3 and 39350 DF, p-value: < 0.000000000000000222
LRM2 <- plm(plm::lag(annual_r,-1) ~ oepsp + roabit, data=uspanelactive, model="pooling", na.action = na.omit)
summary(LRM2)
## Pooling Model
## 
## Call:
## plm(formula = plm::lag(annual_r, -1) ~ oepsp + roabit, data = uspanelactive, 
##     na.action = na.omit, model = "pooling")
## 
## Unbalanced Panel: n = 1319, T = 1-39, N = 37607
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -4.629421 -0.235164  0.041297  0.273546  9.533288 
## 
## Coefficients:
##                 Estimate   Std. Error t-value            Pr(>|t|)    
## (Intercept)  0.001618658  0.002810782  0.5759              0.5647    
## oepsp        0.596528989  0.008223104 72.5430 <0.0000000000000002 ***
## roabit      -0.000104450  0.000082411 -1.2674              0.2050    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    12562
## Residual Sum of Squares: 11020
## R-Squared:      0.12277
## Adj. R-Squared: 0.12273
## F-statistic: 2631.48 on 2 and 37604 DF, p-value: < 0.000000000000000222

You need to configure and evaluate different models or model configurations in order to select the best model.

Which model indicators or fit measures can be used for model evaluation?

Model selection - which models have the best fit with new data?

Model interpretation

Phase 4 (Simulation and controlling)

Data management to incorporate market return:

library(quantmod)
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
getSymbols("^GSPC", from="2010-01-01", to= "2020-12-31",
            periodicity="monthly", src="yahoo")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^GSPC"
QSP500 <- to.quarterly(GSPC,indexAt='startof')
head(QSP500)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close  GSPC.Volume GSPC.Adjusted
## 2010-01-01   1116.56   1180.69  1044.50    1169.43 279192470000       1169.43
## 2010-04-01   1171.23   1219.80  1028.33    1030.71 354511440000       1030.71
## 2010-07-01   1031.10   1157.16  1010.91    1141.20 260105810000       1141.20
## 2010-10-01   1143.49   1262.60  1131.87    1257.64 257671870000       1257.64
## 2011-01-01   1257.62   1344.07  1249.05    1325.83 240896240000       1325.83
## 2011-04-01   1329.48   1370.58  1258.07    1320.64 245196520000       1320.64
QSP500 = QSP500$GSPC.Adjusted
colnames(QSP500) <- "SP500index"

Now we can calculate cc return of the market

QSP500$SPreturns <- diff(log(QSP500$SP500index),lag=4) 
QSP500.df<-data.frame(q=index(QSP500),coredata(QSP500))

uspanel$q<-as.Date(uspanel$q)

uspanel<-merge(uspanel,QSP500.df,by="q")

Model prediction to find the optimal business solution to the problem

Once you have the best model(s), now you can make predictions and scenarios to make effective business decisions

uspanel$firm_vs_sp500 = ifelse(uspanel$annual_r > uspanel$SPreturns, 1, 0)

uspanel$FUTURE_R = plm::lag(uspanel$firm_vs_sp500,-4)
uspanel$MKT_CAP_SIZE = log(uspanel$marketcap)
library(caret)
## Loading required package: lattice
manufacturing_uspanel_ml <- uspanel%>%
              select(firm, q, year, fiscalmonth, annual_r, FUTURE_R, bmrw, oepspw, MKT_CAP_SIZE) %>%
  na.omit() %>%
  as.data.frame()
set.seed(444)
trainingdataindex = createDataPartition(manufacturing_uspanel_ml$FUTURE_R, p=0.80, list= FALSE)
training = manufacturing_uspanel_ml[trainingdataindex,]
testing = manufacturing_uspanel_ml[-trainingdataindex, ]

ML1 <- glm(FUTURE_R ~ bmrw + oepspw + MKT_CAP_SIZE, data = training, family= "binomial", na.action = na.omit)
summary(ML1)
## 
## Call:
## glm(formula = FUTURE_R ~ bmrw + oepspw + MKT_CAP_SIZE, family = "binomial", 
##     data = training, na.action = na.omit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9232  -1.0768  -0.4508   1.0875   3.3482  
## 
## Coefficients:
##               Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  -0.194461   0.099959  -1.945               0.0517 .  
## bmrw         -1.678228   0.042190 -39.778 < 0.0000000000000002 ***
## oepspw        4.230421   0.140603  30.088 < 0.0000000000000002 ***
## MKT_CAP_SIZE  0.050735   0.006688   7.586   0.0000000000000331 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43163  on 31475  degrees of freedom
## Residual deviance: 38651  on 31472  degrees of freedom
## AIC: 38659
## 
## Number of Fisher Scoring iterations: 5

How you can use the best ML model to design winning financial portfolios of stocks?

Which might be the holding-period return of the designed portfolios?

Do the designed portfolios have the potential to offer returns above the market index?