bla bla bla
You just started to work in SmartPort, a mutual fund firm. You were assigned to design and implement better algorithms to design financial portfolios for customers. In SmartPort, the methods to select stocks and design portfolios has been the traditional fundamental analysis using simple financial ratios, and basic technical analysis. You have the challenge to design a better methodology to select US stocks for portfolio design using Machine Learning algorithms.
The main questions that SmartPort wants to respond with your work are the following:
What type of machine learning models/algorithms we can design to identify the financial ratios that better predict future stock returns and/or select stocks with above-average future returns?
Which financial ratios (factors) might have a strong explanation power of future returns? To respond this question, you have to work on the following industry: Manufacturing industries.
How can we use the best ML model to design financial portfolios of stocks?
What might be the holding period return of portfolios designed with the best ML models?
You have to respond to the stated questions. You have to justify your responses with arguments using real data, financial theories, and machine learning models/algorithms.
You need to get the right data for the problem. You have to download historical real financial statement information and stock prices of all US public firms. You have to program basic algorithms for data management before you make sense of the data.
options(scipen=999)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
download.file("http://www.apradie.com/datos/us2020q.xlsx",
"us2020q.xlsx", mode="wb")
download.file("http://www.apradie.com/datos/usfirms.xlsx",
"usfirms.xlsx", mode="wb")
Importamos los datos trimestrales de panel de todas las empresas estadounidenses:
uspanel <- read_excel("us2020q.xlsx",sheet = "data")
Importamos el diccionario de datos para las variables de este conjunto de datos:
dictionary <- read_excel("us2020q.xlsx",sheet = "data dictionary")
Importamos la lista de todas las empresas públicas estadounidenses:
usfirms <- read_excel("usfirms.xlsx")
Creo un conjunto de datos con solo las columnas relevantes para fusionar, y a su vez, hago la combinación indicando que coincida con ambos archivos usando la columna firm que se encuentra en ambos conjuntos de datos.
usfirms1 <- usfirms %>% select(firm, CompanyName, status, naics1)
uspanel <- merge(uspanel, usfirms1, by="firm")
uspanel
uspanel_2020 <- uspanel %>%
filter(status=="active", year==2020, fiscalmonth==12, naics1 == "Manufacturing")
uspanel_2020
uspanel <- uspanel %>%
filter(status=="active", naics1 == "Manufacturing")
uspanel
nrow(uspanel_2020)
## [1] 1343
En 2020 hubo 1343 empresas públicas activas pertenecientes a la industria de manufactura en el mercado estadounidense.
uspanel_2020$marketcap = uspanel_2020$originalprice * uspanel_2020$sharesoutstanding
ta_summary <- summary(uspanel_2020$totalassets)
ta_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 266 140030 584954 6507966 2804174 332750000
mcap_summary<- summary(uspanel_2020$marketcap)
mcap_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2 214172 1075669 10967967 4378270 1980644694 18
top_10_tassets <- uspanel_2020 %>%
arrange(desc(totalassets)) %>%
top_n(10) %>%
select(firm, CompanyName, naics1, totalassets, revenue)
## Selecting by marketcap
top_10_tassets
Las empresas más grandes de la industria manufacturera en términos de activos totales fueron Apple, J&J, Pfizer, Intel, P&G, Pepsico, Merck & Co, Coca Cola, Abbott Laboratories y Tesla.
top_10_mcap <- uspanel_2020 %>%
arrange(desc(marketcap)) %>%
top_n(10) %>%
select(firm, CompanyName, naics1, marketcap, revenue)
## Selecting by marketcap
top_10_mcap
Las empresas más grandes de la industria manufacturera en términos de capitalización de mercado fueron Apple, Tesla, J&J, P&G, Coca Cola, Merck & Co, Pepsico, Pfizer, Intel y Abbott Laboratories.
top_10_rev <- uspanel_2020 %>%
arrange(desc(revenue)) %>%
top_n(10) %>%
select(firm, CompanyName, naics1, marketcap, revenue)
## Selecting by marketcap
top_10_rev
Las empresas más grandes de la industria manufacturera en términos de ingresos fueron Apple, Exxon Mobil, GM, Chevron, J&J, GE, Intel, Intl, P&G y Pepsico.
by_manufacturing_summary <- uspanel_2020 %>%
group_by(naics1) %>%
summarize(firms = n(),
median_total_assets = median(totalassets, na.rm = TRUE),
Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
Q3_total_assets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
median_marketcap = median(marketcap, na.rm = TRUE),
Q1_marketcap = quantile(marketcap, probs=c(0.25),na.rm=TRUE),
Q3_marketcap = quantile(marketcap,probs=c(0.75),na.rm=TRUE)
)
by_manufacturing_summary
Hallazgos significativos de la industria manufacturera al año 2020 en el mercado estadounidense:
· El tamaño de empresa típico en la industria manufacturera de las empresas públicas estadounidenses en 2020 en términos de activos totales es de $ 584,954 millones. · El tamaño de empresa típico en la industria manufacturera de las empresas públicas estadounidenses en 2020 en términos de capitalización de mercado es de $ 1,075,660 millones.
· En 2020, aproximadamente el 25% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían activos totales iguales o inferiores a $ 140,030 millones. · En 2020, aproximadamente el 25% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían una capitalización de mercado igual o inferior a $ 214,172 millones.
· En 2020, alrededor del 75% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían activos totales iguales o inferiores a $ 2,804,174 millones. · En 2020, alrededor del 75% de las empresas públicas pertenecientes a la industria manufacturera de EE. UU. tenían una capitalización de mercado igual o inferior a $ 4,378,270 millones.
firm_selection1 <- uspanel_2020 %>%
filter(marketcap <=250000000) %>%
select(firm, marketcap)
hist(firm_selection1$marketcap, breaks=100)
firm_selection2 <- uspanel_2020 %>%
filter(marketcap <=50000000) %>%
select(firm, marketcap)
hist(firm_selection2$marketcap, breaks=100)
library(statar)
library(ggplot2)
uspanel_2020$marketcap_w <- winsorize(uspanel_2020$marketcap, probs = c(0,0.95))
## 0.00 % observations replaced at the bottom
## 4.91 % observations replaced at the top
ggplot(uspanel_2020, aes(x=naics1, y=marketcap_w)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 360, hjust = 1))
## Warning: Removed 18 rows containing non-finite values (stat_boxplot).
De acuerdo con BBVA, en su artículo “Machine learning: ¿qué es y cómo funciona?”, el ‘machine learning’ –aprendizaje automático– es una rama de la inteligencia artificial que permite que las máquinas aprendan sin ser expresamente programadas para ello. Una habilidad indispensable para hacer sistemas capaces de identificar patrones entre los datos para hacer predicciones. Esta tecnología está presente en un sinfín de aplicaciones como las recomendaciones de Netflix o Spotify, las respuestas inteligentes de Gmail o el habla de Siri y Alexa. (BBVA, 2021)
En el sentido más básico, Machine Learning (ML) es una forma de implementar inteligencia artificial. Al igual que la IA, el aprendizaje automático es una rama de la informática en la que se diseña o estudia el diseño de algoritmos que pueden aprender. De acuerdo con el sitio oficial de DataCamp, existen varios algoritmos de aprendizaje automático como:
· Decision trees, · Naive Bayes, · Random forest · Support vector machine · K-nearest neighbor, · K-means clustering, · Gaussian mixture model, · Hidden Markov model etc.
library(plm)
##
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
uspanel <- pdata.frame(uspanel, index= c("firm","q"))
class(uspanel)
## [1] "pdata.frame" "data.frame"
Con el conjunto de datos del panel original (uspanel), creo columnas para las siguientes variables que serán utilizadas para generar las razones financieras más adelante.
Gross profit (grossprofit): Revenue - Cost of good Sold
uspanel$grossprofit = (uspanel$revenue - uspanel$cogs)
Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses
uspanel$ebit = uspanel$grossprofit - uspanel$sgae
Net Income (netincome): ebit - financial expenses - income taxes
uspanel$netincome = uspanel$ebit - uspanel$finexp - uspanel$incometax
Annual market return: use adjusted stock price and remember that you have quarterly data. Here we have to use the lagged value of adjusted stock price:
uspanel$annual_R = uspanel$adjprice / plm::lag(uspanel$adjprice,k=4) - 1
Market capitalization: (marketcap): original stock price * shareoutstanding.
uspanel$marketcap = uspanel$originalprice * uspanel$sharesoutstanding
Con el mismo conjunto de datos de panel (uspanel), ahora sí, creo columnas para el cálculo de las siguientes razones financieras:
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
uspanel$roabit = ifelse(plm::lag(uspanel$totalassets,k=4)==0,NA, uspanel$ebit / plm::lag(uspanel$totalassets,k=4))
uspanel$roa = ifelse(plm::lag(uspanel$totalassets,k=4)==0,NA,uspanel$netincome / plm::lag(uspanel$totalassets,k=4))
uspanel$oeps = ifelse(uspanel$sharesoutstanding==0,NA,uspanel$ebit / uspanel$sharesoutstanding)
uspanel$oepsp = ifelse(uspanel$originalprice==0,NA,uspanel$oeps / uspanel$originalprice)
uspanel$bookvalue = uspanel$totalassets - uspanel$totalliabilities
uspanel$bmr = ifelse(uspanel$marketcap==0,NA,uspanel$bookvalue / uspanel$marketcap)
uspanel$annual_r = diff(log(uspanel$adjprice),lag=4)
library(statar)
uspanel$bmrw <- winsorize(uspanel$bmr,probs=c(0.02,0.98))
## 1.35 % observations replaced at the bottom
## 1.35 % observations replaced at the top
hist(uspanel$bmrw)
uspanel$oepspw <- winsorize(uspanel$oepsp,probs=c(0.02,0.98))
## 1.34 % observations replaced at the bottom
## 1.34 % observations replaced at the top
hist(uspanel$oepspw)
uspanelactive <- uspanel[uspanel$status == "active",]
LRM0 <- lm(annual_r ~ oepsp + roabit, data = uspanelactive, na.action = na.omit)
summary(LRM0)
##
## Call:
## lm(formula = annual_r ~ oepsp + roabit, data = uspanelactive,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3513 -0.2358 0.0366 0.2668 10.8563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0050164 0.0027186 1.845 0.0650 .
## oepsp 0.6751529 0.0079798 84.608 <0.0000000000000002 ***
## roabit -0.0005494 0.0002213 -2.482 0.0131 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5301 on 38563 degrees of freedom
## (25982 observations deleted due to missingness)
## Multiple R-squared: 0.1566, Adjusted R-squared: 0.1565
## F-statistic: 3580 on 2 and 38563 DF, p-value: < 0.00000000000000022
Interpretación:
El siguiente modelo de regresión lineal contempla como variable dependiente el retorno anual de las acciones (trimestre a trimestre) y como variables independientes la utilidad operativa por acción deflactada por precio (oepsp) y la rentabilidad operativa sobre los activos (roabit).
El coeficiente beta de oepsp es de 0.6751529. Este coeficiente es positivo y estadísticamente significativo ya que su valor p (0.0000000000000002) es mucho menor que 0.05. Esto nos indica que al menos el 95% de las veces, en un futuro próximo este coeficiente será positivo. En otras palabras, existe una relación significativa y positiva entre las ganancias operativas por acción deflactadas por el precio y los rendimientos anuales de las acciones después de considerar el efecto del rendimiento operativo sobre los activos sobre los rendimientos anuales de las acciones. Es decir, después de considerar el efecto de roabit, por cada cambio de +1 en oepsp, se espera que el cambio en el rendimiento anual sea de aproximadamente 0.6751529.
Por otra parte, el coeficiente de roa operacional (roabit) es de -0.0005494; este coeficiente es negativo y estadísticamente significativo ya que su valor p (0.0131) es menor que 0.05. Por lo tanto, después de considerar el efecto de oeps en el rendimiento de las acciones, podemos determinar que sí hay un efecto relevante o relación entre el roabit y la rentabilidad anual de las acciones. Cabe resaltar que a diferencia de otras industrias, en la manufactura se logra apreciar esta relación, en algunas otras no hay soporte estadístico para concluir lo mismo.
LRM1 <- plm(plm::lag(annual_r,-1) ~ bmrw + oepsp + log(marketcap), data=uspanelactive, model="pooling", na.action = na.omit)
summary(LRM1)
## Pooling Model
##
## Call:
## plm(formula = plm::lag(annual_r, -1) ~ bmrw + oepsp + log(marketcap),
## data = uspanelactive, na.action = na.omit, model = "pooling")
##
## Unbalanced Panel: n = 1320, T = 1-40, N = 39354
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.260325 -0.238854 0.015514 0.263722 8.201229
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -0.4681163 0.0190408 -24.585 < 0.00000000000000022 ***
## bmrw -0.2139067 0.0070841 -30.195 < 0.00000000000000022 ***
## oepsp 0.4806520 0.0082379 58.346 < 0.00000000000000022 ***
## log(marketcap) 0.0412405 0.0012904 31.960 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 13164
## Residual Sum of Squares: 10740
## R-Squared: 0.18411
## Adj. R-Squared: 0.18404
## F-statistic: 2959.76 on 3 and 39350 DF, p-value: < 0.000000000000000222
LRM2 <- plm(plm::lag(annual_r,-1) ~ oepsp + roabit, data=uspanelactive, model="pooling", na.action = na.omit)
summary(LRM2)
## Pooling Model
##
## Call:
## plm(formula = plm::lag(annual_r, -1) ~ oepsp + roabit, data = uspanelactive,
## na.action = na.omit, model = "pooling")
##
## Unbalanced Panel: n = 1319, T = 1-39, N = 37607
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -4.629421 -0.235164 0.041297 0.273546 9.533288
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 0.001618658 0.002810782 0.5759 0.5647
## oepsp 0.596528989 0.008223104 72.5430 <0.0000000000000002 ***
## roabit -0.000104450 0.000082411 -1.2674 0.2050
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 12562
## Residual Sum of Squares: 11020
## R-Squared: 0.12277
## Adj. R-Squared: 0.12273
## F-statistic: 2631.48 on 2 and 37604 DF, p-value: < 0.000000000000000222
You need to configure and evaluate different models or model configurations in order to select the best model.
library(quantmod)
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
getSymbols("^GSPC", from="2010-01-01", to= "2020-12-31",
periodicity="monthly", src="yahoo")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^GSPC"
QSP500 <- to.quarterly(GSPC,indexAt='startof')
head(QSP500)
## GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume GSPC.Adjusted
## 2010-01-01 1116.56 1180.69 1044.50 1169.43 279192470000 1169.43
## 2010-04-01 1171.23 1219.80 1028.33 1030.71 354511440000 1030.71
## 2010-07-01 1031.10 1157.16 1010.91 1141.20 260105810000 1141.20
## 2010-10-01 1143.49 1262.60 1131.87 1257.64 257671870000 1257.64
## 2011-01-01 1257.62 1344.07 1249.05 1325.83 240896240000 1325.83
## 2011-04-01 1329.48 1370.58 1258.07 1320.64 245196520000 1320.64
QSP500 = QSP500$GSPC.Adjusted
colnames(QSP500) <- "SP500index"
Now we can calculate cc return of the market
QSP500$SPreturns <- diff(log(QSP500$SP500index),lag=4)
QSP500.df<-data.frame(q=index(QSP500),coredata(QSP500))
uspanel$q<-as.Date(uspanel$q)
uspanel<-merge(uspanel,QSP500.df,by="q")
Once you have the best model(s), now you can make predictions and scenarios to make effective business decisions
uspanel$firm_vs_sp500 = ifelse(uspanel$annual_r > uspanel$SPreturns, 1, 0)
uspanel$FUTURE_R = plm::lag(uspanel$firm_vs_sp500,-4)
uspanel$MKT_CAP_SIZE = log(uspanel$marketcap)
library(caret)
## Loading required package: lattice
manufacturing_uspanel_ml <- uspanel%>%
select(firm, q, year, fiscalmonth, annual_r, FUTURE_R, bmrw, oepspw, MKT_CAP_SIZE) %>%
na.omit() %>%
as.data.frame()
set.seed(444)
trainingdataindex = createDataPartition(manufacturing_uspanel_ml$FUTURE_R, p=0.80, list= FALSE)
training = manufacturing_uspanel_ml[trainingdataindex,]
testing = manufacturing_uspanel_ml[-trainingdataindex, ]
ML1 <- glm(FUTURE_R ~ bmrw + oepspw + MKT_CAP_SIZE, data = training, family= "binomial", na.action = na.omit)
summary(ML1)
##
## Call:
## glm(formula = FUTURE_R ~ bmrw + oepspw + MKT_CAP_SIZE, family = "binomial",
## data = training, na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9232 -1.0768 -0.4508 1.0875 3.3482
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.194461 0.099959 -1.945 0.0517 .
## bmrw -1.678228 0.042190 -39.778 < 0.0000000000000002 ***
## oepspw 4.230421 0.140603 30.088 < 0.0000000000000002 ***
## MKT_CAP_SIZE 0.050735 0.006688 7.586 0.0000000000000331 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43163 on 31475 degrees of freedom
## Residual deviance: 38651 on 31472 degrees of freedom
## AIC: 38659
##
## Number of Fisher Scoring iterations: 5