Abstract:. In this workshop we continue practicing with descriptive statistics, data understanding and financial data calculations. In addition, we introduce what is Machine Learning, and we do a review of regression models and application in Finance.

CHALLENGE 1: Complete/correct your Workshop 2

Resolution Workshop 2

Challenge 1: Data management algorithms

You have to import the datasets. We will download these datasets from a web site. Here is the code:

options(scipen=999)

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

download.file("http://www.apradie.com/datos/us2020q.xlsx",
              "us2020q.xlsx", mode="wb")

download.file("http://www.apradie.com/datos/usfirms.xlsx",
              "usfirms.xlsx", mode="wb")

Now we import (into our R environment) the panel quarterly data of all US firms:

uspanel <- read_excel("us2020q.xlsx",sheet = "data")

Now we import the data dictionary for the variables of this dataset:

dictionary <- read_excel("us2020q.xlsx",sheet = "data dictionary")

Now we import that list of all public US firms:

usfirms <- read_excel("usfirms.xlsx")

As a quick check of the imported data, write the code do the following query:

1.1 Display the revenue and total assets for Apple for all quarters of 2019 and 2020.

Do an online check, find the most recent 10-K report (annual financial statement) of Apple, and compare 2020 anual sales with what you got. You must get the same amount for the anual sales reported at the end of fiscal year (for Apple, the fiscal year is Q3).

AAPL_Revenue_TotalAssets  = uspanel[uspanel$firm == "AAPL" & 
                                    uspanel$year >= 2019, c(1:3,7)]
AAPL_Revenue_TotalAssets

Another possible way to do this first exercise, could be:

library(dplyr)
APPLE_2019 <- uspanel %>% 
              filter(firm=="AAPL" & year==2019) %>%
              select(firm,q,revenue,fiscalmonth)

APPLE_2020 <- uspanel %>% 
              filter(firm=="AAPL" & year==2020) %>%
              select(firm,q,revenue,fiscalmonth)

APPLE_2019

APPLE_2020

1.2 Using the panel dataset, write the algorithm to select active firms along with the quarters with annual financial data in 2020.

Before doing basic descriptive statistics for the US market, you do some data management algorithms. To do this, you have to select firm-quarters with fiscalmonth=12, year=2020 and only active firms. Save the result in a new dataset called panel2020.

usfirms1<- usfirms %>% select(firm,CompanyName,status,naics1)
uspanel <- merge(uspanel,usfirms1,by="firm")    

panel2020 <- uspanel %>%
          filter(status=="active",
                 year==2020, 
                 fiscalmonth==12)
panel2020

nrow(panel2020)

## [1] 3212

There were 3212 active public firms in the US market in 2020.

1.3 Using the panel2020 dataset do the required algorithm to merge important firm characteristics to this panel dataset. You have to pull the following firm characteristics: Company name, status, and indsutry (naics1).

In the panel dataset we do not have general information of each firm such as firm name, industry, status (active or cancelled), etc. So, we start by integrating this information into the panel data:

In this dataset, we have quarterly financial data. For income-statement variables, the information is the cumulative amount from Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the information is the value of the variable at the end of the quarter (since the beginning when the firm was created).

The fiscal month variable indicates the following: if fiscalmonth=12, then that means that the corresponding quarter is the end-of-the-fiscal-year quarter.

panel2020_2 <- merge(usfirms, panel2020, by = "firm")
panel2020_2

firmsbyQ<- table(panel2020$q)
firmsbyQ

## 
## 2020-01-01 2020-04-01 2020-07-01 2020-10-01 
##        209        187        206       2610

Challenge 2: Code for descriptive statistics for 2020

Using the panel2020 dataset, you have to do the following descriptive statistics and respond the following questions.

You have to calculate market capitalization (marketcap) in this dataset. You can calculate a new column as follows:

marketcap = original stock price * shareoutstanding.

panel2020$marketcap = panel2020$originalprice * panel2020$sharesoutstanding
marketcap <- (panel2020_2$originalprice * panel2020_2$sharesoutstanding)
panel2020_2 <- cbind(panel2020_2, marketcap)

This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

2.1 Show and interpret the median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2020

panel2020_3 <- (panel2020_2[,c("totalassets", "marketcap")])
panel2020_3 <- na.omit(panel2020_3)

summary(panel2020_3)

##   totalassets           marketcap         
##  Min.   :       266   Min.   :         2  
##  1st Qu.:    353334   1st Qu.:    293888  
##  Median :   1718393   Median :   1392349  
##  Mean   :  16123109   Mean   :  11874836  
##  3rd Qu.:   6396786   3rd Qu.:   5357797  
##  Max.   :3386071000   Max.   :1980644694

INTERPRETACIÓN:

CON RELACIÓN A LOS ACTIVOS TOTALES PODEMOS CONCLUIR:

·CERCA DEL 25% DE LAS FIRMAS REGISTRADAS EN ESTADOS UNIDOS, CUENTAN CON UNA COLUMNA DE ACTIVOS POR USD 353 MILES DE MILLONES O MENOS; 50% CUENTAN CON UNA COLUMNA DE ACTIVOS POR USD 1.718 MILES DE MILLONES O MENOS; Y, 75% CUENTAN CON UNA COLUMNA DE ACTIVOS POR USD 6.3 MILES DE MILLONES O MENOS. · LA EMPRESA CON MAYOR NÚMERO DE ACTIVOS BAJO GESTIÓN, MANTIENE 3,386,071,000 USD

RESPECTO A LA CAPITALIZACIÓN DE MERCADO, PODEMOS CONCLUIR:

·LA MEDIANA CAPITALIZACIÓN DE LAS EMPRESAS EN EL MERCADO ESTADOUNIDENSE ES DE USD 1.392 MILES DE MILLONES. ·LA EMPRESA CON MAYOR CAPITALIZACIÓN DE MERCADO ES VALORADA EN USD 1,980,644,694; SEGURAMENTE, APPLE.

2.2 By industry show the # of firms, median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2020

industry_panel2020 <- unique(panel2020$naics1)
panel2020 %>% 
  filter(year =='2020') %>%
  group_by(naics1) %>% 
  summarize(firms = n(),
            totalassets_median <- median(totalassets, na.rm = TRUE),
            marketcap_median <- median(marketcap, na.rm = TRUE),
            assets_percentil25 <- quantile(totalassets, probs = (0.25)))

Solution by the professor:

by_industry_summary <- panel2020 %>%
  group_by(naics1) %>%
  summarize(firms = n(),
            median_tot_assets = median(totalassets, na.rm = TRUE),
            Q1_total_assets = quantile(totalassets, probs=c(0.25),na.rm = TRUE),
            Q3_totassets = quantile(totalassets,probs=c(0.75),na.rm=TRUE),
            median_marketcap = median(marketcap, na.rm = TRUE),
            Q1_marketcap = quantile(marketcap, probs=c(0.25),na.rm=TRUE),
            Q3_marketcap = quantile(marketcap,probs=c(0.75),na.rm=TRUE)
            )
by_industry_summary

2.3 Which are the 10 biggest active firms in terms of total assets according to the end of fiscal year of 2020? Show the name of the firms, the industry and total assets.

Top10_TotalAssets <- (panel2020) %>%
         arrange(desc(totalassets)) %>%
         top_n(10) %>% 
         select(firm, CompanyName, naics1, totalassets, revenue)

## Selecting by marketcap

Top10_TotalAssets

2.4 Which are the 10 biggest active firms in terms of market capitalization? Show the name of the firms, the industry and market value as of 2020

Top10_MarketCap <- panel2020 %>%
         arrange(desc(marketcap)) %>%
         top_n(10) %>%
         select(firm, CompanyName, naics1, marketcap, revenue)

## Selecting by marketcap

Top10_MarketCap

2.5 Show and interpret a histogram of the 2020 market capitalization for all ACTIVE firms

hist(panel2020$marketcap)

INTERPRETACIÓN:

MÁS DE LAS 3,000 EMPRESAS REGISTRADAS EN EE.UU. ARROJAN UNA CAPITALIZACIÓN DE MERCADO POR DEBAJO DE 500,000,000 USD. MENOS DE 500 EMPRESAS REPORTAN UN MARKET CAP SUPERIOR A 250,000,000 USD. EL GRÁFICO SE HA VISTO EXTENDIDO POR EL MARKET CAP DE APPLE, CUYO VALOR ASCIENDE HASTA CERCA DE LOS 2 BILLONES DE USD.

This is a typical skewed-to-the-right distribution since there are very few very extreme positive values. Due to the very extreme values of market capitalization of firms, the histogram is not well appreciated. We can change the number of bins or breaks of the histogram, so that we can better appreciate the distribution of the variable in the range where most firms are located in terms of market capitalization:

hist(panel2020$marketcap, breaks=40)

We can see that most of the firms have a market capitalization that is less than $US 250 billion.

To better appreciate the distribution of most of the firms excluding those with very extreme values, we can filter the dataset as follows:

firm_selection1 <- panel2020 %>%
            filter(marketcap <=250000000) %>%
            select(firm, marketcap)
hist(firm_selection1$marketcap, breaks=40)

Even with this selection, we still see that the distribution is very skewed to the right with extreme values of market capitalization.

2.6 For each industry show and interpret a box plot to better understand the size of active firms in 2020 (use the market capitalization)

Administrative_and_Support_and_Waste_Management_and_Remediation_Services <- panel2020[panel2020$naics1 =="Administrative and Support and Waste Management and Remediation Services",]
chart.Boxplot(Administrative_and_Support_and_Waste_Management_and_Remediation_Services$marketcap)

Agriculture_Forestry_Fishing_and_Hunting <- panel2020[panel2020$naics1 =="Agriculture, Forestry, Fishing and Hunting",]
chart.Boxplot(Agriculture_Forestry_Fishing_and_Hunting$marketcap)

Arts_Entertainment_and_Recreation <- panel2020[panel2020$naics1 =="Arts, Entertainment, and Recreation",]
chart.Boxplot(Arts_Entertainment_and_Recreation$marketcap)

Manufacturing <- panel2020[panel2020$naics1 =="Manufacturing",]
chart.Boxplot(Manufacturing$marketcap)

Retail_Trade <- panel2020[panel2020$naics1 =="Retail Trade",]
chart.Boxplot(Retail_Trade$marketcap)

Finance_and_Insurance <- panel2020[panel2020$naics1 =="Finance and Insurance",]
chart.Boxplot(Finance_and_Insurance$marketcap)

Professional_Scientific_and_Technical_Services <- panel2020[panel2020$naics1 =="Professional, Scientific, and Technical Services",]
chart.Boxplot(Professional_Scientific_and_Technical_Services$marketcap)

Construction <- panel2020[panel2020$naics1 =="Construction",]
chart.Boxplot(Construction$marketcap)

Educational_Services <- panel2020[panel2020$naics1 =="Educational Services",]
chart.Boxplot(Educational_Services$marketcap)

Health_Care_and_Social_Assistance <- panel2020[panel2020$naics1 =="Health Care and Social Assistance",]
chart.Boxplot(Health_Care_and_Social_Assistance$marketcap)

Information <- panel2020[panel2020$naics1 =="Information",]
chart.Boxplot(Information$marketcap)

Mining_Quarrying_and_Oil_and_Gas_Extraction <- panel2020[panel2020$naics1 =="Mining, Quarrying, and Oil and Gas Extraction",]
chart.Boxplot(Mining_Quarrying_and_Oil_and_Gas_Extraction$marketcap)

Other_Services <- panel2020[panel2020$naics1 =="Other Services (except Public Administration)",]
chart.Boxplot(Other_Services$marketcap)

Real_Estate_and_Rental_and_Leasing <- panel2020[panel2020$naics1 =="Real Estate and Rental and Leasing",]
chart.Boxplot(Real_Estate_and_Rental_and_Leasing$marketcap)

Transportation_and_Warehousing <- panel2020[panel2020$naics1 =="Transportation and Warehousing",]
chart.Boxplot(Transportation_and_Warehousing$marketcap)

Utilities <- panel2020[panel2020$naics1 =="Utilities",]
chart.Boxplot(Utilities$marketcap)

Wholesale_Trade <- panel2020[panel2020$naics1 =="Wholesale Trade",]
chart.Boxplot(Wholesale_Trade$marketcap)

Solution by the professor:

library(ggplot2)
ggplot(panel2020, aes(x=naics1, y=marketcap)) +
   geom_boxplot()

## Warning: Removed 51 rows containing non-finite values (stat_boxplot).

Here we have 2 visualization problems: 1) We cannot easily visualize the market capitalization of 50% of the firms by industry, which is the box for each industry (Quartiles fro Q1 to Q3), and 2) The x labels for the industry names cannot be read.

To solve the first problem, since there are too many very extreme values for market capitalization, we can winsorize this variable to flatten the big values of market capitalization.

In statistics, winsorization is the process to treat extreme values, which are also called outliers. The winsorization process replace the very extreme values at a specific percentile with a value of the variable in that percentile.

Here we apply a winsorization at the 95 percentiles for big values of market capitalization:

library(statar)
panel2020$marketcap_w <- winsorize(panel2020$marketcap, probs = c(0,0.95))

## 0.00 % observations replaced at the bottom

## 4.92 % observations replaced at the top

We created a winsorized market capitalization for the purpose of visualization of the variable. The process of winsorization is also used for regression analysis since outliers can bias the estimation of the regression results. We will learn that later.

For the second problem, we can add an option to the ggplot to indicate that we want to display the x labes in vertical wah (90 degrees of inclination):

ggplot(panel2020, aes(x=naics1, y=marketcap_w)) +
   geom_boxplot() +
   theme(axis.text.x = element_text(angle = 90, hjust = 1))

## Warning: Removed 51 rows containing non-finite values (stat_boxplot).

Now we have a better appreciation of the market capitalization ranges for each industry.

Challenge 3. Setting the dataset as panel data

Using the plm package, indicate R that uspanel is a panel data, which has firm column for the subjects and quarter (q) for the time variable. Keep the same name of uspanel.

library(plm)

## 
## Attaching package: 'plm'

## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead

uspanel <- pdata.frame(uspanel, index= c("firm","q"))
class(uspanel)

## [1] "pdata.frame" "data.frame"

Now the uspanel is a pdata.frame R object, so we can start doing calculations with lagged values such as return calculations and financial ratio calculations.

Challenge 4. Calculating financial variables

You have to write the code to calculate the following financial variables and financial ratios for all firms-quarters of US firms.

4.1 Using the original panel dataset (uspanel), create columns for the following variables:

· Gross profit (grossprofit): Revenue - Cost of good Sold
· Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses
· Net Income (netincome): ebit - financial expenses - income taxes
· Annual market return: use adjusted stock price and remember that you have quarterly data.
· Market capitalization: (marketcap): original stock price * shareoutstanding.

Gross profit (grossprofit): Revenue - Cost of good Sold

uspanel$grossprofit = (uspanel$revenue - uspanel$cogs)

Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses

uspanel$ebit = uspanel$grossprofit - uspanel$sgae

Net Income (netincome): ebit - financial expenses - income taxes

uspanel$netincome = uspanel$ebit - uspanel$finexp - uspanel$incometax

Annual market return: use adjusted stock price and remember that you have quarterly data. Here we have to use the lagged value of adjusted stock price:

uspanel$annual_R = uspanel$adjprice / plm::lag(uspanel$adjprice,k=4) - 1

Market capitalization: (marketcap): original stock price * shareoutstanding.

uspanel$marketcap = uspanel$originalprice * uspanel$sharesoutstanding

This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

Check that we used original stock price, not adjusted stock price. In financial markets, the adjusted stock prices are calculated after considering dividend payments and stock splits. A stock split is when a firm decides to divide the value of its stock price by 2, 3 or other multiple with the only purpose to avoid the perception that the stock is expensive. For example, late August 2020 Apple and Tesla decided to do stock split. Apple did a split on a 4-for-1 basis. This means that if the stock price was about USD $400.00 on that day, then its price was reduced to USD$100.00, but they multiplied the number of shares (shares outstanding) by 4 to keep the same market value of the firm. In this historical dataset the shares outstanding is the historical, so we need to use the historical/original stock price without considering stock splits nor dividend payments.

4.2 Using the same panel dataset (uspanel), create columns for the following financial ratios:

Operational Return on Assets (roabit): roabit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of ONE year ago.

\[roabit=\frac{ebit_{t}}{totalassets_{t-4}}\] Here we can use the lag function of the plm package to get value of total assets one year ago, which is 4 quarters ago.

However, when calculating financial ratios it is always important to check whether the variable in the denominator of the ratio has one or more zeros. When we divide an amount by zero, it is not possible to determine a value. Actually, the result is infinite. R stores infinite values as Inf.

We can either check whether the variable in the denominator has zero values, and then use a conditional to indicate which result we want in the case that the denominator is equal to zero.

Let’s install the psych package that has a nice function to describe a variable:

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Use the function describe for the denominator of the variable:

describe(uspanel$totalassets)

We see that the minimum value is zero, so we can have troubles calculating the roabit ratio. Then we can use the ifelse function to indicate which value we want in the case of a zero value in the denominator:

uspanel$roabit = ifelse(plm::lag(uspanel$totalassets,k=4)==0,NA, uspanel$ebit / plm::lag(uspanel$totalassets,k=4))

Now we can check whether there is no Inf values in the new ratio:

describe(uspanel$roabit)

We can keep using the ifelse for all ratios without checking for the zeros.

# Return on Assets (roa):
uspanel$roa = ifelse(plm::lag(uspanel$totalassets,k=4)==0,NA,uspanel$netincome / plm::lag(uspanel$totalassets,k=4))

#Operational Earnings per share (oeps): ebit / sharesoutstanding
uspanel$oeps = ifelse(uspanel$sharesoutstanding==0,NA,uspanel$ebit / uspanel$sharesoutstanding)

#Operational epsebit deflated by stock price (oepsp) : oeps / original stock price
uspanel$oepsp = ifelse(uspanel$originalprice==0,NA,uspanel$oeps / uspanel$originalprice)

#Book-to-market ratio (bmr): book value / market value. Use total assets as book value of the firm-quarter.
uspanel$bmr = ifelse(uspanel$marketcap==0,NA,uspanel$totalassets / uspanel$marketcap)

Actually, the right way to calculate book value is the following: accounting book value is the difference between total assets and total liabilities. In other words, accounting book value of a company is the accounting shareholder equity plus retained earnings.

We can re-calculate book-to-market ratio using this definition:

uspanel$bookvalue = uspanel$totalassets - uspanel$totalliabilities
uspanel$bmr = ifelse(uspanel$marketcap==0,NA,uspanel$bookvalue / uspanel$marketcap)

We calculated book value and market value for ALL rms and ALL quarters! Now we can generate the book-to-market ratio for all firms all quarters.

** Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio **

EPS: Earnings per share (EPS) is calculated as a company’s profit divided by the outstanding shares of its common stock. The resulting number serves as an indicator of a company’s profitability.

BOOK TO MARKET RATIO: According to investopedia, the book-to-market ratio helps investors find a company’s value by comparing the firm’s book value to its market value. A high book-to-market ratio might mean that the market is valuing the company’s equity cheaply compared to its book value. Many investors are familiar with the price-to-book ratio, which is simply the inverse of the book-to-market ratio formula.

If the market value of a company is trading higher than its book value per share, it is considered to be overvalued. If the book value is higher than the market value, analysts consider the company to be undervalued. The book-to-market ratio is used to compare a company’s net asset value or book value to its current or market value.

Notes from the professor:

EARNINGS PER SHARE IS EQUAL TO EARNINGS DIVIDED BY THE # OF SHARES. THE MEASURE FOR EARNINGS IS NET INCOME. HOWEVER, SOME ANALYSTS ALSO USE OTHER OPERATIONAL MEASURES FOR EARNINGS SUCH AS EARNINGS BEFORE INTEREST AND TAXES (EBIT). IF WE WANT TO MEASURE OPERATIONAL EARNINGS AND CALCULATE IT FOR MANY FIRMS, IT IS RECOMMENDED TO USE EBIT AS A MEASURE OF EARNINGS.

IN A HYPOTHETICAL SCENARIO, IF THE ALL EARNINGS OF A PERIOD t WERE PAYED TO THE INVESTORS, THEN EPS WILL BE HOW MUCH OF ALL EARNINGS OF THE PERIOD IS PAYED TO EACH SHARE OWN BY INVESTORS.

WHAT IS BOOK-TO-MARKET RATIO?

Book-to-market ratio (bmr) is the ratio of accounting book value of the firm to its market value. In other words, it results by dividing book value by the market value of the firm at a specific time period.

If bmr=1 means that the firm book value is about the same as firm market value. If that is the case, then the market value has not grown beyond book-value, meaning that the firm has not created value beyond its book value.

If bmr>1 means that the market value is less than book value. So, if bmr>1 means that the firm has significantly lost shareholder’s wealth, and it might incur in bankrupt risk.

Then, what would be the bmr level that all firms are looking for? One of the main purposes of the executives is to MAXIMIZE shareholder’s value. The way to increase shareholder’s value is to increase its market value, and the only way to increase market value is to increase stock price.

Then, the bmr level that all executives prefer is a value much less than 1.

If bmr=0.5 means that the firm market value is the double of its book value. In this case, the main difference between market value and book value is the wealth that the firm has created thanks to its valuable intangible assets such as prestige, high quality, and innovation.

Then, what do you think it might be the relationship between bmr and stock return? Intuitively, we might think that a small value of bmr is a good news, then the stock return might be related in a negative way. If bmr goes down (good news), then the stock return might go up. Then, it might be expected that the relationship between bmr and stock return is linear and negative. Some finance research (like Fama & French, 1995), mainly with US firms has found that the relationship between bmr and future stock return is negative.

However, there are mixed findings about the relationship between bmr and stock returns. Some research (like Piotrosky, 2000) has found that firms with high bmr, but strong financials (like earnings per share) usually provides significant positive stock returns. In this special case, bmr has found to be positively related to stock returns.

Finance research has found that bmr influences earnings per share, which in turn influences current and future stock returns. It is some times that firms with low bmr do not experience significant high stock returns due to the high expectations of investors. This is an interesting finding in finance that is still being researched!

Challenge 5. Review of Regression models

Using the uspanel dataset you have to run a multiple pulled regression model to examine whether the operational earnings per share deflated by price (peps) and operational ROA (roabit) are related to annual market returns 1 quarter later.
Interpret the regression coefficients of this model.

When we run a regression model with financial returns, it is recommended to always use continuously compounded returns instead of simple returns.

Continuously compounded returns (cc returns) are also called logarithmic returns. cc returns have better statistical properties compared to simple returns, and also cc returns are additive (they can be added to get cc holding return).

Then, we first calculate the annual cc returns for all firms all quarters:

uspanel$annual_r = diff(log(uspanel$adjprice),lag=4)

Before running the regression, we will select only active firms:

uspanelactive <- uspanel[uspanel$status=="active",]

We first run a multiple regression model to examine whether oepsp and roabit are related to contemporary annual returns. I mean contemporary return since I will not check whether oepsp and roabit are related to returns one quarter later; I am checking whether oepsp and roabit are related to returns in the same period (quarter) for all firms-quarters.

I start running this regression, and later I will modify it to change the dependent variable to be returns one quarter later.

LRM1 <- lm(annual_r ~ oepsp + roabit, data = uspanelactive, na.action = na.omit)
summary(LRM1)

## 
## Call:
## lm(formula = annual_r ~ oepsp + roabit, data = uspanelactive, 
##     na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.0366 -0.1813  0.0582  0.2505  4.3593 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 0.0126443  0.0015733   8.037 0.000000000000000931 ***
## oepsp       0.0117743  0.0007091  16.604 < 0.0000000000000002 ***
## roabit      0.0000968  0.0001315   0.736                0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.488 on 96245 degrees of freedom
##   (55112 observations deleted due to missingness)
## Multiple R-squared:  0.002862,   Adjusted R-squared:  0.002842 
## F-statistic: 138.1 on 2 and 96245 DF,  p-value: < 0.00000000000000022

Interpretación:

El siguiente modelo de regresión lineal contempla como variable dependiente el retorno anual de las acciones (trimestre a trimestre) y como variables independientes la utilidad operativa por acción deflactada por precio (oepsp) y la rentabilidad operativa sobre los activos (roabit).

El coeficiente beta de oepsp es de 0.0117743. Este coeficiente es positivo y estadísticamente significativo ya que su valor p (0.0000000000000002) es mucho menor que 0.05. Esto nos indica que al menos el 95% de las veces, en un futuro próximo este coeficiente será positivo. En otras palabras, existe una relación significativa y positiva entre las ganancias operativas por acción deflactadas por el precio y los rendimientos anuales de las acciones después de considerar el efecto del rendimiento operativo sobre los activos sobre los rendimientos anuales de las acciones.

Después de considerar el efecto de roabit, por cada cambio de +1 en oepsp, se espera que el cambio en el rendimiento anual sea de aproximadamente +0.0117743.

Por otra parte, el coeficiente de roa operacional (roabit) es de 0.0000968; sin embargo, no es significativo ya que su valor p es mucho mayor que 0.05 (0.4615554). Por lo tanto, después de considerar el efecto de oeps en el rendimiento de las acciones, podemos determinar que no hay un efecto relevante o relación entre el roabit y la rentabilidad anual de las acciones.

Ejercicio extra del profesor:

Using the dependent variable as FUTURE stock returns ONE QUARTER LATER:

Now I will change the dependent variable of this multiple regression to be stock annual returns ONE QUARTER IN THE FUTURE. I need to use the function plm instead of lm. lm() stands for linear model; plm stands for panel linear model.

The plm function allows us to use the lagged or future values of a variable in the panel data:

LRM2 <- plm(plm::lag(annual_r,-1) ~ oepsp + roabit, data=uspanelactive, model="pooling", na.action = na.omit)

summary(LRM2)

## Pooling Model
## 
## Call:
## plm(formula = plm::lag(annual_r, -1) ~ oepsp + roabit, data = uspanelactive, 
##     na.action = na.omit, model = "pooling")
## 
## Unbalanced Panel: n = 3178, T = 1-39, N = 93820
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -7.039229 -0.181407  0.058213  0.250802  3.972278 
## 
## Coefficients:
##                Estimate  Std. Error t-value              Pr(>|t|)    
## (Intercept) 0.012041184 0.001596409  7.5427   0.00000000000004646 ***
## oepsp       0.010117701 0.000735352 13.7590 < 0.00000000000000022 ***
## roabit      0.000011308 0.000024614  0.4594                0.6459    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    22467
## Residual Sum of Squares: 22422
## R-Squared:      0.0020161
## Adj. R-Squared: 0.0019948
## F-statistic: 94.764 on 2 and 93817 DF, p-value: < 0.000000000000000222

Compared with the previous model, the variable oepsp is still positive and significant. In the case of roabit, we can see that its coefficient is also positive and also not significant. since its p-value is much greater than 0.05. Now with this model, the effect of oepsp on future annual stock return is still significant and positive, but now its coefficient is less in magnitude. The oepsp coefficient is less than the corresponding coefficient in the previous model. This makes sense since it is much more difficult to find financial ratios that are strongly correlated to future returns; the more in the future the stock return, the less the significance of the explanatory variables.

I a financial market, it is extremely difficult to find financial ratios (also called factors) that are strongly related with future market returns. This has to be the case, since market participants, in theory, they cannot predict future returns with current financial information available to all investors. However, we can design econometric models to find which factors might consistenly be related (may be in small magnitude) with future returns.

Challenge 6. Research what is Machine Learning

Do a quick research about what is Machine Learning.

De acuerdo con BBVA, en su artículo “Machine learning: ¿qué es y cómo funciona?”, el ‘machine learning’ –aprendizaje automático– es una rama de la inteligencia artificial que permite que las máquinas aprendan sin ser expresamente programadas para ello. Una habilidad indispensable para hacer sistemas capaces de identificar patrones entre los datos para hacer predicciones. Esta tecnología está presente en un sinfín de aplicaciones como las recomendaciones de Netflix o Spotify, las respuestas inteligentes de Gmail o el habla de Siri y Alexa.

“En definitiva, el ‘machine learning’ es un maestro del reconocimiento de patrones, y es capaz de convertir una muestra de datos en un programa informático capaz de extraer inferencias de nuevos conjuntos de datos para los que no ha sido entrenado previamente”, explica José Luis Espinoza, científico de datos de BBVA México.

Referencia: https://www.bbva.com/es/machine-learning-que-es-y-como-funciona/

CHALLENGE 2: Winsorization of variables

You have to do your own research about winsorization. Explain it with your words (pay attention in class)

WINSORIZING O WINSORIZATION ES LA TRANSFORMACIÓN DE VARIABLES ESTADÍSTICAS AL LIMITAR LOS VALORES EXTREMOS PARA REDUCIR EL EFECTO DE LOS VALORES ATÍPICOS POSIBLEMENTE ESPURIOS.

You have to install the statar package. You have to winsorize the following ratios we created in workshop 2:

Book-to-market ratio

Earnings per share deflated by price

Using the histogram decide which might be a good level of winsorization for each ratio.

Book-to-market ratio

library(statar)
library(ggplot2)

uspanel$bmr_w <- winsorize(uspanel$bmr, probs = c(.01,.97))

## 0.61 % observations replaced at the bottom

## 1.82 % observations replaced at the top

hist(uspanel$bmr_w)

Earnings per share deflated by price

uspanel$oepsp_w <- winsorize(uspanel$oepsp, probs = c(.05,.97))

## 3.02 % observations replaced at the bottom

## 1.81 % observations replaced at the top

hist(uspanel$oepsp_w)

CHALLENGE 3: Multiple regression models with lagged values

You have to use active firms of the uspanel dataset for this exercise.

Using the plm package, design and run a pooled multiple regression to examine whether a) book-to-market ratio (winsorized), b) earnings per share deflated by price (winsorized) and c) firm size (use log of market value) are related to future stock returns 1 quarter later.

Pay attention in class to learn how to run a regression model using the plm package, and how to indicate to use future or lagged values for variables in the model.

You have to interpret the model

uspanel$marketlog <- log(uspanel$marketcap)

uspanel$annual_return <- diff(log(uspanel$adjprice), lag=4)

uspanelactivefirms <- uspanel[uspanel$status == "active",]

LRM3 <- lm(annual_r ~ bmr_w + oepsp_w + marketlog, data=uspanelactivefirms, na.action = na.omit)

summary(LRM3)

## 
## Call:
## lm(formula = annual_r ~ bmr_w + oepsp_w + marketlog, data = uspanelactivefirms, 
##     na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5023 -0.2023  0.0152  0.2202  4.7462 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -0.3764501  0.0105720  -35.61 <0.0000000000000002 ***
## bmr_w       -0.2598864  0.0031663  -82.08 <0.0000000000000002 ***
## oepsp_w      1.2863279  0.0144343   89.12 <0.0000000000000002 ***
## marketlog    0.0355081  0.0007137   49.75 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4408 on 98061 degrees of freedom
##   (53295 observations deleted due to missingness)
## Multiple R-squared:  0.1898, Adjusted R-squared:  0.1898 
## F-statistic:  7657 on 3 and 98061 DF,  p-value: < 0.00000000000000022

El siguiente modelo de regresión lineal contempla como variable dependiente el retorno anual de las acciones y como variables independientes el book to market ratio winzorizado (bmr_w), la utilidad operativa por acción deflactada por precio winzorizado (oepsp_w) y el tamaño de las empresas (marketlog).

En este modelo, podemos apreciar que todas las variables mantienen una relación significativa soportado estadísticamente, ya que como podemos ver todos los valores p son mucho menos a 0.05. De este modo podemos concluir que:

Por cada unidad de cambio en el ratio BMR, el esperado cambio sobre el retorno anual es de -.32, inversamente proporcional.
Por cada unidad de cambio en la utilidad operativa por acción deflactada por precio, el esperado cambio sobre el retorno anual es de 1.48, directamente proporcional.
Por cada unidad de cambio en valor de mercado de las empresas, el esperado cambio sobre el retorno anual es de .03, directamente proporcional.

Todos esto con soporte y validez estadística.

CHALLENGE 4: Research about Machine Learning

You have to do your own research about Machine Learning, and then write ideas about how you can use Machine Learning for the Problem Situation.

You must write at least 2 ideas or general approaches about how to use Machine Learning for the Problem Situation.

Correr un código que detecte variables financieras fuertemente correlacionadas cada vez que el precio de la acción sube.
Correr un código que detecte patrones históricos en las razones financieras de empresas para predecir posibles bull runs.

CHALLENGE 5: Complete a Datacamp online course

YUO MUST TAKE Chapter 1 : Regression models: fitting them and evaluating their performance from the course: Machine Learning with caret in R

Workshop 3 - Algorithms and data analysis

Rafael Romo - A01701493