Abstract: In this workshop we continue practicing with descriptive statistics, data understanding and financial data calculations. In addition, we introduce what is Machine Learning, and we do a review of regression models and application in Finance.

Introduction

We will work with a panel data of real US public firms. This is the dataset you will use for the Problem Situation (Final Project) of this class. First, we will do descriptive statistics and visualization with the purpose of understanding the US financial market. In addition, we will start learning about machine learning and do a review of regression models.

You will download 3 datasets:

· usfirms.xlsx: List of US public firms with general information of each firm · us2020q.xlsx /data: Panel data with historical annual financial data for all US public firms. This is in an Excel sheet called “data” in the us2020q.xlsx Excel workbook. · us2020q.xlsx/data dictionary: dictionary for each variable in the panel dataset. This is in an Excel sheet called “dictionary data” in the us2020q.xlsx Excel workbook.

You have to manage these datasets to respond the following questions. Hint: it is recommended to merge the main variables of the list of firms with the panel data.

Challenge 1: Data management algorithms

You have to import the datasets. We will download these datasets from a web site. Here is the code:

options(scipen=999)

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(PerformanceAnalytics)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

download.file("http://www.apradie.com/datos/us2020q.xlsx",
              "us2020q.xlsx", mode="wb")

download.file("http://www.apradie.com/datos/usfirms.xlsx",
              "usfirms.xlsx", mode="wb")

Now we import (into our R environment) the panel quarterly data of all US firms:

uspanel <- read_excel("us2020q.xlsx",sheet = "data")

Now we import the data dictionary for the variables of this dataset:

dictionary <- read_excel("us2020q.xlsx",sheet = "data dictionary")

Now we import that list of all public US firms:

usfirms <- read_excel("usfirms.xlsx")

As a quick check of the imported data, write the code do the following query:

1. Display the revenue and total assets for Apple for all quarters of 2019 and 2020.

Do an online check, find the most recent 10-K report (annual financial statement) of Apple, and compare 2020 anual sales with what you got. You must get the same amount for the anual sales reported at the end of fiscal year (for Apple, the fiscal year is Q3).

AAPL_Revenue_TotalAssets  = uspanel[uspanel$firm == "AAPL" & 
                                    uspanel$year >= 2019, c(1:3,7)]
AAPL_Revenue_TotalAssets

2. Using the panel dataset, write the algorithm to select active firms along with the quarters with annual financial data in 2020.

Before doing basic descriptive statistics for the US market, you do some data management algorithms. To do this, you have to select firm-quarters with fiscalmonth=12, year=2020 and only active firms. Save the result in a new dataset called panel2020.

USF_USP_m <- merge(usfirms, uspanel, by = "firm")

panel2020 = USF_USP_m[USF_USP_m$status == "active" & 
                            USF_USP_m$fiscalmonth == 12 &
                            USF_USP_m$year == 2020, c(1,12:45) ]
panel2020

3. Using the panel2020 dataset do the required algorithm to merge important firm characteristics to this panel dataset. You have to pull the following firm characteristics: Company name, status, and indsutry (naics1).

In the panel dataset we do not have general information of each firm such as firm name, industry, status (active or cancelled), etc. So, we start by integrating this information into the panel data:

In this dataset, we have quarterly financial data. For income-statement variables, the information is the cumulative amount from Q1 to the corresponding quarter of the fiscal year. For balance-sheet variables, the information is the value of the variable at the end of the quarter (since the beginning when the firm was created).

The fiscal month variable indicates the following: if fiscalmonth=12, then that means that the corresponding quarter is the end-of-the-fiscal-year quarter.

panel2020_2 <- merge(usfirms, panel2020, by = "firm")
panel2020_2

Challenge 2: Code for descriptive statistics for 2020

Using the panel2020 dataset, you have to do the following descriptive statistics and respond the following questions.

You have to calculate market capitalization (marketcap) in this dataset. You can calculate a new column as follows:

marketcap = original stock price * shareoutstanding.

marketcap <- (panel2020_2$originalprice * panel2020_2$sharesoutstanding)
panel2020_2 <- cbind(panel2020_2, marketcap)

This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

1. Show and interpret the median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2020

panel2020_3 <- (panel2020_2[,c("totalassets", "marketcap")])
panel2020_3 <- na.omit(panel2020_3)

summary(panel2020_3)

##   totalassets           marketcap         
##  Min.   :       266   Min.   :         2  
##  1st Qu.:    353334   1st Qu.:    293888  
##  Median :   1718393   Median :   1392349  
##  Mean   :  16123109   Mean   :  11874836  
##  3rd Qu.:   6396786   3rd Qu.:   5357797  
##  Max.   :3386071000   Max.   :1980644694

INTERPRETACIÓN:

CON RELACIÓN A LOS ACTIVOS TOTALES PODEMOS CONCLUIR:

·CERCA DEL 25% DE LAS FIRMAS REGISTRADAS EN ESTADOS UNIDOS, CUENTAN CON UNA COLUMNA DE ACTIVOS POR USD 353 MILES DE MILLONES O MENOS; 50% CUENTAN CON UNA COLUMNA DE ACTIVOS POR USD 1.718 MILES DE MILLONES O MENOS; Y, 75% CUENTAN CON UNA COLUMNA DE ACTIVOS POR USD 6.3 MILES DE MILLONES O MENOS. · LA EMPRESA CON MAYOR NÚMERO DE ACTIVOS BAJO GESTIÓN, MANTIENE 3,386,071,000 USD

RESPECTO A LA CAPITALIZACIÓN DE MERCADO, PODEMOS CONCLUIR:

·LA MEDIANA CAPITALIZACIÓN DE LAS EMPRESAS EN EL MERCADO ESTADOUNIDENSE ES DE USD 1.392 MILES DE MILLONES. ·LA EMPRESA CON MAYOR CAPITALIZACIÓN DE MERCADO ES VALORADA EN USD 1,980,644,694; SEGURAMENTE, APPLE.

2. By industry show the # of firms, median, 25 and 75 percentile of total assets and market capitalization (market value) of the active US firms in 2020

industry_panel2020 <- unique(panel2020_2$naics1)
panel2020_2 %>% 
  filter(year =='2020') %>%
  group_by(naics1) %>% 
  summarize(firms = n(),
            totalassets_median <- median(totalassets, na.rm = TRUE),
            marketcap_median <- median(marketcap, na.rm = TRUE),
            assets_percentil25 <- quantile(totalassets, probs = (0.25)))

3. Which are the 10 biggest active firms in terms of total assets according to the end of fiscal year of 2020? Show the name of the firms, the industry and total assets.

Top10_TotalAssets <- (panel2020_2) %>%
         select("CompanyName","naics1","totalassets") %>%
         arrange(desc(totalassets) )
Top10_TotalAssets

4. Which are the 10 biggest active firms in terms of market capitalization? Show the name of the firms, the industry and market value as of 2020

Top10_MarketCap <- (panel2020_2) %>%
         select("CompanyName","naics1","marketcap") %>%
         arrange(desc(marketcap) )
Top10_MarketCap

5. Show and interpret a histogram of the 2020 market capitalization for all ACTIVE firms

hist(panel2020_2$marketcap)

INTERPRETACIÓN:

MÁS DE LAS 3,000 EMPRESAS REGISTRADAS EN EE.UU. ARROJAN UNA CAPITALIZACIÓN DE MERCADO POR DEBAJO DE 500,000,000 USD. MENOS DE 500 EMPRESAS REPORTAN UN MARKET CAP SUPERIOR A 250,000,000 USD. EL GRÁFICO SE HA VISTO EXTENDIDO POR EL MARKET CAP DE APPLE, CUYO VALOR ASCIENDE HASTA CERCA DE LOS 2 BILLONES DE USD.

6. For each industry show and interpret a box plot to better understand the size of active firms in 2020 (use the market capitalization)

Administrative_and_Support_and_Waste_Management_and_Remediation_Services <- panel2020_2[panel2020_2$naics1 =="Administrative and Support and Waste Management and Remediation Services",]
chart.Boxplot(Administrative_and_Support_and_Waste_Management_and_Remediation_Services$marketcap)

Agriculture_Forestry_Fishing_and_Hunting <- panel2020_2[panel2020_2$naics1 =="Agriculture, Forestry, Fishing and Hunting",]
chart.Boxplot(Agriculture_Forestry_Fishing_and_Hunting$marketcap)

Arts_Entertainment_and_Recreation <- panel2020_2[panel2020_2$naics1 =="Arts, Entertainment, and Recreation",]
chart.Boxplot(Arts_Entertainment_and_Recreation$marketcap)

Manufacturing <- panel2020_2[panel2020_2$naics1 =="Manufacturing",]
chart.Boxplot(Manufacturing$marketcap)

Retail_Trade <- panel2020_2[panel2020_2$naics1 =="Retail Trade",]
chart.Boxplot(Retail_Trade$marketcap)

Finance_and_Insurance <- panel2020_2[panel2020_2$naics1 =="Finance and Insurance",]
chart.Boxplot(Finance_and_Insurance$marketcap)

Professional_Scientific_and_Technical_Services <- panel2020_2[panel2020_2$naics1 =="Professional, Scientific, and Technical Services",]
chart.Boxplot(Professional_Scientific_and_Technical_Services$marketcap)

Construction <- panel2020_2[panel2020_2$naics1 =="Construction",]
chart.Boxplot(Construction$marketcap)

Educational_Services <- panel2020_2[panel2020_2$naics1 =="Educational Services",]
chart.Boxplot(Educational_Services$marketcap)

Health_Care_and_Social_Assistance <- panel2020_2[panel2020_2$naics1 =="Health Care and Social Assistance",]
chart.Boxplot(Health_Care_and_Social_Assistance$marketcap)

Information <- panel2020_2[panel2020_2$naics1 =="Information",]
chart.Boxplot(Information$marketcap)

Mining_Quarrying_and_Oil_and_Gas_Extraction <- panel2020_2[panel2020_2$naics1 =="Mining, Quarrying, and Oil and Gas Extraction",]
chart.Boxplot(Mining_Quarrying_and_Oil_and_Gas_Extraction$marketcap)

Other_Services <- panel2020_2[panel2020_2$naics1 =="Other Services (except Public Administration)",]
chart.Boxplot(Other_Services$marketcap)

Real_Estate_and_Rental_and_Leasing <- panel2020_2[panel2020_2$naics1 =="Real Estate and Rental and Leasing",]
chart.Boxplot(Real_Estate_and_Rental_and_Leasing$marketcap)

Transportation_and_Warehousing <- panel2020_2[panel2020_2$naics1 =="Transportation and Warehousing",]
chart.Boxplot(Transportation_and_Warehousing$marketcap)

Utilities <- panel2020_2[panel2020_2$naics1 =="Utilities",]
chart.Boxplot(Utilities$marketcap)

Wholesale_Trade <- panel2020_2[panel2020_2$naics1 =="Wholesale Trade",]
chart.Boxplot(Wholesale_Trade$marketcap)

Challenge 3. Setting the dataset as panel data

Using the plm package, indicate R that uspanel is a panel data, which has firm column for the subjects and quarter (q) for the time variable. Keep the same name of uspanel.

library(plm)

## 
## Attaching package: 'plm'

## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead

uspanel <- pdata.frame(uspanel, index= c("firm","q"))

Challenge 4. Calculating financial variables

You have to write the code to calculate the following financial variables and financial ratios for all firms-quarters of US firms.

Using the original panel dataset (uspanel), create columns for the following variables:

· Gross profit (grossprofit): Revenue - Cost of good Sold · Earnings before interest and taxes (ebit): Gross profit - Sales & general administrative expenses · Net Income (netincome): ebit - financial expenses - income taxes · Annual market return: use adjusted stock price and remember that you have quarterly data. · Market capitalization: (marketcap): original stock price * shareoutstanding.

This is the market value of the firm in each quarter. We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.

uspanel$grossprofit = (uspanel$revenue - uspanel$cogs)
uspanel$ebit = (uspanel$grossprofit - uspanel$sgae)
uspanel$netincome = (uspanel$ebit - uspanel$finexp - uspanel$incometax)
uspanel$Market_Cap = (uspanel$originalprice * uspanel$sharesoutstanding)

Using the same panel dataset (uspanel), create columns for the following financial ratios:

· Operational Return on Assets (roabit): roabit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of ONE year ago.

\[roabit = (ebit/totalassets)\]

Here you have to use the lag function of the plm package to get value of total assets one year ago, which is 4 quarters ago.

Return on Assets (roa)

Operational Earnings per share (oeps): ebit / sharesoutstanding

Operational epsebit deflated by stock price (oepsp) : eps_ebit / original stock price

Book-to-market ratio (bmr): book value / market value. Use total assets as book value of the firm-quarter.

** Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio **

uspanel$roabit = (uspanel$ebit/uspanel$totalassets)
roabit_1 = ifelse(uspanel$roabit>
                    plm::lag(uspanel$roabit, 4),1,0)

uspanel$roa = (uspanel$netincome/uspanel$totalassets)
roa_1 = ifelse(uspanel$roa>
                    plm::lag(uspanel$roa, 4),1,0)

uspanel$OEPS = uspanel$ebit / uspanel$sharesoutstanding

uspanel$OEPSP = uspanel$OEPS / uspanel$originalprice

uspanel$BMR = uspanel$totalassets / (uspanel$originalprice*uspanel$sharesoutstanding)

EPS: Earnings per share (EPS) is calculated as a company’s profit divided by the outstanding shares of its common stock. The resulting number serves as an indicator of a company’s profitability.

BOOK TO MARKET RATIO: According to investopedia, the book-to-market ratio helps investors find a company’s value by comparing the firm’s book value to its market value. A high book-to-market ratio might mean that the market is valuing the company’s equity cheaply compared to its book value. Many investors are familiar with the price-to-book ratio, which is simply the inverse of the book-to-market ratio formula.

Challenge 5. Review of Regression models

Using the uspanel dataset you have to run a multiple pulled regression model to examine whether the operational earnings per share deflated by price (peps) and operational ROA (roabit) are related to annual market returns 1 quarter later.
Interpret the regression coefficients of this model.

Pay attention in class for the hints to write the code to run this model. Hint: you have to use the lm function.

Challenge 6. Research what is Machine Learning

Do a quick research about what is Machine Learning.

De acuerdo con BBVA, en su artículo “Machine learning: ¿qué es y cómo funciona?”, el ‘machine learning’ –aprendizaje automático– es una rama de la inteligencia artificial que permite que las máquinas aprendan sin ser expresamente programadas para ello. Una habilidad indispensable para hacer sistemas capaces de identificar patrones entre los datos para hacer predicciones. Esta tecnología está presente en un sinfín de aplicaciones como las recomendaciones de Netflix o Spotify, las respuestas inteligentes de Gmail o el habla de Siri y Alexa.

“En definitiva, el ‘machine learning’ es un maestro del reconocimiento de patrones, y es capaz de convertir una muestra de datos en un programa informático capaz de extraer inferencias de nuevos conjuntos de datos para los que no ha sido entrenado previamente”, explica José Luis Espinoza, científico de datos de BBVA México.

Referencia: https://www.bbva.com/es/machine-learning-que-es-y-como-funciona/

Workshop 2 - Algorithms and data analysis

Rafael Romo - A01701493