Workshop 3, Algorithms and data analysis

Author

Alberto Dorantes, Ph.D.

Published

April 9, 2024

Abstract
This is an INDIVIDUAL workshop. In this workshop we will keep practicing data management for financial massive data and also learn about the Logistic regression model and an introduction to Machine Learning.

1 Introduction

We will work with the dataset of the Final Project (dataus2023), which has historical data from Q1 2010 to Q4 2023.

We will learn a) what is winsorization and why it is important, b) estimate and interpret a logistic regression, and c) training vs testing samples, and d) creating a confusion matrix.

2 CHALLENGE 1: Winsorization of variables

You have to do your own research about winsorization. Explain it with your words (pay attention in class)

You have to install the statar package. You have to winsorize the following ratio:

  • Earnings per share deflated by price

This winsorize function automatically finds a good level of winsorization according to the distribution of all values of the variable.

3 CHALLENGE 2: Algorithm to do many-to-one merge

Write a data management algorithm to do the following:

  1. Download the monthly market index of the S&P500 (^GSPC) from 1999 to date
  2. Convert (collapse) from monthly to quarterly index selecting the last index of each quarter
  3. Calculate market quarterly returns
  4. Do a many-to-one algorithm to add a new column to the us panel dataset that has the market quarterly return

Here is a easy-to-follow guide to do this challenge. Read carefully to understand the algorithm.

library(dplyr)
# Downloading the csv files from the web: 
uspanel <- read.csv("https://www.apradie.com/datos/dataus2024.csv")
usfirms <- read.csv("https://www.apradie.com/datos/firmsus2024.csv")

The uspanel dataset has historical data of many financial-statement variables of US public firms that belong to the New York Exchange and the NASDAQ.

The usfirms dataset is a catalog of all US public firms with general information such as firm name and industry classification.

Now I download the US monthly market index from Q4 1999 to the Q2 of 2024.

library(quantmod)
getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30",
            periodicity="monthly", src="yahoo")
[1] "GSPC"

Now I convert / collapse this monthly dataset in to a quarterly dataset to have the same granularity than the uspanel dataset. For each quarter (3-month period) I need to get ONLY the last month index in order to correctly calculate quarterly returns.

The to.quarterly function from quantmod can do this collapse getting the value of the last month for each quarter:

QGSPC <- to.quarterly(GSPC)
# I keep only the adjusted column, which has the market index:
QGSPC = Ad(QGSPC)
names(QGSPC)= c("SP500")

I calculate quarterly and annual cc returns for the market with the difference function of the log index:

QGSPC$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500),lag=4)
# I delete the first row that has NA values for both returns, and it is from 1999:
QGSPC = QGSPC[2:nrow(QGSPC),]

In order to merge the uspanel with this QGSPC dataset I need to have a common column for the quarter so that R can do the match by quarter.

The QGSPC has the quarter as index, not as column! It is important to note that all xts and zoo datasets have an index that is not part of the columns. I can create a data frame that has the index as column as follows:

QGSPCdf<-data.frame(qdate=index(QGSPC),coredata(QGSPC[,2:3]))

The index function gets the index content, while the coredata function gets only the column data of the dataset.

Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:

class(uspanel$q)
[1] "character"
class(QGSPCdf$qdate)
[1] "yearqtr"

The qdate column of the QGSPCdf is a “yearqtr” variable, while the q column of the uspanel is character variable. I have to decide which column I change to have both with the same type and also the same format.

I will create a q column in the QGSPCdf dataset with the same format as the q in the uspanel.

The q in the uspanel is a character variable that starts with 4 digits for the year, then a “q” and then the # of the quarter. Example: 2020q1, 2020q2.

Then, I create a new column in the QGSPCdf following this format:

library(lubridate)
# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
QGSPCdf$q <- paste0(year(QGSPCdf$qdate),         # Convert dates to quarterly
                             "q",
                             quarter(QGSPCdf$qdate))
# I check that both columns have the same data type, and have the same format:
class(QGSPCdf$q)
[1] "character"
class(uspanel$q)
[1] "character"
head(QGSPCdf$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
head(uspanel$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

The paste0 function concatenates strings of characters.

Now I can do the many-to-1 merge of both dataset indicating that q is the common column.

I use the left_join function from the dplyr package instead of the merge function. I do this since the merge function does not keep the original sorting of the uspanel dataset.

# I delete the first  column (the qdate)
QGSPCdf = QGSPCdf[,c(-1)]

uspanel<-left_join(uspanel,QGSPCdf,by="q")

# I display key columns of 1 quarter:
library(dplyr)
head(uspanel %>% select(firm,q, adjprice, mkqret) %>% filter(q=="2023q4"))
      firm      q  adjprice    mkqret
1        A 2023q4 138.54738 0.1064786
2       AA 2023q4  33.79498 0.1064786
3 AABA_old 2023q4        NA 0.1064786
4  AAC_old 2023q4        NA 0.1064786
5 AAIC_old 2023q4   4.84000 0.1064786
6      AAL 2023q4  13.74000 0.1064786

The market return seems to be well merged since its value is repeated for cases with the same quarter.

Then the mktret column was added to the uspanel. The series of the market return was merged for each firm, so R did a many-to-1 merge!

4 CHALLENGE 3: Logistic regression models with lagged values

Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future quarterly stock returns is higher than the future market quarterly return.

Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.

You have to interpret the model

5 CHALLENGE 4: Running your first Machine Learning model

(moved to week 4)

Create a dataset with the following columns:

  • Future quarterly stock return (1 quarter later)
  • F1r_above_market (1=beat the market in the corresponding quarter; 0= otherwise)
  • Earnings per share deflated by price (epsp).

Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.

Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.

Create and interpret the confusion matrix

It is strongly recommended to review the Chapter 2 of the Datacamp course: “Machine Learning with Caret”

6 W3 submission

The grade of this Workshop will be the following:

  • Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions

  • Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.

  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.

  • Not submitted (0%)

Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.