library(dplyr)
# Downloading the csv files from the web:
<- read.csv("https://www.apradie.com/datos/dataus2024.csv")
uspanel <- read.csv("https://www.apradie.com/datos/firmsus2024.csv") usfirms
Workshop 3, Algorithms and data analysis
1 Introduction
We will work with the dataset of the Final Project (dataus2023), which has historical data from Q1 2010 to Q4 2023.
We will learn a) what is winsorization and why it is important, b) estimate and interpret a logistic regression, and c) training vs testing samples, and d) creating a confusion matrix.
2 CHALLENGE 1: Winsorization of variables
You have to do your own research about winsorization. Explain it with your words (pay attention in class)
You have to install the statar package. You have to winsorize the following ratio:
- Earnings per share deflated by price
This winsorize function automatically finds a good level of winsorization according to the distribution of all values of the variable.
3 CHALLENGE 2: Algorithm to do many-to-one merge
Write a data management algorithm to do the following:
- Download the monthly market index of the S&P500 (^GSPC) from 1999 to date
- Convert (collapse) from monthly to quarterly index selecting the last index of each quarter
- Calculate market quarterly returns
- Do a many-to-one algorithm to add a new column to the us panel dataset that has the market quarterly return
Here is a easy-to-follow guide to do this challenge. Read carefully to understand the algorithm.
The uspanel dataset has historical data of many financial-statement variables of US public firms that belong to the New York Exchange and the NASDAQ.
The usfirms dataset is a catalog of all US public firms with general information such as firm name and industry classification.
Now I download the US monthly market index from Q4 1999 to the Q2 of 2024.
library(quantmod)
getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30",
periodicity="monthly", src="yahoo")
[1] "GSPC"
Now I convert / collapse this monthly dataset in to a quarterly dataset to have the same granularity than the uspanel dataset. For each quarter (3-month period) I need to get ONLY the last month index in order to correctly calculate quarterly returns.
The to.quarterly function from quantmod can do this collapse getting the value of the last month for each quarter:
<- to.quarterly(GSPC)
QGSPC # I keep only the adjusted column, which has the market index:
= Ad(QGSPC)
QGSPC names(QGSPC)= c("SP500")
I calculate quarterly and annual cc returns for the market with the difference function of the log index:
$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500),lag=4)
QGSPC# I delete the first row that has NA values for both returns, and it is from 1999:
= QGSPC[2:nrow(QGSPC),] QGSPC
In order to merge the uspanel with this QGSPC dataset I need to have a common column for the quarter so that R can do the match by quarter.
The QGSPC has the quarter as index, not as column! It is important to note that all xts and zoo datasets have an index that is not part of the columns. I can create a data frame that has the index as column as follows:
<-data.frame(qdate=index(QGSPC),coredata(QGSPC[,2:3])) QGSPCdf
The index function gets the index content, while the coredata function gets only the column data of the dataset.
Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:
class(uspanel$q)
[1] "character"
class(QGSPCdf$qdate)
[1] "yearqtr"
The qdate column of the QGSPCdf is a “yearqtr” variable, while the q column of the uspanel is character variable. I have to decide which column I change to have both with the same type and also the same format.
I will create a q column in the QGSPCdf dataset with the same format as the q in the uspanel.
The q in the uspanel is a character variable that starts with 4 digits for the year, then a “q” and then the # of the quarter. Example: 2020q1, 2020q2.
Then, I create a new column in the QGSPCdf following this format:
library(lubridate)
# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
$q <- paste0(year(QGSPCdf$qdate), # Convert dates to quarterly
QGSPCdf"q",
quarter(QGSPCdf$qdate))
# I check that both columns have the same data type, and have the same format:
class(QGSPCdf$q)
[1] "character"
class(uspanel$q)
[1] "character"
head(QGSPCdf$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
head(uspanel$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
The paste0 function concatenates strings of characters.
Now I can do the many-to-1 merge of both dataset indicating that q is the common column.
I use the left_join function from the dplyr package instead of the merge function. I do this since the merge function does not keep the original sorting of the uspanel dataset.
# I delete the first column (the qdate)
= QGSPCdf[,c(-1)]
QGSPCdf
<-left_join(uspanel,QGSPCdf,by="q")
uspanel
# I display key columns of 1 quarter:
library(dplyr)
head(uspanel %>% select(firm,q, adjprice, mkqret) %>% filter(q=="2023q4"))
firm q adjprice mkqret
1 A 2023q4 138.54738 0.1064786
2 AA 2023q4 33.79498 0.1064786
3 AABA_old 2023q4 NA 0.1064786
4 AAC_old 2023q4 NA 0.1064786
5 AAIC_old 2023q4 4.84000 0.1064786
6 AAL 2023q4 13.74000 0.1064786
The market return seems to be well merged since its value is repeated for cases with the same quarter.
Then the mktret column was added to the uspanel. The series of the market return was merged for each firm, so R did a many-to-1 merge!
4 CHALLENGE 3: Logistic regression models with lagged values
Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future quarterly stock returns is higher than the future market quarterly return.
Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.
You have to interpret the model
5 CHALLENGE 4: Running your first Machine Learning model
(moved to week 4)
Create a dataset with the following columns:
- Future quarterly stock return (1 quarter later)
- F1r_above_market (1=beat the market in the corresponding quarter; 0= otherwise)
- Earnings per share deflated by price (epsp).
Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.
Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.
Create and interpret the confusion matrix
It is strongly recommended to review the Chapter 2 of the Datacamp course: “Machine Learning with Caret”
6 W3 submission
The grade of this Workshop will be the following:
Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)
Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.