Workshop 3

FZ2022 Algorithms and Data Analytics

Author

Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

October 8, 2024

0.1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab notebook for each workshop.

You must share each Colab document (workshop) with me via e-mail:

sergio.castellanos@tec.mx

You must give Edit privileges to this account.

In Google Colab, you can work with Python or R notebooks. The default is Python notebooks, so you must go to “Runtime” or “Entorno de Ejecución”, select “Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).

Your notebook will have a default name like “Untitled3.ipynb”. Click on this name and change it to “W1_FZ2022_YourFirstName_YourLastname”.

Pay attention in class to learn how to write text and R code into your notebook.

In your Workshop Notebook, you have to:

Replicate all the R code along with its output.
You must complete any question and/or challenge asked in the workshop. The challenges can include specific questions or require you to complete an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks at once with Ctrl-F9. You must submit the web link of your Google Colab workshop to Canvas.

1 Introduction

This is an individual workshop. In this session, we will continue practicing data management for large-scale financial data. We will also learn about the logistic regression model and get an introduction to machine learning.

We will work with the dataset from the Final Case Study (Situación Problema) dataus2024, which contains historical data from Q1 2010 to Q4 2023. The key topics covered in this workshop include:

Winsorization: What it is and why it’s important
Estimating and interpreting a logistic regression
Training vs. testing samples
Confusion matrix creation and interpretation

1.1 Loading libraries

Load (and install if needed) the libraries we will use for the workshop:

# Check if the dplyr package is installed; if not, install it
if(!require(dplyr)) install.packages("dplyr")
# Load the dplyr package
library(dplyr)

# Check if the quantmod package is installed; if not, install it
if(!require(quantmod)) install.packages("quantmod")
# Load the quantmod package
library(quantmod)

# Check if the lubridate package is installed; if not, install it
if(!require(lubridate)) install.packages("lubridate")
# Load the lubridate package
library(lubridate)

# Check if the statar package is installed; if not, install it
if(!require(statar)) install.packages("statar")
# Load the statar package
library(statar)

2 Challenge 1: Winsorization of Variables

2.1 Winsorization

Winsorization is a technique used to limit extreme values in your data, reducing the influence of outliers and improving the robustness of your analysis. By capping the data at a certain percentile, it helps avoid distortion in statistical models.

For a detailed explanation and practical example, refer to Section 5.8 of the Workshop 2 solution, which you can find here.

To practice winsorization, you need to install the statar package and winsorize the following ratio:

Earnings per share deflated by price

This function automatically determines a suitable level of winsorization based on the variable’s distribution.

3 Challenge 2: Algorithm for Many-to-One Merge

Write a data management algorithm to do the following:

Download the monthly S&P500 index (^GSPC) from 1999 to the present.
Convert (collapse) this data from monthly to quarterly by selecting the last index of each quarter.
Calculate quarterly market returns.
Perform a many-to-one merge to add a new column to the uspanel dataset, containing the market’s quarterly returns.

3.1 Step-by-step Guide:

# Download the CSV file from the website:
download.file("http://www.apradie.com/datos/dataus2024.csv", "dataus2024.csv")

download.file("http://www.apradie.com/datos/firmsus2024.csv", "firmsus2024.csv")

# Import the panel data
uspanel <- read.csv("dataus2024.csv")

# Import the firm catalog
usfirms <- read.csv("firmsus2024.csv")

The uspanel dataset contains financial data for US public firms listed on the NYSE and NASDAQ. The usfirms dataset is a catalog with general information about these firms.

Next, download the monthly S&P500 index data from Q4 1999 to Q2 2024:

getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30", 
           periodicity="monthly", src="yahoo")

[1] "GSPC"

The issue now, is that the frequency of the new object is monthly, while the rest of the dataset is quarterly. To convert this monthly data to quarterly, use the to.quarterly function from the quantmod package. With the following code, I will extract the observation fro the last month of each 3-month period:

QGSPC <- to.quarterly(GSPC)
QGSPC = Ad(QGSPC)  # Keep only the adjusted column
names(QGSPC) = c("SP500")
head(QGSPC)

          SP500
1999 Q4 1469.25
2000 Q1 1498.58
2000 Q2 1454.60
2000 Q3 1436.51
2000 Q4 1320.28
2001 Q1 1160.33

Now, calculate quarterly continuously compounded returns and annual cc returns using the log-difference function:

QGSPC$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500), lag=4)
QGSPC = QGSPC[2:nrow(QGSPC),]  # Remove the first row with NA values
head(QGSPC)

          SP500      mkqret     mkyret
2000 Q1 1498.58  0.01976590         NA
2000 Q2 1454.60 -0.02978703         NA
2000 Q3 1436.51 -0.01251436         NA
2000 Q4 1320.28 -0.08437271 -0.1069082
2001 Q1 1160.33 -0.12913945 -0.2558136
2001 Q2 1224.38  0.05373019 -0.1722963

To merge the quarterly data with uspanel, we need a common column. Notice that QGSPC, being an xts object, has this information as an index, but not as an actual column.

QGSPCdf <- data.frame(qdate=index(QGSPC), coredata(QGSPC[,2:3]))

The index function gets the index content, while the coredata function gets only the column data of the dataset.

Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:

class(uspanel$q)

[1] "character"

class(QGSPCdf$qdate)

[1] "yearqtr"

The qdate column of the QGSPCdf is a “yearqtr” variable, while the q column of the uspanel is a “character” variable. I need to decide which column to modify so that both have the same type and format.

I will create a q column in the QGSPCdf dataset with the same format as the q in the uspanel.

The q in the uspanel is a character variable that starts with 4 digits for the year, followed by a “q” and then the number of the quarter. For example: 2020q1, 2020q2.

Now, I will create a new column q in the QGSPCdf dataset that follows this format:

# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
QGSPCdf$q <- paste0(year(QGSPCdf$qdate), 
                    "q", 
                    quarter(QGSPCdf$qdate))

Let’s confirm everything is working ok:

class(QGSPCdf$q)

[1] "character"

class(uspanel$q)

[1] "character"

head(QGSPCdf$q)

[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

head(uspanel$q)

[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

Now, use a many-to-one merge to combine the datasets:

QGSPCdf = QGSPCdf[,c(-1)]  # Remove the qdate column
uspanel <- left_join(uspanel, QGSPCdf, by="q")

Finally, display key columns for verification:

head(uspanel %>% 
       select(firm, q, adjprice, mkqret) %>% 
       filter(q == "2023q4"))

      firm      q  adjprice    mkqret
1        A 2023q4 138.54738 0.1064786
2       AA 2023q4  33.79498 0.1064786
3 AABA_old 2023q4        NA 0.1064786
4  AAC_old 2023q4        NA 0.1064786
5 AAIC_old 2023q4   4.84000 0.1064786
6      AAL 2023q4  13.74000 0.1064786

3.1.1 Note: Understanding `left_join`

The left_join function in R (from the dplyr package) is used to merge two datasets based on a common column. It keeps all rows from the left dataset (the first one) and matches the rows from the right dataset based on the common column. If there are no matching values in the right dataset, the result will contain NA values for the columns coming from the right dataset.

This is useful when you want to ensure that none of the rows from your primary dataset are lost during the merging process.

3.1.2 Note: Understanding `paste0`

The paste0 function in R is used to concatenate (combine) multiple strings of text without any separator. It joins the elements together in the exact order they are provided.

For example, if you want to combine a year and a quarter to create a new column in the format “2020q1”, you can use paste0 like this:

paste0(2020, "q", 1)

[1] "2020q1"

This will return “2020q1”. Unlike paste(), which adds a space by default, paste0() does not add any spaces between the elements.

4 Challenge 3: Logistic Regression with Lagged Values

Design and run a logistic regression model to explore whether winsorized earnings per share deflated by price (epspw) is related to the probability that future quarterly stock returns will be higher than future market returns.

4.1 Dataset Creation Instructions

Create a dataset that includes the following columns:

Future Quarterly Stock Return: Represents the stock return for the subsequent quarter.
F1r_above_market:
- 1 = Indicates that the stock beat the market in the corresponding quarter.
- 0 = Indicates that the stock did not beat the market.
Earnings Per Share Deflated by Price (epsp): Represents the earnings per share adjusted for price.

Ensure that each column is clearly defined and formatted for ease of understanding.

4.1.1 Hint: Instructions for Creating the Dependent Variable

Understand the Dependent Variable: In a logistic model, the dependent variable represents the probability that an event occurs. We will use a binary variable (1/0 or TRUE/FALSE) to represent this.
Define the Event: Declare that the EVENT occurs when the future stock return is higher than the future market return.
Independent Variables: Note that the independent variable(s) can be numeric or categorical, similar to a multiple regression model.
Create the Dependent Variable:
- Assign a value of 1 when the future stock return is higher than the future market return.
- Assign a value of 0 otherwise.

4.1.2 Suggested `dplyr` Functions to Use:

mutate(): To create or modify columns in your dataset.
if_else(): To conditionally assign values based on logical criteria.
filter(): To subset your data based on specific conditions.
select(): To choose specific columns from your dataset for analysis.

5 Challenge 4: Running Your First Machine Learning Model

You will create a dataset to predict whether a stock will beat the market. Steps include:

Create a training and testing sample (80/20 split).
Run the logistic regression on the training sample.
Create and interpret the confusion matrix.

Note: Review Chapter 2 of the Datacamp course: “Machine Learning with Caret” and the Workshop Introduction to Machine Learning using Caret for more details.

6 W3 submission

The grade for this workshop will be as follows:

Complete (100%): If you submit an ORIGINAL and COMPLETE notebook with all the activities, your notes, and your OWN RESPONSES to questions. Submitting someone else’s work or having two or more people submit the same answers (exact texts or with minor changes or translations) will be considered plagiarism and a violation of academic integrity (FIA).
Incomplete (75%): If you submit an ORIGINAL notebook with ALL the activities but did NOT RESPOND to the questions and/or did not complete all activities, or if you only responded to some questions. Remember, submitting identical or slightly altered work from another person is considered a breach of academic integrity (FIA).
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or if you completed more but parts of your work are copy-pasted from other notebooks. This too will be considered an FIA if found to be plagiarized.
Not submitted (0%): No submission, no credit.

It is important to remember that submitting someone else’s work, or identical work (even with minor changes or translations) between two or more students, is strictly prohibited and will be considered an academic integrity violation (FIA).

Remember that you have to submit the Google Colab LINK to your notebook, and you must SHARE it with me, ensuring that I have edit access.