Workshop 3 Solution

FZ2022 Algorithms and Data Analytics

Author

Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

October 18, 2024

0.1 General Directions for each workshop

You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab notebook for each workshop.

You must share each Colab document (workshop) with me via e-mail:

sergio.castellanos@tec.mx

You must give Edit privileges to this account.

In Google Colab, you can work with Python or R notebooks. The default is Python notebooks, so you must go to “Runtime” or “Entorno de Ejecución”, select “Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).

Your notebook will have a default name like “Untitled3.ipynb”. Click on this name and change it to “W1_FZ2022_YourFirstName_YourLastname”.

Pay attention in class to learn how to write text and R code into your notebook.

In your Workshop Notebook, you have to:

Replicate all the R code along with its output.
You must complete any question and/or challenge asked in the workshop. The challenges can include specific questions or require you to complete an exercise/challenge.

For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for further study.

Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks at once with Ctrl-F9. You must submit the web link of your Google Colab workshop to Canvas.

1 Introduction

This is an individual workshop. In this session, we will continue practicing data management for large-scale financial data. We will also learn about the logistic regression model and get an introduction to machine learning.

We will work with the dataset from the Final Case Study (Situación Problema) dataus2024, which contains historical data from Q1 2010 to Q4 2023. The key topics covered in this workshop include:

Winsorization: What it is and why it’s important
Estimating and interpreting a logistic regression
Training vs. testing samples
Confusion matrix creation and interpretation

1.1 Loading libraries

Load (and install if needed) the libraries we will use for the workshop:

# Check if the dplyr package is installed; if not, install it
if(!require(dplyr)) install.packages("dplyr")
# Load the dplyr package
library(dplyr)

# Check if the quantmod package is installed; if not, install it
if(!require(quantmod)) install.packages("quantmod")
# Load the quantmod package
library(quantmod)

# Check if the lubridate package is installed; if not, install it
if(!require(lubridate)) install.packages("lubridate")
# Load the lubridate package
library(lubridate)

# Check if the statar package is installed; if not, install it
if(!require(statar)) install.packages("statar")
# Load the statar package
library(statar)

# Install and load caret
if (!require(caret)) {
  install.packages("caret")
  library(caret)
}

# Install and load ggplot2
if (!require(ggplot2)) {
  install.packages("ggplot2")
  library(ggplot2)
}

2 Challenge 1: Winsorization of Variables

2.1 Winsorization

Winsorization is a technique used to limit extreme values in your data, reducing the influence of outliers and improving the robustness of your analysis. By capping the data at a certain percentile, it helps avoid distortion in statistical models.

For a detailed explanation and practical example, refer to Section 5.8 of the Workshop 2 solution, which you can find here.

To practice winsorization, you need to install the statar package and winsorize the following ratio:

Earnings per share deflated by price

This function automatically determines a suitable level of winsorization based on the variable’s distribution.

The first step to winsorize the variable, is to create the variable itself. Let’s start by downloading the dataset.

# Download the CSV file from the website:
download.file("http://www.apradie.com/datos/dataus2024.csv", "dataus2024.csv")

download.file("http://www.apradie.com/datos/firmsus2024.csv", "firmsus2024.csv")

# Import the panel data
uspanel <- read.csv("dataus2024.csv")

# Import the firm catalog
usfirms <- read.csv("firmsus2024.csv")

The uspanel dataset contains financial data for US public firms listed on the NYSE and NASDAQ. The usfirms dataset is a catalog with general information about these firms.

Now, create the variables needed for the ratio:

# Adding the variables for gross profit, EBIT and Net Income:
uspanel <- uspanel %>%
  mutate(
    grossprofit = revenue - cogs,
    ebit = grossprofit - sgae, 
    netincome = ebit + otherincome + extraordinaryitems - finexp - incometax
  )

Creating the ratio

# I generate EPS and EPSP (deflated by price)
uspanel <- uspanel %>%
  mutate(
    eps = ifelse(sharesoutstanding==0,NA,netincome / sharesoutstanding),
    epsp= ifelse(originalprice==0,NA,eps / originalprice))

Winsorizing the variable:

uspanel <- uspanel %>%
  mutate(epspw = winsorize(epsp))

Comparing the unwisorized and winsorized variables:

hist(uspanel$epsp)

hist(uspanel$epspw)

The winsorized range for earnings per share (EPS) deflated by price, which spans from -0.30 to around 0.30, is useful for understanding a company’s profitability relative to its stock price. A value of +0.30, for example, suggests that an investor might expect approximately 30 cents in earnings for every $1.00 invested in the stock.

Now, if we compare this to a much wider range, such as from -6e+6 to 4e+6, the latter reflects an extreme outlier scenario where the EPS deflated by price is either highly negative or overly positive. Such a broad range would likely indicate significant data distortions, such as massive losses or inflated earnings figures, which are far less common in most companies. A range like -6 million to 4 million would suggest wild fluctuations that are impractical for typical financial analysis and likely reflect rare or extraordinary circumstances, possibly due to accounting anomalies or firms in severe distress.

The narrower range of -0.30 to 0.30, by contrast, is more realistic and interpretable, reflecting moderate fluctuations in profitability without the noise of extreme values, making it more reliable for assessing how well companies convert their earnings into shareholder value in relation to their stock prices.

3 Challenge 2: Algorithm for Many-to-One Merge

Write a data management algorithm to do the following:

Download the monthly S&P500 index (^GSPC) from 1999 to the present.
Convert (collapse) this data from monthly to quarterly by selecting the last index of each quarter.
Calculate quarterly market returns.
Perform a many-to-one merge to add a new column to the uspanel dataset, containing the market’s quarterly returns.

3.1 Step-by-step Guide:

Download the monthly S&P500 index data from Q4 1999 to Q2 2024:

getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30", 
           periodicity="monthly", src="yahoo")

[1] "GSPC"

The issue now, is that the frequency of the new object is monthly, while the rest of the dataset is quarterly. To convert this monthly data to quarterly, use the to.quarterly function from the quantmod package. With the following code, I will extract the observation fro the last month of each 3-month period:

QGSPC <- to.quarterly(GSPC)
QGSPC = Ad(QGSPC)  # Keep only the adjusted column
names(QGSPC) = c("SP500")
head(QGSPC)

          SP500
1999 Q4 1469.25
2000 Q1 1498.58
2000 Q2 1454.60
2000 Q3 1436.51
2000 Q4 1320.28
2001 Q1 1160.33

Now, calculate quarterly continuously compounded returns and annual cc returns using the log-difference function:

QGSPC$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500), lag=4)
QGSPC = QGSPC[2:nrow(QGSPC),]  # Remove the first row with NA values
head(QGSPC)

          SP500      mkqret     mkyret
2000 Q1 1498.58  0.01976590         NA
2000 Q2 1454.60 -0.02978703         NA
2000 Q3 1436.51 -0.01251436         NA
2000 Q4 1320.28 -0.08437271 -0.1069082
2001 Q1 1160.33 -0.12913945 -0.2558136
2001 Q2 1224.38  0.05373019 -0.1722963

To merge the quarterly data with uspanel, we need a common column. Notice that QGSPC, being an xts object, has this information as an index, but not as an actual column.

QGSPCdf <- data.frame(qdate=index(QGSPC), coredata(QGSPC[,2:3]))

The index function gets the index content, while the coredata function gets only the column data of the dataset.

Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:

class(uspanel$q)

[1] "character"

class(QGSPCdf$qdate)

[1] "yearqtr"

The qdate column of the QGSPCdf is a “yearqtr” variable, while the q column of the uspanel is a “character” variable. I need to decide which column to modify so that both have the same type and format.

I will create a q column in the QGSPCdf dataset with the same format as the q in the uspanel.

The q in the uspanel is a character variable that starts with 4 digits for the year, followed by a “q” and then the number of the quarter. For example: 2020q1, 2020q2.

Now, I will create a new column q in the QGSPCdf dataset that follows this format:

# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
QGSPCdf$q <- paste0(year(QGSPCdf$qdate), 
                    "q", 
                    quarter(QGSPCdf$qdate))

Let’s confirm everything is working ok:

class(QGSPCdf$q)

[1] "character"

class(uspanel$q)

[1] "character"

head(QGSPCdf$q)

[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

head(uspanel$q)

[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

Now, use a many-to-one merge to combine the datasets:

QGSPCdf = QGSPCdf[,c(-1)]  # Remove the qdate column
uspanel <- left_join(uspanel, QGSPCdf, by="q")

Finally, display key columns for verification:

head(uspanel %>% 
       select(firm, q, adjprice, mkqret) %>% 
       filter(q == "2023q4"))

      firm      q  adjprice    mkqret
1        A 2023q4 138.54738 0.1064786
2       AA 2023q4  33.79498 0.1064786
3 AABA_old 2023q4        NA 0.1064786
4  AAC_old 2023q4        NA 0.1064786
5 AAIC_old 2023q4   4.84000 0.1064786
6      AAL 2023q4  13.74000 0.1064786

3.1.1 Note: Understanding `left_join`

The left_join function in R (from the dplyr package) is used to merge two datasets based on a common column. It keeps all rows from the left dataset (the first one) and matches the rows from the right dataset based on the common column. If there are no matching values in the right dataset, the result will contain NA values for the columns coming from the right dataset.

This is useful when you want to ensure that none of the rows from your primary dataset are lost during the merging process.

3.1.2 Note: Understanding `paste0`

The paste0 function in R is used to concatenate (combine) multiple strings of text without any separator. It joins the elements together in the exact order they are provided.

For example, if you want to combine a year and a quarter to create a new column in the format “2020q1”, you can use paste0 like this:

paste0(2020, "q", 1)

[1] "2020q1"

This will return “2020q1”. Unlike paste(), which adds a space by default, paste0() does not add any spaces between the elements.

4 Challenge 3: Logistic Regression with Lagged Values

Design and run a logistic regression model to explore whether winsorized earnings per share deflated by price (epspw) is related to the probability that future quarterly stock returns will be higher than future market returns.

4.1 Dataset Creation Instructions

Create a dataset that includes the following columns:

Future Quarterly Stock Return: Represents the stock return for the subsequent quarter.
F1r_above_market:
- 1 = Indicates that the stock beat the market in the corresponding quarter.
- 0 = Indicates that the stock did not beat the market.
Earnings Per Share Deflated by Price (epsp): Represents the earnings per share adjusted for price.

Ensure that each column is clearly defined and formatted for ease of understanding.

4.1.1 Hint: Instructions for Creating the Dependent Variable

Understand the Dependent Variable: In a logistic model, the dependent variable represents the probability that an event occurs. We will use a binary variable (1/0 or TRUE/FALSE) to represent this.
Define the Event: Declare that the EVENT occurs when the future stock return is higher than the future market return.
Independent Variables: Note that the independent variable(s) can be numeric or categorical, similar to a multiple regression model.
Create the Dependent Variable:
- Assign a value of 1 when the future stock return is higher than the future market return.
- Assign a value of 0 otherwise.

4.1.2 Suggested `dplyr` Functions to Use:

mutate(): To create or modify columns in your dataset.
if_else(): To conditionally assign values based on logical criteria.
filter(): To subset your data based on specific conditions.
select(): To choose specific columns from your dataset for analysis.

4.2 Solution

4.2.1 Creating the dependent variable

We will create the subsequent quarter continously compounded return.

uspanel <- uspanel %>%
  group_by(firm) %>% 
  arrange(firm, q) %>% 
  mutate(stockret = log(adjprice) - lag(log(adjprice)))

Confirm that the variable was appropriately created.

uspanel %>% select(firm,q,adjprice,stockret) %>% 
       filter(firm=="AAPL",q>="2000q1") %>% head(10)

# A tibble: 10 × 4
# Groups:   firm [1]
   firm  q      adjprice stockret
   <chr> <chr>     <dbl>    <dbl>
 1 AAPL  2000q1    1.02   NA     
 2 AAPL  2000q2    0.790  -0.260 
 3 AAPL  2000q3    0.388  -0.710 
 4 AAPL  2000q4    0.224  -0.549 
 5 AAPL  2001q1    0.333   0.395 
 6 AAPL  2001q2    0.351   0.0521
 7 AAPL  2001q3    0.234  -0.405 
 8 AAPL  2001q4    0.330   0.345 
 9 AAPL  2002q1    0.357   0.0777
10 AAPL  2002q2    0.267  -0.290

Now we can use the lead function (opposite to lag) to obtain future values. We will now obtain the future stock return and market return.

uspanel <-  uspanel %>% 
   group_by(firm) %>% 
   arrange(firm, q) %>% 
   mutate(F1stockret = lead(stockret),
          F1mkqret = lead(mkqret))

Now I can create the binary variable as follows:

uspanel <- uspanel %>%
  mutate(F1r_above_market = ifelse(F1stockret>F1mkqret,1,0))

uspanel %>% 
  select(firm, q, stockret,mkqret, F1stockret, F1mkqret, F1r_above_market) %>% 
  head(8)

# A tibble: 8 × 7
# Groups:   firm [1]
  firm  q      stockret  mkqret F1stockret  F1mkqret F1r_above_market
  <chr> <chr>     <dbl>   <dbl>      <dbl>     <dbl>            <dbl>
1 A     2000q1  NA       0.0198    -0.344  -0.0298                  0
2 A     2000q2  -0.344  -0.0298    -0.410  -0.0125                  0
3 A     2000q3  -0.410  -0.0125     0.112  -0.0844                  1
4 A     2000q4   0.112  -0.0844    -0.578  -0.129                   0
5 A     2001q1  -0.578  -0.129      0.0560  0.0537                  1
6 A     2001q2   0.0560  0.0537    -0.508  -0.162                   0
7 A     2001q3  -0.508  -0.162      0.377   0.0980                  1
8 A     2001q4   0.377   0.0980     0.204  -0.000601                1

Now we can check the number of observations where the event occurs:

table(uspanel$F1r_above_market)


     0      1 
127791 118460

Now, let’s run the logistic regression:

# Runing the model with the winsorized epsp:
logit <- glm(F1r_above_market ~ epspw, data= uspanel, family="binomial",na.action=na.omit)
summary(logit)


Call:
glm(formula = F1r_above_market ~ epspw, family = "binomial", 
    data = uspanel, na.action = na.omit)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.055865   0.004453  -12.55   <2e-16 ***
epspw        2.058406   0.047745   43.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284792  on 205645  degrees of freedom
Residual deviance: 282880  on 205644  degrees of freedom
  (351535 observations deleted due to missingness)
AIC: 282884

Number of Fisher Scoring iterations: 4

Let’s compare the model with the case when we use the original epspvariable:

# Runing the model with epsp:
logit_2 <- glm(F1r_above_market ~ epsp, data= uspanel, family="binomial",na.action=na.omit)
summary(logit_2)


Call:
glm(formula = F1r_above_market ~ epsp, family = "binomial", data = uspanel, 
    na.action = na.omit)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.562e-02  4.413e-03 -17.134   <2e-16 ***
epsp         1.275e-06  4.352e-06   0.293     0.77    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284792  on 205645  degrees of freedom
Residual deviance: 284792  on 205644  degrees of freedom
  (351535 observations deleted due to missingness)
AIC: 284796

Number of Fisher Scoring iterations: 3

4.2.2 Logistic Regression: Winsorized vs. Original EPS

In this analysis, we compare the results of a logistic regression model using both the winsorized EPS (epspw) and the original EPS (epsp). The results showed key differences in the performance of both models, particularly in the magnitude and significance of the coefficients.

4.2.2.1 Key Findings:

Coefficient for Winsorized EPS: In the model with epspw, the coefficient was 2.058406 and was significant at the 1% level. This indicates a strong and statistically significant relationship between winsorized EPS and the likelihood that the stock will outperform the market.
Coefficient for Original EPS: In contrast, the coefficient for epsp in the second model was 1.275e-06 and was not significant. This suggests that without winsorizing, the model struggles to detect a reliable relationship between EPS and the probability of stock outperformance.

4.2.2.2 Interpretation of Logistic Regression:

Sign and Magnitude: The positive coefficient for epspw indicates that as earnings per share deflated by price increases, the likelihood of the stock outperforming the market also increases. The magnitude of 2.058406 reflects a much stronger effect compared to the original EPS coefficient, which is near zero.
Significance: The winsorized model is significant at the 1% level, meaning there is strong evidence that winsorized EPS is an important predictor of stock outperformance. In contrast, the original EPS model does not show a statistically significant relationship, making it less reliable for interpretation.
Odds Ratios: To interpret the magnitude of the effect, we calculate the odds ratio:
- For the winsorized model, the odds ratio is exp(2.058406). This indicates that for every unit increase in epspw, the odds of the stock beating the market increase by a substantial amount.
- For the original EPS model, the odds ratio is exp(1.275e-06), which is essentially 1, suggesting no meaningful change in the odds when using the original EPS variable.

# Calculate the odds ratio from the beta coefficient
odds_ratio <- exp(coef(logit)["epspw"])
odds_ratio

  epspw 
7.83347

4.2.2.3 Odds Ratios:

In logistic regression, the odds ratio helps us understand how changes in the independent variable (in this case, epspw) affect the likelihood of the event occurring—in this case, whether a stock will beat the market.

The odds ratio for epspw is 7.83347. This means that for every 1 unit increase in the winsorized EPS, the odds of the stock outperforming the market in the next quarter increase by 683%.
However, since epspw typically moves between -0.30 and 0.30, the actual effect on the odds is more moderate within this range. For example, if epspw increases by 0.1 units, the odds of the stock outperforming the market would increase by:

# Calculate odds ratio for a 0.1 unit increase in epspw
odds_ratio_0.1 <- exp(coef(logit)["epspw"]*.1)
odds_ratio_0.1

   epspw 
1.228557

Interpretation: With a 0.1 unit increase in epspw, the odds of the stock beating the market increase by approximately 22.86%. Although the odds ratio for a full unit increase is large, within the typical range of the variable (from -0.3 to 0.3), smaller changes in EPS still have a meaningful but more moderate impact on the odds of outperforming the market.

4.2.2.4 Summary:

The odds ratio for the winsorized EPS variable means that a small increase in earnings per share significantly boosts the probability of a stock outperforming the market.
This strong odds ratio, combined with the model’s statistical significance at the 1% level, underscores the robustness of the winsorized model compared to the original model.

5 Challenge 4: Running Your First Machine Learning Model

You will create a dataset to predict whether a stock will beat the market. Steps include:

Create a training and testing sample (80/20 split).
Run the logistic regression on the training sample.
Create and interpret the confusion matrix.

Note: Review Chapter 2 of the Datacamp course: “Machine Learning with Caret” and the Workshop Introduction to Machine Learning using Caret for more details.

5.1 Data Preprocessing

Before applying machine learning algorithms, it is essential to preprocess the data. We’ll perform steps such as splitting the data into training and test sets, scaling features, and handling missing values (if any).

Training data: The model learns from this data.
Test data: We check how well the model performs on this data after it has been trained.

Before applying machine learning algorithms, we will prepare the dataset. First, we will shuffle the data and then create the training and test subsets.

set.seed(123456)

# Shuffle row indices:
rows_shuffled<-sample(nrow(uspanel))

# Randomly order data 
shuffled_uspanel <- uspanel[rows_shuffled, ]

This chunk of code randomly shuffles the row indices of the uspanel data frame and stores them in the rows_shuffled variable.

nrow(uspanel):Gets the total number of rows in the uspanel data frame. nrow() is a function that returns the number of rows in a data frame or matrix.
sample(nrow(uspanel)): The sample() function is used to randomly shuffle or sample elements. In this case, it is generating a random permutation of the numbers from 1 to the number of rows in uspanel. This effectively shuffles the row indices.

I will keep only the variables used in the exercise

shuffled_uspanel <- shuffled_uspanel %>% 
              select(firm, q, epspw, F1r_above_market, stockret,mkqret) 
head(shuffled_uspanel)

# A tibble: 6 × 6
# Groups:   firm [6]
  firm      q        epspw F1r_above_market stockret    mkqret
  <chr>     <chr>    <dbl>            <dbl>    <dbl>     <dbl>
1 GEN__old  2019q2 -0.100                 0  -0.150   0.0372  
2 FIGS      2001q2 NA                    NA  NA       0.0537  
3 FBLG      2018q1 NA                    NA  NA      -0.0123  
4 NC        2008q1 NA                    NA  NA      -0.104   
5 BBOX__old 2002q1  0.0635                0  -0.0881 -0.000601
6 STML_old  2010q4 NA                    NA  NA       0.0972

To create the training and testing sets, let’s use a 80/20 split. First, determine the row where to split the

# Split the data into training and test sets (80/20 split)

# Determine row to split on: split
split <- round(nrow(shuffled_uspanel)*.80)
split

[1] 445745

And now we can create the two different subsets.

# Create train
train <- shuffled_uspanel[1:split, ]

# Create test
test <- shuffled_uspanel[(split+1):nrow(shuffled_uspanel), ]

5.2 Model Training

Re-run the logit model, for the training dataset:

logit_train <- glm(F1r_above_market ~ epspw, data= train, family="binomial",na.action=na.omit)
summary(logit_train)


Call:
glm(formula = F1r_above_market ~ epspw, family = "binomial", 
    data = train, na.action = na.omit)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.05148    0.00498  -10.34   <2e-16 ***
epspw        2.07493    0.05339   38.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 227682  on 164388  degrees of freedom
Residual deviance: 226128  on 164387  degrees of freedom
  (281356 observations deleted due to missingness)
AIC: 226132

Number of Fisher Scoring iterations: 4

Now I use this model and do predictions of the probability that the stock beats the market, but in the test dataset:

test$F1predprob = predict(logit_train,newdata=test, type="response")
head(test)

# A tibble: 6 × 7
# Groups:   firm [6]
  firm    q        epspw F1r_above_market stockret    mkqret F1predprob
  <chr>   <chr>    <dbl>            <dbl>    <dbl>     <dbl>      <dbl>
1 VHC     2002q1 NA                    NA   NA     -0.000601     NA    
2 MMI     2020q1  0.0123                0   -0.318 -0.223         0.494
3 NWPX    2002q4 NA                    NA   NA      0.0762       NA    
4 JNY_old 2000q3  0.0758                1    0.120 -0.0125        0.526
5 RBLX    2012q1 NA                    NA   NA      0.113        NA    
6 MTCH    2011q1 NA                    NA   NA      0.0528       NA

I will determine the cut-off from which I decide that the model considers that the firm will beat the market:

# Create F1r_above_market_pred 
test <- test %>% 
  mutate(F1r_above_market_pred = ifelse(F1predprob > 0.5, 1, 0))
head(test)

# A tibble: 6 × 8
# Groups:   firm [6]
  firm    q        epspw F1r_above_market stockret    mkqret F1predprob
  <chr>   <chr>    <dbl>            <dbl>    <dbl>     <dbl>      <dbl>
1 VHC     2002q1 NA                    NA   NA     -0.000601     NA    
2 MMI     2020q1  0.0123                0   -0.318 -0.223         0.494
3 NWPX    2002q4 NA                    NA   NA      0.0762       NA    
4 JNY_old 2000q3  0.0758                1    0.120 -0.0125        0.526
5 RBLX    2012q1 NA                    NA   NA      0.113        NA    
6 MTCH    2011q1 NA                    NA   NA      0.0528       NA    
# ℹ 1 more variable: F1r_above_market_pred <dbl>

Now I have a column for the actual binary variable (whether the stock beat the market), and also a predicted binary variable using the model and the test dataset.

5.3 Model evaluation

I can now create a Confusion Matrix.

I need to convert the binary variables to a factor-type variables:

test$F1r_above_market = factor(test$F1r_above_market,levels=c("1","0"))
test$F1r_above_market_pred = factor(test$F1r_above_market_pred,levels=c("1","0"))

# When using factor function, the first value of levels must be the POSITIVE value; in this case, =1

# Create confusion matrix
CM1<- confusionMatrix(test$F1r_above_market,test$F1r_above_market_pred, positive='1')
CM1

Confusion Matrix and Statistics

          Reference
Prediction     1     0
         1  7848 11831
         0  6726 14852
                                         
               Accuracy : 0.5502         
                 95% CI : (0.5454, 0.555)
    No Information Rate : 0.6468         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0881         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.5385         
            Specificity : 0.5566         
         Pos Pred Value : 0.3988         
         Neg Pred Value : 0.6883         
             Prevalence : 0.3532         
         Detection Rate : 0.1902         
   Detection Prevalence : 0.4770         
      Balanced Accuracy : 0.5476         
                                         
       'Positive' Class : 1

5.3.1 INTERPRETATION

The diagonal of the confusion matrix contains the cases where my model CORRECTLY PREDICTED whether the stock beat or did not beat the market.

Analyzing the confusion matrix, we observe the following:

The sum of the FIRST COLUMN is 14,574, representing the number of cases when the stock actually BEAT the market.
The sum of the SECOND COLUMN is 26,683, representing the number of cases when the stock actually DID NOT BEAT the market.

From the 14,574 cases when the stocks ACTUALLY BEAT THE MARKET, the model CORRECTLY PREDICTED 7,848 cases. These cases are referred to as TRUE POSITIVES. The rate of TRUE POSITIVES relative to ALL POSITIVES is known as Sensitivity, calculated as:

\text{Sensitivity Rate} = \frac{\text{TRUE POSITIVE}}{\text{TRUE POSITIVE} + \text{FALSE NEGATIVE}} \approx \frac{7848}{7848 + 6726} \approx 0.5385.

Out of the 26,683 cases when the stock DID NOT BEAT THE MARKET, the model CORRECTLY PREDICTED 14,852 cases, referred to as TRUE NEGATIVES. The rate of TRUE NEGATIVES relative to ALL NEGATIVES is known as Specificity, calculated as:

\text{Specificity Rate} = \frac{\text{TRUE NEGATIVE}}{\text{TRUE NEGATIVE} + \text{FALSE POSITIVE}} \approx \frac{14852}{14852 + 11831} \approx 0.5566.

Out of the 14,574 cases where the stocks ACTUALLY BEAT THE MARKET, the model WRONGLY PREDICTED 11,831 cases, known as FALSE POSITIVES.

Conversely, out of the 26,683 cases when the stock DID NOT BEAT THE MARKET, the model WRONGLY PREDICTED 6,726 cases, referred to as FALSE NEGATIVES.

If Sensitivity is greater than Specificity, it indicates that the model is more effective at predicting POSITIVE CASES (when a stock actually beats the market) than at predicting NEGATIVE CASES (when a stock does not beat the market).

5.3.1.1 Additional Ratios

The POSITIVE PREDICTIVE VALUE is defined as:

\text{Pos Pred Value} = \frac{\text{TRUE POSITIVE}}{\text{TRUE POSITIVE} + \text{FALSE POSITIVES}} \approx \frac{7848}{7848 + 11831} \approx 0.3988.

The Positive Predictive Value (PPV) represents the proportion of predicted positive cases (stocks predicted to beat the market) that are actually positive. In this case, the PPV is approximately 39.88%.

The PPV of 39.88% indicates that when the model predicts a stock will beat the market, it is correct about 39.88% of the time. This means that roughly 4 out of 10 stocks identified by the model as outperforming the market actually do so.

The NEGATIVE PREDICTIVE VALUE is defined as:

\text{Neg Pred Value Rate} = \frac{\text{TRUE NEGATIVE}}{\text{TRUE NEGATIVE} + \text{FALSE NEGATIVE}} \approx \frac{14852}{14852 + 6726} \approx 0.6883. The Negative Predictive Value (NPV) measures the proportion of predicted negative cases (stocks predicted to not beat the market) that are actually negative. In this case, the NPV is approximately 68.83%.

The NPV of 68.83% means that when the model predicts a stock will not beat the market, it is correct about 68.83% of the time. This implies that when the model identifies a stock as underperforming, it is accurate in nearly 7 out of 10 cases.

6 W3 submission

The grade for this workshop will be as follows:

Complete (100%): If you submit an ORIGINAL and COMPLETE notebook with all the activities, your notes, and your OWN RESPONSES to questions. Submitting someone else’s work or having two or more people submit the same answers (exact texts or with minor changes or translations) will be considered plagiarism and a violation of academic integrity (FIA).
Incomplete (75%): If you submit an ORIGINAL notebook with ALL the activities but did NOT RESPOND to the questions and/or did not complete all activities, or if you only responded to some questions. Remember, submitting identical or slightly altered work from another person is considered a breach of academic integrity (FIA).
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or if you completed more but parts of your work are copy-pasted from other notebooks. This too will be considered an FIA if found to be plagiarized.
Not submitted (0%): No submission, no credit.

It is important to remember that submitting someone else’s work, or identical work (even with minor changes or translations) between two or more students, is strictly prohibited and will be considered an academic integrity violation (FIA).

Remember that you have to submit the Google Colab LINK to your notebook, and you must SHARE it with me, ensuring that I have edit access.