# Check if the dplyr package is installed; if not, install it
if(!require(dplyr)) install.packages("dplyr")
# Load the dplyr package
library(dplyr)
# Check if the quantmod package is installed; if not, install it
if(!require(quantmod)) install.packages("quantmod")
# Load the quantmod package
library(quantmod)
# Check if the lubridate package is installed; if not, install it
if(!require(lubridate)) install.packages("lubridate")
# Load the lubridate package
library(lubridate)
# Check if the statar package is installed; if not, install it
if(!require(statar)) install.packages("statar")
# Load the statar package
library(statar)
# Install and load caret
if (!require(caret)) {
install.packages("caret")
library(caret)
}
# Install and load ggplot2
if (!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}
Workshop 3 Solution
FZ2022 Algorithms and Data Analytics
0.1 General Directions for each workshop
You have to work on Google Colab for all your workshops. In Google Colab, you MUST LOGIN with your @tec.mx account and then create a Google Colab notebook for each workshop.
You must share each Colab document (workshop) with me via e-mail:
sergio.castellanos@tec.mx
You must give Edit privileges to this account.
In Google Colab, you can work with Python or R notebooks. The default is Python notebooks, so you must go to “Runtime” or “Entorno de Ejecución”, select “Change runtime type” or “Cambiar entorno de ejecución”, and then select R (instead of Python).
Your notebook will have a default name like “Untitled3.ipynb”. Click on this name and change it to “W1_FZ2022_YourFirstName_YourLastname”.
Pay attention in class to learn how to write text and R code into your notebook.
In your Workshop Notebook, you have to:
- Replicate all the R code along with its output.
- You must complete any question and/or challenge asked in the workshop. The challenges can include specific questions or require you to complete an exercise/challenge.
For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
- It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your personal notebook to study for the FINAL EXAM. Your own workshop/notebook will be very helpful for further study.
Once you finish your workshop, make sure that you RUN ALL CHUNKS. You can run each code chunk by clicking on the “Run” button located in the top-left section of each chunk. You can also run all the chunks at once with Ctrl-F9. You must submit the web link of your Google Colab workshop to Canvas.
1 Introduction
This is an individual workshop. In this session, we will continue practicing data management for large-scale financial data. We will also learn about the logistic regression model and get an introduction to machine learning.
We will work with the dataset from the Final Case Study (Situación Problema) dataus2024
, which contains historical data from Q1 2010 to Q4 2023. The key topics covered in this workshop include:
- Winsorization: What it is and why it’s important
- Estimating and interpreting a logistic regression
- Training vs. testing samples
- Confusion matrix creation and interpretation
1.1 Loading libraries
Load (and install if needed) the libraries we will use for the workshop:
2 Challenge 1: Winsorization of Variables
2.1 Winsorization
Winsorization is a technique used to limit extreme values in your data, reducing the influence of outliers and improving the robustness of your analysis. By capping the data at a certain percentile, it helps avoid distortion in statistical models.
For a detailed explanation and practical example, refer to Section 5.8 of the Workshop 2 solution, which you can find here.
To practice winsorization, you need to install the statar
package and winsorize the following ratio:
- Earnings per share deflated by price
This function automatically determines a suitable level of winsorization based on the variable’s distribution.
The first step to winsorize the variable, is to create the variable itself. Let’s start by downloading the dataset.
# Download the CSV file from the website:
download.file("http://www.apradie.com/datos/dataus2024.csv", "dataus2024.csv")
download.file("http://www.apradie.com/datos/firmsus2024.csv", "firmsus2024.csv")
# Import the panel data
<- read.csv("dataus2024.csv")
uspanel
# Import the firm catalog
<- read.csv("firmsus2024.csv") usfirms
The uspanel
dataset contains financial data for US public firms listed on the NYSE and NASDAQ. The usfirms
dataset is a catalog with general information about these firms.
Now, create the variables needed for the ratio:
# Adding the variables for gross profit, EBIT and Net Income:
<- uspanel %>%
uspanel mutate(
grossprofit = revenue - cogs,
ebit = grossprofit - sgae,
netincome = ebit + otherincome + extraordinaryitems - finexp - incometax
)
Creating the ratio
# I generate EPS and EPSP (deflated by price)
<- uspanel %>%
uspanel mutate(
eps = ifelse(sharesoutstanding==0,NA,netincome / sharesoutstanding),
epsp= ifelse(originalprice==0,NA,eps / originalprice))
Winsorizing the variable:
<- uspanel %>%
uspanel mutate(epspw = winsorize(epsp))
Comparing the unwisorized and winsorized variables:
hist(uspanel$epsp)
hist(uspanel$epspw)
The winsorized range for earnings per share (EPS) deflated by price, which spans from -0.30 to around 0.30, is useful for understanding a company’s profitability relative to its stock price. A value of +0.30, for example, suggests that an investor might expect approximately 30 cents in earnings for every $1.00 invested in the stock.
Now, if we compare this to a much wider range, such as from -6e+6 to 4e+6, the latter reflects an extreme outlier scenario where the EPS deflated by price is either highly negative or overly positive. Such a broad range would likely indicate significant data distortions, such as massive losses or inflated earnings figures, which are far less common in most companies. A range like -6 million to 4 million would suggest wild fluctuations that are impractical for typical financial analysis and likely reflect rare or extraordinary circumstances, possibly due to accounting anomalies or firms in severe distress.
The narrower range of -0.30 to 0.30, by contrast, is more realistic and interpretable, reflecting moderate fluctuations in profitability without the noise of extreme values, making it more reliable for assessing how well companies convert their earnings into shareholder value in relation to their stock prices.
3 Challenge 2: Algorithm for Many-to-One Merge
Write a data management algorithm to do the following:
- Download the monthly S&P500 index (^GSPC) from 1999 to the present.
- Convert (collapse) this data from monthly to quarterly by selecting the last index of each quarter.
- Calculate quarterly market returns.
- Perform a many-to-one merge to add a new column to the
uspanel
dataset, containing the market’s quarterly returns.
3.1 Step-by-step Guide:
Download the monthly S&P500 index data from Q4 1999 to Q2 2024:
getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30",
periodicity="monthly", src="yahoo")
[1] "GSPC"
The issue now, is that the frequency of the new object is monthly, while the rest of the dataset is quarterly. To convert this monthly data to quarterly, use the to.quarterly
function from the quantmod
package. With the following code, I will extract the observation fro the last month of each 3-month period:
<- to.quarterly(GSPC)
QGSPC = Ad(QGSPC) # Keep only the adjusted column
QGSPC names(QGSPC) = c("SP500")
head(QGSPC)
SP500
1999 Q4 1469.25
2000 Q1 1498.58
2000 Q2 1454.60
2000 Q3 1436.51
2000 Q4 1320.28
2001 Q1 1160.33
Now, calculate quarterly continuously compounded returns and annual cc returns using the log-difference function:
$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500), lag=4)
QGSPC= QGSPC[2:nrow(QGSPC),] # Remove the first row with NA values
QGSPC head(QGSPC)
SP500 mkqret mkyret
2000 Q1 1498.58 0.01976590 NA
2000 Q2 1454.60 -0.02978703 NA
2000 Q3 1436.51 -0.01251436 NA
2000 Q4 1320.28 -0.08437271 -0.1069082
2001 Q1 1160.33 -0.12913945 -0.2558136
2001 Q2 1224.38 0.05373019 -0.1722963
To merge the quarterly data with uspanel
, we need a common column. Notice that QGSPC, being an xts object, has this information as an index, but not as an actual column.
<- data.frame(qdate=index(QGSPC), coredata(QGSPC[,2:3])) QGSPCdf
The index function gets the index content, while the coredata function gets only the column data of the dataset.
Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:
class(uspanel$q)
[1] "character"
class(QGSPCdf$qdate)
[1] "yearqtr"
The qdate
column of the QGSPCdf
is a “yearqtr” variable, while the q
column of the uspanel
is a “character” variable. I need to decide which column to modify so that both have the same type and format.
I will create a q
column in the QGSPCdf
dataset with the same format as the q
in the uspanel
.
The q
in the uspanel
is a character variable that starts with 4 digits for the year, followed by a “q” and then the number of the quarter. For example: 2020q1, 2020q2.
Now, I will create a new column q
in the QGSPCdf
dataset that follows this format:
# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
$q <- paste0(year(QGSPCdf$qdate),
QGSPCdf"q",
quarter(QGSPCdf$qdate))
Let’s confirm everything is working ok:
class(QGSPCdf$q)
[1] "character"
class(uspanel$q)
[1] "character"
head(QGSPCdf$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
head(uspanel$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
Now, use a many-to-one merge to combine the datasets:
= QGSPCdf[,c(-1)] # Remove the qdate column
QGSPCdf <- left_join(uspanel, QGSPCdf, by="q") uspanel
Finally, display key columns for verification:
head(uspanel %>%
select(firm, q, adjprice, mkqret) %>%
filter(q == "2023q4"))
firm q adjprice mkqret
1 A 2023q4 138.54738 0.1064786
2 AA 2023q4 33.79498 0.1064786
3 AABA_old 2023q4 NA 0.1064786
4 AAC_old 2023q4 NA 0.1064786
5 AAIC_old 2023q4 4.84000 0.1064786
6 AAL 2023q4 13.74000 0.1064786
3.1.1 Note: Understanding left_join
The left_join
function in R (from the dplyr package) is used to merge two datasets based on a common column. It keeps all rows from the left dataset (the first one) and matches the rows from the right dataset based on the common column. If there are no matching values in the right dataset, the result will contain NA
values for the columns coming from the right dataset.
This is useful when you want to ensure that none of the rows from your primary dataset are lost during the merging process.
3.1.2 Note: Understanding paste0
The paste0
function in R is used to concatenate (combine) multiple strings of text without any separator. It joins the elements together in the exact order they are provided.
For example, if you want to combine a year and a quarter to create a new column in the format “2020q1”, you can use paste0
like this:
paste0(2020, "q", 1)
[1] "2020q1"
This will return “2020q1
”. Unlike paste()
, which adds a space by default, paste0()
does not add any spaces between the elements.
4 Challenge 3: Logistic Regression with Lagged Values
Design and run a logistic regression model to explore whether winsorized earnings per share deflated by price (epspw) is related to the probability that future quarterly stock returns will be higher than future market returns.
4.1 Dataset Creation Instructions
Create a dataset that includes the following columns:
- Future Quarterly Stock Return: Represents the stock return for the subsequent quarter.
- F1r_above_market:
- 1 = Indicates that the stock beat the market in the corresponding quarter.
- 0 = Indicates that the stock did not beat the market.
- Earnings Per Share Deflated by Price (epsp): Represents the earnings per share adjusted for price.
Ensure that each column is clearly defined and formatted for ease of understanding.
4.1.1 Hint: Instructions for Creating the Dependent Variable
Understand the Dependent Variable: In a logistic model, the dependent variable represents the probability that an event occurs. We will use a binary variable (1/0 or TRUE/FALSE) to represent this.
Define the Event: Declare that the EVENT occurs when the future stock return is higher than the future market return.
Independent Variables: Note that the independent variable(s) can be numeric or categorical, similar to a multiple regression model.
Create the Dependent Variable:
- Assign a value of 1 when the future stock return is higher than the future market return.
- Assign a value of 0 otherwise.
4.1.2 Suggested dplyr
Functions to Use:
- mutate(): To create or modify columns in your dataset.
- if_else(): To conditionally assign values based on logical criteria.
- filter(): To subset your data based on specific conditions.
- select(): To choose specific columns from your dataset for analysis.
4.2 Solution
4.2.1 Creating the dependent variable
We will create the subsequent quarter continously compounded return.
<- uspanel %>%
uspanel group_by(firm) %>%
arrange(firm, q) %>%
mutate(stockret = log(adjprice) - lag(log(adjprice)))
Confirm that the variable was appropriately created.
%>% select(firm,q,adjprice,stockret) %>%
uspanel filter(firm=="AAPL",q>="2000q1") %>% head(10)
# A tibble: 10 × 4
# Groups: firm [1]
firm q adjprice stockret
<chr> <chr> <dbl> <dbl>
1 AAPL 2000q1 1.02 NA
2 AAPL 2000q2 0.790 -0.260
3 AAPL 2000q3 0.388 -0.710
4 AAPL 2000q4 0.224 -0.549
5 AAPL 2001q1 0.333 0.395
6 AAPL 2001q2 0.351 0.0521
7 AAPL 2001q3 0.234 -0.405
8 AAPL 2001q4 0.330 0.345
9 AAPL 2002q1 0.357 0.0777
10 AAPL 2002q2 0.267 -0.290
Now we can use the lead
function (opposite to lag
) to obtain future values. We will now obtain the future stock return and market return.
<- uspanel %>%
uspanel group_by(firm) %>%
arrange(firm, q) %>%
mutate(F1stockret = lead(stockret),
F1mkqret = lead(mkqret))
Now I can create the binary variable as follows:
<- uspanel %>%
uspanel mutate(F1r_above_market = ifelse(F1stockret>F1mkqret,1,0))
%>%
uspanel select(firm, q, stockret,mkqret, F1stockret, F1mkqret, F1r_above_market) %>%
head(8)
# A tibble: 8 × 7
# Groups: firm [1]
firm q stockret mkqret F1stockret F1mkqret F1r_above_market
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2000q1 NA 0.0198 -0.344 -0.0298 0
2 A 2000q2 -0.344 -0.0298 -0.410 -0.0125 0
3 A 2000q3 -0.410 -0.0125 0.112 -0.0844 1
4 A 2000q4 0.112 -0.0844 -0.578 -0.129 0
5 A 2001q1 -0.578 -0.129 0.0560 0.0537 1
6 A 2001q2 0.0560 0.0537 -0.508 -0.162 0
7 A 2001q3 -0.508 -0.162 0.377 0.0980 1
8 A 2001q4 0.377 0.0980 0.204 -0.000601 1
Now we can check the number of observations where the event occurs:
table(uspanel$F1r_above_market)
0 1
127791 118460
Now, let’s run the logistic regression:
# Runing the model with the winsorized epsp:
<- glm(F1r_above_market ~ epspw, data= uspanel, family="binomial",na.action=na.omit)
logit summary(logit)
Call:
glm(formula = F1r_above_market ~ epspw, family = "binomial",
data = uspanel, na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.055865 0.004453 -12.55 <2e-16 ***
epspw 2.058406 0.047745 43.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284792 on 205645 degrees of freedom
Residual deviance: 282880 on 205644 degrees of freedom
(351535 observations deleted due to missingness)
AIC: 282884
Number of Fisher Scoring iterations: 4
Let’s compare the model with the case when we use the original epsp
variable:
# Runing the model with epsp:
<- glm(F1r_above_market ~ epsp, data= uspanel, family="binomial",na.action=na.omit)
logit_2 summary(logit_2)
Call:
glm(formula = F1r_above_market ~ epsp, family = "binomial", data = uspanel,
na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.562e-02 4.413e-03 -17.134 <2e-16 ***
epsp 1.275e-06 4.352e-06 0.293 0.77
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284792 on 205645 degrees of freedom
Residual deviance: 284792 on 205644 degrees of freedom
(351535 observations deleted due to missingness)
AIC: 284796
Number of Fisher Scoring iterations: 3
4.2.2 Logistic Regression: Winsorized vs. Original EPS
In this analysis, we compare the results of a logistic regression model using both the winsorized EPS (epspw
) and the original EPS (epsp
). The results showed key differences in the performance of both models, particularly in the magnitude and significance of the coefficients.
4.2.2.1 Key Findings:
Coefficient for Winsorized EPS: In the model with
epspw
, the coefficient was 2.058406 and was significant at the 1% level. This indicates a strong and statistically significant relationship between winsorized EPS and the likelihood that the stock will outperform the market.Coefficient for Original EPS: In contrast, the coefficient for
epsp
in the second model was 1.275e-06 and was not significant. This suggests that without winsorizing, the model struggles to detect a reliable relationship between EPS and the probability of stock outperformance.
4.2.2.2 Interpretation of Logistic Regression:
Sign and Magnitude: The positive coefficient for
epspw
indicates that as earnings per share deflated by price increases, the likelihood of the stock outperforming the market also increases. The magnitude of 2.058406 reflects a much stronger effect compared to the original EPS coefficient, which is near zero.Significance: The winsorized model is significant at the 1% level, meaning there is strong evidence that winsorized EPS is an important predictor of stock outperformance. In contrast, the original EPS model does not show a statistically significant relationship, making it less reliable for interpretation.
Odds Ratios: To interpret the magnitude of the effect, we calculate the odds ratio:
- For the winsorized model, the odds ratio is exp(2.058406). This indicates that for every unit increase in
epspw
, the odds of the stock beating the market increase by a substantial amount. - For the original EPS model, the odds ratio is exp(1.275e-06), which is essentially 1, suggesting no meaningful change in the odds when using the original EPS variable.
- For the winsorized model, the odds ratio is exp(2.058406). This indicates that for every unit increase in
# Calculate the odds ratio from the beta coefficient
<- exp(coef(logit)["epspw"])
odds_ratio odds_ratio
epspw
7.83347
4.2.2.3 Odds Ratios:
In logistic regression, the odds ratio helps us understand how changes in the independent variable (in this case, epspw
) affect the likelihood of the event occurring—in this case, whether a stock will beat the market.
The odds ratio for
epspw
is 7.83347. This means that for every 1 unit increase in the winsorized EPS, the odds of the stock outperforming the market in the next quarter increase by 683%.However, since
epspw
typically moves between -0.30 and 0.30, the actual effect on the odds is more moderate within this range. For example, ifepspw
increases by 0.1 units, the odds of the stock outperforming the market would increase by:
# Calculate odds ratio for a 0.1 unit increase in epspw
.1 <- exp(coef(logit)["epspw"]*.1)
odds_ratio_0.1 odds_ratio_0
epspw
1.228557
Interpretation: With a 0.1 unit increase in epspw
, the odds of the stock beating the market increase by approximately 22.86%. Although the odds ratio for a full unit increase is large, within the typical range of the variable (from -0.3 to 0.3), smaller changes in EPS still have a meaningful but more moderate impact on the odds of outperforming the market.
4.2.2.4 Summary:
The odds ratio for the winsorized EPS variable means that a small increase in earnings per share significantly boosts the probability of a stock outperforming the market.
This strong odds ratio, combined with the model’s statistical significance at the 1% level, underscores the robustness of the winsorized model compared to the original model.
5 Challenge 4: Running Your First Machine Learning Model
You will create a dataset to predict whether a stock will beat the market. Steps include:
- Create a training and testing sample (80/20 split).
- Run the logistic regression on the training sample.
- Create and interpret the confusion matrix.
Note: Review Chapter 2 of the Datacamp course: “Machine Learning with Caret” and the Workshop Introduction to Machine Learning using Caret for more details.
5.1 Data Preprocessing
Before applying machine learning algorithms, it is essential to preprocess the data. We’ll perform steps such as splitting the data into training and test sets, scaling features, and handling missing values (if any).
- Training data: The model learns from this data.
- Test data: We check how well the model performs on this data after it has been trained.
Before applying machine learning algorithms, we will prepare the dataset. First, we will shuffle the data and then create the training and test subsets.
set.seed(123456)
# Shuffle row indices:
<-sample(nrow(uspanel))
rows_shuffled
# Randomly order data
<- uspanel[rows_shuffled, ] shuffled_uspanel
This chunk of code randomly shuffles the row indices of the uspanel
data frame and stores them in the rows_shuffled
variable.
nrow(uspanel)
:Gets the total number of rows in theuspanel
data frame.nrow()
is a function that returns the number of rows in a data frame or matrix.sample(nrow(uspanel))
: Thesample()
function is used to randomly shuffle or sample elements. In this case, it is generating a random permutation of the numbers from 1 to the number of rows inuspanel
. This effectively shuffles the row indices.
I will keep only the variables used in the exercise
<- shuffled_uspanel %>%
shuffled_uspanel select(firm, q, epspw, F1r_above_market, stockret,mkqret)
head(shuffled_uspanel)
# A tibble: 6 × 6
# Groups: firm [6]
firm q epspw F1r_above_market stockret mkqret
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 GEN__old 2019q2 -0.100 0 -0.150 0.0372
2 FIGS 2001q2 NA NA NA 0.0537
3 FBLG 2018q1 NA NA NA -0.0123
4 NC 2008q1 NA NA NA -0.104
5 BBOX__old 2002q1 0.0635 0 -0.0881 -0.000601
6 STML_old 2010q4 NA NA NA 0.0972
To create the training and testing sets, let’s use a 80/20 split. First, determine the row where to split the
# Split the data into training and test sets (80/20 split)
# Determine row to split on: split
<- round(nrow(shuffled_uspanel)*.80)
split split
[1] 445745
And now we can create the two different subsets.
# Create train
<- shuffled_uspanel[1:split, ]
train
# Create test
<- shuffled_uspanel[(split+1):nrow(shuffled_uspanel), ] test
5.2 Model Training
Re-run the logit model, for the training dataset:
<- glm(F1r_above_market ~ epspw, data= train, family="binomial",na.action=na.omit)
logit_train summary(logit_train)
Call:
glm(formula = F1r_above_market ~ epspw, family = "binomial",
data = train, na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.05148 0.00498 -10.34 <2e-16 ***
epspw 2.07493 0.05339 38.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 227682 on 164388 degrees of freedom
Residual deviance: 226128 on 164387 degrees of freedom
(281356 observations deleted due to missingness)
AIC: 226132
Number of Fisher Scoring iterations: 4
Now I use this model and do predictions of the probability that the stock beats the market, but in the test dataset:
$F1predprob = predict(logit_train,newdata=test, type="response")
testhead(test)
# A tibble: 6 × 7
# Groups: firm [6]
firm q epspw F1r_above_market stockret mkqret F1predprob
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VHC 2002q1 NA NA NA -0.000601 NA
2 MMI 2020q1 0.0123 0 -0.318 -0.223 0.494
3 NWPX 2002q4 NA NA NA 0.0762 NA
4 JNY_old 2000q3 0.0758 1 0.120 -0.0125 0.526
5 RBLX 2012q1 NA NA NA 0.113 NA
6 MTCH 2011q1 NA NA NA 0.0528 NA
I will determine the cut-off from which I decide that the model considers that the firm will beat the market:
# Create F1r_above_market_pred
<- test %>%
test mutate(F1r_above_market_pred = ifelse(F1predprob > 0.5, 1, 0))
head(test)
# A tibble: 6 × 8
# Groups: firm [6]
firm q epspw F1r_above_market stockret mkqret F1predprob
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VHC 2002q1 NA NA NA -0.000601 NA
2 MMI 2020q1 0.0123 0 -0.318 -0.223 0.494
3 NWPX 2002q4 NA NA NA 0.0762 NA
4 JNY_old 2000q3 0.0758 1 0.120 -0.0125 0.526
5 RBLX 2012q1 NA NA NA 0.113 NA
6 MTCH 2011q1 NA NA NA 0.0528 NA
# ℹ 1 more variable: F1r_above_market_pred <dbl>
Now I have a column for the actual binary variable (whether the stock beat the market), and also a predicted binary variable using the model and the test dataset.
5.3 Model evaluation
I can now create a Confusion Matrix.
I need to convert the binary variables to a factor-type variables:
$F1r_above_market = factor(test$F1r_above_market,levels=c("1","0"))
test$F1r_above_market_pred = factor(test$F1r_above_market_pred,levels=c("1","0"))
test
# When using factor function, the first value of levels must be the POSITIVE value; in this case, =1
# Create confusion matrix
<- confusionMatrix(test$F1r_above_market,test$F1r_above_market_pred, positive='1')
CM1 CM1
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 7848 11831
0 6726 14852
Accuracy : 0.5502
95% CI : (0.5454, 0.555)
No Information Rate : 0.6468
P-Value [Acc > NIR] : 1
Kappa : 0.0881
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.5385
Specificity : 0.5566
Pos Pred Value : 0.3988
Neg Pred Value : 0.6883
Prevalence : 0.3532
Detection Rate : 0.1902
Detection Prevalence : 0.4770
Balanced Accuracy : 0.5476
'Positive' Class : 1
5.3.1 INTERPRETATION
The diagonal of the confusion matrix contains the cases where my model CORRECTLY PREDICTED whether the stock beat or did not beat the market.
Analyzing the confusion matrix, we observe the following:
- The sum of the FIRST COLUMN is 14,574, representing the number of cases when the stock actually BEAT the market.
- The sum of the SECOND COLUMN is 26,683, representing the number of cases when the stock actually DID NOT BEAT the market.
From the 14,574 cases when the stocks ACTUALLY BEAT THE MARKET, the model CORRECTLY PREDICTED 7,848 cases. These cases are referred to as TRUE POSITIVES. The rate of TRUE POSITIVES relative to ALL POSITIVES is known as Sensitivity, calculated as:
\text{Sensitivity Rate} = \frac{\text{TRUE POSITIVE}}{\text{TRUE POSITIVE} + \text{FALSE NEGATIVE}} \approx \frac{7848}{7848 + 6726} \approx 0.5385.
Out of the 26,683 cases when the stock DID NOT BEAT THE MARKET, the model CORRECTLY PREDICTED 14,852 cases, referred to as TRUE NEGATIVES. The rate of TRUE NEGATIVES relative to ALL NEGATIVES is known as Specificity, calculated as:
\text{Specificity Rate} = \frac{\text{TRUE NEGATIVE}}{\text{TRUE NEGATIVE} + \text{FALSE POSITIVE}} \approx \frac{14852}{14852 + 11831} \approx 0.5566.
Out of the 14,574 cases where the stocks ACTUALLY BEAT THE MARKET, the model WRONGLY PREDICTED 11,831 cases, known as FALSE POSITIVES.
Conversely, out of the 26,683 cases when the stock DID NOT BEAT THE MARKET, the model WRONGLY PREDICTED 6,726 cases, referred to as FALSE NEGATIVES.
If Sensitivity is greater than Specificity, it indicates that the model is more effective at predicting POSITIVE CASES (when a stock actually beats the market) than at predicting NEGATIVE CASES (when a stock does not beat the market).
5.3.1.1 Additional Ratios
The POSITIVE PREDICTIVE VALUE is defined as:
\text{Pos Pred Value} = \frac{\text{TRUE POSITIVE}}{\text{TRUE POSITIVE} + \text{FALSE POSITIVES}} \approx \frac{7848}{7848 + 11831} \approx 0.3988.
The Positive Predictive Value (PPV) represents the proportion of predicted positive cases (stocks predicted to beat the market) that are actually positive. In this case, the PPV is approximately 39.88%.
- The PPV of 39.88% indicates that when the model predicts a stock will beat the market, it is correct about 39.88% of the time. This means that roughly 4 out of 10 stocks identified by the model as outperforming the market actually do so.
The NEGATIVE PREDICTIVE VALUE is defined as:
\text{Neg Pred Value Rate} = \frac{\text{TRUE NEGATIVE}}{\text{TRUE NEGATIVE} + \text{FALSE NEGATIVE}} \approx \frac{14852}{14852 + 6726} \approx 0.6883. The Negative Predictive Value (NPV) measures the proportion of predicted negative cases (stocks predicted to not beat the market) that are actually negative. In this case, the NPV is approximately 68.83%.
- The NPV of 68.83% means that when the model predicts a stock will not beat the market, it is correct about 68.83% of the time. This implies that when the model identifies a stock as underperforming, it is accurate in nearly 7 out of 10 cases.
6 W3 submission
The grade for this workshop will be as follows:
Complete (100%): If you submit an ORIGINAL and COMPLETE notebook with all the activities, your notes, and your OWN RESPONSES to questions. Submitting someone else’s work or having two or more people submit the same answers (exact texts or with minor changes or translations) will be considered plagiarism and a violation of academic integrity (FIA).
Incomplete (75%): If you submit an ORIGINAL notebook with ALL the activities but did NOT RESPOND to the questions and/or did not complete all activities, or if you only responded to some questions. Remember, submitting identical or slightly altered work from another person is considered a breach of academic integrity (FIA).
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or if you completed more but parts of your work are copy-pasted from other notebooks. This too will be considered an FIA if found to be plagiarized.
Not submitted (0%): No submission, no credit.
It is important to remember that submitting someone else’s work, or identical work (even with minor changes or translations) between two or more students, is strictly prohibited and will be considered an academic integrity violation (FIA).
Remember that you have to submit the Google Colab LINK to your notebook, and you must SHARE it with me, ensuring that I have edit access.