Workshop 3 Solution, Algorithms and data analysis

Author

Alberto Dorantes, Ph.D.

Published

October 8, 2024

Abstract

This is an INDIVIDUAL workshop. In this workshop we will keep practicing data management for financial massive data and also learn about the Logistic regression model and an introduction to Machine Learning.

1 Introduction

We will work with the dataset of the Final Project (dataus2024), which has historical data from Q1 2000 to Q2 2024.

We will learn a) what is winsorization and why it is important, b) estimate and interpret a logistic regression, and c) training vs testing samples, and d) creating a confusion matrix.

2 CHALLENGE 1: Winsorization of variables

You have to do your own research about winsorization. Explain it with your words (pay attention in class)

You have to install the statar package. You have to winsorize the following ratio:

Earnings per share deflated by price

This winsorize function automatically finds a good level of winsorization according to the distribution of all values of the variable.

You have to install the statar package. You have to winsorize the following ratio:

Earnings per share deflated by price

Using the histogram decide which might be a good level of winsorization for each ratio.

WINSORIZATION IS THE PROCESS OF FLATTENING EXTREME (OUTLIERS) OBSERVATIONS OF A VARIABLE WITH THE PURPOSE OF AVOIDING UNRELIABLE ESTIMATIONS OF BETA COEFFICIENTS WHEN INCLUDING THAT VARIABLE AS EXPLANATORY VARIABLE IN AN ECONOMETRIC MODEL.

THE WINSORIZATION PROCESS CAN BE APPLIED TO HIGH OR LOW (NEGATIVE) VALUES OF THE VARIABLE. FOR HIGH VALUES WE CAN INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES ABOVE THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE. FOR LOW VALUES WE INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES BELOW THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE.

IN R, WE CAN USE THE FUNCTION winsorize FROM THE statar PACKAGE TO DO THE WINSORIZATION OF A VARIABLE.

I first import the datasets:

## library(plm)
# I will NOT use plm since it takes too much time to install! I will use dplyr 
library(dplyr)
# Downloading the csv files from the web: 

download.file("https://www.apradie.com/datos/dataUS20204.csv","dataus2024.csv")
uspanel <- read.csv("dataus2024.csv")

download.file("https://www.apradie.com/datos/firms20204.csv","firms2024.csv")
usfirms <- read.csv("firmsus2024.csv")

Now I generate the variables needed for the ratio:

# Adding the variables for gross profit, EBIT and Net Income:
uspanel <- uspanel %>%
  mutate(
    grossprofit = revenue - cogs,
    ebit = grossprofit - sgae, 
    netincome = ebit + otherincome + extraordinaryitems - finexp - incometax
  )

I generate the ratio:

# I generate EPS and EPSP (deflated by price)
uspanel <- uspanel %>%
  mutate(
    eps = ifelse(sharesoutstanding==0,NA,netincome / sharesoutstanding),
    epsp= ifelse(originalprice==0,NA,eps / originalprice))

NOW I START THE WINSORIZATION PROCESS FOR EPSP:

I START CHECKING THE HISTOGRAM TO DECIDE WHICH LEVEL OF WINSORIZATION I CAN APPLY TO BOTH SIDES OF THE VALUES (HIGH AND LOW VALUES):

hist(uspanel$epsp)

I SEE EXTREME VALUES TO THE LEF AND TO THE RIGHT WITH SIMILAR MAGNITUDE. THE WINSORIZE FUNCTION AUTOMATICALLY DETECTS THE BEST PERCENTILES TO THE LEFT AND TO THE RIGHT ACCORDING TO THE DISTRIBUTION OF THE VARIABLE:

# I load the library for winsorization:
library(statar)
uspanel$epspw <- winsorize(uspanel$epsp)
# I check the distribution of this new winsorized ratio:
hist(uspanel$epspw)

THE RANGE FOR THE WINSORIZED OPERATING EPS DEFLATED BY PRICE IS FROM -0.30 TO ABOUT 0.30. FOR THIS RATIO, THIS RANGE IS OK SINCE WE CAN HAVE SOME FIRMS THAT IN THE PAST THE HAD VERY NEGATIVE NETINCOME SO THAT THE EPS DEFLATED BY PRICE CAN BE NEGATIVE, BUT A FRACTION LESS THAN ONE (IN MAGNITUDE). IF OPERATING EPS DEFLATED BY PRICES IS +0.30 THIS MEANS THAT IF THE NETINCOME WERE DISTRIBUTED AMONG SHAREHOLDERS, THEN FOR EACH $1.0 INVESTED IN SHARES, EACH SHAREHOLDER MIGHT RECEIVE ABOUT 30 CENTS.

3 CHALLENGE 2: Algorithm to do many-to-one merge

Write a data management algorithm to do the following:

Download the monthly market index of the S&P500 (^GSPC) from 1999 to date
Convert (collapse) from monthly to quarterly index selecting the last index of each quarter
Calculate market quarterly returns
Do a many-to-one algorithm to add a new column to the us panel dataset that has the market quarterly return

Here is a easy-to-follow guide to do this challenge. Read carefully to understand the algorithm.

The uspanel dataset has historical data of many financial-statement variables of US public firms that belong to the New York Exchange and the NASDAQ.

The usfirms dataset is a catalog of all US public firms with general information such as firm name and industry classification.

Now I download the US monthly market index from Q4 1999 to the Q2 of 2024.

library(quantmod)
getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30",
            periodicity="monthly", src="yahoo")

[1] "GSPC"

Now I convert / collapse this monthly dataset in to a quarterly dataset to have the same granularity than the uspanel dataset. For each quarter (3-month period) I need to get ONLY the last month index in order to correctly calculate quarterly returns.

The to.quarterly function from quantmod can do this collapse getting the value of the last month for each quarter:

QGSPC <- to.quarterly(GSPC)
# I keep only the adjusted column, which has the market index:
QGSPC = Ad(QGSPC)
names(QGSPC)= c("SP500")

I calculate quarterly and annual cc returns for the market with the difference function of the log index:

QGSPC$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500),lag=4)
# I delete the first row that has NA values for both returns, and it is from 1999:
QGSPC = QGSPC[2:nrow(QGSPC),]

In order to merge the uspanel with this QGSPC dataset I need to have a common column for the quarter so that R can do the match by quarter.

The QGSPC has the quarter as index, not as column! It is important to note that all xts and zoo datasets have an index that is not part of the columns. I can create a data frame that has the index as column as follows:

QGSPCdf<-data.frame(qdate=index(QGSPC),coredata(QGSPC[,2:3]))

The index function gets the index content, while the coredata function gets only the column data of the dataset.

Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:

class(uspanel$q)

[1] "character"

class(QGSPCdf$qdate)

[1] "yearqtr"

The qdate column of the QGSPCdf is a “yearqtr” variable, while the q column of the uspanel is character variable. I have to decide which column I change to have both with the same type and also the same format.

I will create a q column in the QGSPCdf dataset with the same format as the q in the uspanel.

The q in the uspanel is a character variable that starts with 4 digits for the year, then a “q” and then the # of the quarter. Example: 2020q1, 2020q2.

Then, I create a new column in the QGSPCdf following this format:

library(lubridate)
# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
QGSPCdf$q <- paste0(year(QGSPCdf$qdate),         # Convert dates to quarterly
                             "q",
                             quarter(QGSPCdf$qdate))
# I check that both columns have the same data type, and have the same format:
class(QGSPCdf$q)

[1] "character"

class(uspanel$q)

[1] "character"

head(QGSPCdf$q)

[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

head(uspanel$q)

[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"

The paste0 function concatenates strings of characters.

Now I can do the many-to-1 merge of both dataset indicating that q is the common column.

I use the left_join function from the dplyr package instead of the merge function. I do this since the merge function does not keep the original sorting of the uspanel dataset.

# I delete the first  column (the qdate)
QGSPCdf = QGSPCdf[,c(-1)]

uspanel<-left_join(uspanel,QGSPCdf,by="q")

# I display key columns of 1 quarter:
library(dplyr)
head(uspanel %>% select(firm,q, adjprice, mkqret) %>% filter(q=="2024q2"))

      firm      q adjprice     mkqret
1        A 2024q2 129.3899 0.03848037
2       AA 2024q2  39.7800 0.03848037
3 AABA_old 2024q2       NA 0.03848037
4  AAC_old 2024q2       NA 0.03848037
5 AAIC_old 2024q2       NA 0.03848037
6      AAL 2024q2  11.3300 0.03848037

The market return seems to be well merged since its value is repeated for cases with the same quarter.

Then the mktret column was added to the uspanel. The series of the market return was merged for each firm, so R did a many-to-1 merge!

4 CHALLENGE 3: Logistic regression models with lagged values

Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future quarterly stock returns is higher than the future market quarterly return.

Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.

You have to interpret the model

4.1 SOLUTION

The dependent variable of a logistic model is the probability that an event happens. However, the way we introduce the values of the dependent variable in the model is using a binary variable (1/0 or TRUE/FALSE). I declare that the EVENT happens if the future stock return is higher than the future market return.

The independent variable(s) can be numeric or categorical, as in the case of a multiple regression model.

I have to create the dependent variable of the model. I will assign 1 when the future stock return is higher than the future market return; and =0 otherwise.

4.1.1 Creating the dependent variable

I need to generate the stock quarterly cc return for each firm-quarter.

I calculate returns in the panel data with dplyr:

uspanel <- uspanel %>%
  group_by(firm) %>% 
  arrange(firm, q) %>% 
  mutate(stockret = log(adjprice) - lag(log(adjprice)))

In dplyr, I indicate to group by firm so that the lag function will work correctly (will not take a lag value if there is a change of firm!)

To make sure that R did the correct calculation, I can look at few cases for 1 firm:

# I convert the q column to character; when I converted the data into a pdata.frame, it changed the type of q to factor. This creates some problems, so I change it to character:
uspanel$q = as.character((uspanel$q))
uspanel %>% select(firm,q,adjprice,stockret) %>% 
       filter(firm=="AAPL",q>="2000q1") %>% head(10)

# A tibble: 10 × 4
# Groups:   firm [1]
   firm  q      adjprice stockret
   <chr> <chr>     <dbl>    <dbl>
 1 AAPL  2000q1    1.02   NA     
 2 AAPL  2000q2    0.790  -0.260 
 3 AAPL  2000q3    0.388  -0.710 
 4 AAPL  2000q4    0.224  -0.549 
 5 AAPL  2001q1    0.333   0.395 
 6 AAPL  2001q2    0.351   0.0521
 7 AAPL  2001q3    0.234  -0.405 
 8 AAPL  2001q4    0.330   0.345 
 9 AAPL  2002q1    0.357   0.0777
10 AAPL  2002q2    0.267  -0.290

It looks ok since the stock return is NA for the first quarter of each firm.

Now I can create variables for a) the FUTURE quarterly stock return and b) the FUTURE market return. I will use the lead function of dplyr:

uspanel <-  uspanel %>% 
   group_by(firm) %>% 
   arrange(firm, q) %>% 
   mutate(F1stockret = lead(stockret),
          F1mkqret = lead(mkqret))

The lead function is used to get the future value of the variable (1 period ahead).

Now I can create the binary variable as follows:

uspanel <- uspanel %>%
  mutate(F1stockwin = ifelse(F1stockret>F1mkqret,1,0))

I can check how many rows ended up with 1’s and 0’s:

table(uspanel$F1stockwin)


     0      1 
127791 118460

I can view few cases to make sure that I did the correct calculation:

uspanel %>% select(firm, q, stockret,mkqret, F1stockret, F1mkqret, F1stockwin) %>%
       filter(q>="2022q1") %>% head(8)

# A tibble: 8 × 7
# Groups:   firm [1]
  firm  q      stockret  mkqret F1stockret F1mkqret F1stockwin
  <chr> <chr>     <dbl>   <dbl>      <dbl>    <dbl>      <dbl>
1 A     2022q1  -0.186  -0.0507    -0.107   -0.180           1
2 A     2022q2  -0.107  -0.180      0.0249  -0.0542          1
3 A     2022q3   0.0249 -0.0542     0.211    0.0684          1
4 A     2022q4   0.211   0.0684    -0.0786   0.0679          0
5 A     2023q1  -0.0786  0.0679    -0.137    0.0797          0
6 A     2023q2  -0.137   0.0797    -0.0727  -0.0372          0
7 A     2023q3  -0.0727 -0.0372     0.222    0.106           1
8 A     2023q4   0.222   0.106      0.0456   0.0967          0

The calculation of the F1stockwin looks ok; when the future stock return is higher than the market future return, F1stockwin=1; =0 otherwise.

Now I can run the logistic regression with the glm function:

# Runing the model with the winsorized oepsp:
model1 <- glm(F1stockwin ~ epspw, data= uspanel, family="binomial",na.action=na.omit)
summary(model1)


Call:
glm(formula = F1stockwin ~ epspw, family = "binomial", data = uspanel, 
    na.action = na.omit)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.055865   0.004453  -12.55   <2e-16 ***
epspw        2.058406   0.047745   43.11   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284792  on 205645  degrees of freedom
Residual deviance: 282880  on 205644  degrees of freedom
  (351535 observations deleted due to missingness)
AIC: 282884

Number of Fisher Scoring iterations: 4

# Runing the model with the original epsp (before winsorization):
model1a <- glm(F1stockwin ~ epsp, data= uspanel, family="binomial",na.action=na.omit)
summary(model1a)


Call:
glm(formula = F1stockwin ~ epsp, family = "binomial", data = uspanel, 
    na.action = na.omit)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.562e-02  4.413e-03 -17.134   <2e-16 ***
epsp         1.275e-06  4.352e-06   0.293     0.77    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 284792  on 205645  degrees of freedom
Residual deviance: 284792  on 205644  degrees of freedom
  (351535 observations deleted due to missingness)
AIC: 284796

Number of Fisher Scoring iterations: 3

I run the model with the winsorized and the original (before winsorization). Although the sign and significance was the same, the magnitude of the epsp beta coefficient significantly changed and also its pvalue. So, it the results using the winsorized variable is much more reliable.

INTERPRETATION:

THE INTERPRETATION OF A LOGISTIC MODEL IS SIMILAR TO THAT OF THE LINEAR REGRESSION MODEL, BUT THERE IS AN IMPORTANT DIFFERENCE.

IN TERMS OF SIGN AND SIGNIFICANCE OF THE INDEPENDENT VARIABLE, WE INTERPRET THE LOGISTIC REGRESSION IN A SIMILAR WAY. IN THIS CASE, I CAN INTERPRET THE COEFFICIENT AND SIGNIFICANCE OF EARNINGS PER SHARE AS FOLLOWS:

SINCE THE BETA1 COEFFICIENT IS POSITIVE AND SIGNIFICANCE, THEN EARNINGS PER SHARE (DEFLATED BY PRICE) IS POSITIVELY AND SIGNIFICANTLY RELATED TO THE PROBABILITY THAT THE STOCK WILL HAVE A HIGHER RETURN IN THE FUTURE COMPARED TO THE FUTURE MARKET RETURN. IN OTHER WORDS, THE HIGHER THE EPSPW, THE MORE LIKELY THAT THE STOCK WILL BEAT THE MARKET RETURN ONE QUARTER LATER.

THE INTERPRETATION OF THE MAGNITUDE OF THE COEFFICIENT IS DIFFERENT THAN IN THE CASE OF LINEAR REGRESSION.

FIRST WE NEED TO CALCULATE THE EXPONENCIAL OF THE BETA1 COEFFICIENT:

betaepsw_odds = exp(model1$coefficients[2]*1)
betaepsw_odds

  epspw 
7.83347

This new beta coefficient of7.8334699 can be interpreted as:

FOR EACH +1 CHANGE IN EPSPW, IT IS EXPECTED THAT THE ODDS RATIO (THE PROBABILITY THAT THE STOCK BEATS THE MARKET WITH RESPECT TO THE PROBABILITY THAT THE STOCK DO NOT BEAT THE MARKET) INCREASES WILL CHANGE IN 7.8334699 UNITS. IN OTHER WORDS, A FIRM A HAS AN EPSPW THAT IS +1.00 HIGHER THAN A FIRM B, THEN THE FIRM A WILL BE 7.8334699 TIMES MORE LIKELY TO BEAT THE MARKET COMPARED TO FIRM B.

THIS SOUNDS WEIRD SINCE A CHANGE OF +1.00 UNIT IN EPSPW IS ALMOST IMPOSSIBLE IN THE MARKET; EPSPW IS A VARIABLE THAT USUALLY RANGES BETWEEN -0.3 AND +0.30. THEN, IT IS CONVENIENT TO DO ANOTHER CALCULATION FOR THE BETA COEFFICIENT:

betaepsw_odds = exp(model1$coefficients[2]*0.1)
betaepsw_odds

   epspw 
1.228557

I MULTIPLIED THE COEFFICIENT TIMES 0.1 TO CALCULATE THE BETA COEFFICIENT AS A PARTIAL CHANGE OF THE ODDS RATIO FOR A CHANGE OF +0.1 IN EPSPW.

IN THIS CASE, IF A FIRM A IMPROVES ITS EPSPW IN +0.1 UNIT FROM ONE QUARTER TO THE NEXT, THEN THIS FIRM WILL BE 1.2285573 MORE LIKELY TO BEAT THE MARKET COMPARED TO CASE THAT THE FIRM DOES NOT CHANGE ITS EPSPW.

YOU CAN INTERPRET THE MAGNITUDE OF THE COEFFICIENT USING THE CASE OF 1 FIRM THAT INCREASES (OR DECREASES) ITS EPSPW FROM ONE QUARTER TO THE NEXT, OR COMPARING 2 FIRMS IN THE SAME QUARTER WITH DIFFERENT EPSPW VALUES.

5 CHALLENGE 4: Running your first Machine Learning model

Create a dataset with the following columns:

Future quarterly stock return (1 quarter later)
F1r_above_market (1=beat the market in the corresponding quarter; 0= otherwise)
Earnings per share deflated by price (epsp).

Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.

Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.

Create and interpret the confusion matrix

It is strongly recommended to review the Chapter 2 of the Datacamp course: “Machine Learning with Caret”

5.1 SOLUTION

The variables are already created in the above sections.

I create the training and testing random samples:

set.seed(123456)

# Shuffle row indices:
rows_shuffled<-sample(nrow(uspanel))

# Randomly order data
shuffled_uspanel <- uspanel[rows_shuffled, ]

To better organize the data of the model, I will keep ONLY the variables I need for the machine learning model:

shuffled_uspanel <- shuffled_uspanel %>% 
              select(firm, q, epspw, F1stockwin, stockret,mkyret)

I kept the independent and the dependent variable (F1stockwin), and also the returns of the stocks and the market (just in case).

Try an 80/20 split

Now that the dataset is randomly ordered. I can split the first 80% of it into a training set, and the last 20% into a test set.

I can do this by choosing a split point of approx 80% of the shffle data rows:

# Determine row to split on: split
split <- round(nrow(shuffled_uspanel)*.80)
split

[1] 445745

# Create train
train <- shuffled_uspanel[1:split, ]

# Create test
test <- shuffled_uspanel[(split+1):nrow(shuffled_uspanel), ]

According to the basics of machine learning, I only use the train dataset to calibrate my model, and then use the test dataset to do predictions and check how well the model predicted cases in the test dataset.

Then, I re-rerun the model with ONLY the train dataset:

model2 <- glm(F1stockwin ~ epspw, data= train, family="binomial",na.action=na.omit)
summary(model2)


Call:
glm(formula = F1stockwin ~ epspw, family = "binomial", data = train, 
    na.action = na.omit)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.05148    0.00498  -10.34   <2e-16 ***
epspw        2.07493    0.05339   38.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 227682  on 164388  degrees of freedom
Residual deviance: 226128  on 164387  degrees of freedom
  (281356 observations deleted due to missingness)
AIC: 226132

Number of Fisher Scoring iterations: 4

Now I use this model and do predictions of the probability that the stock beats the market, but in the test dataset:

test$F1predprob = predict(model2,newdata=test, type="response")

I see the predicted probabilities for the first cases:

head(test,10)

# A tibble: 10 × 7
# Groups:   firm [10]
   firm     q         epspw F1stockwin stockret  mkyret F1predprob
   <chr>    <chr>     <dbl>      <dbl>    <dbl>   <dbl>      <dbl>
 1 VHC      2002q1 NA               NA NA       -0.0112     NA    
 2 MMI      2020q1  0.0123           0 -0.318   -0.0923      0.494
 3 NWPX     2002q4 NA               NA NA       -0.266      NA    
 4 JNY_old  2000q3  0.0758           1  0.120   NA           0.526
 5 RBLX     2012q1 NA               NA NA        0.0605     NA    
 6 MTCH     2011q1 NA               NA NA        0.126      NA    
 7 EVRG     2009q3  0.0771           1  0.0536  -0.0984      0.527
 8 ANIX     2009q1 NA               NA NA       -0.505      NA    
 9 AVHI_old 2014q4 -0.00600          1 -0.00548  0.108       0.484
10 SYKE_old 2018q2  0.0147           0 -0.00554  0.115       0.495

I can also visualize a histogram to see the range of predicted probabilities:

hist(test$F1predprob)

I now create a column to predict a 1 (the stock beat the market) or 0 according to the predicted probabilities:

test$F1stockwinpred = ifelse(test$F1predprob>0.5,1,0)

Now I have a column for the actual binary variable (whether the stock beat the maret), and also a predicted binary variable using the model and the test dataset.

I can now create a Confusion Matrix.

I need to convert the binary variables to a factor-type variables:

library(caret)

test$F1stockwin1 = factor(test$F1stockwin,levels=c("1","0"))
test$F1stockwinpred1 = factor(test$F1stockwinpred,levels=c("1","0"))
# When using factor function, the first value of levels must be the POSITIVE value; in this case, =1

# Create confusion matrix
CM1<- confusionMatrix(test$F1stockwinpred1,test$F1stockwin1, positive='1')
CM1

Confusion Matrix and Statistics

          Reference
Prediction     1     0
         1  7848  6726
         0 11831 14852
                                         
               Accuracy : 0.5502         
                 95% CI : (0.5454, 0.555)
    No Information Rate : 0.523          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.0881         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.3988         
            Specificity : 0.6883         
         Pos Pred Value : 0.5385         
         Neg Pred Value : 0.5566         
             Prevalence : 0.4770         
         Detection Rate : 0.1902         
   Detection Prevalence : 0.3532         
      Balanced Accuracy : 0.5435         
                                         
       'Positive' Class : 1

INTERPRETATION

The diagonal of the confusion matrix has the cases that my model CORRECTLY PREDICTED whether the stock beat or did not beat the market.

Then, looking at the matrix, we see the following:

The sum of the FIRST COLUMN= 19679, which is the # of CASES when the stock actually BEAT the market
The sum of the SECOND COLUMN= 21578, will be the # of CASES when the stock actually DID NOT BEAT the market
Out of the 19679 cases when the stocks ACTUALLY BEAT THE MARKET, the model CORRECTLY PREDICTED 7848 cases, which are the TRUE POSITIVES cases. The rate of TRUE POSITIVES with respect to ALL POSITIVES is called Sensitivity, which is 0.3988008.
Out of the 21578 cases when the stock DID NOT BEAT THE MARKET, the model CORRECTLY PREDICTED 14852 cases. This # is called: TRUE NEGATIVES. The rate (or %) of TRUE NEGATIVES with respect to ALL NEGATIVES is called Specificity, which is 0.6882936.
Out of the 19679 cases that the stocks ACTUALLY BEAT THE MARKET, the model WRONGLY PREDICTED 11831 cases. This # is called: FALSE POSITIVES.
Out of the 21578 cases when the stock DID NOT BEAT THE MARKET, the model WRONGLY PREDICTED 6726 cases. This # is called: FALSE NEGATIVES.

Remember that the Sensitivity and the Specificity rates are:

SensitivityRate = \frac{TRUEPOSITIVE}{(TRUEPOSITIVE+FALSEPOSITIVE)}

SpecificityRate = \frac{TRUENEGATIVE}{(TRUENEGATIVE+FALSENEGATIVE)}

If Sensitivity is greater than Specificity this means that the model is better to predict POSITIVE CASES (when a stock actually beats the market) than NEGATIVE CASES (when a stock does not beat the market)

The POSITIVE PREDICTED RATIO is:

PosPredValueRate = \frac{TRUEPOSITIVE}{(TRUEPOSITIVE+FALSENEGATIVE)}

The NEGATIVE PREDICTED RATIO is:

NegPredValueRate = \frac{TRUENEGATIVE}{(TRUENEGATIVE+FALSEPOSITIVE)}

We can do a ROC (Receiver Operator Characteristic) Plot. You first have to install the pROC package.

library(pROC)

roc1<- roc(response=test$F1stockwin, predictor=test$F1predprob,plot=T,col="blue", levels=c("0","1"))

I can plot the ROC using the information of roc1 object and the plot function:

plot(1-roc1$specificities, roc1$sensitivities, col="red", pch=16,
     xlab="False Positive", 
     ylab="Sensitivity")

Note that the X Axis of this plot is from 0 to 1, and in the previous plot is from 1 to 0.

The area under the curve of the ROC (AUC) is:

roc1$auc

Area under the curve: 0.5668

As long as the area is greater than 0.5 this means that the model is better than a naive or a totally random model.

We can also plot a curve that shows the probability thresholds used and the corresponding False Positive rate.

library(ggplot2)
# I create a data frame with 3 columns: 1) the threshold probabilities, 2) Sensitivities, and 
#   3) Specificities
thresholdvector <- cbind(roc1$thresholds,roc1$sensitivities,roc1$specificities)
thresholdvector <- as.data.frame(thresholdvector)
names(thresholdvector)<-c("threshold","sensitivities","specificities")

# The roc function does not calculate the positive-predicted ratio nor the negative-predicted
#  ratio.To calculate these ratios, I need the # of tp, tn, fp and fn


# I add columns for #true positive cases (tp), #tn, #fp, and #fn:
thresholdvector$tp = thresholdvector$sensitivities*length(roc1$cases)
thresholdvector$tn = thresholdvector$specificities*length(roc1$controls)
thresholdvector$fp =length(roc1$cases) - thresholdvector$tp
thresholdvector$fn = length(roc1$controls) - thresholdvector$tn
# Check that the length(roc1$cases) is the total of cases when the event actually happened


# I create the positive-prediction ratio and the negative-prediction ratio:
thresholdvector$pospredratio = thresholdvector$tp / (thresholdvector$tp+thresholdvector$fn)
thresholdvector$negpredratio = thresholdvector$tn / (thresholdvector$tn+thresholdvector$fp)

# I create the accuracy ratio, which is the % of true positive and true negative with respect to ALL 
thresholdvector$accuracy = (thresholdvector$tp + thresholdvector$tn) / ( length(roc1$cases) + length(roc1$controls)) 


# I plot the 4 ratios against the threshold probability to visualize which might be the
#   best threshold probability:

ggplot(thresholdvector,aes(x=threshold)) + 
  geom_line(aes(y=accuracy,color="Accuracy ratio"))  + 
  geom_line(aes(y=specificities,color="Specificity ratio")) + 
  geom_line(aes(y=sensitivities,color="Sensitivity ratio")) + 
  geom_line(aes(y=pospredratio,color="True-Positive Predicted ratio")) +
  geom_line(aes(y=negpredratio,color="True-Negative Predicted ratio"))

It seems that an optimal threshold should be between 0.45 and 0.55. We can identify the threshold that maximizes the accuracy ratio using deployer:

thresholdsorted <- thresholdvector %>% 
    select(threshold,accuracy,sensitivities,specificities) %>%
    arrange(desc(accuracy)) %>%
    head(5)
thresholdsorted

  threshold  accuracy sensitivities specificities
1 0.4978217 0.5526335     0.4511916     0.6451478
2 0.4978144 0.5526093     0.4513441     0.6449625
3 0.4978215 0.5526093     0.4511916     0.6451015
4 0.4978233 0.5526093     0.4511408     0.6451478
5 0.4978297 0.5526093     0.4510392     0.6452405

We can see that the threshold that get the highest accuracy is 0.4978217 . This is very close to the original threshold.

For the cases where the optimal threshold is significantly different than 0.5, it is recommended to re-calculate the predicted binary variable (F1stockwinpred) with this optimal threshold. For example, we can re-calculate the predicted binary variable with this threshold as follows:

test$F1stockwinpred_new = ifelse(test$F1predprob>0.4978217,1,0)
table(test$F1stockwinpred_new)


    0     1 
24960 16659

We can compare this table with the first binary predicted variable:

table(test$F1stockwinpred)


    0     1 
26949 14670

With the new threshold there are more ’1’s predicted (more POSITIVE predicted cases)

I will stop here. However, the next step would be to use this new prediction (F1stockwinpred_new) to create a new Confusion Matrix to see the prediction indicators (sensitivity, specificity, accuracy, etc). Once we check that our model improves the prediction indicators, then we can use this model to predict the probability for the stocks to beat the market in the future NEW quarter.

Finally, we can use these prediction to rank the stocks and select the best ones.