## library(plm)
# I will NOT use plm since it takes too much time to install! I will use dplyr
library(dplyr)
# Downloading the csv files from the web:
download.file("https://www.apradie.com/datos/dataUS20204.csv","dataus2024.csv")
<- read.csv("dataus2024.csv")
uspanel
download.file("https://www.apradie.com/datos/firms20204.csv","firms2024.csv")
<- read.csv("firmsus2024.csv") usfirms
Workshop 3 Solution, Algorithms and data analysis
1 Introduction
We will work with the dataset of the Final Project (dataus2024), which has historical data from Q1 2000 to Q2 2024.
We will learn a) what is winsorization and why it is important, b) estimate and interpret a logistic regression, and c) training vs testing samples, and d) creating a confusion matrix.
2 CHALLENGE 1: Winsorization of variables
You have to do your own research about winsorization. Explain it with your words (pay attention in class)
You have to install the statar package. You have to winsorize the following ratio:
- Earnings per share deflated by price
This winsorize function automatically finds a good level of winsorization according to the distribution of all values of the variable.
You have to install the statar package. You have to winsorize the following ratio:
- Earnings per share deflated by price
Using the histogram decide which might be a good level of winsorization for each ratio.
WINSORIZATION IS THE PROCESS OF FLATTENING EXTREME (OUTLIERS) OBSERVATIONS OF A VARIABLE WITH THE PURPOSE OF AVOIDING UNRELIABLE ESTIMATIONS OF BETA COEFFICIENTS WHEN INCLUDING THAT VARIABLE AS EXPLANATORY VARIABLE IN AN ECONOMETRIC MODEL.
THE WINSORIZATION PROCESS CAN BE APPLIED TO HIGH OR LOW (NEGATIVE) VALUES OF THE VARIABLE. FOR HIGH VALUES WE CAN INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES ABOVE THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE. FOR LOW VALUES WE INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES BELOW THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE.
IN R, WE CAN USE THE FUNCTION winsorize FROM THE statar PACKAGE TO DO THE WINSORIZATION OF A VARIABLE.
I first import the datasets:
Now I generate the variables needed for the ratio:
# Adding the variables for gross profit, EBIT and Net Income:
<- uspanel %>%
uspanel mutate(
grossprofit = revenue - cogs,
ebit = grossprofit - sgae,
netincome = ebit + otherincome + extraordinaryitems - finexp - incometax
)
I generate the ratio:
# I generate EPS and EPSP (deflated by price)
<- uspanel %>%
uspanel mutate(
eps = ifelse(sharesoutstanding==0,NA,netincome / sharesoutstanding),
epsp= ifelse(originalprice==0,NA,eps / originalprice))
NOW I START THE WINSORIZATION PROCESS FOR EPSP:
I START CHECKING THE HISTOGRAM TO DECIDE WHICH LEVEL OF WINSORIZATION I CAN APPLY TO BOTH SIDES OF THE VALUES (HIGH AND LOW VALUES):
hist(uspanel$epsp)
I SEE EXTREME VALUES TO THE LEF AND TO THE RIGHT WITH SIMILAR MAGNITUDE. THE WINSORIZE FUNCTION AUTOMATICALLY DETECTS THE BEST PERCENTILES TO THE LEFT AND TO THE RIGHT ACCORDING TO THE DISTRIBUTION OF THE VARIABLE:
# I load the library for winsorization:
library(statar)
$epspw <- winsorize(uspanel$epsp)
uspanel# I check the distribution of this new winsorized ratio:
hist(uspanel$epspw)
THE RANGE FOR THE WINSORIZED OPERATING EPS DEFLATED BY PRICE IS FROM -0.30 TO ABOUT 0.30. FOR THIS RATIO, THIS RANGE IS OK SINCE WE CAN HAVE SOME FIRMS THAT IN THE PAST THE HAD VERY NEGATIVE NETINCOME SO THAT THE EPS DEFLATED BY PRICE CAN BE NEGATIVE, BUT A FRACTION LESS THAN ONE (IN MAGNITUDE). IF OPERATING EPS DEFLATED BY PRICES IS +0.30 THIS MEANS THAT IF THE NETINCOME WERE DISTRIBUTED AMONG SHAREHOLDERS, THEN FOR EACH $1.0 INVESTED IN SHARES, EACH SHAREHOLDER MIGHT RECEIVE ABOUT 30 CENTS.
3 CHALLENGE 2: Algorithm to do many-to-one merge
Write a data management algorithm to do the following:
- Download the monthly market index of the S&P500 (^GSPC) from 1999 to date
- Convert (collapse) from monthly to quarterly index selecting the last index of each quarter
- Calculate market quarterly returns
- Do a many-to-one algorithm to add a new column to the us panel dataset that has the market quarterly return
Here is a easy-to-follow guide to do this challenge. Read carefully to understand the algorithm.
The uspanel dataset has historical data of many financial-statement variables of US public firms that belong to the New York Exchange and the NASDAQ.
The usfirms dataset is a catalog of all US public firms with general information such as firm name and industry classification.
Now I download the US monthly market index from Q4 1999 to the Q2 of 2024.
library(quantmod)
getSymbols("^GSPC", from="1999-10-01", to= "2024-06-30",
periodicity="monthly", src="yahoo")
[1] "GSPC"
Now I convert / collapse this monthly dataset in to a quarterly dataset to have the same granularity than the uspanel dataset. For each quarter (3-month period) I need to get ONLY the last month index in order to correctly calculate quarterly returns.
The to.quarterly function from quantmod can do this collapse getting the value of the last month for each quarter:
<- to.quarterly(GSPC)
QGSPC # I keep only the adjusted column, which has the market index:
= Ad(QGSPC)
QGSPC names(QGSPC)= c("SP500")
I calculate quarterly and annual cc returns for the market with the difference function of the log index:
$mkqret = diff(log(QGSPC$SP500))
QGSPC$mkyret = diff(log(QGSPC$SP500),lag=4)
QGSPC# I delete the first row that has NA values for both returns, and it is from 1999:
= QGSPC[2:nrow(QGSPC),] QGSPC
In order to merge the uspanel with this QGSPC dataset I need to have a common column for the quarter so that R can do the match by quarter.
The QGSPC has the quarter as index, not as column! It is important to note that all xts and zoo datasets have an index that is not part of the columns. I can create a data frame that has the index as column as follows:
<-data.frame(qdate=index(QGSPC),coredata(QGSPC[,2:3])) QGSPCdf
The index function gets the index content, while the coredata function gets only the column data of the dataset.
Besides having the same column in both datasets, both columns must be of the same data type. Then I check which data type each q column has:
class(uspanel$q)
[1] "character"
class(QGSPCdf$qdate)
[1] "yearqtr"
The qdate column of the QGSPCdf is a “yearqtr” variable, while the q column of the uspanel is character variable. I have to decide which column I change to have both with the same type and also the same format.
I will create a q column in the QGSPCdf dataset with the same format as the q in the uspanel.
The q in the uspanel is a character variable that starts with 4 digits for the year, then a “q” and then the # of the quarter. Example: 2020q1, 2020q2.
Then, I create a new column in the QGSPCdf following this format:
library(lubridate)
# I use the year and quarter functions from the lubridate library
# The year function extracts the year of a date, and the quarter extrats the quarter
$q <- paste0(year(QGSPCdf$qdate), # Convert dates to quarterly
QGSPCdf"q",
quarter(QGSPCdf$qdate))
# I check that both columns have the same data type, and have the same format:
class(QGSPCdf$q)
[1] "character"
class(uspanel$q)
[1] "character"
head(QGSPCdf$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
head(uspanel$q)
[1] "2000q1" "2000q2" "2000q3" "2000q4" "2001q1" "2001q2"
The paste0 function concatenates strings of characters.
Now I can do the many-to-1 merge of both dataset indicating that q is the common column.
I use the left_join function from the dplyr package instead of the merge function. I do this since the merge function does not keep the original sorting of the uspanel dataset.
# I delete the first column (the qdate)
= QGSPCdf[,c(-1)]
QGSPCdf
<-left_join(uspanel,QGSPCdf,by="q")
uspanel
# I display key columns of 1 quarter:
library(dplyr)
head(uspanel %>% select(firm,q, adjprice, mkqret) %>% filter(q=="2024q2"))
firm q adjprice mkqret
1 A 2024q2 129.3899 0.03848037
2 AA 2024q2 39.7800 0.03848037
3 AABA_old 2024q2 NA 0.03848037
4 AAC_old 2024q2 NA 0.03848037
5 AAIC_old 2024q2 NA 0.03848037
6 AAL 2024q2 11.3300 0.03848037
The market return seems to be well merged since its value is repeated for cases with the same quarter.
Then the mktret column was added to the uspanel. The series of the market return was merged for each firm, so R did a many-to-1 merge!
4 CHALLENGE 3: Logistic regression models with lagged values
Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future quarterly stock returns is higher than the future market quarterly return.
Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.
You have to interpret the model
4.1 SOLUTION
The dependent variable of a logistic model is the probability that an event happens. However, the way we introduce the values of the dependent variable in the model is using a binary variable (1/0 or TRUE/FALSE). I declare that the EVENT happens if the future stock return is higher than the future market return.
The independent variable(s) can be numeric or categorical, as in the case of a multiple regression model.
I have to create the dependent variable of the model. I will assign 1 when the future stock return is higher than the future market return; and =0 otherwise.
4.1.1 Creating the dependent variable
I need to generate the stock quarterly cc return for each firm-quarter.
I calculate returns in the panel data with dplyr:
<- uspanel %>%
uspanel group_by(firm) %>%
arrange(firm, q) %>%
mutate(stockret = log(adjprice) - lag(log(adjprice)))
In dplyr, I indicate to group by firm so that the lag function will work correctly (will not take a lag value if there is a change of firm!)
To make sure that R did the correct calculation, I can look at few cases for 1 firm:
# I convert the q column to character; when I converted the data into a pdata.frame, it changed the type of q to factor. This creates some problems, so I change it to character:
$q = as.character((uspanel$q))
uspanel%>% select(firm,q,adjprice,stockret) %>%
uspanel filter(firm=="AAPL",q>="2000q1") %>% head(10)
# A tibble: 10 × 4
# Groups: firm [1]
firm q adjprice stockret
<chr> <chr> <dbl> <dbl>
1 AAPL 2000q1 1.02 NA
2 AAPL 2000q2 0.790 -0.260
3 AAPL 2000q3 0.388 -0.710
4 AAPL 2000q4 0.224 -0.549
5 AAPL 2001q1 0.333 0.395
6 AAPL 2001q2 0.351 0.0521
7 AAPL 2001q3 0.234 -0.405
8 AAPL 2001q4 0.330 0.345
9 AAPL 2002q1 0.357 0.0777
10 AAPL 2002q2 0.267 -0.290
It looks ok since the stock return is NA for the first quarter of each firm.
Now I can create variables for a) the FUTURE quarterly stock return and b) the FUTURE market return. I will use the lead function of dplyr:
<- uspanel %>%
uspanel group_by(firm) %>%
arrange(firm, q) %>%
mutate(F1stockret = lead(stockret),
F1mkqret = lead(mkqret))
The lead function is used to get the future value of the variable (1 period ahead).
Now I can create the binary variable as follows:
<- uspanel %>%
uspanel mutate(F1stockwin = ifelse(F1stockret>F1mkqret,1,0))
I can check how many rows ended up with 1’s and 0’s:
table(uspanel$F1stockwin)
0 1
127791 118460
I can view few cases to make sure that I did the correct calculation:
%>% select(firm, q, stockret,mkqret, F1stockret, F1mkqret, F1stockwin) %>%
uspanel filter(q>="2022q1") %>% head(8)
# A tibble: 8 × 7
# Groups: firm [1]
firm q stockret mkqret F1stockret F1mkqret F1stockwin
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2022q1 -0.186 -0.0507 -0.107 -0.180 1
2 A 2022q2 -0.107 -0.180 0.0249 -0.0542 1
3 A 2022q3 0.0249 -0.0542 0.211 0.0684 1
4 A 2022q4 0.211 0.0684 -0.0786 0.0679 0
5 A 2023q1 -0.0786 0.0679 -0.137 0.0797 0
6 A 2023q2 -0.137 0.0797 -0.0727 -0.0372 0
7 A 2023q3 -0.0727 -0.0372 0.222 0.106 1
8 A 2023q4 0.222 0.106 0.0456 0.0967 0
The calculation of the F1stockwin looks ok; when the future stock return is higher than the market future return, F1stockwin=1; =0 otherwise.
Now I can run the logistic regression with the glm function:
# Runing the model with the winsorized oepsp:
<- glm(F1stockwin ~ epspw, data= uspanel, family="binomial",na.action=na.omit)
model1 summary(model1)
Call:
glm(formula = F1stockwin ~ epspw, family = "binomial", data = uspanel,
na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.055865 0.004453 -12.55 <2e-16 ***
epspw 2.058406 0.047745 43.11 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284792 on 205645 degrees of freedom
Residual deviance: 282880 on 205644 degrees of freedom
(351535 observations deleted due to missingness)
AIC: 282884
Number of Fisher Scoring iterations: 4
# Runing the model with the original epsp (before winsorization):
<- glm(F1stockwin ~ epsp, data= uspanel, family="binomial",na.action=na.omit)
model1a summary(model1a)
Call:
glm(formula = F1stockwin ~ epsp, family = "binomial", data = uspanel,
na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.562e-02 4.413e-03 -17.134 <2e-16 ***
epsp 1.275e-06 4.352e-06 0.293 0.77
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 284792 on 205645 degrees of freedom
Residual deviance: 284792 on 205644 degrees of freedom
(351535 observations deleted due to missingness)
AIC: 284796
Number of Fisher Scoring iterations: 3
I run the model with the winsorized and the original (before winsorization). Although the sign and significance was the same, the magnitude of the epsp beta coefficient significantly changed and also its pvalue. So, it the results using the winsorized variable is much more reliable.
INTERPRETATION:
THE INTERPRETATION OF A LOGISTIC MODEL IS SIMILAR TO THAT OF THE LINEAR REGRESSION MODEL, BUT THERE IS AN IMPORTANT DIFFERENCE.
IN TERMS OF SIGN AND SIGNIFICANCE OF THE INDEPENDENT VARIABLE, WE INTERPRET THE LOGISTIC REGRESSION IN A SIMILAR WAY. IN THIS CASE, I CAN INTERPRET THE COEFFICIENT AND SIGNIFICANCE OF EARNINGS PER SHARE AS FOLLOWS:
SINCE THE BETA1 COEFFICIENT IS POSITIVE AND SIGNIFICANCE, THEN EARNINGS PER SHARE (DEFLATED BY PRICE) IS POSITIVELY AND SIGNIFICANTLY RELATED TO THE PROBABILITY THAT THE STOCK WILL HAVE A HIGHER RETURN IN THE FUTURE COMPARED TO THE FUTURE MARKET RETURN. IN OTHER WORDS, THE HIGHER THE EPSPW, THE MORE LIKELY THAT THE STOCK WILL BEAT THE MARKET RETURN ONE QUARTER LATER.
THE INTERPRETATION OF THE MAGNITUDE OF THE COEFFICIENT IS DIFFERENT THAN IN THE CASE OF LINEAR REGRESSION.
FIRST WE NEED TO CALCULATE THE EXPONENCIAL OF THE BETA1 COEFFICIENT:
= exp(model1$coefficients[2]*1)
betaepsw_odds betaepsw_odds
epspw
7.83347
This new beta coefficient of7.8334699 can be interpreted as:
FOR EACH +1 CHANGE IN EPSPW, IT IS EXPECTED THAT THE ODDS RATIO (THE PROBABILITY THAT THE STOCK BEATS THE MARKET WITH RESPECT TO THE PROBABILITY THAT THE STOCK DO NOT BEAT THE MARKET) INCREASES WILL CHANGE IN 7.8334699 UNITS. IN OTHER WORDS, A FIRM A HAS AN EPSPW THAT IS +1.00 HIGHER THAN A FIRM B, THEN THE FIRM A WILL BE 7.8334699 TIMES MORE LIKELY TO BEAT THE MARKET COMPARED TO FIRM B.
THIS SOUNDS WEIRD SINCE A CHANGE OF +1.00 UNIT IN EPSPW IS ALMOST IMPOSSIBLE IN THE MARKET; EPSPW IS A VARIABLE THAT USUALLY RANGES BETWEEN -0.3 AND +0.30. THEN, IT IS CONVENIENT TO DO ANOTHER CALCULATION FOR THE BETA COEFFICIENT:
= exp(model1$coefficients[2]*0.1)
betaepsw_odds betaepsw_odds
epspw
1.228557
I MULTIPLIED THE COEFFICIENT TIMES 0.1 TO CALCULATE THE BETA COEFFICIENT AS A PARTIAL CHANGE OF THE ODDS RATIO FOR A CHANGE OF +0.1 IN EPSPW.
IN THIS CASE, IF A FIRM A IMPROVES ITS EPSPW IN +0.1 UNIT FROM ONE QUARTER TO THE NEXT, THEN THIS FIRM WILL BE 1.2285573 MORE LIKELY TO BEAT THE MARKET COMPARED TO CASE THAT THE FIRM DOES NOT CHANGE ITS EPSPW.
YOU CAN INTERPRET THE MAGNITUDE OF THE COEFFICIENT USING THE CASE OF 1 FIRM THAT INCREASES (OR DECREASES) ITS EPSPW FROM ONE QUARTER TO THE NEXT, OR COMPARING 2 FIRMS IN THE SAME QUARTER WITH DIFFERENT EPSPW VALUES.
5 CHALLENGE 4: Running your first Machine Learning model
Create a dataset with the following columns:
- Future quarterly stock return (1 quarter later)
- F1r_above_market (1=beat the market in the corresponding quarter; 0= otherwise)
- Earnings per share deflated by price (epsp).
Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.
Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.
Create and interpret the confusion matrix
It is strongly recommended to review the Chapter 2 of the Datacamp course: “Machine Learning with Caret”
5.1 SOLUTION
The variables are already created in the above sections.
I create the training and testing random samples:
set.seed(123456)
# Shuffle row indices:
<-sample(nrow(uspanel))
rows_shuffled
# Randomly order data
<- uspanel[rows_shuffled, ] shuffled_uspanel
To better organize the data of the model, I will keep ONLY the variables I need for the machine learning model:
<- shuffled_uspanel %>%
shuffled_uspanel select(firm, q, epspw, F1stockwin, stockret,mkyret)
I kept the independent and the dependent variable (F1stockwin), and also the returns of the stocks and the market (just in case).
Try an 80/20 split
Now that the dataset is randomly ordered. I can split the first 80% of it into a training set, and the last 20% into a test set.
I can do this by choosing a split point of approx 80% of the shffle data rows:
# Determine row to split on: split
<- round(nrow(shuffled_uspanel)*.80)
split split
[1] 445745
# Create train
<- shuffled_uspanel[1:split, ]
train
# Create test
<- shuffled_uspanel[(split+1):nrow(shuffled_uspanel), ] test
According to the basics of machine learning, I only use the train dataset to calibrate my model, and then use the test dataset to do predictions and check how well the model predicted cases in the test dataset.
Then, I re-rerun the model with ONLY the train dataset:
<- glm(F1stockwin ~ epspw, data= train, family="binomial",na.action=na.omit)
model2 summary(model2)
Call:
glm(formula = F1stockwin ~ epspw, family = "binomial", data = train,
na.action = na.omit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.05148 0.00498 -10.34 <2e-16 ***
epspw 2.07493 0.05339 38.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 227682 on 164388 degrees of freedom
Residual deviance: 226128 on 164387 degrees of freedom
(281356 observations deleted due to missingness)
AIC: 226132
Number of Fisher Scoring iterations: 4
Now I use this model and do predictions of the probability that the stock beats the market, but in the test dataset:
$F1predprob = predict(model2,newdata=test, type="response") test
I see the predicted probabilities for the first cases:
head(test,10)
# A tibble: 10 × 7
# Groups: firm [10]
firm q epspw F1stockwin stockret mkyret F1predprob
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VHC 2002q1 NA NA NA -0.0112 NA
2 MMI 2020q1 0.0123 0 -0.318 -0.0923 0.494
3 NWPX 2002q4 NA NA NA -0.266 NA
4 JNY_old 2000q3 0.0758 1 0.120 NA 0.526
5 RBLX 2012q1 NA NA NA 0.0605 NA
6 MTCH 2011q1 NA NA NA 0.126 NA
7 EVRG 2009q3 0.0771 1 0.0536 -0.0984 0.527
8 ANIX 2009q1 NA NA NA -0.505 NA
9 AVHI_old 2014q4 -0.00600 1 -0.00548 0.108 0.484
10 SYKE_old 2018q2 0.0147 0 -0.00554 0.115 0.495
I can also visualize a histogram to see the range of predicted probabilities:
hist(test$F1predprob)
I now create a column to predict a 1 (the stock beat the market) or 0 according to the predicted probabilities:
$F1stockwinpred = ifelse(test$F1predprob>0.5,1,0) test
Now I have a column for the actual binary variable (whether the stock beat the maret), and also a predicted binary variable using the model and the test dataset.
I can now create a Confusion Matrix.
I need to convert the binary variables to a factor-type variables:
library(caret)
$F1stockwin1 = factor(test$F1stockwin,levels=c("1","0"))
test$F1stockwinpred1 = factor(test$F1stockwinpred,levels=c("1","0"))
test# When using factor function, the first value of levels must be the POSITIVE value; in this case, =1
# Create confusion matrix
<- confusionMatrix(test$F1stockwinpred1,test$F1stockwin1, positive='1')
CM1 CM1
Confusion Matrix and Statistics
Reference
Prediction 1 0
1 7848 6726
0 11831 14852
Accuracy : 0.5502
95% CI : (0.5454, 0.555)
No Information Rate : 0.523
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.0881
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.3988
Specificity : 0.6883
Pos Pred Value : 0.5385
Neg Pred Value : 0.5566
Prevalence : 0.4770
Detection Rate : 0.1902
Detection Prevalence : 0.3532
Balanced Accuracy : 0.5435
'Positive' Class : 1
INTERPRETATION
The diagonal of the confusion matrix has the cases that my model CORRECTLY PREDICTED whether the stock beat or did not beat the market.
Then, looking at the matrix, we see the following:
- The sum of the FIRST COLUMN= 19679, which is the # of CASES when the stock actually BEAT the market
- The sum of the SECOND COLUMN= 21578, will be the # of CASES when the stock actually DID NOT BEAT the market
- Out of the 19679 cases when the stocks ACTUALLY BEAT THE MARKET, the model CORRECTLY PREDICTED 7848 cases, which are the TRUE POSITIVES cases. The rate of TRUE POSITIVES with respect to ALL POSITIVES is called Sensitivity, which is 0.3988008.
- Out of the 21578 cases when the stock DID NOT BEAT THE MARKET, the model CORRECTLY PREDICTED 14852 cases. This # is called: TRUE NEGATIVES. The rate (or %) of TRUE NEGATIVES with respect to ALL NEGATIVES is called Specificity, which is 0.6882936.
- Out of the 19679 cases that the stocks ACTUALLY BEAT THE MARKET, the model WRONGLY PREDICTED 11831 cases. This # is called: FALSE POSITIVES.
- Out of the 21578 cases when the stock DID NOT BEAT THE MARKET, the model WRONGLY PREDICTED 6726 cases. This # is called: FALSE NEGATIVES.
Remember that the Sensitivity and the Specificity rates are:
SensitivityRate = \frac{TRUEPOSITIVE}{(TRUEPOSITIVE+FALSEPOSITIVE)}
SpecificityRate = \frac{TRUENEGATIVE}{(TRUENEGATIVE+FALSENEGATIVE)}
If Sensitivity is greater than Specificity this means that the model is better to predict POSITIVE CASES (when a stock actually beats the market) than NEGATIVE CASES (when a stock does not beat the market)
The POSITIVE PREDICTED RATIO is:
PosPredValueRate = \frac{TRUEPOSITIVE}{(TRUEPOSITIVE+FALSENEGATIVE)}
The NEGATIVE PREDICTED RATIO is:
NegPredValueRate = \frac{TRUENEGATIVE}{(TRUENEGATIVE+FALSEPOSITIVE)}
We can do a ROC (Receiver Operator Characteristic) Plot. You first have to install the pROC package.
library(pROC)
<- roc(response=test$F1stockwin, predictor=test$F1predprob,plot=T,col="blue", levels=c("0","1")) roc1
I can plot the ROC using the information of roc1 object and the plot function:
plot(1-roc1$specificities, roc1$sensitivities, col="red", pch=16,
xlab="False Positive",
ylab="Sensitivity")
Note that the X Axis of this plot is from 0 to 1, and in the previous plot is from 1 to 0.
The area under the curve of the ROC (AUC) is:
$auc roc1
Area under the curve: 0.5668
As long as the area is greater than 0.5 this means that the model is better than a naive or a totally random model.
We can also plot a curve that shows the probability thresholds used and the corresponding False Positive rate.
library(ggplot2)
# I create a data frame with 3 columns: 1) the threshold probabilities, 2) Sensitivities, and
# 3) Specificities
<- cbind(roc1$thresholds,roc1$sensitivities,roc1$specificities)
thresholdvector <- as.data.frame(thresholdvector)
thresholdvector names(thresholdvector)<-c("threshold","sensitivities","specificities")
# The roc function does not calculate the positive-predicted ratio nor the negative-predicted
# ratio.To calculate these ratios, I need the # of tp, tn, fp and fn
# I add columns for #true positive cases (tp), #tn, #fp, and #fn:
$tp = thresholdvector$sensitivities*length(roc1$cases)
thresholdvector$tn = thresholdvector$specificities*length(roc1$controls)
thresholdvector$fp =length(roc1$cases) - thresholdvector$tp
thresholdvector$fn = length(roc1$controls) - thresholdvector$tn
thresholdvector# Check that the length(roc1$cases) is the total of cases when the event actually happened
# I create the positive-prediction ratio and the negative-prediction ratio:
$pospredratio = thresholdvector$tp / (thresholdvector$tp+thresholdvector$fn)
thresholdvector$negpredratio = thresholdvector$tn / (thresholdvector$tn+thresholdvector$fp)
thresholdvector
# I create the accuracy ratio, which is the % of true positive and true negative with respect to ALL
$accuracy = (thresholdvector$tp + thresholdvector$tn) / ( length(roc1$cases) + length(roc1$controls))
thresholdvector
# I plot the 4 ratios against the threshold probability to visualize which might be the
# best threshold probability:
ggplot(thresholdvector,aes(x=threshold)) +
geom_line(aes(y=accuracy,color="Accuracy ratio")) +
geom_line(aes(y=specificities,color="Specificity ratio")) +
geom_line(aes(y=sensitivities,color="Sensitivity ratio")) +
geom_line(aes(y=pospredratio,color="True-Positive Predicted ratio")) +
geom_line(aes(y=negpredratio,color="True-Negative Predicted ratio"))
It seems that an optimal threshold should be between 0.45 and 0.55. We can identify the threshold that maximizes the accuracy ratio using deployer:
<- thresholdvector %>%
thresholdsorted select(threshold,accuracy,sensitivities,specificities) %>%
arrange(desc(accuracy)) %>%
head(5)
thresholdsorted
threshold accuracy sensitivities specificities
1 0.4978217 0.5526335 0.4511916 0.6451478
2 0.4978144 0.5526093 0.4513441 0.6449625
3 0.4978215 0.5526093 0.4511916 0.6451015
4 0.4978233 0.5526093 0.4511408 0.6451478
5 0.4978297 0.5526093 0.4510392 0.6452405
We can see that the threshold that get the highest accuracy is 0.4978217 . This is very close to the original threshold.
For the cases where the optimal threshold is significantly different than 0.5, it is recommended to re-calculate the predicted binary variable (F1stockwinpred) with this optimal threshold. For example, we can re-calculate the predicted binary variable with this threshold as follows:
$F1stockwinpred_new = ifelse(test$F1predprob>0.4978217,1,0)
testtable(test$F1stockwinpred_new)
0 1
24960 16659
We can compare this table with the first binary predicted variable:
table(test$F1stockwinpred)
0 1
26949 14670
With the new threshold there are more ’1’s predicted (more POSITIVE predicted cases)
I will stop here. However, the next step would be to use this new prediction (F1stockwinpred_new) to create a new Confusion Matrix to see the prediction indicators (sensitivity, specificity, accuracy, etc). Once we check that our model improves the prediction indicators, then we can use this model to predict the probability for the stocks to beat the market in the future NEW quarter.
Finally, we can use these prediction to rank the stocks and select the best ones.