This is an INDIVIDUAL workshop. In this workshop we continue practice programming for data management, and we introduce the logistic regression model in the context of machine learning
1 Data management for panel-data
For financial analysis, data management is a very important process. Most of the time, before designing and running a financial / econometric model, it is needed to do data wrangling: data collection/importing, data cleanning, data merging, data transformation.
We will keep using the same datasets we used in Workshop 2 for all excercises of this workshop.
Download the online datasets from:
List of all US public firms with general information of each firm
import pandas as pdimport requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}# Download the first fileurl1 ='https://www.apradie.com/datos/usfirms.csv'response1 = requests.get(url1, headers=headers)withopen('usfirms.csv', 'wb') as f: f.write(response1.content)
633061
# Download the second fileurl2 ='https://www.apradie.com/datos/usdata.csv'response2 = requests.get(url2, headers=headers)withopen('usdata.csv', 'wb') as f: f.write(response2.content)
11317275
# Import as data framesusfirms = pd.read_csv('usfirms.csv')usdata = pd.read_csv('usdata.csv')# Display the first few rows of each DataFrame to confirmprint("usfirms DataFrame:")
usfirms DataFrame:
usfirms.head()
empresa ... SectorEconomatica
0 A ... Electric Electron
1 AA ... Basic & Fab Metal
2 AABA_old ... Software & Data
3 AAC_old ... Other
4 AAIC_old ... Funds
[5 rows x 7 columns]
print("\nusdata DataFrame:")
usdata DataFrame:
usdata.head()
firm year revenue ... originalprice sharesoutstanding fixedassets
0 A 2000 10773000.0 ... 54.75 456366.381 1741000.0
1 A 2001 8396000.0 ... 28.51 463695.160 1848000.0
2 A 2002 6010000.0 ... 17.96 467024.421 1579000.0
3 A 2003 6056000.0 ... 29.24 476149.083 1447000.0
4 A 2004 7181000.0 ... 24.10 486841.087 1258000.0
[5 rows x 28 columns]
You can go to Workshop 2 (https://rpubs.com/cdorante/fz2022p_w2) to see the data dictionary for each dataset.
As in the previous workshop, merge the 2 datasets (usdata and usfirms) using a left-join. Remember, the panel-dataset usdata is the left dataset, which has historical annual financial data for all US firms; and the usfirms is the right dataset, which is a cross-sectional dataset with general information of firms from the S&P500 index.
# I merge the datasets using usdata as the left dataset, and usfirms as the right dataset merged_data = pd.merge(usdata, usfirms[['empresa', 'Nombre', 'status', 'naics1']], left_on='firm', right_on='empresa', how='left')# Drop the redundant 'empresa' column from the merged DataFramemerged_data = merged_data.drop('empresa', axis=1)# Display the first few rows of the merged DataFrame to confirmprint("Merged DataFrame:")
Merged DataFrame:
merged_data.head()
firm year revenue ... Nombre status naics1
0 A 2000 10773000.0 ... Agilent Technologies, Inc active Manufacturing
1 A 2001 8396000.0 ... Agilent Technologies, Inc active Manufacturing
2 A 2002 6010000.0 ... Agilent Technologies, Inc active Manufacturing
3 A 2003 6056000.0 ... Agilent Technologies, Inc active Manufacturing
4 A 2004 7181000.0 ... Agilent Technologies, Inc active Manufacturing
[5 rows x 31 columns]
2 CHALLENGE 1. Calculating financial variables
Using the merged dataset, you have to write the code to calculate the following financial variables and financial ratios for all firms-years:
1. Create financial variables
Create columns for the following variables:
Market value (marketvalue) = originalprice * sharesoutstanding
We use the original stock price (before stock splits and dividend adjustments) since the # of shares outstanding is the historical # of shares.
import numpy as np# Market valuemerged_data['marketvalue'] = merged_data['originalprice'] * merged_data['sharesoutstanding']
Gross profit (grossprofit) = Revenue - Cost of good Sold (cogs)
Annual market return: calculate annual return for all firm-years by calculating the continuously compounded percentage of thea djusted stock price (adjprice). Consider that you have a panel-data, so be careful when calculating returns to avoid using data from another firm in the cases of the first year for all firms. Hint: you can use the shift function and groupby firm to avoid using stock price of another stock to calculate the annual return of each firm for all years.
# Annual market return# Sort by firm and year to ensure correct calculation within each firmmerged_data = merged_data.sort_values(by=['firm', 'year'])# Calculate annual log return:# I create a column for the previous year; I use the groupby('firm') function to avoid pulling the stock price of another firm:merged_data['price_lag1'] = merged_data.groupby('firm')['adjprice'].shift(1)# I create the annual log return (continuously compounded return) by getting the log of the current price divided by the previous pricemerged_data['annual_return'] = np.log(merged_data['adjprice'] / merged_data['price_lag1'])# I could also calculate the log return by getting the difference of the log stock prices between t and t-1:#merged_data['annual_return'] = np.log(merged_data['adjprice']) - np.log(merged_data['price_lag1'])
2. Using the same panel dataset, create columns for the following financial ratios:
Operational Return on Assets (roabit): ebit divided by total assets at the beginning of the period. Total assets of the beginning of the year is actually the total assets of the previous year.
roabit=\frac{ebit_{t}}{totalassets_{t-1}}
# Operational Return on Assets (roabit)# Calculate total assets at the beginning of the period (previous year's total assets)merged_data['totalassets_lag1'] = merged_data.groupby('firm')['totalassets'].shift(1)# calculate roabit using the lagged total assets:merged_data['roabit'] = merged_data['ebit'] / merged_data['totalassets_lag1']
Here you can use the shift function to get value of total assets one year ago. Make sure that you indicate to groupby firm so you do not use the totalssets from another firm to calculate the roabit of a firm.
Do your own research and briefly explain what is earnings per share deflated by price, and book-to-market ratio
EARNINGS PER SHARE IS EQUAL TO EARNINGS DIVIDED BY THE # OF SHARES. THE MEASURE FOR EARNINGS IS NET INCOME. HOWEVER, SOME ANALYSTS ALSO USE OTHER OPERATIONAL MEASURES FOR EARNINGS SUCH AS EARNINGS BEFORE INTEREST AND TAXES (EBIT). IF WE WANT TO MEASURE OPERATIONAL EARNINGS AND CALCULATE IT FOR MANY FIRMS, IT IS RECOMMENDED TO USE EBIT AS A MEASURE OF EARNINGS.
THE FORMULA OF EPS USING NET INCOME AS A MEASURE OF EARNINGS IS:
EPS_{t}=\frac{NETINCOME_{t}}{\#OFSHARES_{t}}
THE FORMULA FOR EPS USING EBIT AS A MEASURE OF EARNINGS IS:*
EPS_{t}=\frac{EBIT_{t}}{\#OFSHARES_{t}}
IN A HYPOTHETICAL SCENARIO, IF THE ALL EARNINGS OF A PERIOD t WERE PAYED TO THE INVESTORS, THEN EPS WILL BE HOW MUCH OF ALL EARNINGS OF THE PERIOD IS PAYED TO EACH SHARE OWN BY INVESTORS.
WHAT IS BOOK-TO-MARKET RATIO?
Book-to-maret ratio (bmr) is the ratio of accounting book value of the firm to its market value. In other words, it results by dividing book value by the market value of the firm at a specific time period.
If bmr=1 means that the firm book value is about the same as firm market value. If that is the case, then the market value has not grown beyond book-value, meaning that the firm has not created value beyond its book value.
If bmr>1 means that the market value is less than book value. So, if bmr>1 means that the firm has significantly lost shareholder’s wealth, and it might incur in bankrupt risk.
Then, what would be the bmr level that all firms are looking for? One of the main purposes of the executives is to MAXIMIZE shareholder’s value. The way to increase shareholder’s value is to increase its market value, and the only way to increase market value is to increase stock price.
Then, the bmr level that all executives prefer is a value much less than 1.
If bmr=0.5 means that the firm market value is the double of its book value. In this case, the main difference between market value and book value is the wealth that the firm has created thanks to its valuable intangible assets such as prestige, high quality, and innovation.
Then, what do you think it might be the relationship between bmr and stock return? Intuitively, we might think that a small value of bmr is a good news, then the stock return might be related in a negative way. If bmr goes down (good news), then the stock return might go up. Then, it might be expected that the relationship between bmr and stock return is linear and negative. Some finance research (like Fama & French, 1995), mainly with US firms has found that the relationship between bmr and future stock return is negative.
However, there are mixed findings about the relationship between bmr and stock returns. Some research (like Piotrosky, 2000) has found that firms with high bmr, but strong financials (like earnings per share) usually provides significant positive stock returns. In this special case, bmr has found to be positively related to stock returns.
Finance research has found that bmr influences earnings per share, which in turn influences current and future stock returns. It is some times that firms with low bmr do not experience significant high stock returns due to the high expectations of investors. This is an interesting finding in finance that is still being researched!
3 CHALLENGE 2: Winsorization of variables
You have to do your own research about winsorization. Explain it with your words (pay attention in class)
WINSORIZATION IS THE PROCESS OF FLATTENING EXTREME (OUTLIERS) OBSERVATIONS OF A VARIABLE WITH THE PURPOSE OF AVOIDING UNRELIABLE ESTIMATIONS OF BETA COEFFICIENTS WHEN INCLUDING THAT VARIABLE AS EXPLANATORY VARIABLE IN AN ECONOMETRIC MODEL.
THE WINSORIZATION PROCESS CAN BE APPLIED TO HIGH OR LOW (NEGATIVE) VALUES OF THE VARIABLE. FOR HIGH VALUES WE CAN INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES ABOVE THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE. FOR LOW VALUES WE INDICATE FROM WHICH PERCENTILE WE CONSIDER THE VALUES AS OUTLIERS, THEN THOSE VALUES BELOW THAT PERCENTILE IS REPLACE WITH THE VALUE OF THAT PERCENTILE.
Let’s review the histogram of the operating earnings per share to check whether there are extreme values:
import matplotlib.pyplot as pltimport seaborn as snsplt.figure(figsize=(10, 6))sns.histplot(merged_data['oepsp'].dropna(), bins=50, kde=True)plt.title('Distribution of Operational EPS deflated by Stock Price (oepsp)')plt.xlabel('oepsp')plt.ylabel('Frequency')plt.show()
We can see that there are very, very few extreme values for oepsp; the maximum value is more than 10 million! You have to see the 1e7 in the X axix indicating the scale of oepsp; 1e7 is scientific notation for 10 million (1 x 10^7)
Then, we need to winsorize oepsp to avoid unreliable estimations of the logistic regression since we plan to use oepsp as the X independent variable.
You have to winsorize the following ratio at the 2 percentile for the left values and 98 percentile for the high values:
Earnings per share deflated by price
I use the quantile method of pandas to get the 2 and 95 percentiles, and the clip method to flatten the extreme values with the corresponding percentiles:
# Winsorize oepsp at 2% and 98%lower_bound = merged_data['oepsp'].quantile(0.02)upper_bound = merged_data['oepsp'].quantile(0.98)merged_data['oepspw'] = merged_data['oepsp'].clip(lower=lower_bound, upper=upper_bound)# Display the first few rows to confirm the new columnprint("Merged DataFrame with winsorized oepsp (oepspw):")
Merged DataFrame with winsorized oepsp (oepspw):
merged_data.head()
firm year revenue cogs ... oepsp bookvalue bmr oepspw
0 A 2000 10773000.0 5522000.0 ... 0.042144 5265000.0 0.210718 0.042144
1 A 2001 8396000.0 5166000.0 ... -0.058850 5659000.0 0.428065 -0.058850
2 A 2002 6010000.0 3694000.0 ... -0.191589 4627000.0 0.551637 -0.191589
3 A 2003 6056000.0 3762000.0 ... -0.052074 2824000.0 0.202836 -0.052074
4 A 2004 7181000.0 4058000.0 ... 0.032899 3569000.0 0.304188 0.032899
[5 rows x 45 columns]
oepsp oepspw annual_return
count 5.917200e+04 59172.000000 53983.000000
mean inf -0.024825 -0.003683
std NaN 0.283234 0.557446
min -3.887477e+05 -1.409846 -10.685795
25% -2.085332e-02 -0.020853 -0.196382
50% 4.857955e-02 0.048580 0.069674
75% 9.258618e-02 0.092586 0.280394
max inf 0.301041 3.993313
Now I can check the histogram of the winsorized oepsp:
plt.figure(figsize=(10, 6))sns.histplot(merged_data['oepspw'].dropna(), bins=50, kde=True)plt.title('Distribution of Winsorized Operational EPS deflated by Stock Price (oepspw)')plt.xlabel('oepspw')plt.ylabel('Frequency')plt.show()
Now the minimum value for oepspw is from about -1.30 to +0.25! we do not have extreme values. We can proceed using oepspw as the X predictor for a logistic regression model.
But before we need to bring the market annual return to our merged dataset.
Then, I need to create the dataset for annual market return, and then merge it with the merged_data.
4 CHALLENGE 3: Create an annual dataset for historical market returns
You have to create an annual dataset with annual market returns of the S&P500 index (^GSPC), from 2000 to 2023. You can download monthly quotes from Yahoo Finance using the yfinance library, and then aggregate by year, and then calculate continuously compounded returns.
Once you have the annual returns of the ^GSPC in a dataset, merge these values into the merged panel-dataset using the year as the common/matching column.
Here is the code to create this annual dataset for the market returns:
import yfinance as yf# Download monthly data for S&P 500 (^GSPC) from Dec 1999 to Dec 2023sp500_data = yf.download('^GSPC', start='1999-12-01', end='2023-12-31', interval='1mo')
YF.download() has changed argument auto_adjust default to True
[*********************100%***********************] 1 of 1 completed
# Keep only the last month's adjusted close price for each year# I do a group by year and get the last value (each december for each year)annual_sp500_data = sp500_data.groupby(sp500_data.index.year)['Close'].tail(1)# Create a DataFrame for annual returnsannual_sp500_returns = pd.DataFrame(annual_sp500_data)annual_sp500_returns.columns = ['adjprice']annual_sp500_returns['year'] = annual_sp500_returns.index.year# Calculate annual log returnsannual_sp500_returns['price_lag1'] = annual_sp500_returns['adjprice'].shift(1)annual_sp500_returns['sp500_annual_return'] = np.log(annual_sp500_returns['adjprice'] / annual_sp500_returns['price_lag1'])# Drop the intermediate columnsannual_sp500_returns = annual_sp500_returns[['year', 'sp500_annual_return']]# Display the annual S&P 500 returnsprint("Annual S&P 500 Returns:")
Annual S&P 500 Returns:
annual_sp500_returns.head()
year sp500_annual_return
Date
1999-12-01 1999 NaN
2000-12-01 2000 -0.106908
2001-12-01 2001 -0.139753
2002-12-01 2002 -0.266129
2003-12-01 2003 0.234126
Check that I did a group by year and used the tail method to get the december price for each year by firm.
I ended up with a small dataset of 24 rows (24 years of market returns) and 2 columns (year and sp500_annual_return)
Finally I merge this dataset with the panel dataset:
# Merge the S&P 500 annual returns with the merged_data DataFramemerged_data = pd.merge(merged_data, annual_sp500_returns, on='year', how='left')# Display the first few rows of the merged DataFrame to confirmprint("Merged DataFrame with S&P 500 annual returns:")
Merged DataFrame with S&P 500 annual returns:
merged_data.head()
firm year revenue ... bmr oepspw sp500_annual_return
0 A 2000 10773000.0 ... 0.210718 0.042144 -0.106908
1 A 2001 8396000.0 ... 0.428065 -0.058850 -0.139753
2 A 2002 6010000.0 ... 0.551637 -0.191589 -0.266129
3 A 2003 6056000.0 ... 0.202836 -0.052074 0.234126
4 A 2004 7181000.0 ... 0.304188 0.032899 0.086118
[5 rows x 46 columns]
5 CHALLENGE 4: Logistic regression models with lagged values
Design and run a logistic regression to examine whether earnings per share deflated by price winsorized (epspw) is related to the probability that the future annual stock returns (1 year ahead) is higher than the future market annual return (1 year ahead).
Pay attention in class to learn how to run a logistic regression model, and how to indicate to use future or lagged values for variables in the model.
# import statsmodels to run the logistic regression:import statsmodels.api as sm# Prepare data - drop rows with NaN in relevant columns and ensure sorting for correct laggingregression_data = merged_data.dropna(subset=['firm', 'year', 'oepspw', 'annual_return', 'sp500_annual_return']).copy()# Sort data by firm and year to ensure correct laggingregression_data = regression_data.sort_values(by=['firm', 'year'])# Create the binary dependent variable: 1 if firm annual return > S&P 500 annual return, 0 otherwiseregression_data['firm_return_higher_than_market'] = (regression_data['annual_return'] > regression_data['sp500_annual_return']).astype(int)# Shift the dependent variable one year ahead within each firm groupregression_data['firm_return_higher_than_market_forward1'] = regression_data.groupby('firm')['firm_return_higher_than_market'].shift(-1)# Drop rows with NaN in the lagged dependent variableregression_data = regression_data.dropna(subset=['firm_return_higher_than_market_forward1']).copy()# Define the independent and dependent variablesX = regression_data['oepspw']Y = regression_data['firm_return_higher_than_market_forward1']# Add a constant to the independent variable for the interceptX = sm.add_constant(X)# Fit the logistic regression modelmodel = sm.Logit(Y, X)result = model.fit()
Optimization terminated successfully.
Current function value: 0.688519
Iterations 5
# Print the summary of the regression resultsprint(result.summary())
THE INTERPRETATION OF A LOGISTIC MODEL IS SIMILAR TO THAT OF THE LINEAR REGRESSION MODEL, BUT THERE IS AN IMPORTANT DIFFERENCE.
IN TERMS OF SIGN AND SIGNIFICANCE OF THE INDEPENDENT VARIABLE, WE INTERPRET THE LOGISTIC REGRESSION IN A SIMILAR WAY. IN THIS CASE, I CAN INTERPRET THE COEFFICIENT AND SIGNIFICANCE OF EARNINGS PER SHARE AS FOLLOWS:
SINCE THE BETA1 COEFFICIENT IS POSITIVE AND SIGNIFICANCE, THEN EARNINGS PER SHARE (DEFLATED BY PRICE) IS POSITIVELY AND SIGNIFICANTLY RELATED TO THE PROBABILITY THAT THE STOCK WILL HAVE A HIGHER RETURN IN THE FUTURE COMPARED TO THE FUTURE MARKET RETURN. IN OTHER WORDS, THE HIGHER THE EPSPW, THE MORE LIKELY THAT THE STOCK WILL BEAT THE MARKET RETURN ONE QUARTER LATER.
THE INTERPRETATION OF THE MAGNITUDE OF THE COEFFICIENT IS DIFFERENT THAN IN THE CASE OF LINEAR REGRESSION.
FIRST WE NEED TO CALCULATE THE EXPONENCIAL OF THE BETA1 COEFFICIENT:
# Get the beta1 coefficient (coefficient for 'oepspw') from the previous modelbeta1_coefficient = result.params['oepspw']# Calculate the exponential of the beta1 coefficientexp_beta1 = np.exp(beta1_coefficient)# Print the resultprint(f"The exponential of the beta1 coefficient (oepspw) is: {exp_beta1:.4f}")
The exponential of the beta1 coefficient (oepspw) is: 1.8879
This new beta coefficient of 1.8879 can be interpreted as:
FOR EACH +1 CHANGE IN EPSPW, IT IS EXPECTED THAT THE ODDS RATIO (THE PROBABILITY THAT THE STOCK BEATS THE MARKET WITH RESPECT TO THE PROBABILITY THAT THE STOCK DO NOT BEAT THE MARKET) INCREASES IN 1.8879 UNITS. IN OTHER WORDS, A FIRM A HAS AN EPSPW THAT IS +1.00 HIGHER THAN A FIRM B, THEN THE FIRM A WILL BE 1.8879 TIMES MORE LIKELY TO BEAT THE MARKET COMPARED TO FIRM B.
THIS SOUNDS WEIRD SINCE A CHANGE OF +1.00 UNIT IN EPSPW IS ALMOST IMPOSSIBLE IN THE REAL WORLD; EPSPW IS A VARIABLE THAT USUALLY RANGES BETWEEN -0.3 AND +0.30. THEN, IT IS CONVENIENT TO DO ANOTHER CALCULATION FOR THE BETA COEFFICIENT:
I MULTIPLIED THE COEFFICIENT TIMES 0.1 TO CALCULATE THE BETA COEFFICIENT AS A PARTIAL CHANGE OF THE ODDS RATIO FOR A CHANGE OF +0.1 IN EPSPW.
IN THIS CASE, IF A FIRM A IMPROVES ITS EPSPW IN +0.1 UNIT FROM ONE QUARTER TO THE NEXT, THEN THIS FIRM WILL BE 1.0656 MORE LIKELY TO BEAT THE MARKET COMPARED TO CASE THAT THE FIRM DOES NOT CHANGE ITS EPSPW.
YOU CAN INTERPRET THE MAGNITUDE OF THE COEFFICIENT USING THE CASE OF 1 FIRM THAT INCREASES (OR DECREASES) ITS EPSPW FROM ONE QUARTER TO THE NEXT, OR COMPARING 2 FIRMS IN THE SAME QUARTER WITH DIFFERENT EPSPW VALUES.
6 CHALLENGE 5: Running your first Machine Learning model
Create a dataset with the following columns:
Future annual stock return (1 year later)
F1r_above_market (1=beat the market in the corresponding year; 0= otherwise)
Earnings per share deflated by price (epsp).
Create a training and testing sample: randomly select 80% of observations for the training sample and 20% for the testing sample.
Using the training sample, run the same logistic model to check whether epsp has explanatory power for the probability that the stock beats the market.
Create and interpret the confusion matrix
7 CHALLENGE 5 SOLUTION
It is strongly recommended to review the Chapter 1 : Classification from the course: Supervised Learning with scikit-learn
I will use the scikit-learn library to build a logistic regression model. This model will be trained on 80% of the data to predict the probability that a firm’s annual return will be higher than the S&P 500 annual return in the following year, using the winsorized operational earnings per share deflated by stock price (oepspw) as the predictor. The remaining 20% of the data will be used to test the model’s performance.
from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay# Prepare data - drop rows with NaN in relevant columns and ensure sorting for correct lagging# Use the regression_data from the previous step which already has the lagged dependent variableregression_data_ml = regression_data.copy()# Define the independent and dependent variables (using the lagged dependent variable)X_ml = regression_data_ml[['oepspw']] # X needs to be a DataFrameY_ml = regression_data_ml['firm_return_higher_than_market_forward1']# Shuffle and split the data into training and testing setsX_train, X_test, Y_train, Y_test = train_test_split(X_ml, Y_ml, test_size=0.20, random_state=42, shuffle=True)# Initialize and train the Logistic Regression modelmodel_ml = LogisticRegression()model_ml.fit(X_train, Y_train)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
penalty
'l2'
dual
False
tol
0.0001
C
1.0
fit_intercept
True
intercept_scaling
1
class_weight
None
random_state
None
solver
'lbfgs'
max_iter
100
multi_class
'deprecated'
verbose
0
warm_start
False
n_jobs
None
l1_ratio
None
# Predict probabilities on the test datasetY_pred_proba = model_ml.predict_proba(X_test)[:, 1]# Create the final prediction based on a probability threshold of 0.50Y_pred = (Y_pred_proba >0.50).astype(int)# Create the Confusion Matrixcm = confusion_matrix(Y_test, Y_pred)# Display the confusion matrixdisp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1]) # Assuming 0 and 1 are the class labelsdisp.plot(cmap=plt.cm.Blues)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000026788D9A270>
plt.title('Confusion Matrix')plt.show()
# Calculate statisticsaccuracy = accuracy_score(Y_test, Y_pred)precision = precision_score(Y_test, Y_pred)recall = recall_score(Y_test, Y_pred)f1 = f1_score(Y_test, Y_pred)# Extracto the values from the confusion matrix:tn, fp, fn, tp = cm.ravel()# The ravel method pull all values of a matrix from top-left to bottom-right, and put them in different values: specificity = tn / (tn + fp)all_actual0 = tn + fpall_actual1 = fn + tpprint("\nModel Evaluation Statistics:")
Model Evaluation Statistics:
print(f"Accuracy: {accuracy:.4f}")
Accuracy: 0.5406
print(f"Precision: {precision:.4f}")
Precision: 0.5521
print(f"Recall (Sensitivity): {recall:.4f}")
Recall (Sensitivity): 0.1026
print(f"Specificity: {specificity:.4f}")
Specificity: 0.9267
print(f"F1-Score: {f1:.4f}")
F1-Score: 0.1730
The Confusion Matrix can also be calculated as a Pivot Table (like the Excel Pivot Table) using Y_test and Y_pred. Here is an alternative:
# Create a DataFrame from Y_test and Y_pred for the pivot tablepivot_data = pd.DataFrame({'Y_test': Y_test, 'Y_pred': Y_pred})# Create the pivot tableconfusion_pivot = pivot_data.pivot_table(index='Y_test', columns='Y_pred', aggfunc='size', fill_value=0)# Display the pivot table (which is the confusion matrix)print("Confusion Matrix (Pivot Table):")
Confusion Matrix (Pivot Table):
confusion_pivot
Y_pred 0 1
Y_test
0.0 4777 378
1.0 4078 466
Here is an illustration of what the Confusion Matrix calculates:
All observations of the testing data-set are classified and counted according to the following table:
Predicted: Positive
Predicted: Negative
Actual: Positive
True Positive (TP)
False Negative (FN)
Actual: Negative
False Positive (FP)
True Negative (TN)
In each cell the number of cases (observations) are shown.
Positive/Negative refers whether the event happens or not
True/False refers whether the model correctly predicted (true) or not (false)
TP (True Positive): the model correctly predicted positive (that the event happened)
TN (True Negative): the model correctly predicted negative (that the event did not happened)
FP (False Positive): the model incorrectly predicted positive (Type I error)
FN (False Negative): the model incorrectly predicted negative (Type II error)
INTERPRETATION
The diagonal of the confusion matrix has the cases that my model CORRECTLY PREDICTED whether the stock beat or did not beat the market.
Then, looking at the matrix, we see the following:
The sum of the FIRST ROW= 5155= # of CASES when the stock actually DID NOT BEAT the market
The sum of the SECOND ROW= 4544= # of CASES when the stock actually BEAT the market
Out of the 4544 cases when the stocks ACTUALLY BEAT THE MARKET, the model CORRECTLY PREDICTED 466 cases, which are the TRUE POSITIVES cases. The rate of TRUE POSITIVES with respect to ALL POSITIVES is called Sensitivity or Recall, which is 0.1026.
Out of the 5155 cases when the stock DID NOT BEAT THE MARKET, the model CORRECTLY PREDICTED 4777 cases. This # is called: TRUE NEGATIVES. The rate (or %) of TRUE NEGATIVES with respect to ALL NEGATIVES is called Specificity, which is 0.9267.
Out of the 4544 cases that the stock ACTUALLY BEAT THE MARKET, the model WRONGLY PREDICTED 4078 cases. This # is called: FALSE NEGATIVES.
Out of the 5155 cases when the stock DID NOT BEAT THE MARKET, the model WRONGLY PREDICTED 378 cases. This # is called: FALSE POSITIVES.
Accuracy (0.5406): This is the overall proportion of correct predictions (both true positives and true negatives) out of all predictions. An accuracy of 0.5344 means the model was correct about 54.0571% of the time.
Precision (0.5521): This measures the accuracy of the positive predictions. Of all the times the model predicted that the firm’s return would be higher than the market return, it was correct about 55.2133% of the time.
The model has a very high specificity but very low recall. This means it is quite good at predicting when a stock will not outperform the market, but it is very poor at predicting when a stock will outperform the market.
Remember that the Sensitivity and the Specificity rates are:
If Sensitivity is greater than Specificity this means that the model is better predicting cases when THE EVENT ACTAULLY HAPPENED (when a stock actually beat the market) vs cases when THE EVENT DID NOT HAPPEN (when a stock did not beat the market)
We can do a ROC (Receiver Operator Characteristic) PloT.
from sklearn.metrics import roc_curve, roc_auc_scoreimport matplotlib.pyplot as plt# Calculate the False Positive Rate (FPR) and True Positive Rate (TPR) for different probability thresholdsfpr, tpr, thresholds = roc_curve(Y_test, Y_pred_proba)# Calculate the Area Under the ROC Curve (AUC)roc_auc = roc_auc_score(Y_test, Y_pred_proba)# Plot the ROC curveplt.figure(figsize=(8, 6))plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random guess')plt.xlabel('False Positive Rate (1 - Specificity)')plt.ylabel('True Positive Rate (Recall)')plt.title('Receiver Operating Characteristic (ROC) Curve')plt.legend(loc='lower right')plt.show()
The ROC curve plots the True Positive Rate (sensitivity) against the False Positive Rate (1 - specificity) for different possible classification thresholds.
The Area Under the Curve (AUC) is a measure of the model’s ability to distinguish between the positive and negative classes. An AUC of 1.0 represents a perfect model, while an AUC of 0.5 represents a model that performs no better than random chance.
The model’s AUC is 0.5628. A higher AUC indicates better model performance.
As long as the area is greater than 0.5 this means that the model is better than a naive or a totally random model.
We can analyze how Precision, Sensitivity and Specificity change when the probability threshold changes (moves from the original 0.50 probability). We do this since we can change this probability threshold without modifying the logit model to see if we improve any of these accuracy ratios.
In the next code I do a loop by changing the probability threshold from 0 to 1 and calculate and plot the relevant ratios for each probability threshold:
# Analyze the impact of different probability thresholds on sensitivity, specificity, and precisionthresholds = np.arange(0.1, 1.0, 0.05) # Define a range of thresholds to testsensitivity_scores = []specificity_scores = []precision_scores = []for threshold in thresholds:# Create predictions based on the current threshold Y_pred_threshold = (Y_pred_proba > threshold).astype(int)# Calculate confusion matrix cm_threshold = confusion_matrix(Y_test, Y_pred_threshold)# Extract TN, FP, FN, TP# The ravel method pull all values of a matrix from top-left to bottom-right, and put them in different values. tn, fp, fn, tp = cm_threshold.ravel()# Calculate sensitivity (Recall) sensitivity = tp / (tp + fn) if (tp + fn) >0else0 sensitivity_scores.append(sensitivity)# Calculate specificity specificity = tn / (tn + fp) if (tn + fp) >0else0 specificity_scores.append(specificity)# Calculate precision precision = tp / (tp + fp) if (tp + fp) >0else0 precision_scores.append(precision)# Display the resultsprint("Sensitivity, Specificity, and Precision for different probability thresholds:")
Sensitivity, Specificity, and Precision for different probability thresholds:
for i inrange(len(thresholds)):print(f"Threshold: {thresholds[i]:.2f}, Sensitivity: {sensitivity_scores[i]:.4f}, Specificity: {specificity_scores[i]:.4f}, Precision: {precision_scores[i]:.4f}")
# Plot the metrics vs. thresholdplt.figure(figsize=(10, 6))plt.plot(thresholds, sensitivity_scores, marker='o', label='Sensitivity (Recall)')plt.plot(thresholds, specificity_scores, marker='o', label='Specificity')plt.plot(thresholds, precision_scores, marker='o', label='Precision')plt.xlabel('Probability Threshold')plt.ylabel('Score')plt.title('Model Performance Metrics vs. Probability Threshold')plt.legend()plt.grid(True)plt.show()
We can ses that Sensitivity and Specificity are like mirrors to each other. In other words, when probability threshold increases, Specificity improves, but Sensitivity get worse in a similar proportion.
In this case, we might be interested in maximizing Precision to maximize the probability of correctly guessing when we predict that the stock will beat the market. Then, we can identify the exact probability threshold that maximizes Precision:
# Find the threshold that maximizes Precisionmax_precision =max(precision_scores)max_precision_index = precision_scores.index(max_precision)threshold_maximizing_precision = thresholds[max_precision_index]print(f"\nThe probability threshold that maximizes Precision is: {threshold_maximizing_precision:.2f}")
The probability threshold that maximizes Precision is: 0.50
print(f"The maximum Precision achieved is: {max_precision:.4f}")
The maximum Precision achieved is: 0.5521
In this case, the probability threshold that maximizes precision is 0.50, so we keep the original threshold of 50%.
It depends on the context, we can use a different probability threshold to maximize any ratio (Sensitivity, Specificity, Precision), and then re-calculate the binary predictions and re-do the Confusion Matrix to evaluate the model, and use it for final predictions.