Training_Data_70 <- read.csv("C:/Users/dnorth/Desktop/Training_Data_70.csv")

Question 1: State the main question you will engage through your data analysis. What makes it an interesting question?

Answer: In this assignment, I will study the question: “Are logit models effective at predicting whether a company is currently insolvent?” This question is interesting because it applies a nonlinear classification regression technique to financial forecasting. If logit models are an effective classification tool in the case of bankruptcy, the applications for various entities including hedge funds, private equity funds, and investment banks would be numerous and profitable.

Question 2: What are the main variables of interest (specifically your dependent variable and one or two independent variables)? What “Control” variables will you include?

Answer: In this assignment, I will use a qualitative bankruptcy data set taken from UC Irvine's Center for Machine Learning and Intelligent Systems. This data set includes the dependent variable of interest (whether a company is bankrupt or not), as well as six independent variables which will all be regarded as “variables of interest” in the model.

One distinction that must be made at this point is that because I am seeking to create a model for forecasting rather than interpolation, the typical logic of regression does not apply. Specifically I mean that if this model were intended for interpolation/explaining variance I would have selected control variables to minimize omitted variable bias in accordance with the BLUE criteria for model specification. Because the unbiasedness of the beta coefficients is not of importance in a forecasting model, only the accuracy, all variables will be included so long as they decrease the classification error rate, as represented by the ratio of incorrectly classified cases to correctly classified ones.

Question 3: Graph the relationships between your main variables of interest. What patterns emerge, if any?

To examine initial relationships between variables, we will make a quasi-scatterplot matrix of the six predictor variables, with data points colored by the factor level of the dependent variable, “class” (which indicates whether an entity is bankrupt or not).

In order to accomplish this graph, we will use the “splom()” function within the “lattice” package, which allows for scatterplot matrices where the data points are color-coded by a multi-factor variable (“class”).

library(lattice)

First we create the object “super.sym”, which is essentially a color pallete we will pass to splom():

super.sym <- trellis.par.get("superpose.symbol")

Now: Earlier, I said “quasi-scatterplot matrix” because in reality, it is not actually a scatterplot. In the data's original form, the predictors are all string-class categorical values taking on the same three levels (P-Positive, A-Average, N-Negative). This data arrangement makes scatterplot matrices problematic at first, because data become layered on top of one another, obscuring a large number of data. An example of this is presented below:

splom(~Training_Data_70[c(1:6)], groups=class, data=Training_Data_70, panel=panel.superpose,key=list(title="Bankrupt or Not",columns=2,points=list(pch=super.sym$pch[1:2],col=super.sym$col[1:2]),text=list(c("Bankrupt", "Not Bankrupt"))))

## Warning in par("page"): "page" is not a graphical parameter

## Warning in plot_snapshot(incomplete_plots): Please upgrade R to at least
## version 3.0.2

plot of chunk unnamed-chunk-4

To solve this problem, we convert the three string-class factor levels (P,A,N) to corresponding numeric-class values (1,0,-1) in Excel, using nested IF statements. Importing this new data set back into R, we then duplicate the data set and jitter the numeric values in columns 28 through 33:

Training_Set_Jitter=Training_Data_70
Training_Set_Jitter[,28]=jitter(Training_Set_Jitter[,28])
Training_Set_Jitter[,29]=jitter(Training_Set_Jitter[,29])
Training_Set_Jitter[,30]=jitter(Training_Set_Jitter[,30])
Training_Set_Jitter[,31]=jitter(Training_Set_Jitter[,31])
Training_Set_Jitter[,32]=jitter(Training_Set_Jitter[,32])
Training_Set_Jitter[,33]=jitter(Training_Set_Jitter[,33])

With the new jittered numeric data, we try splom() again, obtaining a much more intelligible result:

splom(~Training_Set_Jitter[c(28:33)], groups=class, data=Training_Set_Jitter, panel=panel.superpose,key=list(title="Bankrupt or Not",columns=2,points=list(pch=super.sym$pch[1:2],col=super.sym$col[1:2]),text=list(c("Bankrupt", "Not Bankrupt"))))

## Warning in par("page"): "page" is not a graphical parameter

## Warning in plot_snapshot(incomplete_plots): Please upgrade R to at least
## version 3.0.2

plot of chunk unnamed-chunk-6

The nice thing about working with purely categorical data using the same three factor levels is that patterns are very easy to identify in a color-coded scatterplot matrix. Examining the plot produced above, the most significant predictors of bankruptcy can be noticed by identifying “lines” parallel to the vertical or horizontal axes of the graph, running through predominantly blue clusters. For example, if one focuses on the tile labeled “comp” and looks along the vertical or horizontal axis portion labeled “-1.0”, we notice that there are no pink points at all in the 15 clusters lying along the “line”. This suggests that when we limit the cases to companies with a “Negative” competitiveness rating, every single company in that subset is in fact bankrupt. Thus, we would expect the competitiveness predictor to enhance the predictive capacity of the model greatly. This line-cluster analysis can be duplicated for every variable. Other noteworthy predictors revealed by this methodology are financial flexibility and credibility, and to a lesser extent, management risk. By contrast, bankruptcy doesn't seem to correlate strongly with particular bins of operating risk or industrial risk.

Question 4: Estimate an appropriate statistical model to explain variation in your outcome of interest. Estimate multiple models if you want to explore the effects of including and excluding variables. One common approach is to estimate a bivariate relationship first, then estimate a multivariate model. Include a nicely formatted table of results in your narrative.

Answer: First, we estimate a logit model including all six predictor variables using the function “glm()”. To fit the three-level predictors into the argument for glm(), they were broken down into three two-level dummy variables each. For instance, Financial Flexibility with levels {P,A,N} becomes transformed into Financial Flexibility P {0,1}, Financial Flexibility A {0,1} and Financial Flexibility N {0,1}.

We estimate the logit model containing all six predictors:

glm.fit=glm(class_B~ ind_risk_A + ind_risk_N + m_risk_A + m_risk_N + f_flex_A + f_flex_N + cred_A + cred_N + comp_A + comp_N + op_risk_A + op_risk_N, family=binomial, data = Training_Data_70)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

We can observe the results of the model using the summary command:

summary(glm.fit)

## 
## Call:
## glm(formula = class_B ~ ind_risk_A + ind_risk_N + m_risk_A + 
##     m_risk_N + f_flex_A + f_flex_N + cred_A + cred_N + comp_A + 
##     comp_N + op_risk_A + op_risk_N, family = binomial, data = Training_Data_70)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.460e-05  -2.110e-08   0.000e+00   2.110e-08   1.926e-05  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.833e+01  1.949e+05       0        1
## ind_risk_A  -5.984e+00  1.818e+05       0        1
## ind_risk_N   1.567e+00  1.259e+05       0        1
## m_risk_A     3.741e+00  1.420e+05       0        1
## m_risk_N     7.367e-01  7.947e+04       0        1
## f_flex_A    -8.369e-01  6.775e+04       0        1
## f_flex_N     2.081e+01  1.692e+05       0        1
## cred_A       6.022e+00  2.494e+05       0        1
## cred_N       2.824e+01  1.719e+05       0        1
## comp_A       1.861e+01  1.689e+05       0        1
## comp_N       5.941e+01  1.485e+05       0        1
## op_risk_A    2.976e+00  2.422e+05       0        1
## op_risk_N   -7.971e+00  1.953e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.0517e+02  on 147  degrees of freedom
## Residual deviance: 2.4137e-09  on 135  degrees of freedom
## AIC: 26
## 
## Number of Fisher Scoring iterations: 25

As the summary for glm.fit and the warning messages from executing the previous command show, the logit model software is experiencing problems, probably because of near-perfect collinearity in one of the predictor variables. Looking back to the scatterplot matrix from question 3, we see that credibility and competetiveness are in fact, perfect predictors (meaning all values taking on N for credibility and competetiveness are bankrupt companies). For the logit fitting software to work, these variables will be removed:

glm.fit=glm(class_B~ ind_risk_A + ind_risk_N + m_risk_A + m_risk_N + f_flex_A + f_flex_N + op_risk_A + op_risk_N, family=binomial, data = Training_Data_70)

Now the model works. Examining the summary:

summary(glm.fit)

## 
## Call:
## glm(formula = class_B ~ ind_risk_A + ind_risk_N + m_risk_A + 
##     m_risk_N + f_flex_A + f_flex_N + op_risk_A + op_risk_N, family = binomial, 
##     data = Training_Data_70)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.41247  -0.11191   0.02326   0.24215   3.12666  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.1339     2.6088  -3.118  0.00182 ** 
## ind_risk_A   -0.4756     0.9427  -0.504  0.61392    
## ind_risk_N    2.2785     1.2499   1.823  0.06831 .  
## m_risk_A      0.9750     1.3220   0.737  0.46082    
## m_risk_N      2.1213     1.2474   1.701  0.08903 .  
## f_flex_A      0.8127     1.2892   0.630  0.52844    
## f_flex_N      7.7710     1.6894   4.600 4.23e-06 ***
## op_risk_A    -0.5220     1.0926  -0.478  0.63281    
## op_risk_N     2.2419     1.1901   1.884  0.05959 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 205.172  on 147  degrees of freedom
## Residual deviance:  57.255  on 139  degrees of freedom
## AIC: 75.255
## 
## Number of Fisher Scoring iterations: 7

We find that all of the predictors used are significant. While financial flexibility is the most predictive (a negative value of financial flexibility being significant at the 0 level), all the other predictors' negative values are also significant at the 0.05 level. It is important to note that while many of the “average” level dummy variables are not statistically significant, it is still important to include them in the model. Without them, the dummy variables would not be exhaustive for each predictor, and the relevant variables would have biased estimates.

Using the “stargazer()” function from the “stargazer” package, we can create a summary table for four models, the first starting with only the best variable (financial flexibility), and then incrementally adding the remaining three variables. This table allows us to observe the effect of adding less significant variables on to a bare-bones version of the model containing only the best predictor, financial flexibility. As the table shows, the “full model” is in fact the best, in the sense that it maximizes the number of significant predictors in the model. Were we to leave these variables out, it might result in omitted variable bias on the estimator for financial flexibility.

First we generate the four models:

glm.fit.1=glm(class_B~ f_flex_A + f_flex_N,family=binomial, data = Training_Data_70)

glm.fit.2=glm(class_B~ f_flex_A + f_flex_N + m_risk_A + m_risk_N,family=binomial, data = Training_Data_70)

glm.fit.3=glm(class_B~ f_flex_A + f_flex_N + m_risk_A + m_risk_N + ind_risk_A + ind_risk_N,family=binomial, data = Training_Data_70)

glm.fit.4=glm(class_B~ f_flex_A + f_flex_N + m_risk_A + m_risk_N + ind_risk_A + ind_risk_N + op_risk_A + op_risk_N,family=binomial, data = Training_Data_70)

Then, we create an output of HTML code for the summary table using stargazer, and save this code to the object “summary_table”:

install.packages("stargazer")

## Installing package into 'C:/Users/dnorth/Documents/R/win-library/3.0'
## (as 'lib' is unspecified)

## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

library(stargazer)

## Warning: package 'stargazer' was built under R version 3.0.3

## 
## Please cite as: 
## 
##  Hlavac, Marek (2014). stargazer: LaTeX code and ASCII text for well-formatted regression and summary statistics tables.
##  R package version 5.1. http://CRAN.R-project.org/package=stargazer

summary_table=stargazer(glm.fit.1,glm.fit.2,glm.fit.3,glm.fit.4, type="html", title="Summary Table for Four Logit Models of Bankruptcy Probability")

## 
## <table style="text-align:center"><caption><strong>Summary Table for Four Logit Models of Bankruptcy Probability</strong></caption>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="4"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="4" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="4">class_B</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">f_flex_A</td><td>1.159</td><td>1.062</td><td>1.133</td><td>0.813</td></tr>
## <tr><td style="text-align:left"></td><td>(1.181)</td><td>(1.194)</td><td>(1.229)</td><td>(1.289)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">f_flex_N</td><td>5.695<sup>***</sup></td><td>5.665<sup>***</sup></td><td>6.707<sup>***</sup></td><td>7.771<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.081)</td><td>(1.109)</td><td>(1.397)</td><td>(1.689)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">m_risk_A</td><td></td><td>-0.236</td><td>0.137</td><td>0.975</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.919)</td><td>(1.009)</td><td>(1.322)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">m_risk_N</td><td></td><td>1.336</td><td>0.948</td><td>2.121<sup>*</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.850)</td><td>(0.909)</td><td>(1.247)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">ind_risk_A</td><td></td><td></td><td>-0.619</td><td>-0.476</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.837)</td><td>(0.943)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">ind_risk_N</td><td></td><td></td><td>2.147<sup>*</sup></td><td>2.278<sup>*</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(1.173)</td><td>(1.250)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">op_risk_A</td><td></td><td></td><td></td><td>-0.522</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td>(1.093)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">op_risk_N</td><td></td><td></td><td></td><td>2.242<sup>*</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td>(1.190)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>-3.526<sup>***</sup></td><td>-4.151<sup>***</sup></td><td>-5.308<sup>***</sup></td><td>-8.134<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(1.015)</td><td>(1.216)</td><td>(1.634)</td><td>(2.609)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>148</td><td>148</td><td>148</td><td>148</td></tr>
## <tr><td style="text-align:left">Log Likelihood</td><td>-40.572</td><td>-37.742</td><td>-33.287</td><td>-28.627</td></tr>
## <tr><td style="text-align:left">Akaike Inf. Crit.</td><td>87.144</td><td>85.483</td><td>80.575</td><td>75.255</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="4" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

The code above is saved as an .html file and then opened, where they run as code and output a summary table for the models. As the summary table in Appendix A shows, the full model is superior to any of the reduced-form models, with all four variables being found significant (while it is true that the “average” rating dummies are not statistically significant, they must be included in the model to maintain the exhaustiveness of the dummy variable binning). Thus for our model, we use glm.fit.4, which includes all four non-perfect predictors.

Question 5: Perform some diagnostic tests on your estimated model (see, e.g., section 6.2 of Monogan). Do you gave any problems that might need addressing? If so, what are they? If not, how do you know?

Answer: Since this is a forecasting model, the primary concern at this stage is to make sure the model isn't being influenced by any severe outliers.

A standard plot of residuals versis predicted values will not work in a logit model, because by construct the residuals in a logit model are heteroscedastic. First, to account for the heteroscedasticity of residuals in logit models, we must transform the raw residual data from glm.fit.4 to Studentized Pearson residuals:

spresid=resid(glm.fit.4, "pearson")/sqrt(1-hatvalues(glm.fit.4))

Then, we plot the Studentized Pearson Residuals versus the index id, creating an index plot for the model:

plot(spresid,glm.fit.4$ind_risk_A)

## Warning in par("page"): "page" is not a graphical parameter

## Warning in plot_snapshot(incomplete_plots): Please upgrade R to at least
## version 3.0.2

abline(3,0, col="red")
abline(-3,0, col="red")

plot of chunk unnamed-chunk-14

This plot shows us whether there are any influential data points which might bias the estimators of the model. A general rule of thumb is to consider outliers as points whose studentized Pearson residual lies outside the range {-3,3}, which is deliniated by the two red lines. As the plot shows, two such points match this criteria. Thus we will drop them from the original dataset:

Training_Data_70=subset(Training_Data_70,spresid>-3 & spresid<3)

…and then reconstruct the logit model:

glm.fit.1=glm(class_B~ f_flex_A + f_flex_N,family=binomial, data = Training_Data_70)

glm.fit.2=glm(class_B~ f_flex_A + f_flex_N + m_risk_A + m_risk_N,family=binomial, data = Training_Data_70)

glm.fit.3=glm(class_B~ f_flex_A + f_flex_N + m_risk_A + m_risk_N + ind_risk_A + ind_risk_N,family=binomial, data = Training_Data_70)

glm.fit.4=glm(class_B~ f_flex_A + f_flex_N + m_risk_A + m_risk_N + ind_risk_A + ind_risk_N + op_risk_A + op_risk_N,family=binomial, data = Training_Data_70)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The new set of models is placed into a summary chart as before:

summary_table_2=stargazer(glm.fit.1,glm.fit.2,glm.fit.3,glm.fit.4, type="html", title="Summary Table for Four Logit Models of Bankruptcy Probability--Clean Data")

## 
## <table style="text-align:center"><caption><strong>Summary Table for Four Logit Models of Bankruptcy Probability--Clean Data</strong></caption>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="4"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="4" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="4">class_B</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">f_flex_A</td><td>17.199</td><td>17.047</td><td>17.018</td><td>20.342</td></tr>
## <tr><td style="text-align:left"></td><td>(1,844.298)</td><td>(1,765.735)</td><td>(1,673.090)</td><td>(9,789.350)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">f_flex_N</td><td>21.869</td><td>21.852</td><td>22.659</td><td>100.605</td></tr>
## <tr><td style="text-align:left"></td><td>(1,844.298)</td><td>(1,765.735)</td><td>(1,673.090)</td><td>(14,782.720)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">m_risk_A</td><td></td><td>-0.404</td><td>-0.039</td><td>38.496</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.997)</td><td>(1.066)</td><td>(6,836.564)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">m_risk_N</td><td></td><td>1.419</td><td>1.048</td><td>39.493</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.947)</td><td>(0.977)</td><td>(6,836.563)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">ind_risk_A</td><td></td><td></td><td>-1.104</td><td>-1.532</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.979)</td><td>(1.295)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">ind_risk_N</td><td></td><td></td><td>1.533</td><td>19.083</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(1.240)</td><td>(3,929.011)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">op_risk_A</td><td></td><td></td><td></td><td>0.578</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td>(1.247)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">op_risk_N</td><td></td><td></td><td></td><td>59.096</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td>(8,739.919)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>-19.566</td><td>-20.176</td><td>-20.723</td><td>-138.525</td></tr>
## <tr><td style="text-align:left"></td><td>(1,844.298)</td><td>(1,765.735)</td><td>(1,673.090)</td><td>(19,738.760)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td><td></td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>146</td><td>146</td><td>146</td><td>146</td></tr>
## <tr><td style="text-align:left">Log Likelihood</td><td>-33.695</td><td>-30.714</td><td>-27.212</td><td>-14.476</td></tr>
## <tr><td style="text-align:left">Akaike Inf. Crit.</td><td>73.390</td><td>71.427</td><td>68.424</td><td>46.951</td></tr>
## <tr><td colspan="5" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="4" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

Comparing the two charts (Shown in Appendix A and B), it is obvious that the model generated from the “clean” data is the worse of the two–for some reason that I'm unsure of, the error terms after the removal of two data points blow up to a ridiculous size. I am unsure of why this happened (since I generated the new model using identical code as the first), however the determination based on the numbers suggests that the original data (with the outliers) generates the better model. Thus the original model stands unchanged.

Question 6: State the conclusions you are willing to draw, given the analysis undertaken above.

Interpretation of the model coefficients here is somewhat superfluous, since meeting the B.L.U.E. criteria is not important in forecasting and therefore the coefficients are likely to be biased. However, the interpretation is done anyway for the sake of pedagogy, and the B.L.U.E. criteria are assumed to be met.

Interpretation for the results of a logit model is a little tricky, since the beta coefficients must undergo a transformation before being expressed in terms of probability. Furthermore, such a transformation is impossible without the probability density function, since the marginal changes in probabilities from a simple transformation would not be constant, but rather conditional on the level of the other variables. Thus the marginal change calculated given one set of characteristics would differ from that calculated given a different set of characteristics. An alternate (and marginally constant) transformation is the “odds ratio”. The transformation in question is:

\[ oddsratio=e^{\beta_{i}} \]

Applying this to the constant of the model, -8.134, we get the answer ~0.000293, indicating that for a company expressing positive levels of all predictors, the odds ratio of bankruptcy is \( (0.00293*100)=0.0293 \).

The “odds ratio” is defined as change in odds relative to the counterfactual situation. It is calculated as:

\[ \frac{Odds\quad of\quad an\quad event\quad given\quad a\quad condition\quad is\quad met}{Odds\quad of\quad an\quad event\quad given\quad said\quad condition\quad is\quad not\quad met} \]

For the constant term, the condition in question is that all predictor variables are have the level “Positive”. Thus for the constant coefficient, if the odds ratio is calculated as 0.0293, the correct interpretation is that the odds of a company with all-positive ratings going bankrupt is 0.0293 the size of the odds of a company going bankrupt when not all ratings are positive (a vast reduction in the odds!).

Applying a similar transformation to the coefficients of the other variables, we calculate the following odds ratios:

-Negative Industrial Risk Rating: 975

-Negative Management Risk Rating: 833

-Negative Operational Risk Rating: 941

-Negative Financial Flexibility Rating: 237,084

This means that regardless of the levels of the other predictors, the odds of bankruptcy increase by a factor of 975 when the industrial risk rating is negative, 833 when the management risk rating is negative, 941 when the operational risk rating is negative, and 237,084 when the financial flexibility rating is negative.

It is unfortunate that there is not a more intuitive interpretation of logit model coefficients. The virtue of the odds ratio is that it is in constant terms. Were I to calculate the probability change from any given coefficient, it would not be constant regardless of the levels of the other variables, but rather conditional upon them. This would make for a useless interpretation of the coefficient. Suffice to say that odds ratios greater than one increase the odds (or probability) of bankruptcy, while odds ratios less than one reduce the odds (or probability) of bankruptcy. Thus, having a positive level for all ratings drastically reduces the odds of default, while all other variables taking on a negative value increases the probabilty of default.

Turning to the most significant indicator of forecasting model validity, we use the model for predicting the outcomes of some data. Ideally, this is performed on an out-of-sample set.

In order to accomplish this, the data on which this model was fit represent only 70% of the total data available in the dataset–this 70% chunk is called the “training”“ data. The remaining 30% of data (on which we will test this logit model) are referred to as the "holdout” data. The number of bankrupt cases in the overall dataset and the number of non-bankrupt cases are the same, and so this one-to-one corresponence of cases was maintained when I constructed the training and holdout data sets.

Using the predict() function, we calculate probabilities of default for each datum in the holdout set. Then, we code predictions as “Bankrupt” if the probability of default is calculated as greater than 0.5, and “Not Bankrupt” if otherwise. Finally, these results are tabulated against the actual reported value of the variable class_B, which takes on value 0 if the company was not bankrupt and value 1 if the company in question was bankrupt:

Training_Data_70 <- read.csv("C:/Users/dnorth/Desktop/Training_Data_70.csv")
Holdout_Data_30 <- read.csv("C:/Users/dnorth/Desktop/Holdout_Data_30.csv")

glm.fit.4=glm(class_B~ ind_risk_A + ind_risk_N + m_risk_A + m_risk_N + f_flex_A + f_flex_N + op_risk_A + op_risk_N, family=binomial, data = Training_Data_70)

glm.probs=predict(glm.fit.4,Holdout_Data_30,type="response")

glm.pred=rep("Not Bankrupt",64)

glm.pred[glm.probs>.5]="Bankrupt"

table(glm.pred,Holdout_Data_30$class_B)

##               
## glm.pred        0  1
##   Bankrupt      5 31
##   Not Bankrupt 27  1

The resulting table indicates that when predicting bankruptcy for 64 out-of-sample companies, the specified model correctly classified \( \frac{27}{32} \) non-bankrupt companies, and \( \frac{31}{32} \) bankrupt companies. This translates to a \( \frac{27+31}{64} \), or \( \frac{58}{64} \) correct classification ratio–a success rate of \( 90.625 \)%. By this metric, then, the model seems to have a large degree of external validity, which ultimately is the only determinant of a forecasting model's success. Thus we conclude that the model is accurate and successful.