The data set used is called, “Loan Defualt Data” and it is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms”. The data set contains 1000 observations and 5 of its original 16 variables. These variables represent the minimum requirements for this research. Saving_amount is explored as the response for this data set. Age has been dichotomized into a categorical variable with three levels and Marital_status has been converted from string representations of its two levels to numeric representations.
Some practical question for this research may be: are there any relationships between Saving_amount and the other variables in the data set? If so, how can those relationships be interpreted?
Next exploratory data analysis is conducted. A scatter-plot matrix of numeric variables from the data set will be output after all non-categorical values in the data set are normalized to z-scores.
Based on output, there seems to be positive, mild, linear relationships between Credit_score vs. Saving_amount, Emp_duration vs. Saving_amount, and Age vs. Saving_amount.
The initial full hypothesized model may be: (Saving_amount) = \(\beta_0\) + \(\beta_1*\)(Credit_score) + \(\beta_2*\)(Emp_duration) + \(\beta_3*(Age)\) + \(\beta_4*(DA1)\) + \(\beta_5*(DA2)\) + \(\beta_6*(MStat1)\) + \(\epsilon\). An MLR will be run using R. Relevant information from this MLR will be used to assess the model assumptions. Based of that assessment, transformations to the model may follow. Below is tabular output of the relevant information from the MLR for the model. An interpretation of this output will be withheld until after residual diagnostics are conducted.
Call:
lm(formula = Saving_amount ~ Credit_score + Emp_duration + Age +
DA + MStat, data = data0.norm)
Residuals:
Min 1Q Median 3Q Max
-3.12626 -0.64452 0.00231 0.62696 2.79153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.15322 0.13635 1.124 0.26142
Credit_score 0.10034 0.03126 3.210 0.00137 **
Emp_duration 0.04102 0.03032 1.353 0.17633
Age 0.46047 0.06176 7.456 0.000000000000194 ***
DA1 -0.11882 0.13203 -0.900 0.36836
DA2 -0.56265 0.20844 -2.699 0.00707 **
MStat1 0.08937 0.06074 1.471 0.14152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9247 on 993 degrees of freedom
Multiple R-squared: 0.15, Adjusted R-squared: 0.1449
F-statistic: 29.21 on 6 and 993 DF, p-value: < 0.00000000000000022
Below is graphical output for the residual analysis of the model. Specifically, a histogram of the residuals from the model, a Q-Q Normal plot and, a versus “fits” plot of predicted values vs. errors. The histogram of the residuals may reveal that the residuals are normally distributed. likewise, there seems to be no major departures from the line of the Q-Q plot, which may also indicate that the model’s errors are normally distributed. Lastly, the “versus fits” plot seems to display no patterns, so our data may have constant variance. Therefore, the errors may be independently and identically distributed normally with mean = 0 and \(σ^2\) = constant.
Next, a variable selection procedure is used to identify models with alternative performance using less or different arrangements of the variables in the full model. the process chossen is the best subsets procedure, which will explore all possible combinations of the full model’s terms. Tabular output from this process follows below.
Best Subsets Regression
-----------------------------------------------------
Model Index Predictors
-----------------------------------------------------
1 Age
2 Age DA
3 Credit_score Age DA
4 Credit_score Age DA MStat
5 Credit_score Emp_duration Age DA MStat
-----------------------------------------------------
Subsets Regression Summary
-------------------------------------------------------------------------------------------------------------------------------------
Adj. Pred
Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
-------------------------------------------------------------------------------------------------------------------------------------
1 0.1177 0.1168 0.1141 34.7621 2717.6466 -120.3516 2732.3699 883.1781 0.8849 0.0009 0.8858
2 0.1362 0.1336 0.1291 15.1732 2700.4796 -139.4521 2725.0184 865.5485 0.8690 0.0009 0.8690
3 0.1455 0.1421 0.1366 6.2417 2691.5883 -148.2746 2721.0349 857.0332 0.8613 0.0009 0.8613
4 0.1485 0.1442 0.1378 4.8309 2690.1657 -149.6595 2724.5200 854.9641 0.8601 0.0009 0.8601
5 0.1500 0.1449 0.1376 5.0000 2690.3236 -149.4690 2729.5856 854.2500 0.8602 0.0009 0.8602
-------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria
SBIC: Sawa's Bayesian Information Criteria
SBC: Schwarz Bayesian Criteria
MSEP: Estimated error of prediction, assuming multivariate normality
FPE: Final Prediction Error
HSP: Hocking's Sp
APC: Amemiya Prediction Criteria
Based on the above output, model five, which is the full model, has a high \(R^2_a\) and low \(C(p)\) and \(AIC\) compared to the other models. Considering this, the model five may be deemed better performing than the the others. Therefore model five, the full model, will be the final model.
Once again relevant information regarding the models statics are output. An interpretation of the information follows.
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.153 | 0.136 | 1.124 | 0.261 | -0.114 | 0.421 |
| Credit_score | 0.100 | 0.031 | 3.210 | 0.001 | 0.039 | 0.162 |
| Emp_duration | 0.041 | 0.030 | 1.353 | 0.176 | -0.018 | 0.101 |
| Age | 0.460 | 0.062 | 7.456 | 0.000 | 0.339 | 0.582 |
| DA: 1 | -0.119 | 0.132 | -0.900 | 0.368 | -0.378 | 0.140 |
| DA: 2 | -0.563 | 0.208 | -2.699 | 0.007 | -0.972 | -0.154 |
| MStat: 1 | 0.089 | 0.061 | 1.471 | 0.142 | -0.030 | 0.209 |
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.15 | 0.145 | 0.849 | 0.921 | 0.925 | 29.213 | 0 | 6 | 1000 |
Based on the residual analysis, their seems to not be any potential violations of the model’s assumptions. Further, the test of the model’s global utility F(6, 993) = 29.2125089, p-value < 0.001, so at least one of the model’s coefficients may be non-zero. Moreover, the adjusted coefficient of determination, \(R^2_A\) = 0.145 and the coefficient of variation CV = 10.68%. Not all of the estimates were found to be significant:
Credit_score: (t=3.21, p<.05), 95% CI = 0.039,
0.162
Emp_duration: (t=1.353, p>.05), 95% CI = -0.018,
0.101
Age: (t=7.456, p<.05), 95% CI = 0.339,
0.582
DA1: (t=-0.9, p>.05), 95% CI = -0.378,
0.14
DA2: (t=-2.699, p<.05), 95% CI = -0.972,
-0.154
MStat1: (t=1.471, p>.05), 95% CI = -0.03,
0.209
The explicit final model follows: (Saving_amount) = 0.153 + 0.1(Credit_score) + 0.041(Emp_duration) + 0.46(Age) + -0.119(DA1) + -0.563(DA2) + 0.089(MStat1)
Interpretations:
The (Credit_score) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.039 and 0.162 units for every 1 unit increase in a borrower’s (Credit_score), while holding (Emp_duration) and (Age) constant.
The (Age) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.339 and 0.582 units for every 1 unit increase in a borrower’s (Age), while holding (Credit_score) and (Emp_duration) constant.
The (DA2) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between -0.972 and -0.154 units more than (DA).
What follows is both of the residual Bootstrapping procedures. First, sampling with replacement from the observations will be used to construct a bootstrapped MLR. this will be done 1000 times. At each iteration, the coefficients will be stored in a metrix from which a confidence interval may be derived for the estimates. This information will be stored and output in a table. Afterwards, Bootstrapping will be conducted a second time, this time using the sampled residuals from an initial model which when paired with the fitted values will yield a bootstrapped fitted value. these values will be used in the MLR to fit a model. Then the parameters from this bootstapped model will be stored and used to calculate confidence intervals for the coefficients. This information will be out put in table with the 95% CI’s from MLR and two bootstrapped MLR’s, afterwards.
| term | estimate | p_value | param_lower | param_upper | bt1_lower | bt1_upper | bt2_lower | bt2_upper |
|---|---|---|---|---|---|---|---|---|
| intercept | 0.153 | 0.261 | -0.114 | 0.421 | -0.1443370 | 0.4206520 | -0.1041865 | 0.4082264 |
| Credit_score | 0.100 | 0.001 | 0.039 | 0.162 | 0.0384878 | 0.1652792 | 0.0358358 | 0.1596797 |
| Emp_duration | 0.041 | 0.176 | -0.018 | 0.101 | -0.0185252 | 0.1023493 | -0.0158685 | 0.1000677 |
| Age | 0.460 | 0.000 | 0.339 | 0.582 | 0.3358312 | 0.5720594 | 0.3373094 | 0.5790738 |
| DA: 1 | -0.119 | 0.368 | -0.378 | 0.140 | -0.3780105 | 0.1627563 | -0.3651118 | 0.1428375 |
| DA: 2 | -0.563 | 0.007 | -0.972 | -0.154 | -0.9305794 | -0.1493390 | -0.9495112 | -0.1731802 |
| MStat: 1 | 0.089 | 0.142 | -0.030 | 0.209 | -0.0273819 | 0.2068801 | -0.0276838 | 0.2018919 |
Each of these confidence interval may indicate that, There is a 95% chance that each of the confidence intervals may contain the true parameter value for each of the estimates. Notice that the intervals do not all exclude zero. However, those that were found to be statistically different from zero, do exclude zero. For these ones, based on the above tables, the bootstrap and parametric regressions, indicate that those slopes may all be significantly different from zero.
The research may not have presented the need to use various regression techniques, so MLR with linearity in the model’s terms was employed. Saving_amount was explored as the response for this data set. Age was dichotomized into a categorical variable with three levels and both Dichotomized Age and Age were included in the model. All observations were z-score normalized. Then an MLR was conducted. Afterwards, a residual diagnostic followed. There seemed to not be any violations to the model’s assumptions. Afterwards, variable selection was conducted. Specifically, the best subsets procedure was used to determine the possible effectiveness of alternative models. This procedure revealed that the initial model may be the best choice moving forward. It was decided then, that the initial model, which was the full model, would be the final model. Because of the linearity of the model it may have been highly interpretable. Those interpretations were given above. Bootstrapping was then conducted. Two procedures, specifically. The first, sampled the observations with replacement and then fit a model. The second, sampled the full models errors with replacement, summed them with their respective fitted values, then used those values to fit a bootstrapped model. The coefficients from each procedure were stored and 95% CI were derived from them and output in a table along with statistics from the MLR. Interpretations of theses bootstrapped values then followed. Note rerunning the analysis using the first bootstrapped procedure may present challenges based on the nature of its mechanics. Further research may be recommended. This may be a result of the wide confidence intervals around all of the slopes as well as the \(R^2_A\) value.