The data set chosen for project #1 is “BankLoanDefaultDataset”. The Structure of the data set follows.
'data.frame': 1000 obs. of 16 variables:
$ Default : int 0 0 0 1 1 0 0 0 0 1 ...
$ Checking_amount : int 988 458 158 300 63 1071 -192 172 585 189 ...
$ Term : int 15 15 14 25 24 20 13 16 20 19 ...
$ Credit_score : int 796 813 756 737 662 828 856 763 778 649 ...
$ Gender : chr "Female" "Female" "Female" "Female" ...
$ Marital_status : chr "Single" "Single" "Single" "Single" ...
$ Car_loan : int 1 1 0 0 0 1 1 1 1 1 ...
$ Personal_loan : int 0 0 1 0 0 0 0 0 0 0 ...
$ Home_loan : int 0 0 0 0 0 0 0 0 0 0 ...
$ Education_loan : int 0 0 0 1 1 0 0 0 0 0 ...
$ Emp_status : chr "employed" "employed" "employed" "employed" ...
$ Amount : int 1536 947 1678 1804 1184 475 626 1224 1162 786 ...
$ Saving_amount : int 3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
$ Emp_duration : int 12 25 43 0 4 12 11 12 12 0 ...
$ Age : int 38 36 34 29 30 32 38 36 36 29 ...
$ No_of_credit_acc: int 1 1 1 1 1 2 1 1 1 1 ...
Fit a simple linear regression (SLR) by selecting a numerical explanatory variable from the data set using the least square approach and then construct 95% bootstrap confidence intervals for the regression coefficients.
The data set used is called, “Loan Defualt Data” and it is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms”. What follows is the observation amount, variables amount, variable names, and variable types.
Below is a scatterplot matrix of most of the variables from the data set. There seems to be many relationships that can be explored. However, given the scale of the plots, the pairwise relationships may be indescernable.
We will consider one predictor for this analysis: (Saving_amount \(= x_1\)) with respects to a response (Age = y). We will hold off on commenting on the results of this summary until after a residual analysis is conducted. The proposed regression model is: (Age) = \(\beta_0\) + \(\beta_1\) (Saving_amount) + \(\epsilon\).
Call:
lm(formula = Age ~ Saving_amount, data = data)
Residuals:
Min 1Q Median 3Q Max
-11.7993 -2.4581 0.0857 2.6885 11.5300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.0602972 1.1460023 15.76 <0.0000000000000002 ***
Saving_amount 0.0041358 0.0003584 11.54 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.847 on 998 degrees of freedom
Multiple R-squared: 0.1177, Adjusted R-squared: 0.1168
F-statistic: 133.1 on 1 and 998 DF, p-value: < 0.00000000000000022
Below are 95% CI for the bootstrapped regression procedure.
| 2.5% | 97.5% | |
|---|---|---|
| bt.b0.ci.vec | 15.7401301 | 20.2560220 |
| bt.b1.ci.vec | 0.0034701 | 0.0048398 |
| term | estimate | bstrap_lower | bstrap_upper | p_value | param_lower | param_upper |
|---|---|---|---|---|---|---|
| intercept | 18.060 | 15.7401301 | 20.2560220 | 0 | 15.811 | 20.309 |
| Saving_amount | 0.004 | 0.0034701 | 0.0048398 | 0 | 0.003 | 0.005 |
The Final regression model is: (Age) = 18.0602972 + 0.0041358 (Saving_amount). The 95% bootstrap CI = (0.0034701,0.0048398) for the slope may indicate a positive relationship between Age and Saving_amount. Also, notice that the interval excludes zero. Based on the above tables the bootstrap and parametric regressions, may indicate that the slope is significantly different from zero, which may mean the Age of a borrower and a borrower’s Saving_amount is statistically correlated.
Comment on the residual plots and point out the violations to the model assumptions.
Below is graphical output for the residual analysis of the model. Specifically, a histogram of the residuals from the model, a Q-Q Normal plot and, a versus “fits” plot of predicted values vs. errors. The histogram of the residuals may reveal that the residuals are normally distributed. likewise, there seems to be no major departures from the line of the Q-Q plot, which may indicate that the model’s errors are normally distributed. Lastly, the “versus fits” plot seems to display no patterns, so our data may have constant variance. Therefore, the errors may be independently and identically distributed normally with mean = 0 and \(σ^2\) = constant.
Based on the above residual analysis, their seems to not be any potential violations of the model’s assumptions. Further, the model’s coefficient of determination \(R^2\) = 0.117706, approximately 11.7705976% of the variation in the (Age) of a borrower may be a result of the change in the borrower’s (Saving_amount). Also, note that the coefficient of variation CV = 12.3254985%. Moreover, the slope was found to be significant (t = 11.5387259, p < .001), 95% CI = (0.003,0.005). An Interpretation might be, the (Saving_amount) of a borrower may contribute information that allows prediction of the (Age) of a borrower. Specifically, their may be 0.0041358 unit gain in the (Age) of a borrower for every every 1 unit gain in that borrower’s (Saving_amount). Or, alternatively, the (Age) of a borrower may increase, with 95% confidence, between 0.003 and 0.005 units for every 1 unit increase in a borrower’s (Saving_amount).