A primary practical question may be: are there any relationships within the data set to explore. Another may be: how are those relationships actualized. This response may be unsatisfactory. However, after some EDA ther may be able present, a consolidated and coherent response. Plainly, this research does not know its direction yet. However, there may be one of merit, and by exstention, may be worth this research’s attention, distinguished form the multitude of headings that are assumed present.
Based on the previous output, the data set has many variables and possible relationships to explore. What follows is a scatter-plot matrix of most of the variables from the data set. The variables that were not include, were non-numeric categorical variables. Additionally, a random slice of 50 observations were used for each scaterplot, which may aid in its readability. As previously expressed, there seems to be many relationships that can be explored. However, given the scale of the plots, the pairwise relationships may be indiscernible, even with the adjustments made to it.
To focus this researches exploration efforts, some variable were chosen and the data set was subset to exclude all other variables except for Saving_amount, Credit_score, Emp_duration, and Age. This may make further exploration manageable. It may also afford meaningful interpretations. Additionally, the variables in this new data set were all normalized to z-scores. The pairwise relationships from this new data set along with the significance of possible correlations and the distribution of the variables were output.
Based on output, there seems to be positive, moderate, linear relationships between Credit_score vs. Saving_amount, Emp_duration vs. Saving_amount, and Age vs. Saving_amount. Based off this new information, A practical question may be how do these relationships effect Saving_amount. The Hypothesized model might then be: \((y=x_{13})\) = \(\beta_0\) + \(\beta_1*(x_1 = x_4)\) + \(\beta_2*(x_2 = x_{14})\) + \(\beta_3*(x_3 = x_{15})\) + \(\epsilon\).
Where \((x_{13})\) will be the response variable in this model. To clarify, the variables for this new data set are:
As previously stated, the initial full hypothesized model may be: (Saving_amount) = \(\beta_0\) + \(\beta_1*\)(Credit_score) + \(\beta_2*\)(Emp_duration) + \(\beta_3*(Age)\) + \(\epsilon\). An MLR will be run using R. Relevant information from this MLR will be used to assess the model assumptions. Based of that assessment, transformations to the model may follow. Below is tabular output of the relevant information from the MLR for the model. An interpretation of this output will be withheld until after residual diagnostics are conducted.
Call:
lm(formula = Saving_amount ~ Credit_score + Emp_duration + Age,
data = data0.norm)
Residuals:
Min 1Q Median 3Q Max
-3.12657 -0.65622 0.04861 0.63466 2.80346
Coefficients:
Estimate Std. Error t value
(Intercept) -0.0000000000000001871 0.0294944135523319134 0.000
Credit_score 0.1107251165606362442 0.0312684481129705955 3.541
Emp_duration 0.0593569321105637399 0.0296323108956561034 2.003
Age 0.3020196267744814644 0.0312967053020656791 9.650
Pr(>|t|)
(Intercept) 1.000000
Credit_score 0.000417 ***
Emp_duration 0.045436 *
Age < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9327 on 996 degrees of freedom
Multiple R-squared: 0.1327, Adjusted R-squared: 0.1301
F-statistic: 50.79 on 3 and 996 DF, p-value: < 0.00000000000000022
Below is graphical output for the residual analysis of the model. Specifically, a histogram of the residuals from the model, a Q-Q Normal plot and, a versus “fits” plot of predicted values vs. errors. The histogram of the residuals may reveal that the residuals are normally distributed. likewise, there seems to be no major departures from the line of the Q-Q plot, which may also indicate that the model’s errors are normally distributed. Lastly, the “versus fits” plot seems to display no patterns, so our data may have constant variance. Therefore, the errors may be independently and identically distributed normally with mean = 0 and \(σ^2\) = constant.
Next, a variable selection procedure is used to identify models with alternative performance using less or different arrangements of the variables in the full model. the process chossen is the best subsets procedure, which will explore all possible combinations of the full model’s terms. Tabular output from this process follows below.
Best Subsets Regression
--------------------------------------------
Model Index Predictors
--------------------------------------------
1 Age
2 Credit_score Age
3 Credit_score Emp_duration Age
--------------------------------------------
Subsets Regression Summary
-------------------------------------------------------------------------------------------------------------------------------------
Adj. Pred
Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
-------------------------------------------------------------------------------------------------------------------------------------
1 0.1177 0.1168 0.1141 17.2096 2717.6466 -120.2829 2732.3699 883.1781 0.8849 0.0009 0.8858
2 0.1292 0.1275 0.1238 6.0125 2706.5360 -131.3411 2726.1670 872.5498 0.8752 0.0009 0.8760
3 0.1327 0.1301 0.1254 4.0000 2704.5155 -133.3295 2729.0543 869.9222 0.8734 0.0009 0.8743
-------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria
SBIC: Sawa's Bayesian Information Criteria
SBC: Schwarz Bayesian Criteria
MSEP: Estimated error of prediction, assuming multivariate normality
FPE: Final Prediction Error
HSP: Hocking's Sp
APC: Amemiya Prediction Criteria
Based on the above output, model three, which is the full model, has the highest \(R^2_a\) and lowest \(C(p)\) and \(AIC\) than the other two models. Considering this, the model three may be deemed better performing than the others. For this research, model three will be the final model.
Once again relevant information regarding the models statics are output. An interpretation of the information follows.
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.000 | 0.029 | 0.000 | 1.000 | -0.058 | 0.058 |
| Credit_score | 0.111 | 0.031 | 3.541 | 0.000 | 0.049 | 0.172 |
| Emp_duration | 0.059 | 0.030 | 2.003 | 0.045 | 0.001 | 0.118 |
| Age | 0.302 | 0.031 | 9.650 | 0.000 | 0.241 | 0.363 |
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.133 | 0.13 | 0.866 | 0.931 | 0.933 | 50.794 | 0 | 3 | 1000 |
Based on the residual analysis, their seems to not be any potential violations of the model’s assumptions. Further, the test of the model’s global utility F(3,996) = 50.79, p-value < 0.001, at least one of the model’s coefficients may be non-zero. Moreover, the adjusted coefficient of determination, \(R^2_A\) = 0.133 and the coefficient of variation CV = 10.68%. Additionally, all slopes were found to be significant Credit_score: (t = 3.541, p < 0), 95% CI = (0.049,0.172), Emp_duration: (t = 2.003, p < 0.045), 95% CI = (0.001,0.118), Age: (t = 9.65, p < 0), 95% CI = (0.241,0.363).
The explicit final model follows: (Saving_amount) = 0 + 0.111 (Credit_score) + 0.059 (Emp_duration) + 0.302 (Age) .
Interpretations:
The (Credit_score) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.049 and 0.172 units for every 1 unit increase in a borrower’s (Credit_score), while holding (Emp_duration) and (Age) constant.
The (Emp_duration) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.001 and 0.118 units for every 1 unit increase in a borrower’s (Emp_duration), while holding (Credit_score) and (Age) constant.
The (Age) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.241 and 0.363 units for every 1 unit increase in a borrower’s (Age), while holding (Emp_duration) and (Credit_score) constant.
The research may not have presented the need to use various regression techniques, so MLR with linearity in the model’s terms was employed. Note that the original data set contained at least 16 variables. After slicing out 50 random observations from the original data set of 1000 observations, pair wise plots of the variables were output. Even with the adjustments made to the input for the plots, the relationships present in the plots, may have still been indiscernible. This may have been because of the size and amount of plots present. Therefore, to focus the research’s efforts, the data set variables were subset to exclude all other variables except for Saving_amount, Credit_score, Emp_duration, and Age. This reduced data set may have yielded manageable exploration and meaningful interpretations. Saving_amount was then chosen as a response variable, which may be owed to the significance in its correlations with the other explanatory variables. From there MLR was conducted after all observations were z-score normalized. Then a residual diagnostic followed. There seemed to not be any violations to the model’s assumptions. Afterwards, variable selection was conducted. Specifically, the best subsets procedure was used to determine the possible effectiveness of alternative models. This procedure revealed that the initial model may be the best choice moving forward. It was decided then, that the initial model, which was the full model, would be the final model. Because of the linearity of the model it may have been highly interpretable. Those interpretations were given above. Further research may be recommended. This may be a result of the wide confidence intervals around all of the slopes as well as the \(R^2_A\) value.