STA321: Week #05 Assignment

Data Description

The data set used is called, “Loan Defualt Data” and it is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms”. The data set contains 1000 observations and 5 of its original 16 variables. These variables represent the minimum requirements for this research. Saving_amount is explored as the response for this data set. Age has been dichotomized into a categorical variable with three levels and Marital_status has been converted from string representations of its two levels to numeric representations.

Saving_amount\((y)\) : the amount of saving a borrower has.
Credit_score\((x_1)\): the credit score of borrower.
Emp_duration\((x_2)\): the amount of time a borrower has been employed.(months)
Age\((x_3)\): the age of a borrower.(years)
DA\((x_4)\): the dichotomized age of a borrower. (3 levels: (<27)=0,(>=27, <35)=1,(>=35)=2)
MStat\((x_5)\): the marital status of the borrower. (2 levels: Single=0 and Married=1)

Practical Question

Some practical question for this research may be: are there any relationships between Saving_amount and the other variables in the data set? If so, how can those relationships be interpreted?

Exploratory Data Analysis

Next exploratory data analysis is conducted. A scatter-plot matrix of numeric variables from the data set will be output after all non-categorical values in the data set are normalized to z-scores.

Based on output, there seems to be positive, mild, linear relationships between Credit_score vs. Saving_amount, Emp_duration vs. Saving_amount, and Age vs. Saving_amount.

Fitting MLR to Data

The initial full hypothesized model may be: (Saving_amount) = \(\beta_0\) + \(\beta_1*\)(Credit_score) + \(\beta_2*\)(Emp_duration) + \(\beta_3*(Age)\) + \(\beta_4*(DA1)\) + \(\beta_5*(DA2)\) + \(\beta_6*(MStat1)\) + \(\epsilon\). An MLR will be run using R. Relevant information from this MLR will be used to assess the model assumptions. Based of that assessment, transformations to the model may follow. Below is tabular output of the relevant information from the MLR for the model. An interpretation of this output will be withheld until after residual diagnostics are conducted.

Full model and diagnostics


Call:
lm(formula = Saving_amount ~ Credit_score + Emp_duration + Age + 
    DA + MStat, data = data0.norm)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.12626 -0.64452  0.00231  0.62696  2.79153 

Coefficients:
             Estimate Std. Error t value          Pr(>|t|)    
(Intercept)   0.15322    0.13635   1.124           0.26142    
Credit_score  0.10034    0.03126   3.210           0.00137 ** 
Emp_duration  0.04102    0.03032   1.353           0.17633    
Age           0.46047    0.06176   7.456 0.000000000000194 ***
DA1          -0.11882    0.13203  -0.900           0.36836    
DA2          -0.56265    0.20844  -2.699           0.00707 ** 
MStat1        0.08937    0.06074   1.471           0.14152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9247 on 993 degrees of freedom
Multiple R-squared:   0.15, Adjusted R-squared:  0.1449 
F-statistic: 29.21 on 6 and 993 DF,  p-value: < 0.00000000000000022

Residual Diagnostic Analysis

Below is graphical output for the residual analysis of the model. Specifically, a histogram of the residuals from the model, a Q-Q Normal plot and, a versus “fits” plot of predicted values vs. errors. The histogram of the residuals may reveal that the residuals are normally distributed. likewise, there seems to be no major departures from the line of the Q-Q plot, which may also indicate that the model’s errors are normally distributed. Lastly, the “versus fits” plot seems to display no patterns, so our data may have constant variance. Therefore, the errors may be independently and identically distributed normally with mean = 0 and \(σ^2\) = constant.

Goodness-of-fit Measures

Next, a variable selection procedure is used to identify models with alternative performance using less or different arrangements of the variables in the full model. the process chossen is the best subsets procedure, which will explore all possible combinations of the full model’s terms. Tabular output from this process follows below.

               Best Subsets Regression               
-----------------------------------------------------
Model Index    Predictors
-----------------------------------------------------
     1         Age                                    
     2         Age DA                                 
     3         Credit_score Age DA                    
     4         Credit_score Age DA MStat              
     5         Credit_score Emp_duration Age DA MStat 
-----------------------------------------------------

                                                     Subsets Regression Summary                                                      
-------------------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                               
Model    R-Square    R-Square    R-Square     C(p)         AIC         SBIC          SBC         MSEP       FPE       HSP       APC  
-------------------------------------------------------------------------------------------------------------------------------------
  1        0.1177      0.1168      0.1141    34.7621    2717.6466    -120.3516    2732.3699    883.1781    0.8849    0.0009    0.8858 
  2        0.1362      0.1336      0.1291    15.1732    2700.4796    -139.4521    2725.0184    865.5485    0.8690    0.0009    0.8690 
  3        0.1455      0.1421      0.1366     6.2417    2691.5883    -148.2746    2721.0349    857.0332    0.8613    0.0009    0.8613 
  4        0.1485      0.1442      0.1378     4.8309    2690.1657    -149.6595    2724.5200    854.9641    0.8601    0.0009    0.8601 
  5        0.1500      0.1449      0.1376     5.0000    2690.3236    -149.4690    2729.5856    854.2500    0.8602    0.0009    0.8602 
-------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria

Based on the above output, model five, which is the full model, has a high \(R^2_a\) and low \(C(p)\) and \(AIC\) compared to the other models. Considering this, the model five may be deemed better performing than the the others. Therefore model five, the full model, will be the final model.

Final Model

Once again relevant information regarding the models statics are output. An interpretation of the information follows.

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	0.153	0.136	1.124	0.261	-0.114	0.421
Credit_score	0.100	0.031	3.210	0.001	0.039	0.162
Emp_duration	0.041	0.030	1.353	0.176	-0.018	0.101
Age	0.460	0.062	7.456	0.000	0.339	0.582
DA: 1	-0.119	0.132	-0.900	0.368	-0.378	0.140
DA: 2	-0.563	0.208	-2.699	0.007	-0.972	-0.154
MStat: 1	0.089	0.061	1.471	0.142	-0.030	0.209

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.15	0.145	0.849	0.921	0.925	29.213	0	6	1000

Based on the residual analysis, their seems to not be any potential violations of the model’s assumptions. Further, the test of the model’s global utility F(6, 993) = 29.2125089, p-value < 0.001, so at least one of the model’s coefficients may be non-zero. Moreover, the adjusted coefficient of determination, \(R^2_A\) = 0.145 and the coefficient of variation CV = 10.68%. Not all of the estimates were found to be significant:

Credit_score: (t=3.21, p<.05), 95% CI = 0.039, 0.162
Emp_duration: (t=1.353, p>.05), 95% CI = -0.018, 0.101
Age: (t=7.456, p<.05), 95% CI = 0.339, 0.582
DA1: (t=-0.9, p>.05), 95% CI = -0.378, 0.14
DA2: (t=-2.699, p<.05), 95% CI = -0.972, -0.154
MStat1: (t=1.471, p>.05), 95% CI = -0.03, 0.209

Summary of the model

The explicit final model follows: (Saving_amount) = 0.153 + 0.1(Credit_score) + 0.041(Emp_duration) + 0.46(Age) + -0.119(DA1) + -0.563(DA2) + 0.089(MStat1)

Interpretations:

The (Credit_score) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.039 and 0.162 units for every 1 unit increase in a borrower’s (Credit_score), while holding (Emp_duration) and (Age) constant.

The (Age) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between 0.339 and 0.582 units for every 1 unit increase in a borrower’s (Age), while holding (Credit_score) and (Emp_duration) constant.

The (DA2) of a borrower may contribute information that allows prediction of a borrower’s (Saving_amount). Specifically, the (Saving_amount) of a borrower may increase, with 95% confidence, between -0.972 and -0.154 units more than (DA).

Residual Bootstrap

What follows is both of the residual Bootstrapping procedures. First, sampling with replacement from the observations will be used to construct a bootstrapped MLR. this will be done 1000 times. At each iteration, the coefficients will be stored in a metrix from which a confidence interval may be derived for the estimates. This information will be stored and output in a table. Afterwards, Bootstrapping will be conducted a second time, this time using the sampled residuals from an initial model which when paired with the fitted values will yield a bootstrapped fitted value. these values will be used in the MLR to fit a model. Then the parameters from this bootstapped model will be stored and used to calculate confidence intervals for the coefficients. This information will be out put in table with the 95% CI’s from MLR and two bootstrapped MLR’s, afterwards.

COEFs, 95% CI Bootstrap and parametric regression
term	estimate	p_value	param_lower	param_upper	bt1_lower	bt1_upper	bt2_lower	bt2_upper
intercept	0.153	0.261	-0.114	0.421	-0.1443370	0.4206520	-0.1041865	0.4082264
Credit_score	0.100	0.001	0.039	0.162	0.0384878	0.1652792	0.0358358	0.1596797
Emp_duration	0.041	0.176	-0.018	0.101	-0.0185252	0.1023493	-0.0158685	0.1000677
Age	0.460	0.000	0.339	0.582	0.3358312	0.5720594	0.3373094	0.5790738
DA: 1	-0.119	0.368	-0.378	0.140	-0.3780105	0.1627563	-0.3651118	0.1428375
DA: 2	-0.563	0.007	-0.972	-0.154	-0.9305794	-0.1493390	-0.9495112	-0.1731802
MStat: 1	0.089	0.142	-0.030	0.209	-0.0273819	0.2068801	-0.0276838	0.2018919

Interpretation:

Each of these confidence interval may indicate that, There is a 95% chance that each of the confidence intervals may contain the true parameter value for each of the estimates. Notice that the intervals do not all exclude zero. However, those that were found to be statistically different from zero, do exclude zero. For these ones, based on the above tables, the bootstrap and parametric regressions, indicate that those slopes may all be significantly different from zero.

Discussions

The research may not have presented the need to use various regression techniques, so MLR with linearity in the model’s terms was employed. Saving_amount was explored as the response for this data set. Age was dichotomized into a categorical variable with three levels and both Dichotomized Age and Age were included in the model. All observations were z-score normalized. Then an MLR was conducted. Afterwards, a residual diagnostic followed. There seemed to not be any violations to the model’s assumptions. Afterwards, variable selection was conducted. Specifically, the best subsets procedure was used to determine the possible effectiveness of alternative models. This procedure revealed that the initial model may be the best choice moving forward. It was decided then, that the initial model, which was the full model, would be the final model. Because of the linearity of the model it may have been highly interpretable. Those interpretations were given above. Bootstrapping was then conducted. Two procedures, specifically. The first, sampled the observations with replacement and then fit a model. The second, sampled the full models errors with replacement, summed them with their respective fitted values, then used those values to fit a bootstrapped model. The coefficients from each procedure were stored and 95% CI were derived from them and output in a table along with statistics from the MLR. Interpretations of theses bootstrapped values then followed. Note rerunning the analysis using the first bootstrapped procedure may present challenges based on the nature of its mechanics. Further research may be recommended. This may be a result of the wide confidence intervals around all of the slopes as well as the \(R^2_A\) value.