Part I: Find a data set for project #1

The data set chosen for project #1 is “BankLoanDefaultDataset”. The Structure of the data set follows.

'data.frame':   1000 obs. of  16 variables:
 $ Default         : int  0 0 0 1 1 0 0 0 0 1 ...
 $ Checking_amount : int  988 458 158 300 63 1071 -192 172 585 189 ...
 $ Term            : int  15 15 14 25 24 20 13 16 20 19 ...
 $ Credit_score    : int  796 813 756 737 662 828 856 763 778 649 ...
 $ Gender          : chr  "Female" "Female" "Female" "Female" ...
 $ Marital_status  : chr  "Single" "Single" "Single" "Single" ...
 $ Car_loan        : int  1 1 0 0 0 1 1 1 1 1 ...
 $ Personal_loan   : int  0 0 1 0 0 0 0 0 0 0 ...
 $ Home_loan       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Education_loan  : int  0 0 0 1 1 0 0 0 0 0 ...
 $ Emp_status      : chr  "employed" "employed" "employed" "employed" ...
 $ Amount          : int  1536 947 1678 1804 1184 475 626 1224 1162 786 ...
 $ Saving_amount   : int  3455 3600 3093 2449 2867 3282 3398 3022 3475 2711 ...
 $ Emp_duration    : int  12 25 43 0 4 12 11 12 12 0 ...
 $ Age             : int  38 36 34 29 30 32 38 36 36 29 ...
 $ No_of_credit_acc: int  1 1 1 1 1 2 1 1 1 1 ...

The detailed requirements of the data will be described bellow
- The response variables must be continuous random variables.
  - Based on the above Output, there seems to be various continuous variables that might be used as a response variable.
- At least two categorical explanatory variables.
  - Based on the above Output, there seems to be various categorical variables that might be used as explanatory variables.
- At least one of the categorical variables has more than two categories.
  - Based on the above Output, there seems to be at least one categorical variable with more than two categories.
- At least two numerical explanatory variables.
  - Based on the above Output, there seems to be various numerical variables that might be used as explanatory variables.
- At least 15 observations are required for estimating each regression coefficient. For example, if your final linear model has 11 variables (including dummy variables), you need 12×15=180 observations.
  - Based on the above Output, there seems to be 16 variables and 1000 observations, which is greater than 240.

Part II: Fit an SLR

Fit a simple linear regression (SLR) by selecting a numerical explanatory variable from the data set using the least square approach and then construct 95% bootstrap confidence intervals for the regression coefficients.

1 DATA SET DESCRIPTION

How the data was collected?
- There is no information regarding how the data was collected for this data set.
List of all variables: names and their variable types.
- The data set used is called, “Loan Defualt Data” and it is taken from the book “Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6”. It was described where it was posted as, “…a subset of a large o data set. The structure of the data set is simple. It can be used for logistic and binary classification / predictive models and algorithms”. What follows is the observation amount, variables amount, variable names, and variable types.
  - 1000 obs
  - 16 variables:
  - Default : int
  - Checking_amount : int
  - Term : int
  - Credit_score : int
  - Gender : chr
  - Marital_status : chr
  - Car_loan : int
  - Personal_loan : int
  - Home_loan : int
  - Education_loan : int
  - Emp_status : chr
  - Amount : int
  - Saving_amount : int
  - Emp_duration : int
  - Age : int
  - No_of_credit_acc: int
What are your practical and analytic questions
- A primary practical question may be: are there any relationships within the data set to explore. Another may be: how are those relationships actualized.
Does the data set have enough information to answer the questions
- Based on the project requirements the data set may have enough information to allow practical questions to be answered.

2 Simple Linear Regression

Make a pair-wise scatter plot of all variables in your selected data set

Below is a scatterplot matrix of most of the variables from the data set. There seems to be many relationships that can be explored. However, given the scale of the plots, the pairwise relationships may be indescernable.

Choose an explanatory variable that is linearly correlated to the response variable.

Below is a scatterplot utilizing variables from the data set. Specifically, those variables are the (Age) of a borrower and the (Saving_amount) that borrower. The scatterplot is arranged to display (Saving_amount) vs. (Age). Seemingly there may be a positive, moderate, linear relationship between the two.

Fit an ordinary least square regression (SLR) to capture the linear relationship between the two variables.

We will consider one predictor for this analysis: (Saving_amount \(= x_1\)) with respects to a response (Age = y). We will hold off on commenting on the results of this summary until after a residual analysis is conducted. The proposed regression model is: (Age) = \(\beta_0\) + \(\beta_1\) (Saving_amount) + \(\epsilon\).


Call:
lm(formula = Age ~ Saving_amount, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.7993  -2.4581   0.0857   2.6885  11.5300 

Coefficients:
                Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   18.0602972  1.1460023   15.76 <0.0000000000000002 ***
Saving_amount  0.0041358  0.0003584   11.54 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.847 on 998 degrees of freedom
Multiple R-squared:  0.1177,    Adjusted R-squared:  0.1168 
F-statistic: 133.1 on 1 and 998 DF,  p-value: < 0.00000000000000022

Comment on the residual plots and point out the violations to the model assumptions.

Below is graphical output for the residual analysis of the model. Specifically, a histogram of the residuals from the model, a Q-Q Normal plot and, a versus “fits” plot of predicted values vs. errors. The histogram of the residuals may reveal that the residuals are normally distributed. likewise, there seems to be no major departures from the line of the Q-Q plot, which may indicate that the model’s errors are normally distributed. Lastly, the “versus fits” plot seems to display no patterns, so our data may have constant variance. Therefore, the errors may be independently and identically distributed normally with mean = 0 and \(σ^2\) = constant.

Based on the above residual analysis, their seems to not be any potential violations of the model’s assumptions. Further, the model’s coefficient of determination \(R^2\) = 0.117706, approximately 11.7705976% of the variation in the (Age) of a borrower may be a result of the change in the borrower’s (Saving_amount). Also, note that the coefficient of variation CV = 12.3254985%. Moreover, the slope was found to be significant (t = 11.5387259, p < .001), 95% CI = (0.003,0.005). An Interpretation might be, the (Saving_amount) of a borrower may contribute information that allows prediction of the (Age) of a borrower. Specifically, their may be 0.0041358 unit gain in the (Age) of a borrower for every every 1 unit gain in that borrower’s (Saving_amount). Or, alternatively, the (Age) of a borrower may increase, with 95% confidence, between 0.003 and 0.005 units for every 1 unit increase in a borrower’s (Saving_amount).

3. Bootstrap Regression

Use the bootstrap algorithm on the previous final linear regression model to estimate the bootstrap confidence intervals of regression coefficients (using 95% confidence level).

Below are 95% CI for the bootstrapped regression procedure.

Bootstrap confidence intervals of regression coefficients.
	2.5%	97.5%
bt.b0.ci.vec	15.7401301	20.2560220
bt.b1.ci.vec	0.0034701	0.0048398

4. Conclusion

compare the p-values and bootstrap confidence intervals of corresponding regression coefficients of the final linear regression model, make a recommendation on which inferential result to be reported, and justify.

COEFs, 95% CI Bootstrap and parametric regression
term	estimate	bstrap_lower	bstrap_upper	p_value	param_lower	param_upper
intercept	18.060	15.7401301	20.2560220	0	15.811	20.309
Saving_amount	0.004	0.0034701	0.0048398	0	0.003	0.005

The Final regression model is: (Age) = 18.0602972 + 0.0041358 (Saving_amount). The 95% bootstrap CI = (0.0034701,0.0048398) for the slope may indicate a positive relationship between Age and Saving_amount. Also, notice that the interval excludes zero. Based on the above tables the bootstrap and parametric regressions, may indicate that the slope is significantly different from zero, which may mean the Age of a borrower and a borrower’s Saving_amount is statistically correlated.

STA321: Week #03 Assignment

TMRB

2023-09-17