This is a case study to exercise model building techniques using multiple linear regression.
This data set was found on data.gov (https://catalog.data.gov/dataset/summary-financial-data-for-credit-unions) through the Society of Actuaries data analytics resources. This data set provides annual summary financial data for Iowa state-chartered credit unions starting with year ending 12/31/2005. The data set was uploaded to Github for universal access (https://raw.githubusercontent.com/JackRoss10089/STA-321/main/Summary_Financial_Data_For_Credit_Unions.csv). The variables in the data set are as follows:
The central objective is to identify the relationship between Net Income and relevant explanatory variables in the data set.
To begin analysis, first it is necessary to evaluate the variables in the data set and choose which variables can be used to build the model.
After initial exploratory data analysis, Net.Income was selected as the response variable and Revenue, Expenses, and Reporting.Date have been selected as predictor variables. These variables were selected for their relevancy to the credit union’s net income generation. Net.Income has been re scaled as the maximum net income value minus the given net income value. This variable is also measured in thousands of dollars. Similarly, the variables revenue and expenses have also been transformed to thousands of dollars.
Now it is time to begin the search process for the final model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.992040e+05 | 110.5926106 | 1801.241717 | 0.0000000 |
| Reporting.Date0(2021) | -8.485917e+01 | 156.5742374 | -0.541974 | 0.5878993 |
| Reporting.Date0[Pre 2020] | 1.961822e+02 | 112.1777629 | 1.748851 | 0.0804765 |
| Expenses0 | 4.941062e-01 | 0.0089309 | 55.325747 | 0.0000000 |
| Revenue0 | -5.105828e-01 | 0.0048017 | -106.333186 | 0.0000000 |
Next, perform residual diagnostics to assess model assumptions.
Based upon the residual plots, it is apparent that there are some violations of the model assumptions: * the variance of the residuals is not constant * the Q-Q plot indicates the distribution of residuals is off the normal distribution * he residual plot seems is clustered around the left side of the graph with a slight nonlinear pattern
#Boxcox Transformation Because the assumption of constant variance is violated, the Box-Cox procedure is adequate to search for a transformation of the response variable. Multiple different transformations will be performed to assess for the best transformation.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 58.3762716 | 0.0129276 | 4515.6309757 | 0.0000000 |
| Reporting.Date0(2021) | 0.0116094 | 0.0183026 | 0.6343065 | 0.5259564 |
| Reporting.Date0[Pre 2020] | 0.0482209 | 0.0131129 | 3.6773618 | 0.0002421 |
| Expenses0 | 0.0000672 | 0.0000010 | 64.3351387 | 0.0000000 |
| Revenue0 | -0.0000632 | 0.0000006 | -112.5482351 | 0.0000000 |
The cube root transformation has shown small improvements in the residual plots, as the residuals vs. fitted plot appears to have a more constant variance. Also, the Q-Q plot has a slightly more linear pattern. Next, we will attempt a log transformation of the response variable net income.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 12.1997920 | 0.0007734 | 15774.304921 | 0.0000000 |
| Reporting.Date0(2021) | 0.0013043 | 0.0010950 | 1.191147 | 0.2337433 |
| Reporting.Date0[Pre 2020] | 0.0034929 | 0.0007845 | 4.452530 | 0.0000090 |
| Expenses0 | 0.0000041 | 0.0000001 | 65.038147 | 0.0000000 |
| Revenue0 | -0.0000037 | 0.0000000 | -109.309141 | 0.0000000 |
The log transformation has shown small improvements in the residual plots, as the Q-Q plot appears to fit a slightly more linear pattern. Now we will compare the three candidate models to see which model best demonstrates the association between Net.Income and the relevant predictors in the model.
The above residual diagnostic plots are similar to that of the previous model. The Q-Q plots of all three models are similar to each other, this means that the assumption of normal residuals is not satisfied for all three models.
In order to compare the quality of the candidate models, several goodness-of-fit measures are extracted and summarized in a table.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 1.876738e+09 | 0.9708988 | 0.9708382 | 5 | 26555.980 | 26583.794 | 2.048509e+09 |
| log.model | 9.178160e-02 | 0.9637510 | 0.9636755 | 5 | -19145.722 | -19117.908 | 1.164052e-01 |
| sq.rt.model | 2.564409e+01 | 0.9686373 | 0.9685719 | 5 | -8302.859 | -8275.046 | 3.054878e+01 |
Based upon the output from the goodness-of-fit comparison, it appears that the original model is the best of the three models. This is due to the first model having the best goodness of fit and interpretation for the practical question regarding this data set.
The inferential statistics of the final working model are summarized in the following table.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.992040e+05 | 110.5926106 | 1801.241717 | 0.0000000 |
| Reporting.Date0(2021) | -8.485917e+01 | 156.5742374 | -0.541974 | 0.5878993 |
| Reporting.Date0[Pre 2020] | 1.961822e+02 | 112.1777629 | 1.748851 | 0.0804765 |
| Expenses0 | 4.941062e-01 | 0.0089309 | 55.325747 | 0.0000000 |
| Revenue0 | -5.105828e-01 | 0.0048017 | -106.333186 | 0.0000000 |
Since the sample size (1925) is large, the argument for validating p-values is the Central Limit Theorem (CLT). All p-values are significant at alpha = 0.10 except for the variable Reporting.Date0(2021). Because this variable has an insignificant p-value, we will not consider it in the interpretation of the final model.
We can explicitly write the final model in the following
\[ \NetIncome = 199204 + 196.1822\times Reporting.Date0[Pre 2020] +0.4941062\times Expenses0 -0.5105828\times Revenue \] Note that the variables Net.Income0, Revenue0, and Expenses0 are all scaled in thousands of dollars. Therefore the regression coefficients for those variables are to be interpreted in thousands of dollars. Fro m the model coefficients, we can see that for reporting dates before 2020, Net Income increases by 196.1822 thousands of dollars. The other coefficients can be interpreted similarly. Revenue is negatively related to Net Income in this model, and Expenses and Reporting Date Pre 2020 are positively related with Net Income.
Net Income is considered as the maximum value of Net Income minus given Net Income, it is important to consider this scaling for practical interpretation.
We use the regression technique Box-Cox transformation for response variable to search for the final model in the case study. There are four variables in the data set and three are significant. The non significant variable was omitted from the final model.
All candidate models have the same number of variables. We use commonly-used global goodness-of-fit measures as model selection criteria.
The interpretation of the regression coefficients is not straightforward in a practical sense. It is of note though that the variables are scaled for thousands of dollars and the Net.Income variable is subtracted from the max Net.Income. This was done to ensure the validity of the box cox transformation, as the response variable must be positive for this transformation.
The violation of the normal assumption of the residuals remains uncorrected. The inference on the regression coefficients is based on the central limit theorem.
Overall, the attempted transformations yielded a model but this model still does not satisfy the normality assumption needed for multiple linear regression. Because of this, a bootstrap sampling technique can be deployed in order to generate a potentially better model.