STA 321 HW 3 Attempt 2

Case Study:

This is a case study to exercise model building techniques using multiple linear regression.

Data Description

This data set was found on data.gov (https://catalog.data.gov/dataset/summary-financial-data-for-credit-unions) through the Society of Actuaries data analytics resources. This data set provides annual summary financial data for Iowa state-chartered credit unions starting with year ending 12/31/2005. The data set was uploaded to Github for universal access (https://raw.githubusercontent.com/JackRoss10089/STA-321/main/Summary_Financial_Data_For_Credit_Unions.csv). The variables in the data set are as follows:

Name: Character Variable for Institution Name
Insurance.Number: Integer Variable for Insurance ID
Reporting.Date(x1): Character Variable for Reporting Date
Assets: Numeric Variable for Total Assets
Shares: Numeric Variable for Total Shares
Liabilities: Numeric Variable for Total Liabilities
Net.Worth: Integer Variable for Net Worth
Revenue(x2): Integer Variable for Revenue
Dividends: Integer Variable for Dividends
Expenses(x3): Integer Variable for Expenses
Net.Income(y): Integer Variable for Net Income

Practical Question

The central objective is to identify the relationship between Net Income and relevant explanatory variables in the data set.

Exploratory Data Analysis

To begin analysis, first it is necessary to evaluate the variables in the data set and choose which variables can be used to build the model.

After initial exploratory data analysis, Net.Income was selected as the response variable and Revenue, Expenses, and Reporting.Date have been selected as predictor variables. These variables were selected for their relevancy to the credit union’s net income generation. Net.Income has been re scaled as the maximum net income value minus the given net income value. This variable is also measured in thousands of dollars. Similarly, the variables revenue and expenses have also been transformed to thousands of dollars.

Fitting MLR to Data

Now it is time to begin the search process for the final model.

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.992040e+05	110.5926106	1801.241717	0.0000000
Reporting.Date0(2021)	-8.485917e+01	156.5742374	-0.541974	0.5878993
Reporting.Date0[Pre 2020]	1.961822e+02	112.1777629	1.748851	0.0804765
Expenses0	4.941062e-01	0.0089309	55.325747	0.0000000
Revenue0	-5.105828e-01	0.0048017	-106.333186	0.0000000

Next, perform residual diagnostics to assess model assumptions.

Based upon the residual plots, it is apparent that there are some violations of the model assumptions: * the variance of the residuals is not constant * the Q-Q plot indicates the distribution of residuals is off the normal distribution * he residual plot seems is clustered around the left side of the graph with a slight nonlinear pattern

#Boxcox Transformation Because the assumption of constant variance is violated, the Box-Cox procedure is adequate to search for a transformation of the response variable. Multiple different transformations will be performed to assess for the best transformation.

Square Root Transformation

square root-transformed model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	58.3762716	0.0129276	4515.6309757	0.0000000
Reporting.Date0(2021)	0.0116094	0.0183026	0.6343065	0.5259564
Reporting.Date0[Pre 2020]	0.0482209	0.0131129	3.6773618	0.0002421
Expenses0	0.0000672	0.0000010	64.3351387	0.0000000
Revenue0	-0.0000632	0.0000006	-112.5482351	0.0000000

The cube root transformation has shown small improvements in the residual plots, as the residuals vs. fitted plot appears to have a more constant variance. Also, the Q-Q plot has a slightly more linear pattern. Next, we will attempt a log transformation of the response variable net income.

Log Transformation

log-transformed model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	12.1997920	0.0007734	15774.304921	0.0000000
Reporting.Date0(2021)	0.0013043	0.0010950	1.191147	0.2337433
Reporting.Date0[Pre 2020]	0.0034929	0.0007845	4.452530	0.0000090
Expenses0	0.0000041	0.0000001	65.038147	0.0000000
Revenue0	-0.0000037	0.0000000	-109.309141	0.0000000

The log transformation has shown small improvements in the residual plots, as the Q-Q plot appears to fit a slightly more linear pattern. Now we will compare the three candidate models to see which model best demonstrates the association between Net.Income and the relevant predictors in the model.

Q-Q Plots of All Models

The above residual diagnostic plots are similar to that of the previous model. The Q-Q plots of all three models are similar to each other, this means that the assumption of normal residuals is not satisfied for all three models.

Goodness-of-fit Measures

In order to compare the quality of the candidate models, several goodness-of-fit measures are extracted and summarized in a table.

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
full.model	1.876738e+09	0.9708988	0.9708382	5	26555.980	26583.794	2.048509e+09
log.model	9.178160e-02	0.9637510	0.9636755	5	-19145.722	-19117.908	1.164052e-01
sq.rt.model	2.564409e+01	0.9686373	0.9685719	5	-8302.859	-8275.046	3.054878e+01

Based upon the output from the goodness-of-fit comparison, it appears that the original model is the best of the three models. This is due to the first model having the best goodness of fit and interpretation for the practical question regarding this data set.

Final Model

The inferential statistics of the final working model are summarized in the following table.

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.992040e+05	110.5926106	1801.241717	0.0000000
Reporting.Date0(2021)	-8.485917e+01	156.5742374	-0.541974	0.5878993
Reporting.Date0[Pre 2020]	1.961822e+02	112.1777629	1.748851	0.0804765
Expenses0	4.941062e-01	0.0089309	55.325747	0.0000000
Revenue0	-5.105828e-01	0.0048017	-106.333186	0.0000000

Since the sample size (1925) is large, the argument for validating p-values is the Central Limit Theorem (CLT). All p-values are significant at alpha = 0.10 except for the variable Reporting.Date0(2021). Because this variable has an insignificant p-value, we will not consider it in the interpretation of the final model.

Summary of the model

We can explicitly write the final model in the following

\[ \NetIncome = 199204 + 196.1822\times Reporting.Date0[Pre 2020] +0.4941062\times Expenses0 -0.5105828\times Revenue \] Note that the variables Net.Income0, Revenue0, and Expenses0 are all scaled in thousands of dollars. Therefore the regression coefficients for those variables are to be interpreted in thousands of dollars. Fro m the model coefficients, we can see that for reporting dates before 2020, Net Income increases by 196.1822 thousands of dollars. The other coefficients can be interpreted similarly. Revenue is negatively related to Net Income in this model, and Expenses and Reporting Date Pre 2020 are positively related with Net Income.

Net Income is considered as the maximum value of Net Income minus given Net Income, it is important to consider this scaling for practical interpretation.

Discussions

We use the regression technique Box-Cox transformation for response variable to search for the final model in the case study. There are four variables in the data set and three are significant. The non significant variable was omitted from the final model.

All candidate models have the same number of variables. We use commonly-used global goodness-of-fit measures as model selection criteria.

The interpretation of the regression coefficients is not straightforward in a practical sense. It is of note though that the variables are scaled for thousands of dollars and the Net.Income variable is subtracted from the max Net.Income. This was done to ensure the validity of the box cox transformation, as the response variable must be positive for this transformation.

The violation of the normal assumption of the residuals remains uncorrected. The inference on the regression coefficients is based on the central limit theorem.

Overall, the attempted transformations yielded a model but this model still does not satisfy the normality assumption needed for multiple linear regression. Because of this, a bootstrap sampling technique can be deployed in order to generate a potentially better model.