Multiple Linear Regression: A Complete Example

I. Introduction

Description: Define the study, define \(Y_i\) and \(X_{ij}\), state the regression equation, model assumptions, sample size, and unknown parameters of interest. Articulate the study goals.

A. Study Design.

A study on insurance redlining is considered. To investigate charges by several Chicago community organizations that insurance companies were refusing to issue insurace to racial minorities, the U.S. Commission on Civil Rights gathered information on the number of FAIR plan policies written and renewed in Chicago (per 100 housing units, \(Y\)) by zip code for the months of December 1977 through May 1978. FAIR plans were offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market. Information on other variables that might also affect insurance writing were recorded. The variables are: race, the racial composition in percentage of minority; fire, fires per 100 housing units; theft, thefts per 1000 population; age, percentage of housing units built before 1939; income, median family income in thousands of dollars; and side, North or South Side of Chicago

##       race fire theft  age involact income side
## 60626 10.0  6.2    29 60.4      0.0 11.744    n
## 60640 22.2  9.5    44 76.5      0.1  9.323    n
## 60613 19.6 10.5    36 73.5      1.2  9.948    n
## 60657 17.3  7.7    37 66.9      0.5 10.656    n
## 60614 24.5  8.6    53 81.4      0.7  9.730    n
## 60610 54.0 34.1    68 52.6      0.3  8.231    n

B. Aims.

The purpose of the study is to investigate the relationship between racial composition and insurance refusal in Chicago between December 1977 and May 1978 while controlling for other potential sources of variation.

II. Methods

A. Preliminary Model.

A multiple linear regression model is considered. Let

\(Y_i =\) the number of FAIR plan policies written and renewed (per 100 housing units) for the \(i^{th}\) zip code

\(X_{i1} =\) racial composition in percentage of minority for the \(i^{th}\) zip code,

\(X_{i2} =\) fires (per 100 housing units) for the \(i^{th}\) zip code,

\(X_{i3} =\) thefts (per 1000 population) for the \(i^{th}\) zip code,

\(X_{i4} =\) percentage of housing units built before 1939 for the \(i^{th}\) zip code,

\(X_{i5} =\) log median family income (in thousands of dollars) for the \(i^{th}\) zip code.

Based on automatic variable selection methods in combination with criterion-based statistics, income was dropped from the model. Partial residual plots, residual-versus-fitted plots, and measures of influence were investigated and no issues with high influence points, linearity, constant variance, independence, or normality were identified. Details are included in the Appendix.

B. Final Model

The final model is given by

\[Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \beta_4X_{i4} + \varepsilon_i\]

where \(\varepsilon_i \sim iidN(0,\sigma^2)\), \(i = 1, 2, . . . , 47\), and \(\beta_0, \beta_1, . . . , \beta_4,\) and \(\sigma^2\) are the unknown model parameters.

III. Results.

Description: Should follow the goals listed in Section I. For each goal, write out the hypotheses being tested (if applicable) and the specific approach taken (e.g., \(F\) test or \(t\) test, 95% confidence interval, Bonferroni adjustment, value of \(\alpha\), etc.).

The fitted model is displayed below. The rate of FAIR policies issued and renewed per 100 housing units increases, on average, 0.01 (95% CI 0, 0.01) for every 1% increase in minorities living in the zip code. Race explains 30.54% of the variation in the number of FAIR plan policies per 100 housing units issued and renewed, as compared to 33.78% for fire rates, 23.24% for rates of theft, and 17.59% for age.

## 
## Call:
## lm(formula = involact ~ race + fire + theft + age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87108 -0.14830 -0.01961  0.19968  0.81638 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.243118   0.145054  -1.676 0.101158    
## race         0.008104   0.001886   4.297 0.000100 ***
## fire         0.036646   0.007916   4.629 3.51e-05 ***
## theft       -0.009592   0.002690  -3.566 0.000921 ***
## age          0.007210   0.002408   2.994 0.004595 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3335 on 42 degrees of freedom
## Multiple R-squared:  0.7472, Adjusted R-squared:  0.7231 
## F-statistic: 31.03 on 4 and 42 DF,  p-value: 4.799e-12

## Analysis of Variance Table
## 
## Response: involact
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## race       1 9.4143  9.4143 84.6358 1.274e-11 ***
## fire       1 2.2326  2.2326 20.0716 5.645e-05 ***
## theft      1 1.1635  1.1635 10.4598  0.002379 ** 
## age        1 0.9973  0.9973  8.9662  0.004595 ** 
## Residuals 42 4.6718  0.1112                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                    2.5 %       97.5 %
## (Intercept) -0.535848021  0.049612429
## race         0.004297666  0.011909498
## fire         0.020670472  0.052622404
## theft       -0.015021345 -0.004163649
## age          0.002350769  0.012069310

IV. Discussion

Description: Restate major findings and provide proper interpretation within the context of the study design. What are the implications of the findings?

There appears to be a positive relationship between FAIR plan policies issued and percentage minority of the population in zip codes. The limitations of this analysis include that it is done at the zip code level rather than at the family or person level. Notice that the data is at the zip code level–an analysis of this data is unable to directly investigate whether minorities are denied insurance. This type of ecological analysis requires we assume that the chances a minority homeowner obtains a FAIR plan after adjusting for the effect of the other covariates is constant across all zip codes. This assumption is not verifiable and may be violated, resulting in incorrect conclusions (called an ecological fallacy).

V. Appendix

Description: Include details of the model building process including: transformations, outliers, variable selection, assumption checking.

A. Diagnostics for Predictors.

The purpose of this section is to examine the distribution of predictors, identify any unusually large or small values, and examine bivariate associations to identify multicollinearity. Unusual values should be flagged as they may influence the fit of the model. Bivariate associations between predictors could cause issues if the purpose of the model is estimation.

A scatterplot matrix indicates positive linear associations between all variables.

The Pearson correlation coefficients for all pairwise association are shown in Table 1. Log(income) is highly associated with the covariate of interest (race).

##                race       fire      theft        age   involact     income
## race      1.0000000  0.5927956  0.2550647  0.2505118  0.7137540 -0.7037328
## fire      0.5927956  1.0000000  0.5562105  0.4122225  0.7030397 -0.6104481
## theft     0.2550647  0.5562105  1.0000000  0.3176308  0.1496309 -0.1729226
## age       0.2505118  0.4122225  0.3176308  1.0000000  0.4757291 -0.5286695
## involact  0.7137540  0.7030397  0.1496309  0.4757291  1.0000000 -0.6648471
## income   -0.7037328 -0.6104481 -0.1729226 -0.5286695 -0.6648471  1.0000000

Strip plots for all predictors and the dependent variable (jittered) are shown next to boxplots of the same data. First, it should be acknowledged that a log transformation of income was taken. Income, as expected, is positively skewed with most observations clustered together and a few observations at much higher income levels. Income is considered in most analyses to be on a multiplicative scale, so because of that (and its skewed nature) a natural-log tranformation is appropriate.

Other features of note: there is a wide range of values for race, the covariate of interest in this problem. There is also skewness visible in the distributions of theft and fire, with observations clustered close to zero and a few data points with large values. We may need to apply transformations if the model diagnostics and assumption checks indicate it.

C. Screening of Predictors

Added variable plots for each of the covariates are shown. Added variable plots (also known as partial residual plots or adjusted variable plots) provide evidence of the importance of a covariate given the other covariates already in the model. They also display the nature of the relationship between the covariate and the outcome (i.e., linear, curvilinear, transformation necessary, etc.) and any problematic data points with respect to the predictor. The plots all indicate no need for transformations because linear relationships are apparent. They also indicate each variable provides some added value to a model that already includes all other covariates because the slopes of the linear relationships are all appear to be non-zero.

Since the purpose of the project is to examine the relationship between race and involact (racial minority percentage and rates of FAIR policies), the goal is estimating \(\beta_1\). Multicollinearity can create instability in estimation and so it should be avoided. We have already seen that log(income) is highly associated with race (\(r =\)-0.77) and several other covariates. Variance inflation factors (VIF) measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictor variables are not linearly related. A maximum VIF in excess of 10 is a good rule of thumb for multicollinearity problems. Based on the maximum VIF, 4.05, there do not appear to be any issues that need remediation. However, \(VIF_5\) is much larger than the others, which indicates income may be redundant.

##        race        fire       theft         age log(income) 
##    2.705490    2.733322    1.621824    1.577322    4.050811

##     race     fire    theft      age 
## 1.561957 2.242628 1.487029 1.221917

Automatic variable selection methods can be a useful starting point in eliminating redundant variables. They should only be used as a guide to the screening and removal (or addition) of predictors. Here, race is forced to stay in the model and all other covariates are allowed to add or drop:

## Subset selection object
## Call: regsubsets.formula(involact ~ race + fire + theft + age + log(income), 
##     force.in = 1, data = chredlin, method = "seqrep")
## 5 Variables  (and intercept)
##             Forced in Forced out
## race             TRUE      FALSE
## fire            FALSE      FALSE
## theft           FALSE      FALSE
## age             FALSE      FALSE
## log(income)     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: 'sequential replacement'
##          race fire theft age log(income)
## 2  ( 1 ) "*"  "*"  " "   " " " "        
## 3  ( 1 ) "*"  "*"  "*"   " " " "        
## 4  ( 1 ) "*"  "*"  "*"   "*" " "        
## 5  ( 1 ) "*"  "*"  "*"   "*" "*"

The summary output includes a matrix indicating which predictors are included in each of the 4 candidate models. In the first model (first row of the matrix, indicated by a ‘2’ for the number of predictors) with two predictors, only race and fire are included. In the second model (row 2) with three (indicated by a ‘3’) predictors, race, fire, and theft are included. The third and fourth models are in the last two rows of the matrix.

Several criteria for selecting the best model are produced, including \(R^2_{adj}\) (large values are better), Bayes Information Criterion \(BIC\) (smaller values are better), Bayes Information Criterion \(BIC\) (smaller values are better), and Mallow’s \(C_p\) statistic (values of \(C_p\) close to \(p\) (number of beta coefficients). Other criteria not produced by the regsubsets function are \(AIC\) and \(PRESS\). We will calculate these statistics for the two potential final models based on the results of automatic variable selection. Here, all statistics indicate that the best model is one in which log(income) is removed: \(R^2_{adj} =\) 0.723, \(BIC =\) -45.38, \(C_p =\) 4.747, \(AIC =\) -98.504, and \(PRESS =\) 6.067. The second best is the full model.

## [1] 0.6134546 0.6718179 0.7231142 0.7214344

## [1] -35.21257 -40.13592 -45.37999 -42.37815

## [1] 20.055622 11.658909  4.746733  6.000000

## [1]   6.00000 -97.35267

## [1]   5.00000 -98.50436

## [1] 6.26613

## [1] 6.067029

C. Model Validation

Model validation can help us select the model that has the best predictive performance in a hold-out sample. There are several approaches to model validation, two of which are shown here.

Leave-one-out cross validation is useful for smaller datasets where training and testing data are not feasible. This method involves:

Leave out one data point and build the model using the remaining data.
Test the model against the data point removed in Step 1 and record the prediction error.
Repeat for all data points.
Compute the overall prediction error by averaging the prediction errors.
If comparing models, the model with lowest MSPE should be selected.

The MSPE is smaller for the model without income.

## Linear Regression 
## 
## 47 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE      
##   0.359285  0.674688  0.2646019
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

## Linear Regression 
## 
## 47 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3651327  0.6659652  0.2745808
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

K-fold cross validation is useful for larger datasets where training and testing data are available/feasible. This method involves:

Randomly split the data into \(k\) subsets. Reserve one of the subsets for testing.
Build (train) the model on the remaining \(k-1\) subsets.
Test the model on the reserved subset and record the mean squared prediction error.
Repeat the process, changing the testing subset each time, until all \(k\) subsets have served as the testing set.
Calculate the average of the \(k\) mean squared prediction errors.
If comparing models, the model with the lowest MSPE should be chosen.

Since \(k = 5\) or \(k=10\) is usually preferred, this approach is not feasible for this dataset. However, it can be implemented using the code below:

If a quick check of assumptions and outliers shows no issues, the reduced model is the final model.

D. Residual Diagnostics

1. Model Completeness

It’s a good idea to also check for possible interactions (though we wouldn’t hypothesize any for this analysis). The fitted-versus-residual plot looks like noise, with the exception of the diagonal streak in the plot near \(\hat{Y}=0\). This feature results from the large number of 0 response values in the data. This plot supports normality and constant variance of the residuals.

## Analysis of Variance Table
## 
## Response: involact
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## race        1 9.4143  9.4143 86.1716 1.244e-11 ***
## fire        1 2.2326  2.2326 20.4358 5.164e-05 ***
## theft       1 1.1635  1.1635 10.6496  0.002224 ** 
## age         1 0.9973  0.9973  9.1289  0.004321 ** 
## race:theft  1 0.1925  0.1925  1.7621  0.191704    
## Residuals  41 4.4793  0.1093                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2. Outliers

Look for outliers in \(X\) and in \(Y\), and also investigate whether there are any influential points.

## integer(0)

## named integer(0)

## 23 24 
## 23 24

##       race fire theft  age involact income side
## 60612 86.2 36.2    41 63.1      1.8  6.565    n
## 60607 50.2 39.7   147 83.0      0.9  7.459    n

## 35 
## 35

## integer(0)

## 6 
## 6

## named integer(0)

##       race fire theft  age involact income side
## 60610   54 34.1    68 52.6      0.3  8.231    n

Observation 6 and 35 stick out as potentially high influence. A fit of the model without them shows the model results do not change and the model can be considered robust to the data point. Even if the model results changed, we would not drop these points as they are real. We would find a method more robust to outliers.

## 
## Call:
## lm(formula = involact ~ race + fire + theft + age, subset = -c(6, 
##     35))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56617 -0.15849 -0.03352  0.14962  0.71529 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.197797   0.118565  -1.668 0.103075    
## race         0.005698   0.001628   3.501 0.001154 ** 
## fire         0.047393   0.006924   6.845 3.09e-08 ***
## theft       -0.008848   0.002198  -4.026 0.000246 ***
## age          0.005330   0.002019   2.639 0.011791 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2717 on 40 degrees of freedom
## Multiple R-squared:  0.8135, Adjusted R-squared:  0.7949 
## F-statistic: 43.62 on 4 and 40 DF,  p-value: 4.475e-14

3. Constant Variance

There are no apparent issues with non-constant variance.

4. Normality

A Q-Q plot supports approximate normality.