Description: Define the study, define \(Y_i\) and \(X_{ij}\), state the regression equation, model assumptions, sample size, and unknown parameters of interest. Articulate the study goals.
A study on insurance redlining is considered. To investigate charges by several Chicago community organizations that insurance companies were refusing to issue insurace to racial minorities, the U.S. Commission on Civil Rights gathered information on the number of FAIR plan policies written and renewed in Chicago (per 100 housing units, \(Y\)) by zip code for the months of December 1977 through May 1978. FAIR plans were offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market. Information on other variables that might also affect insurance writing were recorded. The variables are: race, the racial composition in percentage of minority; fire, fires per 100 housing units; theft, thefts per 1000 population; age, percentage of housing units built before 1939; income, median family income in thousands of dollars; and side, North or South Side of Chicago
## race fire theft age involact income side
## 60626 10.0 6.2 29 60.4 0.0 11.744 n
## 60640 22.2 9.5 44 76.5 0.1 9.323 n
## 60613 19.6 10.5 36 73.5 1.2 9.948 n
## 60657 17.3 7.7 37 66.9 0.5 10.656 n
## 60614 24.5 8.6 53 81.4 0.7 9.730 n
## 60610 54.0 34.1 68 52.6 0.3 8.231 n
The purpose of the study is to investigate the relationship between racial composition and insurance refusal in Chicago between December 1977 and May 1978 while controlling for other potential sources of variation.
A multiple linear regression model is considered. Let
\(Y_i =\) the number of FAIR plan policies written and renewed (per 100 housing units) for the \(i^{th}\) zip code
\(X_{i1} =\) racial composition in percentage of minority for the \(i^{th}\) zip code,
\(X_{i2} =\) fires (per 100 housing units) for the \(i^{th}\) zip code,
\(X_{i3} =\) thefts (per 1000 population) for the \(i^{th}\) zip code,
\(X_{i4} =\) percentage of housing units built before 1939 for the \(i^{th}\) zip code,
\(X_{i5} =\) log median family income (in thousands of dollars) for the \(i^{th}\) zip code.
Based on automatic variable selection methods in combination with criterion-based statistics, income was dropped from the model. Partial residual plots, residual-versus-fitted plots, and measures of influence were investigated and no issues with high influence points, linearity, constant variance, independence, or normality were identified. Details are included in the Appendix.
The final model is given by
\[Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3} + \beta_4X_{i4} + \varepsilon_i\]
where \(\varepsilon_i \sim iidN(0,\sigma^2)\), \(i = 1, 2, . . . , 47\), and \(\beta_0, \beta_1, . . . , \beta_4,\) and \(\sigma^2\) are the unknown model parameters.
Description: Should follow the goals listed in Section I. For each goal, write out the hypotheses being tested (if applicable) and the specific approach taken (e.g., \(F\) test or \(t\) test, 95% confidence interval, Bonferroni adjustment, value of \(\alpha\), etc.).
The fitted model is displayed below. The rate of FAIR policies issued and renewed per 100 housing units increases, on average, 0.01 (95% CI 0, 0.01) for every 1% increase in minorities living in the zip code. Race explains 30.54% of the variation in the number of FAIR plan policies per 100 housing units issued and renewed, as compared to 33.78% for fire rates, 23.24% for rates of theft, and 17.59% for age.
##
## Call:
## lm(formula = involact ~ race + fire + theft + age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87108 -0.14830 -0.01961 0.19968 0.81638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.243118 0.145054 -1.676 0.101158
## race 0.008104 0.001886 4.297 0.000100 ***
## fire 0.036646 0.007916 4.629 3.51e-05 ***
## theft -0.009592 0.002690 -3.566 0.000921 ***
## age 0.007210 0.002408 2.994 0.004595 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3335 on 42 degrees of freedom
## Multiple R-squared: 0.7472, Adjusted R-squared: 0.7231
## F-statistic: 31.03 on 4 and 42 DF, p-value: 4.799e-12
## Analysis of Variance Table
##
## Response: involact
## Df Sum Sq Mean Sq F value Pr(>F)
## race 1 9.4143 9.4143 84.6358 1.274e-11 ***
## fire 1 2.2326 2.2326 20.0716 5.645e-05 ***
## theft 1 1.1635 1.1635 10.4598 0.002379 **
## age 1 0.9973 0.9973 8.9662 0.004595 **
## Residuals 42 4.6718 0.1112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2.5 % 97.5 %
## (Intercept) -0.535848021 0.049612429
## race 0.004297666 0.011909498
## fire 0.020670472 0.052622404
## theft -0.015021345 -0.004163649
## age 0.002350769 0.012069310
Description: Restate major findings and provide proper interpretation within the context of the study design. What are the implications of the findings?
There appears to be a positive relationship between FAIR plan policies issued and percentage minority of the population in zip codes. The limitations of this analysis include that it is done at the zip code level rather than at the family or person level. Notice that the data is at the zip code level–an analysis of this data is unable to directly investigate whether minorities are denied insurance. This type of ecological analysis requires we assume that the chances a minority homeowner obtains a FAIR plan after adjusting for the effect of the other covariates is constant across all zip codes. This assumption is not verifiable and may be violated, resulting in incorrect conclusions (called an ecological fallacy).
Description: Include details of the model building process including: transformations, outliers, variable selection, assumption checking.
The purpose of this section is to examine the distribution of predictors, identify any unusually large or small values, and examine bivariate associations to identify multicollinearity. Unusual values should be flagged as they may influence the fit of the model. Bivariate associations between predictors could cause issues if the purpose of the model is estimation.
A scatterplot matrix indicates positive linear associations between
all variables.
The Pearson correlation coefficients for all pairwise association are shown in Table 1. Log(income) is highly associated with the covariate of interest (race).
## race fire theft age involact income
## race 1.0000000 0.5927956 0.2550647 0.2505118 0.7137540 -0.7037328
## fire 0.5927956 1.0000000 0.5562105 0.4122225 0.7030397 -0.6104481
## theft 0.2550647 0.5562105 1.0000000 0.3176308 0.1496309 -0.1729226
## age 0.2505118 0.4122225 0.3176308 1.0000000 0.4757291 -0.5286695
## involact 0.7137540 0.7030397 0.1496309 0.4757291 1.0000000 -0.6648471
## income -0.7037328 -0.6104481 -0.1729226 -0.5286695 -0.6648471 1.0000000
Strip plots for all predictors and the dependent variable (jittered) are shown next to boxplots of the same data. First, it should be acknowledged that a log transformation of income was taken. Income, as expected, is positively skewed with most observations clustered together and a few observations at much higher income levels. Income is considered in most analyses to be on a multiplicative scale, so because of that (and its skewed nature) a natural-log tranformation is appropriate.
Other features of note: there is a wide range of values for race, the covariate of interest in this problem. There is also skewness visible in the distributions of theft and fire, with observations clustered close to zero and a few data points with large values. We may need to apply transformations if the model diagnostics and assumption checks indicate it.
## race fire theft age log(income)
## 2.705490 2.733322 1.621824 1.577322 4.050811
## race fire theft age
## 1.561957 2.242628 1.487029 1.221917
## Subset selection object
## Call: regsubsets.formula(involact ~ race + fire + theft + age + log(income),
## force.in = 1, data = chredlin, method = "seqrep")
## 5 Variables (and intercept)
## Forced in Forced out
## race TRUE FALSE
## fire FALSE FALSE
## theft FALSE FALSE
## age FALSE FALSE
## log(income) FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: 'sequential replacement'
## race fire theft age log(income)
## 2 ( 1 ) "*" "*" " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " "
## 4 ( 1 ) "*" "*" "*" "*" " "
## 5 ( 1 ) "*" "*" "*" "*" "*"
The summary output includes a matrix indicating which predictors are included in each of the 4 candidate models. In the first model (first row of the matrix, indicated by a ‘2’ for the number of predictors) with two predictors, only race and fire are included. In the second model (row 2) with three (indicated by a ‘3’) predictors, race, fire, and theft are included. The third and fourth models are in the last two rows of the matrix.
Several criteria for selecting the best model are produced, including
\(R^2_{adj}\) (large values are
better), Bayes Information Criterion \(BIC\) (smaller values are better), Bayes
Information Criterion \(BIC\) (smaller
values are better), and Mallow’s \(C_p\) statistic (values of \(C_p\) close to \(p\) (number of beta coefficients). Other
criteria not produced by the regsubsets function are \(AIC\) and \(PRESS\). We will calculate these statistics
for the two potential final models based on the results of automatic
variable selection. Here, all statistics indicate that the best model is
one in which log(income) is removed: \(R^2_{adj} =\) 0.723, \(BIC =\) -45.38, \(C_p =\) 4.747, \(AIC =\) -98.504, and \(PRESS =\) 6.067. The second best is the
full model.
## [1] 0.6134546 0.6718179 0.7231142 0.7214344
## [1] -35.21257 -40.13592 -45.37999 -42.37815
## [1] 20.055622 11.658909 4.746733 6.000000
## [1] 6.00000 -97.35267
## [1] 5.00000 -98.50436
## [1] 6.26613
## [1] 6.067029
Model validation can help us select the model that has the best predictive performance in a hold-out sample. There are several approaches to model validation, two of which are shown here.
Leave-one-out cross validation is useful for smaller datasets where training and testing data are not feasible. This method involves:
The MSPE is smaller for the model without income.
## Linear Regression
##
## 47 samples
## 4 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.359285 0.674688 0.2646019
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## Linear Regression
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3651327 0.6659652 0.2745808
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
K-fold cross validation is useful for larger datasets where training and testing data are available/feasible. This method involves:
Since \(k = 5\) or \(k=10\) is usually preferred, this approach is not feasible for this dataset. However, it can be implemented using the code below:
If a quick check of assumptions and outliers shows no issues, the reduced model is the final model.
It’s a good idea to also check for possible interactions (though we wouldn’t hypothesize any for this analysis). The fitted-versus-residual plot looks like noise, with the exception of the diagonal streak in the plot near \(\hat{Y}=0\). This feature results from the large number of 0 response values in the data. This plot supports normality and constant variance of the residuals.
## Analysis of Variance Table
##
## Response: involact
## Df Sum Sq Mean Sq F value Pr(>F)
## race 1 9.4143 9.4143 86.1716 1.244e-11 ***
## fire 1 2.2326 2.2326 20.4358 5.164e-05 ***
## theft 1 1.1635 1.1635 10.6496 0.002224 **
## age 1 0.9973 0.9973 9.1289 0.004321 **
## race:theft 1 0.1925 0.1925 1.7621 0.191704
## Residuals 41 4.4793 0.1093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Look for outliers in \(X\) and in \(Y\), and also investigate whether there are any influential points.
## integer(0)
## named integer(0)
## 23 24
## 23 24
## race fire theft age involact income side
## 60612 86.2 36.2 41 63.1 1.8 6.565 n
## 60607 50.2 39.7 147 83.0 0.9 7.459 n
## 35
## 35
## integer(0)
## 6
## 6
## named integer(0)
## race fire theft age involact income side
## 60610 54 34.1 68 52.6 0.3 8.231 n
Observation 6 and 35 stick out as potentially high influence. A fit of the model without them shows the model results do not change and the model can be considered robust to the data point. Even if the model results changed, we would not drop these points as they are real. We would find a method more robust to outliers.
##
## Call:
## lm(formula = involact ~ race + fire + theft + age, subset = -c(6,
## 35))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56617 -0.15849 -0.03352 0.14962 0.71529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.197797 0.118565 -1.668 0.103075
## race 0.005698 0.001628 3.501 0.001154 **
## fire 0.047393 0.006924 6.845 3.09e-08 ***
## theft -0.008848 0.002198 -4.026 0.000246 ***
## age 0.005330 0.002019 2.639 0.011791 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2717 on 40 degrees of freedom
## Multiple R-squared: 0.8135, Adjusted R-squared: 0.7949
## F-statistic: 43.62 on 4 and 40 DF, p-value: 4.475e-14
There are no apparent issues with non-constant variance.
A Q-Q plot supports approximate normality.