GSMPR 622: Redlining
Your role is as an investigator for the UCCR attempting to use these data to compile evidence of redlining. The core piece of evidence will come down to whether or not race is related to demand for new FAIR plan policies and renewals per 100 housing units by zip code because there is no logical reason that race would be related were it not for redlining. side01 is the same variable as side only stored as numeric.
Insurance Discrimination
Insurance redlining refers to the practice of refusing to issue insurance to certain types of people or within some geographic area. The name comes from the act of drawing a red line around an area on a map. Now few would quibble with an insurance company refusing to sell auto insurance to a frequent drunk driver, but other forms of discrimination would be unacceptable.
In the late 1970s, the U.S. Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods. Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago as a default policy to homeowners who had been rejected by the voluntary market. Information on other variables that might affect insurance writing such as fire and theft rates were also collected at the zip code level. The variables are:
race: racial composition in percentage of minority
fire: fire per 100 housing units
theft: theft per 1000 population
age: percentage of housing units built before 1939
involact new fair plan policies and renewals per 100 housing units
income median family income in thousands of dollars
side North or South side of Chicago
Some Preliminary Questions:
Basic Data Familiarity
Redlining %>%
arrange(desc(involact)) %>%
select(zip:side01) %>%
dtab(dec = 2, nr = 100) %>% render()
result <- explore(
Redlining,
vars = c("race", "fire", "theft", "age", "involact", "income", "side01"),
fun = c("n_obs", "mean", "min", "max", "sum", "sd", "p25", "p75"),
nr = Inf
)
summary(result, dec = 2)
Explore
Data : Redlining
Functions : n_obs, mean, min, max, sum, sd, p25, p75
Top : Function
variable n_obs mean min max sum sd p25 p75
race 47 34.99 1.00 99.70 1,644.30 32.59 3.75 57.65
fire 47 12.28 2.00 39.70 577.10 9.30 5.65 16.05
theft 47 32.36 3.00 147.00 1,521.00 22.29 22.00 38.00
age 47 60.33 2.00 90.10 2,835.40 22.57 48.60 77.30
involact 47 0.61 0.00 2.20 28.90 0.63 0.00 0.90
income 47 10.70 5.58 21.48 502.70 2.75 8.45 11.99
side01 47 0.47 0.00 1.00 22.00 0.50 0.00 1.00
# dtab(result, dec = 2) %>% render()
visualize(
Redlining,
xvar = c("race", "fire", "theft", "age", "involact", "income"),
yvar = "zip",
type = "line",
facet_row = "side",
color = "side",
theme = "theme_minimal",
base_size = 12,
base_family = "Courier",
custom = FALSE
)
The zip code with most renewals is 60621
The zip code with the highest median family income is 60611, which is North Chicago.
result <- compare_means(
Redlining,
var1 = "side",
var2 = "involact"
)
summary(result, show = TRUE)
Pairwise mean comparisons (t-test)
Data : Redlining
Variables : side, involact
Samples : independent
Confidence: 0.95
Adjustment: None
side mean n n_missing sd se me
n 0.500 25 0 0.633 0.127 0.261
s 0.745 22 0 0.623 0.133 0.276
Null hyp. Alt. hyp. diff p.value se t.value df 2.5% 97.5%
n = s n not equal to s -0.245 0.188 0.183 -1.337 44.413 -0.615 0.124
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "scatter", custom = FALSE)
With 95% confidence interval, the dirrerence range between the North and the South is -0.615 to 0.124 average renewals, being the South the larger one.
result <- correlation(Redlining, vars = c("race", "age"))
summary(result)
Correlation
Data : Redlining
Method : Pearson
Variables : race, age
Null hyp. : variables x and y are not correlated
Alt. hyp. : variables x and y are correlated
Correlation matrix:
race
age 0.25
p.values:
race
age 0.09
plot(result, nrobs = -1)
Yes, there is a positive correlation between those two variables.
With 95% confidence, the average of new FAIR plan policies and renewals per 100 housing units range from 0.429 to 0.801.
Fire is the only variable related to the zip code.
result <- regress(
Redlining,
rvar = "zip",
evar = c("fire", "theft")
)
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data : Redlining
Response variable : zip
Explanatory variables: fire, theft
Null hyp.: the effect of x on zip is zero
Alt. hyp.: the effect of x on zip is not zero
coefficient std.error t.value p.value
(Intercept) 60641.728 3.571 16980.029 < .001 ***
fire -0.488 0.248 -1.965 0.056 .
theft -0.159 0.104 -1.536 0.132
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.24, Adjusted R-squared: 0.205
F-statistic: 6.933 df(2,44), p.value 0.002
Nr obs: 47
Prediction error (RMSE): 12.587
Residual st.dev (RSD): 13.009
Sum of squares:
df SS
Regression 2 2,346.624
Error 44 7,446.695
Total 46 9,793.319
Variance Inflation Factors
fire theft
VIF 1.448 1.448
Rsq 0.309 0.309
coefficient 2.5% 97.5% +/-
(Intercept) 60641.728 60634.530 60648.926 7.198
fire -0.488 -0.988 0.013 0.500
theft -0.159 -0.368 0.050 0.209
plot(result, plots = "scatter", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
Regression Analysis
result <- regress(
Redlining,
rvar = "involact",
evar = c("race", "fire", "theft", "age", "income", "side", "side01")
)
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: race, fire, theft, age, income, side, side01
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero
coefficient std.error t.value p.value
(Intercept) -0.629 0.512 -1.229 0.226
race 0.009 0.003 3.374 0.002 **
fire 0.039 0.009 4.523 < .001 ***
theft -0.010 0.003 -3.494 0.001 **
age 0.008 0.003 2.884 0.006 **
income 0.025 0.032 0.770 0.446
side|s 0.024 0.125 0.192 0.849
side|01 NA NA NA NA NA
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.751, Adjusted R-squared: 0.714
F-statistic: 20.113 df(6,40), p.value < .001
Nr obs: 47
The set of explanatory variables exhibit perfect multicollinearity.
One or more variables were dropped from the estimation.
Prediction error (RMSE): 0.313
Residual st.dev (RSD): 0.339
Sum of squares:
df SS
Regression 6 13.879
Error 40 4.600
Total 46 18.480
Multicollinearity diagnostics were not calculated.
Confidence intervals were not calculated.
result <- regress(
Redlining,
rvar = "involact",
evar = c("race", "fire", "theft", "age", "income", "side", "side01"),
check = "stepwise-backward"
)
Start: AIC=-95.23
involact ~ race + fire + theft + age + income + side + side01
Step: AIC=-95.23
involact ~ race + fire + theft + age + income + side
Df Sum of Sq RSS AIC
- side 1 0.00425 4.6047 -97.184
- income 1 0.06811 4.6686 -96.537
<none> 4.6005 -95.228
- age 1 0.95668 5.5571 -88.348
- race 1 1.30950 5.9099 -85.455
- theft 1 1.40391 6.0044 -84.710
- fire 1 2.35303 6.9535 -77.813
Step: AIC=-97.18
involact ~ race + fire + theft + age + income
Df Sum of Sq RSS AIC
- income 1 0.06710 4.6718 -98.504
<none> 4.6047 -97.184
- age 1 0.99296 5.5977 -90.007
- theft 1 1.46328 6.0680 -86.215
- race 1 1.74657 6.3513 -84.070
- fire 1 2.37807 6.9828 -79.615
Step: AIC=-98.5
involact ~ race + fire + theft + age
Df Sum of Sq RSS AIC
<none> 4.6718 -98.504
- age 1 0.99734 5.6691 -91.410
- theft 1 1.41436 6.0862 -88.074
- race 1 2.05375 6.7256 -83.379
- fire 1 2.38365 7.0554 -81.128
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: race, fire, theft, age, income, side, side01
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero
coefficient std.error t.value p.value
(Intercept) -0.243 0.145 -1.676 0.101
race 0.008 0.002 4.297 < .001 ***
fire 0.037 0.008 4.629 < .001 ***
theft -0.010 0.003 -3.566 < .001 ***
age 0.007 0.002 2.994 0.005 **
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.747, Adjusted R-squared: 0.723
F-statistic: 31.033 df(4,42), p.value < .001
Nr obs: 47
Prediction error (RMSE): 0.315
Residual st.dev (RSD): 0.334
Sum of squares:
df SS
Regression 4 13.808
Error 42 4.672
Total 46 18.480
Variance Inflation Factors
fire race theft age
VIF 2.243 1.562 1.487 1.222
Rsq 0.554 0.360 0.328 0.182
coefficient 2.5% 97.5% +/-
(Intercept) -0.243 -0.536 0.050 0.293
race 0.008 0.004 0.012 0.004
fire 0.037 0.021 0.053 0.016
theft -0.010 -0.015 -0.004 0.005
age 0.007 0.002 0.012 0.005
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
Redlining <- store(Redlining, result, name = "xresids")
shapiro.test(Redlining$xresids)
Shapiro-Wilk normality test
data: Redlining$xresids
W = 0.98065, p-value = 0.6191
With 95% confidence, the factors related to fair policies with 5% significance are: race, fire, and theft.
The factor unrelated with 95% confidence and 5% significance is the Income.
The variance explained by those predictors is 0.751 or 75.1%. And 0.249 or 24.9% is unexplained.
Yes.
| Predictor | 2.5% | 97.5% | |
|---|---|---|---|
| race | 0.004 | to | 0.012 |
| fire | 0.021 | to | 0.053 |
| theft | -0.015 | to | -0.004 |
| age | 0.002 | to | 0.012 |
For a one-unit change in [whatever the metric is of x], (involact) the fair plan policies and renewals per 100 housing units [sign says + increase or - decrease] by (absolute value of) coefficient plus or minus +/-.
As the long of the race (percentage of minority) increases by one unit, involact increases by 0.004 to 0.012 with 95% confidence.
As the long of the fire- fire per 100 housing units, increases by 100 units, involact increases by 0.021 to 0.053 with 95% confidence.
As the long of theft- theft per 1000 population- increases by 1000 population, involact decreases by -0.004 to -0.015 with 95% confidence.
As age- measured as percentage of housing units built before 1939- changes, involact increase by 0.002 to 0.012 with 95% confidence.
Residual data of the simple linear regression model is the difference between the observed data of the dependent variable and the fitted values. The plot is useful for checking the assumption of linearity and homoscedasticity. To assess the assumption of linearity, residuals should be not too far from 0 (ideally, standardized values should be in the range of -2 and +2). To assess he assumption of homoscedasticity, residuals should be randomly and equally distributed around the horizontal red line (which is just a scatterplot smoother, showing the average value of the residuals at each value of fitted value) representing a residual error of zero.
In the current case, the red trend line is almost at zero except towards the right side, due to outliers presence. Some values, in particular are also outside the range between -2 and +2 (graphic: Normal Q-Q)
The probability is 0.389
result <- prob_norm(mean = 0.615, stdev = 0.634, lb = -0.5, ub = 0.5)
summary(result)
Probability calculator
Distribution: Normal
Mean : 0.615
St. dev : 0.634
Lower bound : -0.5
Upper bound : 0.5
P(X < -0.5) = 0.039
P(X > -0.5) = 0.961
P(X < 0.5) = 0.428
P(X > 0.5) = 0.572
P(-0.5 < X < 0.5) = 0.389
1 - P(-0.5 < X < 0.5) = 0.611
plot(result)
Regression Streamlining and Model Building
We wish to engage a process to “streamline” the regression by eliminating unimportant predictors, if there are any. There are two common ways.
The first is to use the evidence you obtained from the regression involving all predictors and to eliminate any factor that is unrelated to demand for new FAIR policies with 90 or 95% confidence [your answers to 2a above. In this case, let’s suppose 95% to remain consistent.
The second is to engage in an automated routine to eliminate predictors according to a given criterion, most often AIC or BIC [radiant uses the former]. Formally, AIC combines the residual sum of squares with a penalty for each predictor to encourage less complicated models (because we minimize it): this is Radiant’s stepwise tick box or the R command step() applied to a regression. Try this.
1st way
result <- regress(
Redlining,
rvar = "involact",
evar = c("zip", "race", "fire", "theft", "age", "income", "side")
)
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: zip, race, fire, theft, age, income, side
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero
coefficient std.error t.value p.value
(Intercept) 265.631 242.891 1.094 0.281
zip -0.004 0.004 -1.096 0.280
race 0.009 0.003 3.298 0.002 **
fire 0.037 0.009 4.270 < .001 ***
theft -0.011 0.003 -3.652 < .001 ***
age 0.008 0.003 2.668 0.011 *
income 0.022 0.032 0.689 0.495
side|s 0.011 0.125 0.088 0.931
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.758, Adjusted R-squared: 0.715
F-statistic: 17.498 df(7,39), p.value < .001
Nr obs: 47
Prediction error (RMSE): 0.308
Residual st.dev (RSD): 0.338
Sum of squares:
df SS
Regression 7 14.017
Error 39 4.463
Total 46 18.480
Variance Inflation Factors
income race fire age theft side zip
VIF 3.142 2.969 2.665 1.787 1.771 1.606 1.372
Rsq 0.682 0.663 0.625 0.440 0.435 0.377 0.271
coefficient 2.5% 97.5% +/-
(Intercept) 265.631 -225.663 756.924 491.294
zip -0.004 -0.012 0.004 0.008
race 0.009 0.003 0.014 0.005
fire 0.037 0.020 0.055 0.018
theft -0.011 -0.017 -0.005 0.006
age 0.008 0.002 0.014 0.006
income 0.022 -0.043 0.087 0.065
side|s 0.011 -0.242 0.264 0.253
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
Redlining <- store(Redlining, result, name = "Resids1")
2nd way
result <- regress(
Redlining,
rvar = "involact",
evar = c(
"zip", "race", "fire", "theft", "age", "income", "side",
"side01"
),
check = "stepwise-backward"
)
Start: AIC=-94.65
involact ~ zip + race + fire + theft + age + income + side +
side01
Step: AIC=-94.65
involact ~ zip + race + fire + theft + age + income + side
Df Sum of Sq RSS AIC
- side 1 0.00088 4.4638 -96.645
- income 1 0.05430 4.5172 -96.086
- zip 1 0.13751 4.6005 -95.228
<none> 4.4629 -94.654
- age 1 0.81472 5.2777 -88.773
- race 1 1.24447 5.7074 -85.094
- theft 1 1.52654 5.9895 -82.827
- fire 1 2.08655 6.5495 -78.626
Step: AIC=-96.64
involact ~ zip + race + fire + theft + age + income
Df Sum of Sq RSS AIC
- income 1 0.05385 4.5177 -98.081
- zip 1 0.14088 4.6047 -97.184
<none> 4.4638 -96.645
- age 1 0.86601 5.3298 -90.311
- theft 1 1.58524 6.0491 -84.362
- race 1 1.59613 6.0599 -84.277
- fire 1 2.13137 6.5952 -80.299
Step: AIC=-98.08
involact ~ zip + race + fire + theft + age
Df Sum of Sq RSS AIC
- zip 1 0.15414 4.6718 -98.504
<none> 4.5177 -98.081
- age 1 0.88275 5.4004 -91.693
- theft 1 1.55459 6.0723 -86.182
- race 1 1.91413 6.4318 -83.478
- fire 1 2.15560 6.6733 -81.746
Step: AIC=-98.5
involact ~ race + fire + theft + age
Df Sum of Sq RSS AIC
<none> 4.6718 -98.504
- age 1 0.99734 5.6691 -91.410
- theft 1 1.41436 6.0862 -88.074
- race 1 2.05375 6.7256 -83.379
- fire 1 2.38365 7.0554 -81.128
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: zip, race, fire, theft, age, income, side, side01
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero
coefficient std.error t.value p.value
(Intercept) -0.243 0.145 -1.676 0.101
race 0.008 0.002 4.297 < .001 ***
fire 0.037 0.008 4.629 < .001 ***
theft -0.010 0.003 -3.566 < .001 ***
age 0.007 0.002 2.994 0.005 **
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.747, Adjusted R-squared: 0.723
F-statistic: 31.033 df(4,42), p.value < .001
Nr obs: 47
Prediction error (RMSE): 0.315
Residual st.dev (RSD): 0.334
Sum of squares:
df SS
Regression 4 13.808
Error 42 4.672
Total 46 18.480
Variance Inflation Factors
fire race theft age
VIF 2.243 1.562 1.487 1.222
Rsq 0.554 0.360 0.328 0.182
coefficient 2.5% 97.5% +/-
(Intercept) -0.243 -0.536 0.050 0.293
race 0.008 0.004 0.012 0.004
fire 0.037 0.021 0.053 0.016
theft -0.010 -0.015 -0.004 0.005
age 0.007 0.002 0.012 0.005
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
Redlining <- store(Redlining, result, name = "Resids1")
shapiro.test(Redlining$Resids1)
Shapiro-Wilk normality test
data: Redlining$Resids1
W = 0.98065, p-value = 0.6191
The general idea behind the stepwise regression procedure is that we build our regression model from a set of candidate predictor variables by entering and removing predictors — in a stepwise manner — into our model until there is no justifiable reason to enter or remove any more.
The first predictor eliminated was side
The second one was Income
The third one was zipcode, also side01 has a NA.
The most important predictor is Fire because it has the greater Sums of squares.
For the first way we have R-squared: 0.758, Adjusted R-squared: 0.715 F-statistic: 17.498 df(7,39), p.value < .001
For the second way: R-squared: 0.747, Adjusted R-squared: 0.723 F-statistic: 31.033 df(4,42), p.value < .001
They are not equal but similar and they provided the same most important predictors.
Yes, the residuals are normal, based on the Shapiro Test. W = 0.98065, p-value = 0.6191 for the second way.
Yes, they are related- race, fire, theft and age.
Yes, 0.747 or 74.7%
With the new regression model, the standard deviation error for the model is 0.334. Lower values (closer to zero) indicate better fit.
For a one-unit change in [whatever the metric is of x], (involact) the fair plan policies and renewals per 100 housing units [sign says + increase or - decrease] by (absolute value of) coefficient plus or minus +/-.
As the long of the race (percentage of minority) increases by one unit, involact increases by 0.004 to 0.012 with 95% confidence.
As the long of the fire- fire per 100 housing units, increases by 100 units, involact increases by 0.021 to 0.053 with 95% confidence.
As the long of theft- theft per 1000 population- increases by 1000 population, involact decreases by -0.004 to -0.015 with 95% confidence.
As age- measured as percentage of housing units built before 1939- changes, involact increase by 0.002 to 0.012 with 95% confidence.
## filter and sort the dataset
X60614 <- Redlining %>%
filter(zip== 60614) %>%
arrange(desc(fire)) %>%
select(zip:side)
register("X60614", "Redlining")
# dtab(X60614, dec = 2, nr = 100) %>% render()
result <- regress(
Redlining,
rvar = "involact",
evar = c("zip", "race", "fire", "theft", "age", "income")
)
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: zip, race, fire, theft, age, income
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero
coefficient std.error t.value p.value
(Intercept) 267.662 238.764 1.121 0.269
zip -0.004 0.004 -1.124 0.268
race 0.009 0.002 3.782 < .001 ***
fire 0.037 0.009 4.370 < .001 ***
theft -0.011 0.003 -3.769 < .001 ***
age 0.008 0.003 2.786 0.008 **
income 0.022 0.032 0.695 0.491
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.758, Adjusted R-squared: 0.722
F-statistic: 20.932 df(6,40), p.value < .001
Nr obs: 47
Prediction error (RMSE): 0.308
Residual st.dev (RSD): 0.334
Sum of squares:
df SS
Regression 6 14.016
Error 40 4.464
Total 46 18.480
Variance Inflation Factors
income fire race theft age zip
VIF 3.137 2.591 2.371 1.719 1.651 1.360
Rsq 0.681 0.614 0.578 0.418 0.394 0.265
coefficient 2.5% 97.5% +/-
(Intercept) 267.662 -214.898 750.222 482.560
zip -0.004 -0.012 0.004 0.008
race 0.009 0.004 0.014 0.005
fire 0.037 0.020 0.054 0.017
theft -0.011 -0.017 -0.005 0.006
age 0.008 0.002 0.013 0.006
income 0.022 -0.042 0.086 0.064
pred <- predict(result, pred_data = X60614)
print(pred, n = 10)
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: zip, race, fire, theft, age, income
Interval : confidence
Prediction dataset : X60614
zip race fire theft age income Prediction 2.5% 97.5% +/-
60614.000 24.500 8.600 53.000 81.400 9.730 0.377 0.128 0.626 0.249
X60614 <- store(X60614, pred, name = "pred_reg60614")
Predictions will first show the confidence interval for the average: 0.128 0.626, on average, for the zip code 60614 with 95% confidence. Given 60614’s data. Average (involact) the fair plan policies and renewals per 100 housing units should range between 0.12 to 0.62 with 95% confidence.
pred <- predict(result, pred_data = X60614, interval="prediction")
print(pred, n = 10)
Linear regression (OLS)
Data : Redlining
Response variable : involact
Explanatory variables: zip, race, fire, theft, age, income
Interval : prediction
Prediction dataset : X60614
zip race fire theft age income Prediction 2.5% 97.5% +/-
60614.000 24.500 8.600 53.000 81.400 9.730 0.377 -0.343 1.096 0.719
Now the prediction interval for the data with 95% probability; it spans -0.343 to 1.096 per 100 housing units.
result <- regress(
Redlining,
rvar = "race",
evar = c(
"zip", "fire", "theft", "age", "involact", "income", "side",
"side01"
),
check = "stepwise-backward"
)
Start: AIC=279.77
race ~ zip + fire + theft + age + involact + income + side +
side01
Step: AIC=279.77
race ~ zip + fire + theft + age + involact + income + side
Df Sum of Sq RSS AIC
- zip 1 5.0 12869 277.78
- fire 1 84.5 12948 278.07
<none> 12864 279.77
- age 1 1089.3 13953 281.59
- theft 1 1690.2 14554 283.57
- side 1 2314.0 15178 285.54
- involact 1 3587.1 16451 289.33
- income 1 4050.3 16914 290.63
Step: AIC=277.78
race ~ fire + theft + age + involact + income + side
Df Sum of Sq RSS AIC
- fire 1 86.6 12956 276.10
<none> 12869 277.78
- age 1 1112.4 13982 279.68
- theft 1 1769.6 14639 281.84
- side 1 2313.5 15183 283.56
- involact 1 3663.1 16532 287.56
- income 1 4092.0 16961 288.76
Step: AIC=276.1
race ~ theft + age + involact + income + side
Df Sum of Sq RSS AIC
<none> 12956 276.10
- age 1 1028.0 13984 277.69
- theft 1 2230.8 15186 281.57
- side 1 2578.7 15534 282.63
- income 1 4190.0 17146 287.27
- involact 1 4851.7 17808 289.05
summary(
result,
sum_check = c("rmse", "sumsquares", "vif", "confint")
)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data : Redlining
Response variable : race
Explanatory variables: zip, fire, theft, age, involact, income, side, side01
Null hyp.: the effect of x on race is zero
Alt. hyp.: the effect of x on race is not zero
coefficient std.error t.value p.value
(Intercept) 73.457 22.777 3.225 0.002 **
theft 0.333 0.125 2.657 0.011 *
age -0.283 0.157 -1.804 0.079 .
involact 22.531 5.750 3.918 < .001 ***
income -5.045 1.385 -3.641 < .001 ***
side|s 16.936 5.928 2.857 0.007 **
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.735, Adjusted R-squared: 0.702
F-statistic: 22.718 df(5,41), p.value < .001
Nr obs: 47
Prediction error (RMSE): 16.603
Residual st.dev (RSD): 17.776
Sum of squares:
df SS
Regression 5 35,894.094
Error 41 12,955.726
Total 46 48,849.820
Variance Inflation Factors
income involact age side theft
VIF 2.120 1.934 1.824 1.302 1.135
Rsq 0.528 0.483 0.452 0.232 0.119
coefficient 2.5% 97.5% +/-
(Intercept) 73.457 27.458 119.456 45.999
theft 0.333 0.080 0.586 0.253
age -0.283 -0.600 0.034 0.317
involact 22.531 10.918 34.143 11.612
income -5.045 -7.843 -2.247 2.798
side|s 16.936 4.963 28.908 11.973
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
Redlining <- store(Redlining, result, name = "raceresids")
We see that there is a wide range in the race variable with some zip codes being almost entirely minority or non-minority. We can clearly see that homeowners in zip codes with high % minority are being denied insurance at higher rate than other zip codes. Also, Based on the new model, race is related to involact and income with 5% significance. That is not in doubt. However, the race effect is not entirely conclusive. It would be a very good idea to obtain some individual level data.
library(visreg)
visreg(result$model, data=Redlining)
Happy Holidays, Robert! Thank you for everything
Karol