GSMPR 622: Redlining

Your role is as an investigator for the UCCR attempting to use these data to compile evidence of redlining. The core piece of evidence will come down to whether or not race is related to demand for new FAIR plan policies and renewals per 100 housing units by zip code because there is no logical reason that race would be related were it not for redlining. side01 is the same variable as side only stored as numeric.

Insurance Discrimination

Insurance redlining refers to the practice of refusing to issue insurance to certain types of people or within some geographic area. The name comes from the act of drawing a red line around an area on a map. Now few would quibble with an insurance company refusing to sell auto insurance to a frequent drunk driver, but other forms of discrimination would be unacceptable.

In the late 1970s, the U.S. Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods. Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago as a default policy to homeowners who had been rejected by the voluntary market. Information on other variables that might affect insurance writing such as fire and theft rates were also collected at the zip code level. The variables are:

race: racial composition in percentage of minority

fire: fire per 100 housing units

theft: theft per 1000 population

age: percentage of housing units built before 1939

involact new fair plan policies and renewals per 100 housing units

income median family income in thousands of dollars

side North or South side of Chicago

Some Preliminary Questions:

Basic Data Familiarity

  1. Provide a basic numerical summary of each of the included variables and a detailed graphical and numerical description and discussion of the demand for new FAIR plan policies and renewals per 100 housing units.
Redlining %>%
  arrange(desc(involact)) %>%
  select(zip:side01) %>%
  dtab(dec = 2, nr = 100) %>% render()
result <- explore(
  Redlining, 
  vars = c("race", "fire", "theft", "age", "involact", "income", "side01"), 
  fun = c("n_obs", "mean", "min", "max", "sum", "sd", "p25", "p75"), 
  nr = Inf
)
summary(result, dec = 2)
Explore
Data        : Redlining 
Functions   : n_obs, mean, min, max, sum, sd, p25, p75 
Top         : Function 

 variable n_obs  mean  min    max      sum    sd   p25   p75
     race    47 34.99 1.00  99.70 1,644.30 32.59  3.75 57.65
     fire    47 12.28 2.00  39.70   577.10  9.30  5.65 16.05
    theft    47 32.36 3.00 147.00 1,521.00 22.29 22.00 38.00
      age    47 60.33 2.00  90.10 2,835.40 22.57 48.60 77.30
 involact    47  0.61 0.00   2.20    28.90  0.63  0.00  0.90
   income    47 10.70 5.58  21.48   502.70  2.75  8.45 11.99
   side01    47  0.47 0.00   1.00    22.00  0.50  0.00  1.00
# dtab(result, dec = 2) %>% render()
visualize(
  Redlining, 
  xvar = c("race", "fire", "theft", "age", "involact", "income"), 
  yvar = "zip", 
  type = "line", 
  facet_row = "side", 
  color = "side", 
  theme = "theme_minimal", 
  base_size = 12, 
  base_family = "Courier", 
  custom = FALSE
)

  1. Which zip code has the most renewals?

The zip code with most renewals is 60621

  1. What zip code has the highest median family income? Is it in north or south Chicago?

The zip code with the highest median family income is 60611, which is North Chicago.

  1. Are renewals per 100 housing units are the same in north and south Chicago? Provide a 95% confidence interval for the difference in average renewals.
result <- compare_means(
  Redlining, 
  var1 = "side", 
  var2 = "involact"
)
summary(result, show = TRUE)
Pairwise mean comparisons (t-test)
Data      : Redlining 
Variables : side, involact 
Samples   : independent 
Confidence: 0.95 
Adjustment: None 

 side  mean  n n_missing    sd    se    me
    n 0.500 25         0 0.633 0.127 0.261
    s 0.745 22         0 0.623 0.133 0.276

 Null hyp. Alt. hyp.          diff   p.value se    t.value df     2.5%   97.5%
 n = s     n not equal to s   -0.245 0.188   0.183 -1.337  44.413 -0.615 0.124
  
  

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "scatter", custom = FALSE)

With 95% confidence interval, the dirrerence range between the North and the South is -0.615 to 0.124 average renewals, being the South the larger one.

  1. Is racial composition related to the age of units? Provide relevant graphical and numerical evidence.
result <- correlation(Redlining, vars = c("race", "age"))
summary(result)
Correlation
Data        : Redlining 
Method      : Pearson 
Variables   : race, age 
Null hyp.   : variables x and y are not correlated
Alt. hyp.   : variables x and y are correlated

Correlation matrix:
    race
age 0.25

p.values:
    race
age 0.09
plot(result, nrobs = -1)

Yes, there is a positive correlation between those two variables.

  1. Provide and interpret a 95% confidence interval for the average new FAIR plan policies and renewals per 100 housing units.

With 95% confidence, the average of new FAIR plan policies and renewals per 100 housing units range from 0.429 to 0.801.

  1. Are fire and theft related within a zip code? Provide a scatterplot and regression evidence linking these two variables.

Fire is the only variable related to the zip code.

result <- regress(
  Redlining, 
  rvar = "zip", 
  evar = c("fire", "theft")
)
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data     : Redlining 
Response variable    : zip 
Explanatory variables: fire, theft 
Null hyp.: the effect of x on zip is zero
Alt. hyp.: the effect of x on zip is not zero

             coefficient std.error   t.value p.value    
 (Intercept)   60641.728     3.571 16980.029  < .001 ***
 fire             -0.488     0.248    -1.965   0.056 .  
 theft            -0.159     0.104    -1.536   0.132    

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.24,  Adjusted R-squared: 0.205 
F-statistic: 6.933 df(2,44), p.value 0.002
Nr obs: 47 

Prediction error (RMSE):  12.587 
Residual st.dev   (RSD):  13.009 

Sum of squares:
           df        SS
Regression  2 2,346.624
Error      44 7,446.695
Total      46 9,793.319

Variance Inflation Factors
     fire theft
VIF 1.448 1.448
Rsq 0.309 0.309

            coefficient      2.5%     97.5%   +/-
(Intercept)   60641.728 60634.530 60648.926 7.198
fire             -0.488    -0.988     0.013 0.500
theft            -0.159    -0.368     0.050 0.209
plot(result, plots = "scatter", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

Regression Analysis

  1. Perform an initial regression explaining the outcome: new FAIR policies and renewals per 100 units as a function of all the other predictors [except the Zip Code].
result <- regress(
  Redlining, 
  rvar = "involact", 
  evar = c("race", "fire", "theft", "age", "income", "side", "side01")
)
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data     : Redlining 
Response variable    : involact 
Explanatory variables: race, fire, theft, age, income, side, side01 
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero

             coefficient std.error t.value p.value    
 (Intercept)      -0.629     0.512  -1.229   0.226    
 race              0.009     0.003   3.374   0.002 ** 
 fire              0.039     0.009   4.523  < .001 ***
 theft            -0.010     0.003  -3.494   0.001 ** 
 age               0.008     0.003   2.884   0.006 ** 
 income            0.025     0.032   0.770   0.446    
 side|s            0.024     0.125   0.192   0.849    
 side|01              NA        NA      NA      NA NA 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.751,  Adjusted R-squared: 0.714 
F-statistic: 20.113 df(6,40), p.value < .001
Nr obs: 47 

The set of explanatory variables exhibit perfect multicollinearity.
One or more variables were dropped from the estimation.
Prediction error (RMSE):  0.313 
Residual st.dev   (RSD):  0.339 

Sum of squares:
           df     SS
Regression  6 13.879
Error      40  4.600
Total      46 18.480

Multicollinearity diagnostics were not calculated.
Confidence intervals were not calculated.
result <- regress(
  Redlining, 
  rvar = "involact", 
  evar = c("race", "fire", "theft", "age", "income", "side", "side01"), 
  check = "stepwise-backward"
)
Start:  AIC=-95.23
involact ~ race + fire + theft + age + income + side + side01


Step:  AIC=-95.23
involact ~ race + fire + theft + age + income + side

         Df Sum of Sq    RSS     AIC
- side    1   0.00425 4.6047 -97.184
- income  1   0.06811 4.6686 -96.537
<none>                4.6005 -95.228
- age     1   0.95668 5.5571 -88.348
- race    1   1.30950 5.9099 -85.455
- theft   1   1.40391 6.0044 -84.710
- fire    1   2.35303 6.9535 -77.813

Step:  AIC=-97.18
involact ~ race + fire + theft + age + income

         Df Sum of Sq    RSS     AIC
- income  1   0.06710 4.6718 -98.504
<none>                4.6047 -97.184
- age     1   0.99296 5.5977 -90.007
- theft   1   1.46328 6.0680 -86.215
- race    1   1.74657 6.3513 -84.070
- fire    1   2.37807 6.9828 -79.615

Step:  AIC=-98.5
involact ~ race + fire + theft + age

        Df Sum of Sq    RSS     AIC
<none>               4.6718 -98.504
- age    1   0.99734 5.6691 -91.410
- theft  1   1.41436 6.0862 -88.074
- race   1   2.05375 6.7256 -83.379
- fire   1   2.38365 7.0554 -81.128
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data     : Redlining 
Response variable    : involact 
Explanatory variables: race, fire, theft, age, income, side, side01 
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero

             coefficient std.error t.value p.value    
 (Intercept)      -0.243     0.145  -1.676   0.101    
 race              0.008     0.002   4.297  < .001 ***
 fire              0.037     0.008   4.629  < .001 ***
 theft            -0.010     0.003  -3.566  < .001 ***
 age               0.007     0.002   2.994   0.005 ** 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.747,  Adjusted R-squared: 0.723 
F-statistic: 31.033 df(4,42), p.value < .001
Nr obs: 47 

Prediction error (RMSE):  0.315 
Residual st.dev   (RSD):  0.334 

Sum of squares:
           df     SS
Regression  4 13.808
Error      42  4.672
Total      46 18.480

Variance Inflation Factors
     fire  race theft   age
VIF 2.243 1.562 1.487 1.222
Rsq 0.554 0.360 0.328 0.182

            coefficient   2.5%  97.5%   +/-
(Intercept)      -0.243 -0.536  0.050 0.293
race              0.008  0.004  0.012 0.004
fire              0.037  0.021  0.053 0.016
theft            -0.010 -0.015 -0.004 0.005
age               0.007  0.002  0.012 0.005
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

Redlining <- store(Redlining, result, name = "xresids")
shapiro.test(Redlining$xresids)

    Shapiro-Wilk normality test

data:  Redlining$xresids
W = 0.98065, p-value = 0.6191
  1. Interpret the previous regression results as completely as you can. In short, you should run through the entire R/radiant regression table
  1. the regression output of slopes/intercepts and t and p-values: what factors are related to new FAIR policies with 95% confidence/5% significance? What factors are unrelated to new FAIR policies with 95% confidence/5% significance?

With 95% confidence, the factors related to fair policies with 5% significance are: race, fire, and theft.

The factor unrelated with 95% confidence and 5% significance is the Income.

  1. R-squared? What proportion of the variance/variation in new FAIR policies is explained by this set of predictors? How much variance/variation is unexplained?

The variance explained by those predictors is 0.751 or 75.1%. And 0.249 or 24.9% is unexplained.

  1. Does the model as a whole explain variation in new FAIR policies?

Yes.

  1. Provide appropriate confidence intervals for all of the effects. Express the confidence intervals in the appropriate metric and interpret their meaning with respect to FAIR policies.
Predictor 2.5% 97.5%
race 0.004 to 0.012
fire 0.021 to 0.053
theft -0.015 to -0.004
age 0.002 to 0.012

For a one-unit change in [whatever the metric is of x], (involact) the fair plan policies and renewals per 100 housing units [sign says + increase or - decrease] by (absolute value of) coefficient plus or minus +/-.

As the long of the race (percentage of minority) increases by one unit, involact increases by 0.004 to 0.012 with 95% confidence.

As the long of the fire- fire per 100 housing units, increases by 100 units, involact increases by 0.021 to 0.053 with 95% confidence.

As the long of theft- theft per 1000 population- increases by 1000 population, involact decreases by -0.004 to -0.015 with 95% confidence.

As age- measured as percentage of housing units built before 1939- changes, involact increase by 0.002 to 0.012 with 95% confidence.

  1. Isolate the residuals from the regression and examine their normality providing as much evidence as you can [at least one bit of quantitative and graphical evidence]. Are the residuals normal? What role does the normality of residuals play in the interpretation of regression models?

Residual data of the simple linear regression model is the difference between the observed data of the dependent variable and the fitted values. The plot is useful for checking the assumption of linearity and homoscedasticity. To assess the assumption of linearity, residuals should be not too far from 0 (ideally, standardized values should be in the range of -2 and +2). To assess he assumption of homoscedasticity, residuals should be randomly and equally distributed around the horizontal red line (which is just a scatterplot smoother, showing the average value of the residuals at each value of fitted value) representing a residual error of zero.

In the current case, the red trend line is almost at zero except towards the right side, due to outliers presence. Some values, in particular are also outside the range between -2 and +2 (graphic: Normal Q-Q)

  1. What is the probability that our regression predicts new FAIR plan policies to within plus or minus 0.5 per 100 housing units?

The probability is 0.389

result <- prob_norm(mean = 0.615, stdev = 0.634, lb = -0.5, ub = 0.5)
summary(result)
Probability calculator
Distribution: Normal
Mean        : 0.615 
St. dev     : 0.634 
Lower bound : -0.5 
Upper bound : 0.5 

P(X < -0.5) = 0.039
P(X > -0.5) = 0.961
P(X < 0.5) = 0.428
P(X > 0.5) = 0.572
P(-0.5 < X < 0.5)     = 0.389
1 - P(-0.5 < X < 0.5) = 0.611
plot(result)

Regression Streamlining and Model Building

We wish to engage a process to “streamline” the regression by eliminating unimportant predictors, if there are any. There are two common ways.

The first is to use the evidence you obtained from the regression involving all predictors and to eliminate any factor that is unrelated to demand for new FAIR policies with 90 or 95% confidence [your answers to 2a above. In this case, let’s suppose 95% to remain consistent.

The second is to engage in an automated routine to eliminate predictors according to a given criterion, most often AIC or BIC [radiant uses the former]. Formally, AIC combines the residual sum of squares with a penalty for each predictor to encourage less complicated models (because we minimize it): this is Radiant’s stepwise tick box or the R command step() applied to a regression. Try this.

1st way

result <- regress(
  Redlining, 
  rvar = "involact", 
  evar = c("zip", "race", "fire", "theft", "age", "income", "side")
)
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data     : Redlining 
Response variable    : involact 
Explanatory variables: zip, race, fire, theft, age, income, side 
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero

             coefficient std.error t.value p.value    
 (Intercept)     265.631   242.891   1.094   0.281    
 zip              -0.004     0.004  -1.096   0.280    
 race              0.009     0.003   3.298   0.002 ** 
 fire              0.037     0.009   4.270  < .001 ***
 theft            -0.011     0.003  -3.652  < .001 ***
 age               0.008     0.003   2.668   0.011 *  
 income            0.022     0.032   0.689   0.495    
 side|s            0.011     0.125   0.088   0.931    

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.758,  Adjusted R-squared: 0.715 
F-statistic: 17.498 df(7,39), p.value < .001
Nr obs: 47 

Prediction error (RMSE):  0.308 
Residual st.dev   (RSD):  0.338 

Sum of squares:
           df     SS
Regression  7 14.017
Error      39  4.463
Total      46 18.480

Variance Inflation Factors
    income  race  fire   age theft  side   zip
VIF  3.142 2.969 2.665 1.787 1.771 1.606 1.372
Rsq  0.682 0.663 0.625 0.440 0.435 0.377 0.271

            coefficient     2.5%   97.5%     +/-
(Intercept)     265.631 -225.663 756.924 491.294
zip              -0.004   -0.012   0.004   0.008
race              0.009    0.003   0.014   0.005
fire              0.037    0.020   0.055   0.018
theft            -0.011   -0.017  -0.005   0.006
age               0.008    0.002   0.014   0.006
income            0.022   -0.043   0.087   0.065
side|s            0.011   -0.242   0.264   0.253
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

Redlining <- store(Redlining, result, name = "Resids1")

2nd way

result <- regress(
  Redlining, 
  rvar = "involact", 
  evar = c(
    "zip", "race", "fire", "theft", "age", "income", "side", 
    "side01"
  ), 
  check = "stepwise-backward"
)
Start:  AIC=-94.65
involact ~ zip + race + fire + theft + age + income + side + 
    side01


Step:  AIC=-94.65
involact ~ zip + race + fire + theft + age + income + side

         Df Sum of Sq    RSS     AIC
- side    1   0.00088 4.4638 -96.645
- income  1   0.05430 4.5172 -96.086
- zip     1   0.13751 4.6005 -95.228
<none>                4.4629 -94.654
- age     1   0.81472 5.2777 -88.773
- race    1   1.24447 5.7074 -85.094
- theft   1   1.52654 5.9895 -82.827
- fire    1   2.08655 6.5495 -78.626

Step:  AIC=-96.64
involact ~ zip + race + fire + theft + age + income

         Df Sum of Sq    RSS     AIC
- income  1   0.05385 4.5177 -98.081
- zip     1   0.14088 4.6047 -97.184
<none>                4.4638 -96.645
- age     1   0.86601 5.3298 -90.311
- theft   1   1.58524 6.0491 -84.362
- race    1   1.59613 6.0599 -84.277
- fire    1   2.13137 6.5952 -80.299

Step:  AIC=-98.08
involact ~ zip + race + fire + theft + age

        Df Sum of Sq    RSS     AIC
- zip    1   0.15414 4.6718 -98.504
<none>               4.5177 -98.081
- age    1   0.88275 5.4004 -91.693
- theft  1   1.55459 6.0723 -86.182
- race   1   1.91413 6.4318 -83.478
- fire   1   2.15560 6.6733 -81.746

Step:  AIC=-98.5
involact ~ race + fire + theft + age

        Df Sum of Sq    RSS     AIC
<none>               4.6718 -98.504
- age    1   0.99734 5.6691 -91.410
- theft  1   1.41436 6.0862 -88.074
- race   1   2.05375 6.7256 -83.379
- fire   1   2.38365 7.0554 -81.128
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data     : Redlining 
Response variable    : involact 
Explanatory variables: zip, race, fire, theft, age, income, side, side01 
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero

             coefficient std.error t.value p.value    
 (Intercept)      -0.243     0.145  -1.676   0.101    
 race              0.008     0.002   4.297  < .001 ***
 fire              0.037     0.008   4.629  < .001 ***
 theft            -0.010     0.003  -3.566  < .001 ***
 age               0.007     0.002   2.994   0.005 ** 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.747,  Adjusted R-squared: 0.723 
F-statistic: 31.033 df(4,42), p.value < .001
Nr obs: 47 

Prediction error (RMSE):  0.315 
Residual st.dev   (RSD):  0.334 

Sum of squares:
           df     SS
Regression  4 13.808
Error      42  4.672
Total      46 18.480

Variance Inflation Factors
     fire  race theft   age
VIF 2.243 1.562 1.487 1.222
Rsq 0.554 0.360 0.328 0.182

            coefficient   2.5%  97.5%   +/-
(Intercept)      -0.243 -0.536  0.050 0.293
race              0.008  0.004  0.012 0.004
fire              0.037  0.021  0.053 0.016
theft            -0.010 -0.015 -0.004 0.005
age               0.007  0.002  0.012 0.005
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

Redlining <- store(Redlining, result, name = "Resids1")
shapiro.test(Redlining$Resids1)

    Shapiro-Wilk normality test

data:  Redlining$Resids1
W = 0.98065, p-value = 0.6191
  1. Describe the stepwise process. What predictor is the least informative and thus eliminated first? Second? Third? What is the most important predictor of new FAIR plan policies and renewals per 100 housing units in the final model? How do you know?

The general idea behind the stepwise regression procedure is that we build our regression model from a set of candidate predictor variables by entering and removing predictors — in a stepwise manner — into our model until there is no justifiable reason to enter or remove any more.

The first predictor eliminated was side

The second one was Income

The third one was zipcode, also side01 has a NA.

The most important predictor is Fire because it has the greater Sums of squares.

  1. Do the two approaches given above yield the same streamlined regression in this case? Now let us focus on this streamlined regression and the interpretation.

For the first way we have R-squared: 0.758, Adjusted R-squared: 0.715 F-statistic: 17.498 df(7,39), p.value < .001

For the second way: R-squared: 0.747, Adjusted R-squared: 0.723 F-statistic: 31.033 df(4,42), p.value < .001

They are not equal but similar and they provided the same most important predictors.

  1. Are the residuals from the streamlined regression normal? Regardless of the result, let’s assume they are for the purposes of interpretation.

Yes, the residuals are normal, based on the Shapiro Test. W = 0.98065, p-value = 0.6191 for the second way.

  1. Are all of the predictors that remain in this streamlined regression related to demand for new FAIR policies with 95% confidence?

Yes, they are related- race, fire, theft and age.

  1. Does the model, as a whole, explain variation in new FAIR policies? How much?

Yes, 0.747 or 74.7%

  1. Discuss the residual standard deviation/error as it relates to the accuracy of the regression model.

With the new regression model, the standard deviation error for the model is 0.334. Lower values (closer to zero) indicate better fit.

  1. Provide and interpret the confidence intervals of this streamlined regression. What do they mean? Do not omit their metrics.

For a one-unit change in [whatever the metric is of x], (involact) the fair plan policies and renewals per 100 housing units [sign says + increase or - decrease] by (absolute value of) coefficient plus or minus +/-.

As the long of the race (percentage of minority) increases by one unit, involact increases by 0.004 to 0.012 with 95% confidence.

As the long of the fire- fire per 100 housing units, increases by 100 units, involact increases by 0.021 to 0.053 with 95% confidence.

As the long of theft- theft per 1000 population- increases by 1000 population, involact decreases by -0.004 to -0.015 with 95% confidence.

As age- measured as percentage of housing units built before 1939- changes, involact increase by 0.002 to 0.012 with 95% confidence.

  1. Predict the distribution of average and all demand for new FAIR plan policies given the data for zip code 60614 and this streamlined regression. Compare this prediction to the one obtained in the last section.
## filter and sort the dataset
X60614 <- Redlining %>%
  filter(zip== 60614) %>%
  arrange(desc(fire)) %>%
  select(zip:side)
register("X60614", "Redlining")
# dtab(X60614, dec = 2, nr = 100) %>% render()

result <- regress(
  Redlining, 
  rvar = "involact", 
  evar = c("zip", "race", "fire", "theft", "age", "income")
)
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
Linear regression (OLS)
Data     : Redlining 
Response variable    : involact 
Explanatory variables: zip, race, fire, theft, age, income 
Null hyp.: the effect of x on involact is zero
Alt. hyp.: the effect of x on involact is not zero

             coefficient std.error t.value p.value    
 (Intercept)     267.662   238.764   1.121   0.269    
 zip              -0.004     0.004  -1.124   0.268    
 race              0.009     0.002   3.782  < .001 ***
 fire              0.037     0.009   4.370  < .001 ***
 theft            -0.011     0.003  -3.769  < .001 ***
 age               0.008     0.003   2.786   0.008 ** 
 income            0.022     0.032   0.695   0.491    

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.758,  Adjusted R-squared: 0.722 
F-statistic: 20.932 df(6,40), p.value < .001
Nr obs: 47 

Prediction error (RMSE):  0.308 
Residual st.dev   (RSD):  0.334 

Sum of squares:
           df     SS
Regression  6 14.016
Error      40  4.464
Total      46 18.480

Variance Inflation Factors
    income  fire  race theft   age   zip
VIF  3.137 2.591 2.371 1.719 1.651 1.360
Rsq  0.681 0.614 0.578 0.418 0.394 0.265

            coefficient     2.5%   97.5%     +/-
(Intercept)     267.662 -214.898 750.222 482.560
zip              -0.004   -0.012   0.004   0.008
race              0.009    0.004   0.014   0.005
fire              0.037    0.020   0.054   0.017
theft            -0.011   -0.017  -0.005   0.006
age               0.008    0.002   0.013   0.006
income            0.022   -0.042   0.086   0.064
pred <- predict(result, pred_data = X60614)
print(pred, n = 10)
Linear regression (OLS)
Data                 : Redlining 
Response variable    : involact 
Explanatory variables: zip, race, fire, theft, age, income 
Interval             : confidence 
Prediction dataset   : X60614 

       zip   race  fire  theft    age income Prediction  2.5% 97.5%   +/-
 60614.000 24.500 8.600 53.000 81.400  9.730      0.377 0.128 0.626 0.249
X60614 <- store(X60614, pred, name = "pred_reg60614")

Predictions will first show the confidence interval for the average: 0.128 0.626, on average, for the zip code 60614 with 95% confidence. Given 60614’s data. Average (involact) the fair plan policies and renewals per 100 housing units should range between 0.12 to 0.62 with 95% confidence.

pred <- predict(result, pred_data = X60614, interval="prediction")
print(pred, n = 10)
Linear regression (OLS)
Data                 : Redlining 
Response variable    : involact 
Explanatory variables: zip, race, fire, theft, age, income 
Interval             : prediction 
Prediction dataset   : X60614 

       zip   race  fire  theft    age income Prediction   2.5% 97.5%   +/-
 60614.000 24.500 8.600 53.000 81.400  9.730      0.377 -0.343 1.096 0.719

Now the prediction interval for the data with 95% probability; it spans -0.343 to 1.096 per 100 housing units.

  1. Given the evidence that you have compiled, is there evidence of race being a driver of redlining? In other words, how should this evidence be interpreted in light of the civil rights issues raised at the beginning? Justify your response from evidence in the data and use graphical/visual evidence to support your claims where possible carefully explaining what the graphic shows.
result <- regress(
  Redlining, 
  rvar = "race", 
  evar = c(
    "zip", "fire", "theft", "age", "involact", "income", "side", 
    "side01"
  ), 
  check = "stepwise-backward"
)
Start:  AIC=279.77
race ~ zip + fire + theft + age + involact + income + side + 
    side01


Step:  AIC=279.77
race ~ zip + fire + theft + age + involact + income + side

           Df Sum of Sq   RSS    AIC
- zip       1       5.0 12869 277.78
- fire      1      84.5 12948 278.07
<none>                  12864 279.77
- age       1    1089.3 13953 281.59
- theft     1    1690.2 14554 283.57
- side      1    2314.0 15178 285.54
- involact  1    3587.1 16451 289.33
- income    1    4050.3 16914 290.63

Step:  AIC=277.78
race ~ fire + theft + age + involact + income + side

           Df Sum of Sq   RSS    AIC
- fire      1      86.6 12956 276.10
<none>                  12869 277.78
- age       1    1112.4 13982 279.68
- theft     1    1769.6 14639 281.84
- side      1    2313.5 15183 283.56
- involact  1    3663.1 16532 287.56
- income    1    4092.0 16961 288.76

Step:  AIC=276.1
race ~ theft + age + involact + income + side

           Df Sum of Sq   RSS    AIC
<none>                  12956 276.10
- age       1    1028.0 13984 277.69
- theft     1    2230.8 15186 281.57
- side      1    2578.7 15534 282.63
- income    1    4190.0 17146 287.27
- involact  1    4851.7 17808 289.05
summary(
  result, 
  sum_check = c("rmse", "sumsquares", "vif", "confint")
)
----------------------------------------------------
Backward stepwise selection of variables
----------------------------------------------------
Linear regression (OLS)
Data     : Redlining 
Response variable    : race 
Explanatory variables: zip, fire, theft, age, involact, income, side, side01 
Null hyp.: the effect of x on race is zero
Alt. hyp.: the effect of x on race is not zero

             coefficient std.error t.value p.value    
 (Intercept)      73.457    22.777   3.225   0.002 ** 
 theft             0.333     0.125   2.657   0.011 *  
 age              -0.283     0.157  -1.804   0.079 .  
 involact         22.531     5.750   3.918  < .001 ***
 income           -5.045     1.385  -3.641  < .001 ***
 side|s           16.936     5.928   2.857   0.007 ** 

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared: 0.735,  Adjusted R-squared: 0.702 
F-statistic: 22.718 df(5,41), p.value < .001
Nr obs: 47 

Prediction error (RMSE):  16.603 
Residual st.dev   (RSD):  17.776 

Sum of squares:
           df         SS
Regression  5 35,894.094
Error      41 12,955.726
Total      46 48,849.820

Variance Inflation Factors
    income involact   age  side theft
VIF  2.120    1.934 1.824 1.302 1.135
Rsq  0.528    0.483 0.452 0.232 0.119

            coefficient   2.5%   97.5%    +/-
(Intercept)      73.457 27.458 119.456 45.999
theft             0.333  0.080   0.586  0.253
age              -0.283 -0.600   0.034  0.317
involact         22.531 10.918  34.143 11.612
income           -5.045 -7.843  -2.247  2.798
side|s           16.936  4.963  28.908 11.973
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

Redlining <- store(Redlining, result, name = "raceresids")

We see that there is a wide range in the race variable with some zip codes being almost entirely minority or non-minority. We can clearly see that homeowners in zip codes with high % minority are being denied insurance at higher rate than other zip codes. Also, Based on the new model, race is related to involact and income with 5% significance. That is not in doubt. However, the race effect is not entirely conclusive. It would be a very good idea to obtain some individual level data.

library(visreg)
visreg(result$model, data=Redlining)

Happy Holidays, Robert! Thank you for everything

Karol