This Document serves as my formal Report & Analysis for the STAT567 Final Project. I have also provided a word document with the summarized Project Abstract (seperate and < 300 words as requested).
The purpose of this report is to investigate and analyze the, “Chicago Insurance Redlining Data” (CIR). This was the dataset used in a study conducted by the, U.S. Commission on Civil Rights. In this study the Commission investigated allegations made by community members of Chicago IL, against multiple Insurance companies - these allegations claimed some Insurance Companies were, “redlining” their neighborhoods. Redlining is a term for, “canceling policies or refusing to insure or renew” an insured. Most concerning - these community members alleged the Insurance companies were discriminating based on race/ethnicity - which is an illegal practice.
The primary objective of this report is to, “model the Chicago insurance redlining (CIR) data to predict whether a residential or housing unit in Chicago has insurance availability, based on other data about the homeowners.”(per Project Outline) More specifically I will explore the CIR dataset (details below), determine valid assumptions under which to fit Linear Models, model comparison & diagnostics, model selection model evluation and cross validation. These steps are done in order to make relevant inferences using the selected model.
The CIR data was compiled by, the Illinois (IL) Dept. of Insurance (DOI), The Chicago Police & Fire Departments along with the US Census Bureau.
The secondary objective of this report is to use similar statistical analysis to make an, Ecological Inference concerning the, “relation between the incomes of the US born and the proportion of the US born within the state. This study is relevant to the analysis of the CIR data since this data is the aggregate data, and the results from the aggregated data may not hold true at the individual level.”(per Project Outline). This will simiarly be limited to Linear modeling.
In order to make this Ecological Inference I used the DEMOG dataset - this dataset contains the following four variables for each of the 50 states plus the District of Columbia.
Once familiarized with the datasets I will provide my solutions to the primary and secondary objectives defined above. - Note: Ordering of the report is based on order of sub-tasks defined in project outline - for this reason the primary and secondary objectives defined above will not analyzed in that order.
First we will explore both datasets though exploratory data analysis (EDA). Using various EDA methods I will note observations regarding variable relationships, identify outlier data points and make necessary data transformations in order to satisfy model assumptions (Strictly Linear Models).
Task 1: “Make a numerical and graphical summary of both sets of the data, commenting on any features that you find interesting. Limit the output you present to a quality that a busy reader would find sufficient to get a basic understanding of the data.”
Task 2: “Perform diagnostics check for relationships and strong interactions among the variables of the data. Report what you did and your findings. Are there any remedial measures needed”?
We will start by visually inspecting the DMOG Data - recall this contains demographic information at the US State level of granularity.
Note: the first column represents State (i.e. a categorical variable).
We will start by visually inspecting the CIR dataset. Recall the CIR data is at the Zip-code level in Chicago, IL. (zip code col. in CIR data is unique).
Note: the first column represents Zip-Code (i.e. a categorical variable).
“For data DEMOG.txt, fit three simple linear regression models of the per capita income on each of the three predictor variables. Does a linear regression model appear to provide a good fit for each of the three predictor variables? Use all appropriate tests, descriptive measures, and plots to conclude your findings here. Which predictor variable leads to significant effect on the per capita income?”
Fit 3 Linear Models using Simple Linear Regression to predict Aggregated Per Capita Income by each Demographic Variable:
Model 1: \(Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon_i\)
Model 2: \(Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{home}+ \epsilon_j\)
Model 3: \(Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{pop}+ \epsilon_k\)
We define Model 1: \[Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon_i\] - See table below the header, “Summary for Model 1:” - this table contains diagnostic information on Model 1 including: - the standard errors - t-test values & p-values for model coeffs - degrees of freedom - residual summary statistics - result of an F-test
## [1] "Summary for Model 1:"
##
## Call:
## lm(formula = DEMOG_df$cap.income ~ DEMOG_df$usborn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6836.5 -2591.5 250.7 1332.1 10262.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68642 8739 7.855 3.19e-10 ***
## DEMOG_df$usborn -46019 9279 -4.959 8.89e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3490 on 49 degrees of freedom
## Multiple R-squared: 0.3342, Adjusted R-squared: 0.3206
## F-statistic: 24.6 on 1 and 49 DF, p-value: 8.891e-06
We define Model 2:
\[Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{home}+ \epsilon_j\]
## [1] "Summary for Model 2:"
##
## Call:
## lm(formula = DEMOG_df$cap.income ~ DEMOG_df$home)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6339.9 -2453.4 -134.7 1614.5 11847.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31398.51 2482.62 12.647 <2e-16 ***
## DEMOG_df$home -99.08 39.74 -2.494 0.0161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4029 on 49 degrees of freedom
## Multiple R-squared: 0.1126, Adjusted R-squared: 0.09449
## F-statistic: 6.218 on 1 and 49 DF, p-value: 0.01607
We define Model 3 as: \[Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{pop}+ \epsilon_k\]
## [1] "Summary for Model 3:"
##
## Call:
## lm(formula = DEMOG_df$cap.income ~ DEMOG_df$pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5980.4 -2791.6 -988.1 2193.4 12708.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.446e+04 7.841e+02 31.190 <2e-16 ***
## DEMOG_df$pop 1.874e-04 1.079e-04 1.736 0.0888 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4151 on 49 degrees of freedom
## Multiple R-squared: 0.05796, Adjusted R-squared: 0.03873
## F-statistic: 3.015 on 1 and 49 DF, p-value: 0.0888
Question: Does SLR appear to provide a good fit for each of the three predictor variables?
Answer: Based off the 3 scatterplots directly above (Models 1-3), it appears SLR provides an “OK” fit to Model 1 and Model 3. I came to this conclusion through visual inspection of the scatterplots for Models 1-3.
Question: Which predictor variable leads to significant effect on the per capita income?
$$Y_{cap.income} {0} + {1}X_{usborn} + _i
\ s.t.:\ \hat{\theta}_0 = 68,642;\ \hat{\theta}_{usborn} = -46,019$$
Task 4: Perform regression diagnostics on the highly significant model in part 3 to check the assumption of the regression model. Report what you did and your findings.
Interpret the value of the coefficient of the usborn predictor variable. What does this say about the average annual income of people who are US born and those who are naturalized citizens? Explain clearly here your findings. In fact, information US Bureau of the Census indicates that US born citizens have an average income just slightly larger than naturalized citizens. Do your findings support this? Explain why or why not. If not, comments on what went wrong with your analysis?
Recall from SLR Model 1 (DEMOG) - we fit Model 1 and we found Model 1 to have the most significant impact on the response, Income per Capita.
Therefore if \(\hat\theta_{usborn}\) increases by 1% (0.01 X 100 = 1%) or equivalently - when \(\hat\theta_{usborn}\) increases by 0.01, holding all other variables constant, the Per Capita Income ($) decreases by:
$$\frac{-46,019}{100} = -460.19$$ As shown directly above - interpreting Model 1 shows that - a 1% increase in the percentage of the State population born in the US results in -$460.19 DECREASE in the State income per Capita
Q: “For data CIR.txt, regress involact on race and interpret the coefficient. Test the hypothesis to determine the claim that homeowners in zip codes with high percent minority are being denied insurance at higher rate than other zip codes. What can regression analysis tell you about the insurance companies claim that the discrepancy is due to greater risks in some zip codes?”
\[Y_{involact} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon\]
- first we will find our, “optimal” parameters - second we will print a summary of our model - this includes tests and diagnostics
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7496 -0.2479 -0.1487 0.3129 1.1724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.129218 0.096611 1.338 0.188
## CIR_df$race 0.013882 0.002031 6.836 1.78e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4488 on 45 degrees of freedom
## Multiple R-squared: 0.5094, Adjusted R-squared: 0.4985
## F-statistic: 46.73 on 1 and 45 DF, p-value: 1.784e-08
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$income + CIR_df$race +
## CIR_df$theft + CIR_df$fire + CIR_df$income + CIR_df$age +
## CIR_df$volact)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84296 -0.14613 -0.01007 0.18386 0.81235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.862e-01 6.020e-01 -0.808 0.424109
## CIR_df$income 2.568e-05 3.220e-05 0.798 0.429759
## CIR_df$race 8.527e-03 2.863e-03 2.978 0.004911 **
## CIR_df$theft -1.016e-02 2.908e-03 -3.494 0.001178 **
## CIR_df$fire 3.778e-02 8.982e-03 4.206 0.000142 ***
## CIR_df$age 7.615e-03 3.330e-03 2.287 0.027582 *
## CIR_df$volact -1.018e-02 2.773e-02 -0.367 0.715519
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3387 on 40 degrees of freedom
## Multiple R-squared: 0.7517, Adjusted R-squared: 0.7144
## F-statistic: 20.18 on 6 and 40 DF, p-value: 1.072e-10
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$income + CIR_df$race +
## CIR_df$theft + CIR_df$fire + CIR_df$income + CIR_df$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.84428 -0.15804 -0.04093 0.18116 0.80828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6089790 0.4952601 -1.230 0.225851
## CIR_df$income 0.0000245 0.0000317 0.773 0.443982
## CIR_df$race 0.0091325 0.0023158 3.944 0.000307 ***
## CIR_df$theft -0.0102976 0.0028529 -3.610 0.000827 ***
## CIR_df$fire 0.0388166 0.0084355 4.602 4e-05 ***
## CIR_df$age 0.0082707 0.0027815 2.973 0.004914 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3351 on 41 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7204
## F-statistic: 24.71 on 5 and 41 DF, p-value: 2.159e-11
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$income + CIR_df$race +
## CIR_df$theft + CIR_df$fire + CIR_df$race * CIR_df$income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09430 -0.15597 -0.05358 0.19461 0.87938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.332e-01 4.208e-01 1.029 0.30931
## CIR_df$income -2.813e-05 3.226e-05 -0.872 0.38828
## CIR_df$race 3.535e-03 7.220e-03 0.490 0.62697
## CIR_df$theft -8.303e-03 3.046e-03 -2.726 0.00938 **
## CIR_df$fire 4.049e-02 9.391e-03 4.312 9.92e-05 ***
## CIR_df$income:CIR_df$race 4.174e-07 7.558e-07 0.552 0.58376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3681 on 41 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6627
## F-statistic: 19.07 on 5 and 41 DF, p-value: 9.192e-10
## Start: AIC=-95.34
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft +
## CIR_df$fire + CIR_df$income + CIR_df$age + CIR_df$volact
##
## Df Sum of Sq RSS AIC
## - CIR_df$volact 1 0.01546 4.6047 -97.184
## - CIR_df$income 1 0.07300 4.6622 -96.601
## <none> 4.5892 -95.342
## - CIR_df$age 1 0.59993 5.1892 -91.568
## - CIR_df$race 1 1.01743 5.6067 -87.931
## - CIR_df$theft 1 1.40048 5.9897 -84.825
## - CIR_df$fire 1 2.02990 6.6191 -80.129
##
## Step: AIC=-97.18
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft +
## CIR_df$fire + CIR_df$age
##
## Df Sum of Sq RSS AIC
## - CIR_df$income 1 0.06710 4.6718 -98.504
## <none> 4.6047 -97.184
## - CIR_df$age 1 0.99296 5.5977 -90.007
## - CIR_df$theft 1 1.46328 6.0680 -86.215
## - CIR_df$race 1 1.74657 6.3513 -84.070
## - CIR_df$fire 1 2.37807 6.9828 -79.615
##
## Step: AIC=-98.5
## CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire +
## CIR_df$age
##
## Df Sum of Sq RSS AIC
## <none> 4.6718 -98.504
## - CIR_df$age 1 0.99734 5.6691 -91.410
## - CIR_df$theft 1 1.41436 6.0862 -88.074
## - CIR_df$race 1 2.05375 6.7256 -83.379
## - CIR_df$fire 1 2.38365 7.0554 -81.128
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire +
## CIR_df$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87108 -0.14830 -0.01961 0.19968 0.81638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.243118 0.145054 -1.676 0.101158
## CIR_df$race 0.008104 0.001886 4.297 0.000100 ***
## CIR_df$theft -0.009592 0.002690 -3.566 0.000921 ***
## CIR_df$fire 0.036646 0.007916 4.629 3.51e-05 ***
## CIR_df$age 0.007210 0.002408 2.994 0.004595 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3335 on 42 degrees of freedom
## Multiple R-squared: 0.7472, Adjusted R-squared: 0.7231
## F-statistic: 31.03 on 4 and 42 DF, p-value: 4.799e-12
## Start: AIC=-97.18
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft +
## CIR_df$fire + CIR_df$income + CIR_df$age
##
## Df Sum of Sq RSS AIC
## - CIR_df$income 1 0.06710 4.6718 -98.504
## <none> 4.6047 -97.184
## - CIR_df$age 1 0.99296 5.5977 -90.007
## - CIR_df$theft 1 1.46328 6.0680 -86.215
## - CIR_df$race 1 1.74657 6.3513 -84.070
## - CIR_df$fire 1 2.37807 6.9828 -79.615
##
## Step: AIC=-98.5
## CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire +
## CIR_df$age
##
## Df Sum of Sq RSS AIC
## <none> 4.6718 -98.504
## - CIR_df$age 1 0.99734 5.6691 -91.410
## - CIR_df$theft 1 1.41436 6.0862 -88.074
## - CIR_df$race 1 2.05375 6.7256 -83.379
## - CIR_df$fire 1 2.38365 7.0554 -81.128
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire +
## CIR_df$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87108 -0.14830 -0.01961 0.19968 0.81638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.243118 0.145054 -1.676 0.101158
## CIR_df$race 0.008104 0.001886 4.297 0.000100 ***
## CIR_df$theft -0.009592 0.002690 -3.566 0.000921 ***
## CIR_df$fire 0.036646 0.007916 4.629 3.51e-05 ***
## CIR_df$age 0.007210 0.002408 2.994 0.004595 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3335 on 42 degrees of freedom
## Multiple R-squared: 0.7472, Adjusted R-squared: 0.7231
## F-statistic: 31.03 on 4 and 42 DF, p-value: 4.799e-12
## Start: AIC=-88.35
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft +
## CIR_df$fire + CIR_df$race * CIR_df$income
##
## Df Sum of Sq RSS AIC
## - CIR_df$income:CIR_df$race 1 0.04133 5.5977 -90.007
## <none> 5.5563 -88.355
## - CIR_df$theft 1 1.00717 6.5635 -82.525
## - CIR_df$fire 1 2.51971 8.0760 -72.779
##
## Step: AIC=-90.01
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft +
## CIR_df$fire
##
## Df Sum of Sq RSS AIC
## - CIR_df$income 1 0.07148 5.6691 -91.410
## <none> 5.5977 -90.007
## - CIR_df$theft 1 0.97615 6.5738 -84.452
## - CIR_df$race 1 1.19784 6.7955 -82.893
## - CIR_df$fire 1 2.48203 8.0797 -74.757
##
## Step: AIC=-91.41
## CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire
##
## Df Sum of Sq RSS AIC
## <none> 5.6691 -91.410
## - CIR_df$theft 1 1.1635 6.8326 -84.637
## - CIR_df$race 1 2.1173 7.7864 -78.495
## - CIR_df$fire 1 3.3753 9.0445 -71.456
##
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08722 -0.17134 -0.06811 0.21330 0.83941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.086870 0.102684 0.846 0.40224
## CIR_df$race 0.008226 0.002053 4.007 0.00024 ***
## CIR_df$theft -0.008639 0.002908 -2.971 0.00485 **
## CIR_df$fire 0.042334 0.008367 5.060 8.32e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3631 on 43 degrees of freedom
## Multiple R-squared: 0.6932, Adjusted R-squared: 0.6718
## F-statistic: 32.39 on 3 and 43 DF, p-value: 4.145e-11
## [,1]
## [1,] -28.41076
## [,1]
## [1,] -30.3953
## [,1]
## [1,] -29.44367