1 Project Introduction:

This Document serves as my formal Report & Analysis for the STAT567 Final Project. I have also provided a word document with the summarized Project Abstract (seperate and < 300 words as requested).

1.1 Background & Problem Set Up:

The purpose of this report is to investigate and analyze the, “Chicago Insurance Redlining Data” (CIR). This was the dataset used in a study conducted by the, U.S. Commission on Civil Rights. In this study the Commission investigated allegations made by community members of Chicago IL, against multiple Insurance companies - these allegations claimed some Insurance Companies were, “redlining” their neighborhoods. Redlining is a term for, “canceling policies or refusing to insure or renew” an insured. Most concerning - these community members alleged the Insurance companies were discriminating based on race/ethnicity - which is an illegal practice.

1.2 Objectives:

The primary objective of this report is to, “model the Chicago insurance redlining (CIR) data to predict whether a residential or housing unit in Chicago has insurance availability, based on other data about the homeowners.”(per Project Outline) More specifically I will explore the CIR dataset (details below), determine valid assumptions under which to fit Linear Models, model comparison & diagnostics, model selection model evluation and cross validation. These steps are done in order to make relevant inferences using the selected model.

The CIR data was compiled by, the Illinois (IL) Dept. of Insurance (DOI), The Chicago Police & Fire Departments along with the US Census Bureau.

1.3 CIR Data Elements:

race: racial composition in percent minority
fire: fires per 100 housing units
theft: theft per 1000 population
age: percent of housing units built before 1939
volact: new homeowner policies plus renewals minus cancellations and non renewals per 100 housing units
involact: new FAIR plan policies and renewals per 100 housing units
income: median family income

The secondary objective of this report is to use similar statistical analysis to make an, Ecological Inference concerning the, “relation between the incomes of the US born and the proportion of the US born within the state. This study is relevant to the analysis of the CIR data since this data is the aggregate data, and the results from the aggregated data may not hold true at the individual level.”(per Project Outline). This will simiarly be limited to Linear modeling.

In order to make this Ecological Inference I used the DEMOG dataset - this dataset contains the following four variables for each of the 50 states plus the District of Columbia.

1.4 DEMOG Data Elements:

zip: zip codes (Chicago, IL)
usborn: proportion of legal state residents born in the United States in 1990.
cap.income: per capita income dollars from all sources in 1998
home: total home in 1998
pop: total population in 1998

1.5 Report Roadmap:

Once familiarized with the datasets I will provide my solutions to the primary and secondary objectives defined above. - Note: Ordering of the report is based on order of sub-tasks defined in project outline - for this reason the primary and secondary objectives defined above will not analyzed in that order.

2 Exploratory Data Analysis (EDA):

First we will explore both datasets though exploratory data analysis (EDA). Using various EDA methods I will note observations regarding variable relationships, identify outlier data points and make necessary data transformations in order to satisfy model assumptions (Strictly Linear Models).

2.1 Tasks 1 & 2:

Task 1: “Make a numerical and graphical summary of both sets of the data, commenting on any features that you find interesting. Limit the output you present to a quality that a busy reader would find sufficient to get a basic understanding of the data.”
Task 2: “Perform diagnostics check for relationships and strong interactions among the variables of the data. Report what you did and your findings. Are there any remedial measures needed”?

2.1.1 EDA - DEMOG Data:

We will start by visually inspecting the DMOG Data - recall this contains demographic information at the US State level of granularity.

2.1.2 Scatter Plot Matrix:

scatter plot matrix: shows pairwise comparisons of our multivariate dataset using scatterplots/density plots
density plots along plot diagonal represent empirical distribution of respective variable
Upper right triangle shows calculted correlations between variables $(i,j)$ - where $i$ = row number and $j$ = col. number.

Note: the first column represents State (i.e. a categorical variable).

2.1.3 Observations - DEMOG:

cap.income appears to be the only variable (DEMOG) w/ an approx. Normal distribution
- confirm this with normal Quantile-Quantile plot (i.e. QQ-Plot)
- this satisfies one of our necessary assumption to fit a linear model (given cap.income is the response variable)
appears to be an approx, negative linear association between usborn and cap.income
- this would imply, on average as the proportion of usborn citizens increases, US State income per capita decreases - we will revisit this.
- possible outlier data points? Can check with boxplot.
appears to be an approx, positive linear association between usborn and home
- this would imply, on average as the proportion of usborn citizens increases, the State total number of homes increases

2.1.4 Diagnostics - DEMOG:

from Scatter Matrix can see the following correlations:
- Correlation(usborn, cap.income) = -0.578; correlation(usborn, home) = 0.459
Histogram of cap.income to see univariate empirical distribution
Normal Quantile-Quantile - visual confirmation that cap.income is approx. Normally distributed - $cap.income\ \approx N(\mu, \sigma^2)$
- from histogram can tell the general shape is approx. normal and symmetric around the mean
- the Q-Q plot is further confirmation for the assumption: $cap.income\ \approx N(\mu, \sigma^2)$
- from the Boxlpot (w/ Jitters points) of cap.income the variance appears symmetric around the mean and doesn’t vary with cap.income
  - consider removing the two outlier points before fitting any model w. this feature as response

2.1.5 EDA - CIR Data:

We will start by visually inspecting the CIR dataset. Recall the CIR data is at the Zip-code level in Chicago, IL. (zip code col. in CIR data is unique).

2.1.6 Scatter Plot Matrix:

scatter plot matrix: shows pairwise comparisons of our multivariate dataset using scatterplots/density plots
density plots along plot diagonal represent empirical distribution of respective variable
Upper right triangle shows calculted correlations between variables $(i,j)$ - where $i$ = row number and $j$ = col. number.

Note: the first column represents Zip-Code (i.e. a categorical variable).

2.1.7 Observations - CIR:

involact appears to not be Normally distributed
- this is the response variable for some of the Linear Models we will fit below - try transform to make approx. normal (e.g. try log(involact))
- confirm this with normal Quantile-Quantile plot (i.e. QQ-Plot)
- appears to be an approx. linear relationship between:
income and volact (positive)
race and involact (negative)
race and fire (positive)
fire and involact (negative)
Note: zip is a categorical variable

2.1.8 Diagnostics - CIR:

see the next four plots: involat (histogram) & log(involat) (histogram) & log(involat) (density plot) & log(involact) (Normal Q-Q Plot)
- after log() transform, it appears from the Normal Q-Q plot and the shape and symmetry (variance around mean) of the density plot & histogram that:
- log$(involat) \sim N(\mu, \sigma^2)$
from Scatter Matrix can see the following correlations (i.e. relationships between variables):
- Correlation(volact, income) = 0.751;Correlation(race, involact) = 0.714;Correlation(race, fire) = 0.593;Correlation(fire, involact) = 0.703;
Histogram of cap.income to see univariate empirical distribution
Normal Quantile-Quantile - visual confirmation that cap.income is approx. Normally distributed - $cap.income\ \approx N(\mu, \sigma^2)$
- from histogram can tell the general shape is approx. normal and symmetric around the mean
- the Q-Q plot is further confirmation for the assumption: $cap.income\ \approx N(\mu, \sigma^2)$

2.1.9 EDA - Conclusion:

Through the EDA process we were able to visualize both univariate and multivariate plots which led us to variables with a linear association (CIR & DEMOG)
Ran diagnostic to calculate correlation coefficient between two variables we suspected have a linear association (CIR & DEMOG)
Also identifed involact is not normally distributed - however log(involact) $\approx N(\mu, \sigma^2)$ (CIR)
Through visual inspection of boxplot we identifed two outliers which we can try removing from our data before fitting and linear models (DEMOG)

3 Simple Linear Regression Analysis (DEMOG)

3.1 Task 3:

“For data DEMOG.txt, fit three simple linear regression models of the per capita income on each of the three predictor variables. Does a linear regression model appear to provide a good fit for each of the three predictor variables? Use all appropriate tests, descriptive measures, and plots to conclude your findings here. Which predictor variable leads to significant effect on the per capita income?”

Fit 3 Linear Models using Simple Linear Regression to predict Aggregated Per Capita Income by each Demographic Variable:
- Model 1: $Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon_i$
- Model 2: $Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{home}+ \epsilon_j$
- Model 3: $Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{pop}+ \epsilon_k$
  - where: $\epsilon \sim N(0,\sigma^2)$

3.1.1 SLR Model 1 (DEMOG):

We define Model 1: \[Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon_i\] - See table below the header, “Summary for Model 1:” - this table contains diagnostic information on Model 1 including: - the standard errors - t-test values & p-values for model coeffs - degrees of freedom - residual summary statistics - result of an F-test

## [1] "Summary for Model 1:"

## 
## Call:
## lm(formula = DEMOG_df$cap.income ~ DEMOG_df$usborn)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6836.5 -2591.5   250.7  1332.1 10262.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        68642       8739   7.855 3.19e-10 ***
## DEMOG_df$usborn   -46019       9279  -4.959 8.89e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3490 on 49 degrees of freedom
## Multiple R-squared:  0.3342, Adjusted R-squared:  0.3206 
## F-statistic:  24.6 on 1 and 49 DF,  p-value: 8.891e-06

3.1.2 SLR Model 2 (DEMOG):

We define Model 2:

\[Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{home}+ \epsilon_j\]

See table below the header, “Summary for Model 2:”
- this table contains diagnostic information on Model 1 including:
  - the standard errors
  - t-test values & p-values for model coeffs
  - degrees of freedom
  - residual summary statistics
  - result of an F-test

## [1] "Summary for Model 2:"

## 
## Call:
## lm(formula = DEMOG_df$cap.income ~ DEMOG_df$home)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6339.9 -2453.4  -134.7  1614.5 11847.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   31398.51    2482.62  12.647   <2e-16 ***
## DEMOG_df$home   -99.08      39.74  -2.494   0.0161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4029 on 49 degrees of freedom
## Multiple R-squared:  0.1126, Adjusted R-squared:  0.09449 
## F-statistic: 6.218 on 1 and 49 DF,  p-value: 0.01607

3.1.3 SLR Model 3 (DEMOG):

We define Model 3 as: \[Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{pop}+ \epsilon_k\]

See table below the header, “Summary for Model 3:”
- this table contains diagnostic information on Model 1 including:
  - the standard errors
  - t-test values & p-values for model coeffs
  - degrees of freedom
  - residual summary statistics
  - result of an F-test

## [1] "Summary for Model 3:"

## 
## Call:
## lm(formula = DEMOG_df$cap.income ~ DEMOG_df$pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5980.4 -2791.6  -988.1  2193.4 12708.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.446e+04  7.841e+02  31.190   <2e-16 ***
## DEMOG_df$pop 1.874e-04  1.079e-04   1.736   0.0888 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4151 on 49 degrees of freedom
## Multiple R-squared:  0.05796,    Adjusted R-squared:  0.03873 
## F-statistic: 3.015 on 1 and 49 DF,  p-value: 0.0888

3.1.4 SLR Compare Models 1-3 (DEMOG):

Question: Does SLR appear to provide a good fit for each of the three predictor variables?
- Answer: Based off the 3 scatterplots directly above (Models 1-3), it appears SLR provides an “OK” fit to Model 1 and Model 3. I came to this conclusion through visual inspection of the scatterplots for Models 1-3.
  - Consider Model 1:
    - clear negative linear association (i.e. as usborn increases, the State level Per Capita Income tends to decrease)
    - visual inspection leads me to believe if we were to remove the outlier data points (identify outliers through box and whisker plots) then our model may have a better fit.
  - Consider Model 2:
    - no clear relationship between predictor and response variables.
    - because we see no relationship between predictor and response I conclude SLR does not proivde a good fit to this data
  - Consider Model 3:
    - clear positive linear association (i.e. as pop increases, the State level Per Capita Income tends to increase)
    - visual inspection leads me to believe if we were to remove the outlier data points (identify outliers through box and whisker plots) then our model may have a better fit.
    - similar to Model 1 I believe removing outliers from our predictor variable may result in SLR being a better fit to this data. However
Question: Which predictor variable leads to significant effect on the per capita income?
- Answer: Model 1 results in a significant effect on the per capita income. I arrived at this answer by analyzing the results of a two sided t-test (see the Summary table above for Model 1). In the summary table for Model 1 we see that given a t-test hypothesis test was run where: $H_0:\ \theta_{usborn} = 0;\ H_a: \theta_{usborn} \neq 0$ where $H_0$ and $H_a$ represent the Null and alternative hypotheses respectively. If we read from the Model 1 Summary table we see for the single predictor, (DEMOG_df.usborn), the Pr(>|t|) = 8.89e-06 < 0.05 at $\alpha = 0.0005$ or 99.95% confidence $\implies$ reject $H_0 = 0$ for the alternative hypothesis, $H_a \neq 0$. Therefore we have shown that the SLR Model 1, the coefficient estimate for usborn = -4,6019. Using the two-sided t-test we found the coefficient estimate for usborn is significant in Model 1 at conf. level $\alpha = 0.0005$. I know from Question 1 that SLR is not a good fit for Model 2 so ignore - that leaves us with Model 3. Based off the summary table for Model 3, we can carry out the same hypothesis test (two-sided t-test) to determine if the coefficient is considered, “statistically significant” in Model 3 ($H_0 = 0;\ H_a \neq 0$). The summary table for Model 3 says for the parameter, DEMOG.pop, the t-value = 1.736 and Pr(>|t|) = 0.0888 > 0.05 $\implies$ we fail to reject $H_0$. Therefore the coefficient estimate for DEMOG.pop in Model 3 is not considered statistically signifcant at the confidence level, $\alpha = 0.0005$. Therefore we have deduced using visual inspection hueristics of scatterplot, t-test and F-Test (included in summary tables) that Model 1 given by:

$$Y_{cap.income} {0} + {1}X_{usborn} + _i

    \ s.t.:\  \hat{\theta}_0 = 68,642;\ \hat{\theta}_{usborn} = -46,019$$

3.2 Task 4: Diagnostics on Model 1 (SLR):

Task 4: Perform regression diagnostics on the highly significant model in part 3 to check the assumption of the regression model. Report what you did and your findings.

3.2.1 Approach:

given Model 1 is our highly significant model from part 3, we can check the assumption of the regression model by first creating a boxplot of the residuals.
- this boxplot is seen directly below and we can see the $mean\ \approx 0$ - we also we recognize there are two outliers at towards the upper bound (whisker).
- similarly, below the boxplot there is a Normal Q-Q plot of the Model 1 Residuals - from visual inspection we can validate the assumption that the Model 1 residuals are:
Model 1 Residuals$\ =\ \epsilon \sim N(\mu_{\epsilon}, \sigma^{2}_{\epsilon}$

3.2.2 Task 4 (Cont.):

To complete Task 4 I will present a plot of the residuals per Model1 and the fitted values per Model1 (see scatterplot below)
- in this scatterplot we can see there is no clear relationship between Model1 Residuals and Model1 Fitted Values.
- because regression function is linear and the variance among error terms is constant.
- now we have verified the assumptions of the regression Model

3.3 Task 5: Interpret Value of Model 1 Coefficient

Interpret the value of the coefficient of the usborn predictor variable. What does this say about the average annual income of people who are US born and those who are naturalized citizens? Explain clearly here your findings. In fact, information US Bureau of the Census indicates that US born citizens have an average income just slightly larger than naturalized citizens. Do your findings support this? Explain why or why not. If not, comments on what went wrong with your analysis?

Recall from SLR Model 1 (DEMOG) - we fit Model 1 and we found Model 1 to have the most significant impact on the response, Income per Capita.

Model 1 is of the form: $Y_{cap.income} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon_i$
- Such that: $\hat{\theta}_{0} = 68,642$ $\hat{\theta}_{usborn} = -46,019$ $\hat{\epsilon}_{i} \sim N(0, \sigma_{i}^2)$

3.3.1 Interpret Model 1 Coefficient:

Recall our predictor is $usborn$ (% State population born in the US)
Our response for Model1 is the Per Capita Income ($).
Therefore if $\hat\theta_{usborn}$ increases by 1% (0.01 X 100 = 1%) or equivalently - when $\hat\theta_{usborn}$ increases by 0.01, holding all other variables constant, the Per Capita Income ($) decreases by:
```
$$\frac{-46,019}{100} = -460.19$$  
```
Note: we divide by 100 here b.c. when we fit Model 1: $usborn$ was entered as a decimal, not a percentage
- therefore if we were considering a 100% increase in $usborn$ then $\implies$ on avg. Per Capita Income ($) decreases by -46,019 dollars
- we will stay with the 1% increase interpretation opposed to the 100% increase as it is more intuitive given Model 1 was fit with $usborn$ as a decimal, not a %.

3.3.2 Ecological Inference:

My findings do not support those from the US Census, where the US Census findings support the claim that, “US born citizens have an average income just slightly larger than naturalized citizens”.
As shown directly above - interpreting Model 1 shows that - a 1% increase in the percentage of the State population born in the US results in -$460.19 DECREASE in the State income per Capita
I don’t believe this implies I did something wrong in my analysis - isntead this may be an example of what’s known as the Ecological Paradox.
- That is, when analyzing data on a granular level (calculate some metric of interest) and note any significant trends.
- Next calculate the same metric but this time using the aggregated data opposed to the data at a lower level of granularity.
- For our ex. above the US Census has found that on an INDIVIDUAL level,“US born citizens have an average income just slightly larger than naturalized citizens”
- However when I ran my analysis aggregated to the State level I found the inverse relationship - this is why this is an example of a paradox.

3.3.3 Task 6:

Q: “For data CIR.txt, regress involact on race and interpret the coefficient. Test the hypothesis to determine the claim that homeowners in zip codes with high percent minority are being denied insurance at higher rate than other zip codes. What can regression analysis tell you about the insurance companies claim that the discrepancy is due to greater risks in some zip codes?”

Our Model is defined as:

\[Y_{involact} \sim \theta_{0} + \theta_{1}X_{usborn} + \epsilon\]
- first we will find our, “optimal” parameters - second we will print a summary of our model - this includes tests and diagnostics

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7496 -0.2479 -0.1487  0.3129  1.1724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.129218   0.096611   1.338    0.188    
## CIR_df$race 0.013882   0.002031   6.836 1.78e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4488 on 45 degrees of freedom
## Multiple R-squared:  0.5094, Adjusted R-squared:  0.4985 
## F-statistic: 46.73 on 1 and 45 DF,  p-value: 1.784e-08

3.3.4 Interpret Model Coefficients:

based on the model fit above (regress involact onto race) and the summary table direclty above this comment we can interpret the model coefficient.
$\hat{\theta}_{race}$ $\approx 0.0139$
Therefore we can interpret this relationship by the following:
- Holding all else constant - a 1% increase in $\hat{\theta}_{race}$ corresponds to a 0.0139*100 = 1.39 increase in the number of home-owner policies obtained through FAIR.
- Recall: $involact$ is # FAIR policies per 100 housing units - this is why I multiplied by 100 above
- Using multivariate linear modeling we can include features in our model that measure crime and then check are we seeing the same effects in the zip codes Ins. Companies claim are simply “higher risk areas” (this is the Ins. Co’s reasoning for redlining)
- it is hard to draw a firm conclusion that individuals in zip codes with high % minority are denied insurance at a higher rate than other surrounding zip codes.
  - this is another example of the difficulty in making an ecological correlation when we don’t actually have the data for # persons denied insurance broken out by race.

3.4 Task 7 - Multivariate Linear Modeling:

“Choose three multiple regression models of the response involact based on different sets of predictor variables, interaction terms, and transformation of the predictor variables for the data CIR.txt. Which of the three models seem best here? Which variables seem to be important in each model? Report what you did and your findings.”

3.5 Print Summary of Each Model:

Here I print the summary for each multivariate linear model createf above
The info. in these summary tables include diagnostics/tests which will aide in selecting our “best model”

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$income + CIR_df$race + 
##     CIR_df$theft + CIR_df$fire + CIR_df$income + CIR_df$age + 
##     CIR_df$volact)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84296 -0.14613 -0.01007  0.18386  0.81235 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.862e-01  6.020e-01  -0.808 0.424109    
## CIR_df$income  2.568e-05  3.220e-05   0.798 0.429759    
## CIR_df$race    8.527e-03  2.863e-03   2.978 0.004911 ** 
## CIR_df$theft  -1.016e-02  2.908e-03  -3.494 0.001178 ** 
## CIR_df$fire    3.778e-02  8.982e-03   4.206 0.000142 ***
## CIR_df$age     7.615e-03  3.330e-03   2.287 0.027582 *  
## CIR_df$volact -1.018e-02  2.773e-02  -0.367 0.715519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3387 on 40 degrees of freedom
## Multiple R-squared:  0.7517, Adjusted R-squared:  0.7144 
## F-statistic: 20.18 on 6 and 40 DF,  p-value: 1.072e-10

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$income + CIR_df$race + 
##     CIR_df$theft + CIR_df$fire + CIR_df$income + CIR_df$age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.84428 -0.15804 -0.04093  0.18116  0.80828 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.6089790  0.4952601  -1.230 0.225851    
## CIR_df$income  0.0000245  0.0000317   0.773 0.443982    
## CIR_df$race    0.0091325  0.0023158   3.944 0.000307 ***
## CIR_df$theft  -0.0102976  0.0028529  -3.610 0.000827 ***
## CIR_df$fire    0.0388166  0.0084355   4.602    4e-05 ***
## CIR_df$age     0.0082707  0.0027815   2.973 0.004914 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3351 on 41 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7204 
## F-statistic: 24.71 on 5 and 41 DF,  p-value: 2.159e-11

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$income + CIR_df$race + 
##     CIR_df$theft + CIR_df$fire + CIR_df$race * CIR_df$income)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09430 -0.15597 -0.05358  0.19461  0.87938 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.332e-01  4.208e-01   1.029  0.30931    
## CIR_df$income             -2.813e-05  3.226e-05  -0.872  0.38828    
## CIR_df$race                3.535e-03  7.220e-03   0.490  0.62697    
## CIR_df$theft              -8.303e-03  3.046e-03  -2.726  0.00938 ** 
## CIR_df$fire                4.049e-02  9.391e-03   4.312 9.92e-05 ***
## CIR_df$income:CIR_df$race  4.174e-07  7.558e-07   0.552  0.58376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3681 on 41 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6627 
## F-statistic: 19.07 on 5 and 41 DF,  p-value: 9.192e-10

3.5.1 Interpret Multivar. Models - Model Diagnostics:

Out of the 3 multivar. models created above it’s clear that either Model1 or Model2 is the “best” out of the 3
- I came to this conclusion by comparing each model’s $R^2$ value along with number of significant predictor variables.
Models 1 & 2 have very similar model diagnostics/test results - further analysis is needed to determine which model is “a better fit”.

3.6 Predictor Subset Selection: Backward Stepwise Regression Method

Perform Subset Selection using Backward Stepwise Regression:

## Start:  AIC=-95.34
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft + 
##     CIR_df$fire + CIR_df$income + CIR_df$age + CIR_df$volact
## 
##                 Df Sum of Sq    RSS     AIC
## - CIR_df$volact  1   0.01546 4.6047 -97.184
## - CIR_df$income  1   0.07300 4.6622 -96.601
## <none>                       4.5892 -95.342
## - CIR_df$age     1   0.59993 5.1892 -91.568
## - CIR_df$race    1   1.01743 5.6067 -87.931
## - CIR_df$theft   1   1.40048 5.9897 -84.825
## - CIR_df$fire    1   2.02990 6.6191 -80.129
## 
## Step:  AIC=-97.18
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft + 
##     CIR_df$fire + CIR_df$age
## 
##                 Df Sum of Sq    RSS     AIC
## - CIR_df$income  1   0.06710 4.6718 -98.504
## <none>                       4.6047 -97.184
## - CIR_df$age     1   0.99296 5.5977 -90.007
## - CIR_df$theft   1   1.46328 6.0680 -86.215
## - CIR_df$race    1   1.74657 6.3513 -84.070
## - CIR_df$fire    1   2.37807 6.9828 -79.615
## 
## Step:  AIC=-98.5
## CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire + 
##     CIR_df$age
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      4.6718 -98.504
## - CIR_df$age    1   0.99734 5.6691 -91.410
## - CIR_df$theft  1   1.41436 6.0862 -88.074
## - CIR_df$race   1   2.05375 6.7256 -83.379
## - CIR_df$fire   1   2.38365 7.0554 -81.128

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire + 
##     CIR_df$age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87108 -0.14830 -0.01961  0.19968  0.81638 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.243118   0.145054  -1.676 0.101158    
## CIR_df$race   0.008104   0.001886   4.297 0.000100 ***
## CIR_df$theft -0.009592   0.002690  -3.566 0.000921 ***
## CIR_df$fire   0.036646   0.007916   4.629 3.51e-05 ***
## CIR_df$age    0.007210   0.002408   2.994 0.004595 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3335 on 42 degrees of freedom
## Multiple R-squared:  0.7472, Adjusted R-squared:  0.7231 
## F-statistic: 31.03 on 4 and 42 DF,  p-value: 4.799e-12

## Start:  AIC=-97.18
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft + 
##     CIR_df$fire + CIR_df$income + CIR_df$age
## 
##                 Df Sum of Sq    RSS     AIC
## - CIR_df$income  1   0.06710 4.6718 -98.504
## <none>                       4.6047 -97.184
## - CIR_df$age     1   0.99296 5.5977 -90.007
## - CIR_df$theft   1   1.46328 6.0680 -86.215
## - CIR_df$race    1   1.74657 6.3513 -84.070
## - CIR_df$fire    1   2.37807 6.9828 -79.615
## 
## Step:  AIC=-98.5
## CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire + 
##     CIR_df$age
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      4.6718 -98.504
## - CIR_df$age    1   0.99734 5.6691 -91.410
## - CIR_df$theft  1   1.41436 6.0862 -88.074
## - CIR_df$race   1   2.05375 6.7256 -83.379
## - CIR_df$fire   1   2.38365 7.0554 -81.128

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire + 
##     CIR_df$age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87108 -0.14830 -0.01961  0.19968  0.81638 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.243118   0.145054  -1.676 0.101158    
## CIR_df$race   0.008104   0.001886   4.297 0.000100 ***
## CIR_df$theft -0.009592   0.002690  -3.566 0.000921 ***
## CIR_df$fire   0.036646   0.007916   4.629 3.51e-05 ***
## CIR_df$age    0.007210   0.002408   2.994 0.004595 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3335 on 42 degrees of freedom
## Multiple R-squared:  0.7472, Adjusted R-squared:  0.7231 
## F-statistic: 31.03 on 4 and 42 DF,  p-value: 4.799e-12

## Start:  AIC=-88.35
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft + 
##     CIR_df$fire + CIR_df$race * CIR_df$income
## 
##                             Df Sum of Sq    RSS     AIC
## - CIR_df$income:CIR_df$race  1   0.04133 5.5977 -90.007
## <none>                                   5.5563 -88.355
## - CIR_df$theft               1   1.00717 6.5635 -82.525
## - CIR_df$fire                1   2.51971 8.0760 -72.779
## 
## Step:  AIC=-90.01
## CIR_df$involact ~ CIR_df$income + CIR_df$race + CIR_df$theft + 
##     CIR_df$fire
## 
##                 Df Sum of Sq    RSS     AIC
## - CIR_df$income  1   0.07148 5.6691 -91.410
## <none>                       5.5977 -90.007
## - CIR_df$theft   1   0.97615 6.5738 -84.452
## - CIR_df$race    1   1.19784 6.7955 -82.893
## - CIR_df$fire    1   2.48203 8.0797 -74.757
## 
## Step:  AIC=-91.41
## CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      5.6691 -91.410
## - CIR_df$theft  1    1.1635 6.8326 -84.637
## - CIR_df$race   1    2.1173 7.7864 -78.495
## - CIR_df$fire   1    3.3753 9.0445 -71.456

## 
## Call:
## lm(formula = CIR_df$involact ~ CIR_df$race + CIR_df$theft + CIR_df$fire)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08722 -0.17134 -0.06811  0.21330  0.83941 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.086870   0.102684   0.846  0.40224    
## CIR_df$race   0.008226   0.002053   4.007  0.00024 ***
## CIR_df$theft -0.008639   0.002908  -2.971  0.00485 ** 
## CIR_df$fire   0.042334   0.008367   5.060 8.32e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3631 on 43 degrees of freedom
## Multiple R-squared:  0.6932, Adjusted R-squared:  0.6718 
## F-statistic: 32.39 on 3 and 43 DF,  p-value: 4.145e-11

3.7 Mallow’s Criterion

Based on the AIC, Adjusted $R^2$ and Mallow’s C_p constant I would suggest Multivar Model 2 as being the model with, “best fit”.

##           [,1]
## [1,] -28.41076

##          [,1]
## [1,] -30.3953

##           [,1]
## [1,] -29.44367

STAT567 Final Project

Adam Shetler

5/6/2020