Functions	Tasks
`plot_summs()`	Creates regression coefficient plots
`stargazer()`	Formats regression analysis results into a well-organized table.
`formula()`	Returns a formulae included in other R objects.
`par()`	Can be used to set or query graphical parameters. See ?par for more details.
`points()`	Draws a point or sequence of points at the specified coordinates. see ?points for more details.
`AIC()`	Compare AIC across models
`anova()`	Test for nested models to see if other variables are needed (hierarchical regression approach)
`resettest()`	Resettest to check if higher order terms needed

1. Introduction

We continue to focus on getting hands-on experience of multiple regression modeling in R. We will focus on various ways of variable selection.

Researchers and planners usually choose the dependent and independent variables based on various criteria, such as their research questions, their empirical and theoretical knowledge as experts, and their lived experience as urban residents. For example, theories suggest that there are sets of variables that affect individual’s decision to walk, and some of these variables are empirically found to have statistically significant relationship with walking behavior. Most researchers base their variable selection on theories; for example a researcher focusing on walkability is likely to rely on the theory on walking, such as 3D’s framework and the hierarchical walking needs theory, as well as on their own experience to refine variable selections.

However, we sometimes need additional help in selecting the right set of variables, particularly when we are dealing with issues with weak or no established theories, intuition, and/or experience to guide us in building regression models. The variable selection methods we will examine in this lab can provide supplementary information that we can take into consideration when specifying regression models.

Today, we will examine what variables are related with crime. We will use the number of crime incidences per 1000 persons as our dependent variable (i.e., the variable being explained by a set of other variables; often referred to as simply y variable). As the candidate for independent variables (i.e., a set variables that are used to explain the dependent variable; often referred to as x variables), we have various variables summarized below.

Variable	What it is measuring
crime.percap	The number of crime incidences per 1000 people (dependent variable)
dist.cbd	The distance from the city center (city hall) to the centroid of the Census Tract. Measured in meters.
b2s	The average of the building-to-street ratio calculated using Google Street View images.
s2s	The average of the sidewalk-to-street proportion calculated using Google Street View images.
green	The average proportion of greenery calculated using Google Street View images.
gsv.index	The average of the walkability index calculated using Google Street View images.
walkscore	The average of Walk Score.
p.age65	The share of population above age 65 out of the total population.
p.service	The share of population employed in service sector.
p.crowd.room	The share of population living in a room in which more than two people share the room.
p.hs	The share of population with less than high school education out of those who are age 25 or above.
p.unemp	The share of population who are unemployed out of those who are in labor force.
p.pov	The share of population whose income is under the poverty line out of those whose poverty status is identified
popden	The Population density
mroom	The median number of rooms in the Census Tract.
mval	The median housing value of the Census Tract.

2. Setting the working directory & reading data

Let’s first download required packages in R. In addition to the functions in base R, we will use regsubsets() in “leaps”, ggpairs() in “GGally”, and stargazer in “stargazer” package. The data file can be found on Canvas.

# Install required packages
if (!require("GGally")) install.packages("GGally")
if (!require("leaps")) install.packages("leaps")
if (!require("stargazer")) install.packages("stargazer")
if (!require("lmtest")) install.packages("lmtest")
if (!require("jtools")) install.packages("jtools")
if (!require("ggstance")) install.packages("ggstance")
if (!require("broom.mixed")) install.packages("broom.mixed")

# Call the packages using library()
library(GGally)
library(leaps)
library(stargazer)
library(lmtest)
library(jtools)
library(ggstance)
library(broom.mixed)

# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 10\\Lab8_2020") # Use your own pathname
gsv.census <- read.csv("Week10_gsv_census.csv") 

str(gsv.census)

## 'data.frame':    118 obs. of  16 variables:
##  $ walkscore   : num  0.338 0.451 0.645 0.76 0.707 ...
##  $ dist.cbd    : num  4495 3587 4382 3528 3595 ...
##  $ b2s         : num  0.0548 0.0469 0.1497 0.1375 0.1421 ...
##  $ green       : num  0.436 0.407 0.297 0.332 0.316 ...
##  $ s2s         : num  0.0564 0.0685 0.1112 0.1504 0.1344 ...
##  $ gsv.index   : num  0.0117 -0.1493 0.4972 1.7117 1.1917 ...
##  $ p.age65     : num  0.0785 0.1311 0.0684 0.0789 0.1407 ...
##  $ p.service   : num  0.0492 0.0971 0.1769 0.0829 0.1324 ...
##  $ p.crowd.room: num  0.0051 0 0.00482 0.00696 0 ...
##  $ p.hs        : num  0.01316 0.01008 0.02415 0.00474 0.06532 ...
##  $ p.unemp     : num  0.0178 0.0194 0.0298 0.0172 0.0982 ...
##  $ p.pov       : num  0.0804 0.0919 0.0876 0.0699 0.213 ...
##  $ popden      : num  1894 1393 2125 2521 1961 ...
##  $ mroom       : num  7.3 5.6 6.4 4.9 5.3 4.5 5.1 6.1 5.7 5.8 ...
##  $ mval        : num  617300 548200 470300 433000 239600 ...
##  $ crime.percap: num  0.275 0.513 0.397 0.835 2.272 ...

3. Selecting independent variables based on your knowledge and experience

Let’s assume that this is a dataset that you created and cleaned for your option paper. Some variables are representing physical characteristics of a Census Tract. Others are representing socioeconomic characteristics that are known to represent resource deprivation (e.g., this study).

As always, let’s start by taking a look at the characteristics of the individual variable’s distribution as well as their scatterplots.

# ggpairs
var.list <- c("crime.percap", "walkscore", "mroom", "p.crowd.room", "popden", "dist.cbd", "mval", "b2s", "s2s", "green", "p.pov", "p.unemp", "p.hs", "p.service", "p.age65")
ggpairs(gsv.census[ , var.list])

The graph shows that, first of all, the number of variables are already too many to be nicely displayed in one graph. Since the graph is unclear, due to the number of variables, creating a correlation matrix is a better option (see below). Although we can see there are some independent variables that appear to have some patterns with the dependent variable, the pattern is not so clear. Also, the dependent variable (i.e., crime.percap) and some of the independent variables show substantial skewness which warrant our attention. But again, let’s pretend for now that these issues are not present. Some of these variables do appear highly correlated to each other.

Given that we have inspected the variables, let’s build a multiple regression model. The issue here is that I don’t know much about crime. Since I don’t know any theory and knowledge to guide me in variable selection, I will use common sense here and select distance from cbd, poverty rate, and unemployment rate. In real regression modeling, however, it is VERY IMPORTANT to have read what other researchers and planners have discovered and theorized so far. You should not kitchen sink – your model should be grounded in the past findings.

#Correlation Matrix
cor_matrix <- cor(gsv.census[var.list])
round(cor_matrix,2)

##              crime.percap walkscore mroom p.crowd.room popden dist.cbd  mval
## crime.percap         1.00      0.40 -0.51         0.13   0.08    -0.36 -0.34
## walkscore            0.40      1.00 -0.68         0.18   0.71    -0.75  0.01
## mroom               -0.51     -0.68  1.00        -0.32  -0.59     0.50  0.35
## p.crowd.room         0.13      0.18 -0.32         1.00   0.21    -0.09 -0.16
## popden               0.08      0.71 -0.59         0.21   1.00    -0.52 -0.03
## dist.cbd            -0.36     -0.75  0.50        -0.09  -0.52     1.00  0.04
## mval                -0.34      0.01  0.35        -0.16  -0.03     0.04  1.00
## b2s                  0.47      0.76 -0.65         0.19   0.74    -0.62 -0.06
## s2s                  0.32      0.80 -0.54         0.14   0.61    -0.72 -0.09
## green               -0.60     -0.72  0.70        -0.32  -0.50     0.63  0.30
## p.pov                0.43      0.01 -0.25         0.29  -0.05    -0.06 -0.67
## p.unemp              0.36     -0.16  0.00         0.05  -0.24     0.06 -0.55
## p.hs                 0.25     -0.21 -0.04         0.16  -0.23     0.12 -0.63
## p.service            0.10     -0.15  0.04         0.11  -0.17     0.04 -0.40
## p.age65             -0.09     -0.50  0.39        -0.30  -0.44     0.39  0.03
##                b2s   s2s green p.pov p.unemp  p.hs p.service p.age65
## crime.percap  0.47  0.32 -0.60  0.43    0.36  0.25      0.10   -0.09
## walkscore     0.76  0.80 -0.72  0.01   -0.16 -0.21     -0.15   -0.50
## mroom        -0.65 -0.54  0.70 -0.25    0.00 -0.04      0.04    0.39
## p.crowd.room  0.19  0.14 -0.32  0.29    0.05  0.16      0.11   -0.30
## popden        0.74  0.61 -0.50 -0.05   -0.24 -0.23     -0.17   -0.44
## dist.cbd     -0.62 -0.72  0.63 -0.06    0.06  0.12      0.04    0.39
## mval         -0.06 -0.09  0.30 -0.67   -0.55 -0.63     -0.40    0.03
## b2s           1.00  0.69 -0.75  0.05   -0.04 -0.21     -0.16   -0.41
## s2s           0.69  1.00 -0.68  0.09   -0.05 -0.09     -0.06   -0.49
## green        -0.75 -0.68  1.00 -0.36   -0.18 -0.11      0.02    0.52
## p.pov         0.05  0.09 -0.36  1.00    0.68  0.69      0.38   -0.12
## p.unemp      -0.04 -0.05 -0.18  0.68    1.00  0.63      0.29    0.09
## p.hs         -0.21 -0.09 -0.11  0.69    0.63  1.00      0.33    0.07
## p.service    -0.16 -0.06  0.02  0.38    0.29  0.33      1.00    0.26
## p.age65      -0.41 -0.49  0.52 -0.12    0.09  0.07      0.26    1.00

The trouble in this approach is that, if I just keep testing different combination of variables at random, there are 16,383 possible combinations I would need to test. It would be very difficult to find a good combination of variables, and I will not be able to know it when I have successfully built a good model – because I don’t have theory backing me up.

Before you move on to the next section, try various variables in the dataset. In doing so, see if you can find a combination of variables that meets the following conditions: (1) adjusted R-squared is substantially increased and (2) you can explain why each of the statistically significant variables have the effects that they have. Read through the variable list above, select a few that you think would impact crime rate, and try running multiple regressions with the selected variables. The information you get from doing so will be useful later when you try to make sense out of the models built by automatic variable selection methods.

4. Model Specification

As we did in class, we are going to create two models. We can’t say this enough, your selection of variables should be based on theory and experience in the subject you are investigating.

model_1 <- lm(crime.percap ~ dist.cbd + p.pov, data = gsv.census)
summary(model_1)

## 
## Call:
## lm(formula = crime.percap ~ dist.cbd + p.pov, data = gsv.census)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1682 -0.3938 -0.1375  0.3221  2.3858 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.122e+00  1.452e-01   7.726 4.54e-12 ***
## dist.cbd    -8.415e-05  1.927e-05  -4.366 2.78e-05 ***
## p.pov        2.062e+00  3.953e-01   5.217 8.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6342 on 115 degrees of freedom
## Multiple R-squared:  0.2987, Adjusted R-squared:  0.2865 
## F-statistic: 24.49 on 2 and 115 DF,  p-value: 1.38e-09

model_2 <- lm(crime.percap ~ dist.cbd + p.pov + p.unemp, data = gsv.census)
summary(model_2)

## 
## Call:
## lm(formula = crime.percap ~ dist.cbd + p.pov + p.unemp, data = gsv.census)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3109 -0.3816 -0.1299  0.2875  2.3384 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.111e+00  1.441e-01   7.708 5.17e-12 ***
## dist.cbd    -8.844e-05  1.927e-05  -4.589 1.15e-05 ***
## p.pov        1.425e+00  5.397e-01   2.641  0.00942 ** 
## p.unemp      2.663e+00  1.551e+00   1.717  0.08871 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6289 on 114 degrees of freedom
## Multiple R-squared:  0.3164, Adjusted R-squared:  0.2984 
## F-statistic: 17.59 on 3 and 114 DF,  p-value: 1.889e-09

Compare AICs across models using same data

AIC(model_1, model_2)

##         df      AIC
## model_1  4 232.3606
## model_2  5 231.3482

What conclusion can you draw from the AIC results?

Compare fit of nested vs non-nested models to determine need for additional covariates

anova(model_1, model_2) # where model_1 is nested or submodel of model_2

## Analysis of Variance Table
## 
## Model 1: crime.percap ~ dist.cbd + p.pov
## Model 2: crime.percap ~ dist.cbd + p.pov + p.unemp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    115 46.255                              
## 2    114 45.089  1    1.1659 2.9478 0.08871 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What conclusion can you draw from the above result?

Let’s see if adding higher order terms improves model fit

resettest(crime.percap ~ dist.cbd + p.pov + p.unemp, power = 2, type = "regressor", data = gsv.census)

## 
##  RESET test
## 
## data:  crime.percap ~ dist.cbd + p.pov + p.unemp
## RESET = 0.63918, df1 = 3, df2 = 111, p-value = 0.5914

resettest(crime.percap ~ dist.cbd + p.pov + p.unemp, power = 2:3, type = "regressor", data = gsv.census)

## 
##  RESET test
## 
## data:  crime.percap ~ dist.cbd + p.pov + p.unemp
## RESET = 0.40756, df1 = 6, df2 = 108, p-value = 0.8726

F test of whether model with higher order terms \(H_a\) fits better than model without them \(H_0\). If \(p-value\) statistically significant, then adding higher order terms makes statistically significant contribution to model (reject \(H_0\) model)

Based on Ramsey’s RESET Test do you think adding a higher order term will make statistically significant contribution to the model and why?

5. Coefficient Plots

“When it comes time to share your findings, especially in talks, tables are often not the best way to capture people’s attention and quickly convey the results. Variants on what are known by some as “forest plots” have been gaining popularity for presenting regression results”.

“For that, jtools provides plot_summs and plot_coefs. plot_summs gives you a plotting interface to summ and allows you to do so with multiple models simultaneously (assuming you want to apply the same arguments to each model)”.

Below is a basic, single-model use case

plot_summs(model_1)

“Note that the intercept is omitted by default because it often distorts the scale and generally isn’t of theoretical interest. You can change this behavior or omit other coefficients with the omit.coefs argument”.

“In the above example, the differing scales of the variables makes it kind of difficult to make a quick assessment. No problem, we can just use the tools built into summ for that”.

plot_summs(model_1, scale = TRUE)

“See? Now we have a better idea of how the uncertainty and magnitude of effect differs for these variables. Note that by default the width of the confidence interval is .95, but this can be changed with the ci_level argument. You can also add a thicker band to convey a narrow interval using the inner_ci_level argument:”

plot_summs(model_1, scale = TRUE, inner_ci_level = .9)

“Comparison of multiple models simultaneously is another benefit of plotting. This is especially true when the models are nested. Let’s fit a second model and compare”.

plot_summs(model_1, model_2, scale = TRUE)

Type ?plot_summs into the console and see how else you might manipulate the plots to show your results in different ways.

source: Long J. (2021, September 2) Tools for summarizing and visualizing regression models. https://cran.r-project.org/web/packages/jtools/vignettes/summ.html

Week10 - Variable Selection

Created by: Bonwoo Koo

Created on: 10/23/2020, Modified on 10/29/2021

1. Introduction

2. Setting the working directory & reading data

3. Selecting independent variables based on your knowledge and experience

4. Model Specification

5. Coefficient Plots