| Functions | Tasks |
|---|---|
step() |
Select a formula-based model by AIC. |
stargazer() |
Formats regression analysis results into a well-organized table. |
regsubsets() |
Model selection by exhaustive search (forward, backward, or sequential is also possible). |
formula() |
Returns a formulae included in other R objects. |
par() |
Can be used to set or query graphical parameters. See ?par for more details. |
points() |
Draws a point or sequence of points at the specified coordinates. see ?points for more details. |
We continue to focus on getting hands-on experience of multiple regression modeling in R. We will focus on various ways of variable selection.
Researchers and planners usually choose the dependent and independent variables based on various criteria, such as their research questions, their empirical and theoretical knowledge as experts, and their lived experience as urban residents. For example, theories suggest that there are sets of variables that affect individual’s decision to walk, and some of these variables are empirically found to have statistically significant relationship with walking behavior. In my work, I based my variable selection on theories on walking, such as 3D’s framework and the hierarchical walking needs theory, as well as on my own experience to refine variable selections.
However, we sometimes need additional help in selecting the right set of variables, particularly when we are dealing with issues with weak or no established theories, intuition, and/or experience to guide us in building regression models. The variable selection methods we will examine in this lab can provide supplementary information that we can take into consideration when specifying regression models.
Today, we will examine what variables are related with crime. We will use the number of crime incidences per 1000 persons as our dependent variable (i.e., the variable being explained by a set of other variables; often referred to as simply y variable). As the candidate for independent variables (i.e., a set variables that are used to explain the dependent variable; often referred to as x variables), we have various variables summarized below.
| Variable | What it is measuring |
|---|---|
| crime.percap | The number of crime incidences per 1000 people (dependent variable) |
| dist.cbd | The distance from the city center (city hall) to the centroid of the Census Tract. Measured in meters. |
| b2s | The average of the building-to-street ratio calculated using Google Street View images. |
| s2s | The average of the sidewalk-to-street proportion calculated using Google Street View images. |
| green | The average proportion of greenery calculated using Google Street View images. |
| gsv.index | The average of the walkability index calculated using Google Street View images. |
| walkscore | The average of Walk Score. |
| p.age65 | The share of population above age 65 out of the total population. |
| p.service | The share of population employed in service sector. |
| p.crowd.room | The share of population living in a room in which more than two people share the room. |
| p.hs | The share of population with less than high school education out of those who are age 25 or above. |
| p.unemp | The share of population who are unemployed out of those who are in labor force. |
| p.pov | The share of population whose income is under the poverty line out of those whose poverty status is identified |
| popden | The Population density |
| mroom | The median number of rooms in the Census Tract. |
| mval | The median housing value of the Census Tract. |
Let’s first download required packages in R. In addition to the functions in base R, we will use regsubsets() in “leaps”, ggpairs() in “GGally”, and stargazer in “stargazer” package. The data file can be found on Canvas.
# Install required packages
if (!require("GGally")) install.packages("GGally")
if (!require("leaps")) install.packages("leaps")
if (!require("stargazer")) install.packages("stargazer")
# Call the packages using library()
library(GGally)
library(leaps)
library(stargazer)
# USE YOUR OWN working directory and file name
setwd("D:/Dropbox (Personal)/School/Fall2020/CP6025/Labs/Lab8_2020") # Use your own pathname
gsv.census <- read.csv("Week10_gsv_census.csv")
str(gsv.census)
## 'data.frame': 118 obs. of 16 variables:
## $ walkscore : num 0.338 0.451 0.645 0.76 0.707 ...
## $ dist.cbd : num 4495 3587 4382 3528 3595 ...
## $ b2s : num 0.0548 0.0469 0.1497 0.1375 0.1421 ...
## $ green : num 0.436 0.407 0.297 0.332 0.316 ...
## $ s2s : num 0.0564 0.0685 0.1112 0.1504 0.1344 ...
## $ gsv.index : num 0.0117 -0.1493 0.4972 1.7117 1.1917 ...
## $ p.age65 : num 0.0785 0.1311 0.0684 0.0789 0.1407 ...
## $ p.service : num 0.0492 0.0971 0.1769 0.0829 0.1324 ...
## $ p.crowd.room: num 0.0051 0 0.00482 0.00696 0 ...
## $ p.hs : num 0.01316 0.01008 0.02415 0.00474 0.06532 ...
## $ p.unemp : num 0.0178 0.0194 0.0298 0.0172 0.0982 ...
## $ p.pov : num 0.0804 0.0919 0.0876 0.0699 0.213 ...
## $ popden : num 1894 1393 2125 2521 1961 ...
## $ mroom : num 7.3 5.6 6.4 4.9 5.3 4.5 5.1 6.1 5.7 5.8 ...
## $ mval : num 617300 548200 470300 433000 239600 ...
## $ crime.percap: num 0.275 0.513 0.397 0.835 2.272 ...
Let’s assume that this is a dataset that you created and cleaned for your option paper. Some variables are representing physical characteristics of a Census Tract. Others are representing socioeconomic characteristics that are known to represent resource deprivation (e.g., this study).
As always, let’s start by taking a look at the characteristics of the individual variable’s distribution as well as their scatterplots.
# ggpairs
var.list <- c("crime.percap", "walkscore", "mroom", "p.crowd.room", "popden", "dist.cbd", "mval", "b2s", "s2s", "green", "p.pov", "p.unemp", "p.hs", "p.service", "p.age65")
ggpairs(gsv.census[ , var.list])
The graph shows that, first of all, the number of variables are already too many to be nicely displayed in one graph. Although we can see there are some independent variables that appear to have some patterns with the dependent variable, the pattern is not so clear. Also, the dependent variable (i.e., crime.percap) and some of the independent variables show substantial skewness which warrant our attention. But again, let’s pretend for now that these issues are not present.
Given that we have inspected the variables, let’s build a multiple regression model. The issue here is that I don’t know much about crime. Since I don’t know any theory and knowledge to guide me in variable selection, I will use common sense here and select distance from cbd, poverty rate, and unemployement rate. In real regression modeling, however, it is VERY IMPORTANT to have read what other researchers and planners have discovered and theorized so far. You should not kitchen sink – your model should be grounded in the past findings.
# First regression: Explaining crime.density with p.unemp and p.pov
model.based.on.guesses <- lm(crime.percap ~ dist.cbd + p.pov + p.unemp, data = gsv.census)
summary(model.based.on.guesses)
##
## Call:
## lm(formula = crime.percap ~ dist.cbd + p.pov + p.unemp, data = gsv.census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3109 -0.3816 -0.1299 0.2875 2.3384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.111e+00 1.441e-01 7.708 5.17e-12 ***
## dist.cbd -8.844e-05 1.927e-05 -4.589 1.15e-05 ***
## p.pov 1.425e+00 5.397e-01 2.641 0.00942 **
## p.unemp 2.663e+00 1.551e+00 1.717 0.08871 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6289 on 114 degrees of freedom
## Multiple R-squared: 0.3164, Adjusted R-squared: 0.2984
## F-statistic: 17.59 on 3 and 114 DF, p-value: 1.889e-09
The trouble in this approach is that, if I just keep testing different combination of variabls at random, there are 16383 possible combinations I would need to test. It would be very difficult to find a good combination of variables, and I will not be able to know it when I have successfully built a good model – because I don’t have theory backing me up.
Before you move on to the next section, try various variables in the dataset. In doing so, see if you can find a combination of variables that meets the following conditions: (1) adjusted R-squared is substantially increased and (2) you can explain why each of the statistically significant variables have the effects that they have. Read through the variable list above, select a few that you think would impact crime rate, and try running multiple regressions with the selected variables. The information you get from doing so will be useful later when you try to make sense out of the models built by automatic variable selection methods.
One way to do an automatic variable selection is to use stepwise regression. As the details of the stepwise regression is already covered in the lecture, I will focus on running it in R.
The first step is to create an empty model that has a dependent variable and an intercept only. This is as simple as a regression model can be, and it is assigned to an R object called null.model. The full.model is the opposite; it contains all independent variables you have in your hands. Using the null.model and full.model, three regression models with forward, backward, and both method are performed and stored in step.model.for, step.model.back and step.model.both, respectively. Note that in step() function, the trace argument is set to 0 to suppress the long output. In real regression analysis, you will want to read those outputs carefully.
Then we use stargazer() to neatly format the three models and print them in one table. As shown below, forward stepwise regression produced a different final model than the other two.
# Base models
null.model <- lm(crime.percap ~ 1, data = gsv.census)
full.model <- lm(crime.percap ~ walkscore + mroom + p.crowd.room +
popden + dist.cbd + b2s + s2s + green + # physical variables
mval + p.pov + p.unemp + p.hs +
p.service + p.age65, # socioeconomic/demographic variables
data = gsv.census)
# Forward (the scope argument must be the full model)
step.model.for <- step(null.model, # Should start from null.model
scope = formula(full.model), # This is the fullest model possible.
direction = "forward",
trace = 0)
# Backward (backward does not need a scope argument because it starts with the full.model and only takes out variables from the full.model)
step.model.back <- step(full.model,
direction = "backward",
trace = 0)
# Both (it needs both null.model and full.model specified, just like forward method)
step.model.both <- step(null.model,
scope = formula(full.model),
direction = "both",
trace = 0)
# Compare models
stargazer(step.model.for, step.model.back, step.model.both, # you can write the names of your R objects sequentially
type = "text", # you can also choose "latex" or "html"
column.labels = c("Forward", "Backward", "Both")) # This is the label for each model
##
## ===========================================================================================
## Dependent variable:
## -----------------------------------------------------------------------
## crime.percap
## Forward Backward Both
## (1) (2) (3)
## -------------------------------------------------------------------------------------------
## green -1.649 -1.745 -1.745
## (1.196) (1.194) (1.194)
##
## p.age65 2.036** 2.157** 2.157**
## (0.847) (0.841) (0.841)
##
## p.unemp 1.460
## (1.295)
##
## mroom -0.192*** -0.181*** -0.181***
## (0.062) (0.061) (0.061)
##
## popden -0.0003*** -0.0003*** -0.0003***
## (0.0001) (0.0001) (0.0001)
##
## b2s 3.378*** 3.550*** 3.550***
## (0.946) (0.935) (0.935)
##
## p.pov 0.929* 1.272*** 1.272***
## (0.479) (0.371) (0.371)
##
## walkscore 0.664* 0.635* 0.635*
## (0.360) (0.360) (0.360)
##
## Constant 1.867*** 1.861*** 1.861***
## (0.610) (0.610) (0.610)
##
## -------------------------------------------------------------------------------------------
## Observations 118 118 118
## R2 0.608 0.603 0.603
## Adjusted R2 0.579 0.578 0.578
## Residual Std. Error 0.487 (df = 109) 0.488 (df = 110) 0.488 (df = 110)
## F Statistic 21.112*** (df = 8; 109) 23.887*** (df = 7; 110) 23.887*** (df = 7; 110)
## ===========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Another way to do an automatic variable selection is to use regsubsets() from ‘leaps’ package (or other similar functions out there). Similar to the backward stepwise regression, the input is the formula of full.model. The nvmax argument determines the biggest model you want to test. For example, if you set nvmax = 8, which is the default setting, the function will test up to 8 variables regardless of how many independent variables you have in your full.model. I had set nvmax equal to 14 because there are 14 independent variables in full.model.
# Find the best subset of variables
subset.model <- regsubsets(formula(full.model),
data = gsv.census,
nvmax = 14,
method = "exhaustive")
# Summary
reg.summary <- summary(subset.model) # regsubsets is a bit weird in that you actually assign an output of summary() and use that for further analysis.
reg.summary
## Subset selection object
## Call: regsubsets.formula(formula(full.model), data = gsv.census, nvmax = 14,
## method = "exhaustive")
## 14 Variables (and intercept)
## Forced in Forced out
## walkscore FALSE FALSE
## mroom FALSE FALSE
## p.crowd.room FALSE FALSE
## popden FALSE FALSE
## dist.cbd FALSE FALSE
## b2s FALSE FALSE
## s2s FALSE FALSE
## green FALSE FALSE
## mval FALSE FALSE
## p.pov FALSE FALSE
## p.unemp FALSE FALSE
## p.hs FALSE FALSE
## p.service FALSE FALSE
## p.age65 FALSE FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: exhaustive
## walkscore mroom p.crowd.room popden dist.cbd b2s s2s green mval p.pov
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " "
## 3 ( 1 ) " " "*" " " "*" " " "*" " " " " " " " "
## 4 ( 1 ) " " "*" " " "*" " " "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" " " "*" " " "*" " " " " " " "*"
## 6 ( 1 ) "*" "*" " " "*" " " "*" " " " " " " "*"
## 7 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " "*"
## 8 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " "*"
## 9 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" " " "*"
## 11 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" " " "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## p.unemp p.hs p.service p.age65
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " "*"
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " "*"
## 7 ( 1 ) " " " " " " "*"
## 8 ( 1 ) " " "*" " " "*"
## 9 ( 1 ) " " "*" " " "*"
## 10 ( 1 ) " " "*" " " "*"
## 11 ( 1 ) "*" "*" " " "*"
## 12 ( 1 ) "*" "*" " " "*"
## 13 ( 1 ) "*" "*" " " "*"
## 14 ( 1 ) "*" "*" "*" "*"
Let’s take a look at the results above. You can ignore the ‘Forced in’ and ‘Forced out’ columns – they are meaningful only when we tell regsubsets() to make sure some variables are never included/excluded, respectively. On the bottom half, there are 14 rows with " " and “*”. The first row has “*” in ‘green’ column and " " in all other columns. This indicates that if a model has to have only one independent variable, having the variable ‘green’ yields the best model. If a model has to have, for example, three independent variables, having ‘mroom’, ‘popden’, and ‘b2s’ yields the best model.
Although the graph above tells us what combination of variables we should use for the model if we want a model with 1 variable, with 2 variables, or with 3 variables, etc., it does not tell us HOW MANY VARIABLES we should choose. To see that, run the following code. I think it would help you learn R if you try to understand the chuck of codes below, but it is not essential.
# Second method of plotting the result.
par(mfrow = c(2,2)) # This code specifies that, instead of showing only one graph in the plots window, you want 4 (2 x 2) graphs.
plot(reg.summary$rsq, # This code plots the number of variables on x-axis and r-squared values on y-axis
xlab = "Number of variables",
ylab = "R-Squared",
type = "l") # type = "l" means that you want to display the graph with a line, not points.
plot(reg.summary$adjr2, # Adjusted r-squared values on y-axis
xlab = "Number of variables",
ylab = "Adjusted R-Sqaured",
type = "l")
points(which.max(reg.summary$adjr2), # This code inserts a red dot on the line graph. The red dot denotes the point where the adjusted R-squared is at its highest.
reg.summary$adjr2[which.max(reg.summary$adjr2)],
col="red",
cex=2,
pch=20)
plot(reg.summary$rss, # sum of squared residual values on y-axis
xlab = "Number of variables",
ylab = "RSS",
type = "l")
plot(reg.summary$bic, # BIC values on y-axis. BIC is very similar to AIC except that it puts more penalty to the number of variables included.
xlab = "Number of variables",
ylab = "BIC",
type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col="red",cex=2,pch=20)# This code inserts a red dot on the line graph. The red dot denotes the point where the sum of squared residual is at its lowest.
par(mfrow = c(1,1)) # This code reverts the plot setting back to the original where we plot one graph at a time.
The plot above illustrates that the adjusted R-squared is the highest when you include 8 variables (the plot on the right side of the first row with a red dot). You can now go back to the output of reg.summary to see what are the best 8 variables you should choose. It also shows that the BIC is the lowest when there are 4 variables (the plot on the right side of the second row).
A side note that is not really important for this class but could be interesting to people like me: BIC is a metric for model fit that is similar to AIC but it tends to prefer smaller models (i.e., less number of independent variables), especially when your data has a large number of sample. For further read, see here.
Shown below is a model built based on the adjusted R-squared (== 8 independent variables). The number of rooms and population density are negatively associated with crime per capita. Conversely, walkscore, building-to-street ratio, poverty rate, and the share of age 65+ are positively associated with a greater crime per capita.
Okay, so we have identified the best set of variables that maximizes the adjusted R-squared. Some of the chosen variables make sense, and we can see why and how they may be associated with crime rate. The problem is that I have absolutely no idea why age 65+ is positively linked with more crime per capita. This is the problem of not having a theory behind it – even when we see a significant effect of a variable, we just don’t know WHY it has the effect. If anyone has any clue, please let me and your fiends know because I am curious to know.
subsetted.model_v1 <- lm(crime.percap ~ walkscore + mroom + popden + b2s + green + p.pov + p.hs + p.age65, data = gsv.census)
summary(subsetted.model_v1)
##
## Call:
## lm(formula = crime.percap ~ walkscore + mroom + popden + b2s +
## green + p.pov + p.hs + p.age65, data = gsv.census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32193 -0.29051 -0.03493 0.24746 1.83683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.727e+00 6.205e-01 2.784 0.006330 **
## walkscore 6.828e-01 3.616e-01 1.889 0.061612 .
## mroom -1.789e-01 6.120e-02 -2.923 0.004218 **
## popden -2.861e-04 5.076e-05 -5.636 1.38e-07 ***
## b2s 3.817e+00 9.623e-01 3.966 0.000131 ***
## green -1.519e+00 1.209e+00 -1.256 0.211777
## p.pov 9.440e-01 4.679e-01 2.018 0.046082 *
## p.hs 9.307e-01 8.133e-01 1.144 0.254978
## p.age65 2.037e+00 8.464e-01 2.407 0.017760 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4871 on 109 degrees of freedom
## Multiple R-squared: 0.6079, Adjusted R-squared: 0.5791
## F-statistic: 21.12 on 8 and 109 DF, p-value: < 2.2e-16
Just for fun, here is the model based on BIC value (== 4 independent variables).
subsetted.model_v2 <- lm(crime.percap ~ mroom + popden + b2s + p.pov, data = gsv.census)
summary(subsetted.model_v2)
##
## Call:
## lm(formula = crime.percap ~ mroom + popden + b2s + p.pov, data = gsv.census)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16370 -0.29319 -0.05239 0.25272 1.95421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.055e+00 3.935e-01 5.221 8.17e-07 ***
## mroom -2.425e-01 5.739e-02 -4.225 4.87e-05 ***
## popden -3.015e-04 4.726e-05 -6.379 4.04e-09 ***
## b2s 4.945e+00 7.466e-01 6.624 1.23e-09 ***
## p.pov 1.333e+00 3.358e-01 3.971 0.000126 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5042 on 113 degrees of freedom
## Multiple R-squared: 0.5644, Adjusted R-squared: 0.549
## F-statistic: 36.61 on 4 and 113 DF, p-value: < 2.2e-16
For those who are curious.. The reason why BIC is the lowest with 4 variables while adjusted R-squared is the highest with 8 variables is because BIC is more strict about including useless variables, so it puts greater penalty to the added number of variables.
Now would be a good time to look at the residual plot for some model diagnostics. But let’s save that for the next week as we will have covered most of the important model diagnostics by then. For those who are curious, I just plotted a residual vs. predicted plot below for you to see.
# Regression diagnostic: Residual plot
plot(predict(subsetted.model_v1),
subsetted.model_v1$residuals,
xlab = "Predicted values",
ylab = "Residuals",
main = "Residuals vs. Predicted")
abline(h = 0, lty = 2, col = 'red')