Lab 8 will focus on getting hands-on experience of multiple regression modeling in R. In addition to running regression models, we usually do data visualization and inspection before the modeling and regression diagnostics after the modeling.
| Functions | Tasks |
|---|---|
left_join() |
Joins two data.frames such that all rows in the table on the left are preserved. |
ggpairs() |
Creates a pair-wise scatterplot + correlation coefficient + density histogram. |
In this lab, we will build a series of regression models that examine the relationship between the median housing value and various factors that may affect the housing value. That is, we will use the median housing value as the dependent variable. We will use Walk Score, distance from CBD, median household income, and whether or not inside Atlanta (dummy variable) as the independent variables. In particular, we will focus on how Walk Score affects housing values.
Let’s first download required packages in R. In addition to the functions in base R, we will use left_join() in “tidyverse” and ggpairs() in “GGally”. Then, we call the package and read the data in. There are two datasets that you can download from Canvas > Files > Lab Materials named ‘lab8_census_10counties.csv’ and ‘lab8_walkscore_4counties.csv’. Download those files and store it in a folder of your choice.
# Install required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("GGally")) install.packages("GGally")
# Call the packages using library()
library(tidyverse)
library(GGally)
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 9\\Lab7_2020") # Use your own pathname
census.df <- read.csv("lab8_census_10counties.csv")
ws.df <- read.csv("lab8_walkscore_4counties.csv")
Can you describe what each of these variables is and whether each are categorical or continuous?
# Inspecting the data
head(census.df)
## GEOID NAME mval hinc nroom
## 1 13121001800 Census Tract 18, Fulton County, Georgia 136500 42159 3.3
## 2 13121003200 Census Tract 32, Fulton County, Georgia 228700 84350 3.8
## 3 13121003600 Census Tract 36, Fulton County, Georgia 175900 41372 3.4
## 4 13121006801 Census Tract 68.01, Fulton County, Georgia NA NA NA
## 5 13121008800 Census Tract 88, Fulton County, Georgia 285100 95524 6.4
## 6 13121010512 Census Tract 105.12, Fulton County, Georgia 92800 30819 5.0
## unit.single unit.multi unitstr.tot p.single p.multi
## 1 325 2914 3277 0.09917608 0.8892280
## 2 728 744 1472 0.49456522 0.5054348
## 3 35 979 1014 0.03451676 0.9654832
## 4 0 0 0 NA NA
## 5 1974 385 2359 0.83679525 0.1632047
## 6 1164 1956 3158 0.36858771 0.6193794
nrow(census.df)
## [1] 738
head(ws.df)
## TractID walkscore cbd_dist in.atl
## 1 13063040202 18.81333 12908.51 0
## 2 13063040203 26.16667 15682.62 0
## 3 13063040204 37.40000 15358.93 0
## 4 13063040302 31.66667 11214.80 1
## 5 13063040303 33.42222 11418.21 1
## 6 13063040307 22.70833 15944.66 0
nrow(ws.df)
## [1] 501
We learned how to join two data.frames using
full_join()function. It worked fine last time because we knew that the two data.frame we joined had data on the exact same Census Tracts and that they shared the same KEY variables. However, as we obseved innrow(census.df)andnrow(ws.df), one data.frame contains 738 rows and other other 501. Also, one data.frame has its KEY variable in ‘GEOID’ and the other data.frame has it in ‘TractID’.
1.1. The census dataset (‘census.df’) contains data for 10 counties around Atlanta while ‘ws.df’ contains data for the 4 counties. If we do full_join(), it will give us too many rows with NAs. Alternatively, we can use left_join() or right_join() to drop all rows that represent Census Tract not contained in ‘ws.df’.
Also, we need to tell R that the names of the variable we want to use as the KEY are different in two data.frames Notice the third argument of left_join() below. That argument specifies which variable in one data.frame should be matched with which variable in the other data.frame when joining them. Note that the order matters.
# Joining census.df and ws.df
join.df <- left_join(ws.df, census.df,
by = c("TractID" = "GEOID"))
str(join.df) # We can see here that we have 501 rows and 13 variables. The two tables are successfully joined.
## 'data.frame': 501 obs. of 13 variables:
## $ TractID : num 1.31e+10 1.31e+10 1.31e+10 1.31e+10 1.31e+10 ...
## $ walkscore : num 18.8 26.2 37.4 31.7 33.4 ...
## $ cbd_dist : num 12909 15683 15359 11215 11418 ...
## $ in.atl : int 0 0 0 1 1 0 1 0 0 0 ...
## $ NAME : chr "Census Tract 402.02, Clayton County, Georgia" "Census Tract 402.03, Clayton County, Georgia" "Census Tract 402.04, Clayton County, Georgia" "Census Tract 403.02, Clayton County, Georgia" ...
## $ mval : num 98600 98200 105000 55400 59300 69100 60500 89700 99200 88100 ...
## $ hinc : int 31524 36786 39194 33190 37236 37064 25159 45768 37224 42280 ...
## $ nroom : num 5.6 5.2 4.5 4.9 5.4 6.2 4.7 5.5 5.6 6.1 ...
## $ unit.single: int 614 992 843 1447 2047 1659 1253 1315 1442 1880 ...
## $ unit.multi : int 397 704 1210 876 629 86 752 98 1408 221 ...
## $ unitstr.tot: int 1011 1696 2064 2323 2730 1772 2038 1465 2891 2129 ...
## $ p.single : num 0.607 0.585 0.408 0.623 0.75 ...
## $ p.multi : num 0.393 0.415 0.586 0.377 0.23 ...
1.2. Let us use separate function to parse ‘NAME’ variable into pieces. This is to extract county names from ‘NAME’ variable.
# Parsing NAME variable
join.df.sep <- separate(join.df, NAME, c("tract", "county", "state"), sep = ", ") # Don't forget to include a space after the comma in the 'sep' argument.
summary(join.df.sep) # Summary shows that the data looks okay in general. There is no NAs. The min, max, mean, median all look reasonable.
## TractID walkscore cbd_dist in.atl
## Min. :1.306e+10 Min. : 1.494 Min. : 0 Min. :0.0000
## 1st Qu.:1.307e+10 1st Qu.:14.197 1st Qu.: 8742 1st Qu.:0.0000
## Median :1.309e+10 Median :23.541 Median :16455 Median :0.0000
## Mean :1.309e+10 Mean :29.426 Mean :17086 Mean :0.3234
## 3rd Qu.:1.312e+10 3rd Qu.:39.852 3rd Qu.:24119 3rd Qu.:1.0000
## Max. :1.312e+10 Max. :91.667 Max. :43070 Max. :1.0000
## tract county state mval
## Length:501 Length:501 Length:501 Min. : 53000
## Class :character Class :character Class :character 1st Qu.: 102700
## Mode :character Mode :character Mode :character Median : 182500
## Mean : 230242
## 3rd Qu.: 322300
## Max. :1117200
## hinc nroom unit.single unit.multi unitstr.tot
## Min. : 9815 Min. :3.00 Min. : 20 Min. : 0.0 Min. : 289
## 1st Qu.: 41481 1st Qu.:4.90 1st Qu.: 794 1st Qu.: 152.0 1st Qu.:1591
## Median : 59981 Median :5.70 Median :1345 Median : 574.0 Median :2129
## Mean : 67735 Mean :5.96 Mean :1462 Mean : 811.8 Mean :2296
## 3rd Qu.: 85333 3rd Qu.:7.00 3rd Qu.:1961 3rd Qu.:1211.0 3rd Qu.:2887
## Max. :200179 Max. :9.00 Max. :5678 Max. :4207.0 Max. :6918
## p.single p.multi
## Min. :0.02699 Min. :0.00000
## 1st Qu.:0.42914 1st Qu.:0.08032
## Median :0.68703 Median :0.30144
## Mean :0.64728 Mean :0.34344
## 3rd Qu.:0.90680 3rd Qu.:0.56258
## Max. :1.00000 Max. :0.97301
Now that we have joined the two data.frames, here is a list of the variable names and what they represent.
| Variable name | What it is measuring |
|---|---|
| TractID | Unique IDs for Census Tract |
| walkscore | Average Walk Score calculated as the mean of Walk Score values measured at points on 0.25-mile grid |
| cdb_dist | Distance from the ATL City Hall to the centroid of each Census Tract |
| in.atl | Binary variable (1 = inside ATL city boundary, 0 = otherwise) |
| county | Contains four counties (Fulton, DeKalb, Clayton, Cobb) |
| mval | Median housing value |
| hinc | Median household income |
| nroom | Median number of rooms |
| unit.single | Count of single family houses |
| unit.multi | Count of multifamily houses |
| unitstr.tot | Total number of houses (denominator for unit.single and unit.multi) |
| p.single | Percent single family houses |
| p.multi | Percent multi-familty houses |
It is often crucial to gain a good understanding of the data before we jump into the regression modeling. Gaining a good understanding of the data starts from thoroughly investigating the descriptive statistics (e.g., mean, median, range, variance, distribution, etc.) and various plots such as histogram and scatterplots.
There is a convenient function called ggpairs() in “GGally” package that generates scatterplots, correlation coefficients, and histograms (with density transformation) in one graph. In its simplest form, it takes only one argument: a data.frame. The ggpairs() function draws the graph using all variables it is supplied with, so we are supplying a data.frame with selected variables.
# Generating a paired scatterplot
ggpairs(join.df.sep[ ,c("walkscore", "mval", "cbd_dist", "hinc", "nroom", "p.single", "p.multi", "in.atl")])
The lower half of the graph shows scatterplots and histograms. The variable of our interest - walkscore - appears to be in a negative relationship with mval, but the relationship is a bit more complex. Among low-walkability areas, some areas have high median housing values and other areas have low median housing values. However, high-walkability areas tend to have low median housing values (try drawing a separate scatterplot if you want to have a closer look). They, however, have almost no correlation (0.053).
Some of other notable things include (1) mval and hinc shows a clear sign of strong correlation, (2) some independent variables have strong correlations (e.g., hinc vs. nroom or nroom vs. p.single) amonst themselves, and (3) p.single and p.multi, by definition, show a nearly perfect correlation. We may need to be concerned about multicollinearity. Also, we can see that many variables including mval, walkscore, and hinc are strongly skewed.
With these patterns in mind, let’s run regression models.
Running regression models is often not a linear process; it is common to repetitively revise a regression model by, for example, including/excluding variables, transforming existing variables, deleting extreme outliers, etc.
3.1. When running regressions, I personally like to start with simple models, so I will start with a bivariate regression between mval and walkscore.
The regression result shown below indicates that there is no significant relationship between mval and walkscore (see the p-value), just like we have just observed in the scatterplot. Notice that Multiple R-squared is extremely low, suggesting that only a extremely small amount of the variation in mval is accounted for by walkscore. F-test also shows an insignificant result.
ols1 <- lm(mval ~ walkscore, data = join.df.sep)
summary(ols1)
##
## Call:
## lm(formula = mval ~ walkscore, data = join.df.sep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184793 -125569 -50006 83863 895981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 217546.2 12886.5 16.882 <2e-16 ***
## walkscore 431.4 361.5 1.193 0.233
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162800 on 499 degrees of freedom
## Multiple R-squared: 0.002846, Adjusted R-squared: 0.0008477
## F-statistic: 1.424 on 1 and 499 DF, p-value: 0.2333
3.2. The result that walkscore and mval is not related is different from what some of the past studies have reported. Why is this the case? We anecdotally know that some neighborhoods in suburban areas can be expensive (e.g., large single-family houses with many rooms) but some neighborhoods in urban areas can ALSO be expensive (e.g., compact, multifamily houses in neighborhoods around BeltLine or Midtown). Maybe we could not see the relationship between walscore and mval because there are many different types and sizes of houses. Let’s include variables that can control for the variation in mval that are caused by housing types. This inclusion will take care of the variation in home value that are created by housing types and sizes. After we control for this variation, we can examine how walkscore is associated with home value for housings with comparable types and sizes.
ols2, the inclusion of nroom and p.single variables dramatically changed the result!After accounting for the effects of nroom and p.single, walkscore variable is highly significant with a positive coefficient. This means that, according to the model, higher walkscore is associated with higher mval WHEN HOLDING ALL OTHER VARIABLES CONSTANT, just like what the literature says.
Also notice that nroom is very significant with a positive coefficient. It is saying that an increase in one unit of nroom (i.e., having one more room) increases the estimated mval by $140,031.1 on average WHEN HOLDING ALL OTHER VARIABLES CONSTANT.
The p.single variable is also very significant but with a negative coefficient. You need to be careful in interpreting the coefficient of p.single - it is NOT saying that 1% increase in the proportion of single family house is associated with a $416,424 decrease in mval. Because p.single is a proportion which ranges between 0 and 1, one-unit increase (i.e., an increase by 1 integer) in this case means going from 0% to 100%. So the massive coefficient (-416424.5) means that when a neighborhood goes from 0% (== 0) single family house to 100% (== 1) single family house, the estimated mval decreases by $416,424 on average WHEN HOLDING ALL OTHER VARIABLES CONSTANT.
Also note that the Multiple R-squared is now 0.4216 and adjusted R-squared 0.4181, risen from around 0.003.
ols2 <- lm(mval ~ walkscore + nroom + p.single, data = join.df.sep) # nroom, and p.single included
summary(ols2)
##
## Call:
## lm(formula = mval ~ walkscore + nroom + p.single, data = join.df.sep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285351 -77593 -20671 55854 668047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -431406.5 39646.7 -10.881 <2e-16 ***
## walkscore 3281.7 374.1 8.773 <2e-16 ***
## nroom 140031.1 7488.3 18.700 <2e-16 ***
## p.single -416424.5 36620.0 -11.372 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124200 on 497 degrees of freedom
## Multiple R-squared: 0.4216, Adjusted R-squared: 0.4181
## F-statistic: 120.8 on 3 and 497 DF, p-value: < 2.2e-16
3.3. A neat thing about running regression in R is that you can modify variables on the fly within lm() function. The code below multiplys p.single by 100 so that we can get a coefficient for 1% increase, not for 100% increase (which is unrealistic and therefore unintuitive). You can do it by wrapping the variable with I().
As shown in ols3, now the coefficient for I(p.single*100) is more intuitive – 1% increase in the proportion of single family houses is associated with $4,246 decrease in mval. Notice that multiplying p.single by 100 did not change anything about other variables.
ols3 <- lm(mval ~ walkscore + nroom + I(p.single*100), data = join.df.sep) # nroom, p.single included
summary(ols3)
##
## Call:
## lm(formula = mval ~ walkscore + nroom + I(p.single * 100), data = join.df.sep)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285351 -77593 -20671 55854 668047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -431406.5 39646.7 -10.881 <2e-16 ***
## walkscore 3281.7 374.1 8.773 <2e-16 ***
## nroom 140031.1 7488.3 18.700 <2e-16 ***
## I(p.single * 100) -4164.2 366.2 -11.372 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124200 on 497 degrees of freedom
## Multiple R-squared: 0.4216, Adjusted R-squared: 0.4181
## F-statistic: 120.8 on 3 and 497 DF, p-value: < 2.2e-16
Now add in.atl (1 = inside ATL city boundary, 0 = otherwise) as an independent variable and assign it to the variable name ols4. How did this change the coefficients and model results?
Add p.multi as an independent variable and drop p.single and assign your new model to the variable name ols5. How did this change the coefficients and model results? Note we are dropping p.single because it is highly correlated with p.multi.
Now drop p.multi and add p.single back in as an independent variable; also add cdb_dist as an independent variable and assign your new model to the variable name ols6. How did this change the coefficients and model results?