Lab 8 will focus on getting hands-on experience of multiple regression modeling in R. In addition to running regression models, we usually do data visualization and inspection before the modeling and regression diagnostics after the modeling.

Functions Tasks
left_join() Joins two data.frames such that all rows in the table on the left are preserved.
ggpairs() Creates a pair-wise scatterplot + correlation coefficient + density histogram.

 

In this lab, we will build a series of regression models that examine the relationship between the median housing value and various factors that may affect the housing value. That is, we will use the median housing value as the dependent variable. We will use Walk Score, distance from CBD, median household income, and whether or not inside Atlanta (dummy variable) as the independent variables. In particular, we will focus on how Walk Score affects housing values.

   

0. Setting the working directory & reading data

Let’s first download required packages in R. In addition to the functions in base R, we will use left_join() in “tidyverse” and ggpairs() in “GGally”. Then, we call the package and read the data in. There are two datasets that you can download from Canvas > Files > Lab Materials named ‘lab8_census_10counties.csv’ and ‘lab8_walkscore_4counties.csv’. Download those files and store it in a folder of your choice.

# Install required packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("GGally")) install.packages("GGally")
# Call the packages using library()
library(tidyverse)
library(GGally)
# USE YOUR OWN working directory and file name
setwd("C:\\Users\\cod\\Desktop\\PhD Files\\GRA & GTA\\GTA\\CP6025 Fall 2021\\Week 9\\Lab7_2020") # Use your own pathname
census.df <- read.csv("lab8_census_10counties.csv")
ws.df <- read.csv("lab8_walkscore_4counties.csv")

Can you describe what each of these variables is and whether each are categorical or continuous?

# Inspecting the data
head(census.df)
##         GEOID                                        NAME   mval  hinc nroom
## 1 13121001800     Census Tract 18, Fulton County, Georgia 136500 42159   3.3
## 2 13121003200     Census Tract 32, Fulton County, Georgia 228700 84350   3.8
## 3 13121003600     Census Tract 36, Fulton County, Georgia 175900 41372   3.4
## 4 13121006801  Census Tract 68.01, Fulton County, Georgia     NA    NA    NA
## 5 13121008800     Census Tract 88, Fulton County, Georgia 285100 95524   6.4
## 6 13121010512 Census Tract 105.12, Fulton County, Georgia  92800 30819   5.0
##   unit.single unit.multi unitstr.tot   p.single   p.multi
## 1         325       2914        3277 0.09917608 0.8892280
## 2         728        744        1472 0.49456522 0.5054348
## 3          35        979        1014 0.03451676 0.9654832
## 4           0          0           0         NA        NA
## 5        1974        385        2359 0.83679525 0.1632047
## 6        1164       1956        3158 0.36858771 0.6193794
nrow(census.df)
## [1] 738
head(ws.df)
##       TractID walkscore cbd_dist in.atl
## 1 13063040202  18.81333 12908.51      0
## 2 13063040203  26.16667 15682.62      0
## 3 13063040204  37.40000 15358.93      0
## 4 13063040302  31.66667 11214.80      1
## 5 13063040303  33.42222 11418.21      1
## 6 13063040307  22.70833 15944.66      0
nrow(ws.df)
## [1] 501

 

1. Joining the two data.frames & Basic data cleaning

We learned how to join two data.frames using full_join() function. It worked fine last time because we knew that the two data.frame we joined had data on the exact same Census Tracts and that they shared the same KEY variables. However, as we obseved in nrow(census.df) and nrow(ws.df), one data.frame contains 738 rows and other other 501. Also, one data.frame has its KEY variable in ‘GEOID’ and the other data.frame has it in ‘TractID’.

1.1. The census dataset (‘census.df’) contains data for 10 counties around Atlanta while ‘ws.df’ contains data for the 4 counties. If we do full_join(), it will give us too many rows with NAs. Alternatively, we can use left_join() or right_join() to drop all rows that represent Census Tract not contained in ‘ws.df’.

Also, we need to tell R that the names of the variable we want to use as the KEY are different in two data.frames Notice the third argument of left_join() below. That argument specifies which variable in one data.frame should be matched with which variable in the other data.frame when joining them. Note that the order matters.

# Joining census.df and ws.df
join.df <- left_join(ws.df, census.df, 
                     by = c("TractID" = "GEOID"))
str(join.df) # We can see here that we have 501 rows and 13 variables. The two tables are successfully joined.
## 'data.frame':    501 obs. of  13 variables:
##  $ TractID    : num  1.31e+10 1.31e+10 1.31e+10 1.31e+10 1.31e+10 ...
##  $ walkscore  : num  18.8 26.2 37.4 31.7 33.4 ...
##  $ cbd_dist   : num  12909 15683 15359 11215 11418 ...
##  $ in.atl     : int  0 0 0 1 1 0 1 0 0 0 ...
##  $ NAME       : chr  "Census Tract 402.02, Clayton County, Georgia" "Census Tract 402.03, Clayton County, Georgia" "Census Tract 402.04, Clayton County, Georgia" "Census Tract 403.02, Clayton County, Georgia" ...
##  $ mval       : num  98600 98200 105000 55400 59300 69100 60500 89700 99200 88100 ...
##  $ hinc       : int  31524 36786 39194 33190 37236 37064 25159 45768 37224 42280 ...
##  $ nroom      : num  5.6 5.2 4.5 4.9 5.4 6.2 4.7 5.5 5.6 6.1 ...
##  $ unit.single: int  614 992 843 1447 2047 1659 1253 1315 1442 1880 ...
##  $ unit.multi : int  397 704 1210 876 629 86 752 98 1408 221 ...
##  $ unitstr.tot: int  1011 1696 2064 2323 2730 1772 2038 1465 2891 2129 ...
##  $ p.single   : num  0.607 0.585 0.408 0.623 0.75 ...
##  $ p.multi    : num  0.393 0.415 0.586 0.377 0.23 ...

1.2. Let us use separate function to parse ‘NAME’ variable into pieces. This is to extract county names from ‘NAME’ variable.

# Parsing NAME variable 
join.df.sep <- separate(join.df, NAME, c("tract", "county", "state"), sep = ", ") # Don't forget to include a space after the comma in the 'sep' argument.
summary(join.df.sep) # Summary shows that the data looks okay in general. There is no NAs. The min, max, mean, median all look reasonable.
##     TractID            walkscore         cbd_dist         in.atl      
##  Min.   :1.306e+10   Min.   : 1.494   Min.   :    0   Min.   :0.0000  
##  1st Qu.:1.307e+10   1st Qu.:14.197   1st Qu.: 8742   1st Qu.:0.0000  
##  Median :1.309e+10   Median :23.541   Median :16455   Median :0.0000  
##  Mean   :1.309e+10   Mean   :29.426   Mean   :17086   Mean   :0.3234  
##  3rd Qu.:1.312e+10   3rd Qu.:39.852   3rd Qu.:24119   3rd Qu.:1.0000  
##  Max.   :1.312e+10   Max.   :91.667   Max.   :43070   Max.   :1.0000  
##     tract              county             state                mval        
##  Length:501         Length:501         Length:501         Min.   :  53000  
##  Class :character   Class :character   Class :character   1st Qu.: 102700  
##  Mode  :character   Mode  :character   Mode  :character   Median : 182500  
##                                                           Mean   : 230242  
##                                                           3rd Qu.: 322300  
##                                                           Max.   :1117200  
##       hinc            nroom       unit.single     unit.multi      unitstr.tot  
##  Min.   :  9815   Min.   :3.00   Min.   :  20   Min.   :   0.0   Min.   : 289  
##  1st Qu.: 41481   1st Qu.:4.90   1st Qu.: 794   1st Qu.: 152.0   1st Qu.:1591  
##  Median : 59981   Median :5.70   Median :1345   Median : 574.0   Median :2129  
##  Mean   : 67735   Mean   :5.96   Mean   :1462   Mean   : 811.8   Mean   :2296  
##  3rd Qu.: 85333   3rd Qu.:7.00   3rd Qu.:1961   3rd Qu.:1211.0   3rd Qu.:2887  
##  Max.   :200179   Max.   :9.00   Max.   :5678   Max.   :4207.0   Max.   :6918  
##     p.single          p.multi       
##  Min.   :0.02699   Min.   :0.00000  
##  1st Qu.:0.42914   1st Qu.:0.08032  
##  Median :0.68703   Median :0.30144  
##  Mean   :0.64728   Mean   :0.34344  
##  3rd Qu.:0.90680   3rd Qu.:0.56258  
##  Max.   :1.00000   Max.   :0.97301

Now that we have joined the two data.frames, here is a list of the variable names and what they represent.

Variable name What it is measuring
TractID Unique IDs for Census Tract
walkscore Average Walk Score calculated as the mean of Walk Score values measured at points on 0.25-mile grid
cdb_dist Distance from the ATL City Hall to the centroid of each Census Tract
in.atl Binary variable (1 = inside ATL city boundary, 0 = otherwise)
county Contains four counties (Fulton, DeKalb, Clayton, Cobb)
mval Median housing value
hinc Median household income
nroom Median number of rooms
unit.single Count of single family houses
unit.multi Count of multifamily houses
unitstr.tot Total number of houses (denominator for unit.single and unit.multi)
p.single Percent single family houses
p.multi Percent multi-familty houses

   

2. Before running regression models: Understanding your data

It is often crucial to gain a good understanding of the data before we jump into the regression modeling. Gaining a good understanding of the data starts from thoroughly investigating the descriptive statistics (e.g., mean, median, range, variance, distribution, etc.) and various plots such as histogram and scatterplots.

There is a convenient function called ggpairs() in “GGally” package that generates scatterplots, correlation coefficients, and histograms (with density transformation) in one graph. In its simplest form, it takes only one argument: a data.frame. The ggpairs() function draws the graph using all variables it is supplied with, so we are supplying a data.frame with selected variables.

# Generating a paired scatterplot
ggpairs(join.df.sep[ ,c("walkscore", "mval", "cbd_dist", "hinc", "nroom", "p.single", "p.multi", "in.atl")])

The lower half of the graph shows scatterplots and histograms. The variable of our interest - walkscore - appears to be in a negative relationship with mval, but the relationship is a bit more complex. Among low-walkability areas, some areas have high median housing values and other areas have low median housing values. However, high-walkability areas tend to have low median housing values (try drawing a separate scatterplot if you want to have a closer look). They, however, have almost no correlation (0.053).

Some of other notable things include (1) mval and hinc shows a clear sign of strong correlation, (2) some independent variables have strong correlations (e.g., hinc vs. nroom or nroom vs. p.single) amonst themselves, and (3) p.single and p.multi, by definition, show a nearly perfect correlation. We may need to be concerned about multicollinearity. Also, we can see that many variables including mval, walkscore, and hinc are strongly skewed.

With these patterns in mind, let’s run regression models.

 

3. Multiple regression

Running regression models is often not a linear process; it is common to repetitively revise a regression model by, for example, including/excluding variables, transforming existing variables, deleting extreme outliers, etc.

3.1. When running regressions, I personally like to start with simple models, so I will start with a bivariate regression between mval and walkscore.

The regression result shown below indicates that there is no significant relationship between mval and walkscore (see the p-value), just like we have just observed in the scatterplot. Notice that Multiple R-squared is extremely low, suggesting that only a extremely small amount of the variation in mval is accounted for by walkscore. F-test also shows an insignificant result.

ols1 <- lm(mval ~ walkscore, data = join.df.sep)
summary(ols1)
## 
## Call:
## lm(formula = mval ~ walkscore, data = join.df.sep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184793 -125569  -50006   83863  895981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 217546.2    12886.5  16.882   <2e-16 ***
## walkscore      431.4      361.5   1.193    0.233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162800 on 499 degrees of freedom
## Multiple R-squared:  0.002846,   Adjusted R-squared:  0.0008477 
## F-statistic: 1.424 on 1 and 499 DF,  p-value: 0.2333

3.2. The result that walkscore and mval is not related is different from what some of the past studies have reported. Why is this the case? We anecdotally know that some neighborhoods in suburban areas can be expensive (e.g., large single-family houses with many rooms) but some neighborhoods in urban areas can ALSO be expensive (e.g., compact, multifamily houses in neighborhoods around BeltLine or Midtown). Maybe we could not see the relationship between walscore and mval because there are many different types and sizes of houses. Let’s include variables that can control for the variation in mval that are caused by housing types. This inclusion will take care of the variation in home value that are created by housing types and sizes. After we control for this variation, we can examine how walkscore is associated with home value for housings with comparable types and sizes.

As shown below in ols2, the inclusion of nroom and p.single variables dramatically changed the result!

After accounting for the effects of nroom and p.single, walkscore variable is highly significant with a positive coefficient. This means that, according to the model, higher walkscore is associated with higher mval WHEN HOLDING ALL OTHER VARIABLES CONSTANT, just like what the literature says.

Also notice that nroom is very significant with a positive coefficient. It is saying that an increase in one unit of nroom (i.e., having one more room) increases the estimated mval by $140,031.1 on average WHEN HOLDING ALL OTHER VARIABLES CONSTANT.

The p.single variable is also very significant but with a negative coefficient. You need to be careful in interpreting the coefficient of p.single - it is NOT saying that 1% increase in the proportion of single family house is associated with a $416,424 decrease in mval. Because p.single is a proportion which ranges between 0 and 1, one-unit increase (i.e., an increase by 1 integer) in this case means going from 0% to 100%. So the massive coefficient (-416424.5) means that when a neighborhood goes from 0% (== 0) single family house to 100% (== 1) single family house, the estimated mval decreases by $416,424 on average WHEN HOLDING ALL OTHER VARIABLES CONSTANT.

Also note that the Multiple R-squared is now 0.4216 and adjusted R-squared 0.4181, risen from around 0.003.

ols2 <- lm(mval ~ walkscore + nroom + p.single, data = join.df.sep) # nroom, and p.single included
summary(ols2)
## 
## Call:
## lm(formula = mval ~ walkscore + nroom + p.single, data = join.df.sep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285351  -77593  -20671   55854  668047 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -431406.5    39646.7 -10.881   <2e-16 ***
## walkscore      3281.7      374.1   8.773   <2e-16 ***
## nroom        140031.1     7488.3  18.700   <2e-16 ***
## p.single    -416424.5    36620.0 -11.372   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124200 on 497 degrees of freedom
## Multiple R-squared:  0.4216, Adjusted R-squared:  0.4181 
## F-statistic: 120.8 on 3 and 497 DF,  p-value: < 2.2e-16

3.3. A neat thing about running regression in R is that you can modify variables on the fly within lm() function. The code below multiplys p.single by 100 so that we can get a coefficient for 1% increase, not for 100% increase (which is unrealistic and therefore unintuitive). You can do it by wrapping the variable with I().

As shown in ols3, now the coefficient for I(p.single*100) is more intuitive – 1% increase in the proportion of single family houses is associated with $4,246 decrease in mval. Notice that multiplying p.single by 100 did not change anything about other variables.

ols3 <- lm(mval ~ walkscore + nroom + I(p.single*100), data = join.df.sep) # nroom, p.single included
summary(ols3)
## 
## Call:
## lm(formula = mval ~ walkscore + nroom + I(p.single * 100), data = join.df.sep)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285351  -77593  -20671   55854  668047 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -431406.5    39646.7 -10.881   <2e-16 ***
## walkscore            3281.7      374.1   8.773   <2e-16 ***
## nroom              140031.1     7488.3  18.700   <2e-16 ***
## I(p.single * 100)   -4164.2      366.2 -11.372   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124200 on 497 degrees of freedom
## Multiple R-squared:  0.4216, Adjusted R-squared:  0.4181 
## F-statistic: 120.8 on 3 and 497 DF,  p-value: < 2.2e-16

Experimenting different combinations of variables in our data set.

  1. Now add in.atl (1 = inside ATL city boundary, 0 = otherwise) as an independent variable and assign it to the variable name ols4. How did this change the coefficients and model results?

  2. Add p.multi as an independent variable and drop p.single and assign your new model to the variable name ols5. How did this change the coefficients and model results? Note we are dropping p.single because it is highly correlated with p.multi.

  3. Now drop p.multi and add p.single back in as an independent variable; also add cdb_dist as an independent variable and assign your new model to the variable name ols6. How did this change the coefficients and model results?