Abstract
Introduction
Data
- Data Prepartion
  - Variables
Methods
- Techniques
  - OLS
  - GWR
- Process Workflow
Analysis
Reflection
References
Appendix

library(tidyverse)
library(tidycensus)
library(tigris)
library(janitor)
library(here)
library(sf)
library(ggthemes)
library(scales)
library(rgdal)
library(rgeos)
library(spgwr)
library(grid)
library(gridExtra)
library(sp)
library(MASS)
library(stargazer)
library(imager)
library(ggspatial)
options(tigris_use_cache = TRUE)

Abstract

An initial investigation of the relationship between vacancy rates and basic demographic characteristics was undertaken to evaluate whether Geographically Weighted Regression (GWR) is able to improve upon the results obtained when using Ordinary Least Squares (OLS) Regression, as well as to learn more about how the required analysis methods can be carried out using the R Statistical Programming Language. The central theme to be explored is using different regression techniques to generate a model to describe the relationship between residential vacancy rates in King County, WA, and a set of variables using data from the U.S. Census Bureau, obtained through the tidycensus package. The variables used are related to basic demographic characteristics that have been aggregated by census tract. Initial results thus far have yielded varied results. Acquiring the data and performing initial regression analysis has been achieved. Starting with a set of potential variables for consideration, the number was reduced to five through applying different evaluation techniques to select the most suitable ones for further analysis. The same set was then used in GWR, with the difference in results between the two models compared and evaluated. GWR was able to improve the explanatory power over OLS, with a median adjusted R-Squared of 0.27, compared to the global 0.23 value. The same process was repeated using only census tracts within the City of Seattle. Even more improvement in the model was shown when using GWR over the results obtained through OLS. The median value increased to over 0.40 for tracts strictly within Seattle. While there appeared to be improvement in the results when a GWR was utilized, tests of significance were inconclusive, suggesting further work is required in acquiring suitable variable data.

Introduction

The following analysis was conducted to investigate the relationship between residential vacancy rates and basic demographic characteristics of census tracts in King County, Washington. Determining the underlying relationship between residential vacancy rates and demographic and location characteristics is of great interest to a wide range of parties impacted by residential real estate markets, from potential home buyers to lenders. Global models used to examine such data are often not well suited for describing the spatial variation of certain variables. Through applying a local model, such as Geographically Weighted Regression (GWR), local characteristics can be observed in a much clearer manner. Several studies have incorporated the use of GWR to study housing prices. A study by Chwiałkowski et al. (2022), examined house prices in Poland, while Wang et al. (2020) examined two sections of Beijing. Using a local model was shown to vastly increase the usefulness of the results, making such methods suitable candidates for the project. While these studies utilized individual transaction data, the proposed project is set to take a different approach. The focus of those studies was primarily on estimating home prices using a wide range of variables, while the proposed project is focused more on trends that are prevalent at the census tract level, rather than building or block level. While the intent of the current project will be different, many of the techniques discussed in the papers will be applicable, even if the topics are of a slightly different nature. The types of variables used include such categories as median home prices and incomes, household density, transportation methods, poverty rates, unemployment rates, and other related topics. These factors are often considered key drivers of residential housing demand as mentioned by Brett & Schmitz (2015), so were included in the process as potential candidates. Mennis (2006) provides an excellent summary of some useful techniques that can be utilized when working with GWR tools. Due to the possibility of changes in coefficient signs, it is advised to use a diverging, rather than a sequential, color scheme when this phenomenon occurs. Additionally, highlighting areas with statistically significant results is important. Because there is a great deal of uncertainty associated with the results, making sure that those areas are shown in a predominate manner is of paramount importance. The relative performance at the census tract level will be examined, to provide a general overview of the importance of different underlying variables to vacancy rates throughout the County. One of the goals of conducting such analysis is to observe what sections of the study area generate the most accurate results and how the importance of the various factors change over the study area as well. Once the most crucial factors for different areas are observed, more detailed investigations for those regions of interest can be carried out in future steps. A current breakdown of the data and methods used is provided next.

Data

Data used in the analysis came almost exclusively from the US Census Bureau, in the form of 2021 ACS 5-year estimates. The City of Seattle Open data portal was the only additional source used, in this particular case a shapefile of city neighborhood districts (City of Seattle GIS Program, 2023). The eventual goal is to add multiple datasets from a wide range of sources, but for now, deriving a suitable methodology to be evaluated will rely only on data from one source. In terms of scale, the area to be studied will be King County, Washington. This county contains the City of Seattle and surrounding areas. The enumerating unit will be census tracts. Owing to data often being aggregated at this level, being able to add in additional sources in the future will be much easier if the units are the same. A full list of each of the variables that was used is provided in the variable section of the code, with a summary of the final set of variables actually evaluated for inclusion in the regression analysis itself is provided in Table 1.

image_path <- here("Data","VariableTable.jpg")
variable_table <- load.image(image_path)
plot(variable_table, axes = F)

Table 1: List of variables considered for inclusion in the regression analysis. The Variable Name is the name used in teh code, while the Variable Description is the use of the variable to represent a particular category of interest.

Data Prepartion

Variables

Before the analytical functions can be conducted, there were some modifications required to be made to some of the variables. The dependent variable, the vacancy rate, is a derived value found by dividing the total vacant unit value by the total unit count in each census tract. This value is then multiplied by one hundred. The steps taken for the independent variable preparation is discussed in a later section. The tidycensus package was used to obtain the data, with the tigris package being used to obtain the census tract boundary layers. The 2021 5-year ACS estimates are the survey type and year for acquiring the requisite data. One of the original goals was to gather data from two different time periods and use changes between periods, but owing to some time constraints, the following analysis only used one period. The census tracts are also the 2021 boundaries.

Methods

Techniques

OLS

The first regression method employed is Ordinary Least Squares (OLS). This global method calculates the regression coefficients using the entire data set. The values obtained are calculated in a manner so that the total error value, the sum of the differences between the observed and model calculated values squared, is minimized. While the coefficients are those that minimize overall error, there is no accounting for local variation. This linear regression method was used first to select a group of variables from the full set of possibilities. A More thorough overview is provided in Brunsdon et al. (1996).

Packages

The tidyverse package was used in the wrangling process and in creating the charts and graphs throughout the process. The tidycensus and tigris packages were used to acquire the ACS data through calling the API in the initial steps. The sf package was also used to create simple feature objectives, storing the ACS data frame with the related geometry. The base statics package was used to perform OLS regression. The imager, MASS, and Stargazer packages were also used. The imager library was used to import some required JPEGS, the MASS packages was utilized in selecting suitable variables, and the stargazer package for displaying the results.

GWR

The second regression method used is Geographically Weighted Regression (GWR). In contrast to global models such as OLS, GWR is a local method, where the coefficients are able to vary across space. Instead of one set of coefficients used for all data points, separate values are calculated for each location using an appropriate weighting scheme. There are a few main input parameters that are needed to utilize the GWR method. The first is the weighting scheme to be used, and the second main consideration is the bandwidth. These elements determine the weighting given to the surrounding points when calculating the local model parameters, and when determining what points are to be considered neighbors. Once these inputs have been set, then a set of local parameters are determined at each location using the selected weighting methodology. In terms of results, there is a separate set of coefficients and residual and error values for each location, as would be done with OLS. In addition, the statistical significance of the results is also determined for each location. Figure 1 shows some common weighting equations that are commonly used. The weighting (w), of a point j in relation to point i can be determined using several different kernel densities functions shown below. The value d, is the distance between the points, and h is the bandwidth value. The bandwidth can be determined in two main ways. In the fixed bandwidth scenario, the value selected is the one that minimizes the Cross Validation (CV) value. This is the sum of the square differences between the observed and predicted independent variable values at each location. The fixed bandwidth is the same for the entire data set. An adaptive bandwidth is a local value that can vary by location. Instead of a fixed value, the percentage of points to be used to calculate the bandwidth at each location is determined. This then allows the bandwidth to be determined at each location as described for the fixed bandwidth calculation. A more detailed explanation of GWR can be found in Brunsdon et al. (1996).

image_path <- here("Data","Functions.jpg")
functions <- load.image(image_path)
plot(functions, axes = F)

Figure 1: Formulas used to derive weighting scheme to calcualte local model parameters when performing Geographically Weighted Regression.

Packages

In addition to the packages used to wrangle the data and perform and analyze the regression results and variables, the spgwr package was utilized extensively to perform the geographically waited regression. The sp package was required to convert the sf objects to a sp data frame for use in performing the GWR operations.

Process Workflow

The first set of tasks are related to data wrangling. After obtaining the required data using the tidycensus data, OLS regression is performed using all potential variables. Further methods are used to determine which variables are significant and should be used in the remainder of the process steps. Several different methods are used during these steps. Once the set of potential variables is reduced, the OLS regression is run again. Using the same steps as done with the OLS method, GWR is performed with the same set of variables. The resulting local R-Squared, coefficients, and error values are then mapped, and improvement over the OLS method evaluated. The same process is then repeated using only census tracts within the City of Seattle. The workflow steps taken in the analysis are shown in Figure 2.

image_path <- here("Data","Process.jpg")
process <- load.image(image_path)
plot(process, axes = F)

Figure 2: Process workflow steps taken in the analysis.

Analysis

Inputs

Parameters

The parameters passed to the get ACS data method were all the same type and period. The inputs to import all variable data using tidycensus methods used the same parameters of geography = “tract”, state = “WA”, county = “King”, year = 2021, and survey = “acs5”.

geography <- "tract" # Set geography to census tracts
state <- "WA" # Set state to Washington
county <- "King" # Set county to King
year <- 2021 # Set survey year to 2021
survey <- "acs5" # Set survey type to 5-year estimates 
crs <- 3690 # Washington State Plane North Projection

Variables

Dependent Variables

The dependent variables selected were related to the estimated number of vacant residential units. These nominal values are then used to derive a vacancy rate.

Vacant_total <- "B25005_001" # Total vacant units
Vacant_resident_elsewhere <- "B25005_002" # Vacant units were resident resides elsewhere
Vacant_other <- "B25005_003" # Other vacancy situation

Independent Variables

The full list of potential independent variables used are obtained. Several individual tables are required to fully calculate the necessary topic. The main topics are number of units, number of owned and rented units, Gini coefficients, median gross rents, median home values, number of households, median household income, population counts at different levels of area median income, transportation taken to work, median age, population based on period of housing tenure imitated for owned properties, and population counts of those in the labor force not working. These variables are selected as those related to housing demand. A full list of variable names and type of data can be found in the Appendix section at the end.

Total_units <- "B25128_001" # Total number of estimated residential units
Owner_units <- "B25128_002" # Number of owned units
Renter_units <- "B25128_024" # Number of rented units
gini <- "B19083_001"# Gini coefficient in census tract
median_gross_rent <- "B25031_001" # Median gross rent in census tract
median_home_value <- "B25109_001"# Median home value in census tract
Households <- "B25124_001" # number of households
median_income <- "B19326_001" # Median income in census tract
poverty_total <- "B08122_001" # Total population count for which income levels are determined
poverty_under_100 <- "B08122_002" # Population count with income levels under 100% of area median
poverty_under_100_149 <- "B08122_003" # Population count with income levels between 100% and 149% of area median
poverty_over_150 <- "B08122_004" # Population count with income levels at or over 150% of area median
private_transport_work <- "B08122_005" # Population count of workers who take private transportation to work
public_transport_work <- "B08122_013" # Population count of workers who take public transportation to work
walked_work <- "B08122_017" # Population count of workers who take walk to work
other_transport_work <- "B08122_021" # Population count of workers who take other transportation to work
work_from_home <- "B08122_025" # Population count of workers who work from home
median_age <- "B06002_001" # Median age of census tract
Total_owned_pop <- "B25026_002" # Population count who own housing unit
Total_rent_pop <- "B25026_009" # Population count who rent housing unit
owned_pop_2019_later <- "B25026_003" # Population count of housing unit owners with housing tenure commencing in 2019 or later 
owned_pop_2015_2018 <- "B25026_004" # Population count of housing unit owners with housing tenure commencing between 2015 and 2018
owned_pop_2010_2014 <- "B25026_005" # Population count of housing unit owners with housing tenure commencing between 2010 and 2014
owned_pop_2000_2009 <- "B25026_006" # Population count of housing unit owners with housing tenure commencing between 2000 and 2009
owned_pop_1990_1999 <- "B25026_007" # Population count of housing unit owners with housing tenure commencing between 1990 and 1999
owned_pop_1989_earlier <- "B25026_008" # Population count of housing unit owners with housing tenure commencing in 1989 or earlier
rented_pop_2019_later <- "B25026_010" # Population count of housing unit renters with housing tenure commencing in 2019 or later
rented_pop_2015_2018 <- "B25026_011" # Population count of housing unit renters with housing tenure commencing between 2015 and 2018
rented_pop_2010_2014 <- "B25026_012" # Population count of housing unit renters with housing tenure commencing between 2010 and 2014
rented_pop_2000_2009 <- "B25026_013" # Population count of housing unit renters with housing tenure commencing between 2000 and 2009
rented_pop_1990_1999 <- "B25026_014" # Population count of housing unit renters with housing tenure commencing between 1990 and 1999
rented_pop_1989_earlier <- "B25026_015" # Population count of housing unit renters with housing tenure commencing in 1989 or earlier
total_labor_force <- "B12006_001" # Total labor force count
Not_in_labor_force_1 <- "B12006_012" # Not in labor force -  
Not_in_labor_force_2 <- "B12006_018" # Not in labor force - 
Not_in_labor_force_3 <- "B12006_023" # Not in labor force - 
Not_in_labor_force_4 <- "B12006_029" # Not in labor force - 
Not_in_labor_force_5 <- "B12006_034" # Not in labor force - 
Not_in_labor_force_6 <- "B12006_040" # Not in labor force - 
Not_in_labor_force_7 <- "B12006_045" # Not in labor force - 
Not_in_labor_force_8 <- "B12006_051" # Not in labor force - 
Not_in_labor_force_9 <- "B12006_056" # Not in labor force -

Import Data

Seattle Neighborhood Shapefile

The shapefile of neighborhood districts was obtained to provide the extent of the City of Seattle in relation to the rest of King County, and the display the different districts within the city as well. This layer is also used later to select only the census tracts within the city limits.

Seattle <- st_read(here("Data","Neighborhood_Map_Atlas_Districts.shp"), quiet = TRUE) %>% # Read in shapefile
  st_transform(crs) %>% # Transform projection 
  st_as_sf() # Create simple feature object

Seattle_Boundary <- Seattle %>%  # Dissolve polygons for each district to create one polygon for city
  mutate(City = 1) %>% # Assign value of 1 to all records
  group_by(City) %>% # Group all values together
  summarize() # Dissolve polygons

Variables

Variables that are to be used in the analysis were added to one list without any modification, representing the full set of potential variables to be included. This list is what is passed to several different methods. The names were changed from the original census table alphanumerical designation to match the variable name provided in the variable list. The names were kept the same in the following analysis steps.

ind_variables <- c(gini = gini,median_gross_rent = median_gross_rent, # Get independent variable data.
               Median_Home_Value = median_home_value,
               Households = Households, median_income = median_income, owned_units = Owner_units, rented_units = Renter_units,
               poverty_count = poverty_total, poverty_100 = poverty_under_100 , poverty_100_149 = poverty_under_100_149, poverty_150 = poverty_over_150,
               private_transport_work = private_transport_work, public_transport_work = public_transport_work, 
               walked_work = walked_work, other_transport_work = other_transport_work, work_from_home = work_from_home, median_age = median_age,
               Total_owned_pop = Total_owned_pop,owned_pop_2019_later = owned_pop_2019_later,owned_pop_2015_2018 = owned_pop_2015_2018,owned_pop_2010_2014 = owned_pop_2010_2014,
               owned_pop_2000_2009 = owned_pop_2000_2009, owned_pop_1990_1999 = owned_pop_1990_1999, owned_pop_1989_earlier = owned_pop_1989_earlier,
               Total_rent_pop = Total_rent_pop,rented_pop_2019_later = rented_pop_2019_later, rented_pop_2015_2018 = rented_pop_2015_2018, rented_pop_2010_2014 = rented_pop_2010_2014,
               rented_pop_2000_2009 = rented_pop_2000_2009, rented_pop_1990_1999 = rented_pop_1990_1999,rented_pop_1989_earlier = rented_pop_1989_earlier,
               total_labor_force = total_labor_force, Not_in_labor_force_1 =  Not_in_labor_force_1, Not_in_labor_force_2 = Not_in_labor_force_2, Not_in_labor_force_3 = Not_in_labor_force_3, Not_in_labor_force_4 = Not_in_labor_force_4, 
               Not_in_labor_force_5 = Not_in_labor_force_5, Not_in_labor_force_6 = Not_in_labor_force_6, Not_in_labor_force_7 = Not_in_labor_force_7, Not_in_labor_force_8 = Not_in_labor_force_8, Not_in_labor_force_9 = Not_in_labor_force_9)

Dataframe and Vacant Units

A data frame of vacant units with associated geometry is created through calling get acs method.

Vacant_df <- get_acs(variables = c(vacant_total = Vacant_total, vacant_else = Vacant_resident_elsewhere,vacant_other = Vacant_other), geography = geography, state = state,
                    year = year, survey = survey, county = county, geometry = T) %>% # Get vacant unit data 
  rename(vacant_count = estimate) %>%
  filter(variable == "vacant_total")

Dataframe of all Units

A data frame of total residential units created through calling get acs method.

Total_units_df <-  get_acs(variables = c(total_units = Total_units), geography = geography, state = state, # Get total Unit counts
                           year = year, survey = survey, county = county)

Combine Dataframes

The vacant unit data frame and total unit data frame are joined to create a combined data frame. This allowed for the vacancy rate to be calculated by dividing the number of vacant units by the total unit count for each census tract. The resulting values is the derived vacancy rate used in the following analysis as the independent variable.

Vacant_df <- left_join(Vacant_df,Total_units_df, by = "GEOID") %>% # Join total unit data frame to vacant unit data frame
  rename(total_units = estimate) %>% # Rename columns
  mutate(vacancy_rate = (vacant_count/(vacant_count+total_units)*100)) # Calculate vacancy rates

Dataframe of Independent Variables

A data frame of the independent variables is created through calling get acs method. The format was set to long, that is the form needed for creating related charts and graphs.

Independent_var_df <- get_acs(variables = ind_variables,geography = geography, state = state,
                              year = year, survey = survey, county = county) %>% 
  group_by(variable)

Dataframe of Independent Variables in Wide Formate

A data frame of the independent variables is created through calling get acs method. The format was set to wide, that is the format where each variable is in a separate column and is required for performing the regression analysis.

Independent_var_df_2 <-  get_acs(variables = ind_variables, geography = geography, state = state,
                                 year = year, survey = survey, county = county, output = "wide")

Combined Vacant Unit Dataframe and Independent Variable Data Frame

The dependent and independent variable data frames are joined. The required calculation to modify the variables to be in a usable form were steps also taken at this stage. Several of the variables were to be in percentage terms. This often required summing several different values together and then dividing them by a total count. Multiplying by 100 was also done for all variables that were a percentage of total value. The variables in nominal terms were the Gini coefficient, median gross rent, median income, and median age. The household variable is a derived value found through dividing the household count by the area of each census tract on a per square mile basis. The remaining variables were derived and shown as a percentage of the total for each category. For example, the percentage of owned units was found by dividing the owned unit counts by the total unit count. Other variables were calculated in an analogous manner. The names for the variables were kept the same, except there is an E added at the end of each name. This is the designation provided for estimate values when calling the census API using tidycensus. For simplicity, the names were not changed.

Combined_df <- left_join(Vacant_df,Independent_var_df_2, by = "GEOID") %>% # combine vacant unit data frame with independent variable data frame
  dplyr::select(GEOID,vacancy_rate, giniE, median_gross_rentE, Median_Home_ValueE, HouseholdsE, 
         median_incomeE,owned_unitsE,rented_unitsE,poverty_countE,poverty_100E,poverty_100_149E,poverty_150E,private_transport_workE,public_transport_workE,
         walked_workE,other_transport_workE,work_from_homeE,median_ageE, Total_owned_popE,owned_pop_2019_laterE, owned_pop_2015_2018E, 
         owned_pop_2010_2014E,owned_pop_2000_2009E, owned_pop_1990_1999E, owned_pop_1989_earlierE,
         total_labor_forceE, Not_in_labor_force_1E, Not_in_labor_force_2E, Not_in_labor_force_3E, Not_in_labor_force_4E, 
         Not_in_labor_force_5E, Not_in_labor_force_6E, Not_in_labor_force_7E, Not_in_labor_force_8E, Not_in_labor_force_9E,
         geometry) %>%
  mutate(total_commute = private_transport_workE + public_transport_workE + walked_workE + other_transport_workE + work_from_homeE, 
         total_owned_rented = owned_unitsE + rented_unitsE, owned_unitsE = (owned_unitsE/total_owned_rented)*100,rented_unitsE = (rented_unitsE/total_owned_rented)*100, 
         area = st_area(.)/2589988.1103,  # square miles conversion
         HouseholdsE = HouseholdsE/area, # Household density
         poverty_100E = (poverty_100E/poverty_countE)*100, poverty_100_149E = (poverty_100_149E/poverty_countE)*100, poverty_150E = (poverty_150E/poverty_countE)*100, # Income levels % of area median
         private_transport_workE = (private_transport_workE/total_commute)*100, public_transport_workE = (public_transport_workE/total_commute)*100, # Daily work commute types
         walked_workE = (walked_workE/total_commute)*100, other_transport_workE = (other_transport_workE/total_commute)*100,work_from_homeE = (work_from_homeE/total_commute)*100,
         owned_pop_2019_laterE = (owned_pop_2019_laterE/Total_owned_popE)*100, owned_pop_2015_2018E = (owned_pop_2015_2018E/Total_owned_popE)*100, 
         owned_pop_2010_2014E = (owned_pop_2010_2014E/Total_owned_popE)*100,owned_pop_2000_2009E = (owned_pop_2000_2009E/Total_owned_popE)*100, owned_pop_1990_1999E = (owned_pop_1990_1999E/Total_owned_popE)*100, owned_pop_1989_earlierE = (owned_pop_1989_earlierE/Total_owned_popE)*100,
         total_unemployed = (Not_in_labor_force_1E + Not_in_labor_force_2E +  Not_in_labor_force_3E + Not_in_labor_force_4E + Not_in_labor_force_5E + Not_in_labor_force_6E + Not_in_labor_force_7E + Not_in_labor_force_8E + Not_in_labor_force_9E), unemployment_rate = (total_unemployed/total_labor_forceE)*100) %>%
  na.omit()

Remove Water Features

The water features are removed from the combined data frame to better shown the actual extent of the census tracts.

Combined_df_water_erase <- Combined_df %>% # Remove water from shapefiles
  erase_water(area_threshold = .99, year = year)

Display Esimated Vacancy Rates

The first set of results is a map of vacancy rates using the ACS estimates for each census tract. This is a useful introduction to the basic vacancy characteristics of the county. As shown, the eastern section of the county has higher vacancy rates than the western portions.

King_County <- ggplot(data = Combined_df_water_erase, aes(fill = vacancy_rate)) + # Plot calculated vacancy rates by census tract
  geom_sf(color = NA) + 
  geom_sf(data = Combined_df_water_erase, fill = NA,color = "black") + 
  scale_fill_distiller(palette = "YlOrRd", direction = 1) +
  labs(title = "Vacancy Rate Estimates by Census Tract", subtitle = "King County, WA", fill = "Vacancy Rate") + 
  theme(legend.position = "right") + 
  annotation_north_arrow() + 
  theme_map() 
King_County

Figure 3: Map of vacancy rates using ACS estimates. The higher rates are found in the eastern part of the county.

OLS Regression

After preparing the data, the bulk of tasks undertaken were related to performing the regression analysis before visualizing the results.

Regression parameters

The first step taken was to create a formula with all the potential variables that might be used in the regression analysis. For the topics that required several different variables to provide a full summary for that category, one variable was removed.

formula_reg <- "vacancy_rate ~ giniE + median_gross_rentE + Median_Home_ValueE + HouseholdsE + median_incomeE + owned_unitsE + poverty_100E + poverty_100_149E + private_transport_workE + public_transport_workE + walked_workE + work_from_homeE + median_ageE +
owned_pop_2015_2018E + owned_pop_2010_2014E + owned_pop_2000_2009E + owned_pop_1990_1999E + owned_pop_1989_earlierE + unemployment_rate" # formula to be used in regression listing independent and dependent variables

OLS regression and results

OLS was run using the full set of variables. The results indicate a R-Squared value of only around 0.16, with only a few variables being significant. A positive coefficient value indicates that an increase in the value results in an increase in the estimated vacancy rate, while a negative coefficient value has the opposite impact. A more suitable set of variables is required to have a more adequate set of regression inputs, which is the focus of the next steps.

OLS_Example <- lm(formula_reg, data = Combined_df) # Perform regression
summary_stats <- summary(OLS_Example) # Obtain results
summary_stats # Display results

## 
## Call:
## lm(formula = formula_reg, data = Combined_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7773 -2.4691 -0.3501  1.9008 26.8404 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.027e+00  7.763e+00   0.132 0.894783    
## giniE                    3.714e+00  3.831e+00   0.970 0.332764    
## median_gross_rentE      -6.898e-05  5.542e-04  -0.124 0.900998    
## Median_Home_ValueE       8.689e-07  1.127e-06   0.771 0.441130    
## HouseholdsE             -8.651e-05  4.845e-05  -1.786 0.074802 .  
## median_incomeE          -2.536e-05  1.692e-05  -1.499 0.134491    
## owned_unitsE            -4.859e-02  1.366e-02  -3.556 0.000415 ***
## poverty_100E             2.930e-02  6.816e-02   0.430 0.667447    
## poverty_100_149E         2.232e-02  7.200e-02   0.310 0.756763    
## private_transport_workE  1.862e-02  7.421e-02   0.251 0.802042    
## public_transport_workE  -9.825e-03  8.510e-02  -0.115 0.908141    
## walked_workE             1.876e-01  8.162e-02   2.299 0.021966 *  
## work_from_homeE          4.292e-02  8.144e-02   0.527 0.598399    
## median_ageE              1.642e-01  5.389e-02   3.047 0.002448 ** 
## owned_pop_2015_2018E    -4.656e-03  3.579e-02  -0.130 0.896555    
## owned_pop_2010_2014E    -1.547e-02  3.599e-02  -0.430 0.667453    
## owned_pop_2000_2009E    -1.191e-02  3.384e-02  -0.352 0.725003    
## owned_pop_1990_1999E    -2.670e-02  4.094e-02  -0.652 0.514620    
## owned_pop_1989_earlierE -3.479e-02  4.015e-02  -0.867 0.386669    
## unemployment_rate       -3.058e-02  3.612e-02  -0.847 0.397649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.791 on 456 degrees of freedom
## Multiple R-squared:  0.1948, Adjusted R-squared:  0.1612 
## F-statistic: 5.806 on 19 and 456 DF,  p-value: 4.172e-13

Residuals

A histogram of the the residuals indicate that they are normally distributed except for a few high values.

residuals <- as_data_frame(OLS_Example$residuals) # Obtain model residuals
ggplot(data = residuals, aes(y = value)) + # Create histogram of residuals
  geom_histogram(bins = 25) +
  coord_flip() + 
  labs(title = "Residuals", y = "Residual Value", x = "") +
  theme_economist_white()

Figure 4: The residuals calculated after performing the regression analysis appear to be normally distributed around 0, except for a few extremely high values.

Coefficients

The coefficient values and model predicated vacancy rates values obtained from results data frame.

coefficient_values <- OLS_Example$coefficients # Obtain model coefficients
Predicted <- as_data_frame(OLS_Example$fitted.values) # Obtain model predicted vacancy rates

Vacancy Rate Estimates

A histogram of the vacancy rates using the ACS estimates shows that the vast majority of values are 10% or lower, with a few exceptions on the high end.

ggplot(data = Vacant_df, aes(y = vacancy_rate)) + # Create histogram of estimated vacancy rates
  geom_histogram(bins = 25) +
  coord_flip() + 
  labs(title = "ACS Estimated Vacancy Values", y = "Estimated Vacancy Rate", x = "") +
  theme_economist_white()

Figure 5: The vacancy rates using ACS estimates shows a clear majority of values below 10%, with a few exceptionally large values included, mirroring the results of the residual distribution.

Regression Predicted Vacancy Rates

A histogram of the predicted vacancy rates shows that the vast majority of values are around 5%, with a few exceptions at the high end.

ggplot(data = Predicted, aes(y = value)) + # Create histogram of predicted vacancy rates
  geom_histogram(bins = 25) +
  coord_flip() + 
  labs(title = "Model Predicted Vacancy Values", y = "Predicted Vacancy Rate", x = "") +
  theme_economist_white()

Figure 6: The model predicted vacancy rates are distributed around 5%, with a few values on the high end.

Regression Tests

A more thorough set of sets were performed after the initial examination of the regression results. Several different methods were used to select a more appropriate set of variables.

VIF

The Variance Inflation Factor (VIF) of the full set of variables were found to determine whether multicolliniarity is present or not. A value of over 7.5 suggests that the variable is highly correlated with another values and is redundant. The variables of concern are those related to the type of transportation taken to work. Whether these values improve once a more refined set of variables is selected is a test to still be taken.

car::vif(OLS_Example)

##                   giniE      median_gross_rentE      Median_Home_ValueE 
##                2.031671                2.427831                3.605077 
##             HouseholdsE          median_incomeE            owned_unitsE 
##                2.983100                3.848244                3.457073 
##            poverty_100E        poverty_100_149E private_transport_workE 
##                1.653601                1.626325               61.373860 
##  public_transport_workE            walked_workE         work_from_homeE 
##               15.552125               15.497096               15.641166 
##             median_ageE    owned_pop_2015_2018E    owned_pop_2010_2014E 
##                2.663928                4.451877                3.806761 
##    owned_pop_2000_2009E    owned_pop_1990_1999E owned_pop_1989_earlierE 
##                3.597354                2.659924                2.601993 
##       unemployment_rate 
##                2.491761

Variable Selection

The combination of variables that have the lowest error values is determined through the stepAIC method. Variables are added and removed and the AIC value for each iteration is calculated, with the lowest AIC combination provided. For the purpose of this analysis, the variables shown as being significant are used in the following steps. The variables found through using this process are the Gini coefficient, household density, owned unit percentage, percentage of workers that walk to work, and the median age.

step_car <- stepAIC(OLS_Example, trace = F, direction = "both") # Calculate variable combination with lowest AIC score
step_car # Show results

## 
## Call:
## lm(formula = vacancy_rate ~ giniE + HouseholdsE + owned_unitsE + 
##     walked_workE + median_ageE, data = Combined_df)
## 
## Coefficients:
##  (Intercept)         giniE   HouseholdsE  owned_unitsE  walked_workE  
##    7.879e-01     5.269e+00    -8.203e-05    -4.893e-02     1.614e-01  
##  median_ageE  
##    1.265e-01

stargazer(OLS_Example, step_car, type = "text") # Show results in easier to follow table

## 
## =======================================================================
##                                       Dependent variable:              
##                         -----------------------------------------------
##                                          vacancy_rate                  
##                                   (1)                     (2)          
## -----------------------------------------------------------------------
## giniE                            3.714                  5.269*         
##                                 (3.831)                 (2.926)        
##                                                                        
## median_gross_rentE              -0.0001                                
##                                 (0.001)                                
##                                                                        
## Median_Home_ValueE              0.00000                                
##                                (0.00000)                               
##                                                                        
## HouseholdsE                    -0.0001*                -0.0001*        
##                                (0.00005)               (0.00004)       
##                                                                        
## median_incomeE                 -0.00003                                
##                                (0.00002)                               
##                                                                        
## owned_unitsE                   -0.049***               -0.049***       
##                                 (0.014)                 (0.011)        
##                                                                        
## poverty_100E                     0.029                                 
##                                 (0.068)                                
##                                                                        
## poverty_100_149E                 0.022                                 
##                                 (0.072)                                
##                                                                        
## private_transport_workE          0.019                                 
##                                 (0.074)                                
##                                                                        
## public_transport_workE          -0.010                                 
##                                 (0.085)                                
##                                                                        
## walked_workE                    0.188**                0.161***        
##                                 (0.082)                 (0.034)        
##                                                                        
## work_from_homeE                  0.043                                 
##                                 (0.081)                                
##                                                                        
## median_ageE                    0.164***                0.127***        
##                                 (0.054)                 (0.043)        
##                                                                        
## owned_pop_2015_2018E            -0.005                                 
##                                 (0.036)                                
##                                                                        
## owned_pop_2010_2014E            -0.015                                 
##                                 (0.036)                                
##                                                                        
## owned_pop_2000_2009E            -0.012                                 
##                                 (0.034)                                
##                                                                        
## owned_pop_1990_1999E            -0.027                                 
##                                 (0.041)                                
##                                                                        
## owned_pop_1989_earlierE         -0.035                                 
##                                 (0.040)                                
##                                                                        
## unemployment_rate               -0.031                                 
##                                 (0.036)                                
##                                                                        
## Constant                         1.027                   0.788         
##                                 (7.763)                 (1.702)        
##                                                                        
## -----------------------------------------------------------------------
## Observations                      476                     476          
## R2                               0.195                   0.185         
## Adjusted R2                      0.161                   0.176         
## Residual Std. Error        3.791 (df = 456)        3.757 (df = 470)    
## F Statistic             5.806*** (df = 19; 456) 21.307*** (df = 5; 470)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

Revised OLS Regression

The same regression steps taken before were repeated with the revised set of variables.

Revised regression

The regression formula with the five variables selected.

formula_reg <- "vacancy_rate ~ giniE + HouseholdsE + owned_unitsE  + walked_workE  + median_ageE" # Revised variable list

Run Revised OLS

The regression was run, with the results obtained showing a slight improvement in the Adjusted R-squared value to slightly higher value just under 0.18.

OLS_Example_rev <- lm(formula_reg, data = Combined_df) # Perform regression
summary_stats <- summary(OLS_Example_rev) # Obtain results
summary_stats # Show results

## 
## Call:
## lm(formula = formula_reg, data = Combined_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1323 -2.4116 -0.4587  1.9220 27.0792 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.879e-01  1.702e+00   0.463  0.64362    
## giniE         5.269e+00  2.926e+00   1.800  0.07243 .  
## HouseholdsE  -8.203e-05  4.486e-05  -1.829  0.06809 .  
## owned_unitsE -4.893e-02  1.144e-02  -4.277 2.29e-05 ***
## walked_workE  1.614e-01  3.363e-02   4.800 2.13e-06 ***
## median_ageE   1.265e-01  4.291e-02   2.949  0.00335 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.757 on 470 degrees of freedom
## Multiple R-squared:  0.1848, Adjusted R-squared:  0.1761 
## F-statistic: 21.31 on 5 and 470 DF,  p-value: < 2.2e-16

Revised Tests

Revised VIF

The VIF calculation method was performed on the revised variables that were used. The values are all under 7.5, with none over 3, indicating there is not a concerning amount of multicollinearity.

car::vif(OLS_Example_rev) # Calculate VIF values for each variable

##        giniE  HouseholdsE owned_unitsE walked_workE  median_ageE 
##     1.206955     2.603836     2.467367     2.678647     1.718910

GWR

After performing standard linear regression, geographically weighted regression analysis the next set of analytical tasks that were performed. Whether there is significant local variation and improvement in the results is the major objective to be met.

Conert to sp SpatialDataFrame

GWR requires a sp SpatialDataFrame be used which is the first step taken.

King_County_Tracts <- Combined_df %>% # Convert from sp SpatialDataFrame to sf object
  na.omit() %>% # Remove NA values
  as_Spatial() # Convert

GWR parameters and analaysis

Two major inputs are required for the GWR. The first step is to set the bandwidth to be used. An adjustable bandwidth is selected due to the drastically varying size of census tracts. Instead of a fixed bandwidth value, the output is a percentage of data points to be used to calculate the bandwidth at each location. Approximately 18% of the nearest values are used to estimate the bandwidth at each location in this case, representing 85 out of 476 data points. The regression results show the range of coefficient values for each variable. The signs for all variables are the same in this model. A more thorough analysis of the results follows.

bw <- gwr.sel(formula = formula_reg, data = King_County_Tracts, adapt = T) # Establish search bandwidth parameters

## Adaptive q: 0.381966 CV score: 6745.519 
## Adaptive q: 0.618034 CV score: 6773.173 
## Adaptive q: 0.236068 CV score: 6733.691 
## Adaptive q: 0.145898 CV score: 6730.719 
## Adaptive q: 0.1101097 CV score: 6746.94 
## Adaptive q: 0.1803399 CV score: 6727.934 
## Adaptive q: 0.1829134 CV score: 6728.127 
## Adaptive q: 0.1727274 CV score: 6727.966 
## Adaptive q: 0.1774322 CV score: 6728.278 
## Adaptive q: 0.1792292 CV score: 6727.791 
## Adaptive q: 0.1785428 CV score: 6727.719 
## Adaptive q: 0.1781186 CV score: 6727.93 
## Adaptive q: 0.1787898 CV score: 6727.734 
## Adaptive q: 0.1783873 CV score: 6727.796 
## Adaptive q: 0.1786371 CV score: 6727.714 
## Adaptive q: 0.1785965 CV score: 6727.708 
## Adaptive q: 0.1785965 CV score: 6727.708

gwr_example <- gwr(formula_reg, data = King_County_Tracts, adapt = bw, se.fit = TRUE, hatmatrix = T) # Perform GWR
gwr_example # Display model results

## Call:
## gwr(formula = formula_reg, data = King_County_Tracts, adapt = bw, 
##     hatmatrix = T, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.1785965 (about 85 of 476 data points)
## Summary of GWR coefficient estimates at data points:
##                     Min.     1st Qu.      Median     3rd Qu.        Max.
## X.Intercept. -1.9206e+00  1.6728e-01  1.1800e+00  4.0810e+00  6.4483e+00
## giniE        -7.4928e-01  2.3048e+00  3.3111e+00  4.3103e+00  6.5146e+00
## HouseholdsE  -1.7279e-04 -1.1066e-04 -9.4054e-05 -6.4614e-05 -2.8398e-05
## owned_unitsE -1.0837e-01 -7.4968e-02 -6.2652e-02 -5.6876e-02 -4.2988e-02
## walked_workE  1.2355e-01  1.3302e-01  1.4369e-01  1.5590e-01  2.2533e-01
## median_ageE   9.0910e-03  9.3854e-02  1.5917e-01  2.0196e-01  2.2763e-01
##               Global
## X.Intercept.  0.7879
## giniE         5.2687
## HouseholdsE  -0.0001
## owned_unitsE -0.0489
## walked_workE  0.1614
## median_ageE   0.1265
## Number of data points: 476 
## Effective number of parameters (residual: 2traceS - traceS'S): 22.41479 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 453.5852 
## Sigma (residual: 2traceS - traceS'S): 3.709822 
## Effective number of parameters (model: traceS): 16.6848 
## Effective degrees of freedom (model: traceS): 459.3152 
## Sigma (model: traceS): 3.686609 
## Sigma (ML): 3.621421 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 2612.741 
## AIC (GWR p. 96, eq. 4.22): 2592.611 
## Residual sum of squares: 6242.594 
## Quasi-global R2: 0.2330278

Correlation

A correlation matrix of the coefficient values was conducted, with numbers close to one indicating the variables are strongly positively correlated, with negative numbers near negative one the opposite.

round(cor(as.data.frame(gwr_example$SDF[,2:7]), use = "complete.obs"),2) # Correlation matrix of variables

##              X.Intercept. giniE HouseholdsE owned_unitsE walked_workE
## X.Intercept.         1.00  0.33       -0.14        -0.61        -0.53
## giniE                0.33  1.00        0.40         0.11        -0.48
## HouseholdsE         -0.14  0.40        1.00         0.40        -0.57
## owned_unitsE        -0.61  0.11        0.40         1.00         0.47
## walked_workE        -0.53 -0.48       -0.57         0.47         1.00
## median_ageE         -0.85 -0.67       -0.15         0.16         0.47
##              median_ageE
## X.Intercept.       -0.85
## giniE              -0.67
## HouseholdsE        -0.15
## owned_unitsE        0.16
## walked_workE        0.47
## median_ageE         1.00

Plot of pairs

Plots of each variable against each other one was also performed. The correlation between two sets of variables can be seen through the graphs. values along a 45-degree line indicate a high degree of correlation. The resulting standard error column at the end shows rather unusual patterns with many of the variables, suggesting some type of data transformation might be advisable.

pairs(as(gwr_example$SDF, "data.frame")[,2:8], pch = ".") # Plot variable values against one another

Figure 7: Plots of values for pairs of variables provides a visual representation of variable correlation.

GWR Results and Conversion to sf object

The GWR results SpatialDataFrame was converted to a sf object to graph the results.

gwr_example_results <- gwr_example$SDF %>% # Create new data frame of GWR results
  st_as_sf() # Convert to sp object

Obtain bandwidth and residuals

Bandwidth and residual values obtained from results data frame.

gwr_example_results$bandwidth <- gwr_example$bandwidth # Obtain calculated bandwidth values
gwr_example_results$residuals <- gwr_example$lm$residuals # Obtain residual values

Display GWR Results

The results of the GWR are provided through several different visualizations to display the local variation of results more effectively.

Local R-Squared

The distribution of local R-Squared values shows the range of values, with the vast majority being between 0.20 and 0.40. There is also a spike around 0.20.

LocalR <- as_data_frame(gwr_example_results$localR2)# Create data frame of Local Rsquared values

ggplot(data = LocalR, aes(y = value)) + # Create histogram of local R-Squared values
  geom_histogram(bins = 25) +
  coord_flip() + 
  labs(title = "Local R Squared Values", y = "R-Squared Value", x = "") +
  theme_economist_white()

Figure 8: The distribution of R-Squared values shows that the values are closer to a uniform rather than gaussian distribution, skewing towards lower values as well.

Distributon of R-Squard values

The range of values is found as well to provide additional overview of the values displayed above. The median of 0.27 is higher than the OLS results, indicating improvement in the explanatory nature of the local model when compared to the global model.

summary(LocalR) # Obtain summary of local Rsquared statistics

##      value       
##  Min.   :0.1911  
##  1st Qu.:0.2187  
##  Median :0.2722  
##  Mean   :0.2823  
##  3rd Qu.:0.3357  
##  Max.   :0.4113

Remove Water Features

Remove water features to better show extent of census tracts.

gwr_example_results_display <- gwr_example_results %>% # Remove water features
  erase_water(area_threshold = .99, year = year)

Display Seattle with Neighborhood Districts

The city neighborhood districts are shown within King County. This provides the location of Seattle in relation to the rest of King County to better understand the results, and to show the extent of individual districts within the City of even more detailed descriptions.

ggplot() + 
  geom_sf(data = Seattle, aes(fill = L_HOOD)) + # Display city neighborhoods within County boundaries
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Seattle Districts", subtitle = "King County, WA", fill = "City District") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 9: Extent of Seattle city limits provides a reference of the City within the county.

Display Local R-Squared

The local R-Squared values increase significantly from east to west, with the highest values being in and near Seattle.

ggplot(data = gwr_example_results_display, aes(fill = localR2)) + # Map local Rsquared values by census tract as points
  geom_sf(color = NA) +
  scale_fill_viridis_c(option = "G", direction = -1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") +
  labs(title = "Local R-squared Values by Census Tract", subtitle = "King County, WA", fill = "Local\nR-Squared") +
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 10: The choropleth of local R-Squared values shows higher values found in the western part of King County, mainly within Seattle.

Display Local R-Squared with Seattle City Lines

Census tract boundaries were removed and city limits added to shown values in relation to Seattle.

ggplot(data = gwr_example_results_display, aes(fill = localR2)) + # Map local Rsquared values by census tract as points
  geom_sf(color = NA) +
  geom_sf(data = Seattle_Boundary, fill = NA, color = "darkred") +
  scale_fill_viridis_c(option = "G", direction = -1) + 
  labs(title = "Local R-squared Values by Census Tract", subtitle = "King County, WA", fill = "Local\nR-Squared") +
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 11: Choropleth without census tract boundaries but with City limits highlights results specifically within the City.

Display Residuals

The residuals of the results show the highest values in the eastern part of the county.

ggplot(data = gwr_example_results_display, aes(fill = residuals)) + # Map local residuals values by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) +  
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") +
  labs(title = "Residuals by Census Tract", subtitle = "King County, WA", fill = "Residuals") +
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 12: The highest residual values are found in the eastern part of the county, allowing for the outliers detected earlier to be located and mapped.

Display Bandwidth Values

The bandwidth values show the highest in the east, and lower and more uniform values in the west.

ggplot(data = gwr_example_results_display, aes(fill = bandwidth)) + # Map local Median Income by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Bandwidth by Census Tract", subtitle = "King County, WA", fill = "Bandwidth") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 13: Bandwidth values are greater in the larger census tracts in the western section of the county.

Display Local Gini Coefficent Values

The local Gini coefficients show a trend of increase from west to east, with all values being positive. The higher the inequality, the higher the vacancy rate.

ggplot(data = gwr_example_results_display, aes(fill = giniE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", type = "seq", direction = 1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Local Gini Coefficients", subtitle = "King County, WA", fill = "Local Coefficients\nGini Coefficient") +   # Local Coefficients  
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 14: Impact of increasing inequality is greater in the western and north census tracts than in the more southwestern sections. The impact within the City is also varied.

Display Local Household Density Coefficent Values

The household density coefficients show that the higher the density, the lower the vacancy rate, with the most significant impact being in the western part of the county in Seattle.

ggplot(data = gwr_example_results_display, aes(fill = HouseholdsE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Blues", type = "seq") + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Local Household Density Coefficients", subtitle = "King County, WA", fill = "Local Coefficients\nHousehold Density") +   # Local Coefficients  
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 15: Higher household density decreases vacancy rates, with Seattle strong reflecting this trend. The more central tracts are less impacted.

Display Local Ownership Percentage Coefficent Values

The same trend as found with the household density values were found with the ownership percentage variable. All else being equal, the higher the percentage of units that are owned, the lower the vacancy rate.

ggplot(data = gwr_example_results_display, aes(fill = owned_unitsE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Blues", type = "seq") + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Local Owned Unit Percentage Coefficients", subtitle = "King County, WA", fill = "Local Coefficients\n% Units Owned") +   # Local Coefficients
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 16: Higher ownership rates were found to reduce the vacancy rates, with Seattle once again more strongly reflecting this trend than other census tracts in the county.

Display Local Percentage of Workers that Walk to Work Coefficent Values

The walked to work coefficient shows that the higher the percentage of workers that walk to work the higher the vacancy rate. While the values are all positive, the importance of the variable is significantly lower in Seattle than other parts of the county.

ggplot(data = gwr_example_results_display, aes(fill = walked_workE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", type = "seq", direction = 1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Local Walked to Work Coefficients", subtitle = "King County, WA", fill = "Local Coefficients\n% of Workers\nWalk to Work") +   # Local Coefficients
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 17: The percentage of workers that walk to work was found to increase vacancy rates, but Seattle was less impacted by this trend than tracts to the south.

Display Local Median Age Coefficent Values

The same trend was found with the median age coefficient values. Age is more important in Seattle and the southeast part of the county than in other areas.

ggplot(data = gwr_example_results_display, aes(fill = median_ageE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", type = "seq", direction = 1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Local Median Age Coefficients", subtitle = "King County, WA", fill = "Local Coefficients\n Median Age") +   # Local Coefficients
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 18: The median age of a census tract showed increasing vacancy rates with increasing age, with the southwestern tracts the most impacted by this trend.

Display Model Predicted Vacancy Rate Values

The predicated vacancy values were shown as well, with the western sections having the highest vacancy rates.

ggplot(data = gwr_example_results_display, aes(fill = pred)) + # Map local Median Income by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Local Predicted Vacancy Rates by Census Tract", subtitle = "King County, WA", fill = "Vacancy Rate\nPrediction") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 19: The model predicted vacancy rates were found to be highest in the central parts of the city.

Display Model Predicted Vacancy Rate Standard Error Values

Finally, the map of standard error values shows that higher error values are found in Seattle area.

ggplot(data = gwr_example_results_display, aes(fill = pred.se)) + # Map local Median Income by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) + 
  geom_sf(data = gwr_example_results_display,fill = NA, color = "black") + 
  labs(title = "Standard Error Values by Census Tract", subtitle = "King County, WA", fill = "Vacancy Rate\nPrediction\nStandard Error") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 20: standard error values were found to be higher in Seattle, indicating higher predicted vacancy rates, the higher the error.

Test GWR Results with OLS Results

The results hint at the GWR method being a better explanatory model than the OLS model, but it is not conclusive. In Addition, the Gini coefficient was not found to be significant in the GWR model, even when found so in OLS.

BFC02.gwr.test(gwr_example)

## 
##  Brunsdon, Fotheringham & Charlton (2002, pp. 91-2) ANOVA
## 
## data:  gwr_example
## F = 1.0629, df1 = 470.00, df2 = 453.59, p-value = 0.2565
## alternative hypothesis: greater
## sample estimates:
## SS OLS residuals SS GWR residuals 
##         6635.256         6242.594

BFC99.gwr.test(gwr_example)

## 
##  Brunsdon, Fotheringham & Charlton (1999) ANOVA
## 
## data:  gwr_example
## F = 1.7381, df1 = 296.61, df2 = 461.96, p-value = 4.873e-08
## alternative hypothesis: greater
## sample estimates:
## SS GWR improvement   SS GWR residuals 
##           392.6622          6242.5939

LMZ.F1GWR.test(gwr_example)

## 
##  Leung et al. (2000) F(1) test
## 
## data:  gwr_example
## F = 0.97487, df1 = 461.96, df2 = 470.00, p-value = 0.3919
## alternative hypothesis: less
## sample estimates:
## SS OLS residuals SS GWR residuals 
##         6635.256         6242.594

LMZ.F2GWR.test(gwr_example)

## 
##  Leung et al. (2000) F(2) test
## 
## data:  gwr_example
## F = 1.6944, df1 = 32.903, df2 = 470.000, p-value = 0.01063
## alternative hypothesis: greater
## sample estimates:
##   SS OLS residuals SS GWR improvement 
##          6635.2561           392.6622

LMZ.F3GWR.test(gwr_example)

## 
## Leung et al. (2000) F(3) test
## 
##              F statistic Numerator d.f. Denominator d.f.     Pr(>)    
## (Intercept)      1.97311      149.60046           461.96 3.425e-08 ***
## giniE            0.38356      179.22429           461.96   1.00000    
## HouseholdsE      1.56032       49.30229           461.96   0.01123 *  
## owned_unitsE     1.86717      122.81632           461.96 1.900e-06 ***
## walked_workE     1.35495       68.03280           461.96   0.03920 *  
## median_ageE      2.43136       79.50655           461.96 4.537e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary

Based on the results obtained when performing the two regression techniques on data for King County, there are a few main takeaways. Using GWR did improve the results to a degree over the OLS model. In addition, the higher the Gini, the higher the vacancy rate, as is the case for walked to work and the median age. On the other hand, the higher the household density and ownership rates, the lower the vacancy rates. The variation in results also suggests that the dynamics in the western and eastern parts of the County are quite different. This conclusion makes focusing on just one part of the County an appropriate next step.

Seattle Census Tracts

After the analysis that was conducted using the entire County, the City of Seattle census tracts were used to perform the same analysis that was conducted over the entire County. Owing to the fact that there was such a variation in values when the entire County was used, and trends appeared much more localized, especially in close proximity to Seattle, conducting similar analysis just for Seattle was a practical step to take.

Select Census Tracts within Seattle

The census tracts in Seattle were selected from the full county set, with a new data frame created through using the intersection method.

Seatle Only OLS Regression Parameters

The full set of variables were added again, with the process repeated as done before.

formula_reg_seattle <- "vacancy_rate ~ giniE + median_gross_rentE + Median_Home_ValueE + HouseholdsE + median_incomeE + owned_unitsE + poverty_100E + poverty_100_149E + private_transport_workE + public_transport_workE + walked_workE + work_from_homeE + median_ageE +
owned_pop_2015_2018E + owned_pop_2010_2014E + owned_pop_2000_2009E + owned_pop_1990_1999E + owned_pop_1989_earlierE + unemployment_rate" # formula to be used in regression listing independent and dependent variables

Perform OLS and Display Results

Linear regression was performed. Using only the city data, the adjusted R-squared value increases to 0.25 from below 0.18 when using the fully county data. Already, using just data for Seattle was improving results.

OLS_Seattle <- lm(formula_reg_seattle, data = Seattle_CT) # Perform regression
summary_stats <- summary(OLS_Seattle) # Obtain results
summary_stats # Display results

## 
## Call:
## lm(formula = formula_reg_seattle, data = Seattle_CT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0515 -2.2825 -0.3977  1.9360 13.0460 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             -2.597e+00  1.068e+01  -0.243  0.80825   
## giniE                    4.289e+00  6.306e+00   0.680  0.49744   
## median_gross_rentE      -6.076e-04  1.088e-03  -0.559  0.57717   
## Median_Home_ValueE       9.934e-07  1.936e-06   0.513  0.60860   
## HouseholdsE             -1.296e-04  5.314e-05  -2.440  0.01578 * 
## median_incomeE          -4.673e-06  3.296e-05  -0.142  0.88742   
## owned_unitsE            -7.599e-02  2.554e-02  -2.975  0.00338 **
## poverty_100E             2.637e-02  9.343e-02   0.282  0.77815   
## poverty_100_149E        -5.259e-02  1.153e-01  -0.456  0.64892   
## private_transport_workE  1.292e-01  9.952e-02   1.298  0.19609   
## public_transport_workE   1.232e-02  1.119e-01   0.110  0.91248   
## walked_workE             2.557e-01  1.041e-01   2.457  0.01507 * 
## work_from_homeE          6.381e-02  1.145e-01   0.557  0.57827   
## median_ageE              1.220e-01  9.445e-02   1.292  0.19813   
## owned_pop_2015_2018E     2.340e-02  5.082e-02   0.461  0.64576   
## owned_pop_2010_2014E    -3.235e-03  4.895e-02  -0.066  0.94739   
## owned_pop_2000_2009E     2.356e-02  5.158e-02   0.457  0.64846   
## owned_pop_1990_1999E    -1.835e-02  5.933e-02  -0.309  0.75756   
## owned_pop_1989_earlierE -6.244e-02  6.774e-02  -0.922  0.35800   
## unemployment_rate       -1.050e-01  6.462e-02  -1.624  0.10630   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.725 on 162 degrees of freedom
## Multiple R-squared:  0.3296, Adjusted R-squared:  0.251 
## F-statistic: 4.192 on 19 and 162 DF,  p-value: 1.949e-07

VIF

Variance Inflation Factors once again show the type of transportation values are of concern.

car::vif(OLS_Seattle) # Calculate VIF values

##                   giniE      median_gross_rentE      Median_Home_ValueE 
##                1.936046                2.022081                2.685593 
##             HouseholdsE          median_incomeE            owned_unitsE 
##                2.854513                4.712014                4.487958 
##            poverty_100E        poverty_100_149E private_transport_workE 
##                1.720442                1.615514               37.011549 
##  public_transport_workE            walked_workE         work_from_homeE 
##                8.988803               19.868655                9.133287 
##             median_ageE    owned_pop_2015_2018E    owned_pop_2010_2014E 
##                2.913040                4.781289                3.812517 
##    owned_pop_2000_2009E    owned_pop_1990_1999E owned_pop_1989_earlierE 
##                2.964696                2.468949                3.054665 
##       unemployment_rate 
##                2.726612

Variable Selection

Variable selection shows household density, owned units percentage and walked to work remaining, with taking private transportation to work and housing tenor of owned units starting in 1989 or earlier being added.

step_car_seattle <- stepAIC(OLS_Seattle, trace = F, direction = "both") # 
step_car_seattle

## 
## Call:
## lm(formula = vacancy_rate ~ HouseholdsE + owned_unitsE + private_transport_workE + 
##     walked_workE + owned_pop_1989_earlierE, data = Seattle_CT)
## 
## Coefficients:
##             (Intercept)              HouseholdsE             owned_unitsE  
##               5.9909539               -0.0001137               -0.0680989  
## private_transport_workE             walked_workE  owned_pop_1989_earlierE  
##               0.0778791                0.2001375               -0.0796994

stargazer(OLS_Seattle, step_car_seattle, type = "text")

## 
## =======================================================================
##                                       Dependent variable:              
##                         -----------------------------------------------
##                                          vacancy_rate                  
##                                   (1)                     (2)          
## -----------------------------------------------------------------------
## giniE                            4.289                                 
##                                 (6.306)                                
##                                                                        
## median_gross_rentE              -0.001                                 
##                                 (0.001)                                
##                                                                        
## Median_Home_ValueE              0.00000                                
##                                (0.00000)                               
##                                                                        
## HouseholdsE                    -0.0001**               -0.0001**       
##                                (0.0001)                (0.00005)       
##                                                                        
## median_incomeE                 -0.00000                                
##                                (0.00003)                               
##                                                                        
## owned_unitsE                   -0.076***               -0.068***       
##                                 (0.026)                 (0.017)        
##                                                                        
## poverty_100E                     0.026                                 
##                                 (0.093)                                
##                                                                        
## poverty_100_149E                -0.053                                 
##                                 (0.115)                                
##                                                                        
## private_transport_workE          0.129                 0.078***        
##                                 (0.100)                 (0.027)        
##                                                                        
## public_transport_workE           0.012                                 
##                                 (0.112)                                
##                                                                        
## walked_workE                    0.256**                0.200***        
##                                 (0.104)                 (0.042)        
##                                                                        
## work_from_homeE                  0.064                                 
##                                 (0.115)                                
##                                                                        
## median_ageE                      0.122                                 
##                                 (0.094)                                
##                                                                        
## owned_pop_2015_2018E             0.023                                 
##                                 (0.051)                                
##                                                                        
## owned_pop_2010_2014E            -0.003                                 
##                                 (0.049)                                
##                                                                        
## owned_pop_2000_2009E             0.024                                 
##                                 (0.052)                                
##                                                                        
## owned_pop_1990_1999E            -0.018                                 
##                                 (0.059)                                
##                                                                        
## owned_pop_1989_earlierE         -0.062                  -0.080*        
##                                 (0.068)                 (0.047)        
##                                                                        
## unemployment_rate               -0.105                                 
##                                 (0.065)                                
##                                                                        
## Constant                        -2.597                 5.991***        
##                                (10.683)                 (1.622)        
##                                                                        
## -----------------------------------------------------------------------
## Observations                      182                     182          
## R2                               0.330                   0.299         
## Adjusted R2                      0.251                   0.279         
## Residual Std. Error        3.725 (df = 162)        3.654 (df = 176)    
## F Statistic             4.192*** (df = 19; 162) 15.043*** (df = 5; 176)
## =======================================================================
## Note:                                       *p<0.1; **p<0.05; ***p<0.01

Revised Regression Parameters

Revised regression formula with only significant variables included.

formula_reg_seattle <- "vacancy_rate ~ HouseholdsE + owned_unitsE + private_transport_workE + walked_workE + owned_pop_1989_earlierE" # formula to be used in regression listing independent and dependent variables

Run OLS Regression and Dislay Results

Results show slight improvement when insignificant variables are removed.

OLS_Seattle <- lm(formula_reg_seattle, data = Seattle_CT) # Perform regression
summary_stats <- summary(OLS_Seattle) # Obtain results
summary_stats

## 
## Call:
## lm(formula = formula_reg_seattle, data = Seattle_CT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6325 -2.5042 -0.2928  2.1309 13.2241 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.991e+00  1.622e+00   3.694 0.000295 ***
## HouseholdsE             -1.137e-04  4.855e-05  -2.341 0.020331 *  
## owned_unitsE            -6.810e-02  1.745e-02  -3.903 0.000135 ***
## private_transport_workE  7.788e-02  2.701e-02   2.883 0.004424 ** 
## walked_workE             2.001e-01  4.180e-02   4.788 3.55e-06 ***
## owned_pop_1989_earlierE -7.970e-02  4.655e-02  -1.712 0.088662 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.654 on 176 degrees of freedom
## Multiple R-squared:  0.2994, Adjusted R-squared:  0.2795 
## F-statistic: 15.04 on 5 and 176 DF,  p-value: 2.723e-12

VIF

VIF values now all below 7.5, reducing concerns regarding multicollinearity.

car::vif(OLS_Seattle)

##             HouseholdsE            owned_unitsE private_transport_workE 
##                2.477246                2.177724                2.833793 
##            walked_workE owned_pop_1989_earlierE 
##                3.330710                1.499994

Seattle GWR

Convert to sp SpatialDataFrame

Converting to sp SpatialDataFrame undertaken to perform GWR.

Seattle_Tracts <- Seattle_CT %>%
  na.omit() %>%
  as_Spatial()

Regression Bandwidth Selection and Regression

Bandwidth is set to adaptive with roughly 21% of values used to calculate each local bandwidth. Coefficients have the same sign and the quasi-global R-Squared has increased as well.

bw_seattle <- gwr.sel(formula = formula_reg_seattle, data = Seattle_Tracts, adapt = T) # Establish search bandwidth parameters

## Adaptive q: 0.381966 CV score: 2463.149 
## Adaptive q: 0.618034 CV score: 2483.339 
## Adaptive q: 0.236068 CV score: 2456.933 
## Adaptive q: 0.1194597 CV score: 2492.466 
## Adaptive q: 0.2917961 CV score: 2458.19 
## Adaptive q: 0.2132592 CV score: 2455.819 
## Adaptive q: 0.177431 CV score: 2461.357 
## Adaptive q: 0.1995741 CV score: 2462.023 
## Adaptive q: 0.2219714 CV score: 2456.916 
## Adaptive q: 0.208032 CV score: 2456.661 
## Adaptive q: 0.2145577 CV score: 2455.851 
## Adaptive q: 0.2134738 CV score: 2455.814 
## Adaptive q: 0.2136195 CV score: 2455.811 
## Adaptive q: 0.2139779 CV score: 2455.805 
## Adaptive q: 0.2141993 CV score: 2455.801 
## Adaptive q: 0.2143362 CV score: 2455.809 
## Adaptive q: 0.2141275 CV score: 2455.802 
## Adaptive q: 0.2142516 CV score: 2455.8 
## Adaptive q: 0.2142923 CV score: 2455.801 
## Adaptive q: 0.2142516 CV score: 2455.8

gwr_example_seattle <- gwr(formula_reg_seattle, data = Seattle_Tracts, adapt = bw_seattle, se.fit = TRUE, hatmatrix = T) # Perform GWR
gwr_example_seattle

## Call:
## gwr(formula = formula_reg_seattle, data = Seattle_Tracts, adapt = bw_seattle, 
##     hatmatrix = T, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.2142516 (about 38 of 182 data points)
## Summary of GWR coefficient estimates at data points:
##                                Min.     1st Qu.      Median     3rd Qu.
## X.Intercept.             4.4773e+00  6.5055e+00  7.2895e+00  8.6522e+00
## HouseholdsE             -1.6146e-04 -1.3906e-04 -1.2410e-04 -1.1320e-04
## owned_unitsE            -1.2359e-01 -1.0505e-01 -8.1616e-02 -6.7203e-02
## private_transport_workE  1.9258e-02  5.0571e-02  7.4903e-02  8.3884e-02
## walked_workE             1.2747e-01  1.5952e-01  1.8070e-01  2.0048e-01
## owned_pop_1989_earlierE -1.8086e-01 -1.2739e-01 -9.7645e-02 -1.6167e-02
##                                Max.  Global
## X.Intercept.             1.0231e+01  5.9910
## HouseholdsE             -6.6864e-05 -0.0001
## owned_unitsE            -1.7077e-02 -0.0681
## private_transport_workE  9.2656e-02  0.0779
## walked_workE             2.4093e-01  0.2001
## owned_pop_1989_earlierE  2.5933e-02 -0.0797
## Number of data points: 182 
## Effective number of parameters (residual: 2traceS - traceS'S): 19.93073 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 162.0693 
## Sigma (residual: 2traceS - traceS'S): 3.563829 
## Effective number of parameters (model: traceS): 15.20053 
## Effective degrees of freedom (model: traceS): 166.7995 
## Sigma (model: traceS): 3.512933 
## Sigma (ML): 3.363036 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 993.7517 
## AIC (GWR p. 96, eq. 4.22): 973.1694 
## Residual sum of squares: 2058.422 
## Quasi-global R2: 0.3862092

Variable Correlation Table

Correlation of variable pairs with one another.

round(cor(as.data.frame(gwr_example_seattle$SDF[,2:7]), use = "complete.obs"),2)

##                         X.Intercept. HouseholdsE owned_unitsE
## X.Intercept.                    1.00       -0.54        -0.95
## HouseholdsE                    -0.54        1.00         0.43
## owned_unitsE                   -0.95        0.43         1.00
## private_transport_workE         0.35       -0.55        -0.29
## walked_workE                   -0.89        0.23         0.87
## owned_pop_1989_earlierE        -0.46        0.66         0.31
##                         private_transport_workE walked_workE
## X.Intercept.                               0.35        -0.89
## HouseholdsE                               -0.55         0.23
## owned_unitsE                              -0.29         0.87
## private_transport_workE                    1.00        -0.46
## walked_workE                              -0.46         1.00
## owned_pop_1989_earlierE                   -0.93         0.50
##                         owned_pop_1989_earlierE
## X.Intercept.                              -0.46
## HouseholdsE                                0.66
## owned_unitsE                               0.31
## private_transport_workE                   -0.93
## walked_workE                               0.50
## owned_pop_1989_earlierE                    1.00

Graph Variable Pairs

Plots of variable points.

pairs(as(gwr_example_seattle$SDF, "data.frame")[,2:8], pch = ".")

Figure 21: Variable pair graphs to visually look for correlation for Seattle only census tracts show similar patterns as before.

Convert Results to sf object

Converting to sf object to map results.

gwr_example_seattle_results <- gwr_example_seattle$SDF %>% # Create new data frame of GWR results
  st_as_sf() # Convert to sp object

Obtain Bandwidth and Residuals

Obtain bandwidth and residuals.

gwr_example_seattle_results$bandwidth <- gwr_example_seattle$bandwidth
gwr_example_seattle_results$residuals <- gwr_example_seattle$lm$residuals

Local R-Squared Values and Chart of Results

The distribution of R-Squared values shows a large spike near 0.5 on the higher end of the scale, in contrast to the previous version.

LocalR <- as_data_frame(gwr_example_seattle_results$localR2)# Create data frame of Local Rsquared values

ggplot(data = LocalR, aes(y = value)) + # Create histogram of local Rsquared values
  geom_histogram(bins = 25) +
  coord_flip() + 
  labs(title = "Local R Squared Values", y = "R-Squred", x = "") +
  theme_economist_white()

Figure 22: : Distribution of local R-Squared values indicates higher values than whole county results, with a large spike around 0.45.

Local R-Squred Value Distribution Summary

Summary of values shows median is now over 0.4, an improvement from when data for the entire county used.

summary(LocalR) # Obtain summary of local Rsquared statistics

##      value       
##  Min.   :0.2280  
##  1st Qu.:0.3414  
##  Median :0.4156  
##  Mean   :0.4032  
##  3rd Qu.:0.4679  
##  Max.   :0.5167

Remove Water Features

Remove water features to better show extent of census tracts.

gwr_example_results_seattle_display <- gwr_example_seattle_results %>%
  erase_water(area_threshold = .99, year = year)

Display City Neighborhood Districts

Neighborhood boundaries for Seattle provides a more detailed overview of the neighborhoods in the City when the rest of the County is not included.

ggplot() + 
  geom_sf(data = Seattle, aes(fill = L_HOOD)) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Seattle Districts", subtitle = "Seattle, WA", fill = "Neighborhood District") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 23: Neighborhood boundaries of Seattle in more detail than whole county map.

Display Local R-Squard Values

R-squared values are higher in the central portions of the City, decreasing in outlying areas.

ggplot(data = gwr_example_results_seattle_display, aes(fill = localR2)) + # Map local Rsquared values by census tract as points
  geom_sf(color = NA) +
  scale_fill_viridis_c(option = "G", direction = -1) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") +
  labs(title = "Local R-squared Values by Census Tract", subtitle = "Seattle, WA", fill = "Local\nR Squared") +
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 24: 24: Location of higher R-Squared Values is centered around the central and western parts of Seattle. This may be due to edge effects at the outlying areas.

Display Residuals

A map of the residuals highlights most predicted values were lower then the observed values.

ggplot(data = gwr_example_results_seattle_display, aes(fill = residuals)) + # Map local residuals values by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "RdBu", type = "div") + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") +
  labs(title = "Residuals by Census Tract", subtitle = "Seattle, WA", fill = "Residuals") +
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 25: Residuals indicated that most areas had lower model predicted values than the original estimated values.

Display Local Bandwidth Values

Bandwidth values show a general trend of increasing in value away from the central parts of the city.

ggplot(data = gwr_example_results_seattle_display, aes(fill = bandwidth)) + # Map local Median Income by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Bandwidth by Census Tract", subtitle = "Seattle, WA", fill = "Bandwidth") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 26: Bandwidth value range shows smaller values clustered around smaller tracts, as theory suggests.

Display Local Household Density Coefficient Values

Household density shows negative impact on vacancy rates. The higher the density the lower the vacancy rate. This is the same result as previously found.

ggplot(data = gwr_example_results_seattle_display, aes(fill = HouseholdsE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Blues", type = "seq") + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Local Household Density Coefficients", subtitle = "Seattle, WA", fill = "Local Coefficients\nHousehold Density") +   # Local Coefficients  
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 27: Household density has a negative impact on vacancy rates, with the central tracts experiencing the greatest impact.

Display Local Ownded Unit Percentage Coefficient Values

The same trend is found for percentage of housing units that are owned as found for housing density variable.

ggplot(data = gwr_example_results_seattle_display, aes(fill = owned_unitsE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Blues", type = "seq") + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Local Owned Unit Percentage Coefficients", subtitle = "Seattle, WA", fill = "Local Coefficients\n% Units Owned") +   # Local Coefficients  
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 28: The percentage of residential units that are owned has a negative impact on vacancy rates, with the central areas being the most impacted.

Display Local Percentage of Workers that Take Private Transportaion to Work Coefficient Values

The private transportation coefficients show that the higher the percentage of the population that choose this mode of transportation for getting to work, the higher the vacancy rate.

ggplot(data = gwr_example_results_seattle_display, aes(fill = private_transport_workE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", type = "seq", direction = 1) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Local Private Transportation to Work", subtitle = "Seattle, WA", fill = "Local Coefficients\n% Private Transportation\nto Work") +   # Local Coefficients
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 29: Private transportation usage has a positive impact on vacancy rates, with the southern part of Seattle being the most impacted.

Display Local Percentage of Workers that Walk to Work Coefficient Values

The Walked to work coefficients follow the same trend as the private transportation variable coefficients, but is more significant in magnitude.

ggplot(data = gwr_example_results_seattle_display, aes(fill = walked_workE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", type = "seq", direction = 1) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Local Walked to Work Coefficients", subtitle = "Seattle, WA", fill = "Local Coefficients\n% of Workers\nWalk to Work") +   # Local Coefficients
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 30: The percentage of workers that walk to work also has a positive impact on vacancy rates, with the central tracts being the least impacted by the phenomenon.

Display Local Percentage of Populatin that Moved into Owned Unit in 1989 or Earlier Coefficient Values

The higher the percentage of residents who have lived in owned units before 1990 the lower the vacancy rate. Long-term residency in owned units suggests lower vacancy rates, holding all else equal.

ggplot(data = gwr_example_results_seattle_display, aes(fill = owned_pop_1989_earlierE)) + # Map local Gini by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Blues", type = "seq") + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Local Owned Unit Occupation starting in 1989 or Earlier Coefficients", subtitle = "Seattle, WA", fill = "Local Coefficients\nOwnership Tenure\nStart 1989 or earlier") +   # Local Coefficients
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 31: Housing tenor commencing in 1989 or earlier in owner occupied units had a negative impact on vacancy rates, with the central tracts being the most impacted.

Display Model Predicated Vacancy Rate Values

Prediction values show highest values in central areas.

ggplot(data = gwr_example_results_seattle_display, aes(fill = pred)) + # Map local Median Income by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Local Predicted Vacancy Rates by Census Tract", subtitle = "Seattle, WA", fill = "Vacancy Rate\nPrediction") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 32: The highest predicted vacancy rates were in the central parts of Seattle.

Display Model Predicated Vacancy Rate Standard Error Values

Errors are higher in the areas with the highest predicated vacancy rates.

ggplot(data = gwr_example_results_seattle_display, aes(fill = pred.se)) + # Map local Median Income by census tract as points
  geom_sf(color = NA) +
  scale_fill_distiller(palette = "Reds", direction = 1) + 
  geom_sf(data = gwr_example_results_seattle_display,fill = NA, color = "black") + 
  labs(title = "Standard Error Values by Census Tract", subtitle = "Seattle, WA", fill = "Vacancy Rate\nPrediction\nStandard Error") +   
  annotation_north_arrow() + 
  theme(legend.position = "right") + 
  theme_map()

Figure 33: The highest standard error values are found in the areas with the highest predicated vacancy rates.

Comparison and Test of OLS and GWR Results

Statistical tests are inconclusive regarding improvement of the GWR model over the OLS. Some tests indicate there is statistically significant improvement while others do not. In addition, only the percentage of units that are owned and the housing tenure variables were found to be significant at any level in this particular model iteration.

BFC02.gwr.test(gwr_example_seattle)

## 
##  Brunsdon, Fotheringham & Charlton (2002, pp. 91-2) ANOVA
## 
## data:  gwr_example_seattle
## F = 1.1414, df1 = 176.00, df2 = 162.07, p-value = 0.1964
## alternative hypothesis: greater
## sample estimates:
## SS OLS residuals SS GWR residuals 
##         2349.549         2058.422

BFC99.gwr.test(gwr_example_seattle)

## 
##  Brunsdon, Fotheringham & Charlton (1999) ANOVA
## 
## data:  gwr_example_seattle
## F = 1.6454, df1 = 127.98, df2 = 169.08, p-value = 0.001234
## alternative hypothesis: greater
## sample estimates:
## SS GWR improvement   SS GWR residuals 
##           291.1275          2058.4215

LMZ.F1GWR.test(gwr_example_seattle)

## 
##  Leung et al. (2000) F(1) test
## 
## data:  gwr_example_seattle
## F = 0.9514, df1 = 169.08, df2 = 176.00, p-value = 0.3724
## alternative hypothesis: less
## sample estimates:
## SS OLS residuals SS GWR residuals 
##         2349.549         2058.422

LMZ.F2GWR.test(gwr_example_seattle)

## 
##  Leung et al. (2000) F(2) test
## 
## data:  gwr_example_seattle
## F = 1.5654, df1 = 26.928, df2 = 176.000, p-value = 0.04624
## alternative hypothesis: greater
## sample estimates:
##   SS OLS residuals SS GWR improvement 
##          2349.5490           291.1275

LMZ.F3GWR.test(gwr_example_seattle)

## 
## Leung et al. (2000) F(3) test
## 
##                         F statistic Numerator d.f. Denominator d.f.     Pr(>)
## (Intercept)                 0.90877       62.43552           169.08  0.662787
## HouseholdsE                 0.78204       17.05720           169.08  0.712408
## owned_unitsE                2.51510       68.18919           169.08 7.926e-07
## private_transport_workE     0.48951       55.95608           169.08  0.998760
## walked_workE                0.82770       47.18652           169.08  0.773857
## owned_pop_1989_earlierE     2.00398       22.71014           169.08  0.006759
##                            
## (Intercept)                
## HouseholdsE                
## owned_unitsE            ***
## private_transport_workE    
## walked_workE               
## owned_pop_1989_earlierE ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary

Utilizing just the census tracts within the City of Seattle improved the results considerably, with the amount of variance explained by the model increasing from roughly thirty percent to over forty percent. The contribution of different variables also provided a different overview. No longer was inequality, as measured by the Gini coefficient included in the City only data. While household density, percentage of residential units that are owned and percentage of workers that walk to work included in both the county and city analysis, the median age for the county was replaced by percentage of workers that take transportation to work. There was also less variation in the parameters for the city than for the county as a whole. These results have made it clear that it may be wise to not analyze an area with too much variation in underlying characteristics, and to fully understand the weighting to be used to consider significant variations.

Reflection

Initial results from the analysis provide a good starting point for further analysis. While the basic techniques required have been figured out, from initial descriptive statistics and mapping to more in-depth regression analysis, the specific data incorporated into the analysis needs further investigation. One of the major objectives of being able to wrangle data and then perform GWR and map the results was successfully completed. Now that the basic process and input parameters required to complete such tasks has been completed, the next step is to further refine and improve the process. The work that has been done so far was done as proof of concept. The ability to easily obtain census data without a lot of manual cleaning made experimenting with and learning about how to conduct GWR in R ideal but is lacking in several aspects. Due to survey data being used, there is a high degree of uncertainty involved with the estimates. The associated uncertainty makes the results less than reliable. In addition, most of the variables used were provided as counts of some element of interest. While easy to manipulate, these aggregations do not provide a lot of the details that are required. Based on these results, there are several changes to be made in future work. One requirement would be gathering data from several different sources, with more emphasis on acquiring transaction-related data. Another future goal is to examine changes between two periods, measuring changes over time rather than specific values at one point in time. More time and effort can now be devoted to acquiring more relevant variables, now that the methodology to carry out the analysis has been learned.

References

Brazil N. (2019). Geographically Weighted Regression: CRD 298 - Spatial Methods in Community Research. Available at https://crd230.github.io/gwr.html#kernel_density_function_and_bandwidth_h

Brett, D. L., & Schmitz, A. (2015). Real estate market analysis: Methods and case studies (Second edition). Urban Land Institute.

Brunsdon, C., Fotheringham, A. S., & Charlton, M. E. (1996). Geographically weighted regression: A method for exploring spatial nonstationarity. Geographical Analysis, 28(4), 281–298. https://doi.org/10.1111/j.1538-4632.1996.tb00936.x

Chwiałkowski, C., Zydroń, A., & Kayzer, D. (2022). Assessing the impact of selected attributes on dwelling prices using ordinary least squares regression and geographically weighted regression: A case study in poznań, poland. Land, 12(1), 125. https://doi.org/10.3390/land12010125

City of Seattle GIS Program. (2023, February 7). Neighborhood Map Atlas Neighborhoods. SeattleGeoData. https://data-seattlecitygis.opendata.arcgis.com/datasets/SeattleCityGIS::neighborhood-map-atlas-neighborhoods/about

Mennis, J. (2006). Mapping the results of geographically weighted regression. The Cartographic Journal, 43(2), 171–179. https://doi.org/10.1179/000870406X114658

Wang, D., Li, V. J., & Yu, H. (2020). Mass appraisal modeling of real estate in urban centers by geographically and temporally weighted regression: A case study of beijing’s core area. Land, 9(5), 143. https://doi.org/10.3390/land9050143

Appendix

List of variable names and uses as designated by the Census Bureau.

# B25109_001  Estimate!!Median value --!!Total: MEDIAN VALUE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25107_001    # Estimate!!Median value --!!Total: MEDIAN VALUE BY YEAR STRUCTURE BUILT
# B25097_001    # Estimate!!Median value --!!Total: MORTGAGE STATUS BY MEDIAN VALUE (DOLLARS)
# B25097_002    # Estimate!!Median value --!!Total:!!Median value for units with a mortgage MORTGAGE STATUS BY MEDIAN VALUE (DOLLARS)
# B25097_003    # Estimate!!Median value --!!Total:!!Median value for units without a mortgage  MORTGAGE STATUS BY MEDIAN VALUE (DOLLARS)
# B08122_001    # Estimate!!Total:  MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_002    # Estimate!!Total:!!Below 100 percent of the poverty level  MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_003    # Estimate!!Total:!!100 to 149 percent of the poverty level MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_004    # Estimate!!Total:!!At or above 150 percent of the poverty level    MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_005    # Estimate!!Total:!!Car, truck, or van - drove alone:   MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_013    # Estimate!!Total:!!Public transportation (excluding taxicab):  MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_017    # Estimate!!Total:!!Walked: MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_021    # Estimate!!Total:!!Taxicab, motorcycle, bicycle, or other means:   MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B08122_025    # Estimate!!Total:!!Worked from home    MEANS OF TRANSPORTATION TO WORK BY POVERTY STATUS IN THE PAST 12 MONTHS
# B25026_001    # Estimate!!Total population in occupied housing units: TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_002    # Estimate!!Total population in occupied housing units:!!Owner occupied:    TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_003    # Estimate!!Total population in occupied housing units:!!Owner occupied:!!Moved in 2019 or later    TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_004    # Estimate!!Total population in occupied housing units:!!Owner occupied:!!Moved in 2015 to 2018 TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_005    # Estimate!!Total population in occupied housing units:!!Owner occupied:!!Moved in 2010 to 2014 TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_006    # Estimate!!Total population in occupied housing units:!!Owner occupied:!!Moved in 2000 to 2009 TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_007    # Estimate!!Total population in occupied housing units:!!Owner occupied:!!Moved in 1990 to 1999 TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_008    # Estimate!!Total population in occupied housing units:!!Owner occupied:!!Moved in 1989 or earlier  TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_009    # Estimate!!Total population in occupied housing units:!!Renter occupied:   TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_010    # Estimate!!Total population in occupied housing units:!!Renter occupied:!!Moved in 2019 or later   TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_011    # Estimate!!Total population in occupied housing units:!!Renter occupied:!!Moved in 2015 to 2018    TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_012    # Estimate!!Total population in occupied housing units:!!Renter occupied:!!Moved in 2010 to 2014    TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_013    # Estimate!!Total population in occupied housing units:!!Renter occupied:!!Moved in 2000 to 2009    TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_014    # Estimate!!Total population in occupied housing units:!!Renter occupied:!!Moved in 1990 to 1999    TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B25026_015    # Estimate!!Total population in occupied housing units:!!Renter occupied:!!Moved in 1989 or earlier TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY TENURE BY YEAR HOUSEHOLDER MOVED INTO UNIT
# B06002_001    # Estimate!!Median age --!!Total:   MEDIAN AGE BY PLACE OF BIRTH IN THE UNITED STATES
# B06002_002    # Estimate!!Median age --!!Total:!!Born in state of residence   MEDIAN AGE BY PLACE OF BIRTH IN THE UNITED STATES
# B06002_003    # Estimate!!Median age --!!Total:!!Born in other state of the United States MEDIAN AGE BY PLACE OF BIRTH IN THE UNITED STATES
# B06002_004    # Estimate!!Median age --!!Total:!!Native; born outside the United States   MEDIAN AGE BY PLACE OF BIRTH IN THE UNITED STATES
# B06002_005    # Estimate!!Median age --!!Total:!!Foreign born MEDIAN AGE BY PLACE OF BIRTH IN THE UNITED STATES
# B23013_001    # Estimate!!Median age--!!Total:    MEDIAN AGE BY SEX FOR WORKERS 16 TO 64 YEARS
# B23013_002    # Estimate!!Median age--!!Total:!!Male  MEDIAN AGE BY SEX FOR WORKERS 16 TO 64 YEARS
# B23013_003    # Estimate!!Median age--!!Total:!!Female    MEDIAN AGE BY SEX FOR WORKERS 16 TO 64 YEARS
# B12006_001    # Estimate!!Total:  MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_007  # Estimate!!Total:!!Never married:!!Male:!!Not in labor force MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_012    # Estimate!!Total:!!Never married:!!Female:!!Not in labor force MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_018    # Estimate!!Total:!!Now married (except separated):!!Male:!!Not in labor force  MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_023    # Estimate!!Total:!!Now married (except separated):!!Female:!!Not in labor force    MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_029    # Estimate!!Total:!!Separated:!!Male:!!Not in labor force   MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_034    # Estimate!!Total:!!Separated:!!Female:!!Not in labor force MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_040    # Estimate!!Total:!!Widowed:!!Male:!!Not in labor force MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_045    # Estimate!!Total:!!Widowed:!!Female:!!Not in labor force   MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_051    # Estimate!!Total:!!Divorced:!!Male:!!Not in labor force    MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION
# B12006_056    # Estimate!!Total:!!Divorced:!!Female:!!Not in labor force  MARITAL STATUS BY SEX BY LABOR FORCE PARTICIPATION

GEOG 588 Final Project - Vacancy Rate Analysis in King County, WA Using GWR

Matthew Langsenkamp

2023-07-19

Abstract

Introduction

Data

Data Prepartion

Variables

Methods

Techniques

OLS

Packages

GWR

Packages

Process Workflow

Analysis

Inputs

Parameters

Variables

Dependent Variables

Independent Variables

Import Data

Seattle Neighborhood Shapefile

Variables

Dataframe and Vacant Units

Dataframe of all Units

Combine Dataframes

Dataframe of Independent Variables

Dataframe of Independent Variables in Wide Formate

Combined Vacant Unit Dataframe and Independent Variable Data Frame

Remove Water Features

Display Esimated Vacancy Rates

OLS Regression

Regression parameters

OLS regression and results

Residuals

Coefficients

Vacancy Rate Estimates

Regression Predicted Vacancy Rates

Regression Tests

VIF

Variable Selection

Revised OLS Regression

Revised regression

Run Revised OLS

Revised Tests

Revised VIF

GWR

Conert to sp SpatialDataFrame

GWR parameters and analaysis

Correlation

Plot of pairs

GWR Results and Conversion to sf object

Obtain bandwidth and residuals

Display GWR Results

Local R-Squared

Distributon of R-Squard values

Remove Water Features

Display Seattle with Neighborhood Districts

Display Local R-Squared

Display Local R-Squared with Seattle City Lines

Display Residuals

Display Bandwidth Values

Display Local Gini Coefficent Values

Display Local Household Density Coefficent Values

Display Local Ownership Percentage Coefficent Values

Display Local Percentage of Workers that Walk to Work Coefficent Values

Display Local Median Age Coefficent Values

Display Model Predicted Vacancy Rate Values

Display Model Predicted Vacancy Rate Standard Error Values

Test GWR Results with OLS Results

Summary

Seattle Census Tracts

Select Census Tracts within Seattle

Seatle Only OLS Regression Parameters

Perform OLS and Display Results

VIF

Variable Selection

Revised Regression Parameters

Run OLS Regression and Dislay Results