DATA
The data for this study was collected from several sources. First,
annual Gross Domestic Product (GDP) data for the year 2021 for all
metropolitan statistical areas (MSA) in the United States were collected
from the BEA website (Bureau of Economic Analysis 2023). BEA provides
data tables for GDP values for each metropolitan and micropolitan
statistical areas in the United States. BEA offers a way to retrieve the
data programmatically using their API, but it was simple to download the
data table from the site directly. The data also already came in a
format that was easy to ingest for the study.
The second source of data was the US Census Bureau website (US
Census Bureau 2023). 19 different socioeconomic variables related mainly
to occupation and industry of the employed population were selected and
collected for each MSA. These 19 variables are outlined in figure 1 in
the list of socioeconomic variables. The American Community Survey
5-year data for 2017-2021 was collected. Since this data is based on
representative data collected over the given time period, the average or
“estimate” was deemed to align well with the annual GDP data for 2021
(US Census Bureau, 2023). It was hoped that the socioeconomic
characteristics estimates of the population during that period would
help to explain the culminating size of the economy, in terms of GDP, in
2021. The collection of this data was done solely through RStudio,
utilizing the Census Bureau’s API capabilities. Originally, shapefiles
for MSAs were going to be collected as well, but instead the “geometry”
option in the retrieval process was utilized to obtain data which
already had spatial information attached. This saved time and effort and
made the data preparation phase much smoother.
Figure 1. List of socioeconomic variables
Figure 1. List of socioeconomic variables. Variables from the
categories of occupation, industry, mortgage status and gross rent were
collected using the get_acs() method from the US Census Bureau. The data
is shown with the variable ID according to the US Census Bureau, the
Variable_Name used within the study, their category, description, and
unit.
METHODOLOGY
The analysis methodology followed several phases, as described in
figure 2. The first phase involved data collection and cleanup and had
three steps. In the first phase, data was collected from all sources and
then cleaned up to prepare for ingestion in the analysis methods. GDP
data was downloaded from the BEA website and a quick cleanup of the data
was conducted in Microsoft Excel. Since one of the goals of this project
was to utilize R and RStudio to conduct most operations, the .csv file
was cleaned with minimal operations.
Figure 2. Workflow of the study methodology. The blue parallelograms
at the beginning (left side) and the end (right side) of the workflow
represents the inputs and outputs, respectively. The inputs are the
datasets collected and prepared for the study while the outputs
represents the results generated by the study. The three orange boxes in
the middle represents the methodology steps in which something is
happening to the inputs. The first orange square includes the data
collection and cleanup phase, the second one the descriptive statistics
phase, and the third one the analysis phase which includes both
non-spatial and spatial analyses. The green arrows indicate the
direction of the workflow.
In the second phase of the project, variable analysis was conducted.
This was done in two steps. First, descriptive statistics were created
and analyzed using the describe() function from the psych package. The
dataset were combined into one object to show the following statistics
using describe(): mean, standard deviation (sd), median, trimmed median,
mad (median absolute deviation from the median), minimum value, maximum
value, skewness, kurtosis, and standard error of the mean. The purpose
was to observe any troublesome patterns in the data. Second, exploratory
analysis was conducting by creating histograms using the ggplot2 package
and by conducting normality tests using the stats package. Histograms
group the data values into different bins that shows how many of the
entries fall within certain ranges. A variable is said to have a normal
distribution if the histogram is bell-shaped, contains one peak, with a
mean value through this peak, and is symmetric (NIH, n.d.). The
Kolmogorov-Smirnov or KS test tests whether the data follows a certain
distribution. This can be a one-sample, testing one dataset to see if it
matches a certain distribution such as a normal distribution, or
two-sample, testing two datasets against one another to see if they have
a matching distribution (Kolmogorov-Smirnov Tests, n.d.). In this case,
we utilized a one-sample test to see if the data followed a normal
distribution. If the assumption of normality was not observed, we then
either transformed the variable or omitted them from further
consideration in the study.
In the third phase of the project, OLS regression and GWR analysis
were conducted. In the OLS regression analysis, a full model containing
all of the variables were created using the lm() function from the stats
package. The dependent variable was “GDP” while the independent
variables included all of the socioeconomic variables on the table in
figure 2. To explore the variables to include in the final model, a
stepwise regression using backwards elimination of variables based on
AIC was conducted. This was done using the step() function from the
stats package. Backward modeling starts with all the independent
variables included in the model, after which the variables are
eliminated one by one based on statistical significance (Smith,
2018).
The result from this process is the summary of the regression and an
equation for the final model. The summary of the model was presented and
explained using summary() and the equation presented by a figure created
using an online LatEx editor. Variance inflation factor analysis was
also conducted using the VIF() function from the regclass package. This
was to analyze if the independent variables had affects on one another.
A large VIF value could indicate problems with accurate prediction in
the model. (VIF). Additionally, model assumptions were analyzed and some
of these outputs are presented in the results. In the second step of the
analysis phase, GWR analysis was conducted. This was conducted on the
same dataset to compare how the inclusion of a spatial dimension would
alter the relationships between GDP and the variables. The spatial
extent of this covers all the MSAs within the contiguous United
States.
CHALLENGES & LIMITATIONS
For the data and analysis, one of the challenges involved selecting
the right geography or spatial scale. To limit the scope of the analysis
to a finer study, a study of MSAs in one state, California, was tested
early in the study. This, however, resulted in a data sample size of 26
rows. OLS and GWR regressions were conducted. However, the R2 and
adjusted R2 were suspiciously high, at over 0.9999 each. Additionally,
the VIF values were well over 2000 for each variable, indicating issues
with the independent variables. For this reason, the analysis were
conducted on the whole dataset to avail more data points for the
regressions. Additionally, instead of the raw estimate values, the
percentage of population values were used for the independent variables.
Moreover, to reduce the scope, it was decided that the data would be
limited to analyzing metropolitan statistical areas and not micropolitan
statistical areas.
RESULTS
DATA PREPARATION
There were three main parts to data preparation. First, this
entailed loading the MSA GDP .csv file that was obtained from the Bureau
of Economic Analysis. Second, socioeconomic data was retrieved from the
US Census Bureau using the API. Lastly, the GDP and the socioeconomic
data were joined into one dataset. These steps align with the first
orange box in the workflow in figure 2.
Load packages
First, we loaded all the packages used for the analysis.
library(tidycensus) #For obtaining tidy census data
library(tidyverse) #For manipulating census and other data
library(here) #To load our data using read_csv
library(dplyr) #For data manipulation
library(spgwr) #For the GWR analysis
library(regclass) #For VIF
library(sf) #For simple features and other geometry operations
library(psych) #For descriptive statistics
library(ggplot2) #For plotting graphs
library(cowplot) #For combining plots
Importing the data
GDP data
Next, we converted the ‘GeoFips’ column name to a ‘character’ type
to prepare it to be joined. Fortunately, the ‘GeoFips’ column in the BEA
data corresponds directly to the ‘GEOID’ column in the census data.
GEOID <- as.character(usa_msa_init$GeoFips) # Convert from 'double' to 'character' to allow the join.
Then, we added the newly renamed ‘GEOID’ column to the GDP
dataset.
usa_msa <- cbind(usa_msa_init, GEOID) # Add converted GEOFIPS (now GEOID) column to the data frame.
Socioeconomic data
To retrieve the socioeconomic variables data, we first created a
list of variables that we were interested in for the study. These were
then pulled from the US Census Bureau’s website through the API in the
next step.
varlist = c(
# OCCUPATION category
# Civilian employed population 16 years and over
o_management = "DP03_0027P", # management, business, science, and arts occupations (percent)
o_service = "DP03_0028P", # service occupations (percent)
o_sales = "DP03_0029P", # sales and office occupations (percent)
o_resources = "DP03_0030P", # natural resources, construction, and maintenance occupations (percent)
o_production = "DP03_0031P", # production, transportation, and material moving occupations (percent)
# INDUSTRY category
# Civilian employed population 16 years and over
i_agriculture = "DP03_0033P", # agriculture, forestry, fishing and hunting, and mining (percent)
i_construction = "DP03_0034P", # construction (percent)
i_manufacturing = "DP03_0035P", # manufacturing (percent)
i_wholesale = "DP03_0036P", # wholesale trade (percent)
i_retail = "DP03_0037P", # retail trade (percent)
i_transport = "DP03_0038P", # transportation and warehousing, and utilities (percent)
i_information = "DP03_0039P", # information (percent)
i_finance = "DP03_0040P", # fiance and insurance, and real estate and rental and leasing (percent)
i_professional = "DP03_0041P", # professional, scientific, and management, and administrative and waste management services (percent)
i_education = "DP03_0042P", # educational services, and health care and social assistance (percent)
# HOUSING category
# Mortgage status and gross rent
h_mortgage = "DP04_0091P", # housing unit with a mortgage (percent)
h_no_mortgage = "DP04_0092P", # housing units without a mortgage (percent)
h_paying_rent = "DP04_0126P", # occupied units paying rent (percent)
h_median_rent = "DP04_0134" # occupied units paying rent - median (dollars)
)
The socioeconomic variables were retrieved using get_acs()
us_socio <- get_acs( # Get data with get_acs()
geography = "metropolitan statistical area/micropolitan statistical area", # the geography option is for Metro and Micro statistical areas but we only used the former.
variables = varlist, # We use the variable list created above, in line 83.
year = 2021, # This is for the 2017-2021 ACS
geometry = TRUE, # We also retrieve the geometry
output="wide", # We load the data in 'wide' format.
progress_bar = FALSE # This is to turn off a long progress bar message.
)
Joining the data
For the final step in data preparation, we joined the GDP data and
the socioeconomic data.
usa_msa_joined <- usa_msa %>%
left_join(us_socio, by="GEOID") %>% # We join using "GEOID"
replace(is.na(.), 0) # replaced NA values with zero.
Additionally, we deleted four MSAs that were located in Alaska and
Hawaii. The non-continguous nature of these MSAs in relation to the MSAs
on the mainland US was not helpful for the spatial analysis.
usa <- usa_msa_joined %>%
filter(GEOID != "11260", GEOID != "21820", GEOID != "27980", GEOID != "46520") # We use a filter from dplyr that takes out the omitted MSAs.
# The omitted MSAs were:
# Anchorage, AK (11260) and Fairbanks, AK (21820).
# Kahului-Wailuku-Lahaina, HI (27980) and Urban Honolulu, HI (46520).
Figure 3. Sample of analysis-ready ‘usa’ dataset. This sample
contains the first ten entries in the ‘usa’ dataset, which contains GDP
and socioeconomic variable information for MSAs in the contiguous United
States. The data were obtained from BEA and the US Census Bureau and
were manipulated using R to ready for analysis.
In the dataset, the variable names ending with the letter “E” are
the estimate values (in percent) while the names ending with the letter
“M” are the margins of error. The estimate value columns were selected
and used for the analysis.
VARIABLE ANALYSIS
Descriptive Statistics
The describe() function from the psych package was used to generate
descriptive statistics for the variables in the dataset. The purpose of
this step was to view the distribution of the values of each variable to
see how much they deviate from the assumption of a normal
distribution.
The describe() function returns the following descriptive
statistics:
Figure 4. Descriptive statistics of GDP and socioeconomic variables.
These statistic values were obtained using the describe() function in
the psych package. They look at the distribution of the data values for
the dependent variable, GDP, and the twenty independent variables
initially chosen for the study.
After the descriptive statistics were calculated, we added the
variable names to the matrix for readability.
varnames <- c('GDP', 'o_management', 'o_service', 'o_sales', 'o_resources', 'o_production', 'i_agriculture', 'i_construction',
'i_manufacturing', 'i_wholesale', 'i_retail', 'i_transport', 'i_information', 'i_finance', 'i_professional',
'i_education', 'h_mortgage', 'h_no_mortgage', 'h_paying_rent', 'h_median_rent') # A list of the variable names
desc_stats <- cbind(varnames, desc1) # We bind the variable names to the descriptive statistics (desc1)
Figure 5. Descriptive statistics of GDP and socioeconomic variables
with variable names. These statistic values were obtained using the
describe() function in the psych package and the variable names were
added afterwards.
In the descriptive statistics tables, figures 4 and 5, the first
column is the item name, the second column (vars) is the item number,
both of which correspond to the variables that were used in describe()
in the order they were listed. The third column (n) is the number of
entries or rows. The mean is the average value of the data and the
standard deviation (sd) is a measure of the dispersion of the values.
The median is the value that separates the lower and upper half of the
data equally, the trimmed mean (trimmed) is a particular mean calculated
by discarding some low/high values, and the median absolute deviation
(mad) is a calculatation of how the values deviate from the median. The
min and max columns represent the lowest and highest values for each
variable, respectively. The range is the difference between the highest
and the lowest values.
Skew (skewness) provides a measure by which the distribution departs
from that of a normal distribution. Values that are “+” suggest a
positive skew while values that are “-“ skewness suggest a negative
skew. Values that are roughly ±1.0 suggests non-normal data. Kurtosis
shows the”tailedness” of the data, and similar to skewness, describes
the distribution of the data. Lastly, the standard error (se) is the
standard error of the mean in the data.
Overall, GDP has the highest value ranges, which is understandable,
as it represents the total economic value within an MSA. Most of the
other variables are percentage values, which vary from 0 to 100. The
next highest values overall are the ‘h_paying_rent’ and ‘h_median_rent’
variables, which represent rent and mortgage-related values. The
descriptive statistics were calculated mainly to review the
distributions of the dataset. In the next step, we looked at skewness
specifically to narrow down the variables that were to be used in the
study.
EXPLORATORY ANALYSIS
To explore the viability of the variables to be chosen, the
assumption of normality was tested for several variables. This was done
by creating histograms and by conducting one-sample Kolmogorov-Smirnov
tests for each variable.
The preliminary exploratory phase was done for the dependent
variable, as well as independent variables that had high skew according
to the descriptive statistics. The descriptive statistics figures
indicated that most variables had a normal distribution or only a slight
skew. For variables that had visibly high skews, considered in this
study as skew values over 1.9, histograms and normality tests were
conducted to see how much they deviated from the assumption of
normality.
DEPENDENT VARIABLE ANALYSIS
Figure 6. Histogram of the dependent variable ’GDP.” A histogram of
the variable GDP was created using ggplot(), with default bin values.
The x axis shows the value of GDP, in thousands, while the y axis shows
the number of MSAs that fall within a certain bin. From the figure, GDP
data has a positive skew, with a few outliers.
Figure 7. Histogram of the log-transformed ’GDP” variable. The
histogram was created using ggplot(), with default bin values. The x
axis shows the log-transformed values of GDP, while the y axis shows the
number of MSAs that fall within each bin. From the figure, log GDP still
has a skew, but the distribution is closer to normal compared to the
initial dataset.
The histogram shows that the depenendent variable is still skewed
but shows a much better distribution compared to the initial GDP data.
In the log-transformed version of GDP, the plot is closer to a normal
distribution. We used this log-transformed GDP variable for the OLS
analysis.
INDEPENDENT VARIABLE ANALYSIS
Next, we conducted normality tests for the independent variables.
First, we created histograms for each variable, as shown in figure 8.
Like GDP, these variables showed high skew values in the descriptive
statistics and were chosen for further analysis. However, transformation
was not conducted on these independent variables to maintain the
simplicity of the regression equation interpretation.
Histograms
h1 <- ggplot(usa1, aes(x=i_agricultureE)) + # Plot for i_agriculture
geom_histogram(color="blue", fill="lightblue")
h2 <- ggplot(usa1, aes(x=i_informationE)) + # Plot for i_informationE
geom_histogram(color="blue", fill="lightblue")
h3 <- ggplot(usa1, aes(x=i_financeE)) + # Plot for i_financeE
geom_histogram(color="blue", fill="lightblue")
h4 <- ggplot(usa1, aes(x=h_paying_rentE)) + # Plot for h_paying_rentE
geom_histogram(color="blue", fill="lightblue")
h_group <- plot_grid(h1, h2, h3, h4) # Group the plots into one grid
title <- ggdraw()+ # Use draw_label() to create a title.
draw_label(
"Histograms of independent variables with high skew")
# The combined plot with the title:
h_plots <- plot_grid(title, h_group, ncol=1, rel_heights = c(0.1, 1))
h_plots

Figure 8. Histogram of four independent variables, i_agriculture,
i_information, i_finance, and h_paying_rent. The histograms were created
using ggplot and were combined using cowplot. The x axis shows the value
of each variable, while the y axis shows the number of MSAs that fall
within a certain bin for that variable. From the figure, all four
variables show a positive skew, particularly for h_paying_rent.
Normality test
After visual normality testing was done using histograms, a
statistical normality test, the Kolmogorov-Smirnov test, was done each
of the four independent variables.
# Kolmogorov-Smirnov test
ks1 <- ks.test(usa1$i_agricultureE, "pnorm") # This test is from the stat package.
ks2 <- ks.test(usa1$i_informationE, "pnorm") # We used a one-sample test.
ks3 <- ks.test(usa1$i_financeE, "pnorm") # Which tests the sample against a normal distribution, “pnorm.”
ks4 <- ks.test(usa1$h_paying_rentE, "pnorm")
Here are the results of the KS test:
ks1
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: usa1$i_agricultureE
## D = 0.63883, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks2
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: usa1$i_informationE
## D = 0.77499, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks3
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: usa1$i_financeE
## D = 0.99379, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks4
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: usa1$h_paying_rentE
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
The KS test results in the code above shows the test statistic (D),
the p-value, and the type of test: two-sided hypothesis test. All the
results showed that each p-value were under 0.05. This means that each
of the test statistic was significant. Therefore, we reject the null
hypothesis which states that there is no significant difference between
the sample data and that of a normal distribution. In short, this means
that we have evidence to show that the variable data do not come from a
normal distribution.
From the normality tests conducted using histograms and the KS test,
the four variables explored show non-normality in their distributions.
As a result, to maintain simplicity in interpretation of the analysis,
these four independent variables were selected to be dropped from the
OLS study.
MODELING
OLS Regression
In the first step of the modeling phase, OLS regression was
conducted. This process began using an initial model that contained all
the independent variables selected earlier, ones in the variable list
used to get the ACS data, minus the four variables eliminated earlier in
the exploratory analysis.
# We use the lm() function in stats package to fit a linear model using the variables.
msa_full_model <- lm(usa1$gdp_log ~ usa1$o_managementE + usa1$o_serviceE + usa1$o_salesE +
usa1$o_resourcesE + usa1$o_productionE +
usa1$i_constructionE + usa1$i_manufacturingE + usa1$i_wholesaleE +
usa1$i_retailE + usa1$i_transportE +
usa1$i_professionalE + usa1$i_educationE + usa1$h_mortgageE + usa1$h_no_mortgageE +
usa1$h_median_rentE, data = usa1)
# The first argument is the model formula, which contains the dependent variable as the first item and the rest as the independent variables. The next argument, in this case the last, is the dataset.
In the code, the call shows the equation that were used for the
model. The residuals shows the difference between the actual data values
and the values predicted by the model. The coefficients help us build
our model and the ‘Estimate’ column helps to interpret the relationship
that the data shows between the dependent variable, GDP, and the
independent variables. The ‘Std.Error’ are the deviation of the
coefficient estimate values. The t and p-values indicates whether or not
the variable is considered significant in the model. The * symbols
indicates the level of significance. The last three lines shows us how
well the model does in fitting the data. One particular value, the
adjusted R-squared, is commonly used to gauge model fit. The adjusted
R-squared shows the percentage of the variation in GDP that the
socioeconomic variables are predicting. Thus, a higher adjusted
R-squared is generally desired. (Linear Reg).
As seen in the code, the adjusted R-squared is relatively high at
0.6684. However, there are many independent variables that are not
statistically significant in the model. Thus, there was room for
improvement in the model, for both simplicity and performance.
To obtain a better model, stepwise backward elimination was
conducted. First, this was done manually, with the deletion of one
independent variable at a time, based on statistical significance. At
each step, the variable with the highest p-value was deleted and a new
summary was generated. This process was repeated until all variables
were statistically significant at the 5% level, meaning the p-values
were under 0.05. At the same time, the adjusted R-squared was monitored
to ensure that it did not drop drastically.
STEPWISE BACKWARD ELIMINATION - MANUAL
Manually by eliminating variables with the highest p-value, one at a
time.
msa_test_model <- lm(usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE +
usa1$i_manufacturingE + usa1$i_wholesaleE +
usa1$i_retailE + usa1$i_transportE +
usa1$i_professionalE + usa1$h_mortgageE +
usa1$h_median_rentE, data = usa1)
Here is the summary of the full OLS model for the USA MSA data.
summary(msa_test_model)
##
## Call:
## lm(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE +
## usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE +
## usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE +
## usa1$h_median_rentE, data = usa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05399 -0.46074 0.00569 0.43843 2.63563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4123181 0.9000048 8.236 3.11e-15 ***
## usa1$o_managementE 0.0637641 0.0105500 6.044 3.67e-09 ***
## usa1$o_salesE 0.1749031 0.0290251 6.026 4.07e-09 ***
## usa1$i_manufacturingE 0.0271802 0.0082976 3.276 0.00115 **
## usa1$i_wholesaleE 0.4328488 0.0661290 6.546 1.98e-10 ***
## usa1$i_retailE -0.1969850 0.0367528 -5.360 1.47e-07 ***
## usa1$i_transportE 0.1954982 0.0314574 6.215 1.39e-09 ***
## usa1$i_professionalE 0.1644058 0.0231218 7.110 6.01e-12 ***
## usa1$h_mortgageE 0.0157055 0.0069630 2.256 0.02468 *
## usa1$h_median_rentE 0.0006707 0.0002234 3.002 0.00286 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7041 on 370 degrees of freedom
## Multiple R-squared: 0.6754, Adjusted R-squared: 0.6675
## F-statistic: 85.53 on 9 and 370 DF, p-value: < 2.2e-16
In the test model, the adjusted R-squared is still relatively high
at 0.6675. As the * symbols on the right side indicate, all the
variables are statistically significant, with the p-values all falling
under 0.05.
STEPWISE BACKWARD SELECTION - AUTOMATED
The automated backward stepwise elimination process resulted in a
different equation from the manual process. Though the adjusted
r-squared is higher for the automated model, at 0.6696, it was decided
that the manual model would be kept. The automated process did not
eliminate variables that were statistically not significant. The smaller
number of variables in the manual model also keeps the model simpler
while still having high performance value. The adjusted R-squared for
the manual model was 0.6675, which is not much different from the
automated model.
Final OLS Model
Here is the summary of the final model:
summary(msa_final_model)
##
## Call:
## lm(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE +
## usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE +
## usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE,
## data = usa1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15034 -0.43014 -0.00781 0.44090 2.81387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.092952 0.880345 9.193 < 2e-16 ***
## usa1$o_managementE 0.057481 0.010451 5.500 7.09e-08 ***
## usa1$o_salesE 0.159207 0.028857 5.517 6.48e-08 ***
## usa1$i_manufacturingE 0.019923 0.008023 2.483 0.01346 *
## usa1$i_wholesaleE 0.438172 0.066815 6.558 1.83e-10 ***
## usa1$i_retailE -0.203943 0.037074 -5.501 7.05e-08 ***
## usa1$i_transportE 0.186954 0.031665 5.904 8.01e-09 ***
## usa1$i_professionalE 0.202982 0.019430 10.447 < 2e-16 ***
## usa1$h_mortgageE 0.022127 0.006698 3.304 0.00105 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7117 on 371 degrees of freedom
## Multiple R-squared: 0.6675, Adjusted R-squared: 0.6603
## F-statistic: 93.08 on 8 and 371 DF, p-value: < 2.2e-16
The final model still had a relatively high adjusted R-squared value
at 0.6603, with eight variables kept in the equation. The coefficients
were also large enough to be easily interpretable and all variables were
statistically significant.
VIF
Figure 9. The OLS equation for the final model. The model was
derived using exploratory variable analysis and backward stepwise
elimination in R. The variable y represents GDP and the different x
variable represent the independent socioeconomic variables.
OLS Results Discussion
For the analysis of the OLS regression, generally a one-unit change
in an independent variable equates to a one-unit change in the dependent
variable multiplied by the coefficient. In this case, however, the
dependent variable was log-transformed. Therefore, the results must be
discussed with this consideration.
To interpret the results, it is important to note that by
log-transforming the dependent variable, our independent variables have
been exponentiated. To interpret the results, we need to exponentiate
the independent variables. For example:
The coefficient of o_managementE is 0.057.
exp(0.057)
## [1] 1.058656
This means that for every one-unit increase in o_management, y
increases by about 6%. We can then calculate this value for all
variables:
exp(coef(msa_final_model)['usa1$o_managementE'])
## usa1$o_managementE
## 1.059165
exp(coef(msa_final_model)['usa1$o_salesE'])
## usa1$o_salesE
## 1.172581
exp(coef(msa_final_model)['usa1$i_manufacturingE'])
## usa1$i_manufacturingE
## 1.020122
exp(coef(msa_final_model)['usa1$i_wholesaleE'])
## usa1$i_wholesaleE
## 1.549871
exp(coef(msa_final_model)['usa1$i_retailE'])
## usa1$i_retailE
## 0.8155088
exp(coef(msa_final_model)['usa1$i_transportE'])
## usa1$i_transportE
## 1.205572
exp(coef(msa_final_model)['usa1$i_professionalE'])
## usa1$i_professionalE
## 1.225051
exp(coef(msa_final_model)['usa1$h_mortgageE'])
## usa1$h_mortgageE
## 1.022374
Figure 10 shows the effects of the coefficients of the independent
variables after the calculations. These represent percent changes in GDP
per one-unit change for each independent variable. The variable with the
larget positive effect is i_wholesale, which increases GDP by 55%,
followed by i_professional at 23%, i_transport at 21%, and o_sales at
17%. These variables seem to have a strong positive relationship with
GDP. The variables o_management at 6%, i_manufacturing at 2%, and
h_mortgage at 2%, also have positive relationships with GDP, albeit a
smaller one in the single digit. The only variable to have a negative
relationship is i_retail, which decreases GDP by 18% per one-unit
change.
These interpretation of the coefficient results show the effects and
relationships of the final variables with GDP. This OLS regression model
is a generalized model that is applicable to the MSAs in the entire
continguous US. There are, however, bound to be regional differences
within the US. Thus, the next step of the modeling phase involved
Geographically Weighted Regression to see how the relationships of these
variables vary spatially.
Figure 10. Interpretation of the OLS regression
coefficients
Figure 10. Interpretation of the OLS regression coefficients. This
figure shows the relationship between GDP value and the coefficients of
the variables in the final regression model. The percent change values
were calculated using the exponents of the variables and the numeric
change was calculated by subtracting one from the coefficient and
multiplying it by 100. These calculations were done in R.
Geographically Weighted Regression
For GWR, the data was prepared so that it could be inputted into the
GWR functions.
First, the “usa” dataset was converted to a ‘sf’ class.
usa1_sf <- st_as_sf(usa1) # The function st_as_sf() converts to an 'sf' class.
Then, it was converted again into an ‘sp’ class. Now, the data is
ready for use.
usa1.sp <- as(usa1_sf, "Spatial") # The function as() was used to convert to "Spatial" class
GWR Bandwidth
Here is the resulting bandwidth:
bW1
## [1] 528.9772
The gwr.sel() function determined that a bandwidth value of 528.9772
provided the minimal root mean square prediction error out of the
different bandwidth values that it tested. We used this bandwidth value
for our GWR analysis.
In this next step, we created the GWR model using the gwr()
function.
gwr.fit1 <-gwr(usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE + usa1$i_manufacturingE + usa1$i_wholesaleE +
usa1$i_retailE + usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE,
data=usa1.sp, bandwidth=bW1, se.fit=T, hatmatrix=T)
# The gwr() function takes an OLS equation as an input, along with the specified dataset. We also specified our bandwidth that we obtained in the last step. se.fit = T allows for the function to return local coefficient standard errors while hatmatrix=T return a matrix of the y-hat values, or the predicted dependent variable.
Here is the summary of the GWR modeling.
gwr.fit1
## Call:
## gwr(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE +
## usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE +
## usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE,
## data = usa1.sp, bandwidth = bW1, hatmatrix = T, se.fit = T)
## Kernel function: gwr.Gauss
## Fixed bandwidth: 528.9772
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max.
## X.Intercept. 0.7450466 4.2332756 5.8926084 7.5953772 17.1404527
## usa1.o_managementE -0.0276163 0.0538689 0.0786261 0.0971403 0.1185411
## usa1.o_salesE -0.0375649 0.1103914 0.1809676 0.2329294 0.3149851
## usa1.i_manufacturingE -0.0597396 0.0138856 0.0353128 0.0430031 0.0642532
## usa1.i_wholesaleE 0.0931624 0.4090415 0.4288715 0.4655341 0.8368153
## usa1.i_retailE -0.4076938 -0.2190375 -0.1574815 -0.1135772 0.0025098
## usa1.i_transportE -0.0200086 0.1398921 0.1655582 0.2049887 0.4241009
## usa1.i_professionalE 0.1031053 0.1783273 0.2043704 0.2428115 0.3606911
## usa1.h_mortgageE -0.0010410 0.0135575 0.0175507 0.0244307 0.1333892
## Global
## X.Intercept. 8.0930
## usa1.o_managementE 0.0575
## usa1.o_salesE 0.1592
## usa1.i_manufacturingE 0.0199
## usa1.i_wholesaleE 0.4382
## usa1.i_retailE -0.2039
## usa1.i_transportE 0.1870
## usa1.i_professionalE 0.2030
## usa1.h_mortgageE 0.0221
## Number of data points: 380
## Effective number of parameters (residual: 2traceS - traceS'S): 71.50608
## Effective degrees of freedom (residual: 2traceS - traceS'S): 308.4939
## Sigma (residual: 2traceS - traceS'S): 0.6379012
## Effective number of parameters (model: traceS): 53.14229
## Effective degrees of freedom (model: traceS): 326.8577
## Sigma (model: traceS): 0.6197226
## Sigma (ML): 0.5747579
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 784.1655
## AIC (GWR p. 96, eq. 4.22): 710.6427
## Residual sum of squares: 125.5317
## Quasi-global R2: 0.7778485
The gwr() function, as an output, provides a ‘gwr’ object class that
shows a summary similar to that of the summary() function. The Call
section shows the equation, while the kernel function shows the
distribution used, which is a Gaussian or normal distribution. The fixed
bandwidth is the one we selected from the gwr.sel(). The gwr output also
shows the data ranges of the different variable coefficients in the
model. Since GWR generates a local OLS regression model for each
feature, showing all of the equations would take up a lot of room. The
minimum, maximum, and quartile information provides an overview of the
coefficients. In the last column, however, the global coefficients
equate to the global model that we created earlier in the OLS
regression. The lines after the coefficients show the different
performance metrics of the model. Most importantly, the quasi-global R2
provides the overall variation in GDP explained by the socioeconomic
variables. Overall, it seems that the GWR local coefficients are fairly
similar to the ones in the global model. However, there are definite
differences in some local models when viewing individual minimum and
maximum values. The maximum value for i_wholesale, for example, is 0.836
in the GWR model, while it is 0.438 in the global model, indicating that
there are spatial variations that exist for this variable. Additionally,
the GWR model has a higher quasi-global R2 value of 0.7778, as compared
to the adjusted r-squared value of 0.6603 for the global model.
GWR Visualization
To visualize the results of the GWR analysis, we first converted the
summary results, as represented by the SDF or the
SpatialPointsDataFrame, into an ‘sp’ object and an ‘sf’ object.
gwr1.sp <- gwr.fit1$SDF # We first converted the SDF into an 'sp' object.
gwr1.sf <-st_as_sf(gwr1.sp) # Then, we converted it to an 'sf' object.
Now, we can visualize the data using ggplot2.
Local R2
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$localR2)) + # We used ggplot to plot 'localR2'
scale_fill_gradient(low="lightblue", high="blue")+ # Set the low values to lightblue and the high values to blue.
labs(fill="Local R2") + # We renamed the legend
coord_sf() + # added coordinate info
theme_map() + # added theme
ggtitle(paste("MSA GWR Model - Local R2")) # and included a title.

Figure 11. Local R2 values from the GWR modeling. The plot was
created using ggplot2 and utilizing the R2 values for the local
regression models in the GWR model as the color fill. The darker blue
hue represents higher local R2 values while the lighter blue hue
represents lower local R2 values.
From figure 11, the spatial variation in the local R2 for the local
regression models can be seen. The darker the blue hue, the higher the
local R2, and the lighter the blue hue, the lower the local R2. It seems
that local R2 values are higher in the western MSAs and they get lower
as one moves eastwards. Overall, however, the local R2 values are fairly
high, ranging between 0.70 and 0.85. From this figure, we can say that
the variables selected in the study are explaining a higher percentage
of the variation in GDP in the MSAs in the western part of the US. This
is followed by the central, midwestern, and southeastern MSAs, and then
the northeastern MSAs.
Residuals
The residuals in the model show errors from the estimates of the
model, gained from the difference between the observed value and the
predicted values of GDP.
To map the residuals, we used the numeric object ‘gwr.e’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$gwr.e)) + # We plot the residuals, 'gwr.e'
scale_fill_gradient(low="yellow", high="red")+ # We made the fill from yellow to red.
labs(fill="Residuals") + # Changed the legend title
coord_sf() + # Coordinates
theme_map() + # Theme
ggtitle(paste("MSA GWR Model - Residuals")) # Title

Figure 12. Residuals from the GWR modeling. The plot was created
using ggplot2 and utilizing the residual values in the GWR model as the
color fill. The red hue represents higher residual values while the
yellow hue represents lower local residual values.
The residual values seem to be evenly distributed spatially
according to figure 12. Most of the MSAs in the US seem to have similar
residual values. The most obvious high residuals appear arround the New
York MSA in the northeast, showing up as the brightest red hue. This
indicates that for the New York MSA, the model may be underpredicting
the effects of the variables on GDP. This is because the higher the
residual value, the greater the difference between the observed value (a
plot point higher than the regression line) and the predicted value (the
linear regression line that is under the plot point).
Variable Coefficients
We then plotted the coefficients of the socioeconomic variables to
observe how they varied spatially.
‘o_management’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.o_managementE)) + # We plotted the variable 'o_management'
scale_fill_gradient(low="lightgreen", high="darkgreen")+ # Fill of lightgreen to darkgreen
labs(fill="Coefficient") + # Rename the legend
coord_sf() + # Coordinates
theme_map() + # Theme
ggtitle(paste("MSA GWR Model coefficient: o_management")) # Title

Figure 13. Coefficient values of ‘o_management’ from the GWR
modeling. The plot was created using ggplot2 and utilized the
coefficient values in the GWR model for ‘o_management’ as the color
fill. The darker green hue represents higher coefficient values while
the lighter green hue represents lower coefficient values.
For the variable ‘o_management,’ it seems that the coefficient is
higher in the eastern, midwestern, and southeastern MSAs in the US. This
means that the population engaged in management-related occupation are
explaining the variation in GDP at a higher rate in these MSAs convert
to the rest of the country. In the western US, for example, the effects
are much lower, hovering around 0.04 instead of the 0.08 range in the
eastern states.
‘o_salesE’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.o_salesE)) + # o_sales
scale_fill_gradient(low="lightgreen", high="darkgreen")+ # fill of lightgreen to darkgreen.
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: o_sales")) # title

Figure 14. Coefficient values of ‘o_sales’ from the GWR modeling.
The plot was created using ggplot2 and utilized the coefficient values
in the GWR model for ‘o_sales’ as the color fill. The darker green hue
represents higher coefficient values while the lighter green hue
represents lower coefficient values.
For the variable ’o_sales,” the higher coefficients are much more
concetrated in the eastern coastal MSAs, as well as the MSAs in the
southwestern US. The coefficients are much lower in the central and
northwestern US, showing that these areas explain the variation in GDP
at a lower rate than the former MSAs.
‘i_manufacturing’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_manufacturingE)) + # 'i_manufacturing'
scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: i_manufacturing")) # title

Figure 15. Coefficient values of ‘i_manufacturing’ from the GWR
modeling. The plot was created using ggplot2 and utilized the
coefficient values in the GWR model for ‘i_manufacturing’ as the color
fill. The darker blue hue represents higher coefficient values while the
lighter blue hue represents lower coefficient values.
For the variable ‘i_manufacturing,’ it seems that this variable
explains the variation in GDP at a much more spatially uniform pattern
compared to the other variables. Most of the MSAs fall in the higher
range of the coefficient value, with the exception of the MSAs in the
southwestern US, mostly in California.
‘i_wholesale’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_wholesaleE)) + # 'i_wholesale'
scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: i_wholesale")) # title

Figure 16. Coefficient values of ‘i_manufacturing’ from the GWR
modeling. The plot was created using ggplot2 and utilized the
coefficient values in the GWR model for ‘i_manufacturing’ as the color
fill. The darker blue hue represents higher coefficient values while the
lighter blue hue represents lower coefficient values.
The variable ‘i_wholesale’ is also another variable that shows an
interesting pattern in the spatial distribution of the coefficient
values. The higher coefficient value of ‘i_wholesale’ mostly reside in
the south-central part of the US, mainly in the state of Texas. The
coefficient is generally higher in the eastern US, while the lowest
values can be found in some of the west-central MSAs.
‘i_retail’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_retailE)) + # 'i_retail'
scale_fill_gradient(low="#003366", high="#99CCFF" ) + # fill of dark to light green
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: i_retail")) # title

Figure 17. Coefficient values of ‘i_retail’ from the GWR modeling.
The plot was created using ggplot2 and utilized the coefficient values
in the GWR model for ‘i_retail’ as the color fill. The darker blue hue
represents larger negative coefficient values while the lighter blue hue
represents lower negative coefficient values.
The ‘i_retail’ variable is the only variable to have a negative
relationship with GDP. Thus, in the plot the colors were reversed to
show the MSAs with coefficient of higher magnitude, but in a negative
direction. Figure 17 shows that there is clear variation in this
coefficient. The coefficient in the southwestern and southeastern MSAs
show a greater, more negative coefficient value, indicating that the
population engaged in retail-related industries are explaining the
variation in GDP at a higher rate in these MSAs.
‘i_transport’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_transportE)) + # 'i_transport'
scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: i_transport")) # title

Figure 18. Coefficient values of ‘i_transport’ from the GWR
modeling. The plot was created using ggplot2 and utilized the
coefficient values in the GWR model for ‘i_retail’ as the color fill.
The darker blue hue represents higher coefficient values while the
lighter blue hue represents lower coefficient values.
The ‘i_transport’ variable has higher coefficient values in the
western MSAs, particularly in the southwest of the US. This indicates
that the variation in the GDP of the MSAs in the southwestern US is
explained more by the population in the transport-related industry
compared to the other areas of the country.
‘i_professional’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_professionalE)) + # 'i_professional'
scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: i_professional")) # title

Figure 19. Coefficient values of ‘i_professional’ from the GWR
modeling. The plot was created using ggplot2 and utilized the
coefficient values in the GWR model for ‘i_professional’ as the color
fill. The darker blue hue represents higher coefficient values while the
lighter blue hue represents lower coefficient values.
The variable ‘i_professional’ does not vary too significantly in its
coefficient values. Spatially, however, the plot in figure 19 indicates
that there are higher coefficient values in the central/midwestern
section of the US. This shows that the population engaged in
professional-related industries explain the variation in GDP at a higher
rate in these MSAs.
‘h_mortgage’
ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.h_mortgageE)) + # 'h_mortgage'
scale_fill_gradient(low="violet", high="#330066") + # fill of violet to dark purple
labs(fill="Coefficient") + # rename legend
coord_sf() + # coordinates
theme_map() + # theme
ggtitle(paste("MSA GWR Model coefficient: h_mortgage")) # title

Figure 20. Coefficient values of ‘h_mortgage’ from the GWR modeling.
The plot was created using ggplot2 and utilized the coefficient values
in the GWR model for ‘h_mortgage’ as the color fill. The darker purple
hue represents higher coefficient values while the lighter purple/pink
hue represents lower coefficient values.
The variable ‘h_mortgage’ has overall much lower coefficient values
compared to the other variables. The spatial variation can be seen in
figure 20, however, as much of the MSAs in eastern half of the US have
similar values while the western half, particularly the northwest, have
higher coefficient values. This variable explains the variation in GDP
at a higher rate in the northwestern, then the southwestern MSAs in the
US.
CONCLUSION
In this study, statistical and spatial analyses were conducted to
observe the relationship between GDP and socioeconomic data for MSAs
within the contiguous United States. GDP data were collected from the
BEA while socioeconomic data were collected from the US Census Bureau.
The study workflow included data cleanup and preparation, variable
analysis, and modeling. Except for a minor manipulation in the
beginning, the model workflow was conducted entirely using R. Due to
non-normal distributions, four variables were omitted from the analysis
and the dependent variable, GDP, was log-transformed. The OLS regression
showed that the following variables had statistically significant
relationships with GDP: “occuation: management,” “occupation: sales,”
“industry: manufacturing,” “industry: wholesale,” “industry_retail,”
“industry: transport,” “industry: professional,” and “housing:
mortgage.” The GWR analysis results showed the different spatial
distributions of the coefficients for each variable. For example,
“i_transportation” had higher local coefficients in the southwestern
MSAs compared to the MSAs in the eastern US. This indicates that the
higher proportion of the population that are involved in the
transportation industry is related to higher GDP in those MSAs. This
effect, in turn, is much less in the MSAs in the eastern US. Similarly,
the other variables had different spatial variation in which their
coefficients varied across different regions of the US.
REFLECTION
The initial draft of the project included an analysis of spatial
autocorrelation. However, this was removed in order to limit the scope
of the study further, as well as to be able to focus on the regression
models. This may be re-visited and added to a future version of the
project.
At the initial stages of the project, the raw dataset was used to
explore OLS regression models and GWR models. The raw dataset, however,
resulted in unusually high adjusted R2 values, as well as extremely high
VIF values. After conducting a review of the descriptive statistics and
the exploratory phase of the analysis, the skewness of certain
variables, including the dependent variable, became much clearer. We
believe these patterns were influencing the poor performance of the test
models at the beginning of the study. For future studies, conducting a
review of the descriptive statistics and normality tests may save time
and effort and lead to better modeling results.
REFERENCES
Bureau of Economic Analysis (2023, June 2). Gross Domestic Product.
bea.gov.
Glaeser, E. L., & Gottlieb, J. D. (2009). The wealth of cities:
Agglomeration economies and
spatial equilibrium in the United States. Journal of economic
literature, 47(4), 983-1028.
Glen, S. (2023). Akaike’s Information Criterion: Definition,
Formulas. Statisticshowto.com.
gwr.sel: Crossvalidation of bandwidth for geographically weighted
regression (n. d.).
rdocumentation.org. Retrieved July 19, 2023, from
Koc, T. (2022). Bandwidth Selection in Geographically Weighted
Regression Models via
Information Complexity Criteria. Journal of Mathematics,
2022.
Kolmogorov-Smirnov Tests (n. d.). stat.eethz.ch. Retrieved July 19,
2023, from
McMillen, D. P. (2005). Urban Economics. In K. Kempf-Leonard (Ed.),
Encyclopedia of Social
Mendez C. (2020). Geographically weighted regression models: A
tutorial using the spgwr
NIH (n. d.). Learn More about Normal Distribution.
Dietassessmentprimer.cancer.gov.
#:~:text=A%20variable%20that%20is%20normally,to%20describe%
20departures%20from%20normality.
Smith, G. (2018). Step away from stepwise. Journal of Big Data,
5(32).
Thieme, C. (2021). Understanding Linear Regression Output in R.
towardsdatascience.com.
VIF: Variance Inflation Factor (n. d.). rdocumentation.org.
Retrieved July 19, 2023, from
Walker K, Herman M (2023). tidycensus: Load US Census Boundary and
Attribute Data as
‘tidyverse’ and ‘sf’-Ready Data Frames.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis.
Springer-Verlag New York.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A,
Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller
K, Ooms J,
Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C,
Woo K,
Yutani H (2019). “Welcome to the tidyverse.”
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A
Grammar of Data
William Revelle (2023). psych: Procedures for Psychological,
Psychometric, and Personality
Research. Northwestern University, Evanston, Illinois.