ABSTRACT

The goal of this project was to study the relationship between the size of an economy, based on Gross Domestic Product (GDP) and the socioeconomic variables for urban areas in the United States. The main question the project aimed to answer was: how do socioeconomic variables relate to the size of the economy of an urban area? This report begins by introducing the reasoning behind the need to advance the study of urban and spatial economics and presenting past studies. Then, it outlines the data sources used, mainly GDP data from the Bureau of Economics Analysis (BEA) and socioeconomic variables from the US Census Bureau, and their preparation for analysis. Afterwards, the different methodologies utilized, Ordinary Least Squares (OLS) regression and Geographically Weighted Regression (GWR) will be outlined and the results from the analysis will be presented, along with a discussion of the analysis and future considerations. In summary, the OLS regression showed that eight variables were found to be significant in explaining the variation in GDP: “occuation: management,” “occupation: sales,” “industry: manufacturing,” “industry: wholesale,” “industry_retail,” “industry: transport,” “industry: professional,” and “housing: mortgage.” These socioeconomic variables had varying relationships with GDP in the OLS regression model. The GWR modeling showed that the variables also had different spatial variations in terms of coefficient values in different regions of the US.

INTRODUCTION

Cities are often seen as drivers of economic growth, with their dense nature and agglomeration of industries seen as being conducive to economic activity and development. Economies of scale, for example, makes transportation of goods more efficient due to cities acting as trade hubs (McMillen 2005). In the field of economics, studies on urban economic theories and models can be found. However, spatial studies of cities using big data and spatial science methods are still quite limited. A work by Glaeser and Gottlieb (2009), for example, looks mainly at scholarly literature on agglomeration and spatial characteristics of cities. However, the specific spatial characteristics of cities such as their size, scale, and spread, vary greatly from city to city. This is why it is important that studies be done using more contemporary methods that make use of data available today. This project aims to advance the field of spatial and urban economics by combining spatial analytical methods in R with current economic and demographic data available. The main question it aims to answer is: how do socioeconomic variables affect the size of the economy of an urban area? This question will be analyzed by looking at socioeconomic variables from the 2017-2021 American Community Survey (ACS) data and the 2021 GDP data for all Metropolitan Statistical Areas (MSAs) in the United States.

DATA

The data for this study was collected from several sources. First, annual Gross Domestic Product (GDP) data for the year 2021 for all metropolitan statistical areas (MSA) in the United States were collected from the BEA website (Bureau of Economic Analysis 2023). BEA provides data tables for GDP values for each metropolitan and micropolitan statistical areas in the United States. BEA offers a way to retrieve the data programmatically using their API, but it was simple to download the data table from the site directly. The data also already came in a format that was easy to ingest for the study.

Figure 1. List of socioeconomic variables. Variables from the categories of occupation, industry, mortgage status and gross rent were collected using the get_acs() method from the US Census Bureau. The data is shown with the variable ID according to the US Census Bureau, the Variable_Name used within the study, their category, description, and unit.

METHODOLOGY

The analysis methodology followed several phases, as described in figure 2. The first phase involved data collection and cleanup and had three steps. In the first phase, data was collected from all sources and then cleaned up to prepare for ingestion in the analysis methods. GDP data was downloaded from the BEA website and a quick cleanup of the data was conducted in Microsoft Excel. Since one of the goals of this project was to utilize R and RStudio to conduct most operations, the .csv file was cleaned with minimal operations.

After being downloaded, header rows at the top and metadata information at the bottom of the excel sheet were deleted. The “2021” column which contained the GDP data was also renamed “GDP.” After the initial review and cleanup of the GDP data in Microsoft Excel, the tidycensus and tidyverse packages in R were utilized to complete the collection and “tidy” actions for the rest of the data. These are steps b and c in phase 1. The socioeconomic variables were collected using the get_acs() function and the GDP data was also further manipulated to allow for a join between the two datasets. For the join, a left_join() was used using the “GEOID” column and this fulfilled the requirements for the data to be able to be processed in the analysis.

Figure 2. Workflow of the study methodology
Figure 2. Workflow of the study methodology

Figure 2. Workflow of the study methodology. The blue parallelograms at the beginning (left side) and the end (right side) of the workflow represents the inputs and outputs, respectively. The inputs are the datasets collected and prepared for the study while the outputs represents the results generated by the study. The three orange boxes in the middle represents the methodology steps in which something is happening to the inputs. The first orange square includes the data collection and cleanup phase, the second one the descriptive statistics phase, and the third one the analysis phase which includes both non-spatial and spatial analyses. The green arrows indicate the direction of the workflow.

In the second phase of the project, variable analysis was conducted. This was done in two steps. First, descriptive statistics were created and analyzed using the describe() function from the psych package. The dataset were combined into one object to show the following statistics using describe(): mean, standard deviation (sd), median, trimmed median, mad (median absolute deviation from the median), minimum value, maximum value, skewness, kurtosis, and standard error of the mean. The purpose was to observe any troublesome patterns in the data. Second, exploratory analysis was conducting by creating histograms using the ggplot2 package and by conducting normality tests using the stats package. Histograms group the data values into different bins that shows how many of the entries fall within certain ranges. A variable is said to have a normal distribution if the histogram is bell-shaped, contains one peak, with a mean value through this peak, and is symmetric (NIH, n.d.). The Kolmogorov-Smirnov or KS test tests whether the data follows a certain distribution. This can be a one-sample, testing one dataset to see if it matches a certain distribution such as a normal distribution, or two-sample, testing two datasets against one another to see if they have a matching distribution (Kolmogorov-Smirnov Tests, n.d.). In this case, we utilized a one-sample test to see if the data followed a normal distribution. If the assumption of normality was not observed, we then either transformed the variable or omitted them from further consideration in the study.

In the third phase of the project, OLS regression and GWR analysis were conducted. In the OLS regression analysis, a full model containing all of the variables were created using the lm() function from the stats package. The dependent variable was “GDP” while the independent variables included all of the socioeconomic variables on the table in figure 2. To explore the variables to include in the final model, a stepwise regression using backwards elimination of variables based on AIC was conducted. This was done using the step() function from the stats package. Backward modeling starts with all the independent variables included in the model, after which the variables are eliminated one by one based on statistical significance (Smith, 2018).

The result from this process is the summary of the regression and an equation for the final model. The summary of the model was presented and explained using summary() and the equation presented by a figure created using an online LatEx editor. Variance inflation factor analysis was also conducted using the VIF() function from the regclass package. This was to analyze if the independent variables had affects on one another. A large VIF value could indicate problems with accurate prediction in the model. (VIF). Additionally, model assumptions were analyzed and some of these outputs are presented in the results. In the second step of the analysis phase, GWR analysis was conducted. This was conducted on the same dataset to compare how the inclusion of a spatial dimension would alter the relationships between GDP and the variables. The spatial extent of this covers all the MSAs within the contiguous United States.

CHALLENGES & LIMITATIONS

For the data and analysis, one of the challenges involved selecting the right geography or spatial scale. To limit the scope of the analysis to a finer study, a study of MSAs in one state, California, was tested early in the study. This, however, resulted in a data sample size of 26 rows. OLS and GWR regressions were conducted. However, the R2 and adjusted R2 were suspiciously high, at over 0.9999 each. Additionally, the VIF values were well over 2000 for each variable, indicating issues with the independent variables. For this reason, the analysis were conducted on the whole dataset to avail more data points for the regressions. Additionally, instead of the raw estimate values, the percentage of population values were used for the independent variables. Moreover, to reduce the scope, it was decided that the data would be limited to analyzing metropolitan statistical areas and not micropolitan statistical areas.

RESULTS

DATA PREPARATION

There were three main parts to data preparation. First, this entailed loading the MSA GDP .csv file that was obtained from the Bureau of Economic Analysis. Second, socioeconomic data was retrieved from the US Census Bureau using the API. Lastly, the GDP and the socioeconomic data were joined into one dataset. These steps align with the first orange box in the workflow in figure 2.

Load packages

First, we loaded all the packages used for the analysis.

library(tidycensus) #For obtaining tidy census data
library(tidyverse) #For manipulating census and other data
library(here) #To load our data using read_csv
library(dplyr)  #For data manipulation 
library(spgwr)   #For the GWR analysis
library(regclass) #For VIF
library(sf)   #For simple features and other geometry operations
library(psych) #For descriptive statistics
library(ggplot2) #For plotting graphs
library(cowplot) #For combining plots

Importing the data

GDP data

First, we imported the “usa_msa.csv” file into RStudio, which contained the 2021 annual GDP data for all of the MSAs in the United States. This dataset was downloaded directly from the US Bureau of Economic Analysis website and had minimal manual manipulation. After downloading, the top two header rows and the information rows at the bottom were deleted in Excel. The year ‘2021’ column was also renamed as ‘GDP.’

usa_msa_init <- read_csv(here("data", "usa_msa.csv")) # Used the 'here' package to load the GDP dataset.
## Rows: 384 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): GeoName
## dbl (2): GeoFips, GDP
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Next, we converted the ‘GeoFips’ column name to a ‘character’ type to prepare it to be joined. Fortunately, the ‘GeoFips’ column in the BEA data corresponds directly to the ‘GEOID’ column in the census data.

GEOID <- as.character(usa_msa_init$GeoFips) # Convert from 'double' to 'character' to allow the join. 

Then, we added the newly renamed ‘GEOID’ column to the GDP dataset.

usa_msa <- cbind(usa_msa_init, GEOID) # Add converted GEOFIPS (now GEOID) column to the data frame.

Socioeconomic data

To retrieve the socioeconomic variables data, we first created a list of variables that we were interested in for the study. These were then pulled from the US Census Bureau’s website through the API in the next step.

varlist = c(
  # OCCUPATION category
  # Civilian employed population 16 years and over
  o_management = "DP03_0027P", # management, business, science, and arts occupations (percent) 
  o_service = "DP03_0028P", # service occupations (percent)
  o_sales = "DP03_0029P", # sales and office occupations (percent)
  o_resources = "DP03_0030P", # natural resources, construction, and maintenance occupations (percent)
  o_production = "DP03_0031P", # production, transportation, and material moving occupations (percent)
  # INDUSTRY category
  # Civilian employed population 16 years and over
  i_agriculture = "DP03_0033P", # agriculture, forestry, fishing and hunting, and mining (percent)
  i_construction = "DP03_0034P", # construction (percent)
  i_manufacturing = "DP03_0035P", # manufacturing (percent)
  i_wholesale = "DP03_0036P", # wholesale trade (percent)
  i_retail = "DP03_0037P", # retail trade (percent)
  i_transport = "DP03_0038P", # transportation and warehousing, and utilities (percent)
  i_information = "DP03_0039P", # information (percent)
  i_finance = "DP03_0040P", # fiance and insurance, and real estate and rental and leasing (percent)
  i_professional = "DP03_0041P", # professional, scientific, and management, and administrative and waste management services (percent)
  i_education = "DP03_0042P", # educational services, and health care and social assistance (percent)
  # HOUSING category
  # Mortgage status and gross rent
  h_mortgage = "DP04_0091P",  # housing unit with a mortgage (percent)
  h_no_mortgage = "DP04_0092P", # housing units without a mortgage (percent)
  h_paying_rent = "DP04_0126P", # occupied units paying rent (percent)
  h_median_rent = "DP04_0134" # occupied units paying rent - median (dollars)
)

The socioeconomic variables were retrieved using get_acs()

us_socio <- get_acs(      # Get data with get_acs()
  geography = "metropolitan statistical area/micropolitan statistical area", # the geography option is for Metro and Micro statistical areas but we only used the former.
  variables = varlist, # We use the variable list created above, in line 83.
  year = 2021, # This is for the 2017-2021 ACS
  geometry = TRUE, # We also retrieve the geometry
  output="wide", # We load the data in 'wide' format.
  progress_bar = FALSE # This is to turn off a long progress bar message.
  )

Joining the data

For the final step in data preparation, we joined the GDP data and the socioeconomic data.

usa_msa_joined <- usa_msa %>%
  left_join(us_socio, by="GEOID") %>% # We join using "GEOID"
  replace(is.na(.), 0)  # replaced NA values with zero.

Additionally, we deleted four MSAs that were located in Alaska and Hawaii. The non-continguous nature of these MSAs in relation to the MSAs on the mainland US was not helpful for the spatial analysis.

usa <- usa_msa_joined %>% 
  filter(GEOID != "11260", GEOID != "21820", GEOID != "27980", GEOID != "46520") # We use a filter from dplyr that takes out the omitted MSAs.
  # The omitted MSAs were: 
  # Anchorage, AK (11260) and Fairbanks, AK (21820).
  # Kahului-Wailuku-Lahaina, HI (27980) and Urban Honolulu, HI (46520).

A sample of the final joined data is shown here in figure 3:

head(usa)
##   GeoFips                                                     GeoName      GDP
## 1   10180                 Abilene, TX (Metropolitan Statistical Area)  8848793
## 2   10420                   Akron, OH (Metropolitan Statistical Area) 41058875
## 3   10500                  Albany, GA (Metropolitan Statistical Area)  6462473
## 4   10540          Albany-Lebanon, OR (Metropolitan Statistical Area)  5608491
## 5   10580 Albany-Schenectady-Troy, NY (Metropolitan Statistical Area) 73995509
## 6   10740             Albuquerque, NM (Metropolitan Statistical Area) 49480431
##   GEOID                                   NAME o_managementE o_managementM
## 1 10180                 Abilene, TX Metro Area          34.6           1.7
## 2 10420                   Akron, OH Metro Area          38.4           0.7
## 3 10500                  Albany, GA Metro Area          33.7           1.5
## 4 10540          Albany-Lebanon, OR Metro Area          34.0           1.4
## 5 10580 Albany-Schenectady-Troy, NY Metro Area          46.5           0.7
## 6 10740             Albuquerque, NM Metro Area          42.6           0.7
##   o_serviceE o_serviceM o_salesE o_salesM o_resourcesE o_resourcesM
## 1       20.1        1.1     21.7      1.3         11.2          1.0
## 2       17.1        0.5     21.9      0.5          7.3          0.3
## 3       18.5        1.4     21.4      1.4          9.4          1.1
## 4       17.7        1.1     19.8      1.4         11.5          1.1
## 5       15.8        0.4     21.4      0.6          6.8          0.3
## 6       18.4        0.5     21.6      0.6          8.6          0.4
##   o_productionE o_productionM i_agricultureE i_agricultureM i_constructionE
## 1          12.5           0.9            3.0            0.4             7.1
## 2          15.3           0.5            0.5            0.1             6.1
## 3          16.9           1.3            2.7            0.6             5.6
## 4          17.0           1.2            4.9            0.8             7.6
## 5           9.5           0.4            0.7            0.1             5.5
## 6           8.8           0.5            1.3            0.1             7.1
##   i_constructionM i_manufacturingE i_manufacturingM i_wholesaleE i_wholesaleM
## 1             0.9              6.9              0.8          1.9          0.4
## 2             0.3             15.1              0.4          2.9          0.2
## 3             0.9              9.5              1.2          2.8          0.4
## 4             1.0             13.1              1.2          1.7          0.5
## 5             0.3              7.9              0.3          1.9          0.1
## 6             0.4              4.4              0.3          2.2          0.2
##   i_retailE i_retailM i_transportE i_transportM i_informationE i_informationM
## 1      12.3       1.0          5.7          0.8            1.0            0.3
## 2      11.8       0.4          5.1          0.3            1.5            0.1
## 3      11.9       1.1          5.8          0.9            1.2            0.3
## 4      11.3       1.1          5.4          0.7            0.7            0.2
## 5      10.2       0.4          4.3          0.2            1.7            0.1
## 6      10.7       0.5          4.0          0.2            1.6            0.2
##   i_financeE i_financeM i_professionalE i_professionalM i_educationE
## 1        7.1        0.9             7.0             0.7         28.6
## 2        6.3        0.3            10.1             0.3         23.5
## 3        4.4        0.7             8.3             1.2         26.6
## 4        4.1        0.8             7.6             0.9         24.1
## 5        7.2        0.4            11.8             0.4         26.7
## 6        5.8        0.4            14.0             0.5         26.1
##   i_educationM h_mortgageE h_mortgageM h_no_mortgageE h_no_mortgageM
## 1          1.3        51.3         1.5           48.7            1.5
## 2          0.6        64.5         0.7           35.5            0.7
## 3          1.5        56.8         2.0           43.2            2.0
## 4          1.3        65.7         1.9           34.3            1.9
## 5          0.5        63.0         0.8           37.0            0.8
## 6          0.7        62.7         0.7           37.3            0.7
##   h_paying_rentE h_paying_rentM h_median_rentE h_median_rentM
## 1          21768              0            949             27
## 2          89858              0            889             10
## 3          23425              0            826             19
## 4          15716              0           1087             20
## 5         128130              0           1094              9
## 6         112528              0            952             13
##                         geometry
## 1 MULTIPOLYGON (((-100.1518 3...
## 2 MULTIPOLYGON (((-81.68815 4...
## 3 MULTIPOLYGON (((-84.60282 3...
## 4 MULTIPOLYGON (((-123.2608 4...
## 5 MULTIPOLYGON (((-74.71158 4...
## 6 MULTIPOLYGON (((-106.277 35...

Figure 3. Sample of analysis-ready ‘usa’ dataset. This sample contains the first ten entries in the ‘usa’ dataset, which contains GDP and socioeconomic variable information for MSAs in the contiguous United States. The data were obtained from BEA and the US Census Bureau and were manipulated using R to ready for analysis.

In the dataset, the variable names ending with the letter “E” are the estimate values (in percent) while the names ending with the letter “M” are the margins of error. The estimate value columns were selected and used for the analysis.

VARIABLE ANALYSIS

Descriptive Statistics

The describe() function from the psych package was used to generate descriptive statistics for the variables in the dataset. The purpose of this step was to view the distribution of the values of each variable to see how much they deviate from the assumption of a normal distribution.

The describe() function returns the following descriptive statistics:

The variable number, the total number of variables (n), mean, standard deviation (sd), median, trimmed median, median absolute deviation from the median (mad), minimum value (min), maximum value (max), range, skew, kurtosis, and the standard error of the mean (se).

# We put all of the variables into one object, in this case, combining certain rows from the 'usa' data.frame into a matrix. 
descriptives <- cbind(usa$GDP, usa$o_managementE, usa$o_serviceE, usa$o_salesE, usa$o_resourcesE, usa$o_productionE,       
                     usa$i_agricultureE, usa$i_constructionE, usa$i_manufacturingE, usa$i_wholesaleE, usa$i_retailE, 
                     usa$i_transportE, usa$i_informationE, usa$i_financeE, usa$i_professionalE, usa$i_educationE, 
                     usa$h_mortgageE, usa$h_no_mortgageE, usa$h_paying_rentE, usa$h_median_rentE
                     )
# We did the cbind() operation because we only wanted to analyze the estimate values of each variables, but the dataset also contains the columns for margin of error (moe) from the US Census Bureau data. 
desc1 <- describe(descriptives) # We use describe() to get the descriptive statistics. 

Figure 4 shows the descriptive statistics created by the describe() function.

desc1
##     vars   n        mean           sd      median     trimmed         mad
## X1     1 380 54817925.22 151710190.92 13977362.50 22360987.33 11929094.49
## X2     2 380       37.15         5.92       36.30       36.76        5.78
## X3     3 380       17.83         2.25       17.65       17.74        2.15
## X4     4 380       21.07         1.77       21.10       21.07        1.48
## X5     5 380        9.53         2.52        9.20        9.24        1.93
## X6     6 380       14.42         4.14       13.75       14.15        4.37
## X7     7 380        2.23         2.73        1.30        1.63        0.89
## X8     8 380        6.82         1.50        6.60        6.73        1.33
## X9     9 380       11.02         5.46        9.90       10.49        5.04
## X10   10 380        2.36         0.62        2.30        2.34        0.59
## X11   11 380       11.67         1.39       11.50       11.59        1.19
## X12   12 380        5.27         1.38        5.20        5.19        1.33
## X13   13 380        1.48         0.54        1.40        1.42        0.44
## X14   14 380        5.60         1.84        5.20        5.41        1.48
## X15   15 380        9.72         2.84        9.15        9.42        2.45
## X16   16 380       24.63         4.17       24.00       24.21        3.41
## X17   17 380       60.29         7.04       60.95       60.63        7.19
## X18   18 380       39.71         7.04       39.05       39.37        7.19
## X19   19 380    98079.79    254725.43    30508.50    45364.59    25969.96
## X20   20 380     1021.24       258.67      942.50      982.69      180.14
##           min          max        range  skew kurtosis         se
## X1  2894022.0 1992779274.0 1989885252.0  7.56    77.71 7782568.58
## X2       23.8         55.8         32.0  0.62     0.31       0.30
## X3       13.0         26.8         13.8  0.71     1.32       0.12
## X4       16.0         27.9         11.9  0.11     0.68       0.09
## X5        4.1         22.4         18.3  1.73     5.15       0.13
## X6        6.6         33.5         26.9  0.70     0.61       0.21
## X7        0.2         19.9         19.7  3.40    13.30       0.14
## X8        2.9         13.0         10.1  0.72     1.13       0.08
## X9        2.2         38.1         35.9  1.26     2.65       0.28
## X10       0.6          4.7          4.1  0.23     0.24       0.03
## X11       7.7         17.2          9.5  0.59     0.97       0.07
## X12       2.1         14.8         12.7  1.56     7.70       0.07
## X13       0.6          5.4          4.8  2.31    10.25       0.03
## X14       2.5         19.1         16.6  1.96     8.90       0.09
## X15       4.5         21.9         17.4  1.21     2.06       0.15
## X16      15.3         45.2         29.9  1.28     3.06       0.21
## X17      37.6         76.6         39.0 -0.44    -0.10       0.36
## X18      23.4         62.4         39.0  0.44    -0.10       0.36
## X19    6677.0    3395778.0    3389101.0  8.03    85.80   13067.14
## X20     657.0       2511.0       1854.0  1.83     4.75      13.27

Figure 4. Descriptive statistics of GDP and socioeconomic variables. These statistic values were obtained using the describe() function in the psych package. They look at the distribution of the data values for the dependent variable, GDP, and the twenty independent variables initially chosen for the study.

After the descriptive statistics were calculated, we added the variable names to the matrix for readability.

varnames <- c('GDP', 'o_management', 'o_service', 'o_sales', 'o_resources', 'o_production', 'i_agriculture',  'i_construction',
               'i_manufacturing', 'i_wholesale', 'i_retail', 'i_transport', 'i_information', 'i_finance', 'i_professional',
              'i_education', 'h_mortgage', 'h_no_mortgage', 'h_paying_rent', 'h_median_rent') # A list of the variable names

desc_stats <- cbind(varnames, desc1) # We bind the variable names to the descriptive statistics (desc1)

Variable names were added in figure 5. However, the operation converted the numerical values to a scientific notation format. For the analysis, both tables were referenced since figure 4 contained more readable numeric values while figure 5 contained readable variable names that were easier to reference.

desc_stats # The final descriptive statistics table
##            varnames vars   n         mean           sd      median      trimmed
## X1              GDP    1 380 5.481793e+07 1.517102e+08 13977362.50 2.236099e+07
## X2     o_management    2 380 3.714868e+01 5.923855e+00       36.30 3.676447e+01
## X3        o_service    3 380 1.783447e+01 2.251132e+00       17.65 1.773520e+01
## X4          o_sales    4 380 2.107132e+01 1.766589e+00       21.10 2.106809e+01
## X5      o_resources    5 380 9.526316e+00 2.523429e+00        9.20 9.237500e+00
## X6     o_production    6 380 1.441632e+01 4.136937e+00       13.75 1.414737e+01
## X7    i_agriculture    7 380 2.229474e+00 2.734964e+00        1.30 1.625329e+00
## X8   i_construction    8 380 6.822105e+00 1.501182e+00        6.60 6.727632e+00
## X9  i_manufacturing    9 380 1.102263e+01 5.460993e+00        9.90 1.048553e+01
## X10     i_wholesale   10 380 2.356053e+00 6.165614e-01        2.30 2.344408e+00
## X11        i_retail   11 380 1.166579e+01 1.389059e+00       11.50 1.158783e+01
## X12     i_transport   12 380 5.273947e+00 1.381399e+00        5.20 5.190132e+00
## X13   i_information   13 380 1.478158e+00 5.366974e-01        1.40 1.415461e+00
## X14       i_finance   14 380 5.595263e+00 1.840351e+00        5.20 5.406579e+00
## X15  i_professional   15 380 9.721316e+00 2.836391e+00        9.15 9.424013e+00
## X16     i_education   16 380 2.463342e+01 4.165408e+00       24.00 2.421118e+01
## X17      h_mortgage   17 380 6.028763e+01 7.040128e+00       60.95 6.062928e+01
## X18   h_no_mortgage   18 380 3.971237e+01 7.040128e+00       39.05 3.937072e+01
## X19   h_paying_rent   19 380 9.807979e+04 2.547254e+05    30508.50 4.536459e+04
## X20   h_median_rent   20 380 1.021239e+03 2.586651e+02      942.50 9.826908e+02
##              mad       min          max        range       skew   kurtosis
## X1  1.192909e+07 2894022.0 1992779274.0 1989885252.0  7.5622452 77.7078514
## X2  5.782140e+00      23.8         55.8         32.0  0.6158026  0.3080361
## X3  2.149770e+00      13.0         26.8         13.8  0.7093938  1.3227377
## X4  1.482600e+00      16.0         27.9         11.9  0.1088465  0.6833167
## X5  1.927380e+00       4.1         22.4         18.3  1.7333137  5.1459713
## X6  4.373670e+00       6.6         33.5         26.9  0.7045609  0.6140945
## X7  8.895600e-01       0.2         19.9         19.7  3.3972709 13.2959358
## X8  1.334340e+00       2.9         13.0         10.1  0.7166221  1.1309616
## X9  5.040840e+00       2.2         38.1         35.9  1.2584898  2.6544094
## X10 5.930400e-01       0.6          4.7          4.1  0.2306887  0.2434818
## X11 1.186080e+00       7.7         17.2          9.5  0.5891544  0.9691956
## X12 1.334340e+00       2.1         14.8         12.7  1.5633812  7.6978933
## X13 4.447800e-01       0.6          5.4          4.8  2.3094055 10.2509312
## X14 1.482600e+00       2.5         19.1         16.6  1.9603773  8.9032210
## X15 2.446290e+00       4.5         21.9         17.4  1.2082859  2.0559682
## X16 3.409980e+00      15.3         45.2         29.9  1.2808478  3.0596242
## X17 7.190610e+00      37.6         76.6         39.0 -0.4424950 -0.1049643
## X18 7.190610e+00      23.4         62.4         39.0  0.4424950 -0.1049643
## X19 2.596996e+04    6677.0    3395778.0    3389101.0  8.0295696 85.7957782
## X20 1.801359e+02     657.0       2511.0       1854.0  1.8259783  4.7468910
##               se
## X1  7.782569e+06
## X2  3.038873e-01
## X3  1.154806e-01
## X4  9.062411e-02
## X5  1.294492e-01
## X6  2.122204e-01
## X7  1.403007e-01
## X8  7.700901e-02
## X9  2.801430e-01
## X10 3.162893e-02
## X11 7.125725e-02
## X12 7.086429e-02
## X13 2.753200e-02
## X14 9.440800e-02
## X15 1.455038e-01
## X16 2.136809e-01
## X17 3.611510e-01
## X18 3.611510e-01
## X19 1.306714e+04
## X20 1.326924e+01

Figure 5. Descriptive statistics of GDP and socioeconomic variables with variable names. These statistic values were obtained using the describe() function in the psych package and the variable names were added afterwards.

In the descriptive statistics tables, figures 4 and 5, the first column is the item name, the second column (vars) is the item number, both of which correspond to the variables that were used in describe() in the order they were listed. The third column (n) is the number of entries or rows. The mean is the average value of the data and the standard deviation (sd) is a measure of the dispersion of the values. The median is the value that separates the lower and upper half of the data equally, the trimmed mean (trimmed) is a particular mean calculated by discarding some low/high values, and the median absolute deviation (mad) is a calculatation of how the values deviate from the median. The min and max columns represent the lowest and highest values for each variable, respectively. The range is the difference between the highest and the lowest values.

Skew (skewness) provides a measure by which the distribution departs from that of a normal distribution. Values that are “+” suggest a positive skew while values that are “-“ skewness suggest a negative skew. Values that are roughly ±1.0 suggests non-normal data. Kurtosis shows the”tailedness” of the data, and similar to skewness, describes the distribution of the data. Lastly, the standard error (se) is the standard error of the mean in the data.

EXPLORATORY ANALYSIS

To explore the viability of the variables to be chosen, the assumption of normality was tested for several variables. This was done by creating histograms and by conducting one-sample Kolmogorov-Smirnov tests for each variable.

The preliminary exploratory phase was done for the dependent variable, as well as independent variables that had high skew according to the descriptive statistics. The descriptive statistics figures indicated that most variables had a normal distribution or only a slight skew. For variables that had visibly high skews, considered in this study as skew values over 1.9, histograms and normality tests were conducted to see how much they deviated from the assumption of normality.

DEPENDENT VARIABLE ANALYSIS

We used ggplot() to conduct normality tests of the dependent variable, GDP. A histogram was created and is shown in figure 6.

hist_gdp <- ggplot(usa, aes(x=GDP)) +   # Fill is GDP data
  geom_histogram(color="blue", fill="lightblue") + # We use geom_histogram 
  ggtitle("Histogram of the dependent variable 'GDP'") # Add a title 
hist_gdp

Figure 6. Histogram of the dependent variable ’GDP.” A histogram of the variable GDP was created using ggplot(), with default bin values. The x axis shows the value of GDP, in thousands, while the y axis shows the number of MSAs that fall within a certain bin. From the figure, GDP data has a positive skew, with a few outliers.

From figure 6, it looks like our dependent variable has a heavy right or positive skew. This was also shown in our descriptive statistics, where GDP had a skew value of 7.6 in figures 4 and 5, indicating high skewness. This could cause potential issues for the analysis, so a log tranformation was completed for the dependent variable.

gdp_log <- log(usa$GDP) # Conduct log transformation of the dependent variable. 

We then added the log-transformed GDP variable to our dataset.

usa1 <- cbind(usa, gdp_log) # Bind the new gdp_log column to the usa dataset.

We then create a histogram for the log-transformed GDP variable.

hist_log_gdp <- ggplot(usa1, aes(x=gdp_log)) + # Fill is gdp_log
  geom_histogram(color="blue", fill="lightblue") + # We use geom_histogram
  ggtitle("Histogram of the log-transformed variable 'GDP'") # Add a title 
hist_log_gdp

Figure 7. Histogram of the log-transformed ’GDP” variable. The histogram was created using ggplot(), with default bin values. The x axis shows the log-transformed values of GDP, while the y axis shows the number of MSAs that fall within each bin. From the figure, log GDP still has a skew, but the distribution is closer to normal compared to the initial dataset.

The histogram shows that the depenendent variable is still skewed but shows a much better distribution compared to the initial GDP data. In the log-transformed version of GDP, the plot is closer to a normal distribution. We used this log-transformed GDP variable for the OLS analysis.

INDEPENDENT VARIABLE ANALYSIS

Next, we conducted normality tests for the independent variables. First, we created histograms for each variable, as shown in figure 8. Like GDP, these variables showed high skew values in the descriptive statistics and were chosen for further analysis. However, transformation was not conducted on these independent variables to maintain the simplicity of the regression equation interpretation.

Histograms

h1 <- ggplot(usa1, aes(x=i_agricultureE)) + # Plot for i_agriculture
  geom_histogram(color="blue", fill="lightblue")

h2 <- ggplot(usa1, aes(x=i_informationE)) + # Plot for i_informationE
  geom_histogram(color="blue", fill="lightblue")

h3 <- ggplot(usa1, aes(x=i_financeE)) +     # Plot for i_financeE
  geom_histogram(color="blue", fill="lightblue")

h4 <- ggplot(usa1, aes(x=h_paying_rentE)) + # Plot for h_paying_rentE
  geom_histogram(color="blue", fill="lightblue")

h_group <- plot_grid(h1, h2, h3, h4) # Group the plots into one grid
title <- ggdraw()+ # Use draw_label() to create a title. 
  draw_label(
    "Histograms of independent variables with high skew")

# The combined plot with the title:
h_plots <- plot_grid(title, h_group, ncol=1, rel_heights = c(0.1, 1))
h_plots

Figure 8. Histogram of four independent variables, i_agriculture, i_information, i_finance, and h_paying_rent. The histograms were created using ggplot and were combined using cowplot. The x axis shows the value of each variable, while the y axis shows the number of MSAs that fall within a certain bin for that variable. From the figure, all four variables show a positive skew, particularly for h_paying_rent.

Normality test

After visual normality testing was done using histograms, a statistical normality test, the Kolmogorov-Smirnov test, was done each of the four independent variables.

# Kolmogorov-Smirnov test
ks1 <- ks.test(usa1$i_agricultureE, "pnorm") # This test is from the stat package.
ks2 <- ks.test(usa1$i_informationE, "pnorm") # We used a one-sample test.
ks3 <- ks.test(usa1$i_financeE, "pnorm") # Which tests the sample against a normal distribution, “pnorm.”
ks4 <- ks.test(usa1$h_paying_rentE, "pnorm")

Here are the results of the KS test:

ks1
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  usa1$i_agricultureE
## D = 0.63883, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks2
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  usa1$i_informationE
## D = 0.77499, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks3
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  usa1$i_financeE
## D = 0.99379, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks4
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  usa1$h_paying_rentE
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

The KS test results in the code above shows the test statistic (D), the p-value, and the type of test: two-sided hypothesis test. All the results showed that each p-value were under 0.05. This means that each of the test statistic was significant. Therefore, we reject the null hypothesis which states that there is no significant difference between the sample data and that of a normal distribution. In short, this means that we have evidence to show that the variable data do not come from a normal distribution.

From the normality tests conducted using histograms and the KS test, the four variables explored show non-normality in their distributions. As a result, to maintain simplicity in interpretation of the analysis, these four independent variables were selected to be dropped from the OLS study.

MODELING

OLS Regression

In the first step of the modeling phase, OLS regression was conducted. This process began using an initial model that contained all the independent variables selected earlier, ones in the variable list used to get the ACS data, minus the four variables eliminated earlier in the exploratory analysis.

# We use the lm() function in stats package to fit a linear model using the variables. 
msa_full_model <- lm(usa1$gdp_log ~ usa1$o_managementE +  usa1$o_serviceE + usa1$o_salesE + 
                      usa1$o_resourcesE + usa1$o_productionE + 
                      usa1$i_constructionE + usa1$i_manufacturingE + usa1$i_wholesaleE + 
                      usa1$i_retailE + usa1$i_transportE + 
                      usa1$i_professionalE + usa1$i_educationE + usa1$h_mortgageE + usa1$h_no_mortgageE + 
                      usa1$h_median_rentE, data = usa1)
# The first argument is the model formula, which contains the dependent variable as the first item and the rest as the independent variables. The next argument, in this case the last, is the dataset. 

The summary() function gives us a look into the performance of the model by showing the coefficient values as well as other metrics.

summary(msa_full_model) # summary of our full model. 
## 
## Call:
## lm(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_serviceE + 
##     usa1$o_salesE + usa1$o_resourcesE + usa1$o_productionE + 
##     usa1$i_constructionE + usa1$i_manufacturingE + usa1$i_wholesaleE + 
##     usa1$i_retailE + usa1$i_transportE + usa1$i_professionalE + 
##     usa1$i_educationE + usa1$h_mortgageE + usa1$h_no_mortgageE + 
##     usa1$h_median_rentE, data = usa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16016 -0.49061  0.00956  0.43355  2.70810 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.448e+01  5.607e+01  -1.328   0.1849    
## usa1$o_managementE     8.872e-01  5.596e-01   1.586   0.1137    
## usa1$o_serviceE        8.473e-01  5.612e-01   1.510   0.1320    
## usa1$o_salesE          9.875e-01  5.605e-01   1.762   0.0789 .  
## usa1$o_resourcesE      8.269e-01  5.603e-01   1.476   0.1408    
## usa1$o_productionE     7.813e-01  5.604e-01   1.394   0.1641    
## usa1$i_constructionE  -2.182e-02  3.390e-02  -0.644   0.5202    
## usa1$i_manufacturingE  4.945e-02  2.091e-02   2.365   0.0186 *  
## usa1$i_wholesaleE      4.710e-01  7.050e-02   6.681 8.87e-11 ***
## usa1$i_retailE        -1.892e-01  3.800e-02  -4.980 9.83e-07 ***
## usa1$i_transportE      2.193e-01  3.872e-02   5.663 3.02e-08 ***
## usa1$i_professionalE   1.533e-01  2.901e-02   5.286 2.16e-07 ***
## usa1$i_educationE     -1.071e-02  1.717e-02  -0.624   0.5332    
## usa1$h_mortgageE       1.525e-02  7.197e-03   2.118   0.0348 *  
## usa1$h_no_mortgageE           NA         NA      NA       NA    
## usa1$h_median_rentE    5.955e-04  2.392e-04   2.490   0.0132 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7032 on 365 degrees of freedom
## Multiple R-squared:  0.6806, Adjusted R-squared:  0.6684 
## F-statistic: 55.56 on 14 and 365 DF,  p-value: < 2.2e-16

In the code, the call shows the equation that were used for the model. The residuals shows the difference between the actual data values and the values predicted by the model. The coefficients help us build our model and the ‘Estimate’ column helps to interpret the relationship that the data shows between the dependent variable, GDP, and the independent variables. The ‘Std.Error’ are the deviation of the coefficient estimate values. The t and p-values indicates whether or not the variable is considered significant in the model. The * symbols indicates the level of significance. The last three lines shows us how well the model does in fitting the data. One particular value, the adjusted R-squared, is commonly used to gauge model fit. The adjusted R-squared shows the percentage of the variation in GDP that the socioeconomic variables are predicting. Thus, a higher adjusted R-squared is generally desired. (Linear Reg).

As seen in the code, the adjusted R-squared is relatively high at 0.6684. However, there are many independent variables that are not statistically significant in the model. Thus, there was room for improvement in the model, for both simplicity and performance.

To obtain a better model, stepwise backward elimination was conducted. First, this was done manually, with the deletion of one independent variable at a time, based on statistical significance. At each step, the variable with the highest p-value was deleted and a new summary was generated. This process was repeated until all variables were statistically significant at the 5% level, meaning the p-values were under 0.05. At the same time, the adjusted R-squared was monitored to ensure that it did not drop drastically.

STEPWISE BACKWARD ELIMINATION - MANUAL

Manually by eliminating variables with the highest p-value, one at a time.

msa_test_model <- lm(usa1$gdp_log ~ usa1$o_managementE +  usa1$o_salesE + 
                      usa1$i_manufacturingE + usa1$i_wholesaleE + 
                      usa1$i_retailE + usa1$i_transportE + 
                      usa1$i_professionalE + usa1$h_mortgageE +
                      usa1$h_median_rentE, data = usa1)

Here is the summary of the full OLS model for the USA MSA data.

summary(msa_test_model)
## 
## Call:
## lm(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE + 
##     usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE + 
##     usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE + 
##     usa1$h_median_rentE, data = usa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05399 -0.46074  0.00569  0.43843  2.63563 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.4123181  0.9000048   8.236 3.11e-15 ***
## usa1$o_managementE     0.0637641  0.0105500   6.044 3.67e-09 ***
## usa1$o_salesE          0.1749031  0.0290251   6.026 4.07e-09 ***
## usa1$i_manufacturingE  0.0271802  0.0082976   3.276  0.00115 ** 
## usa1$i_wholesaleE      0.4328488  0.0661290   6.546 1.98e-10 ***
## usa1$i_retailE        -0.1969850  0.0367528  -5.360 1.47e-07 ***
## usa1$i_transportE      0.1954982  0.0314574   6.215 1.39e-09 ***
## usa1$i_professionalE   0.1644058  0.0231218   7.110 6.01e-12 ***
## usa1$h_mortgageE       0.0157055  0.0069630   2.256  0.02468 *  
## usa1$h_median_rentE    0.0006707  0.0002234   3.002  0.00286 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7041 on 370 degrees of freedom
## Multiple R-squared:  0.6754, Adjusted R-squared:  0.6675 
## F-statistic: 85.53 on 9 and 370 DF,  p-value: < 2.2e-16

In the test model, the adjusted R-squared is still relatively high at 0.6675. As the * symbols on the right side indicate, all the variables are statistically significant, with the p-values all falling under 0.05.

STEPWISE BACKWARD SELECTION - AUTOMATED

The same process was conducted again, but this time using code from different packages with no manual manipulation. As with the manual process, the variable with the highest p-value was deleted one at a time sequentially, with a model produced after each deletion. This was done in an automated manner through code until all variables left over were significant. We used the step() function to conduct the backward stepwise selection, which uses the AIC metric to gauge the overall model performance during the process. AIC, or the Akaike’s Information Criterion, allows for model performance comparison and is based on a formula that includes the number of model parameters and a log-likelihood function (Glen, 2023).

step_model <- step(msa_full_model, direction="backward", trace=FALSE) # the step() functions allows us to conduct the stepwise analysis.
# The option is "backward" so we could eliminated variables one at a time starting from a full model. 
summary(step_model)
## 
## Call:
## lm(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_serviceE + 
##     usa1$o_salesE + usa1$o_resourcesE + usa1$o_productionE + 
##     usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE + 
##     usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE + 
##     usa1$h_median_rentE, data = usa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16435 -0.48879  0.01584  0.43227  2.71564 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -8.029e+01  5.551e+01  -1.446   0.1489    
## usa1$o_managementE     9.367e-01  5.554e-01   1.687   0.0925 .  
## usa1$o_serviceE        8.992e-01  5.567e-01   1.615   0.1071    
## usa1$o_salesE          1.042e+00  5.554e-01   1.875   0.0615 .  
## usa1$o_resourcesE      8.788e-01  5.553e-01   1.583   0.1144    
## usa1$o_productionE     8.346e-01  5.557e-01   1.502   0.1339    
## usa1$i_manufacturingE  5.337e-02  1.954e-02   2.732   0.0066 ** 
## usa1$i_wholesaleE      4.701e-01  7.036e-02   6.682 8.79e-11 ***
## usa1$i_retailE        -1.892e-01  3.710e-02  -5.099 5.47e-07 ***
## usa1$i_transportE      2.247e-01  3.780e-02   5.944 6.46e-09 ***
## usa1$i_professionalE   1.609e-01  2.318e-02   6.945 1.73e-11 ***
## usa1$h_mortgageE       1.650e-02  6.986e-03   2.362   0.0187 *  
## usa1$h_median_rentE    6.073e-04  2.351e-04   2.583   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7019 on 367 degrees of freedom
## Multiple R-squared:   0.68,  Adjusted R-squared:  0.6696 
## F-statistic:    65 on 12 and 367 DF,  p-value: < 2.2e-16

The automated backward stepwise elimination process resulted in a different equation from the manual process. Though the adjusted r-squared is higher for the automated model, at 0.6696, it was decided that the manual model would be kept. The automated process did not eliminate variables that were statistically not significant. The smaller number of variables in the manual model also keeps the model simpler while still having high performance value. The adjusted R-squared for the manual model was 0.6675, which is not much different from the automated model.

Final OLS Model

After reviewing the manual model, the coefficient value for h_median_rent was very low. Again, to simplify the model while keeping the performance metrics high, this variable was eliminated. Thus, the final model for the OLS regression was:

msa_final_model <- lm(usa1$gdp_log ~ usa1$o_managementE +  usa1$o_salesE + usa1$i_manufacturingE + usa1$i_wholesaleE + 
                      usa1$i_retailE + usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE, 
                      data = usa1)

Here is the summary of the final model:

summary(msa_final_model)
## 
## Call:
## lm(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE + 
##     usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE + 
##     usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE, 
##     data = usa1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15034 -0.43014 -0.00781  0.44090  2.81387 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.092952   0.880345   9.193  < 2e-16 ***
## usa1$o_managementE     0.057481   0.010451   5.500 7.09e-08 ***
## usa1$o_salesE          0.159207   0.028857   5.517 6.48e-08 ***
## usa1$i_manufacturingE  0.019923   0.008023   2.483  0.01346 *  
## usa1$i_wholesaleE      0.438172   0.066815   6.558 1.83e-10 ***
## usa1$i_retailE        -0.203943   0.037074  -5.501 7.05e-08 ***
## usa1$i_transportE      0.186954   0.031665   5.904 8.01e-09 ***
## usa1$i_professionalE   0.202982   0.019430  10.447  < 2e-16 ***
## usa1$h_mortgageE       0.022127   0.006698   3.304  0.00105 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7117 on 371 degrees of freedom
## Multiple R-squared:  0.6675, Adjusted R-squared:  0.6603 
## F-statistic: 93.08 on 8 and 371 DF,  p-value: < 2.2e-16

The final model still had a relatively high adjusted R-squared value at 0.6603, with eight variables kept in the equation. The coefficients were also large enough to be easily interpretable and all variables were statistically significant.

VIF

A VIF was also conducted for this final model, with the results of the VIF analysis are shown below. VIFs values between 5 and 10 indicates issues with the predictability and performance of the model (VIF, n.d.). In this case, however, the VIF values for all variables were under 5.

VIF(msa_final_model)
##    usa1$o_managementE         usa1$o_salesE usa1$i_manufacturingE 
##              2.868332              1.944664              1.436401 
##     usa1$i_wholesaleE        usa1$i_retailE     usa1$i_transportE 
##              1.269905              1.984453              1.431735 
##  usa1$i_professionalE      usa1$h_mortgageE 
##              2.272671              1.663657
Figure 9. The OLS equation for the final model
Figure 9. The OLS equation for the final model

Figure 9. The OLS equation for the final model. The model was derived using exploratory variable analysis and backward stepwise elimination in R. The variable y represents GDP and the different x variable represent the independent socioeconomic variables.

OLS Results Discussion

For the analysis of the OLS regression, generally a one-unit change in an independent variable equates to a one-unit change in the dependent variable multiplied by the coefficient. In this case, however, the dependent variable was log-transformed. Therefore, the results must be discussed with this consideration.

To interpret the results, it is important to note that by log-transforming the dependent variable, our independent variables have been exponentiated. To interpret the results, we need to exponentiate the independent variables. For example:

The coefficient of o_managementE is 0.057.

exp(0.057)
## [1] 1.058656

This means that for every one-unit increase in o_management, y increases by about 6%. We can then calculate this value for all variables:

exp(coef(msa_final_model)['usa1$o_managementE'])
## usa1$o_managementE 
##           1.059165
exp(coef(msa_final_model)['usa1$o_salesE'])
## usa1$o_salesE 
##      1.172581
exp(coef(msa_final_model)['usa1$i_manufacturingE'])
## usa1$i_manufacturingE 
##              1.020122
exp(coef(msa_final_model)['usa1$i_wholesaleE'])
## usa1$i_wholesaleE 
##          1.549871
exp(coef(msa_final_model)['usa1$i_retailE'])
## usa1$i_retailE 
##      0.8155088
exp(coef(msa_final_model)['usa1$i_transportE'])
## usa1$i_transportE 
##          1.205572
exp(coef(msa_final_model)['usa1$i_professionalE'])
## usa1$i_professionalE 
##             1.225051
exp(coef(msa_final_model)['usa1$h_mortgageE'])
## usa1$h_mortgageE 
##         1.022374

Figure 10 shows the effects of the coefficients of the independent variables after the calculations. These represent percent changes in GDP per one-unit change for each independent variable. The variable with the larget positive effect is i_wholesale, which increases GDP by 55%, followed by i_professional at 23%, i_transport at 21%, and o_sales at 17%. These variables seem to have a strong positive relationship with GDP. The variables o_management at 6%, i_manufacturing at 2%, and h_mortgage at 2%, also have positive relationships with GDP, albeit a smaller one in the single digit. The only variable to have a negative relationship is i_retail, which decreases GDP by 18% per one-unit change.

These interpretation of the coefficient results show the effects and relationships of the final variables with GDP. This OLS regression model is a generalized model that is applicable to the MSAs in the entire continguous US. There are, however, bound to be regional differences within the US. Thus, the next step of the modeling phase involved Geographically Weighted Regression to see how the relationships of these variables vary spatially.

Figure 10. Interpretation of the OLS regression coefficients
Figure 10. Interpretation of the OLS regression coefficients

Figure 10. Interpretation of the OLS regression coefficients. This figure shows the relationship between GDP value and the coefficients of the variables in the final regression model. The percent change values were calculated using the exponents of the variables and the numeric change was calculated by subtracting one from the coefficient and multiplying it by 100. These calculations were done in R.

Geographically Weighted Regression

For GWR, the data was prepared so that it could be inputted into the GWR functions.

First, the “usa” dataset was converted to a ‘sf’ class.

usa1_sf <- st_as_sf(usa1) # The function st_as_sf() converts to an 'sf' class.

Then, it was converted again into an ‘sp’ class. Now, the data is ready for use.

usa1.sp <- as(usa1_sf, "Spatial") # The function as() was used to convert to "Spatial" class

GWR Bandwidth

In this step, the GWR bandwidth, which is the distance or neighborhood metric, was calculated with the “usa” dataset using gwr.sel(). Bandwidth is one of the most important parameters in GWR. Changing the bandwidth can change the coefficients of the local regression models because as the bandwidth value increas, the weight decreases, decreasing the variation of the coefficients. (BW) Fortunately, the gwr.sel() function offers one way to identify an optimal bandwidth value. With an OLS equation as an input, gwr.sel() explores different bandwidth values and scores the best one by minimizing the mean square prediction error. (gwr.sel)

bW1 <- gwr.sel(usa1$gdp_log ~ usa1$o_managementE +  usa1$o_salesE + usa1$i_manufacturingE + usa1$i_wholesaleE + 
                      usa1$i_retailE + usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE, 
               data=usa1.sp) 
## Please note that rgdal will be retired during October 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2023/05/15/evolution4.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-7, (SVN revision 1203)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.6.2, released 2023/01/02
## Path to GDAL shared files: C:/Users/lemon/AppData/Local/R/win-library/4.3/rgdal/gdal
##  GDAL does not use iconv for recoding strings.
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 9.2.0, March 1st, 2023, [PJ_VERSION: 920]
## Path to PROJ shared files: C:/Users/lemon/AppData/Local/R/win-library/4.3/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
## Bandwidth: 2038.214 CV score: 194.4745 
## Bandwidth: 3294.607 CV score: 196.7146 
## Bandwidth: 1261.721 CV score: 191.3163 
## Bandwidth: 781.8212 CV score: 183.7579 
## Bandwidth: 485.227 CV score: 178.4308 
## Bandwidth: 301.9217 CV score: 202.1355 
## Bandwidth: 598.5159 CV score: 178.7628 
## Bandwidth: 523.9975 CV score: 177.8591 
## Bandwidth: 535.693 CV score: 177.8638 
## Bandwidth: 529.1772 CV score: 177.8527 
## Bandwidth: 529.0468 CV score: 177.8527 
## Bandwidth: 528.9799 CV score: 177.8527 
## Bandwidth: 528.9772 CV score: 177.8527 
## Bandwidth: 528.9772 CV score: 177.8527 
## Bandwidth: 528.9773 CV score: 177.8527 
## Bandwidth: 528.9772 CV score: 177.8527
# We input our final model formula from the OLS regression modeling step into the gwr.sel() function. 

Here is the resulting bandwidth:

bW1
## [1] 528.9772

The gwr.sel() function determined that a bandwidth value of 528.9772 provided the minimal root mean square prediction error out of the different bandwidth values that it tested. We used this bandwidth value for our GWR analysis.

In this next step, we created the GWR model using the gwr() function.

gwr.fit1 <-gwr(usa1$gdp_log ~ usa1$o_managementE +  usa1$o_salesE + usa1$i_manufacturingE + usa1$i_wholesaleE + 
                usa1$i_retailE + usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE, 
                data=usa1.sp, bandwidth=bW1, se.fit=T, hatmatrix=T)
# The gwr() function takes an OLS equation as an input, along with the specified dataset. We also specified our bandwidth that we obtained in the last step. se.fit = T allows for the function to return local coefficient standard errors while hatmatrix=T return a matrix of the y-hat values, or the predicted dependent variable. 

Here is the summary of the GWR modeling.

gwr.fit1
## Call:
## gwr(formula = usa1$gdp_log ~ usa1$o_managementE + usa1$o_salesE + 
##     usa1$i_manufacturingE + usa1$i_wholesaleE + usa1$i_retailE + 
##     usa1$i_transportE + usa1$i_professionalE + usa1$h_mortgageE, 
##     data = usa1.sp, bandwidth = bW1, hatmatrix = T, se.fit = T)
## Kernel function: gwr.Gauss 
## Fixed bandwidth: 528.9772 
## Summary of GWR coefficient estimates at data points:
##                             Min.    1st Qu.     Median    3rd Qu.       Max.
## X.Intercept.           0.7450466  4.2332756  5.8926084  7.5953772 17.1404527
## usa1.o_managementE    -0.0276163  0.0538689  0.0786261  0.0971403  0.1185411
## usa1.o_salesE         -0.0375649  0.1103914  0.1809676  0.2329294  0.3149851
## usa1.i_manufacturingE -0.0597396  0.0138856  0.0353128  0.0430031  0.0642532
## usa1.i_wholesaleE      0.0931624  0.4090415  0.4288715  0.4655341  0.8368153
## usa1.i_retailE        -0.4076938 -0.2190375 -0.1574815 -0.1135772  0.0025098
## usa1.i_transportE     -0.0200086  0.1398921  0.1655582  0.2049887  0.4241009
## usa1.i_professionalE   0.1031053  0.1783273  0.2043704  0.2428115  0.3606911
## usa1.h_mortgageE      -0.0010410  0.0135575  0.0175507  0.0244307  0.1333892
##                        Global
## X.Intercept.           8.0930
## usa1.o_managementE     0.0575
## usa1.o_salesE          0.1592
## usa1.i_manufacturingE  0.0199
## usa1.i_wholesaleE      0.4382
## usa1.i_retailE        -0.2039
## usa1.i_transportE      0.1870
## usa1.i_professionalE   0.2030
## usa1.h_mortgageE       0.0221
## Number of data points: 380 
## Effective number of parameters (residual: 2traceS - traceS'S): 71.50608 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 308.4939 
## Sigma (residual: 2traceS - traceS'S): 0.6379012 
## Effective number of parameters (model: traceS): 53.14229 
## Effective degrees of freedom (model: traceS): 326.8577 
## Sigma (model: traceS): 0.6197226 
## Sigma (ML): 0.5747579 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 784.1655 
## AIC (GWR p. 96, eq. 4.22): 710.6427 
## Residual sum of squares: 125.5317 
## Quasi-global R2: 0.7778485

The gwr() function, as an output, provides a ‘gwr’ object class that shows a summary similar to that of the summary() function. The Call section shows the equation, while the kernel function shows the distribution used, which is a Gaussian or normal distribution. The fixed bandwidth is the one we selected from the gwr.sel(). The gwr output also shows the data ranges of the different variable coefficients in the model. Since GWR generates a local OLS regression model for each feature, showing all of the equations would take up a lot of room. The minimum, maximum, and quartile information provides an overview of the coefficients. In the last column, however, the global coefficients equate to the global model that we created earlier in the OLS regression. The lines after the coefficients show the different performance metrics of the model. Most importantly, the quasi-global R2 provides the overall variation in GDP explained by the socioeconomic variables. Overall, it seems that the GWR local coefficients are fairly similar to the ones in the global model. However, there are definite differences in some local models when viewing individual minimum and maximum values. The maximum value for i_wholesale, for example, is 0.836 in the GWR model, while it is 0.438 in the global model, indicating that there are spatial variations that exist for this variable. Additionally, the GWR model has a higher quasi-global R2 value of 0.7778, as compared to the adjusted r-squared value of 0.6603 for the global model.

GWR Visualization

To visualize the results of the GWR analysis, we first converted the summary results, as represented by the SDF or the SpatialPointsDataFrame, into an ‘sp’ object and an ‘sf’ object.

gwr1.sp <- gwr.fit1$SDF # We first converted the SDF into an 'sp' object.
gwr1.sf <-st_as_sf(gwr1.sp) # Then, we converted it to an 'sf' object.

Now, we can visualize the data using ggplot2.

Local R2

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$localR2)) + # We used ggplot to plot 'localR2'
  scale_fill_gradient(low="lightblue", high="blue")+ # Set the low values to lightblue and the high values to blue. 
  labs(fill="Local R2") + # We renamed the legend 
  coord_sf() + # added coordinate info
  theme_map() + # added theme
  ggtitle(paste("MSA GWR Model - Local R2")) # and included a title.

Figure 11. Local R2 values from the GWR modeling. The plot was created using ggplot2 and utilizing the R2 values for the local regression models in the GWR model as the color fill. The darker blue hue represents higher local R2 values while the lighter blue hue represents lower local R2 values.

From figure 11, the spatial variation in the local R2 for the local regression models can be seen. The darker the blue hue, the higher the local R2, and the lighter the blue hue, the lower the local R2. It seems that local R2 values are higher in the western MSAs and they get lower as one moves eastwards. Overall, however, the local R2 values are fairly high, ranging between 0.70 and 0.85. From this figure, we can say that the variables selected in the study are explaining a higher percentage of the variation in GDP in the MSAs in the western part of the US. This is followed by the central, midwestern, and southeastern MSAs, and then the northeastern MSAs.

Residuals

The residuals in the model show errors from the estimates of the model, gained from the difference between the observed value and the predicted values of GDP.

To map the residuals, we used the numeric object ‘gwr.e’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$gwr.e)) + # We plot the residuals, 'gwr.e'
  scale_fill_gradient(low="yellow", high="red")+ # We made the fill from yellow to red. 
  labs(fill="Residuals") + # Changed the legend title
  coord_sf() + # Coordinates
  theme_map() + # Theme
  ggtitle(paste("MSA GWR Model - Residuals")) # Title 

Figure 12. Residuals from the GWR modeling. The plot was created using ggplot2 and utilizing the residual values in the GWR model as the color fill. The red hue represents higher residual values while the yellow hue represents lower local residual values.

The residual values seem to be evenly distributed spatially according to figure 12. Most of the MSAs in the US seem to have similar residual values. The most obvious high residuals appear arround the New York MSA in the northeast, showing up as the brightest red hue. This indicates that for the New York MSA, the model may be underpredicting the effects of the variables on GDP. This is because the higher the residual value, the greater the difference between the observed value (a plot point higher than the regression line) and the predicted value (the linear regression line that is under the plot point).

Variable Coefficients

We then plotted the coefficients of the socioeconomic variables to observe how they varied spatially.

‘o_management’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.o_managementE)) + # We plotted the variable 'o_management'
  scale_fill_gradient(low="lightgreen", high="darkgreen")+ # Fill of lightgreen to darkgreen
  labs(fill="Coefficient") +  # Rename the legend
  coord_sf() + # Coordinates
  theme_map() + # Theme
  ggtitle(paste("MSA GWR Model coefficient: o_management")) # Title 

Figure 13. Coefficient values of ‘o_management’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘o_management’ as the color fill. The darker green hue represents higher coefficient values while the lighter green hue represents lower coefficient values.

‘o_salesE’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.o_salesE)) + # o_sales
  scale_fill_gradient(low="lightgreen", high="darkgreen")+ # fill of lightgreen to darkgreen.
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: o_sales")) # title 

Figure 14. Coefficient values of ‘o_sales’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘o_sales’ as the color fill. The darker green hue represents higher coefficient values while the lighter green hue represents lower coefficient values.

For the variable ’o_sales,” the higher coefficients are much more concetrated in the eastern coastal MSAs, as well as the MSAs in the southwestern US. The coefficients are much lower in the central and northwestern US, showing that these areas explain the variation in GDP at a lower rate than the former MSAs.

‘i_manufacturing’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_manufacturingE)) + # 'i_manufacturing'
  scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: i_manufacturing")) # title

Figure 15. Coefficient values of ‘i_manufacturing’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘i_manufacturing’ as the color fill. The darker blue hue represents higher coefficient values while the lighter blue hue represents lower coefficient values.

For the variable ‘i_manufacturing,’ it seems that this variable explains the variation in GDP at a much more spatially uniform pattern compared to the other variables. Most of the MSAs fall in the higher range of the coefficient value, with the exception of the MSAs in the southwestern US, mostly in California.

‘i_wholesale’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_wholesaleE)) + # 'i_wholesale'
  scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: i_wholesale")) # title

Figure 16. Coefficient values of ‘i_manufacturing’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘i_manufacturing’ as the color fill. The darker blue hue represents higher coefficient values while the lighter blue hue represents lower coefficient values.

The variable ‘i_wholesale’ is also another variable that shows an interesting pattern in the spatial distribution of the coefficient values. The higher coefficient value of ‘i_wholesale’ mostly reside in the south-central part of the US, mainly in the state of Texas. The coefficient is generally higher in the eastern US, while the lowest values can be found in some of the west-central MSAs.

‘i_retail’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_retailE)) + # 'i_retail'
  scale_fill_gradient(low="#003366", high="#99CCFF" ) + # fill of dark to light green
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: i_retail")) # title

Figure 17. Coefficient values of ‘i_retail’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘i_retail’ as the color fill. The darker blue hue represents larger negative coefficient values while the lighter blue hue represents lower negative coefficient values.

‘i_transport’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_transportE)) + # 'i_transport'
  scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: i_transport")) # title

Figure 18. Coefficient values of ‘i_transport’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘i_retail’ as the color fill. The darker blue hue represents higher coefficient values while the lighter blue hue represents lower coefficient values.

‘i_professional’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.i_professionalE)) + # 'i_professional'
  scale_fill_gradient(low="#99CCFF", high="#003366") + # fill of light to dark blue
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: i_professional")) # title

Figure 19. Coefficient values of ‘i_professional’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘i_professional’ as the color fill. The darker blue hue represents higher coefficient values while the lighter blue hue represents lower coefficient values.

‘h_mortgage’

ggplot() + geom_sf(data = gwr1.sf, aes(fill = gwr1.sf$usa1.h_mortgageE)) + # 'h_mortgage'
  scale_fill_gradient(low="violet", high="#330066") + # fill of violet to dark purple
  labs(fill="Coefficient") + # rename legend
  coord_sf() + # coordinates
  theme_map() + # theme
  ggtitle(paste("MSA GWR Model coefficient: h_mortgage")) # title

Figure 20. Coefficient values of ‘h_mortgage’ from the GWR modeling. The plot was created using ggplot2 and utilized the coefficient values in the GWR model for ‘h_mortgage’ as the color fill. The darker purple hue represents higher coefficient values while the lighter purple/pink hue represents lower coefficient values.

The variable ‘h_mortgage’ has overall much lower coefficient values compared to the other variables. The spatial variation can be seen in figure 20, however, as much of the MSAs in eastern half of the US have similar values while the western half, particularly the northwest, have higher coefficient values. This variable explains the variation in GDP at a higher rate in the northwestern, then the southwestern MSAs in the US.

CONCLUSION

REFLECTION

The initial draft of the project included an analysis of spatial autocorrelation. However, this was removed in order to limit the scope of the study further, as well as to be able to focus on the regression models. This may be re-visited and added to a future version of the project.

At the initial stages of the project, the raw dataset was used to explore OLS regression models and GWR models. The raw dataset, however, resulted in unusually high adjusted R2 values, as well as extremely high VIF values. After conducting a review of the descriptive statistics and the exploratory phase of the analysis, the skewness of certain variables, including the dependent variable, became much clearer. We believe these patterns were influencing the poor performance of the test models at the beginning of the study. For future studies, conducting a review of the descriptive statistics and normality tests may save time and effort and lead to better modeling results.

REFERENCES

Bureau of Economic Analysis (2023, June 2). Gross Domestic Product. bea.gov.

Glaeser, E. L., & Gottlieb, J. D. (2009). The wealth of cities: Agglomeration economies and

  spatial equilibrium in the United States. Journal of economic literature, 47(4), 983-1028.

Glen, S. (2023). Akaike’s Information Criterion: Definition, Formulas. Statisticshowto.com.

gwr.sel: Crossvalidation of bandwidth for geographically weighted regression (n. d.).

  rdocumentation.org. Retrieved July 19, 2023, from

Koc, T. (2022). Bandwidth Selection in Geographically Weighted Regression Models via

  Information Complexity Criteria. Journal of Mathematics, 2022.

Kolmogorov-Smirnov Tests (n. d.). stat.eethz.ch. Retrieved July 19, 2023, from

McMillen, D. P. (2005). Urban Economics. In K. Kempf-Leonard (Ed.), Encyclopedia of Social

  Measurement (pp. 915-920). Elsevier. https://doi.org/10.1016/B0-12-369398-5/00555-7.

Mendez C. (2020). Geographically weighted regression models: A tutorial using the spgwr

  package in R. R Studio/RPubs. https://rpubs.com/quarcs-lab/tutorial-gwr1

NIH (n. d.). Learn More about Normal Distribution. Dietassessmentprimer.cancer.gov.

  #:~:text=A%20variable%20that%20is%20normally,to%20describe%

  20departures%20from%20normality.

Smith, G. (2018). Step away from stepwise. Journal of Big Data, 5(32).

Thieme, C. (2021). Understanding Linear Regression Output in R. towardsdatascience.com.

US Census Bureau (2023, June 6). Data. Census.gov. https://www.census.gov/data.html

VIF: Variance Inflation Factor (n. d.). rdocumentation.org. Retrieved July 19, 2023, from

Walker K, Herman M (2023). tidycensus: Load US Census Boundary and Attribute Data as

  ‘tidyverse’ and ‘sf’-Ready Data Frames.

  R package version 1.4.3, https://walker-data.com/tidycensus/.

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

  ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A,

  Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J,

  Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K,

  Yutani H (2019). “Welcome to the tidyverse.”

  Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.

Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data

William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality

  Research. Northwestern University, Evanston, Illinois.

  R package version 2.3.6, https://CRAN.R-project.org/package=psych.