Introduction

Population density is the most frequently used demographic parameter for any initial statistical analysis. There are numerous known factors such as the topography of the area, proximity to the river or water, availability of resources, which affects the population density in an area. There are very few works demonstrating effects unconventional variables on the distribution of population density.

Shelby County is the second-largest county in terms of population density, with a population of 936,365 (United States Census Bureau, 2018). The total area of the county is 785 square miles, with a population density of 1,192.8 per square mile.

This study investigated some unconventional socio-economic variables to explore and explain the spatial relationship of population density in census tracts of Shelby County.

Objectives

Data and Variables

The data for this study was collected from the US Census Census Bureau website. The following datasets of 2018 have been used –

The population density was used as the dependent variable, and the predictor variables used are median income, the population density of the White, African American, and Asian people from the year 2018.

Main variables used for this analysis are given below.

Cleaning the Global Environment and the Console

rm(list = ls())      # Clears the global environment for a fresh start
cat('\f')            # Cleans the console

Loading Libraries

This code allows to load multiple library at once.

my_library <- c("sf", "sp", "tibble", "rgdal", "rgeos", "tmap",
                "tmaptools", "spgwr", "grid", "gridExtra", "spdep", "sjPlot", "psych")
lapply(my_library, library, character.only = TRUE)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## rgdal: version: 1.5-16, (SVN revision 1050)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: C:/Users/hasan/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: C:/Users/hasan/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## Linking to sp version:1.4-2
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## rgeos version: 0.5-3, (SVN revision 634)
##  GEOS runtime version: 3.8.0-CAPI-1.13.1 
##  Linking to sp version: 1.4-2 
##  Polygon checking: TRUE
## Warning: package 'spgwr' was built under R version 4.0.3
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.0.3
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
## Warning: package 'spdep' was built under R version 4.0.3
## Warning: package 'sjPlot' was built under R version 4.0.3
## Warning: package 'psych' was built under R version 4.0.3

Loading the Data

Shelby <- st_read("Data/Shelby.shp")
## Reading layer `Shelby' from data source `C:\Rizwan\Education\UofM PhD Work\Seminar in Earth Sciences\Term_Project\Data\Shelby.shp' using driver `ESRI Shapefile'
## Simple feature collection with 221 features and 9 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -90.3103 ymin: 34.99419 xmax: -89.63278 ymax: 35.40948
## geographic CRS: NAD83
Data <- as_tibble(read.csv("Data/Data.csv"))

Examining the Data

glimpse(Shelby)
## Rows: 221
## Columns: 10
## $ STATEFP  <chr> "47", "47", "47", "47", "47", "47", "47", "47", "47", "47"...
## $ COUNTYFP <chr> "157", "157", "157", "157", "157", "157", "157", "157", "1...
## $ TRACTCE  <chr> "021312", "021530", "021726", "022024", "000200", "000800"...
## $ AFFGEOID <chr> "1400000US47157021312", "1400000US47157021530", "1400000US...
## $ GEOID    <chr> "47157021312", "47157021530", "47157021726", "47157022024"...
## $ NAME     <chr> "213.12", "215.30", "217.26", "220.24", "2", "8", "27", "3...
## $ LSAD     <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT"...
## $ ALAND    <dbl> 1362074, 10068590, 1433803, 2581685, 1056567, 2037629, 135...
## $ AWATER   <dbl> 0, 0, 2132, 0, 45527, 154343, 0, 0, 0, 0, 19621, 0, 44688,...
## $ geometry <POLYGON [°]> POLYGON ((-89.84226 35.1076..., POLYGON ((-89.7324...
glimpse(Data)
## Rows: 221
## Columns: 17
## $ FID        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ STATEFP    <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, ...
## $ COUNTYFP   <int> 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 1...
## $ GEO_ID     <chr> "1400000US47157021312", "1400000US47157021530", "1400000...
## $ OBJECTID   <int> 1161, 1176, 1186, 1201, 1001, 1006, 1021, 1029, 1044, 10...
## $ ALAND      <int> 1362074, 10068590, 1433803, 2581685, 1056567, 2037629, 1...
## $ AWATER     <int> 0, 0, 2132, 0, 45527, 154343, 0, 0, 0, 0, 19621, 0, 4468...
## $ Area_sqmil <dbl> 0.5282983, 3.8862272, 0.5538200, 0.9941988, 0.4208300, 0...
## $ Tot_Pop    <int> 2108, 5396, 3486, 3429, 790, 2286, 2030, 3138, 2253, 140...
## $ Tot_Wht    <int> 1902, 4530, 1012, 167, 0, 42, 882, 2417, 18, 21, 306, 60...
## $ Tot_AA     <int> 99, 308, 2423, 3250, 762, 2244, 857, 595, 2229, 1379, 33...
## $ Tot_Asian  <int> 97, 479, 0, 0, 0, 0, 110, 80, 0, 1, 0, 0, 0, 91, 438, 26...
## $ Pop_Den    <dbl> 3990.1699, 1388.4932, 6294.4641, 3449.0085, 1877.2425, 2...
## $ Med_Inc    <int> 54769, 129375, 26731, 49667, 14345, 14299, 45729, 56111,...
## $ PopDen_Wht <dbl> 3600.238692, 1165.654943, 1827.308561, 167.974461, 0.000...
## $ PopDen_AA  <dbl> 187.39413, 79.25424, 4375.06783, 3268.96407, 1810.70734,...
## $ PopDen_Asi <dbl> 183.608387, 123.255788, 0.000000, 0.000000, 0.000000, 0....

Merging the Shape File and the Dataset

Shelby_merged <- merge(Shelby,Data, by.x = "AFFGEOID", by.y = "GEO_ID")

Run Linear Regression Model and Model Diagnostics

The model table shows higher adjusted \(R^2\) value. The model could be biased if there autocorrelations between the independent variables. The table shows that the median income is not statistically significant to describe the dependent variable.

model <- lm(Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi, data = Data)
summary(model)
## 
## Call:
## lm(formula = Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi, 
##     data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -294.61 -118.28  -38.41   32.33 1687.95 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.495e+01  4.765e+01  -0.314    0.754    
## Med_Inc     -3.972e-04  6.422e-04  -0.619    0.537    
## PopDen_Wht   1.074e+00  1.635e-02  65.640  < 2e-16 ***
## PopDen_AA    1.064e+00  1.173e-02  90.664  < 2e-16 ***
## PopDen_Asi   9.940e-01  2.017e-01   4.927 1.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 234.5 on 216 degrees of freedom
## Multiple R-squared:  0.983,  Adjusted R-squared:  0.9827 
## F-statistic:  3129 on 4 and 216 DF,  p-value: < 2.2e-16
tab_model(model, dv.labels = "Shelby County Population Density 2018",
    CSS = list(
    css.depvarhead = 'color: red;',
    css.centeralign = 'text-align: left;', 
    css.firsttablecol = 'font-weight: bold;', 
    css.summary = 'color: blue;')
)
  Shelby County Population Density 2018
Predictors Estimates CI p
(Intercept) -14.95 -108.86 – 78.96 0.754
Med_Inc -0.00 -0.00 – 0.00 0.537
PopDen_Wht 1.07 1.04 – 1.11 <0.001
PopDen_AA 1.06 1.04 – 1.09 <0.001
PopDen_Asi 0.99 0.60 – 1.39 <0.001
Observations 221
R2 / R2 adjusted 0.983 / 0.983
par(mfrow=c(2,2))
plot(model)

Scatter Matrix of Correlation

THe correlation matrix shows that the higher correlation is present between population density and African American population density. This correleation can be biased as the autocorrelation between the independent variables not been calculated.

pairs.panels(Data[, 13:17],
             method = "pearson",# correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Calculating Residuals and Normalization

The residuals have been normalized by scale function. It is necessary to scale the data to standard deviations for better representation. Normalizing the residuals identifies the outliers in the spatial data.

resids <- residuals(model)
map.resids1 <- cbind(Shelby_merged, resids)
sd_breaks <- scale(map.resids1$resids)
summary(sd_breaks)
##        V1         
##  Min.   :-1.2677  
##  1st Qu.:-0.5089  
##  Median :-0.1653  
##  Mean   : 0.0000  
##  3rd Qu.: 0.1391  
##  Max.   : 7.2633
map.resids2 <- cbind(map.resids1, sd_breaks)
glimpse(map.resids2)
## Rows: 221
## Columns: 28
## $ AFFGEOID   <chr> "1400000US47157000100", "1400000US47157000200", "1400000...
## $ STATEFP.x  <chr> "47", "47", "47", "47", "47", "47", "47", "47", "47", "4...
## $ COUNTYFP.x <chr> "157", "157", "157", "157", "157", "157", "157", "157", ...
## $ TRACTCE    <chr> "000100", "000200", "000300", "000400", "000600", "00070...
## $ GEOID      <chr> "47157000100", "47157000200", "47157000300", "4715700040...
## $ NAME       <chr> "1", "2", "3", "4", "6", "7", "8", "9", "11", "12", "13"...
## $ LSAD       <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "C...
## $ ALAND.x    <dbl> 4186009, 1056567, 2872435, 2276725, 3883873, 2497535, 20...
## $ AWATER.x   <dbl> 2549731, 45527, 8845, 91258, 1208912, 0, 154343, 0, 0, 0...
## $ FID        <int> 220, 4, 23, 24, 194, 25, 5, 146, 147, 16, 60, 81, 53, 14...
## $ STATEFP.y  <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, ...
## $ COUNTYFP.y <int> 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 1...
## $ OBJECTID   <int> 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 10...
## $ ALAND.y    <int> 4186009, 1056567, 2872435, 2276725, 3883873, 2497535, 20...
## $ AWATER.y   <int> 2549731, 45527, 8845, 91258, 1208912, 0, 154343, 0, 0, 0...
## $ Area_sqmil <dbl> 2.5916126, 0.4208300, 1.1006220, 0.9194098, 1.9638176, 0...
## $ Tot_Pop    <int> 5806, 790, 806, 1494, 1969, 4308, 2286, 2625, 2973, 4540...
## $ Tot_Wht    <int> 3748, 0, 3, 63, 0, 338, 42, 35, 1394, 1975, 476, 69, 225...
## $ Tot_AA     <int> 1114, 762, 782, 1416, 1937, 3970, 2244, 2527, 1323, 946,...
## $ Tot_Asian  <int> 822, 0, 0, 0, 0, 0, 0, 0, 11, 28, 208, 12, 37, 276, 64, ...
## $ Pop_Den    <dbl> 2240.3039, 1877.2425, 732.3132, 1624.9554, 1002.6389, 44...
## $ Med_Inc    <int> 76627, 14345, 20428, 25789, 16379, 19849, 14299, 18790, ...
## $ PopDen_Wht <dbl> 1446.203779, 0.000000, 2.725732, 68.522216, 0.000000, 34...
## $ PopDen_AA  <dbl> 429.8482, 1810.7073, 710.5073, 1540.1184, 986.3441, 4109...
## $ PopDen_Asi <dbl> 317.177029, 0.000000, 0.000000, 0.000000, 0.000000, 0.00...
## $ resids     <dbl> -219.907220, -3.334919, -294.614837, -173.240475, -27.82...
## $ sd_breaks  <dbl> -0.94626480, -0.01435022, -1.26773304, -0.74545694, -0.1...
## $ geometry   <POLYGON [°]> POLYGON ((-90.06468 35.1702..., POLYGON ((-90.04...

Visualization of the Residuals

The following maps shows presence of outliers in the spatial data.

my_breaks <- c(-8,-3,-2,-1,1,2,3,8)
tm_shape(map.resids2)+ tm_borders()+
  tm_fill("sd_breaks", midpoint = NA, title = "Residuals", style = "fixed", breaks = my_breaks, palette = "-RdBu")+
  tm_layout(main.title = "Residuals", main.title.size = 0.7 ,
            legend.position = c("left", "top"), legend.title.size = 0.8)

Run Geographic Weighted Regression (GWR) Model

map_sp <- as(map.resids2, "Spatial") #Converting the sf object to sp object

Calculation of kernel bandwidth

GWRbandwidth <- gwr.sel(Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi, data = map_sp, adapt = T)
## Adaptive q: 0.381966 CV score: 11447704 
## Adaptive q: 0.618034 CV score: 11892736 
## Adaptive q: 0.236068 CV score: 10927543 
## Adaptive q: 0.145898 CV score: 10478910 
## Adaptive q: 0.09016994 CV score: 9921170 
## Adaptive q: 0.05572809 CV score: 9349887 
## Adaptive q: 0.03444185 CV score: 8568541 
## Adaptive q: 0.02128624 CV score: 8098634 
## Adaptive q: 0.01315562 CV score: 8821234 
## Adaptive q: 0.02481281 CV score: 8195559 
## Adaptive q: 0.01818062 CV score: 8176755 
## Adaptive q: 0.02132693 CV score: 8097201 
## Adaptive q: 0.02265842 CV score: 8062991 
## Adaptive q: 0.02384356 CV score: 8127619 
## Adaptive q: 0.02214983 CV score: 8072107 
## Adaptive q: 0.0231111 CV score: 8085220 
## Adaptive q: 0.02253265 CV score: 8063280 
## Adaptive q: 0.02261773 CV score: 8061576 
## Adaptive q: 0.02257704 CV score: 8062379 
## Adaptive q: 0.02261773 CV score: 8061576

Run the GWR Model

gwr.model = gwr(Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi,
                data = map_sp,
                adapt=GWRbandwidth,
                hatmatrix=TRUE,
                se.fit=TRUE)
## Warning in proj4string(data): CRS object has comment, which is lost in output
gwr.model
## Call:
## gwr(formula = Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi, 
##     data = map_sp, adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss 
## Adaptive quantile: 0.02261773 (about 4 of 221 data points)
## Summary of GWR coefficient estimates at data points:
##                     Min.     1st Qu.      Median     3rd Qu.        Max.
## X.Intercept. -1.7054e+02 -2.2973e+01  2.6542e+01  1.5198e+02  8.9238e+02
## Med_Inc      -1.3718e-02 -2.7939e-03 -8.1104e-04 -3.5837e-05  3.1918e-03
## PopDen_Wht    7.4561e-01  1.0263e+00  1.0525e+00  1.0946e+00  1.4251e+00
## PopDen_AA     8.8938e-01  1.0058e+00  1.0386e+00  1.0803e+00  1.1422e+00
## PopDen_Asi   -9.9825e-01  7.9698e-01  1.0615e+00  1.2507e+00  1.9710e+00
##                Global
## X.Intercept. -14.9543
## Med_Inc       -0.0004
## PopDen_Wht     1.0735
## PopDen_AA      1.0635
## PopDen_Asi     0.9940
## Number of data points: 221 
## Effective number of parameters (residual: 2traceS - traceS'S): 89.29223 
## Effective degrees of freedom (residual: 2traceS - traceS'S): 131.7078 
## Sigma (residual: 2traceS - traceS'S): 182.0397 
## Effective number of parameters (model: traceS): 66.71983 
## Effective degrees of freedom (model: traceS): 154.2802 
## Sigma (model: traceS): 168.1964 
## Sigma (ML): 140.5322 
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 3009.614 
## AIC (GWR p. 96, eq. 4.22): 2879.774 
## Residual sum of squares: 4364592 
## Quasi-global R2: 0.9937685

Creating and Adding Results of GWR

THe local \(R^2\) map shows that the model have low \(R^2\) value at the center of the county.

results <-as.data.frame(gwr.model$SDF)
names(results)
##  [1] "sum.w"               "X.Intercept."        "Med_Inc"            
##  [4] "PopDen_Wht"          "PopDen_AA"           "PopDen_Asi"         
##  [7] "X.Intercept._se"     "Med_Inc_se"          "PopDen_Wht_se"      
## [10] "PopDen_AA_se"        "PopDen_Asi_se"       "gwr.e"              
## [13] "pred"                "pred.se"             "localR2"            
## [16] "X.Intercept._se_EDF" "Med_Inc_se_EDF"      "PopDen_Wht_se_EDF"  
## [19] "PopDen_AA_se_EDF"    "PopDen_Asi_se_EDF"   "pred.se.1"
gwr.map <- cbind(map_sp, as.matrix(results))
gwr.map2 <- st_as_sf(gwr.map)

qtm(gwr.map, fill = "localR2", fill.palette = "-OrRd")

Map Variable Distribution using Use Grid Extra

The grid, and grid_extra functions are another way to plot maps side by side. We can also tmap_arrannge function to get the same output.

The distribution maps of the independent variable shows that the eastern part of the county has higher median income, and gradually decreases to the west. The white population density and the african american population density is scattered within the center of the county. Higher asian population density is observed at the southeastern part of the county.

map1 <- tm_shape(gwr.map2) + 
  tm_fill("Med_Inc",
          n = 5,
          style = "quantile",
          palette = "-RdYlBu",
          title = "Med_Inc Distribution")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)

map2 <- tm_shape(gwr.map2) + 
  tm_fill("PopDen_Wht",
          n = 5,
          style = "quantile",
          palette = "-RdYlBu",
          title = "PopDen_Wht Distribution")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)

map3 <- tm_shape(gwr.map2) + 
  tm_fill("PopDen_AA",
          n = 5,
          style = "quantile",
          palette = "-RdYlBu",
          title = "PopDen_AA Distribution")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)

map4 <- tm_shape(gwr.map2) + 
  tm_fill("PopDen_Asi",
          n = 5,
          style = "quantile",
          palette = "-RdYlBu",
          title = "PopDen_Asi Distribution")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)
grid.newpage()

# assigns the cell size of the grid, in this case 2 by 2
pushViewport(viewport(layout=grid.layout(2,2)))

# prints a map object into a defined cell   
print(map1, vp=viewport(layout.pos.col = 1, layout.pos.row =1))
print(map2, vp=viewport(layout.pos.col = 2, layout.pos.row =1))
print(map3, vp=viewport(layout.pos.col = 1, layout.pos.row =2))
print(map4, vp=viewport(layout.pos.col = 2, layout.pos.row =2))

Map Coefficient Distribution using Use Grid Extra

The coefficient map shows that the median income negatively impacts the population density of the central and southern part of the Shelby county. The Population density can better expressed by the white population density at the central part of the county.

map5 <- tm_shape(gwr.map2) +
  tm_fill("Med_Inc.1",
          n = 5,
          midpoint = NA,
          style = "quantile",
          palette = "RdYlBu",
          title = "Med_Inc Coefficient")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)

map6 <- tm_shape(gwr.map2) +
  tm_fill("PopDen_Wht.1",
          n = 5,
          style = "quantile",
          palette = "-RdYlBu",
          title = "PopDen_Wht Coefficient")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)

map7 <- tm_shape(gwr.map2) +
  tm_fill("PopDen_AA.1",
          n = 5,
          style = "quantile",
          palette = "-RdYlBu",
          title = "PopDen_AA Coefficient")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)

map8 <- tm_shape(gwr.map2) +
  tm_fill("PopDen_Asi.1",
          n = 5,
          midpoint = NA,
          style = "quantile",
          palette = "-RdYlBu",
          title = "PopDen_Asi Coefficient")+
  tm_borders()+
  tm_layout(frame = TRUE,
            legend.text.size = 0.5,
            legend.title.size = 0.6)
grid.newpage()

# assigns the cell size of the grid, in this case 2 by 2
pushViewport(viewport(layout=grid.layout(2,2)))

# prints a map object into a defined cell   
print(map5, vp=viewport(layout.pos.col = 1, layout.pos.row =1))
print(map6, vp=viewport(layout.pos.col = 2, layout.pos.row =1))
print(map7, vp=viewport(layout.pos.col = 1, layout.pos.row =2))
print(map8, vp=viewport(layout.pos.col = 2, layout.pos.row =2))

Conclusion

This project shows usability of linear and geographic weighted regression model approach to understand the population density of Shelby county using R program. Though, the models shows higher adjusted \(R^2\) values, it could be biased if there are autocorrelations present within the independent variable. The autocorrelation analysis will be a part of the future effort for the project.