Population density is the most frequently used demographic parameter for any initial statistical analysis. There are numerous known factors such as the topography of the area, proximity to the river or water, availability of resources, which affects the population density in an area. There are very few works demonstrating effects unconventional variables on the distribution of population density.
Shelby County is the second-largest county in terms of population density, with a population of 936,365 (United States Census Bureau, 2018). The total area of the county is 785 square miles, with a population density of 1,192.8 per square mile.
This study investigated some unconventional socio-economic variables to explore and explain the spatial relationship of population density in census tracts of Shelby County.
Analyzing the coefficient value for the independent variable and its corresponding p-value.
Analyzing the sign of the coefficient, whether it is positive or negative.
Checking the adjusted R-squared value(s).
Visually understand the correlation between the dependent and exploratory variables in the Shelby county.
Map spatial distribution and coefficients.
Perform geographic weighted regression which can be used to identify how (locally weighted) regression coefficients may vary across the study area.
The data for this study was collected from the US Census Census Bureau website. The following datasets of 2018 have been used –
Population data
Selected Economic data, and
Race data
The population density was used as the dependent variable, and the predictor variables used are median income, the population density of the White, African American, and Asian people from the year 2018.
Main variables used for this analysis are given below.
Pop_Den : Population density per sq. mile
Med_Inc : Median Income
PopDen_Wht : Population density of White population per sq. mile
PopDen_AA : Population density of African American population per sq. mile
PopDen_Asi : Population density of Asian population per sq. mile
rm(list = ls()) # Clears the global environment for a fresh start
cat('\f') # Cleans the console
This code allows to load multiple library at once.
my_library <- c("sf", "sp", "tibble", "rgdal", "rgeos", "tmap",
"tmaptools", "spgwr", "grid", "gridExtra", "spdep", "sjPlot", "psych")
lapply(my_library, library, character.only = TRUE)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## rgdal: version: 1.5-16, (SVN revision 1050)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.4, released 2020/01/28
## Path to GDAL shared files: C:/Users/hasan/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 6.3.1, February 10th, 2020, [PJ_VERSION: 631]
## Path to PROJ shared files: C:/Users/hasan/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## Linking to sp version:1.4-2
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## rgeos version: 0.5-3, (SVN revision 634)
## GEOS runtime version: 3.8.0-CAPI-1.13.1
## Linking to sp version: 1.4-2
## Polygon checking: TRUE
## Warning: package 'spgwr' was built under R version 4.0.3
## Loading required package: spData
## Warning: package 'spData' was built under R version 4.0.3
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
## Warning: package 'spdep' was built under R version 4.0.3
## Warning: package 'sjPlot' was built under R version 4.0.3
## Warning: package 'psych' was built under R version 4.0.3
Shelby <- st_read("Data/Shelby.shp")
## Reading layer `Shelby' from data source `C:\Rizwan\Education\UofM PhD Work\Seminar in Earth Sciences\Term_Project\Data\Shelby.shp' using driver `ESRI Shapefile'
## Simple feature collection with 221 features and 9 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -90.3103 ymin: 34.99419 xmax: -89.63278 ymax: 35.40948
## geographic CRS: NAD83
Data <- as_tibble(read.csv("Data/Data.csv"))
glimpse(Shelby)
## Rows: 221
## Columns: 10
## $ STATEFP <chr> "47", "47", "47", "47", "47", "47", "47", "47", "47", "47"...
## $ COUNTYFP <chr> "157", "157", "157", "157", "157", "157", "157", "157", "1...
## $ TRACTCE <chr> "021312", "021530", "021726", "022024", "000200", "000800"...
## $ AFFGEOID <chr> "1400000US47157021312", "1400000US47157021530", "1400000US...
## $ GEOID <chr> "47157021312", "47157021530", "47157021726", "47157022024"...
## $ NAME <chr> "213.12", "215.30", "217.26", "220.24", "2", "8", "27", "3...
## $ LSAD <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT"...
## $ ALAND <dbl> 1362074, 10068590, 1433803, 2581685, 1056567, 2037629, 135...
## $ AWATER <dbl> 0, 0, 2132, 0, 45527, 154343, 0, 0, 0, 0, 19621, 0, 44688,...
## $ geometry <POLYGON [°]> POLYGON ((-89.84226 35.1076..., POLYGON ((-89.7324...
glimpse(Data)
## Rows: 221
## Columns: 17
## $ FID <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ STATEFP <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, ...
## $ COUNTYFP <int> 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 1...
## $ GEO_ID <chr> "1400000US47157021312", "1400000US47157021530", "1400000...
## $ OBJECTID <int> 1161, 1176, 1186, 1201, 1001, 1006, 1021, 1029, 1044, 10...
## $ ALAND <int> 1362074, 10068590, 1433803, 2581685, 1056567, 2037629, 1...
## $ AWATER <int> 0, 0, 2132, 0, 45527, 154343, 0, 0, 0, 0, 19621, 0, 4468...
## $ Area_sqmil <dbl> 0.5282983, 3.8862272, 0.5538200, 0.9941988, 0.4208300, 0...
## $ Tot_Pop <int> 2108, 5396, 3486, 3429, 790, 2286, 2030, 3138, 2253, 140...
## $ Tot_Wht <int> 1902, 4530, 1012, 167, 0, 42, 882, 2417, 18, 21, 306, 60...
## $ Tot_AA <int> 99, 308, 2423, 3250, 762, 2244, 857, 595, 2229, 1379, 33...
## $ Tot_Asian <int> 97, 479, 0, 0, 0, 0, 110, 80, 0, 1, 0, 0, 0, 91, 438, 26...
## $ Pop_Den <dbl> 3990.1699, 1388.4932, 6294.4641, 3449.0085, 1877.2425, 2...
## $ Med_Inc <int> 54769, 129375, 26731, 49667, 14345, 14299, 45729, 56111,...
## $ PopDen_Wht <dbl> 3600.238692, 1165.654943, 1827.308561, 167.974461, 0.000...
## $ PopDen_AA <dbl> 187.39413, 79.25424, 4375.06783, 3268.96407, 1810.70734,...
## $ PopDen_Asi <dbl> 183.608387, 123.255788, 0.000000, 0.000000, 0.000000, 0....
Shelby_merged <- merge(Shelby,Data, by.x = "AFFGEOID", by.y = "GEO_ID")
The model table shows higher adjusted \(R^2\) value. The model could be biased if there autocorrelations between the independent variables. The table shows that the median income is not statistically significant to describe the dependent variable.
model <- lm(Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi, data = Data)
summary(model)
##
## Call:
## lm(formula = Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi,
## data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -294.61 -118.28 -38.41 32.33 1687.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.495e+01 4.765e+01 -0.314 0.754
## Med_Inc -3.972e-04 6.422e-04 -0.619 0.537
## PopDen_Wht 1.074e+00 1.635e-02 65.640 < 2e-16 ***
## PopDen_AA 1.064e+00 1.173e-02 90.664 < 2e-16 ***
## PopDen_Asi 9.940e-01 2.017e-01 4.927 1.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 234.5 on 216 degrees of freedom
## Multiple R-squared: 0.983, Adjusted R-squared: 0.9827
## F-statistic: 3129 on 4 and 216 DF, p-value: < 2.2e-16
tab_model(model, dv.labels = "Shelby County Population Density 2018",
CSS = list(
css.depvarhead = 'color: red;',
css.centeralign = 'text-align: left;',
css.firsttablecol = 'font-weight: bold;',
css.summary = 'color: blue;')
)
| Â | Shelby County Population Density 2018 | ||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -14.95 | -108.86 – 78.96 | 0.754 |
| Med_Inc | -0.00 | -0.00 – 0.00 | 0.537 |
| PopDen_Wht | 1.07 | 1.04 – 1.11 | <0.001 |
| PopDen_AA | 1.06 | 1.04 – 1.09 | <0.001 |
| PopDen_Asi | 0.99 | 0.60 – 1.39 | <0.001 |
| Observations | 221 | ||
| R2 / R2 adjusted | 0.983 / 0.983 | ||
par(mfrow=c(2,2))
plot(model)
THe correlation matrix shows that the higher correlation is present between population density and African American population density. This correleation can be biased as the autocorrelation between the independent variables not been calculated.
pairs.panels(Data[, 13:17],
method = "pearson",# correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
The residuals have been normalized by scale function. It is necessary to scale the data to standard deviations for better representation. Normalizing the residuals identifies the outliers in the spatial data.
resids <- residuals(model)
map.resids1 <- cbind(Shelby_merged, resids)
sd_breaks <- scale(map.resids1$resids)
summary(sd_breaks)
## V1
## Min. :-1.2677
## 1st Qu.:-0.5089
## Median :-0.1653
## Mean : 0.0000
## 3rd Qu.: 0.1391
## Max. : 7.2633
map.resids2 <- cbind(map.resids1, sd_breaks)
glimpse(map.resids2)
## Rows: 221
## Columns: 28
## $ AFFGEOID <chr> "1400000US47157000100", "1400000US47157000200", "1400000...
## $ STATEFP.x <chr> "47", "47", "47", "47", "47", "47", "47", "47", "47", "4...
## $ COUNTYFP.x <chr> "157", "157", "157", "157", "157", "157", "157", "157", ...
## $ TRACTCE <chr> "000100", "000200", "000300", "000400", "000600", "00070...
## $ GEOID <chr> "47157000100", "47157000200", "47157000300", "4715700040...
## $ NAME <chr> "1", "2", "3", "4", "6", "7", "8", "9", "11", "12", "13"...
## $ LSAD <chr> "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "CT", "C...
## $ ALAND.x <dbl> 4186009, 1056567, 2872435, 2276725, 3883873, 2497535, 20...
## $ AWATER.x <dbl> 2549731, 45527, 8845, 91258, 1208912, 0, 154343, 0, 0, 0...
## $ FID <int> 220, 4, 23, 24, 194, 25, 5, 146, 147, 16, 60, 81, 53, 14...
## $ STATEFP.y <int> 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, ...
## $ COUNTYFP.y <int> 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 1...
## $ OBJECTID <int> 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 10...
## $ ALAND.y <int> 4186009, 1056567, 2872435, 2276725, 3883873, 2497535, 20...
## $ AWATER.y <int> 2549731, 45527, 8845, 91258, 1208912, 0, 154343, 0, 0, 0...
## $ Area_sqmil <dbl> 2.5916126, 0.4208300, 1.1006220, 0.9194098, 1.9638176, 0...
## $ Tot_Pop <int> 5806, 790, 806, 1494, 1969, 4308, 2286, 2625, 2973, 4540...
## $ Tot_Wht <int> 3748, 0, 3, 63, 0, 338, 42, 35, 1394, 1975, 476, 69, 225...
## $ Tot_AA <int> 1114, 762, 782, 1416, 1937, 3970, 2244, 2527, 1323, 946,...
## $ Tot_Asian <int> 822, 0, 0, 0, 0, 0, 0, 0, 11, 28, 208, 12, 37, 276, 64, ...
## $ Pop_Den <dbl> 2240.3039, 1877.2425, 732.3132, 1624.9554, 1002.6389, 44...
## $ Med_Inc <int> 76627, 14345, 20428, 25789, 16379, 19849, 14299, 18790, ...
## $ PopDen_Wht <dbl> 1446.203779, 0.000000, 2.725732, 68.522216, 0.000000, 34...
## $ PopDen_AA <dbl> 429.8482, 1810.7073, 710.5073, 1540.1184, 986.3441, 4109...
## $ PopDen_Asi <dbl> 317.177029, 0.000000, 0.000000, 0.000000, 0.000000, 0.00...
## $ resids <dbl> -219.907220, -3.334919, -294.614837, -173.240475, -27.82...
## $ sd_breaks <dbl> -0.94626480, -0.01435022, -1.26773304, -0.74545694, -0.1...
## $ geometry <POLYGON [°]> POLYGON ((-90.06468 35.1702..., POLYGON ((-90.04...
The following maps shows presence of outliers in the spatial data.
my_breaks <- c(-8,-3,-2,-1,1,2,3,8)
tm_shape(map.resids2)+ tm_borders()+
tm_fill("sd_breaks", midpoint = NA, title = "Residuals", style = "fixed", breaks = my_breaks, palette = "-RdBu")+
tm_layout(main.title = "Residuals", main.title.size = 0.7 ,
legend.position = c("left", "top"), legend.title.size = 0.8)
map_sp <- as(map.resids2, "Spatial") #Converting the sf object to sp object
GWRbandwidth <- gwr.sel(Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi, data = map_sp, adapt = T)
## Adaptive q: 0.381966 CV score: 11447704
## Adaptive q: 0.618034 CV score: 11892736
## Adaptive q: 0.236068 CV score: 10927543
## Adaptive q: 0.145898 CV score: 10478910
## Adaptive q: 0.09016994 CV score: 9921170
## Adaptive q: 0.05572809 CV score: 9349887
## Adaptive q: 0.03444185 CV score: 8568541
## Adaptive q: 0.02128624 CV score: 8098634
## Adaptive q: 0.01315562 CV score: 8821234
## Adaptive q: 0.02481281 CV score: 8195559
## Adaptive q: 0.01818062 CV score: 8176755
## Adaptive q: 0.02132693 CV score: 8097201
## Adaptive q: 0.02265842 CV score: 8062991
## Adaptive q: 0.02384356 CV score: 8127619
## Adaptive q: 0.02214983 CV score: 8072107
## Adaptive q: 0.0231111 CV score: 8085220
## Adaptive q: 0.02253265 CV score: 8063280
## Adaptive q: 0.02261773 CV score: 8061576
## Adaptive q: 0.02257704 CV score: 8062379
## Adaptive q: 0.02261773 CV score: 8061576
gwr.model = gwr(Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi,
data = map_sp,
adapt=GWRbandwidth,
hatmatrix=TRUE,
se.fit=TRUE)
## Warning in proj4string(data): CRS object has comment, which is lost in output
gwr.model
## Call:
## gwr(formula = Pop_Den ~ Med_Inc + PopDen_Wht + PopDen_AA + PopDen_Asi,
## data = map_sp, adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.02261773 (about 4 of 221 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max.
## X.Intercept. -1.7054e+02 -2.2973e+01 2.6542e+01 1.5198e+02 8.9238e+02
## Med_Inc -1.3718e-02 -2.7939e-03 -8.1104e-04 -3.5837e-05 3.1918e-03
## PopDen_Wht 7.4561e-01 1.0263e+00 1.0525e+00 1.0946e+00 1.4251e+00
## PopDen_AA 8.8938e-01 1.0058e+00 1.0386e+00 1.0803e+00 1.1422e+00
## PopDen_Asi -9.9825e-01 7.9698e-01 1.0615e+00 1.2507e+00 1.9710e+00
## Global
## X.Intercept. -14.9543
## Med_Inc -0.0004
## PopDen_Wht 1.0735
## PopDen_AA 1.0635
## PopDen_Asi 0.9940
## Number of data points: 221
## Effective number of parameters (residual: 2traceS - traceS'S): 89.29223
## Effective degrees of freedom (residual: 2traceS - traceS'S): 131.7078
## Sigma (residual: 2traceS - traceS'S): 182.0397
## Effective number of parameters (model: traceS): 66.71983
## Effective degrees of freedom (model: traceS): 154.2802
## Sigma (model: traceS): 168.1964
## Sigma (ML): 140.5322
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 3009.614
## AIC (GWR p. 96, eq. 4.22): 2879.774
## Residual sum of squares: 4364592
## Quasi-global R2: 0.9937685
THe local \(R^2\) map shows that the model have low \(R^2\) value at the center of the county.
results <-as.data.frame(gwr.model$SDF)
names(results)
## [1] "sum.w" "X.Intercept." "Med_Inc"
## [4] "PopDen_Wht" "PopDen_AA" "PopDen_Asi"
## [7] "X.Intercept._se" "Med_Inc_se" "PopDen_Wht_se"
## [10] "PopDen_AA_se" "PopDen_Asi_se" "gwr.e"
## [13] "pred" "pred.se" "localR2"
## [16] "X.Intercept._se_EDF" "Med_Inc_se_EDF" "PopDen_Wht_se_EDF"
## [19] "PopDen_AA_se_EDF" "PopDen_Asi_se_EDF" "pred.se.1"
gwr.map <- cbind(map_sp, as.matrix(results))
gwr.map2 <- st_as_sf(gwr.map)
qtm(gwr.map, fill = "localR2", fill.palette = "-OrRd")
The grid, and grid_extra functions are another way to plot maps side by side. We can also tmap_arrannge function to get the same output.
The distribution maps of the independent variable shows that the eastern part of the county has higher median income, and gradually decreases to the west. The white population density and the african american population density is scattered within the center of the county. Higher asian population density is observed at the southeastern part of the county.
map1 <- tm_shape(gwr.map2) +
tm_fill("Med_Inc",
n = 5,
style = "quantile",
palette = "-RdYlBu",
title = "Med_Inc Distribution")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map2 <- tm_shape(gwr.map2) +
tm_fill("PopDen_Wht",
n = 5,
style = "quantile",
palette = "-RdYlBu",
title = "PopDen_Wht Distribution")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map3 <- tm_shape(gwr.map2) +
tm_fill("PopDen_AA",
n = 5,
style = "quantile",
palette = "-RdYlBu",
title = "PopDen_AA Distribution")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map4 <- tm_shape(gwr.map2) +
tm_fill("PopDen_Asi",
n = 5,
style = "quantile",
palette = "-RdYlBu",
title = "PopDen_Asi Distribution")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
grid.newpage()
# assigns the cell size of the grid, in this case 2 by 2
pushViewport(viewport(layout=grid.layout(2,2)))
# prints a map object into a defined cell
print(map1, vp=viewport(layout.pos.col = 1, layout.pos.row =1))
print(map2, vp=viewport(layout.pos.col = 2, layout.pos.row =1))
print(map3, vp=viewport(layout.pos.col = 1, layout.pos.row =2))
print(map4, vp=viewport(layout.pos.col = 2, layout.pos.row =2))
The coefficient map shows that the median income negatively impacts the population density of the central and southern part of the Shelby county. The Population density can better expressed by the white population density at the central part of the county.
map5 <- tm_shape(gwr.map2) +
tm_fill("Med_Inc.1",
n = 5,
midpoint = NA,
style = "quantile",
palette = "RdYlBu",
title = "Med_Inc Coefficient")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map6 <- tm_shape(gwr.map2) +
tm_fill("PopDen_Wht.1",
n = 5,
style = "quantile",
palette = "-RdYlBu",
title = "PopDen_Wht Coefficient")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map7 <- tm_shape(gwr.map2) +
tm_fill("PopDen_AA.1",
n = 5,
style = "quantile",
palette = "-RdYlBu",
title = "PopDen_AA Coefficient")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
map8 <- tm_shape(gwr.map2) +
tm_fill("PopDen_Asi.1",
n = 5,
midpoint = NA,
style = "quantile",
palette = "-RdYlBu",
title = "PopDen_Asi Coefficient")+
tm_borders()+
tm_layout(frame = TRUE,
legend.text.size = 0.5,
legend.title.size = 0.6)
grid.newpage()
# assigns the cell size of the grid, in this case 2 by 2
pushViewport(viewport(layout=grid.layout(2,2)))
# prints a map object into a defined cell
print(map5, vp=viewport(layout.pos.col = 1, layout.pos.row =1))
print(map6, vp=viewport(layout.pos.col = 2, layout.pos.row =1))
print(map7, vp=viewport(layout.pos.col = 1, layout.pos.row =2))
print(map8, vp=viewport(layout.pos.col = 2, layout.pos.row =2))
This project shows usability of linear and geographic weighted regression model approach to understand the population density of Shelby county using R program. Though, the models shows higher adjusted \(R^2\) values, it could be biased if there are autocorrelations present within the independent variable. The autocorrelation analysis will be a part of the future effort for the project.