This notebook is used to illustrate the Geographicaaly Weighted Principle Component Analysis. It is adapted from https://rpubs.com/IIASA/gwpca.
Data is from the community capital framework and it is sued to illustrate and buildup the code. True analysis will be upfdated upon the latest data.
Firstly, load all of the packages that r requires for this exercise.
Now, set up the data set used for this practical.
# read the QoL data, might be updated
QoL <- read_xlsx("../02 Data/Raw/Yang Quality of Life working.xlsx")
# Rename columns
QoL <- QoL %>% rename(
FIPS = FIPS,
FairPoorHealth = `% Fair or Poor Health`,
UnhealthyPhysDays = `Average Number of Physically Unhealthy Days`,
UnhealthyMentDays = `Average Number of Mentally Unhealthy Days`,
PovertyRate2022 = `Poverty Rate ACS (2022)`,
FullTimeWorkPoverty = `Poverty Rate Age 16 and over Worked Full-Time Year Round (ACS 2022)`,
Under5Poverty = `Poverty Rate Age 5 and under (ACS 2022)`,
Over75Poverty = `Poverty Rate Age 75 and over (ACS 2022)`,
HighRentBurden = `Percent Renters 35%+ Income to Rent`,
MobileHomePct = `Percent Mobile Home`,
Crowding = `Crowed more people than rooms estimate`,
NoHSDiploma = `Lack HS Diploma`,
InjuryDeathRate = `Safety Injury Death Rate`,
WatershedImpact = `Average of Percentile rank of percent of tract that intersects an impaired/impacted watershed at the HUC12 level`,
PM25Days = `Annual mean days above PM2.5 regulatory standard - 3-year average`,
OzoneDays = `OZONE Annual mean days above O3 regulatory standard - 3-year average`
)
# Check new column names
colnames(QoL)
## [1] "FIPS" "FairPoorHealth" "UnhealthyPhysDays"
## [4] "UnhealthyMentDays" "PovertyRate2022" "FullTimeWorkPoverty"
## [7] "Under5Poverty" "Over75Poverty" "HighRentBurden"
## [10] "MobileHomePct" "Crowding" "NoHSDiploma"
## [13] "InjuryDeathRate" "WatershedImpact" "PM25Days"
## [16] "OzoneDays"
ct
## Simple feature collection with 3235 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.1467 ymin: -14.5487 xmax: 179.7785 ymax: 71.38782
## Geodetic CRS: NAD83
## First 10 features:
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME
## 1 01 069 00161560 0500000US01069 01069 Houston
## 2 01 023 00161537 0500000US01023 01023 Choctaw
## 3 01 005 00161528 0500000US01005 01005 Barbour
## 4 01 107 00161580 0500000US01107 01107 Pickens
## 5 01 033 00161542 0500000US01033 01033 Colbert
## 6 04 012 00043540 0500000US04012 04012 La Paz
## 7 04 001 00025441 0500000US04001 04001 Apache
## 8 05 081 00066874 0500000US05081 05081 Little River
## 9 05 121 00069178 0500000US05121 05121 Randolph
## 10 06 037 00277283 0500000US06037 06037 Los Angeles
## NAMELSAD STUSPS STATE_NAME LSAD ALAND AWATER
## 1 Houston County AL Alabama 06 1501742235 4795415
## 2 Choctaw County AL Alabama 06 2365900083 19114321
## 3 Barbour County AL Alabama 06 2292160151 50523213
## 4 Pickens County AL Alabama 06 2282835044 22621093
## 5 Colbert County AL Alabama 06 1535742270 79160396
## 6 La Paz County AZ Arizona 06 11646086560 36514347
## 7 Apache County AZ Arizona 06 29003497233 54139714
## 8 Little River County AR Arkansas 06 1375943077 83115227
## 9 Randolph County AR Arkansas 06 1688445989 10370823
## 10 Los Angeles County CA California 06 10515988166 1785003207
## geometry
## 1 MULTIPOLYGON (((-85.71209 3...
## 2 MULTIPOLYGON (((-88.47323 3...
## 3 MULTIPOLYGON (((-85.74803 3...
## 4 MULTIPOLYGON (((-88.34043 3...
## 5 MULTIPOLYGON (((-88.13925 3...
## 6 MULTIPOLYGON (((-114.7312 3...
## 7 MULTIPOLYGON (((-110.0007 3...
## 8 MULTIPOLYGON (((-94.48558 3...
## 9 MULTIPOLYGON (((-91.40687 3...
## 10 MULTIPOLYGON (((-118.6044 3...
names(QoL)
## [1] "FIPS" "FairPoorHealth" "UnhealthyPhysDays"
## [4] "UnhealthyMentDays" "PovertyRate2022" "FullTimeWorkPoverty"
## [7] "Under5Poverty" "Over75Poverty" "HighRentBurden"
## [10] "MobileHomePct" "Crowding" "NoHSDiploma"
## [13] "InjuryDeathRate" "WatershedImpact" "PM25Days"
## [16] "OzoneDays" "STATEFP" "COUNTYFP"
## [19] "COUNTYNS" "AFFGEOID" "GEOID"
## [22] "NAME" "NAMELSAD" "STUSPS"
## [25] "STATE_NAME" "LSAD" "ALAND"
## [28] "AWATER" "geometry"
Next, we will be working with scaled data. \[z = \frac{x - \bar{x}}{\textrm{sd}(x)}\] so that the variable is rescaled to have a mean value of zero, and a standard deviation of one. This means that we are measuring relative fluctuations in each individual variable across counties, rather than measuring all variables on the same scale.
Next, we will be working with scaled data - that is we will replace the chemical concentration measurements by their z-scores. This means that we are measuring relative fluctuations in each individual chemical across geography, rather than measuring all concentrations on the same scale.
# Standardize the selected columns
QoL.scaled <- scale(as.matrix(QoL[, 2:16]))
# Running PCA
pca <- princomp(QoL.scaled, cor = FALSE, scores = TRUE) # using covariance matrix
# Show results of PCA
pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## FairPoorHealth 0.398 0.119 0.106
## UnhealthyPhysDays 0.388 0.159 0.178
## UnhealthyMentDays 0.287 0.328 0.304 0.161 0.137
## PovertyRate2022 0.334 -0.182 -0.256
## FullTimeWorkPoverty 0.237 -0.148 -0.196 -0.422 0.220 -0.593
## Under5Poverty 0.309 -0.182 0.226 -0.277
## Over75Poverty 0.228 -0.174 -0.219 -0.154 -0.827
## HighRentBurden 0.353 -0.101 -0.620 -0.288 0.164 -0.144 0.517
## MobileHomePct 0.307 -0.147 0.136 0.177
## Crowding 0.139 -0.531 0.341 -0.358 0.459 0.198
## NoHSDiploma 0.342 -0.218 0.108 0.140 0.127 0.142
## InjuryDeathRate 0.176 -0.173 0.459 0.187 -0.529 -0.272 -0.177
## WatershedImpact -0.109 0.472 0.177 -0.347 -0.288 0.515 -0.304
## PM25Days 0.100 0.547 0.500 -0.196
## OzoneDays 0.352 -0.442 0.308 -0.290 -0.615 -0.183
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## FairPoorHealth 0.109 0.120 0.230 0.549 0.638
## UnhealthyPhysDays 0.240 0.137 0.423 -0.716
## UnhealthyMentDays -0.158 0.388 0.144 -0.467 -0.433 0.248
## PovertyRate2022 -0.136 -0.693 0.480 -0.196
## FullTimeWorkPoverty -0.334 0.264 -0.250 -0.223
## Under5Poverty 0.599 -0.328 -0.188 0.425 -0.164 -0.148
## Over75Poverty 0.143 0.260 -0.249
## HighRentBurden 0.222 -0.130
## MobileHomePct -0.439 -0.714 -0.124 -0.294
## Crowding 0.162 -0.148 -0.339 -0.172
## NoHSDiploma -0.163 0.527 0.138 0.440 -0.483 -0.109
## InjuryDeathRate 0.221 0.213 -0.421 0.172
## WatershedImpact -0.212 -0.266 0.132 0.137 0.117
## PM25Days 0.277 -0.130 -0.538
## OzoneDays -0.194 0.205
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.067 0.133 0.200 0.267 0.333 0.400 0.467 0.533 0.600
## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.067 0.067 0.067 0.067 0.067 0.067
## Cumulative Var 0.667 0.733 0.800 0.867 0.933 1.000
# Extracting the first principal component scores
QoL$pc1 <- pca$scores[, 1]
QoL$pc2 <- pca$scores[, 2]
# Visualizing the first principal component
# Check if 'QoL$geometry' is a spatial object or convert it to one if needed
ggplot(data = QoL$geometry) +
geom_sf(aes(fill = QoL$pc1)) +
scale_fill_viridis_c(option = "C", begin = 0, end = 1, name = "PC1 Score") +
labs(title = "Spatial Distribution of PC1 Scores",
caption = "Data source: Your Data") +
theme_void()
# Check if 'QoL$geometry' is a spatial object or convert it to one if needed
ggplot(data = QoL$geometry) +
geom_sf(aes(fill = QoL$pc2)) +
scale_fill_viridis_c(option = "C", begin = 0, end = 1, name = "PC2 Score") +
labs(title = "Spatial Distribution of PC2 Scores",
caption = "Data source: Your Data") +
theme_void()
boxplot(QoL$pc1,horizontal = TRUE)
boxplot(QoL$pc2,horizontal = TRUE)
outl_1 <- QoL$NAME[which(QoL$pc1 %in% boxplot(QoL$pc1,plot=FALSE)$out)]
outl_1
## [1] "Clinch" "Hancock" "Wheeler" "Clay"
## [5] "Knott" "Lee" "McCreary" "Magoffin"
## [9] "Wolfe" "East Carroll" "Madison" "Buffalo"
## [13] "Oglala Lakota" "Todd" "Brooks" "Dimmit"
## [17] "Hudspeth" "Kenedy" "Presidio" "Starr"
## [21] "Zapata" "McDowell" "Mingo"
outl_2 <- QoL$NAME[which(QoL$pc2 %in% boxplot(QoL$pc2,plot=FALSE)$out)]
outl_2
## [1] "Apache" "La Paz" "Fresno" "Kern"
## [5] "Kings" "Los Angeles" "Madera" "Merced"
## [9] "Riverside" "San Bernardino" "Stanislaus" "Tulare"
## [13] "Dolores" "Berrien" "Wheatland" "Banner"
## [17] "Hayes" "Logan" "Loup" "McPherson"
## [21] "Sioux" "Valley" "McKinley" "Mora"
## [25] "Rio Arriba" "Billings" "Dunn" "Mountrail"
## [29] "Portage" "Delaware" "Lebanon" "Buffalo"
## [33] "Corson" "Dewey" "Faulk" "Haakon"
## [37] "Harding" "Jackson" "Jerauld" "Jones"
## [41] "McPherson" "Mellette" "Oglala Lakota" "Todd"
## [45] "Hudspeth" "Kenedy" "King" "Sutton"
## [49] "Carbon" "Crook" "Hot Springs" "Platte"
To find such outliers the procedure is relatively simple - instead of doing a PCA on all of the data, for each sample we do a PCA on data falling into a window centred on the location of that sample. In that way we can check whether the sample is like its neighbours or not, from a multivariate viewpoint.
# Calculate centroids of the geometry
ct_centroids <- QoL$geometry %>%
st_geometry() %>%
st_centroid()
# Assuming 'gw.dist' needs the coordinates in a specific format
d <- gw.dist(dp.locat = st_coordinates(ct_centroids), longlat = TRUE)
QoL_sf<- st_as_sf(QoL)
spdf <- as(QoL_sf, "Spatial")
# Choose the bandwidth=2951
# bw.gw.pca <- bw.gwpca(spdf,
# vars = colnames(spdf@data)[2:16],
# k = 5,
# robust = FALSE,
# adaptive = TRUE)
# Now you should be able to use 'ct_centroids' in 'gwpca',16
pca.gw <- gwpca(spdf , vars = colnames(spdf@data)[2:16], bw = 2951 , k = 5, dMat = d, adaptive = TRUE)
how data dimensionality varies spatially and
how the original variables influence the components.
For the former, the spatial distribution of local PTV for say, the frst two components can be mapped.
# function for calculation pproportion of variance
prop.var <- function(gwpca.obj, n.components) {
return((rowSums(gwpca.obj$var[, 1:n.components]) /rowSums(gwpca.obj$var)) * 100)
}
var.gwpca <- prop.var( pca.gw, 5)
QoL$var.gwpca <- var.gwpca
# Convert SpatialPolygonsDataFrame back to an sf object
spdf_sf <- st_as_sf(spdf)
# Iterate over the first 5 principal components
for (i in 1:5) {
# Extract the loadings for the i-th component
local.loadings <- pca.gw$loadings[, , i] # assuming loadings are stored in 3D array format
lead.item <- colnames(local.loadings)[max.col(abs(local.loadings))]
# Add the lead item as a new column in the spdf_sf data
spdf_sf$lead.item <- lead.item
# Plotting the spatial distribution of the lead item for the i-th component
plot_title <- paste("Spatial Distribution of Lead Item Scores: PC", i)
p <- ggplot(data = spdf_sf) +
geom_sf(aes(fill = as.factor(lead.item)), show.legend = TRUE) + # Convert to factor for categorical coloring
scale_fill_viridis_d(option = "C", name = "Lead Item Score") + # Using discrete color scale
labs(title = plot_title,
caption = "Data source: Your Data") +
theme_void()
# Print the plot
print(p)
# # Optionally, save the plot to a file
# ggsave(filename = paste("PC", i, "Lead_Item_Scores.png", sep = "_"), plot = p, width = 10, height = 8)
}