This notebook is used to illustrate the Geographicaaly Weighted Principle Component Analysis. It is adapted from https://rpubs.com/IIASA/gwpca.

Data is from the community capital framework and it is sued to illustrate and buildup the code. True analysis will be upfdated upon the latest data.

Set up library

Firstly, load all of the packages that r requires for this exercise.

Data

Read Data

Now, set up the data set used for this practical.

# read the QoL data, might be updated
QoL <- read_xlsx("../02 Data/Raw/Yang Quality of Life working.xlsx")

# Rename columns
QoL <- QoL %>% rename(
  FIPS = FIPS,
  FairPoorHealth = `% Fair or Poor Health`,
  UnhealthyPhysDays = `Average Number of Physically Unhealthy Days`,
  UnhealthyMentDays = `Average Number of Mentally Unhealthy Days`,
  PovertyRate2022 = `Poverty Rate ACS (2022)`,
  FullTimeWorkPoverty = `Poverty Rate Age 16 and over Worked Full-Time Year Round (ACS 2022)`,
  Under5Poverty = `Poverty Rate Age 5 and under (ACS 2022)`,
  Over75Poverty = `Poverty Rate Age 75 and over (ACS 2022)`,
  HighRentBurden = `Percent Renters 35%+ Income to Rent`,
  MobileHomePct = `Percent Mobile Home`,
  Crowding = `Crowed more people than rooms estimate`,
  NoHSDiploma = `Lack HS Diploma`,
  InjuryDeathRate = `Safety Injury Death Rate`,
  WatershedImpact = `Average of Percentile rank of percent of tract that intersects an impaired/impacted watershed at the HUC12 level`,
  PM25Days = `Annual mean days above PM2.5 regulatory standard - 3-year average`,
  OzoneDays = `OZONE Annual mean days above O3 regulatory standard - 3-year average`
)

# Check new column names
colnames(QoL)

##  [1] "FIPS"                "FairPoorHealth"      "UnhealthyPhysDays"  
##  [4] "UnhealthyMentDays"   "PovertyRate2022"     "FullTimeWorkPoverty"
##  [7] "Under5Poverty"       "Over75Poverty"       "HighRentBurden"     
## [10] "MobileHomePct"       "Crowding"            "NoHSDiploma"        
## [13] "InjuryDeathRate"     "WatershedImpact"     "PM25Days"           
## [16] "OzoneDays"

ct

## Simple feature collection with 3235 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1467 ymin: -14.5487 xmax: 179.7785 ymax: 71.38782
## Geodetic CRS:  NAD83
## First 10 features:
##    STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID         NAME
## 1       01      069 00161560 0500000US01069 01069      Houston
## 2       01      023 00161537 0500000US01023 01023      Choctaw
## 3       01      005 00161528 0500000US01005 01005      Barbour
## 4       01      107 00161580 0500000US01107 01107      Pickens
## 5       01      033 00161542 0500000US01033 01033      Colbert
## 6       04      012 00043540 0500000US04012 04012       La Paz
## 7       04      001 00025441 0500000US04001 04001       Apache
## 8       05      081 00066874 0500000US05081 05081 Little River
## 9       05      121 00069178 0500000US05121 05121     Randolph
## 10      06      037 00277283 0500000US06037 06037  Los Angeles
##               NAMELSAD STUSPS STATE_NAME LSAD       ALAND     AWATER
## 1       Houston County     AL    Alabama   06  1501742235    4795415
## 2       Choctaw County     AL    Alabama   06  2365900083   19114321
## 3       Barbour County     AL    Alabama   06  2292160151   50523213
## 4       Pickens County     AL    Alabama   06  2282835044   22621093
## 5       Colbert County     AL    Alabama   06  1535742270   79160396
## 6        La Paz County     AZ    Arizona   06 11646086560   36514347
## 7        Apache County     AZ    Arizona   06 29003497233   54139714
## 8  Little River County     AR   Arkansas   06  1375943077   83115227
## 9      Randolph County     AR   Arkansas   06  1688445989   10370823
## 10  Los Angeles County     CA California   06 10515988166 1785003207
##                          geometry
## 1  MULTIPOLYGON (((-85.71209 3...
## 2  MULTIPOLYGON (((-88.47323 3...
## 3  MULTIPOLYGON (((-85.74803 3...
## 4  MULTIPOLYGON (((-88.34043 3...
## 5  MULTIPOLYGON (((-88.13925 3...
## 6  MULTIPOLYGON (((-114.7312 3...
## 7  MULTIPOLYGON (((-110.0007 3...
## 8  MULTIPOLYGON (((-94.48558 3...
## 9  MULTIPOLYGON (((-91.40687 3...
## 10 MULTIPOLYGON (((-118.6044 3...

names(QoL)

##  [1] "FIPS"                "FairPoorHealth"      "UnhealthyPhysDays"  
##  [4] "UnhealthyMentDays"   "PovertyRate2022"     "FullTimeWorkPoverty"
##  [7] "Under5Poverty"       "Over75Poverty"       "HighRentBurden"     
## [10] "MobileHomePct"       "Crowding"            "NoHSDiploma"        
## [13] "InjuryDeathRate"     "WatershedImpact"     "PM25Days"           
## [16] "OzoneDays"           "STATEFP"             "COUNTYFP"           
## [19] "COUNTYNS"            "AFFGEOID"            "GEOID"              
## [22] "NAME"                "NAMELSAD"            "STUSPS"             
## [25] "STATE_NAME"          "LSAD"                "ALAND"              
## [28] "AWATER"              "geometry"

Next, we will be working with scaled data. \[z = \frac{x - \bar{x}}{\textrm{sd}(x)}\] so that the variable is rescaled to have a mean value of zero, and a standard deviation of one. This means that we are measuring relative fluctuations in each individual variable across counties, rather than measuring all variables on the same scale.

PCA

Next, we will be working with scaled data - that is we will replace the chemical concentration measurements by their z-scores. This means that we are measuring relative fluctuations in each individual chemical across geography, rather than measuring all concentrations on the same scale.

# Standardize the selected columns
QoL.scaled <- scale(as.matrix(QoL[, 2:16]))

# Running PCA
pca <- princomp(QoL.scaled, cor = FALSE, scores = TRUE)  # using covariance matrix

# Show results of PCA
pca$loadings

## 
## Loadings:
##                     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## FairPoorHealth       0.398                0.119  0.106                     
## UnhealthyPhysDays    0.388         0.159  0.178                            
## UnhealthyMentDays    0.287  0.328  0.304  0.161                       0.137
## PovertyRate2022      0.334               -0.182        -0.256              
## FullTimeWorkPoverty  0.237 -0.148 -0.196 -0.422         0.220        -0.593
## Under5Poverty        0.309               -0.182         0.226        -0.277
## Over75Poverty        0.228 -0.174 -0.219 -0.154        -0.827              
## HighRentBurden              0.353 -0.101 -0.620 -0.288  0.164 -0.144  0.517
## MobileHomePct        0.307 -0.147  0.136                              0.177
## Crowding             0.139        -0.531  0.341 -0.358         0.459  0.198
## NoHSDiploma          0.342        -0.218  0.108  0.140  0.127  0.142       
## InjuryDeathRate      0.176 -0.173  0.459  0.187 -0.529        -0.272 -0.177
## WatershedImpact     -0.109  0.472  0.177        -0.347 -0.288  0.515 -0.304
## PM25Days             0.100  0.547                0.500               -0.196
## OzoneDays                   0.352 -0.442  0.308 -0.290        -0.615 -0.183
##                     Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## FairPoorHealth                      0.109   0.120   0.230   0.549   0.638 
## UnhealthyPhysDays           0.240           0.137           0.423  -0.716 
## UnhealthyMentDays   -0.158  0.388           0.144  -0.467  -0.433   0.248 
## PovertyRate2022     -0.136         -0.693           0.480  -0.196         
## FullTimeWorkPoverty -0.334  0.264          -0.250  -0.223                 
## Under5Poverty        0.599 -0.328  -0.188   0.425  -0.164  -0.148         
## Over75Poverty        0.143          0.260          -0.249                 
## HighRentBurden                      0.222  -0.130                         
## MobileHomePct       -0.439 -0.714          -0.124  -0.294                 
## Crowding             0.162         -0.148  -0.339  -0.172                 
## NoHSDiploma         -0.163          0.527   0.138   0.440  -0.483  -0.109 
## InjuryDeathRate      0.221          0.213  -0.421   0.172                 
## WatershedImpact     -0.212 -0.266           0.132   0.137   0.117         
## PM25Days             0.277 -0.130          -0.538                         
## OzoneDays           -0.194                  0.205                         
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.067  0.067  0.067  0.067  0.067  0.067  0.067  0.067  0.067
## Cumulative Var  0.067  0.133  0.200  0.267  0.333  0.400  0.467  0.533  0.600
##                Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## SS loadings      1.000   1.000   1.000   1.000   1.000   1.000
## Proportion Var   0.067   0.067   0.067   0.067   0.067   0.067
## Cumulative Var   0.667   0.733   0.800   0.867   0.933   1.000

# Extracting the first principal component scores
QoL$pc1 <- pca$scores[, 1]

QoL$pc2 <- pca$scores[, 2]

# Visualizing the first principal component
# Check if 'QoL$geometry' is a spatial object or convert it to one if needed
ggplot(data = QoL$geometry) +
 geom_sf(aes(fill = QoL$pc1)) +
  scale_fill_viridis_c(option = "C", begin = 0, end = 1, name = "PC1 Score") +
  labs(title = "Spatial Distribution of PC1 Scores",
       caption = "Data source: Your Data") +
  theme_void()

# Check if 'QoL$geometry' is a spatial object or convert it to one if needed
ggplot(data = QoL$geometry) +
 geom_sf(aes(fill = QoL$pc2)) +
  scale_fill_viridis_c(option = "C", begin = 0, end = 1, name = "PC2 Score") +
  labs(title = "Spatial Distribution of PC2 Scores",
       caption = "Data source: Your Data") +
  theme_void()

Outliers

boxplot(QoL$pc1,horizontal = TRUE)

boxplot(QoL$pc2,horizontal = TRUE)

outl_1 <- QoL$NAME[which(QoL$pc1 %in% boxplot(QoL$pc1,plot=FALSE)$out)]
outl_1

##  [1] "Clinch"        "Hancock"       "Wheeler"       "Clay"         
##  [5] "Knott"         "Lee"           "McCreary"      "Magoffin"     
##  [9] "Wolfe"         "East Carroll"  "Madison"       "Buffalo"      
## [13] "Oglala Lakota" "Todd"          "Brooks"        "Dimmit"       
## [17] "Hudspeth"      "Kenedy"        "Presidio"      "Starr"        
## [21] "Zapata"        "McDowell"      "Mingo"

outl_2 <- QoL$NAME[which(QoL$pc2 %in% boxplot(QoL$pc2,plot=FALSE)$out)]
outl_2

##  [1] "Apache"         "La Paz"         "Fresno"         "Kern"          
##  [5] "Kings"          "Los Angeles"    "Madera"         "Merced"        
##  [9] "Riverside"      "San Bernardino" "Stanislaus"     "Tulare"        
## [13] "Dolores"        "Berrien"        "Wheatland"      "Banner"        
## [17] "Hayes"          "Logan"          "Loup"           "McPherson"     
## [21] "Sioux"          "Valley"         "McKinley"       "Mora"          
## [25] "Rio Arriba"     "Billings"       "Dunn"           "Mountrail"     
## [29] "Portage"        "Delaware"       "Lebanon"        "Buffalo"       
## [33] "Corson"         "Dewey"          "Faulk"          "Haakon"        
## [37] "Harding"        "Jackson"        "Jerauld"        "Jones"         
## [41] "McPherson"      "Mellette"       "Oglala Lakota"  "Todd"          
## [45] "Hudspeth"       "Kenedy"         "King"           "Sutton"        
## [49] "Carbon"         "Crook"          "Hot Springs"    "Platte"

GWPCA

To find such outliers the procedure is relatively simple - instead of doing a PCA on all of the data, for each sample we do a PCA on data falling into a window centred on the location of that sample. In that way we can check whether the sample is like its neighbours or not, from a multivariate viewpoint.

# Calculate centroids of the geometry
ct_centroids <- QoL$geometry %>% 
  st_geometry() %>% 
  st_centroid()
 
# Assuming 'gw.dist' needs the coordinates in a specific format
d <- gw.dist(dp.locat = st_coordinates(ct_centroids), longlat = TRUE)

QoL_sf<- st_as_sf(QoL)
spdf <- as(QoL_sf, "Spatial")

# Choose the bandwidth=2951
# bw.gw.pca <- bw.gwpca(spdf, 
#                       vars = colnames(spdf@data)[2:16],
#                       k = 5,
#                       robust = FALSE, 
#                       adaptive = TRUE)

# Now you should be able to use 'ct_centroids' in 'gwpca',16
pca.gw <- gwpca(spdf , vars = colnames(spdf@data)[2:16], bw = 2951 , k = 5, dMat = d, adaptive = TRUE)

Visualized and interpreted GWPCA output

how data dimensionality varies spatially and
how the original variables influence the components.

For the former, the spatial distribution of local PTV for say, the frst two components can be mapped.

# function for calculation pproportion of variance 
prop.var <- function(gwpca.obj, n.components) {
            return((rowSums(gwpca.obj$var[, 1:n.components]) /rowSums(gwpca.obj$var)) * 100)
                                   }

var.gwpca <- prop.var( pca.gw, 5)
QoL$var.gwpca <- var.gwpca

# Convert SpatialPolygonsDataFrame back to an sf object
spdf_sf <- st_as_sf(spdf)


# Iterate over the first 5 principal components
for (i in 1:5) {
  # Extract the loadings for the i-th component
  local.loadings <- pca.gw$loadings[, , i]  # assuming loadings are stored in 3D array format
  lead.item <- colnames(local.loadings)[max.col(abs(local.loadings))]
  
  # Add the lead item as a new column in the spdf_sf data
  spdf_sf$lead.item <- lead.item
  
  # Plotting the spatial distribution of the lead item for the i-th component
  plot_title <- paste("Spatial Distribution of Lead Item Scores: PC", i)
  
  p <- ggplot(data = spdf_sf) +
    geom_sf(aes(fill = as.factor(lead.item)), show.legend = TRUE) +  # Convert to factor for categorical coloring
    scale_fill_viridis_d(option = "C", name = "Lead Item Score") +  # Using discrete color scale
    labs(title = plot_title,
         caption = "Data source: Your Data") +
    theme_void()
  
  # Print the plot
  print(p)
  
  # # Optionally, save the plot to a file
  # ggsave(filename = paste("PC", i, "Lead_Item_Scores.png", sep = "_"), plot = p, width = 10, height = 8)
}

GWPCA

Yang Cheng

2024-07-22

Set up library

Data

Read Data

PCA

Outliers

GWPCA

Visualized and interpreted GWPCA output