Problem Background

Identifying and characterizing urban vulnerability to heat is a key step in designing intervention strategies to combat negative consequences of extreme heat on human health. In this study, we seek to identify the most vulnerable populations to extreme heat by analyzing the impact of social-demographics and minimum daily temperature on the risk of mortality for the 1487 census block groups in the greater Houston area. The dataset we consider for this analysis has the following variables collected on each of the 1487 census block groups in Houston:

Variable Description
NOAC The percent of homes without air conditioning in the block group
MED_AGE the median age of persons living in the block group
HispanicPC Percent hispanic in the block group
BlackPCT Percent black in the block group
under5PCT Percent under 5 years in the block group
over65PCT Percent over 65 years in the block group
povertyPCT Percent living below poverty line in the block group
alonePCT Percent living alone in the block group
MinTemp Average minimum summer temperature in the block group
Count The count of number of heat-related morbidities
Population The population of the census block group

Reading in and Handling Shapefiles

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)
library(car)
## Loading required package: carData
library(MASS)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(readr)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(spdep)
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
library(RSpectra)

myShp <- st_read("HoustonHeat.shp")
## Reading layer `HoustonHeat' from data source 
##   `/Users/guozhupeng/Documents/Stat 469/HoustonHeat.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1487 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -95.7332 ymin: 29.52989 xmax: -95.02184 ymax: 30.16357
## Geodetic CRS:  WGS 84
myShpDF <- myShp %>%
  st_drop_geometry() 

1. Exploratory plots of the data by looking at the relationship between log(Count+1) (the response variable) and a few of the explanatory variables. Note that we explore log(Count+1) here because Poisson regression is log-linear and we arbitrarily add one to the counts because log(0) = -Inf.

ggplot(myShpDF, aes(x = MinTemp, y = log(Count+1))) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Log of Number of Heat-related Morbidities by Average Minimum Temperature",
       x = "Average Minimum Temperature in the Summer",
       y = "Log of Number of Heat-related Morbidities") 
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the graph above, we can see that the blue line is fairly flat. It indicators that the increase in the average minimum temperatures seems to not have much effects on the number of heat-related morbidities.

ggplot(myShpDF, aes(x = Population, y = log(Count+1))) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Log of Number of Heat-related Morbidities by Population",
       x = "Population",
       y = "Log of Number of Heat-related Morbidities") 
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the graph above, we can see that the blue line is increasing at the beginning and it turns fairly flat. It indicators that the increase in the population seems to not have much effects on the number of heat-related morbidities.

ggplot(myShpDF, aes(x = alonePCT, y = log(Count+1))) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Log of Number of Heat-related Morbidities by Percentage of Living Alone",
       x = "Percentage of Living Alone",
       y = "Log of Number of Heat-related Morbidities") 
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the graph above, we can see that the blue line is fairly flat. It indicators that the increase in the percentage of living alone seems to not have much effects on the number of heat-related morbidities. But it still worth it exploring it later on.

2. Fit an independent MLR model with a linear effect between log(Count+1) and all the explanatory variables.

sf::sf_use_s2(FALSE) 
## Spherical geometry (s2) switched off
model.lm <- lm(log(Count+1) ~ NOAC + MED_AGE + HispanicPC + BlackPCT + under5PCT + over65PCT + povertyPCT + alonePCT + MinTemp + Population, data = myShp)

residuals <- resid(model.lm)

ggplot(data=myShp) +
  geom_sf(mapping=aes(fill=residuals))

moran.test(x=residuals, listw = nb2listw(poly2nb(myShp)))
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
## 
##  Moran I test under randomisation
## 
## data:  residuals  
## weights: nb2listw(poly2nb(myShp))    
## 
## Moran I statistic standard deviate = 15.736, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.2338486649     -0.0006729475      0.0002221256

From the graph above, we can see that areas around dark areas tend to be darker. I conducted a Moran’s I test on the residuals. Here is the result from the Moran’s I test. Since the p-value is 2.2*10^(-16), which is less than 0.05. We conclude that there is spatial correlation in the residuals.

3. A spatial GLM model for analyzing the mortality data in terms of parameters that includes spatial basis functions to capture the spatial correlation in the data.

The model we’ll use for this problem is a spatial GLM model, y ~ Pois(\(\lambda\)). y represents count of number of heat-related morbidities, and \(\lambda\) is estimated by \(log(\lambda) = X^{'} \beta + b^{'} \theta\). \(X^{'}\) represents a matrix of variables like HispanicPC, MinTemp, and so on. It also contains a column of 1 as the intercept. \(\beta\) represents the coefficients of each variables in \(X^{'}\). For example, for every unit increase in \(X_{MinTemp}\) , the log of the mean count of number of heat-related morbidities counts increases by \(\beta_{MinTemp}\). \(b^{'}\) represents the spatial bases, which is generated from the spatial data. \(\theta\) is the coefficients of each spatial basis in \(b^{'}\). The spatial basis functions are similar to error terms, and they will adjust the value of log of \(\lambda\).

4. Fit the spatial GLM model using Moran basis function and validate any assumptions I made to fit the model.

A <- nb2mat(poly2nb(myShp), style = "B")
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
X <- model.matrix(object = log(Count+1) ~ NOAC + MED_AGE + HispanicPC + BlackPCT + under5PCT + over65PCT + povertyPCT + alonePCT + MinTemp + Population, data = myShp)

source("moranBasis.R")
M <- moranBasis(X, A, tol=0.95)

myShpDF <- bind_cols(myShpDF, M)

model.glm <- glm(formula= Count ~ ., data=myShpDF, family=poisson)

Here I fitted my spatial GLM model using Moran basis function.

(1) The Assumption of Linearity

avPlots(model.lm)

From the graphs above, we can see that there is straight line in each graph. So, the assumption of linearity is met.

(2) The Assumption of Independence

decorrelated_residuals <- resid(model.glm)
moran.test(x=decorrelated_residuals, listw = nb2listw(poly2nb(myShp)))
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
## 
##  Moran I test under randomisation
## 
## data:  decorrelated_residuals  
## weights: nb2listw(poly2nb(myShp))    
## 
## Moran I statistic standard deviate = -11.326, p-value = 1
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##     -0.1694150910     -0.0006729475      0.0002219758

Here I decorrelated the residuals and performed a Moran’s I test. The p-value of the test is 1, which is greater than 0.05. So, there is no more spatial correlation. The assumption of independence is met.

(3) Draw a choropleth map of the standardized and decorrelated residuals to visually verify that the residuals are no longer spatially correlated.

ggplot(data=myShp) + 
  geom_sf(mapping=aes(fill=decorrelated_residuals), color="black") + 
  scale_fill_distiller(palette="Spectral")

From the graph above, we can see that an area seem not to have much effects to areas around it. So the residuals are no longer spatially correlated.

(4) The assumption of Equal Variance

log_fitted_values <- log(fitted(model.glm))

plot(log_fitted_values, decorrelated_residuals, 
     xlab = "Fitted Values", ylab = "Decorrelated Residuals",
     main = "Scatterplot of Log of Fitted Values vs Decorrelated Residuals")
abline(h = 0, col = "red", lty = 2)  # Add a horizontal line at y = 0

From the graph, we can see that points of the Log of Fitted Values vs Decorrelated Residuals scatter plot scattered around the red line since the response variable is counts. Overall, the pattern is good. We can conclude that the equal variance assumption for this model is met.