Problem Background

Identifying and characterizing urban vulnerability to heat is a key step in designing intervention strategies to combat negative consequences of extreme heat on human health. In this study, we seek to identify the most vulnerable populations to extreme heat by analyzing the impact of social-demographics and minimum daily temperature on the risk of mortality for the 1487 census block groups in the greater Houston area. The dataset we consider for this analysis has the following variables collected on each of the 1487 census block groups in Houston:

Variable	Description
NOAC	The percent of homes without air conditioning in the block group
MED_AGE	the median age of persons living in the block group
HispanicPC	Percent hispanic in the block group
BlackPCT	Percent black in the block group
under5PCT	Percent under 5 years in the block group
over65PCT	Percent over 65 years in the block group
povertyPCT	Percent living below poverty line in the block group
alonePCT	Percent living alone in the block group
MinTemp	Average minimum summer temperature in the block group
Count	The count of number of heat-related morbidities
Population	The population of the census block group

Reading in and Handling Shapefiles

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(tidyr)
library(car)

## Loading required package: carData

library(MASS)
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(readr)
library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(spdep)

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

library(RSpectra)

myShp <- st_read("HoustonHeat.shp")

## Reading layer `HoustonHeat' from data source 
##   `/Users/guozhupeng/Documents/Stat 469/HoustonHeat.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1487 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -95.7332 ymin: 29.52989 xmax: -95.02184 ymax: 30.16357
## Geodetic CRS:  WGS 84

myShpDF <- myShp %>%
  st_drop_geometry()

1. Exploratory plots of the data by looking at the relationship between log(Count+1) (the response variable) and a few of the explanatory variables. Note that we explore log(Count+1) here because Poisson regression is log-linear and we arbitrarily add one to the counts because log(0) = -Inf.

ggplot(myShpDF, aes(x = MinTemp, y = log(Count+1))) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Log of Number of Heat-related Morbidities by Average Minimum Temperature",
       x = "Average Minimum Temperature in the Summer",
       y = "Log of Number of Heat-related Morbidities")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the graph above, we can see that the blue line is fairly flat. It indicators that the increase in the average minimum temperatures seems to not have much effects on the number of heat-related morbidities.

ggplot(myShpDF, aes(x = Population, y = log(Count+1))) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Log of Number of Heat-related Morbidities by Population",
       x = "Population",
       y = "Log of Number of Heat-related Morbidities")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the graph above, we can see that the blue line is increasing at the beginning and it turns fairly flat. It indicators that the increase in the population seems to not have much effects on the number of heat-related morbidities.

ggplot(myShpDF, aes(x = alonePCT, y = log(Count+1))) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Log of Number of Heat-related Morbidities by Percentage of Living Alone",
       x = "Percentage of Living Alone",
       y = "Log of Number of Heat-related Morbidities")

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

From the graph above, we can see that the blue line is fairly flat. It indicators that the increase in the percentage of living alone seems to not have much effects on the number of heat-related morbidities. But it still worth it exploring it later on.

2. Fit an independent MLR model with a linear effect between log(Count+1) and all the explanatory variables.

sf::sf_use_s2(FALSE)

## Spherical geometry (s2) switched off

model.lm <- lm(log(Count+1) ~ NOAC + MED_AGE + HispanicPC + BlackPCT + under5PCT + over65PCT + povertyPCT + alonePCT + MinTemp + Population, data = myShp)

residuals <- resid(model.lm)

ggplot(data=myShp) +
  geom_sf(mapping=aes(fill=residuals))

moran.test(x=residuals, listw = nb2listw(poly2nb(myShp)))

## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar

## 
##  Moran I test under randomisation
## 
## data:  residuals  
## weights: nb2listw(poly2nb(myShp))    
## 
## Moran I statistic standard deviate = 15.736, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.2338486649     -0.0006729475      0.0002221256

From the graph above, we can see that areas around dark areas tend to be darker. I conducted a Moran’s I test on the residuals. Here is the result from the Moran’s I test. Since the p-value is 2.2*10^(-16), which is less than 0.05. We conclude that there is spatial correlation in the residuals.

3. A spatial GLM model for analyzing the mortality data in terms of parameters that includes spatial basis functions to capture the spatial correlation in the data.

The model we’ll use for this problem is a spatial GLM model, y ~ Pois(\(\lambda\)). y represents count of number of heat-related morbidities, and \(\lambda\) is estimated by \(log(\lambda) = X^{'} \beta + b^{'} \theta\). \(X^{'}\) represents a matrix of variables like HispanicPC, MinTemp, and so on. It also contains a column of 1 as the intercept. \(\beta\) represents the coefficients of each variables in \(X^{'}\). For example, for every unit increase in \(X_{MinTemp}\) , the log of the mean count of number of heat-related morbidities counts increases by \(\beta_{MinTemp}\). \(b^{'}\) represents the spatial bases, which is generated from the spatial data. \(\theta\) is the coefficients of each spatial basis in \(b^{'}\). The spatial basis functions are similar to error terms, and they will adjust the value of log of \(\lambda\).

4. Fit the spatial GLM model using Moran basis function and validate any assumptions I made to fit the model.

A <- nb2mat(poly2nb(myShp), style = "B")

## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar

X <- model.matrix(object = log(Count+1) ~ NOAC + MED_AGE + HispanicPC + BlackPCT + under5PCT + over65PCT + povertyPCT + alonePCT + MinTemp + Population, data = myShp)

source("moranBasis.R")
M <- moranBasis(X, A, tol=0.95)

myShpDF <- bind_cols(myShpDF, M)

model.glm <- glm(formula= Count ~ ., data=myShpDF, family=poisson)

Here I fitted my spatial GLM model using Moran basis function.

(1) The Assumption of Linearity

avPlots(model.lm)

From the graphs above, we can see that there is straight line in each graph. So, the assumption of linearity is met.

(2) The Assumption of Independence

decorrelated_residuals <- resid(model.glm)
moran.test(x=decorrelated_residuals, listw = nb2listw(poly2nb(myShp)))

## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar

## 
##  Moran I test under randomisation
## 
## data:  decorrelated_residuals  
## weights: nb2listw(poly2nb(myShp))    
## 
## Moran I statistic standard deviate = -11.326, p-value = 1
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##     -0.1694150910     -0.0006729475      0.0002219758

Here I decorrelated the residuals and performed a Moran’s I test. The p-value of the test is 1, which is greater than 0.05. So, there is no more spatial correlation. The assumption of independence is met.

(3) Draw a choropleth map of the standardized and decorrelated residuals to visually verify that the residuals are no longer spatially correlated.

ggplot(data=myShp) + 
  geom_sf(mapping=aes(fill=decorrelated_residuals), color="black") + 
  scale_fill_distiller(palette="Spectral")

From the graph above, we can see that an area seem not to have much effects to areas around it. So the residuals are no longer spatially correlated.

(4) The assumption of Equal Variance

log_fitted_values <- log(fitted(model.glm))

plot(log_fitted_values, decorrelated_residuals, 
     xlab = "Fitted Values", ylab = "Decorrelated Residuals",
     main = "Scatterplot of Log of Fitted Values vs Decorrelated Residuals")
abline(h = 0, col = "red", lty = 2)  # Add a horizontal line at y = 0

From the graph, we can see that points of the Log of Fitted Values vs Decorrelated Residuals scatter plot scattered around the red line since the response variable is counts. Overall, the pattern is good. We can conclude that the equal variance assumption for this model is met.

5. Calculate confidence intervals for the effect of each explanatory variable included in the model to determine who is at greatest risk for heat-related mortality based on the estimated effects.

summary <- summary(model.glm)

summary$coefficients[1:11,]

##                  Estimate   Std. Error   z value      Pr(>|z|)
## (Intercept) -1.639864e+01 2.187133e+00 -7.497779  6.490828e-14
## Population   4.196629e-04 1.705326e-05 24.608961 1.012875e-133
## NOAC         2.390184e-01 1.029790e-01  2.321041  2.028466e-02
## MED_AGE     -6.133867e-03 6.769988e-03 -0.906038  3.649157e-01
## HispanicPC   6.530482e-01 1.451324e-01  4.499673  6.805812e-06
## BlackPCT    -1.639188e-02 1.081152e-01 -0.151615  8.794906e-01
## under5PCT   -5.066456e+00 1.103116e+00 -4.592860  4.372127e-06
## over65PCT    4.967054e+00 5.541197e-01  8.963865  3.134939e-19
## povertyPCT   4.582490e-01 1.994188e-01  2.297923  2.156618e-02
## alonePCT     7.144809e-01 2.163013e-01  3.303175  9.559681e-04
## MinTemp      6.483888e-01 8.762316e-02  7.399743  1.364488e-13

Here is a summary of the estimates from my GLM model.

CI <- confint(model.glm, parm=c("NOAC", "MED_AGE", "HispanicPC","BlackPCT", "under5PCT", "over65PCT", "povertyPCT", "alonePCT", "MinTemp" ,"Population"))

## Waiting for profiling to be done...

CI

##                    2.5 %        97.5 %
## NOAC        0.0368230144  0.4406023383
## MED_AGE    -0.0192696852  0.0072645190
## HispanicPC  0.3720719733  0.9413581568
## BlackPCT   -0.2289538841  0.1950813843
## under5PCT  -7.2556629382 -2.9279407427
## over65PCT   3.8825348973  6.0553685846
## povertyPCT  0.0659408429  0.8478234198
## alonePCT    0.2873925443  1.1357105584
## MinTemp     0.4778030036  0.8216642800
## Population  0.0003863967  0.0004532833

lower_bound <- CI[6, 1]
upper_bound <- CI[6, 2]

Here is the 95% confidence intervals for each explanatory variable.

Based on the information above, variables like NOAC, HispanicPC, over65PCT, alonePCT, MinTemp, Population have positive effects on the count of number of heat-related morbidities since both bounds of their 95% confidence interval are positive. But the over65PCT has the greatest effects since the 95% confidence interval of over65PCT is between 3.8825349and 6.0553686. So people who are older than 65 years old are at greatest risk for heat-related mortality.

6. A map of the correlated residuals to try and reach conclusions about areas at risk of heat-related mortality not explained by the explanatory variables.

# Since High counts might caused by large population, looking count doesn't mean a lot. This is accounting population and other covariates in affecting the count.

theta <- as.matrix(coefficients(model.glm)[-c(1:11)])

bases_effects <- as.matrix(M) %*% theta

ggplot(data=myShp) + 
  geom_sf(mapping=aes(fill=bases_effects), color="black") + 
  scale_fill_distiller(palette="Spectral")

From the graph above, we can see that areas outside of the center are orange. We can conclude that areas outside of the center of Huston are at higher risk for heat-related mortality.

Heat-related Analysis

Spencer Guo

2024-10-07