Exploratory Analysis of Nighttime Radiance Patterns and Mental Health

Lei Kang, Lulu Cheng

April 24, 2016

Objectives

  • Explore association between nighttime radiance intensity and mental disorder related death
  • Build county-level predictive models for mental disorder related death

Data

Centers for Disease Control and Prevention (CDC)

  • Get Data from here
  • Data Description
  • Variable: Cause of death by county and month in 2014 due to Mental and Behavioral Disorders

Nighttime VIIRS Day/Night (Special thanks to Jeff, Star, Alex)

SES Data

  • Get Data from here
  • County-level monthly poverty, education, employment, population information for 2014

Exploratory Data Analysis

Interactive Visualization Dashboard

We observe a strong positive correlation between nighttime radiance intensity and mental disorder related death even after controlling for population based on the above Interactive plot.

Now, let’s explore a little bit about satellite images. Due to time constraint, we just focus on January, 2014. Then, 3220 county nightlight images are extracted using the following procedure:

  1. Extract pixel information based on county shape file using minimum coverage rectangle
  2. Replace non-county region with 0 (no radiance)
  3. Resample different sizes of images to project onto the median-level reference image (size of 103 * 130) so that all counties are comparable in terms of lighting patterns. See the figure below for before and after comparison after normalization.

Caption for the picture.

  1. Use PCA to construct Eigen-county patterns. The first eigen county map contains a lot of noise because image background is generally dark and not useful though they account for majority pixels. Thus, the first eigen county attemps to pick up noisy information as shown below. However, eigen 2, 3, 4 counties start to represent lighting patterns, see figure below. Need to mention that, we do not rescale every image because we care more about county-wise comparison.

Caption for the picture.

  1. Use PCA to reduce dimension from 103*130 to 8 and perform K-means clustering based on reduced-dimension data. Number of clusters are determined using residual sum of squares reduction, stability-based K-means, and dendrogram together. We end up with K=9 clusters. The idea of stability-based K-means is that if we perturb data a little bit (say by 20% or 40%) and run K-means based on two random subsamples of original data, how much the cluster assignments of two subsamples agree with each other. One can use percentage of matched observations as a metric. Then as this metric moves closer to 1, one tends to get more stable clustering result. We decide to use K=9 because matching metric starts to move insignificantly after K=9 and within sum of squares residuals also start to decay slowly after K=9. Both results are consistent with hierachical clustering dendrogram.
########stability based K-means clustering, decide number of clusters
get_stable<-function(data,k,m){##k is number of clusters, m is sample proportion
  
  size<-m*nrow(data)
  data$id<-1:nrow(data)
  
  ##subsample
  sample1_id<-sample(1:nrow(data), size, replace = FALSE)
  sample2_id<-sample(1:nrow(data), size, replace = FALSE)
  
  sub_1<-data[sample1_id, ]
  sub_2<-data[sample2_id, ]
  
  ##k-means
  result_1<-data.frame(kmeans(sub_1[2:9], centers = k,algorithm= "Lloyd",iter.max = 1000)$cluster) ##try to alleviate the impact of random starting value
  result_2<-data.frame(kmeans(sub_2[2:9], centers = k,algorithm= "Lloyd",iter.max = 1000)$cluster)
  
  ##cluster result
  sub_1_clust<-data.frame(cbind(sub_1$id,result_1))
  names(sub_1_clust)[1] <- "id"
  names(sub_1_clust)[2] <- "cluster"
  sub_2_clust<-data.frame(cbind(sub_2$id,result_2))
  names(sub_2_clust)[1] <- "id"
  names(sub_2_clust)[2] <- "cluster"
  
  ##find intersected obs and its clustering membership, ensure 1-1 match
  intersect<-merge(sub_1_clust,sub_2_clust,by=c("id"))
  clust_1<-as.integer(as.character(intersect$cluster.x))
  clust_2<-as.integer(as.character(intersect$cluster.y))
  
  ##construct C_ij
  inter.dim <- dim(intersect)[1]
  C_1 <- matrix(clust_1, nr = inter.dim, nc = inter.dim) == matrix(clust_1, nr = inter.dim, nc = inter.dim, byrow = TRUE)
  C_2 <- matrix(clust_2, nr = inter.dim, nc = inter.dim) == matrix(clust_2, nr = inter.dim, nc = inter.dim, byrow = TRUE)
  diag(C_1) <- 0
  diag(C_2) <- 0
  
  ##compute similarity measure
  jaccard <- sum(C_1 * C_2)/(sum(C_1) + sum(C_2) - sum(C_1 * C_2))
  matching<- (sum(C_1 * C_2)+sum((1-C_1) * (1-C_2)))/(sum(C_1 * C_2)+sum((1-C_1) * (1-C_2))+sum((1-C_1)*C_2)+sum((1-C_2)*C_1))
  corr<-sum(C_1 * C_2)/sqrt(sum(C_1)*sum(C_2))
  #   print(jaccard)
  #   print(matching)
  #   print(corr)
  return(c(jaccard,matching,corr))
}


stable_result_jac<-matrix(0,nrow=14,ncol=100)
stable_result_mat<-matrix(0,nrow=14,ncol=100)
stable_result_cor<-matrix(0,nrow=14,ncol=100)


##################run stability-based K-means
for (k in 2:15){
  for (i in 1:100){
    dd<-get_stable(withname,k=k,m=0.6)
    stable_result_jac[k-1,i]<-dd[1]
    stable_result_mat[k-1,i]<-dd[2]
    stable_result_cor[k-1,i]<-dd[3]
  }
}

jac<-data.frame(t(stable_result_jac))
mat2<-data.frame(t(stable_result_mat))
cor<-data.frame(t(stable_result_cor))

Caption for the picture.

The following figure presents selected raw images from two representative clusters. Cluster 1 tends to be more centered while cluster 2 tends to be more sparsely populated in terms of night time radiance. Both clusters can capture inherent different night time lighting patterns.

Caption for the picture.

Prediction Models for Death Counts

After exploring satellite image data, we need to provide more useful insight on how it might influence mental disorder related death. To this end, we match county level monthly mental disorder related death data with aggregated county-level radiance data for year 2014. In the end, we got 9754 observations with matched death and nighttime radiance information where 75% of them is treated as a training set and the rest 25% is test set. We are going to predict county-level monthly mental disorder related death counts using the following features:

  1. SES: Population, poverty percentage, median income, unemployment rate, domestic migration rate, international migration rate, natural population growth rate, death rate, percentage of adults with less than high school degree, percentage of adults with high school degree, percentage of adults with college degree, percentage of adults with bachelor degree or more
  2. Spatial data: lat, lon of county
  3. Temporal data: month indicator
  4. Lighting data: County-wide sum of monthly average nighttime radiance

We apply the Super Learner idea (Ensemble learning) which combines outputs from the following 11 algorithms: Ridge, Lasso, Decision tree with pruning, Multivariate adaptive regression splines (MARS), Random forests, generalized additive model, SVM, Neural Network, linear regression, generalized boosting.

5-fold CV is used to tune parameters in aforementioned algorithms and Super Learner seeks the best convex combination of those prediction results.

library(SuperLearner)
set.seed(1127)
index<-sample(seq_len(nrow(new)),size=floor(0.75*nrow(new)))
train<-new[index,]
test<-new[-index,]
SL.library<-c('SL.ridge','SL.rpartPrune','SL.polymars','SL.mean','SL.randomForest','SL.gam','SL.glm','SL.glmnet','SL.svm','SL.gbm','SL.nnet')

X<-train[c("month","pop","per_povall","median_income","ump_rate","R_DOMESTIC_MIG_2014",
           "R_INTERNATIONAL_MIG_2014","R_NATURAL_INC_2014","R_death_2014","per_bechelor",
           "per_college","per_highsch","per_lesshighsch","sum","lat","lng")]   

SL.out<-SuperLearner(Y=train$Deaths,X=X,SL.library = SL.library,family = 'gaussian',cvControl = list(V=5))

testX<-test[c("month","pop","per_povall","median_income","ump_rate","R_DOMESTIC_MIG_2014",
           "R_INTERNATIONAL_MIG_2014","R_NATURAL_INC_2014","R_death_2014","per_bechelor",
           "per_college","per_highsch","per_lesshighsch","sum","lat","lng")]   
pred<-predict.SuperLearner(SL.out,newdata=testX,onlySL = TRUE)

Random forests are found to have the best performance, followed by MARS and SVM. Thus, in the final Super Learner prediction, the weights of random forests output is very hight which is about 0.96. Other algorithms are not doing a good job in this dataset hence their weights are estimated to be 0. The following plots show the prediction performance on test set. The plot suggests that our prediction values are in line with true values.

SL.out
## 
## Call:  
## SuperLearner(Y = train$Deaths, X = X, family = "gaussian", SL.library = SL.library,  
##     cvControl = list(V = 5)) 
## 
## 
##                          Risk        Coef
## SL.ridge_All         608.9955 0.000000000
## SL.rpartPrune_All    177.9456 0.000000000
## SL.polymars_All      211.6064 0.033813510
## SL.mean_All         2667.2209 0.000000000
## SL.randomForest_All   79.0766 0.961432697
## SL.gam_All           388.2769 0.000000000
## SL.glm_All           609.3036 0.000000000
## SL.glmnet_All        609.3605 0.000000000
## SL.svm_All           241.0309 0.004753793
## SL.gbm_All           201.5128 0.000000000
## SL.nnet_All         2666.4160 0.000000000
sqrt(mean((pred$pred-test$Deaths)^2)) ##RMSE
## [1] 9.163683
plot(pred$pred,test$Deaths,xlim=c(0,600),ylim=c(0,600))

cor(pred$pred,test$Deaths)
##           [,1]
## [1,] 0.9843591

Conclusions

To gain more insight about the importance of night time radiance in mental-relatd death prediction, we run a seperate random forests model using 1000 trees. See the importance measure below. Following population, aggregated county level monthly average night time radiance is the second most important predictor even after controling for other SES, spatial, and month effects. It gives us a strong indication that night time radiance is a mental health problem. However, assoication is not causation. We need to be careful in interpretating this result. But, the predicative model does suggest night time radiance is a good metric in predicting mental disorder related death and also presumably happiness and well-being.

fit<-randomForest(Deaths~ pop + per_povall+month+
                    median_income+ump_rate+R_DOMESTIC_MIG_2014+
                    R_INTERNATIONAL_MIG_2014+R_NATURAL_INC_2014+
                   R_death_2014+per_bechelor+per_college+
                  per_highsch+per_lesshighsch+sum+lat+lng,
                  data=train,importance=TRUE, ntree=1000)
importance(fit)
##                           %IncMSE IncNodePurity
## pop                      76.87876    8578661.18
## per_povall               23.39711     247959.95
## month                    23.44638      91453.08
## median_income            24.92261     287429.68
## ump_rate                 16.67274     213105.54
## R_DOMESTIC_MIG_2014      24.78048     176341.25
## R_INTERNATIONAL_MIG_2014 29.12139    1054962.80
## R_NATURAL_INC_2014       30.18272     369282.90
## R_death_2014             37.77894     511571.33
## per_bechelor             22.13854     651536.97
## per_college              23.97069     209545.26
## per_highsch              25.31010     500739.77
## per_lesshighsch          17.77900     352156.25
## sum                      41.28070    5470166.82
## lat                      40.22816     386858.77
## lng                      34.61009     300946.63