Lei Kang, Lulu Cheng
April 24, 2016
Centers for Disease Control and Prevention (CDC)
Nighttime VIIRS Day/Night (Special thanks to Jeff, Star, Alex)
SES Data
Interactive Visualization Dashboard
We observe a strong positive correlation between nighttime radiance intensity and mental disorder related death even after controlling for population based on the above Interactive plot.
Now, let’s explore a little bit about satellite images. Due to time constraint, we just focus on January, 2014. Then, 3220 county nightlight images are extracted using the following procedure:
########stability based K-means clustering, decide number of clusters
get_stable<-function(data,k,m){##k is number of clusters, m is sample proportion
size<-m*nrow(data)
data$id<-1:nrow(data)
##subsample
sample1_id<-sample(1:nrow(data), size, replace = FALSE)
sample2_id<-sample(1:nrow(data), size, replace = FALSE)
sub_1<-data[sample1_id, ]
sub_2<-data[sample2_id, ]
##k-means
result_1<-data.frame(kmeans(sub_1[2:9], centers = k,algorithm= "Lloyd",iter.max = 1000)$cluster) ##try to alleviate the impact of random starting value
result_2<-data.frame(kmeans(sub_2[2:9], centers = k,algorithm= "Lloyd",iter.max = 1000)$cluster)
##cluster result
sub_1_clust<-data.frame(cbind(sub_1$id,result_1))
names(sub_1_clust)[1] <- "id"
names(sub_1_clust)[2] <- "cluster"
sub_2_clust<-data.frame(cbind(sub_2$id,result_2))
names(sub_2_clust)[1] <- "id"
names(sub_2_clust)[2] <- "cluster"
##find intersected obs and its clustering membership, ensure 1-1 match
intersect<-merge(sub_1_clust,sub_2_clust,by=c("id"))
clust_1<-as.integer(as.character(intersect$cluster.x))
clust_2<-as.integer(as.character(intersect$cluster.y))
##construct C_ij
inter.dim <- dim(intersect)[1]
C_1 <- matrix(clust_1, nr = inter.dim, nc = inter.dim) == matrix(clust_1, nr = inter.dim, nc = inter.dim, byrow = TRUE)
C_2 <- matrix(clust_2, nr = inter.dim, nc = inter.dim) == matrix(clust_2, nr = inter.dim, nc = inter.dim, byrow = TRUE)
diag(C_1) <- 0
diag(C_2) <- 0
##compute similarity measure
jaccard <- sum(C_1 * C_2)/(sum(C_1) + sum(C_2) - sum(C_1 * C_2))
matching<- (sum(C_1 * C_2)+sum((1-C_1) * (1-C_2)))/(sum(C_1 * C_2)+sum((1-C_1) * (1-C_2))+sum((1-C_1)*C_2)+sum((1-C_2)*C_1))
corr<-sum(C_1 * C_2)/sqrt(sum(C_1)*sum(C_2))
# print(jaccard)
# print(matching)
# print(corr)
return(c(jaccard,matching,corr))
}
stable_result_jac<-matrix(0,nrow=14,ncol=100)
stable_result_mat<-matrix(0,nrow=14,ncol=100)
stable_result_cor<-matrix(0,nrow=14,ncol=100)
##################run stability-based K-means
for (k in 2:15){
for (i in 1:100){
dd<-get_stable(withname,k=k,m=0.6)
stable_result_jac[k-1,i]<-dd[1]
stable_result_mat[k-1,i]<-dd[2]
stable_result_cor[k-1,i]<-dd[3]
}
}
jac<-data.frame(t(stable_result_jac))
mat2<-data.frame(t(stable_result_mat))
cor<-data.frame(t(stable_result_cor))
The following figure presents selected raw images from two representative clusters. Cluster 1 tends to be more centered while cluster 2 tends to be more sparsely populated in terms of night time radiance. Both clusters can capture inherent different night time lighting patterns.
After exploring satellite image data, we need to provide more useful insight on how it might influence mental disorder related death. To this end, we match county level monthly mental disorder related death data with aggregated county-level radiance data for year 2014. In the end, we got 9754 observations with matched death and nighttime radiance information where 75% of them is treated as a training set and the rest 25% is test set. We are going to predict county-level monthly mental disorder related death counts using the following features:
We apply the Super Learner idea (Ensemble learning) which combines outputs from the following 11 algorithms: Ridge, Lasso, Decision tree with pruning, Multivariate adaptive regression splines (MARS), Random forests, generalized additive model, SVM, Neural Network, linear regression, generalized boosting.
5-fold CV is used to tune parameters in aforementioned algorithms and Super Learner seeks the best convex combination of those prediction results.
library(SuperLearner)
set.seed(1127)
index<-sample(seq_len(nrow(new)),size=floor(0.75*nrow(new)))
train<-new[index,]
test<-new[-index,]
SL.library<-c('SL.ridge','SL.rpartPrune','SL.polymars','SL.mean','SL.randomForest','SL.gam','SL.glm','SL.glmnet','SL.svm','SL.gbm','SL.nnet')
X<-train[c("month","pop","per_povall","median_income","ump_rate","R_DOMESTIC_MIG_2014",
"R_INTERNATIONAL_MIG_2014","R_NATURAL_INC_2014","R_death_2014","per_bechelor",
"per_college","per_highsch","per_lesshighsch","sum","lat","lng")]
SL.out<-SuperLearner(Y=train$Deaths,X=X,SL.library = SL.library,family = 'gaussian',cvControl = list(V=5))
testX<-test[c("month","pop","per_povall","median_income","ump_rate","R_DOMESTIC_MIG_2014",
"R_INTERNATIONAL_MIG_2014","R_NATURAL_INC_2014","R_death_2014","per_bechelor",
"per_college","per_highsch","per_lesshighsch","sum","lat","lng")]
pred<-predict.SuperLearner(SL.out,newdata=testX,onlySL = TRUE)
Random forests are found to have the best performance, followed by MARS and SVM. Thus, in the final Super Learner prediction, the weights of random forests output is very hight which is about 0.96. Other algorithms are not doing a good job in this dataset hence their weights are estimated to be 0. The following plots show the prediction performance on test set. The plot suggests that our prediction values are in line with true values.
SL.out
##
## Call:
## SuperLearner(Y = train$Deaths, X = X, family = "gaussian", SL.library = SL.library,
## cvControl = list(V = 5))
##
##
## Risk Coef
## SL.ridge_All 608.9955 0.000000000
## SL.rpartPrune_All 177.9456 0.000000000
## SL.polymars_All 211.6064 0.033813510
## SL.mean_All 2667.2209 0.000000000
## SL.randomForest_All 79.0766 0.961432697
## SL.gam_All 388.2769 0.000000000
## SL.glm_All 609.3036 0.000000000
## SL.glmnet_All 609.3605 0.000000000
## SL.svm_All 241.0309 0.004753793
## SL.gbm_All 201.5128 0.000000000
## SL.nnet_All 2666.4160 0.000000000
sqrt(mean((pred$pred-test$Deaths)^2)) ##RMSE
## [1] 9.163683
plot(pred$pred,test$Deaths,xlim=c(0,600),ylim=c(0,600))
cor(pred$pred,test$Deaths)
## [,1]
## [1,] 0.9843591
To gain more insight about the importance of night time radiance in mental-relatd death prediction, we run a seperate random forests model using 1000 trees. See the importance measure below. Following population, aggregated county level monthly average night time radiance is the second most important predictor even after controling for other SES, spatial, and month effects. It gives us a strong indication that night time radiance is a mental health problem. However, assoication is not causation. We need to be careful in interpretating this result. But, the predicative model does suggest night time radiance is a good metric in predicting mental disorder related death and also presumably happiness and well-being.
fit<-randomForest(Deaths~ pop + per_povall+month+
median_income+ump_rate+R_DOMESTIC_MIG_2014+
R_INTERNATIONAL_MIG_2014+R_NATURAL_INC_2014+
R_death_2014+per_bechelor+per_college+
per_highsch+per_lesshighsch+sum+lat+lng,
data=train,importance=TRUE, ntree=1000)
importance(fit)
## %IncMSE IncNodePurity
## pop 76.87876 8578661.18
## per_povall 23.39711 247959.95
## month 23.44638 91453.08
## median_income 24.92261 287429.68
## ump_rate 16.67274 213105.54
## R_DOMESTIC_MIG_2014 24.78048 176341.25
## R_INTERNATIONAL_MIG_2014 29.12139 1054962.80
## R_NATURAL_INC_2014 30.18272 369282.90
## R_death_2014 37.77894 511571.33
## per_bechelor 22.13854 651536.97
## per_college 23.97069 209545.26
## per_highsch 25.31010 500739.77
## per_lesshighsch 17.77900 352156.25
## sum 41.28070 5470166.82
## lat 40.22816 386858.77
## lng 34.61009 300946.63