Before Clustering Analysis
Before jumping into clustering, we need to understanding the data type. We only have one RUCC varibale that is categorcial. Therefore, we need to consider whether we should include it in our analysis or we can just replace it with %rural.Based on our current information, the best practice is to do both.
- As we all known, including categorical variable RUCC will only output clusters exactly the same as RUCC. That won’t provide any insights for the research.
- Therefore, it would be better to perform PCA first and extract important information from dataset and reduce dimension.
- The advantage is that we can remove some outliers inside the dataset and make the clustering more robust.
- However, it also increases the complexity when we try to interpretate the clustering reuslt. Because our variable become principle components instead of variables itself.
- The way we interpreate it is to see the component itself. As for selecting the most important variable, we can select the most important component and take a look at the compnent itself.
Clustering with %rural
library(factoextra)
library(FactoMineR)
# all data is numeric for now
# Compute PCA with ncp = 3
res.pca <- prcomp(df, scale = TRUE)
fviz_eig(res.pca)

res.pca <- PCA(df, ncp = 4, graph = FALSE)
# Compute hierarchical clustering on principal components
res.hcpc <- HCPC(res.pca, graph = FALSE)
fviz_dend(res.hcpc, show_labels = FALSE)

fviz_cluster(res.hcpc, geom = "point", main = "Factor map")

res.hcpc$desc.var$quanti
$`1`
v.test Mean in category Overall mean
air_EQI 34.28580 0.8934593 0.19325392
sociod_EQI 26.12468 0.7050311 0.04876047
smoking._rate -23.51719 21.4748677 23.89641192
rural_percent -36.91117 27.7556614 56.01658387
sd in category Overall sd p.value
air_EQI 0.4986621 0.7852859 1.277690e-257
sociod_EQI 0.8777742 0.9659367 1.912255e-150
smoking._rate 3.6655253 3.9593480 2.721078e-122
rural_percent 17.8763724 29.4404947 3.058444e-298
$`2`
v.test Mean in category Overall mean
rural_percent 16.427964 71.9286969 56.01658387
sociod_EQI 4.760943 0.2000611 0.04876047
smoking._rate -13.027424 22.1994143 23.89641192
air_EQI -28.884804 -0.5530165 0.19325392
sd in category Overall sd p.value
rural_percent 23.2010645 29.4404947 1.206422e-60
sociod_EQI 0.6796654 0.9659367 1.926901e-06
smoking._rate 2.2470063 3.9593480 8.544436e-39
air_EQI 0.6192462 0.7852859 1.853049e-183
$`3`
v.test Mean in category Overall mean
smoking._rate 35.096403 27.38216380 23.89641192
rural_percent 21.685287 72.03134479 56.01658387
air_EQI -7.801616 0.03957208 0.19325392
sociod_EQI -30.191995 -0.68280073 0.04876047
sd in category Overall sd p.value
smoking._rate 2.3268278 3.9593480 7.648538e-270
rural_percent 20.5409475 29.4404947 2.824908e-104
air_EQI 0.5026574 0.7852859 6.111944e-15
sociod_EQI 0.6716843 0.9659367 3.017017e-200
We have 3 clusters for us to analyze.
d1=cbind(mydata,clust=res.hcpc$data.clust$clust)
qplot(x=clust,y=LungRate,data=d1,fill=clust,geom="boxplot")

qplot(x=clust,y=smoking._rate ,data=d1,fill=clust,geom="boxplot")

qplot(x=clust,y=rural_percent ,data=d1,fill=clust,geom="boxplot")

qplot(x=clust,y=air_EQI ,data=d1,fill=clust,geom="boxplot")

qplot(x=clust,y=sociod_EQI,data=d1,fill=clust,geom="boxplot")

The result is similar to RUCC. Therefore, we may not proceed to use RUCC in clustering analysis.
Some insights from Clustering Analysis
- smoking rate seems to be a driving factor for lung cancer
- rural area seems to have higher smoking rate
- highly urbanized area seems to have worse social environment(higher social EQI)
- highly urbanized area seems to have worse air envrionemtn( higher air EQI)
- highly urbanized and areas in the middle of urbanization tend to have lower lung cancer compared to extremely urbanized areas (that makes sense but conficts with social EQI and air EQI distribution)
We have tried 4 RUCC cluster before and we have similar conclusion.Therefore, I would say the 3 cluster one is acceptable.
LS0tCnRpdGxlOiAiTHVuZyBDYW5jZXIgYW5kIEVRSSIKb3V0cHV0OiAKICBodG1sX25vdGVib29rOgogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6CiAgICAgIGNvbGxhcHNlZDogZmFsc2UKICAgICAgc21vb3RoX3Njcm9sbDogZmFsc2UKLS0tCgojIEJlZm9yZSBDbHVzdGVyaW5nIEFuYWx5c2lzCgpCZWZvcmUganVtcGluZyBpbnRvIGNsdXN0ZXJpbmcsIHdlIG5lZWQgdG8gdW5kZXJzdGFuZGluZyB0aGUgZGF0YSB0eXBlLiBXZSBvbmx5IGhhdmUgb25lIFJVQ0MgdmFyaWJhbGUgdGhhdCBpcyBjYXRlZ29yY2lhbC4gVGhlcmVmb3JlLCB3ZSBuZWVkIHRvIGNvbnNpZGVyIHdoZXRoZXIgd2Ugc2hvdWxkIGluY2x1ZGUgaXQgaW4gb3VyIGFuYWx5c2lzIG9yIHdlIGNhbiBqdXN0IHJlcGxhY2UgaXQgd2l0aCAlcnVyYWwuQmFzZWQgb24gb3VyIGN1cnJlbnQgaW5mb3JtYXRpb24sIHRoZSBiZXN0IHByYWN0aWNlIGlzIHRvIGRvIGJvdGguCgoqIEFzIHdlIGFsbCBrbm93biwgaW5jbHVkaW5nIGNhdGVnb3JpY2FsIHZhcmlhYmxlIFJVQ0Mgd2lsbCBvbmx5IG91dHB1dCBjbHVzdGVycyBleGFjdGx5IHRoZSBzYW1lIGFzIFJVQ0MuIFRoYXQgd29uJ3QgcHJvdmlkZSBhbnkgaW5zaWdodHMgZm9yIHRoZSByZXNlYXJjaC4gCiogVGhlcmVmb3JlLCBpdCB3b3VsZCBiZSBiZXR0ZXIgdG8gcGVyZm9ybSBQQ0EgZmlyc3QgYW5kIGV4dHJhY3QgaW1wb3J0YW50IGluZm9ybWF0aW9uIGZyb20gZGF0YXNldCBhbmQgcmVkdWNlIGRpbWVuc2lvbi4gCiogVGhlIGFkdmFudGFnZSBpcyB0aGF0IHdlIGNhbiByZW1vdmUgc29tZSBvdXRsaWVycyBpbnNpZGUgdGhlIGRhdGFzZXQgYW5kIG1ha2UgdGhlIGNsdXN0ZXJpbmcgbW9yZSByb2J1c3QuIAoqIEhvd2V2ZXIsIGl0IGFsc28gaW5jcmVhc2VzIHRoZSBjb21wbGV4aXR5IHdoZW4gd2UgdHJ5IHRvIGludGVycHJldGF0ZSB0aGUgY2x1c3RlcmluZyByZXVzbHQuIEJlY2F1c2Ugb3VyIHZhcmlhYmxlIGJlY29tZSBwcmluY2lwbGUgY29tcG9uZW50cyBpbnN0ZWFkIG9mIHZhcmlhYmxlcyBpdHNlbGYuIAoqIFRoZSB3YXkgd2UgaW50ZXJwcmVhdGUgaXQgaXMgdG8gc2VlIHRoZSBjb21wb25lbnQgaXRzZWxmLiBBcyBmb3Igc2VsZWN0aW5nIHRoZSBtb3N0IGltcG9ydGFudCB2YXJpYWJsZSwgd2UgY2FuIHNlbGVjdCB0aGUgbW9zdCBpbXBvcnRhbnQgY29tcG9uZW50IGFuZCB0YWtlIGEgbG9vayBhdCB0aGUgY29tcG5lbnQgaXRzZWxmLgoKIyBJbnB1dCBEYXRhCiogYWlyIEVRSSBkYXRhIChQQ0Egb2YgYWlyIHBvbGx1dGlvbikKKiBTbW9raW5nIHJhdGUKKiBzb2NpYWwgRVFJIGRhdGEgKFBDQSBvZiBzb2NpYWwgbWVhc3VybWVudCkKKiBsYW5kLCBidWlsdCwgd2F0ZXIgRVFJKG5vdCB1c2VkKQoqICVydXJhbCBhbmQgUlVDQyhvbmx5IHNlbGVjdCBvbmUgb2YgdGhlbSB3aGVuIGFuYWx5c2lzKQoKIyBDbHVzdGVyaW5nIHdpdGggJXJ1cmFsCmBgYHtyIGRhdGEgcHJlcHJvY2Vzc2luZ30KZmluYWxkYXRhIDwtIHJlYWQuY3N2KCJ+L0Rlc2t0b3AvQ29sdW1iaWEgcmVzZWFyY2gvbWFwIHByb2plY3QvZmluYWxkYXRhLmNzdiIpCm5hbWVzKGZpbmFsZGF0YSkKI2FpckVRSSBhbmQgc29jaWFsRVFJIGFuZCBzbW9raW5nIHJhdGUgYW5kICVydXJhbApteWRhdGE9bmEub21pdChmaW5hbGRhdGFbYygxMzQsMTIwOjEyNSwxNjAsMTI3KV0pCmRmPW15ZGF0YVstYygxLDQsNSw3LDkpXQpgYGAKCmBgYHtyIHNjYWxlIGFuZCBwY2EgYmVmb3JlIGNsdXN0ZXJpbmd9CmxpYnJhcnkoZmFjdG9leHRyYSkKbGlicmFyeShGYWN0b01pbmVSKQojIGFsbCBkYXRhIGlzIG51bWVyaWMgZm9yIG5vdwpyZXMucGNhIDwtIHByY29tcChkZiwgc2NhbGUgPSBUUlVFKQpmdml6X2VpZyhyZXMucGNhKQpgYGAKYGBge3J9CnJlcy5wY2EgPC0gUENBKGRmLCBuY3AgPSA0LCBncmFwaCA9IEZBTFNFKQojIENvbXB1dGUgaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgb24gcHJpbmNpcGFsIGNvbXBvbmVudHMKcmVzLmhjcGMgPC0gSENQQyhyZXMucGNhLCBncmFwaCA9IEZBTFNFKQpgYGAKCgpgYGB7cn0KZnZpel9kZW5kKHJlcy5oY3BjLCBzaG93X2xhYmVscyA9IEZBTFNFKQpgYGAKCmBgYHtyfQpmdml6X2NsdXN0ZXIocmVzLmhjcGMsIGdlb20gPSAicG9pbnQiLCBtYWluID0gIkZhY3RvciBtYXAiKQpgYGAKCmBgYHtyfQpyZXMuaGNwYyRkZXNjLnZhciRxdWFudGkKYGBgCldlIGhhdmUgMyBjbHVzdGVycyBmb3IgdXMgdG8gYW5hbHl6ZS4gCmBgYHtyfQpkMT1jYmluZChteWRhdGEsY2x1c3Q9cmVzLmhjcGMkZGF0YS5jbHVzdCRjbHVzdCkKcXBsb3QoeD1jbHVzdCx5PUx1bmdSYXRlLGRhdGE9ZDEsZmlsbD1jbHVzdCxnZW9tPSJib3hwbG90IikKcXBsb3QoeD1jbHVzdCx5PXNtb2tpbmcuX3JhdGUgLGRhdGE9ZDEsZmlsbD1jbHVzdCxnZW9tPSJib3hwbG90IikKcXBsb3QoeD1jbHVzdCx5PXJ1cmFsX3BlcmNlbnQgLGRhdGE9ZDEsZmlsbD1jbHVzdCxnZW9tPSJib3hwbG90IikKcXBsb3QoeD1jbHVzdCx5PWFpcl9FUUkgLGRhdGE9ZDEsZmlsbD1jbHVzdCxnZW9tPSJib3hwbG90IikKcXBsb3QoeD1jbHVzdCx5PXNvY2lvZF9FUUksZGF0YT1kMSxmaWxsPWNsdXN0LGdlb209ImJveHBsb3QiKQpgYGAKVGhlIHJlc3VsdCBpcyBzaW1pbGFyIHRvIFJVQ0MuIFRoZXJlZm9yZSwgd2UgbWF5IG5vdCBwcm9jZWVkIHRvIHVzZSBSVUNDIGluIGNsdXN0ZXJpbmcgYW5hbHlzaXMuCgojIFNvbWUgaW5zaWdodHMgZnJvbSBDbHVzdGVyaW5nIEFuYWx5c2lzCgoqIHNtb2tpbmcgcmF0ZSBzZWVtcyB0byBiZSBhIGRyaXZpbmcgZmFjdG9yIGZvciBsdW5nIGNhbmNlcgoqIHJ1cmFsIGFyZWEgc2VlbXMgdG8gaGF2ZSBoaWdoZXIgc21va2luZyByYXRlCiogaGlnaGx5IHVyYmFuaXplZCBhcmVhIHNlZW1zIHRvIGhhdmUgd29yc2Ugc29jaWFsIGVudmlyb25tZW50KGhpZ2hlciBzb2NpYWwgRVFJKQoqIGhpZ2hseSB1cmJhbml6ZWQgYXJlYSBzZWVtcyB0byBoYXZlIHdvcnNlIGFpciBlbnZyaW9uZW10biggaGlnaGVyIGFpciBFUUkpCiogaGlnaGx5IHVyYmFuaXplZCBhbmQgYXJlYXMgaW4gdGhlIG1pZGRsZSBvZiB1cmJhbml6YXRpb24gdGVuZCB0byBoYXZlIGxvd2VyIGx1bmcgY2FuY2VyIGNvbXBhcmVkIHRvIGV4dHJlbWVseSB1cmJhbml6ZWQgYXJlYXMgKHRoYXQgbWFrZXMgc2Vuc2UgYnV0IGNvbmZpY3RzIHdpdGggc29jaWFsIEVRSSBhbmQgYWlyIEVRSSBkaXN0cmlidXRpb24pCgpXZSBoYXZlIHRyaWVkIDQgUlVDQyBjbHVzdGVyIGJlZm9yZSBhbmQgd2UgaGF2ZSBzaW1pbGFyIGNvbmNsdXNpb24uVGhlcmVmb3JlLCBJIHdvdWxkIHNheSB0aGUgMyBjbHVzdGVyIG9uZSBpcyBhY2NlcHRhYmxlLgo=