This lab will have you replicate the work you did for the previous lab, but instead of using the Phoenix data you will select a city of your choice to use for this lab, subsequent labs, and the final project.
You want to select a dense city that has a lot of census tracts, so I suggest you select from the 50 largest cities in the US. The Census defines Metropolitan Statistical Areas (MSAs) as “one or more adjacent counties or county equivalents that have at least one urban core area of at least 50,000 population, plus adjacent territory that has a high degree of social and economic integration with the core as measured by commuting ties.” There are currently 384 MSAs in the US.
Wikipedia provides a convenient list of MSAs sorted by size.
library( geojsonio ) # read shapefiles
library( sp ) # work with shapefiles
library( sf ) # work with shapefiles - simple features format
library( mclust ) # cluster analysis
library( tmap ) # theme maps
library( ggplot2 ) # graphing
library( ggthemes ) # nice formats for ggplots
library( dplyr ) # data wrangling
library( pander ) # formatting RMD tables
library( tidycensus )
library( cartogram ) # spatial maps w/ tract size bias reduction
library( maptools ) # spatial object manipulation You can select a city from the list of large MSAs.
To get Census data on the city you will first need to identify all of the counties that comprise the MSA. You can look this information up through MSA to FIPS crosswalks provided by the National Bureau for Economic Research (NBER): https://www.nber.org/data/cbsa-fips-county-crosswalk.html
I have added the file to GitHub for ease of access.
crosswalk <- read.csv( "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv", stringsAsFactors=F, colClasses="character" )
# search for citie names by strings, use the ^ anchor for "begins with"
grep( "^SAN D", crosswalk$msaname, value=TRUE )## [1] "SAN DIEGO, CA"
grep( "^CALI", crosswalk$msaname, value=TRUE )## [1] "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA"
## [6] "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA"
## [11] "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA"
## [16] "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA"
## [21] "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA" "CALIFORNIA"
Select all of your county fips. To use them in the TidyCenss package you will need to split the state and county:
these.sdp <- crosswalk$msaname == "CALIFORNIA"
these.fips <- crosswalk$fipscounty[ these.sdp ]
these.fips <- na.omit( these.fips )To create a Dorling cartogram we need a shapefile and a population count. We can get both through the Census download that includes simple features.
library( tidycensus )
#census_api_key("404739a4c1d1baf6067b937fe42425b977ef145d", install = TRUE)
state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )
cbind( these.fips, state.fips, county.fips ) %>% pander()| these.fips | state.fips | county.fips |
|---|---|---|
| 06003 | 06 | 003 |
| 06005 | 06 | 005 |
| 06009 | 06 | 009 |
| 06011 | 06 | 011 |
| 06015 | 06 | 015 |
| 06021 | 06 | 021 |
| 06023 | 06 | 023 |
| 06025 | 06 | 025 |
| 06027 | 06 | 027 |
| 06031 | 06 | 031 |
| 06033 | 06 | 033 |
| 06035 | 06 | 035 |
| 06043 | 06 | 043 |
| 06045 | 06 | 045 |
| 06049 | 06 | 049 |
| 06051 | 06 | 051 |
| 06057 | 06 | 057 |
| 06063 | 06 | 063 |
| 06069 | 06 | 069 |
| 06091 | 06 | 091 |
| 06093 | 06 | 093 |
| 06103 | 06 | 103 |
| 06105 | 06 | 105 |
| 06109 | 06 | 109 |
| 06990 | 06 | 990 |
sd.pop <-
get_acs( geography = "tract", variables = "B01003_001",
state = "06", county = county.fips[state.fips=="06"], geometry = TRUE ) %>%
select( GEOID, estimate ) %>%
rename( POP=estimate )##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 36%
|
|========================== | 37%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|========================================================== | 84%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|======================================================================| 100%
URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds"
census.dat <- readRDS(gzcon(url( URL )))
# can merge an sf object and data.frame
sd.pop$GEOID<-sub('.', '', sd.pop$GEOID)
sdp <- merge( sd.pop, census.dat, by.x="GEOID", by.y="tractid" )
# make sure there are no empty polygons
sdp <- sdp[ ! st_is_empty( sdp ) , ]
head(sdp)## Simple feature collection with 6 features and 177 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -121.0277 ymin: 38.21795 xmax: -119.5425 ymax: 38.93332
## Geodetic CRS: NAD83
## GEOID POP statea countya tracta pnhwht12 pnhblk12 phisp12 pntv12
## 1 6003010000 1344 06 003 010000 69.76 0.00 6.52 19.21
## 2 6005000101 4088 06 005 000101 90.35 0.98 3.44 3.59
## 3 6005000102 2054 06 005 000102 83.05 0.00 7.75 7.21
## 4 6005000301 5152 06 005 000301 52.74 13.74 28.46 1.99
## 5 6005000303 6346 06 005 000303 84.77 0.09 8.13 1.38
## 6 6005000304 5186 06 005 000304 79.95 0.00 15.95 0.23
## pasian12 phaw12 pindia12 pchina12 pfilip12 pjapan12 pkorea12 pviet12 p15wht12
## 1 4.51 0.00 0 1.42 0.00 0.00 0 1.25 10.54
## 2 0.00 0.00 0 0.00 0.20 0.00 0 0.00 6.32
## 3 0.00 0.00 0 0.00 0.00 0.00 0 0.00 18.31
## 4 2.23 0.05 0 0.00 0.26 0.47 0 0.28 10.63
## 5 2.34 0.00 0 1.89 0.09 0.00 0 0.00 16.18
## 6 1.03 1.03 0 0.00 0.00 0.00 0 0.00 14.53
## p65wht12 p15blk12 p65blk12 p15hsp12 p65hsp12 p15ntv12 p65ntv12 p15asn12
## 1 13.65 18.75882 10.08296 29.49 6.41 35.17 10.17 20.370
## 2 36.31 0.00000 0.00000 64.23 10.22 0.00 0.00 0.000
## 3 25.79 18.75882 10.08296 4.49 21.15 0.00 23.45 15.255
## 4 13.35 1.25000 2.95000 2.58 2.00 3.33 5.33 0.000
## 5 18.26 0.00000 50.00000 37.30 3.97 16.47 0.00 40.370
## 6 16.53 18.75882 10.08296 17.48 8.09 0.00 0.00 0.000
## p65asn12 pmex12 pcuban12 ppr12 pruanc12 pitanc12 pgeanc12 piranc12 pscanc12
## 1 7.41000 1.92 0.00 0.50 0.00 8.27 11.19 12.11 6.18
## 2 0.00000 1.81 0.00 0.00 2.74 3.49 9.75 12.67 5.20
## 3 10.63872 3.43 0.00 0.94 6.91 3.98 17.99 5.62 5.72
## 4 2.76000 25.09 0.47 1.66 0.24 3.34 9.63 9.47 2.38
## 5 3.67000 4.80 0.00 0.00 0.00 6.37 14.52 9.78 4.00
## 6 0.00000 11.48 0.00 0.00 0.83 4.62 14.43 6.86 6.40
## pfb12 pnat12 p10imm12 prufb12 pitfb12 pgefb12 pirfb12 pscfb12 polang12 plep12
## 1 4.76 2.51 1.50 0.00 0.00 0.33 0.0 0.00 12.84 4.13
## 2 5.15 3.90 0.33 2.74 0.00 0.30 0.3 0.00 5.10 1.02
## 3 12.48 9.64 0.00 6.91 0.00 2.58 0.6 0.00 8.38 3.69
## 4 8.98 2.62 1.37 0.00 0.00 0.11 0.0 0.75 23.62 2.36
## 5 4.34 2.58 1.27 0.00 0.15 0.00 0.0 0.00 5.61 0.78
## 6 3.28 0.62 1.14 0.00 0.00 0.00 0.0 0.00 8.26 1.73
## phs12 pcol12 punemp12 pflabf12 pprof12 pmanuf12 psemp12 pvet12 p65pov12
## 1 35.21 31.71 10.40 60.40 42.80 5.72 14.58 13.27 1.12
## 2 31.53 23.16 19.87 41.52 43.01 1.99 26.47 16.89 0.68
## 3 35.73 24.53 20.74 47.28 12.88 5.98 14.42 21.47 0.00
## 4 60.92 9.93 14.33 46.86 36.16 5.28 10.68 13.60 0.91
## 5 33.27 23.13 9.41 54.13 32.06 5.59 14.88 10.42 0.60
## 6 36.98 10.58 17.68 57.59 22.40 6.15 8.80 15.43 0.00
## ppov12 pwpov12 pnapov12 pfmpov12 pbpov12 phpov12 papov12 pvac12 pown12
## 1 13.71 10.64 14.01 0.80 25.50688 46.15 11.1100 78.79 80.78
## 2 7.59 5.40 75.52 2.37 0.00000 0.00 0.0000 28.46 89.68
## 3 9.99 7.60 5.52 6.61 25.50688 22.44 14.0471 51.81 81.37
## 4 5.96 6.10 0.00 0.65 0.00000 10.13 0.0000 8.82 74.17
## 5 12.28 11.26 0.00 9.87 50.00000 14.29 65.1400 11.50 68.44
## 6 16.34 10.13 0.00 3.52 25.50688 42.56 100.0000 21.36 73.12
## pmulti12 p30old12 p18und12 p60up12 p75up12 pmar12 pwds12 pfhh12 p10yrs12
## 1 37.47 45.07 22.56 19.97 5.01 51.00 10.82 4.40 41.30
## 2 0.00 43.00 10.26 51.26 12.54 68.30 24.19 3.60 48.12
## 3 4.13 63.44 16.90 33.20 13.97 67.76 21.89 3.22 57.24
## 4 5.80 43.00 10.60 12.52 2.93 29.93 29.11 1.94 51.79
## 5 11.08 44.51 24.09 22.67 5.96 56.46 22.46 17.16 72.63
## 6 2.62 32.01 20.23 20.05 8.00 51.65 19.04 3.43 78.51
## ageblk12 agentv12 agewht12 agehsp12 india12 filip12 japan12 korea12 viet12
## 1 0 236 835 78 0 0 0 0 15
## 2 39 143 3594 137 0 8 0 0 0
## 3 0 145 1671 156 0 0 0 0 0
## 4 881 150 3235 1746 0 16 29 0 17
## 5 4 85 3942 378 0 4 0 0 0
## 6 0 9 3098 618 0 0 0 0 0
## pop12 nhwht12 nhblk12 ntv12 hisp12 asian12 haw12 china12 a15wht12 a65wht12
## 1 1197 835 0 230 78 54 0 17 88 114
## 2 3978 3594 39 143 137 0 0 0 227 1305
## 3 2012 1671 0 145 156 0 0 0 306 431
## 4 6134 3235 843 122 1746 137 3 0 344 432
## 5 4650 3942 4 64 378 109 0 88 638 720
## 6 3875 3098 0 9 618 40 40 0 450 512
## a15blk12 a65blk12 a15hsp12 a65hsp12 a15ntv12 a65ntv12 ageasn12 a15asn12
## 1 0 0 23 5 83 24 54 11
## 2 0 0 88 14 0 0 8 0
## 3 0 0 7 33 0 34 0 0
## 4 11 26 45 35 5 8 145 0
## 5 0 2 141 15 14 0 109 44
## 6 0 0 108 50 0 0 40 0
## a65asn12 mex12 pr12 cuban12 geanc12 iranc12 itanc12 ruanc12 fb12 nat12 itfb12
## 1 4 23 6 0 134 145 99 0 57 30 0
## 2 0 72 0 0 388 504 139 109 205 155 0
## 3 0 69 19 0 362 113 80 139 251 194 0
## 4 4 1539 102 29 591 581 205 15 551 161 0
## 5 4 223 0 0 675 455 296 0 202 120 7
## 6 0 445 0 0 559 266 179 32 127 24 0
## rufb12 ag5up12 irfb12 gefb12 scanc12 n10imm12 olang12 lep12 scfb12 ag25up12
## 1 0 1137 0 4 74 18 146 47 0 801
## 2 109 3922 12 12 207 13 200 40 0 3527
## 3 139 1922 12 52 115 0 161 71 0 1598
## 4 0 5973 0 7 146 84 1411 141 46 4854
## 5 0 4352 0 0 186 59 244 34 0 3234
## 6 0 3704 0 0 248 44 306 64 0 2750
## dfmpov12 hh12 hinc12 hincb12 hincw12 hinch12 incpc12 ag18cv12 vet12
## 1 250 385 59931 49940.72 73162 38750.00 27135 927 123
## 2 1306 1968 56136 49940.72 56917 55629.36 35314 3570 603
## 3 590 746 47875 49940.72 48438 43098.00 23797 1672 359
## 4 619 755 60391 53409.00 59613 88661.00 10165 5484 746
## 5 1317 1885 56120 49940.72 57109 72727.00 29481 3530 368
## 6 1107 1410 54045 49940.72 53500 22286.00 24651 3091 477
## empclf12 dpov12 npov12 dbpov12 nbpov12 dnapov12 nnapov12 dwpov12 nwpov12
## 1 542 1160 159 0 0 207 29 827 88
## 2 1360 3978 302 39 0 143 108 3594 194
## 3 652 2012 201 0 0 145 8 1671 127
## 4 777 1979 118 47 0 25 0 1802 110
## 5 1915 4650 571 4 2 85 0 3942 444
## 6 1625 3868 632 0 0 9 0 3091 313
## dhpov12 nhpov12 hhb12 hhw12 hhh12 hs12 col12 clf12 unemp12 dflabf12 flabf12
## 1 78 36 0 307 6 282 254 596 62 452 273
## 2 137 0 27 1829 27 1112 817 1631 324 2016 837
## 3 156 35 0 640 63 571 392 786 163 789 373
## 4 79 8 14 667 53 2957 482 900 129 828 388
## 5 378 54 0 1737 80 1076 748 2105 198 1888 1022
## 6 618 263 0 1220 136 1017 291 1974 349 1554 895
## prof12 manuf12 semp12 hha12 hinca12 n65pov12 nfmpov12 napov12 dapov12
## 1 232 31 79 11 36079.55 13 2 6 54
## 2 585 27 360 0 79531.26 27 31 0 8
## 3 84 39 94 0 79531.26 0 39 0 0
## 4 281 41 83 0 79531.26 18 4 0 4
## 5 614 107 285 35 53015.00 28 130 71 109
## 6 364 100 143 0 79531.26 0 39 40 40
## family12 hu12 vac12 ohu12 own12 rent12 dmulti12 mrent12 mhmval12 multi12
## 1 250 1815 1430 385 311 74 1815 696 371300 680
## 2 1306 2751 783 1968 1765 203 2751 925 277600 0
## 3 590 1548 802 746 607 139 1548 839 273100 64
## 4 619 828 73 755 560 195 828 724 278200 48
## 5 1317 2130 245 1885 1290 595 2130 1058 281900 236
## 6 1107 1793 383 1410 1031 379 1793 1112 229100 47
## h30old12 h10yrs12 a18und12 a60up12 a75up12 ag15up12 X12.Mar wds12 fhh12
## 1 818 159 270 239 60 998 509 108 11
## 2 1183 947 408 2039 499 3663 2502 886 47
## 3 982 427 340 668 281 1681 1139 368 19
## 4 356 391 650 768 180 5716 1711 1664 12
## 5 948 1369 1120 1054 277 3700 2089 831 226
## 6 574 1107 784 777 310 3282 1695 625 38
## geometry
## 1 MULTIPOLYGON (((-120.0724 3...
## 2 MULTIPOLYGON (((-120.632 38...
## 3 MULTIPOLYGON (((-120.6093 3...
## 4 MULTIPOLYGON (((-121.0274 3...
## 5 MULTIPOLYGON (((-121.0272 3...
## 6 MULTIPOLYGON (((-121.0114 3...
#head(sd) %>% pander()This exercise uses Census data from the 2012 American Communities Survey made available through the Diversity and Disparities Project.
dd.URL <- "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/data-dictionary.csv"
data.dictionary <- read.csv( dd.URL, stringsAsFactors=F )
data.dictionary %>% pander()| LABEL | VARIABLE |
|---|---|
| tractid | GEOID |
| pnhwht12 | Percent white, non-Hispanic |
| pnhblk12 | Percent black, non-Hispanic |
| phisp12 | Percent Hispanic |
| pntv12 | Percent Native American race |
| pfb12 | Percent foreign born |
| polang12 | Percent speaking other language at home, age 5 plus |
| phs12 | Percent with high school degree or less |
| pcol12 | Percent with 4-year college degree or more |
| punemp12 | Percent unemployed |
| pflabf12 | Percent female labor force participation |
| pprof12 | Percent professional employees |
| pmanuf12 | Percent manufacturing employees |
| pvet12 | Percent veteran |
| psemp12 | Percent self-employed |
| hinc12 | Median HH income, total |
| incpc12 | Per capita income |
| ppov12 | Percent in poverty, total |
| pown12 | Percent owner-occupied units |
| pvac12 | Percent vacant units |
| pmulti12 | Percent multi-family units |
| mrent12 | Median rent |
| mhmval12 | Median home value |
| p30old12 | Percent structures more than 30 years old |
| p10yrs12 | Percent HH in neighborhood 10 years or less |
| p18und12 | Percent 17 and under, total |
| p60up12 | Percent 60 and older, total |
| p75up12 | Percent 75 and older, total |
| pmar12 | Percent currently married, not separated |
| pwds12 | Percent widowed, divorced and separated |
| pfhh12 | Percent female-headed families with children |
# convert sf map object to an sp version
sd.sp <- as_Spatial( sdp )
class( sd.sp )## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
plot( sd.sp )
# project map and remove empty tracts
sd.sp <- spTransform( sd.sp, CRS("+init=epsg:3395"))
sd.sp <- sd.sp[ sd.sp$POP != 0 & (! is.na( sd.sp$POP )) , ]
# convert census tract polygons to dorling cartogram
# no idea why k=0.03 works, but it does - default is k=5
sd.sp$pop.w <- sd.sp$POP / 9000 # max(msp.sp$POP) # standardizes it to max of 1.5
sd_dorling <- cartogram_dorling( x=sd.sp, weight="pop.w", k=0.05 )
plot( sd_dorling )
axis(side=2)
tm_shape( sd_dorling ) +
tm_polygons( size="POP", col="hinc12", n=7, style="quantile", palette="Spectral" )
Use a bounding box: improve the aestetics.
# user-defined bounding box to move slocer to subjects
bb <- st_bbox( c( xmin = -14000000, xmax = -13000000,
ymax = 5165123, ymin = 4156123 ),
crs = st_crs("+init=epsg:3395"))
tm_shape( sd_dorling, bbox = bb) +
tm_polygons( col="hinc12", n=10, style="quantile", palette="Spectral" ) +
tm_layout( "Dorling Cartogram", title.position=c("right","top") )
We will use the same set of variables as last week. The data is transformed into z-score so that they are all on similar scales.
keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12",
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12",
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12",
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12",
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")
d1 <- sd_dorling@data
d2 <- select( d1, keep.these )
d3 <- apply( d2, 2, scale )
head( d3[,1:6] ) %>% pander()| pnhwht12 | pnhblk12 | phisp12 | pntv12 | pfb12 | polang12 |
|---|---|---|---|---|---|
| 0.2577 | -0.5154 | -0.814 | 2.338 | -0.7104 | -0.4819 |
| 1.051 | -0.2854 | -0.9353 | 0.09011 | -0.6705 | -0.8322 |
| 0.7698 | -0.5154 | -0.7655 | 0.6112 | 0.07956 | -0.6837 |
| -0.398 | 2.709 | 0.05019 | -0.1402 | -0.2786 | 0.006068 |
| 0.8361 | -0.4943 | -0.7505 | -0.228 | -0.7533 | -0.8091 |
| 0.6504 | -0.5154 | -0.4425 | -0.3935 | -0.8618 | -0.6892 |
We transform all of the variables to z scorse so they are on the same scale while clustering. This ensures that each census variable has equal weight. Z-scores typically range from about -3 to +3 with a mean of zero.
keep.these <- c("pnhwht12", "pnhblk12", "phisp12", "pntv12", "pfb12", "polang12",
"phs12", "pcol12", "punemp12", "pflabf12", "pprof12", "pmanuf12",
"pvet12", "psemp12", "hinc12", "incpc12", "ppov12", "pown12",
"pvac12", "pmulti12", "mrent12", "mhmval12", "p30old12", "p10yrs12",
"p18und12", "p60up12", "p75up12", "pmar12", "pwds12", "pfhh12")
d2 <- select( d1, keep.these )
d3 <- apply( d2, 2, scale )
head( d3[,1:6] ) %>% pander()| pnhwht12 | pnhblk12 | phisp12 | pntv12 | pfb12 | polang12 |
|---|---|---|---|---|---|
| 0.2577 | -0.5154 | -0.814 | 2.338 | -0.7104 | -0.4819 |
| 1.051 | -0.2854 | -0.9353 | 0.09011 | -0.6705 | -0.8322 |
| 0.7698 | -0.5154 | -0.7655 | 0.6112 | 0.07956 | -0.6837 |
| -0.398 | 2.709 | 0.05019 | -0.1402 | -0.2786 | 0.006068 |
| 0.8361 | -0.4943 | -0.7505 | -0.228 | -0.7533 | -0.8091 |
| 0.6504 | -0.5154 | -0.4425 | -0.3935 | -0.8618 | -0.6892 |
For more details on cluster analysis visit the mclust tutorial.
# library( mclust )
set.seed( 1234 )
fit <- Mclust( d3 )
sd_dorling$cluster <- as.factor( fit$classification )
summary( fit )## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 2 components:
##
## log-likelihood n df BIC ICL
## -5105.704 209 556 -13181.75 -13182.02
##
## Clustering table:
## 1 2
## 162 47
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 7 components:
##
## log-likelihood n df BIC ICL
## -15370.41 771 861 -36464.47 -36533.72
##
## Clustering table:
## 1 2 3 4 5 6 7
## 74 151 112 237 89 63 45Some visuals of model fit statistics (doesn’t work well with a lot of variables):
plot( fit, what = "classification" )
Build the charts to compare census characteristics across the groups.
df.pct <- sapply( d2, ntile, 100 )
d4 <- as.data.frame( df.pct )
d4$cluster <- as.factor( paste0("GROUP-",fit$classification) )
num.groups <- length( unique( fit$classification ) )
stats <-
d4 %>%
group_by( cluster ) %>%
summarise_each( funs(mean) )
t <- data.frame( t(stats), stringsAsFactors=F )
names(t) <- paste0( "GROUP.", 1:num.groups )
t <- t[-1,]
for( i in 1:num.groups )
{
z <- t[,i]
plot( rep(1,30), 1:30, bty="n", xlim=c(-75,100),
type="n", xaxt="n", yaxt="n",
xlab="Percentile", ylab="",
main=paste("GROUP",i) )
abline( v=seq(0,100,25), lty=3, lwd=1.5, col="gray90" )
segments( y0=1:30, x0=0, x1=100, col="gray70", lwd=2 )
text( -0.2, 1:30, data.dictionary$VARIABLE[-1], cex=0.85, pos=2 )
points( z, 1:30, pch=19, col="firebrick", cex=1.5 )
axis( side=1, at=c(0,50,100), col.axis="gray", col="gray" )
}
Note that in order to add variables together we need to make sure that they all have the same cardinality, i.e. positive scores should represent high scores on each scale, negative scores should represent low scores.
We can determine this by looking at the correlation structure and checking to make sure we have all positive correlations:
Transitivity Index
library( corrplot )
d3 <- as.data.frame(d3)
df.dim1 <- dplyr::select( d3, pown12, pmulti12, p10yrs12, pwds12, pfhh12 )
corrplot( cor(df.dim1, use="complete.obs"),
order = "hclust", tl.col='black', tl.cex=.75 )
We can see that home ownership goes in the opposite direction as all of the other variables. We can fix it by flipping the sign of home ownership, but note that this means that high on the “community stability” scale now means low home ownership, higher rates of transitivity, higher rates of multi-family homes. So we would have to rename the variable to “community instability”.
Alternatively, we can flip the signs of the other variables so high is good on the scale (which is slightly easier to interpret).
# flip the signs
df.dim1$pmulti12 <- - df.dim1$pmulti12
df.dim1$p10yrs12 <- - df.dim1$p10yrs12
df.dim1$pwds12 <- - df.dim1$pwds12
df.dim1$pfhh12 <- - df.dim1$pfhh12
corrplot( cor(df.dim1, use="complete.obs"),
order = "hclust", tl.col='black', tl.cex=.75 )
Diversity Index
Need to flip percent white - low proportion of white population corresponds with higher levels of diversity.
df.dim2 <- d3[ c("pnhwht12", "pnhblk12", "phisp12", "pfb12", "polang12") ]
# Check direction
# Need to flip percent white
corrplot( cor(df.dim2, use="complete.obs"),
order = "hclust", tl.col='black', tl.cex=.75 )
# flip the signs
df.dim2$pnhblk12 <- - df.dim2$pnhblk12
df.dim2$pfb12 <- - df.dim2$pfb12
df.dim2$phisp12 <- - df.dim2$phisp12
df.dim2$polang12 <- - df.dim2$polang12
corrplot( cor(df.dim2, use="complete.obs"),
order = "hclust", tl.col='black', tl.cex=.75 )Human Capital
df.dim3 <- select( d3, pcol12, phs12, pprof12, hinc12, mhmval12 )
# Check direction
# Need to flip high school graduation rates
corrplot( cor(df.dim3, use="complete.obs"),
order = "hclust", tl.col='black', tl.cex=.75 )
# flip the signs
df.dim3$pcol12 <- - df.dim3$pcol12
df.dim3$pprof12 <- - df.dim3$pprof12
df.dim3$hinc12 <- - df.dim3$hinc12
df.dim3$mhmval12 <- - df.dim3$mhmval12
corrplot( cor(df.dim3, use="complete.obs"),
order = "hclust", tl.col='black', tl.cex=.75 )
dim1 <- d3$pown12 - d3$pmulti12 - d3$p10yrs12 - d3$pwds12 - d3$pfhh12
dim2 <- - d3$pnhwht12 + d3$pnhblk12 + d3$phisp12 + d3$pfb12 + d3$polang12
dim3 <- d3$pcol12 - d3$phs12 + d3$pprof12 + d3$hinc12 + d3$mhmval12
df.nhood.metrics <- data.frame( dim1, dim2, dim3 )
summary( df.nhood.metrics )## dim1 dim2 dim3
## Min. :-11.3690 Min. :-4.642 Min. :-9.4133
## 1st Qu.: -1.8842 1st Qu.:-3.342 1st Qu.:-2.3401
## Median : 0.4666 Median :-1.955 Median :-0.3951
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000
## 3rd Qu.: 2.6991 3rd Qu.: 2.663 3rd Qu.: 2.3319
## Max. : 6.4182 Max. :11.568 Max. :11.1834
corrplot( cor( df.nhood.metrics, use="complete.obs" ),
order = "hclust", tl.col='black', tl.cex=.75 )
I arbitrarily selected three census variables to compare methods. Here we are using % of population 18 and under, % of female labor force participation, and household income.
# cluster with data set of three indices
fit2 <- Mclust( df.nhood.metrics )
summary( fit2 )## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2
## components:
##
## log-likelihood n df BIC ICL
## -1594.944 209 19 -3291.393 -3317.4
##
## Clustering table:
## 1 2
## 115 94
sd_dorling$cluster2 <- as.factor( fit2$classification )
# cluster with dataset of three census variables
d33 <- data.frame( d3$p18und12, d3$pflabf12, d3$hinc12 )
fit3 <- Mclust( d33 )
summary( fit3 )## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 2 components:
##
## log-likelihood n df BIC ICL
## -828.9165 209 16 -1743.31 -1771.503
##
## Clustering table:
## 1 2
## 25 184
sd_dorling$cluster3 <- as.factor( fit3$classification )
tmap_mode("plot")
tmap_style("cobalt")
tm1 <-
tm_shape( sd_dorling ) +
tm_polygons( col="cluster", palette="Accent" )
tm2 <-
tm_shape( sd_dorling ) +
tm_polygons( col="cluster2", palette="Accent" )
tm3 <-
tm_shape( sd_dorling ) +
tm_polygons( col="cluster3", palette="Accent" )
tmap_arrange( tm1, tm2, tm3 )
# only 3 neighborhood indices
plot( fit2, what = "classification" )
Part 2 explores how the input data impacts your clusters.
You will use your full model with 30 census variables as the reference point, then do the following:
Compare that set of groups to the groups identified by the model using only the three indices above. Are they identifying the same groups? Which group is missing?
ANSWER: What three indices?
Select three census variables from the full list of 30, and create a new clustering model. How many groups are generated? Are they similar groups to the full model? Are our inferences significantly different if we have a limited amount of data to draw from?
Report your three variables and describe the groups that are identified.
ANSWER:
Reflect on how the data we use in these models defines how we are defining and describing the neighborhoods (groups) within these models?
Are these clusters valid constructs? Do we believe that they are accurately describing a concrete reality about the world? Would knowing which group that a tract is part of help us predict something about the tract, such as outcomes that children would experience living there, or how the tract might change over time?
How do we know which is the right set of groups to use?
ANSWER: