Overview
For this code-through we will be working through creating maps with that visually describe the populations represented in those areas, using Seattle Washington for this tutorial. We will also be generating diagrams that display the characteristics of the 6 groups autogenerated by this program.
For this code-through we will be using Census data from the 2012 American Communities Survey made available through the Diversity and Disparities Project.
Project Set Up
Load Packages
You will need to load the following packages to preform this exercises outlined in this tutorial.
library( geojsonio ) # Used to read shapefiles
library( sp ) # Used to work with with shapefiles
library( sf ) # Simplifies features for working with shapefiles
library( mclust ) # Used for analyzing clusters generated by the program
library( tmap ) # Allows us to create and utilize themed maps
library( ggplot2 ) # Allows us to graph group characteristics from clusters
library( ggthemes ) # Allows us to reformat graphs created with ggplot2
library( dplyr ) # Data wrangling tool
library( pander ) # Used to format table created in .rmd files
library(tidycensus) # Allows us to pull cencus data
library( cartogram ) # Allows us to create spatial maps with tract size bias reduction
library( maptools ) # Used for spatial object manipulation Download Data
We will be using Census data made available through the Diversity and Disparities Project for this Code-Through. The Diversity and Disparities Project has created a dataset with 170 variables that have been harmonized from 1970 onward for analysis of changes in tracts over time, allowing for easy analyses.
DATA DICTIONARY
| LABEL | VARIABLE |
|---|---|
| tractid | GEOID |
| pnhwht12 | Percent white, non-Hispanic |
| pnhblk12 | Percent black, non-Hispanic |
| phisp12 | Percent Hispanic |
| pntv12 | Percent Native American race |
| pfb12 | Percent foreign born |
| polang12 | Percent speaking other language at home, age 5 plus |
| phs12 | Percent with high school degree or less |
| pcol12 | Percent with 4-year college degree or more |
| punemp12 | Percent unemployed |
| pflabf12 | Percent female labor force participation |
| pprof12 | Percent professional employees |
| pmanuf12 | Percent manufacturing employees |
| pvet12 | Percent veteran |
| psemp12 | Percent self-employed |
| hinc12 | Median HH income, total |
| incpc12 | Per capita income |
| ppov12 | Percent in poverty, total |
| pown12 | Percent owner-occupied units |
| pvac12 | Percent vacant units |
| pmulti12 | Percent multi-family units |
| mrent12 | Median rent |
| mhmval12 | Median home value |
| p30old12 | Percent structures more than 30 years old |
| p10yrs12 | Percent HH in neighborhood 10 years or less |
| p18und12 | Percent 17 and under, total |
| p60up12 | Percent 60 and older, total |
| p75up12 | Percent 75 and older, total |
| pmar12 | Percent currently married, not separated |
| pwds12 | Percent widowed, divorced and separated |
| pfhh12 | Percent female-headed families with children |
Part 1: Building a Neighbohood Dataset
Step 1: Select Your MSA
Once you have decided which area you would like to do an analyis of, you can search for a city, using ‘grep( “CITYNAME”, crosswalk$msaname, value=TRUE )’
crosswalk <- read.csv( "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv", stringsAsFactors=F, colClasses="character" )
grep( "SEATTLE", crosswalk$msaname, value=TRUE ) ## [1] "SEATTLE-BELLEVUE-EVERETT, WA" "SEATTLE-BELLEVUE-EVERETT, WA"
## [3] "SEATTLE-BELLEVUE-EVERETT, WA"
Step 2: Download a Shapefile with Population Data
having determined the information for your area, you can download the shape files for that area.
Simply copy/paste the MSA name that you found through your search above, into the first line below, to pull the information for those areas.
Using ‘census_api_key(“your key here”)’, and your personal key, you can access and pull your data. If you do not yet have a Cencus AIP Key, you can request on at aip.census.gov.
After connecting, you can use the msaname and ‘get_acs()’ command to pull shapefiles and information that is relevant to the creation of your maps and analysis of your popuations:
these.areas <- crosswalk$msaname == "SEATTLE-BELLEVUE-EVERETT, WA"
these.fips <- crosswalk$fipscounty[ these.areas ]
these.fips <- na.omit( these.fips )
state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )
census_api_key("42bf5fcc6e6a6f05ebe97a0e647a5216a708613a")
area.pop <-
get_acs( geography = "tract", variables = "B01003_001", state = "53", county = county.fips[state.fips=="53"], geometry = TRUE ) %>%
select( GEOID, estimate ) %>%
rename( POP = estimate )##
|
| | 0%
|
|= | 1%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|===== | 6%
|
|===== | 8%
|
|====== | 8%
|
|======= | 10%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|=============== | 21%
|
|================= | 24%
|
|================== | 26%
|
|=================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 38%
|
|============================ | 40%
|
|============================= | 42%
|
|============================== | 42%
|
|=============================== | 44%
|
|================================ | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|======================================= | 56%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================= | 64%
|
|============================================= | 65%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|==================================================== | 74%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 85%
|
|============================================================ | 86%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|=============================================================== | 90%
|
|================================================================ | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
Step 3: Add Census Data
After we have pulled our geographic data, we can then retrieve demographic data from the a cencus file, and merge the files using common information.
- Since we will be surveying an area with a costal area, which is seen as empty by the program, we will have to also include the code ‘area[ ! st_is_empty( area ) ,’ to tell the program to ignore those areas.
Step 4: Transform the Shapefile into A Dorling Cartogram
Once we have defined the areas which we are interested in mappingm can then transform our shapefile into a cartogram and plot them. This will provide us with a map if the tracts we are examining.
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
We can them convert the map to display Cencus information such as population, and remove the empty tracts:
# project map and remove empty tracts
area.sp <- spTransform( area.sp, CRS("+init=epsg:3395"))
area.sp <- area.sp[ area.sp$POP != 0 & (! is.na( area.sp$POP )) , ]
# convert census tract polygons to dorling cartogram
# no idea why k=0.03 works, but it does - default is k=5
area.sp$pop.w <- area.sp$POP / 9000 # max(msp.sp$POP) # standardizes it to max of 1.5
area_dorling <- cartogram_dorling( x=area.sp, weight="pop.w", k=0.05 )
plot( area_dorling ) We can then color code the cencus data to display gradiants in the trends we see within the populations in the area, such as Median Household Income, also defining the size of our ’bubbles, the data displayed, and the color pallette used to display that data:
tm_shape( area_dorling ) +
tm_polygons( size="POP", col="hinc12", n=7, style="quantile", palette="Spectral" ) If we play around more with the maps, we will find that we are also able to customize them more with items such as titles and content location.
tm_shape( area_dorling ) +
tm_polygons( col="hinc12", n=10, style="quantile", palette="Spectral" ) +
tm_layout( "Seattle Cartogram", title.position=c("right","top") )## $tm_layout
## $tm_layout$legend.show
## [1] TRUE
##
## $tm_layout$style
## [1] NA
##
##
## attr(,"class")
## [1] "tm"
We can also switch out the basemap to display topography, and focus on our area of interest, upon loading.
tmap_mode("view")
tm_basemap( "HikeBike.HillShading" ) +
tm_shape( area_dorling, bbox = "Seattle" ) +
tm_polygons( col="hinc12", n=7, style="quantile", palette="RdYlBu" ) +
tm_layout( "Seattle Cartogram", title.position=c("right","top") )## $tm_layout
## $tm_layout$legend.show
## [1] TRUE
##
## $tm_layout$style
## [1] NA
##
##
## attr(,"class")
## [1] "tm"
Or use other themes such as Stamen Watercolor, to display reference points like bodies of water, roads, or parks.
tm_basemap(leaflet::providers$Stamen.Watercolor) +
tm_shape( area_dorling, bbox = "Seattle" ) +
tm_polygons( col="hinc12", n=7, style="quantile", palette="RdYlBu" ) +
tm_layout( "Seattle Cartogram", title.position=c("right","top") )## $tm_layout
## $tm_layout$legend.show
## [1] TRUE
##
## $tm_layout$style
## [1] NA
##
##
## attr(,"class")
## [1] "tm"
Part 2: Creating representative groups of the populations represented in the maps
After determining the variables that are of interest to you in regards to your popilations, you can auto generate distinct groups based on those characteristics. The data is transformed into z-scores so that they are all on a similar scale.
Prepare Data for Clustering
Using the ‘keep.these’ command, you can determine which variables to keep. You can the examine the data embedded in the map, focusing on your variables of interest. Below you will see an example of the first 7 variables we are examining.
| pnhwht12 | pnhblk12 | phisp12 | pntv12 | pfb12 | polang12 | phs12 |
|---|---|---|---|---|---|---|
| 1.24 | -0.7428 | -0.8731 | -0.4588 | -1.181 | -1.323 | 1.056 |
| 0.2575 | 0.6813 | -0.3043 | 0.7652 | -1.199 | -1.241 | -0.1616 |
| 1.171 | -0.5183 | -1.048 | -0.01369 | -1.355 | -1.193 | 0.111 |
| 0.1927 | -0.3768 | -0.4018 | -0.1002 | -0.4999 | -0.2616 | 1.027 |
| 0.7317 | -0.6259 | -0.441 | -0.08169 | -0.636 | -0.7571 | -0.4013 |
| -0.2968 | -0.1246 | 0.5503 | -0.3351 | 0.0542 | -0.5033 | 0.3122 |
Perform Cluster Analysis
We will then use mclust to create significant groupings of these populations, and examine the differences between groups.
Identifying Neighborhood Clusters
Finally, we are able to create representitive charts based on these variables, so that we are able to compare each of the groups based on their census characteristics. In this case we are letting the program determine the # of groups it creates, although you are also able to create a set # of groups if you would like.
df.pct <- sapply( d2, ntile, 100 )
d4 <- as.data.frame( df.pct )
d4$cluster <- as.factor( paste0("GROUP-",fit$classification) )
num.groups <- length( unique( fit$classification ) )
stats <-
d4 %>%
group_by( cluster ) %>%
summarise_each( funs(mean) )
t <- data.frame( t(stats), stringsAsFactors=F )
names(t) <- paste0( "GROUP.", 1:num.groups )
t <- t[-1,]
for( i in 1:num.groups )
{ z <- t[,i]
plot( rep(1,30), 1:30, bty="n", xlim=c(-75,100),
type="n", xaxt="n", yaxt="n",
xlab="Percentile", ylab="",
main=paste("GROUP",i) )
abline( v=seq(0,100,25), lty=3, lwd=1.5, col="gray90" )
segments( y0=1:30, x0=0, x1=100, col="gray70", lwd=2 )
text( -0.2, 1:30, data.dictionary$VARIABLE[-1], cex=0.85, pos=2 )
points( z, 1:30, pch=19, col="firebrick", cex=1.5 )
axis( side=1, at=c(0,50,100), col.axis="gray", col="gray" )}Using this Information to Serve Our Community
Once you have produced your charts, you can use them the create labels for your groups based on the the information found in the clusters determined above. You can also determine trends in various areas, based on that information (i.e. demographic make up if each area).
I found this process incredibly useful in helping me to better understand the areas we examined, by being able to better understand the people in those areas. This process could be used to determine processes for local government or non-profit organizatoions, such as where to open a new workforce center, or what areas to target in promoting a new GED preparation course. It can help us to not only better understand our communities of interest, but also to better serve them.
Code-Trough Created By: Ricky Duran For: CPP 529 - Data Analysis Practicum Created On: “November 27, 2019”