Overview

For this code-through we will be working through creating maps with that visually describe the populations represented in those areas, using Seattle Washington for this tutorial. We will also be generating diagrams that display the characteristics of the 6 groups autogenerated by this program.

For this code-through we will be using Census data from the 2012 American Communities Survey made available through the Diversity and Disparities Project.

Project Set Up

Load Packages

You will need to load the following packages to preform this exercises outlined in this tutorial.

library( geojsonio )   # Used to read shapefiles
library( sp )          # Used to work with with shapefiles
library( sf )          # Simplifies features for working with shapefiles
library( mclust )      # Used for analyzing clusters generated by the program 
library( tmap )        # Allows us to create and utilize themed maps
library( ggplot2 )     # Allows us to graph group characteristics from clusters  
library( ggthemes )    # Allows us to reformat graphs created with ggplot2
library( dplyr )       # Data wrangling tool
library( pander )      # Used to format table created in .rmd files
library(tidycensus)    # Allows us to pull cencus data
library( cartogram )   # Allows us to create spatial maps with tract size bias reduction
library( maptools )    # Used for spatial object manipulation

Download Data

We will be using Census data made available through the Diversity and Disparities Project for this Code-Through. The Diversity and Disparities Project has created a dataset with 170 variables that have been harmonized from 1970 onward for analysis of changes in tracts over time, allowing for easy analyses.

DATA DICTIONARY

LABEL	VARIABLE
tractid	GEOID
pnhwht12	Percent white, non-Hispanic
pnhblk12	Percent black, non-Hispanic
phisp12	Percent Hispanic
pntv12	Percent Native American race
pfb12	Percent foreign born
polang12	Percent speaking other language at home, age 5 plus
phs12	Percent with high school degree or less
pcol12	Percent with 4-year college degree or more
punemp12	Percent unemployed
pflabf12	Percent female labor force participation
pprof12	Percent professional employees
pmanuf12	Percent manufacturing employees
pvet12	Percent veteran
psemp12	Percent self-employed
hinc12	Median HH income, total
incpc12	Per capita income
ppov12	Percent in poverty, total
pown12	Percent owner-occupied units
pvac12	Percent vacant units
pmulti12	Percent multi-family units
mrent12	Median rent
mhmval12	Median home value
p30old12	Percent structures more than 30 years old
p10yrs12	Percent HH in neighborhood 10 years or less
p18und12	Percent 17 and under, total
p60up12	Percent 60 and older, total
p75up12	Percent 75 and older, total
pmar12	Percent currently married, not separated
pwds12	Percent widowed, divorced and separated
pfhh12	Percent female-headed families with children

Part 1: Building a Neighbohood Dataset

Step 1: Select Your MSA

Once you have decided which area you would like to do an analyis of, you can search for a city, using ‘grep( “CITYNAME”, crosswalk$msaname, value=TRUE )’

crosswalk <- read.csv( "https://raw.githubusercontent.com/DS4PS/cpp-529-master/master/data/cbsatocountycrosswalk.csv",  stringsAsFactors=F, colClasses="character" )

grep( "SEATTLE", crosswalk$msaname, value=TRUE )

## [1] "SEATTLE-BELLEVUE-EVERETT, WA" "SEATTLE-BELLEVUE-EVERETT, WA"
## [3] "SEATTLE-BELLEVUE-EVERETT, WA"

Step 2: Download a Shapefile with Population Data

having determined the information for your area, you can download the shape files for that area.

Simply copy/paste the MSA name that you found through your search above, into the first line below, to pull the information for those areas.

Using ‘census_api_key(“your key here”)’, and your personal key, you can access and pull your data. If you do not yet have a Cencus AIP Key, you can request on at aip.census.gov.

After connecting, you can use the msaname and ‘get_acs()’ command to pull shapefiles and information that is relevant to the creation of your maps and analysis of your popuations:

these.areas <- crosswalk$msaname == "SEATTLE-BELLEVUE-EVERETT, WA"
these.fips <- crosswalk$fipscounty[ these.areas ]
these.fips <- na.omit( these.fips )

state.fips <- substr( these.fips, 1, 2 )
county.fips <- substr( these.fips, 3, 5 )

census_api_key("42bf5fcc6e6a6f05ebe97a0e647a5216a708613a")

area.pop <-
get_acs( geography = "tract", variables = "B01003_001", state = "53", county = county.fips[state.fips=="53"], geometry = TRUE ) %>%
  select( GEOID, estimate ) %>%
  rename( POP = estimate )

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

Step 3: Add Census Data

After we have pulled our geographic data, we can then retrieve demographic data from the a cencus file, and merge the files using common information.

Since we will be surveying an area with a costal area, which is seen as empty by the program, we will have to also include the code ‘area[ ! st_is_empty( area ) ,’ to tell the program to ignore those areas.

URL <- "https://github.com/DS4PS/cpp-529-master/raw/master/data/ltdb_std_2010_sample.rds"
census.dat <- readRDS(gzcon(url( URL )))

# can merge an sf object and data.frame
area <- merge( area.pop, census.dat, by.x="GEOID", by.y="tractid" )
area2 <- area[ ! st_is_empty( area ) , ]

Step 4: Transform the Shapefile into A Dorling Cartogram

Once we have defined the areas which we are interested in mappingm can then transform our shapefile into a cartogram and plot them. This will provide us with a map if the tracts we are examining.

area.sp <- as_Spatial( area2 )
class( area.sp )

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

plot ( area.sp )

We can them convert the map to display Cencus information such as population, and remove the empty tracts:

# project map and remove empty tracts
area.sp <- spTransform( area.sp, CRS("+init=epsg:3395"))
area.sp <- area.sp[ area.sp$POP != 0 & (! is.na( area.sp$POP )) , ]

# convert census tract polygons to dorling cartogram
# no idea why k=0.03 works, but it does - default is k=5
area.sp$pop.w <- area.sp$POP / 9000 # max(msp.sp$POP)   # standardizes it to max of 1.5
area_dorling <- cartogram_dorling( x=area.sp, weight="pop.w", k=0.05 )
plot( area_dorling )

We can then color code the cencus data to display gradiants in the trends we see within the populations in the area, such as Median Household Income, also defining the size of our ’bubbles, the data displayed, and the color pallette used to display that data:

tm_shape( area_dorling ) + 
  tm_polygons( size="POP", col="hinc12", n=7, style="quantile", palette="Spectral" )

If we play around more with the maps, we will find that we are also able to customize them more with items such as titles and content location.

tm_shape( area_dorling ) + 
  tm_polygons( col="hinc12", n=10, style="quantile", palette="Spectral" ) +
  tm_layout( "Seattle Cartogram", title.position=c("right","top") )

tm_legend( show=TRUE)

## $tm_layout
## $tm_layout$legend.show
## [1] TRUE
## 
## $tm_layout$style
## [1] NA
## 
## 
## attr(,"class")
## [1] "tm"

We can also switch out the basemap to display topography, and focus on our area of interest, upon loading.

tmap_mode("view")
tm_basemap( "HikeBike.HillShading"  ) +
tm_shape( area_dorling, bbox = "Seattle" ) + 
  tm_polygons( col="hinc12", n=7, style="quantile", palette="RdYlBu" ) +
  tm_layout( "Seattle Cartogram", title.position=c("right","top") )

  tm_legend( show=TRUE )

## $tm_layout
## $tm_layout$legend.show
## [1] TRUE
## 
## $tm_layout$style
## [1] NA
## 
## 
## attr(,"class")
## [1] "tm"

Or use other themes such as Stamen Watercolor, to display reference points like bodies of water, roads, or parks.

tm_basemap(leaflet::providers$Stamen.Watercolor) +
tm_shape( area_dorling, bbox = "Seattle" ) + 
  tm_polygons( col="hinc12", n=7, style="quantile", palette="RdYlBu" ) +
  tm_layout( "Seattle Cartogram", title.position=c("right","top") )

  tm_legend( show=TRUE )

## $tm_layout
## $tm_layout$legend.show
## [1] TRUE
## 
## $tm_layout$style
## [1] NA
## 
## 
## attr(,"class")
## [1] "tm"

Part 2: Creating representative groups of the populations represented in the maps

After determining the variables that are of interest to you in regards to your popilations, you can auto generate distinct groups based on those characteristics. The data is transformed into z-scores so that they are all on a similar scale.

Prepare Data for Clustering

Using the ‘keep.these’ command, you can determine which variables to keep. You can the examine the data embedded in the map, focusing on your variables of interest. Below you will see an example of the first 7 variables we are examining.

pnhwht12	pnhblk12	phisp12	pntv12	pfb12	polang12	phs12
1.24	-0.7428	-0.8731	-0.4588	-1.181	-1.323	1.056
0.2575	0.6813	-0.3043	0.7652	-1.199	-1.241	-0.1616
1.171	-0.5183	-1.048	-0.01369	-1.355	-1.193	0.111
0.1927	-0.3768	-0.4018	-0.1002	-0.4999	-0.2616	1.027
0.7317	-0.6259	-0.441	-0.08169	-0.636	-0.7571	-0.4013
-0.2968	-0.1246	0.5503	-0.3351	0.0542	-0.5033	0.3122

Perform Cluster Analysis

We will then use mclust to create significant groupings of these populations, and examine the differences between groups.

# library( mclust )
set.seed( 1234 )
fit <- Mclust( d3 )
area_dorling$cluster <- as.factor( fit$classification )

Identifying Neighborhood Clusters

Finally, we are able to create representitive charts based on these variables, so that we are able to compare each of the groups based on their census characteristics. In this case we are letting the program determine the # of groups it creates, although you are also able to create a set # of groups if you would like.

df.pct <- sapply( d2, ntile, 100 )
d4 <- as.data.frame( df.pct )
d4$cluster <- as.factor( paste0("GROUP-",fit$classification) )

num.groups <- length( unique( fit$classification ) )

stats <- 
d4 %>% 
  group_by( cluster ) %>% 
  summarise_each( funs(mean) )

t <- data.frame( t(stats), stringsAsFactors=F )
names(t) <- paste0( "GROUP.", 1:num.groups )
t <- t[-1,]

for( i in 1:num.groups )
{  z <- t[,i]
  plot( rep(1,30), 1:30, bty="n", xlim=c(-75,100), 
        type="n", xaxt="n", yaxt="n",
        xlab="Percentile", ylab="",
        main=paste("GROUP",i) )
  abline( v=seq(0,100,25), lty=3, lwd=1.5, col="gray90" )
  segments( y0=1:30, x0=0, x1=100, col="gray70", lwd=2 )
  text( -0.2, 1:30, data.dictionary$VARIABLE[-1], cex=0.85, pos=2 )
  points( z, 1:30, pch=19, col="firebrick", cex=1.5 )
  axis( side=1, at=c(0,50,100), col.axis="gray", col="gray" )}

Using this Information to Serve Our Community

Once you have produced your charts, you can use them the create labels for your groups based on the the information found in the clusters determined above. You can also determine trends in various areas, based on that information (i.e. demographic make up if each area).

I found this process incredibly useful in helping me to better understand the areas we examined, by being able to better understand the people in those areas. This process could be used to determine processes for local government or non-profit organizatoions, such as where to open a new workforce center, or what areas to target in promoting a new GED preparation course. It can help us to not only better understand our communities of interest, but also to better serve them.

Code-Trough Created By: Ricky Duran For: CPP 529 - Data Analysis Practicum Created On: “November 27, 2019”

Code-Through: Clustering Neighborhood Types Using Census Data

Ricky Duran

November 27, 2019