Mining the U.S. Census Bureau API metadata

There are over 330 datasets with over 64,000 variables maintained by the Census Bureau; these datasets cover topics from international trade to population estimates to business formation within the United States. I’ll use metadata from these datasets to understand the connections between them.

The metadata includes information like the title of the dataset, a description field, what organization(s) within the Census Bureau is responsible for the dataset, keywords for the dataset that have been assigned by a human being, and so forth. The metadata for all its datasets is publicly available online in JSON format.

In this report, I will analyze the Census Bureau metadata as a text dataset and perform text mining techniques using the R library tidytext. I will preform word co-occurrences and correlations, tf-idf, and topic modeling to explore the connections between the datasets. I will seek to find if datasets are related to one other and find clusters of similar datasets. Since the Census Bureau provides several text fields in the metadata, most importantly the title, description, and keyword fields, I can show connections between the fields to better understand the connections between the Census Bureau API datasets.

How data is organized at the Census Bureau

Download the JSON file and take a look at the names of what is stored in the metadata.

library(jsonlite)

metadata <- jsonlite::fromJSON("https://api.census.gov/data.json")

base::names(metadata$dataset)

##  [1] "c_vintage"           "c_dataset"           "c_geographyLink"     "c_variablesLink"    
##  [5] "c_tagsLink"          "c_examplesLink"      "c_groupsLink"        "c_valuesLink"       
##  [9] "c_documentationLink" "c_isAggregate"       "c_isAvailable"       "@type"              
## [13] "title"               "accessLevel"         "bureauCode"          "description"        
## [17] "distribution"        "contactPoint"        "identifier"          "keyword"            
## [21] "license"             "modified"            "programCode"         "references"         
## [25] "spatial"             "temporal"            "publisher"           "c_isCube"           
## [29] "c_isTimeseries"

The title, description, and keywords for each dataset will be the features of interest.

base::class(metadata$dataset$title)

## [1] "character"

base::class(metadata$dataset$description)

## [1] "character"

base::class(metadata$dataset$keyword)

## [1] "list"

The title and description fields are stored as character vectors, and the keywords are stored as a list of character vectors.

Data preparation

library(tidyverse)

census_title <- dplyr::data_frame(
  id = metadata$dataset$identifier,
  title = metadata$dataset$title
)
census_title %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	title
http://api.census.gov/data/id/POPESThousing2016	Vintage 2016 Population Estimates: Housing Unit Estimates for US, States, and Counties
https://api.census.gov/data/id/ACSST5Y2015	ACS 5-Year Subject Tables
http://api.census.gov/data/id/ACSFlows2011	2007-2011 American Community Survey: Migration Flows
https://api.census.gov/data/id/PEPCOMPONENTS2018	Vintage 2018 Population Estimates: Components of Change Estimates
http://api.census.gov/data/id/ASMState	Time Series Annual Survey of Manufactures: Statistics for All Manufacturing by State
http://api.census.gov/data/id/EconCensusEWKS2007	2007 Economic Census - All Sectors: Economy-Wide Key Statistics
http://api.census.gov/data/id/POPESThousing2013	Vintage 2013 Population Estimates: Housing Unit Estimates for US, States, and Counties
http://api.census.gov/data/id/POPESTcomponents2015	Vintage 2015 Population Estimates: Components of Change Estimates
http://api.census.gov/data/id/POPESTcty2013	Vintage 2013 Population Estimates: County Total Population and Components of Change
http://api.census.gov/data/id/POPESThousing2014	Vintage 2014 Population Estimates: Housing Unit Estimates for US, States, and Counties

census_desc <- dplyr::data_frame(
  id = metadata$dataset$identifier,
  desc = metadata$dataset$description
)

census_desc %>%
#  dplyr::select(desc) %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	desc
http://api.census.gov/data/id/PDBBLOCKGROUP2018	The PDB is a database of U.S. housing, demographic, socioeconomic and operational statistics based on select 2010 Decennial Census and select 5-year American Community Survey (ACS) estimates. Data are provided at the census block group level of geography. These data can be used for many purposes, including survey field operations planning.
http://api.census.gov/data/id/POPESTintercensalnatcivpop1990	Monthly Intercensal Estimates of the Civilian Population by Single Year of Age and Sex: April 1, 1990 to April 1, 2000 // Source: U.S. Census Bureau, Population Division // For detailed information about the methods used to create the intercensal population estimates, see https://www.census.gov/popest/methodology/intercensal_nat_meth.pdf. // The Census Bureau’s Population Estimates Program produces intercensal estimates each decade by adjusting the existing time series of postcensal estimates for a decade to smooth the transition from one decennial census count to the next. They differ from the postcensal estimates that are released annually because they rely on a formula that redistributes the difference between the April 1 postcensal estimate and April 1 census count for the end of the decade across the estimates for that decade. Meanwhile, the postcensal estimates incorporate current data on births, deaths, and migration to produce each new vintage of estimates, and to revise estimates for years back to the last census. The Population Estimates Program provides additional information including historical and postcensal estimates, evaluation estimates, demographic analysis, and research papers on its website: https://www.census.gov/popest/index.html.
http://api.census.gov/data/id/ACSProfile3Y2013	The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households – that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. The 3-year data provide key estimates for each of the topic areas covered by the ACS for the nation, all 50 states, the District of Columbia, Puerto Rico, every congressional district, every metropolitan area, and all counties and places with populations of 20,000 or more. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau’s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties. For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units.
http://api.census.gov/data/id/2011acs5	This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2011/acs/acs5.
http://api.census.gov/data/id/POPPROJBirths2012	Projected Births by Sex, Race, and Hispanic Origin for the United States: 2012 to 2060 File: 2012 National Population Projections Source: U.S. Census Bureau, Population Division Release Date: December 2012 NOTE: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. The projections generally do not precisely agree with population estimates available elsewhere on the Census Bureau website for methodological reasons. Where both estimates and projections are available for a given time reference, we recommend that you use the population estimates as the measure of the current population. For detailed information about the methods used to create the population projections, see http://www.census.gov/population/projections/methodology/. *** The U.S. Census Bureau periodically produces projections of the United States resident population by age, sex, race, and Hispanic origin. Population projections are estimates of the population for future dates. They are typically based on an estimated population consistent with the most recent decennial census and are produced using the cohort-component method. Projections illustrate possible courses of population change based on assumptions about future births, deaths, net international migration, and domestic migration. In some cases, several series of projections are produced based on alternative assumptions for future fertility, life expectancy, net international migration, and (for state-level projections) state-to-state or domestic migration. Additional information is available on the Population Projections website: http://www.census.gov/population/projections/.
http://api.census.gov/data/id/DecennialSF11990	Summary File 1 (SF 1) contains detailed tables focusing on age, sex, households, families, and housing units. These tables provide in-depth figures by race and Hispanic origin; some tables are repeated for each of nine race/Latino groups. Counts also are provided for over forty American Indian and Alaska Native tribes and for groups within race categories. The race categories include eighteen Asian groups and twelve Native Hawaiian and Other Pacific Islander groups. Counts of persons of Hispanic origin by country of origin (twenty-eight groups) are also shown. Summary File 1 presents data for the United States, the 50 states, and the District of Columbia in a hierarchical sequence down to the block level for many tabulations, but only to the census tract level for others. Summaries are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs) and Congressional districts. Geographic coverage for Puerto Rico is comparable to the 50 states. Data are presented in a hierarchical sequence down the block level for many tabulations, but only to the census tract level for others. Geographic areas include barrios, barrios-pueblo, subbarrios, places, census tracts, block groups, and blocks. Summaries also are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs). Puerto Rico data will be loaded in January 2017.
http://api.census.gov/data/id/ACSSF5Y2010	This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2010/acs/acs5.
http://api.census.gov/data/id/ftdImpExpHist	This international trade file provides the annual dollar value of U.S. exports and imports of goods for all U.S. trade partners. It also provides the annual dollar value of U.S. exports and imports of manufactured goods for all U.S. trade partners. You can find this data and more by going to usatrade.census.gov. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov.
http://api.census.gov/data/id/ACSProfile5Y2013	This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2013/acs/acs5/profile.
http://api.census.gov/data/id/ITMONTHLYIMPORTSSTATEHS	The Census data API provides access to the most comprehensive set of data on current month and cumulative year-to-date imports by state and Harmonized System (HS) code. The State HS endpoint in the Census data API also provides value, shipping weight, and method of transportation totals at the state level for all U.S. trading partners. The Census data API will help users research new markets for their products, establish pricing structures for potential export markets, and conduct economic planning. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov.

census_keyword <- dplyr::data_frame(
  id = metadata$dataset$identifier,
  keyword = metadata$dataset$keyword) %>% 
  dplyr::filter(!purrr::map_lgl(keyword, is.null)) %>%
  tidyr::unnest(keyword)

census_keyword %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	keyword
http://api.census.gov/data/id/ACSSF5Y2010	Income
http://api.census.gov/data/id/ACSSF5Y2012	Poverty
http://api.census.gov/data/id/ACSSF5Y2013	Marital
http://api.census.gov/data/id/ACSSF5Y2015	Poverty
http://api.census.gov/data/id/ACSSF5Y2014	Income
http://api.census.gov/data/id/ACSSF5Y2012	Marital
http://api.census.gov/data/id/ACSSF5Y2009	Poverty
http://api.census.gov/data/id/ACSSF5Y2010	Poverty
http://api.census.gov/data/id/ACSSF5Y2009	Income
http://api.census.gov/data/id/ACSSF5Y2012	Income

library(tidytext)

census_title <- census_title %>%
  tidytext::unnest_tokens(word, title) %>%
  dplyr::anti_join(stop_words, by = "word")

census_desc <- census_desc %>%
  tidytext::unnest_tokens(word, desc) %>%
  dplyr::anti_join(stop_words, by = "word")

The title, description, and keyword datasets have been prepared and are now ready for exploration.

# create a list of user-defined stop words
extra_stopwords <- dplyr::data_frame(
  word = c(
    as.character(1:100), 
    as.character(1950:2018), 
    "endpoint", "phased", "api.census.gov", "acs", "acs5", "u.s", "puerto", "rico", 
    "census", "bureau", "data", "information", "549", "800", "0595", "Sector", "62", "1,000", "100,000",
    "NAICS", "00", "000", "100", "MRSF", "01", "US1", "pdf", "0", "zero", "64,000", "65,000"
  )
)

# remove those extra stop words from title and description
census_title <- census_title %>%
  dplyr::anti_join(extra_stopwords, by = "word")

census_desc <- census_desc %>%
  dplyr::anti_join(extra_stopwords, by = "word")

census_title %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word
http://api.census.gov/data/id/CBP2006	business
http://api.census.gov/data/id/ITMONTHLYIMPORTSHS	series
http://api.census.gov/data/id/EITSMWTS	wholesale
https://api.census.gov/data/id/ACSST5Y2013	tables
http://api.census.gov/data/id/ZBPTotal1994	county
http://api.census.gov/data/id/ZBPTotal2016	business
https://api.census.gov/data/id/ACSDT5Y2013	detailed
http://api.census.gov/data/id/CBP2007	patterns
http://api.census.gov/data/id/CBP2009	business
http://api.census.gov/data/id/EITSRESCONST	construction

census_desc %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word
http://api.census.gov/data/id/ACSFlows2013	addition
http://api.census.gov/data/id/POPESTcochar62013	extends
http://api.census.gov/data/id/ITMONTHLYEXPORTSSTATENAICS	email
http://api.census.gov/data/id/NONEMP2005	source
http://api.census.gov/data/id/POPESTprm2013	april
http://api.census.gov/data/id/CBP2000	series
http://api.census.gov/data/id/2012acs3	tracts
http://api.census.gov/data/id/POPESTnatstprc2014	rates
https://api.census.gov/data/id/ACSDT1Y2015	person
http://api.census.gov/data/id/POPESTcomponents2016	december

census_keyword %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	keyword
http://api.census.gov/data/id/ACSSF5Y2013	Income
http://api.census.gov/data/id/ACSSF5Y2012	Income
http://api.census.gov/data/id/ACSSF5Y2015	Marital
http://api.census.gov/data/id/ACSSF5Y2009	Marital
http://api.census.gov/data/id/ACSSF5Y2010	Marital
http://api.census.gov/data/id/ACSSF5Y2014	Poverty
http://api.census.gov/data/id/ACSSF5Y2014	Income
http://api.census.gov/data/id/ACSSF5Y2014	Marital
http://api.census.gov/data/id/ACSSF5Y2012	Poverty
http://api.census.gov/data/id/ACSSF5Y2012	Marital

Initial simple exploration

What are the most common words in the Census Bureau dataset keywords?

#What are the most common keywords?
census_keyword %>%
  dplyr::group_by(keyword) %>%
  dplyr::count(sort = TRUE) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

keyword	n
Income	6
Marital	6
Poverty	6

What are the most common words in the Census Bureau dataset descriptions?

#What are the most common keywords?
census_desc %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

word	n
estimates	1178
population	1019
economic	315
housing	298
series	296
program	288
survey	270
decennial	262
race	262
annual	246
vintage	227
www.census.gov	207
demographic	191
statistics	183
time	177
recent	172
business	170
migration	166
county	164
current	163
popest	154
produces	144
includes	142
counties	138
including	134
payroll	134
http	131
businesses	130
american	129
date	121
district	118
metropolitan	117
april	114
based	114
community	114
united	113
social	110
change	109
bureau’s	105
provide	103
file	102
projections	102
hispanic	100
origin	99
congressional	98
source	97
detailed	95
patterns	95
additional	93
geographic	93
international	92
division	91
index.html	91
profiles	91
july	90
characteristics	87
employment	87
https	87
research	87
unit	85
intercensal	84
cbp	83
nation	82
trade	82
analysis	81
form	81
surveys	81
code	80
historical	80
produced	80
website	79
methodology	76
resident	76
age	75
variables	75
zip	75
counts	74
establishments	74
include	74
programs	74
communities	72
derived	72
net	72
sex	71
note	70
dataset	69
due	69
files	69
reference	69
summary	69
count	68
populations	68
question	68
births	66
create	65
dates	65
deaths	65
methods	65
subject	65
combination	64
evaluation	63
papers	63
e.g	62
level	62
services	62
employees	61
industry	61
paid	61
march	60
nonemployers	60
previously	60
resolution	60
total	60
broad	59
units	59
sector	58
api	57
final	55
geographies	55
official	55
begins	54
calculate	54
extends	54
issue	54
quarter	54
refers	54
reflect	54
revises	54
supersedes	54
suppressed	54
utilizes	54
week	54
considered	53
range	53
revisions	53
error	52
mi	52
michigan	52
processing	52
values	52
measures	51
tables	50
pep	49
statistical	49
covers	48
industries	48
investments	48
quarterly	48
release	48
giving	47
ongoing	47
plan	47
topics	47
totals	46
firms	45
receipts	45
set	45
income	44
ethnicity	43
decade	41
beta	40
moved	40
columbia	39
individuals	39
markets	38
specific	38
comprehensive	37
hispanics	37
method	37
indicators	36
selected	36
tracts	36
categories	35
districts	35
modified	35
responses	35
comparison	34
metro	34
primary	34
topic	34
sample	33
december	32
differences	32
mcd	32
armed	31
census.gov	31
forces	31
health	31
naics	31
races	31
block	30
planning	30
postcensal	30
principal	30
recommend	30
released	30
sampling	30
shown	30
percentages	29
sectors	29
staff	29
calhoun	28
dc	28
households	28
included	28
results	28
federal	27
island	27
monthly	27
original	27
partners	27
type	27
versus	27
3rd	26
affect	26
affects	26
analyses	26
assistance	26
battle	26
care	26
creek	26
identified	26
levels	26
plans	26
questions	26
result	26
revised	26
subtraction	26
addresses	25
changing	25
cities	25
civilian	25
collecting	25
construction	25
designed	25
disseminates	25
estimating	25
fresh	25
mailed	25
obtain	25
person	25
questionnaires	25
replaced	25
residence	25
sales	25
strength	25
thresholds	25
towns	25
documentation	24
future	24
manufactured	24
owners	24
potential	24
technical	24
activity	23
adds	23
call	23
digit	23
eid.international.trade.data	23
email	23
flow	23
found	23
option	23
products	23
table	23
zbp	23
basis	22
born	22
means	22
nationwide	22
overseas	22
report	22
sum	22
foreign	21
key	21
manufacturing	21
public	21
tax	21
v2013	21
consist	20
economy	20
local	20
national	20
nonemployer	20
performance	20
produce	20
sic	20
subnational	20
transportation	20
access	19
component	19
conduct	19
contents	19
covered	19
cumulative	19
establish	19
estimated	19
export	19
flows	19
june	19
month	19
native	19
pricing	19
provided	19
rolling	19
shipping	19
single	19
structures	19
trading	19
users	19
weight	19
components	18
imports	18
operating	18
region	18
residual	18
exports	17
individual	17
mrsf	17
notes	17
status	17
unincorporated	17
us1	17
ago	16
government	16
mcds	16
methodology.html	16
overview	16
percent	16
popest.html	16
sources	16
system	16
tabulations	16
timely	16
variety	16
activities	15
adjustment	15
average	15
bin	15
briefrm	15
briefroom	15
bureau.s	15
cgi	15
complementary	15
covering	15
decisions	15
employed	15
estimation	15
exception	15
excluded	15
homes	15
impact	15
indicator	15
inform	15
investment	15
majority	15
nationally	15
offer	15
owner’s	15
pensions	15
pertinent	15
policy	15
proprietorships	15
published	15
refer	15
reliability	15
reliable	15
resource	15
retail	15
scope	15
seasonal	15
sole	15
starting	15
study	15
taxes	15
variability	15
variance	15
visit	15
webpages	15
wholesale	15
assumptions	14
datasets	14
definitions	14
domestic	14
measure	14
terms	14
terms.html	14
incorporated	13
66,000	12
base	12
geo	12
noninstitutionalized	12
select	12
v2015	12
compared	11
comparisons	11
cross	11
enumerated	11
location	11
municipios	11
operational	11
past	11
persons	11
projected	11
separately	11
significance	11
similar	11
site	11
size	11
testing	11
web	11
2060	10
addition	10
analyzing	10
classification	10
cohort	10
companies	10
congress	10
consistent	10
courses	10
geography	10
hs	10
illustrate	10
micropolitan	10
relationship	10
typically	10
world	10
114th	9
active	9
attributed	9
database	9
divisions	9
duty	9
functions	9
movement	9
natives	9
poverty	9
profile	9
qwi	9
rates	9
represents	9
specifically	9
subgroups	9
v2014	9
2,400	8
abroad	8
acs1	8
civil	8
dollar	8
field	8
geographical	8
list	8
live	8
minor	8
mobility	8
mrsf2010	8
nonmovers	8
quarters	8
strong	8
subcounty	8
supplemental	8
tract	8
universe	8
www2	8
administration	7
boundaries	7
censuses	7
household	7
overlapping	7
owned	7
repeat	7
agency	6
agree	6
asm	6
budget’s	6
collected	6
combined	6
countries	6
country	6
coverage	6
delineations	6
education	6
harmonized	6
issued	6
january	6
latino	6
lrd	6
management	6
market	6
methodological	6
office	6
operations	6
pdb	6
port	6
precisely	6
purposes	6
reasons	6
sbo	6
school	6
socioeconomic	6
statement	6
surnames	6
tabulation	6
5,000	5
account	5
advertising	5
agencies	5
benchmark	5
budgets	5
class	5
conducted	5
databases	5
developing	5
effectiveness	5
firm	5
forms	5
government’s	5
indian	5
insurance	5
law	5
legal	5
locations	5
loss	5
measuring	5
medium	5
million	5
modifying	5
october	5
organization	5
placements	5
prepared	5
quotas	5
reported	5
representing	5
required	5
residential	5
respondents	5
response	5
setting	5
stock	5
studying	5
subjects	5
16,000	4
2050	4
adjusting	4
alaska	4
alternative	4
annually	4
barrios	4
boundary	4
bureaus	4
city	4
codes	4
commonwealth	4
creation	4
difference	4
earnings	4
estimate	4
existing	4
expectancy	4
fertility	4
formula	4
funds	4
hawaiian	4
hierarchical	4
incorporate	4
intercensal_nat_meth.pdf	4
job	4
life	4
north	4
periodically	4
private	4
product	4
redistributes	4
reflects	4
related	4
rely	4
revise	4
sahie	4
sequence	4
sf	4
sitc	4
smooth	4
speakers	4
subdivisions	4
summaries	4
tech	4
transition	4
urban	4
usatrade.census.gov	4
v2017	4
zctas	4
20,000	3
3,000	3
9,000	3
about.html	3
academic	3
accurate	3
agent	3
agricultural	3
ancestry	3
ascertain	3
blocks	3
broken	3
commodities	3
compiled	3
cps	3
demographics	3
english	3
entrepreneurs	3
factfinder	3
families	3
figures	3
gender	3
glossary	3
glossary.html	3
home	3
inputs	3
labor	3
languages	3
lehd.ces.census.gov	3
longitudinal	3
main	3
manufactures	3
micro	3
military	3
minority	3
nonagricultural	3
outputs	3
people	3
prior	3
project	3
read	3
regions	3
regularly	3
researchers	3
rural	3
special	3
sworn	3
updates	3
usda	3
uswide	3
veteran	3
women	3
worker	3
workforce	3
25,000	2
acsse	2
advance	2
ages	2
aggregates	2
aid	2
allocation	2
asian	2
bds	2
birth	2
category	2
changes.html	2
children	2
collection	2
comparable	2
concluded	2
confidentiality	2
core	2
cprofile	2
decades	2
depth	2
determined	2
distributing	2
dvd	2
efforts	2
eighteen	2
elementary	2
eligible	2
enumeration	2
ethnic	2
family	2
fl	2
focus	2
focusing	2
formed	2
forty	2
frequency	2
functional	2
gadsden	2
ia	2
idb	2
identify	2
incomplete	2
informationgateway.php	2
islander	2
items	2
jurisdictions	2
lands	2
limited	2
located	2
managing	2
midyear	2
model	2
modified.html	2
municipio	2
names	2
objective	2
occupancy	2
occupied	2
pacific	2
place.html	2
pueblo	2
quantity	2
rank	2
recommended	2
recorded	2
repeated	2
reports	2
review	2
rockwell	2
saipe	2
secondary	2
september	2
spanish	2
stages	2
standard	2
statements	2
subbarrios	2
summarized	2
systems	2
tenure	2
times	2
title	2
tribal	2
tribes	2
twelve	2
twenty	2
update	2
urbanized	2
v2016	2
varies	2
113th	1
115th	1
31,000	1
65k	1
abbreviation	1
act	1
additionally	1
affamerican	1
aggregate	1
amended	1
article	1
authorization	1
bls	1
breast	1
cancer	1
capital	1
cash	1
cd	1
cd113	1
cdc	1
centers	1
cervical	1
ces	1
charges	1
closings	1
clusters	1
collects	1
combining	1
complement	1
conditions	1
conducting	1
constitution	1
content	1
control	1
cph	1
created	1
csa	1
cscb	1
cscbo	1
d.c	1
dataproducts	1
dec	1
describing	1
destruction	1
detail	1
detection	1
develop	1
direct	1
disease	1
distinct	1
dynamics	1
enhance	1
establishment	1
expenditure	1
extensive	1
external	1
fall	1
force	1
function	1
funded	1
glance	1
house	1
identical	1
implement	1
indebtedness	1
industrial	1
instruction	1
internal	1
issues	1
jointly	1
language	1
link	1
loaded	1
lunch	1
models	1
monies	1
multiple	1
nativity	1
nbccedp	1
numerous	1
object	1
openings	1
outlay	1
owner	1
page	1
partially	1
passed	1
payments	1
phc	1
pk	1
planned	1
prevention	1
producing	1
property	1
provisions	1
publishing	1
purpose	1
rate	1
reapportioning	1
relating	1
rented	1
renter	1
representatives	1
requires	1
revenue	1
rom	1
salaries	1
schedule	1
service	1
sf1	1
speak	1
spoken	1
sponsored	1
startups	1
substate	1
summaray	1
support	1
tabulate	1
ten	1
test	1
titled	1
tuition	1
undergone	1
understanding	1
unemployment	1
vacancy	1
washington	1
wide	1

What are the most common words in the Census Bureau dataset titles?

#What are the most common keywords?
census_title %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

word	n
estimates	139
population	123
business	114
patterns	108
county	64
code	63
series	62
time	62
vintage	54
survey	52
zip	46
age	45
statistics	44
american	43
beta	40
community	39
tables	38
sex	34
international	32
monthly	31
profiles	30
total	29
trade	29
national	22
single	22
economic	21
hispanic	21
origin	21
nonemployer	20
employer	18
subject	18
detailed	17
indicators	15
profile	15
imports	14
races	14
characteristics	13
comparison	13
exports	13
housing	11
summary	11
annual	10
decennial	10
projected	10
projections	10
system	10
file	9
intercensal	9
migration	9
united	9
change	8
components	8
database	8
flows	8
race	8
classification	6
harmonized	6
hs	6
planning	6
resident	6
supplemental	6
unit	6
counties	5
industry	5
municipios	5
owners	5
demographic	4
dynamics	4
economy	4
historical	4
key	4
naics	4
north	4
pr	4
quarterly	4
sales	4
sectors	4
services	4
universe	4
wide	4
agriculture	3
block	3
commonwealth	3
department	3
entrepreneurs	3
household	3
inventories	3
level	3
longitudinal	3
manufactures	3
poverty	3
qwi	3
tract	3
advanced	2
births	2
businesses	2
company	2
congressional	2
construction	2
deaths	2
districts	2
education	2
food	2
health	2
income	2
insurance	2
manufacturing	2
mcds	2
net	2
populations	2
port	2
public	2
retail	2
selected	2
shipments	2
sitc	2
standard	2
subcounty	2
summarized	2
surnames	2
technology	2
113th	1
115th	1
advance	1
armed	1
basic	1
cd113	1
cd115	1
civilian	1
classes	1
current	1
district	1
elementary	1
ethnicity	1
finance	1
financial	1
firm	1
forces	1
home	1
homeownership	1
homes	1
individual	1
industries	1
language	1
local	1
manufactured	1
manufacturers	1
nativity	1
overseas	1
packages	1
pensions	1
product	1
report	1
residential	1
school	1
secondary	1
sf1	1
spending	1
spoken	1
status	1
table	1
taxes	1
test	1
units	1
vacancies	1
wholesale	1

Word co-ocurrences and correlations

Here I examine which words commonly occur together in the titles, descriptions, and keywords of the Census Bureau datasets to create word networks that help determine which datasets are related to one other.

Networks of Description and Title Words

library(widyr)

title_word_pairs <- census_title %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

title_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

item1	item2	n
estimates	population	63
estimates	vintage	54
vintage	population	54
county	business	54
county	patterns	54
business	patterns	54
time	series	46
population	age	42
american	community	39
american	survey	39
community	survey	39

These are the pairs of words that occur together most often in title fields.

desc_word_pairs <- census_desc %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

desc_word_pairs %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

item1	item2	n
utilizes	v2015	12
weight	census.gov	18
selected	rural	1
topics	comparison	21
july	release	11
smooth	incorporate	4
income	county	16
date	exports	9
total	produces	28
methods	terms	7

These are the pairs of words that occur together most often in descripton fields.

Below is a plot of networks of these co-occurring words to better see relationships.

library(ggplot2)
library(igraph)
library(ggraph)

# plot network of co-occuring words for 'title' field
set.seed(1234)
title_word_pairs %>%
  dplyr::filter(n >= 18) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Word network in the Census Bureau dataset titles

We see some clear clustering in this network of title words; words in the Census Bureau dataset titles are largely organized into several families of words that tend to go together.

Now I plot the same for the description fields.

# plot network of co-occuring words for 'description' field
set.seed(1234)
desc_word_pairs %>%
  dplyr::filter(n >= 85) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Word network in the Census Bureau dataset descriptions

Networks of Keywords

A network of the keywords to see which keywords commonly occur together in the same datasets.

# Network of Keywords
## See which keywords commonly occur together in the same dataset
keyword_pairs <- census_keyword %>%
  widyr::pairwise_count(keyword, id, sort = TRUE, upper = FALSE)

keyword_pairs %>%
  dplyr::arrange(-n) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

item1	item2	n
Income	Marital	6
Income	Poverty	6
Marital	Poverty	6

set.seed(1234)
keyword_pairs %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "royalblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Co-occurrence network in the Census Bureau dataset keywords

Of the 330 or so datasets, only about 6 contain keywords.

Calculating tf-idf for the description fields

Getting the tf-idf for the description field words

What are the highest tf-idf words in the Census Bureau description fields?

library(topicmodels)

desc_tf_idf <- census_desc %>% 
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup() %>%
  tidytext::bind_tf_idf(word, id, n) %>%
  dplyr::arrange(-tf_idf)

desc_tf_idf %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word	n	tf	idf	tf_idf
https://api.census.gov/data/id/ACSDP5Y2014	dataset	1	0.0227273	1.791759	0.0407218
http://api.census.gov/data/id/POPESTPROJbirths2014	cohort	1	0.0117647	3.496508	0.0411354
http://api.census.gov/data/id/POPESTprcagesex2013	commonwealth	1	0.0085470	4.412798	0.0377162
http://api.census.gov/data/id/POPESTcharage2016	issue	1	0.0058480	1.810109	0.0105854
https://api.census.gov/data/id/ACSST5Y2012	giving	1	0.0222222	1.948945	0.0433099
http://api.census.gov/data/id/POPESThousing2013	bureau’s	1	0.0083333	1.145132	0.0095428
http://api.census.gov/data/id/POPESTcharagegroups2016	reflect	1	0.0058140	1.810109	0.0105239
http://api.census.gov/data/id/POPESTpop2015	based	1	0.0074627	1.193922	0.0089099
http://api.census.gov/data/id/ACSFlows2011	metropolitan	1	0.0128205	1.126264	0.0144393
http://api.census.gov/data/id/CBP1995	subtraction	1	0.0128205	2.540996	0.0325769

These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.

Connecting description fields to keywords

First join the results of the tf-idf analysis with the keyword dataset.

library(topicmodels)

desc_tf_idf_keyword <- dplyr::full_join(
  desc_tf_idf, 
  census_keyword, by = "id") %>%
  dplyr::arrange(word)

desc_tf_idf_keyword %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word	n	tf	idf	tf_idf	keyword
https://api.census.gov/data/id/ACSDP1Y2011	official	2	0.0222222	2.3978953	0.0532866	NA
http://api.census.gov/data/id/CBP1988	affects	1	0.0128205	2.5409961	0.0325769	NA
https://api.census.gov/data/id/CBP2012	benchmark	1	0.0200000	4.1896547	0.0837931	NA
http://api.census.gov/data/id/NONEMP2007	percent	1	0.0123457	3.0265039	0.0373642	NA
http://api.census.gov/data/id/ZBPTotal2011	zbp	1	0.0357143	2.6635984	0.0951285	NA
https://api.census.gov/data/id/ACSCP1Y2016	covers	1	0.0196078	1.9278916	0.0378018	NA
https://api.census.gov/data/id/ACSCP5Y2017	services	1	0.0196078	1.6719583	0.0327835	NA
http://api.census.gov/data/id/POPESTcomponents2015	foreign	2	0.0113636	3.6018681	0.0409303	NA
http://api.census.gov/data/id/CBP1986	suppressed	2	0.0256410	2.4668881	0.0632535	NA
http://api.census.gov/data/id/SBOCSCB12	survey	3	0.0681818	0.9315582	0.0635153	NA

Plot some of the most important words, as measured by tf-idf, for all of the provided keywords used on the Census Bureau datasets.

desc_tf_idf_keyword %>% 
  dplyr::filter(!near(tf, 1)) %>%
  dplyr::filter(keyword %in% c("Income", "Marital", "Poverty")) %>%
  dplyr::arrange(dplyr::desc(tf_idf)) %>%
  dplyr::group_by(keyword) %>%
  dplyr::distinct(word, keyword, .keep_all = TRUE) %>%
  dplyr::top_n(15, tf_idf) %>% 
  dplyr::ungroup() %>%
  dplyr::mutate(word = base::factor(word, levels = base::rev(unique(word)))) %>%
  ggplot2::ggplot(ggplot2::aes(word, tf_idf, fill = keyword)) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(~keyword, ncol = 3, scales = "free") +
  ggplot2::coord_flip() +
  ggplot2::labs(title = "Highest tf-idf words in Census metadata description fields",
       caption = "Census metadata from https://api.census.gov/data.json",
       x = NULL, y = "tf-idf")

Distribution of tf-idf for words from datasets labeled with selected keywords

Uncovering hidden conversations

Topic modeling attempts to uncover the hidden conversations within each description field. Latent Dirichlet allocation (LDA) is a technique to model each document (description field) as a mixture of topics and further describe each topic as a mixture of words

desc_stop_words <- dplyr::bind_rows(
  stop_words, 
  dplyr::data_frame(
    word = c("nbsp", "amp", "gt", 
             "lt", "timesnewromanpsmt", "font",
             "td", "li", "br", "tr", "quot",
             "st", "img", "src", "strong", 
             "http", "file", "files", "00", "www.census.gov",
             base::as.character(1:12)
             ), 
    lexicon = base::rep("custom", 32)))

word_counts <- census_desc %>%
  dplyr::anti_join(desc_stop_words, by = "word") %>%
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup()

word_counts %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word	n
http://api.census.gov/data/id/POPESTcharagegroups2017	program	4
https://api.census.gov/data/id/ACSST1Y2016	metropolitan	1
http://api.census.gov/data/id/ITMONTHLYEXPORTSSTATEHS	access	1
http://api.census.gov/data/id/POPESTintercensalnatmonthly2000	program	4
http://api.census.gov/data/id/SBOCSCB12	veteran	1
https://api.census.gov/data/id/ACSST5Y2010	housing	1
http://api.census.gov/data/id/POPESTnatmonthly2016	report	1
http://api.census.gov/data/id/NONEMP2003	scope	1
http://api.census.gov/data/id/POPESTstchar52013	revises	1
http://api.census.gov/data/id/PDBBlockGroup2015	estimates	1

Casting to a document-term matrix

Create a sparse document term matrix, containing the count of terms in each document.

desc_dtm <- word_counts %>%
  tidytext::cast_dtm(id, word, n)

desc_dtm

## <<DocumentTermMatrix (documents: 330, terms: 860)>>
## Non-/sparse entries: 16052/267748
## Sparsity           : 94%
## Maximal term length: 28
## Weighting          : term frequency (tf)

Ready for topic modeling

The following creates an LDA model. Like many clustering algorithms, the number of topics must be set a priori. Here I set the number of topics to 8.

library(topicmodels)
# run an LDA on the description words
desc_lda <- topicmodels::LDA(desc_dtm, k = 8, control = base::list(seed = 1234))
desc_lda

## A LDA_VEM topic model with 8 topics.

Interpreting the topic model

The following takes the lda model and constructs a tidy data frame that summarizes the results.

# interpret the results
tidy_lda <- tidytext::tidy(desc_lda)

tidy_lda %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

topic	term	beta
4	question	0.0050538
5	city	0.0000000
4	percent	0.0000000
5	passed	0.0000000
4	armed	0.0031119
3	producing	0.0000000
8	property	0.0000000
7	special	0.0000000
2	glossary	0.0006520
8	minor	0.0000000

The column \(\beta\) shows the probability of that term being generated from that topic for that document. It is the probability of that term (word) belonging to that topic.

The following examines the top 6 terms for each topic.

top_terms <- tidy_lda %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(6, beta) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(topic, -beta)

top_terms %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

topic	term	beta
5	employment	0.0256562
1	trade	0.0254148
1	postcensal	0.0243657
8	surveys	0.0257496
8	survey	0.0401349
2	vintage	0.0275104
6	produces	0.0360178
3	business	0.0286840
6	unit	0.0352348
7	housing	0.0215063

Here are the results of the top_terms exercise depicting visually:

top_terms %>%
  dplyr::mutate(term =  stats::reorder(term, beta)) %>%
  dplyr::group_by(topic, term) %>%    
  dplyr::arrange(dplyr::desc(beta)) %>%  
  dplyr::ungroup() %>%
  dplyr::mutate(term = base::factor(base::paste(term, topic, sep = "__"), 
                       levels = base::rev(base::paste(term, topic, sep = "__")))) %>%
  ggplot2::ggplot(ggplot2::aes(term, beta, fill = base::as.factor(topic))) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::scale_x_discrete(labels = function(x) base::gsub("__.+$", "", x)) +
  ggplot2::labs(
    title = "Top 6 terms in each LDA topic",
    x = NULL, y = base::expression(beta)) +
  ggplot2::facet_wrap(~ topic, ncol = 4, scales = "free")

Top terms in topic modeling of Census metadata description field texts

The most frequently occuring terms in each of the topics are “population”, “annual”, and “estimates”. Given that most of the 330 datasets in the Census API are survey results from the annual American Community Survey, this makes sense.

The following examines topics that are associated with which description fields (i.e., documents). The probability, \(\gamma\), is the probability that each document belongs in each topic.

# examine which topics are associated with which description fields
lda_gamma <- tidytext::tidy(desc_lda, matrix = "gamma")

lda_gamma %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

document	topic	gamma
http://api.census.gov/data/id/ACSProfile3Y2013	1	0.0002431
http://api.census.gov/data/id/CBP1989	8	0.0002773
https://api.census.gov/data/id/ACSSE2016	4	0.0002671
https://api.census.gov/data/id/ACSDT1Y2010	1	0.0002431
http://api.census.gov/data/id/POPESTintercensalhousing2000	7	0.0003378
http://api.census.gov/data/id/CBP1999	8	0.0002773
http://api.census.gov/data/id/POPESTPROJnat2014	6	0.0002671
http://api.census.gov/data/id/QWISE	4	0.0006535
http://api.census.gov/data/id/PDBTRACT2018	6	0.1656551
https://api.census.gov/data/id/ACSDT5Y2011	4	0.0004695

The variable, \(\gamma\), has values that run from 0 to 1. Documents with values near zero means that those documents do not belong in each topic. Values with \(\gamma\) values close to 1 indicate that these documents do belong in those topics.

This distribution below shows that most documents either do belong or do not belong to a given topic.

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma)) +
  ggplot2::geom_histogram(bins = 12) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probabilities for all topics",
       y = "Number of documents", x = base::expression(gamma))

Probability distribution in topic modeling of Census metadata description field texts

The following plot shows how the probabilities are distributed within each topic:

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma, fill = base::as.factor(topic))) +
  ggplot2::geom_histogram(bins = 4, show.legend = FALSE) +
  ggplot2::facet_wrap(~ topic, ncol = 4) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probability for each topic",
       y = "Number of documents", x = base::expression(gamma))

Probability distribution for each topic in topic modeling of Census metadata description field texts