There are over 330 datasets with over 64,000 variables maintained by the Census Bureau; these datasets cover topics from international trade to population estimates to business formation within the United States. I’ll use metadata from these datasets to understand the connections between them.

The metadata includes information like the title of the dataset, a description field, what organization(s) within the Census Bureau is responsible for the dataset, keywords for the dataset that have been assigned by a human being, and so forth. The metadata for all its datasets is publicly available online in JSON format.

In this report, I will analyze the Census Bureau metadata as a text dataset and perform text mining techniques using the R library tidytext. I will preform word co-occurrences and correlations, tf-idf, and topic modeling to explore the connections between the datasets. I will seek to find if datasets are related to one other and find clusters of similar datasets. Since the Census Bureau provides several text fields in the metadata, most importantly the title, description, and keyword fields, I can show connections between the fields to better understand the connections between the Census Bureau API datasets.

How data is organized at the Census Bureau

Download the JSON file and take a look at the names of what is stored in the metadata.

library(jsonlite)

metadata <- jsonlite::fromJSON("https://api.census.gov/data.json")

base::names(metadata$dataset)
##  [1] "c_vintage"           "c_dataset"           "c_geographyLink"     "c_variablesLink"    
##  [5] "c_tagsLink"          "c_examplesLink"      "c_groupsLink"        "c_valuesLink"       
##  [9] "c_documentationLink" "c_isAggregate"       "c_isAvailable"       "@type"              
## [13] "title"               "accessLevel"         "bureauCode"          "description"        
## [17] "distribution"        "contactPoint"        "identifier"          "keyword"            
## [21] "license"             "modified"            "programCode"         "references"         
## [25] "spatial"             "temporal"            "publisher"           "c_isCube"           
## [29] "c_isTimeseries"

The title, description, and keywords for each dataset will be the features of interest.

base::class(metadata$dataset$title)
## [1] "character"
base::class(metadata$dataset$description)
## [1] "character"
base::class(metadata$dataset$keyword)
## [1] "list"

The title and description fields are stored as character vectors, and the keywords are stored as a list of character vectors.

Data preparation

library(tidyverse)

census_title <- dplyr::data_frame(
  id = metadata$dataset$identifier,
  title = metadata$dataset$title
)
census_title %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id title
http://api.census.gov/data/id/POPESThousing2016 Vintage 2016 Population Estimates: Housing Unit Estimates for US, States, and Counties
https://api.census.gov/data/id/ACSST5Y2015 ACS 5-Year Subject Tables
http://api.census.gov/data/id/ACSFlows2011 2007-2011 American Community Survey: Migration Flows
https://api.census.gov/data/id/PEPCOMPONENTS2018 Vintage 2018 Population Estimates: Components of Change Estimates
http://api.census.gov/data/id/ASMState Time Series Annual Survey of Manufactures: Statistics for All Manufacturing by State
http://api.census.gov/data/id/EconCensusEWKS2007 2007 Economic Census - All Sectors: Economy-Wide Key Statistics
http://api.census.gov/data/id/POPESThousing2013 Vintage 2013 Population Estimates: Housing Unit Estimates for US, States, and Counties
http://api.census.gov/data/id/POPESTcomponents2015 Vintage 2015 Population Estimates: Components of Change Estimates
http://api.census.gov/data/id/POPESTcty2013 Vintage 2013 Population Estimates: County Total Population and Components of Change
http://api.census.gov/data/id/POPESThousing2014 Vintage 2014 Population Estimates: Housing Unit Estimates for US, States, and Counties
census_desc <- dplyr::data_frame(
  id = metadata$dataset$identifier,
  desc = metadata$dataset$description
)

census_desc %>%
#  dplyr::select(desc) %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id desc
http://api.census.gov/data/id/PDBBLOCKGROUP2018 The PDB is a database of U.S. housing, demographic, socioeconomic and operational statistics based on select 2010 Decennial Census and select 5-year American Community Survey (ACS) estimates. Data are provided at the census block group level of geography. These data can be used for many purposes, including survey field operations planning.
http://api.census.gov/data/id/POPESTintercensalnatcivpop1990 Monthly Intercensal Estimates of the Civilian Population by Single Year of Age and Sex: April 1, 1990 to April 1, 2000 // Source: U.S. Census Bureau, Population Division // For detailed information about the methods used to create the intercensal population estimates, see https://www.census.gov/popest/methodology/intercensal_nat_meth.pdf. // The Census Bureau’s Population Estimates Program produces intercensal estimates each decade by adjusting the existing time series of postcensal estimates for a decade to smooth the transition from one decennial census count to the next. They differ from the postcensal estimates that are released annually because they rely on a formula that redistributes the difference between the April 1 postcensal estimate and April 1 census count for the end of the decade across the estimates for that decade. Meanwhile, the postcensal estimates incorporate current data on births, deaths, and migration to produce each new vintage of estimates, and to revise estimates for years back to the last census. The Population Estimates Program provides additional information including historical and postcensal estimates, evaluation estimates, demographic analysis, and research papers on its website: https://www.census.gov/popest/index.html.
http://api.census.gov/data/id/ACSProfile3Y2013 The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households – that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. The 3-year data provide key estimates for each of the topic areas covered by the ACS for the nation, all 50 states, the District of Columbia, Puerto Rico, every congressional district, every metropolitan area, and all counties and places with populations of 20,000 or more. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau’s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties. For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units.
http://api.census.gov/data/id/2011acs5 This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2011/acs/acs5.
http://api.census.gov/data/id/POPPROJBirths2012 Projected Births by Sex, Race, and Hispanic Origin for the United States: 2012 to 2060 File: 2012 National Population Projections Source: U.S. Census Bureau, Population Division Release Date: December 2012 NOTE: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. The projections generally do not precisely agree with population estimates available elsewhere on the Census Bureau website for methodological reasons. Where both estimates and projections are available for a given time reference, we recommend that you use the population estimates as the measure of the current population. For detailed information about the methods used to create the population projections, see http://www.census.gov/population/projections/methodology/. *** The U.S. Census Bureau periodically produces projections of the United States resident population by age, sex, race, and Hispanic origin. Population projections are estimates of the population for future dates. They are typically based on an estimated population consistent with the most recent decennial census and are produced using the cohort-component method. Projections illustrate possible courses of population change based on assumptions about future births, deaths, net international migration, and domestic migration. In some cases, several series of projections are produced based on alternative assumptions for future fertility, life expectancy, net international migration, and (for state-level projections) state-to-state or domestic migration. Additional information is available on the Population Projections website: http://www.census.gov/population/projections/.
http://api.census.gov/data/id/DecennialSF11990 Summary File 1 (SF 1) contains detailed tables focusing on age, sex, households, families, and housing units. These tables provide in-depth figures by race and Hispanic origin; some tables are repeated for each of nine race/Latino groups. Counts also are provided for over forty American Indian and Alaska Native tribes and for groups within race categories. The race categories include eighteen Asian groups and twelve Native Hawaiian and Other Pacific Islander groups. Counts of persons of Hispanic origin by country of origin (twenty-eight groups) are also shown. Summary File 1 presents data for the United States, the 50 states, and the District of Columbia in a hierarchical sequence down to the block level for many tabulations, but only to the census tract level for others. Summaries are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs) and Congressional districts. Geographic coverage for Puerto Rico is comparable to the 50 states. Data are presented in a hierarchical sequence down the block level for many tabulations, but only to the census tract level for others. Geographic areas include barrios, barrios-pueblo, subbarrios, places, census tracts, block groups, and blocks. Summaries also are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs). Puerto Rico data will be loaded in January 2017.
http://api.census.gov/data/id/ACSSF5Y2010 This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2010/acs/acs5.
http://api.census.gov/data/id/ftdImpExpHist This international trade file provides the annual dollar value of U.S. exports and imports of goods for all U.S. trade partners. It also provides the annual dollar value of U.S. exports and imports of manufactured goods for all U.S. trade partners. You can find this data and more by going to usatrade.census.gov. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov.
http://api.census.gov/data/id/ACSProfile5Y2013 This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2013/acs/acs5/profile.
http://api.census.gov/data/id/ITMONTHLYIMPORTSSTATEHS The Census data API provides access to the most comprehensive set of data on current month and cumulative year-to-date imports by state and Harmonized System (HS) code. The State HS endpoint in the Census data API also provides value, shipping weight, and method of transportation totals at the state level for all U.S. trading partners. The Census data API will help users research new markets for their products, establish pricing structures for potential export markets, and conduct economic planning. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov.
census_keyword <- dplyr::data_frame(
  id = metadata$dataset$identifier,
  keyword = metadata$dataset$keyword) %>% 
  dplyr::filter(!purrr::map_lgl(keyword, is.null)) %>%
  tidyr::unnest(keyword)

census_keyword %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
library(tidytext)

census_title <- census_title %>%
  tidytext::unnest_tokens(word, title) %>%
  dplyr::anti_join(stop_words, by = "word")

census_desc <- census_desc %>%
  tidytext::unnest_tokens(word, desc) %>%
  dplyr::anti_join(stop_words, by = "word")

The title, description, and keyword datasets have been prepared and are now ready for exploration.

# create a list of user-defined stop words
extra_stopwords <- dplyr::data_frame(
  word = c(
    as.character(1:100), 
    as.character(1950:2018), 
    "endpoint", "phased", "api.census.gov", "acs", "acs5", "u.s", "puerto", "rico", 
    "census", "bureau", "data", "information", "549", "800", "0595", "Sector", "62", "1,000", "100,000",
    "NAICS", "00", "000", "100", "MRSF", "01", "US1", "pdf", "0", "zero", "64,000", "65,000"
  )
)

# remove those extra stop words from title and description
census_title <- census_title %>%
  dplyr::anti_join(extra_stopwords, by = "word")

census_desc <- census_desc %>%
  dplyr::anti_join(extra_stopwords, by = "word")

census_title %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
census_desc %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
census_keyword %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

Initial simple exploration

What are the most common words in the Census Bureau dataset keywords?

#What are the most common keywords?
census_keyword %>%
  dplyr::group_by(keyword) %>%
  dplyr::count(sort = TRUE) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
keyword n
Income 6
Marital 6
Poverty 6

What are the most common words in the Census Bureau dataset descriptions?

#What are the most common keywords?
census_desc %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
word n
estimates 1178
population 1019
economic 315
housing 298
series 296
program 288
survey 270
decennial 262
race 262
annual 246
vintage 227
www.census.gov 207
demographic 191
statistics 183
time 177
recent 172
business 170
migration 166
county 164
current 163
popest 154
produces 144
includes 142
counties 138
including 134
payroll 134
http 131
businesses 130
american 129
date 121
district 118
metropolitan 117
april 114
based 114
community 114
united 113
social 110
change 109
bureau’s 105
provide 103
file 102
projections 102
hispanic 100
origin 99
congressional 98
source 97
detailed 95
patterns 95
additional 93
geographic 93
international 92
division 91
index.html 91
profiles 91
july 90
characteristics 87
employment 87
https 87
research 87
unit 85
intercensal 84
cbp 83
nation 82
trade 82
analysis 81
form 81
surveys 81
code 80
historical 80
produced 80
website 79
methodology 76
resident 76
age 75
variables 75
zip 75
counts 74
establishments 74
include 74
programs 74
communities 72
derived 72
net 72
sex 71
note 70
dataset 69
due 69
files 69
reference 69
summary 69
count 68
populations 68
question 68
births 66
create 65
dates 65
deaths 65
methods 65
subject 65
combination 64
evaluation 63
papers 63
e.g 62
level 62
services 62
employees 61
industry 61
paid 61
march 60
nonemployers 60
previously 60
resolution 60
total 60
broad 59
units 59
sector 58
api 57
final 55
geographies 55
official 55
begins 54
calculate 54
extends 54
issue 54
quarter 54
refers 54
reflect 54
revises 54
supersedes 54
suppressed 54
utilizes 54
week 54
considered 53
range 53
revisions 53
error 52
mi 52
michigan 52
processing 52
values 52
measures 51
tables 50
pep 49
statistical 49
covers 48
industries 48
investments 48
quarterly 48
release 48
giving 47
ongoing 47
plan 47
topics 47
totals 46
firms 45
receipts 45
set 45
income 44
ethnicity 43
decade 41
beta 40
moved 40
columbia 39
individuals 39
markets 38
specific 38
comprehensive 37
hispanics 37
method 37
indicators 36
selected 36
tracts 36
categories 35
districts 35
modified 35
responses 35
comparison 34
metro 34
primary 34
topic 34
sample 33
december 32
differences 32
mcd 32
armed 31
census.gov 31
forces 31
health 31
naics 31
races 31
block 30
planning 30
postcensal 30
principal 30
recommend 30
released 30
sampling 30
shown 30
percentages 29
sectors 29
staff 29
calhoun 28
dc 28
households 28
included 28
results 28
federal 27
island 27
monthly 27
original 27
partners 27
type 27
versus 27
3rd 26
affect 26
affects 26
analyses 26
assistance 26
battle 26
care 26
creek 26
identified 26
levels 26
plans 26
questions 26
result 26
revised 26
subtraction 26
addresses 25
changing 25
cities 25
civilian 25
collecting 25
construction 25
designed 25
disseminates 25
estimating 25
fresh 25
mailed 25
obtain 25
person 25
questionnaires 25
replaced 25
residence 25
sales 25
strength 25
thresholds 25
towns 25
documentation 24
future 24
manufactured 24
owners 24
potential 24
technical 24
activity 23
adds 23
call 23
digit 23
eid.international.trade.data 23
email 23
flow 23
found 23
option 23
products 23
table 23
zbp 23
basis 22
born 22
means 22
nationwide 22
overseas 22
report 22
sum 22
foreign 21
key 21
manufacturing 21
public 21
tax 21
v2013 21
consist 20
economy 20
local 20
national 20
nonemployer 20
performance 20
produce 20
sic 20
subnational 20
transportation 20
access 19
component 19
conduct 19
contents 19
covered 19
cumulative 19
establish 19
estimated 19
export 19
flows 19
june 19
month 19
native 19
pricing 19
provided 19
rolling 19
shipping 19
single 19
structures 19
trading 19
users 19
weight 19
components 18
imports 18
operating 18
region 18
residual 18
exports 17
individual 17
mrsf 17
notes 17
status 17
unincorporated 17
us1 17
ago 16
government 16
mcds 16
methodology.html 16
overview 16
percent 16
popest.html 16
sources 16
system 16
tabulations 16
timely 16
variety 16
activities 15
adjustment 15
average 15
bin 15
briefrm 15
briefroom 15
bureau.s 15
cgi 15
complementary 15
covering 15
decisions 15
employed 15
estimation 15
exception 15
excluded 15
homes 15
impact 15
indicator 15
inform 15
investment 15
majority 15
nationally 15
offer 15
owner’s 15
pensions 15
pertinent 15
policy 15
proprietorships 15
published 15
refer 15
reliability 15
reliable 15
resource 15
retail 15
scope 15
seasonal 15
sole 15
starting 15
study 15
taxes 15
variability 15
variance 15
visit 15
webpages 15
wholesale 15
assumptions 14
datasets 14
definitions 14
domestic 14
measure 14
terms 14
terms.html 14
incorporated 13
66,000 12
base 12
geo 12
noninstitutionalized 12
select 12
v2015 12
compared 11
comparisons 11
cross 11
enumerated 11
location 11
municipios 11
operational 11
past 11
persons 11
projected 11
separately 11
significance 11
similar 11
site 11
size 11
testing 11
web 11
2060 10
addition 10
analyzing 10
classification 10
cohort 10
companies 10
congress 10
consistent 10
courses 10
geography 10
hs 10
illustrate 10
micropolitan 10
relationship 10
typically 10
world 10
114th 9
active 9
attributed 9
database 9
divisions 9
duty 9
functions 9
movement 9
natives 9
poverty 9
profile 9
qwi 9
rates 9
represents 9
specifically 9
subgroups 9
v2014 9
2,400 8
abroad 8
acs1 8
civil 8
dollar 8
field 8
geographical 8
list 8
live 8
minor 8
mobility 8
mrsf2010 8
nonmovers 8
quarters 8
strong 8
subcounty 8
supplemental 8
tract 8
universe 8
www2 8
administration 7
boundaries 7
censuses 7
household 7
overlapping 7
owned 7
repeat 7
agency 6
agree 6
asm 6
budget’s 6
collected 6
combined 6
countries 6
country 6
coverage 6
delineations 6
education 6
harmonized 6
issued 6
january 6
latino 6
lrd 6
management 6
market 6
methodological 6
office 6
operations 6
pdb 6
port 6
precisely 6
purposes 6
reasons 6
sbo 6
school 6
socioeconomic 6
statement 6
surnames 6
tabulation 6
5,000 5
account 5
advertising 5
agencies 5
benchmark 5
budgets 5
class 5
conducted 5
databases 5
developing 5
effectiveness 5
firm 5
forms 5
government’s 5
indian 5
insurance 5
law 5
legal 5
locations 5
loss 5
measuring 5
medium 5
million 5
modifying 5
october 5
organization 5
placements 5
prepared 5
quotas 5
reported 5
representing 5
required 5
residential 5
respondents 5
response 5
setting 5
stock 5
studying 5
subjects 5
16,000 4
2050 4
adjusting 4
alaska 4
alternative 4
annually 4
barrios 4
boundary 4
bureaus 4
city 4
codes 4
commonwealth 4
creation 4
difference 4
earnings 4
estimate 4
existing 4
expectancy 4
fertility 4
formula 4
funds 4
hawaiian 4
hierarchical 4
incorporate 4
intercensal_nat_meth.pdf 4
job 4
life 4
north 4
periodically 4
private 4
product 4
redistributes 4
reflects 4
related 4
rely 4
revise 4
sahie 4
sequence 4
sf 4
sitc 4
smooth 4
speakers 4
subdivisions 4
summaries 4
tech 4
transition 4
urban 4
usatrade.census.gov 4
v2017 4
zctas 4
20,000 3
3,000 3
9,000 3
about.html 3
academic 3
accurate 3
agent 3
agricultural 3
ancestry 3
ascertain 3
blocks 3
broken 3
commodities 3
compiled 3
cps 3
demographics 3
english 3
entrepreneurs 3
factfinder 3
families 3
figures 3
gender 3
glossary 3
glossary.html 3
home 3
inputs 3
labor 3
languages 3
lehd.ces.census.gov 3
longitudinal 3
main 3
manufactures 3
micro 3
military 3
minority 3
nonagricultural 3
outputs 3
people 3
prior 3
project 3
read 3
regions 3
regularly 3
researchers 3
rural 3
special 3
sworn 3
updates 3
usda 3
uswide 3
veteran 3
women 3
worker 3
workforce 3
25,000 2
acsse 2
advance 2
ages 2
aggregates 2
aid 2
allocation 2
asian 2
bds 2
birth 2
category 2
changes.html 2
children 2
collection 2
comparable 2
concluded 2
confidentiality 2
core 2
cprofile 2
decades 2
depth 2
determined 2
distributing 2
dvd 2
efforts 2
eighteen 2
elementary 2
eligible 2
enumeration 2
ethnic 2
family 2
fl 2
focus 2
focusing 2
formed 2
forty 2
frequency 2
functional 2
gadsden 2
ia 2
idb 2
identify 2
incomplete 2
informationgateway.php 2
islander 2
items 2
jurisdictions 2
lands 2
limited 2
located 2
managing 2
midyear 2
model 2
modified.html 2
municipio 2
names 2
objective 2
occupancy 2
occupied 2
pacific 2
place.html 2
pueblo 2
quantity 2
rank 2
recommended 2
recorded 2
repeated 2
reports 2
review 2
rockwell 2
saipe 2
secondary 2
september 2
spanish 2
stages 2
standard 2
statements 2
subbarrios 2
summarized 2
systems 2
tenure 2
times 2
title 2
tribal 2
tribes 2
twelve 2
twenty 2
update 2
urbanized 2
v2016 2
varies 2
113th 1
115th 1
31,000 1
65k 1
abbreviation 1
act 1
additionally 1
affamerican 1
aggregate 1
amended 1
article 1
authorization 1
bls 1
breast 1
cancer 1
capital 1
cash 1
cd 1
cd113 1
cdc 1
centers 1
cervical 1
ces 1
charges 1
closings 1
clusters 1
collects 1
combining 1
complement 1
conditions 1
conducting 1
constitution 1
content 1
control 1
cph 1
created 1
csa 1
cscb 1
cscbo 1
d.c 1
dataproducts 1
dec 1
describing 1
destruction 1
detail 1
detection 1
develop 1
direct 1
disease 1
distinct 1
dynamics 1
enhance 1
establishment 1
expenditure 1
extensive 1
external 1
fall 1
force 1
function 1
funded 1
glance 1
house 1
identical 1
implement 1
indebtedness 1
industrial 1
instruction 1
internal 1
issues 1
jointly 1
language 1
link 1
loaded 1
lunch 1
models 1
monies 1
multiple 1
nativity 1
nbccedp 1
numerous 1
object 1
openings 1
outlay 1
owner 1
page 1
partially 1
passed 1
payments 1
phc 1
pk 1
planned 1
prevention 1
producing 1
property 1
provisions 1
publishing 1
purpose 1
rate 1
reapportioning 1
relating 1
rented 1
renter 1
representatives 1
requires 1
revenue 1
rom 1
salaries 1
schedule 1
service 1
sf1 1
speak 1
spoken 1
sponsored 1
startups 1
substate 1
summaray 1
support 1
tabulate 1
ten 1
test 1
titled 1
tuition 1
undergone 1
understanding 1
unemployment 1
vacancy 1
washington 1
wide 1

What are the most common words in the Census Bureau dataset titles?

#What are the most common keywords?
census_title %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
word n
estimates 139
population 123
business 114
patterns 108
county 64
code 63
series 62
time 62
vintage 54
survey 52
zip 46
age 45
statistics 44
american 43
beta 40
community 39
tables 38
sex 34
international 32
monthly 31
profiles 30
total 29
trade 29
national 22
single 22
economic 21
hispanic 21
origin 21
nonemployer 20
employer 18
subject 18
detailed 17
indicators 15
profile 15
imports 14
races 14
characteristics 13
comparison 13
exports 13
housing 11
summary 11
annual 10
decennial 10
projected 10
projections 10
system 10
file 9
intercensal 9
migration 9
united 9
change 8
components 8
database 8
flows 8
race 8
classification 6
harmonized 6
hs 6
planning 6
resident 6
supplemental 6
unit 6
counties 5
industry 5
municipios 5
owners 5
demographic 4
dynamics 4
economy 4
historical 4
key 4
naics 4
north 4
pr 4
quarterly 4
sales 4
sectors 4
services 4
universe 4
wide 4
agriculture 3
block 3
commonwealth 3
department 3
entrepreneurs 3
household 3
inventories 3
level 3
longitudinal 3
manufactures 3
poverty 3
qwi 3
tract 3
advanced 2
births 2
businesses 2
company 2
congressional 2
construction 2
deaths 2
districts 2
education 2
food 2
health 2
income 2
insurance 2
manufacturing 2
mcds 2
net 2
populations 2
port 2
public 2
retail 2
selected 2
shipments 2
sitc 2
standard 2
subcounty 2
summarized 2
surnames 2
technology 2
113th 1
115th 1
advance 1
armed 1
basic 1
cd113 1
cd115 1
civilian 1
classes 1
current 1
district 1
elementary 1
ethnicity 1
finance 1
financial 1
firm 1
forces 1
home 1
homeownership 1
homes 1
individual 1
industries 1
language 1
local 1
manufactured 1
manufacturers 1
nativity 1
overseas 1
packages 1
pensions 1
product 1
report 1
residential 1
school 1
secondary 1
sf1 1
spending 1
spoken 1
status 1
table 1
taxes 1
test 1
units 1
vacancies 1
wholesale 1

Word co-ocurrences and correlations

Here I examine which words commonly occur together in the titles, descriptions, and keywords of the Census Bureau datasets to create word networks that help determine which datasets are related to one other.

Networks of Description and Title Words

library(widyr)

title_word_pairs <- census_title %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

title_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
item1 item2 n
estimates population 63
estimates vintage 54
vintage population 54
county business 54
county patterns 54
business patterns 54
time series 46
population age 42
american community 39
american survey 39
community survey 39

These are the pairs of words that occur together most often in title fields.

desc_word_pairs <- census_desc %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

desc_word_pairs %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
item1 item2 n
utilizes v2015 12
weight census.gov 18
selected rural 1
topics comparison 21
july release 11
smooth incorporate 4
income county 16
date exports 9
total produces 28
methods terms 7

These are the pairs of words that occur together most often in descripton fields.

Below is a plot of networks of these co-occurring words to better see relationships.

library(ggplot2)
library(igraph)
library(ggraph)

# plot network of co-occuring words for 'title' field
set.seed(1234)
title_word_pairs %>%
  dplyr::filter(n >= 18) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Word network in the Census Bureau dataset titles

Word network in the Census Bureau dataset titles

We see some clear clustering in this network of title words; words in the Census Bureau dataset titles are largely organized into several families of words that tend to go together.

Now I plot the same for the description fields.

# plot network of co-occuring words for 'description' field
set.seed(1234)
desc_word_pairs %>%
  dplyr::filter(n >= 85) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Word network in the Census Bureau dataset descriptions

Word network in the Census Bureau dataset descriptions

Networks of Keywords

A network of the keywords to see which keywords commonly occur together in the same datasets.

# Network of Keywords
## See which keywords commonly occur together in the same dataset
keyword_pairs <- census_keyword %>%
  widyr::pairwise_count(keyword, id, sort = TRUE, upper = FALSE)

keyword_pairs %>%
  dplyr::arrange(-n) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
item1 item2 n
Income Marital 6
Income Poverty 6
Marital Poverty 6
set.seed(1234)
keyword_pairs %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "royalblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Co-occurrence network in the Census Bureau dataset keywords

Co-occurrence network in the Census Bureau dataset keywords

Of the 330 or so datasets, only about 6 contain keywords.

Calculating tf-idf for the description fields

Getting the tf-idf for the description field words

What are the highest tf-idf words in the Census Bureau description fields?

library(topicmodels)

desc_tf_idf <- census_desc %>% 
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup() %>%
  tidytext::bind_tf_idf(word, id, n) %>%
  dplyr::arrange(-tf_idf)

desc_tf_idf %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word n tf idf tf_idf
https://api.census.gov/data/id/ACSDP5Y2014 dataset 1 0.0227273 1.791759 0.0407218
http://api.census.gov/data/id/POPESTPROJbirths2014 cohort 1 0.0117647 3.496508 0.0411354
http://api.census.gov/data/id/POPESTprcagesex2013 commonwealth 1 0.0085470 4.412798 0.0377162
http://api.census.gov/data/id/POPESTcharage2016 issue 1 0.0058480 1.810109 0.0105854
https://api.census.gov/data/id/ACSST5Y2012 giving 1 0.0222222 1.948945 0.0433099
http://api.census.gov/data/id/POPESThousing2013 bureau’s 1 0.0083333 1.145132 0.0095428
http://api.census.gov/data/id/POPESTcharagegroups2016 reflect 1 0.0058140 1.810109 0.0105239
http://api.census.gov/data/id/POPESTpop2015 based 1 0.0074627 1.193922 0.0089099
http://api.census.gov/data/id/ACSFlows2011 metropolitan 1 0.0128205 1.126264 0.0144393
http://api.census.gov/data/id/CBP1995 subtraction 1 0.0128205 2.540996 0.0325769

These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.

Connecting description fields to keywords

First join the results of the tf-idf analysis with the keyword dataset.

library(topicmodels)

desc_tf_idf_keyword <- dplyr::full_join(
  desc_tf_idf, 
  census_keyword, by = "id") %>%
  dplyr::arrange(word)

desc_tf_idf_keyword %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word n tf idf tf_idf keyword
https://api.census.gov/data/id/ACSDP1Y2011 official 2 0.0222222 2.3978953 0.0532866 NA
http://api.census.gov/data/id/CBP1988 affects 1 0.0128205 2.5409961 0.0325769 NA
https://api.census.gov/data/id/CBP2012 benchmark 1 0.0200000 4.1896547 0.0837931 NA
http://api.census.gov/data/id/NONEMP2007 percent 1 0.0123457 3.0265039 0.0373642 NA
http://api.census.gov/data/id/ZBPTotal2011 zbp 1 0.0357143 2.6635984 0.0951285 NA
https://api.census.gov/data/id/ACSCP1Y2016 covers 1 0.0196078 1.9278916 0.0378018 NA
https://api.census.gov/data/id/ACSCP5Y2017 services 1 0.0196078 1.6719583 0.0327835 NA
http://api.census.gov/data/id/POPESTcomponents2015 foreign 2 0.0113636 3.6018681 0.0409303 NA
http://api.census.gov/data/id/CBP1986 suppressed 2 0.0256410 2.4668881 0.0632535 NA
http://api.census.gov/data/id/SBOCSCB12 survey 3 0.0681818 0.9315582 0.0635153 NA

Plot some of the most important words, as measured by tf-idf, for all of the provided keywords used on the Census Bureau datasets.

desc_tf_idf_keyword %>% 
  dplyr::filter(!near(tf, 1)) %>%
  dplyr::filter(keyword %in% c("Income", "Marital", "Poverty")) %>%
  dplyr::arrange(dplyr::desc(tf_idf)) %>%
  dplyr::group_by(keyword) %>%
  dplyr::distinct(word, keyword, .keep_all = TRUE) %>%
  dplyr::top_n(15, tf_idf) %>% 
  dplyr::ungroup() %>%
  dplyr::mutate(word = base::factor(word, levels = base::rev(unique(word)))) %>%
  ggplot2::ggplot(ggplot2::aes(word, tf_idf, fill = keyword)) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(~keyword, ncol = 3, scales = "free") +
  ggplot2::coord_flip() +
  ggplot2::labs(title = "Highest tf-idf words in Census metadata description fields",
       caption = "Census metadata from https://api.census.gov/data.json",
       x = NULL, y = "tf-idf")
Distribution of tf-idf for words from datasets labeled with selected keywords

Distribution of tf-idf for words from datasets labeled with selected keywords

Uncovering hidden conversations

Topic modeling attempts to uncover the hidden conversations within each description field. Latent Dirichlet allocation (LDA) is a technique to model each document (description field) as a mixture of topics and further describe each topic as a mixture of words

desc_stop_words <- dplyr::bind_rows(
  stop_words, 
  dplyr::data_frame(
    word = c("nbsp", "amp", "gt", 
             "lt", "timesnewromanpsmt", "font",
             "td", "li", "br", "tr", "quot",
             "st", "img", "src", "strong", 
             "http", "file", "files", "00", "www.census.gov",
             base::as.character(1:12)
             ), 
    lexicon = base::rep("custom", 32)))

word_counts <- census_desc %>%
  dplyr::anti_join(desc_stop_words, by = "word") %>%
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup()

word_counts %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word n
http://api.census.gov/data/id/POPESTcharagegroups2017 program 4
https://api.census.gov/data/id/ACSST1Y2016 metropolitan 1
http://api.census.gov/data/id/ITMONTHLYEXPORTSSTATEHS access 1
http://api.census.gov/data/id/POPESTintercensalnatmonthly2000 program 4
http://api.census.gov/data/id/SBOCSCB12 veteran 1
https://api.census.gov/data/id/ACSST5Y2010 housing 1
http://api.census.gov/data/id/POPESTnatmonthly2016 report 1
http://api.census.gov/data/id/NONEMP2003 scope 1
http://api.census.gov/data/id/POPESTstchar52013 revises 1
http://api.census.gov/data/id/PDBBlockGroup2015 estimates 1

Casting to a document-term matrix

Create a sparse document term matrix, containing the count of terms in each document.

desc_dtm <- word_counts %>%
  tidytext::cast_dtm(id, word, n)

desc_dtm
## <<DocumentTermMatrix (documents: 330, terms: 860)>>
## Non-/sparse entries: 16052/267748
## Sparsity           : 94%
## Maximal term length: 28
## Weighting          : term frequency (tf)

Ready for topic modeling

The following creates an LDA model. Like many clustering algorithms, the number of topics must be set a priori. Here I set the number of topics to 8.

library(topicmodels)
# run an LDA on the description words
desc_lda <- topicmodels::LDA(desc_dtm, k = 8, control = base::list(seed = 1234))
desc_lda
## A LDA_VEM topic model with 8 topics.

Interpreting the topic model

The following takes the lda model and constructs a tidy data frame that summarizes the results.

# interpret the results
tidy_lda <- tidytext::tidy(desc_lda)

tidy_lda %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
topic term beta
4 question 0.0050538
5 city 0.0000000
4 percent 0.0000000
5 passed 0.0000000
4 armed 0.0031119
3 producing 0.0000000
8 property 0.0000000
7 special 0.0000000
2 glossary 0.0006520
8 minor 0.0000000

The column \(\beta\) shows the probability of that term being generated from that topic for that document. It is the probability of that term (word) belonging to that topic.

The following examines the top 6 terms for each topic.

top_terms <- tidy_lda %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(6, beta) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(topic, -beta)

top_terms %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
topic term beta
5 employment 0.0256562
1 trade 0.0254148
1 postcensal 0.0243657
8 surveys 0.0257496
8 survey 0.0401349
2 vintage 0.0275104
6 produces 0.0360178
3 business 0.0286840
6 unit 0.0352348
7 housing 0.0215063

Here are the results of the top_terms exercise depicting visually:

top_terms %>%
  dplyr::mutate(term =  stats::reorder(term, beta)) %>%
  dplyr::group_by(topic, term) %>%    
  dplyr::arrange(dplyr::desc(beta)) %>%  
  dplyr::ungroup() %>%
  dplyr::mutate(term = base::factor(base::paste(term, topic, sep = "__"), 
                       levels = base::rev(base::paste(term, topic, sep = "__")))) %>%
  ggplot2::ggplot(ggplot2::aes(term, beta, fill = base::as.factor(topic))) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::scale_x_discrete(labels = function(x) base::gsub("__.+$", "", x)) +
  ggplot2::labs(
    title = "Top 6 terms in each LDA topic",
    x = NULL, y = base::expression(beta)) +
  ggplot2::facet_wrap(~ topic, ncol = 4, scales = "free")
Top terms in topic modeling of Census metadata description field texts

Top terms in topic modeling of Census metadata description field texts

The most frequently occuring terms in each of the topics are “population”, “annual”, and “estimates”. Given that most of the 330 datasets in the Census API are survey results from the annual American Community Survey, this makes sense.

The following examines topics that are associated with which description fields (i.e., documents). The probability, \(\gamma\), is the probability that each document belongs in each topic.

# examine which topics are associated with which description fields
lda_gamma <- tidytext::tidy(desc_lda, matrix = "gamma")

lda_gamma %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

The variable, \(\gamma\), has values that run from 0 to 1. Documents with values near zero means that those documents do not belong in each topic. Values with \(\gamma\) values close to 1 indicate that these documents do belong in those topics.

This distribution below shows that most documents either do belong or do not belong to a given topic.

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma)) +
  ggplot2::geom_histogram(bins = 12) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probabilities for all topics",
       y = "Number of documents", x = base::expression(gamma))
Probability distribution in topic modeling of Census metadata description field texts

Probability distribution in topic modeling of Census metadata description field texts

The following plot shows how the probabilities are distributed within each topic:

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma, fill = base::as.factor(topic))) +
  ggplot2::geom_histogram(bins = 4, show.legend = FALSE) +
  ggplot2::facet_wrap(~ topic, ncol = 4) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probability for each topic",
       y = "Number of documents", x = base::expression(gamma))
Probability distribution for each topic in topic modeling of Census metadata description field texts

Probability distribution for each topic in topic modeling of Census metadata description field texts