There are over 330 datasets with over 64,000 variables maintained by the Census Bureau; these datasets cover topics from international trade to population estimates to business formation within the United States. I’ll use metadata from these datasets to understand the connections between them.
The metadata includes information like the title of the dataset, a description field, what organization(s) within the Census Bureau is responsible for the dataset, keywords for the dataset that have been assigned by a human being, and so forth. The metadata for all its datasets is publicly available online in JSON format.
In this report, I will analyze the Census Bureau metadata as a text dataset and perform text mining techniques using the R library tidytext. I will preform word co-occurrences and correlations, tf-idf, and topic modeling to explore the connections between the datasets. I will seek to find if datasets are related to one other and find clusters of similar datasets. Since the Census Bureau provides several text fields in the metadata, most importantly the title, description, and keyword fields, I can show connections between the fields to better understand the connections between the Census Bureau API datasets.
Download the JSON file and take a look at the names of what is stored in the metadata.
library(jsonlite)
metadata <- jsonlite::fromJSON("https://api.census.gov/data.json")
base::names(metadata$dataset)
## [1] "c_vintage" "c_dataset" "c_geographyLink" "c_variablesLink"
## [5] "c_tagsLink" "c_examplesLink" "c_groupsLink" "c_valuesLink"
## [9] "c_documentationLink" "c_isAggregate" "c_isAvailable" "@type"
## [13] "title" "accessLevel" "bureauCode" "description"
## [17] "distribution" "contactPoint" "identifier" "keyword"
## [21] "license" "modified" "programCode" "references"
## [25] "spatial" "temporal" "publisher" "c_isCube"
## [29] "c_isTimeseries"
The title, description, and keywords for each dataset will be the features of interest.
base::class(metadata$dataset$title)
## [1] "character"
base::class(metadata$dataset$description)
## [1] "character"
base::class(metadata$dataset$keyword)
## [1] "list"
The title and description fields are stored as character vectors, and the keywords are stored as a list of character vectors.
library(tidyverse)
census_title <- dplyr::data_frame(
id = metadata$dataset$identifier,
title = metadata$dataset$title
)
census_title %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | title |
|---|---|
| http://api.census.gov/data/id/POPESThousing2016 | Vintage 2016 Population Estimates: Housing Unit Estimates for US, States, and Counties |
| https://api.census.gov/data/id/ACSST5Y2015 | ACS 5-Year Subject Tables |
| http://api.census.gov/data/id/ACSFlows2011 | 2007-2011 American Community Survey: Migration Flows |
| https://api.census.gov/data/id/PEPCOMPONENTS2018 | Vintage 2018 Population Estimates: Components of Change Estimates |
| http://api.census.gov/data/id/ASMState | Time Series Annual Survey of Manufactures: Statistics for All Manufacturing by State |
| http://api.census.gov/data/id/EconCensusEWKS2007 | 2007 Economic Census - All Sectors: Economy-Wide Key Statistics |
| http://api.census.gov/data/id/POPESThousing2013 | Vintage 2013 Population Estimates: Housing Unit Estimates for US, States, and Counties |
| http://api.census.gov/data/id/POPESTcomponents2015 | Vintage 2015 Population Estimates: Components of Change Estimates |
| http://api.census.gov/data/id/POPESTcty2013 | Vintage 2013 Population Estimates: County Total Population and Components of Change |
| http://api.census.gov/data/id/POPESThousing2014 | Vintage 2014 Population Estimates: Housing Unit Estimates for US, States, and Counties |
census_desc <- dplyr::data_frame(
id = metadata$dataset$identifier,
desc = metadata$dataset$description
)
census_desc %>%
# dplyr::select(desc) %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | desc |
|---|---|
| http://api.census.gov/data/id/PDBBLOCKGROUP2018 | The PDB is a database of U.S. housing, demographic, socioeconomic and operational statistics based on select 2010 Decennial Census and select 5-year American Community Survey (ACS) estimates. Data are provided at the census block group level of geography. These data can be used for many purposes, including survey field operations planning. |
| http://api.census.gov/data/id/POPESTintercensalnatcivpop1990 | Monthly Intercensal Estimates of the Civilian Population by Single Year of Age and Sex: April 1, 1990 to April 1, 2000 // Source: U.S. Census Bureau, Population Division // For detailed information about the methods used to create the intercensal population estimates, see https://www.census.gov/popest/methodology/intercensal_nat_meth.pdf. // The Census Bureau’s Population Estimates Program produces intercensal estimates each decade by adjusting the existing time series of postcensal estimates for a decade to smooth the transition from one decennial census count to the next. They differ from the postcensal estimates that are released annually because they rely on a formula that redistributes the difference between the April 1 postcensal estimate and April 1 census count for the end of the decade across the estimates for that decade. Meanwhile, the postcensal estimates incorporate current data on births, deaths, and migration to produce each new vintage of estimates, and to revise estimates for years back to the last census. The Population Estimates Program provides additional information including historical and postcensal estimates, evaluation estimates, demographic analysis, and research papers on its website: https://www.census.gov/popest/index.html. |
| http://api.census.gov/data/id/ACSProfile3Y2013 | The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households – that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. The 3-year data provide key estimates for each of the topic areas covered by the ACS for the nation, all 50 states, the District of Columbia, Puerto Rico, every congressional district, every metropolitan area, and all counties and places with populations of 20,000 or more. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau’s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties. For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units. |
| http://api.census.gov/data/id/2011acs5 | This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2011/acs/acs5. |
| http://api.census.gov/data/id/POPPROJBirths2012 | Projected Births by Sex, Race, and Hispanic Origin for the United States: 2012 to 2060 File: 2012 National Population Projections Source: U.S. Census Bureau, Population Division Release Date: December 2012 NOTE: Hispanic origin is considered an ethnicity, not a race. Hispanics may be of any race. The projections generally do not precisely agree with population estimates available elsewhere on the Census Bureau website for methodological reasons. Where both estimates and projections are available for a given time reference, we recommend that you use the population estimates as the measure of the current population. For detailed information about the methods used to create the population projections, see http://www.census.gov/population/projections/methodology/. *** The U.S. Census Bureau periodically produces projections of the United States resident population by age, sex, race, and Hispanic origin. Population projections are estimates of the population for future dates. They are typically based on an estimated population consistent with the most recent decennial census and are produced using the cohort-component method. Projections illustrate possible courses of population change based on assumptions about future births, deaths, net international migration, and domestic migration. In some cases, several series of projections are produced based on alternative assumptions for future fertility, life expectancy, net international migration, and (for state-level projections) state-to-state or domestic migration. Additional information is available on the Population Projections website: http://www.census.gov/population/projections/. |
| http://api.census.gov/data/id/DecennialSF11990 | Summary File 1 (SF 1) contains detailed tables focusing on age, sex, households, families, and housing units. These tables provide in-depth figures by race and Hispanic origin; some tables are repeated for each of nine race/Latino groups. Counts also are provided for over forty American Indian and Alaska Native tribes and for groups within race categories. The race categories include eighteen Asian groups and twelve Native Hawaiian and Other Pacific Islander groups. Counts of persons of Hispanic origin by country of origin (twenty-eight groups) are also shown. Summary File 1 presents data for the United States, the 50 states, and the District of Columbia in a hierarchical sequence down to the block level for many tabulations, but only to the census tract level for others. Summaries are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs) and Congressional districts. Geographic coverage for Puerto Rico is comparable to the 50 states. Data are presented in a hierarchical sequence down the block level for many tabulations, but only to the census tract level for others. Geographic areas include barrios, barrios-pueblo, subbarrios, places, census tracts, block groups, and blocks. Summaries also are included for other geographic areas such as ZIP Code Tabulation Areas (ZCTAs). Puerto Rico data will be loaded in January 2017. |
| http://api.census.gov/data/id/ACSSF5Y2010 | This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2010/acs/acs5. |
| http://api.census.gov/data/id/ftdImpExpHist | This international trade file provides the annual dollar value of U.S. exports and imports of goods for all U.S. trade partners. It also provides the annual dollar value of U.S. exports and imports of manufactured goods for all U.S. trade partners. You can find this data and more by going to usatrade.census.gov. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov. |
| http://api.census.gov/data/id/ACSProfile5Y2013 | This endpoint is being phased out. Please use corresponding endpoint found at api.census.gov/data/2013/acs/acs5/profile. |
| http://api.census.gov/data/id/ITMONTHLYIMPORTSSTATEHS | The Census data API provides access to the most comprehensive set of data on current month and cumulative year-to-date imports by state and Harmonized System (HS) code. The State HS endpoint in the Census data API also provides value, shipping weight, and method of transportation totals at the state level for all U.S. trading partners. The Census data API will help users research new markets for their products, establish pricing structures for potential export markets, and conduct economic planning. If you have any questions regarding U.S. international trade data, please call us at 1(800)549-0595 option #4 or email us at eid.international.trade.data@census.gov. |
census_keyword <- dplyr::data_frame(
id = metadata$dataset$identifier,
keyword = metadata$dataset$keyword) %>%
dplyr::filter(!purrr::map_lgl(keyword, is.null)) %>%
tidyr::unnest(keyword)
census_keyword %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
library(tidytext)
census_title <- census_title %>%
tidytext::unnest_tokens(word, title) %>%
dplyr::anti_join(stop_words, by = "word")
census_desc <- census_desc %>%
tidytext::unnest_tokens(word, desc) %>%
dplyr::anti_join(stop_words, by = "word")
The title, description, and keyword datasets have been prepared and are now ready for exploration.
# create a list of user-defined stop words
extra_stopwords <- dplyr::data_frame(
word = c(
as.character(1:100),
as.character(1950:2018),
"endpoint", "phased", "api.census.gov", "acs", "acs5", "u.s", "puerto", "rico",
"census", "bureau", "data", "information", "549", "800", "0595", "Sector", "62", "1,000", "100,000",
"NAICS", "00", "000", "100", "MRSF", "01", "US1", "pdf", "0", "zero", "64,000", "65,000"
)
)
# remove those extra stop words from title and description
census_title <- census_title %>%
dplyr::anti_join(extra_stopwords, by = "word")
census_desc <- census_desc %>%
dplyr::anti_join(extra_stopwords, by = "word")
census_title %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word |
|---|---|
| http://api.census.gov/data/id/CBP2006 | business |
| http://api.census.gov/data/id/ITMONTHLYIMPORTSHS | series |
| http://api.census.gov/data/id/EITSMWTS | wholesale |
| https://api.census.gov/data/id/ACSST5Y2013 | tables |
| http://api.census.gov/data/id/ZBPTotal1994 | county |
| http://api.census.gov/data/id/ZBPTotal2016 | business |
| https://api.census.gov/data/id/ACSDT5Y2013 | detailed |
| http://api.census.gov/data/id/CBP2007 | patterns |
| http://api.census.gov/data/id/CBP2009 | business |
| http://api.census.gov/data/id/EITSRESCONST | construction |
census_desc %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
census_keyword %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
What are the most common words in the Census Bureau dataset keywords?
#What are the most common keywords?
census_keyword %>%
dplyr::group_by(keyword) %>%
dplyr::count(sort = TRUE) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| keyword | n |
|---|---|
| Income | 6 |
| Marital | 6 |
| Poverty | 6 |
What are the most common words in the Census Bureau dataset descriptions?
#What are the most common keywords?
census_desc %>%
dplyr::group_by(word) %>%
dplyr::count(sort = TRUE) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| word | n |
|---|---|
| estimates | 1178 |
| population | 1019 |
| economic | 315 |
| housing | 298 |
| series | 296 |
| program | 288 |
| survey | 270 |
| decennial | 262 |
| race | 262 |
| annual | 246 |
| vintage | 227 |
| www.census.gov | 207 |
| demographic | 191 |
| statistics | 183 |
| time | 177 |
| recent | 172 |
| business | 170 |
| migration | 166 |
| county | 164 |
| current | 163 |
| popest | 154 |
| produces | 144 |
| includes | 142 |
| counties | 138 |
| including | 134 |
| payroll | 134 |
| http | 131 |
| businesses | 130 |
| american | 129 |
| date | 121 |
| district | 118 |
| metropolitan | 117 |
| april | 114 |
| based | 114 |
| community | 114 |
| united | 113 |
| social | 110 |
| change | 109 |
| bureau’s | 105 |
| provide | 103 |
| file | 102 |
| projections | 102 |
| hispanic | 100 |
| origin | 99 |
| congressional | 98 |
| source | 97 |
| detailed | 95 |
| patterns | 95 |
| additional | 93 |
| geographic | 93 |
| international | 92 |
| division | 91 |
| index.html | 91 |
| profiles | 91 |
| july | 90 |
| characteristics | 87 |
| employment | 87 |
| https | 87 |
| research | 87 |
| unit | 85 |
| intercensal | 84 |
| cbp | 83 |
| nation | 82 |
| trade | 82 |
| analysis | 81 |
| form | 81 |
| surveys | 81 |
| code | 80 |
| historical | 80 |
| produced | 80 |
| website | 79 |
| methodology | 76 |
| resident | 76 |
| age | 75 |
| variables | 75 |
| zip | 75 |
| counts | 74 |
| establishments | 74 |
| include | 74 |
| programs | 74 |
| communities | 72 |
| derived | 72 |
| net | 72 |
| sex | 71 |
| note | 70 |
| dataset | 69 |
| due | 69 |
| files | 69 |
| reference | 69 |
| summary | 69 |
| count | 68 |
| populations | 68 |
| question | 68 |
| births | 66 |
| create | 65 |
| dates | 65 |
| deaths | 65 |
| methods | 65 |
| subject | 65 |
| combination | 64 |
| evaluation | 63 |
| papers | 63 |
| e.g | 62 |
| level | 62 |
| services | 62 |
| employees | 61 |
| industry | 61 |
| paid | 61 |
| march | 60 |
| nonemployers | 60 |
| previously | 60 |
| resolution | 60 |
| total | 60 |
| broad | 59 |
| units | 59 |
| sector | 58 |
| api | 57 |
| final | 55 |
| geographies | 55 |
| official | 55 |
| begins | 54 |
| calculate | 54 |
| extends | 54 |
| issue | 54 |
| quarter | 54 |
| refers | 54 |
| reflect | 54 |
| revises | 54 |
| supersedes | 54 |
| suppressed | 54 |
| utilizes | 54 |
| week | 54 |
| considered | 53 |
| range | 53 |
| revisions | 53 |
| error | 52 |
| mi | 52 |
| michigan | 52 |
| processing | 52 |
| values | 52 |
| measures | 51 |
| tables | 50 |
| pep | 49 |
| statistical | 49 |
| covers | 48 |
| industries | 48 |
| investments | 48 |
| quarterly | 48 |
| release | 48 |
| giving | 47 |
| ongoing | 47 |
| plan | 47 |
| topics | 47 |
| totals | 46 |
| firms | 45 |
| receipts | 45 |
| set | 45 |
| income | 44 |
| ethnicity | 43 |
| decade | 41 |
| beta | 40 |
| moved | 40 |
| columbia | 39 |
| individuals | 39 |
| markets | 38 |
| specific | 38 |
| comprehensive | 37 |
| hispanics | 37 |
| method | 37 |
| indicators | 36 |
| selected | 36 |
| tracts | 36 |
| categories | 35 |
| districts | 35 |
| modified | 35 |
| responses | 35 |
| comparison | 34 |
| metro | 34 |
| primary | 34 |
| topic | 34 |
| sample | 33 |
| december | 32 |
| differences | 32 |
| mcd | 32 |
| armed | 31 |
| census.gov | 31 |
| forces | 31 |
| health | 31 |
| naics | 31 |
| races | 31 |
| block | 30 |
| planning | 30 |
| postcensal | 30 |
| principal | 30 |
| recommend | 30 |
| released | 30 |
| sampling | 30 |
| shown | 30 |
| percentages | 29 |
| sectors | 29 |
| staff | 29 |
| calhoun | 28 |
| dc | 28 |
| households | 28 |
| included | 28 |
| results | 28 |
| federal | 27 |
| island | 27 |
| monthly | 27 |
| original | 27 |
| partners | 27 |
| type | 27 |
| versus | 27 |
| 3rd | 26 |
| affect | 26 |
| affects | 26 |
| analyses | 26 |
| assistance | 26 |
| battle | 26 |
| care | 26 |
| creek | 26 |
| identified | 26 |
| levels | 26 |
| plans | 26 |
| questions | 26 |
| result | 26 |
| revised | 26 |
| subtraction | 26 |
| addresses | 25 |
| changing | 25 |
| cities | 25 |
| civilian | 25 |
| collecting | 25 |
| construction | 25 |
| designed | 25 |
| disseminates | 25 |
| estimating | 25 |
| fresh | 25 |
| mailed | 25 |
| obtain | 25 |
| person | 25 |
| questionnaires | 25 |
| replaced | 25 |
| residence | 25 |
| sales | 25 |
| strength | 25 |
| thresholds | 25 |
| towns | 25 |
| documentation | 24 |
| future | 24 |
| manufactured | 24 |
| owners | 24 |
| potential | 24 |
| technical | 24 |
| activity | 23 |
| adds | 23 |
| call | 23 |
| digit | 23 |
| eid.international.trade.data | 23 |
| 23 | |
| flow | 23 |
| found | 23 |
| option | 23 |
| products | 23 |
| table | 23 |
| zbp | 23 |
| basis | 22 |
| born | 22 |
| means | 22 |
| nationwide | 22 |
| overseas | 22 |
| report | 22 |
| sum | 22 |
| foreign | 21 |
| key | 21 |
| manufacturing | 21 |
| public | 21 |
| tax | 21 |
| v2013 | 21 |
| consist | 20 |
| economy | 20 |
| local | 20 |
| national | 20 |
| nonemployer | 20 |
| performance | 20 |
| produce | 20 |
| sic | 20 |
| subnational | 20 |
| transportation | 20 |
| access | 19 |
| component | 19 |
| conduct | 19 |
| contents | 19 |
| covered | 19 |
| cumulative | 19 |
| establish | 19 |
| estimated | 19 |
| export | 19 |
| flows | 19 |
| june | 19 |
| month | 19 |
| native | 19 |
| pricing | 19 |
| provided | 19 |
| rolling | 19 |
| shipping | 19 |
| single | 19 |
| structures | 19 |
| trading | 19 |
| users | 19 |
| weight | 19 |
| components | 18 |
| imports | 18 |
| operating | 18 |
| region | 18 |
| residual | 18 |
| exports | 17 |
| individual | 17 |
| mrsf | 17 |
| notes | 17 |
| status | 17 |
| unincorporated | 17 |
| us1 | 17 |
| ago | 16 |
| government | 16 |
| mcds | 16 |
| methodology.html | 16 |
| overview | 16 |
| percent | 16 |
| popest.html | 16 |
| sources | 16 |
| system | 16 |
| tabulations | 16 |
| timely | 16 |
| variety | 16 |
| activities | 15 |
| adjustment | 15 |
| average | 15 |
| bin | 15 |
| briefrm | 15 |
| briefroom | 15 |
| bureau.s | 15 |
| cgi | 15 |
| complementary | 15 |
| covering | 15 |
| decisions | 15 |
| employed | 15 |
| estimation | 15 |
| exception | 15 |
| excluded | 15 |
| homes | 15 |
| impact | 15 |
| indicator | 15 |
| inform | 15 |
| investment | 15 |
| majority | 15 |
| nationally | 15 |
| offer | 15 |
| owner’s | 15 |
| pensions | 15 |
| pertinent | 15 |
| policy | 15 |
| proprietorships | 15 |
| published | 15 |
| refer | 15 |
| reliability | 15 |
| reliable | 15 |
| resource | 15 |
| retail | 15 |
| scope | 15 |
| seasonal | 15 |
| sole | 15 |
| starting | 15 |
| study | 15 |
| taxes | 15 |
| variability | 15 |
| variance | 15 |
| visit | 15 |
| webpages | 15 |
| wholesale | 15 |
| assumptions | 14 |
| datasets | 14 |
| definitions | 14 |
| domestic | 14 |
| measure | 14 |
| terms | 14 |
| terms.html | 14 |
| incorporated | 13 |
| 66,000 | 12 |
| base | 12 |
| geo | 12 |
| noninstitutionalized | 12 |
| select | 12 |
| v2015 | 12 |
| compared | 11 |
| comparisons | 11 |
| cross | 11 |
| enumerated | 11 |
| location | 11 |
| municipios | 11 |
| operational | 11 |
| past | 11 |
| persons | 11 |
| projected | 11 |
| separately | 11 |
| significance | 11 |
| similar | 11 |
| site | 11 |
| size | 11 |
| testing | 11 |
| web | 11 |
| 2060 | 10 |
| addition | 10 |
| analyzing | 10 |
| classification | 10 |
| cohort | 10 |
| companies | 10 |
| congress | 10 |
| consistent | 10 |
| courses | 10 |
| geography | 10 |
| hs | 10 |
| illustrate | 10 |
| micropolitan | 10 |
| relationship | 10 |
| typically | 10 |
| world | 10 |
| 114th | 9 |
| active | 9 |
| attributed | 9 |
| database | 9 |
| divisions | 9 |
| duty | 9 |
| functions | 9 |
| movement | 9 |
| natives | 9 |
| poverty | 9 |
| profile | 9 |
| qwi | 9 |
| rates | 9 |
| represents | 9 |
| specifically | 9 |
| subgroups | 9 |
| v2014 | 9 |
| 2,400 | 8 |
| abroad | 8 |
| acs1 | 8 |
| civil | 8 |
| dollar | 8 |
| field | 8 |
| geographical | 8 |
| list | 8 |
| live | 8 |
| minor | 8 |
| mobility | 8 |
| mrsf2010 | 8 |
| nonmovers | 8 |
| quarters | 8 |
| strong | 8 |
| subcounty | 8 |
| supplemental | 8 |
| tract | 8 |
| universe | 8 |
| www2 | 8 |
| administration | 7 |
| boundaries | 7 |
| censuses | 7 |
| household | 7 |
| overlapping | 7 |
| owned | 7 |
| repeat | 7 |
| agency | 6 |
| agree | 6 |
| asm | 6 |
| budget’s | 6 |
| collected | 6 |
| combined | 6 |
| countries | 6 |
| country | 6 |
| coverage | 6 |
| delineations | 6 |
| education | 6 |
| harmonized | 6 |
| issued | 6 |
| january | 6 |
| latino | 6 |
| lrd | 6 |
| management | 6 |
| market | 6 |
| methodological | 6 |
| office | 6 |
| operations | 6 |
| pdb | 6 |
| port | 6 |
| precisely | 6 |
| purposes | 6 |
| reasons | 6 |
| sbo | 6 |
| school | 6 |
| socioeconomic | 6 |
| statement | 6 |
| surnames | 6 |
| tabulation | 6 |
| 5,000 | 5 |
| account | 5 |
| advertising | 5 |
| agencies | 5 |
| benchmark | 5 |
| budgets | 5 |
| class | 5 |
| conducted | 5 |
| databases | 5 |
| developing | 5 |
| effectiveness | 5 |
| firm | 5 |
| forms | 5 |
| government’s | 5 |
| indian | 5 |
| insurance | 5 |
| law | 5 |
| legal | 5 |
| locations | 5 |
| loss | 5 |
| measuring | 5 |
| medium | 5 |
| million | 5 |
| modifying | 5 |
| october | 5 |
| organization | 5 |
| placements | 5 |
| prepared | 5 |
| quotas | 5 |
| reported | 5 |
| representing | 5 |
| required | 5 |
| residential | 5 |
| respondents | 5 |
| response | 5 |
| setting | 5 |
| stock | 5 |
| studying | 5 |
| subjects | 5 |
| 16,000 | 4 |
| 2050 | 4 |
| adjusting | 4 |
| alaska | 4 |
| alternative | 4 |
| annually | 4 |
| barrios | 4 |
| boundary | 4 |
| bureaus | 4 |
| city | 4 |
| codes | 4 |
| commonwealth | 4 |
| creation | 4 |
| difference | 4 |
| earnings | 4 |
| estimate | 4 |
| existing | 4 |
| expectancy | 4 |
| fertility | 4 |
| formula | 4 |
| funds | 4 |
| hawaiian | 4 |
| hierarchical | 4 |
| incorporate | 4 |
| intercensal_nat_meth.pdf | 4 |
| job | 4 |
| life | 4 |
| north | 4 |
| periodically | 4 |
| private | 4 |
| product | 4 |
| redistributes | 4 |
| reflects | 4 |
| related | 4 |
| rely | 4 |
| revise | 4 |
| sahie | 4 |
| sequence | 4 |
| sf | 4 |
| sitc | 4 |
| smooth | 4 |
| speakers | 4 |
| subdivisions | 4 |
| summaries | 4 |
| tech | 4 |
| transition | 4 |
| urban | 4 |
| usatrade.census.gov | 4 |
| v2017 | 4 |
| zctas | 4 |
| 20,000 | 3 |
| 3,000 | 3 |
| 9,000 | 3 |
| about.html | 3 |
| academic | 3 |
| accurate | 3 |
| agent | 3 |
| agricultural | 3 |
| ancestry | 3 |
| ascertain | 3 |
| blocks | 3 |
| broken | 3 |
| commodities | 3 |
| compiled | 3 |
| cps | 3 |
| demographics | 3 |
| english | 3 |
| entrepreneurs | 3 |
| factfinder | 3 |
| families | 3 |
| figures | 3 |
| gender | 3 |
| glossary | 3 |
| glossary.html | 3 |
| home | 3 |
| inputs | 3 |
| labor | 3 |
| languages | 3 |
| lehd.ces.census.gov | 3 |
| longitudinal | 3 |
| main | 3 |
| manufactures | 3 |
| micro | 3 |
| military | 3 |
| minority | 3 |
| nonagricultural | 3 |
| outputs | 3 |
| people | 3 |
| prior | 3 |
| project | 3 |
| read | 3 |
| regions | 3 |
| regularly | 3 |
| researchers | 3 |
| rural | 3 |
| special | 3 |
| sworn | 3 |
| updates | 3 |
| usda | 3 |
| uswide | 3 |
| veteran | 3 |
| women | 3 |
| worker | 3 |
| workforce | 3 |
| 25,000 | 2 |
| acsse | 2 |
| advance | 2 |
| ages | 2 |
| aggregates | 2 |
| aid | 2 |
| allocation | 2 |
| asian | 2 |
| bds | 2 |
| birth | 2 |
| category | 2 |
| changes.html | 2 |
| children | 2 |
| collection | 2 |
| comparable | 2 |
| concluded | 2 |
| confidentiality | 2 |
| core | 2 |
| cprofile | 2 |
| decades | 2 |
| depth | 2 |
| determined | 2 |
| distributing | 2 |
| dvd | 2 |
| efforts | 2 |
| eighteen | 2 |
| elementary | 2 |
| eligible | 2 |
| enumeration | 2 |
| ethnic | 2 |
| family | 2 |
| fl | 2 |
| focus | 2 |
| focusing | 2 |
| formed | 2 |
| forty | 2 |
| frequency | 2 |
| functional | 2 |
| gadsden | 2 |
| ia | 2 |
| idb | 2 |
| identify | 2 |
| incomplete | 2 |
| informationgateway.php | 2 |
| islander | 2 |
| items | 2 |
| jurisdictions | 2 |
| lands | 2 |
| limited | 2 |
| located | 2 |
| managing | 2 |
| midyear | 2 |
| model | 2 |
| modified.html | 2 |
| municipio | 2 |
| names | 2 |
| objective | 2 |
| occupancy | 2 |
| occupied | 2 |
| pacific | 2 |
| place.html | 2 |
| pueblo | 2 |
| quantity | 2 |
| rank | 2 |
| recommended | 2 |
| recorded | 2 |
| repeated | 2 |
| reports | 2 |
| review | 2 |
| rockwell | 2 |
| saipe | 2 |
| secondary | 2 |
| september | 2 |
| spanish | 2 |
| stages | 2 |
| standard | 2 |
| statements | 2 |
| subbarrios | 2 |
| summarized | 2 |
| systems | 2 |
| tenure | 2 |
| times | 2 |
| title | 2 |
| tribal | 2 |
| tribes | 2 |
| twelve | 2 |
| twenty | 2 |
| update | 2 |
| urbanized | 2 |
| v2016 | 2 |
| varies | 2 |
| 113th | 1 |
| 115th | 1 |
| 31,000 | 1 |
| 65k | 1 |
| abbreviation | 1 |
| act | 1 |
| additionally | 1 |
| affamerican | 1 |
| aggregate | 1 |
| amended | 1 |
| article | 1 |
| authorization | 1 |
| bls | 1 |
| breast | 1 |
| cancer | 1 |
| capital | 1 |
| cash | 1 |
| cd | 1 |
| cd113 | 1 |
| cdc | 1 |
| centers | 1 |
| cervical | 1 |
| ces | 1 |
| charges | 1 |
| closings | 1 |
| clusters | 1 |
| collects | 1 |
| combining | 1 |
| complement | 1 |
| conditions | 1 |
| conducting | 1 |
| constitution | 1 |
| content | 1 |
| control | 1 |
| cph | 1 |
| created | 1 |
| csa | 1 |
| cscb | 1 |
| cscbo | 1 |
| d.c | 1 |
| dataproducts | 1 |
| dec | 1 |
| describing | 1 |
| destruction | 1 |
| detail | 1 |
| detection | 1 |
| develop | 1 |
| direct | 1 |
| disease | 1 |
| distinct | 1 |
| dynamics | 1 |
| enhance | 1 |
| establishment | 1 |
| expenditure | 1 |
| extensive | 1 |
| external | 1 |
| fall | 1 |
| force | 1 |
| function | 1 |
| funded | 1 |
| glance | 1 |
| house | 1 |
| identical | 1 |
| implement | 1 |
| indebtedness | 1 |
| industrial | 1 |
| instruction | 1 |
| internal | 1 |
| issues | 1 |
| jointly | 1 |
| language | 1 |
| link | 1 |
| loaded | 1 |
| lunch | 1 |
| models | 1 |
| monies | 1 |
| multiple | 1 |
| nativity | 1 |
| nbccedp | 1 |
| numerous | 1 |
| object | 1 |
| openings | 1 |
| outlay | 1 |
| owner | 1 |
| page | 1 |
| partially | 1 |
| passed | 1 |
| payments | 1 |
| phc | 1 |
| pk | 1 |
| planned | 1 |
| prevention | 1 |
| producing | 1 |
| property | 1 |
| provisions | 1 |
| publishing | 1 |
| purpose | 1 |
| rate | 1 |
| reapportioning | 1 |
| relating | 1 |
| rented | 1 |
| renter | 1 |
| representatives | 1 |
| requires | 1 |
| revenue | 1 |
| rom | 1 |
| salaries | 1 |
| schedule | 1 |
| service | 1 |
| sf1 | 1 |
| speak | 1 |
| spoken | 1 |
| sponsored | 1 |
| startups | 1 |
| substate | 1 |
| summaray | 1 |
| support | 1 |
| tabulate | 1 |
| ten | 1 |
| test | 1 |
| titled | 1 |
| tuition | 1 |
| undergone | 1 |
| understanding | 1 |
| unemployment | 1 |
| vacancy | 1 |
| washington | 1 |
| wide | 1 |
What are the most common words in the Census Bureau dataset titles?
#What are the most common keywords?
census_title %>%
dplyr::group_by(word) %>%
dplyr::count(sort = TRUE) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| word | n |
|---|---|
| estimates | 139 |
| population | 123 |
| business | 114 |
| patterns | 108 |
| county | 64 |
| code | 63 |
| series | 62 |
| time | 62 |
| vintage | 54 |
| survey | 52 |
| zip | 46 |
| age | 45 |
| statistics | 44 |
| american | 43 |
| beta | 40 |
| community | 39 |
| tables | 38 |
| sex | 34 |
| international | 32 |
| monthly | 31 |
| profiles | 30 |
| total | 29 |
| trade | 29 |
| national | 22 |
| single | 22 |
| economic | 21 |
| hispanic | 21 |
| origin | 21 |
| nonemployer | 20 |
| employer | 18 |
| subject | 18 |
| detailed | 17 |
| indicators | 15 |
| profile | 15 |
| imports | 14 |
| races | 14 |
| characteristics | 13 |
| comparison | 13 |
| exports | 13 |
| housing | 11 |
| summary | 11 |
| annual | 10 |
| decennial | 10 |
| projected | 10 |
| projections | 10 |
| system | 10 |
| file | 9 |
| intercensal | 9 |
| migration | 9 |
| united | 9 |
| change | 8 |
| components | 8 |
| database | 8 |
| flows | 8 |
| race | 8 |
| classification | 6 |
| harmonized | 6 |
| hs | 6 |
| planning | 6 |
| resident | 6 |
| supplemental | 6 |
| unit | 6 |
| counties | 5 |
| industry | 5 |
| municipios | 5 |
| owners | 5 |
| demographic | 4 |
| dynamics | 4 |
| economy | 4 |
| historical | 4 |
| key | 4 |
| naics | 4 |
| north | 4 |
| pr | 4 |
| quarterly | 4 |
| sales | 4 |
| sectors | 4 |
| services | 4 |
| universe | 4 |
| wide | 4 |
| agriculture | 3 |
| block | 3 |
| commonwealth | 3 |
| department | 3 |
| entrepreneurs | 3 |
| household | 3 |
| inventories | 3 |
| level | 3 |
| longitudinal | 3 |
| manufactures | 3 |
| poverty | 3 |
| qwi | 3 |
| tract | 3 |
| advanced | 2 |
| births | 2 |
| businesses | 2 |
| company | 2 |
| congressional | 2 |
| construction | 2 |
| deaths | 2 |
| districts | 2 |
| education | 2 |
| food | 2 |
| health | 2 |
| income | 2 |
| insurance | 2 |
| manufacturing | 2 |
| mcds | 2 |
| net | 2 |
| populations | 2 |
| port | 2 |
| public | 2 |
| retail | 2 |
| selected | 2 |
| shipments | 2 |
| sitc | 2 |
| standard | 2 |
| subcounty | 2 |
| summarized | 2 |
| surnames | 2 |
| technology | 2 |
| 113th | 1 |
| 115th | 1 |
| advance | 1 |
| armed | 1 |
| basic | 1 |
| cd113 | 1 |
| cd115 | 1 |
| civilian | 1 |
| classes | 1 |
| current | 1 |
| district | 1 |
| elementary | 1 |
| ethnicity | 1 |
| finance | 1 |
| financial | 1 |
| firm | 1 |
| forces | 1 |
| home | 1 |
| homeownership | 1 |
| homes | 1 |
| individual | 1 |
| industries | 1 |
| language | 1 |
| local | 1 |
| manufactured | 1 |
| manufacturers | 1 |
| nativity | 1 |
| overseas | 1 |
| packages | 1 |
| pensions | 1 |
| product | 1 |
| report | 1 |
| residential | 1 |
| school | 1 |
| secondary | 1 |
| sf1 | 1 |
| spending | 1 |
| spoken | 1 |
| status | 1 |
| table | 1 |
| taxes | 1 |
| test | 1 |
| units | 1 |
| vacancies | 1 |
| wholesale | 1 |
Here I examine which words commonly occur together in the titles, descriptions, and keywords of the Census Bureau datasets to create word networks that help determine which datasets are related to one other.
library(widyr)
title_word_pairs <- census_title %>%
widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)
title_word_pairs %>%
dplyr::arrange(-n) %>%
dplyr::top_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| item1 | item2 | n |
|---|---|---|
| estimates | population | 63 |
| estimates | vintage | 54 |
| vintage | population | 54 |
| county | business | 54 |
| county | patterns | 54 |
| business | patterns | 54 |
| time | series | 46 |
| population | age | 42 |
| american | community | 39 |
| american | survey | 39 |
| community | survey | 39 |
These are the pairs of words that occur together most often in title fields.
desc_word_pairs <- census_desc %>%
widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)
desc_word_pairs %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| item1 | item2 | n |
|---|---|---|
| utilizes | v2015 | 12 |
| weight | census.gov | 18 |
| selected | rural | 1 |
| topics | comparison | 21 |
| july | release | 11 |
| smooth | incorporate | 4 |
| income | county | 16 |
| date | exports | 9 |
| total | produces | 28 |
| methods | terms | 7 |
These are the pairs of words that occur together most often in descripton fields.
Below is a plot of networks of these co-occurring words to better see relationships.
library(ggplot2)
library(igraph)
library(ggraph)
# plot network of co-occuring words for 'title' field
set.seed(1234)
title_word_pairs %>%
dplyr::filter(n >= 18) %>%
igraph::graph_from_data_frame() %>%
ggraph::ggraph(layout = "fr") +
ggraph::geom_edge_link(
ggplot2::aes(edge_alpha = n, edge_width = n),
edge_colour = "steelblue"
) +
ggraph::geom_node_point(size = 5) +
ggraph::geom_node_text(
ggplot2::aes(label = name),
repel = TRUE,
point.padding = unit(0.2, "lines")
) +
ggplot2::theme_void()
Word network in the Census Bureau dataset titles
We see some clear clustering in this network of title words; words in the Census Bureau dataset titles are largely organized into several families of words that tend to go together.
Now I plot the same for the description fields.
# plot network of co-occuring words for 'description' field
set.seed(1234)
desc_word_pairs %>%
dplyr::filter(n >= 85) %>%
igraph::graph_from_data_frame() %>%
ggraph::ggraph(layout = "fr") +
ggraph::geom_edge_link(
ggplot2::aes(edge_alpha = n, edge_width = n),
edge_colour = "steelblue"
) +
ggraph::geom_node_point(size = 5) +
ggraph::geom_node_text(
ggplot2::aes(label = name),
repel = TRUE,
point.padding = unit(0.2, "lines")
) +
ggplot2::theme_void()
Word network in the Census Bureau dataset descriptions
A network of the keywords to see which keywords commonly occur together in the same datasets.
# Network of Keywords
## See which keywords commonly occur together in the same dataset
keyword_pairs <- census_keyword %>%
widyr::pairwise_count(keyword, id, sort = TRUE, upper = FALSE)
keyword_pairs %>%
dplyr::arrange(-n) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| item1 | item2 | n |
|---|---|---|
| Income | Marital | 6 |
| Income | Poverty | 6 |
| Marital | Poverty | 6 |
set.seed(1234)
keyword_pairs %>%
igraph::graph_from_data_frame() %>%
ggraph::ggraph(layout = "fr") +
ggraph::geom_edge_link(
ggplot2::aes(edge_alpha = n, edge_width = n),
edge_colour = "royalblue"
) +
ggraph::geom_node_point(size = 5) +
ggraph::geom_node_text(
ggplot2::aes(label = name),
repel = TRUE,
point.padding = unit(0.2, "lines")
) +
ggplot2::theme_void()
Co-occurrence network in the Census Bureau dataset keywords
Of the 330 or so datasets, only about 6 contain keywords.
What are the highest tf-idf words in the Census Bureau description fields?
library(topicmodels)
desc_tf_idf <- census_desc %>%
dplyr::count(id, word, sort = TRUE) %>%
dplyr::ungroup() %>%
tidytext::bind_tf_idf(word, id, n) %>%
dplyr::arrange(-tf_idf)
desc_tf_idf %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| https://api.census.gov/data/id/ACSDP5Y2014 | dataset | 1 | 0.0227273 | 1.791759 | 0.0407218 |
| http://api.census.gov/data/id/POPESTPROJbirths2014 | cohort | 1 | 0.0117647 | 3.496508 | 0.0411354 |
| http://api.census.gov/data/id/POPESTprcagesex2013 | commonwealth | 1 | 0.0085470 | 4.412798 | 0.0377162 |
| http://api.census.gov/data/id/POPESTcharage2016 | issue | 1 | 0.0058480 | 1.810109 | 0.0105854 |
| https://api.census.gov/data/id/ACSST5Y2012 | giving | 1 | 0.0222222 | 1.948945 | 0.0433099 |
| http://api.census.gov/data/id/POPESThousing2013 | bureau’s | 1 | 0.0083333 | 1.145132 | 0.0095428 |
| http://api.census.gov/data/id/POPESTcharagegroups2016 | reflect | 1 | 0.0058140 | 1.810109 | 0.0105239 |
| http://api.census.gov/data/id/POPESTpop2015 | based | 1 | 0.0074627 | 1.193922 | 0.0089099 |
| http://api.census.gov/data/id/ACSFlows2011 | metropolitan | 1 | 0.0128205 | 1.126264 | 0.0144393 |
| http://api.census.gov/data/id/CBP1995 | subtraction | 1 | 0.0128205 | 2.540996 | 0.0325769 |
These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.
First join the results of the tf-idf analysis with the keyword dataset.
library(topicmodels)
desc_tf_idf_keyword <- dplyr::full_join(
desc_tf_idf,
census_keyword, by = "id") %>%
dplyr::arrange(word)
desc_tf_idf_keyword %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word | n | tf | idf | tf_idf | keyword |
|---|---|---|---|---|---|---|
| https://api.census.gov/data/id/ACSDP1Y2011 | official | 2 | 0.0222222 | 2.3978953 | 0.0532866 | NA |
| http://api.census.gov/data/id/CBP1988 | affects | 1 | 0.0128205 | 2.5409961 | 0.0325769 | NA |
| https://api.census.gov/data/id/CBP2012 | benchmark | 1 | 0.0200000 | 4.1896547 | 0.0837931 | NA |
| http://api.census.gov/data/id/NONEMP2007 | percent | 1 | 0.0123457 | 3.0265039 | 0.0373642 | NA |
| http://api.census.gov/data/id/ZBPTotal2011 | zbp | 1 | 0.0357143 | 2.6635984 | 0.0951285 | NA |
| https://api.census.gov/data/id/ACSCP1Y2016 | covers | 1 | 0.0196078 | 1.9278916 | 0.0378018 | NA |
| https://api.census.gov/data/id/ACSCP5Y2017 | services | 1 | 0.0196078 | 1.6719583 | 0.0327835 | NA |
| http://api.census.gov/data/id/POPESTcomponents2015 | foreign | 2 | 0.0113636 | 3.6018681 | 0.0409303 | NA |
| http://api.census.gov/data/id/CBP1986 | suppressed | 2 | 0.0256410 | 2.4668881 | 0.0632535 | NA |
| http://api.census.gov/data/id/SBOCSCB12 | survey | 3 | 0.0681818 | 0.9315582 | 0.0635153 | NA |
Plot some of the most important words, as measured by tf-idf, for all of the provided keywords used on the Census Bureau datasets.
desc_tf_idf_keyword %>%
dplyr::filter(!near(tf, 1)) %>%
dplyr::filter(keyword %in% c("Income", "Marital", "Poverty")) %>%
dplyr::arrange(dplyr::desc(tf_idf)) %>%
dplyr::group_by(keyword) %>%
dplyr::distinct(word, keyword, .keep_all = TRUE) %>%
dplyr::top_n(15, tf_idf) %>%
dplyr::ungroup() %>%
dplyr::mutate(word = base::factor(word, levels = base::rev(unique(word)))) %>%
ggplot2::ggplot(ggplot2::aes(word, tf_idf, fill = keyword)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::facet_wrap(~keyword, ncol = 3, scales = "free") +
ggplot2::coord_flip() +
ggplot2::labs(title = "Highest tf-idf words in Census metadata description fields",
caption = "Census metadata from https://api.census.gov/data.json",
x = NULL, y = "tf-idf")
Distribution of tf-idf for words from datasets labeled with selected keywords