Discovery API Text Analysis

The Socrata data platform hosts tens of thousands of government datasets. Governments large and small publish data on crime, permits, finance, healthcare, research, performance, and more for citizens to use. While this large corpus of government data is already accessible via opendatanetwork.com, this API is a powerful way to access and explore all public metadata published on the Socrata platform.

Using the Socrata Discovery API I’ll use metadata from these datasets to understand the connections between them.

The metadata includes information like the name of the dataset, a description field, tags for the dataset that have been assigned by a human being, and so forth. The metadata for all its public datasets is available online in JSON format.

In this report, I will analyze the Socrata Open Data Network metadata as a text dataset and perform text mining techniques using the R library tidytext. I will preform word co-occurrences and correlations, tf-idf, and topic modeling to explore the connections between the datasets. I will seek to find if datasets are related to one other and find clusters of similar datasets. Since the Socrata Open Data Network provides several text fields in the metadata, most importantly the name, description, and tag fields, I can show connections between the fields to better understand the connections between the Socrata Open Data Network API datasets.

How data is organized at the Socrata Open Data Network

Download the JSON file and take a look at the names of what is stored in the metadata.

library(jsonlite)

metadata <- jsonlite::fromJSON("https://api.us.socrata.com/api/catalog/v1?limit=10000")

# look at column names in dataset
base::names(metadata$results)

## [1] "resource"          "classification"    "metadata"          "permalink"        
## [5] "link"              "owner"             "preview_image_url" "published_copy"

The name, description, and for each dataset will be the features of interest.

# text fields of interest are within the 'resource' and 'classification' object
base::class(metadata$results$resource$name)

## [1] "character"

base::class(metadata$results$resource$description)

## [1] "character"

base::class(metadata$results$classification$categories) # domain_tags are an array (or list)

## [1] "list"

The name and description fields are stored as character vectors, and the domain_tags are stored as a list of character vectors.

Data preparation

## Create dataframes of title, description, and tags fields
library(tidyverse)

socrata_title <- dplyr::data_frame(
  id = metadata$results$resource$id,
  title = metadata$results$resource$name
)

# show title fields
socrata_title %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	title
tua8-2qwc	2010 General - Election Results for LD5
dmc2-d98i	Dusty
9g5a-r9zs	E-Government Service Portfolio - OSCIO
2vvn-pdyi	NYS Lottery Retailers
26fe-arqy	Employee Payroll (Searchable)
va6n-ui5f	Lista de EPSAGRO inscritas, acreditadas y habilitadas del Departamento de Boyacá 2017
45um-c62r	Rates of TBI-related Emergency Department Visits, Hospitalizations, and Deaths - United States, 2001 – 2010
nmvn-mx4x	Small Commercial Solar Incentive Levels and Available Capacity
mjrr-dybz	Planning Department Development Application
ujhw-8y22	Improving Care

socrata_desc <- dplyr::data_frame(
  id = metadata$results$resource$id,
  desc = metadata$results$resource$description
  ) %>% 
  dplyr::filter(!purrr::map_lgl(desc, is.null)) %>% # filter out null values
  dplyr::filter(!purrr::map_lgl(desc, is.na)) %>% # filter out NA values
  dplyr::filter(desc != "") # filter out blank values
  
  
# take a sample and show description fields
socrata_desc %>%
  dplyr::select(desc) %>%
  dplyr::sample_n(5) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

desc
The innovative financing and governance arrangements of Financial Intermediary Funds (FIFs) enable funds to be raised from multiple sources, including from sovereign and private sources. FIFs can receive contributions in the form of concessional loans and capital contribution in addition to traditional grant funds. In addition, FIFs can receive funds from innovative sources, such as proceeds from CER sales. This chart shows contributions to all FIFs provided by a country during a certain calendar year.
<div>This dataset shows whether each dataset on data.maryland.gov has been updated recently enough. For example, datasets containing weekly data should be updated at least every 7 days. Datasets containing monthly data should be updated at least every 31 days. This dataset also shows a overview of metadata from all data.maryland.gov datasets.<br></div>
New York City school level College Board SAT results for the graduating seniors of 2010. Records contain 2010 College-bound seniors mean SAT scores. Records with 5 or fewer students are suppressed (marked ‘s’). College-bound seniors are those students that complete the SAT Questionnaire when they register for the SAT and identify that they will graduate from high school in a specific year. For example, the 2010 college-bound seniors are those students that self-reported they would graduate in 2010. Students are not required to complete the SAT Questionnaire in order to register for the SAT. Students who do not indicate which year they will graduate from high school will not be included in any college-bound senior report. Students are linked to schools by identifying which school they attend when registering for a College Board exam. A student is only included in a school’s report if he/she self-reports being enrolled at that school. Data collected and processed by the College Board.
The Adult Care Facility Directory contains a listing of Adult Care Facilities that are currently licensed in New York State. New York State, through the New York State Department of Health, licenses and supervises adult care facilities which provide temporary or long-term, non-medical residential care services to adults who are substantially unable to live independently.
All BPD data on Open Baltimore is preliminary data and subject to change. The information presented through Open Baltimore represents Part I victim based crime data. The data do not represent statistics submitted to the FBI’s Uniform Crime Report (UCR); therefore any comparisons are strictly prohibited. For further clarification of UCR data, please visit http://www.fbi.gov/about-us/cjis/ucr/ucr. Please note that this data is preliminary and subject to change. Prior month data is likely to show changes when it is refreshed on a monthly basis. All data is geocoded to the approximate latitude/longitude location of the incident and excludes those records for which an address could not be geocoded. Any attempt to match the approximate location of the incident to an exact address is strictly prohibited.

# all datasets have a title and description but not all have domain_tags
socrata_tags <- dplyr::data_frame(
  id = metadata$results$resource$id,
  tag = metadata$results$classification$domain_tags
  ) %>% 
  dplyr::filter(!purrr::map_lgl(tag, is.null)) %>% # filter out null values
  tidyr::unnest(tag) # unlist the tag array and make the dataframe long

# show tags fields
socrata_tags %>%
  dplyr::sample_n(100) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	tag
m3s8-jyeu	quality of care
tft4-2tch	grades
eb4y-d4ic	osh
8hpj-294h	mechanical
skzk-tcq9	pmf
uedp-fegm	energy
xnhu-aczu	sign permit
wyja-ysz2	public works
kvbx-erfw	agency performance
9x8y-nux4	renewable
fr79-srce	community health and chronic disease
5zaw-hj3p	country level
3rgd-zjxx	water quality
5u9x-y4iz	relocation permit
abfj-y7uq	driver
pt2v-9a3h	handicap
u2u3-22pf	popayan
n7es-ur7c	ground-level ozone
h7rm-fz6m	benchmarking
5fn4-dr26	city government
cpkv-aajs	construction
8xjf-z55e	city services
8z4h-2ak6	employment
pknd-dutm	square
3cdm-p29e	doitt
a9we-mtpn	connectivity
ya2t-3mrd	payroll
q8xn-hhm3	beneficiaries
xgwu-c37w	noxious weeds
xbn8-g7iv	legislature
kahe-efs3	bpc
m5c8-fvpq	enforcement
6uza-cd7z	rescue
mfzt-js4n	gis
wxae-84gr	ifc
hzzz-eg3r	precincts
yd42-ttr4	mesa sectoriales
f2du-824e	cars
jwz2-cge4	rainfall
7xmz-2ur8	education
wn3p-qsan	volcanic plumes
mqvy-dtax	bronx
smb2-b252	police
tid3-g26p	permisos uso
547a-qp7h	acacías
c32y-ud8a	collaborating for clients
88g8-5mnd	employee
9gcg-vghr	unpaid
8wk5-pp5m	affordable care act
qvuk-dgpr	license
7xpx-5fte	accident
vh2s-8wb2	inpatient
6q2s-9pnn	new building
6vp6-wxuq	prevalence
i8h7-mn6v	business name
r4kn-u4a2	county data
h4hj-g7g5	allocations
qwux-prpy	clear
pknd-dutm	art
hfm8-take	donors
py8w-ca47	postcensal
5g2s-tnb7	sales and use tax
fymg-3wv3	childcare
w6qn-gx72	spin
vfe9-k7vc	bus
wd85-n28p	property tax
b6tj-gt35	banner
dwqk-w36f	postdiarrheal
h9gi-nx95	collisions
rgfe-8y2z	mayor’s office of long term planning and sustainability (oltps)
65xz-asgc	pesticides
djpb-rjn9	program evaluation
mpue-vn67	sparcs
yuq5-65xt	performance-based payment
32cx-q9bj	building new
jj7d-djgr	grantees
rycq-b8jg	odfw
b3ae-eupi	delis
j4zm-9kqu	disclosure
ukww-xkmj	dead animal removal
5rab-bshx	beneficiaries
uk7p-iqpz	snap
n5iq-rskv	bridge
jew6-k2tq	cms
n2dz-pwuk	labeling
g6gf-cj67	aid
t753-n3tv	country level
hfe6-4nr6	self-insured
y3c3-hqwu	elevator
7mgr-iety	red light cameras
d4iy-9uh7	community centers
yqdr-edy8	sheriff
6d4u-qybd	natural area
wj5k-n3n8	ifc
ps8r-e55x	pm2.5
xt4z-bnwh	map_layer
k5bm-9tcj	business name
h4zs-rypf	beneficiaries
5i2c-y2u6	parcel
msxg-ssda	ventilation

# Create dataframes of tokenized title, description, and tags fields
library(tidytext)

title_tokens <- socrata_title %>%
  tidytext::unnest_tokens(word, title) %>%
  dplyr::anti_join(stop_words, by = "word") # remove default stop words

desc_tokens <- socrata_desc %>%
  tidytext::unnest_tokens(word, desc) %>%
  dplyr::anti_join(stop_words, by = "word") # remove default stop words

# don't remove any stop words
# the thinking here is that these tags are selected by a user
# and have meaning to both the dataset and the user
# there is considerablly more thought placed on these tags than say,
# a title or description
# Put differently, these are not an arbitrary listing of tags 
# and thus, they should all remain
tag_tokens <- socrata_tags %>%
  tidytext::unnest_tokens(word, tag) # we're NOT removing any tags here

The name, description, and domain_tag datasets have been prepared and are now ready for exploration.

# remove those extra stop words from title and description
title_tokens_reduced <- title_tokens %>%
  dplyr::anti_join(extra_stopwords, by = "word")

desc_tokens_reduced <- desc_tokens %>%
  dplyr::anti_join(extra_stopwords, by = "word")

tag_tokens_reduced <- tag_tokens %>%
  dplyr::anti_join(dplyr::data_frame(word = c("and","of")), by = "word") # just remove 'and' and 'of'

title_tokens_reduced %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word
q3z3-udcz	roofs
7h2m-3um5	shelter
a46m-xdxd	de
b5f4-szwq	deployment
mktr-gfzv	provider
ir4y-sesj	individual
h6vj-9z3w	inpatient
wwub-ubqq	staff
jjir-cuty	safety
m22x-wyj3	service

desc_tokens_reduced %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word
5hq5-im7i	ma
inhb-jgj2	york
69ge-5wp8	result
trt6-xugg	including
ukyn-gdj4	borrower
uyk3-a9sx	refer
zjqd-uvky	chicago
6hp8-hzag	services
nk9j-j4v5	borrower
g9qy-h66j	mechanical

tag_tokens_reduced %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word
xnhu-aczu	grading
fqjd-8bkh	building
cauq-8yn6	kml
vxub-6swi	comparison
9t3u-k43z	façade
j2sw-exum	king
55ef-pfz7	litter
mz3s-i8jq	cooperativa
w3eb-4mzd	tax
qky3-f2m4	raw

Initial simple exploration

What are the most common words in the Socrata Open Data Network dataset domain_tags?

#What are the most common tags?
tag_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  dplyr::filter(n > 250) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word	n
health	784
level	771
permit	668
projects	522
country	487
safety	480
building	450
beneficiaries	371
quality	357
public	346
community	345
county	333
police	327
care	307
elections	271
services	270
disease	264
region	257
crime	254
results	253

What are the most common words in the Socrata Open Data Network dataset descriptions?

#What are the most common descriptive words?
desc_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  dplyr::filter(n > 750) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word	n
information	3886
city	2067
de	1797
department	1750
bank	1680
health	1501
york	1430
public	1327
includes	1285
services	1222
county	1184
date	1155
include	1073
current	1064
police	1023
system	1016
based	983
list	979
world	979
service	941
care	930
reported	884
updated	881
file	861
program	848
report	757

What are the most common words in the Socrata Open Data Network dataset titles?

#What are the most common words in titles?
title_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  dplyr::filter(n > 150) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word	n
map	705
de	564
beginning	515
county	421
city	265
public	253
summary	238
current	235
service	218
care	217
health	215
york	203
results	193
bronx	189
election	178
locations	176
budget	171
permits	164
financial	160
provider	160
active	158
funds	155
hospital	155
services	153

Word co-ocurrences and correlations

Here I examine which words commonly occur together in the titles, descriptions, and domain_tags of the Socrata Open Data Network datasets to create word networks that help determine which datasets are related to one other.

Networks of Description and Title Words

library(widyr)

title_word_pairs <- title_tokens_reduced %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

title_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

item1	item2	n
current	summary	134
financial	intermediary	128
financial	funds	128
intermediary	funds	128
election	results	107
ibrd	loans	90
ida	credits	88
de	del	83
funding	decisions	78
summary	ibrd	75
financial	funding	75
summary	ida	75
funding	intermediary	75
funding	funds	75
financial	decisions	75
intermediary	decisions	75
funds	decisions	75

These are the pairs of words that occur together most often in title fields.

desc_word_pairs <- desc_tokens_reduced %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

desc_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

item1	item2	n
information	includes	620
information	city	590
department	information	509
york	information	489
information	services	476
information	public	468
information	system	449
information	additional	434
information	health	416
bank	world	410

These are the pairs of words that occur together most often in description fields.

Below is a plot of networks of these co-occurring words to better see relationships.

library(ggplot2)
library(igraph)
library(ggraph)

# plot network of co-occuring words for 'title' field
set.seed(1234)
title_word_pairs %>%
  dplyr::filter(n >= 50) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Word network in the Socrata Open Data Network dataset titles

We see some clear clustering in this network of title words; words in the Socrata Open Data Network dataset titles are largely organized into several families of words that tend to go together.

Now I plot the same for the description fields.

# plot network of co-occuring words for 'description' field
set.seed(1234)
desc_word_pairs %>%
  dplyr::filter(n >= 260) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Word network in the Socrata Open Data Network dataset descriptions

There seem to be two large network clusters of words around two phrases: “information” and “world bank”.

Networks of Tags

tag_word_pairs <- tag_tokens_reduced %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

tag_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

item1	item2	n
level	country	487
beneficiaries	level	364
beneficiaries	country	364
projects	level	283
projects	country	283
projects	beneficiaries	256
region	projects	243
region	beneficiaries	243
region	level	243
region	country	243

These are the pairs of words that occur together most often in domain_tags fields.

# find the correlations among tags
tag_word_corr <- tag_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::filter(n() >= 100) %>%
  widyr::pairwise_cor(word, id, sort = TRUE, upper = FALSE)

tag_word_corr %>%
  dplyr::arrange(-correlation) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

item1	item2	correlation
level	country	0.9729475
chronic	disease	0.9137723
ida	credits	0.8843341
ibrd	loans	0.8772401
pv	solar	0.8743425
procurement	contracts	0.8675812
beneficiaries	country	0.8454930
grants	credits	0.8264519
beneficiaries	level	0.8225372
ida	grants	0.7868242

These are the most correlated pairs of words in domain_tags fields.

Now I plot the same for the domain_tags fields.

# plot network of co-occuring words for 'doamin_tags' field
set.seed(1234)
tag_word_pairs %>%
  dplyr::filter(n >= 100) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Word network in the Socrata Open Data Network dataset domain tags

Here we have another sparse, tightly-grouped network of terms that seem to logically relate to one another. For instance, “elections” and “results” have a strong connection, which leads us to conclude that many of the datasets contain only election results. Whereas “country”, “beneficiary”, and “level” are three significant nodes within a looser network relationship, suggesting that World Bank and CMS datasets – while never overlapping – otherwise overwhelm the corpus of datasets on the Discovery API.

As an exercise, I plot the network of correlated words as well.

# plot network of correlated words for 'doamin_tags' field
set.seed(1234)
tag_word_corr %>%
  dplyr::filter(correlation > 0.6) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = correlation, edge_width = correlation),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()

Word network in the Socrata Open Data Network dataset domain tags

Calculating tf-idf for the description fields

Getting the tf-idf for the description field words

What are the highest tf-idf words in the Socrata Open Data Network description fields?

library(topicmodels)

desc_tf_idf <- desc_tokens_reduced %>% 
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup() %>%
  tidytext::bind_tf_idf(word, id, n) %>%
  dplyr::arrange(-tf_idf)

desc_tf_idf %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word	n	tf	idf	tf_idf
i2wz-mrjg	privacy	1	0.0666667	4.097082	0.2731388
t2qc-9pjd	congestion	9	0.0652174	7.041521	0.4592296
85ca-t3if	freeway	1	0.0063694	7.175052	0.0457010
sj6t-9cju	recreation	2	0.0800000	4.844296	0.3875437
tv4x-8thz	program.the	1	0.0476190	9.120963	0.4343316
78az-bt2s	detailed	1	0.0099010	3.413852	0.0338005
ebtp-548e	phases	1	0.0121951	7.041521	0.0858722
fmw5-2cbs	alcaldes	1	0.0833333	7.511525	0.6259604
t8pq-wu86	esri	1	0.0769231	4.777157	0.3674736
wpfr-zwvu	ensure	1	0.0056180	4.714243	0.0264845

These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.

Connecting description fields to keywords

First join the results of the tf-idf analysis with the tags dataset.

library(topicmodels)

# join tags on description tf-idf
desc_tf_idf_tags <- dplyr::full_join(
  desc_tf_idf, 
  socrata_tags , by = "id") %>%
  dplyr::arrange(word)

desc_tf_idf_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word	n	tf	idf	tf_idf	tag
akmt-4qtj	annually	1	0.0144928	3.433987	0.0497679	reproductive health
qxh8-f4bd	a.m	2	0.0222222	6.923738	0.1538608	health
wpfr-zwvu	foods	1	0.0056180	5.594602	0.0314303	violations
hc4z-b2p5	cooling	2	0.0215054	4.844296	0.1041784	energy efficiency
qmci-7h73	consequences	1	0.0151515	3.858272	0.0584587	procurement
6ez4-x9ke	report	1	0.0105263	2.820177	0.0296861	income
9ix3-ryt6	delayed	1	0.0073529	5.229142	0.0384496	wonder
x79g-2vkx	obligation	2	0.0555556	4.110327	0.2283515	ida
dmn7-mpa8	programs	3	0.0206897	3.564135	0.0737407	community health and chronic disease
x4kp-ag8p	range	1	0.0156250	4.358789	0.0681061	structural retrofit

Plot some of the most important words, as measured by tf-idf, for all of the provided keywords used on the Socrata Open Data Network datasets.

desc_tf_idf_tags %>% 
  dplyr::filter(!near(tf, 1)) %>%
  dplyr::filter(tag %in% c("health","safety","food","fire","rescue","closures")) %>%
  dplyr::arrange(dplyr::desc(tf_idf)) %>%
  dplyr::group_by(tag) %>%
  dplyr::distinct(word, tag, .keep_all = TRUE) %>%
  dplyr::top_n(15, tf_idf) %>% 
  dplyr::ungroup() %>%
  dplyr::mutate(word = base::factor(word, levels = base::rev(unique(word)))) %>%
  ggplot2::ggplot(ggplot2::aes(word, tf_idf, fill = tag)) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(~tag, ncol = 3, scales = "free") +
  ggplot2::coord_flip() +
  ggplot2::labs(title = "Highest tf-idf words in Socrata metadata description fields",
       caption = "Socrata metadata from https://api.us.socrata.com/api/catalog/v1",
       x = NULL, y = "tf-idf")

Distribution of tf-idf for words from datasets labeled with selected keywords

Uncovering hidden conversations

Topic modeling attempts to uncover the hidden conversations within each description field. Latent Dirichlet allocation (LDA) is a technique to model each document (description field) as a mixture of topics and further describe each topic as a mixture of words

desc_word_counts <- desc_tokens_reduced %>%
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup()

desc_word_counts %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

id	word	n
ucrd-5yjt	catedrales	1
frty-7vhw	subcontractors	1
xhn2-yaeu	individual	2
83hz-w2t5	approved	2
vwf2-pf38	consequences	1
dbq9-t7nz	targets	1
5diq-ds9y	page	1
shvr-uu6u	determine	1
478k-mpdy	risk	1
p3xw-8bp4	presenta	1

Casting to a document-term matrix

Create a sparse document term matrix, containing the count of terms in each document.

desc_dtm <- desc_word_counts %>%
  tidytext::cast_dtm(id, word, n)

desc_dtm

## <<DocumentTermMatrix (documents: 9145, terms: 16035)>>
## Non-/sparse entries: 226240/146413835
## Sparsity           : 100%
## Maximal term length: 80
## Weighting          : term frequency (tf)

Ready for topic modeling

The following creates an LDA model. Like many clustering algorithms, the number of topics must be set a priori. Here I set the number of topics to 8.

library(topicmodels)
# run an LDA on the description words
desc_lda <- topicmodels::LDA(desc_dtm, k = 8, control = base::list(seed = 1234))
desc_lda

## A LDA_VEM topic model with 8 topics.

Interpreting the topic model

The following takes the lda model and constructs a tidy data frame that summarizes the results.

# interpret the results
tidy_lda <- tidytext::tidy(desc_lda)

tidy_lda %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

topic	term	beta
6	knock	0.00e+00
3	claremont	0.00e+00
4	douge	0.00e+00
6	273.5	0.00e+00
6	lender	0.00e+00
2	minería	0.00e+00
1	institutes	2.18e-05
6	lucas	0.00e+00
3	westbound	1.74e-05
8	rats	0.00e+00

The column \(\beta\) shows the probability of that term being generated from that topic for that document. It is the probability of that term (word) belonging to that topic.

The following examines the top 6 terms for each topic.

top_lda_tags <- tidy_lda %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(10, beta) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(topic, -beta)

top_lda_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

topic	term	beta
1	medicare	0.0095178
8	chicago	0.0114303
5	world	0.0390233
8	records	0.0080730
1	system	0.0081732
2	file	0.0105068
1	information	0.0182373
6	del	0.0139492
8	location	0.0070459
5	ibrd	0.0264524

Here are the results of the top_lda_tags exercise depicted visually:

top_lda_tags %>%
  dplyr::mutate(term =  stats::reorder(term, beta)) %>%
  dplyr::group_by(topic, term) %>%    
  dplyr::arrange(dplyr::desc(beta)) %>%  
  dplyr::ungroup() %>%
  dplyr::mutate(term = base::factor(base::paste(term, topic, sep = "__"), 
                       levels = base::rev(base::paste(term, topic, sep = "__")))) %>%
  ggplot2::ggplot(ggplot2::aes(term, beta, fill = base::as.factor(topic))) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::scale_x_discrete(labels = function(x) base::gsub("__.+$", "", x)) +
  ggplot2::labs(
    title = "Top 10 terms in each LDA topic",
    x = NULL, y = base::expression(beta)) +
  ggplot2::facet_wrap(~ topic, ncol = 4, scales = "free")

Top terms in topic modeling of Discovery API metadata description field texts

The most frequently occuring terms in each of the topics tend to surround the concepts of food inspections, health quality indicators, World Bank datasets, and building permits.

The following examines topics that are associated with which description fields (i.e., documents). The probability, \(\gamma\), is the probability that each document belongs in each topic.

# examine which topics are associated with which description fields
lda_gamma <- tidytext::tidy(desc_lda, matrix = "gamma")

lda_gamma %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

document	topic	gamma
9pq7-febn	3	0.0133010
f3h8-mnxi	3	0.2468961
ygqr-578p	4	0.0102484
5kmn-6qd9	5	0.0015747
2ph4-cewj	7	0.7310144
fki5-gc4g	1	0.0039318
vw6y-z8j6	8	0.9575135
gmb4-77ea	6	0.0018333
36rn-h6qa	3	0.0060695
iu8a-46gu	1	0.0008067

The variable, \(\gamma\), has values that run from 0 to 1. Documents with values near zero means that those documents do not belong in each topic. Values with \(\gamma\) values close to 1 indicate that these documents do belong in those topics.

This distribution below shows that most documents either do belong or do not belong to a given topic.

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma)) +
  ggplot2::geom_histogram(bins = 48) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probabilities for all topics",
       y = "Number of documents", x = base::expression(gamma))

Probability distribution in topic modeling of Discovery API metadata description field texts

The following plot shows how the probabilities are distributed within each topic:

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma, fill = base::as.factor(topic))) +
  ggplot2::geom_histogram(bins = 16, show.legend = FALSE) +
  ggplot2::facet_wrap(~ topic, ncol = 4) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probability for each topic",
       y = "Number of documents", x = base::expression(gamma))

Probability distribution for each topic in topic modeling of Discovery API metadata description field texts

Connecting topic modeling with keywords

The following connects topic models with the domain tags to see what relationships exist between the human-tagged terms and the modeled topics.

# join tags on description LDA
desc_lda_tags <- dplyr::full_join(
  lda_gamma, 
  socrata_tags, by = c("document" = "id"))

desc_lda_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

document	topic	gamma	tag
5rxc-uczg	2	0.0015747	parks
b3b6-bu6t	6	0.0017610	births-deaths-other facts
5fqt-hs7d	7	0.0053432	concesion
gukj-e8fh	8	0.0070242	NA
i26j-ai4z	2	0.0036137	3-1-1
72xs-3mfz	5	0.0009047	opda
5yzi-ei9i	4	0.0010067	medicaid
keg4-3bc2	5	0.0007161	personal vehicles
dayf-zcvp	7	0.0015747	chicago park district
y3c3-hqwu	3	0.0008215	grading

I keep the document-topic entries that have probabilities (\(\gamma\)) greater than 0.9.

top_lda_gamma_tags <- desc_lda_tags %>% 
  dplyr::filter(!purrr::map_lgl(tag, is.null)) %>% # filter out null values
  dplyr::filter(!purrr::map_lgl(tag, is.na)) %>% # filter out NA values
  dplyr::filter(tag != "") %>% # filter out blank values
  dplyr::filter(gamma > 0.9) %>% 
  dplyr::count(topic, tag, sort = TRUE)

top_lda_gamma_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")

topic	tag	n
3	panama	2
7	explorer	2
3	online	1
6	cataño	1
7	members	28
1	smokefree indoor air	3
8	sfd mobile	3
3	accessories	1
6	pitaya	2
6	residuos	1

The following plot shows the top tags for each topic.

top_lda_gamma_tags %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(5, n) %>%
  dplyr::group_by(topic, tag) %>%
  dplyr::arrange(desc(n)) %>%  
  dplyr::ungroup() %>%
  dplyr::mutate(
    tag = base::factor(
      base::paste(
        tag, topic, sep = "__"
      ),
      levels = base::rev(
        base::paste(
          tag, topic, sep = "__"
        )
      )
    )
  ) %>%
  ggplot2::ggplot(ggplot2::aes(tag, n, fill = base::as.factor(topic))) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::labs(
    title = "Top tags for each LDA topic",
    x = NULL, y = "Number of documents"
  )+
  ggplot2::coord_flip() +
  ggplot2::scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  ggplot2::facet_wrap(~ topic, ncol = 4, scales = "free")

Top tags in topic modeling of Discovery API metadata description field texts

Discovery API datasets are tagged with keywords by human beings and the LDA topic model used in this analysis answers the question, “For the datasets with description fields that have a high probability of belonging to a given topic, what are the most common human-assigned keywords?”

Summary

The Socraata Discovery API houses a diverse set of tables that shows, through a combination of network analysis, tf-idf, and topic modeling, that many are connected to one other despite the many institutions publishing datasets.