The Socrata data platform hosts tens of thousands of government datasets. Governments large and small publish data on crime, permits, finance, healthcare, research, performance, and more for citizens to use. While this large corpus of government data is already accessible via opendatanetwork.com, this API is a powerful way to access and explore all public metadata published on the Socrata platform.

Using the Socrata Discovery API I’ll use metadata from these datasets to understand the connections between them.

The metadata includes information like the name of the dataset, a description field, tags for the dataset that have been assigned by a human being, and so forth. The metadata for all its public datasets is available online in JSON format.

In this report, I will analyze the Socrata Open Data Network metadata as a text dataset and perform text mining techniques using the R library tidytext. I will preform word co-occurrences and correlations, tf-idf, and topic modeling to explore the connections between the datasets. I will seek to find if datasets are related to one other and find clusters of similar datasets. Since the Socrata Open Data Network provides several text fields in the metadata, most importantly the name, description, and tag fields, I can show connections between the fields to better understand the connections between the Socrata Open Data Network API datasets.

How data is organized at the Socrata Open Data Network

Download the JSON file and take a look at the names of what is stored in the metadata.

library(jsonlite)

metadata <- jsonlite::fromJSON("https://api.us.socrata.com/api/catalog/v1?limit=10000")

# look at column names in dataset
base::names(metadata$results)
## [1] "resource"          "classification"    "metadata"          "permalink"        
## [5] "link"              "owner"             "preview_image_url" "published_copy"

The name, description, and for each dataset will be the features of interest.

# text fields of interest are within the 'resource' and 'classification' object
base::class(metadata$results$resource$name)
## [1] "character"
base::class(metadata$results$resource$description)
## [1] "character"
base::class(metadata$results$classification$categories) # domain_tags are an array (or list)
## [1] "list"

The name and description fields are stored as character vectors, and the domain_tags are stored as a list of character vectors.

Data preparation

## Create dataframes of title, description, and tags fields
library(tidyverse)

socrata_title <- dplyr::data_frame(
  id = metadata$results$resource$id,
  title = metadata$results$resource$name
)

# show title fields
socrata_title %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id title
tua8-2qwc 2010 General - Election Results for LD5
dmc2-d98i Dusty
9g5a-r9zs E-Government Service Portfolio - OSCIO
2vvn-pdyi NYS Lottery Retailers
26fe-arqy Employee Payroll (Searchable)
va6n-ui5f Lista de EPSAGRO inscritas, acreditadas y habilitadas del Departamento de Boyacá 2017
45um-c62r Rates of TBI-related Emergency Department Visits, Hospitalizations, and Deaths - United States, 2001 – 2010
nmvn-mx4x Small Commercial Solar Incentive Levels and Available Capacity
mjrr-dybz Planning Department Development Application
ujhw-8y22 Improving Care
socrata_desc <- dplyr::data_frame(
  id = metadata$results$resource$id,
  desc = metadata$results$resource$description
  ) %>% 
  dplyr::filter(!purrr::map_lgl(desc, is.null)) %>% # filter out null values
  dplyr::filter(!purrr::map_lgl(desc, is.na)) %>% # filter out NA values
  dplyr::filter(desc != "") # filter out blank values
  
  
# take a sample and show description fields
socrata_desc %>%
  dplyr::select(desc) %>%
  dplyr::sample_n(5) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
desc
The innovative financing and governance arrangements of Financial Intermediary Funds (FIFs) enable funds to be raised from multiple sources, including from sovereign and private sources. FIFs can receive contributions in the form of concessional loans and capital contribution in addition to traditional grant funds. In addition, FIFs can receive funds from innovative sources, such as proceeds from CER sales. This chart shows contributions to all FIFs provided by a country during a certain calendar year.
<div>This dataset shows whether each dataset on data.maryland.gov has been updated recently enough. For example, datasets containing weekly data should be updated at least every 7 days. Datasets containing monthly data should be updated at least every 31 days. This dataset also shows a overview of metadata from all data.maryland.gov datasets.<br></div>

New York City school level College Board SAT results for the graduating seniors of 2010. Records contain 2010 College-bound seniors mean SAT scores.

Records with 5 or fewer students are suppressed (marked ‘s’).

College-bound seniors are those students that complete the SAT Questionnaire when they register for the SAT and identify that they will graduate from high school in a specific year. For example, the 2010 college-bound seniors are those students that self-reported they would graduate in 2010. Students are not required to complete the SAT Questionnaire in order to register for the SAT. Students who do not indicate which year they will graduate from high school will not be included in any college-bound senior report.

Students are linked to schools by identifying which school they attend when registering for a College Board exam. A student is only included in a school’s report if he/she self-reports being enrolled at that school.

Data collected and processed by the College Board.
The Adult Care Facility Directory contains a listing of Adult Care Facilities that are currently licensed in New York State. New York State, through the New York State Department of Health, licenses and supervises adult care facilities which provide temporary or long-term, non-medical residential care services to adults who are substantially unable to live independently.
All BPD data on Open Baltimore is preliminary data and subject to change. The information presented through Open Baltimore represents Part I victim based crime data. The data do not represent statistics submitted to the FBI’s Uniform Crime Report (UCR); therefore any comparisons are strictly prohibited. For further clarification of UCR data, please visit http://www.fbi.gov/about-us/cjis/ucr/ucr. Please note that this data is preliminary and subject to change. Prior month data is likely to show changes when it is refreshed on a monthly basis. All data is geocoded to the approximate latitude/longitude location of the incident and excludes those records for which an address could not be geocoded. Any attempt to match the approximate location of the incident to an exact address is strictly prohibited.
# all datasets have a title and description but not all have domain_tags
socrata_tags <- dplyr::data_frame(
  id = metadata$results$resource$id,
  tag = metadata$results$classification$domain_tags
  ) %>% 
  dplyr::filter(!purrr::map_lgl(tag, is.null)) %>% # filter out null values
  tidyr::unnest(tag) # unlist the tag array and make the dataframe long

# show tags fields
socrata_tags %>%
  dplyr::sample_n(100) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id tag
m3s8-jyeu quality of care
tft4-2tch grades
eb4y-d4ic osh
8hpj-294h mechanical
skzk-tcq9 pmf
uedp-fegm energy
xnhu-aczu sign permit
wyja-ysz2 public works
kvbx-erfw agency performance
9x8y-nux4 renewable
fr79-srce community health and chronic disease
5zaw-hj3p country level
3rgd-zjxx water quality
5u9x-y4iz relocation permit
abfj-y7uq driver
pt2v-9a3h handicap
u2u3-22pf popayan
n7es-ur7c ground-level ozone
h7rm-fz6m benchmarking
5fn4-dr26 city government
cpkv-aajs construction
8xjf-z55e city services
8z4h-2ak6 employment
pknd-dutm square
3cdm-p29e doitt
a9we-mtpn connectivity
ya2t-3mrd payroll
q8xn-hhm3 beneficiaries
xgwu-c37w noxious weeds
xbn8-g7iv legislature
kahe-efs3 bpc
m5c8-fvpq enforcement
6uza-cd7z rescue
mfzt-js4n gis
wxae-84gr ifc
hzzz-eg3r precincts
yd42-ttr4 mesa sectoriales
f2du-824e cars
jwz2-cge4 rainfall
7xmz-2ur8 education
wn3p-qsan volcanic plumes
mqvy-dtax bronx
smb2-b252 police
tid3-g26p permisos uso
547a-qp7h acacías
c32y-ud8a collaborating for clients
88g8-5mnd employee
9gcg-vghr unpaid
8wk5-pp5m affordable care act
qvuk-dgpr license
7xpx-5fte accident
vh2s-8wb2 inpatient
6q2s-9pnn new building
6vp6-wxuq prevalence
i8h7-mn6v business name
r4kn-u4a2 county data
h4hj-g7g5 allocations
qwux-prpy clear
pknd-dutm art
hfm8-take donors
py8w-ca47 postcensal
5g2s-tnb7 sales and use tax
fymg-3wv3 childcare
w6qn-gx72 spin
vfe9-k7vc bus
wd85-n28p property tax
b6tj-gt35 banner
dwqk-w36f postdiarrheal
h9gi-nx95 collisions
rgfe-8y2z mayor’s office of long term planning and sustainability (oltps)
65xz-asgc pesticides
djpb-rjn9 program evaluation
mpue-vn67 sparcs
yuq5-65xt performance-based payment
32cx-q9bj building new
jj7d-djgr grantees
rycq-b8jg odfw
b3ae-eupi delis
j4zm-9kqu disclosure
ukww-xkmj dead animal removal
5rab-bshx beneficiaries
uk7p-iqpz snap
n5iq-rskv bridge
jew6-k2tq cms
n2dz-pwuk labeling
g6gf-cj67 aid
t753-n3tv country level
hfe6-4nr6 self-insured
y3c3-hqwu elevator
7mgr-iety red light cameras
d4iy-9uh7 community centers
yqdr-edy8 sheriff
6d4u-qybd natural area
wj5k-n3n8 ifc
ps8r-e55x pm2.5
xt4z-bnwh map_layer
k5bm-9tcj business name
h4zs-rypf beneficiaries
5i2c-y2u6 parcel
msxg-ssda ventilation
# Create dataframes of tokenized title, description, and tags fields
library(tidytext)

title_tokens <- socrata_title %>%
  tidytext::unnest_tokens(word, title) %>%
  dplyr::anti_join(stop_words, by = "word") # remove default stop words

desc_tokens <- socrata_desc %>%
  tidytext::unnest_tokens(word, desc) %>%
  dplyr::anti_join(stop_words, by = "word") # remove default stop words

# don't remove any stop words
# the thinking here is that these tags are selected by a user
# and have meaning to both the dataset and the user
# there is considerablly more thought placed on these tags than say,
# a title or description
# Put differently, these are not an arbitrary listing of tags 
# and thus, they should all remain
tag_tokens <- socrata_tags %>%
  tidytext::unnest_tokens(word, tag) # we're NOT removing any tags here

The name, description, and domain_tag datasets have been prepared and are now ready for exploration.

# remove those extra stop words from title and description
title_tokens_reduced <- title_tokens %>%
  dplyr::anti_join(extra_stopwords, by = "word")

desc_tokens_reduced <- desc_tokens %>%
  dplyr::anti_join(extra_stopwords, by = "word")

tag_tokens_reduced <- tag_tokens %>%
  dplyr::anti_join(dplyr::data_frame(word = c("and","of")), by = "word") # just remove 'and' and 'of'

title_tokens_reduced %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word
q3z3-udcz roofs
7h2m-3um5 shelter
a46m-xdxd de
b5f4-szwq deployment
mktr-gfzv provider
ir4y-sesj individual
h6vj-9z3w inpatient
wwub-ubqq staff
jjir-cuty safety
m22x-wyj3 service
desc_tokens_reduced %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word
5hq5-im7i ma
inhb-jgj2 york
69ge-5wp8 result
trt6-xugg including
ukyn-gdj4 borrower
uyk3-a9sx refer
zjqd-uvky chicago
6hp8-hzag services
nk9j-j4v5 borrower
g9qy-h66j mechanical
tag_tokens_reduced %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word
xnhu-aczu grading
fqjd-8bkh building
cauq-8yn6 kml
vxub-6swi comparison
9t3u-k43z façade
j2sw-exum king
55ef-pfz7 litter
mz3s-i8jq cooperativa
w3eb-4mzd tax
qky3-f2m4 raw

Initial simple exploration

What are the most common words in the Socrata Open Data Network dataset domain_tags?

#What are the most common tags?
tag_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  dplyr::filter(n > 250) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
word n
health 784
level 771
permit 668
projects 522
country 487
safety 480
building 450
beneficiaries 371
quality 357
public 346
community 345
county 333
police 327
care 307
elections 271
services 270
disease 264
region 257
crime 254
results 253

What are the most common words in the Socrata Open Data Network dataset descriptions?

#What are the most common descriptive words?
desc_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  dplyr::filter(n > 750) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
word n
information 3886
city 2067
de 1797
department 1750
bank 1680
health 1501
york 1430
public 1327
includes 1285
services 1222
county 1184
date 1155
include 1073
current 1064
police 1023
system 1016
based 983
list 979
world 979
service 941
care 930
reported 884
updated 881
file 861
program 848
report 757

What are the most common words in the Socrata Open Data Network dataset titles?

#What are the most common words in titles?
title_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::count(sort = TRUE) %>%
  dplyr::filter(n > 150) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
word n
map 705
de 564
beginning 515
county 421
city 265
public 253
summary 238
current 235
service 218
care 217
health 215
york 203
results 193
bronx 189
election 178
locations 176
budget 171
permits 164
financial 160
provider 160
active 158
funds 155
hospital 155
services 153

Word co-ocurrences and correlations

Here I examine which words commonly occur together in the titles, descriptions, and domain_tags of the Socrata Open Data Network datasets to create word networks that help determine which datasets are related to one other.

Networks of Description and Title Words

library(widyr)

title_word_pairs <- title_tokens_reduced %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

title_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
item1 item2 n
current summary 134
financial intermediary 128
financial funds 128
intermediary funds 128
election results 107
ibrd loans 90
ida credits 88
de del 83
funding decisions 78
summary ibrd 75
financial funding 75
summary ida 75
funding intermediary 75
funding funds 75
financial decisions 75
intermediary decisions 75
funds decisions 75

These are the pairs of words that occur together most often in title fields.

desc_word_pairs <- desc_tokens_reduced %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

desc_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
item1 item2 n
information includes 620
information city 590
department information 509
york information 489
information services 476
information public 468
information system 449
information additional 434
information health 416
bank world 410

These are the pairs of words that occur together most often in description fields.

Below is a plot of networks of these co-occurring words to better see relationships.

library(ggplot2)
library(igraph)
library(ggraph)

# plot network of co-occuring words for 'title' field
set.seed(1234)
title_word_pairs %>%
  dplyr::filter(n >= 50) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Word network in the Socrata Open Data Network dataset titles

Word network in the Socrata Open Data Network dataset titles

We see some clear clustering in this network of title words; words in the Socrata Open Data Network dataset titles are largely organized into several families of words that tend to go together.

Now I plot the same for the description fields.

# plot network of co-occuring words for 'description' field
set.seed(1234)
desc_word_pairs %>%
  dplyr::filter(n >= 260) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Word network in the Socrata Open Data Network dataset descriptions

Word network in the Socrata Open Data Network dataset descriptions

There seem to be two large network clusters of words around two phrases: “information” and “world bank”.

Networks of Tags

tag_word_pairs <- tag_tokens_reduced %>%
  widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)

tag_word_pairs %>%
  dplyr::arrange(-n) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
item1 item2 n
level country 487
beneficiaries level 364
beneficiaries country 364
projects level 283
projects country 283
projects beneficiaries 256
region projects 243
region beneficiaries 243
region level 243
region country 243

These are the pairs of words that occur together most often in domain_tags fields.

# find the correlations among tags
tag_word_corr <- tag_tokens_reduced %>%
  dplyr::group_by(word) %>%
  dplyr::filter(n() >= 100) %>%
  widyr::pairwise_cor(word, id, sort = TRUE, upper = FALSE)

tag_word_corr %>%
  dplyr::arrange(-correlation) %>%
  dplyr::top_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
item1 item2 correlation
level country 0.9729475
chronic disease 0.9137723
ida credits 0.8843341
ibrd loans 0.8772401
pv solar 0.8743425
procurement contracts 0.8675812
beneficiaries country 0.8454930
grants credits 0.8264519
beneficiaries level 0.8225372
ida grants 0.7868242

These are the most correlated pairs of words in domain_tags fields.

Now I plot the same for the domain_tags fields.

# plot network of co-occuring words for 'doamin_tags' field
set.seed(1234)
tag_word_pairs %>%
  dplyr::filter(n >= 100) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = n, edge_width = n),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Word network in the Socrata Open Data Network dataset domain tags

Word network in the Socrata Open Data Network dataset domain tags

Here we have another sparse, tightly-grouped network of terms that seem to logically relate to one another. For instance, “elections” and “results” have a strong connection, which leads us to conclude that many of the datasets contain only election results. Whereas “country”, “beneficiary”, and “level” are three significant nodes within a looser network relationship, suggesting that World Bank and CMS datasets – while never overlapping – otherwise overwhelm the corpus of datasets on the Discovery API.

As an exercise, I plot the network of correlated words as well.

# plot network of correlated words for 'doamin_tags' field
set.seed(1234)
tag_word_corr %>%
  dplyr::filter(correlation > 0.6) %>%
  igraph::graph_from_data_frame() %>%
  ggraph::ggraph(layout = "fr") +
  ggraph::geom_edge_link(
    ggplot2::aes(edge_alpha = correlation, edge_width = correlation),
    edge_colour = "steelblue"
  ) +
  ggraph::geom_node_point(size = 5) +
  ggraph::geom_node_text(
    ggplot2::aes(label = name), 
    repel = TRUE, 
    point.padding = unit(0.2, "lines")
  ) +
  ggplot2::theme_void()
Word network in the Socrata Open Data Network dataset domain tags

Word network in the Socrata Open Data Network dataset domain tags

Calculating tf-idf for the description fields

Getting the tf-idf for the description field words

What are the highest tf-idf words in the Socrata Open Data Network description fields?

library(topicmodels)

desc_tf_idf <- desc_tokens_reduced %>% 
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup() %>%
  tidytext::bind_tf_idf(word, id, n) %>%
  dplyr::arrange(-tf_idf)

desc_tf_idf %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word n tf idf tf_idf
i2wz-mrjg privacy 1 0.0666667 4.097082 0.2731388
t2qc-9pjd congestion 9 0.0652174 7.041521 0.4592296
85ca-t3if freeway 1 0.0063694 7.175052 0.0457010
sj6t-9cju recreation 2 0.0800000 4.844296 0.3875437
tv4x-8thz program.the 1 0.0476190 9.120963 0.4343316
78az-bt2s detailed 1 0.0099010 3.413852 0.0338005
ebtp-548e phases 1 0.0121951 7.041521 0.0858722
fmw5-2cbs alcaldes 1 0.0833333 7.511525 0.6259604
t8pq-wu86 esri 1 0.0769231 4.777157 0.3674736
wpfr-zwvu ensure 1 0.0056180 4.714243 0.0264845

These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.

Connecting description fields to keywords

First join the results of the tf-idf analysis with the tags dataset.

library(topicmodels)

# join tags on description tf-idf
desc_tf_idf_tags <- dplyr::full_join(
  desc_tf_idf, 
  socrata_tags , by = "id") %>%
  dplyr::arrange(word)

desc_tf_idf_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word n tf idf tf_idf tag
akmt-4qtj annually 1 0.0144928 3.433987 0.0497679 reproductive health
qxh8-f4bd a.m 2 0.0222222 6.923738 0.1538608 health
wpfr-zwvu foods 1 0.0056180 5.594602 0.0314303 violations
hc4z-b2p5 cooling 2 0.0215054 4.844296 0.1041784 energy efficiency
qmci-7h73 consequences 1 0.0151515 3.858272 0.0584587 procurement
6ez4-x9ke report 1 0.0105263 2.820177 0.0296861 income
9ix3-ryt6 delayed 1 0.0073529 5.229142 0.0384496 wonder
x79g-2vkx obligation 2 0.0555556 4.110327 0.2283515 ida
dmn7-mpa8 programs 3 0.0206897 3.564135 0.0737407 community health and chronic disease
x4kp-ag8p range 1 0.0156250 4.358789 0.0681061 structural retrofit

Plot some of the most important words, as measured by tf-idf, for all of the provided keywords used on the Socrata Open Data Network datasets.

desc_tf_idf_tags %>% 
  dplyr::filter(!near(tf, 1)) %>%
  dplyr::filter(tag %in% c("health","safety","food","fire","rescue","closures")) %>%
  dplyr::arrange(dplyr::desc(tf_idf)) %>%
  dplyr::group_by(tag) %>%
  dplyr::distinct(word, tag, .keep_all = TRUE) %>%
  dplyr::top_n(15, tf_idf) %>% 
  dplyr::ungroup() %>%
  dplyr::mutate(word = base::factor(word, levels = base::rev(unique(word)))) %>%
  ggplot2::ggplot(ggplot2::aes(word, tf_idf, fill = tag)) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::facet_wrap(~tag, ncol = 3, scales = "free") +
  ggplot2::coord_flip() +
  ggplot2::labs(title = "Highest tf-idf words in Socrata metadata description fields",
       caption = "Socrata metadata from https://api.us.socrata.com/api/catalog/v1",
       x = NULL, y = "tf-idf")
Distribution of tf-idf for words from datasets labeled with selected keywords

Distribution of tf-idf for words from datasets labeled with selected keywords

Uncovering hidden conversations

Topic modeling attempts to uncover the hidden conversations within each description field. Latent Dirichlet allocation (LDA) is a technique to model each document (description field) as a mixture of topics and further describe each topic as a mixture of words

desc_word_counts <- desc_tokens_reduced %>%
  dplyr::count(id, word, sort = TRUE) %>%
  dplyr::ungroup()

desc_word_counts %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
id word n
ucrd-5yjt catedrales 1
frty-7vhw subcontractors 1
xhn2-yaeu individual 2
83hz-w2t5 approved 2
vwf2-pf38 consequences 1
dbq9-t7nz targets 1
5diq-ds9y page 1
shvr-uu6u determine 1
478k-mpdy risk 1
p3xw-8bp4 presenta 1

Casting to a document-term matrix

Create a sparse document term matrix, containing the count of terms in each document.

desc_dtm <- desc_word_counts %>%
  tidytext::cast_dtm(id, word, n)

desc_dtm
## <<DocumentTermMatrix (documents: 9145, terms: 16035)>>
## Non-/sparse entries: 226240/146413835
## Sparsity           : 100%
## Maximal term length: 80
## Weighting          : term frequency (tf)

Ready for topic modeling

The following creates an LDA model. Like many clustering algorithms, the number of topics must be set a priori. Here I set the number of topics to 8.

library(topicmodels)
# run an LDA on the description words
desc_lda <- topicmodels::LDA(desc_dtm, k = 8, control = base::list(seed = 1234))
desc_lda
## A LDA_VEM topic model with 8 topics.

Interpreting the topic model

The following takes the lda model and constructs a tidy data frame that summarizes the results.

# interpret the results
tidy_lda <- tidytext::tidy(desc_lda)

tidy_lda %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
topic term beta
6 knock 0.00e+00
3 claremont 0.00e+00
4 douge 0.00e+00
6 273.5 0.00e+00
6 lender 0.00e+00
2 minería 0.00e+00
1 institutes 2.18e-05
6 lucas 0.00e+00
3 westbound 1.74e-05
8 rats 0.00e+00

The column \(\beta\) shows the probability of that term being generated from that topic for that document. It is the probability of that term (word) belonging to that topic.

The following examines the top 6 terms for each topic.

top_lda_tags <- tidy_lda %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(10, beta) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(topic, -beta)

top_lda_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
topic term beta
1 medicare 0.0095178
8 chicago 0.0114303
5 world 0.0390233
8 records 0.0080730
1 system 0.0081732
2 file 0.0105068
1 information 0.0182373
6 del 0.0139492
8 location 0.0070459
5 ibrd 0.0264524

Here are the results of the top_lda_tags exercise depicted visually:

top_lda_tags %>%
  dplyr::mutate(term =  stats::reorder(term, beta)) %>%
  dplyr::group_by(topic, term) %>%    
  dplyr::arrange(dplyr::desc(beta)) %>%  
  dplyr::ungroup() %>%
  dplyr::mutate(term = base::factor(base::paste(term, topic, sep = "__"), 
                       levels = base::rev(base::paste(term, topic, sep = "__")))) %>%
  ggplot2::ggplot(ggplot2::aes(term, beta, fill = base::as.factor(topic))) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::coord_flip() +
  ggplot2::scale_x_discrete(labels = function(x) base::gsub("__.+$", "", x)) +
  ggplot2::labs(
    title = "Top 10 terms in each LDA topic",
    x = NULL, y = base::expression(beta)) +
  ggplot2::facet_wrap(~ topic, ncol = 4, scales = "free")
Top terms in topic modeling of Discovery API metadata description field texts

Top terms in topic modeling of Discovery API metadata description field texts

The most frequently occuring terms in each of the topics tend to surround the concepts of food inspections, health quality indicators, World Bank datasets, and building permits.

The following examines topics that are associated with which description fields (i.e., documents). The probability, \(\gamma\), is the probability that each document belongs in each topic.

# examine which topics are associated with which description fields
lda_gamma <- tidytext::tidy(desc_lda, matrix = "gamma")

lda_gamma %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
document topic gamma
9pq7-febn 3 0.0133010
f3h8-mnxi 3 0.2468961
ygqr-578p 4 0.0102484
5kmn-6qd9 5 0.0015747
2ph4-cewj 7 0.7310144
fki5-gc4g 1 0.0039318
vw6y-z8j6 8 0.9575135
gmb4-77ea 6 0.0018333
36rn-h6qa 3 0.0060695
iu8a-46gu 1 0.0008067

The variable, \(\gamma\), has values that run from 0 to 1. Documents with values near zero means that those documents do not belong in each topic. Values with \(\gamma\) values close to 1 indicate that these documents do belong in those topics.

This distribution below shows that most documents either do belong or do not belong to a given topic.

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma)) +
  ggplot2::geom_histogram(bins = 48) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probabilities for all topics",
       y = "Number of documents", x = base::expression(gamma))
Probability distribution in topic modeling of Discovery API metadata description field texts

Probability distribution in topic modeling of Discovery API metadata description field texts

The following plot shows how the probabilities are distributed within each topic:

ggplot2::ggplot(lda_gamma, ggplot2::aes(gamma, fill = base::as.factor(topic))) +
  ggplot2::geom_histogram(bins = 16, show.legend = FALSE) +
  ggplot2::facet_wrap(~ topic, ncol = 4) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Distribution of probability for each topic",
       y = "Number of documents", x = base::expression(gamma))
Probability distribution for each topic in topic modeling of Discovery API metadata description field texts

Probability distribution for each topic in topic modeling of Discovery API metadata description field texts

Connecting topic modeling with keywords

The following connects topic models with the domain tags to see what relationships exist between the human-tagged terms and the modeled topics.

# join tags on description LDA
desc_lda_tags <- dplyr::full_join(
  lda_gamma, 
  socrata_tags, by = c("document" = "id"))

desc_lda_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
document topic gamma tag
5rxc-uczg 2 0.0015747 parks
b3b6-bu6t 6 0.0017610 births-deaths-other facts
5fqt-hs7d 7 0.0053432 concesion
gukj-e8fh 8 0.0070242 NA
i26j-ai4z 2 0.0036137 3-1-1
72xs-3mfz 5 0.0009047 opda
5yzi-ei9i 4 0.0010067 medicaid
keg4-3bc2 5 0.0007161 personal vehicles
dayf-zcvp 7 0.0015747 chicago park district
y3c3-hqwu 3 0.0008215 grading

I keep the document-topic entries that have probabilities (\(\gamma\)) greater than 0.9.

top_lda_gamma_tags <- desc_lda_tags %>% 
  dplyr::filter(!purrr::map_lgl(tag, is.null)) %>% # filter out null values
  dplyr::filter(!purrr::map_lgl(tag, is.na)) %>% # filter out NA values
  dplyr::filter(tag != "") %>% # filter out blank values
  dplyr::filter(gamma > 0.9) %>% 
  dplyr::count(topic, tag, sort = TRUE)

top_lda_gamma_tags %>%
  dplyr::sample_n(10) %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  kableExtra::scroll_box(height = "300px")
topic tag n
3 panama 2
7 explorer 2
3 online 1
6 cataño 1
7 members 28
1 smokefree indoor air 3
8 sfd mobile 3
3 accessories 1
6 pitaya 2
6 residuos 1

The following plot shows the top tags for each topic.

top_lda_gamma_tags %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(5, n) %>%
  dplyr::group_by(topic, tag) %>%
  dplyr::arrange(desc(n)) %>%  
  dplyr::ungroup() %>%
  dplyr::mutate(
    tag = base::factor(
      base::paste(
        tag, topic, sep = "__"
      ),
      levels = base::rev(
        base::paste(
          tag, topic, sep = "__"
        )
      )
    )
  ) %>%
  ggplot2::ggplot(ggplot2::aes(tag, n, fill = base::as.factor(topic))) +
  ggplot2::geom_col(show.legend = FALSE) +
  ggplot2::labs(
    title = "Top tags for each LDA topic",
    x = NULL, y = "Number of documents"
  )+
  ggplot2::coord_flip() +
  ggplot2::scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  ggplot2::facet_wrap(~ topic, ncol = 4, scales = "free")
Top tags in topic modeling of Discovery API metadata description field texts

Top tags in topic modeling of Discovery API metadata description field texts

Discovery API datasets are tagged with keywords by human beings and the LDA topic model used in this analysis answers the question, “For the datasets with description fields that have a high probability of belonging to a given topic, what are the most common human-assigned keywords?”

Summary

The Socraata Discovery API houses a diverse set of tables that shows, through a combination of network analysis, tf-idf, and topic modeling, that many are connected to one other despite the many institutions publishing datasets.