The Socrata data platform hosts tens of thousands of government datasets. Governments large and small publish data on crime, permits, finance, healthcare, research, performance, and more for citizens to use. While this large corpus of government data is already accessible via opendatanetwork.com, this API is a powerful way to access and explore all public metadata published on the Socrata platform.
Using the Socrata Discovery API I’ll use metadata from these datasets to understand the connections between them.
The metadata includes information like the name of the dataset, a description field, tags for the dataset that have been assigned by a human being, and so forth. The metadata for all its public datasets is available online in JSON format.
In this report, I will analyze the Socrata Open Data Network metadata as a text dataset and perform text mining techniques using the R library tidytext. I will preform word co-occurrences and correlations, tf-idf, and topic modeling to explore the connections between the datasets. I will seek to find if datasets are related to one other and find clusters of similar datasets. Since the Socrata Open Data Network provides several text fields in the metadata, most importantly the name, description, and tag fields, I can show connections between the fields to better understand the connections between the Socrata Open Data Network API datasets.
Download the JSON file and take a look at the names of what is stored in the metadata.
library(jsonlite)
metadata <- jsonlite::fromJSON("https://api.us.socrata.com/api/catalog/v1?limit=10000")
# look at column names in dataset
base::names(metadata$results)
## [1] "resource" "classification" "metadata" "permalink"
## [5] "link" "owner" "preview_image_url" "published_copy"
The name, description, and for each dataset will be the features of interest.
# text fields of interest are within the 'resource' and 'classification' object
base::class(metadata$results$resource$name)
## [1] "character"
base::class(metadata$results$resource$description)
## [1] "character"
base::class(metadata$results$classification$categories) # domain_tags are an array (or list)
## [1] "list"
The name and description fields are stored as character vectors, and the domain_tags are stored as a list of character vectors.
## Create dataframes of title, description, and tags fields
library(tidyverse)
socrata_title <- dplyr::data_frame(
id = metadata$results$resource$id,
title = metadata$results$resource$name
)
# show title fields
socrata_title %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | title |
|---|---|
| tua8-2qwc | 2010 General - Election Results for LD5 |
| dmc2-d98i | Dusty |
| 9g5a-r9zs | E-Government Service Portfolio - OSCIO |
| 2vvn-pdyi | NYS Lottery Retailers |
| 26fe-arqy | Employee Payroll (Searchable) |
| va6n-ui5f | Lista de EPSAGRO inscritas, acreditadas y habilitadas del Departamento de Boyacá 2017 |
| 45um-c62r | Rates of TBI-related Emergency Department Visits, Hospitalizations, and Deaths - United States, 2001 – 2010 |
| nmvn-mx4x | Small Commercial Solar Incentive Levels and Available Capacity |
| mjrr-dybz | Planning Department Development Application |
| ujhw-8y22 | Improving Care |
socrata_desc <- dplyr::data_frame(
id = metadata$results$resource$id,
desc = metadata$results$resource$description
) %>%
dplyr::filter(!purrr::map_lgl(desc, is.null)) %>% # filter out null values
dplyr::filter(!purrr::map_lgl(desc, is.na)) %>% # filter out NA values
dplyr::filter(desc != "") # filter out blank values
# take a sample and show description fields
socrata_desc %>%
dplyr::select(desc) %>%
dplyr::sample_n(5) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| desc |
|---|
| The innovative financing and governance arrangements of Financial Intermediary Funds (FIFs) enable funds to be raised from multiple sources, including from sovereign and private sources. FIFs can receive contributions in the form of concessional loans and capital contribution in addition to traditional grant funds. In addition, FIFs can receive funds from innovative sources, such as proceeds from CER sales. This chart shows contributions to all FIFs provided by a country during a certain calendar year. |
| <div>This dataset shows whether each dataset on data.maryland.gov has been updated recently enough. For example, datasets containing weekly data should be updated at least every 7 days. Datasets containing monthly data should be updated at least every 31 days. This dataset also shows a overview of metadata from all data.maryland.gov datasets.<br></div> |
|
New York City school level College Board SAT results for the graduating seniors of 2010. Records contain 2010 College-bound seniors mean SAT scores. Records with 5 or fewer students are suppressed (marked ‘s’). College-bound seniors are those students that complete the SAT Questionnaire when they register for the SAT and identify that they will graduate from high school in a specific year. For example, the 2010 college-bound seniors are those students that self-reported they would graduate in 2010. Students are not required to complete the SAT Questionnaire in order to register for the SAT. Students who do not indicate which year they will graduate from high school will not be included in any college-bound senior report. Students are linked to schools by identifying which school they attend when registering for a College Board exam. A student is only included in a school’s report if he/she self-reports being enrolled at that school. Data collected and processed by the College Board. |
| The Adult Care Facility Directory contains a listing of Adult Care Facilities that are currently licensed in New York State. New York State, through the New York State Department of Health, licenses and supervises adult care facilities which provide temporary or long-term, non-medical residential care services to adults who are substantially unable to live independently. |
| All BPD data on Open Baltimore is preliminary data and subject to change. The information presented through Open Baltimore represents Part I victim based crime data. The data do not represent statistics submitted to the FBI’s Uniform Crime Report (UCR); therefore any comparisons are strictly prohibited. For further clarification of UCR data, please visit http://www.fbi.gov/about-us/cjis/ucr/ucr. Please note that this data is preliminary and subject to change. Prior month data is likely to show changes when it is refreshed on a monthly basis. All data is geocoded to the approximate latitude/longitude location of the incident and excludes those records for which an address could not be geocoded. Any attempt to match the approximate location of the incident to an exact address is strictly prohibited. |
# all datasets have a title and description but not all have domain_tags
socrata_tags <- dplyr::data_frame(
id = metadata$results$resource$id,
tag = metadata$results$classification$domain_tags
) %>%
dplyr::filter(!purrr::map_lgl(tag, is.null)) %>% # filter out null values
tidyr::unnest(tag) # unlist the tag array and make the dataframe long
# show tags fields
socrata_tags %>%
dplyr::sample_n(100) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | tag |
|---|---|
| m3s8-jyeu | quality of care |
| tft4-2tch | grades |
| eb4y-d4ic | osh |
| 8hpj-294h | mechanical |
| skzk-tcq9 | pmf |
| uedp-fegm | energy |
| xnhu-aczu | sign permit |
| wyja-ysz2 | public works |
| kvbx-erfw | agency performance |
| 9x8y-nux4 | renewable |
| fr79-srce | community health and chronic disease |
| 5zaw-hj3p | country level |
| 3rgd-zjxx | water quality |
| 5u9x-y4iz | relocation permit |
| abfj-y7uq | driver |
| pt2v-9a3h | handicap |
| u2u3-22pf | popayan |
| n7es-ur7c | ground-level ozone |
| h7rm-fz6m | benchmarking |
| 5fn4-dr26 | city government |
| cpkv-aajs | construction |
| 8xjf-z55e | city services |
| 8z4h-2ak6 | employment |
| pknd-dutm | square |
| 3cdm-p29e | doitt |
| a9we-mtpn | connectivity |
| ya2t-3mrd | payroll |
| q8xn-hhm3 | beneficiaries |
| xgwu-c37w | noxious weeds |
| xbn8-g7iv | legislature |
| kahe-efs3 | bpc |
| m5c8-fvpq | enforcement |
| 6uza-cd7z | rescue |
| mfzt-js4n | gis |
| wxae-84gr | ifc |
| hzzz-eg3r | precincts |
| yd42-ttr4 | mesa sectoriales |
| f2du-824e | cars |
| jwz2-cge4 | rainfall |
| 7xmz-2ur8 | education |
| wn3p-qsan | volcanic plumes |
| mqvy-dtax | bronx |
| smb2-b252 | police |
| tid3-g26p | permisos uso |
| 547a-qp7h | acacías |
| c32y-ud8a | collaborating for clients |
| 88g8-5mnd | employee |
| 9gcg-vghr | unpaid |
| 8wk5-pp5m | affordable care act |
| qvuk-dgpr | license |
| 7xpx-5fte | accident |
| vh2s-8wb2 | inpatient |
| 6q2s-9pnn | new building |
| 6vp6-wxuq | prevalence |
| i8h7-mn6v | business name |
| r4kn-u4a2 | county data |
| h4hj-g7g5 | allocations |
| qwux-prpy | clear |
| pknd-dutm | art |
| hfm8-take | donors |
| py8w-ca47 | postcensal |
| 5g2s-tnb7 | sales and use tax |
| fymg-3wv3 | childcare |
| w6qn-gx72 | spin |
| vfe9-k7vc | bus |
| wd85-n28p | property tax |
| b6tj-gt35 | banner |
| dwqk-w36f | postdiarrheal |
| h9gi-nx95 | collisions |
| rgfe-8y2z | mayor’s office of long term planning and sustainability (oltps) |
| 65xz-asgc | pesticides |
| djpb-rjn9 | program evaluation |
| mpue-vn67 | sparcs |
| yuq5-65xt | performance-based payment |
| 32cx-q9bj | building new |
| jj7d-djgr | grantees |
| rycq-b8jg | odfw |
| b3ae-eupi | delis |
| j4zm-9kqu | disclosure |
| ukww-xkmj | dead animal removal |
| 5rab-bshx | beneficiaries |
| uk7p-iqpz | snap |
| n5iq-rskv | bridge |
| jew6-k2tq | cms |
| n2dz-pwuk | labeling |
| g6gf-cj67 | aid |
| t753-n3tv | country level |
| hfe6-4nr6 | self-insured |
| y3c3-hqwu | elevator |
| 7mgr-iety | red light cameras |
| d4iy-9uh7 | community centers |
| yqdr-edy8 | sheriff |
| 6d4u-qybd | natural area |
| wj5k-n3n8 | ifc |
| ps8r-e55x | pm2.5 |
| xt4z-bnwh | map_layer |
| k5bm-9tcj | business name |
| h4zs-rypf | beneficiaries |
| 5i2c-y2u6 | parcel |
| msxg-ssda | ventilation |
# Create dataframes of tokenized title, description, and tags fields
library(tidytext)
title_tokens <- socrata_title %>%
tidytext::unnest_tokens(word, title) %>%
dplyr::anti_join(stop_words, by = "word") # remove default stop words
desc_tokens <- socrata_desc %>%
tidytext::unnest_tokens(word, desc) %>%
dplyr::anti_join(stop_words, by = "word") # remove default stop words
# don't remove any stop words
# the thinking here is that these tags are selected by a user
# and have meaning to both the dataset and the user
# there is considerablly more thought placed on these tags than say,
# a title or description
# Put differently, these are not an arbitrary listing of tags
# and thus, they should all remain
tag_tokens <- socrata_tags %>%
tidytext::unnest_tokens(word, tag) # we're NOT removing any tags here
The name, description, and domain_tag datasets have been prepared and are now ready for exploration.
# remove those extra stop words from title and description
title_tokens_reduced <- title_tokens %>%
dplyr::anti_join(extra_stopwords, by = "word")
desc_tokens_reduced <- desc_tokens %>%
dplyr::anti_join(extra_stopwords, by = "word")
tag_tokens_reduced <- tag_tokens %>%
dplyr::anti_join(dplyr::data_frame(word = c("and","of")), by = "word") # just remove 'and' and 'of'
title_tokens_reduced %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word |
|---|---|
| q3z3-udcz | roofs |
| 7h2m-3um5 | shelter |
| a46m-xdxd | de |
| b5f4-szwq | deployment |
| mktr-gfzv | provider |
| ir4y-sesj | individual |
| h6vj-9z3w | inpatient |
| wwub-ubqq | staff |
| jjir-cuty | safety |
| m22x-wyj3 | service |
desc_tokens_reduced %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word |
|---|---|
| 5hq5-im7i | ma |
| inhb-jgj2 | york |
| 69ge-5wp8 | result |
| trt6-xugg | including |
| ukyn-gdj4 | borrower |
| uyk3-a9sx | refer |
| zjqd-uvky | chicago |
| 6hp8-hzag | services |
| nk9j-j4v5 | borrower |
| g9qy-h66j | mechanical |
tag_tokens_reduced %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word |
|---|---|
| xnhu-aczu | grading |
| fqjd-8bkh | building |
| cauq-8yn6 | kml |
| vxub-6swi | comparison |
| 9t3u-k43z | façade |
| j2sw-exum | king |
| 55ef-pfz7 | litter |
| mz3s-i8jq | cooperativa |
| w3eb-4mzd | tax |
| qky3-f2m4 | raw |
What are the most common words in the Socrata Open Data Network dataset domain_tags?
#What are the most common tags?
tag_tokens_reduced %>%
dplyr::group_by(word) %>%
dplyr::count(sort = TRUE) %>%
dplyr::filter(n > 250) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| word | n |
|---|---|
| health | 784 |
| level | 771 |
| permit | 668 |
| projects | 522 |
| country | 487 |
| safety | 480 |
| building | 450 |
| beneficiaries | 371 |
| quality | 357 |
| public | 346 |
| community | 345 |
| county | 333 |
| police | 327 |
| care | 307 |
| elections | 271 |
| services | 270 |
| disease | 264 |
| region | 257 |
| crime | 254 |
| results | 253 |
What are the most common words in the Socrata Open Data Network dataset descriptions?
#What are the most common descriptive words?
desc_tokens_reduced %>%
dplyr::group_by(word) %>%
dplyr::count(sort = TRUE) %>%
dplyr::filter(n > 750) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| word | n |
|---|---|
| information | 3886 |
| city | 2067 |
| de | 1797 |
| department | 1750 |
| bank | 1680 |
| health | 1501 |
| york | 1430 |
| public | 1327 |
| includes | 1285 |
| services | 1222 |
| county | 1184 |
| date | 1155 |
| include | 1073 |
| current | 1064 |
| police | 1023 |
| system | 1016 |
| based | 983 |
| list | 979 |
| world | 979 |
| service | 941 |
| care | 930 |
| reported | 884 |
| updated | 881 |
| file | 861 |
| program | 848 |
| report | 757 |
What are the most common words in the Socrata Open Data Network dataset titles?
#What are the most common words in titles?
title_tokens_reduced %>%
dplyr::group_by(word) %>%
dplyr::count(sort = TRUE) %>%
dplyr::filter(n > 150) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| word | n |
|---|---|
| map | 705 |
| de | 564 |
| beginning | 515 |
| county | 421 |
| city | 265 |
| public | 253 |
| summary | 238 |
| current | 235 |
| service | 218 |
| care | 217 |
| health | 215 |
| york | 203 |
| results | 193 |
| bronx | 189 |
| election | 178 |
| locations | 176 |
| budget | 171 |
| permits | 164 |
| financial | 160 |
| provider | 160 |
| active | 158 |
| funds | 155 |
| hospital | 155 |
| services | 153 |
Here I examine which words commonly occur together in the titles, descriptions, and domain_tags of the Socrata Open Data Network datasets to create word networks that help determine which datasets are related to one other.
library(widyr)
title_word_pairs <- title_tokens_reduced %>%
widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)
title_word_pairs %>%
dplyr::arrange(-n) %>%
dplyr::top_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| item1 | item2 | n |
|---|---|---|
| current | summary | 134 |
| financial | intermediary | 128 |
| financial | funds | 128 |
| intermediary | funds | 128 |
| election | results | 107 |
| ibrd | loans | 90 |
| ida | credits | 88 |
| de | del | 83 |
| funding | decisions | 78 |
| summary | ibrd | 75 |
| financial | funding | 75 |
| summary | ida | 75 |
| funding | intermediary | 75 |
| funding | funds | 75 |
| financial | decisions | 75 |
| intermediary | decisions | 75 |
| funds | decisions | 75 |
These are the pairs of words that occur together most often in title fields.
desc_word_pairs <- desc_tokens_reduced %>%
widyr::pairwise_count(word, id, sort = TRUE, upper = FALSE)
desc_word_pairs %>%
dplyr::arrange(-n) %>%
dplyr::top_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| item1 | item2 | n |
|---|---|---|
| information | includes | 620 |
| information | city | 590 |
| department | information | 509 |
| york | information | 489 |
| information | services | 476 |
| information | public | 468 |
| information | system | 449 |
| information | additional | 434 |
| information | health | 416 |
| bank | world | 410 |
These are the pairs of words that occur together most often in description fields.
Below is a plot of networks of these co-occurring words to better see relationships.
library(ggplot2)
library(igraph)
library(ggraph)
# plot network of co-occuring words for 'title' field
set.seed(1234)
title_word_pairs %>%
dplyr::filter(n >= 50) %>%
igraph::graph_from_data_frame() %>%
ggraph::ggraph(layout = "fr") +
ggraph::geom_edge_link(
ggplot2::aes(edge_alpha = n, edge_width = n),
edge_colour = "steelblue"
) +
ggraph::geom_node_point(size = 5) +
ggraph::geom_node_text(
ggplot2::aes(label = name),
repel = TRUE,
point.padding = unit(0.2, "lines")
) +
ggplot2::theme_void()
Word network in the Socrata Open Data Network dataset titles
We see some clear clustering in this network of title words; words in the Socrata Open Data Network dataset titles are largely organized into several families of words that tend to go together.
Now I plot the same for the description fields.
# plot network of co-occuring words for 'description' field
set.seed(1234)
desc_word_pairs %>%
dplyr::filter(n >= 260) %>%
igraph::graph_from_data_frame() %>%
ggraph::ggraph(layout = "fr") +
ggraph::geom_edge_link(
ggplot2::aes(edge_alpha = n, edge_width = n),
edge_colour = "steelblue"
) +
ggraph::geom_node_point(size = 5) +
ggraph::geom_node_text(
ggplot2::aes(label = name),
repel = TRUE,
point.padding = unit(0.2, "lines")
) +
ggplot2::theme_void()
Word network in the Socrata Open Data Network dataset descriptions
There seem to be two large network clusters of words around two phrases: “information” and “world bank”.
What are the highest tf-idf words in the Socrata Open Data Network description fields?
library(topicmodels)
desc_tf_idf <- desc_tokens_reduced %>%
dplyr::count(id, word, sort = TRUE) %>%
dplyr::ungroup() %>%
tidytext::bind_tf_idf(word, id, n) %>%
dplyr::arrange(-tf_idf)
desc_tf_idf %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word | n | tf | idf | tf_idf |
|---|---|---|---|---|---|
| i2wz-mrjg | privacy | 1 | 0.0666667 | 4.097082 | 0.2731388 |
| t2qc-9pjd | congestion | 9 | 0.0652174 | 7.041521 | 0.4592296 |
| 85ca-t3if | freeway | 1 | 0.0063694 | 7.175052 | 0.0457010 |
| sj6t-9cju | recreation | 2 | 0.0800000 | 4.844296 | 0.3875437 |
| tv4x-8thz | program.the | 1 | 0.0476190 | 9.120963 | 0.4343316 |
| 78az-bt2s | detailed | 1 | 0.0099010 | 3.413852 | 0.0338005 |
| ebtp-548e | phases | 1 | 0.0121951 | 7.041521 | 0.0858722 |
| fmw5-2cbs | alcaldes | 1 | 0.0833333 | 7.511525 | 0.6259604 |
| t8pq-wu86 | esri | 1 | 0.0769231 | 4.777157 | 0.3674736 |
| wpfr-zwvu | ensure | 1 | 0.0056180 | 4.714243 | 0.0264845 |
These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.
First join the results of the tf-idf analysis with the tags dataset.
library(topicmodels)
# join tags on description tf-idf
desc_tf_idf_tags <- dplyr::full_join(
desc_tf_idf,
socrata_tags , by = "id") %>%
dplyr::arrange(word)
desc_tf_idf_tags %>%
dplyr::sample_n(10) %>%
knitr::kable() %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
kableExtra::scroll_box(height = "300px")
| id | word | n | tf | idf | tf_idf | tag |
|---|---|---|---|---|---|---|
| akmt-4qtj | annually | 1 | 0.0144928 | 3.433987 | 0.0497679 | reproductive health |
| qxh8-f4bd | a.m | 2 | 0.0222222 | 6.923738 | 0.1538608 | health |
| wpfr-zwvu | foods | 1 | 0.0056180 | 5.594602 | 0.0314303 | violations |
| hc4z-b2p5 | cooling | 2 | 0.0215054 | 4.844296 | 0.1041784 | energy efficiency |
| qmci-7h73 | consequences | 1 | 0.0151515 | 3.858272 | 0.0584587 | procurement |
| 6ez4-x9ke | report | 1 | 0.0105263 | 2.820177 | 0.0296861 | income |
| 9ix3-ryt6 | delayed | 1 | 0.0073529 | 5.229142 | 0.0384496 | wonder |
| x79g-2vkx | obligation | 2 | 0.0555556 | 4.110327 | 0.2283515 | ida |
| dmn7-mpa8 | programs | 3 | 0.0206897 | 3.564135 | 0.0737407 | community health and chronic disease |
| x4kp-ag8p | range | 1 | 0.0156250 | 4.358789 | 0.0681061 | structural retrofit |
Plot some of the most important words, as measured by tf-idf, for all of the provided keywords used on the Socrata Open Data Network datasets.
desc_tf_idf_tags %>%
dplyr::filter(!near(tf, 1)) %>%
dplyr::filter(tag %in% c("health","safety","food","fire","rescue","closures")) %>%
dplyr::arrange(dplyr::desc(tf_idf)) %>%
dplyr::group_by(tag) %>%
dplyr::distinct(word, tag, .keep_all = TRUE) %>%
dplyr::top_n(15, tf_idf) %>%
dplyr::ungroup() %>%
dplyr::mutate(word = base::factor(word, levels = base::rev(unique(word)))) %>%
ggplot2::ggplot(ggplot2::aes(word, tf_idf, fill = tag)) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::facet_wrap(~tag, ncol = 3, scales = "free") +
ggplot2::coord_flip() +
ggplot2::labs(title = "Highest tf-idf words in Socrata metadata description fields",
caption = "Socrata metadata from https://api.us.socrata.com/api/catalog/v1",
x = NULL, y = "tf-idf")
Distribution of tf-idf for words from datasets labeled with selected keywords
The Socraata Discovery API houses a diverse set of tables that shows, through a combination of network analysis, tf-idf, and topic modeling, that many are connected to one other despite the many institutions publishing datasets.