There are 32,000+ datasets at NASA, and NASA is interested in understanding the connections between these datasets and also connections to other important datasets at other government organizations outside of NASA. Metadata about the NASA datasets is available online in JSON format. Let’s look at this metadata, specifically in this report the description and keyword fields. Let’s use tf-idf to find important words in the description fields and connect that to the keywords.
Let’s download the metadata for the 32,000+ NASA datasets and set up data frames for the descriptions and keywords.
library(jsonlite)
library(dplyr)
library(tidyr)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)## [1] "_id" "@type" "accessLevel" "accrualPeriodicity"
## [5] "bureauCode" "contactPoint" "description" "distribution"
## [9] "identifier" "issued" "keyword" "landingPage"
## [13] "language" "modified" "programCode" "publisher"
## [17] "spatial" "temporal" "theme" "title"
## [21] "license" "isPartOf" "references" "rights"
## [25] "describedBy"
nasadesc <- data_frame(id = metadata$dataset$`_id`$`$oid`, desc = metadata$dataset$description)
nasadesc## # A tibble: 32,089 x 2
## id
## <chr>
## 1 55942a57c63a7fe59b495a77
## 2 55942a57c63a7fe59b495a78
## 3 55942a58c63a7fe59b495a79
## 4 55942a58c63a7fe59b495a7a
## 5 55942a58c63a7fe59b495a7b
## 6 55942a58c63a7fe59b495a7c
## 7 55942a58c63a7fe59b495a7d
## 8 55942a58c63a7fe59b495a7e
## 9 55942a58c63a7fe59b495a7f
## 10 55942a58c63a7fe59b495a80
## # ... with 32,079 more rows, and 1 more variables: desc <chr>
These are having a hard time printing out; let’s print out part of a few.
nasadesc %>% select(desc) %>% sample_n(5)## # A tibble: 5 x 1
## desc
## <chr>
## 1 A Group for High Resolution Sea Surface Temperature (GHRSST) Level 4 sea surface temperature analysis produced as a retrospective dataset at the JPL P
## 2 ML2CO is the EOS Aura Microwave Limb Sounder (MLS) standard product for carbon monoxide derived from radiances measured by the 640 GHz radiometer. The
## 3 Crew lock bag. Polygons: 405 Vertices: 514
## 4 JEM Engineering proved the technical feasibility of the FlexScan array?a very low-cost, highly-efficient, wideband phased array antenna?in Phase I, an
## 5 MODIS (or Moderate Resolution Imaging Spectroradiometer) is a key instrument aboard the\nTerra (EOS AM) and Aqua (EOS PM) satellites. Terra's orbit aro
And here are the keywords.
nasakeyword <- data_frame(id = metadata$dataset$`_id`$`$oid`,
keyword = metadata$dataset$keyword) %>%
unnest(keyword)
nasakeyword## # A tibble: 126,814 x 2
## id keyword
## <chr> <chr>
## 1 55942a57c63a7fe59b495a77 EARTH SCIENCE
## 2 55942a57c63a7fe59b495a77 HYDROSPHERE
## 3 55942a57c63a7fe59b495a77 SURFACE WATER
## 4 55942a57c63a7fe59b495a78 EARTH SCIENCE
## 5 55942a57c63a7fe59b495a78 HYDROSPHERE
## 6 55942a57c63a7fe59b495a78 SURFACE WATER
## 7 55942a58c63a7fe59b495a79 EARTH SCIENCE
## 8 55942a58c63a7fe59b495a79 HYDROSPHERE
## 9 55942a58c63a7fe59b495a79 SURFACE WATER
## 10 55942a58c63a7fe59b495a7a EARTH SCIENCE
## # ... with 126,804 more rows
What are the most common keywords?
nasakeyword %>% group_by(keyword) %>% count(sort = TRUE)## # A tibble: 1,774 x 2
## keyword n
## <chr> <int>
## 1 EARTH SCIENCE 14362
## 2 Project 7452
## 3 ATMOSPHERE 7321
## 4 Ocean Color 7268
## 5 Ocean Optics 7268
## 6 Oceans 7268
## 7 completed 6452
## 8 ATMOSPHERIC WATER VAPOR 3142
## 9 OCEANS 2765
## 10 LAND SURFACE 2720
## # ... with 1,764 more rows
Looks like “Project completed” may not be useful keywords to keep around for some purposes, and we may want to change all of these to lower or upper case to get rid of duplicates like “OCEANS” and “Oceans”. Let’s do that, actually.
nasakeyword <- nasakeyword %>% mutate(keyword = toupper(keyword))What is tf-idf? One way to approach how important a word in a document is can be is its term frequency (tf), how frequently a word occurs in a document. Some frequently occurring words are not important, though; in English, these are probably words like “the”, “is”, “of”, and so forth. Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it is used. Check out my blog post on this to learn more.
library(tidytext)
descwords <- nasadesc %>% unnest_tokens(word, desc) %>%
count(id, word, sort = TRUE) %>%
ungroup()
descwords## # A tibble: 2,728,224 x 3
## id word n
## <chr> <chr> <int>
## 1 55942a88c63a7fe59b498280 amp 679
## 2 55942a88c63a7fe59b498280 nbsp 655
## 3 55942a8ec63a7fe59b4986ef gt 330
## 4 55942a8ec63a7fe59b4986ef lt 330
## 5 55942a8ec63a7fe59b4986ef p 327
## 6 55942a8ec63a7fe59b4986ef the 231
## 7 55942a86c63a7fe59b49803b amp 208
## 8 55942a86c63a7fe59b49803b nbsp 204
## 9 56cf5b00a759fdadc44e564a the 201
## 10 55942a86c63a7fe59b4980a2 gt 191
## # ... with 2,728,214 more rows
These are the most common “words” in NASA description fields, the words with highest term frequency. Most of these are nonsense gibberish from converting symbols like an ampersand to plain text. Let’s look at that first dataset, for example:
nasadesc %>% filter(id == "55942a88c63a7fe59b498280") %>% select(desc)## # A tibble: 1 x 1
## desc
## <chr>
## 1 <p>The objective of the Variable Oxygen Regulator Element is to develop an oxygen-rated, contaminant-tolerant oxygen regulator to control suit p
There were apparently lots of weird characters in that one. The tf-idf algorithm should decrease the weight for all of these because they are common, but we can remove them via stop words if necessary. So now let’s calculate tf-idf for all the words in the description fields.
descwords <- descwords %>% bind_tf_idf(word, id, n)
descwords## # A tibble: 2,728,224 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 55942a88c63a7fe59b498280 amp 679 0.35661765 3.1810813 1.134429711
## 2 55942a88c63a7fe59b498280 nbsp 655 0.34401261 4.2066578 1.447143322
## 3 55942a8ec63a7fe59b4986ef gt 330 0.05722213 3.2263517 0.184618705
## 4 55942a8ec63a7fe59b4986ef lt 330 0.05722213 3.2903671 0.188281801
## 5 55942a8ec63a7fe59b4986ef p 327 0.05670192 3.3741126 0.191318680
## 6 55942a8ec63a7fe59b4986ef the 231 0.04005549 0.1485621 0.005950728
## 7 55942a86c63a7fe59b49803b amp 208 0.32911392 3.1810813 1.046938133
## 8 55942a86c63a7fe59b49803b nbsp 204 0.32278481 4.2066578 1.357845252
## 9 56cf5b00a759fdadc44e564a the 201 0.06962245 0.1485621 0.010343258
## 10 55942a86c63a7fe59b4980a2 gt 191 0.12290862 3.2263517 0.396546449
## # ... with 2,728,214 more rows
The columns that have been added are tf, idf, and those two quantities multiplied together, tf-idf, which is the thing we are interested in. What are the highest tf-idf words in the NASA description fields?
descwords %>% arrange(-tf_idf)## # A tibble: 2,728,224 x 6
## id word n tf idf
## <chr> <chr> <int> <dbl> <dbl>
## 1 55942a7cc63a7fe59b49774a rdr 1 1 10.376269
## 2 55942ac9c63a7fe59b49b688 palsar_radiometric_terrain_corrected_high_res 1 1 10.376269
## 3 55942ac9c63a7fe59b49b689 palsar_radiometric_terrain_corrected_low_res 1 1 10.376269
## 4 55942a7bc63a7fe59b4976ca lgrs 1 1 8.766831
## 5 55942a7bc63a7fe59b4976d2 lgrs 1 1 8.766831
## 6 55942a7bc63a7fe59b4976e3 lgrs 1 1 8.766831
## 7 55942ad8c63a7fe59b49cf6c template_proddescription 1 1 8.296827
## 8 55942ad8c63a7fe59b49cf6d template_proddescription 1 1 8.296827
## 9 55942ad8c63a7fe59b49cf6e template_proddescription 1 1 8.296827
## 10 55942ad8c63a7fe59b49cf6f template_proddescription 1 1 8.296827
## tf_idf
## <dbl>
## 1 10.376269
## 2 10.376269
## 3 10.376269
## 4 8.766831
## 5 8.766831
## 6 8.766831
## 7 8.296827
## 8 8.296827
## 9 8.296827
## 10 8.296827
## # ... with 2,728,214 more rows
So these are the most “important” words in the description fields as measured by tf-idf, meaning they are common but not too common. Notice we have run into an issue here; both \(n\) and \(tf\) are equal to 1 for these terms, meaning that these were description fields that only had a single “word” in them. Let’s look at that top one:
nasadesc %>% filter(id == "55942a7cc63a7fe59b49774a") %>% select(desc)## # A tibble: 1 x 1
## desc
## <chr>
## 1 RDR
The tf-idf algorithm will think that is a really important word. It might be a good idea to throw out all description fields that have fewer than 5 words or similar.
So now we know which words in the descriptions have high tf-idf, and we also have labels for these descriptions in the keywords. Let’s do a full join of the keyword data frame and the data frame of description words with tf-idf, and then find the highest tf-idf words for a given keyword. (This full join takes a bit to run.)
descwords <- full_join(descwords, nasakeyword, by = "id")
descwords## # A tibble: 11,013,838 x 7
## id word n tf idf tf_idf keyword
## <chr> <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 55942a88c63a7fe59b498280 amp 679 0.35661765 3.181081 1.1344297 ELEMENT
## 2 55942a88c63a7fe59b498280 amp 679 0.35661765 3.181081 1.1344297 JOHNSON SPACE CENTER
## 3 55942a88c63a7fe59b498280 amp 679 0.35661765 3.181081 1.1344297 VOR
## 4 55942a88c63a7fe59b498280 amp 679 0.35661765 3.181081 1.1344297 ACTIVE
## 5 55942a88c63a7fe59b498280 nbsp 655 0.34401261 4.206658 1.4471433 ELEMENT
## 6 55942a88c63a7fe59b498280 nbsp 655 0.34401261 4.206658 1.4471433 JOHNSON SPACE CENTER
## 7 55942a88c63a7fe59b498280 nbsp 655 0.34401261 4.206658 1.4471433 VOR
## 8 55942a88c63a7fe59b498280 nbsp 655 0.34401261 4.206658 1.4471433 ACTIVE
## 9 55942a8ec63a7fe59b4986ef gt 330 0.05722213 3.226352 0.1846187 JOHNSON SPACE CENTER
## 10 55942a8ec63a7fe59b4986ef gt 330 0.05722213 3.226352 0.1846187 PROJECT
## # ... with 11,013,828 more rows
Let’s look at some of the most important words for a few example keywords.
plot_words <- descwords %>% filter(!near(tf, 1)) %>%
filter(keyword %in% c("SOLAR ACTIVITY", "CLOUDS",
"VEGETATION", "ASTROPHYSICS",
"HUMAN HEALTH", "BUDGET")) %>%
arrange(desc(tf_idf)) %>%
group_by(keyword) %>%
distinct(word, keyword, .keep_all = TRUE) %>%
top_n(20, tf_idf) %>% ungroup() %>%
mutate(word = factor(word, levels = rev(unique(word))))
plot_words## # A tibble: 122 x 7
## id word n tf idf tf_idf keyword
## <chr> <fctr> <int> <dbl> <dbl> <dbl> <chr>
## 1 55942a60c63a7fe59b49612f estimates 1 0.5000000 3.172863 1.586432 CLOUDS
## 2 55942a76c63a7fe59b49728d ncdc 1 0.1666667 7.603680 1.267280 CLOUDS
## 3 55942a60c63a7fe59b49612f cloud 1 0.5000000 2.464212 1.232106 CLOUDS
## 4 55942a5ac63a7fe59b495bd8 fife 1 0.2000000 5.910360 1.182072 CLOUDS
## 5 55942a5cc63a7fe59b495deb allometry 1 0.1428571 7.891362 1.127337 VEGETATION
## 6 55942a5dc63a7fe59b495ede tgb 3 0.1875000 5.945452 1.114772 VEGETATION
## 7 55942a5ac63a7fe59b495bd8 tovs 1 0.2000000 5.524238 1.104848 CLOUDS
## 8 55942a5ac63a7fe59b495bd8 received 1 0.2000000 5.332843 1.066569 CLOUDS
## 9 55942a5cc63a7fe59b495dfd sap 1 0.1250000 8.430358 1.053795 VEGETATION
## 10 55942a60c63a7fe59b496131 abstract 1 0.3333333 3.118561 1.039520 CLOUDS
## # ... with 112 more rows
Notice that many of these have \(n=1\); these are words have that appeared only one time in their given description fields. A lot of them have really high term frequency too (i.e., very short descriptions).
nasadesc %>% filter(id == "55942a60c63a7fe59b49612f") %>% select(desc)## # A tibble: 1 x 1
## desc
## <chr>
## 1 Cloud estimates
A tf-idf algorithm isn’t going to work very well on descriptions that are only 2 words long, or at least it is going to very heavily weight those words. Maybe that isn’t inappropriate, actually.
Anyway, let’s plot these high tf-idf words for these example keywords.
library(ggplot2)
library(ggstance)
library(ggthemes)
ggplot(plot_words, aes(tf_idf, word, fill = keyword, alpha = tf_idf)) +
geom_barh(stat = "identity", show.legend = FALSE) +
labs(title = "Highest tf-idf words in NASA Metadata Description Fields",
subtitle = "Distribution of tf-idf for words from datasets labeled with various keywords",
caption = "NASA metadata from https://data.nasa.gov/data.json",
y = NULL, x = "tf-idf") +
facet_wrap(~keyword, ncol = 3, scales = "free") +
theme_tufte(base_family = "Arial", base_size = 13, ticks = FALSE) +
scale_alpha_continuous(range = c(0.2, 1)) +
scale_x_continuous(expand=c(0,0)) +
theme(strip.text=element_text(hjust=0)) +
theme(plot.caption=element_text(size=9))This could use a bit more cleaning still; there are still some short “words” that are remnants of the conversion from symbols (“li” for sure, maybe others). Some of these other combinations of letters are certainly acronyms (important?), and the examples of numbers may be important for these topics. I see an example of what I think is a mispelled word that the algorithm decided was important: “univsity”? Overall, tf-idf has identified important words for these topics.