NASA Metadata: tf-idf of Description Texts and Keywords

Getting and Wrangling the NASA Metadata
Calculating tf-idf for the Description Texts
Connecting Keywords and Descriptions
Visualizing Results

There are 32,000+ datasets at NASA, and NASA is interested in understanding the connections between these datasets and also connections to other important datasets at other government organizations outside of NASA. Metadata about the NASA datasets is available online in JSON format. Let’s look at this metadata, specifically in this report the description and keyword fields. Let’s use tf-idf to find important words in the description fields and connect that to the keywords.

Getting and Wrangling the NASA Metadata

Let’s download the metadata for the 32,000+ NASA datasets and set up data frames for the descriptions and keywords.

library(jsonlite)
library(dplyr)
library(tidyr)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)

##  [1] "_id"                "@type"              "accessLevel"        "accrualPeriodicity"
##  [5] "bureauCode"         "contactPoint"       "description"        "distribution"      
##  [9] "identifier"         "issued"             "keyword"            "landingPage"       
## [13] "language"           "modified"           "programCode"        "publisher"         
## [17] "spatial"            "temporal"           "theme"              "title"             
## [21] "license"            "isPartOf"           "references"         "rights"            
## [25] "describedBy"

nasadesc <- data_frame(id = metadata$dataset$`_id`$`$oid`, desc = metadata$dataset$description)
nasadesc

## # A tibble: 32,089 x 2
##                          id
##                       <chr>
## 1  55942a57c63a7fe59b495a77
## 2  55942a57c63a7fe59b495a78
## 3  55942a58c63a7fe59b495a79
## 4  55942a58c63a7fe59b495a7a
## 5  55942a58c63a7fe59b495a7b
## 6  55942a58c63a7fe59b495a7c
## 7  55942a58c63a7fe59b495a7d
## 8  55942a58c63a7fe59b495a7e
## 9  55942a58c63a7fe59b495a7f
## 10 55942a58c63a7fe59b495a80
## # ... with 32,079 more rows, and 1 more variables: desc <chr>

These are having a hard time printing out; let’s print out part of a few.

nasadesc %>% select(desc) %>% sample_n(5)

## # A tibble: 5 x 1
##                                                                                                                                                      desc
##                                                                                                                                                     <chr>
## 1  A Group for High Resolution Sea Surface Temperature (GHRSST) Level 4 sea surface temperature analysis produced as a retrospective dataset at the JPL P
## 2  ML2CO is the EOS Aura Microwave Limb Sounder (MLS) standard product for carbon monoxide derived from radiances measured by the 640 GHz radiometer. The
## 3                                                                                                              Crew lock bag. Polygons: 405 Vertices: 514
## 4  JEM Engineering proved the technical feasibility of the FlexScan array?a very low-cost, highly-efficient, wideband phased array antenna?in Phase I, an
## 5 MODIS (or Moderate Resolution Imaging Spectroradiometer) is a key instrument aboard the\nTerra (EOS AM) and Aqua (EOS PM) satellites. Terra's orbit aro

And here are the keywords.

nasakeyword <- data_frame(id = metadata$dataset$`_id`$`$oid`, 
                          keyword = metadata$dataset$keyword) %>%
        unnest(keyword)
nasakeyword

## # A tibble: 126,814 x 2
##                          id       keyword
##                       <chr>         <chr>
## 1  55942a57c63a7fe59b495a77 EARTH SCIENCE
## 2  55942a57c63a7fe59b495a77   HYDROSPHERE
## 3  55942a57c63a7fe59b495a77 SURFACE WATER
## 4  55942a57c63a7fe59b495a78 EARTH SCIENCE
## 5  55942a57c63a7fe59b495a78   HYDROSPHERE
## 6  55942a57c63a7fe59b495a78 SURFACE WATER
## 7  55942a58c63a7fe59b495a79 EARTH SCIENCE
## 8  55942a58c63a7fe59b495a79   HYDROSPHERE
## 9  55942a58c63a7fe59b495a79 SURFACE WATER
## 10 55942a58c63a7fe59b495a7a EARTH SCIENCE
## # ... with 126,804 more rows

What are the most common keywords?

nasakeyword %>% group_by(keyword) %>% count(sort = TRUE)

## # A tibble: 1,774 x 2
##                    keyword     n
##                      <chr> <int>
## 1            EARTH SCIENCE 14362
## 2                  Project  7452
## 3               ATMOSPHERE  7321
## 4              Ocean Color  7268
## 5             Ocean Optics  7268
## 6                   Oceans  7268
## 7                completed  6452
## 8  ATMOSPHERIC WATER VAPOR  3142
## 9                   OCEANS  2765
## 10            LAND SURFACE  2720
## # ... with 1,764 more rows

Looks like “Project completed” may not be useful keywords to keep around for some purposes, and we may want to change all of these to lower or upper case to get rid of duplicates like “OCEANS” and “Oceans”. Let’s do that, actually.

nasakeyword <- nasakeyword %>% mutate(keyword = toupper(keyword))

Calculating tf-idf for the Description Texts

What is tf-idf? One way to approach how important a word in a document is can be is its term frequency (tf), how frequently a word occurs in a document. Some frequently occurring words are not important, though; in English, these are probably words like “the”, “is”, “of”, and so forth. Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf, the frequency of a term adjusted for how rarely it is used. Check out my blog post on this to learn more.

library(tidytext)
descwords <- nasadesc %>% unnest_tokens(word, desc) %>%
        count(id, word, sort = TRUE) %>%
        ungroup()
descwords

## # A tibble: 2,728,224 x 3
##                          id  word     n
##                       <chr> <chr> <int>
## 1  55942a88c63a7fe59b498280   amp   679
## 2  55942a88c63a7fe59b498280  nbsp   655
## 3  55942a8ec63a7fe59b4986ef    gt   330
## 4  55942a8ec63a7fe59b4986ef    lt   330
## 5  55942a8ec63a7fe59b4986ef     p   327
## 6  55942a8ec63a7fe59b4986ef   the   231
## 7  55942a86c63a7fe59b49803b   amp   208
## 8  55942a86c63a7fe59b49803b  nbsp   204
## 9  56cf5b00a759fdadc44e564a   the   201
## 10 55942a86c63a7fe59b4980a2    gt   191
## # ... with 2,728,214 more rows

These are the most common “words” in NASA description fields, the words with highest term frequency. Most of these are nonsense gibberish from converting symbols like an ampersand to plain text. Let’s look at that first dataset, for example:

nasadesc %>% filter(id == "55942a88c63a7fe59b498280") %>% select(desc)

## # A tibble: 1 x 1
##                                                                                                                                                     desc
##                                                                                                                                                    <chr>
## 1 &lt;p&gt;The objective of the Variable Oxygen Regulator Element is to develop an oxygen-rated, contaminant-tolerant oxygen regulator to control suit p

There were apparently lots of weird characters in that one. The tf-idf algorithm should decrease the weight for all of these because they are common, but we can remove them via stop words if necessary. So now let’s calculate tf-idf for all the words in the description fields.

descwords <- descwords %>% bind_tf_idf(word, id, n)
descwords

## # A tibble: 2,728,224 x 6
##                          id  word     n         tf       idf      tf_idf
##                       <chr> <chr> <int>      <dbl>     <dbl>       <dbl>
## 1  55942a88c63a7fe59b498280   amp   679 0.35661765 3.1810813 1.134429711
## 2  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.2066578 1.447143322
## 3  55942a8ec63a7fe59b4986ef    gt   330 0.05722213 3.2263517 0.184618705
## 4  55942a8ec63a7fe59b4986ef    lt   330 0.05722213 3.2903671 0.188281801
## 5  55942a8ec63a7fe59b4986ef     p   327 0.05670192 3.3741126 0.191318680
## 6  55942a8ec63a7fe59b4986ef   the   231 0.04005549 0.1485621 0.005950728
## 7  55942a86c63a7fe59b49803b   amp   208 0.32911392 3.1810813 1.046938133
## 8  55942a86c63a7fe59b49803b  nbsp   204 0.32278481 4.2066578 1.357845252
## 9  56cf5b00a759fdadc44e564a   the   201 0.06962245 0.1485621 0.010343258
## 10 55942a86c63a7fe59b4980a2    gt   191 0.12290862 3.2263517 0.396546449
## # ... with 2,728,214 more rows

The columns that have been added are tf, idf, and those two quantities multiplied together, tf-idf, which is the thing we are interested in. What are the highest tf-idf words in the NASA description fields?

descwords %>% arrange(-tf_idf)

## # A tibble: 2,728,224 x 6
##                          id                                          word     n    tf       idf
##                       <chr>                                         <chr> <int> <dbl>     <dbl>
## 1  55942a7cc63a7fe59b49774a                                           rdr     1     1 10.376269
## 2  55942ac9c63a7fe59b49b688 palsar_radiometric_terrain_corrected_high_res     1     1 10.376269
## 3  55942ac9c63a7fe59b49b689  palsar_radiometric_terrain_corrected_low_res     1     1 10.376269
## 4  55942a7bc63a7fe59b4976ca                                          lgrs     1     1  8.766831
## 5  55942a7bc63a7fe59b4976d2                                          lgrs     1     1  8.766831
## 6  55942a7bc63a7fe59b4976e3                                          lgrs     1     1  8.766831
## 7  55942ad8c63a7fe59b49cf6c                      template_proddescription     1     1  8.296827
## 8  55942ad8c63a7fe59b49cf6d                      template_proddescription     1     1  8.296827
## 9  55942ad8c63a7fe59b49cf6e                      template_proddescription     1     1  8.296827
## 10 55942ad8c63a7fe59b49cf6f                      template_proddescription     1     1  8.296827
##       tf_idf
##        <dbl>
## 1  10.376269
## 2  10.376269
## 3  10.376269
## 4   8.766831
## 5   8.766831
## 6   8.766831
## 7   8.296827
## 8   8.296827
## 9   8.296827
## 10  8.296827
## # ... with 2,728,214 more rows

So these are the most “important” words in the description fields as measured by tf-idf, meaning they are common but not too common. Notice we have run into an issue here; both \(n\) and \(tf\) are equal to 1 for these terms, meaning that these were description fields that only had a single “word” in them. Let’s look at that top one:

nasadesc %>% filter(id == "55942a7cc63a7fe59b49774a") %>% select(desc)

## # A tibble: 1 x 1
##    desc
##   <chr>
## 1   RDR

The tf-idf algorithm will think that is a really important word. It might be a good idea to throw out all description fields that have fewer than 5 words or similar.

Connecting Keywords and Descriptions

So now we know which words in the descriptions have high tf-idf, and we also have labels for these descriptions in the keywords. Let’s do a full join of the keyword data frame and the data frame of description words with tf-idf, and then find the highest tf-idf words for a given keyword. (This full join takes a bit to run.)

descwords <- full_join(descwords, nasakeyword, by = "id")
descwords

## # A tibble: 11,013,838 x 7
##                          id  word     n         tf      idf    tf_idf              keyword
##                       <chr> <chr> <int>      <dbl>    <dbl>     <dbl>                <chr>
## 1  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297              ELEMENT
## 2  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297 JOHNSON SPACE CENTER
## 3  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297                  VOR
## 4  55942a88c63a7fe59b498280   amp   679 0.35661765 3.181081 1.1344297               ACTIVE
## 5  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433              ELEMENT
## 6  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433 JOHNSON SPACE CENTER
## 7  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433                  VOR
## 8  55942a88c63a7fe59b498280  nbsp   655 0.34401261 4.206658 1.4471433               ACTIVE
## 9  55942a8ec63a7fe59b4986ef    gt   330 0.05722213 3.226352 0.1846187 JOHNSON SPACE CENTER
## 10 55942a8ec63a7fe59b4986ef    gt   330 0.05722213 3.226352 0.1846187              PROJECT
## # ... with 11,013,828 more rows

Visualizing Results

Let’s look at some of the most important words for a few example keywords.

plot_words <- descwords %>% filter(!near(tf, 1)) %>%
        filter(keyword %in% c("SOLAR ACTIVITY", "CLOUDS", 
                              "VEGETATION", "ASTROPHYSICS",
                              "HUMAN HEALTH", "BUDGET")) %>%
        arrange(desc(tf_idf)) %>%
        group_by(keyword) %>%
        distinct(word, keyword, .keep_all = TRUE) %>%
        top_n(20, tf_idf) %>% ungroup() %>%
        mutate(word = factor(word, levels = rev(unique(word))))
plot_words

## # A tibble: 122 x 7
##                          id      word     n        tf      idf   tf_idf    keyword
##                       <chr>    <fctr> <int>     <dbl>    <dbl>    <dbl>      <chr>
## 1  55942a60c63a7fe59b49612f estimates     1 0.5000000 3.172863 1.586432     CLOUDS
## 2  55942a76c63a7fe59b49728d      ncdc     1 0.1666667 7.603680 1.267280     CLOUDS
## 3  55942a60c63a7fe59b49612f     cloud     1 0.5000000 2.464212 1.232106     CLOUDS
## 4  55942a5ac63a7fe59b495bd8      fife     1 0.2000000 5.910360 1.182072     CLOUDS
## 5  55942a5cc63a7fe59b495deb allometry     1 0.1428571 7.891362 1.127337 VEGETATION
## 6  55942a5dc63a7fe59b495ede       tgb     3 0.1875000 5.945452 1.114772 VEGETATION
## 7  55942a5ac63a7fe59b495bd8      tovs     1 0.2000000 5.524238 1.104848     CLOUDS
## 8  55942a5ac63a7fe59b495bd8  received     1 0.2000000 5.332843 1.066569     CLOUDS
## 9  55942a5cc63a7fe59b495dfd       sap     1 0.1250000 8.430358 1.053795 VEGETATION
## 10 55942a60c63a7fe59b496131  abstract     1 0.3333333 3.118561 1.039520     CLOUDS
## # ... with 112 more rows

Notice that many of these have \(n=1\); these are words have that appeared only one time in their given description fields. A lot of them have really high term frequency too (i.e., very short descriptions).

nasadesc %>% filter(id == "55942a60c63a7fe59b49612f") %>% select(desc)

## # A tibble: 1 x 1
##              desc
##             <chr>
## 1 Cloud estimates

A tf-idf algorithm isn’t going to work very well on descriptions that are only 2 words long, or at least it is going to very heavily weight those words. Maybe that isn’t inappropriate, actually.

Anyway, let’s plot these high tf-idf words for these example keywords.

library(ggplot2)
library(ggstance)
library(ggthemes)
ggplot(plot_words, aes(tf_idf, word, fill = keyword, alpha = tf_idf)) +
        geom_barh(stat = "identity", show.legend = FALSE) +
        labs(title = "Highest tf-idf words in NASA Metadata Description Fields",
             subtitle = "Distribution of tf-idf for words from datasets labeled with various keywords",
             caption = "NASA metadata from https://data.nasa.gov/data.json",
             y = NULL, x = "tf-idf") +
        facet_wrap(~keyword, ncol = 3, scales = "free") +
        theme_tufte(base_family = "Arial", base_size = 13, ticks = FALSE) +
        scale_alpha_continuous(range = c(0.2, 1)) +
        scale_x_continuous(expand=c(0,0)) +
        theme(strip.text=element_text(hjust=0)) +
        theme(plot.caption=element_text(size=9))

This could use a bit more cleaning still; there are still some short “words” that are remnants of the conversion from symbols (“li” for sure, maybe others). Some of these other combinations of letters are certainly acronyms (important?), and the examples of numbers may be important for these topics. I see an example of what I think is a mispelled word that the algorithm decided was important: “univsity”? Overall, tf-idf has identified important words for these topics.

NASA Metadata: tf-idf of Description Texts and Keywords

Julia Silge

2016-08-02

Getting and Wrangling the NASA Metadata

Calculating tf-idf for the Description Texts

Connecting Keywords and Descriptions

Visualizing Results