There are 32,000+ datasets at NASA, and NASA is interested in understanding the connections between these datasets. Metadata about the NASA datasets is available online in JSON format. Let’s look at this metadata, specifically in this report the description and keyword fields. I started to use topic modeling to classify the description fields and connect that to the keywords; in this report, let’s explore how many topics makes the most sense.

What is Topic Modeling?

To review, topic modeling is a method for unsupervised classification of documents; this method models each document as a mixture of topics and each topic as a mixture of words. The kind of method I’ll be using here for topic modeling is called latent Dirichlet allocation (LDA) but there are other possibilities for fitting a topic model. In the context here, each data set description is a document; we are going to see if we can fit model these description texts as a mixture of topics.

Getting and Wrangling the NASA Metadata

Let’s download the metadata for the 32,000+ NASA datasets and set up data frames for the descriptions and keywords, similarly to how I’ve done this in the past.

library(jsonlite)
library(dplyr)
library(tidyr)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)
##  [1] "_id"                "@type"              "accessLevel"        "accrualPeriodicity"
##  [5] "bureauCode"         "contactPoint"       "description"        "distribution"      
##  [9] "identifier"         "issued"             "keyword"            "landingPage"       
## [13] "language"           "modified"           "programCode"        "publisher"         
## [17] "spatial"            "temporal"           "theme"              "title"             
## [21] "license"            "isPartOf"           "references"         "rights"            
## [25] "describedBy"
nasadesc <- data_frame(id = metadata$dataset$`_id`$`$oid`, desc = metadata$dataset$description)
nasakeyword <- data_frame(id = metadata$dataset$`_id`$`$oid`, 
                          keyword = metadata$dataset$keyword) %>%
    unnest(keyword)
nasakeyword <- nasakeyword %>% mutate(keyword = toupper(keyword))

Just to check on things, what are the most common keywords?

nasakeyword %>% group_by(keyword) %>% count(sort = TRUE)
## # A tibble: 1,616 × 2
##                    keyword     n
##                      <chr> <int>
## 1            EARTH SCIENCE 14386
## 2                   OCEANS 10033
## 3                  PROJECT  7463
## 4             OCEAN OPTICS  7324
## 5               ATMOSPHERE  7323
## 6              OCEAN COLOR  7270
## 7                COMPLETED  6452
## 8  ATMOSPHERIC WATER VAPOR  3142
## 9             LAND SURFACE  2720
## 10               BIOSPHERE  2449
## # ... with 1,606 more rows

Making a DocumentTermMatrix

To do the topic modeling, we need to make a DocumentTermMatrix, a special kind of matrix from the tm package (of course, there is just a general concept of a “document-term matrix”). Rows correspond to documents (description texts in our case) and columns correspond to terms (i.e., words); it is a sparse matrix and the values are word counts (although they also can be tf-idf).

Let’s clean up the text a bit using stop words to remove some of the nonsense “words” leftover from HTML or other character encoding.

library(tidytext)
mystop_words <- bind_rows(stop_words, 
                          data_frame(word = c("nbsp", "amp", "gt", "lt",
                                              "timesnewromanpsmt", "font",
                                              "td", "li", "br", "tr", "quot",
                                              "st", "img", "src", "strong",
                                              as.character(1:10)), 
                                     lexicon = rep("custom", 25)))
word_counts <- nasadesc %>% unnest_tokens(word, desc) %>%
    anti_join(mystop_words) %>%
    count(id, word, sort = TRUE) %>%
    ungroup()
word_counts
## # A tibble: 1,909,247 × 3
##                          id     word     n
##                       <chr>    <chr> <int>
## 1  55942a8ec63a7fe59b4986ef     suit    82
## 2  55942a8ec63a7fe59b4986ef    space    69
## 3  56cf5b00a759fdadc44e564a     data    41
## 4  56cf5b00a759fdadc44e564a     leak    40
## 5  56cf5b00a759fdadc44e564a     tree    39
## 6  55942a8ec63a7fe59b4986ef pressure    34
## 7  55942a8ec63a7fe59b4986ef   system    34
## 8  55942a89c63a7fe59b4982d9       em    32
## 9  55942a8ec63a7fe59b4986ef       al    32
## 10 55942a8ec63a7fe59b4986ef    human    31
## # ... with 1,909,237 more rows

Now let’s make the DocumentTermMatrix.

desc_dtm <- word_counts %>%
  cast_dtm(id, word, n)
desc_dtm
## <<DocumentTermMatrix (documents: 32003, terms: 35915)>>
## Non-/sparse entries: 1909247/1147478498
## Sparsity           : 100%
## Maximal term length: 166
## Weighting          : term frequency (tf)

LDA Topic Modeling

Now let’s use the topicmodels package to create an LDA model. How many topics will we tell the algorithm to make? This is a question much like in \(k\)-means clustering; we don’t really know ahead of time. In my last exploration, I tried 8 topics; this time around, we’re going to try 16.

library(topicmodels)
desc_lda <- LDA(desc_dtm, k = 16, control = list(seed = 1234))
desc_lda
## A LDA_VEM topic model with 16 topics.

We have done it! We have modeled topics! This is a stochastic algorithm that could have different results depending on where the algorithm starts, so I need to put a seed for reproducibility. We’ll need to see how robust the topic modeling is eventually.

Exploring the Modeling

Let’s use the amazing/wonderful broom package to tidy the models, and see what we can find out.

library(broom)
tidy_lda <- tidy(desc_lda)
tidy_lda
## # A tibble: 574,640 × 3
##    topic  term         beta
##    <int> <chr>        <dbl>
## 1      1  suit 1.157604e-59
## 2      2  suit 4.626844e-77
## 3      3  suit 8.858039e-83
## 4      4  suit 4.262036e-73
## 5      5  suit 4.828308e-65
## 6      6  suit 5.942637e-80
## 7      7  suit 1.674197e-98
## 8      8  suit 1.124932e-04
## 9      9  suit 3.019772e-04
## 10    10  suit 1.000439e-24
## # ... with 574,630 more rows

The column \(\beta\) tells us the probability of that term being generated from that topic for that document. Notice that some of very, very low, and some are not so low.

What are the top 5 terms for each topic?

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms
## # A tibble: 160 × 3
##    topic         term        beta
##    <int>        <chr>       <dbl>
## 1      1         data 0.047500572
## 2      1          set 0.017008824
## 3      1        files 0.009121241
## 4      1         soil 0.008179488
## 5      1       forest 0.008167524
## 6      1         land 0.007875858
## 7      1         site 0.006788564
## 8      1 measurements 0.006735500
## 9      1        cover 0.006137627
## 10     1         file 0.006122067
## # ... with 150 more rows

Let’s look at this visually.

library(ggplot2)
library(ggthemes)
ggplot(top_terms, aes(term, beta, fill = as.factor(topic))) +
    geom_bar(stat = "identity", show.legend = FALSE, alpha = 0.8) +
    coord_flip() +
    labs(title = "Top 10 Terms in Each LDA Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         x = NULL, y = "beta") +
    facet_wrap(~topic, ncol = 4, scales = "free") +
    theme_tufte(base_family = "Arial", ticks = FALSE) +
    scale_y_continuous(expand=c(0,0)) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

We can see what a dominant word “data” is in these description texts. There do appear to be meaningful differences between these collections of terms, though, from terms about soil and land to terms about design, systems, and technology.

Which Topic Does Each Document Belong To?

Let’s find out which topics are associated with which description fields (i.e., documents).

lda_gamma <- tidy(desc_lda, matrix = "gamma")
lda_gamma
## # A tibble: 512,048 × 3
##                    document topic        gamma
##                       <chr> <int>        <dbl>
## 1  55942a8ec63a7fe59b4986ef     1 2.371235e-02
## 2  56cf5b00a759fdadc44e564a     1 7.022165e-02
## 3  55942a89c63a7fe59b4982d9     1 4.453017e-05
## 4  56cf5b00a759fdadc44e55cd     1 2.725297e-05
## 5  55942a89c63a7fe59b4982c6     1 8.021201e-05
## 6  55942a86c63a7fe59b498077     1 6.876576e-05
## 7  56cf5b00a759fdadc44e56f8     1 5.718876e-05
## 8  55942a8bc63a7fe59b4984b5     1 5.228273e-05
## 9  55942a6ec63a7fe59b496bf7     1 5.074625e-05
## 10 55942a8ec63a7fe59b4986f6     1 3.492363e-05
## # ... with 512,038 more rows

The column \(\gamma\) here is the probability that each document belongs in each topic. Notice that some are very low and some are higher. How are the probabilities distributed?

ggplot(lda_gamma, aes(gamma, fill = as.factor(topic))) +
    geom_histogram(show.legend = FALSE, alpha = 0.8) +
    facet_wrap(~topic, ncol = 4) +
    labs(title = "Distribution of Probability for Each Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         y = NULL, x = "gamma") +
    scale_y_log10() +
    theme_minimal(base_family = "Arial", base_size = 13) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

The y-axis is plotted here on a log scale so we can see something. Documents are getting sorted into topics a little less cleanly now than with 8 topics, with the exception of, say, topics 2 and 4. Still, this doesn’t look intensely worrisome.

Connecting Topic Modeling to Keywords

Let’s connect these topic models with the keywords and see what happens. Let’s join this dataframe to the keywords and see which keywords are associated with which topic.

lda_gamma <- full_join(lda_gamma, nasakeyword, by = c("document" = "id"))
lda_gamma
## # A tibble: 2,025,199 × 4
##                    document topic        gamma                     keyword
##                       <chr> <int>        <dbl>                       <chr>
## 1  55942a8ec63a7fe59b4986ef     1 2.371235e-02        JOHNSON SPACE CENTER
## 2  55942a8ec63a7fe59b4986ef     1 2.371235e-02                     PROJECT
## 3  55942a8ec63a7fe59b4986ef     1 2.371235e-02                   COMPLETED
## 4  56cf5b00a759fdadc44e564a     1 7.022165e-02                    DASHLINK
## 5  56cf5b00a759fdadc44e564a     1 7.022165e-02                        AMES
## 6  56cf5b00a759fdadc44e564a     1 7.022165e-02                        NASA
## 7  55942a89c63a7fe59b4982d9     1 4.453017e-05 GODDARD SPACE FLIGHT CENTER
## 8  55942a89c63a7fe59b4982d9     1 4.453017e-05                     PROJECT
## 9  55942a89c63a7fe59b4982d9     1 4.453017e-05                   COMPLETED
## 10 56cf5b00a759fdadc44e55cd     1 2.725297e-05                    DASHLINK
## # ... with 2,025,189 more rows

Let’s keep each document that was modeled as belonging to a topic with a probability \(> 0.9\), and then find the top keywords for each topic.

top_keywords <- lda_gamma %>% filter(gamma > 0.9) %>% 
    group_by(topic, keyword) %>% 
    count(keyword, sort = TRUE)
top_keywords
## Source: local data frame [1,061 x 3]
## Groups: topic [16]
## 
##    topic       keyword     n
##    <int>         <chr> <int>
## 1      4   OCEAN COLOR  4480
## 2      4  OCEAN OPTICS  4480
## 3      4        OCEANS  4480
## 4      8       PROJECT  2458
## 5      8     COMPLETED  2173
## 6      1 EARTH SCIENCE  1869
## 7     13 EARTH SCIENCE  1315
## 8      1     BIOSPHERE  1223
## 9     16   OCEAN COLOR  1216
## 10    16  OCEAN OPTICS  1216
## # ... with 1,051 more rows

Let’s do a visualization for these as well.

top_keywords <- top_keywords %>%
    top_n(10, n)
ggplot(top_keywords, aes(keyword, n, fill = as.factor(topic))) +
    geom_bar(stat = "identity", show.legend = FALSE, alpha = 0.8) +
    labs(title = "Top 10 Keywords for Each LDA Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         x = NULL, y = "Number of documents") +
    coord_flip() +
    facet_wrap(~topic, ncol = 2, scales = "free") +
    theme_tufte(base_family = "Arial", ticks = FALSE) +
    scale_y_continuous(expand=c(0,0)) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

Nice! Topic modeling with 16 topics looks pretty meaningful still, although the issue of categorizing documents into topics is starting to get iffy. It’s interesting that the keywords for topics 2, 3, and 4 are such duplicates, because the top terms in those topics do show meaningful differences.