NASA Metadata: 32 Topics for Topic Modeling?

What is Topic Modeling?
Getting and Wrangling the NASA Metadata
Making a DocumentTermMatrix
LDA Topic Modeling
Exploring the Modeling
Which Topic Does Each Document Belong To?
Connecting Topic Modeling to Keywords

There are 32,000+ datasets at NASA, and NASA is interested in understanding the connections between these datasets. Metadata about the NASA datasets is available online in JSON format. Let’s look at this metadata, specifically in this report the description and keyword fields. I started to use topic modeling to classify the description fields and connect that to the keywords; in this report, let’s explore how many topics makes the most sense.

What is Topic Modeling?

To review, topic modeling is a method for unsupervised classification of documents; this method models each document as a mixture of topics and each topic as a mixture of words. The kind of method I’ll be using here for topic modeling is called latent Dirichlet allocation (LDA) but there are other possibilities for fitting a topic model. In the context here, each data set description is a document; we are going to see if we can fit model these description texts as a mixture of topics.

Getting and Wrangling the NASA Metadata

Let’s download the metadata for the 32,000+ NASA datasets and set up data frames for the descriptions and keywords, similarly to how I’ve done this in the past.

library(jsonlite)
library(dplyr)
library(tidyr)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)

##  [1] "_id"                "@type"              "accessLevel"        "accrualPeriodicity"
##  [5] "bureauCode"         "contactPoint"       "description"        "distribution"      
##  [9] "identifier"         "issued"             "keyword"            "landingPage"       
## [13] "language"           "modified"           "programCode"        "publisher"         
## [17] "spatial"            "temporal"           "theme"              "title"             
## [21] "license"            "isPartOf"           "references"         "rights"            
## [25] "describedBy"

nasadesc <- data_frame(id = metadata$dataset$`_id`$`$oid`, desc = metadata$dataset$description)
nasakeyword <- data_frame(id = metadata$dataset$`_id`$`$oid`, 
                          keyword = metadata$dataset$keyword) %>%
    unnest(keyword)
nasakeyword <- nasakeyword %>% mutate(keyword = toupper(keyword))

Just to check on things, what are the most common keywords?

nasakeyword %>% group_by(keyword) %>% count(sort = TRUE)

## # A tibble: 1,616 × 2
##                    keyword     n
##                      <chr> <int>
## 1            EARTH SCIENCE 14386
## 2                   OCEANS 10033
## 3                  PROJECT  7463
## 4             OCEAN OPTICS  7324
## 5               ATMOSPHERE  7323
## 6              OCEAN COLOR  7270
## 7                COMPLETED  6452
## 8  ATMOSPHERIC WATER VAPOR  3142
## 9             LAND SURFACE  2720
## 10               BIOSPHERE  2449
## # ... with 1,606 more rows

Making a DocumentTermMatrix

To do the topic modeling, we need to make a DocumentTermMatrix, a special kind of matrix from the tm package (of course, there is just a general concept of a “document-term matrix”). Rows correspond to documents (description texts in our case) and columns correspond to terms (i.e., words); it is a sparse matrix and the values are word counts (although they also can be tf-idf).

Let’s clean up the text a bit using stop words to remove some of the nonsense “words” leftover from HTML or other character encoding.

library(tidytext)
mystop_words <- bind_rows(stop_words, 
                          data_frame(word = c("nbsp", "amp", "gt", "lt",
                                              "timesnewromanpsmt", "font",
                                              "td", "li", "br", "tr", "quot",
                                              "st", "img", "src", "strong",
                                              as.character(1:10)), 
                                     lexicon = rep("custom", 25)))
word_counts <- nasadesc %>% unnest_tokens(word, desc) %>%
    anti_join(mystop_words) %>%
    count(id, word, sort = TRUE) %>%
    ungroup()
word_counts

## # A tibble: 1,909,247 × 3
##                          id     word     n
##                       <chr>    <chr> <int>
## 1  55942a8ec63a7fe59b4986ef     suit    82
## 2  55942a8ec63a7fe59b4986ef    space    69
## 3  56cf5b00a759fdadc44e564a     data    41
## 4  56cf5b00a759fdadc44e564a     leak    40
## 5  56cf5b00a759fdadc44e564a     tree    39
## 6  55942a8ec63a7fe59b4986ef pressure    34
## 7  55942a8ec63a7fe59b4986ef   system    34
## 8  55942a89c63a7fe59b4982d9       em    32
## 9  55942a8ec63a7fe59b4986ef       al    32
## 10 55942a8ec63a7fe59b4986ef    human    31
## # ... with 1,909,237 more rows

Now let’s make the DocumentTermMatrix.

desc_dtm <- word_counts %>%
  cast_dtm(id, word, n)
desc_dtm

## <<DocumentTermMatrix (documents: 32003, terms: 35915)>>
## Non-/sparse entries: 1909247/1147478498
## Sparsity           : 100%
## Maximal term length: 166
## Weighting          : term frequency (tf)

LDA Topic Modeling

Now let’s use the topicmodels package to create an LDA model. How many topics will we tell the algorithm to make? This is a question much like in \(k\)-means clustering; we don’t really know ahead of time. In previous explorations, I tried 8 topics and 16 topics; this time around, we’re going to try 32.

library(topicmodels)
desc_lda <- LDA(desc_dtm, k = 32, control = list(seed = 1234))
desc_lda

## A LDA_VEM topic model with 32 topics.

We have done it! We have modeled topics! This is a stochastic algorithm that could have different results depending on where the algorithm starts, so I need to put a seed for reproducibility. We’ll need to see how robust the topic modeling is eventually.

Exploring the Modeling

Let’s use the amazing/wonderful broom package to tidy the models, and see what we can find out.

library(broom)
tidy_lda <- tidy(desc_lda)
tidy_lda

## # A tibble: 1,149,280 × 3
##    topic  term          beta
##    <int> <chr>         <dbl>
## 1      1  suit 3.093608e-160
## 2      2  suit  1.424056e-95
## 3      3  suit  6.244541e-75
## 4      4  suit  7.194924e-56
## 5      5  suit 9.820662e-177
## 6      6  suit  1.148286e-95
## 7      7  suit  1.271599e-58
## 8      8  suit  6.272612e-11
## 9      9  suit  2.755461e-03
## 10    10  suit  7.080205e-21
## # ... with 1,149,270 more rows

The column \(\beta\) tells us the probability of that term being generated from that topic for that document. Notice that some of very, very low, and some are not so low.

What are the top 5 terms for each topic?

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms

## # A tibble: 320 × 3
##    topic      term        beta
##    <int>     <chr>       <dbl>
## 1      1      data 0.042421092
## 2      1   biomass 0.016297516
## 3      1    ground 0.015896812
## 4      1     files 0.014294353
## 5      1       npp 0.013474718
## 6      1        m2 0.011044086
## 7      1       set 0.010429942
## 8      1      file 0.008843429
## 9      1 estimates 0.008639112
## 10     1     sites 0.008432176
## # ... with 310 more rows

Let’s look at this visually.

library(ggplot2)
library(ggthemes)
ggplot(top_terms, aes(term, beta, fill = as.factor(topic))) +
    geom_bar(stat = "identity", show.legend = FALSE, alpha = 0.8) +
    coord_flip() +
    labs(title = "Top 10 Terms in Each LDA Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         x = NULL, y = "beta") +
    facet_wrap(~topic, ncol = 4, scales = "free") +
    theme_tufte(base_family = "Arial", ticks = FALSE) +
    scale_y_continuous(expand=c(0,0)) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

We see some similar patterns as before, and continue to see meaningful differences between these collections of terms.

Which Topic Does Each Document Belong To?

Let’s find out which topics are associated with which description fields (i.e., documents).

lda_gamma <- tidy(desc_lda, matrix = "gamma")
lda_gamma

## # A tibble: 1,024,096 × 3
##                    document topic        gamma
##                       <chr> <int>        <dbl>
## 1  55942a8ec63a7fe59b4986ef     1 4.883194e-06
## 2  56cf5b00a759fdadc44e564a     1 8.730182e-06
## 3  55942a89c63a7fe59b4982d9     1 2.788193e-05
## 4  56cf5b00a759fdadc44e55cd     1 1.706525e-05
## 5  55942a89c63a7fe59b4982c6     1 5.021637e-05
## 6  55942a86c63a7fe59b498077     1 4.305248e-05
## 7  56cf5b00a759fdadc44e56f8     1 3.580610e-05
## 8  55942a8bc63a7fe59b4984b5     1 3.273506e-05
## 9  55942a6ec63a7fe59b496bf7     1 3.177324e-05
## 10 55942a8ec63a7fe59b4986f6     1 2.186778e-05
## # ... with 1,024,086 more rows

The column \(\gamma\) here is the probability that each document belongs in each topic. Notice that some are very low and some are higher. How are the probabilities distributed?

ggplot(lda_gamma, aes(gamma, fill = as.factor(topic))) +
    geom_histogram(show.legend = FALSE, alpha = 0.8) +
    facet_wrap(~topic, ncol = 4) +
    labs(title = "Distribution of Probability for Each Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         y = NULL, x = "gamma") +
    scale_y_log10() +
    theme_minimal(base_family = "Arial", base_size = 13) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

The y-axis is plotted here on a log scale so we can see something. Documents are not getting sorted into topics very cleanly now, compared with 8 topics and 16 topics; several of these distributions are starting to be flat toward the \(\gamma = 1\) end. Still, for any individual document, we could find the topic that it has the highest probability of belonging to.

Connecting Topic Modeling to Keywords

Let’s connect these topic models with the keywords and see what happens. Let’s join this dataframe to the keywords and see which keywords are associated with which topic.

lda_gamma <- full_join(lda_gamma, nasakeyword, by = c("document" = "id"))
lda_gamma

## # A tibble: 4,050,143 × 4
##                    document topic        gamma                     keyword
##                       <chr> <int>        <dbl>                       <chr>
## 1  55942a8ec63a7fe59b4986ef     1 4.883194e-06        JOHNSON SPACE CENTER
## 2  55942a8ec63a7fe59b4986ef     1 4.883194e-06                     PROJECT
## 3  55942a8ec63a7fe59b4986ef     1 4.883194e-06                   COMPLETED
## 4  56cf5b00a759fdadc44e564a     1 8.730182e-06                    DASHLINK
## 5  56cf5b00a759fdadc44e564a     1 8.730182e-06                        AMES
## 6  56cf5b00a759fdadc44e564a     1 8.730182e-06                        NASA
## 7  55942a89c63a7fe59b4982d9     1 2.788193e-05 GODDARD SPACE FLIGHT CENTER
## 8  55942a89c63a7fe59b4982d9     1 2.788193e-05                     PROJECT
## 9  55942a89c63a7fe59b4982d9     1 2.788193e-05                   COMPLETED
## 10 56cf5b00a759fdadc44e55cd     1 1.706525e-05                    DASHLINK
## # ... with 4,050,133 more rows

Let’s keep each document that was modeled as belonging to a topic with a probability \(> 0.9\), and then find the top keywords for each topic.

top_keywords <- lda_gamma %>% filter(gamma > 0.9) %>% 
    group_by(topic, keyword) %>% 
    count(keyword, sort = TRUE)
top_keywords

## Source: local data frame [1,159 x 3]
## Groups: topic [32]
## 
##    topic       keyword     n
##    <int>         <chr> <int>
## 1     20   OCEAN COLOR  4480
## 2     20  OCEAN OPTICS  4480
## 3     20        OCEANS  4480
## 4     26   OCEAN COLOR  1216
## 5     26  OCEAN OPTICS  1216
## 6     26        OCEANS  1216
## 7     13 EARTH SCIENCE   853
## 8     32 EARTH SCIENCE   770
## 9     19   OCEAN COLOR   768
## 10    19  OCEAN OPTICS   768
## # ... with 1,149 more rows

Let’s do a visualization for these as well.

top_keywords <- top_keywords %>%
    top_n(10, n)
ggplot(top_keywords, aes(keyword, n, fill = as.factor(topic))) +
    geom_bar(stat = "identity", show.legend = FALSE, alpha = 0.8) +
    labs(title = "Top 10 Keywords for Each LDA Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         x = NULL, y = "Number of documents") +
    coord_flip() +
    facet_wrap(~topic, ncol = 2, scales = "free") +
    theme_tufte(base_family = "Arial", ticks = FALSE) +
    scale_y_continuous(expand=c(0,0)) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

We are seeing even MORE topics that have duplicate sets of keywords now.