There are 32,000+ datasets at NASA, and NASA is interested in understanding the connections between these datasets. Metadata about the NASA datasets is available online in JSON format. Let’s look at this metadata, specifically in this report the description and keyword fields. I started to use topic modeling to classify the description fields and connect that to the keywords; in this report, let’s explore how many topics makes the most sense.

What is Topic Modeling?

To review, topic modeling is a method for unsupervised classification of documents; this method models each document as a mixture of topics and each topic as a mixture of words. The kind of method I’ll be using here for topic modeling is called latent Dirichlet allocation (LDA) but there are other possibilities for fitting a topic model. In the context here, each data set description is a document; we are going to see if we can fit model these description texts as a mixture of topics.

Getting and Wrangling the NASA Metadata

Let’s download the metadata for the 32,000+ NASA datasets and set up data frames for the descriptions and keywords, similarly to how I’ve done this in the past.

library(jsonlite)
library(dplyr)
library(tidyr)
metadata <- fromJSON("https://data.nasa.gov/data.json")
names(metadata$dataset)
##  [1] "_id"                "@type"              "accessLevel"        "accrualPeriodicity"
##  [5] "bureauCode"         "contactPoint"       "description"        "distribution"      
##  [9] "identifier"         "issued"             "keyword"            "landingPage"       
## [13] "language"           "modified"           "programCode"        "publisher"         
## [17] "spatial"            "temporal"           "theme"              "title"             
## [21] "license"            "isPartOf"           "references"         "rights"            
## [25] "describedBy"
nasadesc <- data_frame(id = metadata$dataset$`_id`$`$oid`, desc = metadata$dataset$description)
nasakeyword <- data_frame(id = metadata$dataset$`_id`$`$oid`, 
                          keyword = metadata$dataset$keyword) %>%
    unnest(keyword)
nasakeyword <- nasakeyword %>% mutate(keyword = toupper(keyword))

Just to check on things, what are the most common keywords?

nasakeyword %>% group_by(keyword) %>% count(sort = TRUE)
## # A tibble: 1,616 × 2
##                    keyword     n
##                      <chr> <int>
## 1            EARTH SCIENCE 14386
## 2                   OCEANS 10033
## 3                  PROJECT  7463
## 4             OCEAN OPTICS  7324
## 5               ATMOSPHERE  7323
## 6              OCEAN COLOR  7270
## 7                COMPLETED  6452
## 8  ATMOSPHERIC WATER VAPOR  3142
## 9             LAND SURFACE  2720
## 10               BIOSPHERE  2449
## # ... with 1,606 more rows

Making a DocumentTermMatrix

To do the topic modeling, we need to make a DocumentTermMatrix, a special kind of matrix from the tm package (of course, there is just a general concept of a “document-term matrix”). Rows correspond to documents (description texts in our case) and columns correspond to terms (i.e., words); it is a sparse matrix and the values are word counts (although they also can be tf-idf).

Let’s clean up the text a bit using stop words to remove some of the nonsense “words” leftover from HTML or other character encoding.

library(tidytext)
mystop_words <- bind_rows(stop_words, 
                          data_frame(word = c("nbsp", "amp", "gt", "lt",
                                              "timesnewromanpsmt", "font",
                                              "td", "li", "br", "tr", "quot",
                                              "st", "img", "src", "strong",
                                              as.character(1:10)), 
                                     lexicon = rep("custom", 25)))
word_counts <- nasadesc %>% unnest_tokens(word, desc) %>%
    anti_join(mystop_words) %>%
    count(id, word, sort = TRUE) %>%
    ungroup()
word_counts
## # A tibble: 1,909,247 × 3
##                          id     word     n
##                       <chr>    <chr> <int>
## 1  55942a8ec63a7fe59b4986ef     suit    82
## 2  55942a8ec63a7fe59b4986ef    space    69
## 3  56cf5b00a759fdadc44e564a     data    41
## 4  56cf5b00a759fdadc44e564a     leak    40
## 5  56cf5b00a759fdadc44e564a     tree    39
## 6  55942a8ec63a7fe59b4986ef pressure    34
## 7  55942a8ec63a7fe59b4986ef   system    34
## 8  55942a89c63a7fe59b4982d9       em    32
## 9  55942a8ec63a7fe59b4986ef       al    32
## 10 55942a8ec63a7fe59b4986ef    human    31
## # ... with 1,909,237 more rows

Now let’s make the DocumentTermMatrix.

desc_dtm <- word_counts %>%
  cast_dtm(id, word, n)
desc_dtm
## <<DocumentTermMatrix (documents: 32003, terms: 35915)>>
## Non-/sparse entries: 1909247/1147478498
## Sparsity           : 100%
## Maximal term length: 166
## Weighting          : term frequency (tf)

LDA Topic Modeling

Now let’s use the topicmodels package to create an LDA model. How many topics will we tell the algorithm to make? This is a question much like in \(k\)-means clustering; we don’t really know ahead of time. In previous explorations, I tried 8 topics, 16 topics, and 32 topics; this time around, we’re going to try 64.

library(topicmodels)
desc_lda <- LDA(desc_dtm, k = 64, control = list(seed = 1234))
desc_lda
## A LDA_VEM topic model with 64 topics.

We have done it! We have modeled topics! This is a stochastic algorithm that could have different results depending on where the algorithm starts, so I need to put a seed for reproducibility. We’ll need to see how robust the topic modeling is eventually.

Exploring the Modeling

Let’s use the amazing/wonderful broom package to tidy the models, and see what we can find out.

library(broom)
tidy_lda <- tidy(desc_lda)
tidy_lda
## # A tibble: 2,298,560 × 3
##    topic  term          beta
##    <int> <chr>         <dbl>
## 1      1  suit  3.720076e-44
## 2      2  suit 7.885956e-149
## 3      3  suit 1.171872e-123
## 4      4  suit 6.014392e-116
## 5      5  suit  3.720076e-44
## 6      6  suit 4.679277e-158
## 7      7  suit  1.404045e-86
## 8      8  suit  4.646238e-63
## 9      9  suit 5.659340e-102
## 10    10  suit  8.964041e-39
## # ... with 2,298,550 more rows

The column \(\beta\) tells us the probability of that term being generated from that topic for that document. Notice that some of very, very low, and some are not so low.

What are the top 5 terms for each topic?

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms
## # A tibble: 640 × 3
##    topic        term        beta
##    <int>       <chr>       <dbl>
## 1      1        data 0.049960363
## 2      1         ice 0.032896920
## 3      1          km 0.013051631
## 4      1         sea 0.012222144
## 5      1        snow 0.011629998
## 6      1      nimbus 0.011326717
## 7      1  resolution 0.010902940
## 8      1  radiometer 0.008732614
## 9      1 temperature 0.008714210
## 10     1         set 0.008303329
## # ... with 630 more rows

Let’s look at this visually.

library(ggplot2)
library(ggthemes)
ggplot(top_terms, aes(term, beta, fill = as.factor(topic))) +
    geom_bar(stat = "identity", show.legend = FALSE, alpha = 0.8) +
    coord_flip() +
    labs(title = "Top 10 Terms in Each LDA Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         x = NULL, y = "beta") +
    facet_wrap(~topic, ncol = 4, scales = "free") +
    theme_tufte(base_family = "Arial", ticks = FALSE) +
    scale_y_continuous(expand=c(0,0)) +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

We see some similar patterns as before, and continue to see meaningful differences between these collections of terms.

Which Topic Does Each Document Belong To?

Let’s find out which topics are associated with which description fields (i.e., documents).

lda_gamma <- tidy(desc_lda, matrix = "gamma")
lda_gamma
## # A tibble: 2,048,192 × 3
##                    document topic        gamma
##                       <chr> <int>        <dbl>
## 1  55942a8ec63a7fe59b4986ef     1 3.518039e-06
## 2  56cf5b00a759fdadc44e564a     1 6.289214e-06
## 3  55942a89c63a7fe59b4982d9     1 2.008069e-05
## 4  56cf5b00a759fdadc44e55cd     1 1.229234e-05
## 5  55942a89c63a7fe59b4982c6     1 3.615466e-05
## 6  55942a86c63a7fe59b498077     1 3.099995e-05
## 7  56cf5b00a759fdadc44e56f8     1 2.578482e-05
## 8  55942a8bc63a7fe59b4984b5     1 2.357432e-05
## 9  55942a6ec63a7fe59b496bf7     1 2.288197e-05
## 10 55942a8ec63a7fe59b4986f6     1 1.575061e-05
## # ... with 2,048,182 more rows

The column \(\gamma\) here is the probability that each document belongs in each topic. Notice that some are very low and some are higher. How are the probabilities distributed?

ggplot(lda_gamma, aes(gamma, fill = as.factor(topic))) +
    geom_histogram(show.legend = FALSE, alpha = 0.8) +
    facet_wrap(~topic, ncol = 4) +
    labs(title = "Distribution of Probability for Each Topic",
         subtitle = "Topic modeling of NASA metadata description field texts",
         caption = "NASA metadata from https://data.nasa.gov/data.json",
         y = NULL, x = "gamma") +
    scale_y_log10() +
    theme_minimal(base_family = "Arial") +
    theme(strip.text=element_text(hjust=0)) +
    theme(plot.caption=element_text(size=9))

The y-axis is plotted here on a log scale so we can see something. These are looking pretty bad now, in my opinion; many of them are flat or even sloping down as you move from \(\gamma = 0\) to \(\gamma = 1\). We have definitely gotten into a range here where documents are not being sorted into topics with reliable probability. So, 64 topics are definitely too many!