The term “data scientist” was coined sometime in 2008 by DJ Patil and Jeff Hammerbacher.1 But what is exactly is a data scientist? A statistician? A computer scientist? A data analyst? It’s an ongoing debate, but some great ways to get acquainted with this elusive role and inform your own decisions as you navigate the job market:

My take:

Among the many definitions of “data science” that exists, there remains an overarching theme: the ability to make discoveries from big data. Whether that information exists within the conventional tidy framework, consisting of a matrix organized by observations and variables, or outside of this structure, if you can answer questions with a mashup of analytics on this data, you are a data scientist.

1 Intro to topic modeling

What is a topic model…?

The secret to getting a PhD –> Wikipedia!

In summary, it boils down to a latent variable model, we’re trying to figure out the hidden topics in a collection of documents, when all we see is the unstructured text. Given the untidy nature of text data, a lot of topic models rely on one simplification to provide structure to the data: representing each document as a “bag-of-words.” This assumption allows us to ignore the order in which words are written and assume words in one bag, or one document, are independent of words in another bag. We thus only care about the frequency at which each unique word in each document appears, allowing us to then construct a document-term matrix where each document has its own row and each unique word has its own column. Now that we have a matrix, we can feed in our data into familiar tools for data analysis.

1.1 Latent Dirichlet Allocation

Given the exhaustive amount of work done in the realm of topic modeling, we’re going to focus on one of the simplest models: Latent Dirichlet Allocation, which we will just call “LDA.”

Let’s summarize Blei’s LDA model, his model for how text is generated:

  1. Inputs number of topics \(T\), number of documents \(D\), and hyperparameters \(\alpha\), \(\beta\), \(\eta\)
  2. For each topic \(t = \{1, \ldots, T\}\):
  • Sample the topic’s distribution over a fixed vocabularly \(\phi^{(t)} \sim \text{Dir}(\beta)\)
  1. For each document \(d = \{1, \ldots, T\}\):
  • Sample the document’s distribution over all topics \(\theta^{(d)} \sim \text{Dir}(\alpha)\)
  • Sample the document’s length from a Poisson \(N_d \sim \text{Poi}(\eta)\)
  • For each word \(i = \{1, \ldots, N_d \}\) in document \(d\), sample a topic assignment \(z_i ~ \text{Mult}(\theta^{(d)})\) and from that topic, sample a word \(w_i \sim \text{Mult}(\phi^{(z_i)})\)

There are then a number of inference techniques available that we can leverage to solve for what we’re interested in: the document-topic distribution and the topic-word distribution (for another day…)

2 Using tidytext

This section is focused on practical implementation of a topic model.

2.1 Find a dataset

There’s a great deal of datasets out there. Identify a text collection that you find interesting, or aggregate one by scraping one via R! Some useful data set sites:

The dataset we’ll work through today was uploaded by Aashita Kesarwani at the NY Times. We focus specifically on the comments uploaded between January and April 2018.

2.2 Load R packages

For ease of data cleaning and analysis, we’ll be working with the following libraries:

library(readr) # To read in csvs 
library(purrr) # For map and reduce
library(dplyr) # For wrangling with data frames
library(tidytext) # For word tokenization and extraction
library(ggplot2) # For viz
library(stringr) # For string extraction
library(tidyr) # For spread/gather 
library(topicmodels) # For our topic model

2.3 Get the meta-data

Since our comment and article csv files are segregated by month, here’s a quick way to combine them: (1) list all the relevant files; (2) combine them via the use of map and reduce in purrr.

filenames <- list.files("~/workspace/metar/real/data",
                        full.names = TRUE,
                        pattern = "Articles")
try(allArticles <- filenames %>%
  map(read_csv) %>%
  reduce(rbind))
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   articleWordCount = col_integer(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character(),
#>   articleWordCount = col_integer()
#> )
#> Parsed with column specification:
#> cols(
#>   abstract = col_character(),
#>   articleID = col_character(),
#>   articleWordCount = col_integer(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character(),
#>   articleWordCount = col_integer()
#> )

You should see an error message pop-up. This is beauty of working with R! Note that one of the csv files had an additional column of abstract added. We can thus move forward with combining the csv files that do not have any conflicts, and merge in our last csv file, the “anomaly.”

# Figure out what went wrong 
readr::read_csv(filenames[1]) %>%
  summary()
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   articleWordCount = col_integer(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character()
#> )
#>   articleID         articleWordCount    byline          documentType      
#>  Length:1324        Min.   :  53     Length:1324        Length:1324       
#>  Class :character   1st Qu.: 811     Class :character   Class :character  
#>  Mode  :character   Median :1103     Mode  :character   Mode  :character  
#>                     Mean   :1177                                          
#>                     3rd Qu.:1378                                          
#>                     Max.   :9887                                          
#>    headline           keywords           multimedia      newDesk         
#>  Length:1324        Length:1324        Min.   : 0.00   Length:1324       
#>  Class :character   Class :character   1st Qu.:66.00   Class :character  
#>  Mode  :character   Mode  :character   Median :68.00   Mode  :character  
#>                                        Mean   :65.53                     
#>                                        3rd Qu.:68.00                     
#>                                        Max.   :68.00                     
#>    printPage          pubDate                    sectionName       
#>  Min.   :  0.000   Min.   :2018-03-29 14:26:01   Length:1324       
#>  1st Qu.:  0.000   1st Qu.:2018-04-06 06:03:14   Class :character  
#>  Median :  1.000   Median :2018-04-14 12:15:00   Mode  :character  
#>  Mean   :  6.387   Mean   :2018-04-14 19:37:45                     
#>  3rd Qu.: 11.000   3rd Qu.:2018-04-23 10:00:01                     
#>  Max.   :109.000   Max.   :2018-05-01 09:00:09                     
#>    snippet             source          typeOfMaterial    
#>  Length:1324        Length:1324        Length:1324       
#>  Class :character   Class :character   Class :character  
#>  Mode  :character   Mode  :character   Mode  :character  
#>                                                          
#>                                                          
#>                                                          
#>     webURL         
#>  Length:1324       
#>  Class :character  
#>  Mode  :character  
#>                    
#>                    
#> 
readr::read_csv(filenames[2]) %>%
  dim() 
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character(),
#>   articleWordCount = col_integer()
#> )
#> [1] 1155   15
readr::read_csv(filenames[3]) %>%
  dim() 
#> Parsed with column specification:
#> cols(
#>   abstract = col_character(),
#>   articleID = col_character(),
#>   articleWordCount = col_integer(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character()
#> )
#> [1] 905  16
readr::read_csv(filenames[4]) %>%
  dim() 
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character(),
#>   articleWordCount = col_integer()
#> )
#> [1] 1385   15
# Read in this anomaly separately
anomaly <- readr::read_csv(filenames[3]) %>% 
  dplyr::select(-abstract)
#> Parsed with column specification:
#> cols(
#>   abstract = col_character(),
#>   articleID = col_character(),
#>   articleWordCount = col_integer(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character()
#> )
# Let's redeem ourselves
filenames <- list.files("~/workspace/metar/real/data",
                        full.names = TRUE,
                        pattern = "Articles")
allArticles <- filenames[c(1, 2, 4)] %>%
  map(read_csv) %>%
  reduce(rbind) %>%
  bind_rows(anomaly) %>%
  select(articleID,
         pubDate,
         headline,
         keywords,
         newDesk,
         sectionName)
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   articleWordCount = col_integer(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character()
#> )
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character(),
#>   articleWordCount = col_integer()
#> )
#> Parsed with column specification:
#> cols(
#>   articleID = col_character(),
#>   byline = col_character(),
#>   documentType = col_character(),
#>   headline = col_character(),
#>   keywords = col_character(),
#>   multimedia = col_integer(),
#>   newDesk = col_character(),
#>   printPage = col_integer(),
#>   pubDate = col_datetime(format = ""),
#>   sectionName = col_character(),
#>   snippet = col_character(),
#>   source = col_character(),
#>   typeOfMaterial = col_character(),
#>   webURL = col_character(),
#>   articleWordCount = col_integer()
#> )
# Check out what we have 
glimpse(allArticles)
#> Observations: 4,769
#> Variables: 6
#> $ articleID   <chr> "5adf6684068401528a2aa69b", "5adf653f068401528a2aa69…
#> $ pubDate     <dttm> 2018-04-24 17:16:49, 2018-04-24 17:11:21, 2018-04-2…
#> $ headline    <chr> "Former N.F.L. Cheerleaders’ Settlement Offer: $1 an…
#> $ keywords    <chr> "['Workplace Hazards and Violations', 'Football', 'C…
#> $ newDesk     <chr> "Sports", "Climate", "Dining", "Washington", "Foreig…
#> $ sectionName <chr> "Pro Football", "Unknown", "Unknown", "Europe", "Can…

2.4 Get the data

We repeat the same process with our actual text data, the comments associated with each article.

filenames <- list.files("~/workspace/metar/real/data",
                        full.names = TRUE,
                        pattern = "Comments")
allComments <- filenames %>%
  map(read_csv) %>%
  reduce(rbind) %>%
  select(articleID,
         commentBody,
         commentID,
         recommendations) 

Given the length of time it takes to read in and parse together the article and comment files, let’s save it using the saveRDS function to avoid repeating this computationally-intensive process again.

saveRDS(allComments, file = "~/workspace/metar/statclub/data/allComments.rds")
saveRDS(allArticles, file = "~/workspace/metar/statclub/data/allArticles.rds")

2.5 Clean the data

First, let’s do some sanity checks on our data. Each comment should have it’s own unique ID. Otherwise, it’s a duplicate (possibly), which we should remove.

com <- readRDS("~/workspace/metar/statclub/data/allComments.rds")
cat("The dimension of our comment matrix is: ", dim(com), "\n")
#> The dimension of our comment matrix is:  930320 4
cat("... while the number of unique comment ids are:", length(unique(com$commentID)), "\n")
#> ... while the number of unique comment ids are: 917736
dup <- com %>%
  filter(duplicated(commentID)) %>%
  slice(1:5)
ex <- com %>%
  filter(commentID == 26559756)
ex
#> # A tibble: 2 x 4
#>   articleID       commentBody                     commentID recommendations
#>   <chr>           <chr>                               <dbl>           <int>
#> 1 5abcf77c068401… "\"... there should be a bette…  26559756               2
#> 2 5abcf77c47de81… "\"... there should be a bette…  26559756               0

An easy way to remove such duplicate copies:

com <- com %>%
  distinct(commentID, .keep_all = TRUE)

Depending on the analysis you wish to carry out, you might have to remove specific strings:

com <- com %>%
  mutate(commentBody = gsub("<br/>", "", commentBody)) %>%
  mutate(commentBody = gsub("a/k/a", "", commentBody)) %>%
  mutate(commentBody = gsub("_blank", "", commentBody)) %>%
  mutate(commentBody = gsub("<.*?>", "", commentBody)) %>%
  mutate(commentBody = gsub("http.*?com/", "", commentBody)) %>%
  mutate(commentBody = gsub("'s", "", commentBody))
saveRDS(com, file = "~/workspace/metar/statclub/data/allCommentsCleaned.rds")

You should sample your data to make sure you’ve covered any edge cases that might not have been captured through your initial data screen.

2.6 Structure our text

A lot of what we’ll do to turn our unstructured comments into a document-term matrix is covered in more detail in Drs. Silge and Robinson’s Text Mining with R. If the exercise today is in anyway interesting, this is a great free resource to peruse.

To get our desired document-term matrix to feed as input into a topic model, we carry out the following steps:

  1. Tokenize. Each comment is split from its singular cell into multiple cells, where each word from the comment appears in its own row.
  2. Anti-join. Remove words that appear frequently as listed in the data(stop_words) dictionary as it has been deemed to have low discriminatory power.
  3. Clean words. Ensure we are extracting purely text, as UTF-8 encoding sometimes includes other characters around words that have been italicized or bolded.
  4. Filter. Remove the inclusion of words that you (and the domain expert) might consider irrelevant to a decomposition of a topic.
comCleaned <- readRDS("~/workspace/metar/statclub/data/allCommentsCleaned.rds")
comFiltered <- comCleaned %>%
  unnest_tokens(word, commentBody) %>%
  anti_join(stop_words) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  mutate(word_l = str_length(word)) %>%
  filter(word_l >= 3)
saveRDS(comFiltered, "~/workspace/metar/statclub/data/allCommentsFiltered.rds")

2.7 Some visualizations

As you decide what to prune from your data, it helps to visualize what’s going on. Some simple visualizations that can help in this process:

comFiltered <- readRDS("~/workspace/metar/statclub/data/allCommentsFiltered.rds")
# Visualized 
comFiltered %>%
  count(word, sort = TRUE) %>%
  filter(n > 35000) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) + 
  coord_flip()

Another popular visualization in the realm of NLP is offered via the wordclouds package:

library(wordcloud)
#> Loading required package: RColorBrewer
comFiltered %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(word, n, max.words = 150))

We can also design metrics to capture the tail behavior of the word frequencies by article:

sample_one <- sample(allArticles$articleID, 1)
sample_one <- "5ae7ca23068401528a2ab8e3"
comFiltered %>%
  filter(articleID == sample_one)  %>%
  count(articleID, word, sort = TRUE) %>%
  mutate(wordFreq = n/sum(n)) %>%
  ggplot(aes(x = wordFreq)) +
  geom_histogram(show.legend = FALSE) 
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.8 Training and Testing Split

Once you are comfortable with the processing of the text, we can now split this data collection into training and testing. There’s a number of research articles out there on figuring out the best “split,” but in this example, we simply go with previous suggestions from literature.

split <- round(dim(allArticles)[1]*.75)
cat("The number of articles in training will be: ", split, "\n")
#> The number of articles in training will be:  3577
cat("...The subsequent number of articles in testing will be: ", dim(allArticles)[1] - split, "\n")
#> ...The subsequent number of articles in testing will be:  1192
articlesTraining <- allArticles[1:split, ]
articlesTesting <- allArticles[(split+1):dim(allArticles)[1],]

Since we rely on the bag-of-words assumption, we can thus arrive at our document-term matrix simply by extracting the frequency of each unique word, among each unique set of comments associated with an article.

trainingDTM <- comFiltered %>%
  select(articleID, word) %>%
  count(articleID, word) %>%
  filter(articleID %in% articlesTraining$articleID) %>%
  cast_dtm(articleID, word, n)
saveRDS(trainingDTM, "~/workspace/metar/statclub/data/trainingDTM.rds")
testingDTM <- comFiltered %>%
  select(articleID, word) %>%
  count(articleID, word) %>%
  filter(articleID %in% articlesTesting$articleID) %>%
  cast_dtm(articleID, word, n)
saveRDS(testingDTM, "~/workspace/metar/statclub/data/testingDTM.rds")

3 Modeling… finally!

One of the typical responsibilities of a “data scientist” or “machine learning engineer” (note each organization defines these differently!) is model tuning. You want to optimize your hyperparameters so that your model achieves your end goal, so in the case of topic modeling, that might be reducing our perplexity or maximizing some measure that captures a “good” decomposition of topics. A naive approach to hyperparameter optimization is to simply apply a grid search; in other words, set some range of reasonable values and try every value in between. If there is more than one parameter to optimize, set a grid of values for that parameter and train a topic model on the Cartesian product of these parameters.

In the case of training a LDA model, we know that it is particularly sensitive to:

  • The number of topics, \(T\)
  • The hyperparameter on the symmetric Dirichlet prior for the topic-word distribution \(\beta\)
  • The hyperparameter on the symmetric Dirichlet prior for the document-topic distribution \(\alpha\)

To simplify the number of parameter combinations, we fix our number of topics \(T\) to 62. We then select among a finite set of LDA models in which:

  1. \(\beta \in \{0.01, 0.1, 1, 10, 100\}\)
  2. \(\alpha \in \{50/T, 100/T, T/T, 1/T\}\)

Some code to reproduce these results:

# Read in the training and testing data
train <- readRDS("~/workspace/metar/statclub/data/trainingDTM.rds")
test <- readRDS("~/workspace/metar/statclub/data/testingDTM.rds")
# Set up grid of values
setup <- expand.grid(alpha = c("50/k", "100/k", "k/k", "1/k"), 
                     delta = c(.01, .1, 1, 10, 100))
setup$perplex <- NA
setup$Phi <- list(matrix(0))
setup$Theta <- list(matrix(0))
# Train a LDA on each combination of parameters
for(i in 1:dim(setup)[1]){
  cat("Now working on setup #: ", i, "\n")
  # Define the control 
  controlGibbs <- list(alpha = setup$alpha[i],
                       delta = setup$delta[i],
                       iter = 1000, burnin = 0)
  # Feed into model 
  nytimesLDA <- LDA(train, k = 62, method = "Gibbs", control = controlGibbs)
  # You can choose to save each model 
  saveRDS(nytimesLDA, paste0("~/workspace/metar/real/results/lda_", i, ".rds"))
  # Get posterior info for comparison
  post <- posterior(nytimesLDA)
  setup$Phi[i][[1]] <- post$terms 
  setup$Theta[i][[1]] <- post$topics
  # Evaluate perplexity
  cat("Now evaluating ... ", "\n")
  perplex_result <- perplexity(nytimesLDA, newdata = test, use_theta = FALSE)
  setup$perplex[i] <- perplex_result
}
# Save the output attached to the setup dataframe
saveRDS(setup, "~/workspace/metar/real/results/setup.rds")

Since the numeric output from our trained LDA model (i.e., the document-topic distribution and the topic-word distribution) can be difficult to decipher, we can carry out some qualitative review.

4 More advanced R references

This module barely touches the tip of the iceberg on the kind of problems you are equipped to solve as a statistics major! Some of my favorite books:

“Behind every wildly successful tool there’s probably a very powerful abstraction.” (Jenny Bryan)


  1. This presentation was put together for NC State’s undergraduate stat club. Please reach out to jyu21@ncsu.edu if there is anything you find that requires clarification or references. Apologies in advance.