true

Introduction

The New York Times, a New York based daily newspaper founded in 1851, is one of the most popular news platforms worldwide. As of 2022, it has 740,000 paid print subscribers and 8.6 million digital subscribers (https://en.wikipedia.org/wiki/The_New_York_Times). The New York Times reports on current news events as well as having news desks for popular culture, investigative journalism, education, scientific advancements, politics, travel, among many others.

Topic modeling is a form of text mining that serves to identify patterns and topics within a text. Newspapers in particular, because of their regular topic changes from a daily source (Brett, 2012), are a popular and solid choice for topic modeling.

1. PREPARE

1a. Data and Guiding Questions

The data set used for this topic modeling analysis was obtained from the open source data website, Kaggle. The data contains information on over 16,000 articles with 11 features:

  • Newsdesk

  • Section

  • Subsection

  • Material

  • Headline

  • Abstract

  • Keywords

  • Word Count

  • Publication Date

  • Number of Comments

  • Unique ID

These articles were posted over the course of one year - from January 1st, 2020 until December 31st, 2020.

The goal of this study is to utilize the data set to model topics. In this, the guiding questions are:

  1. What are the most popular topics based on article abstracts?
  2. What themes are present among these articles?

Both Latent Dirichlet Allocation (LDA) and Structured Topic Model (STM) will be used to classify and identify these topics.

1b. Set-Up

Loading Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidytext)

library(SnowballC)

library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com
library(knitr)

library(topicmodels)

library(dplyr)

library(ldatuning)

library(LDAvis)

library(wordcloud2)

2. WRANGLE

2a. Importing

nyt_articles <- read_csv("C:/Users/bethe/Downloads/nyt-articles-2020.csv")
## Rows: 16787 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): newsdesk, section, subsection, material, headline, abstract, keywo...
## dbl  (2): word_count, n_comments
## dttm (1): pub_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nyt_articles
## # A tibble: 16,787 × 11
##    newsdesk  section            subsec…¹ mater…² headl…³ abstr…⁴ keywo…⁵ word_…⁶
##    <chr>     <chr>              <chr>    <chr>   <chr>   <chr>   <chr>     <dbl>
##  1 Editorial Opinion            <NA>     Editor… Protec… Congre… "['Vet…     680
##  2 Games     Crosswords & Games <NA>     News    ‘It’s … Christ… "['Cro…     931
##  3 Science   Science            <NA>     News    Meteor… All ye… "['Met…    1057
##  4 Science   Science            <NA>     Intera… Sync y… Never … "['Spa…       0
##  5 Science   Science            <NA>     News    Rocket… A year… "['Spa…    1156
##  6 Foreign   World              Middle … News    Pro-Ir… Iran’s… "['Ira…    1120
##  7 Magazine  Magazine           <NA>     News    Judge … If you… "['Com…     153
##  8 Magazine  Magazine           <NA>     News    She Fe… After … "['Str…    1395
##  9 Magazine  Magazine           <NA>     News    These … Tangy … "['Coo…     754
## 10 Culture   Books              <NA>     Review  ‘Don’t… David … "[\"Do…     957
## # … with 16,777 more rows, 3 more variables: pub_date <dttm>, n_comments <dbl>,
## #   uniqueID <chr>, and abbreviated variable names ¹​subsection, ²​material,
## #   ³​headline, ⁴​abstract, ⁵​keywords, ⁶​word_count

2b. Creating Document Term Matrix

Tokenizing Text

articles_tidy <- nyt_articles %>%

unnest_tokens(output = word, input = abstract) %>%

anti_join(stop_words, by = "word")

articles_tidy
## # A tibble: 172,253 × 11
##    newsdesk  section subse…¹ mater…² headl…³ keywo…⁴ word_…⁵ pub_date           
##    <chr>     <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <dttm>             
##  1 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  2 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  3 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  4 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  5 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  6 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  7 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  8 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  9 Games     Crossw… <NA>    News    ‘It’s … ['Cros…     931 2020-01-01 03:00:10
## 10 Games     Crossw… <NA>    News    ‘It’s … ['Cros…     931 2020-01-01 03:00:10
## # … with 172,243 more rows, 3 more variables: n_comments <dbl>, uniqueID <chr>,
## #   word <chr>, and abbreviated variable names ¹​subsection, ²​material,
## #   ³​headline, ⁴​keywords, ⁵​word_count

Word Count

articles_tidy %>%

count(word, sort = TRUE)
## # A tibble: 24,581 × 2
##    word            n
##    <chr>       <int>
##  1 coronavirus  1272
##  2 president    1207
##  3 pandemic     1036
##  4 trump         819
##  5 people        746
##  6 york          594
##  7 city          551
##  8 time          508
##  9 home          504
## 10 it’s          469
## # … with 24,571 more rows

Creating Document Term Matix

articles_dtm <- articles_tidy %>%

count(uniqueID, word) %>%

cast_dtm(uniqueID, word, n)

articles_dtm
## <<DocumentTermMatrix (documents: 16766, terms: 24581)>>
## Non-/sparse entries: 170737/411954309
## Sparsity           : 100%
## Maximal term length: NA
## Weighting          : term frequency (tf)

Based on this matrix, there are 16,766 documents with 24, 581 terms. The sparsity is very high at 100%.

Processing and Stemming

temp <- textProcessor(nyt_articles$abstract,

metadata = nyt_articles,

lowercase=TRUE,

removestopwords=TRUE,

removenumbers=TRUE,

removepunctuation=TRUE,

wordLengths=c(3,Inf),

stem=TRUE,

onlycharacter= FALSE,

striphtml=TRUE,

customstopwords=NULL)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
temp
## A text corpus with 16780 documents, and an 19553 word dictionary.

The resulting corpus contains 16,780 documents with a 19,553 word dictionary.

meta <- temp$meta

vocab <- temp$vocab

docs <- temp$documents

Stemming Tidy Text

stemmed_articles <- nyt_articles %>%

unnest_tokens(output = word, input = abstract) %>%

anti_join(stop_words, by = "word") %>%

mutate(stem = wordStem(word))

stemmed_articles
## # A tibble: 172,253 × 12
##    newsdesk  section subse…¹ mater…² headl…³ keywo…⁴ word_…⁵ pub_date           
##    <chr>     <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <dttm>             
##  1 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  2 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  3 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  4 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  5 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  6 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  7 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  8 Editorial Opinion <NA>    Editor… Protec… ['Vete…     680 2020-01-01 00:18:54
##  9 Games     Crossw… <NA>    News    ‘It’s … ['Cros…     931 2020-01-01 03:00:10
## 10 Games     Crossw… <NA>    News    ‘It’s … ['Cros…     931 2020-01-01 03:00:10
## # … with 172,243 more rows, 4 more variables: n_comments <dbl>, uniqueID <chr>,
## #   word <chr>, stem <chr>, and abbreviated variable names ¹​subsection,
## #   ²​material, ³​headline, ⁴​keywords, ⁵​word_count
stemmed_dtm <- nyt_articles %>%

unnest_tokens(output = word, input = abstract) %>%

anti_join(stop_words, by = "word") %>%

mutate(stem = wordStem(word)) %>%

count(uniqueID, stem) %>%

cast_dtm(uniqueID, stem, n)

stemmed_dtm
## <<DocumentTermMatrix (documents: 16766, terms: 16762)>>
## Non-/sparse entries: 170273/280861419
## Sparsity           : 100%
## Maximal term length: NA
## Weighting          : term frequency (tf)

3. MODEL

3a. Fitting the Topic Model with LDA

There are several features that are used to categorize the different headlines/abstracts. These are the newsdesk, section, and material. I will run each one to see what returns.

Finding Unique Names

n_distinct(nyt_articles$material)
## [1] 10
#k=10
n_distinct(nyt_articles$section)
## [1] 42
#k=42
n_distinct(nyt_articles$newsdesk)
## [1] 63
#k=63

Based on the K returned and its being the easiest way to identify the category the abstracts fall in, I will be using the ’material” feature. Additionally, because material has the smallest number of varying categories, I will use it for the purpose of this analysis due to the larger amount of time and memory that it would take to most effectively run the others.

Running LDA

Because K=10, this will be used to run the LDA.

articles_lda <- LDA(articles_dtm,

k = 10, control = list(seed = 300))

articles_lda
## A LDA_VEM topic model with 10 topics.

3b. Fitting Structural Topic Model

I also want to analyze the data using the Structural Topic Model to get another look at how the data can be assigned outside of using LDA.

Extracting temp object elements

docs <- temp$documents

meta <- temp$meta

vocab <- temp$vocab

Running Structural Topic Model (STM)

articles_stm <- stm(documents=docs,

data=meta,

vocab=vocab,

K=10,

max.em.its=25,

verbose = FALSE)

articles_stm
## A topic model with 10 topics, 16780 documents and a 19553 word dictionary.

Plotting Top Topics

plot.STM(articles_stm, n = 10)

These are the top topics based on the abstracts.

k_metrics <- FindTopicsNumber(

articles_dtm,

topics = seq(10, 75, by = 5),

metrics = "Griffiths2004",

method = "Gibbs",

control = list(),

mc.cores = NA,

return_models = FALSE,

verbose = FALSE,

libpath = NULL)

FindTopicsNumber_plot(k_metrics)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
##   Please report the issue at <]8;;https://github.com/nikita-moor/ldatuning/issueshttps://github.com/nikita-moor/ldatuning/issues]8;;>.

Based on the k-metrics, the optimal number of topics is 20. With me selecting only 10, this could suggest that it would’ve been in my best interest to select a higher K value. However, given the fact that my other options were 42 and 63, these were not nearly as optimal, and I was on the right track with selecting 10 in comparison with those two.

4. EXPLORE

4a. Exploring Beta Values

Identifying the 5 Most Likely Terms

terms(articles_lda, 5)
##      Topic 1       Topic 2     Topic 3       Topic 4       Topic 5      
## [1,] "coronavirus" "president" "president"   "people"      "people"     
## [2,] "york"        "trump"     "pandemic"    "president"   "coronavirus"
## [3,] "world"       "home"      "coronavirus" "pandemic"    "president"  
## [4,] "pandemic"    "city"      "trump"       "coronavirus" "city"       
## [5,] "health"      "time"      "american"    "health"      "virus"      
##      Topic 6       Topic 7       Topic 8     Topic 9       Topic 10     
## [1,] "president"   "coronavirus" "president" "coronavirus" "city"       
## [2,] "people"      "york"        "york"      "pandemic"    "time"       
## [3,] "trump"       "pandemic"    "trump"     "people"      "it’s"       
## [4,] "coronavirus" "home"        "pandemic"  "country"     "coronavirus"
## [5,] "american"    "trump"       "times"     "president"   "pandemic"

Tidying LDA model to cleaner data frame

tidy_lda <- tidy(articles_lda)

tidy_lda
## # A tibble: 245,810 × 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 cited 0.000106 
##  2     2 cited 0.0000383
##  3     3 cited 0.0000433
##  4     4 cited 0.000109 
##  5     5 cited 0.0000467
##  6     6 cited 0.000117 
##  7     7 cited 0.000149 
##  8     8 cited 0.0000407
##  9     9 cited 0.0000230
## 10    10 cited 0.0000231
## # … with 245,800 more rows

Visualizing Top Terms

top_terms <- tidy_lda %>%

group_by(topic) %>%

slice_max(beta, n = 5, with_ties = FALSE) %>%

ungroup() %>%

arrange(topic, -beta)

top_terms %>%

mutate(term = reorder_within(term, beta, topic)) %>%

group_by(topic, term) %>%

arrange(desc(beta)) %>%

ungroup() %>%

ggplot(aes(beta, term, fill = as.factor(topic))) +

geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each NYT Article LDA topic",

x = expression(beta), y = NULL) +

facet_wrap(~ topic, ncol = 4, scales = "free") 

Given that this data-set is from 2020, it makes sense that many of the top terms are surrounding Covid-19 and Trump’s presidency, as well as the upcoming election. Additionally, as this is from the New York times, and New York was hit very hard from the pandemic, it is also interesting to see the terms “city”, “people” and “york”.

4b. Exploring Gamma Values

td_beta <- tidy(articles_lda)

td_gamma <- tidy(articles_lda, matrix = "gamma")

td_beta
## # A tibble: 245,810 × 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 cited 0.000106 
##  2     2 cited 0.0000383
##  3     3 cited 0.0000433
##  4     4 cited 0.000109 
##  5     5 cited 0.0000467
##  6     6 cited 0.000117 
##  7     7 cited 0.000149 
##  8     8 cited 0.0000407
##  9     9 cited 0.0000230
## 10    10 cited 0.0000231
## # … with 245,800 more rows
td_gamma
## # A tibble: 167,660 × 3
##    document                                           topic  gamma
##    <chr>                                              <int>  <dbl>
##  1 nyt://article/00036db7-8494-5141-b0ac-414118caabee     1 0.0986
##  2 nyt://article/00045477-4bf1-59d5-874f-b5b61bbcfb36     1 0.102 
##  3 nyt://article/00074f13-bb5f-57c0-80e2-df0921cc62eb     1 0.101 
##  4 nyt://article/00168469-d42e-5a6f-9c81-90e3306c6c85     1 0.0994
##  5 nyt://article/00170d27-c2ca-5d3a-8c05-f27721c3c45f     1 0.0992
##  6 nyt://article/001c8c67-95ae-5f45-ad75-3df0948d4c34     1 0.0984
##  7 nyt://article/00275cc0-a134-5bfc-8b8c-4a10273ead16     1 0.101 
##  8 nyt://article/003acab0-8d5d-5dd7-905e-4ac90f9d706e     1 0.0999
##  9 nyt://article/003b8444-3e28-5a98-b5e1-6dbdeee6003e     1 0.0997
## 10 nyt://article/00453889-025e-5bba-a279-af36e50f73a7     1 0.0994
## # … with 167,650 more rows

Joining Beta and Gamma for Tidied Top Terms

top_terms <- td_beta %>%

arrange(beta) %>%

group_by(topic) %>%

top_n(7, beta) %>%

arrange(-beta) %>%

select(topic, term) %>%

summarise(terms = list(term)) %>%

mutate(terms = map(terms, paste, collapse = ",")) %>%

unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
gamma_terms <- td_gamma %>%

group_by(topic) %>%

summarise(gamma = mean(gamma)) %>%

arrange(desc(gamma)) %>%

left_join(top_terms, by = "topic") %>%

mutate(topic = paste0("Topic", topic),

topic = reorder(topic, gamma))

gamma_terms %>%

select(topic, gamma, terms) %>%

kable(digits = 3,

col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
Topic Expected topic proportion Top 7 terms
Topic2 0.1 president,trump,home,city,time,pandemic,health
Topic6 0.1 president,people,trump,coronavirus,american,country,life
Topic8 0.1 president,york,trump,pandemic,times,biden,change
Topic7 0.1 coronavirus,york,pandemic,home,trump,election,world
Topic5 0.1 people,coronavirus,president,city,virus,election,health
Topic3 0.1 president,pandemic,coronavirus,trump,american,time,virus
Topic10 0.1 city,time,it’s,coronavirus,pandemic,president,united
Topic1 0.1 coronavirus,york,world,pandemic,health,trump,it’s
Topic9 0.1 coronavirus,pandemic,people,country,president,election,time
Topic4 0.1 people,president,pandemic,coronavirus,health,white,home

Based on this, all of the top topics have some form of variations that represent the president, the people and the pandemic.

plot(articles_stm, n = 7)

4c. Reviewing Tea Leaves

Based on the findings above, some key themes that seem to arise from the data are:

  • President : nearly all of the top topics mention president, with a few mentioning Trump specifically. Biden is also mentioned once.

  • Coronavirus: Coronavirus is mentioned several times in all but Topic 2 and Topic 8.

  • Health: Mentioned in nearly all of the topics

  • City or Cities: mentioned in nearly all of the topics

  • New York: Mentioned in several topics.

  • Pandemic

Overall, again, unsurprising due to the fact that 2020 marked the height of the Covid-19 pandemic, as well as upcoming elections and press related to former president

Here, I am taking a sample of one of the most popular topics to see if the expected topic proportion matches the full text.

nyt_articles_reduced <-nyt_articles$abstract[-temp$docs.removed]

findThoughts(articles_stm,

texts = nyt_articles_reduced,

topics = 2,

n = 10,

thresh = 0.5)
## 
##  Topic 2: 
##       Three days in the creative wilderness with Francis Farewell Starlite, the reclusive muse to Kanye West, Bon Iver and Drake.
##      For over 20 years, Beatrice Tosti di Valminuta and her husband, Julio Pena, have turned diners into devotees at their trattoria, Il Posto Accanto.
##      Bonding at a butcher shop, a rainy day ode and more reader tales of New York City in this week’s Metropolitan Diary.
##      A ride to Manhattan, looking skyward in Brooklyn and more reader tales of New York City in this week’s Metropolitan Diary.
##      A Manhattan lunch counter where Jackie O. ate with John-John and Holly Golightly met Sally Tomato’s bagman bids a final farewell.
##      A lucky penny in Midtown, navigating slushy streets and more reader tales of New York City in this week’s Metropolitan Diary.
##      In 1978, a mischievous band of writers that included George Plimpton and Nora Ephron teamed up to create a spoof of The New York Times. Turns out, Times journalists were among them.
##      Kathy Lowden and Tess Davison make their New York Times Crossword debut.

Looking at the full text of Topic 2, it seems pretty in line with the expected topic proportion.

findThoughts(articles_stm,

texts = nyt_articles_reduced,

topics = 9,

n = 10,

thresh = 0.5)
## 
##  Topic 9: 
##       On a wild stretch of Pacific shoreline, Costa Chica draws artists, architects, surfers, yogis and “naughty friends.”
##      Adding dulce de leche and two kinds of coconut milk to a tres leches cake makes it supremely creamy and rich.

This topic, however, is way off the mark in terms of what was expected. The word make is present, but the others are off.

COMMUNICATE

Purpose

The purpose of this study was to determine what topics are the most popular from several thousand article abstracts of the New York Times over the course a year. Additionally, I wanted to see what themes were present among these topics.

Methods

The data from this study was used for topic modeling, in an effort to categorize the most popular topics based on abstracts from digital articles of the New York Times’.

To categorize these topics, Latent Dirichlet Allocation and Structured Topic Modelling was used.

Findings

Overall, the most popular topics and themes extracted from the abstracts mostly centered around some variation of the president and upcoming presidential election and the Covid-19 pandemic. This does make sense, considering the articles were all from 2020, which marked heavy media coverage of both.

Discussion

Newspapers offer a good source of varying topics. These topics can be used to get a sense of the themes that run throughout a particular article and entire newspaper.

This study provided the ability to gain a snapshot of some of the most common news topics in the New York Times. In the future, I would like to possibly expand the number of categories used, as analysis showed that a K value of around 25 was most ideal, as opposed to the 10 used. This might have helped to get more variation in terms of the common topics and reoccurring themes. The outcome surrounding the top topics was expected, given the period of time, but I would’ve liked to see more variation and some other interesting topics arise.

With over 16,000 articles, I definitely believe that there are many topics to explore and model. In addition to this data set, there was an another data set that contained the comments under each of the articles presented in the data used for this analysis. Though this analysis focused on simply identifying the top topics in the NYT, in future topic modelling, it would be interesting to take this comment data and explore what themes are present within them, particularly with many of the articles being news (local to NY, national and world) centered.

Additionally, I want to topic model using the headlines and see if there are any differences between the assigned topics and themes for headlines as opposed to using the abstracts. I think that it would also be interesting to compare the differences in topics between varying news desks (news, sports, etc.) and see if there are any common themes or perhaps topics that are heavily popular in one area that may inadvertently impact another. For example, topics within the News segments, appearing within the Real Estate or Travel segments.

Overall, with such a large amount of data, the possibilities for topic modelling are seemingly endless.