The New York Times, a New York based daily newspaper founded in 1851, is one of the most popular news platforms worldwide. As of 2022, it has 740,000 paid print subscribers and 8.6 million digital subscribers (https://en.wikipedia.org/wiki/The_New_York_Times). The New York Times reports on current news events as well as having news desks for popular culture, investigative journalism, education, scientific advancements, politics, travel, among many others.
Topic modeling is a form of text mining that serves to identify patterns and topics within a text. Newspapers in particular, because of their regular topic changes from a daily source (Brett, 2012), are a popular and solid choice for topic modeling.
The data set used for this topic modeling analysis was obtained from the open source data website, Kaggle. The data contains information on over 16,000 articles with 11 features:
Newsdesk
Section
Subsection
Material
Headline
Abstract
Keywords
Word Count
Publication Date
Number of Comments
Unique ID
These articles were posted over the course of one year - from January 1st, 2020 until December 31st, 2020.
The goal of this study is to utilize the data set to model topics. In this, the guiding questions are:
Both Latent Dirichlet Allocation (LDA) and Structured Topic Model (STM) will be used to classify and identify these topics.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidytext)
library(SnowballC)
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(knitr)
library(topicmodels)
library(dplyr)
library(ldatuning)
library(LDAvis)
library(wordcloud2)
nyt_articles <- read_csv("C:/Users/bethe/Downloads/nyt-articles-2020.csv")
## Rows: 16787 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): newsdesk, section, subsection, material, headline, abstract, keywo...
## dbl (2): word_count, n_comments
## dttm (1): pub_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nyt_articles
## # A tibble: 16,787 × 11
## newsdesk section subsec…¹ mater…² headl…³ abstr…⁴ keywo…⁵ word_…⁶
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Editorial Opinion <NA> Editor… Protec… Congre… "['Vet… 680
## 2 Games Crosswords & Games <NA> News ‘It’s … Christ… "['Cro… 931
## 3 Science Science <NA> News Meteor… All ye… "['Met… 1057
## 4 Science Science <NA> Intera… Sync y… Never … "['Spa… 0
## 5 Science Science <NA> News Rocket… A year… "['Spa… 1156
## 6 Foreign World Middle … News Pro-Ir… Iran’s… "['Ira… 1120
## 7 Magazine Magazine <NA> News Judge … If you… "['Com… 153
## 8 Magazine Magazine <NA> News She Fe… After … "['Str… 1395
## 9 Magazine Magazine <NA> News These … Tangy … "['Coo… 754
## 10 Culture Books <NA> Review ‘Don’t… David … "[\"Do… 957
## # … with 16,777 more rows, 3 more variables: pub_date <dttm>, n_comments <dbl>,
## # uniqueID <chr>, and abbreviated variable names ¹subsection, ²material,
## # ³headline, ⁴abstract, ⁵keywords, ⁶word_count
articles_tidy <- nyt_articles %>%
unnest_tokens(output = word, input = abstract) %>%
anti_join(stop_words, by = "word")
articles_tidy
## # A tibble: 172,253 × 11
## newsdesk section subse…¹ mater…² headl…³ keywo…⁴ word_…⁵ pub_date
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dttm>
## 1 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 2 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 3 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 4 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 5 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 6 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 7 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 8 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 9 Games Crossw… <NA> News ‘It’s … ['Cros… 931 2020-01-01 03:00:10
## 10 Games Crossw… <NA> News ‘It’s … ['Cros… 931 2020-01-01 03:00:10
## # … with 172,243 more rows, 3 more variables: n_comments <dbl>, uniqueID <chr>,
## # word <chr>, and abbreviated variable names ¹subsection, ²material,
## # ³headline, ⁴keywords, ⁵word_count
articles_tidy %>%
count(word, sort = TRUE)
## # A tibble: 24,581 × 2
## word n
## <chr> <int>
## 1 coronavirus 1272
## 2 president 1207
## 3 pandemic 1036
## 4 trump 819
## 5 people 746
## 6 york 594
## 7 city 551
## 8 time 508
## 9 home 504
## 10 it’s 469
## # … with 24,571 more rows
articles_dtm <- articles_tidy %>%
count(uniqueID, word) %>%
cast_dtm(uniqueID, word, n)
articles_dtm
## <<DocumentTermMatrix (documents: 16766, terms: 24581)>>
## Non-/sparse entries: 170737/411954309
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
Based on this matrix, there are 16,766 documents with 24, 581 terms. The sparsity is very high at 100%.
temp <- textProcessor(nyt_articles$abstract,
metadata = nyt_articles,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
temp
## A text corpus with 16780 documents, and an 19553 word dictionary.
The resulting corpus contains 16,780 documents with a 19,553 word dictionary.
meta <- temp$meta
vocab <- temp$vocab
docs <- temp$documents
stemmed_articles <- nyt_articles %>%
unnest_tokens(output = word, input = abstract) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word))
stemmed_articles
## # A tibble: 172,253 × 12
## newsdesk section subse…¹ mater…² headl…³ keywo…⁴ word_…⁵ pub_date
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dttm>
## 1 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 2 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 3 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 4 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 5 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 6 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 7 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 8 Editorial Opinion <NA> Editor… Protec… ['Vete… 680 2020-01-01 00:18:54
## 9 Games Crossw… <NA> News ‘It’s … ['Cros… 931 2020-01-01 03:00:10
## 10 Games Crossw… <NA> News ‘It’s … ['Cros… 931 2020-01-01 03:00:10
## # … with 172,243 more rows, 4 more variables: n_comments <dbl>, uniqueID <chr>,
## # word <chr>, stem <chr>, and abbreviated variable names ¹subsection,
## # ²material, ³headline, ⁴keywords, ⁵word_count
stemmed_dtm <- nyt_articles %>%
unnest_tokens(output = word, input = abstract) %>%
anti_join(stop_words, by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(uniqueID, stem) %>%
cast_dtm(uniqueID, stem, n)
stemmed_dtm
## <<DocumentTermMatrix (documents: 16766, terms: 16762)>>
## Non-/sparse entries: 170273/280861419
## Sparsity : 100%
## Maximal term length: NA
## Weighting : term frequency (tf)
There are several features that are used to categorize the different headlines/abstracts. These are the newsdesk, section, and material. I will run each one to see what returns.
n_distinct(nyt_articles$material)
## [1] 10
#k=10
n_distinct(nyt_articles$section)
## [1] 42
#k=42
n_distinct(nyt_articles$newsdesk)
## [1] 63
#k=63
Based on the K returned and its being the easiest way to identify the category the abstracts fall in, I will be using the ’material” feature. Additionally, because material has the smallest number of varying categories, I will use it for the purpose of this analysis due to the larger amount of time and memory that it would take to most effectively run the others.
Because K=10, this will be used to run the LDA.
articles_lda <- LDA(articles_dtm,
k = 10, control = list(seed = 300))
articles_lda
## A LDA_VEM topic model with 10 topics.
I also want to analyze the data using the Structural Topic Model to get another look at how the data can be assigned outside of using LDA.
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
articles_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
K=10,
max.em.its=25,
verbose = FALSE)
articles_stm
## A topic model with 10 topics, 16780 documents and a 19553 word dictionary.
plot.STM(articles_stm, n = 10)
These are the top topics based on the abstracts.
k_metrics <- FindTopicsNumber(
articles_dtm,
topics = seq(10, 75, by = 5),
metrics = "Griffiths2004",
method = "Gibbs",
control = list(),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL)
FindTopicsNumber_plot(k_metrics)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ldatuning package.
## Please report the issue at <]8;;https://github.com/nikita-moor/ldatuning/issueshttps://github.com/nikita-moor/ldatuning/issues]8;;>.
Based on the k-metrics, the optimal number of topics is 20. With me selecting only 10, this could suggest that it would’ve been in my best interest to select a higher K value. However, given the fact that my other options were 42 and 63, these were not nearly as optimal, and I was on the right track with selecting 10 in comparison with those two.
terms(articles_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "coronavirus" "president" "president" "people" "people"
## [2,] "york" "trump" "pandemic" "president" "coronavirus"
## [3,] "world" "home" "coronavirus" "pandemic" "president"
## [4,] "pandemic" "city" "trump" "coronavirus" "city"
## [5,] "health" "time" "american" "health" "virus"
## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
## [1,] "president" "coronavirus" "president" "coronavirus" "city"
## [2,] "people" "york" "york" "pandemic" "time"
## [3,] "trump" "pandemic" "trump" "people" "it’s"
## [4,] "coronavirus" "home" "pandemic" "country" "coronavirus"
## [5,] "american" "trump" "times" "president" "pandemic"
tidy_lda <- tidy(articles_lda)
tidy_lda
## # A tibble: 245,810 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 cited 0.000106
## 2 2 cited 0.0000383
## 3 3 cited 0.0000433
## 4 4 cited 0.000109
## 5 5 cited 0.0000467
## 6 6 cited 0.000117
## 7 7 cited 0.000149
## 8 8 cited 0.0000407
## 9 9 cited 0.0000230
## 10 10 cited 0.0000231
## # … with 245,800 more rows
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each NYT Article LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")
Given that this data-set is from 2020, it makes sense that many of the top terms are surrounding Covid-19 and Trump’s presidency, as well as the upcoming election. Additionally, as this is from the New York times, and New York was hit very hard from the pandemic, it is also interesting to see the terms “city”, “people” and “york”.
td_beta <- tidy(articles_lda)
td_gamma <- tidy(articles_lda, matrix = "gamma")
td_beta
## # A tibble: 245,810 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 cited 0.000106
## 2 2 cited 0.0000383
## 3 3 cited 0.0000433
## 4 4 cited 0.000109
## 5 5 cited 0.0000467
## 6 6 cited 0.000117
## 7 7 cited 0.000149
## 8 8 cited 0.0000407
## 9 9 cited 0.0000230
## 10 10 cited 0.0000231
## # … with 245,800 more rows
td_gamma
## # A tibble: 167,660 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 nyt://article/00036db7-8494-5141-b0ac-414118caabee 1 0.0986
## 2 nyt://article/00045477-4bf1-59d5-874f-b5b61bbcfb36 1 0.102
## 3 nyt://article/00074f13-bb5f-57c0-80e2-df0921cc62eb 1 0.101
## 4 nyt://article/00168469-d42e-5a6f-9c81-90e3306c6c85 1 0.0994
## 5 nyt://article/00170d27-c2ca-5d3a-8c05-f27721c3c45f 1 0.0992
## 6 nyt://article/001c8c67-95ae-5f45-ad75-3df0948d4c34 1 0.0984
## 7 nyt://article/00275cc0-a134-5bfc-8b8c-4a10273ead16 1 0.101
## 8 nyt://article/003acab0-8d5d-5dd7-905e-4ac90f9d706e 1 0.0999
## 9 nyt://article/003b8444-3e28-5a98-b5e1-6dbdeee6003e 1 0.0997
## 10 nyt://article/00453889-025e-5bba-a279-af36e50f73a7 1 0.0994
## # … with 167,650 more rows
top_terms <- td_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ",")) %>%
unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c(terms)`.
gamma_terms <- td_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(top_terms, by = "topic") %>%
mutate(topic = paste0("Topic", topic),
topic = reorder(topic, gamma))
gamma_terms %>%
select(topic, gamma, terms) %>%
kable(digits = 3,
col.names = c("Topic", "Expected topic proportion", "Top 7 terms"))
| Topic | Expected topic proportion | Top 7 terms |
|---|---|---|
| Topic2 | 0.1 | president,trump,home,city,time,pandemic,health |
| Topic6 | 0.1 | president,people,trump,coronavirus,american,country,life |
| Topic8 | 0.1 | president,york,trump,pandemic,times,biden,change |
| Topic7 | 0.1 | coronavirus,york,pandemic,home,trump,election,world |
| Topic5 | 0.1 | people,coronavirus,president,city,virus,election,health |
| Topic3 | 0.1 | president,pandemic,coronavirus,trump,american,time,virus |
| Topic10 | 0.1 | city,time,it’s,coronavirus,pandemic,president,united |
| Topic1 | 0.1 | coronavirus,york,world,pandemic,health,trump,it’s |
| Topic9 | 0.1 | coronavirus,pandemic,people,country,president,election,time |
| Topic4 | 0.1 | people,president,pandemic,coronavirus,health,white,home |
Based on this, all of the top topics have some form of variations that represent the president, the people and the pandemic.
plot(articles_stm, n = 7)
Based on the findings above, some key themes that seem to arise from the data are:
President : nearly all of the top topics mention president, with a few mentioning Trump specifically. Biden is also mentioned once.
Coronavirus: Coronavirus is mentioned several times in all but Topic 2 and Topic 8.
Health: Mentioned in nearly all of the topics
City or Cities: mentioned in nearly all of the topics
New York: Mentioned in several topics.
Pandemic
Overall, again, unsurprising due to the fact that 2020 marked the height of the Covid-19 pandemic, as well as upcoming elections and press related to former president
Here, I am taking a sample of one of the most popular topics to see if the expected topic proportion matches the full text.
nyt_articles_reduced <-nyt_articles$abstract[-temp$docs.removed]
findThoughts(articles_stm,
texts = nyt_articles_reduced,
topics = 2,
n = 10,
thresh = 0.5)
##
## Topic 2:
## Three days in the creative wilderness with Francis Farewell Starlite, the reclusive muse to Kanye West, Bon Iver and Drake.
## For over 20 years, Beatrice Tosti di Valminuta and her husband, Julio Pena, have turned diners into devotees at their trattoria, Il Posto Accanto.
## Bonding at a butcher shop, a rainy day ode and more reader tales of New York City in this week’s Metropolitan Diary.
## A ride to Manhattan, looking skyward in Brooklyn and more reader tales of New York City in this week’s Metropolitan Diary.
## A Manhattan lunch counter where Jackie O. ate with John-John and Holly Golightly met Sally Tomato’s bagman bids a final farewell.
## A lucky penny in Midtown, navigating slushy streets and more reader tales of New York City in this week’s Metropolitan Diary.
## In 1978, a mischievous band of writers that included George Plimpton and Nora Ephron teamed up to create a spoof of The New York Times. Turns out, Times journalists were among them.
## Kathy Lowden and Tess Davison make their New York Times Crossword debut.
Looking at the full text of Topic 2, it seems pretty in line with the expected topic proportion.
findThoughts(articles_stm,
texts = nyt_articles_reduced,
topics = 9,
n = 10,
thresh = 0.5)
##
## Topic 9:
## On a wild stretch of Pacific shoreline, Costa Chica draws artists, architects, surfers, yogis and “naughty friends.”
## Adding dulce de leche and two kinds of coconut milk to a tres leches cake makes it supremely creamy and rich.
This topic, however, is way off the mark in terms of what was expected. The word make is present, but the others are off.
The purpose of this study was to determine what topics are the most popular from several thousand article abstracts of the New York Times over the course a year. Additionally, I wanted to see what themes were present among these topics.
The data from this study was used for topic modeling, in an effort to categorize the most popular topics based on abstracts from digital articles of the New York Times’.
To categorize these topics, Latent Dirichlet Allocation and Structured Topic Modelling was used.
Overall, the most popular topics and themes extracted from the abstracts mostly centered around some variation of the president and upcoming presidential election and the Covid-19 pandemic. This does make sense, considering the articles were all from 2020, which marked heavy media coverage of both.
Newspapers offer a good source of varying topics. These topics can be used to get a sense of the themes that run throughout a particular article and entire newspaper.
This study provided the ability to gain a snapshot of some of the most common news topics in the New York Times. In the future, I would like to possibly expand the number of categories used, as analysis showed that a K value of around 25 was most ideal, as opposed to the 10 used. This might have helped to get more variation in terms of the common topics and reoccurring themes. The outcome surrounding the top topics was expected, given the period of time, but I would’ve liked to see more variation and some other interesting topics arise.
With over 16,000 articles, I definitely believe that there are many topics to explore and model. In addition to this data set, there was an another data set that contained the comments under each of the articles presented in the data used for this analysis. Though this analysis focused on simply identifying the top topics in the NYT, in future topic modelling, it would be interesting to take this comment data and explore what themes are present within them, particularly with many of the articles being news (local to NY, national and world) centered.
Additionally, I want to topic model using the headlines and see if there are any differences between the assigned topics and themes for headlines as opposed to using the abstracts. I think that it would also be interesting to compare the differences in topics between varying news desks (news, sports, etc.) and see if there are any common themes or perhaps topics that are heavily popular in one area that may inadvertently impact another. For example, topics within the News segments, appearing within the Real Estate or Travel segments.
Overall, with such a large amount of data, the possibilities for topic modelling are seemingly endless.