In my previous posts, I have used the data pulled by the New York Times “article search” API to analyze the 3,442 results of the search query “Afghanistan”. The search parameters have been of the years 2020 and 2021, and my overall goal has been to analyze the articles for differences in framing and use of sources compared to the sentiment of the relevant article. However, I have been limited in that the article search API for the New York Times does not pull the entire article; rather, I have been able to pull the abstract/summary, lead paragraph, and snippet for each article as well as the keywords, authors, sections, and url. In addition, I can get the article titles for both the print and online versions of the article. That last bit is important to my next phase of research, and touches on the research conducted on the same topic in last semester’s Research Design course.
In that course, our research group hand coded PDF copies of articles resulting from a simple search on the webistes of the New York Times and Wall Street Journal from Feburary 29, 2020 through September 30, 2021 using the term “Afghanistan withdrawal”. One thing I noticed was that when loading the PDF articles into NVivo for coding, it was difficult to match the New York Times articles to the citation information in Zotero for many of the articles because the article titles did not match. I realized that in the process of saving the articles in Zotero, they were saved with a title viewable on the web version of the article; however, once the article had been preserved by using the site’s “Print to PDF” function, the article title that it used as a default file name was different than the web version.
Since I am somewhat limited in my research on the articles pulled by the API due to the fact that I only have the lead paragraph and not the entire article, I want to look at the differences in the fields for the “main title” and “print title” pulled from the API for similarity.
I’ll start by loading the whole of the data from my collection phase and looking at the headlines in more detail.
#load data
afghanistan_articles <- read.csv("afghanistan.articles.headlines.csv")
afghanistan_articles <- as.data.frame(afghanistan_articles)
#create subset to analyze
small_df <- afghanistan_articles %>%
select(date, section.name, news.desk, headline.main, headline.print, material)
head(small_df)
## date section.name news.desk
## 1 12/10/2020 World Magazine
## 2 11/5/2020 Magazine Magazine
## 3 12/17/2020 Opinion OpEd
## 4 12/27/2020 World Foreign
## 5 12/10/2020 World Foreign
## 6 12/16/2020 World Foreign
## headline.main
## 1 Afghan War Casualty Report: December 2020
## 2 Afghan War Casualty Report: November 2020
## 3 The Afghan War Is Over. Did Anyone Notice?
## 4 In a Village of Widows, the Opium Trade Has Taken a Deadly Toll
## 5 Afghan Journalist Is Killed in Latest Attack on Media Figures
## 6 ‘Sticky Bombs’ Sow Terror and Chaos in a City on Edge
## headline.print material
## 1 <NA> News
## 2 <NA> News
## 3 Afghanistan: American ‘Iliad’? Op-Ed
## 4 Village of Widows Scrapes By in Shadow of Afghan Opium Trade News
## 5 Afghan Journalist Is Killed in Latest Attack on Media News
## 6 Taliban Use Lethal ‘Sticky Bombs’ Nearly Daily to Terrorize Afghans News
I can see that there is not always a pair of headlines/titles to review; to find our how many I’ll use the “complete.cases()” function. It tells me that of the 3,442 observations (articles), 1,547 of them have only one of the two headlines indicated. Here, I need to make a decision about inclusion/exclusion as part of pre-processing. I think that in making a text analysis, I’ll want to leave the data in.
small_df[!complete.cases(small_df),]
I also want to look at which types of materials are represented in what proportion
small_df %>%
group_by(material) %>%
summarize(count=n()) %>%
mutate(percent = (count / sum(count))*100) %>%
ggplot() +
geom_bar(aes(y=percent, x=material, fill=material), stat = "identity") + coord_flip()
Now I need to look at the article headlines, independently, and create a corpus.
#load individual data
main_headlines <- read_csv("main_headlines.csv")
print_headlines <- read_csv("print_headlines.csv")
head(main_headlines)
## # A tibble: 6 x 6
## doc.id text date section.name news.desk material
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 Quotation of the Day: Ex-SEAL~ 1/1/2~ "Today\x92s P~ Summary Quote
## 2 2 Afghan War Casualty Report: J~ 1/2/2~ "Magazine" Magazine News
## 3 3 The 50 TV Shows You Need to W~ 1/2/2~ "Arts" Weekend News
## 4 4 A History of War in Six Drugs 1/3/2~ "Magazine" Magazine News
## 5 5 Airstrike Pushes National Sec~ 1/3/2~ "U.S." Politics News
## 6 6 The Case for a One-Term Joe 1/3/2~ "Opinion" OpEd Op-Ed
head(print_headlines)
## # A tibble: 6 x 6
## doc.id text date section.name news.desk material
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 Quote of the Day 1/1/2~ "Today\x92s P~ Summary Quote
## 2 2 <NA> 1/2/2~ "Magazine" Magazine News
## 3 3 50 TV Shows To Watch This Wi~ 1/2/2~ "Arts" Weekend News
## 4 4 <NA> 1/3/2~ "Magazine" Magazine News
## 5 5 Airstrike Pushes Foreign Poli~ 1/3/2~ "U.S." Politics News
## 6 6 The Case for a One-Term Presi~ 1/3/2~ "Opinion" OpEd Op-Ed
main_corpus <- corpus(main_headlines)
print_corpus <- corpus(print_headlines)
main_summary <- summary(main_corpus)
print_summary <- summary(print_corpus)
head(main_summary)
## Text Types Tokens Sentences doc.id date section.name news.desk
## 1 text1 11 11 1 1 1/1/2020 Today<U+0092>s Paper Summary
## 2 text2 7 7 1 2 1/2/2020 Magazine Magazine
## 3 text3 10 10 1 3 1/2/2020 Arts Weekend
## 4 text4 7 7 1 4 1/3/2020 Magazine Magazine
## 5 text5 9 9 1 5 1/3/2020 U.S. Politics
## 6 text6 6 6 1 6 1/3/2020 Opinion OpEd
## material
## 1 Quote
## 2 News
## 3 News
## 4 News
## 5 News
## 6 Op-Ed
Now I’ll add an indicator for the deadline type for later usage, if necessary. However, I cannot execute this function as the “summary” versions only return 100 obervations. I’ll leave this for a follow up dicsussion.
main_summary$type <- "Main Headline"
print_summary$type <- "Print Headline"
#docvars(main_corpus, field = "type") <- main_summary$type
#docvars(print_corpus, field = "type") <- print_summary$type
For now, I’ll move on to tokenization
# the default breaks on white space
main_tokens <- tokens(main_corpus)
print(main_tokens)
## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "Quotation" "of" "the" "Day" ":" "Ex-SEAL"
## [7] "Now" "Pitching" "Products" "and" "President"
##
## text2 :
## [1] "Afghan" "War" "Casualty" "Report" ":" "January" "2020"
##
## text3 :
## [1] "The" "50" "TV" "Shows" "You" "Need" "to" "Watch"
## [9] "This" "Winter"
##
## text4 :
## [1] "A" "History" "of" "War" "in" "Six" "Drugs"
##
## text5 :
## [1] "Airstrike" "Pushes" "National" "Security" "to" "Forefront"
## [7] "of" "2020" "Race"
##
## text6 :
## [1] "The" "Case" "for" "a" "One-Term" "Joe"
##
## [ reached max_ndoc ... 3,436 more documents ]
# the default breaks on white space
print_tokens <- tokens(print_corpus)
print(print_tokens)
## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "Quote" "of" "the" "Day"
##
## text2 :
## character(0)
##
## text3 :
## [1] "50" "TV" "Shows" "To" "Watch" "This" "Winter"
##
## text4 :
## character(0)
##
## text5 :
## [1] "Airstrike" "Pushes" "Foreign" "Policy" "To"
## [6] "Front" "of" "Presidential" "Race"
##
## text6 :
## [1] "The" "Case" "for" "a" "One-Term" "President"
## [7] "Biden"
##
## [ reached max_ndoc ... 3,436 more documents ]
This is a bit difficult to navigate given that it is clear that not all of the returned articles had a primary focus on Afghanistan, rather, many of them simply will have included the term “Afghanistan” somewhere in the article, even if it is ancillary to the topic. In addition, many of the results are news briefs with quick run downs of facts and very little in the way of context or framing.
I will again do some pre-processing, removing punctuation. I do not want to remove numbers, as they may represent data on the situation in Afghanstan such as deaths, attacks, etc.
main_tokens <- tokens(main_corpus,
remove_punct = TRUE)
print_tokens <- tokens(print_corpus,
remove_punct = TRUE)
main_tokens
## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "Quotation" "of" "the" "Day" "Ex-SEAL" "Now"
## [7] "Pitching" "Products" "and" "President"
##
## text2 :
## [1] "Afghan" "War" "Casualty" "Report" "January" "2020"
##
## text3 :
## [1] "The" "50" "TV" "Shows" "You" "Need" "to" "Watch"
## [9] "This" "Winter"
##
## text4 :
## [1] "A" "History" "of" "War" "in" "Six" "Drugs"
##
## text5 :
## [1] "Airstrike" "Pushes" "National" "Security" "to" "Forefront"
## [7] "of" "2020" "Race"
##
## text6 :
## [1] "The" "Case" "for" "a" "One-Term" "Joe"
##
## [ reached max_ndoc ... 3,436 more documents ]
print_tokens
## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "Quote" "of" "the" "Day"
##
## text2 :
## character(0)
##
## text3 :
## [1] "50" "TV" "Shows" "To" "Watch" "This" "Winter"
##
## text4 :
## character(0)
##
## text5 :
## [1] "Airstrike" "Pushes" "Foreign" "Policy" "To"
## [6] "Front" "of" "Presidential" "Race"
##
## text6 :
## [1] "The" "Case" "for" "a" "One-Term" "President"
## [7] "Biden"
##
## [ reached max_ndoc ... 3,436 more documents ]
With the data frame subset version, I can look at the most frequent terms (features), starting with the top 20 frequent terms. it’s clear that the frequency of the “briefings” will skew my analysis. It’s also clear that the “remove_punct()” command did not remove the top result, a symbol.
main_df <- corpus_subset(main_corpus) %>%
tokens(remove_punct = TRUE) %>%
dfm()
topfeatures(main_df, n=20)
## <U+FFFD> the in to a of
## 1307 1137 734 683 653 599
## your briefing s and afghanistan u.s
## 546 531 499 417 391 380
## for afghan biden on taliban is
## 318 291 271 265 242 227
## trump as
## 208 206
print_df <- corpus_subset(print_corpus) %>%
tokens(remove_punct = TRUE) %>%
dfm()
topfeatures(print_df, n=20)
## <U+FFFD> the to in of a
## 626 461 445 443 414 406
## s u.s and on for afghan
## 251 249 224 195 191 151
## afghanistan taliban is biden as at
## 149 149 140 131 130 101
## trump war
## 100 99
Taking a look at the distribution of word frequencies, I create a data frame. Just from looking at the data frame of the main headlines, it is clear I need to also remove stop words from my analysis.
word_counts <- as.data.frame(sort(colSums(main_df),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(main_df))
head(word_counts)
## Frequency Rank
## <U+FFFD> 1307 1
## the 1137 2
## in 734 3
## to 683 4
## a 653 5
## of 599 6
Until I do so, analyzing the results will not be very meaningful.
main_tokens <- tokens(main_corpus)
main_tokens <- tokens_tolower(main_tokens)
main_tokens <- tokens_select(main_tokens,
pattern = stopwords("en"),
selection = "remove")
main_dfm <- dfm(main_tokens)
length(main_tokens)
## [1] 3442
print(main_tokens)
## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "quotation" "day" ":" "ex-seal" "now" "pitching"
## [7] "products" "president"
##
## text2 :
## [1] "afghan" "war" "casualty" "report" ":" "january" "2020"
##
## text3 :
## [1] "50" "tv" "shows" "need" "watch" "winter"
##
## text4 :
## [1] "history" "war" "six" "drugs"
##
## text5 :
## [1] "airstrike" "pushes" "national" "security" "forefront" "2020"
## [7] "race"
##
## text6 :
## [1] "case" "one-term" "joe"
##
## [ reached max_ndoc ... 3,436 more documents ]
print_tokens <- tokens(print_corpus)
print_tokens <- tokens_tolower(print_tokens)
print_tokens <- tokens_select(print_tokens,
pattern = stopwords("en"),
selection = "remove")
length(print_tokens)
## [1] 3442
print(print_tokens)
## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "quote" "day"
##
## text2 :
## character(0)
##
## text3 :
## [1] "50" "tv" "shows" "watch" "winter"
##
## text4 :
## character(0)
##
## text5 :
## [1] "airstrike" "pushes" "foreign" "policy" "front"
## [6] "presidential" "race"
##
## text6 :
## [1] "case" "one-term" "president" "biden"
##
## [ reached max_ndoc ... 3,436 more documents ]
Now I can use quanteda to generate the document-feature matrices
main_dfm <- dfm(main_tokens)
main_dfm
## Document-feature matrix of: 3,442 documents, 5,701 features (99.87% sparse) and 5 docvars.
## features
## docs quotation day : ex-seal now pitching products president afghan war
## text1 1 1 1 1 1 1 1 1 0 0
## text2 0 0 1 0 0 0 0 0 1 1
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 1
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 5,691 more features ]
print_dfm <- dfm(print_tokens)
print_dfm
## Document-feature matrix of: 3,442 documents, 4,121 features (99.91% sparse) and 5 docvars.
## features
## docs quote day 50 tv shows watch winter airstrike pushes foreign
## text1 1 1 0 0 0 0 0 0 0 0
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 1 1 1 1 1 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 1 1 1
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 4,111 more features ]
# trim based on the overall frequency (i.e., the word counts) with a max at the top "non-gibberish" term.
smaller_main_dfm <- dfm_trim(main_dfm, max_termfreq = 1137)
smaller_print_dfm <- dfm_trim(print_dfm, max_termfreq = 461)
# trim based on the proportion of documents that the feature appears in; here,
# the feature needs to appear in more than 5% of documents (articles)
#smaller_main_dfm <- dfm_trim(smaller_main_dfm, min_docfreq = 0.10, docfreq_type = "prop")
#smaller_main_dfm
#smaller_print_dfm <- dfm_trim(smaller_print_dfm, min_docfreq = 0.10, docfreq_type = "prop")
#smaller_print_dfm
Now I can take a look at the wordcloud and updated word frequency metrics:
# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)
# draw the wordcloud
textplot_wordcloud(smaller_main_dfm, min_count = 10, random_order = FALSE)
textplot_wordcloud(smaller_print_dfm, min_count = 10, random_order = FALSE)
# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(smaller_main_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(smaller_main_dfm))
head(word_counts)
## Frequency Rank
## . 1090 1
## : 597 2
## briefing 531 3
## s 499 4
## afghanistan 391 5
## u.s 380 6
On initial review, I have successfully reduced the sparsity from over 99% to >90%. But that’s still quite a sparsity percentage. I could drop terms found in the updated word count list that appear to still be punctuation and letters, if I’m sure they are not relevant to the context of my research. Also, rather than dropping the term “briefing”, I will run this analysis again, but selecting only certain types of news desks, and excluding “briefing” types of entries.
I’m going to again try to exclude the jibberish, and increase the number of articles being evaluated to represent those in more than 1% of the articles rather than 5%.
Now I can take a look at this network of feature co-occurrences for the main headlines:
# create fcm from dfm
smaller_main_fcm <- fcm(smaller_main_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_main_fcm)
## [1] 5699 5699
# pull the top features
myFeatures <- names(topfeatures(smaller_main_fcm, 20))
# retain only those top features as part of our matrix
even_smaller_main_fcm <- fcm_select(smaller_main_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_main_fcm)
## [1] 20 20
# compute size weight for vertices in network
size <- log(colSums(even_smaller_main_fcm))
# create plot
textplot_network(even_smaller_main_fcm, vertex_size = size / max(size) * 3)
and for the print headlines:
# create fcm from dfm
smaller_print_fcm <- fcm(smaller_print_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_print_fcm)
## [1] 4119 4119
# pull the top features
myFeatures <- names(topfeatures(smaller_print_fcm, 20))
# retain only those top features as part of our matrix
even_smaller_print_fcm <- fcm_select(smaller_print_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_print_fcm)
## [1] 20 20
# compute size weight for vertices in network
size <- log(colSums(even_smaller_print_fcm))
# create plot
textplot_network(even_smaller_print_fcm, vertex_size = size / max(size) * 3)
Finally, I’m trying something new I found for pre-processing and word cloud modeling before going back to the start and using what I’ve learned here on the data tomorrow!
preprocessing = function (doc){
doc = gsub("[^[:alnum:]]"," ",doc)
#create corpus
corpus = Corpus(VectorSource(doc))
#Removal of punctuation
corpus = tm_map(corpus, removePunctuation)
#customize my stopwords
mystopword = "briefing"
#Removal of stopwords
corpus = tm_map(corpus, removeWords, c(stopwords("english"),mystopword))
#retun result
return(corpus)
}
main_clean = preprocessing(main_corpus)
print_clean = preprocessing(print_corpus)
set.seed(1234)
# draw the wordcloud
library(wordcloud)
par(mfrow=c(1,2)) # 1x2 panel plot
par(mar=c(1, 3, 1, 3)) # Set the plot margin
par(bg="black") # set background color as black
par(col.main="white") # set title color as white
wordcloud(main_clean, scale=c(4,.5),min.freq=3, max.words=Inf, random.order=F,
colors = brewer.pal(8, "Set3"))
title("Main Website Headlines")
wordcloud(print_clean, scale=c(4,.5),min.freq=3, max.words=Inf, random.order=F,
colors = brewer.pal(8, "Set3"))
title("Print Headlines")