I have begun pulling the data I will be analyzing for my final project in the course. So far, I have pulled a collection of articles by month beginning January 2020 through December 2021 from the New York Times using their “article search” API for the search query “Afghanistan”. I have not limited my search by any filter at this time. However, I have been limited in that the article search API for the New York Times does not pull the entire article; rather, I have been able to pull the abstract/summary, lead paragraph, and snippet for each article as well as the keywords, authors, sections, and url. In addition, I can get the article titles for both the print and online versions of the article.
Loading the data from my collection phase:
#load data
load("afghanistan.articles.table.RData")
afghanistan_lead <- read.csv("lead.paragraph.table.csv")
afghanistan_articles <- as.data.frame(afghanistan_lead)
Creating a corpus from the data:
afghanistan_corpus <- corpus(afghanistan_articles)
afghanistan_summary <- summary(afghanistan_corpus)
head(afghanistan_corpus)
## Corpus consisting of 6 documents and 11 docvars.
## text1 :
## "“He’s a Rambo version of the same story Trump has been telli..."
##
## text2 :
## "Sign up for our Watching newsletter to get recommendations o..."
##
## text3 :
## "At least 87 pro-government forces and 12 civilians were kill..."
##
## text4 :
## "TOKYO — Carlos Ghosn was aided in his escape from Japan by a..."
##
## text5 :
## "You’re reading this week’s At War newsletter. Sign up here t..."
##
## text6 :
## "He changed the shape of the Syrian civil war and tightened I..."
This time I’m going to add an indicator of the search term used for this corpus, in case I want to add more search terms in the future.
#of the search term used
afghanistan_articles$term <- "Afghanistan"
# add the metadata
docvars(afghanistan_corpus, field = "term") <- afghanistan_articles$term
And still, remove punctuation when creating tokens
afghanistan_tokens <- tokens(afghanistan_corpus)
print(afghanistan_tokens)
## Tokens consisting of 3,442 documents and 12 docvars.
## text1 :
## [1] "\"" "He's" "a" "Rambo" "version" "of" "the"
## [8] "same" "story" "Trump" "has" "been"
## [ ... and 28 more ]
##
## text2 :
## [1] "Sign" "up" "for" "our"
## [5] "Watching" "newsletter" "to" "get"
## [9] "recommendations" "on" "the" "best"
## [ ... and 14 more ]
##
## text3 :
## [1] "At" "least" "87" "pro-government"
## [5] "forces" "and" "12" "civilians"
## [9] "were" "killed" "in" "Afghanistan"
## [ ... and 158 more ]
##
## text4 :
## [1] "TOKYO" "-" "Carlos" "Ghosn" "was" "aided" "in" "his"
## [9] "escape" "from" "Japan" "by"
## [ ... and 44 more ]
##
## text5 :
## [1] "You're" "reading" "this" "week's" "At"
## [6] "War" "newsletter" "." "Sign" "up"
## [11] "here" "to"
## [ ... and 13 more ]
##
## text6 :
## [1] "He" "changed" "the" "shape" "of" "the"
## [7] "Syrian" "civil" "war" "and" "tightened" "Iran's"
## [ ... and 48 more ]
##
## [ reached max_ndoc ... 3,436 more documents ]
Now I’ll use quanteda to generate the document-feature matrix from the corpus object:
afghanistan_dfm <- dfm(tokens(afghanistan_corpus))
afghanistan_dfm
## Document-feature matrix of: 3,442 documents, 14,101 features (99.75% sparse) and 12 docvars.
## features
## docs " he's a rambo version of the same story trump
## text1 2 1 1 1 1 1 4 1 1 1
## text2 0 0 0 0 0 0 1 0 0 0
## text3 0 0 3 0 0 1 9 0 0 0
## text4 0 0 2 0 0 1 4 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 4 5 0 0 0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 14,091 more features ]
This is a bit more difficult to navigate given that my data is not subdivided by chapter, rather there is an individual record for each of the 3,442 articles. Perhaps I will do some pre-processing as part of the matrix creation. Going Removing punctuation, numbers, capitalization, and stopwords as a comparison:
# create the dfm
afghanistan_dfm <- tokens(afghanistan_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
dfm(tolower=TRUE) %>%
dfm_remove(stopwords('english'))
# find out a quick summary of the dfm
afghanistan_dfm
## Document-feature matrix of: 3,442 documents, 13,624 features (99.84% sparse) and 12 docvars.
## features
## docs rambo version story trump telling deep state trying screw media
## text1 1 1 1 1 1 1 1 1 1 1
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 13,614 more features ]
With the more simplified version, I can look at the most frequent terms (features), starting with the top 20 frequent terms.
topfeatures(afghanistan_dfm, 20)
## afghanistan president taliban washington american kabul
## 963 624 507 460 436 404
## new united military u.s said biden
## 388 378 375 364 358 357
## war states get afghan two one
## 352 343 302 290 280 257
## people trump
## 247 236
I would really like to be able to look at the most frequent terms by month/year/section, but have not been able to get that command to run properly yet.
world_words <- as.vector(colSums(afghanistan_dfm) == afghanistan_dfm$section.name["World"])
head(colnames(afghanistan_dfm)[world_words])
## [1] NA NA NA NA NA NA
There are a couple of ways to do this. First, the quanteda.dictionaries package contains the liwcalike() function, which takes a corpus or character vector and carries out an analysis — based on a provided dictionary — that mimics the pay-to-play software LIWC (Linguistic Inquiry and Word Count see here). The LIWC software calculates the percentage of the document that reflects a host of different characteristics. We are going to focus on positive and negative language, but keep in mind that there are lots of other dimensions that could be of interest.
# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)
# draw the wordcloud
textplot_wordcloud(afghanistan_dfm, min_count = 50, random_order = FALSE)
Taking a look at the distribution of word frequencies, I first create a dataframe. Unfortunately, I still need to figure out how to remove the top 2 more frequent occurrences as not relevant to the analysis, likely an import failure from the NYT.
# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(afghanistan_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(afghanistan_dfm))
head(word_counts)
## Frequency Rank
## afghanistan 963 1
## president 624 2
## taliban 507 3
## washington 460 4
## american 436 5
## kabul 404 6
Until I do so, plotting the results isn’t very informative.
ggplot(word_counts, mapping = aes(x = Rank, y = Frequency)) +
geom_point() +
labs(title = "Zipf's Law", x = "Rank", y = "Frequency") +
theme_bw()
# trim based on the overall frequency (i.e., the word counts) with a max at the top "non-gibberish" term.
smaller_dfm <- dfm_trim(afghanistan_dfm, max_termfreq = 1043)
# trim based on the proportion of documents that the feature appears in; here,
# the feature needs to appear in more than 5% of documents (articles)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.05, docfreq_type = "prop")
smaller_dfm
## Document-feature matrix of: 3,442 documents, 31 features (90.95% sparse) and 12 docvars.
## features
## docs trump people get afghanistan week taliban one afghan two american
## text1 1 1 0 0 0 0 0 0 0 0
## text2 0 0 1 0 0 0 0 0 0 0
## text3 0 0 0 1 2 2 1 1 1 1
## text4 0 0 0 0 0 0 0 0 0 1
## text5 0 0 1 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 1 1
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 21 more features ]
Now I can take a look again at the wordcloud and word frequency metrics:
# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)
# draw the wordcloud
textplot_wordcloud(smaller_dfm, min_count = 1, random_order = FALSE)
# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(smaller_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(smaller_dfm))
word_counts
## Frequency Rank
## afghanistan 963 1
## president 624 2
## taliban 507 3
## washington 460 4
## american 436 5
## kabul 404 6
## new 388 7
## united 378 8
## military 375 9
## u.s 364 10
## said 358 11
## biden 357 12
## war 352 13
## states 343 14
## get 302 15
## afghan 290 16
## two 280 17
## one 257 18
## people 247 19
## trump 236 20
## country 234 21
## officials 234 22
## want 231 23
## government 226 24
## coronavirus 226 25
## years 225 26
## last 220 27
## troops 218 28
## week 216 29
## first 213 30
## sign-up 212 31
ggplot(word_counts, mapping = aes(x = Rank, y = Frequency)) +
geom_point() +
labs(title = "Zipf's Law", x = "Rank", y = "Frequency") +
theme_bw()
On initial review, I have successfully reduced the sparsity from over 99$ to ~90%. But that’s still quite a sparsity percentage. I could drop terms found in the updated word count list such as “one”, “two” and “get”, if I’m sure they are not relevant to the context of my research.
I’m going to again try to exclude the jibberisy, and increase the number of articles being evaluated to represent those in more than 1% of the articles rather than 5%.
Now I can take a look at this network of feature co-occurrences:
# create fcm from dfm
smaller_fcm <- fcm(smaller_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_fcm)
## [1] 31 31
# pull the top features
myFeatures <- names(topfeatures(smaller_fcm, 30))
# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_fcm)
## [1] 30 30
# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))
# create plot
textplot_network(even_smaller_fcm, vertex_size = size / max(size) * 3)
I am still not confident that the models I am creating are truly able to assist me with my original project topic, but this tutorial and process has definitely expanded my knowledge of the area of representing texts through these methods.