Getting Started

I have begun pulling the data I will be analyzing for my final project in the course. So far, I have pulled a collection of articles by month beginning January 2020 through December 2021 from the New York Times using their “article search” API for the search query “Afghanistan”. I have not limited my search by any filter at this time. However, I have been limited in that the article search API for the New York Times does not pull the entire article; rather, I have been able to pull the abstract/summary, lead paragraph, and snippet for each article as well as the keywords, authors, sections, and url. In addition, I can get the article titles for both the print and online versions of the article.

Loading the data from my collection phase:

#load data
load("afghanistan.articles.table.RData")
afghanistan_lead <- read.csv("lead.paragraph.table.csv")
afghanistan_articles <- as.data.frame(afghanistan_lead)

Creating a corpus from the data:

afghanistan_corpus <- corpus(afghanistan_articles)
afghanistan_summary <- summary(afghanistan_corpus)
head(afghanistan_corpus)
## Corpus consisting of 6 documents and 11 docvars.
## text1 :
## "“He’s a Rambo version of the same story Trump has been telli..."
## 
## text2 :
## "Sign up for our Watching newsletter to get recommendations o..."
## 
## text3 :
## "At least 87 pro-government forces and 12 civilians were kill..."
## 
## text4 :
## "TOKYO — Carlos Ghosn was aided in his escape from Japan by a..."
## 
## text5 :
## "You’re reading this week’s At War newsletter. Sign up here t..."
## 
## text6 :
## "He changed the shape of the Syrian civil war and tightened I..."

This time I’m going to add an indicator of the search term used for this corpus, in case I want to add more search terms in the future.

#of the search term used
afghanistan_articles$term <- "Afghanistan"
# add the metadata
docvars(afghanistan_corpus, field = "term") <- afghanistan_articles$term

And still, remove punctuation when creating tokens

afghanistan_tokens <- tokens(afghanistan_corpus)
print(afghanistan_tokens)
## Tokens consisting of 3,442 documents and 12 docvars.
## text1 :
##  [1] "\""      "He's"    "a"       "Rambo"   "version" "of"      "the"    
##  [8] "same"    "story"   "Trump"   "has"     "been"   
## [ ... and 28 more ]
## 
## text2 :
##  [1] "Sign"            "up"              "for"             "our"            
##  [5] "Watching"        "newsletter"      "to"              "get"            
##  [9] "recommendations" "on"              "the"             "best"           
## [ ... and 14 more ]
## 
## text3 :
##  [1] "At"             "least"          "87"             "pro-government"
##  [5] "forces"         "and"            "12"             "civilians"     
##  [9] "were"           "killed"         "in"             "Afghanistan"   
## [ ... and 158 more ]
## 
## text4 :
##  [1] "TOKYO"  "-"      "Carlos" "Ghosn"  "was"    "aided"  "in"     "his"   
##  [9] "escape" "from"   "Japan"  "by"    
## [ ... and 44 more ]
## 
## text5 :
##  [1] "You're"     "reading"    "this"       "week's"     "At"        
##  [6] "War"        "newsletter" "."          "Sign"       "up"        
## [11] "here"       "to"        
## [ ... and 13 more ]
## 
## text6 :
##  [1] "He"        "changed"   "the"       "shape"     "of"        "the"      
##  [7] "Syrian"    "civil"     "war"       "and"       "tightened" "Iran's"   
## [ ... and 48 more ]
## 
## [ reached max_ndoc ... 3,436 more documents ]

Now I’ll use quanteda to generate the document-feature matrix from the corpus object:

afghanistan_dfm <- dfm(tokens(afghanistan_corpus))
afghanistan_dfm
## Document-feature matrix of: 3,442 documents, 14,101 features (99.75% sparse) and 12 docvars.
##        features
## docs    " he's a rambo version of the same story trump
##   text1 2    1 1     1       1  1   4    1     1     1
##   text2 0    0 0     0       0  0   1    0     0     0
##   text3 0    0 3     0       0  1   9    0     0     0
##   text4 0    0 2     0       0  1   4    0     0     0
##   text5 0    0 0     0       0  0   0    0     0     0
##   text6 0    0 0     0       0  4   5    0     0     0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 14,091 more features ]

This is a bit more difficult to navigate given that my data is not subdivided by chapter, rather there is an individual record for each of the 3,442 articles. Perhaps I will do some pre-processing as part of the matrix creation. Going Removing punctuation, numbers, capitalization, and stopwords as a comparison:

# create the dfm
afghanistan_dfm <- tokens(afghanistan_corpus,
                                    remove_punct = TRUE,
                                    remove_numbers = TRUE) %>%
                           dfm(tolower=TRUE) %>%
                           dfm_remove(stopwords('english'))
# find out a quick summary of the dfm
afghanistan_dfm
## Document-feature matrix of: 3,442 documents, 13,624 features (99.84% sparse) and 12 docvars.
##        features
## docs    rambo version story trump telling deep state trying screw media
##   text1     1       1     1     1       1    1     1      1     1     1
##   text2     0       0     0     0       0    0     0      0     0     0
##   text3     0       0     0     0       0    0     0      0     0     0
##   text4     0       0     0     0       0    0     0      0     0     0
##   text5     0       0     0     0       0    0     0      0     0     0
##   text6     0       0     0     0       0    0     0      0     0     0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 13,614 more features ]

With the more simplified version, I can look at the most frequent terms (features), starting with the top 20 frequent terms.

topfeatures(afghanistan_dfm, 20)
## afghanistan   president     taliban  washington    american       kabul 
##         963         624         507         460         436         404 
##         new      united    military         u.s        said       biden 
##         388         378         375         364         358         357 
##         war      states         get      afghan         two         one 
##         352         343         302         290         280         257 
##      people       trump 
##         247         236

I would really like to be able to look at the most frequent terms by month/year/section, but have not been able to get that command to run properly yet.

world_words <- as.vector(colSums(afghanistan_dfm) == afghanistan_dfm$section.name["World"])
head(colnames(afghanistan_dfm)[world_words])
## [1] NA NA NA NA NA NA

liwcalike()

There are a couple of ways to do this. First, the quanteda.dictionaries package contains the liwcalike() function, which takes a corpus or character vector and carries out an analysis — based on a provided dictionary — that mimics the pay-to-play software LIWC (Linguistic Inquiry and Word Count see here). The LIWC software calculates the percentage of the document that reflects a host of different characteristics. We are going to focus on positive and negative language, but keep in mind that there are lots of other dimensions that could be of interest.

# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)
# draw the wordcloud
textplot_wordcloud(afghanistan_dfm, min_count = 50, random_order = FALSE)

Taking a look at the distribution of word frequencies, I first create a dataframe. Unfortunately, I still need to figure out how to remove the top 2 more frequent occurrences as not relevant to the analysis, likely an import failure from the NYT.

# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(afghanistan_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(afghanistan_dfm))
head(word_counts)
##             Frequency Rank
## afghanistan       963    1
## president         624    2
## taliban           507    3
## washington        460    4
## american          436    5
## kabul             404    6

Until I do so, plotting the results isn’t very informative.

ggplot(word_counts, mapping = aes(x = Rank, y = Frequency)) + 
  geom_point() +
  labs(title = "Zipf's Law", x = "Rank", y = "Frequency") + 
  theme_bw()

Perhaps I can update the dataframe to exclude those top jibberish results.

# trim based on the overall frequency (i.e., the word counts) with a max at the top "non-gibberish" term.
smaller_dfm <- dfm_trim(afghanistan_dfm, max_termfreq = 1043)
# trim based on the proportion of documents that the feature appears in; here, 
# the feature needs to appear in more than 5% of documents (articles)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.05, docfreq_type = "prop")
smaller_dfm
## Document-feature matrix of: 3,442 documents, 31 features (90.95% sparse) and 12 docvars.
##        features
## docs    trump people get afghanistan week taliban one afghan two american
##   text1     1      1   0           0    0       0   0      0   0        0
##   text2     0      0   1           0    0       0   0      0   0        0
##   text3     0      0   0           1    2       2   1      1   1        1
##   text4     0      0   0           0    0       0   0      0   0        1
##   text5     0      0   1           0    0       0   0      0   0        0
##   text6     0      0   0           0    0       0   0      0   1        1
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 21 more features ]

Now I can take a look again at the wordcloud and word frequency metrics:

# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)
# draw the wordcloud
textplot_wordcloud(smaller_dfm, min_count = 1, random_order = FALSE)

# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(smaller_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(smaller_dfm))
word_counts
##             Frequency Rank
## afghanistan       963    1
## president         624    2
## taliban           507    3
## washington        460    4
## american          436    5
## kabul             404    6
## new               388    7
## united            378    8
## military          375    9
## u.s               364   10
## said              358   11
## biden             357   12
## war               352   13
## states            343   14
## get               302   15
## afghan            290   16
## two               280   17
## one               257   18
## people            247   19
## trump             236   20
## country           234   21
## officials         234   22
## want              231   23
## government        226   24
## coronavirus       226   25
## years             225   26
## last              220   27
## troops            218   28
## week              216   29
## first             213   30
## sign-up           212   31
ggplot(word_counts, mapping = aes(x = Rank, y = Frequency)) + 
  geom_point() +
  labs(title = "Zipf's Law", x = "Rank", y = "Frequency") + 
  theme_bw()

On initial review, I have successfully reduced the sparsity from over 99$ to ~90%. But that’s still quite a sparsity percentage. I could drop terms found in the updated word count list such as “one”, “two” and “get”, if I’m sure they are not relevant to the context of my research.

Feature Co-Occurrence Matrix

I’m going to again try to exclude the jibberisy, and increase the number of articles being evaluated to represent those in more than 1% of the articles rather than 5%.

Now I can take a look at this network of feature co-occurrences:

# create fcm from dfm
smaller_fcm <- fcm(smaller_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_fcm)
## [1] 31 31
# pull the top features
myFeatures <- names(topfeatures(smaller_fcm, 30))
# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_fcm)
## [1] 30 30
# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))
# create plot
textplot_network(even_smaller_fcm, vertex_size = size / max(size) * 3)

I am still not confident that the models I am creating are truly able to assist me with my original project topic, but this tutorial and process has definitely expanded my knowledge of the area of representing texts through these methods.