Getting Started

In my previous posts, I have used the data pulled by the New York Times “article search” API to analyze the 3,442 results of the search query “Afghanistan”. The search parameters have been of the years 2020 and 2021, and my overall goal has been to analyze the articles for differences in framing and use of sources compared to the sentiment of the relevant article. However, I have been limited in that the article search API for the New York Times does not pull the entire article; rather, I have been able to pull the abstract/summary, lead paragraph, and snippet for each article as well as the keywords, authors, sections, and url. In addition, I can get the article titles for both the print and online versions of the article. That last bit is important to my next phase of research, and touches on the research conducted on the same topic in last semester’s Research Design course.

In that course, our research group hand coded PDF copies of articles resulting from a simple search on the webistes of the New York Times and Wall Street Journal from Feburary 29, 2020 through September 30, 2021 using the term “Afghanistan withdrawal”. One thing I noticed was that when loading the PDF articles into NVivo for coding, it was difficult to match the New York Times articles to the citation information in Zotero for many of the articles because the article titles did not match. I realized that in the process of saving the articles in Zotero, they were saved with a title viewable on the web version of the article; however, once the article had been preserved by using the site’s “Print to PDF” function, the article title that it used as a default file name was different than the web version.

Since I am somewhat limited in my research on the articles pulled by the API due to the fact that I only have the lead paragraph and not the entire article, I want to look at the differences in the fields for the “main title” and “print title” pulled from the API for similarity.

I’ll start by loading the whole of the data from my collection phase and looking at the headlines in more detail.

#load data
afghanistan_articles <- read.csv("afghanistan.articles.headlines.csv")
afghanistan_articles <- as.data.frame(afghanistan_articles)
#create subset to analyze
small_df <- afghanistan_articles %>%
  select(date, section.name, news.desk, headline.main, headline.print, material)
head(small_df)

##         date section.name news.desk
## 1 12/10/2020        World  Magazine
## 2  11/5/2020     Magazine  Magazine
## 3 12/17/2020      Opinion      OpEd
## 4 12/27/2020        World   Foreign
## 5 12/10/2020        World   Foreign
## 6 12/16/2020        World   Foreign
##                                                     headline.main
## 1                       Afghan War Casualty Report: December 2020
## 2                       Afghan War Casualty Report: November 2020
## 3                      The Afghan War Is Over. Did Anyone Notice?
## 4 In a Village of Widows, the Opium Trade Has Taken a Deadly Toll
## 5   Afghan Journalist Is Killed in Latest Attack on Media Figures
## 6           ‘Sticky Bombs’ Sow Terror and Chaos in a City on Edge
##                                                        headline.print material
## 1                                                                <NA>     News
## 2                                                                <NA>     News
## 3                                      Afghanistan: American ‘Iliad’?    Op-Ed
## 4        Village of Widows Scrapes By in Shadow of Afghan Opium Trade     News
## 5               Afghan Journalist Is Killed in Latest Attack on Media     News
## 6 Taliban Use Lethal ‘Sticky Bombs’ Nearly Daily to Terrorize Afghans     News

I can see that there is not always a pair of headlines/titles to review; to find our how many I’ll use the “complete.cases()” function. It tells me that of the 3,442 observations (articles), 1,547 of them have only one of the two headlines indicated. Here, I need to make a decision about inclusion/exclusion as part of pre-processing. I think that in making a text analysis, I’ll want to leave the data in.

small_df[!complete.cases(small_df),]

I also want to look at which types of materials are represented in what proportion

small_df %>% 
  group_by(material) %>%
  summarize(count=n()) %>%
  mutate(percent = (count / sum(count))*100) %>%
  ggplot() +
  geom_bar(aes(y=percent, x=material, fill=material), stat = "identity") + coord_flip()

Now I need to look at the article headlines, independently, and create a corpus.

#load individual data
main_headlines <- read_csv("main_headlines.csv")
print_headlines <- read_csv("print_headlines.csv")

head(main_headlines)

## # A tibble: 6 x 6
##   doc.id text                           date   section.name   news.desk material
##    <dbl> <chr>                          <chr>  <chr>          <chr>     <chr>   
## 1      1 Quotation of the Day: Ex-SEAL~ 1/1/2~ "Today\x92s P~ Summary   Quote   
## 2      2 Afghan War Casualty Report: J~ 1/2/2~ "Magazine"     Magazine  News    
## 3      3 The 50 TV Shows You Need to W~ 1/2/2~ "Arts"         Weekend   News    
## 4      4 A History of War in Six Drugs  1/3/2~ "Magazine"     Magazine  News    
## 5      5 Airstrike Pushes National Sec~ 1/3/2~ "U.S."         Politics  News    
## 6      6 The Case for a One-Term Joe    1/3/2~ "Opinion"      OpEd      Op-Ed

head(print_headlines)

## # A tibble: 6 x 6
##   doc.id text                           date   section.name   news.desk material
##    <dbl> <chr>                          <chr>  <chr>          <chr>     <chr>   
## 1      1 Quote of the Day               1/1/2~ "Today\x92s P~ Summary   Quote   
## 2      2 <NA>                           1/2/2~ "Magazine"     Magazine  News    
## 3      3 50 TV Shows To Watch  This Wi~ 1/2/2~ "Arts"         Weekend   News    
## 4      4 <NA>                           1/3/2~ "Magazine"     Magazine  News    
## 5      5 Airstrike Pushes Foreign Poli~ 1/3/2~ "U.S."         Politics  News    
## 6      6 The Case for a One-Term Presi~ 1/3/2~ "Opinion"      OpEd      Op-Ed

main_corpus <- corpus(main_headlines)
print_corpus <- corpus(print_headlines)
main_summary <- summary(main_corpus)
print_summary <- summary(print_corpus)
head(main_summary)

##    Text Types Tokens Sentences doc.id     date         section.name news.desk
## 1 text1    11     11         1      1 1/1/2020 Today<U+0092>s Paper   Summary
## 2 text2     7      7         1      2 1/2/2020             Magazine  Magazine
## 3 text3    10     10         1      3 1/2/2020                 Arts   Weekend
## 4 text4     7      7         1      4 1/3/2020             Magazine  Magazine
## 5 text5     9      9         1      5 1/3/2020                 U.S.  Politics
## 6 text6     6      6         1      6 1/3/2020              Opinion      OpEd
##   material
## 1    Quote
## 2     News
## 3     News
## 4     News
## 5     News
## 6    Op-Ed

Now I’ll add an indicator for the deadline type for later usage, if necessary. However, I cannot execute this function as the “summary” versions only return 100 obervations. I’ll leave this for a follow up dicsussion.

main_summary$type <- "Main Headline"
print_summary$type <- "Print Headline"
#docvars(main_corpus, field = "type") <- main_summary$type
#docvars(print_corpus, field = "type") <- print_summary$type

For now, I’ll move on to tokenization

# the default breaks on white space
main_tokens <- tokens(main_corpus)
print(main_tokens)

## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
##  [1] "Quotation" "of"        "the"       "Day"       ":"         "Ex-SEAL"  
##  [7] "Now"       "Pitching"  "Products"  "and"       "President"
## 
## text2 :
## [1] "Afghan"   "War"      "Casualty" "Report"   ":"        "January"  "2020"    
## 
## text3 :
##  [1] "The"    "50"     "TV"     "Shows"  "You"    "Need"   "to"     "Watch" 
##  [9] "This"   "Winter"
## 
## text4 :
## [1] "A"       "History" "of"      "War"     "in"      "Six"     "Drugs"  
## 
## text5 :
## [1] "Airstrike" "Pushes"    "National"  "Security"  "to"        "Forefront"
## [7] "of"        "2020"      "Race"     
## 
## text6 :
## [1] "The"      "Case"     "for"      "a"        "One-Term" "Joe"     
## 
## [ reached max_ndoc ... 3,436 more documents ]

# the default breaks on white space
print_tokens <- tokens(print_corpus)
print(print_tokens)

## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "Quote" "of"    "the"   "Day"  
## 
## text2 :
## character(0)
## 
## text3 :
## [1] "50"     "TV"     "Shows"  "To"     "Watch"  "This"   "Winter"
## 
## text4 :
## character(0)
## 
## text5 :
## [1] "Airstrike"    "Pushes"       "Foreign"      "Policy"       "To"          
## [6] "Front"        "of"           "Presidential" "Race"        
## 
## text6 :
## [1] "The"       "Case"      "for"       "a"         "One-Term"  "President"
## [7] "Biden"    
## 
## [ reached max_ndoc ... 3,436 more documents ]

This is a bit difficult to navigate given that it is clear that not all of the returned articles had a primary focus on Afghanistan, rather, many of them simply will have included the term “Afghanistan” somewhere in the article, even if it is ancillary to the topic. In addition, many of the results are news briefs with quick run downs of facts and very little in the way of context or framing.

I will again do some pre-processing, removing punctuation. I do not want to remove numbers, as they may represent data on the situation in Afghanstan such as deaths, attacks, etc.

main_tokens <- tokens(main_corpus,
                   remove_punct = TRUE)
print_tokens <- tokens(print_corpus,
                   remove_punct = TRUE)
main_tokens

## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
##  [1] "Quotation" "of"        "the"       "Day"       "Ex-SEAL"   "Now"      
##  [7] "Pitching"  "Products"  "and"       "President"
## 
## text2 :
## [1] "Afghan"   "War"      "Casualty" "Report"   "January"  "2020"    
## 
## text3 :
##  [1] "The"    "50"     "TV"     "Shows"  "You"    "Need"   "to"     "Watch" 
##  [9] "This"   "Winter"
## 
## text4 :
## [1] "A"       "History" "of"      "War"     "in"      "Six"     "Drugs"  
## 
## text5 :
## [1] "Airstrike" "Pushes"    "National"  "Security"  "to"        "Forefront"
## [7] "of"        "2020"      "Race"     
## 
## text6 :
## [1] "The"      "Case"     "for"      "a"        "One-Term" "Joe"     
## 
## [ reached max_ndoc ... 3,436 more documents ]

print_tokens

## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "Quote" "of"    "the"   "Day"  
## 
## text2 :
## character(0)
## 
## text3 :
## [1] "50"     "TV"     "Shows"  "To"     "Watch"  "This"   "Winter"
## 
## text4 :
## character(0)
## 
## text5 :
## [1] "Airstrike"    "Pushes"       "Foreign"      "Policy"       "To"          
## [6] "Front"        "of"           "Presidential" "Race"        
## 
## text6 :
## [1] "The"       "Case"      "for"       "a"         "One-Term"  "President"
## [7] "Biden"    
## 
## [ reached max_ndoc ... 3,436 more documents ]

With the data frame subset version, I can look at the most frequent terms (features), starting with the top 20 frequent terms. it’s clear that the frequency of the “briefings” will skew my analysis. It’s also clear that the “remove_punct()” command did not remove the top result, a symbol.

main_df <- corpus_subset(main_corpus) %>%
    tokens(remove_punct = TRUE) %>%
    dfm()
topfeatures(main_df, n=20)

##    <U+FFFD>         the          in          to           a          of 
##        1307        1137         734         683         653         599 
##        your    briefing           s         and afghanistan         u.s 
##         546         531         499         417         391         380 
##         for      afghan       biden          on     taliban          is 
##         318         291         271         265         242         227 
##       trump          as 
##         208         206

print_df <- corpus_subset(print_corpus) %>%
    tokens(remove_punct = TRUE) %>%
    dfm()
topfeatures(print_df, n=20)

##    <U+FFFD>         the          to          in          of           a 
##         626         461         445         443         414         406 
##           s         u.s         and          on         for      afghan 
##         251         249         224         195         191         151 
## afghanistan     taliban          is       biden          as          at 
##         149         149         140         131         130         101 
##       trump         war 
##         100          99

Taking a look at the distribution of word frequencies, I create a data frame. Just from looking at the data frame of the main headlines, it is clear I need to also remove stop words from my analysis.

word_counts <- as.data.frame(sort(colSums(main_df),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(main_df))
head(word_counts)

##     Frequency Rank
## <U+FFFD>        1307    1
## the      1137    2
## in        734    3
## to        683    4
## a         653    5
## of        599    6

Until I do so, analyzing the results will not be very meaningful.

main_tokens <- tokens(main_corpus)
main_tokens <- tokens_tolower(main_tokens)
main_tokens <- tokens_select(main_tokens, 
                             pattern = stopwords("en"),
                             selection = "remove")

main_dfm <- dfm(main_tokens)

length(main_tokens)

## [1] 3442

print(main_tokens)

## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "quotation" "day"       ":"         "ex-seal"   "now"       "pitching" 
## [7] "products"  "president"
## 
## text2 :
## [1] "afghan"   "war"      "casualty" "report"   ":"        "january"  "2020"    
## 
## text3 :
## [1] "50"     "tv"     "shows"  "need"   "watch"  "winter"
## 
## text4 :
## [1] "history" "war"     "six"     "drugs"  
## 
## text5 :
## [1] "airstrike" "pushes"    "national"  "security"  "forefront" "2020"     
## [7] "race"     
## 
## text6 :
## [1] "case"     "one-term" "joe"     
## 
## [ reached max_ndoc ... 3,436 more documents ]

print_tokens <- tokens(print_corpus)
print_tokens <- tokens_tolower(print_tokens)
print_tokens <- tokens_select(print_tokens, 
                             pattern = stopwords("en"),
                             selection = "remove")


length(print_tokens)

## [1] 3442

print(print_tokens)

## Tokens consisting of 3,442 documents and 5 docvars.
## text1 :
## [1] "quote" "day"  
## 
## text2 :
## character(0)
## 
## text3 :
## [1] "50"     "tv"     "shows"  "watch"  "winter"
## 
## text4 :
## character(0)
## 
## text5 :
## [1] "airstrike"    "pushes"       "foreign"      "policy"       "front"       
## [6] "presidential" "race"        
## 
## text6 :
## [1] "case"      "one-term"  "president" "biden"    
## 
## [ reached max_ndoc ... 3,436 more documents ]

Now I can use quanteda to generate the document-feature matrices

main_dfm <- dfm(main_tokens)
main_dfm

## Document-feature matrix of: 3,442 documents, 5,701 features (99.87% sparse) and 5 docvars.
##        features
## docs    quotation day : ex-seal now pitching products president afghan war
##   text1         1   1 1       1   1        1        1         1      0   0
##   text2         0   0 1       0   0        0        0         0      1   1
##   text3         0   0 0       0   0        0        0         0      0   0
##   text4         0   0 0       0   0        0        0         0      0   1
##   text5         0   0 0       0   0        0        0         0      0   0
##   text6         0   0 0       0   0        0        0         0      0   0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 5,691 more features ]

print_dfm <- dfm(print_tokens)
print_dfm

## Document-feature matrix of: 3,442 documents, 4,121 features (99.91% sparse) and 5 docvars.
##        features
## docs    quote day 50 tv shows watch winter airstrike pushes foreign
##   text1     1   1  0  0     0     0      0         0      0       0
##   text2     0   0  0  0     0     0      0         0      0       0
##   text3     0   0  1  1     1     1      1         0      0       0
##   text4     0   0  0  0     0     0      0         0      0       0
##   text5     0   0  0  0     0     0      0         1      1       1
##   text6     0   0  0  0     0     0      0         0      0       0
## [ reached max_ndoc ... 3,436 more documents, reached max_nfeat ... 4,111 more features ]

# trim based on the overall frequency (i.e., the word counts) with a max at the top "non-gibberish" term.
smaller_main_dfm <- dfm_trim(main_dfm, max_termfreq = 1137)
smaller_print_dfm <- dfm_trim(print_dfm, max_termfreq = 461)
# trim based on the proportion of documents that the feature appears in; here, 
# the feature needs to appear in more than 5% of documents (articles)
#smaller_main_dfm <- dfm_trim(smaller_main_dfm, min_docfreq = 0.10, docfreq_type = "prop")
#smaller_main_dfm
#smaller_print_dfm <- dfm_trim(smaller_print_dfm, min_docfreq = 0.10, docfreq_type = "prop")
#smaller_print_dfm

Now I can take a look at the wordcloud and updated word frequency metrics:

# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)
# draw the wordcloud
textplot_wordcloud(smaller_main_dfm, min_count = 10, random_order = FALSE)

textplot_wordcloud(smaller_print_dfm, min_count = 10, random_order = FALSE)

# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(smaller_main_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(smaller_main_dfm))
head(word_counts)

##             Frequency Rank
## .                1090    1
## :                 597    2
## briefing          531    3
## s                 499    4
## afghanistan       391    5
## u.s               380    6

On initial review, I have successfully reduced the sparsity from over 99% to >90%. But that’s still quite a sparsity percentage. I could drop terms found in the updated word count list that appear to still be punctuation and letters, if I’m sure they are not relevant to the context of my research. Also, rather than dropping the term “briefing”, I will run this analysis again, but selecting only certain types of news desks, and excluding “briefing” types of entries.

Feature Co-Occurrence Matrix

I’m going to again try to exclude the jibberish, and increase the number of articles being evaluated to represent those in more than 1% of the articles rather than 5%.

Now I can take a look at this network of feature co-occurrences for the main headlines:

# create fcm from dfm
smaller_main_fcm <- fcm(smaller_main_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_main_fcm)

## [1] 5699 5699

# pull the top features
myFeatures <- names(topfeatures(smaller_main_fcm, 20))
# retain only those top features as part of our matrix
even_smaller_main_fcm <- fcm_select(smaller_main_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_main_fcm)

## [1] 20 20

# compute size weight for vertices in network
size <- log(colSums(even_smaller_main_fcm))
# create plot
textplot_network(even_smaller_main_fcm, vertex_size = size / max(size) * 3)

and for the print headlines:

# create fcm from dfm
smaller_print_fcm <- fcm(smaller_print_dfm)
# check the dimensions (i.e., the number of rows and the number of columnns)
# of the matrix we created
dim(smaller_print_fcm)

## [1] 4119 4119

# pull the top features
myFeatures <- names(topfeatures(smaller_print_fcm, 20))
# retain only those top features as part of our matrix
even_smaller_print_fcm <- fcm_select(smaller_print_fcm, pattern = myFeatures, selection = "keep")
# check dimensions
dim(even_smaller_print_fcm)

## [1] 20 20

# compute size weight for vertices in network
size <- log(colSums(even_smaller_print_fcm))
# create plot
textplot_network(even_smaller_print_fcm, vertex_size = size / max(size) * 3)

Finally, I’m trying something new I found for pre-processing and word cloud modeling before going back to the start and using what I’ve learned here on the data tomorrow!

preprocessing = function (doc){
  doc = gsub("[^[:alnum:]]"," ",doc)
  #create corpus
  corpus = Corpus(VectorSource(doc))
  #Removal of punctuation
  corpus = tm_map(corpus, removePunctuation)
  #customize my stopwords
  mystopword = "briefing"
  #Removal of stopwords
  corpus = tm_map(corpus, removeWords, c(stopwords("english"),mystopword))
  #retun result
  return(corpus)
}

main_clean = preprocessing(main_corpus)
print_clean = preprocessing(print_corpus)

set.seed(1234)
# draw the wordcloud
library(wordcloud)

par(mfrow=c(1,2)) # 1x2 panel plot
par(mar=c(1, 3, 1, 3)) # Set the plot margin
par(bg="black") # set background color as black
par(col.main="white") # set title color as white
wordcloud(main_clean, scale=c(4,.5),min.freq=3, max.words=Inf, random.order=F, 
          colors = brewer.pal(8, "Set3"))   
title("Main Website Headlines")
wordcloud(print_clean, scale=c(4,.5),min.freq=3, max.words=Inf, random.order=F, 
          colors = brewer.pal(8, "Set3"))   
title("Print Headlines")

Blog Post 7: Print vs. Web Titles

DACSS Text as Data Spring 2022

4/7/2022

Getting Started

Feature Co-Occurrence Matrix