Blog Post 2 DACSS697D

Overview of Weeks 3 and 4

Week 3 Learning Curve in Webscraping using CSS

In Week 3, working with the css selector has proven frustrating. After building confidence in using the CSS Diner tutorial, working through the colab tutorial was less empowering. First, because the SelectorGadget tool is not available to be used in Chrome when logged in to the UMass environment due to blocking of the Chrome Store. So in the tutorial, I was able to find where the “#tablepress-73” was by inspecting the page but I would not have known how to find that to import it into R unless I was given that information, so I clearly need to understand this process better.

Week 3 API Usage

Week 3 included a welcome introduction to scraping data using an api, and specifically using the New York Times api was exciting given I have a research project using New York Times data that I have been conducting manually using PDF versions of the articles being used. However, I tried using the material to use a different api - one from opensecrets.org. After getting the api key, I started trying to see if I could apply some of the skills learned in the lab to search for information on lobby groups and data related to lobby groups with a focus on veteran issues. However, I ran into issues with this api, so I returned to using the New York Times api for information involving a new research topic for this week’s blog. I hope to return to troubleshooting my issues with the opensecrets.org api later.

Week 4 Natural Language Processing

Week 4 materials in NLP are extremely valuable and interesting to me. Unfortunately, due to learning curve issues, frustration, and the simple volume of information I am learning at once this semester, I have just scratched the surface of the NLP skills we are learning in this blog post. I look forward to developing them further as the semester goes on.

Frustration

I could not get all of the demonstrative code from all of the notebook tutorials to work on my own RStudio. Specifically, I realized that simply installing Python was not enough, and through I read that I needed to install the Anaconda platform for NLP in the tutorial, I had no idea how to functionally execute this task. Given my absolute absence of knowledge of Python and anything related, this was a long day of trying to catch up with little success. After over 20 hours of messing around with various help sites and trial and error on this, I was referred to the ‘spacyr’ package and its’ documentation by a classmate. This helped immensely, and after successfully getting the Miniconda and dependent actions completed, I was able to move foward.

library(httr); library(jsonlite)
library(dplyr)
library(quanteda)
library(tidyverse)
library(tidytext)
library(cleanNLP)

Work Content for Weeks 3 and 4

Trouble Using The opensecrets.org API

I know from the API documentation that the Committee on Veterans’ Affairs is listed with committee code “HVET” (House) as well as “SVET” (Senate). The House Armed Services committee is coded “HARM”.

I am looking for data from the 116th U.S. Congress, convened on January 3, 2019 and ending on January 3, 2021.

So I start with “GET” and my api call for the House Committee on Veterans Affairs, but this is not working. After literally hours of troubleshooting, I realized my error was that I failed to put a (’) mark at the beginning and end of my api call. It is so numbingly obvious now that I’ve seen it. Coding is nothing if not time consuming and absolutely aggravating. I’m beginning to feel like a real coder after the last 2 days and hours of work over a silly apostrophe!

Anyway, Back To The Project…

cnlp_init_udpipe()

HVET <- GET('https://www.opensecrets.org/api/?method=congCmteIndus&congno=116&indus=F10&cmte=HVET&apikey=3f183582e16c4fff025509c65828bfa4&output=json')

names(HVET)

##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"

#HVET_r <- fromJSON(rawToChar(HVET$content))
#names(HVET_r) 
#HVET_r$fault

I continue to receive errors indicating a lexical error, then an invalid apikey. I am struggling to understand why my api key is not working here. Since I cannot do through another delve into troubleshooting and understand how this api has returned different results, I will instead go back to the New York Times api and pull my own text relevant to another new project.

New York Times Article Search for last 7 days on Ukraine

Finally, I can move a bit further after successfully retrieving data using my NYT api:

ukraine <- GET('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20220213&end_date=20220220&q=ukraine&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')

names(ukraine)

##  [1] "url"         "status_code" "headers"     "all_headers" "cookies"    
##  [6] "content"     "date"        "times"       "request"     "handle"

Now I can take the step of transforming JSON objects into R objects:

ukraine_r <- fromJSON(rawToChar(ukraine$content))
names(ukraine_r)

## [1] "status"    "copyright" "response"

And look at the response and lead paragraph headers:

#ukraine_r$response

ukraine_r$response$docs$lead_paragraph

##  [1] "On the edge of Europe and thousands of miles from the United States, the relevance of Ukraine extends far beyond its borders."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##  [2] "A day before Ukraine announced its Defense Ministry and banking servers had been hacked, our video team toured the countryâ\200\231s Cybercommand Center, where officials have been preparing for this scenario for years."                                                                                                                                                                                                                                                                                                                                                                                                   
##  [3] "Vladimir Putin may still order an invasion of Ukraine, as President Biden said yesterday. Putin has long been obsessed with Ukraine, viewing it as part of Russiaâ\200\231s immediate orbit. And more than 150,000 Russian troops remain ready to pour over the border if Putin gives the order."                                                                                                                                                                                                                                                                                                                             
##  [4] "As Russian troops are dispatched to Ukraineâ\200\231s borders, the threat of a major assault on the country continues to escalate. Ukrainian soldiers stand guard at possible points of invasion, including sections of irradiated zones near Chernobyl. The United States has sent troops to NATO countries and has pulled most diplomats from Kyiv, Ukraine and world leaders are in contact with President Vladimir V. Putin of Russia to try to negotiate peace in the region. In a one-hour phone call on Saturday, President Biden warned Mr. Putin that an invasion would result in â\200œswift and severeâ\200\235 costs."     
##  [5] "Two founding members of the Soviet Union â\200” Russia and Ukraine â\200” are once again at a flash point. Here are some pivotal moments that have led to Russiaâ\200\231s troop buildup on its western border with Ukraine:"                                                                                                                                                                                                                                                                                                                                                                                                       
##  [6] "MARIUPOL, Ukraine â\200” Paramilitary groups are actively preparing for a Russian invasion near Ukraineâ\200\231s front line with Russia-backed separatists."                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [7] "Big American and European companies operating on the ground in Ukraine said Friday that they had contingency plans at the ready in case of a Russian invasion but so far had not ordered the relocation of employees."                                                                                                                                                                                                                                                                                                                                                                                                  
##  [8] "KYIV, Ukraineâ\200” Pavlo Kaliuk, a freelance property broker in Ukraineâ\200\231s capital, used to sell and rent properties to clients from the United States, France, Germany and Israel. Then in November, when Russia first began posting troops along the countryâ\200\231s border, the deals quickly dried up."                                                                                                                                                                                                                                                                                                                  
##  [9] "The New York Times traveled with a paramilitary group that says it refuses to leave the front lines in Ukraineâ\200\231s war in the east. But the government denies that theyâ\200\231re even there. What role can they have in a larger conflict with Russia?"                                                                                                                                                                                                                                                                                                                                                                     
## [10] "NOVOTOSHKIVSKE, Ukraine â\200” Artillery shells struck a ring of frontline towns in eastern Ukraine Friday, blowing out windows, hitting schools, homes and military positions â\200” and stirring fears that the escalation here is only the prelude to direct Russian military action."

And take some of the aspects of this data that are interesting to me and create a tibble:

ukraine_t <- as_tibble(cbind(
          date=ukraine_r$response$docs$pub_date,
          abstract=ukraine_r$response$docs$abstract,
          lead=ukraine_r$response$docs$lead_paragraph)
)

ukraine_t

## # A tibble: 10 x 3
##    date         abstract                          lead                          
##    <chr>        <chr>                             <chr>                         
##  1 2022-02-19T~ Hereâ€™s how the country ended u~ "On the edge of Europe and th~
##  2 2022-02-16T~ A day before Ukraine announced i~ "A day before Ukraine announc~
##  3 2022-02-16T~ Three explanations for the lates~ "Vladimir Putin may still ord~
##  4 2022-02-15T~ Andrew E. Kramer, a Times Moscow~ "As Russian troops are dispat~
##  5 2022-02-18T~ Here is a look at some key momen~ "Two founding members of the ~
##  6 2022-02-19T~ Times journalists followed the m~ "MARIUPOL, Ukraine â€” Parami~
##  7 2022-02-18T~ Despite warnings by Western lead~ "Big American and European co~
##  8 2022-02-18T~ Flights have been canceled, comm~ "KYIV, Ukraineâ€” Pavlo Kaliu~
##  9 2022-02-19T~ The New York Times traveled with~ "The New York Times traveled ~
## 10 2022-02-18T~ With nowhere to go, many are inc~ "NOVOTOSHKIVSKE, Ukraine â€” ~

Finally, I can utilize the cleanNLP package:

cnlp_init_udpipe()

annotated <- cnlp_annotate(ukraine_t$lead)

## Processed document 10 of 10

head(annotated)

## $token
## # A tibble: 479 x 11
##    doc_id   sid tid   token  token_with_ws lemma  upos  xpos  feats   tid_source
##  *  <int> <int> <chr> <chr>  <chr>         <chr>  <chr> <chr> <chr>   <chr>     
##  1      1     1 1     On     "On "         on     ADP   IN    <NA>    3         
##  2      1     1 2     the    "the "        the    DET   DT    Defini~ 3         
##  3      1     1 3     edge   "edge "       edge   NOUN  NN    Number~ 19        
##  4      1     1 4     of     "of "         of     ADP   IN    <NA>    5         
##  5      1     1 5     Europe "Europe "     Europe PROPN NNP   Number~ 3         
##  6      1     1 6     and    "and "        and    CCONJ CC    <NA>    7         
##  7      1     1 7     thous~ "thousands "  thous~ NOUN  NNS   Number~ 3         
##  8      1     1 8     of     "of "         of     ADP   IN    <NA>    9         
##  9      1     1 9     miles  "miles "      mile   NOUN  NNS   Number~ 7         
## 10      1     1 10    from   "from "       from   ADP   IN    <NA>    13        
## # ... with 469 more rows, and 1 more variable: relation <chr>
## 
## $document
##    doc_id
## 1       1
## 2       2
## 3       3
## 4       4
## 5       5
## 6       6
## 7       7
## 8       8
## 9       9
## 10     10

and the spacyr package:

library(spacyr)

spacy_initialize(model="en_core_web_sm")
ukraine_parsed <- spacy_parse(ukraine_t$lead, tag=TRUE, nounphrase=TRUE, entity=TRUE, lemma=TRUE)

head(ukraine_parsed)

##   doc_id sentence_id token_id  token  lemma   pos tag entity nounphrase
## 1  text1           1        1     On     on   ADP  IN                  
## 2  text1           1        2    the    the   DET  DT               beg
## 3  text1           1        3   edge   edge  NOUN  NN          end_root
## 4  text1           1        4     of     of   ADP  IN                  
## 5  text1           1        5 Europe Europe PROPN NNP  LOC_B   beg_root
## 6  text1           1        6    and    and CCONJ  CC                  
##   whitespace
## 1       TRUE
## 2       TRUE
## 3       TRUE
## 4       TRUE
## 5       TRUE
## 6       TRUE

Annotating Data

ukraine_anno <- left_join(annotated$document, annotated$token, by = "doc_id")
head(ukraine_anno)

##   doc_id sid tid  token token_with_ws  lemma  upos xpos
## 1      1   1   1     On           On      on   ADP   IN
## 2      1   1   2    the          the     the   DET   DT
## 3      1   1   3   edge         edge    edge  NOUN   NN
## 4      1   1   4     of           of      of   ADP   IN
## 5      1   1   5 Europe       Europe  Europe PROPN  NNP
## 6      1   1   6    and          and     and CCONJ   CC
##                       feats tid_source relation
## 1                      <NA>          3     case
## 2 Definite=Def|PronType=Art          3      det
## 3               Number=Sing         19      obl
## 4                      <NA>          5     case
## 5               Number=Sing          3     nmod
## 6                      <NA>          7       cc

And finally, beginning to summarize data from the text.

library(magrittr)

nouns <- ukraine_anno %>% 
  filter(upos == "NOUN") %>%
  group_by(token) %>% 
  summarize(count = n()) %>%
  arrange(desc(count))

adjs <- ukraine_anno %>% 
  filter(upos == "ADJ") %>%
  group_by(token) %>% 
  summarize(count = n()) %>%
  arrange(desc(count))

propns <- ukraine_anno %>% 
  filter(upos == "PROPN") %>%
  group_by(token) %>% 
  summarize(count = n()) %>%
  arrange(desc(count))

verbs <- ukraine_anno %>% 
  filter(upos == "VERB") %>%
  group_by(token) %>% 
  summarize(count = n()) %>%
  arrange(desc(count))

head(nouns)

## # A tibble: 6 x 2
##   token        count
##   <chr>        <int>
## 1 â                5
## 2 invasion         5
## 3 troops           4
## 4 border           3
## 5 borders          2
## 6 countryâ€™s     2

head(adjs)

## # A tibble: 6 x 2
##   token    count
##   <chr>    <int>
## 1 Russian      5
## 2 front        2
## 3 military     2
## 4 ready        2
## 5 American     1
## 6 Big          1

head(propns)

## # A tibble: 6 x 2
##   token        count
##   <chr>        <int>
## 1 Ukraine         11
## 2 Putin            5
## 3 Russia           5
## 4 Ukraineâ€™s     4
## 5 President        3
## 6 States           3

head(verbs)

## # A tibble: 6 x 2
##   token     count
##   <chr>     <int>
## 1 preparing     2
## 2 said          2
## 3 announced     1
## 4 backed        1
## 5 began         1
## 6 blowing       1

Clean-Up Needed

The characters that carried over from the NYT api import are not easily cleaned up, though I have spent some time getting familiar with some of tools to do so. I need to spend more time getting familiar with the best options for this.

Looking For a Word

I scratched the surface of the stringr function and looked at all of the lead paragraphs from the last week to see if there was any variation of the word “protest(s)”. Unfortunately, I did not find any instances.

library(stringr)

str_match(ukraine_t$lead, " [P|p]rotest[s] ")

##       [,1]
##  [1,] NA  
##  [2,] NA  
##  [3,] NA  
##  [4,] NA  
##  [5,] NA  
##  [6,] NA  
##  [7,] NA  
##  [8,] NA  
##  [9,] NA  
## [10,] NA

Creating a Wordcloud

Since I had so much background work to do this week and wanted to end up creating something visually, I wanted to start with a wordcloud. Although there is a clear need to clean up the characters, I feel like I have covered a lot of ground in the last couple of weeks. I look forward to putting it together in a more purposeful way.

library(wordcloud)
library(RColorBrewer)
library(tm)

ukraine_words <- merge(merge(merge(
  nouns,
  adjs, all = TRUE),
  propns, all = TRUE),
  verbs, all = TRUE)

set.seed(1234)

wordcloud(words = ukraine_words$token, freq = ukraine_words$count, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))