In Week 3, working with the css selector has proven frustrating. After building confidence in using the CSS Diner tutorial, working through the colab tutorial was less empowering. First, because the SelectorGadget tool is not available to be used in Chrome when logged in to the UMass environment due to blocking of the Chrome Store. So in the tutorial, I was able to find where the “#tablepress-73” was by inspecting the page but I would not have known how to find that to import it into R unless I was given that information, so I clearly need to understand this process better.
Week 3 included a welcome introduction to scraping data using an api, and specifically using the New York Times api was exciting given I have a research project using New York Times data that I have been conducting manually using PDF versions of the articles being used. However, I tried using the material to use a different api - one from opensecrets.org. After getting the api key, I started trying to see if I could apply some of the skills learned in the lab to search for information on lobby groups and data related to lobby groups with a focus on veteran issues. However, I ran into issues with this api, so I returned to using the New York Times api for information involving a new research topic for this week’s blog. I hope to return to troubleshooting my issues with the opensecrets.org api later.
Week 4 materials in NLP are extremely valuable and interesting to me. Unfortunately, due to learning curve issues, frustration, and the simple volume of information I am learning at once this semester, I have just scratched the surface of the NLP skills we are learning in this blog post. I look forward to developing them further as the semester goes on.
I could not get all of the demonstrative code from all of the notebook tutorials to work on my own RStudio. Specifically, I realized that simply installing Python was not enough, and through I read that I needed to install the Anaconda platform for NLP in the tutorial, I had no idea how to functionally execute this task. Given my absolute absence of knowledge of Python and anything related, this was a long day of trying to catch up with little success. After over 20 hours of messing around with various help sites and trial and error on this, I was referred to the ‘spacyr’ package and its’ documentation by a classmate. This helped immensely, and after successfully getting the Miniconda and dependent actions completed, I was able to move foward.
library(httr); library(jsonlite)
library(dplyr)
library(quanteda)
library(tidyverse)
library(tidytext)
library(cleanNLP)
I know from the API documentation that the Committee on Veterans’ Affairs is listed with committee code “HVET” (House) as well as “SVET” (Senate). The House Armed Services committee is coded “HARM”.
I am looking for data from the 116th U.S. Congress, convened on January 3, 2019 and ending on January 3, 2021.
So I start with “GET” and my api call for the House Committee on Veterans Affairs, but this is not working. After literally hours of troubleshooting, I realized my error was that I failed to put a (’) mark at the beginning and end of my api call. It is so numbingly obvious now that I’ve seen it. Coding is nothing if not time consuming and absolutely aggravating. I’m beginning to feel like a real coder after the last 2 days and hours of work over a silly apostrophe!
cnlp_init_udpipe()
HVET <- GET('https://www.opensecrets.org/api/?method=congCmteIndus&congno=116&indus=F10&cmte=HVET&apikey=3f183582e16c4fff025509c65828bfa4&output=json')
names(HVET)
## [1] "url" "status_code" "headers" "all_headers" "cookies"
## [6] "content" "date" "times" "request" "handle"
#HVET_r <- fromJSON(rawToChar(HVET$content))
#names(HVET_r)
#HVET_r$fault
I continue to receive errors indicating a lexical error, then an invalid apikey. I am struggling to understand why my api key is not working here. Since I cannot do through another delve into troubleshooting and understand how this api has returned different results, I will instead go back to the New York Times api and pull my own text relevant to another new project.
Finally, I can move a bit further after successfully retrieving data using my NYT api:
ukraine <- GET('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20220213&end_date=20220220&q=ukraine&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')
names(ukraine)
## [1] "url" "status_code" "headers" "all_headers" "cookies"
## [6] "content" "date" "times" "request" "handle"
ukraine_r <- fromJSON(rawToChar(ukraine$content))
names(ukraine_r)
## [1] "status" "copyright" "response"
#ukraine_r$response
ukraine_r$response$docs$lead_paragraph
## [1] "On the edge of Europe and thousands of miles from the United States, the relevance of Ukraine extends far beyond its borders."
## [2] "A day before Ukraine announced its Defense Ministry and banking servers had been hacked, our video team toured the countryâ\200\231s Cybercommand Center, where officials have been preparing for this scenario for years."
## [3] "Vladimir Putin may still order an invasion of Ukraine, as President Biden said yesterday. Putin has long been obsessed with Ukraine, viewing it as part of Russiaâ\200\231s immediate orbit. And more than 150,000 Russian troops remain ready to pour over the border if Putin gives the order."
## [4] "As Russian troops are dispatched to Ukraineâ\200\231s borders, the threat of a major assault on the country continues to escalate. Ukrainian soldiers stand guard at possible points of invasion, including sections of irradiated zones near Chernobyl. The United States has sent troops to NATO countries and has pulled most diplomats from Kyiv, Ukraine and world leaders are in contact with President Vladimir V. Putin of Russia to try to negotiate peace in the region. In a one-hour phone call on Saturday, President Biden warned Mr. Putin that an invasion would result in â\200œswift and severeâ\200\235 costs."
## [5] "Two founding members of the Soviet Union â\200” Russia and Ukraine â\200” are once again at a flash point. Here are some pivotal moments that have led to Russiaâ\200\231s troop buildup on its western border with Ukraine:"
## [6] "MARIUPOL, Ukraine â\200” Paramilitary groups are actively preparing for a Russian invasion near Ukraineâ\200\231s front line with Russia-backed separatists."
## [7] "Big American and European companies operating on the ground in Ukraine said Friday that they had contingency plans at the ready in case of a Russian invasion but so far had not ordered the relocation of employees."
## [8] "KYIV, Ukraineâ\200” Pavlo Kaliuk, a freelance property broker in Ukraineâ\200\231s capital, used to sell and rent properties to clients from the United States, France, Germany and Israel. Then in November, when Russia first began posting troops along the countryâ\200\231s border, the deals quickly dried up."
## [9] "The New York Times traveled with a paramilitary group that says it refuses to leave the front lines in Ukraineâ\200\231s war in the east. But the government denies that theyâ\200\231re even there. What role can they have in a larger conflict with Russia?"
## [10] "NOVOTOSHKIVSKE, Ukraine â\200” Artillery shells struck a ring of frontline towns in eastern Ukraine Friday, blowing out windows, hitting schools, homes and military positions â\200” and stirring fears that the escalation here is only the prelude to direct Russian military action."
ukraine_t <- as_tibble(cbind(
date=ukraine_r$response$docs$pub_date,
abstract=ukraine_r$response$docs$abstract,
lead=ukraine_r$response$docs$lead_paragraph)
)
ukraine_t
## # A tibble: 10 x 3
## date abstract lead
## <chr> <chr> <chr>
## 1 2022-02-19T~ Here’s how the country ended u~ "On the edge of Europe and th~
## 2 2022-02-16T~ A day before Ukraine announced i~ "A day before Ukraine announc~
## 3 2022-02-16T~ Three explanations for the lates~ "Vladimir Putin may still ord~
## 4 2022-02-15T~ Andrew E. Kramer, a Times Moscow~ "As Russian troops are dispat~
## 5 2022-02-18T~ Here is a look at some key momen~ "Two founding members of the ~
## 6 2022-02-19T~ Times journalists followed the m~ "MARIUPOL, Ukraine — Parami~
## 7 2022-02-18T~ Despite warnings by Western lead~ "Big American and European co~
## 8 2022-02-18T~ Flights have been canceled, comm~ "KYIV, Ukraine— Pavlo Kaliu~
## 9 2022-02-19T~ The New York Times traveled with~ "The New York Times traveled ~
## 10 2022-02-18T~ With nowhere to go, many are inc~ "NOVOTOSHKIVSKE, Ukraine — ~
cnlp_init_udpipe()
annotated <- cnlp_annotate(ukraine_t$lead)
## Processed document 10 of 10
head(annotated)
## $token
## # A tibble: 479 x 11
## doc_id sid tid token token_with_ws lemma upos xpos feats tid_source
## * <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 1 On "On " on ADP IN <NA> 3
## 2 1 1 2 the "the " the DET DT Defini~ 3
## 3 1 1 3 edge "edge " edge NOUN NN Number~ 19
## 4 1 1 4 of "of " of ADP IN <NA> 5
## 5 1 1 5 Europe "Europe " Europe PROPN NNP Number~ 3
## 6 1 1 6 and "and " and CCONJ CC <NA> 7
## 7 1 1 7 thous~ "thousands " thous~ NOUN NNS Number~ 3
## 8 1 1 8 of "of " of ADP IN <NA> 9
## 9 1 1 9 miles "miles " mile NOUN NNS Number~ 7
## 10 1 1 10 from "from " from ADP IN <NA> 13
## # ... with 469 more rows, and 1 more variable: relation <chr>
##
## $document
## doc_id
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
library(spacyr)
spacy_initialize(model="en_core_web_sm")
ukraine_parsed <- spacy_parse(ukraine_t$lead, tag=TRUE, nounphrase=TRUE, entity=TRUE, lemma=TRUE)
head(ukraine_parsed)
## doc_id sentence_id token_id token lemma pos tag entity nounphrase
## 1 text1 1 1 On on ADP IN
## 2 text1 1 2 the the DET DT beg
## 3 text1 1 3 edge edge NOUN NN end_root
## 4 text1 1 4 of of ADP IN
## 5 text1 1 5 Europe Europe PROPN NNP LOC_B beg_root
## 6 text1 1 6 and and CCONJ CC
## whitespace
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
ukraine_anno <- left_join(annotated$document, annotated$token, by = "doc_id")
head(ukraine_anno)
## doc_id sid tid token token_with_ws lemma upos xpos
## 1 1 1 1 On On on ADP IN
## 2 1 1 2 the the the DET DT
## 3 1 1 3 edge edge edge NOUN NN
## 4 1 1 4 of of of ADP IN
## 5 1 1 5 Europe Europe Europe PROPN NNP
## 6 1 1 6 and and and CCONJ CC
## feats tid_source relation
## 1 <NA> 3 case
## 2 Definite=Def|PronType=Art 3 det
## 3 Number=Sing 19 obl
## 4 <NA> 5 case
## 5 Number=Sing 3 nmod
## 6 <NA> 7 cc
library(magrittr)
nouns <- ukraine_anno %>%
filter(upos == "NOUN") %>%
group_by(token) %>%
summarize(count = n()) %>%
arrange(desc(count))
adjs <- ukraine_anno %>%
filter(upos == "ADJ") %>%
group_by(token) %>%
summarize(count = n()) %>%
arrange(desc(count))
propns <- ukraine_anno %>%
filter(upos == "PROPN") %>%
group_by(token) %>%
summarize(count = n()) %>%
arrange(desc(count))
verbs <- ukraine_anno %>%
filter(upos == "VERB") %>%
group_by(token) %>%
summarize(count = n()) %>%
arrange(desc(count))
head(nouns)
## # A tibble: 6 x 2
## token count
## <chr> <int>
## 1 â 5
## 2 invasion 5
## 3 troops 4
## 4 border 3
## 5 borders 2
## 6 country’s 2
head(adjs)
## # A tibble: 6 x 2
## token count
## <chr> <int>
## 1 Russian 5
## 2 front 2
## 3 military 2
## 4 ready 2
## 5 American 1
## 6 Big 1
head(propns)
## # A tibble: 6 x 2
## token count
## <chr> <int>
## 1 Ukraine 11
## 2 Putin 5
## 3 Russia 5
## 4 Ukraine’s 4
## 5 President 3
## 6 States 3
head(verbs)
## # A tibble: 6 x 2
## token count
## <chr> <int>
## 1 preparing 2
## 2 said 2
## 3 announced 1
## 4 backed 1
## 5 began 1
## 6 blowing 1
The characters that carried over from the NYT api import are not easily cleaned up, though I have spent some time getting familiar with some of tools to do so. I need to spend more time getting familiar with the best options for this.
I scratched the surface of the stringr function and looked at all of the lead paragraphs from the last week to see if there was any variation of the word “protest(s)”. Unfortunately, I did not find any instances.
library(stringr)
str_match(ukraine_t$lead, " [P|p]rotest[s] ")
## [,1]
## [1,] NA
## [2,] NA
## [3,] NA
## [4,] NA
## [5,] NA
## [6,] NA
## [7,] NA
## [8,] NA
## [9,] NA
## [10,] NA
Since I had so much background work to do this week and wanted to end up creating something visually, I wanted to start with a wordcloud. Although there is a clear need to clean up the characters, I feel like I have covered a lot of ground in the last couple of weeks. I look forward to putting it together in a more purposeful way.
library(wordcloud)
library(RColorBrewer)
library(tm)
ukraine_words <- merge(merge(merge(
nouns,
adjs, all = TRUE),
propns, all = TRUE),
verbs, all = TRUE)
set.seed(1234)
wordcloud(words = ukraine_words$token, freq = ukraine_words$count, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))