Getting Started

I am beginning to pull in the text collection that I will be analyzing for my final project in the course. To do so, I am pulling articles by month beginning January 2020 through December 2021. For my initial text collection, I will be collecting articles using the New York Times API for the search query “Afghanistan”. I am not limiting my search by any filter at this time. However, I am limited in that the article search API for the New York Times does not pull the entire article; rather, I will be able to pull the abstract/summary, lead paragraph, and

This gives me a value for ‘max pages’ of the initial query of 344, so I know there are 345 pages of results since page “1” is really “0” (up to 10 per page) for the query term “Afghanistan” from January 2020 to December 2021. This means I will get between 3,440 and 3,450 results.

If I start with a query of all of the results for 2020 and 2021, I get a timeout on page 172/344.

# January 1, 2020 through December 31, 2021

#urlall <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20200101&end_date=20211231&q=afghanistan&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')

#queryall <- fromJSON(urlall)

#max.pagesall <- ceiling((queryall$response$meta$hits[1] / 10)-1) 

#pagesall <- list()
#for(i in 0:max.pagesall){
  #searchall <- fromJSON(paste0(urlall, "&page=", i), flatten = TRUE) %>% data.frame() 
  #message("Retrieving page ", i)
  #pagesall[[i+1]] <- searchall
  #Sys.sleep(10)}

Timeout on page 172/344

# Initiate review of total article pages in NYT Article Search api for term "Afghanistan" between January 2020 and December 2021

# This serves as import of data for import 1

baseurl <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20200101&end_date=20211231&q=afghanistan&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')

initial.query <- fromJSON(baseurl)

max.pages <- ceiling((initial.query$response$meta$hits[1] / 10)-1) 

I want to reduce the queries into more workable groups that will not time out, so I have to do some experimenting to get the most efficient article collection plan. I will start by pulling articles from 2020 only.

# For articles from 2020

url2020 <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20200101&end_date=20201231&q=afghanistan&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')

query2020 <- fromJSON(url2020)

max.pages2020 <- ceiling((query2020$response$meta$hits[1] / 10)-1) 

pages2020 <- list()
for(i in 0:max.pages2020){
  search2020 <- fromJSON(paste0(url2020, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages2020[[i+1]] <- search2020
  Sys.sleep(10) 
}
## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
## Retrieving page 3
## Retrieving page 4
## Retrieving page 5
## Retrieving page 6
## Retrieving page 7
## Retrieving page 8
## Retrieving page 9
## Retrieving page 10
## Retrieving page 11
## Retrieving page 12
## Retrieving page 13
## Retrieving page 14
## Retrieving page 15
## Retrieving page 16
## Retrieving page 17
## Retrieving page 18
## Retrieving page 19
## Retrieving page 20
## Retrieving page 21
## Retrieving page 22
## Retrieving page 23
## Retrieving page 24
## Retrieving page 25
## Retrieving page 26
## Retrieving page 27
## Retrieving page 28
## Retrieving page 29
## Retrieving page 30
## Retrieving page 31
## Retrieving page 32
## Retrieving page 33
## Retrieving page 34
## Retrieving page 35
## Retrieving page 36
## Retrieving page 37
## Retrieving page 38
## Retrieving page 39
## Retrieving page 40
## Retrieving page 41
## Retrieving page 42
## Retrieving page 43
## Retrieving page 44
## Retrieving page 45
## Retrieving page 46
## Retrieving page 47
## Retrieving page 48
## Retrieving page 49
## Retrieving page 50
## Retrieving page 51
## Retrieving page 52
## Retrieving page 53
## Retrieving page 54
## Retrieving page 55
## Retrieving page 56
## Retrieving page 57
## Retrieving page 58
## Retrieving page 59
## Retrieving page 60
## Retrieving page 61
## Retrieving page 62
## Retrieving page 63
## Retrieving page 64
## Retrieving page 65
## Retrieving page 66
## Retrieving page 67
## Retrieving page 68
## Retrieving page 69
## Retrieving page 70
## Retrieving page 71
## Retrieving page 72
## Retrieving page 73
## Retrieving page 74
## Retrieving page 75
## Retrieving page 76
## Retrieving page 77
## Retrieving page 78
## Retrieving page 79
## Retrieving page 80
## Retrieving page 81
## Retrieving page 82
## Retrieving page 83
## Retrieving page 84
## Retrieving page 85
## Retrieving page 86
## Retrieving page 87
## Retrieving page 88
## Retrieving page 89
## Retrieving page 90
## Retrieving page 91
## Retrieving page 92
## Retrieving page 93
## Retrieving page 94
## Retrieving page 95
## Retrieving page 96
## Retrieving page 97
## Retrieving page 98
## Retrieving page 99
## Retrieving page 100
## Retrieving page 101
## Retrieving page 102
## Retrieving page 103
## Retrieving page 104
## Retrieving page 105
## Retrieving page 106
## Retrieving page 107
## Retrieving page 108
## Retrieving page 109
## Retrieving page 110
## Retrieving page 111
## Retrieving page 112
## Retrieving page 113
## Retrieving page 114
## Retrieving page 115
## Retrieving page 116
## Retrieving page 117
## Retrieving page 118
## Retrieving page 119
## Retrieving page 120
## Retrieving page 121
## Retrieving page 122
## Retrieving page 123
## Retrieving page 124
## Retrieving page 125

Now I can take the 2020 search results, bind the results, and save them into the data shell for the 2020 articles. I will later merge all of the results into one complete shell to combine all of the results from the 3 searches.

pages2020[[i+1]] <- search2020 
afghanistan.articles.2020 <- rbind_pages(pages2020)

save(afghanistan.articles.2020,file="afghanistan_articles_2020.Rdata")

Next, I’ll pull articles from January to August, 2021. Since August 31, 2021 was the “official” withdrawal date of U.S. troops from Afghanistan, the bulk of the news articles involving Afghanistan come from August - September, 2021.

# For January to August 2021

url2021a <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20210101&end_date=20210831&q=afghanistan&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')

query2021a <- fromJSON(url2021a)

max.pages2021a <- ceiling((query2021a$response$meta$hits[1] / 10)-1) 

pages2021a <- list()
for(i in 0:max.pages2021a){
  search2021a <- fromJSON(paste0(url2021a, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages2021a[[i+1]] <- search2021a
  Sys.sleep(10) 
}
## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
## Retrieving page 3
## Retrieving page 4
## Retrieving page 5
## Retrieving page 6
## Retrieving page 7
## Retrieving page 8
## Retrieving page 9
## Retrieving page 10
## Retrieving page 11
## Retrieving page 12
## Retrieving page 13
## Retrieving page 14
## Retrieving page 15
## Retrieving page 16
## Retrieving page 17
## Retrieving page 18
## Retrieving page 19
## Retrieving page 20
## Retrieving page 21
## Retrieving page 22
## Retrieving page 23
## Retrieving page 24
## Retrieving page 25
## Retrieving page 26
## Retrieving page 27
## Retrieving page 28
## Retrieving page 29
## Retrieving page 30
## Retrieving page 31
## Retrieving page 32
## Retrieving page 33
## Retrieving page 34
## Retrieving page 35
## Retrieving page 36
## Retrieving page 37
## Retrieving page 38
## Retrieving page 39
## Retrieving page 40
## Retrieving page 41
## Retrieving page 42
## Retrieving page 43
## Retrieving page 44
## Retrieving page 45
## Retrieving page 46
## Retrieving page 47
## Retrieving page 48
## Retrieving page 49
## Retrieving page 50
## Retrieving page 51
## Retrieving page 52
## Retrieving page 53
## Retrieving page 54
## Retrieving page 55
## Retrieving page 56
## Retrieving page 57
## Retrieving page 58
## Retrieving page 59
## Retrieving page 60
## Retrieving page 61
## Retrieving page 62
## Retrieving page 63
## Retrieving page 64
## Retrieving page 65
## Retrieving page 66
## Retrieving page 67
## Retrieving page 68
## Retrieving page 69
## Retrieving page 70
## Retrieving page 71
## Retrieving page 72
## Retrieving page 73
## Retrieving page 74
## Retrieving page 75
## Retrieving page 76
## Retrieving page 77
## Retrieving page 78
## Retrieving page 79
## Retrieving page 80
## Retrieving page 81
## Retrieving page 82
## Retrieving page 83
## Retrieving page 84
## Retrieving page 85
## Retrieving page 86
## Retrieving page 87
## Retrieving page 88
## Retrieving page 89
## Retrieving page 90
## Retrieving page 91
## Retrieving page 92
## Retrieving page 93
## Retrieving page 94
## Retrieving page 95
## Retrieving page 96
## Retrieving page 97
## Retrieving page 98
## Retrieving page 99
## Retrieving page 100
## Retrieving page 101
## Retrieving page 102
## Retrieving page 103
## Retrieving page 104
## Retrieving page 105
## Retrieving page 106
## Retrieving page 107
## Retrieving page 108
## Retrieving page 109
## Retrieving page 110
## Retrieving page 111
## Retrieving page 112
## Retrieving page 113
## Retrieving page 114
## Retrieving page 115
## Retrieving page 116
## Retrieving page 117
## Retrieving page 118
## Retrieving page 119
## Retrieving page 120
## Retrieving page 121
## Retrieving page 122
## Retrieving page 123
## Retrieving page 124
## Retrieving page 125
## Retrieving page 126
## Retrieving page 127
## Retrieving page 128
## Retrieving page 129
## Retrieving page 130
## Retrieving page 131
## Retrieving page 132
## Retrieving page 133

Now I can take the 1st part of the 2021 search results, bind the results, and save them into the data shell:

pages2021a[[i+1]] <- search2021a 
afghanistan.articles.2021a <- rbind_pages(pages2021a)

save(afghanistan.articles.2021a,file="afghanistan_articles_2021a.Rdata")

Next, I’ll pull articles from September to December, 2021.

# For September-December 2021

url2021b <- ('https://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20210901&end_date=20211231&q=afghanistan&api-key=GTp3efxVZiGO75Iox9uZJ8ZTjIMjDWsM')

query2021b <- fromJSON(url2021b)

max.pages2021b <- ceiling((query2021b$response$meta$hits[1] / 10)-1) 

pages2021b <- list()
for(i in 0:max.pages2021b){
  search2021b <- fromJSON(paste0(url2021b, "&page=", i), flatten = TRUE) %>% data.frame() 
  message("Retrieving page ", i)
  pages2021b[[i+1]] <- search2021b
  Sys.sleep(10) 
}
## Retrieving page 0
## Retrieving page 1
## Retrieving page 2
## Retrieving page 3
## Retrieving page 4
## Retrieving page 5
## Retrieving page 6
## Retrieving page 7
## Retrieving page 8
## Retrieving page 9
## Retrieving page 10
## Retrieving page 11
## Retrieving page 12
## Retrieving page 13
## Retrieving page 14
## Retrieving page 15
## Retrieving page 16
## Retrieving page 17
## Retrieving page 18
## Retrieving page 19
## Retrieving page 20
## Retrieving page 21
## Retrieving page 22
## Retrieving page 23
## Retrieving page 24
## Retrieving page 25
## Retrieving page 26
## Retrieving page 27
## Retrieving page 28
## Retrieving page 29
## Retrieving page 30
## Retrieving page 31
## Retrieving page 32
## Retrieving page 33
## Retrieving page 34
## Retrieving page 35
## Retrieving page 36
## Retrieving page 37
## Retrieving page 38
## Retrieving page 39
## Retrieving page 40
## Retrieving page 41
## Retrieving page 42
## Retrieving page 43
## Retrieving page 44
## Retrieving page 45
## Retrieving page 46
## Retrieving page 47
## Retrieving page 48
## Retrieving page 49
## Retrieving page 50
## Retrieving page 51
## Retrieving page 52
## Retrieving page 53
## Retrieving page 54
## Retrieving page 55
## Retrieving page 56
## Retrieving page 57
## Retrieving page 58
## Retrieving page 59
## Retrieving page 60
## Retrieving page 61
## Retrieving page 62
## Retrieving page 63
## Retrieving page 64
## Retrieving page 65
## Retrieving page 66
## Retrieving page 67
## Retrieving page 68
## Retrieving page 69
## Retrieving page 70
## Retrieving page 71
## Retrieving page 72
## Retrieving page 73
## Retrieving page 74
## Retrieving page 75
## Retrieving page 76
## Retrieving page 77
## Retrieving page 78
## Retrieving page 79
## Retrieving page 80
## Retrieving page 81
## Retrieving page 82
## Retrieving page 83
## Retrieving page 84
## Retrieving page 85

Now I can take the 2nd part of the 2021 search results, bind the results, and save them into the data shell:

pages2021b[[i+1]] <- search2021b 
afghanistan.articles.2021b <- rbind_pages(pages2021b)

save(afghanistan.articles.2021b,file="afghanistan_articles_2021b.Rdata")

After each iteration is complete, I will bind them all into the “afghanistan.articles.all” data frame:

# Create shell for data

afghanistan.articles.all <- c()

afghanistan.articles.all <- rbind_pages(c(pages2020, pages2021a, pages2021b))

Finally, I’m cleaning up the date column and saving the formatted tibble for offline access.

afghanistan.articles.table<- as_tibble(cbind(
  date=afghanistan.articles.all$response.docs.pub_date,
  abstract=afghanistan.articles.all$response.docs.abstract,
  lead.paragraph=afghanistan.articles.all$response.docs.lead_paragraph,
  snippet=afghanistan.articles.all$response.docs.snippet,
  section.name=afghanistan.articles.all$response.docs.section_name,
  subsection.name=afghanistan.articles.all$response.docs.subsection_name,
  news.desk=afghanistan.articles.all$response.docs.news_desk,
  byline=afghanistan.articles.all$response.docs.byline.original,
  headline.main=afghanistan.articles.all$response.docs.headline.main,
  headline.print=afghanistan.articles.all$response.docs.headline.print_headline,
  headline.kicker=afghanistan.articles.all$response.docs.headline.kicker,
  material=afghanistan.articles.all$response.docs.type_of_material,
  url=afghanistan.articles.all$response.docs.web_url
  ))

afghanistan.articles.table$date <- substr(afghanistan.articles.table$date, 1, nchar(afghanistan.articles.table$date)-14)

afghanistan.articles.table$date <- as.Date(afghanistan.articles.table$date, "%Y-%m-%d")

save(afghanistan.articles.table,file="afghanistan.articles.table.Rdata")

write.table(afghanistan.articles.table, file = "~/GitHub/DACSS.697D/Text as Data Spring22/afghanistan.articles.table.csv", sep=",", row.names=FALSE)

Phase one of data collection, complete!