For week9 assignment, we need to connect to one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.
I have chosen to work with Technology under the TopStories API. I am interested here to look for latest top 20 tags in technology.
The first step to connect to New York Times APIs is to signup and register for apps which will provide the api key to access various apis. api key is needed with every request for NYT apis. Below is the link with all the details to get api-key.
https://developer.nytimes.com/get-started
The R packages used here are as below.
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: RColorBrewer
Step 1:- First step is to read data from NYT top stories api for technology using fromJSON() function jsonlite and select relevant columns. In this case columns selected are results.section, results.subsection, results.title, results.abstract, results.url, results.byline, results.published_date, results.des_facet.
# Read NYT APIs data as JSON
theURL <- "https://api.nytimes.com/svc/topstories/v2/technology.json"
theURL <- paste0(theURL, "?api-key=" ,api_key)
# select relevant columns from the data and get it as data frame
tech_data <- fromJSON(theURL, flatten = TRUE) %>%
as.data.frame() %>%
select(results.section,
results.subsection,
results.title,
results.abstract,
results.url,
results.byline,
results.published_date,
results.des_facet)
# show head of data
head(tech_data)
## results.section results.subsection
## 1 technology
## 2 business
## 3 business
## 4 technology personaltech
## 5 style
## 6 technology
## results.title
## 1 As Life Moves Online, an Older Generation Faces a Digital Divide
## 2 A New Mission for Nonprofits During the Outbreak: Survival
## 3 Surging Traffic Is Slowing Down Our Internet
## 4 The Dos and Don’ts of Online Video Meetings
## 5 At Two Fashion Resale Warehouses, Workers Fear for Their Safety
## 6 The Week in Tech: We’re Testing How Much the Internet Can Handle
## results.abstract
## 1 Uncomfortable with tech, many are struggling to use modern tools to keep up with friends and family in the pandemic.
## 2 Upended by the coronavirus outbreak, nonprofits are laying off workers and seeking help from stretched donors.
## 3 With people going online more in the pandemic, internet traffic has exploded. That’s taking a toll on our download speeds and video quality.
## 4 From setting a clear agenda to testing your tech setup, here’s how to make video calls more tolerable for you and your colleagues.
## 5 As New Jersey orders nonessential workers to stay home to fight the spread of the new coronavirus, employees of the RealReal, a luxury resale company, wonder just what is “essential.”
## 6 We are more dependent on technology than ever. Can it handle the strain?
## results.url
## 1 https://www.nytimes.com/2020/03/27/technology/virus-older-generation-digital-divide.html
## 2 https://www.nytimes.com/2020/03/27/business/nonprofits-survival-coronavirus.html
## 3 https://www.nytimes.com/2020/03/26/business/coronavirus-internet-traffic-speed.html
## 4 https://www.nytimes.com/2020/03/25/technology/personaltech/online-video-meetings-etiquette-virus.html
## 5 https://www.nytimes.com/2020/03/27/style/realreal-warehouse-ecommerce-new-jersey.html
## 6 https://www.nytimes.com/2020/03/27/technology/internet-strain-coronavirus.html
## results.byline results.published_date
## 1 By Kate Conger and Erin Griffith 2020-03-27T09:30:08-04:00
## 2 By David Streitfeld 2020-03-27T05:00:29-04:00
## 3 By Cecilia Kang, Davey Alba and Adam Satariano 2020-03-26T05:00:24-04:00
## 4 By Brian X. Chen 2020-03-25T05:00:18-04:00
## 5 By Jessica Testa 2020-03-27T17:38:47-04:00
## 6 By Cade Metz 2020-03-27T09:00:05-04:00
## results.des_facet
## 1 Computers and the Internet, Mobile Applications, Elderly, Coronavirus (2019-nCoV), Software, Videophones and Videoconferencing, Nursing Homes, Epidemics
## 2 Layoffs and Job Reductions, Philanthropy, Nonprofit Organizations, Shutdowns (Institutional), Labor and Jobs, Coronavirus (2019-nCoV)
## 3 Coronavirus (2019-nCoV), Quarantines, Computers and the Internet, Wireless Communications, Video Recordings, Downloads and Streaming, Telephones and Telecommunications, Computer and Video Games, Videophones and Videoconferencing, Xbox (Video Game System)
## 4 Computers and the Internet, Telecommuting, Mobile Applications, Cameras, Videophones and Videoconferencing, Coronavirus (2019-nCoV)
## 5 Layoffs and Job Reductions, Luxury Goods and Services, Shopping and Retail, Coronavirus (2019-nCoV), E-Commerce, Fashion and Apparel
## 6 Computers and the Internet, Coronavirus (2019-nCoV), Online Advertising, Social Media, E-Commerce
dim(tech_data)
## [1] 29 8
Step2:- Then used unnest() function of tidyr (part of tidyverse) lib for column results.des_facet to convert the data from wide to long. In this case all tags in results.des_facet column were unnested in separate rows. After this all columns were renamed as SECTION, SUB_SECTION, TITLE, ABSTRACT, URL ,AUTHOR, PUBLISHED_DATE, TAG respectively.
tech_data <- unnest(tech_data, results.des_facet)
# replace column names
col_names <- c("SECTION","SUB_SECTION","TITLE","ABSTRACT","URL","AUTHOR","PUBLISHED_DATE","TAG")
colnames(tech_data) <- col_names
# Format AUTHOR column
tech_data$AUTHOR <- str_replace(tech_data$AUTHOR, 'By ', '')
head(tech_data)
## # A tibble: 6 x 8
## SECTION SUB_SECTION TITLE ABSTRACT URL AUTHOR PUBLISHED_DATE TAG
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 technol… "" As Life… Uncomforta… https:… Kate … 2020-03-27T09… Compu…
## 2 technol… "" As Life… Uncomforta… https:… Kate … 2020-03-27T09… Mobil…
## 3 technol… "" As Life… Uncomforta… https:… Kate … 2020-03-27T09… Elder…
## 4 technol… "" As Life… Uncomforta… https:… Kate … 2020-03-27T09… Coron…
## 5 technol… "" As Life… Uncomforta… https:… Kate … 2020-03-27T09… Softw…
## 6 technol… "" As Life… Uncomforta… https:… Kate … 2020-03-27T09… Video…
Step3:- Used sqldf() function to count all the TAG(s) and then order them in decreasing order. The idea is to see top 20 TAGs and their frequency.
# Gets all TAG and their count from tech_data
df_tag <- sqldf("SELECT TAG, Count(1) as FREQUENCY FROM tech_data GROUP BY TAG")
# order results in decreasing
df_tag <- df_tag[order(df_tag$FREQUENCY, decreasing = TRUE),]
head(df_tag)
## TAG FREQUENCY
## 16 Coronavirus (2019-nCoV) 22
## 15 Computers and the Internet 13
## 29 Epidemics 7
## 46 Mobile Applications 7
## 69 Social Media 7
## 57 Quarantines 6
Step4:- To draw a bar plot and visulaize top 20 TAGs and their counts. Also see the top 20 tags through wordcloud2.
# ggplot top 20 results
top_n(df_tag, n=20, FREQUENCY) %>%
ggplot(., aes(x=TAG, y=FREQUENCY, fill=FREQUENCY)) +
geom_bar(stat='identity') +
coord_flip()
wordcloud2(data=top_n(df_tag, n=20, FREQUENCY), size=.25, color='random-dark')
Finally the top tags for top stories listed on the NY Times Technology website are “Coronavirus (2019-nCoV)”, “Computers and the Internet”, “Epidemics’,”Mobile Applications" and “Social Media”. Since the data gets updated regularly, it would be fun to see how frequent the tags get changed.