Data607 - Assignment9

Solution

The first step to connect to New York Times APIs is to signup and register for apps which will provide the api key to access various apis. api key is needed with every request for NYT apis. Below is the link with all the details to get api-key.

https://developer.nytimes.com/get-started

The R packages used here are as below.

tidyverse - for data tidying/structuring
jsonlite - for data import
sqldf - to visualize the data
wordcloud2 - to visualize top tags

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

## Loading required package: RColorBrewer

Step 1:- First step is to read data from NYT top stories api for technology using fromJSON() function jsonlite and select relevant columns. In this case columns selected are results.section, results.subsection, results.title, results.abstract, results.url, results.byline, results.published_date, results.des_facet.

# Read NYT APIs data as JSON
theURL <- "https://api.nytimes.com/svc/topstories/v2/technology.json"

theURL <- paste0(theURL, "?api-key=" ,api_key)

# select relevant columns from the data and get it as data frame
tech_data <- fromJSON(theURL, flatten = TRUE) %>% 
  as.data.frame() %>% 
  select(results.section, 
         results.subsection, 
         results.title, 
         results.abstract, 
         results.url, 
         results.byline, 
         results.published_date, 
         results.des_facet)

# show head of data
head(tech_data)

##   results.section results.subsection
## 1      technology                   
## 2        business                   
## 3        business                   
## 4      technology       personaltech
## 5           style                   
## 6      technology                   
##                                                      results.title
## 1 As Life Moves Online, an Older Generation Faces a Digital Divide
## 2       A New Mission for Nonprofits During the Outbreak: Survival
## 3                     Surging Traffic Is Slowing Down Our Internet
## 4                      The Dos and Don’ts of Online Video Meetings
## 5  At Two Fashion Resale Warehouses, Workers Fear for Their Safety
## 6 The Week in Tech: We’re Testing How Much the Internet Can Handle
##                                                                                                                                                                          results.abstract
## 1                                                                    Uncomfortable with tech, many are struggling to use modern tools to keep up with friends and family in the pandemic.
## 2                                                                          Upended by the coronavirus outbreak, nonprofits are laying off workers and seeking help from stretched donors.
## 3                                            With people going online more in the pandemic, internet traffic has exploded. That’s taking a toll on our download speeds and video quality.
## 4                                                      From setting a clear agenda to testing your tech setup, here’s how to make video calls more tolerable for you and your colleagues.
## 5 As New Jersey orders nonessential workers to stay home to fight the spread of the new coronavirus, employees of the RealReal, a luxury resale company, wonder just what is “essential.”
## 6                                                                                                                We are more dependent on technology than ever. Can it handle the strain?
##                                                                                             results.url
## 1              https://www.nytimes.com/2020/03/27/technology/virus-older-generation-digital-divide.html
## 2                      https://www.nytimes.com/2020/03/27/business/nonprofits-survival-coronavirus.html
## 3                   https://www.nytimes.com/2020/03/26/business/coronavirus-internet-traffic-speed.html
## 4 https://www.nytimes.com/2020/03/25/technology/personaltech/online-video-meetings-etiquette-virus.html
## 5                 https://www.nytimes.com/2020/03/27/style/realreal-warehouse-ecommerce-new-jersey.html
## 6                        https://www.nytimes.com/2020/03/27/technology/internet-strain-coronavirus.html
##                                   results.byline    results.published_date
## 1               By Kate Conger and Erin Griffith 2020-03-27T09:30:08-04:00
## 2                            By David Streitfeld 2020-03-27T05:00:29-04:00
## 3 By Cecilia Kang, Davey Alba and Adam Satariano 2020-03-26T05:00:24-04:00
## 4                               By Brian X. Chen 2020-03-25T05:00:18-04:00
## 5                               By Jessica Testa 2020-03-27T17:38:47-04:00
## 6                                   By Cade Metz 2020-03-27T09:00:05-04:00
##                                                                                                                                                                                                                                                results.des_facet
## 1                                                                                                       Computers and the Internet, Mobile Applications, Elderly, Coronavirus (2019-nCoV), Software, Videophones and Videoconferencing, Nursing Homes, Epidemics
## 2                                                                                                                          Layoffs and Job Reductions, Philanthropy, Nonprofit Organizations, Shutdowns (Institutional), Labor and Jobs, Coronavirus (2019-nCoV)
## 3 Coronavirus (2019-nCoV), Quarantines, Computers and the Internet, Wireless Communications, Video Recordings, Downloads and Streaming, Telephones and Telecommunications, Computer and Video Games, Videophones and Videoconferencing, Xbox (Video Game System)
## 4                                                                                                                            Computers and the Internet, Telecommuting, Mobile Applications, Cameras, Videophones and Videoconferencing, Coronavirus (2019-nCoV)
## 5                                                                                                                           Layoffs and Job Reductions, Luxury Goods and Services, Shopping and Retail, Coronavirus (2019-nCoV), E-Commerce, Fashion and Apparel
## 6                                                                                                                                                              Computers and the Internet, Coronavirus (2019-nCoV), Online Advertising, Social Media, E-Commerce

dim(tech_data)

## [1] 29  8

Step2:- Then used unnest() function of tidyr (part of tidyverse) lib for column results.des_facet to convert the data from wide to long. In this case all tags in results.des_facet column were unnested in separate rows. After this all columns were renamed as SECTION, SUB_SECTION, TITLE, ABSTRACT, URL ,AUTHOR, PUBLISHED_DATE, TAG respectively.

tech_data <- unnest(tech_data, results.des_facet)

# replace column names
col_names <- c("SECTION","SUB_SECTION","TITLE","ABSTRACT","URL","AUTHOR","PUBLISHED_DATE","TAG")
colnames(tech_data) <- col_names

# Format AUTHOR column
tech_data$AUTHOR <- str_replace(tech_data$AUTHOR, 'By ', '')

head(tech_data)

## # A tibble: 6 x 8
##   SECTION  SUB_SECTION TITLE    ABSTRACT    URL     AUTHOR PUBLISHED_DATE TAG   
##   <chr>    <chr>       <chr>    <chr>       <chr>   <chr>  <chr>          <chr> 
## 1 technol… ""          As Life… Uncomforta… https:… Kate … 2020-03-27T09… Compu…
## 2 technol… ""          As Life… Uncomforta… https:… Kate … 2020-03-27T09… Mobil…
## 3 technol… ""          As Life… Uncomforta… https:… Kate … 2020-03-27T09… Elder…
## 4 technol… ""          As Life… Uncomforta… https:… Kate … 2020-03-27T09… Coron…
## 5 technol… ""          As Life… Uncomforta… https:… Kate … 2020-03-27T09… Softw…
## 6 technol… ""          As Life… Uncomforta… https:… Kate … 2020-03-27T09… Video…

Step3:- Used sqldf() function to count all the TAG(s) and then order them in decreasing order. The idea is to see top 20 TAGs and their frequency.

# Gets all TAG and their count from tech_data
df_tag <- sqldf("SELECT TAG, Count(1) as FREQUENCY FROM tech_data GROUP BY TAG")

# order results in decreasing
df_tag <- df_tag[order(df_tag$FREQUENCY, decreasing = TRUE),]  

head(df_tag)

##                           TAG FREQUENCY
## 16    Coronavirus (2019-nCoV)        22
## 15 Computers and the Internet        13
## 29                  Epidemics         7
## 46        Mobile Applications         7
## 69               Social Media         7
## 57                Quarantines         6

Step4:- To draw a bar plot and visulaize top 20 TAGs and their counts. Also see the top 20 tags through wordcloud2.

# ggplot top 20 results
top_n(df_tag, n=20, FREQUENCY) %>% 
  ggplot(., aes(x=TAG, y=FREQUENCY, fill=FREQUENCY)) + 
  geom_bar(stat='identity') + 
  coord_flip()

wordcloud2(data=top_n(df_tag, n=20, FREQUENCY), size=.25, color='random-dark')

Data607 - Assignment9

Amit Kapoor

3/25/2020

Introduction

Problem Statement

Solution

Summary