Assignment

The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis

You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.

Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
library(stringr)
library(ggplot2)
library(DT)

#API Key

apiKey = "ZHUSRtGwBb1cvhmgHPJ6QIiXQNGy65as"

Connect to the API

I will be using the Top Stories API, and filter Technology API.

url <- paste("https://api.nytimes.com/svc/topstories/v2/technology.json?api-key=", apiKey, sep='')

Creating Data Frame

techData <- fromJSON(url) %>% 
  as.data.frame() %>%
  select(-results.multimedia)

datatable(techData,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

Tidy

techData_df <- techData %>%
  select(last_updated,results.published_date,results.section,results.subsection, results.title,results.abstract,results.url, results.byline, results.des_facet)

datatable(techData_df,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

More Tidy

colnames <- c('LAST_UPDATED', 'PUBLISHED_DATE', 'WEBSITE_SECTION' , 'WEBSITE_SUBSECTION', 'TITLE', 'ABSTRACT', 'URL', 'AUTHOR', 'TAGS')

colnames(techData_df) <- colnames

datatable(techData_df,extensions='Scroller',options=list(scrollY=500,scroller=TRUE))

Analyze

Now that we have the data tidied up, lets do some visualization.

finalData <- unnest(techData_df, TAGS)
finalData
## # A tibble: 141 x 9
##    LAST_UPDATED  PUBLISHED_DATE  WEBSITE_SECTION WEBSITE_SUBSECT~ TITLE ABSTRACT
##    <chr>         <chr>           <chr>           <chr>            <chr> <chr>   
##  1 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  2 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  3 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  4 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  5 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  6 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  7 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  8 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
##  9 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
## 10 2021-10-23T1~ 2021-10-23T15:~ technology      ""               In I~ Interna~
## # ... with 131 more rows, and 3 more variables: URL <chr>, AUTHOR <chr>,
## #   TAGS <chr>
finalCounts <- as.data.frame(table(finalData$TAGS)%>% sort(decreasing= TRUE))
colnames(finalCounts) <- c('Tag', 'Frequency')

Plot

top_n(finalCounts, n=10, Frequency) %>%
    ggplot(., aes(x=Tag, y=Frequency))+
    geom_bar(stat='identity') + 
  ggtitle("Top Tags") + 
  xlab("Tags") + ylab("# of articles") +
  theme(axis.text.x = element_text(angle = 90))

ggplot(techData_df, aes(x = AUTHOR)) + 
       geom_histogram(stat="count") +
       theme(axis.text.x = element_text(angle = 90))
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Conclusions

The top tag in top technology stories is Computer and the Internet followed by social media with Shivra Ovalde having written the most articles.