Data 607 Assignment 9

Overview / Introduction

For this assignment, we’ve been tasked with choosing an API from the New York Times Developer site, constructing an interface in R to read in the JSON data, and transforming it into an R data frame which can be used for some analysis. Of the APIs available on the NYT site, I chose the Most Popular Articles API and decided to look into articles with the most views within the last 30 days.

Load Packages

As always, let’s start off by loading the necessary packages.

library(tidyverse)
library(dplyr)
library(jsonlite)
library(httr)

Import API Data

In order to import the API data, we first had to assign each part of the URL we are going to use to a variable, especially the key and the parts that tell the API what we are looking for such as the parameter and period. Once having done this we go ahead and run the GET() function from the httr package and store the results in our api_data blob.

domain <- "https://api.nytimes.com"
path <- "/svc/mostpopular/v2/"
parameter <- "viewed" #possible values: emailed, shared, viewed
period <- 30 #possible values: 1, 7, 30
fragment <- ".json?api-key="
key <- "1noMabNVIXRvem1M2c2MGeFL4uwUO07J"

api_data <- GET(paste0(domain, path, parameter, "/", period, fragment, key, sep = ""))
api_data

## Response [https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json?api-key=1noMabNVIXRvem1M2c2MGeFL4uwUO07J]
##   Date: 2024-12-21 00:51
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 40 kB

Transform into R Data Frame

Now that we have the data from the API, we transform it into something more manageable using the code below.

raw_data = fromJSON(rawToChar(api_data$content))

data_frame = as.data.frame(raw_data$results)

glimpse(data_frame)

## Rows: 20
## Columns: 22
## $ uri            <chr> "nyt://article/e807c6e7-68eb-527f-88b2-fb163b76eb54", "…
## $ url            <chr> "https://www.nytimes.com/2024/12/06/opinion/united-heal…
## $ id             <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ asset_id       <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ source         <chr> "New York Times", "New York Times", "New York Times", "…
## $ published_date <chr> "2024-12-06", "2024-11-22", "2024-12-09", "2024-12-06",…
## $ updated        <chr> "2024-12-09 12:55:12", "2024-11-23 17:44:09", "2024-12-…
## $ section        <chr> "Opinion", "U.S.", "Business", "Arts", "New York", "Foo…
## $ subsection     <chr> "", "Politics", "Media", "Music", "", "", "Politics", "…
## $ nytdsection    <chr> "opinion", "u.s.", "business", "arts", "new york", "foo…
## $ adx_keywords   <chr> "Health Insurance and Managed Care;Income Inequality;Mu…
## $ column         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ byline         <chr> "By Zeynep Tufekci", "By Theodore Schleifer", "By Jonat…
## $ type           <chr> "Article", "Article", "Article", "Article", "Article", …
## $ title          <chr> "The Rage and Glee That Followed a C.E.O.’s Killing Sho…
## $ abstract       <chr> "It echoes another era of extreme inequality and extrem…
## $ des_facet      <list> <"Health Insurance and Managed Care", "Income Inequali…
## $ org_facet      <list> "UnitedHealth Group Inc", "Government Efficiency Depar…
## $ per_facet      <list> "Thompson, Brian (1974-2024)", <"Musk, Elon", "Trump, …
## $ geo_facet      <list> "United States", <>, <>, <>, "Georgia", <>, <>, "Unite…
## $ media          <list> [<data.frame[1 x 6]>], [<data.frame[1 x 6]>], [<data.f…
## $ eta_id         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Data Tidying

Let’s go ahead and tidy up the data to perform a simple analysis on article type and section. We will also keep some peripheral fields such as title, byline (author), published_date, etc. To do this we take a subset of the data and select the fields in the order we desire. The resulting data set can be seen below.

top_viewed_articles <- subset(data_frame, select = c(15:16, 14, 13, 6, 8:9))
glimpse(top_viewed_articles)

## Rows: 20
## Columns: 7
## $ title          <chr> "The Rage and Glee That Followed a C.E.O.’s Killing Sho…
## $ abstract       <chr> "It echoes another era of extreme inequality and extrem…
## $ type           <chr> "Article", "Article", "Article", "Article", "Article", …
## $ byline         <chr> "By Zeynep Tufekci", "By Theodore Schleifer", "By Jonat…
## $ published_date <chr> "2024-12-06", "2024-11-22", "2024-12-09", "2024-12-06",…
## $ section        <chr> "Opinion", "U.S.", "Business", "Arts", "New York", "Foo…
## $ subsection     <chr> "", "Politics", "Media", "Music", "", "", "Politics", "…

Data Analysis

We start the analysis by grouping the data according to type and providing a count for each. Doing this shows a clear dominance of a plain “article” format over the “interactive” type. Creating a bar chart of this illustrates it further. Next, we analyze section using the same approach. When first run, the most popular sections were U.S., New York, and Arts. However, while writing this section and rerunning the code, that all changed and we ended up with U.S. and New York taking dominance. Therefore, it is possible that, after publishing this RMD file, the results may change again.

top_article_type <- top_viewed_articles %>% 
  group_by(type) %>% 
  summarise(
    count = n()
  )
top_article_type

## # A tibble: 2 × 2
##   type        count
##   <chr>       <int>
## 1 Article        18
## 2 Interactive     2

ggplot(data = top_viewed_articles, aes(x = type, fill = type)) +
  geom_bar()

top_section <- top_viewed_articles %>% 
  group_by(section) %>% 
  summarise(
    count = n()
  )
top_section

## # A tibble: 10 × 2
##    section     count
##    <chr>       <int>
##  1 Arts            1
##  2 Books           1
##  3 Business        1
##  4 Food            1
##  5 Magazine        1
##  6 New York        6
##  7 Opinion         1
##  8 Real Estate     1
##  9 U.S.            6
## 10 Well            1

ggplot(data = top_viewed_articles, aes(x = section, fill = section)) +
  geom_bar()