For this assignment, we’ve been tasked with choosing an API from the New York Times Developer site, constructing an interface in R to read in the JSON data, and transforming it into an R data frame which can be used for some analysis. Of the APIs available on the NYT site, I chose the Most Popular Articles API and decided to look into articles with the most views within the last 30 days.
As always, let’s start off by loading the necessary packages.
library(tidyverse)
library(dplyr)
library(jsonlite)
library(httr)
In order to import the API data, we first had to assign each part of
the URL we are going to use to a variable, especially the
key and the parts that tell the API what we are looking for
such as the parameter and period. Once having
done this we go ahead and run the GET() function from the
httr package and store the results in our
api_data blob.
domain <- "https://api.nytimes.com"
path <- "/svc/mostpopular/v2/"
parameter <- "viewed" #possible values: emailed, shared, viewed
period <- 30 #possible values: 1, 7, 30
fragment <- ".json?api-key="
key <- "1noMabNVIXRvem1M2c2MGeFL4uwUO07J"
api_data <- GET(paste0(domain, path, parameter, "/", period, fragment, key, sep = ""))
api_data
## Response [https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json?api-key=1noMabNVIXRvem1M2c2MGeFL4uwUO07J]
## Date: 2024-12-21 00:51
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 40 kB
Now that we have the data from the API, we transform it into something more manageable using the code below.
raw_data = fromJSON(rawToChar(api_data$content))
data_frame = as.data.frame(raw_data$results)
glimpse(data_frame)
## Rows: 20
## Columns: 22
## $ uri <chr> "nyt://article/e807c6e7-68eb-527f-88b2-fb163b76eb54", "…
## $ url <chr> "https://www.nytimes.com/2024/12/06/opinion/united-heal…
## $ id <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ asset_id <dbl> 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14, 1e+14,…
## $ source <chr> "New York Times", "New York Times", "New York Times", "…
## $ published_date <chr> "2024-12-06", "2024-11-22", "2024-12-09", "2024-12-06",…
## $ updated <chr> "2024-12-09 12:55:12", "2024-11-23 17:44:09", "2024-12-…
## $ section <chr> "Opinion", "U.S.", "Business", "Arts", "New York", "Foo…
## $ subsection <chr> "", "Politics", "Media", "Music", "", "", "Politics", "…
## $ nytdsection <chr> "opinion", "u.s.", "business", "arts", "new york", "foo…
## $ adx_keywords <chr> "Health Insurance and Managed Care;Income Inequality;Mu…
## $ column <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ byline <chr> "By Zeynep Tufekci", "By Theodore Schleifer", "By Jonat…
## $ type <chr> "Article", "Article", "Article", "Article", "Article", …
## $ title <chr> "The Rage and Glee That Followed a C.E.O.’s Killing Sho…
## $ abstract <chr> "It echoes another era of extreme inequality and extrem…
## $ des_facet <list> <"Health Insurance and Managed Care", "Income Inequali…
## $ org_facet <list> "UnitedHealth Group Inc", "Government Efficiency Depar…
## $ per_facet <list> "Thompson, Brian (1974-2024)", <"Musk, Elon", "Trump, …
## $ geo_facet <list> "United States", <>, <>, <>, "Georgia", <>, <>, "Unite…
## $ media <list> [<data.frame[1 x 6]>], [<data.frame[1 x 6]>], [<data.f…
## $ eta_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
Let’s go ahead and tidy up the data to perform a simple analysis on
article type and section. We will also keep
some peripheral fields such as title, byline
(author), published_date, etc. To do this we take a subset
of the data and select the fields in the order we desire. The resulting
data set can be seen below.
top_viewed_articles <- subset(data_frame, select = c(15:16, 14, 13, 6, 8:9))
glimpse(top_viewed_articles)
## Rows: 20
## Columns: 7
## $ title <chr> "The Rage and Glee That Followed a C.E.O.’s Killing Sho…
## $ abstract <chr> "It echoes another era of extreme inequality and extrem…
## $ type <chr> "Article", "Article", "Article", "Article", "Article", …
## $ byline <chr> "By Zeynep Tufekci", "By Theodore Schleifer", "By Jonat…
## $ published_date <chr> "2024-12-06", "2024-11-22", "2024-12-09", "2024-12-06",…
## $ section <chr> "Opinion", "U.S.", "Business", "Arts", "New York", "Foo…
## $ subsection <chr> "", "Politics", "Media", "Music", "", "", "Politics", "…
We start the analysis by grouping the data according to
type and providing a count for each. Doing this shows a
clear dominance of a plain “article” format over the “interactive” type.
Creating a bar chart of this illustrates it further. Next, we analyze
section using the same approach. When first run, the most
popular sections were U.S., New York, and Arts. However, while writing
this section and rerunning the code, that all changed and we ended up
with U.S. and New York taking dominance. Therefore, it is possible that,
after publishing this RMD file, the results may change again.
top_article_type <- top_viewed_articles %>%
group_by(type) %>%
summarise(
count = n()
)
top_article_type
## # A tibble: 2 × 2
## type count
## <chr> <int>
## 1 Article 18
## 2 Interactive 2
ggplot(data = top_viewed_articles, aes(x = type, fill = type)) +
geom_bar()
top_section <- top_viewed_articles %>%
group_by(section) %>%
summarise(
count = n()
)
top_section
## # A tibble: 10 × 2
## section count
## <chr> <int>
## 1 Arts 1
## 2 Books 1
## 3 Business 1
## 4 Food 1
## 5 Magazine 1
## 6 New York 6
## 7 Opinion 1
## 8 Real Estate 1
## 9 U.S. 6
## 10 Well 1
ggplot(data = top_viewed_articles, aes(x = section, fill = section)) +
geom_bar()
As of writing, I was able to see the API data change right before my
eyes, resulting in a slight change in the analysis results. This, of
course, only attested further towards the benefits of pulling data from
an API and not a source that is otherwise stagnant. Using the
httr and jsonlite packages to import and
analyze data like that provided on the New York Times Developer
site has proven to be very beneficial. In addition, having to use
minimal code to get the data frame up and running was helpful in
expediting the process to start tidying and analyzing the data. Having
said this, the ability to import, tidy, and analyze API data is one all
data professionals should become comfortable with.