Overview

In this assignment, we will be scraping data from the NY Times API.

1. Collecting the data

We’ll be getting data from the most emailed, shared, and viewed articles. We’ll also attempt to build an attribution model between these three assuming that shares and emails lead to more views:

Importing libraries:

library(jsonlite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Creating a list of endpoints for our data collection. We will go back the last 30 days

api_urls <- list(
  email = "https://api.nytimes.com/svc/mostpopular/v2/emailed/30.json?api-key=",
  share = "https://api.nytimes.com/svc/mostpopular/v2/shared/30.json?api-key=", # nolint: line_length_linter.
  views = "https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json?api-key="
)
nyt_data <- list(
  email = fromJSON(paste(api_urls$email, nyt_api_key, sep = "")),
  share = fromJSON(paste(api_urls$share, nyt_api_key, sep = "")),
  views = fromJSON(paste(api_urls$views, nyt_api_key, sep = ""))
)

Let’s take a look at the columns we have from this api:

names(nyt_data$email)
## [1] "status"      "copyright"   "num_results" "results"
names(nyt_data$share$results)
##  [1] "uri"            "url"            "id"             "asset_id"      
##  [5] "source"         "published_date" "updated"        "section"       
##  [9] "subsection"     "nytdsection"    "adx_keywords"   "column"        
## [13] "byline"         "type"           "title"          "abstract"      
## [17] "des_facet"      "org_facet"      "per_facet"      "geo_facet"     
## [21] "media"          "eta_id"

We can see that in the results part of the response we have what seems to be our data. Let’s just keep that part:

email <- nyt_data$email$results
share <- nyt_data$share$results
views <- nyt_data$views$results

names(share)
##  [1] "uri"            "url"            "id"             "asset_id"      
##  [5] "source"         "published_date" "updated"        "section"       
##  [9] "subsection"     "nytdsection"    "adx_keywords"   "column"        
## [13] "byline"         "type"           "title"          "abstract"      
## [17] "des_facet"      "org_facet"      "per_facet"      "geo_facet"     
## [21] "media"          "eta_id"

I was confused, but at this point I closely inspected the available API Docs and it seems that I won’t be able to get any quantifiable information regarding views, emails and shares. Simply put, I am only able to get a ranked list of the most shared, emailed, and viewed stories.

Although, it was really nice that the API doc came through as a YAML file.

Since we cannot build an attribution model, let’s index each article and see how relative ranks compare as well as see if there are any trends based on the geography for the highest rank items. We’ll call this scoring of ranks a “geographic relevancy”.

To do so, we’ll create the rank at the article level and then unnest longer the geographies:

email <- email |>
  mutate(
    email_rank = row_number()
  ) |>
  select(
    email_rank,
    title,
    section,
    subsection,
    id,
    geo_facet
  )

share <- share |>
  mutate(
    share_rank = row_number()
  ) |>
  select(
    share_rank,
    title,
    section,
    subsection,
    id,
    geo_facet
  )

views <- views |>
  mutate(
    view_rank = row_number()
  ) |>
  select(
    view_rank,
    title,
    section,
    subsection,
    id,
    geo_facet
  )

email <- unnest_longer(email, geo_facet)

share <- unnest_longer(share, geo_facet)

views <- unnest_longer(views, geo_facet)

replace_rank <- 1 + max(
  c(
    max(email$email_rank),
    max(share$share_rank),
    max(views$view_rank)
  )
)

Now, let’s perform some data visualization. Firstly, let’s see where each item shows up in their respective ranks.

To do so, and since multiple geographies can show up, I’ve decided to create an index where the lower the score the better. This index will be calculated by grouping on the geo_facet and then take the minimum value for their respective rank.

share_scored <- share |>
  group_by(geo_facet) |>
  summarise(share_score = min(share_rank)) |>
  select(geo_facet, share_score)

email_scored <- email |>
  group_by(geo_facet) |>
  summarise(email_score = min(email_rank)) |>
  select(geo_facet, email_score)

views_scored <- views |>
  group_by(geo_facet) |>
  summarise(view_score = min(view_rank)) |>
  select(geo_facet, view_score)

One important consideration are the NAs. It is entirely possible that one of the most shared or emailed articles may contain a location that isn’t in one of the most viewed. To account for this, we will be filling the NAs in the score columns with 21. 21 was chosen as each dataset was observed to only have 20 articles.

all_scores <- share_scored |>
  full_join(
    email_scored,
    by = "geo_facet"
  ) |>
  full_join(
    views_scored,
    by = "geo_facet"
  ) |>
  replace_na(
    list(
      email_score = replace_rank,
      share_score = replace_rank,
      view_score = replace_rank
    )
  ) |>
  mutate(
    total_score = share_score + email_score + view_score
  ) |>
  arrange(total_score)

print(all_scores, n = nrow(all_scores))
## # A tibble: 33 × 5
##    geo_facet                    share_score email_score view_score total_score
##    <chr>                              <int>       <int>      <int>       <int>
##  1 United States                          9          17          3          29
##  2 Florida                               21           3          5          29
##  3 Bronx (NYC)                            6           4         20          30
##  4 Colorado                               4          21         10          35
##  5 Gaza Strip                            14           1         21          36
##  6 Israel                                14           1         21          36
##  7 Great Britain                         11          17          9          37
##  8 Russia                                 3          17         21          41
##  9 Alabama                               13          21         10          44
## 10 Central Park (Manhattan, NY)          21           2         21          44
## 11 Sarasota (Fla)                        21           3         21          45
## 12 Maine                                 21          15         10          46
## 13 South Carolina                        21          21          6          48
## 14 Crimea (Ukraine)                      11          17         21          49
## 15 Ohio                                   7          21         21          49
## 16 Ukraine                               11          17         21          49
## 17 Europe                                21           9         21          51
## 18 Alta (Utah)                           21          10         21          52
## 19 Alaska                                21          21         10          52
## 20 Arkansas                              21          21         10          52
## 21 California                            21          21         10          52
## 22 Massachusetts                         21          21         10          52
## 23 Minnesota                             21          21         10          52
## 24 North Carolina                        21          21         10          52
## 25 Texas                                 21          21         10          52
## 26 Utah                                  21          21         10          52
## 27 Vermont                               21          21         10          52
## 28 Virginia                              21          21         10          52
## 29 Michigan                              21          21         11          53
## 30 Santa Fe (NM)                         21          21         12          54
## 31 East Amwell (NJ)                      21          21         14          56
## 32 New York State                        19          21         21          61
## 33 Worcester (Mass)                      21          19         21          61

This type of comparison is expected to have different results every run, so in order to discuss the specific results of my run (March 19th, 2024 at around 11PM) I’ve included a screenshot of my results below:

big screenshot, bigger dreams
big screenshot, bigger dreams

From here we can see that the United States was the number one geography that was mentioned. After that, we can see that Florida was the second most “relevant”. Interesting about the Florida placement was that it did not make it into the top 20 articles shared but is the third most emailed. I’m interpreting this to mean that those who are sharing articles via email are more likely to share articles about Florida. Given the climate, it’s interesting that the most emailed geographies (regarding Palestine and Israel) are the most shared but are not even among the 20th most viewed articles in the last 30 days. This suggests that the overall impact that emails have on views is very low.

Another thing to note, is that the number one article by share_score and view_score is not present. Inspecting further I found that during the unnest_longer() function, the elements in the array which existed were completely removed.

Lastly, there were many geographies sharing rank 10, which was the article with the Super Tuesday results for the states listed.

Conclusion

In conclusion, despite encountering limitations in building an attribution model, our analysis provided valuable insights into the relative rankings and geographical relevance of NY Times articles. We observed notable trends, such as the United States emerging as the most frequently mentioned geography, and that shares and likes don’t seem to have a great impact on how many views an article will have