In this assignment, we will be scraping data from the NY Times API.
We’ll be getting data from the most emailed, shared, and viewed articles. We’ll also attempt to build an attribution model between these three assuming that shares and emails lead to more views:
Importing libraries:
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
Creating a list of endpoints for our data collection. We will go back the last 30 days
api_urls <- list(
email = "https://api.nytimes.com/svc/mostpopular/v2/emailed/30.json?api-key=",
share = "https://api.nytimes.com/svc/mostpopular/v2/shared/30.json?api-key=", # nolint: line_length_linter.
views = "https://api.nytimes.com/svc/mostpopular/v2/viewed/30.json?api-key="
)
nyt_data <- list(
email = fromJSON(paste(api_urls$email, nyt_api_key, sep = "")),
share = fromJSON(paste(api_urls$share, nyt_api_key, sep = "")),
views = fromJSON(paste(api_urls$views, nyt_api_key, sep = ""))
)
Let’s take a look at the columns we have from this api:
names(nyt_data$email)
## [1] "status" "copyright" "num_results" "results"
names(nyt_data$share$results)
## [1] "uri" "url" "id" "asset_id"
## [5] "source" "published_date" "updated" "section"
## [9] "subsection" "nytdsection" "adx_keywords" "column"
## [13] "byline" "type" "title" "abstract"
## [17] "des_facet" "org_facet" "per_facet" "geo_facet"
## [21] "media" "eta_id"
We can see that in the results part of the response we
have what seems to be our data. Let’s just keep that part:
email <- nyt_data$email$results
share <- nyt_data$share$results
views <- nyt_data$views$results
names(share)
## [1] "uri" "url" "id" "asset_id"
## [5] "source" "published_date" "updated" "section"
## [9] "subsection" "nytdsection" "adx_keywords" "column"
## [13] "byline" "type" "title" "abstract"
## [17] "des_facet" "org_facet" "per_facet" "geo_facet"
## [21] "media" "eta_id"
I was confused, but at this point I closely inspected the available API Docs and it seems that I won’t be able to get any quantifiable information regarding views, emails and shares. Simply put, I am only able to get a ranked list of the most shared, emailed, and viewed stories.
Although, it was really nice that the API doc came through as a YAML file.
Since we cannot build an attribution model, let’s index each article and see how relative ranks compare as well as see if there are any trends based on the geography for the highest rank items. We’ll call this scoring of ranks a “geographic relevancy”.
To do so, we’ll create the rank at the article level and then unnest longer the geographies:
email <- email |>
mutate(
email_rank = row_number()
) |>
select(
email_rank,
title,
section,
subsection,
id,
geo_facet
)
share <- share |>
mutate(
share_rank = row_number()
) |>
select(
share_rank,
title,
section,
subsection,
id,
geo_facet
)
views <- views |>
mutate(
view_rank = row_number()
) |>
select(
view_rank,
title,
section,
subsection,
id,
geo_facet
)
email <- unnest_longer(email, geo_facet)
share <- unnest_longer(share, geo_facet)
views <- unnest_longer(views, geo_facet)
replace_rank <- 1 + max(
c(
max(email$email_rank),
max(share$share_rank),
max(views$view_rank)
)
)
Now, let’s perform some data visualization. Firstly, let’s see where each item shows up in their respective ranks.
To do so, and since multiple geographies can show up, I’ve decided to create an index where the lower the score the better. This index will be calculated by grouping on the geo_facet and then take the minimum value for their respective rank.
share_scored <- share |>
group_by(geo_facet) |>
summarise(share_score = min(share_rank)) |>
select(geo_facet, share_score)
email_scored <- email |>
group_by(geo_facet) |>
summarise(email_score = min(email_rank)) |>
select(geo_facet, email_score)
views_scored <- views |>
group_by(geo_facet) |>
summarise(view_score = min(view_rank)) |>
select(geo_facet, view_score)
One important consideration are the NAs. It is entirely possible that one of the most shared or emailed articles may contain a location that isn’t in one of the most viewed. To account for this, we will be filling the NAs in the score columns with 21. 21 was chosen as each dataset was observed to only have 20 articles.
all_scores <- share_scored |>
full_join(
email_scored,
by = "geo_facet"
) |>
full_join(
views_scored,
by = "geo_facet"
) |>
replace_na(
list(
email_score = replace_rank,
share_score = replace_rank,
view_score = replace_rank
)
) |>
mutate(
total_score = share_score + email_score + view_score
) |>
arrange(total_score)
print(all_scores, n = nrow(all_scores))
## # A tibble: 33 × 5
## geo_facet share_score email_score view_score total_score
## <chr> <int> <int> <int> <int>
## 1 United States 9 17 3 29
## 2 Florida 21 3 5 29
## 3 Bronx (NYC) 6 4 20 30
## 4 Colorado 4 21 10 35
## 5 Gaza Strip 14 1 21 36
## 6 Israel 14 1 21 36
## 7 Great Britain 11 17 9 37
## 8 Russia 3 17 21 41
## 9 Alabama 13 21 10 44
## 10 Central Park (Manhattan, NY) 21 2 21 44
## 11 Sarasota (Fla) 21 3 21 45
## 12 Maine 21 15 10 46
## 13 South Carolina 21 21 6 48
## 14 Crimea (Ukraine) 11 17 21 49
## 15 Ohio 7 21 21 49
## 16 Ukraine 11 17 21 49
## 17 Europe 21 9 21 51
## 18 Alta (Utah) 21 10 21 52
## 19 Alaska 21 21 10 52
## 20 Arkansas 21 21 10 52
## 21 California 21 21 10 52
## 22 Massachusetts 21 21 10 52
## 23 Minnesota 21 21 10 52
## 24 North Carolina 21 21 10 52
## 25 Texas 21 21 10 52
## 26 Utah 21 21 10 52
## 27 Vermont 21 21 10 52
## 28 Virginia 21 21 10 52
## 29 Michigan 21 21 11 53
## 30 Santa Fe (NM) 21 21 12 54
## 31 East Amwell (NJ) 21 21 14 56
## 32 New York State 19 21 21 61
## 33 Worcester (Mass) 21 19 21 61
This type of comparison is expected to have different results every run, so in order to discuss the specific results of my run (March 19th, 2024 at around 11PM) I’ve included a screenshot of my results below:
From here we can see that the United States was the number one geography that was mentioned. After that, we can see that Florida was the second most “relevant”. Interesting about the Florida placement was that it did not make it into the top 20 articles shared but is the third most emailed. I’m interpreting this to mean that those who are sharing articles via email are more likely to share articles about Florida. Given the climate, it’s interesting that the most emailed geographies (regarding Palestine and Israel) are the most shared but are not even among the 20th most viewed articles in the last 30 days. This suggests that the overall impact that emails have on views is very low.
Another thing to note, is that the number one article by share_score
and view_score is not present. Inspecting further I found that during
the unnest_longer() function, the elements in the array
which existed were completely removed.
Lastly, there were many geographies sharing rank 10, which was the article with the Super Tuesday results for the states listed.
In conclusion, despite encountering limitations in building an attribution model, our analysis provided valuable insights into the relative rankings and geographical relevance of NY Times articles. We observed notable trends, such as the United States emerging as the most frequently mentioned geography, and that shares and likes don’t seem to have a great impact on how many views an article will have