Introduction

For this project, I focused on importing data from the “Most Popular” API provided by the New York Times. The data set used in this assignment, looks at the most popular articles over the past 30-days based on the number of times they were emailed, shared, or viewed.

In addition to a number of pre-processing steps, for this project, I focused on analyzing the data to answer the following questions:

  1. Which writers generate the most favorited bylines?
  2. Which sections and subsections of the paper generated the most popular articles?
  3. Which topics were the most popular?
  4. Were there articles that were popular across email, views, and shared?

Setup

#rm(list=ls())

knitr::opts_chunk$set(echo = TRUE)

library(httr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(ggwordcloud)
library(dotenv)

Create Common Dataframe

Create a single dataframe that has the data from the 3 sub APIs, and create a new column “fav_category” that identifies which of the APIs the data is collected from.

viewed_df <- viewed_results %>% 
  select(c(5,6,8,9,11,13,15)) %>%
  mutate(fav_category = "Viewed")

emailed_df <- emailed_results %>% 
  select(c(5,6,8,9,11,13,15)) %>%
  mutate(fav_category = "Emailed")

shared_df <- shared_results %>% 
  select(c(5,6,8,9,11,13,15)) %>%
  mutate(fav_category = "Shared")

common_df <- rbind(viewed_df, shared_df) 
common_df <- rbind(common_df, emailed_df)

Analysis Questions:

  1. Which writers generate the most favorited bylines?
  2. Which sections and subsections of the paper generated the most popular articles?
  3. Which topics were the most popular? (Use the adx_keywords)
  4. Were there articles that were popular across email, views, and shared?

Question 1 - Which writers generate the most favorited bylines?

## Create data frame of writers with bylines

bylines <- common_df$byline

writers_df = data.frame()

for(i in 1:length(bylines)) {
  clean_bylines <- str_replace(bylines[i], "By", "")
  writers <- str_split(clean_bylines,",|and")
  #print(writers)
  
  for(j in 1:length(writers[[1]])) {
    writer <- str_squish(writers[[1]][j])
    
    if(writer != "" & !(writer %in% writers_df)) {
      writers_df = rbind(writers_df,writer)
    }
  }  
}

colnames(writers_df) <- 'writer'

## Determine number of times writer appears in a byline

writers_df <- writers_df %>%
  distinct(writer)

writers_df <- writers_df %>%
  mutate(num_bylines = NA)

for(i in 1:nrow(writers_df)) {
  selected_writer <- writers_df[i, 'writer']
  
  num_bylines_select <- common_df %>% 
    filter(str_detect(byline, selected_writer)) %>% 
    select(byline) %>% 
    nrow()
  
  writers_df <- writers_df %>%
    mutate(num_bylines = ifelse(writer == selected_writer,num_bylines_select,num_bylines))

  }

top10_writers <- writers_df %>%
  arrange(desc(num_bylines)) %>%
  filter(row_number() <= 10)

top10_writers %>%
  kable(
    row.names = T,
    col.names = c("Writer", "Count"),
    caption = "Top 10 Writers Based on Number of Times They Appear in Bylines"
  ) %>%
  kable_material(c("striped"))
Top 10 Writers Based on Number of Times They Appear in Bylines
Writer Count
1 Michael Levenson 4
2 Nicholas Bogel-Burroughs 3
3 Connie Chang 3
4 Noam Chomsky 3
5 Ian Roberts 3
6 Jeffrey Watumull 3
7 Eduardo Medina 3
8 Maggie Haberman 2
9 Jonah E. Bromwich 2
10 Ben Protess 2
ggplot(top10_writers, aes(x=reorder(writer, num_bylines), y=num_bylines, fill=writer)) +
  geom_bar(stat='identity') +
  labs(
    y="Num Bylines",
    x="",
    title = "Top 10 Writers Based on Number of Contributions to Popular Articles"
  ) +
  coord_flip()

Answer: Michael Levenson was the most popular writer with 4 articles appearing in the favorites list.