The New York Times web site provides a rich set of APIs, as described here: https://developer.nytimes.com/apis You’ll need to start by signing up for an API key. Your task is to choose one of the New York Times APIs, construct an interface in R to read in the JSON data, and transform it into an R DataFrame.
library(httr)
library(jsonlite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library("stringr")
library("ggplot2")
Web scraping is a valuable method for collecting data from websites automatically. To be able to use their API, a developer account was created, register an application and choose “Most Popular API” to activate.Then, I had my API key which is used to interact with APIs that I selected.
# API Key
apikey <- "EE8AqkYM74AA8jrTpENJXlUPbWzKO1Pd"
# Get the URL
most_polular <- GET(paste("https://api.nytimes.com/svc/mostpopular/v2/shared/1/facebook.json?api-key=", apikey,sep=""))
# Get status code
most_polular$status_code
## [1] 200
summary(most_polular)
## Length Class Mode
## url 1 -none- character
## status_code 1 -none- numeric
## headers 24 insensitive list
## all_headers 1 -none- list
## cookies 7 data.frame list
## content 35968 -none- raw
## date 1 POSIXct numeric
## times 6 -none- numeric
## request 7 request list
## handle 1 curl_handle externalptr
mostpolular <- content(most_polular, as = "text")
The extracted file is a json, which is why jsonlite tool is used to make the work easier. It will extract the JSON object and make it an R object. We also specified the term flatten = TRUE to convert the data from its nested form to a non-nested data. This will help simplify the data structure for easier analysis in R.
df <- fromJSON(mostpolular, flatten = TRUE)
df <- data.frame(df$results, stringsAsFactors = FALSE)
#Get column names
colnames(df)
## [1] "uri" "url" "id" "asset_id"
## [5] "source" "published_date" "updated" "section"
## [9] "subsection" "nytdsection" "adx_keywords" "column"
## [13] "byline" "type" "title" "abstract"
## [17] "des_facet" "org_facet" "per_facet" "geo_facet"
## [21] "media" "eta_id"
#Rename columns
colnames(df) <- c("URI","URL", "ID", "asset_ID", "Source", "Published_Date", "Updated", "Section", "Subsection", "NYTDsection")
#Drop columns not needed
popular_df <- df[, -c(11:22)]
An analysis on popular articles with the ‘Section’ and ‘Published Date’ using R and ggplot2 for data visualization will be conducted.
#Count of Section
NYT_Publihsed_Date <- popular_df%>%
group_by(Published_Date)%>%
summarise(num=n())%>%
arrange(desc(num))
print(NYT_Publihsed_Date)
## # A tibble: 3 × 2
## Published_Date num
## <chr> <int>
## 1 2023-10-27 10
## 2 2023-10-26 7
## 3 2023-10-25 3
ggplot(data = NYT_Publihsed_Date, aes(x = Published_Date, y = num, fill = Published_Date))+ geom_bar(stat="identity", position="dodge") + ggtitle("Number of Polular Articles Recently Published on Facebook") + ylab("Frequency")+ geom_text(aes(label = num))
NYTsection <- popular_df%>%
group_by(Section)%>%
summarise(num=n())%>%
arrange(desc(num))
print(NYTsection)
## # A tibble: 9 × 2
## Section num
## <chr> <int>
## 1 Opinion 7
## 2 World 3
## 3 Arts 2
## 4 Travel 2
## 5 U.S. 2
## 6 New York 1
## 7 Real Estate 1
## 8 Technology 1
## 9 Well 1
ggplot(data = NYTsection, aes(x = Section, y = num, fill = Section))+ geom_bar(stat="identity", position="dodge") + ggtitle("Most Polular Articles Recently Viewed on Facebook") + ylab("Frequency")+ geom_text(aes(label = num))
Facebook is one of the largest social media platform in the world, had 2.4 billion users in 2019 that has changed the world. The New York Times utilized effectively the rapid and vast adoption of these technologies to share most popular articles for general public who can easily access latest information.