Using the New York Times Archive API, we will be analyzing the articles from November of 2020, the election month for the 2020 election, in order to gauge its weight in the new industry.
library(jsonlite)
library(tidyverse)
library(wordcloud)
url = "https://api.nytimes.com/svc/archive/v1/2020/11.json?api-key="
data <- as.data.frame(fromJSON(paste(url,key,sep = "")))
d <- data %>% select(Abstract = response.docs.abstract, Snippet = response.docs.snippet, Lead_Paragraph = response.docs.lead_paragraph, response.docs.type_of_material, Section = response.docs.section_name, Subsection = response.docs.subsection_name, Doc_Type = response.docs.document_type)
d["Headline"] <- data$response.docs.headline$main
n <- nrow(d)
data.frame(table(d$Section)) %>% mutate(Proportion = Freq/n) %>% arrange(desc(Proportion)) %>% head(5)
## Var1 Freq Proportion
## 1 U.S. 1777 0.34747751
## 2 World 514 0.10050841
## 3 Opinion 353 0.06902620
## 4 Business Day 265 0.05181854
## 5 Arts 252 0.04927650
data.frame(table(d$Subsection)) %>% mutate(Proportion = Freq/n) %>% arrange(desc(Proportion)) %>% head(5)
## Var1 Freq Proportion
## 1 Elections 786 0.15369574
## 2 Politics 575 0.11243645
## 3 Europe 140 0.02737583
## 4 Book Review 113 0.02209621
## 5 Television 97 0.01896754
The top five sections are: U.S., World, Opinion, Business Day, and Arts. The top five sub sections are: Elections, Politics, Europe, Book Review, and Television. Of all the New York Times articles 34.8% were related to the U.S.
As we are interested in the 2020 election, the data will be filtered to contain rows that have any of the top three (U.S., World, Opinion) as the section.
election <- d %>% filter(Section == "U.S." | Section == "World" | Section == "Opinion")
head(election,3)
## Abstract
## 1 PHILADELPHIA — Add Debra Messing and Kathy Najimy to the thousands of canvassers spread out across Philadelphia for the Biden campaign on Saturday.
## 2 Here are 20 counties in battleground states that are crucial for a White House victory.
## 3 SUN CITY, Ariz. — Don’t believe the polls. Don’t believe the media. And definitely do not believe the Democrats.
## Snippet
## 1
## 2 Here are 20 counties in battleground states that are crucial for a White House victory.
## 3
## Lead_Paragraph
## 1 PHILADELPHIA — Add Debra Messing and Kathy Najimy to the thousands of canvassers spread out across Philadelphia for the Biden campaign on Saturday.
## 2 Here are 20 counties in battleground states that are crucial for a White House victory.
## 3 SUN CITY, Ariz. — Don’t believe the polls. Don’t believe the media. And definitely do not believe the Democrats.
## response.docs.type_of_material Section Subsection Doc_Type
## 1 News U.S. Elections article
## 2 Interactive Feature U.S. Politics multimedia
## 3 News U.S. Elections article
## Headline
## 1 Celebrities lend Biden a hand in turning out the vote in Philadelphia.
## 2 The Battlegrounds Within Battlegrounds
## 3 ‘They’re coming after our state,’ McSally warns Arizona Republicans.
The “Headlines” function take in a word and returns its count in the “Headlines” column in the “elections” data frame. This function will be used to acquire the frequency of all the desired words in the word bank.
Headlines <- function(x){
counts <- data.frame(table(unlist(strsplit(tolower(election$Headline), " ")))) %>% arrange(desc(Freq))
x <- counts[which(counts$Var1 == x),]
return(x)
}
A word bank is created consisting of keywords related to the 2020 election. These words will be the targets for our analysis. We would like to see the frequency of their appearance in Article headlines.
words <- c("biden", "trump", "harris", "election", "2020", "vote", "president", "vice", "coronavirus", "covid-19", "american", "democrat", "republican", "swing", "state", "votes", "results", "fraud", "battleground", "progressive", "ballot" )
results <- c()
counter <- 1
for(word in words)
{
results[[counter]] <- Headlines(word)
counter <- counter +1
}
results <- as.data.frame(bind_rows(results))
results <- cbind(as.data.frame(words), bind_cols(words, results))
## New names:
## * NA -> ...1
results <- results[-2]
results <- results[-2]
colnames(results) <- c("Word", "Count")
results %>% arrange(desc(Count))
## Word Count
## 1 election 745
## 2 trump 233
## 3 biden 205
## 4 results 170
## 5 coronavirus 65
## 6 vote 36
## 7 state 31
## 8 republican 28
## 9 president 24
## 10 votes 24
## 11 2020 20
## 12 covid-19 18
## 13 fraud 16
## 14 harris 13
## 15 swing 9
## 16 ballot 9
## 17 american 8
## 18 democrat 7
## 19 vice 5
## 20 progressive 3
## 21 battleground 2
From the analysis we can see that of the 2,644 articles dealing with the US, politics, and Opinion, the word “Election” appeared 745 times, followed by “Trump” (233) and “Biden” (205)
wordcloud(words = results$Word, freq = results$Count, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
With the results we have gotten, we can say that the 2020 election was an important event as it relates to news for November of 2020. The word with the highest could was “election”, with the names of the two presidential candidates following behind. As a continuation of this work, we would like to create a more neutral word bank and apply it to the articles published in the election month of the past few presidential elections.