We live in tumultuous times when it comes to politics. There is a lot of rhetoric coming from both sides on television, so it would be very beneficial to know what issues actually separate the parties. Press releases from branches such as the department of the treasury should be more substantative than TV sound bites, so we will look there for data. We will compare press releases made under the Obama administration to those made under the Trump administration. Then we will see if a classifier can determine under which administration the document was created.
Each page of the query returns a table of links and dates. We want the dates for metadata, so we store those first. Each link is put in a vector, so that they can be looped through, and the article stored.
base <- "https://www.treasury.gov/press-center/press-releases/Pages/default.aspx?dtend=2017-11-01&page="
bodies <- list()
dates <- list ()
for (j in 1:2){ #this did go to 35, but for demonstration purposes we change it to 2
url <- str_c(base,j)
table <- html_session(url) %>%
html_nodes(xpath = '//td')
date_row <- table %>%
html_nodes('.firstcol') %>%
html_text
dates <- c(dates, date_row)
links <- html_session(url) %>%
html_nodes ('.datarow a') %>%
html_attr(name = 'href')
links <- str_c("https://www.treasury.gov",links)
for (i in 1:length(links)){
body <- html_session(links[i]) %>%
html_nodes(xpath = "//div[@class='content-bg']") %>% #this leaves some undesirable text, but it is removed later
html_text()
body <- paste(body, collapse = ' ')
bodies <- c(bodies, body)
Sys.sleep(.5)
}
}
all_docs <- unlist(bodies) %>%
tolower()
all_dates <- (unlist(dates)) %>%
mdy()
data <- data.frame(date = all_dates, article = all_docs, stringsAsFactors = F) %>%
mutate(
administration = ifelse(date >= "2017-01-20", "Trump", "Obama"),
article = str_replace_all(article, "[[:punct:]]", " "),
article = str_replace_all(article, "[^[:alpha:] ]", "")
) %>%
arrange(article) #this will randomize the data for our testing set
We remove non-letter characters, lowercase, and stem the data. We also replace bigrams with a single token version.
all_stops <- c(tm::stopwords("en"), "trump", "obama", "page", "contentwashington", letters)
descs <- data$article
data$wo_stops <-removeWords(tolower(descs), all_stops)
article_words <- data %>%
select(
wo_stops, administration
) %>%
unnest_tokens(word, wo_stops) %>%
mutate(word = wordStem(word)) %>%
group_by(administration, word) %>%
dplyr::summarize(count = n()) %>%
filter(count >= 10) %>%
mutate(prob = count/sum(count), tot = sum(count)) %>%
arrange(administration, desc(prob))
article_words %>%
filter(min_rank(desc(prob)) <= 15) %>%
arrange(administration, desc(prob)) %>%
ggplot() + geom_bar(aes(x = word, y = prob, fill = administration), stat = "identity") + facet_wrap(~administration) +
coord_flip () + labs(title = "Unigram Comparison")
North Korea, economic growth vs securities growth, and banks appear to be some of the major differences in issues.
article_grams <- data %>%
select(
wo_stops, administration
) %>%
unnest_tokens(word, wo_stops, token = "ngrams", n = 2) %>%
mutate(word = wordStem(word)) %>%
group_by(administration, word) %>%
dplyr::summarize(count = n()) %>%
filter(count >= 10) %>%
mutate(prob = count/sum(count)) %>%
arrange(administration, desc(prob))
article_grams %>%
filter(min_rank(desc(prob)) <= 15) %>%
arrange(administration, desc(prob)) %>%
ggplot() + geom_bar(aes(x = word, y = prob, fill = administration), stat = "identity") + facet_wrap(~administration) +
coord_flip () + labs(title = "Bigram Comparison")
The comparison of bigrams adds more clarity to the differences
sub_grams <- article_grams %>%
ungroup () %>%
filter(prob >= .0005) %>%
mutate(new_word = str_replace_all(word, ' ', '_')) %>%
distinct(word, new_word)
data_new <- data
for (i in 1:nrow(sub_grams)){
data_new <- data_new %>%
mutate(
wo_stops = str_replace_all(wo_stops, sub_grams$word[i], sub_grams$new_word[i])
)
}
data_new$wo_stops <- removeWords(data_new$wo_stops, "united_st")
articles_final <- data_new %>%
select(wo_stops, administration) %>%
rename(article = wo_stops)
Here, we create a tf-idf matrix and run that through some of the “usual suspects” for classifier models
corpus <- Corpus(VectorSource(articles_final$article))
meta(corpus, "administration") <- articles_final$administration
corpus <- tm_map(corpus, stemDocument)
table(articles_final$administration)
##
## Obama Trump
## 491 209
td <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf))
labels <- data$administration
len <- length(labels)
container <- create_container(td, labels = labels, trainSize = 1:500, testSize = 501:len, virgin = F)
svm_model <- train_model(container, "SVM") #changing the kernel to polynomial hurt performance
tree_model <- train_model(container, "TREE", ntree = 500, nodesize = 5)
maxent_model <- train_model(container, "MAXENT")
svm_test <- classify_model(container, svm_model)
tree_test <- classify_model(container, tree_model)
maxent_test <- classify_model(container, maxent_model)
labels_out <- data.frame(actual_label = labels[501:len], svm = svm_test[,1], tree = tree_test[,1],
maxent = maxent_test[,1], stringsAsFactors = F)
We create a function for calculating performance metrics and compare the models
print_mets <- function (confusion){
recall <- confusion[4]/sum(confusion[3],confusion[4])
precision <- confusion[4] / sum(confusion[2], confusion[4])
accuracy <- sum(confusion[1], confusion[4]) / sum(confusion)
f <- 2*((precision*recall)/(precision + recall))
sprintf("Precision: %s Recall: %s Accuracy: %s F Measure: %s", round(precision, 5), round(recall,5),
round(accuracy, 5), round(f,5))
}
table_svm <- table(Predicted = labels_out$svm, Actual = labels_out$actual_label)
pander(table_svm)
| Obama | Trump | |
|---|---|---|
| Obama | 133 | 45 |
| Trump | 7 | 15 |
print_mets(table_svm)
## [1] "Precision: 0.68182 Recall: 0.25 Accuracy: 0.74 F Measure: 0.36585"
table_tree <- table(Predicted = labels_out$tree, Actual = labels_out$actual_label)
pander(table_tree)
| Obama | Trump | |
|---|---|---|
| Obama | 127 | 27 |
| Trump | 13 | 33 |
print_mets(table_tree)
## [1] "Precision: 0.71739 Recall: 0.55 Accuracy: 0.8 F Measure: 0.62264"
table_maxent <- table(Predicted = labels_out$maxent, Actual = labels_out$actual_label)
pander(table_maxent)
| Obama | Trump | |
|---|---|---|
| Obama | 131 | 39 |
| Trump | 9 | 21 |
print_mets(table_maxent)
## [1] "Precision: 0.7 Recall: 0.35 Accuracy: 0.76 F Measure: 0.46667"
The best performance is clearly from the random forest model.
There were some substantative differences in word distributions between press releases between the two administrations. As such, the classifier fared well, and was better than random when guessing either administration. It is up to the reader to determine what to make of these differences in word choices.