To: Research Team From: Marisa Smith

Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification. The goal of the algorithm is to create a hyperplane (maximum-marginal hyperplane), which separates the categories.

SVM uses features of the the categories to predict its classification. In the case of text analysis, if the category uses certain words, it is a possible predictor for classifying the text. Features close to the maximum-marginal hyperplane are called “support vectors.” The coefficient w represents the weight of each vector.

Support Vector Machine

Support Vector Machine

SVM and Framing Analysis

When all of the classifications are known (e.g., news sources), SVM can be used beyond merely classifying text. For instance, SVM can be used to determine what news sources cover by determining the features with the largest weights (w). The weights indicate what features best predict the source; features that are most “telling” of the source. Fortuna et al. (2009) suggests that when the topics are isolated, the weights indicate how news sources talk about the same topic or event.

Below, I provide SVM analysis of articles collected from the NY Times (N = 36) and Breitbart (N = 56) referencing the murder of Mollie Tibbetts. This analysis combined SVM with Natural Language Processing (NLP) to determine how the sources describe the event and the people involved. Specifically, the NLP procedures I employ detailed POS tags for the text in each sentence (https://www.clips.uantwerpen.be/pages/mbsp-tags). I used the tagged words to create unigrams, bigrams, and trigrams of the text (weighted using TFIDF).

Please note: There is still noise in the graph because I need a larger sample.

#### requires ####
library(rvest)
## Warning: package 'xml2' was built under R version 3.4.3
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.2
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
library(tidyr)
library(textreadr)
## Warning: package 'textreadr' was built under R version 3.4.4
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.4
library(quanteda)
## Warning: package 'quanteda' was built under R version 3.4.4
library(textreadr)
library(ggplot2)
##library(LexisNexisTools)
library(e1071)
set.seed(123)
#### python backend ####
library(spacyr)
##spacy_install("/anaconda3/bin/python3.7")
brietbart <- read_csv("brietbart_ex.csv", col_names = F)
brietbart <- brietbart$X1

url_count <-  1:length(brietbart)
raw.brietbart.text <- c()
for (i in url_count) {
  x   <- {xml2::read_html(brietbart[i]) %>% 
                                     html_nodes("header+ .entry-content") %>% 
                                     html_text(trim = T) %>%
                                     trimws()}
  raw.brietbart.text <- c(raw.brietbart.text, x)
    }

brietbart.text <- raw.brietbart.text %>% 
  #remove url
  stringr::str_replace_all("https://t.co/[a-z,A-Z,0-9]*", "") %>% 
  #remove text between ()
  iconv("latin1", "ASCII", sub="")  %>% stringr::str_replace_all("Follow\\s[a-z]{3,}\\s(on)\\s(Twitter)\\s(at).*$|Follow\\s[A-z]{3,}\\s(on)\\s(Twitter)\\s(at).*$", "") %>%
  stringr::str_replace_all("([a-z]{2,})([A-Z])", "\\1 \\2") %>%
  stringr::str_replace_all("[A-Z][a-z]{3,}\\s[A-Z][a-z]{3,}\\s(is a reporter).*$", "") %>%
  #add spaces between numbers (e.g., age) and other words
  stringr::str_replace_all("@[[:alnum:]]*", "") %>%
   stringr::str_replace_all("@[[:alnum:]]*[[:punct:]][[:alnum:]]*", "") %>%
  stringr::str_replace_all("([[:digit:]]{2,})([a-z]{3,})", "\\1 \\2") %>%
  stringr::str_replace_all("([a-z])([[:digit:]])", "\\1 \\2" ) %>%
  #remove white space
  stringr::str_squish()
brietbart.text[1]
## [1] "Illegal alien Cristhian Bahena Rivera claims in new court filings that his constitutional rights were violated by Iowa law enforcement officials after being interrogated for allegedly murdering 20-year-old college student Mollie Tibbetts in Brooklyn, Iowa, last year. In August 2018, Bahena Rivera, a 25-year-old illegal alien from Mexico, waschargedwith Tibbetts murderafter police said he admitted to confronting and chasing down the young woman. After a nationwide search, Tibbetts body was found in a cornfield in Poweshiek County, Iowa.The illegal alien lived in a region of Iowa that was surrounded by sanctuary cities, as Breitbart Newsnoted, andworkedon a dairy farm using a stolen ID and Social Security card after allegedlycrossingthe U.S.-Mexico border as a child. A 29-page motion obtained by the Des Moines Register and filed by Bahena Riveras defense attorneys, Chad and Jennifer Frese, claims that the illegal alien had his constitutional rights violated when local Iowa police took him in for questioning and did not make clear that he could have contacted the Mexican consulate before speaking to law enforcement. Providing this information to Bahena a confused, exhausted and vulnerable Mexican national was seeking trustworthy help would have triggered an invocation of consular notification and a decision to await the consulates assistance before making any other statements, the defense attorneys wrote in the motion. Last year, more than 200 American taxpayers sent letters to the court saying that illegal aliens living in the U.S. do not have constitutional rights and therefore should not be awarded taxpayer money to pay for their defense. Among other accusations, the defense attorneys said police did not read Bahena Rivera his Miranda Rights until hours into a 12-hour interview a violation of his constitutional rights that they claim should nullify his alleged murder confession. Bahena Rivera has very limited Mexican education and wholly unfamiliar with the U.S. criminal justice system, his attorneys wrote. Bahena Riveras attorneys have sought for months to suppress his alleged murder confession and have it thrown out of the case before the trial begins on November 12. Already, Bahena Rivera successfully got his trial date pushed back and moved his trial to a less white, more Hispanic-populated county in the state. According to prosecutors, Bahena Rivera was the last person who saw Tibbetts jogging on the evening of July 18 in Brooklyn, Iowa, security camera footage reveals. That is the night Tibbetts went missing. The illegal alien told police that Tibbetts was jogging when he saw her, according to prosecutors. That is when he said he approached Tibbetts and started talking to her. After Tibbetts told Bahena Rivera that she would call the police if he did not stop following her, the illegal alien allegedly chased her and says he blacked out after this. Police believe Bahena Rivera stabbed Tibbettsto death, then drove to a cornfield where prosecutors say he placed corn stalks over her to hide her body. The illegal alien has been held on a $5 million bond."
nytimes.dir = "nytimes_ex/"

nyarticles <- 
  lnt_read(nytimes.dir, start_keyword = "Body", extract_paragraphs = F, 
           end_keyword = "Classification", 
           file_pattern = ".rtf$", convert_date = F)


nytimes.text <- nyarticles@articles %>% select(Article) %>% 
  mutate(Article = Article %>% 
           #remove "Body" at the beginning of each line
           stringr::str_replace_all("^(Body)\\s", "") %>%
           #remove location at the beginning of each line
           stringr::str_replace_all("[A-Z]{1,}\\s[[:punct:]]\\s|[A-Z]{1,}[[:punct:]]\\s[A-Z][a-z]{1,}\\s[[:punct:]]\\s", "") %>%
           #remove quotation marks
           stringr::str_replace_all("''", "") %>%
           #remove twitter
           stringr::str_replace_all("[[:punct:]]@[[:alnum:]]*[[:punct:]]", "")
           )

nytimes.text <- nytimes.text$Article

write.csv(as.data.frame(nytimes.text), "nytimes.csv")
nytimes.text <- read_csv("nytimes.csv", col_names = T) 
## Warning: Missing column names filled in: 'X1' [1]
nytimes.text  <- nytimes.text$nytimes.text

nytimes.text  <- nytimes.text %>% stringr::str_replace_all("Follow\\s[A-z]{3,}\\son\\sTwitter\\s[a-z]{2,}.$|Follow\\s[A-z]{3,}\\s[A-z]{3,}\\son\\sTwitter[[:punct:]]\\s@[[:alnum:]]*", "") %>%
  stringr::str_replace_all("(PHOTO):.*$", "") %>% stringr::str_replace_all("[[:punct:]](PHOTOGRAPH)\\s(BY).*$", "")
nytimes.text[1]
## [1] "Where does the conservative commitment to limited government and individual freedom, always more rhetorical than real, finally go to die? One strong candidate is rural America, where Mollie Tibbetts, a 20-year-old student at the University of Iowa, was brutally murdered this summer at the hands, allegedly, of a Mexican immigrant who may be in the country illegally. The killing of Ms. Tibbetts, who went missing on July 18 but whose body was found only this week, is an unspeakable tragedy. Her killer should be prosecuted and punished to the fullest extent of the law. Yet many conservatives who have long assailed the government as incompetent at best are now so blinded by xenophobic rage over her murder that they've turned into the thing they claim to despise: vociferous boosters of big government. Consider a piece posted this week at Conservative HQ, the official site of the direct-mail innovator Richard Viguerie, the legendary funding father of the postwar conservative movement going back to its early days supporting Barry Goldwater in the 1960s. In The Politicians Who Killed Mollie Tibbetts, the site's editor, George Rasley, lays the blame not on the man in police custody but on a cabal of elected officials and profit-hungry plutocrats. The real killers are the politicians who keep our borders open and who continue protecting illegal aliens, Mr. Rasley wrote. Mollie Tibbetts was killed so that somebody in Iowa could have cheap labor, and she was killed so that the Business Roundtable, the U.S. Chamber of Commerce, and the rest of the Washington-Wall Street-Silicon Valley Axis could hit this quarter's earnings target. If you're not surprised by such a sentiment, that's because it's no longer news when conservatives and Republicans announce that immigrants, especially those here illegally, are uniquely prone to crime. In 2013, Representative Steve King of Iowa, who has called for electrifying the border with Mexico because we do that with livestock all the time, declared that for every undocumented immigrant who's a valedictorian, there's another 100 out there that weigh 130 pounds and they've got calves the size of cantaloupes because they're hauling 75 pounds of marijuana across the desert. In his successful bid for a Senate seat in Arkansas, Tom Cotton routinely argued that liberals in Washington want to let illegal immigrants get Social Security for work they did with forged identities. And of course, within minutes of announcing his presidential run in 2015, Donald Trump asserted that Mexicans entering the country were bringing drugs and crime. Virtually all research concludes that immigrants, especially undocumented immigrants, commit fewer crimes, including violent crimes, than native-born Americans (among other things, they do not want to draw attention to themselves). In 2016, the Cato Institute's Alex Nowratesh writes, the homicide conviction rate for native-born Americans in Texas was 3.2 per 100,000 natives while it was 1.8 per 100,000 illegal immigrants and 0.9 per 100,000 legal immigrants. In raw numbers, 32 undocumented immigrants and 28 legal immigrants were convicted of homicide in Texas, compared with 746 native-born Americans. But the Republican Party won't let such facts get in the way of calling for precisely the sort of big-government surveillance state against which conservatives used to rail. Conservatives who used to denounce worker databases such as E-Verify and national ID cards as affronts to the rights of states, business owners and individuals to make their own security and hiring decisions now support all such measures to round up undocumented immigrants. Conservatives who used to denounce government snooping and even the census now support internal checkpoint laws that result in thousands of legal citizens being mistakenly deported each year. Conservatives who used to talk passionately about family values have shown little empathy when the Trump administration has separated migrant families crossing the southern border or when ICE agents have arrested fathers accompanying their wives to give birth. How the Republican Party became the champion of closed borders and the police state necessary to enforce them is something of a mystery. As recently as 1980, Ronald Reagan and George H.W. Bush, both seeking the Republican presidential nomination, outdid each other in praising not just legal but undocumented immigrants. We're creating a whole society of really honorable, decent, family-loving people that are in violation of the law, Mr. Bush lamented. Rather than talking about putting up a fence, Mr. Reagan suggested, open the borders both ways. One sure explanation is demographics. As the country becomes more multiethnic, the Republican Party is increasingly defining itself as the party of white and rural Americans. At the same time, the Democrats switched positions. As recently as 1996, their national platform denounced immigrants in what can only be dubbed proto-Trumpian terms. More recently, and in conjunction with the decline of historically anti-immigration unions, they have come to realize that the dense, urban places immigrants tend to move are Democratic Party strongholds. The result has been an almost complete shift in party policies toward immigration -- a shift that Donald Trump, and the party he has remade in his image, is shameless in exploiting. Mollie Tibbetts's murder is profoundly disturbing. But it is a very rare event in a country where the violent crime and homicide rates are near 40-year lows. It does her memory and the country no good to greatly increase the power of the state as if there are no costs to such plans. Nick Gillespie  is an editor at large of Reason."
parsed_text <- spacyr::spacy_parse(brietbart.text, tag = T, dependency = T, nounphrase = T)  


nounphrase_tokens <- as.tokens(x = parsed_text, include_pos = "tag", concatenator = "_") %>%
    tokens_select(pattern = c("*_NN", "*_NNS", "*_NNP", "*_PRP", "*_JJ", "*_JJR", "*_JJS")) %>% tokens(ngrams = 1:2, concatenator = " ")


brietbart_dfm <- quanteda::dfm(nounphrase_tokens,
                             remove = stopwords("english"), stem = TRUE, tolower = TRUE)


brietbart_dfm <- dfm_trim(brietbart_dfm, min_termfreq = 5, min_docfreq = 5)
brietbart_dfm_weighted <- dfm_tfidf(brietbart_dfm )


brietbart_dfm_df <- convert(brietbart_dfm_weighted, to = "data.frame")
names(brietbart_dfm_df)[1] <- "news.article"
brietbart_dfm_df <- brietbart_dfm_df %>% select(-news.article) %>% mutate(news.source = "Breitbart")
## parse the text using NLP
parsed_text_ny <- spacy_parse(nytimes.text, tag = T, dependency = T, nounphrase = T)  

## retrieve tokens for words tagged as entities (e.g., NN*) or adjectives (e.g., JJ*))
nounphrase_tokens <- as.tokens(x = parsed_text_ny, include_pos = "tag", concatenator = "_") %>%
    tokens_select(pattern = c("*_NN", "*_NNS", "*_NNP", "*_PRP", "*_JJ", "*_JJR", "*_JJS")) %>% tokens(ngrams = 1:2, concatenator = " ")

nyyork_dfm <- quanteda::dfm(nounphrase_tokens,
                               remove = stopwords("english"), stem = TRUE, tolower = TRUE)

## trim phrases that appear less than 10 times and in less than 5 documents. 

nyyork_dfm <- dfm_trim(nyyork_dfm, min_termfreq = 5, min_docfreq = 5)
nyyork_dfm_weighted <- dfm_tfidf(nyyork_dfm)


nyyork_dfm_df <- convert(nyyork_dfm_weighted, to = "data.frame")
names(nyyork_dfm_df)[1] <- "news.article"
nyyork_dfm_df <- nyyork_dfm_df %>% select(-news.article) %>% mutate(news.source = "NY Times")
#combine data frames      
full_dfm_df <- bind_rows(brietbart_dfm_df, nyyork_dfm_df) %>% mutate(news.source = factor(news.source, levels = c(
  "NY Times","Breitbart"))) 
full_dfm_df[is.na(full_dfm_df)] <- 0             


# Randomize the order of the data frame
#full_dfm_df<- full_dfm_df[sample(1:nrow(full_dfm_df)), ]


#create traning and testing set

news_train <- sample_frac(full_dfm_df, .8 # 80% of the sample
                          )
news_test <- full_dfm_df %>% 
  anti_join(news_train
            )
model <- e1071::svm(news.source ~ ., 
             data = news_train, type = "nu-classification", kernal = "linear")



predictions <- predict(model, 
                       news_test)

agreement <- predictions == news_test$news.source
prop.table(table(agreement))
## agreement
## TRUE 
##    1
W <- t(model$coefs) %*% model$SV

weights.df <- as.data.frame(t(W), row.names = NULL)


weights.df <- weights.df %>% rownames_to_column("word")
names(weights.df)[names(weights) == "V1"] <- "w"
weights.df %>% group_by(V1 < 0) %>%
  top_n(20, abs(V1)) %>%
  ungroup() %>%
  mutate(word = reorder(word, V1)) %>%
  ggplot(aes(word, V1, fill = V1 < 0 )) +
  geom_col(show.legend = F) +
  coord_flip() + 
  ylab("w") +
  scale_fill_manual(values = c("orange", "blue")) +
  labs(title = "News Phrases for the Mollie Tibbett Murder", 
       subtitle = "NY Times versus Breitbart")