library(ldatuning)
library(tidyverse) # general utility & workflow functions
library(tidytext) # tidy implimentation of NLP methods
library(topicmodels) # for LDA topic modelling 
library(tm) # general text mining functions
library(SnowballC) # for stemming
library(reticulate)
library(quanteda)
library(seededlda)
library(tidyverse)
library(quanteda.textplots)
library(quanteda.textstats)
library(quanteda.textmodels)
library(spacyr)
library(stringr)
library(caret)
library(seededlda)
library(caTools)
library(text2vec)
library(rvest)
library(magrittr)
library(tidyverse)
library(dplyr)
library(caTools)
library(readxl)
library(stm)
library(corpus)

1.Research Question

As a superpower in the world, the United States always attracts the attention of the whole world in every presidential election.The political system in the U.S. is called a two-party system. That means that two parties dominate the political field at all levels of government. In the U.S. these two parties are the Republican Party and the Democratic Party. Majorities of partisans say the policy positions of the Republican and Democratic parties are very different, and neither Republicans nor Democrats say the other party has many good ideas. Indeed, The two parties represent different interest groups as well as having different opinions for matters.

The Republican Party is known to support right-leaning ideologies of conservatism, social conservatism, and economic libertarianism. Therefore, Republicans broadly advocate for traditional values, a low degree of government interference, and large support of the private sector.

The Republican Party platform focus on family and individual freedom. Generally, the Republican Party often tends to promote states’ and local rights. That means that they often wish for federal regulations to play a lesser role in policymaking. Furthermore, the Republican has a pro-business-oriented platform. Thus, the party advocates for businesses to exist in a free market instead of being impacted by tight government regulations [2].

The Democratic Party generally represents left-leaning, liberal, and progressive ideological values, thus advocating for a strong government to regulate business and support the citizens of the United States. Thus, one of the key values emphasized by Democrats is social responsibility. Generally, they trust that a prominent and powerful government can ensure welfare and equality for all. Democrats tend to support the heavy taxation of high-income households. Democrats general support government-funded healthcare and the party thereby aims to ensure that health and social services are accessible for all U.S. citizens. Additionally, the Democratic Party generally supports women’s legal rights [2].

Following Reagan’s two terms in office, his Vice President, George H. W. Bush was elected as his successor in the White House. Since then, Republicans and Democrats have taken turns in The White House. In 2008, Democrat Barack Obama was elected as the first African American president. After two terms, Obama’s successor, Republican Donald Trump was elected. He moved into the White House in 2017. Two of the main accomplishments on Trump’s agenda was providing tax reliefs and to establishing strong borders in order to reduce the number of undocumented immigrants entering the United States.In 2020, Democrat Joe Biden was elected as Donald Trump’s successor. President Biden now serves as the 46. president of the United States [7].

Looking at American politics, it seems very noteworthy that how the two parties have a positive or negative impact on society. During each US presidential election, how candidates from different parties express their party’s voice, positive or negative, always attracts the attention of the whole world. This is very meaningful if we could analysis text information using some computational techniques.

The main purpose of this project are a) to conduct topic modeling of the speeches from different parties and to discuss the synergy between the two parties b) to execute sentiment analysis for the speeches of the two parties c) to obtain a classifier for the classification of party based on the text of each speech.

2.Web scraping

This project uses the speeches of three American presidents in the 20 years: George Walker Bush, Barack Obama, and Donald John Trump. To collect the speeches of Bush and Obama, we scabbed our data from https://www.americanrhetoric.com/index.htm. The website collects speeches for about20 years. The corpus of Trump is downloaded from Kaggle, it contains the speeches from 2016 to 2018. The limitation of the corpus is that it only contain 2 years speeches. We give each speech a hash on its party, ‘Democratic’ or ‘Republic’. We only keep the column ‘title’, ‘date’, ‘content’, and ‘party’. We take the subset of Trump’s speech since there are too many speeches compared to the other two. Or if the republicans take the majority, it is meaningless to do the comparison. So, it is important and necessary to have a balanced data set. In the end, our data set contains about 850 speeches with around 420 speeches from the Republic party and the rest from the Democratic.

url<-'https://www.americanrhetoric.com/gwbushspeeches.htm'
web<-read_html(url)


###From here the program work!

titles<-web%>%html_nodes('table')%>%.[[2]]%>%html_table()
titles

link<-web%>%html_nodes('a')%>%html_attr("href")
link

clean<-data.frame(titles)
clean<-clean[-c(1:30),]
clean<-clean[,-c(4:34)]
#clean<-clean[-c(1:5),]


clean1<-clean%>%slice(1:5)
clean1<-clean1[,-1]
names(clean1)[1] <-"time"
names(clean1)[2] <-"title"
clean2<-clean%>%slice(6:34)
clean2<-clean2[,-3]
names(clean2)[1] <-"time"
names(clean2)[2] <-"title"
clean3<-clean%>%slice(35:108)
clean3<-clean3[,-1]
names(clean3)[1] <-"time"
names(clean3)[2] <-"title"
clean11<-rbind(clean1,clean2)
clean12<-rbind(clean11,clean3)

link<-web%>%html_nodes('a')%>%html_attr("href")
link
link<-data.frame(link)
names(link)[1] <-"link"
link1<-link[-grep("pdf", link$link),]
link1<-data.frame(link1)

link2<-link1[!grepl("mp3", link1$link),]
link2<-data.frame(link2)
link2<-link2[-c(1:22),]
link2<-data.frame(link2)
link2<-link2[-c(109:116),]
clean123<-rbind(clean12,link2)
clean123<-data.frame(clean123)
clean4<-clean123[-grep("http", clean123$link),]
clean4<-data.frame(clean4)
link3<-clean4$link
link3<-data.frame(link3)

clean4$link<-paste('https://www.americanrhetoric.com/',clean4$link2,sep="")
link4<-clean4$link
link4[90]

content <- c(1:99)


for(i in 1:99){
  content[i]<-read_html(link4[i])%>%html_node('td')%>%html_text()
}
content<-data.frame(content)
content[1]

result<- c(1:99)

for(i in 1:99){
  result[i] = substring(content[i], 700)
}


result<-data.frame(result)
BushData<-cbind(clean4,result)
BushData<-BushData[,-3]
BushData<-BushData[,-3]
names(BushData)[1] <-"date"
names(BushData)[2] <-"title"
names(BushData)[3]<-"content"
BushData$Party<-'R'

3.Dataset generate

##change url
Url <-"/Users/rubing/Desktop/Collecting and Analysing Big Data/MrTrumpSpeeches.csv"
TrumpData <-read.csv(file=Url,header=T,sep='~')
TrumpData2<-TrumpData[-grep("music", TrumpData$subtitles),]
TrumpData3<-TrumpData2[-grep("Applause", TrumpData2$subtitles),]
TrumpData2<-data.frame(TrumpData2)
TrumpData3<-data.frame(TrumpData3)

# Splitting the dataset into the Training set and Test set 
 
set.seed(123) 
split = sample.split(TrumpData3, SplitRatio = 0.6) 
training_set = subset(TrumpData3, split == TRUE) 
#test_set = subset(dataset, split == FALSE) 
TrumpData1<- data.frame(training_set$title,training_set$upload_date,training_set$subtitles)
names(TrumpData1)[1] <-"title"
names(TrumpData1)[2] <-"date"
names(TrumpData1)[3]<-"content"
TrumpData1$Party<-"R"



Url2 <-"/Users/rubing/Desktop/Collecting and Analysing Big Data/obama_speeches_dataframe.xlsx"
ObamaData <-read_excel(Url2)
ObamaData <-ObamaData [,-1]
ObamaData$Party<-"D"


url3<-"/Users/rubing/Desktop/Collecting and Analysing Big Data/BUSH_speech.csv"
BushData1 <-read.csv(url3)
BushData1<-BushData1[,-1]
BushData1<-BushData1[,-4]
BushData1<-BushData1[,-4]
BushData1$Party<-'R'
length(BushData1)
result<- c(1:97)

for(i in 1:97){
  BushData1$content[i] = substring(BushData1$content[i], 300)
}

FullDataSet1<-rbind(TrumpData1,ObamaData)
FullDataSet<-rbind(FullDataSet1,BushData1)
write.csv(x = FullDataSet,file  = "/Users/rubing/Desktop/Collecting and Analysing Big Data/FullData1.csv")
url3<-"/Users/rubing/Desktop/Collecting and Analysing Big Data/FullData.csv"
s1 <-read.csv(url3)

4.Data cleanning

After collecting the data, we generated the data through a cleaning and transformation process.

In the topic modeling and word plot processing, we wrote our function and applied other cleaning methods in package tidyverse to make it tidy to feed the model. First, move all the numbers, clean the line break, punctuation characters, alphabetic characters, and extra space are moved from our content. Then we added some additional stop words to the default stop word. For instance, ‘american’,’ president’,’ united’, and ’states’ are removed since they appear in every speech and are meaningless. Some words indicating time also be added to the stopwords list like ‘tonight’, and ’today’. After we remove the stopwords, we get the column called document_stop_word.

Later, we applied the stemming method to get the stemmed text. A column without stopwords and in the stemmed form is created. Here we apply stemmization but not lemmatization. Because it is difficult to categorize the topic when we do lemmatization, we cannot understand all words clearly.

After stemming we tokenize our data and make a corpus for each party, this step is the pre-work of topic modeling. The code of tokenization is embedded in the topic modeling part.

Clean_Text <- function(text){
  
  # Lowercase
  temp <- tolower(text)
  
  # Remove punctuation
  temp <- gsub("[[:punct:][:blank:]]+", " ",  temp)
  
  # Remove numbers
  temp <- gsub("[[:digit:]]", "",  temp)
  
  # Remove regular urls
  temp <- gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", temp)
  
  # Remove tiny urls
  temp <- gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", temp)
  

  return(temp)
}

# read in the libraries we're going to use


url3<-"C:/Users/yinfu/Desktop/研一下/social data science/exam/project2/data/FullData1.csv"
FullData<-read.csv(url3)
names(FullData)[1]<-"id"
coupus<-FullData$content
coupus<- iconv(coupus,"WINDOWS-1252","UTF-8")
coupus <- Clean_Text(coupus)
# create a document term matrix to clean
text <- Corpus(VectorSource(coupus)) 
Content <- TermDocumentMatrix(text)
# convert the document term matrix to a tidytext corpus
Content_tidy<-tidy(Content)
add_stop_words =c ('like','youre','ive','im','really','id','just','dont','didnt','thi','wa',
                  'say','know','make','people',"today","way","day","time","year",'tonight',
'say','like','just','dont','don','im','it','ve','re','we',
                'live','youll','youve','things','thing','youre','right','really','lot',
                'make','know','people','way','day',
                'little', 'maybe','men',"americans","america",
                'kind','heart', "american","president","united","states" ,"doesn","obama:well","Trump","obama:i","youtube"
,"thatâ","â","."
)
custom_stop_words <- tibble(add_stop_words)
names(custom_stop_words)[1]<-"word"

# remove stopwords
newsDTM_tidy_cleaned <-Content_tidy %>% 
  anti_join(stop_words, by = c("term" = "word")) %>% # remove English stopword
  anti_join(custom_stop_words, by = c("term" = "word")) # remove our own dictionary words



# reconstruct cleaned documents (so that each word shows up the correct number of times)
stops_documents <- newsDTM_tidy_cleaned %>%
  group_by(document) %>% 
  mutate(terms = toString(rep(term, count))) %>%
  select(document, terms) %>%
  unique()
head(stops_documents)

## # A tibble: 6 × 2
## # Groups:   document [6]
##   document terms                                                                
##   <chr>    <chr>                                                                
## 1 1        ability, ability, abraham, abused, abusers, abuses, abuses, abuses, …
## 2 2        ability, ability, abraham, abuse, abusers, abuses, abuses, abuses, a…
## 3 3        abandoned, ability, ability, absolutely, abu, acknowledge, acknowled…
## 4 4        ability, absolutely, abuse, accepted, access, add, addiction, addict…
## 5 5        absolutely, agency, agency, alex, amazing, amp, andrew, andrew, answ…
## 6 6        abandoned, ability, ability, absolutely, abu, acknowledge, acknowled…

names(stops_documents)[1]<-"stop_document_id"
names(stops_documents)[2]<-"document_stop_word"

# stem the words (e.g. convert each word to its stem, where applicable)
newsDTM_tidy_cleaned <- newsDTM_tidy_cleaned %>% 
  mutate(stem = wordStem(term))

# reconstruct our documents
cleaned_documents <- newsDTM_tidy_cleaned %>%
  group_by(document) %>% 
  mutate(terms = toString(rep(stem, count))) %>%
  select(document, terms) %>%
  unique()
names(cleaned_documents)[1]<-"stemmed_document_id"
names(cleaned_documents)[2]<-"document_stemmed"

FullData_cleaned<-cbind(FullData,cleaned_documents)
FullData_cleaned<-cbind(FullData_cleaned,stops_documents)
FullData_cleaned_R<-subset(FullData_cleaned, Party == "R")
FullData_cleaned_D<-subset(FullData_cleaned, Party == "D")

5.Sentiment Analysis

Sentiment analysis , is to systematically identify, extract, quantify,and abstract subjective information, which is defined as a process that automates the mining of attitudes, opinions, views, and emotions from text, speech, tweets, and database sources through Natural Language Processing (NLP) [1]. Normally, sentiment analysis would classify text information into three main categories, namely, “positive”(“pos”) or “negative” (“neg”), or “neutral”(neu)[2].Sentiment analysis typically requires numerous methodological decisions, such as deciding whether to use a dictionary-based or a supervised machine learning approach and determining how sentiment measures are suited to the investigation of a particular domain (e.g., VADER for social media data).[3]

5.1 Data and Methodology

In this part, we use two methods/packages to conduct sentiment analyisis on our speeches, orderly, Tidytext, and Vader (Valence Aware Dictionary for Sentiment Reasoning). For Tidytext, we subtracted negativity ratings from positivity ratings to obtain overall scores using Bing dictionary and defined a speech as neutral if that overall rating was 0 as well as missing data (over 0 as positive, under 0 as negative). During data processing, we label positive, negative, and neutral, as “pos”,“neg”, and “neu” individually. For Vader, we used its internal continuous compound score as overall scale and classified speech as neutral if that score was between -1 and 0 ( witout 0), as positive if that score was between 0 ( witout 0) and 1, as neutral for others. ll 836 speeches are used in this process with un-cleaned speech content.

library(tidytext)
library(tidyr)
library(ggplot2)
library(dplyr)
library(stringr)
library(anytime)
library(cowplot)
library(lubridate)
library(vader)
library(corpus)
library(SnowballC) 

speech = read.csv("C:/Users/yinfu/Desktop/研一下/social data science/exam/project2/data/FullData.csv",header = T)

speech$date<-as.Date(anydate(speech$date))
speech$Party = ifelse(speech$Party=="D","The Democrats", "The Republicans")

5.2 Tidytext Sentiment

### Label ###

speech$label1 = ""
for (i in 1:length(speech$content)) {
  df = tibble(text = speech$content[i]) %>% unnest_tokens(word,text) %>% 
    inner_join(get_sentiments("bing")) %>% count(sentiment)
  if (rlang::is_empty(df$n)){
    speech$label2[i] = "neu"
  }
  else {
    sentiment = df$sentiment[which.max(df$n)]
    if (sentiment == "positive"){
      speech$label1[i] = "pos"
    }
    else {
      speech$label1[i] = "neg"
    }
  }
}

speech$label2 = ""
for (i in 1:length(speech$label1)) {
  if (speech$label1[i] == "neu"){
    speech$label2[i] = 0
 }
    else if (speech$label1[i] == "pos"){
      speech$label2[i] = 1
 }
    else {
      speech$label2[i] = -1
 }
}

summary1 <- group_by(speech,Party) %>% 
  summarise(TidytextSCore=mean(as.numeric(label2)) ,NumberSpeech=n())

print(summary1)

## # A tibble: 2 × 3
##   Party           TidytextSCore NumberSpeech
##   <chr>                   <dbl>        <int>
## 1 The Democrats           0.860          428
## 2 The Republicans         0.760          408

According to the table above, we can see that speeches sentiment score from the Democrats (0.860, n=428) is more positive than speeches sentiment score from the Republicans (0.760, n=408) using Tidytext Sentiment Analysis.

5.3 Vader Sentiment

### Sentiment Scoring ###

vdf1<-vader_df(speech$content[1:100])
vdf2<-vader_df(speech$content[201:300])
vdf3<-vader_df(speech$content[301:400])
vdf4<-vader_df(speech$content[401:300])
vdf5<-vader_df(speech$content[501:400])
vdf6<-vader_df(speech$content[601:300])
vdf7<-vader_df(speech$content[701:400])
vdf8<-vader_df(speech$content[801:836]) 
vdf <- rbind(vdf1,vdf2,vdf3,vdf4,vdf5,vdf6,vdf7,vdf8) }

### label According to Scoring###

speech$cmpd_score = vdf$compound
speech$cmpd_label = ifelse(speech$cmpd_score>=0,"pos","neg")
speech$cmpd_label[which(speech$cmpd_score==0)] = "neu"

### Import Saved vdf Dataset ###
speech = read.csv("C:/Users/yinfu/Desktop/研一下/social data science/exam/project2/speech_all.csv",header = T)



#####plot#####
options(warinings = -1)

speech$date<-as.Date(anydate(speech$date))
speech$Party = ifelse(speech$Party=="D","The Democrats", "The Republicans")

score_by_date = speech %>% 
  group_by(date,Party) %>% summarise_at(vars(cmpd_score),list(avg_score = mean))

summary2 <- group_by(speech,Party) %>% 
  summarise(VaderScore=mean(cmpd_score,na.rm = T),NumberSpeech=n())

print(summary2)

## # A tibble: 2 × 3
##   Party           VaderScore NumberSpeech
##   <chr>                <dbl>        <int>
## 1 The Democrats        0.857          428
## 2 The Republicans      0.722          408

### Dived into Two Groups ###
score_by_date1 <- filter(score_by_date,Party=="The Democrats")
score_by_date2a<- filter(score_by_date,date< anydate("01 Jan 2010") &       
                           Party=="The Republicans")
score_by_date2b<- filter(score_by_date,date>= anydate("01 Jan 2016")
                         & avg_score != "NA" & date<= anydate("01 Jan 2018")                          & Party=="The Republicans")
score_by_date2 <- rbind(score_by_date2a,score_by_date2b)
score_by_date3 <- rbind(score_by_date1,score_by_date2)

According to the table above, we can see that speeches sentiment score from the Democrats (0.857, n=428) is more positive than speeches sentiment score from the Republicans (0.722, n=408) using Vader Sentiment Analysis.

Compare to the results from two different methods, we found that it seems to be the same round score for the Democrats between two methods, individually, 0.860 and 0.857, but be different scores for the Republicans between two methods,individually, 0.760 and 0.722.

### Density: The Democrats (blue)  ###

ggplot(data=score_by_date1, aes(x=avg_score, group=Party, fill=Party)) +
  geom_density(adjust=100,alpha=1) + scale_fill_manual(values=c("#3333FF"))+
  ggtitle("Figure 1. The Distribution of the Democrats") + 
  xlab("Sentiment Score") + ylab("Density")

### Density: The Republicans (red)  ###

ggplot(data=score_by_date2, aes(x=avg_score, group=Party, fill=Party)) +
  geom_density(adjust=1.5,alpha=1) + scale_fill_manual(values=c("#CC0033")) + ggtitle("Figure 2. The Distribution of the Republicans") + 
  xlab("Sentiment Score") + ylab("Density")

After comparing to the results from the two diffent methods, we dicided to conduct further analysis using Vader (details see next part). Figure 1 and Figure 2 show that the distribution of the score of speeches from two parties, the Democrats (blue color) and the Republicans (red color). For the Democrats, it suggest that most speeches scores are very to positive (value equals 1). For the Republicans, it suggest that most speeches scores are very to positive (value equals 1), but also lots of speech score distribute between 0 and 0.5.

#### Plot: All Speeches ###

ggplot(score_by_date,aes(x=date,y=avg_score)) + 
  geom_line(data = score_by_date1, color = "blue") + 
  geom_point(data = score_by_date1, color = "dark blue" )+ 
  geom_line(data = score_by_date2, color = "red") + 
  geom_point(data = score_by_date2, color = "dark red" )+
  theme_grey() + geom_hline(yintercept=0,color = "black") + 
  ggtitle("Figure 3. Sentiment Score Between Parties Across Years") + 
  xlab("Date") + ylab("Sentiment Score")+
   theme(plot.title.position = 'plot', 
      plot.title = element_text(hjust = 0.5))

### Plot: The Democrats (blue) ###

ggplot(score_by_date1,aes(x=date,y=avg_score)) + geom_line(color = "blue") + 
  geom_point(color = "dark blue")+ theme_grey() + geom_hline(yintercept=0,color = "black") +
  ggtitle("Figure 4. Sentiment Score for the Speeches of the Democrats") + 
  xlab("Date") + ylab("Sentiment Score")+
   theme(plot.title.position = 'plot', 
      plot.title = element_text(hjust = 0.5))

### Plot: The Republicans (red) ###
#score_by_date2 <- filter(score_by_date,Party=="The Republicans")

ggplot(score_by_date2,aes(x=date,y=avg_score)) + geom_line(color = "dark red") + geom_point(color = "red")+ theme_grey() + geom_hline(yintercept=0,color = "black") +
  ggtitle("Figure 5. Sentiment Score for the Speeches of the Republicans") +
  xlab("Date") + ylab("Sentiment Score")+
   theme(plot.title.position = 'plot', 
      plot.title = element_text(hjust = 0.5))

Figure 3 shows how all speech score changes across the years from the year 2001 to the year 2018. The red line stands for the Republicans and the blue line stands for the Democrats. The horizontal black line is the neutral (value equals 0). This plot shows that more red points before the year 2010 are more closer to 1 than red points after 2015, as well as, more red points shows up between 0 and 0.5. In addition, blue points have a more highly presence between 2010 and 2012 than the after. (also we can zoom in, see Figure 4 and 5).

###Plot: The Republicans_Bush (red) ###

ggplot(score_by_date2a,aes(x=date,y=avg_score)) + 
  geom_line(color = "dark red") + geom_point(color = "red")+ 
  theme_grey() + geom_hline(yintercept=0,color = "black") +
  ggtitle("Figure 6. Sentiment Score for the Speeches of the Republicans_Bush") + 
  xlab("Date") + ylab("Sentiment Score") +
  theme(plot.title.position = 'plot', 
      plot.title = element_text(hjust = 0.5))

### Plot: The Republicans_Trump (red) ###

ggplot(score_by_date2b,aes(x=date,y=avg_score)) + 
  geom_line(color = "dark blue") + geom_point(color = "blue")+ 
  theme_grey() + geom_hline(yintercept=0,color = "black")+
  ggtitle("Figure 7. Sentiment Score for the Speeches of the Republicans_Trump") + 
  xlab("Date") + ylab("Sentiment Score") +
   theme(plot.title.position = 'plot', 
      plot.title = element_text(hjust = 0.5))

Further Analysis are conduct within the Republicans, individually, Bush and Trump. As the Figure 6 and 7 shown, most score points from Bush seem to be closer than score points from Trump. Some score points from Trump are around between 0 and 0.5, especially between the year 2017 and 2018, which means that there is less strong positive than before.

6. Topic modeling

6.1 Methodology

LDA is the most fundamental and popular realization of topic modeling. The main purpose of LDA is to discover latent themes that permeate the corpus. When the LDA is trained, two outputs are generated: word distribution per topic and topic distribution per document. If it is trained on a corpus, the frontier is term distribution per topic and the latter topic distribution per document. The latter can be considered as another document representation in the word frequencies and topic constitution is considered [8]. However, the distribution of words within a topic is stationary, which means topic 1 for document 1 uses identical words as topic 1 for documents 2, 3, etc.

For topical content, the idea of the structural topic model (STM) is to define the distribution over the terms associated with the different topics as an exponential family model, like a multinomial logistic regression, parametrized as a function of the marginal frequency of occurrence deviations for each term, and of deviations from it that are specific to topics, covariates and their interactions [9]. The inclusion of covariates is informative about the structure of the document collection and its design.

The R package stm can be used to plot the influence of a topical content covariate. A topical content variable allows for the vocabulary used to talk about a particular topic to vary. The estimateEffect can be used to estimate a regression where documents are the units, the outcome is the proportion of each document about a topic in an STM model and the covariates are document-meta data. This procedure incorporates measurement uncertainty from the STM model using the method of composition.

6.2 Experiment design

For LDA, the algorithm will be trained by three kinds of data: the data without stopwords, the text removed stopwords and vectorized, and the text cleaned and stemmed. We will compare the results of the three data and choose one to analyze, and the other two results as a complement. We have down the tokenization for the first two dataset. For the second data, vectorization creates an object which defines on how to transform list of tokens into vector space. It maps words to indices. For the algorithm without stopwords text, we set the minimum frequency of the word as 0.01, and the maximum frequency of the word as 0.8, which means about 20% of most frequent words and some rare words are dropped. For the method without stopwords text and vectorized text, we set the minimum number of occurrences over all documents as 30, which means that if the word exists in 30 of 850 speeches, we will take it into account. We set this number since it is about 5% of the whole number of speeches. We also set the maximum proportion of documents that should contain a term as 20%. Since suppose the word exists in all speeches, it is not useful to categorize the topics. For these parametric settings, we should adjust and test them for some time and get the result with the best interpretability. The choice of parameters, like how many words should be deleted is not random. Although there are some methods like searchK to illustrate the residual and likelihood of choosing the number of topics, we think it is better to think about its interpretability.

Concerning STM, after topic categorization, we would like to plot the effect of a covariate on a set of topics selected by the previous common results. We want to know for the same topic, whether the two parties’ speeches are different. We would refer to some articles to prove and illustrate how different they are. The limitation of our design of STM is that validation is lack, which might influence the reliability of our work.

6.3 LDA results and analysis

Comparing the results of different data sets, the result of LDA with the data only removed the stopwords is the most understandable. We use it to distinguish topics.

Concerning the Democrats, topic 2 is about the health care system, topic 3 is about the war and military, topic 4 and 5 is regarding childcare, women’s right, and family, and topic 7 is focusing on the economic problem that American meet.

For the Republican party, topic 1 is about voting and selection, topic 4 is regarding the health care problems, topic 5 is about war, topic 6 is about serving the people, and topic 7 is focusing on the trade problem.

Comparing the two results, it is common to discuss the health care problem and war, which are two problems that American have met for a long time.

For the economic topic, the two parties focus on different sides. The democrats pay more attention to domestic issues, for instance, the unemployment rate and tax, while the republicans focus more on international trade. It is explainable that the financial crisis came during Obama’s first term and one main government policy is to do macro-financial control. So, they focus on some economic indicators. On the other hand, trump’s economic platform is on manufacturing repatriation, and that’s why Republicans focus on the ‘trade war’ and refer to ‘China’.

Taxation is also a part of economic, and the parties both mentioned about tax. Traditionally, the Republican Party focus on lower taxes for all, while the Democratic Party think that higher taxes, especially for high-income earners is important[6]. This result is in context of our STM result.

Furthermore, the democrats focus more on children, women, and family welfare which is a characteristic of their values and policy. However, the republicans only talk about serving people generally. Discussing welfare issues is a symbolic characteristic of the Democrat party. It is reasonable for the democrats to conclude that they do better than the republicans in this field. Democrat party aims to ensure that health and social services are accessible for all U.S. citizens.[6] In fact, from our analysis, it seems that the Republic party still does not have a full value, policy, and discussion about public welfare.

The LDA with different data sets shows a similar pattern. Besides, the Republicans also pay attention to the race issue, since ‘black’, ‘white’, and ‘people’ are shown in the table. Moreover, the democrats also talk about some regional issues and how to enhance the influence among the states around the US.

6.3.1 LDA with only stopwords

The following code are for the Democrats.

corp_party <- corpus(FullData_cleaned_D, text_field = "document_stop_word")
token_corp_party_no_stopsword_D <- tokens(corp_party)
dfmat_party_lda <- dfm(token_corp_party_no_stopsword_D) %>% 
  dfm_trim(min_docfreq = 0.01, max_docfreq = 0.85, docfreq_type = "prop")
party_lda <- textmodel_lda(dfmat_party_lda, k = 7)
terms(party_lda, 20)

##       topic1     topic2       topic3       topic4      topic5        
##  [1,] "life"     "health"     "and"        "world"     "weâ"         
##  [2,] "god"      "care"       "we"         "countries" "military"    
##  [3,] "lives"    "jobs"       "i"          "weâ"       "iraq"        
##  [4,] "nation"   "economy"    "but"        "nations"   "isil"        
##  [5,] "children" "weâ"        "it"         "future"    "security"    
##  [6,] "world"    "insurance"  "the"        "democracy" "war"         
##  [7,] "love"     "tax"        "so"         "human"     "forces"      
##  [8,] "women"    "businesses" "that"       "global"    "afghanistan" 
##  [9,] "family"   "energy"     "they"       "rights"    "troops"      
## [10,] "faith"    "itâ"        "you"        "progress"  "world"       
## [11,] "history"  "companies"  "in"         "change"    "veterans"    
## [12,] "hard"     "plan"       "this"       "iâ"        "intelligence"
## [13,] "families" "education"  "itâ"        "africa"    "qaeda"       
## [14,] "justice"  "system"     "thatâ"      "trade"     "syria"       
## [15,] "bless"    "pay"        "government" "economic"  "terrorists"  
## [16,] "hope"     "middle"     "he"         "leaders"   "nation"      
## [17,] "home"     "government" "as"         "asia"      "terrorist"   
## [18,] "rights"   "reform"     "weâ"        "itâ"       "attacks"     
## [19,] "story"    "future"     "our"        "region"    "continue"    
## [20,] "change"   "class"      "there"      "china"     "including"   
##       topic6          topic7      
##  [1,] "nuclear"       "weâ"       
##  [2,] "iran"          "iâ"        
##  [3,] "world"         "obama"     
##  [4,] "security"      "question"  
##  [5,] "international" "itâ"       
##  [6,] "peace"         "theyâ"     
##  [7,] "russia"        "donâ"      
##  [8,] "nations"       "youâ"      
##  [9,] "israel"        "law"       
## [10,] "weapons"       "thereâ"    
## [11,] "deal"          "congress"  
## [12,] "sanctions"     "folks"     
## [13,] "war"           "issue"     
## [14,] "region"        "issues"    
## [15,] "ukraine"       "process"   
## [16,] "countries"     "gun"       
## [17,] "minister"      "house"     
## [18,] "prime"         "job"       
## [19,] "government"    "forward"   
## [20,] "iranian"       "government"

party_topics_D <- topics(party_lda)

The following code are for the republican.

corp_party <- corpus(FullData_cleaned_R, text_field = "document_stop_word")
token_corp_party_no_stopsword_R <- tokens(corp_party)
dfmat_party_lda <- dfm(token_corp_party_no_stopsword_R) %>% 
  dfm_trim(min_docfreq = 0.01, max_docfreq = 0.85, docfreq_type = "prop")
party_lda <- textmodel_lda(dfmat_party_lda, k = 7)
terms(party_lda, 20)

##       topic1     topic2           topic3      topic4       topic5       
##  [1,] "love"     "trump"          "world"     "health"     "â"          
##  [2,] "country"  "question"       "freedom"   "care"       "iraq"       
##  [3,] "win"      "house"          "nation"    "line"       "war"        
##  [4,] "trump"    "white"          "peace"     "congress"   "terrorists" 
##  [5,] "folks"    "administration" "security"  "tax"        "iraqi"      
##  [6,] "guy"      "news"           "country"   "economy"    "world"      
##  [7,] "money"    "watching"       "free"      "government" "weapons"    
##  [8,] "gonna"    "obama"          "nations"   "country"    "military"   
##  [9,] "wall"     "click"          "life"      "federal"    "forces"     
## [10,] "won"      "videos"         "future"    "workers"    "terrorist"  
## [11,] "didn"     "donald"         "democracy" "reform"     "terror"     
## [12,] "vote"     "russia"         "hope"      "plan"       "nations"    
## [13,] "donald"   "deal"           "god"       "border"     "enemy"      
## [14,] "happen"   "campaign"       "support"   "law"        "regime"     
## [15,] "world"    "watch"          "liberty"   "children"   "threat"     
## [16,] "military" "didn"           "women"     "child"      "qaeda"      
## [17,] "talk"     "security"       "history"   "bill"       "country"    
## [18,] "bad"      "subscribe"      "middle"    "system"     "afghanistan"
## [19,] "dollars"  "hey"            "human"     "school"     "attacks"    
## [20,] "nice"     "court"          "leaders"   "job"        "iraqis"     
##       topic6       topic7     
##  [1,] "country"    "trade"    
##  [2,] "hillary"    "jobs"     
##  [3,] "clinton"    "china"    
##  [4,] "jobs"       "country"  
##  [5,] "trump"      "business" 
##  [6,] "donald"     "percent"  
##  [7,] "obama"      "world"    
##  [8,] "dollars"    "companies"
##  [9,] "percent"    "mexico"   
## [10,] "government" "deal"     
## [11,] "vote"       "countries"
## [12,] "change"     "deals"    
## [13,] "african"    "dollars"  
## [14,] "safe"       "money"    
## [15,] "system"     "job"      
## [16,] "remember"   "tax"      
## [17,] "veterans"   "clinton"  
## [18,] "november"   "billion"  
## [19,] "cities"     "care"     
## [20,] "special"    "hundred"

party_topics_R <- topics(party_lda)

6.3.2 LDA with vectorized stopwords

The following code are for the Democrats.

#D

tokens = word_tokenizer(token_corp_party_no_stopsword_D)
it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 30, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")

lda_model_D = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr_D = 
  lda_model_D$fit_transform(x = dtm, n_iter = 2500, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

## INFO  [17:37:31.793] early stopping at 725 iteration
## INFO  [17:37:33.681] early stopping at 75 iteration

lda_model_D$get_top_words(n = 20, lambda = 0.7)

##       [,1]          [,2]         [,3]         [,4]          [,5]        
##  [1,] "â"           "â"          "â"          "â"           "â"         
##  [2,] "country"     "family"     "government" "lives"       "theyâ"     
##  [3,] "leaders"     "global"     "and"        "i"           "months"    
##  [4,] "iâ"          "law"        "continue"   "strong"      "school"    
##  [5,] "weâ"         "issue"      "weâ"        "insurance"   "white"     
##  [6,] "it"          "care"       "iâ"         "nations"     "national"  
##  [7,] "lost"        "decades"    "percent"    "so"          "world"     
##  [8,] "makes"       "world"      "citizens"   "life"        "gun"       
##  [9,] "human"       "rights"     "program"    "nation"      "challenges"
## [10,] "set"         "the"        "talk"       "health"      "respect"   
## [11,] "generation"  "means"      "action"     "politics"    "trade"     
## [12,] "thereâ"      "week"       "public"     "workers"     "syria"     
## [13,] "goal"        "stronger"   "issues"     "top"         "effort"    
## [14,] "business"    "democrats"  "save"       "tough"       "debt"      
## [15,] "senate"      "call"       "pay"        "hope"        "political" 
## [16,] "differences" "west"       "fear"       "africa"      "term"      
## [17,] "protect"     "decision"   "veterans"   "share"       "businesses"
## [18,] "enforcement" "technology" "hold"       "federal"     "market"    
## [19,] "peace"       "deal"       "ensure"     "and"         "we"        
## [20,] "future"      "sense"      "decisions"  "opportunity" "youâ"      
##       [,6]             [,7]          [,8]           [,9]           
##  [1,] "â"              "â"           "â"            "â"            
##  [2,] "jobs"           "itâ"         "security"     "congress"     
##  [3,] "house"          "economy"     "world"        "canâ"         
##  [4,] "responsibility" "war"         "crisis"       "international"
##  [5,] "families"       "bring"       "military"     "itâ"          
##  [6,] "single"         "class"       "youâ"         "terrorist"    
##  [7,] "start"          "doesnâ"      "team"         "money"        
##  [8,] "forces"         "weâ"         "students"     "economic"     
##  [9,] "war"            "happen"      "health"       "stand"        
## [10,] "home"           "question"    "build"        "israel"       
## [11,] "didnâ"          "spending"    "intelligence" "but"          
## [12,] "true"           "republicans" "efforts"      "speak"        
## [13,] "friends"        "energy"      "provide"      "democracy"    
## [14,] "oil"            "iraq"        "faith"        "country"      
## [15,] "resources"      "move"        "countries"    "ago"          
## [16,] "recognize"      "system"      "isil"         "power"        
## [17,] "forward"        "free"        "dollars"      "community"    
## [18,] "threat"         "helping"     "father"       "begin"        
## [19,] "future"         "conflict"    "party"        "nation"       
## [20,] "peaceful"       "proud"       "and"          "ground"       
##       [,10]        
##  [1,] "â"          
##  [2,] "weâ"        
##  [3,] "support"    
##  [4,] "progress"   
##  [5,] "information"
##  [6,] "tax"        
##  [7,] "weapons"    
##  [8,] "russia"     
##  [9,] "chance"     
## [10,] "region"     
## [11,] "home"       
## [12,] "fair"       
## [13,] "nations"    
## [14,] "commitment" 
## [15,] "sanctions"  
## [16,] "reforms"    
## [17,] "prime"      
## [18,] "approach"   
## [19,] "children"   
## [20,] "entire"

The following code are for the republican.

#R
tokens = word_tokenizer(token_corp_party_no_stopsword_R )
it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 30, doc_proportion_max = 0.2)

vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")

lda_model_R = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr_R = 
  lda_model_R$fit_transform(x = dtm, n_iter = 2500, 
                          convergence_tol = 0.001, n_check_convergence = 25, 
                          progressbar = FALSE)

## INFO  [17:37:46.618] early stopping at 800 iteration
## INFO  [17:37:47.656] early stopping at 75 iteration

lda_model_R$get_top_words(n = 20, lambda = 0.7)

##       [,1]        [,2]        [,3]         [,4]             [,5]         
##  [1,] "war"       "house"     "clinton"    "trump"          "hard"       
##  [2,] "gonna"     "lives"     "military"   "care"           "terror"     
##  [3,] "policy"    "call"      "change"     "million"        "single"     
##  [4,] "real"      "hundred"   "women"      "win"            "talking"    
##  [5,] "talked"    "choice"    "days"       "support"        "veterans"   
##  [6,] "public"    "foreign"   "east"       "history"        "bad"        
##  [7,] "peace"     "country"   "security"   "thousands"      "immigration"
##  [8,] "hell"      "obama"     "disaster"   "bill"           "law"        
##  [9,] "proud"     "god"       "hear"       "borders"        "movement"   
## [10,] "secretary" "leaders"   "heard"      "massive"        "line"       
## [11,] "totally"   "terrorism" "campaign"   "happen"         "trade"      
## [12,] "job"       "children"  "true"       "threat"         "millions"   
## [13,] "killed"    "create"    "federal"    "administration" "run"        
## [14,] "cities"    "rights"    "bernie"     "called"         "free"       
## [15,] "remember"  "pay"       "countries"  "china"          "winning"    
## [16,] "wasn"      "worse"     "money"      "defend"         "coming"     
## [17,] "folks"     "mexico"    "tax"        "top"            "started"    
## [18,] "york"      "nuclear"   "tremendous" "strong"         "violence"   
## [19,] "home"      "hillary"   "times"      "hampshire"      "democracy"  
## [20,] "process"   "terrible"  "reform"     "provide"        "close"      
##       [,6]         [,7]         [,8]          [,9]         [,10]       
##  [1,] "jobs"       "didn"       "country"     "nations"    "iraq"      
##  [2,] "love"       "nation"     "â"           "stop"       "understand"
##  [3,] "government" "won"        "world"       "deals"      "companies" 
##  [4,] "percent"    "build"      "happened"    "incredible" "anymore"   
##  [5,] "middle"     "money"      "business"    "taking"     "taxes"     
##  [6,] "talk"       "iraqi"      "wrong"       "energy"     "police"    
##  [7,] "pretty"     "nice"       "politicians" "deal"       "tough"     
##  [8,] "special"    "stand"      "matter"      "russia"     "beautiful" 
##  [9,] "regime"     "world"      "bush"        "families"   "safe"      
## [10,] "human"      "washington" "bless"       "folks"      "reason"    
## [11,] "weeks"      "building"   "question"    "congress"   "terrorists"
## [12,] "paying"     "illegal"    "power"       "trump"      "hillary"   
## [13,] "cruz"       "left"       "leadership"  "border"     "south"     
## [14,] "enemy"      "liberty"    "donald"      "department" "system"    
## [15,] "north"      "ago"        "force"       "government" "billion"   
## [16,] "protect"    "send"       "hillary"     "save"       "ted"       
## [17,] "obamacare"  "speech"     "countries"   "â"          "common"    
## [18,] "businesses" "qaeda"      "respect"     "freedom"    "family"    
## [19,] "bit"        "including"  "week"        "bring"      "guys"      
## [20,] "lose"       "child"      "drugs"       "peace"      "hey"

6.3.3 LDA with stemmed data

top_terms_by_topic_LDA <- function(input_text, # should be a columm from a dataframe
                                   plot = T, # return a plot? TRUE by defult
                                   number_of_topics = 4) # number of topics (4 by default)
{    
  # create a corpus (type of object expected by tm) and document term matrix
  Corpus <- Corpus(VectorSource(input_text)) # make a corpus object
  DTM <- DocumentTermMatrix(Corpus) # get the count of words/document
  
  # remove any empty rows in our document term matrix (if there are any 
  # we'll get an error when we try to run our LDA)
  unique_indexes <- unique(DTM$i) # get the index of each unique value
  DTM <- DTM[unique_indexes,] # get a subset of only those indexes
  
  # preform LDA & get the words/topic in a tidy text format
  lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
  topics <- tidy(lda, matrix = "beta")
  
  # get the top ten terms for each topic
  top_terms <- topics  %>% # take the topics data frame and..
    group_by(topic) %>% # treat each topic as a different group
    top_n(25, beta) %>% # get the top 10 most informative words
    ungroup() %>% # ungroup
    arrange(topic, -beta) # arrange words in descending informativeness
  
  # if the user asks for a plot (TRUE by default)
  if(plot == T){
    # plot the top ten terms for each topic in order
    top_terms %>% # take the top terms
      mutate(term = reorder(term, beta)) %>% # sort terms by beta value 
      ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
      geom_col(show.legend = FALSE) + # as a bar plot
      facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot
      labs(x = NULL, y = "Beta") + # no x label, change y label 
      coord_flip() # turn bars sideways
  }else{ 
    # if the user does not request a plot
    # return a list of sorted terms instead
    return(top_terms)
  }
}

The following code are for the Democrats.

top_terms_by_topic_LDA(FullData_cleaned_D$document_stemmed, number_of_topics = 5)

The following code are for the republican.

top_terms_by_topic_LDA(FullData_cleaned_R$document_stemmed, number_of_topics = 5)

6.4 STM

we applied another method called structural topic modeling (STM). Although this method does not work well in this dataset due to its limitation. The topic of war, health care, and the economy exist, which means two parties discussed them frequently. We plot the figure with the index of each topic. It can be concluded that the two parties have different attitudes, values, policy towards the three topics, which is also in context with our common sense and previous results.

The two parties have a contradiction in health care issues. One of Obama’s most notable political achievements was reforming American health care with the Affordable Care Act, commonly known as Obamacare, which ensured that the large majority of Americans became covered by insurance. Trump was against it and wanted to abolish the health care bill.In fact, the two parties have different platform of health care system. The Republican Party values private healthcare services and low degree of government interference, while the Democratic Party values equal access to some form of government-supported healthcare [6].

However, it should be noticed that there is an enormous difference in the topic of war. Traditionally, we think that the two parties have a consensus that it is acceptable to create regional conflicts because of the industry and the oversea benefit. We think the result is influenced by Trump, cause his election platform focus on peace and strategic contraction. On the contrary, some military operations occurred during the Obama and bush in their president time.

processed <- textProcessor(FullData_cleaned$document_stop_word, metadata =FullData_cleaned)

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

out <- prepDocuments(processed$documents, processed$vocab, processed$meta)

## Removing 8471 of 19029 terms (8471 of 352492 tokens) due to frequency 
## Your corpus now has 834 documents, 10558 terms and 344021 tokens.

docs <- out$documents
vocab <- out$vocab
meta <-out$meta
First_STM <- stm(documents = out$documents, vocab = out$vocab,
                 K = 10, prevalence =~ Party ,
                 max.em.its = 75, data = out$meta,
                 init.type = "Spectral", verbose = FALSE)
plot(First_STM)

#findThoughts(First_STM, texts = FullData_cleaned$content,
  #           n = 2, topics = 3)
#findingk <- searchK(out$documents, out$vocab, K = c(4:15),
 #                   prevalence =~ Party, data = meta, verbose=FALSE)

#plot(findingk)

predict_topics<-estimateEffect(formula = 1:10 ~ Party, stmobj = First_STM, metadata = out$meta, uncertainty = "Global")
plot(predict_topics,covariate = "Party",model = First_STM,topics = c(4,2,6),
     cov.value1 = "L", cov.value2 = "R", main = "Effect of L vs. R",xlab = "More Republican ... More Demostic")

## 7. Word plot

According to the wordplot results, we can see that the Democratic Party’s speeches mostly focus on people’s lives, such as education, health care, and also on war,while the Republican Party focuses more on the political system

dfmat_toks_party_nostop <- dfm(token_corp_party_no_stopsword_R)
print(dfmat_toks_party_nostop)

## Document-feature matrix of: 406 documents, 17,122 features (97.86% sparse) and 8 docvars.
##        features
## docs    ability    , abraham abused abusers abuses abusing act addition address
##   text1       2 2313       1      1       1      3       1   2        1       1
##   text2       2 2310       1      0       1      3       1   2        1       1
##   text3       2 1069       0      0       0      0       0   1        1       0
##   text4       1 1377       0      0       0      0       0   0        0       0
##   text5       0  254       0      0       0      0       0   0        0       0
##   text6       2 1061       0      0       0      0       0   1        1       0
## [ reached max_ndoc ... 400 more documents, reached max_nfeat ... 17,112 more features ]

dfmat_toks_party_nostop <- dfm(token_corp_party_no_stopsword_D)
print(dfmat_toks_party_nostop)

## Document-feature matrix of: 428 documents, 24,130 features (97.47% sparse) and 8 docvars.
##        features
## docs    abandoned    , ability abundance accept account achieve act action
##   text1         1  991       2         1      1       1       1   1      1
##   text2         0  347       0         0      0       0       0   4      1
##   text3         0  727       0         0      0       0       1   1      0
##   text4         0  428       0         0      0       0       0   3      0
##   text5         0 1665       0         0      1       0       0   4      1
##   text6         0  229       1         0      0       0       0   0      0
##        features
## docs    additional
##   text1          1
##   text2          0
##   text3          0
##   text4          1
##   text5          3
##   text6          0
## [ reached max_ndoc ... 422 more documents, reached max_nfeat ... 24,120 more features ]

#Democratic Party
unigram_toks_party <- tokens_ngrams(token_corp_party_no_stopsword_D, n=1)
unigram_dfm_party <- dfm(unigram_toks_party)
unigram_freq_party <-textstat_frequency(unigram_dfm_party)

# Plot wordcloud to show most frequent words
textplot_wordcloud(unigram_dfm_party, max_words = 200,
                   ordered_color = TRUE)

#Republican Party
unigram_toks_party <- tokens_ngrams(token_corp_party_no_stopsword_R, n=1)
unigram_dfm_party <- dfm(unigram_toks_party)
unigram_freq_party <-textstat_frequency(unigram_dfm_party)

# Plot wordcloud to show most frequent words
textplot_wordcloud(unigram_dfm_party, max_words = 200,
                   ordered_color = TRUE)

8. Classification

In the classification step, we first divided 70% of the data set into the training group and the remaining 30% into the test group.We then built three different classifiers that predicted which party issued the speech by identifying the speech text content, and then compared how well these three classifiers predict.

8.1 Split Test/Train (0.3 - 0.7) & Pre-Processing

set.seed(123) 
ind_train <- lapply(split(seq(1:nrow(FullData_cleaned)), FullData_cleaned$Party), function(x) sample(x, floor(.7*length(x))))
ind_test <- mapply(function(x,y) setdiff(x,y), x = split(seq(1:nrow(FullData_cleaned)), FullData_cleaned$Party), y = ind_train)

test <- FullData_cleaned[unlist(ind_test),]
train <- FullData_cleaned[unlist(ind_train),]

corp_party_test <- corpus(test, text_field = "document_stemmed")
corp_party_train <- corpus(train, text_field = "document_stemmed")

toks_party_test <- tokens(corp_party_test)
toks_party_train <- tokens(corp_party_train)

dfmt_party_train <- dfm(toks_party_train)
dfmt_party_test <- dfm(toks_party_test)

8.2 Train Naive Bayes.

The Naive Bayes classifier was the first one we used, and it greatly simplifies learning by presuming that features are independent of class. Despite the fact that independence is generally a bad assumption, in practice, naive Bayes frequently outperforms more advanced classifiers. Numerous real-world applications, such as text classification, medical diagnosis, and system performance management, have shown that Naive Bayes works well. [Domingos & Pazzani, 1997; Hellerstein, Jayram & Rish,2000; Mitchell,1997]

tmod_nb_train <- textmodel_nb(dfmt_party_train, dfmt_party_train@docvars[["Party"]])
summary(tmod_nb_train)

## 
## Call:
## textmodel_nb.dfm(x = dfmt_party_train, y = dfmt_party_train@docvars[["Party"]])
## 
## Class Priors:
## (showing first 2 elements)
##   D   R 
## 0.5 0.5 
## 
## Estimated Feature Scores:
##     abandon      ,   abbigal    abroad    access   account    achiev acknowledg
## D 3.771e-05 0.4813 7.252e-06 0.0001059 0.0003191 0.0003510 0.0004264  1.813e-04
## R 8.300e-05 0.4771 2.515e-06 0.0001006 0.0001685 0.0003043 0.0003169  2.767e-05
##      action  activist      acut administr     adopt    advanc     advic
## D 0.0006455 3.481e-05 1.160e-05 0.0004424 3.626e-05 0.0003539 1.044e-04
## R 0.0003697 2.012e-05 1.006e-05 0.0007118 5.533e-05 0.0002842 6.288e-05
##    advocaci    affect    afford    afraid    africa afternoon        ag
## D 1.596e-05 0.0001073 0.0004467 6.237e-05 4.438e-04 1.117e-04 0.0002422
## R 2.515e-06 0.0001107 0.0001811 8.803e-05 7.294e-05 7.545e-05 0.0001459
##       agenc agreement     ahead allegedli   allianc      amaz   america
## D 0.0001654 0.0003597 0.0003322 5.802e-06 2.132e-04 8.268e-05 0.0001305
## R 0.0001232 0.0002188 0.0003798 7.545e-06 7.294e-05 6.665e-04 0.0000327
##    americaâ
## D 3.757e-04
## R 7.545e-06

dfmat_matched <- dfm_match(dfmt_party_test, features = featnames(dfmt_party_train))

actual_class <- dfmat_matched@docvars[["Party"]]
predicted_class <- predict(tmod_nb_train, newdata = dfmat_matched)
tab_class <- table(actual_class, predicted_class)
tab_class

##             predicted_class
## actual_class   D   R
##            D 127   2
##            R  19 103

confusionMatrix(tab_class, mode = "everything")

## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class   D   R
##            D 127   2
##            R  19 103
##                                          
##                Accuracy : 0.9163         
##                  95% CI : (0.875, 0.9475)
##     No Information Rate : 0.5817         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8319         
##                                          
##  Mcnemar's Test P-Value : 0.0004803      
##                                          
##             Sensitivity : 0.8699         
##             Specificity : 0.9810         
##          Pos Pred Value : 0.9845         
##          Neg Pred Value : 0.8443         
##               Precision : 0.9845         
##                  Recall : 0.8699         
##                      F1 : 0.9236         
##              Prevalence : 0.5817         
##          Detection Rate : 0.5060         
##    Detection Prevalence : 0.5139         
##       Balanced Accuracy : 0.9254         
##                                          
##        'Positive' Class : D              
##

According to the results, we can find that the Naive Bayes classifier has an overall prediction accuracy of about 92%, which is not bad, and can correctly assign 98.5% of the Democratic speeches to the appropriate category. Despite the fact that the Republican Party’s accuracy rate is only 84%.

8.3 Train Linear SVM

Linear Support Vector Machines (SVMs), which have emerged as one of the most popular machine learning techniques for high-dimensional sparse data frequently encountered in applications like text classification, word-sense disambiguation, and drug design (Joachims, 2006), served as the second classifier that we employed.

tmod_svm_train <- textmodel_svm(dfmt_party_train, dfmt_party_train@docvars[["Party"]])
#summary(tmod_svm_train)

actual_class_svm <- dfmat_matched@docvars[["Party"]]
predicted_class_svm <- predict(tmod_svm_train, newdata = dfmat_matched)
tab_class_svm <- table(actual_class_svm, predicted_class_svm)
tab_class_svm

##                 predicted_class_svm
## actual_class_svm   D   R
##                D 124   5
##                R   7 115

confusionMatrix(tab_class_svm, mode = "everything")

## Confusion Matrix and Statistics
## 
##                 predicted_class_svm
## actual_class_svm   D   R
##                D 124   5
##                R   7 115
##                                          
##                Accuracy : 0.9522         
##                  95% CI : (0.918, 0.9751)
##     No Information Rate : 0.5219         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9043         
##                                          
##  Mcnemar's Test P-Value : 0.7728         
##                                          
##             Sensitivity : 0.9466         
##             Specificity : 0.9583         
##          Pos Pred Value : 0.9612         
##          Neg Pred Value : 0.9426         
##               Precision : 0.9612         
##                  Recall : 0.9466         
##                      F1 : 0.9538         
##              Prevalence : 0.5219         
##          Detection Rate : 0.4940         
##    Detection Prevalence : 0.5139         
##       Balanced Accuracy : 0.9524         
##                                          
##        'Positive' Class : D              
##

According to the results, this classifier outperforms the first one in terms of overall prediction accuracy, reaching 95%, and predicts the correct result with high accuracy regardless of whether the speech is Democratic or Republican (96% for Democrats vs 94% for Republicans).

8.4 Logistic Regression

The final classifier we employed was the logistic regression classifier, a traditional statistical tool that has recently gained popularity in machine learning due to its close resemblance to the support vector machine (Vapnik, 1999), and it is frequently employed in the classification of texts.

tmod_lr_train <- textmodel_lr(dfmt_party_train, dfmt_party_train@docvars[["Party"]])
summary(tmod_lr_train)

## 
## Call:
## textmodel_lr.dfm(x = dfmt_party_train, y = dfmt_party_train@docvars[["Party"]])
## 
## Lambda Min:
## [1] 0.006521
## 
## Lambda 1se:
## [1] 0.01653
## 
## Estimated Feature Scores:
##   (Intercept) abandon , abbigal abroad access account achiev acknowledg action
## R       1.148       0 0       0      0      0       0      0          0      0
##   activist acut administr adopt advanc advic advocaci affect afford afraid
## R        0    0         0     0      0     0        0      0      0      0
##   africa afternoon ag agenc agreement ahead allegedli allianc amaz america
## R      0         0  0     0         0     0         0       0    0       0

actual_class_lr <- dfmat_matched@docvars[["Party"]]
predicted_class_lr <- predict(tmod_lr_train, newdata = dfmat_matched)
tab_class_lr <- table(actual_class_lr, predicted_class_lr)
tab_class_lr

##                predicted_class_lr
## actual_class_lr   D   R
##               D 121   8
##               R   1 121

confusionMatrix(tab_class_lr, mode = "everything")

## Confusion Matrix and Statistics
## 
##                predicted_class_lr
## actual_class_lr   D   R
##               D 121   8
##               R   1 121
##                                          
##                Accuracy : 0.9641         
##                  95% CI : (0.933, 0.9835)
##     No Information Rate : 0.5139         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9283         
##                                          
##  Mcnemar's Test P-Value : 0.0455         
##                                          
##             Sensitivity : 0.9918         
##             Specificity : 0.9380         
##          Pos Pred Value : 0.9380         
##          Neg Pred Value : 0.9918         
##               Precision : 0.9380         
##                  Recall : 0.9918         
##                      F1 : 0.9641         
##              Prevalence : 0.4861         
##          Detection Rate : 0.4821         
##    Detection Prevalence : 0.5139         
##       Balanced Accuracy : 0.9649         
##                                          
##        'Positive' Class : D              
##

According to the findings, this classifier, which has a prediction accuracy of 96%, is the best of the three classifiers. It also successfully categorizes Republican speeches with an accuracy of 99%.

9.Conclusion

For sentiment analysis results, we can see that speeches from the Democrats are more positive than speeches from the Republicans in general. For the Democrats, most speeches expressed a very positive information, expect from the year 2013 and the year around 2016. For the Republicans, much speeches expressed a relatively positive information. Specifically, speeches during Bush presidency are most likely very positive, however, speeches during Trump presidency are not strong enough positive with somewhat positive.

We also find that there is a difference between two methods, especially for negative score. It maybe that negative sentiment is difficult to reliably detect with dictionary approaches. This could be due to nuanced linguistic markers (e.g., sarcasm) that require advanced algorithms to be detected [5]. For Tidytext, it does not provide its own default dictionary. At its core, it strives to pre-process input text which is then analyzed through any input dictionary and output a Tidytext provides functions for converting text into a “one-token-perdocument-per-row” format which may ease text analysis.

For Vader, itfeatures multiple sub-dictionaries and considers word order and degree modifiers to give us a continous score, which we can use it to classify speeches in a more detailed way,(e.g, “very positive”, “slightly positive”, “somewhat positive”). It performs well in sentiment analyses of speech. We also found the R implementation of Vader to take more time to compute compared to Tidytext.

In a word, we are willing to accept the result from Vader as our final result of speech sentiment analysis in our project.

According to the Topic Modelling results, the Democrats prioritize the healthcare system while the Republicans prioritize voting and selection. Despite having different points of view on the issue, both parties discuss the war and the health care crisis.

Finally, based on the results of the classification, the last classifier-logistic regression classifier has the highest prediction accuracy in distinguishing which party made the speech.

9.1 Limitation

There are some limitations of our work. The first is that there are some uninformative texts after scraping and the stopwords corpus is not enough. It has a negative influence on our topic modeling. Our topic table contains some boring words, like ‘state’. The second limitation is that the duration of Trump’s speeches is from 2016 to 2018, which is short. Our data set is not balanced in time. Our sentiment analysis might be influenced due to that. The last and most limitation is that validation is lacking in STM, which means that our result in STM is might not so reliable.

9.2 Elevator pitch

This project discovered that a) Democrats prioritize the healthcare system while Republicans prioritize voting and choice. B) Overall, Democratic speeches are more upbeat than Republican speeches C) The logistic regression classifier has the highest prediction accuracy in determining which party delivered the speech.

Reference

[1] Chaudhry, H., Javed, Y., Kulsoom, F., Mehmood, Z., Khan, Z., Shoaib, U., & Janjua, S. (2021). Sentiment analysis of before and after elections: Twitter data of U.S. election 2020. Electronics (Basel), 10(17), 2082. https://www.mdpi.com/2079-9292/10/17/2082

[2] Liu, B. (2015). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge: Cambridge University Press. doi:10.1017/CBO9781139084789 https://www.cambridge.org/core/books/sentiment-analysis/3F0F24BE12E66764ACE8F179BCDA42E9

[3] Ribeiro, F. N. , M Araújo, Gonalves, P. , Benevenuto, F. , & Gonalves, M. A. . (2015). Sentibench - a benchmark comparison of state-of-the-practice sentiment analysis methods. https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-016-0085-1

[4] Borchers, C., Rosenberg, J. M.,Gibbons, B.,Burchfield, M. A., &Fischer, C. (2021). To Scale or Not to Scale: Comparing Popular Sentiment Analysis Dictionaries on Educational Twitter Data. Fourteenth International Conference on Educational Data Mining (EDM 2021). https://educationaldatamining.org/EDM2021/virtual/static/pdf/EDM21_paper_122.pdf

[5] Riloff, E., Qadir, A. , Surve, P.,Silva, L. D. , Gilbert, N. , & Huang, R. (2013). Sarcasm as contrast between a positive sentiment and negative situation. https://aclanthology.org/D13-1066.pdf

[6]https://dk.usembassy.gov/da/youth-education-da/the-american-political-system/the-democrats-and-the-republicans/

[7]https://dk.usembassy.gov/da/youth-education-da/the-american-political-system/history-of-the-democratic-and-republican-parties/

[8]Donghwa Kim, Deokseong Seo, Suhyoun Cho, Pilsung Kang, Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec, Information Sciences, Volume 477, 2019, Pages 15-29, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2018.10.006.

[9]Margaret E. Roberts, Brandon M. Stewart, Edoardo M. Airoldi https://scholar.princeton.edu/sites/default/files/bstewart/files/stm.pdf ers/AAAI/2000/AAAI00-091.pdf

[10]Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46). https://www.cc.gatech.edu/home/isbell/classes/reading/papers/Rish.pdf

[11]Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2), 103-130. https://link.springer.com/article/10.1023/A:1007413511361

[12]Hellerstein, J. L., Jayram, T. S., & Rish, I. (2000). Recognizing end-user transactions in performance management (pp. 596-602). Hawthorne, NY: IBM Thomas J. Watson Research Division. https://www.aaai.org/Papers/AAAI/2000/AAAI00-091.pdf

[13]Mitchell, T. M. (1997). Machine learning. v. 45. Burr Ridge, IL: McGraw Hill, 37.

[14]Joachims, T. (2006, August). Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 217-226). https://dl.acm.org/doi/abs/10.1145/1150402.1150429?casa_token=5ZxPeTWb0CUAAAAA:f0U7zMTO7OvNmYdC8LOBjeOUGaA-eP-C8NY0M7pAxpnKZfO7ixs_JZdTElwmB726obh9udvyy9bJxg

[15]Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media. https://books.google.se/books?hl=en&lr=&id=sna9BaxVbj8C&oi=fnd&pg=PR7&dq=Vapnik,+V.+(1999).+The+nature+of+statistical+learning+theory.+Springer+science+%26+business+media.&ots=oqM9HXhmfc&sig=RVcwSsYC8EqvKSHUpIDbWAFdLQI&redir_esc=y#v=onepage&q=Vapnik%2C%20V.%20(1999).%20The%20nature%20of%20statistical%20learning%20theory.%20Springer%20science%20%26%20business%20media.&f=false

Assignment 2

Group 11

Rubing Wang (r0864561)

Haijia Chen (r0861874)

Yinfu Liu (r0865899)

2022-08-29

1.Research Question

2.Web scraping

3.Dataset generate

4.Data cleanning

5.Sentiment Analysis

5.1 Data and Methodology

5.2 Tidytext Sentiment

5.3 Vader Sentiment

6. Topic modeling

6.1 Methodology

6.2 Experiment design

6.3 LDA results and analysis

6.3.1 LDA with only stopwords

6.3.2 LDA with vectorized stopwords

6.3.3 LDA with stemmed data

6.4 STM

8. Classification

8.1 Split Test/Train (0.3 - 0.7) & Pre-Processing

8.2 Train Naive Bayes.

8.3 Train Linear SVM

8.4 Logistic Regression

9.Conclusion

9.1 Limitation

9.2 Elevator pitch

Reference