Introduction

Professor Suleiman pointed out that this class has been light on politics so far. I’ve decided to go rogue and make up time by delivering everyone a semester’s worth of political headaches in one project.

Specifically my goal is to create a sentiment lexicon that is able to be joined with any text and rank the political stance of the message (and more subtly the message’s author) as either right-leaning (conservative/Republican) or left-leaning (liberal/Democrat). I want to state in advance that equating Democrat = liberal and Republican = conservative is a gross oversimplification of the political spectrum. However, within the scope of this project they will be referred to interchangeably.

The lexicon will be a continuous scale, rather than binary, ranging from -Inf to Inf. One party’s words will be assigned a negative score and 0 will be assigned a neutral value. The goal is to have equal amounts of words and weightings to avoid introducing a bias to one party over the other.

Methodology

This analysis attempts to be fully objective and will utilize presidential speeches and addresses as the word pool (corpus) from which the lexicon will be created. The speeches and addresses chosen were largely delivered to domestic audiences and attempted to avoid centralized topics (i.e. a speech about an education reform bill). Rather State of the Union and election victory speeches were prioritized as they appeared more likely to encompass a wide variety of party platform initiatives. Speech transcripts from multiple online sources (see Appendix) were group by President and copied into text (.txt) files for processing.

Presidential Dataset

The four Presidents chosen were:

pres_stats <- data.frame(
    row.names = c("Democrat", "Barrack Obama", "William Clinton", "Republican", "George W. Bush", "Donald Trump"),
    Words = c(53165, 26502, 26663, 53080, 26063, 27017))

knitr::kable(pres_stats,booktabs = T)
Words
Democrat 53165
Barrack Obama 26502
William Clinton 26663
Republican 53080
George W. Bush 26063
Donald Trump 27017

Cleaning the Data

By using at least 2 Presidents from each party we limit the amount of words that are added to the lexicon purely due to individual speaking style and word selection. Word totals above are the final counts used in the analysis, English stop words (i.e. I, she, it), punctuation, and numbers have been removed using the TM package.

clean_corp <- function(corp) {
  corp %>%
    tm_map(tolower) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>%
    tm_map(stripWhitespace) %>%
    tm_map(removeWords, c(stopwords("english"), "president", "applause", "will"))
}

Presidential Profiles

As part of the project we’ll display the high level information found for each President to highlight differences and similarities in their speaking styles and word choices.

Individual Level

Below, each President’s tab contains:

Tabset

Bush

con_corp <- Corpus(DirSource("./Cons/Bush"), readerControl = list(language="lat"))
con_corp <- clean_corp(con_corp)

wordcloud(con_corp, random.order=FALSE, scale=c(3, .5), use.r.layout=TRUE,colors = brewer.pal(5, "Dark2"), max.words=75)

con_tdm <- TermDocumentMatrix(con_corp)
con_m <- as.matrix(con_tdm)
con_df <- data.frame(word = row.names(con_m), con_m)
con_df$word <- as.character(con_df$word)
colnames(con_df) <- c("word", "count")
con_df %>% inner_join(get_sentiments("bing")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 2 x 2
##   sentiment sent_ttl
##       <chr>    <dbl>
## 1  negative     1576
## 2  positive     2765
con_df %>% inner_join(get_sentiments("afinn")) %>% summarize(sent_ttl = sum(count * score))
##   sent_ttl
## 1     2458
con_df %>% inner_join(get_sentiments("nrc")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 10 x 2
##       sentiment sent_ttl
##           <chr>    <dbl>
##  1        anger     1038
##  2 anticipation     1756
##  3      disgust      517
##  4         fear     1680
##  5          joy     1600
##  6     negative     2039
##  7     positive     4145
##  8      sadness      931
##  9     surprise      678
## 10        trust     2991
con_df %>% inner_join(get_sentiments("bing")) %>%
  group_by(sentiment) %>%
  top_n(20, count) %>%
  ggplot(aes(reorder(word, count), count, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

con_tdm2 <- con_tdm

Trump

con_corp <- Corpus(DirSource("./Cons/Trump"), readerControl = list(language="lat"))
con_corp <- clean_corp(con_corp)

wordcloud(con_corp, random.order=FALSE, scale=c(3, .5), use.r.layout=TRUE,colors = brewer.pal(5, "Dark2"), max.words=75)

con_tdm <- TermDocumentMatrix(con_corp)
con_m <- as.matrix(con_tdm)
con_df <- data.frame(word = row.names(con_m), con_m)
con_df$word <- as.character(con_df$word)
colnames(con_df) <- c("word", "count")
con_df %>% inner_join(get_sentiments("bing")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 2 x 2
##   sentiment sent_ttl
##       <chr>    <dbl>
## 1  negative     1142
## 2  positive     2519
con_df %>% inner_join(get_sentiments("afinn")) %>% summarize(sent_ttl = sum(count * score))
##   sent_ttl
## 1     2949
con_df %>% inner_join(get_sentiments("nrc")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 10 x 2
##       sentiment sent_ttl
##           <chr>    <dbl>
##  1        anger      783
##  2 anticipation     1510
##  3      disgust      523
##  4         fear      941
##  5          joy     1283
##  6     negative     1620
##  7     positive     3154
##  8      sadness      936
##  9     surprise      656
## 10        trust     2206
con_df %>% inner_join(get_sentiments("bing")) %>%
  group_by(sentiment) %>%
  top_n(20, count) %>%
  ggplot(aes(reorder(word, count), count, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

con_tdm2 <- c(con_tdm2, con_tdm)

Clinton

lib_corp <- Corpus(DirSource("./Lib/Clinton"), readerControl = list(language="lat"))  
lib_corp <- clean_corp(lib_corp)

wordcloud(lib_corp, random.order=FALSE, scale=c(3, .5), use.r.layout=TRUE,colors = brewer.pal(5, "Dark2"), max.words=75)

lib_tdm <- TermDocumentMatrix(lib_corp)
lib_m <- as.matrix(lib_tdm)
lib_df <- data.frame(word = row.names(lib_m), lib_m)
lib_df$word <- as.character(lib_df$word)
colnames(lib_df) <- c("word", "count")
lib_df %>% inner_join(get_sentiments("bing")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 2 x 2
##   sentiment sent_ttl
##       <chr>    <dbl>
## 1  negative     1042
## 2  positive     2324
lib_df %>% inner_join(get_sentiments("afinn")) %>% summarize(sent_ttl = sum(count * score))
##   sent_ttl
## 1     2472
lib_df %>% inner_join(get_sentiments("nrc")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 10 x 2
##       sentiment sent_ttl
##           <chr>    <dbl>
##  1        anger      742
##  2 anticipation     1614
##  3      disgust      471
##  4         fear     1132
##  5          joy     1250
##  6     negative     1671
##  7     positive     3541
##  8      sadness      694
##  9     surprise      588
## 10        trust     2362
lib_df %>% inner_join(get_sentiments("bing")) %>%
  group_by(sentiment) %>%
  top_n(20, count) %>%
  ggplot(aes(reorder(word, count), count, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

lib_tdm2 <- lib_tdm

Obama

lib_corp <- Corpus(DirSource("./Lib/Obama"), readerControl = list(language="lat"))  
lib_corp <- clean_corp(lib_corp)

wordcloud(lib_corp, random.order=FALSE, scale=c(3, .5), use.r.layout=TRUE,colors = brewer.pal(5, "Dark2"), max.words=75)

lib_tdm <- TermDocumentMatrix(lib_corp)
lib_m <- as.matrix(lib_tdm)
lib_df <- data.frame(word = row.names(lib_m), lib_m)
lib_df$word <- as.character(lib_df$word)
colnames(lib_df) <- c("word", "count")
lib_df %>% inner_join(get_sentiments("bing")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 2 x 2
##   sentiment sent_ttl
##       <chr>    <dbl>
## 1  negative     1545
## 2  positive     2151
lib_df %>% inner_join(get_sentiments("afinn")) %>% summarize(sent_ttl = sum(count * score))
##   sent_ttl
## 1     1391
lib_df %>% inner_join(get_sentiments("nrc")) %>% group_by(sentiment) %>% summarize(sent_ttl = sum(count))
## # A tibble: 10 x 2
##       sentiment sent_ttl
##           <chr>    <dbl>
##  1        anger      776
##  2 anticipation     1427
##  3      disgust      448
##  4         fear     1171
##  5          joy     1079
##  6     negative     1788
##  7     positive     3247
##  8      sadness      802
##  9     surprise      524
## 10        trust     2207
lib_df %>% inner_join(get_sentiments("bing")) %>%
  group_by(sentiment) %>%
  top_n(20, count) %>%
  ggplot(aes(reorder(word, count), count, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

lib_tdm2 <- c(lib_tdm2, lib_tdm)

Group Comparison

Now that we’ve looked at each President individually let’s compare and contrast the group as a whole. Below is a comparison word cloud which displays words within the group (President) who used it significantly more than other groups. For example, in the middle of the graph, the word “going” is within the Trump grouping and is the largest word visible. This conveys that President Trump uses this word more frequently than his peers by a large margin, the closer to the center a word appears, the more it contrasts with the groups nearby.

all_tdm2 <- c(con_tdm2, lib_tdm2)

all_m2 <- as.matrix(all_tdm2)
colnames(all_m2) <- c("Bush", "Trump", "Clinton", "Obama")

all_df2 <- data.frame(term = row.names(all_m2), all_m2)
all_df2$term <- as.character(all_df2$term)
colnames(all_m2) <- c("Bush", "Trump", "Clinton", "Obama")
all_tdm2 <- c(con_tdm2, lib_tdm2)

comparison.cloud(all_m2, random.order=FALSE, 
                 colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),
                 title.size=3, max.words=500)

Party Profiles

Word Frequency By Party

Next we’re going to combine the Presidents into one matrix and sum the scores of the two political parties using the totals of the respective Presidents. The graph below shows each party represented on an axis with word count determining the position of the words within the plot. Here we begin to see how certain words seem to appear more in the rhetoric of one party over the other.

all_df2 <- all_df2 %>% mutate(republican = Bush + Trump, democrat = Clinton + Obama)
all_df2 %>% filter(democrat + republican > 1) %>% ggplot(aes(democrat, republican)) + geom_text(aes(label = term), position = position_jitter())

Word Sentiment By Party

Before proceeding its worth checking to ensure that no biases have emerged in the data. Specifically, has a certain political party been associated with a majority of words that are subjectively “negative” or “positive”. We’ll use the same graph above, but first we’ll utilize the “bing” sentiment lexicon to negate the word counts of words labeled as “negative”. We can see that both parties are represented across the spectrum and have fairly even amounts of both highly positive and highly negative words.

all_df2 %>% inner_join(get_sentiments("bing"), by = c("term" = "word")) %>% mutate(republican = ifelse(sentiment == "negative", -republican, republican), democrat = ifelse(sentiment == "negative", -democrat, democrat)) %>%
  ggplot(aes(democrat, republican)) + geom_text(aes(label = term), position = position_jitter())

Constructing the Lexicon

Cleaning Pt. 2

Now it’s time to create the pol_sent (Political Sentiment) lexicon. To create a continuous scale we negate the word counts of the combined Republican Presidents. Next, words that contain less than 5 letters are removed. This was done after observing that most of these words were parts of speech such as fundamental nouns and prepositions that were not relevant to the analysis.
Certain words were also removed from the dataframe including: month names, President names, party names, political entities, and current event buzzwords. The buzzwords such as “Korea”, “nuclear”, “missile” were removed as they would be characterized by the President handling the situation and would certainly become an element of their party’s lexicon. This is a false characterization as it is not associated with their political standing or speaking style, rather a symptom of the time/place they occupied in history.

mid_words <- c("republican", "democrat", "democrats", "replubicans", "america","americans", "american", "party", "united", "states", "people", 
                 "country", "trump", "clinton", "bush", "obama", "north", "korea", "koreas", "nuclear", "missile", "donald",
                 "january", "february", "march", "april", "may", "june", "july", "august", "september", "october", "november", "december")

Score Calculation

To determine which words would be assigned to each party we summed the democratic (positive) and republican (negative) scores to create a net total. For example, if the word “toast” appeared 20 times in the republican column and 10 times in the democratic column the net score would be -10 (-20 + 10 = -10).
Words that had a net value greater than -5 were split into the republican subset and net values greater than 5 into the democratic subset. Since net scores were highly varied the scores had to be standardized. To do this the net scores for each subset was ordered by quartile depending on its relation to the subset’s median score. This created normalized values ranging from -4 to 4 once the subsets were merged back together.

all_df_prep <- all_df2 %>% filter(!term %in% stop_words$word, nchar(term) > 4, !term %in% mid_words) %>% mutate(republican = -republican, net = republican + democrat)

dem_df <- all_df_prep %>% filter(net > 5) %>% select(term, net)
rep_df <- all_df_prep %>% filter(net < -5) %>% select(term, net) %>% mutate(net = -net)
kable(dem_df %>% arrange(desc(net)) %>% head(), caption = "Sample Democratic Terms")
Sample Democratic Terms
term net
children 114
century 112
challenge 83
tonight 75
crisis 71
congress 69
kable(rep_df %>% arrange(desc(net)) %>% head(), caption = "Sample Republican Terms")
Sample Republican Terms
term net
freedom 119
nations 109
citizens 70
taxes 68
theyre 68
middle 59
dem_df2 <- within(dem_df, quartile <- as.integer(cut(net, unique(quantile(net, probs=0:4/4)), include.lowest=TRUE)))

rep_df2 <- within(rep_df, quartile <- as.integer(cut(net, unique(quantile(net, probs=0:4/4)), include.lowest=TRUE)))
rep_df2 <- rep_df2 %>% mutate(quartile = -quartile)

pol_sent <- rbind(select(dem_df2, term, quartile), select(rep_df2, term, quartile))
colnames(pol_sent) <- c("word", "score")

Final Structure

lex_stats <- data.frame(
    row.names = c("Democratic", "Republican", "Full Lexicon"),
    "Sum of Scores" = c(sum(dem_df2$quartile), sum(rep_df2$quartile), sum(pol_sent$score)),
    Words = c(nrow(dem_df2),nrow(rep_df2),nrow(pol_sent)))

kable(lex_stats)
Sum.of.Scores Words
Democratic 1124 482
Republican -1144 484
Full Lexicon -20 966

While we are slightly off with alignment (2 more conservative words and a net score of -20), its within tolerant levels. The lexicon is ready for action!

Calibration Test

Political Sentiment Score

To ensure that we’ve calibrated the lexicon correctly we’re actually going to start by testing it on a piece that should have little to no political sentiment attached. Prince Harry is getting married! We’re going to analyze a piece about Prince Harry’s engagement from the Associated Press to judge how effectively our lexicon can detect political sentiment.

get_news <- function(src) {
  news <- read_file(src)
  news_corp <- Corpus(VectorSource(news), readerControl = list(language = "lat")) %>%
    tm_map(tolower) %>%
    tm_map(removePunctuation) %>%
    tm_map(removeNumbers) %>%
    tm_map(stripWhitespace) %>%
    tm_map(removeWords, c(stopwords("english"), "will"))
  news_tdm <- TermDocumentMatrix(news_corp)
  news_m <- as.matrix(news_tdm)
  news_df <<- data.frame(word = row.names(news_m), news_m)
  news_df$word <<- as.character(news_df$word) 
}
get_news("./ap.txt")
news_df %>% inner_join(pol_sent) %>% summarize(pol_score = sum(score * X1))
##   pol_score
## 1         1
news_df %>% inner_join(pol_sent) %>%
  ggplot(aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

Not bad! The score reported (1) is almost perfectly neutral, and judging by the words captured in the graph above, we have a good mix of words leaning both right and left. I’m sure you can see that certain words don’t appear to have much to do with politics. Remember that these were chosen based of Presidential speaking patterns and word choices in an attempt to detect words more associated with a party.

Emotional Sentiment Score

Let’s make sure that the bias we covered before hasn’t snuck into our data. Below we use the “bing” sentiment set to evaluate the amount of positive and negative words found, and display the top contributors to sentiment. The result shows that this article has a highly positive message, which contrasts with the neutral score our pol_sent lexicon calculated. This is a good sign as emotion is not measured by pol_sent.

news_df %>% inner_join(get_sentiments("bing")) %>% group_by(sentiment) %>% summarize(sum(X1))
## # A tibble: 2 x 2
##   sentiment `sum(X1)`
##       <chr>     <dbl>
## 1  negative         6
## 2  positive        17
news_df %>% inner_join(get_sentiments("bing")) %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ggplot(aes(reorder(word, X1), X1, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

Parabolic Curve Method

For another perspective, this last plot shows the distribution of lexicon words in the article using word count as the x-axis and word score(-4:4) * count. Overlap has been disabled so only a handful of the words show. The usefulness of this view is in the shape of the plotted points. A neutral argument should display a parabola that can be drawn from the top quadrant to the bottom. If all words have low scores, the width of the parabola shrinks but maintains an even shape as it approaches the x-axis. If the parabola is skewed so that its midpoint is above or below 0, we can quickly detect that the political sentiment of the text leans one way or the other.

news_df %>% inner_join(pol_sent) %>%
  ggplot(aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(score * X1)), color = "purple")

Twitter

OK enough testing, let’s take a look at some twitter feeds. Below we’re going to punch through a variety of different twitter handles to see how pol_sent evaluates the contents of the text. We’ll be looking at several different news sources and will use strictly tweets that they have posted (this will include retweets if they rt’ed under their handle).

## [1] "Using direct authentication"
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
twt_wrds <- data.frame()
twt_wrds2 <- data.frame()
twt_df2 <- data.frame()
twt_df3 <- data.frame()
clean_twt <- function(df){
  df %>% filter(!str_detect(text, '^"')) %>%
    mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
    unnest_tokens(word, text, token = "regex", pattern = reg) %>%
    filter(!word %in% stop_words$word,
           str_detect(word, "[a-z]")) 
}

get_ind_twt <- function(df){
  df %>% group_by(id) %>% inner_join(pol_sent) %>% summarize(score = sum(score)) %>%
    mutate(Party = ifelse(score > 0, "Liberal", "Conservative"))
}

group_twt <- function(df){
  df %>% group_by(word) %>% summarize(X1 = n()) %>% inner_join(pol_sent)
}

plot_id <- ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

plot_top30 <- ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

plot_curve <-   ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

merge_df <- function(df) {
  df %>% group_by(id, word) %>% inner_join(pol_sent) %>% summarize(score = sum(score)) %>%
    mutate(Party = ifelse(score > 0, "Liberal", "Conservative")) %>% mutate(Source = twt_src)
}

News Sources

FoxNews

twt_src <- "FoxNews"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1      -790
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- select(twt_merge, Source, Party, id, word,score)

NBC

twt_src <- "NBCNews"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1      -314
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- rbind(comb_twts, twt_merge)

MSNBC

twt_src <- "MSNBC"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1      -251
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- rbind(comb_twts, twt_merge)

CNN

twt_src <- "CNN"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1       -86
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- rbind(comb_twts, twt_merge)

AP

twt_src <- "AP"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1      -102
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- rbind(comb_twts, twt_merge)

BrightBart

twt_src <- "altbright"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1       -15
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- rbind(comb_twts, twt_merge)

BBC

twt_src <- "BBCNews"
twt <- userTimeline(twt_src, n = 1000)
twt_df <- twListToDF(twt)
twt_wrds <- clean_twt(twt_df)
twt_wrds2 <- get_ind_twt(twt_wrds)
ggplot(twt_wrds2, aes(id, score, fill = Party)) + 
  geom_bar(stat = "identity") + labs(x = "") + ggtitle("Recent Tweets By Political Sentiment") +      theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

twt_df2 <- group_twt(twt_wrds)
twt_df2 %>% summarize(pol_score = sum(score * X1))
## # A tibble: 1 x 1
##   pol_score
##       <int>
## 1        16
twt_df3 <- twt_df2 %>% top_n(30,abs(score*X1))
ggplot(twt_df3, aes(reorder(word, score * X1), score * X1)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  labs(y = "Contribution to sentiment", x = "") +
  coord_flip()

ggplot(twt_df2, aes(X1 , score * X1)) + 
  geom_text(aes(label = word), size = 3, position = position_jitter()) +
  geom_hline(aes(yintercept = mean(twt_df2$score * twt_df2$X1)), color = "purple")

twt_merge <- merge_df(twt_wrds)

comb_twts <- rbind(comb_twts, twt_merge)

News Source Comparison

Lastly let’s take a look at the distribution for each of the different news sources. A more positive score represents more liberal minded rhetoric while negative represents conservative. This distribution will always rely heavily on the top issues being covered in the news and should not be used as an indicator of political preference or bias. Rather, this was just an interesting use of R text mining and data gathering packages. Hope you enjoyed the ride, thanks.

comb_twts %>% group_by(Source, id) %>% summarize(score = sum(score)) %>% ggplot(aes(Source, score)) + geom_boxplot()

kable(comb_twts %>% group_by(Source) %>% summarize(Mean_Score = mean(score)) %>% arrange(desc(Mean_Score)),digits = 2)
Source Mean_Score
BBCNews 0.26
CNN -0.10
altbright -0.14
NBCNews -0.46
FoxNews -0.66
MSNBC -0.97
AP -1.01

Appendix

Full Lexicon

kable(pol_sent %>% arrange(desc(score)), caption = "Full Political Sentiment Lexicon")
Full Political Sentiment Lexicon
word score
action 4
agree 4
balance 4
balanced 4
banks 4
bipartisan 4
budget 4
century 4
challenge 4
challenges 4
chamber 4
change 4
child 4
children 4
childrens 4
china 4
college 4
common 4
communities 4
community 4
congress 4
credit 4
crime 4
crisis 4
decisions 4
deficit 4
differences 4
economic 4
economy 4
education 4
effort 4
efforts 4
europe 4
families 4
fellow 4
financial 4
forward 4
future 4
generation 4
global 4
growth 4
health 4
helped 4
homes 4
house 4
increase 4
internet 4
invest 4
investment 4
issue 4
issues 4
leadership 4
loans 4
longterm 4
markets 4
means 4
million 4
moment 4
national 4
opportunity 4
pakistan 4
parents 4
perfect 4
politics 4
private 4
propose 4
prosperity 4
provide 4
qaeda 4
question 4
racial 4
recession 4
recognize 4
recovery 4
reduce 4
responsibility 4
reverend 4
revolution 4
school 4
schools 4
science 4
sense 4
service 4
simply 4
single 4
solve 4
spending 4
standards 4
start 4
steps 4
streets 4
strengthen 4
stronger 4
students 4
support 4
teachers 4
technology 4
tonight 4
trade 4
troops 4
union 4
values 4
welfare 4
white 4
black 4
forge 4
lines 4
ultimately 4
guantanamo 4
package 4
detainees 4
accountable 3
actions 3
address 3
administration 3
afford 3
affordable 3
answer 3
approach 3
begin 3
bridge 3
building 3
chance 3
checks 3
citizen 3
classroom 3
clean 3
combat 3
consensus 3
consumers 3
corps 3
costs 3
critical 3
decade 3
decline 3
depression 3
difficult 3
direction 3
dollar 3
doubt 3
enormous 3
entire 3
environment 3
expand 3
extraordinary 3
family 3
finish 3
fundamental 3
generations 3
housing 3
hundred 3
ideas 3
incentives 3
individual 3
initiative 3
international 3
investments 3
journey 3
kinds 3
largest 3
lasting 3
learning 3
legacy 3
lives 3
market 3
measure 3
millennium 3
months 3
moved 3
neighborhoods 3
opportunities 3
partners 3
partnership 3
poverty 3
preserve 3
prevent 3
programs 3
proposal 3
public 3
quality 3
raise 3
reach 3
restore 3
review 3
rural 3
sector 3
serving 3
shape 3
shortterm 3
strategy 3
street 3
strongest 3
succeed 3
surplus 3
teacher 3
transparency 3
unprecedented 3
worlds 3
finance 3
fiscal 3
harder 3
immigrants 3
lowest 3
planet 3
waste 3
bosnia 3
discrimination 3
doors 3
exist 3
globalization 3
recommend 3
stimulus 3
summit 3
treaty 3
wright 3
commissions 3
compensation 3
interrogation 3
lending 3
anger 2
assure 2
authority 2
blame 2
built 2
campaign 2
cancer 2
capacity 2
choose 2
churches 2
civil 2
complete 2
confidence 2
courts 2
creating 2
determination 2
discipline 2
earlier 2
economies 2
effective 2
enable 2
enduring 2
ensure 2
european 2
expanded 2
exports 2
federal 2
foundation 2
government 2
ground 2
hands 2
havens 2
includes 2
incomes 2
investing 2
knowledge 2
legitimate 2
lifetime 2
maintain 2
marines 2
mutual 2
pleased 2
police 2
policies 2
proposed 2
proud 2
remain 2
renew 2
rewards 2
rolls 2
saving 2
served 2
spirit 2
spring 2
student 2
summer 2
surely 2
thousand 2
times 2
tomorrow 2
tools 2
truth 2
tuition 2
watching 2
closer 2
michael 2
payments 2
space 2
aging 2
americorps 2
asian 2
boldly 2
brady 2
classrooms 2
efficient 2
endeavor 2
kenya 2
organized 2
richard 2
tobacco 2
trouble 2
whove 2
biden 2
brown 2
document 2
economists 2
elkhart 2
extraordinarily 2
french 2
michelle 2
oversight 2
spectrum 2
sustainable 2
ability 1
accountability 1
africanamerican 1
alliances 1
arguments 1
ashley 1
assistance 1
begins 1
bottom 1
break 1
burden 1
camps 1
capital 1
carry 1
carrying 1
celebrate 1
charter 1
church 1
completely 1
comprehensive 1
conflicts 1
constructive 1
creed 1
crises 1
debate 1
decide 1
defeat 1
designed 1
diplomacy 1
disagree 1
districts 1
divide 1
documents 1
doesnt 1
dreams 1
establish 1
failure 1
falling 1
father 1
finest 1
firms 1
fought 1
founders 1
goals 1
grade 1
grant 1
guarantee 1
guarantees 1
improved 1
india 1
language 1
legislation 1
listening 1
local 1
losing 1
match 1
methods 1
miles 1
mistake 1
mother 1
mothers 1
notion 1
operate 1
partner 1
pension 1
pollution 1
possibility 1
powerful 1
precious 1
presidents 1
pursue 1
raised 1
range 1
rates 1
relationship 1
release 1
reminds 1
report 1
republicans 1
requires 1
research 1
reward 1
rooted 1
safer 1
savings 1
scores 1
senate 1
short 1
sitting 1
social 1
soldiers 1
stood 1
stopped 1
stories 1
struggling 1
table 1
taxpayer 1
teach 1
technologies 1
toxic 1
training 1
traveled 1
treasures 1
trust 1
unemployment 1
weakened 1
worth 1
acknowledge 1
aisle 1
announce 1
computer 1
convention 1
cycle 1
effectively 1
flowing 1
intend 1
joint 1
minimum 1
partnerships 1
reality 1
regulatory 1
shores 1
adversaries 1
alive 1
applaud 1
applied 1
captain 1
commonsense 1
connect 1
conversation 1
deepen 1
dialogue 1
economically 1
empowerment 1
endanger 1
expensive 1
extending 1
graduates 1
greenhouse 1
harness 1
kosovo 1
launching 1
library 1
memories 1
mortgage 1
native 1
privacy 1
publicly 1
recognition 1
repeat 1
responded 1
restoring 1
reverse 1
rivers 1
salaries 1
scholarships 1
shareholders 1
warming 1
blacks 1
coordinated 1
detention 1
diplomatic 1
executives 1
homeowners 1
laying 1
reinvestment 1
sermons 1
strasbourg 1
techniques 1
wrights 1
accept -1
access -1
accomplished -1
accounts -1
active -1
additional -1
adopted -1
afghan -1
agencies -1
agents -1
aggression -1
aggressive -1
ahead -1
allowing -1
alternative -1
amount -1
baghdad -1
ballistic -1
beliefs -1
benefits -1
bigotry -1
broader -1
brought -1
charging -1
chief -1
civilization -1
closely -1
condemned -1
confronted -1
congratulations -1
consequences -1
conservative -1
construction -1
dangerous -1
dealing -1
deepest -1
defeating -1
defended -1
defiance -1
delivered -1
dictators -1
director -1
dominate -1
duties -1
eliminate -1
employees -1
encourages -1
entrepreneurs -1
entry -1
environmental -1
equally -1
event -1
eventually -1
faith -1
field -1
fight -1
frivolous -1
fulfill -1
funding -1
glorious -1
governing -1
heart -1
homeland -1
honorable -1
independent -1
institution -1
judge -1
kuwait -1
lawsuits -1
leader -1
lieutenant -1
looked -1
louisiana -1
loyalty -1
mankind -1
matter -1
mercy -1
minute -1
miracle -1
names -1
opponent -1
optimism -1
organization -1
outcome -1
outdated -1
outlaw -1
palestine -1
pennsylvania -1
permanent -1
permitting -1
personnel -1
political -1
prevail -1
progrowth -1
promise -1
promote -1
protection -1
protects -1
proudly -1
providing -1
punish -1
putting -1
quiet -1
rebuild -1
recently -1
region -1
rejects -1
relations -1
religion -1
responders -1
rulers -1
ruling -1
salute -1
secretary -1
senators -1
share -1
shown -1
souls -1
stability -1
standard -1
stated -1
steven -1
stock -1
strive -1
struggle -1
supplies -1
telling -1
terrorism -1
tyranny -1
ultimate -1
undermine -1
unfairly -1
unlike -1
update -1
virginia -1
voted -1
western -1
wishes -1
wounded -1
affected -1
aliens -1
altleft -1
canada -1
circuit -1
cleanest -1
fence -1
flynn -1
green -1
havent -1
indiana -1
jessica -1
julie -1
keystone -1
mentioned -1
miners -1
offshore -1
pipeline -1
product -1
reserves -1
robert -1
scott -1
selfinflicted -1
southern -1
statue -1
terrific -1
thrilled -1
vetting -1
walmart -1
watched -1
advance -2
allies -2
amazing -2
ambitions -2
assembly -2
attack -2
biological -2
broke -2
capable -2
catastrophic -2
charge -2
cheney -2
choices -2
close -2
compliance -2
confident -2
conflict -2
confront -2
contributions -2
court -2
crucial -2
decision -2
delegates -2
destruction -2
directed -2
elections -2
enemies -2
equipment -2
faithbased -2
folks -2
framework -2
friend -2
growing -2
hardworking -2
hearts -2
honest -2
immediately -2
importantly -2
inauguration -2
information -2
intimidate -2
iraqs -2
joining -2
judges -2
liability -2
lowincome -2
medicare -2
message -2
moral -2
oppressed -2
ownership -2
peoples -2
phase -2
positive -2
power -2
promised -2
quickly -2
repeal -2
replace -2
represent -2
represented -2
respected -2
sanctions -2
saudi -2
signed -2
sovereignty -2
surprised -2
texas -2
travel -2
unfair -2
witnessed -2
wonderful -2
wrote -2
badly -2
bannon -2
disadvantage -2
friendly -2
harold -2
jobkilling -2
protest -2
putin -2
reporters -2
tennessee -2
advantage -3
allowed -3
alqaida -3
announced -3
benefit -3
bless -3
borders -3
broken -3
build -3
business -3
businesses -3
chaos -3
choice -3
civilized -3
commitments -3
compassionate -3
compete -3
correct -3
couple -3
danger -3
death -3
decades -3
declared -3
deliver -3
desire -3
determined -3
dictator -3
disarm -3
disaster -3
elected -3
election -3
encourage -3
expected -3
focus -3
forgotten -3
found -3
friends -3
giving -3
governor -3
happened -3
happy -3
heard -3
highway -3
honored -3
hopes -3
illegal -3
income -3
innocent -3
inspectors -3
israel -3
killers -3
leaving -3
manufacturing -3
medicine -3
mexico -3
morning -3
muslim -3
negotiate -3
officials -3
oppression -3
palestinian -3
peaceful -3
period -3
personal -3
plants -3
pouring -3
prayers -3
prescription -3
pride -3
process -3
produce -3
production -3
promises -3
reforms -3
regulation -3
representative -3
resolution -3
retirement -3
ronald -3
security -3
serve -3
signs -3
sources -3
starting -3
supposed -3
takes -3
taliban -3
theyve -3
thousands -3
threat -3
totally -3
treatment -3
understand -3
unite -3
uranium -3
victims -3
watch -3
whats -3
worker -3
world -3
anymore -3
audience -3
charlottesville -3
classified -3
deals -3
factories -3
fantastic -3
guess -3
happening -3
inaudible -3
ivanka -3
kelly -3
leaks -3
mattis -3
neonazis -3
pipelines -3
restrictions -3
steel -3
theyll -3
account -4
agreement -4
americas -4
attacks -4
billions -4
border -4
bring -4
called -4
calling -4
character -4
cities -4
citizens -4
coalition -4
coming -4
commitment -4
companies -4
compassion -4
competitive -4
control -4
council -4
countries -4
courage -4
coverage -4
current -4
defend -4
democracy -4
democratic -4
deserve -4
didnt -4
dignity -4
dollars -4
excuse -4
foreign -4
freedom -4
governments -4
happen -4
hatred -4
healthcare -4
honor -4
horrible -4
human -4
hussein -4
immigration -4
including -4
incredible -4
infrastructure -4
intelligence -4
iraqi -4
justice -4
laughter -4
leaders -4
level -4
liberty -4
lower -4
marriage -4
massive -4
media -4
meeting -4
middle -4
military -4
millions -4
model -4
money -4
murder -4
nation -4
nations -4
paris -4
peace -4
percent -4
person -4
press -4
reagan -4
reform -4
regime -4
regimes -4
regulations -4
relief -4
remember -4
resolutions -4
respect -4
rights -4
russia -4
saddam -4
senator -4
seniors -4
societies -4
society -4
story -4
taxes -4
terrible -4
terror -4
terrorist -4
terrorists -4
theyre -4
tremendous -4
trillion -4
violence -4
vital -4
wages -4
wealth -4
weapons -4
women -4
workers -4
youre -4
accord -4
approved -4
beautiful -4
dakota -4
obamacare -4
statement -4
steve -4
talking -4
trillions -4