Introduction

In this project, we are going to see if Americans support President Trump’s impeachment inquiry or not. For that purpose, we have taken the data from three different sources i.e. Washington post was scraped to see the road to impeachment, getting data from five thirty eight.com and then tweets were taken from twitter. Reason behind was to see if Americans support impeachment without biasness.

Loading libraries

library(knitr)
library(tm)
library(dplyr)
library(twitteR)
library(wordcloud)
library(wordcloud2)
library(ggwordcloud)
library(tidyverse)
library(tidytext)
library(DT)
library(knitr)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(RCurl)
library(XML)
library(stringr)
library(ggplot2)
library(htmlwidgets)
library(rvest)
library(data.table)
library(DT)
library(kableExtra)
library(DBI)
#library(RMySQL)
library(readr)
library(lattice)

Scrapping the data from Washington Post

Road to impeachment

The House of Representatives is engaged in a formal impeachment inquiry of President Trump. It is focused on his efforts to secure specific investigations in Ukraine that carried political benefits for him — including aides allegedly tying those investigations to official U.S. government concessions. To get the data for Trump-Ukraine impeachment timeline of relevant events leading to Trump Impeachment inquiry, we decided to scrap the following website:

https://www.washingtonpost.com/graphics/2019/politics/trump-impeachment-timeline

#First we create an empty data frame for our function to fill. We call it ukraine.
ukraine <- data.frame( Title = character(),
                       Description = character(),
                       stringsAsFactors=FALSE
                       ) 
url <- "https://www.washingtonpost.com/graphics/2019/politics/trump-impeachment-timeline/"
var <- read_html(url)
title <- var %>% 
    html_nodes("div.pg-card .pg-card-title") %>%
    html_text()
#title
description <- var %>%
    html_nodes("div.pg-card .pg-card-description") %>%
    html_text()
description <-  gsub(pattern = "\\[\\]", replace = "", description)
descrition <- stringr::str_replace(description, '\\*', '')
description <-  gsub(pattern = "\n", replace = "", description)
ukraine <- rbind(ukraine, as.data.frame(cbind(title,description)))

## Warning in cbind(title, description): number of rows of result is not a
## multiple of vector length (arg 1)

ukraine <- ukraine %>% mutate(id = row_number())
ukraine <- ukraine[1:215,]
head(ukraine) %>% kable() %>% kable_styling()

title	description	id
February 22, 2014	Ukrainian President Viktor Yanukovych is ousted from power during a popular uprising in the country. He flees to Russia. After his ouster, Ukrainian officials begin a wide-ranging investigation into corruption in the country.	1
March 7, 2014	Lev Parnas, eventually an associate of former New York City mayor Rudolph W. Giuliani, has his first known interaction with Donald Trump at a golf tournament in Florida.	2
March 1, 2014	Russia invades the Ukrainian peninsula of Crimea, annexing it.	3
May 13, 2014	Hunter Biden, a son of then-U.S. Vice President Joe Biden, joins the board of the Ukrainian energy company Burisma Holdings. It is owned by oligarch Mykola Zlochevsky, one of several subjects of the Ukrainian corruption probe.	4
May 25, 2014	Petro Poroshenko is elected president of Ukraine.	5
February 10, 2015	Viktor Shokin becomes Ukraine’s prosecutor general.	6

tail(ukraine) %>% kable() %>% kable_styling()

	title	description	id
210	November 10, 2019	Hill testifies.	210
211	November 13, 2019	Kent testifies.	211
212	November 15, 2019	McKinley testifies and explains his resignation. “I was disturbed by the implication that foreign governments were being approached to procure negative information on political opponents,” McKinley says. “I was convinced that this would also have a serious impact on Foreign Service morale and the integrity of our work overseas.”	212
213	November 19, 2019	Sondland testifies, saying any pressure he applied on Ukraine to investigate Burisma came before he knew the case involved the Bidens. (He claims this despite Giuliani’s efforts and the Bidens’ proximity to them being in the news by early May.) Sondland says he is making that distinction “because I believe I testified that it would be improper” to push for such political investigations. Asked whether it would be illegal, Sondland says: “I’m not a lawyer, but I assume so.”	213
214	November 20, 2019	Trump announces Perry will resign by the end of the year.	214
215	November 21, 2019	Mulvaney in a news conference momentarily confirms a quid pro quo with Ukraine. “[Did Trump] also mention to me, in the past, that the corruption related to the DNC server?” Mulvaney said. “Absolutely, no question about that. But that’s it. And that’s why we held up the money. . . . The look back to what happened in 2016 certainly was part of the thing that he was worried about in corruption with that nation. And that is absolutely appropriate.” Mulvaney later issues a statement trying to reverse course, saying there actually was no connection.	215

In the above chunk of codes, empty dataframe was created first and then gauged relevant html nodes to scrap the relevant data.

get all Events Text in the timeline events

allDescriptions <- ""
mdescription <- c()
for (i in (1:length(ukraine$title))){
  
   mdescription <- ukraine$description[i] 
   allDescriptions <- paste0(allDescriptions,mdescription)
  
}
allDescriptions <- gsub(pattern = "\\\"", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "\\[\\]", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "\"", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "__", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "--", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "----", replace = "", allDescriptions)
#allDescriptions

create the corpus and clean it up

Now after getting the data, let’s clean the data using tm package’s Corpus function through removing unnecessary numbers and making the words cleaner.

# putting the words in vector
words <- Corpus(VectorSource(allDescriptions))
#using tm to remove numbers, punctuation and convert to lowercase. Some high frequency words we do not want are removed.
words <- tm_map(words, tolower)
words <- tm_map(words, removeNumbers)
words <- tm_map(words, removePunctuation)
words <- tm_map(words, removeWords, stopwords("english"))
words <- tm_map(words, removeWords, c("will","according", "later", "say", "says", "said", "saying", "tells", "also", "—-", "__" ))
#inspect(words)

create Term-Document Matrix for the corpus

#Build a term-document matrix and dataframe d to show frequency of words
tdm <- TermDocumentMatrix(words)
m <- as.matrix(tdm)
#desc(m)
# head(m, 20) %>% kable() %>% kable_styling()
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d,8) %>% kable() %>% kable_styling()

	word	freq
trump	trump	110
ukraine	ukraine	74
zelensky	zelensky	70
taylor	taylor	65
sondland	sondland	52
house	house	39
call	call	38
giuliani	giuliani	37

Visualize the Text Events

#wordcloud2(d, size = 0.7)
wordcloud2(d, size = 1, color = "random-light", backgroundColor = "grey")

display words frequency

ggplot(head(d, 25), aes(reorder(word, freq),freq)) +
  geom_bar(stat = "identity", fill = "#7300AB") +  #03DAC6   #6200EE
  labs(title = "Road To Impeachment words frequency",
       x = "Words", y = "Frequency") +
  geom_text(aes(label=freq), vjust=0.4, hjust= 1.2, size=3, color="white")+
  coord_flip()

Show me the Tweets!!

ggplot(d, aes(label = word, size=2)) +
  geom_text_wordcloud_area(
    mask = png::readPNG("t.png"),
    rm_outside = TRUE, color="skyblue"
  ) +
  scale_size_area(max_size = 10) +
  theme_minimal()

## Some words could not fit on page. They have been removed.

Poll dataset from fivethirtyeight.com

Do Americans Support Impeaching Trump? reference- https://projects.fivethirtyeight.com/impeachment-polls

poll <-  read.csv("https://raw.githubusercontent.com/ekhahm/datascience/master/impeachment-polls.csv")
head(poll)

##       Start       End                 Pollster Sponsor SampleSize Pop
## 1 6/28/2019  7/1/2019 ABC News/Washington Post               1008   a
## 2 4/22/2019 4/25/2019 ABC News/Washington Post               1001   a
## 3 1/21/2019 1/24/2019 ABC News/Washington Post               1001   a
## 4 8/26/2018 8/29/2018 ABC News/Washington Post               1003   a
## 5  6/8/2019 6/12/2019                   Civiqs               1559  rv
## 6 5/28/2019 5/31/2019                 CNN/SSRS               1006   a
##   tracking
## 1       NA
## 2       NA
## 3       NA
## 4       NA
## 5       NA
## 6       NA
##                                                                                                                                                                                              Text
## 1 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 2 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 3 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 4 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 5                                              Do you think the House of Representatives should open an impeachment inquiry to determine if President Donald Trump should be removed from office?
## 6                                              Based on what you have read or heard, do you believe that President Trump should be impeached and removed from office, or don't you feel that way?
##             Category Include. Yes No Unsure Rep.Sample Rep.Yes Rep.No
## 1  begin_proceedings      yes  37 59      4        232       7     87
## 2  begin_proceedings      yes  37 56      6        260      10     87
## 3  begin_proceedings      yes  40 55      6        240       7     90
## 4  begin_proceedings      yes  49 46      5        251      15     82
## 5      begin_inquiry      yes  43 51      5        483       5     93
## 6 impeach_and_remove      yes  41 54      5        342       6     93
##   Dem.Sample Dem.Yes Dem.No Ind.Sample Ind.Yes Ind.No
## 1        292      61     36        373      37     59
## 2        290      62     29        360      36     59
## 3        320      64     30        370      42     53
## 4        331      75     21        371      49     46
## 5        577      77     15        499      41     53
## 6        272      76     18        392      35     59
##                                                                                                                                                           URL
## 1 https://games-cdn.washingtonpost.com/notes/prod/default/documents/2557e081-f90a-4c44-a04e-9c98f04bb725/note/d1660489-1b82-43c7-afce-812d2861ecf7.pdf#page=1
## 2        https://games-cdn.washingtonpost.com/notes/prod/default/documents/873ceb77-ad0f-439a-891b-d440139189d0/note/fae3467f-5c96-41b7-99ef-cd49f752e038.pdf
## 3                                                                                       langerresearch.com/wp-content/uploads/1204a2TrumpInvestigations-1.pdf
## 4                                                               https://www.langerresearch.com/wp-content/uploads/1200a1TrumpandtheMuellerInvestigation-1.pdf
## 5                                                                                https://civiqs.com/documents/Civiqs_DailyKos_monthly_banner_book_2019_06.pdf
## 6                                                                                 https://cdn.cnn.com/cnn/2019/images/06/01/rel7a.-.trump,.investigations.pdf
##   Notes
## 1      
## 2      
## 3      
## 4      
## 5      
## 6

Reliable pollsters

According to the FiveThirtyEight’s Pollster Ratings, the most reliable pollster among fifteen are three pollsters which are Marist College, SurveyUSA Emerson College. All of them receive greater than grade A-. The survey questions asking if Congress should impeach/impeach and remove Trump from the three pollsters are analyzed and visualized.

reference - https://projects.fivethirtyeight.com/pollster-ratings

tidying data

poll$Start <- format(as.Date(poll$Start, format="%m/%d/%Y"))
poll1 <- poll %>%
          filter(Pollster == "Marist College"|Pollster == "Emerson College"|  Pollster =="SurveyUSA")%>%  # filter down to the three pollsters
          filter(str_detect(Category, "impeach"))%>% 
          gather("Answer", "percent",11:13)%>%
          select(Start, End, Answer, percent)%>%
          group_by(Start, Answer)%>%
          summarise(percent = mean(percent))%>%
          arrange(Start)
      
head(poll1)

## # A tibble: 6 x 3
## # Groups:   Start [2]
##   Start      Answer percent
##   <chr>      <chr>    <dbl>
## 1 2017-07-17 No        42  
## 2 2017-07-17 Unsure    15  
## 3 2017-07-17 Yes       42  
## 4 2019-10-03 No        47.5
## 5 2019-10-03 Unsure     4  
## 6 2019-10-03 Yes       49

Visualization by date with ggplot2

gg <- ggplot(poll1, aes(x= Start, y=percent, fill= Answer))+
       geom_bar(aes(fill=Answer), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent),hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + coord_flip()
gg <- gg + labs(x="Start_date", y= "percent", title="Do you support the impeachment of President Trump?")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))


gg

Impeachment opinion survey from Ipsos by month

When we analyze which pollster has the largest sample set, Ipsos has provided the largest sample size so far. We subset Ipsos pollster and filter the category that contains word “impeach” to specifically look at people’ opinion about impeachment.
There are two poll categories which are begin and impeach. The begin means polls asking if impeachment process should begin and impeachor `impeach_remove’ means that polls asking if Congress should impeach/impeach and remove Trump.

tidying data

poll_Ipsos <- poll %>%
          filter(Pollster == "Ipsos") %>%  # filter down to Ipsos
          filter(str_detect(Category, "impeach"))%>% # subset category = impeach
          gather("Answer", "count",11:13)%>% # makes “wide” data longer
          select(Start, End, Answer, count)%>% 
          arrange(desc(Start))%>%
          separate(Start, c("start_year", "start_month", "start_day"), sep = "-") 

head(poll_Ipsos)

##   start_year start_month start_day       End Answer count
## 1       2019          12        02 12/3/2019    Yes    44
## 2       2019          12        02 12/3/2019    Yes    45
## 3       2019          12        02 12/3/2019     No    42
## 4       2019          12        02 12/3/2019     No    45
## 5       2019          12        02 12/3/2019 Unsure    13
## 6       2019          12        02 12/3/2019 Unsure    10

visualization

gg1 <- ggplot(poll_Ipsos, aes(x = start_day, y = count, color= Answer)) +
      geom_point()+theme_minimal() +
      facet_wrap( ~ start_month ) +
      labs(title = "Impeachment survey from Ipsos 2019",
           x = "Start_date", y = "percent") + theme_bw(base_size = 15)+
      theme(axis.text.x = element_blank(),axis.ticks = element_blank())
gg1

## Warning: Removed 1 rows containing missing values (geom_point).

Population analysis

In the poll dataset, there are total 3 different populations (all adults (a), likely voters(lv), registered voters(rv)). In this section, we try to analyze how these populations are thinking about president’s impeachment.

Tidying data

poll %>%
  group_by(Pop)%>%
  summarise(sum =sum(SampleSize)) #calculating sample size for each population

## # A tibble: 3 x 2
##   Pop      sum
##   <fct>  <int>
## 1 a     497574
## 2 lv      9716
## 3 rv    233059

poll2 <- poll %>%
          filter(str_detect(Category, "impeach")) # subset category = impeach

#Calculating average percent of each answers
poll3 <- poll2 %>%
          filter(Pop == "a")%>%
          summarise(Yes = mean(Yes), No = mean(No), Unsure = mean(Unsure, na.rm = TRUE))

poll4 <- poll2 %>%
          filter(Pop == "lv")%>%
          summarise(Yes = mean(Yes), No = mean(No), Unsure = mean(Unsure, na.rm = TRUE))

poll5 <- poll2 %>%
          filter(Pop == "rv")%>%
          summarise(Yes = mean(Yes), No = mean(No), Unsure = mean(Unsure, na.rm = TRUE))

# creating a table 
poll_pop <- rbind("all adults"= poll3, "likely voters"=poll4, "registered voters"=poll5)
poll_pop

##                        Yes       No   Unsure
## all adults        43.55243 43.30278 12.86902
## likely voters     47.62000 45.46000  7.90000
## registered voters 43.79573 45.05897 11.16325

Support for impeachment by party

Lastly, we would like to investigate whether people support impeaching Trump by parties(Republicans, Democrats, and independents). We choose Fox News pollster and CNN/SSRS pollster and then compare the results. Both total sample size are about same around 800. The visualization is grouped by the parties.

Fox News
### tidying data

poll6 <- poll %>%
          select(Start, End, Pollster, Rep.Yes, Rep.No, Dem.Yes, Dem.No, Ind.Yes, Ind.No) %>%
          filter(Pollster == "Fox News") %>%
          gather("Answer", "percent",4:9) %>%
          separate(Answer, c("Party", "YesNo"))%>% # separate character by non-character(".")
          arrange(desc(Start))

head(poll6)

##        Start        End Pollster Party YesNo percent
## 1 2019-10-27 10/30/2019 Fox News   Rep   Yes       8
## 2 2019-10-27 10/30/2019 Fox News   Rep    No      87
## 3 2019-10-27 10/30/2019 Fox News   Dem   Yes      86
## 4 2019-10-27 10/30/2019 Fox News   Dem    No       9
## 5 2019-10-27 10/30/2019 Fox News   Ind   Yes      38
## 6 2019-10-27 10/30/2019 Fox News   Ind    No      47

Visualization with boxplot

ggplot(poll6, aes(x = YesNo, y = percent, fill = YesNo)) + geom_boxplot() +
facet_wrap(~ Party, ncol = 5)+
labs(title = "Impeachment opinion by party from Fox News",x = "Start_date", y = "percent") + theme_bw(base_size = 15)

CNN/SSRS
### tidying data

poll7 <- poll %>%
          filter(Pollster == "CNN/SSRS") %>%
          filter(str_detect(Category, "impeach"))%>%
          select(Start, End, Pollster, Rep.Yes, Rep.No, Dem.Yes, Dem.No, Ind.Yes, Ind.No) %>%
          gather("Answer", "percent",4:9) %>%
          separate(Answer, c("Party", "YesNo"))%>%
          arrange(desc(Start))

head(poll7)

##        Start        End Pollster Party YesNo percent
## 1 2019-11-21 11/24/2019 CNN/SSRS   Rep   Yes      10
## 2 2019-11-21 11/24/2019 CNN/SSRS   Rep    No      87
## 3 2019-11-21 11/24/2019 CNN/SSRS   Dem   Yes      90
## 4 2019-11-21 11/24/2019 CNN/SSRS   Dem    No       6
## 5 2019-11-21 11/24/2019 CNN/SSRS   Ind   Yes      47
## 6 2019-11-21 11/24/2019 CNN/SSRS   Ind    No      45

Visualization with boxplot

ggplot(poll7, aes(x = YesNo, y = percent, fill = YesNo)) + geom_boxplot() +
facet_wrap(~ Party, ncol = 5)+
labs(title = "Impeachment opinion by party from CNN/SSRS",x = "Start_date", y = "percent") + theme_bw(base_size = 15)

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

#polls <- read_csv('https://raw.githubusercontent.com/habibkhan89/607-finalproject/master/impeachment-polls.csv?token=ANBSKRDMOMTIFDN7YZW4J6S554ACO')
polls <- read_csv('impeachment-polls.csv')

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Start = col_character(),
##   End = col_character(),
##   Pollster = col_character(),
##   Sponsor = col_character(),
##   Pop = col_character(),
##   tracking = col_logical(),
##   Text = col_character(),
##   Category = col_character(),
##   `Include?` = col_character(),
##   URL = col_character(),
##   Notes = col_character()
## )

## See spec(...) for full column specifications.

polls$percent <- polls$Yes/polls$No
polls$EndDate <- as.Date(polls$End, "%m/%d/%Y")
ggplot(data = polls) +
  geom_point(aes(x = EndDate, y = percent,
  size = polls$SampleSize,
  colour = factor(Category))) + 
  geom_hline(yintercept=1)

ggplot(data = polls) + 
  geom_point(aes(x = EndDate, y = percent)) +
  geom_smooth(data = polls, 
            aes(x = EndDate, y = percent))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

polls$wgtpct <- polls$percent * log(polls$SampleSize)
ggplot(data = polls) + 
  geom_point(aes(x = EndDate, y = wgtpct)) +
  geom_smooth(data = polls, 
              aes(x = EndDate, y = wgtpct, 
                  size = polls$SampleSize))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = polls) + 
  geom_point(aes(x = EndDate, y = wgtpct)) +
  geom_smooth(data = polls, 
              aes(x = EndDate, y = wgtpct, 
                  size = polls$SampleSize), method = "lm")

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following object is masked from 'package:base':
## 
##     date

days <- yday(polls$EndDate) - 1 # so Jan 1 = day 0 
total_days <- cumsum(days)
ref_date <- dmy("01-01-2017")
polls$alldays <- difftime(polls$EndDate,ref_date,units = "days")
lmpct <- lm(polls$percent ~ polls$alldays)
summary(lmpct)

## 
## Call:
## lm(formula = polls$percent ~ polls$alldays)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55683 -0.10214  0.01528  0.09931  0.86974 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.138e-01  3.647e-02   19.57  < 2e-16 ***
## polls$alldays 3.330e-04  4.178e-05    7.97 1.42e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1831 on 433 degrees of freedom
## Multiple R-squared:  0.1279, Adjusted R-squared:  0.1259 
## F-statistic: 63.52 on 1 and 433 DF,  p-value: 1.42e-14

365*3.328e-04

## [1] 0.121472

Extracting tweets from twitter using API

Now finally let’s go and get tweets from twiter using twitteR package. Initially, we wanted to filter the tweets before few incidents going on with impeachment case but due to access we could not do that. Below is the chunk of code which allows us to get into twitter and get tweets.

## [1] "Using direct authentication"

Extracting tweets and converting into dataframe

Let’s get the tweets using keywords such as impeachment, whistleblower, Ukraine. Initially while getting tweets, it brought bunch of retweets then we had to exclude from the data to see reliable results.

# Now let's start extracting tweets regarding impeachment inquiry by using few trending tweets
tweets <- searchTwitter('impeachment, whistleblower, Ukraine, -filter:retweets', n=2000, lang = 'en')

## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 2000 tweets were requested but the
## API can only return 395

# Converting tweets into dataframe
tweets_df <- twListToDF(tweets)
tweets_df2 <- tweets_df$text # This vector contain only tweets

# Reading the csv file from local directory
tweets2 <- read.csv('tweets_only2.csv', row.names=NULL, stringsAsFactors = FALSE, header=TRUE)

Data cleaning and text mining

Again we used Corpus to do text mining and clean the data. There were bunch of unnecessary words which were needed to be excluded otherwise it would brought meaningless result.

words <- Corpus(VectorSource(tweets2$x)) # Saving the tweets in vector 'words' while x is column's name which was given randomly while importing
words <- tm_map(words, tolower)

## Warning in tm_map.SimpleCorpus(words, tolower): transformation drops
## documents

words <- tm_map(words, removeNumbers)

## Warning in tm_map.SimpleCorpus(words, removeNumbers): transformation drops
## documents

words <- tm_map(words, removePunctuation)

## Warning in tm_map.SimpleCorpus(words, removePunctuation): transformation
## drops documents

words <- tm_map(words, stripWhitespace)

## Warning in tm_map.SimpleCorpus(words, stripWhitespace): transformation
## drops documents

words <- tm_map(words, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(words, removeWords, stopwords("english")):
## transformation drops documents

words <- tm_map(words, removeWords, c("will", "got", "admits<U+0085>", "want", "say")) # This sentence would be helpful for later to remove any unnecessary words

## Warning in tm_map.SimpleCorpus(words, removeWords, c("will", "got",
## "admits<U+0085>", : transformation drops documents

# words <- tm_map(words, gsub, pattern = 'Impeached', replacement= 'Impeachment') # This line of code will replace impeached with impeachment
# Now let's build a matrix and dataframe to show the number of words to make wordcloud
tdm <- TermDocumentMatrix((words))
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
head(d,20)

##                            word freq
## ukraine                 ukraine  169
## impeachment         impeachment  153
## trump                     trump  134
## whistleblower     whistleblower  133
## aid                         aid   42
## get                         get   34
## gop                         gop   33
## realdonaldtrump realdonaldtrump   32
## another                 another   30
## schiff                   schiff   29
## president             president   28
## trumps                   trumps   28
## senator                 senator   27
## complaint             complaint   26
## breaks                   breaks   25
## attempt                 attempt   25
## defend                   defend   24
## house                     house   24
## news                       news   23
## possibility         possibility   23

Data visualization

Let’s visualize the data through wordcloud to see the frequency of words and then we will use sentiment analysis to see what Americans think about president Trump’s impeachment inquiry.

Wordcloud

set.seed(3322)
wordcloud(words=d$word, freq=d$freq, min.freq=10, max.words =200, random.order=FALSE, decreasing= TRUE, rot.per=0.05, colors=brewer.pal(10,"Dark2"))

Sentiment analysis

# Using sentiment analysis to see people's reaction
imp_tdm <- tidy(tdm)
imp_senti <- imp_tdm %>% 
  inner_join(get_sentiments("bing"), by=c(term="word"))
imp_senti %>% 
  count(sentiment, term, wt=count) %>% 
  ungroup() %>% 
  filter(n>= 3) %>% 
  mutate(n= ifelse(sentiment=="negative", -n, n)) %>% 
  mutate(term=reorder(term,n)) %>% 
  ggplot(aes(term, n, fill=sentiment))+ geom_bar(stat="identity")+ylab("People's sentiment on Trump's Impeachment")+coord_flip()

Result from sentiment analysis shows that people are upset and angry on his Ukraine case. Although it would be better if we had access, it would help to see the sentiments on different time period. It would also let us filter by locations and the project would be more specific to the the question “Does Americans support Trump’s impeachment”

Conclusion

The result based on all three data sources show mix opinion either president Trump should be impeached or not. Based on the data sources taken from Washington Post and twitter, people are overall angry and upset about his Ukraine case and see that as a shameful but the result was still not clear. Five thirty eight’s result shows clearly mixed opinion. Overall polls shows almost same result but if we take a look at data more specifically, we see that Democrat’s clearly support his impeachment while Republicans clearly do not support his impeachment. Since the sample size were around 1000 during all polls, we cannot exactly say if All the Americans want his impeachment or not but we can clearly point out the Democrats want him impeached while Republicans don’t. Based on limitations of data accessibility, further research on this topic might bring more insights.

Do Americans support President Trump’s impeachment

Randall Thompson, Abdelmalek Hajjam, Eunkyu Hahm, Habib Khan

12/9/2019

Introduction

Loading libraries

Scrapping the data from Washington Post

Road to impeachment

get all Events Text in the timeline events

create the corpus and clean it up

create Term-Document Matrix for the corpus

Visualize the Text Events

display words frequency

Show me the Tweets!!

Poll dataset from fivethirtyeight.com

Reliable pollsters

tidying data

Visualization by date with ggplot2

Impeachment opinion survey from Ipsos by month

tidying data

visualization

Population analysis

Tidying data

Support for impeachment by party

Visualization with boxplot

Visualization with boxplot

Extracting tweets from twitter using API

Extracting tweets and converting into dataframe

Data cleaning and text mining

Data visualization

Wordcloud

Sentiment analysis

Conclusion