Introduction

In this project, we are going to see if Americans support President Trump’s impeachment inquiry or not. For that purpose, we have taken the data from three different sources i.e. Washington post was scraped to see the road to impeachment, getting data from five thirty eight.com and then tweets were taken from twitter. Reason behind was to see if Americans support impeachment without biasness.

Loading libraries

library(knitr)
library(tm)
library(dplyr)
library(twitteR)
library(wordcloud)
library(wordcloud2)
library(ggwordcloud)
library(tidyverse)
library(tidytext)
library(DT)
library(knitr)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(RCurl)
library(XML)
library(stringr)
library(ggplot2)
library(htmlwidgets)
library(rvest)
library(data.table)
library(DT)
library(kableExtra)
library(DBI)
library(RMySQL)
library(readr)
library(lattice)

Scrapping the data from Washington Post

Road to impeachment

The House of Representatives is engaged in a formal impeachment inquiry of President Trump. It is focused on his efforts to secure specific investigations in Ukraine that carried political benefits for him — including aides allegedly tying those investigations to official U.S. government concessions. To get the data for Trump-Ukraine impeachment timeline of relevant events leading to Trump Impeachment inquiry, we decided to scrap the following website:

https://www.washingtonpost.com/graphics/2019/politics/trump-impeachment-timeline

#First we create an empty data frame for our function to fill. We call it ukraine.
ukraine <- data.frame( Title = character(),
                       Description = character(),
                       stringsAsFactors=FALSE
                       ) 
url <- "https://www.washingtonpost.com/graphics/2019/politics/trump-impeachment-timeline/"
var <- read_html(url)
title <- var %>% 
    html_nodes("div.pg-card .pg-card-title") %>%
    html_text()
#title
description <- var %>%
    html_nodes("div.pg-card .pg-card-description") %>%
    html_text()
description <-  gsub(pattern = "\\[\\]", replace = "", description)
descrition <- stringr::str_replace(description, '\\*', '')
description <-  gsub(pattern = "\n", replace = "", description)
ukraine <- rbind(ukraine, as.data.frame(cbind(title,description)))
## Warning in cbind(title, description): number of rows of result is not a
## multiple of vector length (arg 1)
ukraine <- ukraine %>% mutate(id = row_number())
ukraine <- ukraine[1:215,]
head(ukraine) %>% kable() %>% kable_styling()
title description id
February 22, 2014 Ukrainian President Viktor Yanukovych is ousted from power during a popular uprising in the country. He flees to Russia. After his ouster, Ukrainian officials begin a wide-ranging investigation into corruption in the country. 1
March 7, 2014 Lev Parnas, eventually an associate of former New York City mayor Rudolph W. Giuliani, has his first known interaction with Donald Trump at a golf tournament in Florida. 2
March 1, 2014 Russia invades the Ukrainian peninsula of Crimea, annexing it. 3
May 13, 2014 Hunter Biden, a son of then-U.S. Vice President Joe Biden, joins the board of the Ukrainian energy company Burisma Holdings. It is owned by oligarch Mykola Zlochevsky, one of several subjects of the Ukrainian corruption probe. 4
May 25, 2014 Petro Poroshenko is elected president of Ukraine. 5
February 10, 2015 Viktor Shokin becomes Ukraine’s prosecutor general. 6
tail(ukraine) %>% kable() %>% kable_styling()
title description id
210 October 30, 2019 State Department Inspector General Steve Linick shares with Congress documents that had been sent to the State Department that include conspiracy theories about the Bidens. Giuliani indicates he was responsible for some of the materials, which were apparently sent to State from the White House. 210
211 October 30, 2019 Volker submits to a deposition, sharing text messages (as described above) with Taylor, Sondland, Giuliani and Yermak. He says he never had a quid pro quo communicated to him. 211
212 October 31, 2019 “Mr. President, what exactly did you hope Zelensky would do about the Bidens after your phone call?” Trump is asked by a reporter. 212
213 October 31, 2019 “Well,” he replies, “I would think that, if they were honest about it, they’d start a major investigation into the Bidens.  It’s a very simple answer.” 213
214 November 4, 2019 He tells reporters that he also thinks China should launch an investigation involving the Bidens. “And by the way, likewise, China should start an investigation into the Bidens because what happened in China is just about as bad as what happened with Ukraine,” Trump says. 214
215 November 4, 2019 Kent confronts State officials about the claims in Pompeo’s letter, calling them inaccurate, according to his later testimony. He tells one official whose name is redacted: “I said, well, you say that the career foreign services are being intimidated. . . . And I asked him, about whom are you speaking? And he said, you’re asking me to reveal confidential information. And I said, no, I’m not. There are only two career Foreign Service officers who subject to this process. I’m one of them. I’m the only one working at the Department of State, and the other one is Ambassador Yovanovitch, who is teaching at Georgetown.” 215

In the above chunk of codes, empty dataframe was created first and then gauged relevant html nodes to scrap the relevant data.

get all Events Text in the timeline events

allDescriptions <- ""
mdescription <- c()
for (i in (1:length(ukraine$title))){
  
   mdescription <- ukraine$description[i] 
   allDescriptions <- paste0(allDescriptions,mdescription)
  
}
allDescriptions <- gsub(pattern = "\\\"", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "\\[\\]", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "\"", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "__", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "--", replace = "", allDescriptions)
allDescriptions <-  gsub(pattern = "----", replace = "", allDescriptions)
#allDescriptions

create the corpus and clean it up

Now after getting the data, let’s clean the data using tm package’s Corpus function through removing unnecessary numbers and making the words cleaner.

# putting the words in vector
words <- Corpus(VectorSource(allDescriptions))
#using tm to remove numbers, punctuation and convert to lowercase. Some high frequency words we do not want are removed.
words <- tm_map(words, tolower)
words <- tm_map(words, removeNumbers)
words <- tm_map(words, removePunctuation)
words <- tm_map(words, removeWords, stopwords("english"))
words <- tm_map(words, removeWords, c("will","according", "later", "say", "says", "said", "saying", "tells", "also", "—-", "__" ))
#inspect(words)

create Term-Document Matrix for the corpus

#Build a term-document matrix and dataframe d to show frequency of words
tdm <- TermDocumentMatrix(words)
m <- as.matrix(tdm)
#desc(m)
# head(m, 20) %>% kable() %>% kable_styling()
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
head(d,50) %>% kable() %>% kable_styling()
word freq
trump trump 105
ukraine ukraine 69
zelensky zelensky 66
taylor taylor 65
sondland sondland 48
giuliani giuliani 43
call call 37
house house 37
lutsenko lutsenko 33
ukrainian ukrainian 33
president president 32
volker volker 32
32
testimony testimony 29
aid aid 28
white white 28
biden biden 27
investigation investigation 27
yovanovitch yovanovitch 24
meeting meeting 23
state state 23
bidens bidens 22
department department 22
officials officials 22
parnas parnas 21
burisma burisma 20
prosecutor prosecutor 19
yermak yermak 19
whistleblower whistleblower 18
corruption corruption 17
investigations investigations 17
military military 17
campaign campaign 16
complaint complaint 16
time time 16
general general 15
meets meets 14
office office 14
want want 14
election election 13
first first 13
foreign foreign 13
one one 13
poroshenko poroshenko 13
trumps trumps 13
country country 12
intelligence intelligence 12
meet meet 12
new new 12
official official 12

Visualize the Text Events

#wordcloud2(d, size = 0.7)
wordcloud2(d, size = 1, color = "random-light", backgroundColor = "grey")

display words frequency

ggplot(head(d, 25), aes(reorder(word, freq),freq)) +
  geom_bar(stat = "identity", fill = "#7300AB") +  #03DAC6   #6200EE
  labs(title = "Road To Impeachment words frequency",
       x = "Words", y = "Frequency") +
  geom_text(aes(label=freq), vjust=0.4, hjust= 1.2, size=3, color="white")+
  coord_flip()

Show me the Tweets!!

ggplot(d, aes(label = word, size=2)) +
  geom_text_wordcloud_area(
    mask = png::readPNG("t.png"),
    rm_outside = TRUE, color="skyblue"
  ) +
  scale_size_area(max_size = 10) +
  theme_minimal()
## Some words could not fit on page. They have been removed.

Poll dataset from fivethirtyeight.com

Do Americans Support Impeaching Trump? reference- https://projects.fivethirtyeight.com/impeachment-polls

poll <-  read.csv("https://raw.githubusercontent.com/ekhahm/datascience/master/impeachment-polls.csv")
head(poll)
##       Start       End                 Pollster Sponsor SampleSize Pop
## 1 6/28/2019  7/1/2019 ABC News/Washington Post               1008   a
## 2 4/22/2019 4/25/2019 ABC News/Washington Post               1001   a
## 3 1/21/2019 1/24/2019 ABC News/Washington Post               1001   a
## 4 8/26/2018 8/29/2018 ABC News/Washington Post               1003   a
## 5  6/8/2019 6/12/2019                   Civiqs               1559  rv
## 6 5/28/2019 5/31/2019                 CNN/SSRS               1006   a
##   tracking
## 1       NA
## 2       NA
## 3       NA
## 4       NA
## 5       NA
## 6       NA
##                                                                                                                                                                                              Text
## 1 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 2 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 3 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 4 Based on what you know, do you think Congress should or should not begin impeachment proceedings that could lead to Trump being removed from office? Do you feel that way strongly or somewhat?
## 5                                              Do you think the House of Representatives should open an impeachment inquiry to determine if President Donald Trump should be removed from office?
## 6                                              Based on what you have read or heard, do you believe that President Trump should be impeached and removed from office, or don't you feel that way?
##             Category Include. Yes No Unsure Rep.Sample Rep.Yes Rep.No
## 1  begin_proceedings      yes  37 59      4        232       7     87
## 2  begin_proceedings      yes  37 56      6        260      10     87
## 3  begin_proceedings      yes  40 55      6        240       7     90
## 4  begin_proceedings      yes  49 46      5        251      15     82
## 5      begin_inquiry      yes  43 51      5        483       5     93
## 6 impeach_and_remove      yes  41 54      5        342       6     93
##   Dem.Sample Dem.Yes Dem.No Ind.Sample Ind.Yes Ind.No
## 1        292      61     36        373      37     59
## 2        290      62     29        360      36     59
## 3        320      64     30        370      42     53
## 4        331      75     21        371      49     46
## 5        577      77     15        499      41     53
## 6        272      76     18        392      35     59
##                                                                                                                                                           URL
## 1 https://games-cdn.washingtonpost.com/notes/prod/default/documents/2557e081-f90a-4c44-a04e-9c98f04bb725/note/d1660489-1b82-43c7-afce-812d2861ecf7.pdf#page=1
## 2        https://games-cdn.washingtonpost.com/notes/prod/default/documents/873ceb77-ad0f-439a-891b-d440139189d0/note/fae3467f-5c96-41b7-99ef-cd49f752e038.pdf
## 3                                                                                       langerresearch.com/wp-content/uploads/1204a2TrumpInvestigations-1.pdf
## 4                                                               https://www.langerresearch.com/wp-content/uploads/1200a1TrumpandtheMuellerInvestigation-1.pdf
## 5                                                                                https://civiqs.com/documents/Civiqs_DailyKos_monthly_banner_book_2019_06.pdf
## 6                                                                                 https://cdn.cnn.com/cnn/2019/images/06/01/rel7a.-.trump,.investigations.pdf
##   Notes
## 1      
## 2      
## 3      
## 4      
## 5      
## 6

Reliable pollsters

According to the FiveThirtyEight’s Pollster Ratings, the most reliable pollster among fifteen are three pollsters which are Marist College, SurveyUSA Emerson College. All of them receive greater than grade A-. The survey questions asking if Congress should impeach/impeach and remove Trump from the three pollsters are analyzed and visualized.

reference - https://projects.fivethirtyeight.com/pollster-ratings

tidying data

poll$Start <- format(as.Date(poll$Start, format="%m/%d/%Y"))
poll1 <- poll %>%
          filter(Pollster == "Marist College"|Pollster == "Emerson College"|  Pollster =="SurveyUSA")%>%  # filter down to the three pollsters
          filter(str_detect(Category, "impeach"))%>% 
          gather("Answer", "percent",11:13)%>%
          select(Start, End, Answer, percent)%>%
          group_by(Start, Answer)%>%
          summarise(percent = mean(percent))%>%
          arrange(Start)
      
head(poll1)
## # A tibble: 6 x 3
## # Groups:   Start [2]
##   Start      Answer percent
##   <chr>      <chr>    <dbl>
## 1 2017-07-17 No        42  
## 2 2017-07-17 Unsure    15  
## 3 2017-07-17 Yes       42  
## 4 2019-10-03 No        47.5
## 5 2019-10-03 Unsure     4  
## 6 2019-10-03 Yes       49

Visualization by date with ggplot2

gg <- ggplot(poll1, aes(x= Start, y=percent, fill= Answer))+
       geom_bar(aes(fill=Answer), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent),hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + coord_flip()
gg <- gg + labs(x="Start_date", y= "percent", title="Do you support the impeachment of President Trump?")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))


gg

Impeachment opinion survey from Ipsos by month

When we analyze which pollster has the largest sample set, Ipsos has provided the largest sample size so far. We subset Ipsos pollster and filter the category that contains word “impeach” to specifically look at people’ opinion about impeachment.
There are two poll categories which are begin and impeach. The begin means polls asking if impeachment process should begin and impeachor `impeach_remove’ means that polls asking if Congress should impeach/impeach and remove Trump.

tidying data

poll_Ipsos <- poll %>%
          filter(Pollster == "Ipsos") %>%  # filter down to Ipsos
          filter(str_detect(Category, "impeach"))%>% # subset category = impeach
          gather("Answer", "count",11:13)%>% # makes “wide” data longer
          select(Start, End, Answer, count)%>% 
          arrange(desc(Start))%>%
          separate(Start, c("start_year", "start_month", "start_day"), sep = "-") 

head(poll_Ipsos)
##   start_year start_month start_day       End Answer count
## 1       2019          12        02 12/3/2019    Yes    44
## 2       2019          12        02 12/3/2019    Yes    45
## 3       2019          12        02 12/3/2019     No    42
## 4       2019          12        02 12/3/2019     No    45
## 5       2019          12        02 12/3/2019 Unsure    13
## 6       2019          12        02 12/3/2019 Unsure    10

visualization

gg1 <- ggplot(poll_Ipsos, aes(x = start_day, y = count, color= Answer)) +
      geom_point()+theme_minimal() +
      facet_wrap( ~ start_month ) +
      labs(title = "Impeachment survey from Ipsos 2019",
           x = "Start_date", y = "percent") + theme_bw(base_size = 15)+
      theme(axis.text.x = element_blank(),axis.ticks = element_blank())
gg1
## Warning: Removed 1 rows containing missing values (geom_point).

Population analysis

In the poll dataset, there are total 3 different populations (all adults (a), likely voters(lv), registered voters(rv)). In this section, we try to analyze how these populations are thinking about president’s impeachment.

Tidying data

poll %>%
  group_by(Pop)%>%
  summarise(sum =sum(SampleSize)) #calculating sample size for each population
## # A tibble: 3 x 2
##   Pop      sum
##   <fct>  <int>
## 1 a     497574
## 2 lv      9716
## 3 rv    233059
poll2 <- poll %>%
          filter(str_detect(Category, "impeach")) # subset category = impeach

#Calculating average percent of each answers
poll3 <- poll2 %>%
          filter(Pop == "a")%>%
          summarise(Yes = mean(Yes), No = mean(No), Unsure = mean(Unsure, na.rm = TRUE))

poll4 <- poll2 %>%
          filter(Pop == "lv")%>%
          summarise(Yes = mean(Yes), No = mean(No), Unsure = mean(Unsure, na.rm = TRUE))

poll5 <- poll2 %>%
          filter(Pop == "rv")%>%
          summarise(Yes = mean(Yes), No = mean(No), Unsure = mean(Unsure, na.rm = TRUE))

# creating a table 
poll_pop <- rbind("all adults"= poll3, "likely voters"=poll4, "registered voters"=poll5)
poll_pop
##                        Yes       No   Unsure
## all adults        43.55243 43.30278 12.86902
## likely voters     47.62000 45.46000  7.90000
## registered voters 43.79573 45.05897 11.16325

Support for impeachment by party

Lastly, we would like to investigate whether people support impeaching Trump by parties(Republicans, Democrats, and independents). We choose Fox News pollster and CNN/SSRS pollster and then compare the results. Both total sample size are about same around 800. The visualization is grouped by the parties.

Fox News
### tidying data

poll6 <- poll %>%
          select(Start, End, Pollster, Rep.Yes, Rep.No, Dem.Yes, Dem.No, Ind.Yes, Ind.No) %>%
          filter(Pollster == "Fox News") %>%
          gather("Answer", "percent",4:9) %>%
          separate(Answer, c("Party", "YesNo"))%>% # separate character by non-character(".")
          arrange(desc(Start))

head(poll6)
##        Start        End Pollster Party YesNo percent
## 1 2019-10-27 10/30/2019 Fox News   Rep   Yes       8
## 2 2019-10-27 10/30/2019 Fox News   Rep    No      87
## 3 2019-10-27 10/30/2019 Fox News   Dem   Yes      86
## 4 2019-10-27 10/30/2019 Fox News   Dem    No       9
## 5 2019-10-27 10/30/2019 Fox News   Ind   Yes      38
## 6 2019-10-27 10/30/2019 Fox News   Ind    No      47

Visualization with boxplot

ggplot(poll6, aes(x = YesNo, y = percent, fill = YesNo)) + geom_boxplot() +
facet_wrap(~ Party, ncol = 5)+
labs(title = "Impeachment opinion by party from Fox News",x = "Start_date", y = "percent") + theme_bw(base_size = 15)

CNN/SSRS
### tidying data

poll7 <- poll %>%
          filter(Pollster == "CNN/SSRS") %>%
          filter(str_detect(Category, "impeach"))%>%
          select(Start, End, Pollster, Rep.Yes, Rep.No, Dem.Yes, Dem.No, Ind.Yes, Ind.No) %>%
          gather("Answer", "percent",4:9) %>%
          separate(Answer, c("Party", "YesNo"))%>%
          arrange(desc(Start))

head(poll7)
##        Start        End Pollster Party YesNo percent
## 1 2019-11-21 11/24/2019 CNN/SSRS   Rep   Yes      10
## 2 2019-11-21 11/24/2019 CNN/SSRS   Rep    No      87
## 3 2019-11-21 11/24/2019 CNN/SSRS   Dem   Yes      90
## 4 2019-11-21 11/24/2019 CNN/SSRS   Dem    No       6
## 5 2019-11-21 11/24/2019 CNN/SSRS   Ind   Yes      47
## 6 2019-11-21 11/24/2019 CNN/SSRS   Ind    No      45

Visualization with boxplot

ggplot(poll7, aes(x = YesNo, y = percent, fill = YesNo)) + geom_boxplot() +
facet_wrap(~ Party, ncol = 5)+
labs(title = "Impeachment opinion by party from CNN/SSRS",x = "Start_date", y = "percent") + theme_bw(base_size = 15)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

#polls <- read_csv('https://raw.githubusercontent.com/habibkhan89/607-finalproject/master/impeachment-polls.csv?token=ANBSKRDMOMTIFDN7YZW4J6S554ACO')
polls <- read_csv('impeachment-polls.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Start = col_character(),
##   End = col_character(),
##   Pollster = col_character(),
##   Sponsor = col_character(),
##   Pop = col_character(),
##   tracking = col_logical(),
##   Text = col_character(),
##   Category = col_character(),
##   `Include?` = col_character(),
##   URL = col_character(),
##   Notes = col_character()
## )
## See spec(...) for full column specifications.
polls$percent <- polls$Yes/polls$No
polls$EndDate <- as.Date(polls$End, "%m/%d/%Y")
ggplot(data = polls) +
  geom_point(aes(x = EndDate, y = percent,
  size = polls$SampleSize,
  colour = factor(Category))) + 
  geom_hline(yintercept=1)

ggplot(data = polls) + 
  geom_point(aes(x = EndDate, y = percent)) +
  geom_smooth(data = polls, 
            aes(x = EndDate, y = percent))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

polls$wgtpct <- polls$percent * log(polls$SampleSize)
ggplot(data = polls) + 
  geom_point(aes(x = EndDate, y = wgtpct)) +
  geom_smooth(data = polls, 
              aes(x = EndDate, y = wgtpct, 
                  size = polls$SampleSize))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = polls) + 
  geom_point(aes(x = EndDate, y = wgtpct)) +
  geom_smooth(data = polls, 
              aes(x = EndDate, y = wgtpct, 
                  size = polls$SampleSize), method = "lm")
## `geom_smooth()` using formula 'y ~ x'

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
days <- yday(polls$EndDate) - 1 # so Jan 1 = day 0 
total_days <- cumsum(days)
ref_date <- dmy("01-01-2017")
polls$alldays <- difftime(polls$EndDate,ref_date,units = "days")
lmpct <- lm(polls$percent ~ polls$alldays)
summary(lmpct)
## 
## Call:
## lm(formula = polls$percent ~ polls$alldays)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55777 -0.10240  0.01524  0.09843  0.86791 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7107707  0.0361705  19.651  < 2e-16 ***
## polls$alldays 0.0003377  0.0000413   8.177 3.16e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1824 on 439 degrees of freedom
## Multiple R-squared:  0.1322, Adjusted R-squared:  0.1302 
## F-statistic: 66.86 on 1 and 439 DF,  p-value: 3.162e-15
365*3.328e-04
## [1] 0.121472

Extracting tweets from twitter using API

Now finally let’s go and get tweets from twiter using twitteR package. Initially, we wanted to filter the tweets before few incidents going on with impeachment case but due to access we could not do that. Below is the chunk of code which allows us to get into twitter and get tweets.

## [1] "Using direct authentication"

Extracting tweets and converting into dataframe

Let’s get the tweets using keywords such as impeachment, whistleblower, Ukraine. Initially while getting tweets, it brought bunch of retweets then we had to exclude from the data to see reliable results.

# Now let's start extracting tweets regarding impeachment inquiry by using few trending tweets
tweets <- searchTwitter('impeachment, whistleblower, Ukraine, -filter:retweets', n=2000, lang = 'en')
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 2000 tweets were requested but the
## API can only return 18
# Converting tweets into dataframe
tweets_df <- twListToDF(tweets)
tweets_df2 <- tweets_df$text # This vector contain only tweets
# Reading the csv file from local directory
tweets2 <- read.csv('tweets_only.csv', row.names=NULL, stringsAsFactors = FALSE, header=TRUE)

Data cleaning and text mining

Again we used Corpus to do text mining and clean the data. There were bunch of unnecessary words which were needed to be excluded otherwise it would brought meaningless result.

words <- Corpus(VectorSource(tweets2$x)) # Saving the tweets in vector 'words' while x is column's name which was given randomly while importing
words <- tm_map(words, tolower)
## Warning in tm_map.SimpleCorpus(words, tolower): transformation drops
## documents
words <- tm_map(words, removeNumbers)
## Warning in tm_map.SimpleCorpus(words, removeNumbers): transformation drops
## documents
words <- tm_map(words, removePunctuation)
## Warning in tm_map.SimpleCorpus(words, removePunctuation): transformation
## drops documents
words <- tm_map(words, stripWhitespace)
## Warning in tm_map.SimpleCorpus(words, stripWhitespace): transformation
## drops documents
words <- tm_map(words, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(words, removeWords, stopwords("english")):
## transformation drops documents
words <- tm_map(words, removeWords, c("will", "got", "admits<U+0085>", "want", "say")) # This sentence would be helpful for later to remove any unnecessary words
## Warning in tm_map.SimpleCorpus(words, removeWords, c("will", "got",
## "admits<U+0085>", : transformation drops documents
# words <- tm_map(words, gsub, pattern = 'Impeached', replacement= 'Impeachment') # This line of code will replace impeached with impeachment
# Now let's build a matrix and dataframe to show the number of words to make wordcloud
tdm <- TermDocumentMatrix((words))
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word= names(v), freq=v)
head(d,20)
##                          word freq
## haspel                 haspel  383
## gina                     gina  363
## officials           officials  363
## whistleblower   whistleblower  363
## trump                   trump  319
## ukraine               ukraine  308
## impeachment       impeachment  251
## whistleblowers whistleblowers  249
## fake                     fake  227
## farce                   farce  226
## information       information  226
## soschiffty         soschiffty  226
## star                     star  226
## ufas                     ufas  226
## witness               witness  226
## hand                     hand  225
## admits<U+0085>   admits<U+0085>  224
## michaelmagausn michaelmagausn  224
## kendilaniannbc kendilaniannbc  223
## intel                   intel  192

Data visualization

Let’s visualize the data through wordcloud to see the frequency of words and then we will use sentiment analysis to see what Americans think about president Trump’s impeachment inquiry.

Wordcloud

set.seed(3322)
wordcloud(words=d$word, freq=d$freq, min.freq=10, max.words =200, random.order=FALSE, decreasing= TRUE, rot.per=0.05, colors=brewer.pal(10,"Dark2"))

Sentiment analysis

# Using sentiment analysis to see people's reaction
imp_tdm <- tidy(tdm)
imp_senti <- imp_tdm %>% 
  inner_join(get_sentiments("bing"), by=c(term="word"))
imp_senti %>% 
  count(sentiment, term, wt=count) %>% 
  ungroup() %>% 
  filter(n>= 3) %>% 
  mutate(n= ifelse(sentiment=="negative", -n, n)) %>% 
  mutate(term=reorder(term,n)) %>% 
  ggplot(aes(term, n, fill=sentiment))+ geom_bar(stat="identity")+ylab("People's sentiment on Trump's Impeachment")+coord_flip()

Result from sentiment analysis shows that people are upset and angry on his Ukraine case. Although it would be better if we had access, it would help to see the sentiments on different time period. It would also let us filter by locations and the project would be more specific to the the question “Does Americans support Trump’s impeachment”

Conclusion

The result based on all three data sources show mix opinion either president Trump should be impeached or not. Based on the data sources taken from Washington Post and twitter, people are overall angry and upset about his Ukraine case and see that as a shameful but the result was still not clear. Five thirty eight’s result shows clearly mixed opinion. Overall polls shows almost same result but if we take a look at data more specifically, we see that Democrat’s clearly support his impeachment while Republicans clearly do not support his impeachment. Since the sample size were around 1000 during all polls, we cannot exactly say if All the Americans want his impeachment or not but we can clearly point out the Democrats want him impeached while Republicans don’t. Based on limitations of data accessibility, further research on this topic might bring more insights.