Introduction

The 2016 US presidential election between Donald Trump and Hillary Clinton was one of the most controversial and highly publicized elections in recent history. Social media platforms, such as Twitter, played a significant role in shaping public opinion and spreading information about the candidates and their policies. In this project, we will analyze a dataset of tweets related to the 2016 US presidential election to gain insight into the sentiments and opinions expressed by Twitter users during this time period. By using text mining techniques such as sentiment analysis and topic modeling, we aim to answer questions such as: How did Twitter users perceive the two candidates and their policies? What were the most common topics of discussion among Twitter users during the election? Our results will provide a snapshot of the public discourse surrounding the 2016 presidential election and offer insight into the role of social media in shaping political opinions and discussions.

Data

Here is the columns and their description that we used in our analysis:

  • id - ID of each tweet
  • handle - Candidates for the election: Hilary Clinton and Donald Trump
  • text - Text of the tweets
  • time - Date and time of the shared tweet.
  • retweet_count - How many times they retweeted that tweet

Modeling

Let’s first install all of the libraries we will use during our analysis.

library(tm)
library(SnowballC)
library(textmineR)
library(tidyverse)
library(stringr)
library(dplyr)
library(wordcloud)
library(tidytext)
library(topicmodels)
library(textdata)
library(janeaustenr)
library(stringr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(visNetwork)
library(glue)
library(cowplot)
library(magrittr)
library(plotly)
library(widyr)
library(hms)
library(lubridate) 
library(igraph)
library(networkD3)
library(rjson)
library(ngram)

Let’s set our working directory and read our dataset.

Here we create Document-Term Matrix (DTM). DTM is a matrix representation of a collection of documents where each cell represents the frequency (or count) of a term (word) in a document. The rows represent documents, and the columns represent terms.

## as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

We develop the matrix of term counts to get the IDF vector. TF-IDF and cosine similarity is a measure of similarity between two non-zero vectors of an inner product space and we convert cosine similarity to cosine distance by subtracting it from 1.

We now create a word cloud to clearly see which words are used the most after cleaning it from stop words and etc.

Here we filter and summarize the data to see how many of the tweets are about Trump or Clinton. We see that they have almost the same number of tweets. However the number retweets of Trump is as twice as Clinton.

## # A tibble: 2 × 2
##   handle          count
##   <chr>           <int>
## 1 HillaryClinton   3226
## 2 realDonaldTrump  3218
## # A tibble: 2 × 2
##   handle           summing
##   <chr>              <int>
## 1 HillaryClinton   9625246
## 2 realDonaldTrump 18703714

We will tokenize the words now, Tokenizing in text mining and natural language processing (NLP) refers to the process of breaking a text document into smaller units, called tokens. We will use unnest_tokens().

unnest_tokens() has done some cleaning, removed punctuation and whitespace, transformed to lowercase & etc.

Having a single word per row means that our dataset dimensions exploded! Now there is 120227 rows. Let’s count the words.

##    word    n
## 1  t.co 4021
## 2 https 4020
## 3   the 3601
## 4    to 2843
## 5     a 2009
## 6   and 1928

We can use unnest_tokens() with stopwords.

The number of words is drastically reduced.

## [1] 62527

Let’s count words again

##              word    n
## 1            t.co 4021
## 2           https 4020
## 3           trump 1104
## 4         hillary 1035
## 5          donald  493
## 6 realdonaldtrump  422

Let’s now visualize the text with starting with tidy_trump.

Now each row/review has a number id thanks to row_number, then after unnest_tokens each row/observation is a single word, and each product review has as many rows as number of non-stopwords in the review.

##   id         handle is_retweet original_author                time
## 1  1 HillaryClinton      False                 2016-09-28T00:22:34
## 2  1 HillaryClinton      False                 2016-09-28T00:22:34
## 3  1 HillaryClinton      False                 2016-09-28T00:22:34
## 4  1 HillaryClinton      False                 2016-09-28T00:22:34
## 5  1 HillaryClinton      False                 2016-09-28T00:22:34
## 6  1 HillaryClinton      False                 2016-09-28T00:22:34
##   in_reply_to_screen_name in_reply_to_status_id in_reply_to_user_id
## 1                                            NA                  NA
## 2                                            NA                  NA
## 3                                            NA                  NA
## 4                                            NA                  NA
## 5                                            NA                  NA
## 6                                            NA                  NA
##   is_quote_status lang retweet_count favorite_count longitude latitude place_id
## 1           False   en           218            651        NA       NA         
## 2           False   en           218            651        NA       NA         
## 3           False   en           218            651        NA       NA         
## 4           False   en           218            651        NA       NA         
## 5           False   en           218            651        NA       NA         
## 6           False   en           218            651        NA       NA         
##   place_full_name place_name place_type place_country_code place_country
## 1                                                                       
## 2                                                                       
## 3                                                                       
## 4                                                                       
## 5                                                                       
## 6                                                                       
##   place_contained_within place_attributes place_bounding_box
## 1                                                           
## 2                                                           
## 3                                                           
## 4                                                           
## 5                                                           
## 6                                                           
##                   source_url truncated
## 1 https://studio.twitter.com     False
## 2 https://studio.twitter.com     False
## 3 https://studio.twitter.com     False
## 4 https://studio.twitter.com     False
## 5 https://studio.twitter.com     False
## 6 https://studio.twitter.com     False
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    entities
## 1 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 2 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 3 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 4 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 5 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 6 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      extended_entities
## 1 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 2 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 3 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 4 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 5 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 6 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
##       word
## 1 question
## 2 election
## 3    plans
## 4   action
## 5     life
## 6    https

Let’s visualize counts with geom_col(). Here col is for column, so it’s just like bar plot.

There are too many words at once. Nothing is clear, we have to filter!

With filtered words the plot is better:

Let’s do more improvements. We can use coord_flip() when data are hard to read on the x axis.

Even after removing the common stop words we still have some left.

## # A tibble: 6 × 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

We often have additional words that we want to remove. We have to create our own tibble containing the custom stop words that we want to remove.

## # A tibble: 7 × 2
##   word        lexicon  
##   <chr>       <chr>    
## 1 a           a's      
## 2 able        about    
## 3 above       according
## 4 accordingly across   
## 5 actually    after    
## 6 t.co        https    
## 7 amp         CUSTOM

Let’s combine this tibble to custom stopwords.

And now we have the original stop words with our custom stop words.

Let’s remove stop words again. Now once again we tokenize and clean using new stop words.

Let’s check if this cleaning worked.

##   id                time          handle retweet_count  word
## 1  1 2016-09-28T00:22:34  HillaryClinton           218 https
## 2  2 2016-09-27T23:45:00  HillaryClinton          2445 https
## 3  4 2016-09-27T23:08:41  HillaryClinton           916 https
## 4  4 2016-09-27T23:08:41  HillaryClinton           916 https
## 5  5 2016-09-27T22:30:27  HillaryClinton           859 https
## 6  6 2016-09-27T22:13:24 realDonaldTrump          2181 https

We still have some rows left. So, let’s try this way.

Let’s try again.

## [1] id            time          handle        retweet_count word         
## <0 rows> (or 0-length row.names)

Now we don’t have the word “https” any more.

Dealing with factors

Function ARRANGE doesn’t affect the plots! How to fix that? Characters can be sorted only alphabetically but factor column can include information about the order of words.

Using fct_reorder() before visualizing. If it’s necessary load the library(forcats). Reorder what (word) by what (n).

##              word    n           word2
## 1         america  406         america
## 2         clinton  308         clinton
## 3          donald  493          donald
## 4         hillary 1035         hillary
## 5          people  416          people
## 6       president  393       president
## 7 realdonaldtrump  422 realdonaldtrump
## 8           trump 1104           trump
## 9       trump2016  350       trump2016

Now this plot with new ordered column x = word2 is arranged by word count and is far better to read:

Faceting word count plots

We have Reviews of 2 candidates. It would be nice to compare the word counts by each candidate. Here instead of just counting words we include both word and candidate in the function.

##        word          handle   n
## 1     trump  HillaryClinton 737
## 2   hillary  HillaryClinton 693
## 3    donald  HillaryClinton 427
## 4     trump realDonaldTrump 367
## 5 trump2016 realDonaldTrump 350
## 6   hillary realDonaldTrump 342

Using top_n()

Before we plot we need to filter to use just the most common words for each of the product. We want top 10 words based on order of column n.

## # A tibble: 20 × 3
## # Groups:   handle [2]
##    word                  handle              n
##    <chr>                 <chr>           <int>
##  1 america               HillaryClinton    196
##  2 america               realDonaldTrump   210
##  3 americans             HillaryClinton    129
##  4 clinton               realDonaldTrump   192
##  5 crooked               realDonaldTrump   182
##  6 cruz                  realDonaldTrump   198
##  7 donald                HillaryClinton    427
##  8 families              HillaryClinton    143
##  9 hillary               HillaryClinton    693
## 10 hillary               realDonaldTrump   342
## 11 makeamericagreatagain realDonaldTrump   255
## 12 people                HillaryClinton    191
## 13 people                realDonaldTrump   225
## 14 potus                 HillaryClinton    142
## 15 president             HillaryClinton    279
## 16 realdonaldtrump       realDonaldTrump   326
## 17 trump                 HillaryClinton    737
## 18 trump                 realDonaldTrump   367
## 19 trump's               HillaryClinton    149
## 20 trump2016             realDonaldTrump   350

Now we may prepare word_counts to plot

Now this plot with new ordered column x = word2 is arranged by word count the color is filled by handle and is faceted by handle.

Document Term matrices in tidyverse

To create Document-Term Matrix we use cast_dtm()

## <<DocumentTermMatrix (documents: 6443, terms: 12962)>>
## Non-/sparse entries: 53119/83461047
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

Let’s use as.matrix()

It is a veeeery large matrix. Look at a piece of it. It is very sparse

##       Terms
## Docs   border borderless borders boring borisep born borntobegop
##   1005      0          0       0      0       0    0           0
##   6316      0          0       0      0       0    0           0
##   635       0          0       0      0       0    0           0
##   6078      0          0       0      0       0    0           0

Running topic models

Using LDA(). LDA stands for Latent Dirichlet Allocation. It is a generative probabilistic model for text data commonly used in topic modeling.

k is the number of topic we want the model to produce. How to decide about the k value?

Method is the approximation method by default it is a quick approximation. However if we prefer a longer but more complete method we should specify the Gibbs sampler.

Specifying the simulation seed will help us recover consistent topics on repeat model runs given a probabilistic nature of model estimation.

## A LDA_Gibbs topic model with 2 topics.

We can use function glimpse() to see what is included in this encoded object.

## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
##   ..@ seedwords      : NULL
##   ..@ z              : int [1:54125] 2 1 1 1 2 1 1 1 2 2 ...
##   ..@ alpha          : num 25
##   ..@ call           : language LDA(x = dtm_trump, k = 2, method = "Gibbs", control = list(seed = 42))
##   ..@ Dim            : int [1:2] 6443 12962
##   ..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
##   ..@ k              : int 2
##   ..@ terms          : chr [1:12962] "________" "_bscarb" "_bxddxss" "_hankrearden" ...
##   ..@ documents      : chr [1:6443] "1005" "6316" "635" "6078" ...
##   ..@ beta           : num [1:2, 1:12962] -12.6 -10.2 -12.6 -10.2 -12.6 ...
##   ..@ gamma          : num [1:6443, 1:2] 0.525 0.433 0.534 0.492 0.476 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:53119] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ j   : int [1:53119] 1 642 1693 2481 3662 5846 5861 7392 10423 11413 ...
##   .. ..$ v   : num [1:53119] 2 2 1 1 2 1 1 1 2 2 ...
##   .. ..$ nrow: int 6443
##   .. ..$ ncol: int 12962
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num -434741
##   ..@ iter           : int 2000
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 54125

Let’s evaluate LDA model output. The most important LDA model output are the topics themselves i.e. a dictionary of all words in the corpus sorted according to the probability that each word occurs as a part of that topic.

Function tidy() takes the matrix of topic probabilities “beta” and put it into a form that is easy visualized using ggplot2. It allows us to take the model output and to apply other functions of tidyverse, for example arrange(), desc() etc.

## # A tibble: 6 × 3
##   topic term              beta
##   <int> <chr>            <dbl>
## 1     2 trump           0.0389
## 2     1 hillary         0.0366
## 3     2 donald          0.0174
## 4     2 realdonaldtrump 0.0149
## 5     1 america         0.0143
## 6     1 people          0.0140

Interpreting topics

We obtain output with what the topics are composed of but no direction of what topics mean. Key direction is to find topics that are each different where no topic repeats. Now let’s do once again topic modeling with LDA() and gather all code together.

Finally let’s plot discovered topics. We treat topic as a factor to add some color.

Three topics

We will repeat the same steps of modeling LDA, tidying, grouping, reordering, and finally ploting but with k=3 topics.

Four topics

Let’s repeat this process one more time with k=4. We see that when it is 2 topics, we have topics about Clinton and Trump. When we have 3 topics, they are about Clinton, Trump for election and Trump for jobs, taxes and etc. And when it is 4 topics, they are about the same previous 3 topis plus additionally about general election topics.

Sentiment Analysis

Let’s get sentiments. “AFINN,” “BING,” and “NRC” are all sentiment analysis lexicons, which are lists of words and their associated sentiment scores (positive, negative, or neutral).

## # A tibble: 6 × 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2
## # A tibble: 6 × 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative
## # A tibble: 6 × 2
##   word      sentiment
##   <chr>     <chr>    
## 1 abacus    trust    
## 2 abandon   fear     
## 3 abandon   negative 
## 4 abandon   sadness  
## 5 abandoned anger    
## 6 abandoned fear

What are the most common joy words ?

## Joining, by = "word"
##        word   n
## 1      vote 121
## 2     enjoy  90
## 3      love  66
## 4 wonderful  40
## 5     money  38
## 6   special  36

Let’s find a sentiment score for each word using the Bing lexicon and inner_join(), define an index here to keep track of where we are in the narrative and spread() so that we have negative and positive sentiment in separate columns.

Plot the sentiment scores across the plot trajectory of each candidate. As we can see Clinton has higher sentiments than Trump.

Let’s compare the three dictionaries.

##   id                time          handle retweet_count     word
## 1  6 2016-09-27T22:13:24 realDonaldTrump          2181     join
## 2  6 2016-09-27T22:13:24 realDonaldTrump          2181      3pm
## 3  6 2016-09-27T22:13:24 realDonaldTrump          2181    rally
## 4  6 2016-09-27T22:13:24 realDonaldTrump          2181 tomorrow
## 5  6 2016-09-27T22:13:24 realDonaldTrump          2181      mid
## 6  6 2016-09-27T22:13:24 realDonaldTrump          2181  america

We use inner_join() to calculate the sentiment in different ways count(), spread(), and mutate() to find the net sentiment in each of these sections of text.

Now we bind them together and visualize

Let’s see common positive and negative words. Here, the word “crooked” used mostly as a negative word while “trump” and “win” used as positive words.

##      word sentiment    n
## 1   trump  positive 1104
## 2 crooked  negative  182
## 3     win  positive  136
## 4    love  positive  122
## 5 support  positive  109
## 6     bad  negative  108

Custom changes: We could easily add “trump” to a custom stop-words list using bind_rows().

## # A tibble: 6 × 2
##   word  lexicon
##   <chr> <chr>  
## 1 trump trump  
## 2 a     SMART  
## 3 a's   SMART  
## 4 able  SMART  
## 5 about SMART  
## 6 above SMART

Let’s create a wordcloud with sentiment. Here red color is the negative words and blue colors are positive words.

## Joining, by = "word"

Text network analysis

Let’ read the data again in another name.

Let’s parse subset of the data into a tibble.

## # A tibble: 6 × 3
##        ID Created_At          Text                                              
##     <dbl> <chr>               <chr>                                             
## 1 7.81e17 2016-09-28T00:22:34 "The question in this election: Who can put the p…
## 2 7.81e17 2016-09-27T23:45:00 "Last night, Donald Trump said not paying taxes w…
## 3 7.81e17 2016-09-27T23:08:41 "If we stand together, there's nothing we can't d…
## 4 7.81e17 2016-09-27T22:30:27 "Both candidates were asked about how they'd conf…
## 5 7.81e17 2016-09-27T22:13:24 "Join me for a 3pm rally - tomorrow at the Mid-Am…
## 6 7.81e17 2016-09-27T21:35:28 "This election is too important to sit out. Go to…

See the structure of this tibble.

## Rows: 6,444
## Columns: 3
## $ ID         <dbl> 7.809256e+17, 7.809162e+17, 7.809116e+17, 7.809070e+17, 7.8…
## $ Created_At <chr> "2016-09-28T00:22:34", "2016-09-27T23:45:00", "2016-09-27T2…
## $ Text       <chr> "The question in this election: Who can put the plans into …

Now we parse the Created_At column into a date format in the raw file it has type character.

## [1] "2016-09-28T00:22:34" "2016-09-27T23:45:00" "2016-09-27T23:26:40"
## [4] "2016-09-27T23:08:41"
## # A tibble: 6 × 3
##        ID Created_At          Text                                              
##     <dbl> <dttm>              <chr>                                             
## 1 7.81e17 2016-09-28 00:22:34 "The question in this election: Who can put the p…
## 2 7.81e17 2016-09-27 23:45:00 "Last night, Donald Trump said not paying taxes w…
## 3 7.81e17 2016-09-27 23:08:41 "If we stand together, there's nothing we can't d…
## 4 7.81e17 2016-09-27 22:30:27 "Both candidates were asked about how they'd conf…
## 5 7.81e17 2016-09-27 22:13:24 "Join me for a 3pm rally - tomorrow at the Mid-Am…
## 6 7.81e17 2016-09-27 21:35:28 "This election is too important to sit out. Go to…

The Created_At column makes reference to the UTC time, so we need to subtract 4 hours from it to get the New York time. We subtract seconds, that is why we need three factors.

Now compute the time range

## [1] "2016-01-04 23:36:53 UTC"
## [1] "2016-09-27 20:22:34 UTC"

Let’s plot the time series of tweets count per hour

An interesting peak at around 2016-07-28 19:00:00. Let’s have a look at some tweets after during this peak

##  [1] "The question in this election: Who can put the plans into action that will make your life better? https://t.co/XreEY9OicG"                   
##  [2] "Last night, Donald Trump said not paying taxes was \"smart.\" You know what I call it? Unpatriotic. https://t.co/t0xmBfj7zF"                 
##  [3] "If we stand together, there's nothing we can't do. \n\nMake sure you're ready to vote: https://t.co/tTgeqxNqYm https://t.co/Q3Ymbb7UNy"      
##  [4] "Both candidates were asked about how they'd confront racial injustice. Only one had a real answer. https://t.co/sjnEokckis"                  
##  [5] "Join me for a 3pm rally - tomorrow at the Mid-America Center in Council Bluffs, Iowa! Tickets:… https://t.co/dfzsbICiXc"                     
##  [6] "This election is too important to sit out. Go to https://t.co/tTgeqxNqYm and make sure you're registered. #NationalVoterRegistrationDay -H"  
##  [7] "When Donald Trump goes low...register to vote: https://t.co/tTgeqxNqYm https://t.co/DXz9dEwsZS"                                              
##  [8] "Once again, we will have a government of, by and for the people. Join the MOVEMENT today! https://t.co/lWjYDbPHav https://t.co/uYwJrtZkAe"   
##  [9] "The election is just weeks away. Check if you're registered to vote at https://t.co/HcMAh8ljR0, only takes a few cl… https://t.co/H1H7hAA4XM"
## [10] "On National #VoterRegistrationDay, make sure you're registered to vote so we can #MakeAmericaGreatAgain… https://t.co/0wib6UEZON"            
## [11] "Hillary Clinton's Campaign Continues To Make False Claims About Foundation Disclosure: \nhttps://t.co/zhkEfUouHH"                            
## [12] "Donald Trump lied to the American people at least 58 times during the first presidential debate. (We counted.) https://t.co/h43O6Rws4S"      
## [13] "Great afternoon in Little Havana with Hispanic community leaders. Thank you for your support! #ImWithYou https://t.co/vxWZ2tyJTF"            
## [14] "In the last 24 hrs. we have raised over $13M from online donations and National Call Day, and we’re still going! Thank you America! #MAGA"   
## [15] "“She gained about 55 pounds in...9 months. She was like an eating machine.” —Trump, a man who wants to be president: https://t.co/1ht91eZCyw"
## [16] "It's #NationalVoterRegistrationDay. Celebrate by registering to vote → https://t.co/tTgeqxNqYm https://t.co/R6lVvgLECG"                      
## [17] "\"I love this country.\nI’m proud of this country.\nI want to be a leader who brings people together.\"\n—Hillary #LoveTrumpsHate"           
## [18] "We don’t want to turn against each other.\nWe want to work with one another.\nWe want to set big goals in this country.\n#StrongerTogether"  
## [19] "\"What we hear from my opponent is dangerously incoherent. It's unclear what he's saying, but words matter.\" —Hillary"                      
## [20] "One candidate made it clear he wasn’t prepared for last night’s debate. The other made it clear she’s prepared to b… https://t.co/InYZBmnbBM"

Let’s prepare the text.

Let’s replace accents.

We could also use stemming by uncommenting the folowing line. tm_map(stemDocument, ‘spanish’). We recover text data into original tibble.

## [1] TRUE

We extract only the hashtags of each tweet.

We apply it to our data.

## # A tibble: 6 × 1
##   Hashtags
##   <chr>   
## 1 <NA>    
## 2 <NA>    
## 3 <NA>    
## 4 <NA>    
## 5 <NA>    
## 6 <NA>

We now merge these data frames together

We can do analogous analysis for hastags.

Let’s see the most famous hashtags.

## Warning in wordcloud(words = str_c("#", hashtags.unnested.count$hashtag), :
## #makeamericagreatagain could not be fit on page. It will not be plotted.

Let’s split the data before and after the results of the referendum are known, i.e. we split the Created_At_Round column with respect to the results.time “m” will represent before. results.time.

“p” will represent after results.time.

Counting the most popular words in the tweets removes the shortcut ‘q’ for ‘que’.

## # A tibble: 10 × 2
##    word          n
##    <chr>     <int>
##  1 trump      1062
##  2 hillary    1023
##  3 will        805
##  4 thank       559
##  5 great       530
##  6 donald      486
##  7 people      414
##  8 america     400
##  9 just        394
## 10 president   382

Now we can visualize these counts in a bar plot.

We can do the same for the split data.

Here is before the results.

## Warning: Outer names are only allowed for unnamed scalar atomic inputs

And here is after the results.

## Warning: Outer names are only allowed for unnamed scalar atomic inputs

Wordcloud before the results:

Wordcloud after the results:

Again we do analogous analysis for hastags.

## Warning in wordcloud(words = str_c("#", hashtags.unnested.count$hashtag), :
## #makeamericagreatagain could not be fit on page. It will not be plotted.

The most popular hashtag for the ‘YES’ and ‘NO’ are #trump2016 #americafirst (after that) respectively. Let us see the volume development of these hastags. As we can see as the time passes, people used americafirst hastag more.

Network part

Let’s Count pairwise occurrences of words which appear together in the text

## # A tibble: 10 × 1
##    bigram       
##    <chr>        
##  1 the question 
##  2 question in  
##  3 in this      
##  4 this election
##  5 election who 
##  6 who can      
##  7 can put      
##  8 put the      
##  9 the plans    
## 10 plans into

We can filter for stop words and remove white spaces.

Let’s group and count by bigram.

## # A tibble: 6 × 3
##   word1   word2   weight
##   <chr>   <chr>    <int>
## 1 donald  trump      363
## 2 hillary clinton    228
## 3 crooked hillary    169
## 4 make    america    137
## 5 america great      115
## 6 ted     cruz       102

Plot the distribution of the weight values

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Very skewed, for visualization purposes it might be a good idea to perform a transformation

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We define a weighted network from a bigram count. Each word wis going to represent a node. Two words ae going to be connected if they appear as a bigram. The weight of an edge is the numer of times the bigram appears in the corpus. (Optional) We are free to decide if we want the graph to be directed or not. For visualization purposes, we can set a threshold which defines the minimal weight allowed in the graph. It is necessary to set the weight column name as weight

For visualization purposes we scale by a global factor.

## IGRAPH a8ec170 UNW- 2 1 -- 
## + attr: name (v/c), weight (e/n)
## + edge from a8ec170 (vertex names):
## [1] donald--trump
## [1] TRUE

Let’s add some additional information to the visualization: Set the sizes of the nodes and the edges by the degree and weight respectively. For a weighted network we can consider the weighted degree, which can be computed with the strength function.

We store the degree.

We compute the weight shares.

We can extract the biggest connected component of the network as follows. We get all connected components.

## $membership
## donald  trump 
##      1      1 
## 
## $csize
## [1] 2
## 
## $no
## [1] 1

We select biggest connected component.

## IGRAPH a90a409 UNW- 2 1 -- 
## + attr: name (v/c), degree (v/n), cluster (v/n), weight (e/n), width
## | (e/n)
## + edge from a90a409 (vertex names):
## [1] donald--trump

We store the degree.

We compute the weight shares.

Let’s make our visualization more dynamic

We store the degree, compute the weight shares, create networkD3 object, define node size, degine color group (I will explore this feature later), define edges width respectively.

Now let’s decrease the threshold to get a more complex network. We can see which words are used mostly together. Such as “Donald Trump”, “Make America Great”, “Hillary Clinton Crooked”.

Let’s consider skipgrams.

Consider the example tweet.

##                                                                     character(0) 
## "if we stand together theres nothing we cant do make sure youre ready to vote  "

The skipgrams are:

## # A tibble: 11 × 1
##    skipgram     
##    <chr>        
##  1 this         
##  2 this election
##  3 this who     
##  4 election     
##  5 election who 
##  6 election can 
##  7 who          
##  8 who can      
##  9 who put      
## 10 can          
## 11 can put

Let’s count the skipgrams containing two words

## # A tibble: 6 × 3
##   word1   word2   weight
##   <chr>   <chr>    <int>
## 1 donald  trump      371
## 2 hillary clinton    228
## 3 crooked hillary    169
## 4 make    america    137
## 5 america great      124
## 6 make    great      112

Treshold

We see that Hillary Clinton Crooked is the most used words together.

We now compute the centrality measures for the biggest connected component from above.

## # A tibble: 3 × 4
##   word    degree closeness betweenness
##   <chr>    <dbl>     <dbl>       <dbl>
## 1 hillary    397   0.00252           1
## 2 clinton    228   0.0016            0
## 3 crooked    169   0.00177           0
## # A tibble: 3 × 4
##   word    degree closeness betweenness
##   <chr>    <dbl>     <dbl>       <dbl>
## 1 hillary    397   0.00252           1
## 2 crooked    169   0.00177           0
## 3 clinton    228   0.0016            0
## # A tibble: 3 × 4
##   word    degree closeness betweenness
##   <chr>    <dbl>     <dbl>       <dbl>
## 1 hillary    397   0.00252           1
## 2 crooked    169   0.00177           0
## 3 clinton    228   0.0016            0

Louvain method

Clusters (community) detection

## IGRAPH clustering multi level, groups: 1, mod: 0
## + groups:
##   $`1`
##   [1] "hillary" "crooked" "clinton"
## 

Modularity is as chance-corrected statistic, and is defined as the fraction of ties that fall within the given groups minus the expected such fraction. If the ties were distributed at random encode the membership as a node atribute zoom and click on each node to explore the clusters.

We use the membership label to color the nodes.

Words per cluster:

## [1] "hillary, crooked, clinton"

THANK YOU!