The 2016 US presidential election between Donald Trump and Hillary Clinton was one of the most controversial and highly publicized elections in recent history. Social media platforms, such as Twitter, played a significant role in shaping public opinion and spreading information about the candidates and their policies. In this project, we will analyze a dataset of tweets related to the 2016 US presidential election to gain insight into the sentiments and opinions expressed by Twitter users during this time period. By using text mining techniques such as sentiment analysis and topic modeling, we aim to answer questions such as: How did Twitter users perceive the two candidates and their policies? What were the most common topics of discussion among Twitter users during the election? Our results will provide a snapshot of the public discourse surrounding the 2016 presidential election and offer insight into the role of social media in shaping political opinions and discussions.
Here is the columns and their description that we used in our analysis:
Let’s first install all of the libraries we will use during our analysis.
library(tm)
library(SnowballC)
library(textmineR)
library(tidyverse)
library(stringr)
library(dplyr)
library(wordcloud)
library(tidytext)
library(topicmodels)
library(textdata)
library(janeaustenr)
library(stringr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(visNetwork)
library(glue)
library(cowplot)
library(magrittr)
library(plotly)
library(widyr)
library(hms)
library(lubridate)
library(igraph)
library(networkD3)
library(rjson)
library(ngram)
Let’s set our working directory and read our dataset.
Here we create Document-Term Matrix (DTM). DTM is a matrix representation of a collection of documents where each cell represents the frequency (or count) of a term (word) in a document. The rows represent documents, and the columns represent terms.
## as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
We develop the matrix of term counts to get the IDF vector. TF-IDF and cosine similarity is a measure of similarity between two non-zero vectors of an inner product space and we convert cosine similarity to cosine distance by subtracting it from 1.
We now create a word cloud to clearly see which words are used the most after cleaning it from stop words and etc.
Here we filter and summarize the data to see how many of the tweets are about Trump or Clinton. We see that they have almost the same number of tweets. However the number retweets of Trump is as twice as Clinton.
## # A tibble: 2 × 2
## handle count
## <chr> <int>
## 1 HillaryClinton 3226
## 2 realDonaldTrump 3218
## # A tibble: 2 × 2
## handle summing
## <chr> <int>
## 1 HillaryClinton 9625246
## 2 realDonaldTrump 18703714
We will tokenize the words now, Tokenizing in text mining and natural language processing (NLP) refers to the process of breaking a text document into smaller units, called tokens. We will use unnest_tokens().
unnest_tokens() has done some cleaning, removed punctuation and whitespace, transformed to lowercase & etc.
Having a single word per row means that our dataset dimensions exploded! Now there is 120227 rows. Let’s count the words.
## word n
## 1 t.co 4021
## 2 https 4020
## 3 the 3601
## 4 to 2843
## 5 a 2009
## 6 and 1928
We can use unnest_tokens() with stopwords.
The number of words is drastically reduced.
## [1] 62527
Let’s count words again
## word n
## 1 t.co 4021
## 2 https 4020
## 3 trump 1104
## 4 hillary 1035
## 5 donald 493
## 6 realdonaldtrump 422
Let’s now visualize the text with starting with tidy_trump.
Now each row/review has a number id thanks to row_number, then after unnest_tokens each row/observation is a single word, and each product review has as many rows as number of non-stopwords in the review.
## id handle is_retweet original_author time
## 1 1 HillaryClinton False 2016-09-28T00:22:34
## 2 1 HillaryClinton False 2016-09-28T00:22:34
## 3 1 HillaryClinton False 2016-09-28T00:22:34
## 4 1 HillaryClinton False 2016-09-28T00:22:34
## 5 1 HillaryClinton False 2016-09-28T00:22:34
## 6 1 HillaryClinton False 2016-09-28T00:22:34
## in_reply_to_screen_name in_reply_to_status_id in_reply_to_user_id
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## is_quote_status lang retweet_count favorite_count longitude latitude place_id
## 1 False en 218 651 NA NA
## 2 False en 218 651 NA NA
## 3 False en 218 651 NA NA
## 4 False en 218 651 NA NA
## 5 False en 218 651 NA NA
## 6 False en 218 651 NA NA
## place_full_name place_name place_type place_country_code place_country
## 1
## 2
## 3
## 4
## 5
## 6
## place_contained_within place_attributes place_bounding_box
## 1
## 2
## 3
## 4
## 5
## 6
## source_url truncated
## 1 https://studio.twitter.com False
## 2 https://studio.twitter.com False
## 3 https://studio.twitter.com False
## 4 https://studio.twitter.com False
## 5 https://studio.twitter.com False
## 6 https://studio.twitter.com False
## entities
## 1 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 2 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 3 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 4 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 5 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## 6 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'id_str': '780924569674809345', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'photo', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}], 'user_mentions': [], 'symbols': [], 'urls': [], 'hashtags': []}
## extended_entities
## 1 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 2 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 3 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 4 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 5 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## 6 {'media': [{'display_url': 'pic.twitter.com/XreEY9OicG', 'sizes': {'medium': {'h': 675, 'resize': 'fit', 'w': 1200}, 'small': {'h': 383, 'resize': 'fit', 'w': 680}, 'large': {'h': 720, 'resize': 'fit', 'w': 1280}, 'thumb': {'h': 150, 'resize': 'crop', 'w': 150}}, 'expanded_url': 'https://twitter.com/HillaryClinton/status/780925634159796224/video/1', 'indices': [98, 121], 'id': 780924569674809345, 'url': 'https://t.co/XreEY9OicG', 'video_info': {'variants': [{'content_type': 'application/dash+xml', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.mpd'}, {'bitrate': 320000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/320x180/uN-fdRWGnmcmGGVo.mp4'}, {'bitrate': 2176000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/1280x720/jdIrsRbzox7KDUgd.mp4'}, {'content_type': 'application/x-mpegURL', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/pl/Hi-7lXt1Y4B7gsOA.m3u8'}, {'bitrate': 832000, 'content_type': 'video/mp4', 'url': 'https://video.twimg.com/amplify_video/780924569674809345/vid/640x360/pROTBCbKMos05Wqg.mp4'}], 'aspect_ratio': [16, 9], 'duration_millis': 65933}, 'id_str': '780924569674809345', 'additional_media_info': {'embeddable': True, 'title': '', 'call_to_actions': {'visit_site': {'url': 'http://hrc.io/2dxWlpm'}}, 'description': '', 'monetizable': False}, 'media_url_https': 'https://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg', 'type': 'video', 'media_url': 'http://pbs.twimg.com/media/CtZnY0dVUAAkwYB.jpg'}]}
## word
## 1 question
## 2 election
## 3 plans
## 4 action
## 5 life
## 6 https
Let’s visualize counts with geom_col(). Here col is for column, so it’s just like bar plot.
There are too many words at once. Nothing is clear, we have to filter!
With filtered words the plot is better:
Let’s do more improvements. We can use coord_flip() when data are hard to read on the x axis.
Even after removing the common stop words we still have some left.
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
We often have additional words that we want to remove. We have to create our own tibble containing the custom stop words that we want to remove.
## # A tibble: 7 × 2
## word lexicon
## <chr> <chr>
## 1 a a's
## 2 able about
## 3 above according
## 4 accordingly across
## 5 actually after
## 6 t.co https
## 7 amp CUSTOM
Let’s combine this tibble to custom stopwords.
And now we have the original stop words with our custom stop words.
Let’s remove stop words again. Now once again we tokenize and clean using new stop words.
Let’s check if this cleaning worked.
## id time handle retweet_count word
## 1 1 2016-09-28T00:22:34 HillaryClinton 218 https
## 2 2 2016-09-27T23:45:00 HillaryClinton 2445 https
## 3 4 2016-09-27T23:08:41 HillaryClinton 916 https
## 4 4 2016-09-27T23:08:41 HillaryClinton 916 https
## 5 5 2016-09-27T22:30:27 HillaryClinton 859 https
## 6 6 2016-09-27T22:13:24 realDonaldTrump 2181 https
We still have some rows left. So, let’s try this way.
Let’s try again.
## [1] id time handle retweet_count word
## <0 rows> (or 0-length row.names)
Now we don’t have the word “https” any more.
Function ARRANGE doesn’t affect the plots! How to fix that? Characters can be sorted only alphabetically but factor column can include information about the order of words.
Using fct_reorder() before visualizing. If it’s necessary load the library(forcats). Reorder what (word) by what (n).
## word n word2
## 1 america 406 america
## 2 clinton 308 clinton
## 3 donald 493 donald
## 4 hillary 1035 hillary
## 5 people 416 people
## 6 president 393 president
## 7 realdonaldtrump 422 realdonaldtrump
## 8 trump 1104 trump
## 9 trump2016 350 trump2016
Now this plot with new ordered column x = word2 is arranged by word count and is far better to read:
We have Reviews of 2 candidates. It would be nice to compare the word counts by each candidate. Here instead of just counting words we include both word and candidate in the function.
## word handle n
## 1 trump HillaryClinton 737
## 2 hillary HillaryClinton 693
## 3 donald HillaryClinton 427
## 4 trump realDonaldTrump 367
## 5 trump2016 realDonaldTrump 350
## 6 hillary realDonaldTrump 342
Before we plot we need to filter to use just the most common words for each of the product. We want top 10 words based on order of column n.
## # A tibble: 20 × 3
## # Groups: handle [2]
## word handle n
## <chr> <chr> <int>
## 1 america HillaryClinton 196
## 2 america realDonaldTrump 210
## 3 americans HillaryClinton 129
## 4 clinton realDonaldTrump 192
## 5 crooked realDonaldTrump 182
## 6 cruz realDonaldTrump 198
## 7 donald HillaryClinton 427
## 8 families HillaryClinton 143
## 9 hillary HillaryClinton 693
## 10 hillary realDonaldTrump 342
## 11 makeamericagreatagain realDonaldTrump 255
## 12 people HillaryClinton 191
## 13 people realDonaldTrump 225
## 14 potus HillaryClinton 142
## 15 president HillaryClinton 279
## 16 realdonaldtrump realDonaldTrump 326
## 17 trump HillaryClinton 737
## 18 trump realDonaldTrump 367
## 19 trump's HillaryClinton 149
## 20 trump2016 realDonaldTrump 350
Now we may prepare word_counts to plot
Now this plot with new ordered column x = word2 is arranged by word count the color is filled by handle and is faceted by handle.
To create Document-Term Matrix we use cast_dtm()
## <<DocumentTermMatrix (documents: 6443, terms: 12962)>>
## Non-/sparse entries: 53119/83461047
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
Let’s use as.matrix()
It is a veeeery large matrix. Look at a piece of it. It is very sparse
## Terms
## Docs border borderless borders boring borisep born borntobegop
## 1005 0 0 0 0 0 0 0
## 6316 0 0 0 0 0 0 0
## 635 0 0 0 0 0 0 0
## 6078 0 0 0 0 0 0 0
Using LDA(). LDA stands for Latent Dirichlet Allocation. It is a generative probabilistic model for text data commonly used in topic modeling.
k is the number of topic we want the model to produce. How to decide about the k value?
Method is the approximation method by default it is a quick approximation. However if we prefer a longer but more complete method we should specify the Gibbs sampler.
Specifying the simulation seed will help us recover consistent topics on repeat model runs given a probabilistic nature of model estimation.
## A LDA_Gibbs topic model with 2 topics.
We can use function glimpse() to see what is included in this encoded object.
## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
## ..@ seedwords : NULL
## ..@ z : int [1:54125] 2 1 1 1 2 1 1 1 2 2 ...
## ..@ alpha : num 25
## ..@ call : language LDA(x = dtm_trump, k = 2, method = "Gibbs", control = list(seed = 42))
## ..@ Dim : int [1:2] 6443 12962
## ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
## ..@ k : int 2
## ..@ terms : chr [1:12962] "________" "_bscarb" "_bxddxss" "_hankrearden" ...
## ..@ documents : chr [1:6443] "1005" "6316" "635" "6078" ...
## ..@ beta : num [1:2, 1:12962] -12.6 -10.2 -12.6 -10.2 -12.6 ...
## ..@ gamma : num [1:6443, 1:2] 0.525 0.433 0.534 0.492 0.476 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:53119] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ j : int [1:53119] 1 642 1693 2481 3662 5846 5861 7392 10423 11413 ...
## .. ..$ v : num [1:53119] 2 2 1 1 2 1 1 1 2 2 ...
## .. ..$ nrow: int 6443
## .. ..$ ncol: int 12962
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num -434741
## ..@ iter : int 2000
## ..@ logLiks : num(0)
## ..@ n : int 54125
Let’s evaluate LDA model output. The most important LDA model output are the topics themselves i.e. a dictionary of all words in the corpus sorted according to the probability that each word occurs as a part of that topic.
Function tidy() takes the matrix of topic probabilities “beta” and put it into a form that is easy visualized using ggplot2. It allows us to take the model output and to apply other functions of tidyverse, for example arrange(), desc() etc.
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 2 trump 0.0389
## 2 1 hillary 0.0366
## 3 2 donald 0.0174
## 4 2 realdonaldtrump 0.0149
## 5 1 america 0.0143
## 6 1 people 0.0140
We obtain output with what the topics are composed of but no direction of what topics mean. Key direction is to find topics that are each different where no topic repeats. Now let’s do once again topic modeling with LDA() and gather all code together.
Finally let’s plot discovered topics. We treat topic as a factor to add some color.
Three topics
We will repeat the same steps of modeling LDA, tidying, grouping, reordering, and finally ploting but with k=3 topics.
Four topics
Let’s repeat this process one more time with k=4. We see that when it is 2 topics, we have topics about Clinton and Trump. When we have 3 topics, they are about Clinton, Trump for election and Trump for jobs, taxes and etc. And when it is 4 topics, they are about the same previous 3 topis plus additionally about general election topics.
Let’s get sentiments. “AFINN,” “BING,” and “NRC” are all sentiment analysis lexicons, which are lists of words and their associated sentiment scores (positive, negative, or neutral).
## # A tibble: 6 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
What are the most common joy words ?
## Joining, by = "word"
## word n
## 1 vote 121
## 2 enjoy 90
## 3 love 66
## 4 wonderful 40
## 5 money 38
## 6 special 36
Let’s find a sentiment score for each word using the Bing lexicon and inner_join(), define an index here to keep track of where we are in the narrative and spread() so that we have negative and positive sentiment in separate columns.
Plot the sentiment scores across the plot trajectory of each candidate. As we can see Clinton has higher sentiments than Trump.
Let’s compare the three dictionaries.
## id time handle retweet_count word
## 1 6 2016-09-27T22:13:24 realDonaldTrump 2181 join
## 2 6 2016-09-27T22:13:24 realDonaldTrump 2181 3pm
## 3 6 2016-09-27T22:13:24 realDonaldTrump 2181 rally
## 4 6 2016-09-27T22:13:24 realDonaldTrump 2181 tomorrow
## 5 6 2016-09-27T22:13:24 realDonaldTrump 2181 mid
## 6 6 2016-09-27T22:13:24 realDonaldTrump 2181 america
We use inner_join() to calculate the sentiment in different ways count(), spread(), and mutate() to find the net sentiment in each of these sections of text.
Now we bind them together and visualize
Let’s see common positive and negative words. Here, the word “crooked” used mostly as a negative word while “trump” and “win” used as positive words.
## word sentiment n
## 1 trump positive 1104
## 2 crooked negative 182
## 3 win positive 136
## 4 love positive 122
## 5 support positive 109
## 6 bad negative 108
Custom changes: We could easily add “trump” to a custom stop-words list using bind_rows().
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 trump trump
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
Let’s create a wordcloud with sentiment. Here red color is the negative words and blue colors are positive words.
## Joining, by = "word"
Let’ read the data again in another name.
Let’s parse subset of the data into a tibble.
## # A tibble: 6 × 3
## ID Created_At Text
## <dbl> <chr> <chr>
## 1 7.81e17 2016-09-28T00:22:34 "The question in this election: Who can put the p…
## 2 7.81e17 2016-09-27T23:45:00 "Last night, Donald Trump said not paying taxes w…
## 3 7.81e17 2016-09-27T23:08:41 "If we stand together, there's nothing we can't d…
## 4 7.81e17 2016-09-27T22:30:27 "Both candidates were asked about how they'd conf…
## 5 7.81e17 2016-09-27T22:13:24 "Join me for a 3pm rally - tomorrow at the Mid-Am…
## 6 7.81e17 2016-09-27T21:35:28 "This election is too important to sit out. Go to…
See the structure of this tibble.
## Rows: 6,444
## Columns: 3
## $ ID <dbl> 7.809256e+17, 7.809162e+17, 7.809116e+17, 7.809070e+17, 7.8…
## $ Created_At <chr> "2016-09-28T00:22:34", "2016-09-27T23:45:00", "2016-09-27T2…
## $ Text <chr> "The question in this election: Who can put the plans into …
Now we parse the Created_At column into a date format in the raw file it has type character.
## [1] "2016-09-28T00:22:34" "2016-09-27T23:45:00" "2016-09-27T23:26:40"
## [4] "2016-09-27T23:08:41"
## # A tibble: 6 × 3
## ID Created_At Text
## <dbl> <dttm> <chr>
## 1 7.81e17 2016-09-28 00:22:34 "The question in this election: Who can put the p…
## 2 7.81e17 2016-09-27 23:45:00 "Last night, Donald Trump said not paying taxes w…
## 3 7.81e17 2016-09-27 23:08:41 "If we stand together, there's nothing we can't d…
## 4 7.81e17 2016-09-27 22:30:27 "Both candidates were asked about how they'd conf…
## 5 7.81e17 2016-09-27 22:13:24 "Join me for a 3pm rally - tomorrow at the Mid-Am…
## 6 7.81e17 2016-09-27 21:35:28 "This election is too important to sit out. Go to…
The Created_At column makes reference to the UTC time, so we need to subtract 4 hours from it to get the New York time. We subtract seconds, that is why we need three factors.
Now compute the time range
## [1] "2016-01-04 23:36:53 UTC"
## [1] "2016-09-27 20:22:34 UTC"
Let’s plot the time series of tweets count per hour
An interesting peak at around 2016-07-28 19:00:00. Let’s have a look at some tweets after during this peak
## [1] "The question in this election: Who can put the plans into action that will make your life better? https://t.co/XreEY9OicG"
## [2] "Last night, Donald Trump said not paying taxes was \"smart.\" You know what I call it? Unpatriotic. https://t.co/t0xmBfj7zF"
## [3] "If we stand together, there's nothing we can't do. \n\nMake sure you're ready to vote: https://t.co/tTgeqxNqYm https://t.co/Q3Ymbb7UNy"
## [4] "Both candidates were asked about how they'd confront racial injustice. Only one had a real answer. https://t.co/sjnEokckis"
## [5] "Join me for a 3pm rally - tomorrow at the Mid-America Center in Council Bluffs, Iowa! Tickets:… https://t.co/dfzsbICiXc"
## [6] "This election is too important to sit out. Go to https://t.co/tTgeqxNqYm and make sure you're registered. #NationalVoterRegistrationDay -H"
## [7] "When Donald Trump goes low...register to vote: https://t.co/tTgeqxNqYm https://t.co/DXz9dEwsZS"
## [8] "Once again, we will have a government of, by and for the people. Join the MOVEMENT today! https://t.co/lWjYDbPHav https://t.co/uYwJrtZkAe"
## [9] "The election is just weeks away. Check if you're registered to vote at https://t.co/HcMAh8ljR0, only takes a few cl… https://t.co/H1H7hAA4XM"
## [10] "On National #VoterRegistrationDay, make sure you're registered to vote so we can #MakeAmericaGreatAgain… https://t.co/0wib6UEZON"
## [11] "Hillary Clinton's Campaign Continues To Make False Claims About Foundation Disclosure: \nhttps://t.co/zhkEfUouHH"
## [12] "Donald Trump lied to the American people at least 58 times during the first presidential debate. (We counted.) https://t.co/h43O6Rws4S"
## [13] "Great afternoon in Little Havana with Hispanic community leaders. Thank you for your support! #ImWithYou https://t.co/vxWZ2tyJTF"
## [14] "In the last 24 hrs. we have raised over $13M from online donations and National Call Day, and we’re still going! Thank you America! #MAGA"
## [15] "“She gained about 55 pounds in...9 months. She was like an eating machine.” —Trump, a man who wants to be president: https://t.co/1ht91eZCyw"
## [16] "It's #NationalVoterRegistrationDay. Celebrate by registering to vote → https://t.co/tTgeqxNqYm https://t.co/R6lVvgLECG"
## [17] "\"I love this country.\nI’m proud of this country.\nI want to be a leader who brings people together.\"\n—Hillary #LoveTrumpsHate"
## [18] "We don’t want to turn against each other.\nWe want to work with one another.\nWe want to set big goals in this country.\n#StrongerTogether"
## [19] "\"What we hear from my opponent is dangerously incoherent. It's unclear what he's saying, but words matter.\" —Hillary"
## [20] "One candidate made it clear he wasn’t prepared for last night’s debate. The other made it clear she’s prepared to b… https://t.co/InYZBmnbBM"
Let’s prepare the text.
Let’s replace accents.
We could also use stemming by uncommenting the folowing line. tm_map(stemDocument, ‘spanish’). We recover text data into original tibble.
## [1] TRUE
We extract only the hashtags of each tweet.
We apply it to our data.
## # A tibble: 6 × 1
## Hashtags
## <chr>
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
We now merge these data frames together
We can do analogous analysis for hastags.
Let’s see the most famous hashtags.
## Warning in wordcloud(words = str_c("#", hashtags.unnested.count$hashtag), :
## #makeamericagreatagain could not be fit on page. It will not be plotted.
Let’s split the data before and after the results of the referendum are known, i.e. we split the Created_At_Round column with respect to the results.time “m” will represent before. results.time.
“p” will represent after results.time.
Counting the most popular words in the tweets removes the shortcut ‘q’ for ‘que’.
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 trump 1062
## 2 hillary 1023
## 3 will 805
## 4 thank 559
## 5 great 530
## 6 donald 486
## 7 people 414
## 8 america 400
## 9 just 394
## 10 president 382
Now we can visualize these counts in a bar plot.
We can do the same for the split data.
Here is before the results.
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
And here is after the results.
## Warning: Outer names are only allowed for unnamed scalar atomic inputs
Wordcloud before the results:
Wordcloud after the results:
Again we do analogous analysis for hastags.
## Warning in wordcloud(words = str_c("#", hashtags.unnested.count$hashtag), :
## #makeamericagreatagain could not be fit on page. It will not be plotted.
The most popular hashtag for the ‘YES’ and ‘NO’ are #trump2016 #americafirst (after that) respectively. Let us see the volume development of these hastags. As we can see as the time passes, people used americafirst hastag more.
Let’s Count pairwise occurrences of words which appear together in the text
## # A tibble: 10 × 1
## bigram
## <chr>
## 1 the question
## 2 question in
## 3 in this
## 4 this election
## 5 election who
## 6 who can
## 7 can put
## 8 put the
## 9 the plans
## 10 plans into
We can filter for stop words and remove white spaces.
Let’s group and count by bigram.
## # A tibble: 6 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 donald trump 363
## 2 hillary clinton 228
## 3 crooked hillary 169
## 4 make america 137
## 5 america great 115
## 6 ted cruz 102
Plot the distribution of the weight values
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Very skewed, for visualization purposes it might be a good idea to perform a transformation
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We define a weighted network from a bigram count. Each word wis going to represent a node. Two words ae going to be connected if they appear as a bigram. The weight of an edge is the numer of times the bigram appears in the corpus. (Optional) We are free to decide if we want the graph to be directed or not. For visualization purposes, we can set a threshold which defines the minimal weight allowed in the graph. It is necessary to set the weight column name as weight
For visualization purposes we scale by a global factor.
## IGRAPH a8ec170 UNW- 2 1 --
## + attr: name (v/c), weight (e/n)
## + edge from a8ec170 (vertex names):
## [1] donald--trump
## [1] TRUE
Let’s add some additional information to the visualization: Set the sizes of the nodes and the edges by the degree and weight respectively. For a weighted network we can consider the weighted degree, which can be computed with the strength function.
We store the degree.
We compute the weight shares.
We can extract the biggest connected component of the network as follows. We get all connected components.
## $membership
## donald trump
## 1 1
##
## $csize
## [1] 2
##
## $no
## [1] 1
We select biggest connected component.
## IGRAPH a90a409 UNW- 2 1 --
## + attr: name (v/c), degree (v/n), cluster (v/n), weight (e/n), width
## | (e/n)
## + edge from a90a409 (vertex names):
## [1] donald--trump
We store the degree.
We compute the weight shares.
Let’s make our visualization more dynamic
We store the degree, compute the weight shares, create networkD3 object, define node size, degine color group (I will explore this feature later), define edges width respectively.
Now let’s decrease the threshold to get a more complex network. We can see which words are used mostly together. Such as “Donald Trump”, “Make America Great”, “Hillary Clinton Crooked”.
Let’s consider skipgrams.
Consider the example tweet.
## character(0)
## "if we stand together theres nothing we cant do make sure youre ready to vote "
The skipgrams are:
## # A tibble: 11 × 1
## skipgram
## <chr>
## 1 this
## 2 this election
## 3 this who
## 4 election
## 5 election who
## 6 election can
## 7 who
## 8 who can
## 9 who put
## 10 can
## 11 can put
Let’s count the skipgrams containing two words
## # A tibble: 6 × 3
## word1 word2 weight
## <chr> <chr> <int>
## 1 donald trump 371
## 2 hillary clinton 228
## 3 crooked hillary 169
## 4 make america 137
## 5 america great 124
## 6 make great 112
Treshold
We see that Hillary Clinton Crooked is the most used words together.
We now compute the centrality measures for the biggest connected component from above.
## # A tibble: 3 × 4
## word degree closeness betweenness
## <chr> <dbl> <dbl> <dbl>
## 1 hillary 397 0.00252 1
## 2 clinton 228 0.0016 0
## 3 crooked 169 0.00177 0
## # A tibble: 3 × 4
## word degree closeness betweenness
## <chr> <dbl> <dbl> <dbl>
## 1 hillary 397 0.00252 1
## 2 crooked 169 0.00177 0
## 3 clinton 228 0.0016 0
## # A tibble: 3 × 4
## word degree closeness betweenness
## <chr> <dbl> <dbl> <dbl>
## 1 hillary 397 0.00252 1
## 2 crooked 169 0.00177 0
## 3 clinton 228 0.0016 0
Clusters (community) detection
## IGRAPH clustering multi level, groups: 1, mod: 0
## + groups:
## $`1`
## [1] "hillary" "crooked" "clinton"
##
Modularity is as chance-corrected statistic, and is defined as the fraction of ties that fall within the given groups minus the expected such fraction. If the ties were distributed at random encode the membership as a node atribute zoom and click on each node to explore the clusters.
We use the membership label to color the nodes.
Words per cluster:
## [1] "hillary, crooked, clinton"
THANK YOU!