First, I imported the data using the dslabs package. I am inteterested in looking at the data set, trump_tweets. I want to create a visualization to graphically show his tweets over time.
library("dslabs")
## Warning: package 'dslabs' was built under R version 3.6.1
data(trump_tweets)
I will be using functions from a couple of different libraries. They can be installed with install.packages() if it is not already on the device.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.1
library(igraph)
## Warning: package 'igraph' was built under R version 3.6.1
##
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
##
## groups
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
## Warning: package 'ggraph' was built under R version 3.6.1
After observing the structure of the data, I decided that I wanted to adjust the way that the variable created_at is displayed. It is a date data type, so it was relatively simple to convert it into a format that displays only the year, utilizing the function format(). I saved this converted variable into the new variable year, so that it is easier to call.
str(trump_tweets)
## 'data.frame': 20761 obs. of 8 variables:
## $ source : chr "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...
## $ id_str : chr "6971079756" "6312794445" "6090839867" "5775731054" ...
## $ text : chr "From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think li"| __truncated__ "Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban "| __truncated__ "Wishing you and yours a very Happy and Bountiful Thanksgiving!" "Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc" ...
## $ created_at : POSIXct, format: "2009-12-23 12:38:18" "2009-12-03 14:39:09" ...
## $ retweet_count : int 28 33 13 5 7 4 2 4 1 22 ...
## $ in_reply_to_user_id_str: chr NA NA NA NA ...
## $ favorite_count : int 12 6 11 3 6 5 2 10 4 30 ...
## $ is_retweet : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
year <- format(trump_tweets$created_at, format = "%Y")
I am interested in graphically displaying Donald Trump’s tweets over time. I decided to use ggplot along with plotly, to create a jitter plot. This spreads out the data and enables users to see his tweets by year, without the points being stacked on top of one another. In addition to converted the date tweets were created at, I also decided to display the retweet count in thousands, so that the number is easier to read on the plot. I did this utilizing the mutate() function. Year is on the x-axis, retweet count in thousands is on the y-axis, and size represented the favorite count of the tweets. With the plotly function, I was also able to include the text of each tweet when users hover over each point. I set the opacity to 40 percent, so that users can see see individual tweets that may be stacked in the plot.
chart1 <- trump_tweets %>% mutate(retweet_count_thousands=(retweet_count/1000)) %>% ggplot(aes(x = year, y = retweet_count_thousands, size = favorite_count, text = paste("Tweet:", text))) +
xlab("year") +
ylab("Retweet Count (thousands)") +
ggtitle("Trumps Tweets") +
scale_color_brewer(palette = "Set1") +
theme_minimal(base_size = 12) +
geom_jitter(alpha = 0.4, color = "red")
chart1 <- ggplotly(chart1)
chart1
I wanted to try visualizing the word frequencies of the tweets. I followed this article as a guide. The first step is to clean the data. The author of the article suggests removing URLs from tweets manually and then utilizing the unnest_tokens() from the tidytext package.
#remove urls
trump_tweets$stripped_text <- gsub("http.*","", trump_tweets$text)
trump_tweets$stripped_text <- gsub("https.*","", trump_tweets$stripped_text)
# remove punctuation, convert to lowercase, add id for each tweet!
trump_tweets_clean <- trump_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)
Now let’s use the cleaned date to create the plot! The article points out that we may come across the issue of the top words not being useful, because they are “stop words”.
# plot the top 15 words -- notice any issues?
trump_tweets_clean %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in Trump tweets")
## Selecting by n
Let’s solve the issue by removing the “stop words”, which are defined by a data set in the tidytext package.
# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)
## # A tibble: 6 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
nrow(trump_tweets_clean)
## [1] 337371
# remove stop words from your list of words
cleaned_tweet_words <- trump_tweets_clean %>%
anti_join(stop_words)
## Joining, by = "word"
# there should be fewer words now
nrow(cleaned_tweet_words)
## [1] 157336
Now let’s try plotting the data again to see if the solution worked!
# replotting the top 15 words
cleaned_tweet_words %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(y = "Count",
x = "Unique words",
title = "Count of unique words found in Trump tweets",
subtitle = "Stop words removed from the list")
## Selecting by n
We can create a word network to explore words that occur frequently together. The unrest_tokens() function can do this for us by setting token to ngrams, which specifies pairs, and n=2, which is the number of words together.
# let's create a table with paired words
library(widyr)
## Warning: package 'widyr' was built under R version 3.6.1
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:igraph':
##
## crossing
# remove punctuation, convert to lowercase, add id for each tweet!
trump_tweets_paired_words <- trump_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)
trump_tweets_paired_words %>%
count(paired_words, sort = TRUE)
## # A tibble: 155,388 x 2
## paired_words n
## <chr> <int>
## 1 will be 1329
## 2 thank you 1253
## 3 of the 1205
## 4 in the 887
## 5 is a 752
## 6 a great 675
## 7 i will 631
## 8 for the 584
## 9 to the 582
## 10 on the 489
## # ... with 155,378 more rows
We will now create a bigram count with each of the paired words in separate columns.
trump_tweets_separated_words <- trump_tweets_paired_words %>%
separate(paired_words, c("word1", "word2"), sep = " ")
trump_tweets_filtered <- trump_tweets_separated_words %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
trump_words_counts <- trump_tweets_filtered %>%
count(word1, word2, sort = TRUE)
head(trump_words_counts)
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 donald trump 391
## 2 crooked hillary 221
## 3 hillary clinton 208
## 4 president obama 161
## 5 fake news 158
## 6 makeamericagreatagain trump2016 128
Now we are ready to create the word network.
library(igraph)
library(ggraph)
# plot climate change word network
trump_words_counts %>%
filter(n >= 24) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 3) +
geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
labs(title = "Word Network: Tweets by Donald Trump",
subtitle = "Text mining twitter data from 2009-2018",
x = "", y = "")