Unit 5 Homework Assignment

Importing the Data Set

First, I imported the data using the dslabs package. I am inteterested in looking at the data set, trump_tweets. I want to create a visualization to graphically show his tweets over time.

library("dslabs")

## Warning: package 'dslabs' was built under R version 3.6.1

data(trump_tweets)

I will be using functions from a couple of different libraries. They can be installed with install.packages() if it is not already on the device.

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 3.6.1

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(tidytext)

## Warning: package 'tidytext' was built under R version 3.6.1

library(igraph)

## Warning: package 'igraph' was built under R version 3.6.1

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:plotly':
## 
##     groups

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

## Warning: package 'ggraph' was built under R version 3.6.1

After observing the structure of the data, I decided that I wanted to adjust the way that the variable created_at is displayed. It is a date data type, so it was relatively simple to convert it into a format that displays only the year, utilizing the function format(). I saved this converted variable into the new variable year, so that it is easier to call.

str(trump_tweets)

## 'data.frame':    20761 obs. of  8 variables:
##  $ source                 : chr  "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...
##  $ id_str                 : chr  "6971079756" "6312794445" "6090839867" "5775731054" ...
##  $ text                   : chr  "From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think li"| __truncated__ "Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban "| __truncated__ "Wishing you and yours a very Happy and Bountiful Thanksgiving!" "Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc" ...
##  $ created_at             : POSIXct, format: "2009-12-23 12:38:18" "2009-12-03 14:39:09" ...
##  $ retweet_count          : int  28 33 13 5 7 4 2 4 1 22 ...
##  $ in_reply_to_user_id_str: chr  NA NA NA NA ...
##  $ favorite_count         : int  12 6 11 3 6 5 2 10 4 30 ...
##  $ is_retweet             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

year <- format(trump_tweets$created_at, format = "%Y")

Creating a visualization for Trumps Retweets from 2009-2018

I am interested in graphically displaying Donald Trump’s tweets over time. I decided to use ggplot along with plotly, to create a jitter plot. This spreads out the data and enables users to see his tweets by year, without the points being stacked on top of one another. In addition to converted the date tweets were created at, I also decided to display the retweet count in thousands, so that the number is easier to read on the plot. I did this utilizing the mutate() function. Year is on the x-axis, retweet count in thousands is on the y-axis, and size represented the favorite count of the tweets. With the plotly function, I was also able to include the text of each tweet when users hover over each point. I set the opacity to 40 percent, so that users can see see individual tweets that may be stacked in the plot.

chart1 <- trump_tweets %>% mutate(retweet_count_thousands=(retweet_count/1000)) %>% ggplot(aes(x = year, y = retweet_count_thousands, size = favorite_count, text = paste("Tweet:", text))) +
  xlab("year") + 
  ylab("Retweet Count (thousands)") +
  ggtitle("Trumps Tweets") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 12) +
  geom_jitter(alpha = 0.4, color = "red")
chart1 <- ggplotly(chart1)
chart1

Creating a visualization for the word frequencies of Trump’s Tweets

I wanted to try visualizing the word frequencies of the tweets. I followed this article as a guide. The first step is to clean the data. The author of the article suggests removing URLs from tweets manually and then utilizing the unnest_tokens() from the tidytext package.

#remove urls
trump_tweets$stripped_text <- gsub("http.*","",  trump_tweets$text)
trump_tweets$stripped_text <- gsub("https.*","", trump_tweets$stripped_text)

# remove punctuation, convert to lowercase, add id for each tweet!
trump_tweets_clean <- trump_tweets %>%
  dplyr::select(stripped_text) %>%
  unnest_tokens(word, stripped_text)

Now let’s use the cleaned date to create the plot! The article points out that we may come across the issue of the top words not being useful, because they are “stop words”.

# plot the top 15 words -- notice any issues?
trump_tweets_clean %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in Trump tweets")

## Selecting by n

Let’s solve the issue by removing the “stop words”, which are defined by a data set in the tidytext package.

# load list of stop words - from the tidytext package
data("stop_words")

# view first 6 words
head(stop_words)

## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART

nrow(trump_tweets_clean)

## [1] 337371

# remove stop words from your list of words
cleaned_tweet_words <- trump_tweets_clean %>%
  anti_join(stop_words)

## Joining, by = "word"

# there should be fewer words now
nrow(cleaned_tweet_words)

## [1] 157336

Now let’s try plotting the data again to see if the solution worked!

# replotting the top 15 words
cleaned_tweet_words %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(y = "Count",
      x = "Unique words",
      title = "Count of unique words found in Trump tweets",
      subtitle = "Stop words removed from the list")

## Selecting by n

Creating a word network for tweets by Donald Trump from 2009-2018

We can create a word network to explore words that occur frequently together. The unrest_tokens() function can do this for us by setting token to ngrams, which specifies pairs, and n=2, which is the number of words together.

# let's create a table with paired words
library(widyr)

## Warning: package 'widyr' was built under R version 3.6.1

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:igraph':
## 
##     crossing

# remove punctuation, convert to lowercase, add id for each tweet!
trump_tweets_paired_words <- trump_tweets %>%
  dplyr::select(stripped_text) %>%
  unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2)

trump_tweets_paired_words %>%
  count(paired_words, sort = TRUE)

## # A tibble: 155,388 x 2
##    paired_words     n
##    <chr>        <int>
##  1 will be       1329
##  2 thank you     1253
##  3 of the        1205
##  4 in the         887
##  5 is a           752
##  6 a great        675
##  7 i will         631
##  8 for the        584
##  9 to the         582
## 10 on the         489
## # ... with 155,378 more rows

We will now create a bigram count with each of the paired words in separate columns.

trump_tweets_separated_words <- trump_tweets_paired_words %>%
  separate(paired_words, c("word1", "word2"), sep = " ")

trump_tweets_filtered <- trump_tweets_separated_words %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
trump_words_counts <- trump_tweets_filtered %>%
  count(word1, word2, sort = TRUE)

head(trump_words_counts)

## # A tibble: 6 x 3
##   word1                 word2         n
##   <chr>                 <chr>     <int>
## 1 donald                trump       391
## 2 crooked               hillary     221
## 3 hillary               clinton     208
## 4 president             obama       161
## 5 fake                  news        158
## 6 makeamericagreatagain trump2016   128

Now we are ready to create the word network.

library(igraph)
library(ggraph)

# plot climate change word network
trump_words_counts %>%
        filter(n >= 24) %>%
        graph_from_data_frame() %>%
        ggraph(layout = "fr") +
        geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
        geom_node_point(color = "darkslategray4", size = 3) +
        geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
        labs(title = "Word Network: Tweets by Donald Trump",
             subtitle = "Text mining twitter data from 2009-2018",
             x = "", y = "")

Unit 5 Homework Assignment

Lucy Murray

08/06/2019

Importing the Data Set

Creating a visualization for Trumps Retweets from 2009-2018

Creating a visualization for the word frequencies of Trump’s Tweets

Creating a word network for tweets by Donald Trump from 2009-2018