Setup

library(plyr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()

library(rtweet)

## 
## Attaching package: 'rtweet'

## The following object is masked from 'package:purrr':
## 
##     flatten

library(httpuv)

Original Tweets

Eliminate retweets, quotes, and replies.

Here is the code from Datacamp. Note that I had to precede count() with plyr:: because there is also a count function in dplyr, which loaded later.

# Extract 100 original tweets on "Superbowl"
tweets_org <- search_tweets("Superbowl -filter:retweets -filter:quote -filter:replies", n = 1000)

# Check for presence of replies
plyr::count(tweets_org$reply_to_screen_name)

##    x freq
## 1 NA  985

# Check for presence of quotes
plyr::count(tweets_org$is_quote)

##       x freq
## 1 FALSE  985

# Check for presence of retweets
plyr::count(tweets_org$is_retweet)

##       x freq
## 1 FALSE  985

A good question is how long did it take to get 100 tweets?

How would you answer that?

Solution

tweets_org %>% 
  summarize(max(created_at) - min(created_at))

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 1.949144 days

Tidy Up

Put this all together with tidy coding practice. Start with the query for your term.

Drop the internal filters and follow up using is.na() for reply_to_screen_name and == FALSE for the other two.

Solution

search_tweets("Superbowl", n = 100) %>% 
  filter(is.na(reply_to_screen_name)  &
         is_retweet == FALSE &
         is_quote == FALSE) %>% 
  summarize(max(created_at) - min(created_at))

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 1.081667 hours

Make a Function

Convert this code into a function with two parameters:

Parameter 1: term. this is a string with the search term.

Parameter 2: ntweets. This is the number of tweets.

The function returns the amount of time required to get this number of tweets after eliminating non-original tweets. Test your function with term = “superbowl” and ntweets = 100.

Solution

time_to_get = function(term,ntweets){
  search_tweets(term, n = ntweets) %>% 
  filter(is.na(reply_to_screen_name)  &
         is_retweet == FALSE &
         is_quote == FALSE) %>% 
  summarize(max(created_at) - min(created_at))
  
}

t = time_to_get("superbowl",100)
t

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 1.081667 hours

Use the Function

Use your function to compare the tweet rates of “rstats”, “pandas”, and “ukraine”. Use 500 for ntweets.

Solution

rtime = time_to_get("rstats",500)
rtime

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 1.779444 hours

ptime = time_to_get("pandas",500)
ptime

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 1.191944 hours

utime = time_to_get("ukraine",500)
utime

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 25 secs

Fraction Original

We have a boolean expression to identify original tweets. Use this to create a variable is_org using mutate. Then use mean(is_org) in a summarize step to display the fraction of original tweets in query. Test your code with the superbowl and 500 tweets. Try a few different terms.

Solution

search_tweets("superbowl",n = 500) %>% 
  mutate(is_orig = is.na(reply_to_screen_name)  &
         is_retweet == FALSE &
         is_quote == FALSE) %>%
  summarize(mean(is_orig))

## # A tibble: 1 × 1
##   `mean(is_orig)`
##             <dbl>
## 1           0.282

Make Another Function

The two parameters are obvious.

Solution

fract_original = function(term,ntweets){
  search_tweets(term,n = ntweets) %>% 
  mutate(is_orig = is.na(reply_to_screen_name)  &
         is_retweet == FALSE &
         is_quote == FALSE) %>%
  summarize(mean(is_orig))
}

fract_original("superbowl",500)

## # A tibble: 1 × 1
##   `mean(is_orig)`
##             <dbl>
## 1           0.282

Use It

Use your function to determine the fraction of original tweets for several topics that interest you.

Solution

fract_original("ketanji", 500)

## # A tibble: 1 × 1
##   `mean(is_orig)`
##             <dbl>
## 1           0.066

time_to_get("ketanji",500)

## # A tibble: 1 × 1
##   `max(created_at) - min(created_at)`
##   <drtn>                             
## 1 1.766667 mins

Twitter 2

Setup

Original Tweets

Solution

Tidy Up

Solution

Make a Function

Solution

Use the Function

Solution

Fraction Original

Solution

Make Another Function

Solution

Use It

Solution