Harold Nelson
3/25/2022
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange() masks plyr::arrange()
## x purrr::compact() masks plyr::compact()
## x dplyr::count() masks plyr::count()
## x dplyr::failwith() masks plyr::failwith()
## x dplyr::filter() masks stats::filter()
## x dplyr::id() masks plyr::id()
## x dplyr::lag() masks stats::lag()
## x dplyr::mutate() masks plyr::mutate()
## x dplyr::rename() masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
##
## Attaching package: 'rtweet'
## The following object is masked from 'package:purrr':
##
## flatten
Eliminate retweets, quotes, and replies.
Here is the code from Datacamp. Note that I had to precede count() with plyr:: because there is also a count function in dplyr, which loaded later.
# Extract 100 original tweets on "Superbowl"
tweets_org <- search_tweets("Superbowl -filter:retweets -filter:quote -filter:replies", n = 1000)
# Check for presence of replies
plyr::count(tweets_org$reply_to_screen_name)
## x freq
## 1 NA 985
## x freq
## 1 FALSE 985
## x freq
## 1 FALSE 985
A good question is how long did it take to get 100 tweets?
How would you answer that?
## # A tibble: 1 × 1
## `max(created_at) - min(created_at)`
## <drtn>
## 1 1.949144 days
Put this all together with tidy coding practice. Start with the query for your term.
Drop the internal filters and follow up using is.na() for reply_to_screen_name and == FALSE for the other two.
search_tweets("Superbowl", n = 100) %>%
filter(is.na(reply_to_screen_name) &
is_retweet == FALSE &
is_quote == FALSE) %>%
summarize(max(created_at) - min(created_at))
## # A tibble: 1 × 1
## `max(created_at) - min(created_at)`
## <drtn>
## 1 1.081667 hours
Convert this code into a function with two parameters:
Parameter 1: term. this is a string with the search term.
Parameter 2: ntweets. This is the number of tweets.
The function returns the amount of time required to get this number of tweets after eliminating non-original tweets. Test your function with term = “superbowl” and ntweets = 100.
time_to_get = function(term,ntweets){
search_tweets(term, n = ntweets) %>%
filter(is.na(reply_to_screen_name) &
is_retweet == FALSE &
is_quote == FALSE) %>%
summarize(max(created_at) - min(created_at))
}
t = time_to_get("superbowl",100)
t
## # A tibble: 1 × 1
## `max(created_at) - min(created_at)`
## <drtn>
## 1 1.081667 hours
Use your function to compare the tweet rates of “rstats”, “pandas”, and “ukraine”. Use 500 for ntweets.
## # A tibble: 1 × 1
## `max(created_at) - min(created_at)`
## <drtn>
## 1 1.779444 hours
## # A tibble: 1 × 1
## `max(created_at) - min(created_at)`
## <drtn>
## 1 1.191944 hours
## # A tibble: 1 × 1
## `max(created_at) - min(created_at)`
## <drtn>
## 1 25 secs
We have a boolean expression to identify original tweets. Use this to create a variable is_org using mutate. Then use mean(is_org) in a summarize step to display the fraction of original tweets in query. Test your code with the superbowl and 500 tweets. Try a few different terms.
search_tweets("superbowl",n = 500) %>%
mutate(is_orig = is.na(reply_to_screen_name) &
is_retweet == FALSE &
is_quote == FALSE) %>%
summarize(mean(is_orig))
## # A tibble: 1 × 1
## `mean(is_orig)`
## <dbl>
## 1 0.282
The two parameters are obvious.
fract_original = function(term,ntweets){
search_tweets(term,n = ntweets) %>%
mutate(is_orig = is.na(reply_to_screen_name) &
is_retweet == FALSE &
is_quote == FALSE) %>%
summarize(mean(is_orig))
}
fract_original("superbowl",500)
## # A tibble: 1 × 1
## `mean(is_orig)`
## <dbl>
## 1 0.282
Use your function to determine the fraction of original tweets for several topics that interest you.