Practice in scraping tweets and making a word cloud.
Load Libraries:
911 on Fox Tweets
Load tweets from where I scraped them earlier:
#copy over tweet df to a working df
tweets <- tweets5b12
From a little googling, I can see that the Twitter API returns a created-at timestamp in Greenwich Mean Time, which is 4 hours later than my local time, which means I’m looking for tweets after 12 midnight 3/29.2022 (8 pm DST in my timezone).
I used the code provided by earthdatasciene.org (probably not their intended application but oh well!)
#Narrowing down my working df
nrow(tweets)
[1] 9997
#format = "%Y-%m-%d %H:%M:%s"
# show start date march 29 12 midnight Greenwich Mean Time
start_date <- as.POSIXct('2022-03-29 00:00:00', tz="UTC")
tweets <- tweets %>% filter(created_at >= start_date)
nrow(tweets)
[1] 4652
So I’ve filtered down from 9997 to 4652 tweets.
tweetWords <- tweets %>%
dplyr::select(text) %>%
unnest_tokens(word, text)
head(tweetWords)
# A tibble: 6 × 1
word
<chr>
1 ok
2 911onfox
3 just
4 starts
5 off
6 being
Attempt to plot the top 15 words:
# plot the top 15 words
tweetWords %>%
dplyr::count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
[1] 49424
tweetsClean <- tweetWords %>%
anti_join(stop_words) %>%
filter(!word == "rt")
# how many words after removing the stop words?
nrow(tweetsClean)
[1] 25245
Replot top 15
# plot the top 50 words -- notice any issues?
tweetsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(50) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")
I still want to get things like https and t.co and 911onfox out of here:
#this gets https out I think
nrow(tweetsClean)
[1] 25245
# cleanup
tweetsClean <- tweets %>%
mutate(text = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)",
"", text)) %>%
filter(created_at >= start_date ) %>%
dplyr::select(text) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
filter(!word == "rt") # remove all rows that contain "rt" or retweet
nrow(tweetsClean)
[1] 22911
Replot top 15
# plot the top 50 words -- notice any issues?
tweetsClean %>%
dplyr::count(word, sort = TRUE) %>%
top_n(50) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets, ")
From here I want to remove numbers and words that start with numbers. How do I do that?
#I'm going to try gsub - that worked but left an empty cell!
nrow(tweetsClean)
[1] 22911
#tweetsCloud <- tweetsClean %>% slice(-("911onfox")) this doesn't work
#going to try subset
tweetsCloud <- subset(tweetsClean, word!="911onfox" & word!="episode" & word!="911lonestar"
& word!="hewitt" & word!="i'm" & word!="it's" & word!="1"
& word!="chim" & word!="gonna" & word!="tonight"
& word!="shes" & word!="im") #IT LOOKS LIKE THAT WORKED!!
#tweetsCloud <- subset(tweetsCloud, word!="episode")
#tweetsCloud <- subset(tweetsCloud, word!="911lonestar") #every time I run the word cloud I see words to take out
#tweetsCloud <- subset(tweetsCloud, word!="hewitt")
#tweetsCloud <- subset(tweetsCloud, word!="im")
#tweetsCloud <- subset(tweetsCloud, word!="it's")
#tweetsCloud <- subset(tweetsCloud, word!="1")
#tweetsCloud <- subset(tweetsCloud, word!="chim")
#tweetsCloud <- subset(tweetsCloud, word!="gonna")
#tweetsCloud <- subset(tweetsCloud, word!="tonight")
#tweetsCloud <- subset(tweetsCloud, word!="I'm")
nrow(tweetsCloud)
[1] 16894
Replot top 60
# plot the top 60 words
tweetsCloud %>%
dplyr::count(word, sort = TRUE) %>%
top_n(60) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets, ")
Finally, word cloud???
#Can I put the top 60 words into it's own dataframe?
top50 <- tweetsCloud %>%
dplyr::count(word, sort = TRUE) %>%
top_n(54) %>%
mutate(word = reorder(word, n))
head(top50)
# A tibble: 6 × 2
word n
<fct> <int>
1 maddie 1016
2 chimney 347
3 madney 287
4 love 271
5 jee 201
6 i’m 181
nrow(top50)
[1] 55
gsub("'", "", top50)
[1] "c(55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 42, 43, 41, 40, 39, 38, 36, 37, 35, 34, 33, 31, 32, 30, 29, 28, 27, 24, 25, 26, 23, 22, 21, 20, 19, 17, 18, 16, 15, 12, 13, 14, 8, 9, 10, 11, 7, 6, 4, 5, 1, 2, 3)"
[2] "c(1016, 347, 287, 271, 201, 181, 157, 133, 132, 131, 130, 104, 102, 102, 97, 93, 85, 81, 80, 80, 79, 77, 75, 72, 72, 71, 68, 67, 65, 64, 64, 64, 63, 62, 61, 60, 58, 57, 57, 54, 52, 51, 51, 51, 50, 50, 50, 50, 49, 48, 47, 47, 46, 46, 46)"
[1] 55
Okay, at this point, I have no idea why I can’t get the words with apostrophes in them OUT in R and my google-fu is failing me, so I’m just going to pull the dataframe down to a csv, edit it in excel and try again.
setwd("~/DACCS R/Text as Data/911Fox Project")
write_as_csv(tweetsCloud,"tweetsCloud.csv")
Load in the cleaned up csv:
cleanCloud <- read_csv("tweetsCloud.csv")
Plot Top 60
# plot the top 60 words
cleanCloud %>%
dplyr::count(word, sort = TRUE) %>%
top_n(60) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets, ")
Load the top 60 into their own df
wordcloud2(data=top60, size=2, color = "random-dark")
Brief attempt to change the colors:
# or a vector of colors. vector must be same length than input data
wordcloud2(top60, size=1.6, color=rep_len( c("mediumblue","darkorchid", "seagreen", "firebrick", "deeppink", "goldenrod"), nrow(top60) ) )
For some reason, this 2nd wordcloud isn’t visible when I knit this? Huh. I don’t know why.
I can’t figure out how to export the image? I wonder if I need to run it in a different package? Is this an interactive one?