911 5b-12 TAKE 4

Practice in scraping tweets and making a word cloud.

Lissie Bates-Haus, Ph.D. https://github.com/lbateshaus (U Mass Amherst DACSS MS Student)https://www.umass.edu/sbs/data-analytics-and-computational-social-science-program/ms
2022-04-11

Load Libraries:

911 on Fox Tweets

Load tweets from where I scraped them earlier:

setwd("~/DACCS R/Text as Data/911Fox Project")
#Already scraped tweets so just loading in the csv

tweets5b12 <- read_csv("tweets5b12.csv")
View(tweets5b12)

#I have no idea why this seems to think the file isn't in this directory when it is?
#Fixed it - path issue

Separate the Data into only post-5b 12 tweets.

#copy over tweet df to a working df

tweets <- tweets5b12

From a little googling, I can see that the Twitter API returns a created-at timestamp in Greenwich Mean Time, which is 4 hours later than my local time, which means I’m looking for tweets after 12 midnight 3/29.2022 (8 pm DST in my timezone).

I used the code provided by earthdatasciene.org (probably not their intended application but oh well!)

#Narrowing down my working df

nrow(tweets)
[1] 9997
#format = "%Y-%m-%d %H:%M:%s"
# show start date march 29 12 midnight Greenwich Mean Time
start_date <- as.POSIXct('2022-03-29 00:00:00', tz="UTC")

tweets <- tweets %>% filter(created_at >= start_date)

nrow(tweets)
[1] 4652

So I’ve filtered down from 9997 to 4652 tweets.

Explore Common Words

tweetWords <- tweets %>%
  dplyr::select(text) %>%
  unnest_tokens(word, text)

head(tweetWords)
# A tibble: 6 × 1
  word    
  <chr>   
1 ok      
2 911onfox
3 just    
4 starts  
5 off     
6 being   

Attempt to plot the top 15 words:

# plot the top 15 words
tweetWords %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets")

Deal with Stop Words

data("stop_words")
# how many words do you have including the stop words?
nrow(tweetWords)
[1] 49424
tweetsClean <- tweetWords %>%
  anti_join(stop_words) %>%
  filter(!word == "rt")

# how many words after removing the stop words?
nrow(tweetsClean)
[1] 25245

Replot top 15

# plot the top 50 words -- notice any issues?
tweetsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(50) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets")

I still want to get things like https and t.co and 911onfox out of here:

#this gets https out I think

nrow(tweetsClean)
[1] 25245
# cleanup
tweetsClean <- tweets %>%
  mutate(text = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", 
                           "", text)) %>% 
  filter(created_at >= start_date ) %>% 
  dplyr::select(text) %>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  filter(!word == "rt") # remove all rows that contain "rt" or retweet
nrow(tweetsClean)
[1] 22911

Replot top 15

# plot the top 50 words -- notice any issues?
tweetsClean %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(50) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets, ")

From here I want to remove numbers and words that start with numbers. How do I do that?

#I'm going to try gsub - that worked but left an empty cell!

nrow(tweetsClean)
[1] 22911
#tweetsCloud <- tweetsClean %>% slice(-("911onfox"))  this doesn't work 

#going to try subset

tweetsCloud <- subset(tweetsClean, word!="911onfox" & word!="episode" & word!="911lonestar" 
                      & word!="hewitt" & word!="i'm" & word!="it's" & word!="1" 
                      &  word!="chim" & word!="gonna" & word!="tonight" 
                      & word!="shes" & word!="im")       #IT LOOKS LIKE THAT WORKED!!



#tweetsCloud <- subset(tweetsCloud, word!="episode")
#tweetsCloud <- subset(tweetsCloud, word!="911lonestar")   #every time I run the word cloud I see words to take out
#tweetsCloud <- subset(tweetsCloud, word!="hewitt") 
#tweetsCloud <- subset(tweetsCloud, word!="im") 
#tweetsCloud <- subset(tweetsCloud, word!="it's") 
#tweetsCloud <- subset(tweetsCloud, word!="1") 
#tweetsCloud <- subset(tweetsCloud, word!="chim") 
#tweetsCloud <- subset(tweetsCloud, word!="gonna")
#tweetsCloud <- subset(tweetsCloud, word!="tonight")
#tweetsCloud <- subset(tweetsCloud, word!="I'm") 

nrow(tweetsCloud)
[1] 16894

Replot top 60

# plot the top 60 words
tweetsCloud %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(60) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets, ")

Finally, word cloud???

#Can I put the top 60 words into it's own dataframe?

top50 <- tweetsCloud %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(54) %>%
  mutate(word = reorder(word, n))

head(top50)
# A tibble: 6 × 2
  word        n
  <fct>   <int>
1 maddie   1016
2 chimney   347
3 madney    287
4 love      271
5 jee       201
6 i’m       181
nrow(top50)
[1] 55
gsub("'", "", top50)
[1] "c(55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 42, 43, 41, 40, 39, 38, 36, 37, 35, 34, 33, 31, 32, 30, 29, 28, 27, 24, 25, 26, 23, 22, 21, 20, 19, 17, 18, 16, 15, 12, 13, 14, 8, 9, 10, 11, 7, 6, 4, 5, 1, 2, 3)"                        
[2] "c(1016, 347, 287, 271, 201, 181, 157, 133, 132, 131, 130, 104, 102, 102, 97, 93, 85, 81, 80, 80, 79, 77, 75, 72, 72, 71, 68, 67, 65, 64, 64, 64, 63, 62, 61, 60, 58, 57, 57, 54, 52, 51, 51, 51, 50, 50, 50, 50, 49, 48, 47, 47, 46, 46, 46)"
top50a <- subset(top50, word!="im")

nrow(top50a)
[1] 55

Okay, at this point, I have no idea why I can’t get the words with apostrophes in them OUT in R and my google-fu is failing me, so I’m just going to pull the dataframe down to a csv, edit it in excel and try again.

setwd("~/DACCS R/Text as Data/911Fox Project")
write_as_csv(tweetsCloud,"tweetsCloud.csv")

Load in the cleaned up csv:

cleanCloud <- read_csv("tweetsCloud.csv")

Plot Top 60

# plot the top 60 words
cleanCloud %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(60) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in tweets, ")

Load the top 60 into their own df

top60 <- cleanCloud %>%
  dplyr::count(word, sort = TRUE) %>%
  top_n(60) %>%
  mutate(word = reorder(word, n))
wordcloud2(data=top60, size=2, color = "random-dark")

Brief attempt to change the colors:

# or a vector of colors. vector must be same length than input data
wordcloud2(top60, size=1.6, color=rep_len( c("mediumblue","darkorchid", "seagreen", "firebrick", "deeppink", "goldenrod"), nrow(top60) ) )

For some reason, this 2nd wordcloud isn’t visible when I knit this? Huh. I don’t know why.

I can’t figure out how to export the image? I wonder if I need to run it in a different package? Is this an interactive one?