Creating the Stranger Things Corpus from 7/9/2021 Tweet Scrape

Step 3 of Creating 604 Dashboard

Lissie Bates-Haus, Ph.D. https://github.com/lbateshaus (U Mass Amherst DACSS MS Student)https://www.umass.edu/sbs/data-analytics-and-computational-social-science-program/ms
2022-07-09

Please note: this code is almost entirely from my project 911 5b-12 TAKE 4. New code will be cited as necessary.

Load Libraries:

Load In Data:

Explore Common Words

# A tibble: 6 × 1
  word  
  <chr> 
1 i     
2 been  
3 trying
4 to    
5 place 
6 the   

Attempt to plot the top 30 words:

Deal with Stop Words

A majority of these seem to be stop words, so let’s fix that!

[1] 350860
[1] 199931

Replot top 30

I still want to get things like https and t.co and 911onfox out of here:

[1] 199931
[1] 163594

Replot top 30

What I figured out from my previous projects is that I have an encoding problem here, which is why I’m still getting words that aren’t meaningful. While it’s probably not the correct way to manage this, I’m going to write my top 200 words to a csv, pull it up in excel and take a look at it there. Like I said, probably not ideal, but this will give me clean words for my wordcloud. What I’ve found is that there’s a fair amount of data cleaning to make it meaningful (given that I, while not a domain expert, certainly have domain knowledge, I feel comfortable in making decisions. All editing will be documented.)

# A tibble: 6 × 2
  word                n
  <fct>           <int>
1 strangerthings  18057
2 season           2466
3 stranger         2222
4 eddie            1922
5 strangerthings4  1671
6 eddiemunson      1199

Data Cleaning Decisions:

Removed Words/Numbers: can’t didn’t don’t he’s i’m i’ve it’s that’s y’all doja/dojacat (deleted due to this being related to drama with an actor on the show, not her music)

amp episode/episodes im s4/s04 thorloveandthunder vol/volume

Combined Words:

duffer & brothers & dufferbrothers -> duffer brothers kate & bush & katebush -> kate bush eddie & eddiemunson & eddiemuson -> eddie munson eddiemunsonfanart -> eddie munson fanart el & eleven -> eleven noah & noahschnapp & schnapp -> noah schnapp spoiler & spoilers -> spoiler steve & harrington & steveharrington -> steve harrington master & puppets & masterofpuppets -> master of puppets joseph & quinn & josephquinn -> jospeh quinn stranger & things & stranger_things & strangerthings4 & strangerthings4vol2 & strangerthings5 & strangerthingsseason4 & strangerthingsseason4 -> stranger things max & maxmayfield -> max

Resulted in 157 words on the list.

One of the things I know is that if one word has a much higher count, it can throw off the wordcloud. “stranger things” is over 7x higher count than the next word, so I’ll see how it looks and maybe change the count.

Import clean top 200 dataframe.

I don’t think I need to convert this to an actual corpus object for the code I’m using, so I’ll save this and move on.