Step 3 of Creating 604 Dashboard
Please note: this code is almost entirely from my project 911 5b-12 TAKE 4. New code will be cited as necessary.
Load Libraries:
Load In Data:
# A tibble: 6 × 1
word
<chr>
1 i
2 been
3 trying
4 to
5 place
6 the
Attempt to plot the top 30 words:
A majority of these seem to be stop words, so let’s fix that!
[1] 350860
[1] 199931
Replot top 30
I still want to get things like https and t.co and 911onfox out of here:
[1] 199931
[1] 163594
Replot top 30
What I figured out from my previous projects is that I have an encoding problem here, which is why I’m still getting words that aren’t meaningful. While it’s probably not the correct way to manage this, I’m going to write my top 200 words to a csv, pull it up in excel and take a look at it there. Like I said, probably not ideal, but this will give me clean words for my wordcloud. What I’ve found is that there’s a fair amount of data cleaning to make it meaningful (given that I, while not a domain expert, certainly have domain knowledge, I feel comfortable in making decisions. All editing will be documented.)
# A tibble: 6 × 2
word n
<fct> <int>
1 strangerthings 18057
2 season 2466
3 stranger 2222
4 eddie 1922
5 strangerthings4 1671
6 eddiemunson 1199
Data Cleaning Decisions:
Removed Words/Numbers: can’t didn’t don’t he’s i’m i’ve it’s that’s y’all doja/dojacat (deleted due to this being related to drama with an actor on the show, not her music)
amp episode/episodes im s4/s04 thorloveandthunder vol/volume
Combined Words:
duffer & brothers & dufferbrothers -> duffer brothers kate & bush & katebush -> kate bush eddie & eddiemunson & eddiemuson -> eddie munson eddiemunsonfanart -> eddie munson fanart el & eleven -> eleven noah & noahschnapp & schnapp -> noah schnapp spoiler & spoilers -> spoiler steve & harrington & steveharrington -> steve harrington master & puppets & masterofpuppets -> master of puppets joseph & quinn & josephquinn -> jospeh quinn stranger & things & stranger_things & strangerthings4 & strangerthings4vol2 & strangerthings5 & strangerthingsseason4 & strangerthingsseason4 -> stranger things max & maxmayfield -> max
Resulted in 157 words on the list.
One of the things I know is that if one word has a much higher count, it can throw off the wordcloud. “stranger things” is over 7x higher count than the next word, so I’ll see how it looks and maybe change the count.
Import clean top 200 dataframe.
I don’t think I need to convert this to an actual corpus object for the code I’m using, so I’ll save this and move on.