20190806_tidytuesday

Libraries

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.0     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'dplyr' was built under R version 3.6.1

## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(fivethirtyeight)

## Warning: package 'fivethirtyeight' was built under R version 3.6.1

library("tm")

## Warning: package 'tm' was built under R version 3.6.1

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library("SnowballC")
library("wordcloud")

## Warning: package 'wordcloud' was built under R version 3.6.1

## Loading required package: RColorBrewer

library("RColorBrewer")

Grab the data

br <- bob_ross

Tidy it up some. This code is from the TidyTuesday page.

br_tidy <- br %>% 
  separate(episode, into = c("season", "episode"), sep = "E") %>% 
  mutate(season = str_extract(season, "[:digit:]+")) %>% 
  mutate_at(vars(season, episode), as.integer)

Examine the names.

names(br_tidy)

##  [1] "season"             "episode"            "episode_num"       
##  [4] "title"              "apple_frame"        "aurora_borealis"   
##  [7] "barn"               "beach"              "boat"              
## [10] "bridge"             "building"           "bushes"            
## [13] "cabin"              "cactus"             "circle_frame"      
## [16] "cirrus"             "cliff"              "clouds"            
## [19] "conifer"            "cumulus"            "deciduous"         
## [22] "diane_andre"        "dock"               "double_oval_frame" 
## [25] "farm"               "fence"              "fire"              
## [28] "florida_frame"      "flowers"            "fog"               
## [31] "framed"             "grass"              "guest"             
## [34] "half_circle_frame"  "half_oval_frame"    "hills"             
## [37] "lake"               "lakes"              "lighthouse"        
## [40] "mill"               "moon"               "mountain"          
## [43] "mountains"          "night"              "ocean"             
## [46] "oval_frame"         "palm_trees"         "path"              
## [49] "person"             "portrait"           "rectangle_3d_frame"
## [52] "rectangular_frame"  "river"              "rocks"             
## [55] "seashell_frame"     "snow"               "snowy_mountain"    
## [58] "split_frame"        "steve_ross"         "structure"         
## [61] "sun"                "tomb_frame"         "tree"              
## [64] "trees"              "triple_frame"       "waterfall"         
## [67] "waves"              "windmill"           "window_frame"      
## [70] "winter"             "wood_framed"

I am going to drop columns with the word frame in them and two people’s names.

br_redacted <- br_tidy %>% select(-contains("frame")) %>% select(-contains("steve_ross")) %>% select(-contains("diane_andre"))

names(br_redacted)

##  [1] "season"          "episode"         "episode_num"    
##  [4] "title"           "aurora_borealis" "barn"           
##  [7] "beach"           "boat"            "bridge"         
## [10] "building"        "bushes"          "cabin"          
## [13] "cactus"          "cirrus"          "cliff"          
## [16] "clouds"          "conifer"         "cumulus"        
## [19] "deciduous"       "dock"            "farm"           
## [22] "fence"           "fire"            "flowers"        
## [25] "fog"             "grass"           "guest"          
## [28] "hills"           "lake"            "lakes"          
## [31] "lighthouse"      "mill"            "moon"           
## [34] "mountain"        "mountains"       "night"          
## [37] "ocean"           "palm_trees"      "path"           
## [40] "person"          "portrait"        "river"          
## [43] "rocks"           "snow"            "snowy_mountain" 
## [46] "structure"       "sun"             "tree"           
## [49] "trees"           "waterfall"       "waves"          
## [52] "windmill"        "winter"

I would like to build a word cloud. I found and read part of this article. http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

I know I need to have word frequency table to make this work quickly into a word cloud. I considered using gather, but I couldn’t remember how to do so off the top of my head so I decided to create a tibble with the variable names and then the column sums.

wordcounts <- tibble("word" = names(br_redacted[5:53]), "count" = colSums(br_redacted[5:53]))

arrange(wordcounts, desc(count))

## # A tibble: 49 x 2
##    word      count
##    <chr>     <dbl>
##  1 tree        361
##  2 trees       337
##  3 deciduous   227
##  4 conifer     212
##  5 clouds      179
##  6 mountain    160
##  7 lake        143
##  8 grass       142
##  9 river       126
## 10 bushes      120
## # ... with 39 more rows

This code is copied and modified from the above website resource.

set.seed(1234)
wordcloud(words = wordcounts$word, freq = wordcounts$count, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Comments: This was a super fun and quick project. I set aside 1 hour for it–and was only interupted a few times for baby issues!

Future ideas: Figure out how to pick colors for individual words. Combine tree and trees and other similar words.

20190806_tidytuesday

Mara Alexeev

8/6/2019