word clouds

This week I have been prepping a talk for high school students about about “how to study”- I wanted to see what they might already know about things like retrieval practice and the testing effect so I asked some first year students to fill out a google form and tell me what kinds of strategies they used (and thought were effective) when they were studying for the HSC. Then I asked them to rate how frequently they used a range of common strategies that I found in a published paper about students’ awareness of evidence-based study strategies.

My goal this week was to work out how to use the tidytext package and the wordcloud2 packages to count the most frequently used words and turn them into a word cloud. I also wanted to plot the ratings data to see whether HSC students frequently use recall/testing as a study strategy.

load packages

library(tidyverse)
library(googlesheets4)
library(ggeasy)
library(ggannotate)
library(janitor)
library(tidytext)
library(RColorBrewer)
library(wordcloud2)

read the data

The nice thing about google forms is that you can use the googlesheets4 package to read the data into R using just the URL. It will ask you to authenticate that you have access to the sheet and voila!

Here I am reading in the googlesheet data, making the names consistent with clean_names(), dropping the timestamp variable, adding an id variable and moving it to the front of the dataframe.

study <- read_sheet("https://docs.google.com/spreadsheets/d/1wI7BuEF3PfKgqwSImlaAWLqFEEnyXFB0Cy97ORG85Vo") %>%
  clean_names() %>%
  select(-timestamp) %>%
    mutate(id = row_number()) %>%
    relocate(id, .before = "open_response")

names(study)
##  [1] "id"                       "open_response"           
##  [3] "rereading_notes_text"     "practice_problems"       
##  [5] "flashcards"               "rewrite_notes"           
##  [7] "study_with_friends"       "memorise"                
##  [9] "mnemonics"                "review_sheets"           
## [11] "practise_recall"          "highlight_notes_text"    
## [13] "think_real_life_examples"

Open responses

Here I am selecting just the id column and the open response variable that contains the text.

text <- study %>%
  select(id, open_response)

The unnest_tokens function from the tidytext package takes text responses (in this case the text$open_response column) and unnests them, making a new column (here called word) with one word in each row.

tokens <- text %>%
  unnest_tokens(word, open_response)

head(tokens)
## # A tibble: 6 x 2
##      id word     
##   <int> <chr>    
## 1     1 most     
## 2     1 effective
## 3     1 past     
## 4     1 papers   
## 5     1 and      
## 6     1 revising

Then you can use count() to get the most frequent words.

tokens %>%
  count(word, sort = TRUE)
## # A tibble: 975 x 2
##    word      n
##    <chr> <int>
##  1 i       184
##  2 to      158
##  3 the     152
##  4 and     151
##  5 a       100
##  6 for      78
##  7 of       70
##  8 my       63
##  9 in       62
## 10 would    60
## # … with 965 more rows

Not surprisingly prepositions are the most frequently used words. These are not at all interesting so I want to get rid of them using the “stop words” that come with the package.

There is a list of “stop words” in the tidytext package that you can use to remove all the tiny uninteresting words from your tokens. The anti_join() function from dplyr returns all rows from x (i.e. tokens) where there are not matching values in y (stop_words), keeping just columns from the tokens (i.e. only keep the items in your tokens that are NOT in the stop_word list).

data(stop_words) #gets the stop words from tidytext into your env

#remove stop words
clean_tokens <- tokens %>% 
 anti_join(stop_words)
## Joining, by = "word"

I also want to remove numbers from my tokens. Here I am finding numbers using str_detect() and creating a dataframe of the numbers that exist in the token set and calling this nums.

nums <- clean_tokens %>% 
  filter(str_detect(word, "^[0-9]")) %>% 
  select(word) %>% 
  unique() 


nums
## # A tibble: 7 x 1
##   word 
##   <chr>
## 1 25   
## 2 5    
## 3 3    
## 4 1    
## 5 2    
## 6 4    
## 7 10

Then using antijoin() again to only keep the tokens that are NOT in the nums dataframe.

clean_tokens <- clean_tokens %>% 
  anti_join(nums, by = "word")

Now that my tokens are clean I can count how many times each token appears and sort them by most frequent.

count_tokens <- clean_tokens %>%
  count(word, sort = TRUE) %>%
  rename(freq = n)

count_tokens %>%
  head(10)
## # A tibble: 10 x 2
##    word       freq
##    <chr>     <int>
##  1 notes        56
##  2 study        33
##  3 time         33
##  4 papers       31
##  5 past         27
##  6 practice     27
##  7 questions    27
##  8 subjects     23
##  9 effective    21
## 10 studying     19

how to make a wordcloud

The wordcloud2 package makes nice wordclouds. Here I am setting a seed so that it will produce exactly the same wordcloud everytime (aka REPRODUCIBILITY)

library(wordcloud2)

set.seed(123)

wordcloud2(data = count_tokens, color = "random-light", backgroundColor = "black") 

When asked about strategies that they use to study, they mention NOTES a lot.

Ratings

I also asked students to rate how frequently they used several strategies (rereading, highlighting, flashcards etc). I want to plot how many students say they use each strategy “all the time”.

Here selecting just the id and ratings variables

ratings <- study %>%
  select(id, 3:13)

names(ratings
      )
##  [1] "id"                       "rereading_notes_text"    
##  [3] "practice_problems"        "flashcards"              
##  [5] "rewrite_notes"            "study_with_friends"      
##  [7] "memorise"                 "mnemonics"               
##  [9] "review_sheets"            "practise_recall"         
## [11] "highlight_notes_text"     "think_real_life_examples"

Pulling the data into long format using pivot_longer and adding a rating value column that recodes the categorical responses into numeric values.

ratings_long <- ratings %>%
  pivot_longer(names_to = "strategy", values_to = "rating", rereading_notes_text:think_real_life_examples) 


ratings_long <- ratings_long %>%
  mutate(rating_value = case_when(rating == "Never use" ~ 0, 
                                  rating == "Use sometimes" ~ 1, 
                                  rating == "Use frequently" ~ 2, 
                                  rating == "Use all the time" ~ 3)) 

Here the tabyl function from the janitor package is pretty helpful in counting how many students rate each strategy as “use all the time” = 3. Add new columns for total and percent.

counts <- ratings_long %>%
 tabyl(strategy, rating_value) %>%
  rename(rate0 = "0", rate1 = "1", rate2 = "2", rate3 = "3") %>%
  mutate(total = rate0 + rate1 + rate2 + rate3) %>%
  mutate(percent3 = (rate3/total)*100)

Plot percent of students reporting they use each strategy all the time. This reorder function was new to me- its pretty cool how you can tell ggplot to reorder the strategy, by percent 3 so that the bars are ordered! I also tried out ggannotate again- it is so EASY. You use the addin menu and then click where you want your annotation to go on your plot and what you want it to say and then copy the geom_text() back into your chunk. So fun!! (see more detailed instructions here)

counts %>%
  ggplot(aes(x = reorder(strategy, percent3), y = percent3, fill = strategy)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(limits = c(0,100), expand = c(0,0)) +
  labs(y = "Percent of students reporting", x = "Strategy") +
  theme_classic() +
  easy_remove_legend() +
  geom_text(data = data.frame(x = 6, y = 56.5500528595157, label = "Lots of students report \n using practice problems and \n recall all the time"),
mapping = aes(x = x, y = y, label = label),
inherit.aes = FALSE)

Interesting…. most of the students rate that they do practice problems and recall when you give them a strategy and ask them to report how frequently they use it. Questions and practice comes up big in the word cloud but not many students mention recall or testing. Maybe they are using practice questions as a way of checking what they know or becoming familiar with the kinds of questions that will be asked, rather than knowing about the benefits of retrieval practice for learning.

count_tokens %>%
  filter(word %in% c("practice", "questions", "recall", "test", "testing"))
## # A tibble: 5 x 2
##   word       freq
##   <chr>     <int>
## 1 practice     27
## 2 questions    27
## 3 test          9
## 4 recall        6
## 5 testing       5

Successes/Challenges

I feel like I am getting the hang of the tidytext package. Now I know how to use it to get total word counts and how to use it to count the frequency of individual words. I tried (and failed) to use the wordcloud package to begin with; I couldn’t work out why it wasn’t working. But did some googling and discovered that the wordcloud2 package made prettier clouds AND was easier to use. I also tried the ggannotate package again and it is super easy. Love how relearning something is SO much easier than learning it the first time, particularly if you write notes to yourself.

Next steps

Now that I have total word count and word frequency down, maybe sentiment analysis is my next tidytext challenge. More critically though, I need to work out how to get my wordcloud out of R. I want it as a .png that I can add to my ppt slides but there doesn’t seem to be an obvious way to export it. More googling required….

P.S.

I was looking for new RMarkdown themes to try and discovered this Rmd Gallery- so many to choose from! I settled on the hpstr theme.

More info re prettydocs here