Goals

This week I needed to work out a fast way of marking question responses that my 2nd year class are submitting. To get credit for completing a task they need to answer 2 questions about it and write 100 words for each response. Unlike word, excel doesnt really have a word count function, so my goal this week was to work out whether the tidytext package (which is generally used for sentiment analysis) can help me get a quick word count on these responses.

load packages

library(tidyverse)
library(tidytext)
library(here)

read data

Use read_csv() and here() to read in the example data. Check the names of the variables.

goals <- read_csv(here("1_2061_2021", "data", "example.csv")) 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id = col_double(),
##   what_are_the_3_reasons_you_want_to_achieve_this_goal_write_about_100_words = col_character(),
##   what_are_3_obstacles_that_might_get_in_the_way_write_about_100_words = col_character()
## )

Rename variables to make them easier to use. I never remember which way around these arguments go (is it newname = oldname, or the more intuitive oldname = new name) sigh…. For a while now ive been remembering it as “they are alphabetical” (aka l, m, N, O, new then old), which is a pretty dumb way to remember, so I am starting to write it out in my code everytime so hopefully it will stick.

remember Jenny, when using rename, you need to list newname = oldname

goals <- goals %>% 
  rename(reasons = what_are_the_3_reasons_you_want_to_achieve_this_goal_write_about_100_words,
         obstacles = what_are_3_obstacles_that_might_get_in_the_way_write_about_100_words) 

#newname=oldname

names(goals)
## [1] "id"        "reasons"   "obstacles"

I ran into problems trying to unnest the tokens for reasons and obstacles at the same time. It makes each word appear as a separate row and no student has the same number of words for each answer so tidytext wouldn’t do it. Its a bit hacky but I worked out that it works better if I make a separate df for responses re reasons and obstacles. Then the unnest_tokens() function from tidytext makes each word in the response have its own row.

surely there is a way to do this within a single dataframe, this could be a problem that lists and the purrr package could help with?

tidy_reasons <- goals %>%
  select(1, 2) %>%
  unnest_tokens(reasons_word, reasons)

tidy_obstacles <- goals %>%
  select(1, 3) %>%
  unnest_tokens(ob_word, obstacles)

#checking that worked by looking at the first few rows

tidy_reasons %>% head(5)
## # A tibble: 5 x 2
##      id reasons_word
##   <dbl> <chr>       
## 1     1 the         
## 2     1 first       
## 3     1 reason      
## 4     1 i           
## 5     1 want

get word counts

Now that each word is has its own row, it is just a matter of grouping by student id and counting rows (aka words) for each student. Rename the n column (to make the reasons and obstacles word count variables different so they will join back together).

remember Jenny, when using rename, you need to list newname = oldname

count_reasons <- tidy_reasons %>%
  group_by(id) %>%
  count() %>%
  rename(n_reasons = n)  # newname = oldname


count_obstacles <- tidy_obstacles %>%
  group_by(id) %>%
  count()  %>%
  rename(n_obstacles = n)  # newname = oldname

#checking that worked by looking at the first few rows

count_reasons %>%
  head(5)
## # A tibble: 5 x 2
## # Groups:   id [5]
##      id n_reasons
##   <dbl>     <int>
## 1     1       145
## 2     2        34
## 3     3       111
## 4     4       203
## 5     5       100

join responses back together

OK, now I needed to join word counts for reasons and obstacles into one dataframe. I always pick these join functions at random and just play around with them until they do what I want, without REALLY understanding what the difference between left_join, right_join, inner_join etc actually is.

I was lucky today, right_join works, but I really should spend some time watching Dani’s join videos and making sense of the difference.

r_o <- right_join(count_reasons, count_obstacles) 
## Joining, by = "id"
# check that worked by looking at the first few rows

r_o %>% head(5)
## # A tibble: 5 x 3
## # Groups:   id [5]
##      id n_reasons n_obstacles
##   <dbl>     <int>       <int>
## 1     1       145         158
## 2     2        34          26
## 3     3       111         121
## 4     4       203         211
## 5     5       100         103

pivot data long

Make the data long so it will work with ggplot and make the “response” a factor so it will order correctly in the plot. Note the default order is alphabetical, so if you don’t specify, it will list obstacles first, then reasons in the facet.

r_o_long <- r_o %>%
  pivot_longer(names_to = "response", 
               values_to = "words", n_reasons:n_obstacles)

r_o_long %>% head(5)
## # A tibble: 5 x 3
## # Groups:   id [3]
##      id response    words
##   <dbl> <chr>       <int>
## 1     1 n_reasons     145
## 2     1 n_obstacles   158
## 3     2 n_reasons      34
## 4     2 n_obstacles    26
## 5     3 n_reasons     111
r_o_long$response <- fct_relevel(r_o_long$response, c("n_reasons", "n_obstacles"))

calculate mean word count

Work out how many words students wrote on average for each question. Hmmm less than 100 words 😢.

r_o_long %>% 
  group_by(response) %>%
  summarise(meanwords = mean(words))
## # A tibble: 2 x 2
##   response    meanwords
## * <fct>           <dbl>
## 1 n_reasons        90.3
## 2 n_obstacles      85.6

plot word count distribution

Distribution of word counts separately for reasons and obstacles responses

r_o_long %>%
  ggplot(aes(x = words, fill = response)) +
  geom_histogram() +
  facet_wrap(~response) +
  theme_grey() +
  geom_vline(xintercept = 100, linetype="dashed", 
                color = "blue", size=0.5) +
  ggeasy::easy_remove_legend() +
  labs(title = "Distribution of word count", subtitle = "Mean words in reasons = 90. Mean words in obstacles = 86", x = "Word count", y = "Number of students (N = 100)") +
  annotate("text", x = 180, y = 12, label = "target word \n count = 100") 

Challenges/Successes

Tidytext is pretty fun and a “relatively” easy way to get word count. AGAIN i ran into trouble remembering the order of the arguments when renaming things. Trying to be more intentional about documenting that it is newname = oldname, so maybe one day I won’t have to run through the alphabet to work it out!

It seems a bit clunky to have to work with different dataframes for different questions. This is probably the kind of thing that iteration could help with but I am a bit scared of lists and I still don’t really get the purrr package.

I got lucky with my join today, I picked right_join() at random and it did what I wanted but… I really need to spend some time working out what the difference between the join functions in dplyr are so that its not a crap shoot every time.

Next steps

  • tidytext is really designed for fancy things like sentiment analysis, maybe Ill play with that
  • probably structuring the data as a list and iterating the unnest_tokens() across items in the list would be easier if there were more than a few questions…. maybe Ill work that out
  • watch Dani’s videos about joins and see if I can do better than picking at random!

via GIPHY