This week I needed to work out a fast way of marking question responses that my 2nd year class are submitting. To get credit for completing a task they need to answer 2 questions about it and write 100 words for each response. Unlike word, excel doesnt really have a word count function, so my goal this week was to work out whether the tidytext package (which is generally used for sentiment analysis) can help me get a quick word count on these responses.
library(tidyverse)
library(tidytext)
library(here)
Use read_csv() and here() to read in the example data. Check the names of the variables.
goals <- read_csv(here("1_2061_2021", "data", "example.csv"))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## id = col_double(),
## what_are_the_3_reasons_you_want_to_achieve_this_goal_write_about_100_words = col_character(),
## what_are_3_obstacles_that_might_get_in_the_way_write_about_100_words = col_character()
## )
Rename variables to make them easier to use. I never remember which way around these arguments go (is it newname = oldname, or the more intuitive oldname = new name) sigh…. For a while now ive been remembering it as “they are alphabetical” (aka l, m, N, O, new then old), which is a pretty dumb way to remember, so I am starting to write it out in my code everytime so hopefully it will stick.
remember Jenny, when using rename, you need to list newname = oldname
goals <- goals %>%
rename(reasons = what_are_the_3_reasons_you_want_to_achieve_this_goal_write_about_100_words,
obstacles = what_are_3_obstacles_that_might_get_in_the_way_write_about_100_words)
#newname=oldname
names(goals)
## [1] "id" "reasons" "obstacles"
I ran into problems trying to unnest the tokens for reasons and obstacles at the same time. It makes each word appear as a separate row and no student has the same number of words for each answer so tidytext wouldn’t do it. Its a bit hacky but I worked out that it works better if I make a separate df for responses re reasons and obstacles. Then the unnest_tokens() function from tidytext makes each word in the response have its own row.
surely there is a way to do this within a single dataframe, this could be a problem that lists and the purrr package could help with?
tidy_reasons <- goals %>%
select(1, 2) %>%
unnest_tokens(reasons_word, reasons)
tidy_obstacles <- goals %>%
select(1, 3) %>%
unnest_tokens(ob_word, obstacles)
#checking that worked by looking at the first few rows
tidy_reasons %>% head(5)
## # A tibble: 5 x 2
## id reasons_word
## <dbl> <chr>
## 1 1 the
## 2 1 first
## 3 1 reason
## 4 1 i
## 5 1 want
Now that each word is has its own row, it is just a matter of grouping by student id and counting rows (aka words) for each student. Rename the n column (to make the reasons and obstacles word count variables different so they will join back together).
remember Jenny, when using rename, you need to list newname = oldname
count_reasons <- tidy_reasons %>%
group_by(id) %>%
count() %>%
rename(n_reasons = n) # newname = oldname
count_obstacles <- tidy_obstacles %>%
group_by(id) %>%
count() %>%
rename(n_obstacles = n) # newname = oldname
#checking that worked by looking at the first few rows
count_reasons %>%
head(5)
## # A tibble: 5 x 2
## # Groups: id [5]
## id n_reasons
## <dbl> <int>
## 1 1 145
## 2 2 34
## 3 3 111
## 4 4 203
## 5 5 100
OK, now I needed to join word counts for reasons and obstacles into one dataframe. I always pick these join functions at random and just play around with them until they do what I want, without REALLY understanding what the difference between left_join, right_join, inner_join etc actually is.
I was lucky today, right_join works, but I really should spend some time watching Dani’s join videos and making sense of the difference.
r_o <- right_join(count_reasons, count_obstacles)
## Joining, by = "id"
# check that worked by looking at the first few rows
r_o %>% head(5)
## # A tibble: 5 x 3
## # Groups: id [5]
## id n_reasons n_obstacles
## <dbl> <int> <int>
## 1 1 145 158
## 2 2 34 26
## 3 3 111 121
## 4 4 203 211
## 5 5 100 103
Make the data long so it will work with ggplot and make the “response” a factor so it will order correctly in the plot. Note the default order is alphabetical, so if you don’t specify, it will list obstacles first, then reasons in the facet.
r_o_long <- r_o %>%
pivot_longer(names_to = "response",
values_to = "words", n_reasons:n_obstacles)
r_o_long %>% head(5)
## # A tibble: 5 x 3
## # Groups: id [3]
## id response words
## <dbl> <chr> <int>
## 1 1 n_reasons 145
## 2 1 n_obstacles 158
## 3 2 n_reasons 34
## 4 2 n_obstacles 26
## 5 3 n_reasons 111
r_o_long$response <- fct_relevel(r_o_long$response, c("n_reasons", "n_obstacles"))
Work out how many words students wrote on average for each question. Hmmm less than 100 words 😢.
r_o_long %>%
group_by(response) %>%
summarise(meanwords = mean(words))
## # A tibble: 2 x 2
## response meanwords
## * <fct> <dbl>
## 1 n_reasons 90.3
## 2 n_obstacles 85.6
Distribution of word counts separately for reasons and obstacles responses
r_o_long %>%
ggplot(aes(x = words, fill = response)) +
geom_histogram() +
facet_wrap(~response) +
theme_grey() +
geom_vline(xintercept = 100, linetype="dashed",
color = "blue", size=0.5) +
ggeasy::easy_remove_legend() +
labs(title = "Distribution of word count", subtitle = "Mean words in reasons = 90. Mean words in obstacles = 86", x = "Word count", y = "Number of students (N = 100)") +
annotate("text", x = 180, y = 12, label = "target word \n count = 100")
Tidytext is pretty fun and a “relatively” easy way to get word count. AGAIN i ran into trouble remembering the order of the arguments when renaming things. Trying to be more intentional about documenting that it is newname = oldname, so maybe one day I won’t have to run through the alphabet to work it out!
It seems a bit clunky to have to work with different dataframes for different questions. This is probably the kind of thing that iteration could help with but I am a bit scared of lists and I still don’t really get the purrr package.
I got lucky with my join today, I picked right_join() at random and it did what I wanted but… I really need to spend some time working out what the difference between the join functions in dplyr are so that its not a crap shoot every time.