While writing a grant for BWCB, I wanted to improve some data visualization strategies to communicate certain demographics of our membership. Of these metrics is member area of expertise, which is a free-response section on the BWCB member application. I knew I would need to dive into a new tool, so here I show what I’ve learned and look forward to using in the future with regard to text mining in R.
References:
https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
https://books.psychstat.org/textmining/index.html#contents
https://cran.r-project.org/web/packages/ggraph/vignettes/Layouts.html
http://users.dimi.uniud.it/~massimo.franceschet/ns/syllabus/make/ggraph/ggraph.html
library(readr)
library(ggplot2)
library(tidyr)
library(tidytext)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
Here we use read_csv()
to create a tibble from the data
we read in. We are using a CSV file of self-identified areas of
expterise from 233 members in BWCB.
Set your working directory appropriately
setwd("~/Downloads")
members.tib = read_csv("member-expertise.csv")
## New names:
## Rows: 233 Columns: 2
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Subject.area dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(members.tib)
## # A tibble: 6 × 2
## ...1 Subject.area
## <dbl> <chr>
## 1 1 Biology + Media Arts and Sciences
## 2 2 Physics (Computational Biophysics)
## 3 3 Bioinformatics and Computational Biology
## 4 4 Computer Science
## 5 5 BSc in Molecular and Cellular Biology
## 6 6 Biological sciences
unnest_tokens
functionWe want to divide the responsesinto individual words: This will essentially crate a new column in your tibble called “word” where each word is its own row and mapped to its member. This will increase the dimensons of your tibble, so that’s a good sanity check that this was used properly.
subjects = unnest_tokens(members.tib, word, Subject.area)
dim(subjects)
## [1] 739 2
Next, we’ll use the count
function to get the
frequencies of each word used
word.freq = subjects %>% count(word, sort = T)
word.freq
## # A tibble: 211 × 2
## word n
## <chr> <int>
## 1 bioinformatics 69
## 2 biology 68
## 3 and 54
## 4 computational 31
## 5 science 28
## 6 molecular 22
## 7 genetics 21
## 8 biomedical 20
## 9 genomics 18
## 10 biochemistry 14
## # … with 201 more rows
Looking at the list of most frequent words, “bioinformatics”, “computer”, and “biology” are expected to be frequent, however, there are words like “and”, ’of”, and “in” that are less informative to our analysis. We want to remove those, and there are tools and dataset to help remove these “stop words”.
From Psychstat:
Tidytext includes a dataset called stop_words which consists of words from three systems.
SMART: This stopword list was built by Gerard Salton and Chris Buckley for the SMART information retrieval system at Cornell University. It consists of 571 words.
snowball: The snowball list is from the string processing language snowball. It has 174 words.
onix: This stopword list is probably the most widely used stopword list. It is from the onix system. This wordlist contains 429 words.
The resource I linked shows how you can interface with these lists to remove stopwords from large-scale analysis, but ended up removing the words manually since our datset is smaller and there are only a few words to extract.
word.freq = subjects %>% count(word, sort = T) %>% filter(!(word == "and" | word =="in" | word =="of" | word =="phd" | word == "bsc"))
word.freq
## # A tibble: 206 × 2
## word n
## <chr> <int>
## 1 bioinformatics 69
## 2 biology 68
## 3 computational 31
## 4 science 28
## 5 molecular 22
## 6 genetics 21
## 7 biomedical 20
## 8 genomics 18
## 9 biochemistry 14
## 10 engineering 13
## # … with 196 more rows
Now we have a cleaner list!
With 1-gram analysis (a sequence of one word), the easiest thing to do is create a barplot of word freuqncies. Wordcloud are also something easy to generate in R, if that’s of interest.
Let’s create a sorted barplot of the top 25 terms with
ggplot
word.freq %>%
top_n(25) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
## Selecting by n
1-grams are easy and fun to make, but they’re less informative about more broad fiels and phrases we actually use to describe areas of expertise. “Computational Biology” is, itself, a bigram, and most descriptors of majors or programs are not just one word. Studing the bigrams in our dataset will also allows use to understand word associations.
Let’s start by looking at bigram frequencies, just as we did witht he
1-gram analysis. We still use unnest_tokens()
but this
time, we specifc the number of grams as 2
.
subjects2 = unnest_tokens(members.tib, word, Subject.area, token = "ngrams", n=2)
head(subjects2)
## # A tibble: 6 × 2
## ...1 word
## <dbl> <chr>
## 1 1 biology media
## 2 1 media arts
## 3 1 arts and
## 4 1 and sciences
## 5 2 physics computational
## 6 2 computational biophysics
Looking at the data, not we see words like “computational
biophysics”, but again, we have less informatives terms such as “and
sciences”. We need to get rid of rows with these unhelpful stop words. I
used the grepl
function within a filter
pipe
to achive this will our same word list. I also go ahead and clean out
rows that may have sneaky NAs.
subjects2 = subjects2 %>% filter(!is.na(word)) %>% filter(!(grepl("and", word) | grepl("in", word) |grepl("of", word) |grepl("phd", word) | grepl("bsc", word)))
dim(subjects2)
## [1] 248 2
Let’s visualize an unsorted table of what our new word1
and word2
phrases look like. First, we need to separate our
bigram into two words
subjects2.sep = subjects2 %>% separate(word, c("word1", "word2"), sep = " ")
head(subjects2.sep)
## # A tibble: 6 × 3
## ...1 word1 word2
## <dbl> <chr> <chr>
## 1 1 biology media
## 2 1 media arts
## 3 2 physics computational
## 4 2 computational biophysics
## 5 3 computational biology
## 6 4 computer science
The most common stop word “and” at least appears to be gone.
Let’s go back to our joined dataset to tabulate word counts for our bar plot
subjects2.count = subjects2.sep %>% unite(word, word1, word2, sep = " ") %>% count(word, sort = T)
Create a barplot of the top 10 bigrams
subjects2.count %>% top_n(10) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip() + labs(title = "BWCB Areas of Expertise")
## Selecting by n
We can use quick and easy tools to visualize relationships between
the words in our dataset. The code relies mostly on ggraph
with igraph
objects. We pass our “separated” version of the
bigram frequencies so we can see the relationships between paired
usage.
set.seed(20181005)
igraph
objectword.network = subjects2.sep %>% count(word1, word2, sort = TRUE) %>% filter(n >
1) %>% graph_from_data_frame()
Learn more about interpreting this object: https://books.psychstat.org/textmining/word-frequency.html#:~:text=An%20igraph%20graph%20includes,and%20the%20end%20vertice.
Now we use ggraph
to generate a network plot, a
ggplot
cousin. First, since this will be a directed graph,
we actually need to construct what these arrows look like witht he
arrow
function.
a = arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")
Next we create our network plot.
⚠️ Each graph generated has a unique graph id, and each graph can look slightly different with each run and depending on the layout you choose. You may have to run it a few times to get an acceptable orientation.
I spent more time playing with parameters (e.g. legend position,
label orientation) and like this layout
and
hjust
of the plot that works the best. Fee Free to adjust
your Titles or remove/edit the caption.
ggraph(word.network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) +
geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 0.4) + labs(title = "BWCB Areas of Expertise", caption = "Data from 12/24/2022" ) + theme(legend.position="bottom", plot.margin = unit(c(1,1,1,1), "mm"))
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
💡 Patterns that stand out are an enrichment in the association of “computational” with “biology”, “molecular” with “biology”, and “computer” with “science”. Our bar plot also shows that these three bigrams are the most frequent. Maybe this means that most BWCB members have expertise or are in programs with these titles!
For more information on higher order N-grams, I recommend visiting this helpful worksheet created by Zhiyong Zhang.