In this case study, we’ll “walk through” a basic research workflow, or data analysis process, modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):
Figure 2.2 Steps of Data-Intensive Research Workflow
Each walkthrough will focus on a basic analysis using text mining techniques that you’ll be expected to reproduce, and apply to a new research question in independent practice, using the provided dataset or dataset of your own choosing.
We will focus on analysis of open-ended survey items from an evaluation of the North Carolina Department of Public Instruction (NCDPI) online professional development offered as part of the state’s Race to the Top efforts.
Our focus will be on getting our text “tidy” so we can perform some basic word counts, look at words that occur at a higher rate in a group of documents, and examine words that are unique to those document groups. Specifically, the Walkthrough will cover the following workflow topics:
Prior to analysis, it’s critical to understand the context and data sources available so you can formulate useful questions that can be feasibly addressed by your data. For this section, we’ll focus on the following topics:
North Carolina was one of 12 recipients of the 2010 federal Race to the Top (RttT) grants, bringing nearly $400 million to the state’s public school system. Over the course of four years, NC’s RttT coordinated a set of activities and policy reforms designed to collectively improve the performances of students, teachers, leaders, and schools.
The North Carolina Race to the Top (RttT) proposal (North Carolina Office of the Governor, 2010) specifies that the state’s Professional Development Initiative will focus on the “use of e-learning tools to meet the professional development needs of teachers, schools, and districts” (p. 191). It points to research demonstrating that “well-designed and -implemented online professional development programs are not only valued by teachers but also positively impact classroom practices and student learning.”
Data Source & Analysis
The evaluation used a wide range of data sources including interviews, document review, site analytics, and surveys, which we’ll focus on for this walkthrough. Survey protocols were designed in cooperation with NCDPI to systematically collect information about local professional development, state-level supports, use of available RttT professional development resources, and organizational and classroom practices in the schools, which will serve as a baseline to assess changes over the period of the North Carolina RttT initiatives.
Quantitative analyses focused primarily on descriptive analysis of item-level responses. In addition, quantitative data from these surveys were analyzed to examine patterns in responses by participants’ role, event type (e.g., module, webinar, resource), and region. Responses to open-ended survey items of the Online Resources Survey were manually coded by their relation to each Learning Forward professional development standard.
Note that the dataset we’ll be using for analysis in this walkthrough is exported as is from Qualtrics with personal identifiers, select demographics, metadata, and closed-ended responses removed.
Summary of Findings
Approximately half of the state’s educators completed at least one online module by the end of the 2011-12 school year. Overall, most participants agreed that the webinars and modules were relevant to their professional development needs, though some content was redundant with prior PD activities and not always content- or grade-specific, and some modules did not meet national standards. Most online modules were completed independently and not in Professional Learning Community groups.
A common theme from focus groups and open-ended survey responses was the convenience of online professional development. One teacher in a focus group stated, “I liked the format. And the way that it was given, it was at your own pace, which works well for our schedules…” Educators also frequently cited that the information and resources provided through the modules improved their understanding of the new standards and the teacher evaluation process. Webinar participants appreciated the useful, updated information presented through a combination of PowerPoint slides and video clips.
While the majority of educators have indicated their satisfaction with these resources, the findings suggest that the use of these resources at both the state and local level was not wholly consistent with national standards for online professional development. Many LEAs likely needed additional guidance, training, support, technology tools, and/or content resources to ensure that local efforts contribute to the quality of the experiences for educators and that the vision for online professional development outlined in the state’s RttT proposal is realized and can be sustained beyond RttT.
The State’s progress on designing and implementing online professional development was originally guided by the following (very) general evaluation questions:
For this walkthrough, we’ll use text mining to complement prior qualitative analyses conducted as part of the RttT Evaluation by examining responses to open-ended questions on the RttT Online PD Survey administered to over 15,000 NC educators.
Our (very) specific questions of interest for this walkthrough are:
Finally, one overarching question we’ll explore throughout this lab, and that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, is:
How do we to quantify what a document or collection of documents is about?
As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be your “home” for any files and code used or created in Lab 1. Open RStudio and follow these steps from DESIUR 6.6 to create a Project for Lab 1:
Create New File
Now that you have a Project to store .R scripts that you create as you work through this lab, let’s create our first .R script:
Finally, using your newly created R script, type the following code to load the packages that we’ll be needing for this walkthrough.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidytext)
I highly recommend that that you manually type the code shared throughout this walkthrough, for large blocks of text it may be easier to cut and paste.
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
dplyr package to view, rename, select, slice,
and filter our data in preparation for analysis.tidytext package to both “tidy” and tokenize our text in
order to create a data frame to use for analysis.The Reading Data section introduces the following functions for reading data into R and inspecting it’s contents:
dplyr::read_csv() Reading .csv files into R.base::print() View your data frame in the Console
Paneutils::view() View your data frame in the Source
Panetibble::glimpse() Like print, but transposed so you can
see all columnsutils::head() View the first 6 rows of your data.utils::tail() View last 6 rows of your data.dplyr::write_csv() writing .csv files to
directory.Remember, the name before the double colon indicates the package the
function comes from. For example, read_csv comes from the
`readr`` package.
To get started, we need to import, or “read”, our data into R. The function used to import your data will depend on the file format of the data you are trying to import.
opd_survey.csv file we’ll be using for
this Lab from our github.Now let’s read our data into our Environment and assign it to a variable name so we can work with it like any other object in R.
opd_survey <- read_csv("data/opd_survey.csv")
## New names:
## Rows: 57054 Columns: 19
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (19): RecordedDate, ResponseId, Role, Q14, Q16...5, Resource...6, Resour...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `Q16` -> `Q16...5`
## • `Resource` -> `Resource...6`
## • `Resource_10_TEXT` -> `Resource_10_TEXT...9`
## • `Resource` -> `Resource...10`
## • `Resource_10_TEXT` -> `Resource_10_TEXT...11`
## • `Q16` -> `Q16...12`
Notice that read_csv() dealt with the issues of
duplicate column names for us!!
If you happen to run into issues with data import, RStudio as has an Import Dataset feature for a point and click approach to adding data to your environment.
Once your data is in R, there are many different ways you can view it. Give each of the following at try:
# enter the name of your data frame and view directly in console
opd_survey
# view your data frame transposed so your can see every column and the first few entries
glimpse(opd_survey)
# look at just the first six entries
head(opd_survey)
# or the last six entries
tail(opd_survey)
# view the names of your variables or columns
names(opd_survey)
# or view in source pane
view(opd_survey)
In addition to reading data from your project folder, you can also
write data back to a folder. The readr package has an
intuitively named write_csv() function for doing just
that.
Using the following code to create a copy of the opd_survey.csv file
in your data folder from the opd_survey data frame you
created:
write_csv(opd_survey, "data/opd_survey_copy.csv")
Note that the first argument is the data frame you created earlier and the second argument is the file name you plan to give it, including (if necessary) the file path for where it should go.
Throughout this walkthrough, you will be asked to respond to questions or short tasks to check your comprehension of the content covered. For section 2a. Read, View, and Write Data, please respond to these questions by commenting out a line or lines in your R script like so:
# 1. What argument would you add to `read_csv()` if my file did not not have column names or headers?
# I would need to add the ____ argument and set it to equal ____ to prevent R from setting the first row as column names.
read_csv() if my file
did not not have column names or headers? You can type
?read_csv to get help on this function or check this handy
cheatsheet
for the readr package from the readr website at https://readr.tidyverse.org/index.htmlread_csv() always
expects and what happens if you don’t include in quotes?view() compared
to other functions for viewing your data?write_csv(opd_survey, "opd_survey_copy.csv") and just
specify the file name instead including the folder?As you’ve probably already noticed from viewing our dataset, we
clearly have more data than we need to answer our rather basic research
question. For this part of our workflow we focus on the following
functions from the dplyr package for
wrangling our data:
dplyr
functions
select() picks variables based on their names.slice() lets you select, remove, and duplicate
rows.rename() changes the names of individual variables
using new_name = old_name syntaxfilter() picks cases, or rows, based on their values in
a specified column.stats
functions
na.omit() a handy little function from the
stats package for removing rows with missing values,
i.e. NA.To begin, let’s select() Role, Resources (named as
Resource…6 in the data frame), and Q21 columns and store as new data
frame since those respectively pertain to educator role, OPD resource
they are evaluating, and, as illustrated by the second row,
opd_selected <- select(opd_survey, Role, Resource...6, Q21)
Notice that like the bulk of all tidyverse functions, the first input
select() expects is a data frame, followed by the columns
you’d like to select.
Let’s take a look at our newly created data frame that should have dramatically fewer variables:
head(opd_selected)
## # A tibble: 6 × 3
## Role Resource...6 Q21
## <chr> <chr> <chr>
## 1 "What is your role within your school district or organiza… "Please ind… "Wha…
## 2 "{\"ImportId\":\"QID2\"}" "{\"ImportI… "{\"…
## 3 "Central Office Staff (e.g. Superintendents, Tech Director… "Summer Ins… <NA>
## 4 "Central Office Staff (e.g. Superintendents, Tech Director… "Online Lea… "Glo…
## 5 "School Support Staff (e.g. Counselors, Technology Facilit… "Online Lea… <NA>
## 6 "School Support Staff (e.g. Counselors, Technology Facilit… "Calendar" "com…
Notice that Q21 is not a terribly informative variable name. Let’s
now take our opd_selected data frame and use the
rename() function along with the = assignment
operator to change the name from Q21 to “text” and save it as
opd_renamed.
This naming is somewhat intentional because not only is it the text we are interested in analyzing, but also mirrors the naming conventions in our [Text Mining with R]https://www.tidytextmining.com/tidytext.html book and will make it easier to follow the examples there.
opd_renamed <- rename(opd_selected, text = Q21)
Now let’s deal with the legacy rows that Qualtrics outputs by
default, which are effectively 3 sets of headers. We will use the
slice() function, which is basically the same as the
select() function but for rows instead of columns, to carve
out those two rows.
opd_sliced <- slice(opd_renamed, -1, -2) # the - sign indicates to NOT keep rows 1 and 2
head(opd_sliced)
## # A tibble: 6 × 3
## Role Resource...6 text
## <chr> <chr> <chr>
## 1 Central Office Staff (e.g. Superintendents, Tech Director,… Summer Inst… <NA>
## 2 Central Office Staff (e.g. Superintendents, Tech Director,… Online Lear… Glob…
## 3 School Support Staff (e.g. Counselors, Technology Facilita… Online Lear… <NA>
## 4 School Support Staff (e.g. Counselors, Technology Facilita… Calendar comm…
## 5 Teacher Live Webinar leve…
## 6 Teacher Online Lear… None…
Now let’s take our opd_sliced and remove any rows that
are missing data, as indicated by an NA.
opd_complete <- na.omit(opd_sliced)
Finally, since we are only interested in the feedback from teachers,
let’s also filter our dataset for only participants who indicated their
Role as “Teacher”.
opd_teacher <- filter(opd_complete, Role == "Teacher")
head(opd_teacher)
## # A tibble: 6 × 3
## Role Resource...6 text
## <chr> <chr> <chr>
## 1 Teacher Live Webinar "lev…
## 2 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "Non…
## 3 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "In …
## 4 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "Und…
## 5 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "ove…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "onl…
That was a lot of code we just wrote to end up with
opd_teacher. Let’s review:
opd_selected <- select(opd_survey, Role, Resource, Q21)
opd_renamed <- rename(opd_selected, text = Q21)
opd_sliced <- slice(opd_renamed, -1, -2)
opd_complete <- na.omit(opd_sliced)
opd_teacher <- filter(opd_complete, Role == "Teacher")
Note that we could have reused opd_teacher and simply
overwritten it each time to prevent creating new objects:
opd_teacher <- select(opd_survey, Role, Resource, Q21)
opd_teacher <- rename(opd_teacher, text = Q21)
opd_teacher <- slice(opd_teacher, -1, -2)
opd_teacher <- na.omit(opd_teacher)
opd_teacher <- filter(opd_teacher, Role == "Teacher")
Fortunately, we can use the Pipe Operator %>%
introduced in Chapter 6 of Data
Science in Education Using R (DSIEUR) to dramatically simplify these
cleaning steps and reduce the code written
opd_teacher <- opd_survey %>%
select(Role, Resource...6, Q21) %>%
rename(text = Q21) %>%
slice(-1, -2) %>%
na.omit() %>%
filter(Role == "Teacher")
head(opd_teacher)
## # A tibble: 6 × 3
## Role Resource...6 text
## <chr> <chr> <chr>
## 1 Teacher Live Webinar "lev…
## 2 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "Non…
## 3 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "In …
## 4 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "Und…
## 5 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "ove…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "onl…
Our dataset is now ready to be tidied!!!
opd_benefits for
later use.For this part of our workflow we focus on the following functions
from the tidytext
and dplyr packages respectively:
unnest_tokens() splits a column into tokensanti_join() returns all rows from x
without a match in y.Not surprisingly, the Tidyverse set of packages including packages
like dplyr adhere “tidy” data principles (Wickham
2014). Tidy data has a specific structure:
Why would this data be considered “untidy”?
Text data, by it’s very nature is ESPECIALLY untidy. In Chapter 1 of Text Mining with R, Silge and Robinson define the tidy text format as
a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.
In this section, our goals is to transform our
opd_teacher data from this:
## # A tibble: 6 × 3
## Role Resource...6 text
## <chr> <chr> <chr>
## 1 Teacher Live Webinar "lev…
## 2 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "Non…
## 3 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "In …
## 4 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "Und…
## 5 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "ove…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… "onl…
to this:
## # A tibble: 6 × 3
## Role Resource...6 word
## <chr> <chr> <chr>
## 1 Teacher Live Webinar leve…
## 2 Teacher Live Webinar ofqu…
## 3 Teacher Live Webinar and
## 4 Teacher Live Webinar revi…
## 5 Teacher Live Webinar bloo…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… none
In order to tidy our text, we need to break the text into individual
tokens (a process called tokenization) and transform it to a tidy data
structure. To do this, we use tidytext’s incredibly powerful
unnest_tokens() function.
After all the work we did prepping our data, this is going to feel a little anticlimactic.
Let’s go ahead and tidy our text and save it as
opd_tidy:
opd_tidy <- unnest_tokens(opd_teacher, word, text)
head(opd_tidy)
## # A tibble: 6 × 3
## Role Resource...6 word
## <chr> <chr> <chr>
## 1 Teacher Live Webinar leve…
## 2 Teacher Live Webinar ofqu…
## 3 Teacher Live Webinar and
## 4 Teacher Live Webinar revi…
## 5 Teacher Live Webinar bloo…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… none
Note that we also could have just added
unnest_tokens(word, text) to our previous piped chain of
functions like so:
opd_tidy <- opd_survey %>%
select(Role, Resource...6, Q21) %>%
rename(text = Q21) %>%
slice(-1, -2) %>%
na.omit() %>%
filter(Role == "Teacher") %>%
unnest_tokens(word, text)
head(opd_tidy)
## # A tibble: 6 × 3
## Role Resource...6 word
## <chr> <chr> <chr>
## 1 Teacher Live Webinar leve…
## 2 Teacher Live Webinar ofqu…
## 3 Teacher Live Webinar and
## 4 Teacher Live Webinar revi…
## 5 Teacher Live Webinar bloo…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… none
There is A LOT to unpack with this function. First notice that
unnest_tokens expects a data frame as the first argument,
followed by two column names. The first is an output column name that
doesn’t currently exist but will be created as the text is unnested into
it (word, in this case). This if followed by the input
column that the text comes from which we uncreatively named
text. Also notice:
Role and Resource,
are retained.to_lower = FALSE argument to turn off this behavior).One final step in tidying our text is to remove words that don’t add
much value to our analysis (at least when using this approach) such as
“and”, “the”, “of”, “to” etc. The tidytext package contains
a stop_words dataset derived from three different lexicons
that we’ll use to remove rows that match words in this dataset.
Let’s take a look at these common stop words so we know what we’re
getting rid of from our opd_tidy dataset.
head(stop_words)
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
view(stop_words)
In order to remove these stop words, we will use function called
anti_join() that looks for matching values in a specific
column from two datasets and returns rows from the original dataset that
have no matches. For a good overview of the different dplyr
joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119
Let’s remove rows from our opd_tidy data frame that
contain matches in the word column with those in the
stop_words dataset and save it as opd_clean
since we were done cleaning our data at this point.
opd_clean <- anti_join(opd_tidy, stop_words)
## Joining, by = "word"
head(opd_clean)
## # A tibble: 6 × 3
## Role Resource...6 word
## <chr> <chr> <chr>
## 1 Teacher Live Webinar leve…
## 2 Teacher Live Webinar ofqu…
## 3 Teacher Live Webinar revi…
## 4 Teacher Live Webinar bloo…
## 5 Teacher Online Learning Module (e.g. Call for Change, Understanding the… modu…
## 6 Teacher Online Learning Module (e.g. Call for Change, Understanding the… teac…
anti_join() function in our
previous chain that uses the pipe operator? Give it a try and see what
happens.anti_join() if we had named the
output column from unnest_tokens() “tokens” instead? Hint:
Check ?anti_join documentation.As highlighted in both DSEIUR and Learning Analytics Goes to School, calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. One goal in this phase is explore questions that drove the original analysis and develop new questions and hypotheses to test in later stages. In Section 3, we will calculate some very basic summary statistics from our tidied text, explore key words of interest to gather additional context, and use data visualization to identify patterns and trends that may not be obvious from our tables and numerical summaries. Topics addressed in Section 3 include:
grep package in R, to search for key
words among our data set.Prior to making any data visualization, we revisit our or overarching question guiding most of our efforts in this lab, “How do we quantify what a text is about?”
In this section, we introduce the following functions:
dplyr
functions
count()
lets you quickly count the unique values of one or more
variables
group_by()
takes a data frame and one or more variables to group by
mutate()
adds new variables and preserves existing ones
left_join()
add columns from one dataset to another
tidytext
functions
As highlighted in Word Counts are Amazing, one simple but powerful approach to text analysis is counting the frequency in which words occur in a given collection of documents, or corpus.
Now that we have our original survey data in a tidy text format, we
can use the count() function from the dplyr
package to find the most common words used by teachers when asked, “What
was the most beneficial/valuable aspect of this online resource?”
opd_counts <- count(opd_clean, word, sort = TRUE)
# alternatively, we could have use the %>% operator to yield the same result.
opd_counts <- opd_clean %>%
count(word, sort = TRUE)
opd_counts
## # A tibble: 5,352 × 2
## word n
## <chr> <int>
## 1 information 1885
## 2 learning 1520
## 3 videos 1385
## 4 resources 1286
## 5 online 1139
## 6 examples 1105
## 7 understanding 1092
## 8 time 1082
## 9 students 1013
## 10 data 971
## # … with 5,342 more rows
Going back to findings from the original report, a strategy as simple basic word counts resulted in key words consistent with findings from the qualitative analysis of focus-group transcripts and open-ended survey responses:
Educators frequently cited that the information and resources provided through the modules improved their understanding of the new standards and the teacher evaluation process.
See also this finding around video clips:
Webinar participants appreciated the useful, updated information presented through a combination of PowerPoint slides and video clips.
One notable distinction between word counts and more traditional qualitative analysis is that broader themes like “convenience” often are not immediately apparent in words counts, but rather emerges from responses containing words like “pace”, “format”, “online”, “ease”, and “access”.
A common theme from focus groups and open-ended survey responses was the convenience of online professional development. One teacher in a focus group stated, “I liked the format. And the way that it was given, it was at your own pace, which works well for our schedules…”
The count() function can also be used with more than one
column to count the frequency a word occurs for a select
Resource in our dataset.
opd_resource_counts <- opd_clean %>%
count(Resource...6, word, sort = TRUE)
view(opd_resource_counts)
In this case, we see that “information” was the most common word for Online Learning Modules but did not even make the top 5 for Recorded Webinar:
One common approach to facilitate comparison across documents or groups of text, in our case responses by Online Resource type, is by looking at the frequency that each word occurs among all words for that document group. This also helps to better gauge how prominent the same word is across different groups.
For example, let’s create counts for each Resource and
word paring, and then create a new column using the
mutate() function that calculations the proportion that
word makes up among all words.
To do this a little more efficiently, I’m going to use the %>% operator:
opd_frequencies <- opd_clean %>%
count(Resource...6, word, sort = TRUE) %>%
group_by(Resource...6) %>%
mutate(proportion = n / sum(n))
opd_frequencies
## # A tibble: 7,210 × 4
## # Groups: Resource...6 [10]
## Resource...6 word n proportion
## <chr> <chr> <int> <dbl>
## 1 Online Learning Module (e.g. Call for Change, Underst… info… 1782 0.0238
## 2 Online Learning Module (e.g. Call for Change, Underst… lear… 1445 0.0193
## 3 Online Learning Module (e.g. Call for Change, Underst… vide… 1336 0.0179
## 4 Online Learning Module (e.g. Call for Change, Underst… reso… 1209 0.0162
## 5 Online Learning Module (e.g. Call for Change, Underst… onli… 1082 0.0145
## 6 Online Learning Module (e.g. Call for Change, Underst… unde… 1053 0.0141
## 7 Online Learning Module (e.g. Call for Change, Underst… time 1036 0.0139
## 8 Online Learning Module (e.g. Call for Change, Underst… exam… 1025 0.0137
## 9 Online Learning Module (e.g. Call for Change, Underst… stud… 951 0.0127
## 10 Online Learning Module (e.g. Call for Change, Underst… data 915 0.0122
## # … with 7,200 more rows
Using the view() function we can see that “information”
makes up about 2.3% of words in responses about the Online Modules, and
about 1.7% for Recorded Webinars.
Term frequency-inverse document frequency (tf-idf) is an approach that takes this approach one step further.
As noted in Tidy Text Mining with R:
The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.
Silge and Robinson note that, “The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of document… That is, tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common.”
The tidytext package has a function called
bind_tf_idf() that takes a tidy text dataset as input with
one row per token (term), per document. One column (word here) contains
the terms/tokens, one column contains the documents (book in this case),
and the last necessary column contains the counts, how many times each
document contains each term (n in this example).
Because tf-idf can account through weighting for “too common” words
like “and” or “but”, when calculating tf-idf it is not necessary to
remove stop words. However, we will need add a column for total words
for each Resource type which can be accomplished in a
couple of steps.
First, let’s recycle our opd_teacher data frame and
calculate counts for each word again, but this time instead of word
counts for the total data set, we’ll calculate word counts for each
‘Resource’.
opd_words <- opd_teacher %>%
unnest_tokens(word, text) %>%
count(Resource...6, word, sort = TRUE)
head(opd_words)
## # A tibble: 6 × 3
## Resource...6 word n
## <chr> <chr> <int>
## 1 Online Learning Module (e.g. Call for Change, Understanding the S… the 13058
## 2 Online Learning Module (e.g. Call for Change, Understanding the S… to 7933
## 3 Online Learning Module (e.g. Call for Change, Understanding the S… of 6132
## 4 Online Learning Module (e.g. Call for Change, Understanding the S… and 5560
## 5 Online Learning Module (e.g. Call for Change, Understanding the S… i 3861
## 6 Online Learning Module (e.g. Call for Change, Understanding the S… it 3087
Next, let’s calculate the total words per Resource
type:
total_words <- opd_words %>%
group_by(Resource...6) %>%
summarise(total = sum(n))
total_words
## # A tibble: 10 × 2
## Resource...6 total
## <chr> <int>
## 1 Calendar 137
## 2 Document, please specify (i.e. Facilitator's Guide, Crosswalks, Sampl… 500
## 3 Live Webinar 316
## 4 Online Learning Module (e.g. Call for Change, Understanding the Stand… 181197
## 5 Other, please specify 3363
## 6 Promotional Video 149
## 7 Recorded Webinar or Presentation (e.g. Strategic Staffing, Standards … 1083
## 8 Summer Institute/RESA PowerPoint Presentations 883
## 9 Website, please specify 1860
## 10 Wiki 1039
Now let’s append the total column from
total_words to our opd_words data frame:
opd_totals <- left_join(opd_words, total_words)
## Joining, by = "Resource...6"
opd_totals
## # A tibble: 8,833 × 4
## Resource...6 word n total
## <chr> <chr> <int> <int>
## 1 Online Learning Module (e.g. Call for Change, Understandi… the 13058 181197
## 2 Online Learning Module (e.g. Call for Change, Understandi… to 7933 181197
## 3 Online Learning Module (e.g. Call for Change, Understandi… of 6132 181197
## 4 Online Learning Module (e.g. Call for Change, Understandi… and 5560 181197
## 5 Online Learning Module (e.g. Call for Change, Understandi… i 3861 181197
## 6 Online Learning Module (e.g. Call for Change, Understandi… it 3087 181197
## 7 Online Learning Module (e.g. Call for Change, Understandi… my 2649 181197
## 8 Online Learning Module (e.g. Call for Change, Understandi… was 2520 181197
## 9 Online Learning Module (e.g. Call for Change, Understandi… a 2473 181197
## 10 Online Learning Module (e.g. Call for Change, Understandi… in 2378 181197
## # … with 8,823 more rows
Finally, we’re ready to use the bind_tf_idf() function
to calculate a tf-idf statistic for each word and assess it’s relative
importance to a given online resource type:
opd_tf_idf <- opd_totals %>%
bind_tf_idf(word, Resource...6, n)
opd_tf_idf
## # A tibble: 8,833 × 7
## Resource...6 word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Online Learning Module (e.g. Call fo… the 13058 181197 0.0721 0 0
## 2 Online Learning Module (e.g. Call fo… to 7933 181197 0.0438 0 0
## 3 Online Learning Module (e.g. Call fo… of 6132 181197 0.0338 0 0
## 4 Online Learning Module (e.g. Call fo… and 5560 181197 0.0307 0.105 0.00323
## 5 Online Learning Module (e.g. Call fo… i 3861 181197 0.0213 0 0
## 6 Online Learning Module (e.g. Call fo… it 3087 181197 0.0170 0 0
## 7 Online Learning Module (e.g. Call fo… my 2649 181197 0.0146 0 0
## 8 Online Learning Module (e.g. Call fo… was 2520 181197 0.0139 0 0
## 9 Online Learning Module (e.g. Call fo… a 2473 181197 0.0136 0 0
## 10 Online Learning Module (e.g. Call fo… in 2378 181197 0.0131 0.105 0.00138
## # … with 8,823 more rows
view(opd_tf_idf)
Notice that idf and thus tf-idf are zero for these extremely common words (typically stop words). These are all words that appear in teacher responses for all online resource types, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.
On one final note, while it has proved useful in text mining, search engines, etc., its theoretical foundations are considered less than firm by information theory experts…”
In the next section, we’ll use some data visualization strategies to help us interpret and find patterns in these rather dense output tables.
opd_resource_counts and searching in the source how, how
might you use the filter() function to find return the most
common words for Recorded Webinars?opd_tf_idf data frame we
created?opd_benefits data frame. For frequencies and tf-idf, group
by Role instead of Resource.This section is a really quick aside and primarily meant to introduce
the grep package that we’ll be using in future labs.
A quick word count actually resulted in findings fairly consistent with some of the qualitative findings reported, but also lacked some nuance, unsurprisingly, and left some questions about what some of the more frequent words were in reference to.
Let’s use our reduced opd_teacher survey data frame that contains the
complete teacher responses and use the handy filter(),
select() and grepl() function to select just
our text column and filter out responses that contain key
words of interest. For example, what aspects of “online” made it
beneficial.
We can view all quotes in the source pane, or use the
sample_n(), yes from the dplyr package, to
select any number of random quotes. In this case 20:
opd_quotes <- opd_teacher %>%
select(text) %>%
filter(grepl('online', text))
view(opd_quotes)
sample_n(opd_quotes, 20)
## # A tibble: 20 × 1
## text
## <chr>
## 1 It's online and accessable from any location.
## 2 The examples shown and the video explanation were the most beneficial/valuab…
## 3 Understanding the online resources
## 4 Available online
## 5 Beiang available online and able to complete at my own pace.
## 6 The most beneficial aspect of this online resource was being able to complet…
## 7 The most beneficial/valuable aspect of this online resource was the video th…
## 8 It gave me several chances to explore and use new online resources.
## 9 being able to work online at my own pace
## 10 online videos
## 11 That it was online and I could do it at my own pace on my own time.
## 12 The online access.
## 13 online resources
## 14 online availability
## 15 it was online
## 16 The most beneficial/value of this online course was that I was able to work …
## 17 The sample responses and master scores for the constructed response question…
## 18 This online resource was very user-friendly and easy to follow and comprehen…
## 19 Available online
## 20 The fact that it was presented online.
In some cases, we can see that the use of the word “online” was simply repetition of the question prompt, but in other cases we can see that it’s associated with the broader theme of “convenience” as with the quote, “This online resources gave me the opportunity to study on my own time.”
Note that you can also use regular express operators with
grep like the * operator to search for word
stems. For example using inform* in our search will return
quotes with “inform”, “informative”, “information”, etc.
opd_quotes <- opd_teacher %>%
select(text) %>%
filter(grepl('inform*', text))
view(opd_quotes)
sample_n(opd_quotes, 20)
## # A tibble: 20 × 1
## text
## <chr>
## 1 Easy to access and the information was revelant to all subjects.
## 2 new information not previously known.
## 3 The information about the resources available to enhance the classroom.
## 4 It was informativen
## 5 Has valuable information
## 6 We were able to get information to score.
## 7 Easily accessible, able to revisit if not able to complete in one sitting Vi…
## 8 good information
## 9 Being able to pring informations
## 10 Videos rather than written information gives a hands on look at the material.
## 11 useful information about having a different view of getting students to unde…
## 12 Great information to grow on
## 13 Reinforced what I was already taught in a previous training.
## 14 The information was very beneficial.
## 15 general information
## 16 more specific background information about data literacy
## 17 The links to resources and other information to learn more about the content.
## 18 to have new information needed to be apllied in the classroom
## 19 Understanding the new requirements for students, and excellent information o…
## 20 Lots of information
The go to package for standard charts and graphs is ggplot2. Hadley
Wickham’s R for
Data Science and [ggplot2: Elegant Graphics for Data] are also great
introductions to data visualization in R with ggplot2.
The wordcloud2 packages is pretty dead simple for
generating HTML based word clouds.
For example, let’s load our installed wordclouds2
library, and run the wordcloud2() function on our
opd_counts data frame:
library(wordcloud2)
wordcloud2(opd_counts)
I use wordclouds pretty sparingly in evaluation reports, but typically include them for open ended items in online Qualtrics survey reports to provide education partners I work with a quick snapshot of the response.
Once installed, I recommend using ?wordclouds2 to view
the various arguments for cleaning up the default view.
The bar chart is the workhorse for data viz and is pretty effective
for comparing two or more values. Given the unique aspect of our tidy
text data frame, however, we are looking at upwards of over 5,000 values
(i.e. words and their counts) to compare with our
opd_counts data frame and will need some way to limit the
number of words to display.
opd_counts %>%
filter(n > 500) %>% # keep rows with word counts greater than 500
mutate(word = reorder(word, n)) %>% #reorder the word variable by n and replace with new variable called word
ggplot(aes(n, word)) + # create a plot with n on x axis and word on y axis
geom_col() # make it a bar plot
Word clouds and bar charts are pretty effective for highlighting the most common words in an entire corpus, or in our case, all teacher survey responses, regardless of resource type being reviewed.
One limitation we ran into earlier when we started looking at word frequencies and tf-idf stats was that it was difficult to easily compare the most common or unique words for each resource type. That is where small multiples come. A small multiple is basically a series of similar graphs or charts using the same scale and axes that make it easier to compare across different document collections of interest, in our case, word counts by resource type.
Let’s use the example illustrated in Text Mining with R to create a
small multiple for our opd_frequencies data set instead of
the opd_tf_idf
library(forcats)
opd_frequencies %>%
filter(Resource...6 != "Calendar") %>% # remove Calendar responses, too few.
group_by(Resource...6) %>%
slice_max(proportion, n = 5) %>%
ungroup() %>%
ggplot(aes(proportion, fct_reorder(word, proportion), fill = Resource...6)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Resource...6, ncol = 3, scales = "free")
opd_benefits data.As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
In Learning Analytics Goes to School, the authors describe modeling as simply developing a mathematical summary of a dataset and note that there are two general types to modeling: unsupervised and supervised learning. Unsupervised learning algorithms, which will be the focus in this lab, are used to explore the structure of a dataset, while supervised models “help to quantify relationships between features and a known outcome.”
We will explore the use of models for text mining in future labs, but
if you are interested in looking ahead to see how they might be applied
to text as data, I recommend taking a look at Chapter 6 Topic
Modeling from Text Mining with R: A Tidy Approach. Chris Bail in his
Text as Data course also provides a nice introduction to Topic Modeling,
including Structural Topic Modeling, which we will explore using the
stm package in future labs.
Finally, if you have not already done so, I ask that at minimum you read Chapter 3 of DSIEUR as well as the section on the Data-Intensive Research Workflow from Chapter 2 of Learning Analytics Goes to school.
The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) have outline the following 3-step process for communicating with education stakeholders what you have learned through analysis:
In this particular walkthrough, our target audience is developers of online professional learning opportunities who are looking to receive feedback on what’s working well and potential areas for improvement. This lets us assume a good deal of prior knowledge on their end about the context of the evaluation, a high level of familiarly with the online professional development resources being assessed, and fairly literate at reading and interpreting data and charts. This also lets us simplify our data products and narrative and reduce the level of detail needed to communicate useful information.
For summative evaluation, typically at the end of a school year or grant period when the emphasis is on assessing program outcomes and impact, our audience would extend to those less familiar with the program but with a vested interest in program’s success, such as the NC State Board of Education or those directly impacted by the program including NC educators is general. In that case, our data product would need to include much more narrative to provide context and greater detail in charts and graphs in order to help interpret the data presented.
For analyses to present, I’m going to focus primarily on:
I’ve decided to exclude analyses of just term frequency because I feel like simply counts are easier to quickly interpret while tf-idf provides more nuance. I also want to be careful not to overwhelm my audience.
In terms of “data products” and form, and because this is a simple demonstration for sharing analyses and our first experience in this lab with independently analysis, I’ll prepare my data product as a basic slide show that includes the following charts:
To make the word cloud a little less busy and a little more useful, I
removed the multitude of colors from the default setting, and using some
modified code form the ?wordclouds2 help file, I’ve
included an argument in the wordclouds2( ) function to use
the color black for words that occur more than 1000 times, and gray for
the rest.
library(wordcloud2)
wordcloud2(opd_counts,
color = ifelse(opd_counts[, 2] > 1000, 'black', 'gray'))
For my bar chart, I did some minor clean up, including editing the
x-axis title, removing the redundant y axis by setting it to
NULL, and adding a title. I also used the built-in
theme_minimal( ) function layer to simplify the look. If
this were something for a more formal report, I’d probably finesse it
even more, but it gets the point across.
opd_counts %>%
filter(n > 500) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
labs(x = "Word Counts", y = NULL, title = "20 Most Frequently Used Words to Describe the Value of Online Resources") +
theme_minimal()
Finally, two related issues that I want to clean up a little with respect to tf-idf before sharing with an outside audience are the appearance of stop words and too few responses for the Calendar online learning resources.
First, I’ll reuse my opd_clean data frame which had my
stop words removed to create my new opd_tf_idf data
frame.
opd_resource_counts <- opd_clean %>%
count(Resource...6, word)
total_words <- opd_resource_counts %>%
group_by(Resource...6) %>%
summarize(total = sum(n))
opd_words <- left_join(opd_resource_counts, total_words)
## Joining, by = "Resource...6"
opd_tf_idf <- opd_words %>%
bind_tf_idf(word, Resource...6, n)
Then I’ll use the filter() function to remove any
response pertaining to Calendar and add some labels using the
labs() function. Again, if this were a chart destined for a
more formal report, I’d also clean up the Resource names to make them
more readable and fit properly on each bar plot.
Finally, with the help of Soraya Campbell, I’ve
fixed the pesky issue with the charts not ordering by tf-idf value
properly by changing Resource from a character to a factor
and using the reorder_within function.
opd_tf_idf %>%
filter(Resource...6 != "Calendar") %>%
group_by(Resource...6) %>%
slice_max(tf_idf, n = 5) %>%
ungroup() %>%
mutate(Resource=as.factor(Resource...6),
word=reorder_within(word, tf_idf, Resource...6)) %>%
ggplot(aes(word, tf_idf, fill = Resource...6)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Resource...6, ncol = 3, scales = "free") +
coord_flip() +
scale_x_reordered() +
labs(title = "Words Unique to Each Online Learning Resurcecs", x = "tf-idf value", y = NULL)