library(academictwitteR)
library(tidyverse)
ccss_tweets_2021 <-
get_all_tweets('(#commoncore OR "common core") -is:retweet lang:en',
"2021-01-01T00:00:00Z",
"2021-05-31T00:00:00Z",
bearer_token,
data_path = "ccss-data/",
bind_tweets = FALSE)
ccss_tweets <- bind_tweet_jsons(data_path = "ccss-data/") |>
select(text,
created_at,
author_id,
id,
conversation_id,
source,
possibly_sensitive,
in_reply_to_user_id)
write_csv(ccss_tweets, "data/ccss-tweets.csv")Unit 3 Case Study: Public Sentiment and the State Standards
ECI 586 Introduction to Learning Analytics
1. PREPARE
Data sources such as digital learning environments and administrative data systems, as well as data produced by social media websites and the mass digitization of academic and practitioner publications, hold enormous potential to address a range of pressing problems in education, but collecting and analyzing text-based data also presents unique challenges. This week, our case study is guided by Josh Rosenberg’s study, Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards.
We will focus on conducting a very simplistic “replication study” by comparing the sentiment of tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand public reaction to these two curriculum reform efforts. Specifically, our Unit 3 case study will cover the following topics:
- Prepare: We’ll take a quick look at Dr. Rosenberg’s study and load packages we’ll need for analysis.
- Wrangle: We focus on basic text mining processes such as text tokenization and stop word removal. Specifically, we will learn how to “tidy text” so we can perform some basic analyses such as retrieving word counts and term frequencies.
- Explore: In order to see what insight our data provides into answering our research questions, we will calculate some simple summary statistics from our tidied text and use data visualization to highlight some of these insights.
- Model: We learn a little about sentiment lexicons and introduce the {vader} package to model the sentiment of tweets about the NGSS and CCSS state standards in order to better understand public reaction to these two curriculum reform efforts.
- Communicate: To wrap up our case study, we’ll write a brief summary of our findings and a short reflection on what we learned.
1a. Review the Literature
The Unit 3 Case Study: Public Sentiment and the State Standards is guided by a recent publication by (Rosenberg et al., 2021) Understanding Public Sentiment About Educational Reforms: The Next Generation Science Standards on Twitter. This study in turn builds on upon previous work by Wang & Fikis (2017) examining public opinion on the Common Core State Standards (CCSS) on Twitter. For this case study, we will focus on analyzing tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand key words and phrases that emerge, as well as public sentiment towards these two curriculum reform efforts.
Note on Data Source: This study analyzes data from Twitter, now rebranded as X. Changes to the platform’s API access and policies mean this type of research is now more difficult to replicate.
Abstract
System-wide educational reforms are difficult to implement in the United States, but despite the difficulties, reforms can be successful, particularly when they are associated with broad public support. This study reports on the nature of the public sentiment expressed about a nationwide science education reform effort, the Next Generation Science Standards (NGSS). Through the use of data science techniques to measure the sentiment of posts on Twitter (now X) about the NGSS (N = 565,283), we found that public sentiment about the NGSS is positive, with only 11 negative posts for every 100 positive posts. In contrast to findings from past research and public opinion polling on the Common Core State Standards, sentiment about the NGSS has become more positive over time—and was especially positive for teachers. We discuss what this positive sentiment may indicate about the success of the NGSS in light of opposition to the Common Core State Standards.
Data Sources
Similar to data we’ll be using for this case study, Rosenberg et al. used publicly accessible data from Twitter (now X) collected using the Full-Archive X API and the {rtweet} package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.
Data used in this case study was obtained prior to Twitter’s transition to X, and used the {academictwitter} package and the sadly no longer accessible Academic Research developer account. The Twitter API v2 endpoints allowed researchers to access the full twitter archive, unlike a standard developer account. Data includes all tweets from January through May of 2020 and included the following terms: #ccss, common core, #ngsschat, ngss.
Below is an example of the code used to retrieve data for this case study. This code is set not to execute and will NOT run, but it does illustrate the search query used, variables selected, and time frame.
Analysis
The authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, they used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts.
We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to exploring some other sentiment lexicons to see if they produce similar results. We will use a similar approach to label tweets as positive, negative, or neutral using the {Vader} package which greatly simplifies this process.
The authors also used the lme4 package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teachers. We won’t try to replicate in this part of the study, but we will take a look at some of their findings from this model in section 4.
Summary of Key Findings
- Contrasting with sentiment about CSSS, sentiment about the NGSS science education reform effort is overwhelmingly positive, with approximately 9 positive tweets for every negative tweet.
- Teachers were more positive than non-teachers, and sentiment became substantially more positive over the ten years of NGSS-related posts.
- Differences between the context of the tweets were small, but those that did not include the #NGSSchat hashtag became more positive over time than those posts that did not include the hashtag.
- Individuals posted more tweets during #NGSSchat chats, the sentiment of their posts was more positive, suggesting that while the context of individual tweets has a small effect (with posts not including the hashtag becoming more positive over time), the effect upon individuals of being involved in the #NGSSchat was positive.
1b. Define Questions
One overarching question that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, and that we’ll explore later in this case study, is the question:
How do we to quantify what a document or collection of documents is about?
The questions guiding the Rosenberg et al. study attempt to quantify public sentiment around the NGSS and how that sentiment changes over time. Specifically, they asked:
- What is the public sentiment expressed toward the NGSS?
- How does sentiment for teachers differ from non-teachers?
- How do tweets posted to #NGSSchat differ from those without the hashtag?
- How does participation in #NGSSchat relate to the public sentiment individuals express?
- How does public sentiment vary over time?
For this text mining case study, we’ll use approaches similar to those used by the authors cited above to better understand public discourse surrounding these standards, particularly as they relate to STEM education. We will also try to guage public sentiment around the NGSS, by comparing how much more positive or negative NGSS tweets are relative to CSSS tweets. Specifically, in this case study we’ll attempt to answer the following questions:
- What are the most frequent words or phrases used in reference to tweets about the CCSS and NGSS?
- How does sentiment for NGSS compare to sentiment for CCSS?
1c. Load Libraries
tidytext 📦
As we’ll learn first hand in this module, using tidy data principles can also make many text mining tasks easier, more effective, and consistent with tools already in wide use. The {tidytext} package helps to convert text into data frames with each rows containing an individual word or sequence of words, making it easy to to manipulate, summarize, and visualize text using using familiar functions form the {tidyverse} collection of packages.
Let’s go ahead and load the {tidytext} package:
library(tidytext)For a more comprehensive introduction to the tidytext package, I cannot recommend enough the free and excellent online book, Text Mining with R: A Tidy Approach (Silge & Robinson, 2017). If you’re interested in pursuing text analysis using R after this course, this will be a go to reference.
The vader Package 📦
The {vader} package is for the Valence Aware Dictionary for sEntiment Reasoning (VADER), a rule-based model for general sentiment analysis of social media text and specifically attuned to measuring sentiment in microblog-like contexts.
To learn more about the {vader} package and its development, take a look at the article by Hutto and Gilbert (2014), VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.
Let’s go ahead and load the VADER library:
library(vader)Note: The {vader} package can take quite some time to run on a large datasets like the one we’ll be working with, so in our Model section we will examine just a small(ish) subset of tweets.
Other Packages
Finally, there are a couple other packages we’ll need to get started. The first should look familiar while second {wordcloud2} package is handy little package for creating interactive word clouds.
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(wordcloud2)
library(tidyr)2. WRANGLE
The importance of data wrangling, particularly when working with text, is difficult to overstate. Just as a refresher, wrangling involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al., 2018). This case study will place a heavy emphasis on preparing text for analysis and in particular we’ll learn how to:
- Import Tweets. First we revisit the familiar
read_csv()function for reading in our CCSS and NGSS tweets into R. - Restructure Data. We focus on removing extraneous data using the
select()andfilter()functions from {dplyr}, and revisit functions from the Tidy Your Data Primer for merging data frames. - Tidy Text. Finally, we introduce the {tidytext} package to “tidy” and tokenize our tweets in order to create our data frame for analysis. We also introduce a new join function to remove “stop words” that don’t add much value to our analysis.
2a. Import Tweets from CSV
As noted above, data used in this case study was pulled using an Academic Research developer account and the {academictwitter} package, which uses the Twitter API v2 endpoints and allows researchers to access the full twitter archive, unlike the {rtweet} package, which limits the number of tweets and the length of time from which you can pull tweets.
Data for this case study includes all tweets from January through May of 2020 and includes the following terms: #ccss, common core, #ngsschat, ngss. Since we’ll be working with some computational intensive functions later in this case study that can take some time to run, I restricted the time frame for my search to only a handful of month. Even so, we’ll be working with nearly 30,000 tweets and nearly 1,000,000 words for our analysis!
Let’s use the by now familiar read_csv() function to import our ccss_tweets.csv file saved in our data folder:
ccss_tweets <- read_csv("data/ccss-tweets.csv",
col_types = cols(author_id = col_character(),
id = col_character(),
conversation_id = col_character(),
in_reply_to_user_id = col_character()
)
)
ccss_tweets# A tibble: 27,230 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <chr> <chr> <chr> <chr>
1 "@catturd2 Hmmmm … 2021-01-02 00:49:28 16098543… 1345… 13451697062071… Twitt…
2 "@homebrew1500 I … 2021-01-02 00:40:05 12495948… 1345… 13451533915976… Twitt…
3 "@ClayTravis Dump… 2021-01-02 00:32:46 88770705… 1345… 13450258639942… Twitt…
4 "@KarenGunby @chi… 2021-01-02 00:24:01 12495948… 1345… 13451533915976… Twitt…
5 "@keith3048 I kno… 2021-01-02 00:23:42 12527475… 1345… 13451533915976… Twitt…
6 "Probably common … 2021-01-02 00:18:38 12760173… 1345… 13451625486818… Twitt…
7 "@LisaS4680 Stupi… 2021-01-02 00:16:11 92213292… 1345… 13451595466087… Twitt…
8 "@JerryGl29176259… 2021-01-02 00:10:29 12201608… 1345… 13447179758914… Twitt…
9 "@JBatNC304 @Cawt… 2021-01-02 00:09:15 88091448… 1345… 13447403608625… Twitt…
10 "@chiefaugur I th… 2021-01-01 23:54:38 12495948… 1345… 13451533915976… Twitt…
# ℹ 27,220 more rows
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
Note the addition of the col_types = argument for changing some of the column types to character strings because the numbers for those particular columns actually indicate identifiers for authors and tweets:
author_id= the author of the tweetid= the unique id for each tweetconverastion_id= the unique id for each conversation threadin_reply_to_user_id= the author of the tweet being replied to
Your Turn ⤵
Complete the following code chunk to import the NGSS tweets located in the same data folder as our common core tweets and named ngss-tweets.csv. By default, R will treat numerical IDs in our dataset as numeric values but we will need to convert these to characters like demonstrated above for the purpose of analysis. Also, feel free to repurpose the code from above.
ngss_tweets <- read_csv("data/ngss-tweets.csv",
col_types = cols(author_id = col_character(),
id = col_character(),
conversation_id = col_character(),
in_reply_to_user_id = col_character()
)
)
ngss_tweets# A tibble: 8,125 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <chr> <chr> <chr> <chr>
1 "Please help us R… 2021-01-06 00:50:49 32799077… 1346… 13466201998945… Twitt…
2 "What lab materia… 2021-01-06 00:45:32 10103246… 1346… 13466188701325… Hoots…
3 "I recently saw a… 2021-01-06 00:39:37 61829645 1346… 13466173820858… Twitt…
4 "I'm thrilled to … 2021-01-06 00:30:13 461653415 1346… 13466150172071… Twitt…
5 "PLS RT. Excited … 2021-01-06 00:15:05 22293234 1346… 13466112069671… Twitt…
6 "Inspired by Marg… 2021-01-06 00:00:00 33179602… 1346… 13466074140999… Tweet…
7 "PLTW Launch is d… 2021-01-05 23:45:06 17276863 1346… 13466036638386… Hoots…
8 "@NGSS_tweeps How… 2021-01-05 23:24:01 10230543… 1346… 13464677409499… Twitt…
9 "@NGSS_tweeps I d… 2021-01-05 23:21:56 10230543… 1346… 13464677409499… Twitt…
10 "January 31st is … 2021-01-05 23:10:03 23679615 1346… 13465948440435… Hoots…
# ℹ 8,115 more rows
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
Importing data and dealing with data types can be a bit tricky, especially for beginners. Recall from previous case studies that RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process. If you get stuck, you can copy the code generated in the lower right hand corner of the Import Dataset window.
Now use the following code chunk to inspect the head() of each data frame and answer the questions that follow:
head(ngss_tweets)# A tibble: 6 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <chr> <chr> <chr> <chr>
1 "Please help us RT… 2021-01-06 00:50:49 32799077… 1346… 13466201998945… Twitt…
2 "What lab material… 2021-01-06 00:45:32 10103246… 1346… 13466188701325… Hoots…
3 "I recently saw a … 2021-01-06 00:39:37 61829645 1346… 13466173820858… Twitt…
4 "I'm thrilled to b… 2021-01-06 00:30:13 461653415 1346… 13466150172071… Twitt…
5 "PLS RT. Excited 2… 2021-01-06 00:15:05 22293234 1346… 13466112069671… Twitt…
6 "Inspired by Marga… 2021-01-06 00:00:00 33179602… 1346… 13466074140999… Tweet…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
head(ccss_tweets)# A tibble: 6 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <chr> <chr> <chr> <chr>
1 "@catturd2 Hmmmm “… 2021-01-02 00:49:28 16098543… 1345… 13451697062071… Twitt…
2 "@homebrew1500 I a… 2021-01-02 00:40:05 12495948… 1345… 13451533915976… Twitt…
3 "@ClayTravis Dump … 2021-01-02 00:32:46 88770705… 1345… 13450258639942… Twitt…
4 "@KarenGunby @chie… 2021-01-02 00:24:01 12495948… 1345… 13451533915976… Twitt…
5 "@keith3048 I know… 2021-01-02 00:23:42 12527475… 1345… 13451533915976… Twitt…
6 "Probably common c… 2021-01-02 00:18:38 12760173… 1345… 13451625486818… Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
Wow, so much for a family friendly case study! Based on this very limited sample, which set of standards do you think Twitter users are more negative about?
- Definitely more negative sentiment towards the CCSS!
Let’s take a slightly larger sample of the CCSS tweets:
set.seed(586)
ccss_tweets |>
sample_n(20) |>
relocate(text)# A tibble: 20 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <chr> <chr> <chr> <chr>
1 "Common core math… 2021-01-24 04:45:27 578491631 1353… 13532022276287… Twitt…
2 "@mariana057 Nope… 2021-04-15 18:08:22 46682634 1382… 13827177567479… Twitt…
3 "Critical Race Th… 2021-05-27 09:30:19 10077879… 1397… 13978476272119… Twitt…
4 "@Afkar_omumi @di… 2021-01-11 00:32:28 12397153… 1348… 13484220504410… Twitt…
5 "Pastor Brian, sp… 2021-01-15 19:41:47 13146731… 1350… 13501662706673… Twitt…
6 "Common core math… 2021-04-17 20:02:00 13769382… 1383… 13835110793812… Twitt…
7 "@Saorsa1776 Comm… 2021-05-20 19:57:14 175150086 1395… 13953941821230… Twitt…
8 "Common Core Math… 2021-04-30 13:36:56 13138487… 1388… 13881252185568… Twitt…
9 "[Download] EPUB … 2021-01-09 10:11:18 13462767… 1347… 13478484159174… Twitt…
10 "Bill & Mel… 2021-04-17 02:18:53 13523967… 1383… 13832435374839… Twitt…
11 "@ASlavitt Common… 2021-03-15 10:51:28 12062192… 1371… 13712091309674… Twitt…
12 "@LeftAccidental … 2021-03-22 19:39:43 13517576… 1374… 13740686709771… Twitt…
13 "Don’t you think … 2021-01-09 23:17:27 99623303… 1348… 13480462554993… Twitt…
14 "What? Is that c… 2021-02-07 01:35:22 13295075… 1358… 13582278249901… Twitt…
15 "Now, he's totall… 2021-02-01 15:10:47 70214097… 1356… 13562587032641… Trump…
16 "@JackStr13435605… 2021-02-23 16:16:22 16110641… 1364… 13640566504583… Twitt…
17 "COMMON CORE FTW … 2021-05-18 22:15:24 28045276… 1394… 13947786759003… Ninte…
18 "I have a comprom… 2021-03-14 21:22:31 16693646… 1371… 13712101564831… Twitt…
19 "When she was Gov… 2021-02-12 14:00:41 911076488 1360… 13602273285934… Twitt…
20 "PDF Download Pre… 2021-01-25 23:05:12 13495578… 1353… 13538413772058… Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
Your Turn ⤵
Use the code chunk below to take a sample of the NGSS tweets. Try to do it without looking at the code above first:
set.seed(586)
ngss_tweets |>
sample_n(20) |>
relocate(text)# A tibble: 20 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <chr> <chr> <chr> <chr>
1 "@NewhouseBiology… 2021-04-02 01:04:05 40062074 1377… 13777884447982… Tweet…
2 "Learn more about… 2021-02-17 14:31:07 42574210… 1362… 13620469261542… Twitt…
3 "@philiplbell @sb… 2021-02-03 16:31:05 242075092 1357… 13569805821404… Twitt…
4 "@KRenaeP @starrs… 2021-02-01 21:56:54 11618510… 1356… 13562275510694… Twitt…
5 "Unpacking the fo… 2021-02-24 04:43:17 748435729 1364… 13644357071172… Twitt…
6 "A1: How do you s… 2021-02-05 02:14:19 96102787… 1357… 13575128494767… Twitt…
7 "The prettiest la… 2021-01-21 08:35:32 76327682… 1352… 13521729693852… Insta…
8 "Want more inform… 2021-04-19 19:02:03 99688098… 1384… 13842207692578… Buffer
9 "@NGSS_tweeps Tha… 2021-01-29 01:51:48 13295721… 1354… 13549483292522… Twitt…
10 "@NGSS_tweeps Yes… 2021-02-20 14:47:57 12964713… 1363… 13628226555193… Twitt…
11 "#NGSSchat A2. A… 2021-04-16 01:32:23 10973360… 1382… 13828694498211… Twitt…
12 "PHENOMENA! See w… 2021-05-07 17:16:00 30415132… 1390… 13907170611989… Twitt…
13 "@LisaMLove1996 @… 2021-01-30 05:46:05 13295721… 1355… 13549732742796… Twitt…
14 "Check out my cla… 2021-05-22 02:16:12 21797286… 1395… 13959264365784… Twitt…
15 "KG Ss help Block… 2021-05-06 02:07:32 16231168… 1390… 13901260519765… Tweet…
16 "@SUSDscience Wel… 2021-02-06 01:39:59 17294587… 1357… 13577987920330… Twitt…
17 "Love @LenoraMCra… 2021-03-20 16:52:15 22293234 1373… 13733164682895… Twitt…
18 "my school made a… 2021-01-13 15:24:55 10339637… 1349… 13493768902939… Twitt…
19 "A1.2 Defining pr… 2021-04-16 01:12:24 40062074 1382… 13828644201377… Tweet…
20 "#Science is a gr… 2021-01-19 16:21:21 72579397… 1351… 13515654213134… Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
Still of the same opinion?
- Yes, same opinion stands. The people using the NGSS conversation are more positive than those engaging with the CCSS threads. Appears to be more practitioners of the standard as opposed to CCSS which seems to simply be people complaining.
What else you notice about our data sets? Record a few observations that you think are relevant to our analysis or might be useful for future analyses.
- I think the possibly_sensitive logic row will have some interesting differences between the two data sets with CCSS being flagged much more so than NGSS. I also wonder how much continued thread conversations there are on each.
What questions do you have about these data sets? What are you still curious about?
- The sheer difference between the number of recorded tweets is interesting and could speak to the observation that negative sentiment may invoke more interaction than positive sentiment. Same reasons the news shows negative stories instead of positive ones. I also wonder how many unique authors are in each set.
2c. Restructure Data
Subset Tweets
As you may have noticed, we have more data than we need for our analysis and should probably pare it down to just what we’ll use.
We could do this in multiple steps, creating intermediate objects like ccss_tweets_1, ccss_tweets_2, etc., but that creates unnecessary clutter in our environment and makes code harder to follow. Instead, we’ll use the |> pipe operator to chain our operations together in a single, readable flow.
Let’s clean the CCSS tweets by:
- Filtering out potentially sensitive content
- Selecting only the columns we need
- Adding a “standards” label column
- Moving that new column to the first position for easy viewing
# Clean CCSS Tweets
ccss_tweets_clean <- ccss_tweets |>
filter(possibly_sensitive == "FALSE") |>
select(text, author_id, created_at, conversation_id, id) |>
mutate(standards = "ccss") |>
relocate(standards)
head(ccss_tweets_clean)# A tibble: 6 × 6
standards text author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 ccss "@catturd2 Hmmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 ccss "@homebrew1500 … 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 ccss "@ClayTravis Du… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 ccss "@KarenGunby @c… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 ccss "@keith3048 I k… 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 ccss "Probably commo… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
Your Turn ⤵
Recall from section 1b. Define Questions that we are interested in comparing word usage and public sentiment around both the Common Core and Next Gen Science Standards.
Create an new ngss_tweets_clean data frame consisting of the Next Generation Science Standards tweets we imported earlier by using the code directly above as a guide.
# Clean NGSS Tweets
ngss_tweets_clean <- ngss_tweets |>
filter(possibly_sensitive == "FALSE") |>
select(text, author_id, created_at, conversation_id, id) |>
mutate(standards = "ngss") |>
relocate(standards)
head(ngss_tweets_clean)# A tibble: 6 × 6
standards text author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 ngss "Please help us… 32799077… 2021-01-06 00:50:49 13466201998945… 1346…
2 ngss "What lab mater… 10103246… 2021-01-06 00:45:32 13466188701325… 1346…
3 ngss "I recently saw… 61829645 2021-01-06 00:39:37 13466173820858… 1346…
4 ngss "I'm thrilled t… 461653415 2021-01-06 00:30:13 13466150172071… 1346…
5 ngss "PLS RT. Excite… 22293234 2021-01-06 00:15:05 13466112069671… 1346…
6 ngss "Inspired by Ma… 33179602… 2021-01-06 00:00:00 13466074140999… 1346…
Merge Data Frames
Finally, let’s combine our CCSS and NGSS tweets into a single data frame by using the union() function from dplyr and simply supplying the data frames that you want to combine as arguments:
ss_tweets <- union(ccss_tweets_clean,
ngss_tweets_clean)
ss_tweets# A tibble: 35,233 × 6
standards text author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 ccss "@catturd2 Hmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 ccss "@homebrew1500… 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 ccss "@ClayTravis D… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 ccss "@KarenGunby @… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 ccss "@keith3048 I … 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 ccss "Probably comm… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
7 ccss "@LisaS4680 St… 92213292… 2021-01-02 00:16:11 13451595466087… 1345…
8 ccss "@JerryGl29176… 12201608… 2021-01-02 00:10:29 13447179758914… 1345…
9 ccss "@JBatNC304 @C… 88091448… 2021-01-02 00:09:15 13447403608625… 1345…
10 ccss "@chiefaugur I… 12495948… 2021-01-01 23:54:38 13451533915976… 1345…
# ℹ 35,223 more rows
Note that when creating a “union” like this (i.e. stacking one data frame on top of another), you should have the same number of columns in each data frame and they should be in the exact same order.
Alternatively, we could have used the bind_rows() function from {dplyr} as well:
ss_tweets <- bind_rows(ccss_tweets_clean,
ngss_tweets_clean)
ss_tweets# A tibble: 35,233 × 6
standards text author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 ccss "@catturd2 Hmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 ccss "@homebrew1500… 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 ccss "@ClayTravis D… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 ccss "@KarenGunby @… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 ccss "@keith3048 I … 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 ccss "Probably comm… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
7 ccss "@LisaS4680 St… 92213292… 2021-01-02 00:16:11 13451595466087… 1345…
8 ccss "@JerryGl29176… 12201608… 2021-01-02 00:10:29 13447179758914… 1345…
9 ccss "@JBatNC304 @C… 88091448… 2021-01-02 00:09:15 13447403608625… 1345…
10 ccss "@chiefaugur I… 12495948… 2021-01-01 23:54:38 13451533915976… 1345…
# ℹ 35,223 more rows
The distinction between these two functions is that union by default removes any duplicate rows that might have shown up in our queries.
However, since both functions returned the same number of rows, it’s clear we do not have any duplicates. If we wanted to verify, {dplyr} also has an intersect function to merge the two data frames, but only where they intersect(), or where they have duplicate rows.
ss_tweets_duplicate <- intersect(ccss_tweets_clean,
ngss_tweets_clean)
ss_tweets_duplicate# A tibble: 0 × 6
# ℹ 6 variables: standards <chr>, text <chr>, author_id <chr>,
# created_at <dttm>, conversation_id <chr>, id <chr>
Your Turn ⤵
Finally, let’s take a quick look at both the head() and the tail() of this new ss_tweets data frame to make sure it contains both “ngss” and “ccss” standards and that the values for each are in the correct columns:
# YOUR CODE HERE
head(ss_tweets)# A tibble: 6 × 6
standards text author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 ccss "@catturd2 Hmmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 ccss "@homebrew1500 … 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 ccss "@ClayTravis Du… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 ccss "@KarenGunby @c… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 ccss "@keith3048 I k… 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 ccss "Probably commo… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
tail(ss_tweets)# A tibble: 6 × 6
standards text author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 ngss @BK3DSci Brian,… 558971700 2021-05-21 01:10:28 13955471161272… 1395…
2 ngss A1 My students… 14493822… 2021-05-21 01:10:20 13955474728990… 1395…
3 ngss A1: It is an im… 136014942 2021-05-21 01:09:58 13955473807585… 1395…
4 ngss @MsB_Reilly Mod… 31647215… 2021-05-21 01:09:54 13955471085775… 1395…
5 ngss A1.5 I also lov… 14449947 2021-05-21 01:09:46 13955473306029… 1395…
6 ngss @MsB_Reilly Whe… 558971700 2021-05-21 01:09:44 13955471085775… 1395…
2d. Tidy Text
Text data by it’s very nature is ESPECIALLY untidy and is sometimes referred to as “unstructured” data. In this section we learn some very useful functions from the {tidytext} package to convert text to and from tidy formats. Having our text in a tidy format will allow us to switch seamlessly between tidy tools and existing text mining packages, while also making it easier to visualize text summaries in other data analysis tools like Tableau.
Tokenize Text
In Chapter 1 of Text Mining with R, Silge & Robinson (2017) define the tidy text format as a table with one-token-per-row, and explain that:
A token is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.
This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.
For this part of our workflow, our goal is to transform our ss_tweets data from this:
head(relocate(ss_tweets, text))# A tibble: 6 × 6
text standards author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 "@catturd2 Hmmm… ccss 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 "@homebrew1500 … ccss 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 "@ClayTravis Du… ccss 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 "@KarenGunby @c… ccss 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 "@keith3048 I k… ccss 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 "Probably commo… ccss 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
Into a “tidy text” one-token-per-row format that looks like this:
tidy_tweets <- ss_tweets |>
unnest_tokens(output = word,
input = text) |>
relocate(word)
head(tidy_tweets)# A tibble: 6 × 6
word standards author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 catturd2 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
2 hmmmm ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
3 common ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
4 core ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
5 math ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
6 now ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
If you take ECI 588: Text Mining in Education, you’ll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.
As demonstrated above, the tidytext package provides the incredibly powerful unnest_tokens() function to tokenize text (including tweets!) and convert them to a one-token-per-row format.
Let’s tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze and take a look:
ss_tokens <- unnest_tokens(ss_tweets,
output = word,
input = text)
head(relocate(ss_tokens, word))# A tibble: 6 × 6
word standards author_id created_at conversation_id id
<chr> <chr> <chr> <dttm> <chr> <chr>
1 catturd2 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
2 hmmmm ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
3 common ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
4 core ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
5 math ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
6 now ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
There is A LOT to unpack with this function:
- First notice that
unnest_tokens()expects a data frame as the first argument, followed by two column names. - The next argument is an output column name that doesn’t currently exist but will be created as the text is “unnested” into it,
wordin this case). - This is followed by the input column that the text comes from, which we uncreatively named
text. - By default, a token is an individual word or “unigram” but we could use the
token =argument to change our token to bigrams (2 words) or more. - Other columns, such as
author_idandcreated_at, are retained. - All punctuation has been removed.
- Tokens have been changed to lowercase, which makes them easier to compare or combine with other datasets (use the
to_lower = FALSEargument to turn off if desired).
Note: Since {tidytext} follows tidy data principles, we also could have used the |> operator to pass our data frame to the unnest_tokens() function like so:
ss_tokens <- ss_tweets |>
unnest_tokens(output = word,
input = text)
ss_tokens# A tibble: 911,173 × 6
standards author_id created_at conversation_id id word
<chr> <chr> <dttm> <chr> <chr> <chr>
1 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… catt…
2 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… hmmmm
3 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… comm…
4 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… core
5 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… math
6 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… now
7 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… makes
8 ccss 1609854356 2021-01-02 00:49:28 13451697062071… 1345… sense
9 ccss 1249594897113513985 2021-01-02 00:40:05 13451533915976… 1345… home…
10 ccss 1249594897113513985 2021-01-02 00:40:05 13451533915976… 1345… i
# ℹ 911,163 more rows
Your Turn ⤵
Before we move any further let’s take a quick look at the most common word in our two datasets. To do so, use count() function from the {dplyr} package and include the sort = TRUE.
Hint: Like most functions we’ve introduced, the first argument count() expects is a data frame, followed by the column, in our case word, whose values we want to count:
# YOUR CODE HERE
count(ss_tokens, word,
sort = TRUE)# A tibble: 66,859 × 2
word n
<chr> <int>
1 common 27199
2 core 26992
3 the 25896
4 to 20549
5 and 15686
6 t.co 15389
7 https 15377
8 of 13130
9 a 12543
10 math 12208
# ℹ 66,849 more rows
What are the three most common words and how many times do there occur?
common 27199
core 26992
the 25896
As you may have noticed, many of these tweets are clearly about the CCSS and math, but beyond that it’s a bit hard to tell what the tweets are about and whether they are positive or negative because there are so many “stop words” like “the”, “to”, “and”, “in” that don’t carry much meaning by themselves.
Remove Stop Words
Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The stop_words dataset in the {tidytext} package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.
Let’s take a closer the lexicons and stop words included in each:
View(stop_words)The anti_join Function
In order to remove these stop words, we will use a function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:
For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119.
Now let’s remove stop words that don’t help us learn much about what people are saying about the state standards.
ss_tokens_1 <- anti_join(ss_tokens,
stop_words,
by = "word")
head(ss_tokens_1)# A tibble: 6 × 6
standards author_id created_at conversation_id id word
<chr> <chr> <dttm> <chr> <chr> <chr>
1 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… catt…
2 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… hmmmm
3 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… comm…
4 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… core
5 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… math
6 ccss 1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… makes
Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the ss_tokens dataset that match the stop_words dataset.
When we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset.
However the by = argument wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.
Your Turn ⤵
Use the code chunk below to take a quick count of the most common tokens in our ss_tokens_1 data frame to see if the results are a little more meaningful, then answer the questions that follow.
# YOUR CODE HERE
count(ss_tokens_1, word,
sort = TRUE)# A tibble: 66,166 × 2
word n
<chr> <int>
1 common 27199
2 core 26992
3 t.co 15389
4 https 15377
5 math 12208
6 ngss 4290
7 ngsschat 3284
8 amp 3084
9 science 2905
10 students 2577
# ℹ 66,156 more rows
Your Turn ⤵
How many unique tokens are in our data tidied text?
- 66,166
How many times does the word “math” occur in our set of tweets?
- 12208
Custom Stop Words
Notice that the nonsense word “amp” is among our high frequency words as well as some. We can create our own custom stop word list to to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.
Let’s create a custom stop word list by using the simple c() function to combine our words. We can then add a filter to keep rows where words in our word column do NOT ! match words %in% my_stopwords list:
my_stopwords <- c("amp", "=", "+", "t.co", "https")
ss_tokens_2 <-
ss_tokens_1 |>
filter(!word %in% my_stopwords)Let’s take a look at our top words again and see if that did the trick:
ss_tokens_2 |>
count(word, sort = TRUE)# A tibble: 66,163 × 2
word n
<chr> <int>
1 common 27199
2 core 26992
3 math 12208
4 ngss 4290
5 ngsschat 3284
6 science 2905
7 students 2577
8 education 2493
9 standards 2332
10 school 2212
# ℹ 66,153 more rows
Much better! Note that we could extend this stop word list indefinitely. Feel free to use the code chunk below to try adding more words to our stop list.
Before we move any further, let’s save our tidied tweets as a new data frame for Section 3 and also save it as a .csv file in our data folder:
ss_tidy_tweets <- ss_tokens_2
write_csv(ss_tokens_2, "data/ss_tidy_tweets.csv")3. EXPLORE
Calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. For Unit 3, we’re going to keep things super simple and focus on:
Top Tokens. Since once of our goals is to compare tweets about the NGSS and CSSS standards, we’ll take a look at the top 50 words that appear in each.
Word Clouds. To help illustrate the relative frequency for each of these top 50 words, we’ll introduce the {wordclouds2} package for creating interactive word clouds that can be knitted with your HTML doc.
3a. Top Tokens
First, let’s take advantage of the |> operator to combine some of the functions we’ve used above with the top_n() function from the {dplyr} package. By default, this function is looking for a data frame as the first argument, and then the number of rows to return.
Let’s take a look at the top tokens among the CCSS tweets by filtering our standards by CCSS, counting the number of times each word occurs, and taking the look at the 50 most common words:
ccss_top_tokens <- ss_tidy_tweets |>
filter(standards == "ccss") |>
count(word, sort = TRUE) |>
top_n(50)Selecting by n
ccss_top_tokens# A tibble: 50 × 2
word n
<chr> <int>
1 common 27132
2 core 26924
3 math 12085
4 education 2104
5 standards 1856
6 school 1855
7 kids 1814
8 grade 1484
9 people 1420
10 schools 1299
# ℹ 40 more rows
Not surprisingly, our search terms appear in the top 50 but the word “math” also features prominently among CCSS tweets!
Word Clouds
Word clouds are much maligned and sometimes referred to as the “pie charts of text analysis”, but they can be useful for communicating simple summaries of qualitative data for education practitioners and are intuitive for them to interpret.
The {wordclouds2} package is pretty dead simple tool for generating HTML based interactive word clouds. By default, when you pass a data frame to the wordcloud2() function, it will look for a word column and a column with frequencies or counts, i.e., our column n that we created with the count() function.
Let’s run the wordcloud2() function on our ccss_top_tokens data frame.
wordcloud2(ccss_top_tokens)As you can see, “math” is a pretty common topic when discussing the common core on twitter but words like “core” and “common” – which you can see better if you click the “show in a new window” button or run the code in you console – are not very helpful since those were in our search terms when pulling data from Twitter.
In fact, search terms like these we might want to exclude from a final data product we share with with education partners or in a publication and instead include these these in a title or caption.
ccss_top_tokens |>
filter(word != "common" & word != "core") |>
wordcloud2()Your Turn ⤵
In the code chunk below, filter, count and select the top 50 tokens to create a word cloud for the NGSS tweets. A gold star if you can can do it without using the assignment operator or looking at the code above!
ss_tidy_tweets |>
filter(standards == "ngss") |>
count(word, sort = TRUE) |>
top_n(50) |>
wordcloud2()Selecting by n
Also, take a look at the help file for wordclouds2 to see if there might be other ways you could improve the aesthetics of this visualization.
3b. Exploring Bigrams (Optional)
If you’d like to use the data we’ve been working with a little more, let’s take a quick look at text analysis using bigrams, or tokens consisting of two words.
So far in this lab, we specified tokens as individual words, but many interesting text analyses are based on the relationships between words, which words tend to follow others immediately, or words that tend to co-occur within the same documents.
We can also use the unnest_tokens() function to tokenize our tweets into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we could then build a model of the relationships between them.
To specify our tokens as bigrams, we do add token = "ngrams" to the unnest_tokens() function and setting n to the number of words in each n-gram. Let’s set n to 2, so we can examine pairs of two consecutive words, often called “bigrams”:
ngss_bigrams <- ngss_tweets |>
unnest_tokens(bigram,
text,
token = "ngrams",
n = 2)Before we move any further let’s take a quick look at the most common bigrams in our NGSS tweets:
ngss_bigrams |>
count(bigram, sort = TRUE)# A tibble: 111,411 × 2
bigram n
<chr> <int>
1 https t.co 6240
2 ngsschat https 721
3 of the 630
4 in the 531
5 ngss https 455
6 the ngss 403
7 to the 318
8 for the 295
9 to be 272
10 on the 239
# ℹ 111,401 more rows
As we saw above, a lot of the most common bigrams are pairs of common (uninteresting) words as well. Dealing with these is a little less straightforward and we’ll need to use the separate() function from the tidyr package, which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.
library(tidyr)
bigrams_separated <- ngss_bigrams |>
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word)
tidy_bigrams <- bigrams_filtered |>
unite(bigram, word1, word2, sep = " ")Let’s take a look at our bigram counts now:
tidy_bigrams |>
count(bigram, sort = TRUE)# A tibble: 45,507 × 2
bigram n
<chr> <int>
1 https t.co 6240
2 ngsschat https 721
3 ngss https 455
4 ngss ngsschat 236
5 ngss aligned 192
6 ngss standards 168
7 ngss science 154
8 science education 148
9 science standards 112
10 teachers https 106
# ℹ 45,497 more rows
Better, but there are still many tokens not especially useful for analysis.
Let’s make a custom custom stop word dictionary for bigrams just like we did for our unigrams. A list is started for you below, but you likely want to expand our list off stop words:
my_words <- c("https", "t.co", "ngss https", "teachers https")Now let’s separate, filter, and unite again:
tidy_bigrams <- bigrams_separated |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
filter(!word1 %in% my_words) |>
filter(!word2 %in% my_words) |>
unite(bigram, word1, word2, sep = " ")Note that since my_words is just a vector of words and not a data frame like stop_words, we do not need to select the word column using the $ operator.
Let’s take another quick count of our bigrams:
tidy_bigrams |>
count(bigram, sort = TRUE)# A tibble: 37,539 × 2
bigram n
<chr> <int>
1 ngss ngsschat 236
2 ngss aligned 192
3 ngss standards 168
4 ngss science 154
5 science education 148
6 science standards 112
7 ngss_tweeps ngsschat 96
8 science ngss 94
9 bmsscienceteach ngss_tweeps 92
10 approved approach 89
# ℹ 37,529 more rows
Your Turn ⤵
Use the code chunk below to tidy and count our bigrams for the CCSS tweets:
# YOUR CODE HERE
ccss_bigrams <- ccss_tweets |>
unnest_tokens(bigram,
text,
token = "ngrams",
n = 2)
bigrams_separated_cc <- ccss_bigrams |>
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered_cc <- bigrams_separated_cc |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word)
tidy_bigrams_cc <- bigrams_filtered_cc |>
unite(bigram, word1, word2, sep = " ")
my_words <- c("https", "hmmmm", "catturd2 hmmmm", "in the")
tidy_bigrams_cc <- bigrams_separated_cc |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
filter(!word1 %in% my_words) |>
filter(!word2 %in% my_words) |>
unite(bigram, word1, word2, sep = " ")
tidy_bigrams_cc |>
count(bigram, sort = TRUE)# A tibble: 93,159 × 2
bigram n
<chr> <int>
1 common core 26735
2 core math 8249
3 core standards 683
4 core education 420
5 core curriculum 372
6 gt gt 262
7 bill gates 252
8 grade common 252
9 public schools 246
10 grade level 233
# ℹ 93,149 more rows
What additional insight, if any, did looking at bigrams bring to out analysis?
- It seems helpful to really get to the root of the conversation. In the ccss it’s clear that common core as well as as core math are a couplet that is used frequently. In the ngss data set it’s similar and can provide a cleaner analysis.
4. MODEL
Now that we have our tweets nice and tidy, we’re almost ready to begin exploring public sentiment (at least for the past week due to Twitter API rate limits) around the CCSS and NGSS standards. For this part of our workflow we introduce two new functions from the tidytext and dplyr packages respectively:
How do you “measure” sentiment?
Sentiment analysis tries to evaluate words for their emotional association. In Text Mining with R: A Tidy Approach, Silge and Robinson point out that,
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.
This isn’t the only way to approach sentiment analysis, but it is an easier entry point into sentiment analysis and you’ll find that is it often used in publications that utilize sentiment analysis.
The {tidytext} package provides access to several sentiment lexicons, sometimes referred to as dictionaries, based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
The three general-purpose lexicons we’ll focus on are:
AFINNassigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.bingcategorizes words in a binary fashion into positive and negative categories.nrccategorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
Note that if this is your first time using the AFINN and NRC lexicons, you may prompted to download both. Respond yes to the prompt by entering “1” and the NRC and AFINN lexicons will download. You’ll only have to do this the first time you use the NRC lexicon.
Let’s take a quick look at each of these lexicons using the get_sentiments() function and assign them to their respective names for later use:
afinn <- get_sentiments("afinn")
afinn# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# ℹ 2,467 more rows
bing <- get_sentiments("bing")
bing# A tibble: 6,786 × 2
word sentiment
<chr> <chr>
1 2-faces negative
2 abnormal negative
3 abolish negative
4 abominable negative
5 abominably negative
6 abominate negative
7 abomination negative
8 abort negative
9 aborted negative
10 aborts negative
# ℹ 6,776 more rows
nrc <- get_sentiments("nrc")
nrc# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# ℹ 13,862 more rows
And just out of curiosity, let’s take a look at the loughran lexicon as well:
loughran <- get_sentiments("loughran")
loughran# A tibble: 4,150 × 2
word sentiment
<chr> <chr>
1 abandon negative
2 abandoned negative
3 abandoning negative
4 abandonment negative
5 abandonments negative
6 abandons negative
7 abdicated negative
8 abdicates negative
9 abdicating negative
10 abdication negative
# ℹ 4,140 more rows
Your Turn ⤵
How were these sentiment lexicons put together and validated? Hint: take a look at Chapter 2 from Text Mining with R.
- According to the chapter, they were created from crowdsourced data or by the authors of the collection. They were validated using crowdsourcing and reviews of restaurants or movies to teach it.
Why should we be cautious when using and interpreting them?
- The sentiment lexicons were trained on data that is quite outdated and may not be accurate to today’s vernacular. It’s also a cautionary tale for understanding the rapid evolution of language, especially on social media!
Come to the Dark Side
As noted in the PERPARE section, the {vader} package is for the Valence Aware Dictionary for sEntiment Reasoning (VADER), a rule-based model for general sentiment analysis of social media text and specifically attuned to measuring sentiment in microblog-like contexts such as Twitter.
The VADER assigns a number of different sentiment measures based on the context of the entire social-media post or in our case a tweet. Ultimately, however, these measures are based on a sentiment lexicon similar to those you just saw above. One benefit of using VADER rather than the approaches described by Silge and Robinson is that we use it with our tweets in their original format and skip the text preprocessing steps demonstrated above.
One drawback to VADER is that it can take a little while to run since it’s computationally intensive. Instead of analyzing tens of thousands of tweets, let’s read in our original ccss-tweets.csv and take instead just a sample of 500 “untidy” CCSS tweets using the sample_n() function:
ccss_sample <- read_csv("data/ccss-tweets.csv") |>
sample_n(500)Rows: 27230 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): text, source
dbl (4): author_id, id, conversation_id, in_reply_to_user_id
lgl (1): possibly_sensitive
dttm (1): created_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ccss_sample# A tibble: 500 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <dbl> <dbl> <dbl> <chr>
1 PDF Download Pr… 2021-01-25 23:05:12 1.35e18 1.35e18 1.35e18 Twitt…
2 She began to le… 2021-05-04 16:49:54 2.69e 7 1.39e18 1.39e18 Twitt…
3 @LivePDDave1 Co… 2021-02-15 01:26:33 1.65e 8 1.36e18 1.36e18 Twitt…
4 @8_inside @Cand… 2021-03-18 20:08:04 1.27e18 1.37e18 1.37e18 Twitt…
5 Common core mat… 2021-05-02 00:23:39 1.32e18 1.39e18 1.39e18 Twitt…
6 @ChumZilla Comm… 2021-01-07 02:41:20 9.77e 8 1.35e18 1.35e18 Twitt…
7 @FrancaRose33 @… 2021-01-19 19:55:24 2.57e 8 1.35e18 1.33e18 Twitt…
8 @AStopcommoncor… 2021-04-28 17:06:00 1.85e 7 1.39e18 1.39e18 Twitt…
9 @BrandonStraka … 2021-01-23 09:07:58 1.28e18 1.35e18 1.35e18 Twitt…
10 Lmaoooo cuz thi… 2021-01-25 00:21:57 2.28e 9 1.35e18 1.35e18 Twitt…
# ℹ 490 more rows
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <dbl>
Note above that we passed our read_csv() output directly to our sample() function rather than saving a new data frame object, passing that to sample_n(), and saving as another data frame object. The power of the |> pipe!
On to the Dark Side. The {vader} package basically has just one function, vader_df() that does one thing and expects just one column from one frame. He’s very single minded! Let’s give VADER our ccss_sample data frame and include the $ operator to include only the text column containing our tweets.
Note, this may take a little while to run.
vader_ccss <- vader_df(ccss_sample$text)
head(vader_ccss) text
1 PDF Download Prentice Hall Literature: Common Core Edition -> https://t.co/aTcEqpD3Ry
2 She began to learn common core math and my MIND. WAS. BLOWN. Because, you guys. It turns out I don't have to try to do math on a imaginary piece of paper in my head. Now I just add 30 + 50 = 80. Then I subtract 5.
3 @LivePDDave1 Common Core Math
4 @8_inside @Candidus00 @904Pestana @mdnij34 @DemNevada It was... 30 years ago when I was in school. They have decided now that common core is more important than history.
5 Common core math https://t.co/gyJa74JyEv
6 @ChumZilla Common core math
word_scores
1 {0, 0, 0, 0, 0, 0, 0, 0, 1.1, 0}
2 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
3 {0, 0, 0, 0}
4 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.093, 0, 0}
5 {0, 0, 0, 0}
6 {0, 0, 0, 0}
compound pos neu neg but_count
1 0.273 0.189 0.811 0 0
2 0.000 0.000 1.000 0 0
3 0.000 0.000 1.000 0 0
4 0.272 0.075 0.925 0 0
5 0.000 0.000 1.000 0 0
6 0.000 0.000 1.000 0 0
Take a look at vader_summary data frame using the View() function in the console and sort by most positive and negative tweets.
Does it generally seem to accurately identify positive and negative tweets? Could you find any that you think were mislabeled?
- I think the negative were labeled fairly accurately. However, the positive completely miss the nuance of sarcasm as a written tweet. Many of the highest positive are in fact very negative just through the very human jargon of sarcasm. “@Pismo_B @GHOST_LETTERS Yep, common core ! Lol” for instance ;)
Hutto, C. & Gilbert, E. (2014) provide an excellent summary of the VADER package on their GitHub repository and I’ve copied and explanation of the scores below:
- The
compoundscore is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a ‘normalized, weighted composite score’ is accurate.
NOTE: The compound score is the one most commonly used for sentiment analysis by most researchers, including the authors.
Let’s take a look at the average compound score for our CCSS sample of tweets:
mean(vader_ccss$compound)[1] 0.012168
Overall, does your CCSS tweets sample lean slightly negative or positive? Is this what you expected?
What if we wanted to compare these results more easily to our other sentiment lexicons just to check if result are fairly consistent?
The author’s note that it is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. Typical threshold values are:
positive sentiment:
compoundscore >= 0.05neutral sentiment: (
compoundscore > -0.05) and (compoundscore < 0.05)negative sentiment:
compoundscore <= -0.05
Let’s give that a try and see how things shake out:
vader_ccss_summary <- vader_ccss |>
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05, "negative", "neutral"))) |>
count(sentiment, sort = TRUE) |>
spread(sentiment, n) |>
relocate(positive) |>
mutate(ratio = negative/positive)
vader_ccss_summary positive negative neutral ratio
1 174 168 158 0.9655172
Not quite as bleak as we might have expected according to VADER! But then again, VADER brings an entirely different perspective coming from the dark side
Your Turn ⤵
In the code chunk below, try using VADER to perform a sentiment analysis of the NGSS tweets and see how they compare:
ngss_sample <- read_csv("data/ngss-tweets.csv") |>
sample_n(500)Rows: 8125 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): text, source
dbl (4): author_id, id, conversation_id, in_reply_to_user_id
lgl (1): possibly_sensitive
dttm (1): created_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vader_ngss <- vader_df(ngss_sample$text)
vader_ngss_summary <- vader_ngss |>
mutate(sentiment = ifelse(compound >= 0.05, "positive",
ifelse(compound <= -0.05,
"negative", "neutral"))) |>
count(sentiment, sort = TRUE) |>
spread(sentiment, n) |>
relocate(positive) |>
mutate(ratio = negative/positive)
vader_ngss_summary positive negative neutral ratio
1 334 37 129 0.1107784
How do our results compare to the CSSS sample of tweets?
- Much smaller ratio! Skewed more toward the positive but again I think that’s misunderstanding on the part of the tools at hand.
5. COMMUNICATE
In this case study, we focused on the literature guiding our analysis; wrangling our data into a one-token-per-row tidy text format; and using simple word counts and word clouds to compare common words used in tweets about the NGSS and CCSS curriculum standards. Below, add a few notes in response to the following prompts:
One thing I took away from this case study:
- Even dealing with a smaller subset of the data, this gets quite cumbersome very quickly and the steps used create a lot of data running in the R Studio. Also, the sentiment analysis needs some updated training in order to be more accurate when quantifying positive sentiment. As is, especially on a social media data, it’s lacking the nuance of that particular type of grammar, jargon, etc.
One thing I want to learn more about:
- Whether there is a lexicon for social media sentiment, and if so, who in the world would be able to keep up with the rapid change! Additionally, I think it’s very helpful to parse apart the tweets to quantify the discourse around this subject and numerous other. On to TikTok ;)
Congratulations - you’ve completed your first text mining case study! To complete your work, click the Render button in the toolbar. This will check all your code and create an HTML file in the Files pane that serves as a record of your work that you can open in a browser or share online.
References
Note: Citations embedded in R Markdown will only show upon knitting.