Unit 3 Case Study: Public Sentiment and the State Standards

ECI 586 Introduction to Learning Analytics

Author

Dr. Joey Huang

Published

October 24, 2025

1. PREPARE

Data sources such as digital learning environments and administrative data systems, as well as data produced by social media websites and the mass digitization of academic and practitioner publications, hold enormous potential to address a range of pressing problems in education, but collecting and analyzing text-based data also presents unique challenges. This week, our case study is guided by Josh Rosenberg’s study, Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards.

We will focus on conducting a very simplistic “replication study” by comparing the sentiment of tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand public reaction to these two curriculum reform efforts. Specifically, our Unit 3 case study will cover the following topics:

  1. Prepare: We’ll take a quick look at Dr. Rosenberg’s study and load packages we’ll need for analysis.
  2. Wrangle: We focus on basic text mining processes such as text tokenization and stop word removal. Specifically, we will learn how to “tidy text” so we can perform some basic analyses such as retrieving word counts and term frequencies.
  3. Explore: In order to see what insight our data provides into answering our research questions, we will calculate some simple summary statistics from our tidied text and use data visualization to highlight some of these insights.
  4. Model: We learn a little about sentiment lexicons and introduce the {vader} package to model the sentiment of tweets about the NGSS and CCSS state standards in order to better understand public reaction to these two curriculum reform efforts.
  5. Communicate: To wrap up our case study, we’ll write a brief summary of our findings and a short reflection on what we learned.

1a. Review the Literature

The Unit 3 Case Study: Public Sentiment and the State Standards is guided by a recent publication by (Rosenberg et al., 2021) Understanding Public Sentiment About Educational Reforms: The Next Generation Science Standards on Twitter. This study in turn builds on upon previous work by Wang & Fikis (2017) examining public opinion on the Common Core State Standards (CCSS) on Twitter. For this case study, we will focus on analyzing tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand key words and phrases that emerge, as well as public sentiment towards these two curriculum reform efforts.

Full Paper (AERA Open)

Note on Data Source: This study analyzes data from Twitter, now rebranded as X. Changes to the platform’s API access and policies mean this type of research is now more difficult to replicate.

Abstract

System-wide educational reforms are difficult to implement in the United States, but despite the difficulties, reforms can be successful, particularly when they are associated with broad public support. This study reports on the nature of the public sentiment expressed about a nationwide science education reform effort, the Next Generation Science Standards (NGSS). Through the use of data science techniques to measure the sentiment of posts on Twitter (now X) about the NGSS (N = 565,283), we found that public sentiment about the NGSS is positive, with only 11 negative posts for every 100 positive posts. In contrast to findings from past research and public opinion polling on the Common Core State Standards, sentiment about the NGSS has become more positive over time—and was especially positive for teachers. We discuss what this positive sentiment may indicate about the success of the NGSS in light of opposition to the Common Core State Standards.

Data Sources

Similar to data we’ll be using for this case study, Rosenberg et al. used publicly accessible data from Twitter (now X) collected using the Full-Archive X API and the {rtweet} package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.

Data used in this case study was obtained prior to Twitter’s transition to X, and used the {academictwitter} package and the sadly no longer accessible Academic Research developer account. The Twitter API v2 endpoints allowed researchers to access the full twitter archive, unlike a standard developer account. Data includes all tweets from January through May of 2020 and included the following terms: #ccss, common core, #ngsschat, ngss.

Below is an example of the code used to retrieve data for this case study. This code is set not to execute and will NOT run, but it does illustrate the search query used, variables selected, and time frame.

library(academictwitteR)
library(tidyverse)

ccss_tweets_2021 <-
  get_all_tweets('(#commoncore OR "common core") -is:retweet lang:en',
                 "2021-01-01T00:00:00Z",
                 "2021-05-31T00:00:00Z",
                 bearer_token,
                 data_path = "ccss-data/",
                 bind_tweets = FALSE)

ccss_tweets <- bind_tweet_jsons(data_path = "ccss-data/") |>
  select(text,
         created_at,
         author_id,
         id,
         conversation_id,
         source,
         possibly_sensitive,
         in_reply_to_user_id)


write_csv(ccss_tweets, "data/ccss-tweets.csv")

Analysis

The authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, they used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts.

We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to exploring some other sentiment lexicons to see if they produce similar results. We will use a similar approach to label tweets as positive, negative, or neutral using the {Vader} package which greatly simplifies this process.

The authors also used the lme4 package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teachers. We won’t try to replicate in this part of the study, but we will take a look at some of their findings from this model in section 4.

Summary of Key Findings

  1. Contrasting with sentiment about CSSS, sentiment about the NGSS science education reform effort is overwhelmingly positive, with approximately 9 positive tweets for every negative tweet.
  2. Teachers were more positive than non-teachers, and sentiment became substantially more positive over the ten years of NGSS-related posts.
  3. Differences between the context of the tweets were small, but those that did not include the #NGSSchat hashtag became more positive over time than those posts that did not include the hashtag.
  4. Individuals posted more tweets during #NGSSchat chats, the sentiment of their posts was more positive, suggesting that while the context of individual tweets has a small effect (with posts not including the hashtag becoming more positive over time), the effect upon individuals of being involved in the #NGSSchat was positive.

1b. Define Questions

One overarching question that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, and that we’ll explore later in this case study, is the question:

How do we to quantify what a document or collection of documents is about?

The questions guiding the Rosenberg et al. study attempt to quantify public sentiment around the NGSS and how that sentiment changes over time. Specifically, they asked:

  1. What is the public sentiment expressed toward the NGSS?
  2. How does sentiment for teachers differ from non-teachers?
  3. How do tweets posted to #NGSSchat differ from those without the hashtag?
  4. How does participation in #NGSSchat relate to the public sentiment individuals express?
  5. How does public sentiment vary over time?

For this text mining case study, we’ll use approaches similar to those used by the authors cited above to better understand public discourse surrounding these standards, particularly as they relate to STEM education. We will also try to guage public sentiment around the NGSS, by comparing how much more positive or negative NGSS tweets are relative to CSSS tweets. Specifically, in this case study we’ll attempt to answer the following questions:

  1. What are the most frequent words or phrases used in reference to tweets about the CCSS and NGSS?
  2. How does sentiment for NGSS compare to sentiment for CCSS?

1c. Load Libraries

tidytext 📦

As we’ll learn first hand in this module, using tidy data principles can also make many text mining tasks easier, more effective, and consistent with tools already in wide use. The {tidytext} package helps to convert text into data frames with each rows containing an individual word or sequence of words, making it easy to to manipulate, summarize, and visualize text using using familiar functions form the {tidyverse} collection of packages.

Let’s go ahead and load the {tidytext} package:

library(tidytext)

For a more comprehensive introduction to the tidytext package, I cannot recommend enough the free and excellent online book, Text Mining with R: A Tidy Approach (Silge & Robinson, 2017). If you’re interested in pursuing text analysis using R after this course, this will be a go to reference.

The vader Package 📦


The {vader} package is for the Valence Aware Dictionary for sEntiment Reasoning (VADER), a rule-based model for general sentiment analysis of social media text and specifically attuned to measuring sentiment in microblog-like contexts.

To learn more about the {vader} package and its development, take a look at the article by Hutto and Gilbert (2014), VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.

Let’s go ahead and load the VADER library:

library(vader)

Note: The {vader} package can take quite some time to run on a large datasets like the one we’ll be working with, so in our Model section we will examine just a small(ish) subset of tweets.

Other Packages

Finally, there are a couple other packages we’ll need to get started. The first should look familiar while second {wordcloud2} package is handy little package for creating interactive word clouds.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(wordcloud2)
library(tidyr)

2. WRANGLE

The importance of data wrangling, particularly when working with text, is difficult to overstate. Just as a refresher, wrangling involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al., 2018). This case study will place a heavy emphasis on preparing text for analysis and in particular we’ll learn how to:

  1. Import Tweets. First we revisit the familiar read_csv() function for reading in our CCSS and NGSS tweets into R.
  2. Restructure Data. We focus on removing extraneous data using the select() and filter() functions from {dplyr}, and revisit functions from the Tidy Your Data Primer for merging data frames.
  3. Tidy Text. Finally, we introduce the {tidytext} package to “tidy” and tokenize our tweets in order to create our data frame for analysis. We also introduce a new join function to remove “stop words” that don’t add much value to our analysis.

2a. Import Tweets from CSV

As noted above, data used in this case study was pulled using an Academic Research developer account and the {academictwitter} package, which uses the Twitter API v2 endpoints and allows researchers to access the full twitter archive, unlike the {rtweet} package, which limits the number of tweets and the length of time from which you can pull tweets.

Data for this case study includes all tweets from January through May of 2020 and includes the following terms: #ccss, common core, #ngsschat, ngss. Since we’ll be working with some computational intensive functions later in this case study that can take some time to run, I restricted the time frame for my search to only a handful of month. Even so, we’ll be working with nearly 30,000 tweets and nearly 1,000,000 words for our analysis!

Let’s use the by now familiar read_csv() function to import our ccss_tweets.csv file saved in our data folder:

ccss_tweets <- read_csv("data/ccss-tweets.csv", 
          col_types = cols(author_id = col_character(), 
                           id = col_character(),
                           conversation_id = col_character(), 
                           in_reply_to_user_id = col_character()
                           )
          )

ccss_tweets
# A tibble: 27,230 × 8
   text               created_at          author_id id    conversation_id source
   <chr>              <dttm>              <chr>     <chr> <chr>           <chr> 
 1 "@catturd2 Hmmmm … 2021-01-02 00:49:28 16098543… 1345… 13451697062071… Twitt…
 2 "@homebrew1500 I … 2021-01-02 00:40:05 12495948… 1345… 13451533915976… Twitt…
 3 "@ClayTravis Dump… 2021-01-02 00:32:46 88770705… 1345… 13450258639942… Twitt…
 4 "@KarenGunby @chi… 2021-01-02 00:24:01 12495948… 1345… 13451533915976… Twitt…
 5 "@keith3048 I kno… 2021-01-02 00:23:42 12527475… 1345… 13451533915976… Twitt…
 6 "Probably common … 2021-01-02 00:18:38 12760173… 1345… 13451625486818… Twitt…
 7 "@LisaS4680 Stupi… 2021-01-02 00:16:11 92213292… 1345… 13451595466087… Twitt…
 8 "@JerryGl29176259… 2021-01-02 00:10:29 12201608… 1345… 13447179758914… Twitt…
 9 "@JBatNC304 @Cawt… 2021-01-02 00:09:15 88091448… 1345… 13447403608625… Twitt…
10 "@chiefaugur I th… 2021-01-01 23:54:38 12495948… 1345… 13451533915976… Twitt…
# ℹ 27,220 more rows
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>

Note the addition of the col_types = argument for changing some of the column types to character strings because the numbers for those particular columns actually indicate identifiers for authors and tweets:

  • author_id = the author of the tweet

  • id = the unique id for each tweet

  • converastion_id = the unique id for each conversation thread

  • in_reply_to_user_id = the author of the tweet being replied to

Your Turn ⤵

Complete the following code chunk to import the NGSS tweets located in the same data folder as our common core tweets and named ngss-tweets.csv. By default, R will treat numerical IDs in our dataset as numeric values but we will need to convert these to characters like demonstrated above for the purpose of analysis. Also, feel free to repurpose the code from above.

ngss_tweets <- read_csv("data/ngss-tweets.csv", 
          col_types = cols(author_id = col_character(), 
                           id = col_character(),
                           conversation_id = col_character(), 
                           in_reply_to_user_id = col_character()
                           )
          )
ngss_tweets
# A tibble: 8,125 × 8
   text               created_at          author_id id    conversation_id source
   <chr>              <dttm>              <chr>     <chr> <chr>           <chr> 
 1 "Please help us R… 2021-01-06 00:50:49 32799077… 1346… 13466201998945… Twitt…
 2 "What lab materia… 2021-01-06 00:45:32 10103246… 1346… 13466188701325… Hoots…
 3 "I recently saw a… 2021-01-06 00:39:37 61829645  1346… 13466173820858… Twitt…
 4 "I'm thrilled to … 2021-01-06 00:30:13 461653415 1346… 13466150172071… Twitt…
 5 "PLS RT. Excited … 2021-01-06 00:15:05 22293234  1346… 13466112069671… Twitt…
 6 "Inspired by Marg… 2021-01-06 00:00:00 33179602… 1346… 13466074140999… Tweet…
 7 "PLTW Launch is d… 2021-01-05 23:45:06 17276863  1346… 13466036638386… Hoots…
 8 "@NGSS_tweeps How… 2021-01-05 23:24:01 10230543… 1346… 13464677409499… Twitt…
 9 "@NGSS_tweeps I d… 2021-01-05 23:21:56 10230543… 1346… 13464677409499… Twitt…
10 "January 31st is … 2021-01-05 23:10:03 23679615  1346… 13465948440435… Hoots…
# ℹ 8,115 more rows
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>

Importing data and dealing with data types can be a bit tricky, especially for beginners. Recall from previous case studies that RStudio has an “Import Dataset” feature in the Environment Pane that can help you use the {readr} package and associated functions to greatly facilitate this process. If you get stuck, you can copy the code generated in the lower right hand corner of the Import Dataset window.

Now use the following code chunk to inspect the head() of each data frame and answer the questions that follow:

head(ngss_tweets)
# A tibble: 6 × 8
  text                created_at          author_id id    conversation_id source
  <chr>               <dttm>              <chr>     <chr> <chr>           <chr> 
1 "Please help us RT… 2021-01-06 00:50:49 32799077… 1346… 13466201998945… Twitt…
2 "What lab material… 2021-01-06 00:45:32 10103246… 1346… 13466188701325… Hoots…
3 "I recently saw a … 2021-01-06 00:39:37 61829645  1346… 13466173820858… Twitt…
4 "I'm thrilled to b… 2021-01-06 00:30:13 461653415 1346… 13466150172071… Twitt…
5 "PLS RT. Excited 2… 2021-01-06 00:15:05 22293234  1346… 13466112069671… Twitt…
6 "Inspired by Marga… 2021-01-06 00:00:00 33179602… 1346… 13466074140999… Tweet…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
head(ccss_tweets)
# A tibble: 6 × 8
  text                created_at          author_id id    conversation_id source
  <chr>               <dttm>              <chr>     <chr> <chr>           <chr> 
1 "@catturd2 Hmmmm “… 2021-01-02 00:49:28 16098543… 1345… 13451697062071… Twitt…
2 "@homebrew1500 I a… 2021-01-02 00:40:05 12495948… 1345… 13451533915976… Twitt…
3 "@ClayTravis Dump … 2021-01-02 00:32:46 88770705… 1345… 13450258639942… Twitt…
4 "@KarenGunby @chie… 2021-01-02 00:24:01 12495948… 1345… 13451533915976… Twitt…
5 "@keith3048 I know… 2021-01-02 00:23:42 12527475… 1345… 13451533915976… Twitt…
6 "Probably common c… 2021-01-02 00:18:38 12760173… 1345… 13451625486818… Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>

Wow, so much for a family friendly case study! Based on this very limited sample, which set of standards do you think Twitter users are more negative about?

  • Definitely more negative sentiment towards the CCSS!

Let’s take a slightly larger sample of the CCSS tweets:

set.seed(586)
ccss_tweets |> 
  sample_n(20) |>
  relocate(text)
# A tibble: 20 × 8
   text               created_at          author_id id    conversation_id source
   <chr>              <dttm>              <chr>     <chr> <chr>           <chr> 
 1 "Common core math… 2021-01-24 04:45:27 578491631 1353… 13532022276287… Twitt…
 2 "@mariana057 Nope… 2021-04-15 18:08:22 46682634  1382… 13827177567479… Twitt…
 3 "Critical Race Th… 2021-05-27 09:30:19 10077879… 1397… 13978476272119… Twitt…
 4 "@Afkar_omumi @di… 2021-01-11 00:32:28 12397153… 1348… 13484220504410… Twitt…
 5 "Pastor Brian, sp… 2021-01-15 19:41:47 13146731… 1350… 13501662706673… Twitt…
 6 "Common core math… 2021-04-17 20:02:00 13769382… 1383… 13835110793812… Twitt…
 7 "@Saorsa1776 Comm… 2021-05-20 19:57:14 175150086 1395… 13953941821230… Twitt…
 8 "Common Core Math… 2021-04-30 13:36:56 13138487… 1388… 13881252185568… Twitt…
 9 "[Download] EPUB … 2021-01-09 10:11:18 13462767… 1347… 13478484159174… Twitt…
10 "Bill  &amp;  Mel… 2021-04-17 02:18:53 13523967… 1383… 13832435374839… Twitt…
11 "@ASlavitt Common… 2021-03-15 10:51:28 12062192… 1371… 13712091309674… Twitt…
12 "@LeftAccidental … 2021-03-22 19:39:43 13517576… 1374… 13740686709771… Twitt…
13 "Don’t you think … 2021-01-09 23:17:27 99623303… 1348… 13480462554993… Twitt…
14 "What?  Is that c… 2021-02-07 01:35:22 13295075… 1358… 13582278249901… Twitt…
15 "Now, he's totall… 2021-02-01 15:10:47 70214097… 1356… 13562587032641… Trump…
16 "@JackStr13435605… 2021-02-23 16:16:22 16110641… 1364… 13640566504583… Twitt…
17 "COMMON CORE FTW … 2021-05-18 22:15:24 28045276… 1394… 13947786759003… Ninte…
18 "I have a comprom… 2021-03-14 21:22:31 16693646… 1371… 13712101564831… Twitt…
19 "When she was Gov… 2021-02-12 14:00:41 911076488 1360… 13602273285934… Twitt…
20 "PDF Download Pre… 2021-01-25 23:05:12 13495578… 1353… 13538413772058… Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>

Your Turn

Use the code chunk below to take a sample of the NGSS tweets. Try to do it without looking at the code above first:

set.seed(586)
ngss_tweets |> 
  sample_n(20) |>
  relocate(text)
# A tibble: 20 × 8
   text               created_at          author_id id    conversation_id source
   <chr>              <dttm>              <chr>     <chr> <chr>           <chr> 
 1 "@NewhouseBiology… 2021-04-02 01:04:05 40062074  1377… 13777884447982… Tweet…
 2 "Learn more about… 2021-02-17 14:31:07 42574210… 1362… 13620469261542… Twitt…
 3 "@philiplbell @sb… 2021-02-03 16:31:05 242075092 1357… 13569805821404… Twitt…
 4 "@KRenaeP @starrs… 2021-02-01 21:56:54 11618510… 1356… 13562275510694… Twitt…
 5 "Unpacking the fo… 2021-02-24 04:43:17 748435729 1364… 13644357071172… Twitt…
 6 "A1: How do you s… 2021-02-05 02:14:19 96102787… 1357… 13575128494767… Twitt…
 7 "The prettiest la… 2021-01-21 08:35:32 76327682… 1352… 13521729693852… Insta…
 8 "Want more inform… 2021-04-19 19:02:03 99688098… 1384… 13842207692578… Buffer
 9 "@NGSS_tweeps Tha… 2021-01-29 01:51:48 13295721… 1354… 13549483292522… Twitt…
10 "@NGSS_tweeps Yes… 2021-02-20 14:47:57 12964713… 1363… 13628226555193… Twitt…
11 "#NGSSchat A2.  A… 2021-04-16 01:32:23 10973360… 1382… 13828694498211… Twitt…
12 "PHENOMENA! See w… 2021-05-07 17:16:00 30415132… 1390… 13907170611989… Twitt…
13 "@LisaMLove1996 @… 2021-01-30 05:46:05 13295721… 1355… 13549732742796… Twitt…
14 "Check out my cla… 2021-05-22 02:16:12 21797286… 1395… 13959264365784… Twitt…
15 "KG Ss help Block… 2021-05-06 02:07:32 16231168… 1390… 13901260519765… Tweet…
16 "@SUSDscience Wel… 2021-02-06 01:39:59 17294587… 1357… 13577987920330… Twitt…
17 "Love @LenoraMCra… 2021-03-20 16:52:15 22293234  1373… 13733164682895… Twitt…
18 "my school made a… 2021-01-13 15:24:55 10339637… 1349… 13493768902939… Twitt…
19 "A1.2 Defining pr… 2021-04-16 01:12:24 40062074  1382… 13828644201377… Tweet…
20 "#Science is a gr… 2021-01-19 16:21:21 72579397… 1351… 13515654213134… Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <chr>
  1. Still of the same opinion?

    • Yes, same opinion stands. The people using the NGSS conversation are more positive than those engaging with the CCSS threads. Appears to be more practitioners of the standard as opposed to CCSS which seems to simply be people complaining.
  2. What else you notice about our data sets? Record a few observations that you think are relevant to our analysis or might be useful for future analyses.

    • I think the possibly_sensitive logic row will have some interesting differences between the two data sets with CCSS being flagged much more so than NGSS. I also wonder how much continued thread conversations there are on each.
  3. What questions do you have about these data sets? What are you still curious about?

    • The sheer difference between the number of recorded tweets is interesting and could speak to the observation that negative sentiment may invoke more interaction than positive sentiment. Same reasons the news shows negative stories instead of positive ones. I also wonder how many unique authors are in each set.

2c. Restructure Data

Subset Tweets

As you may have noticed, we have more data than we need for our analysis and should probably pare it down to just what we’ll use.

We could do this in multiple steps, creating intermediate objects like ccss_tweets_1, ccss_tweets_2, etc., but that creates unnecessary clutter in our environment and makes code harder to follow. Instead, we’ll use the |> pipe operator to chain our operations together in a single, readable flow.

Let’s clean the CCSS tweets by:

  1. Filtering out potentially sensitive content
  2. Selecting only the columns we need
  3. Adding a “standards” label column
  4. Moving that new column to the first position for easy viewing
# Clean CCSS Tweets 
ccss_tweets_clean <- ccss_tweets |>
  filter(possibly_sensitive == "FALSE") |>
  select(text, author_id, created_at, conversation_id, id) |>
  mutate(standards = "ccss") |>
  relocate(standards)

head(ccss_tweets_clean)
# A tibble: 6 × 6
  standards text             author_id created_at          conversation_id id   
  <chr>     <chr>            <chr>     <dttm>              <chr>           <chr>
1 ccss      "@catturd2 Hmmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 ccss      "@homebrew1500 … 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 ccss      "@ClayTravis Du… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 ccss      "@KarenGunby @c… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 ccss      "@keith3048 I k… 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 ccss      "Probably commo… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…

Your Turn

Recall from section 1b. Define Questions that we are interested in comparing word usage and public sentiment around both the Common Core and Next Gen Science Standards.

Create an new ngss_tweets_clean data frame consisting of the Next Generation Science Standards tweets we imported earlier by using the code directly above as a guide.

# Clean NGSS Tweets 
ngss_tweets_clean <- ngss_tweets |>
  filter(possibly_sensitive == "FALSE") |>
  select(text, author_id, created_at, conversation_id, id) |>
  mutate(standards = "ngss") |>
  relocate(standards)

head(ngss_tweets_clean)
# A tibble: 6 × 6
  standards text             author_id created_at          conversation_id id   
  <chr>     <chr>            <chr>     <dttm>              <chr>           <chr>
1 ngss      "Please help us… 32799077… 2021-01-06 00:50:49 13466201998945… 1346…
2 ngss      "What lab mater… 10103246… 2021-01-06 00:45:32 13466188701325… 1346…
3 ngss      "I recently saw… 61829645  2021-01-06 00:39:37 13466173820858… 1346…
4 ngss      "I'm thrilled t… 461653415 2021-01-06 00:30:13 13466150172071… 1346…
5 ngss      "PLS RT. Excite… 22293234  2021-01-06 00:15:05 13466112069671… 1346…
6 ngss      "Inspired by Ma… 33179602… 2021-01-06 00:00:00 13466074140999… 1346…

Merge Data Frames

Finally, let’s combine our CCSS and NGSS tweets into a single data frame by using the union() function from dplyr and simply supplying the data frames that you want to combine as arguments:

ss_tweets <- union(ccss_tweets_clean,
                   ngss_tweets_clean)

ss_tweets
# A tibble: 35,233 × 6
   standards text            author_id created_at          conversation_id id   
   <chr>     <chr>           <chr>     <dttm>              <chr>           <chr>
 1 ccss      "@catturd2 Hmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
 2 ccss      "@homebrew1500… 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
 3 ccss      "@ClayTravis D… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
 4 ccss      "@KarenGunby @… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
 5 ccss      "@keith3048 I … 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
 6 ccss      "Probably comm… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
 7 ccss      "@LisaS4680 St… 92213292… 2021-01-02 00:16:11 13451595466087… 1345…
 8 ccss      "@JerryGl29176… 12201608… 2021-01-02 00:10:29 13447179758914… 1345…
 9 ccss      "@JBatNC304 @C… 88091448… 2021-01-02 00:09:15 13447403608625… 1345…
10 ccss      "@chiefaugur I… 12495948… 2021-01-01 23:54:38 13451533915976… 1345…
# ℹ 35,223 more rows

Note that when creating a “union” like this (i.e. stacking one data frame on top of another), you should have the same number of columns in each data frame and they should be in the exact same order.

Alternatively, we could have used the bind_rows() function from {dplyr} as well:

ss_tweets <- bind_rows(ccss_tweets_clean,
                       ngss_tweets_clean)

ss_tweets
# A tibble: 35,233 × 6
   standards text            author_id created_at          conversation_id id   
   <chr>     <chr>           <chr>     <dttm>              <chr>           <chr>
 1 ccss      "@catturd2 Hmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
 2 ccss      "@homebrew1500… 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
 3 ccss      "@ClayTravis D… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
 4 ccss      "@KarenGunby @… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
 5 ccss      "@keith3048 I … 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
 6 ccss      "Probably comm… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
 7 ccss      "@LisaS4680 St… 92213292… 2021-01-02 00:16:11 13451595466087… 1345…
 8 ccss      "@JerryGl29176… 12201608… 2021-01-02 00:10:29 13447179758914… 1345…
 9 ccss      "@JBatNC304 @C… 88091448… 2021-01-02 00:09:15 13447403608625… 1345…
10 ccss      "@chiefaugur I… 12495948… 2021-01-01 23:54:38 13451533915976… 1345…
# ℹ 35,223 more rows

The distinction between these two functions is that union by default removes any duplicate rows that might have shown up in our queries.

However, since both functions returned the same number of rows, it’s clear we do not have any duplicates. If we wanted to verify, {dplyr} also has an intersect function to merge the two data frames, but only where they intersect(), or where they have duplicate rows.

ss_tweets_duplicate <- intersect(ccss_tweets_clean,
                                 ngss_tweets_clean)

ss_tweets_duplicate
# A tibble: 0 × 6
# ℹ 6 variables: standards <chr>, text <chr>, author_id <chr>,
#   created_at <dttm>, conversation_id <chr>, id <chr>

Your Turn

Finally, let’s take a quick look at both the head() and the tail() of this new ss_tweets data frame to make sure it contains both “ngss” and “ccss” standards and that the values for each are in the correct columns:

# YOUR CODE HERE
head(ss_tweets)
# A tibble: 6 × 6
  standards text             author_id created_at          conversation_id id   
  <chr>     <chr>            <chr>     <dttm>              <chr>           <chr>
1 ccss      "@catturd2 Hmmm… 16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 ccss      "@homebrew1500 … 12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 ccss      "@ClayTravis Du… 88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 ccss      "@KarenGunby @c… 12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 ccss      "@keith3048 I k… 12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 ccss      "Probably commo… 12760173… 2021-01-02 00:18:38 13451625486818… 1345…
tail(ss_tweets)
# A tibble: 6 × 6
  standards text             author_id created_at          conversation_id id   
  <chr>     <chr>            <chr>     <dttm>              <chr>           <chr>
1 ngss      @BK3DSci Brian,… 558971700 2021-05-21 01:10:28 13955471161272… 1395…
2 ngss      A1  My students… 14493822… 2021-05-21 01:10:20 13955474728990… 1395…
3 ngss      A1: It is an im… 136014942 2021-05-21 01:09:58 13955473807585… 1395…
4 ngss      @MsB_Reilly Mod… 31647215… 2021-05-21 01:09:54 13955471085775… 1395…
5 ngss      A1.5 I also lov… 14449947  2021-05-21 01:09:46 13955473306029… 1395…
6 ngss      @MsB_Reilly Whe… 558971700 2021-05-21 01:09:44 13955471085775… 1395…

2d. Tidy Text

Text data by it’s very nature is ESPECIALLY untidy and is sometimes referred to as “unstructured” data. In this section we learn some very useful functions from the {tidytext} package to convert text to and from tidy formats. Having our text in a tidy format will allow us to switch seamlessly between tidy tools and existing text mining packages, while also making it easier to visualize text summaries in other data analysis tools like Tableau.

Tokenize Text

In Chapter 1 of Text Mining with R, Silge & Robinson (2017) define the tidy text format as a table with one-token-per-row, and explain that:

A token is a meaningful unit of text, such as a word, two-word phrase (bigram), or sentence that we are interested in using for analysis. And tokenization is the process of splitting text into tokens.

This one-token-per-row structure is in contrast to the ways text is often stored for text analysis, perhaps as strings in a corpus object or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph.

For this part of our workflow, our goal is to transform our ss_tweets data from this:

head(relocate(ss_tweets, text))
# A tibble: 6 × 6
  text             standards author_id created_at          conversation_id id   
  <chr>            <chr>     <chr>     <dttm>              <chr>           <chr>
1 "@catturd2 Hmmm… ccss      16098543… 2021-01-02 00:49:28 13451697062071… 1345…
2 "@homebrew1500 … ccss      12495948… 2021-01-02 00:40:05 13451533915976… 1345…
3 "@ClayTravis Du… ccss      88770705… 2021-01-02 00:32:46 13450258639942… 1345…
4 "@KarenGunby @c… ccss      12495948… 2021-01-02 00:24:01 13451533915976… 1345…
5 "@keith3048 I k… ccss      12527475… 2021-01-02 00:23:42 13451533915976… 1345…
6 "Probably commo… ccss      12760173… 2021-01-02 00:18:38 13451625486818… 1345…

Into a “tidy text” one-token-per-row format that looks like this:

tidy_tweets <- ss_tweets |> 
  unnest_tokens(output = word, 
                input = text) |>
  relocate(word)

head(tidy_tweets)
# A tibble: 6 × 6
  word     standards author_id  created_at          conversation_id     id      
  <chr>    <chr>     <chr>      <dttm>              <chr>               <chr>   
1 catturd2 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
2 hmmmm    ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
3 common   ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
4 core     ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
5 math     ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
6 now      ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…

If you take ECI 588: Text Mining in Education, you’ll learn about other data structures for text analysis like the document-term matrix and corpus objects. For now, however, working with the familiar tidy data frame allows us to take advantage of popular packages that use the shared tidyverse syntax and principles for wrangling, exploring, and modeling data.

As demonstrated above, the tidytext package provides the incredibly powerful unnest_tokens() function to tokenize text (including tweets!) and convert them to a one-token-per-row format.

Let’s tokenize our tweets by using this function to split each tweet into a single row to make it easier to analyze and take a look:

ss_tokens <- unnest_tokens(ss_tweets, 
                           output = word, 
                           input = text)

head(relocate(ss_tokens, word))
# A tibble: 6 × 6
  word     standards author_id  created_at          conversation_id     id      
  <chr>    <chr>     <chr>      <dttm>              <chr>               <chr>   
1 catturd2 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
2 hmmmm    ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
3 common   ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
4 core     ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
5 math     ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…
6 now      ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170…

There is A LOT to unpack with this function:

  • First notice that unnest_tokens() expects a data frame as the first argument, followed by two column names.
  • The next argument is an output column name that doesn’t currently exist but will be created as the text is “unnested” into it, word in this case).
  • This is followed by the input column that the text comes from, which we uncreatively named text.
  • By default, a token is an individual word or “unigram” but we could use the token = argument to change our token to bigrams (2 words) or more.
  • Other columns, such as author_id and created_at, are retained.
  • All punctuation has been removed.
  • Tokens have been changed to lowercase, which makes them easier to compare or combine with other datasets (use the to_lower = FALSE argument to turn off if desired).

Note: Since {tidytext} follows tidy data principles, we also could have used the |> operator to pass our data frame to the unnest_tokens() function like so:

ss_tokens <- ss_tweets |>
  unnest_tokens(output = word, 
                input = text)

ss_tokens
# A tibble: 911,173 × 6
   standards author_id           created_at          conversation_id id    word 
   <chr>     <chr>               <dttm>              <chr>           <chr> <chr>
 1 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… catt…
 2 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… hmmmm
 3 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… comm…
 4 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… core 
 5 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… math 
 6 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… now  
 7 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… makes
 8 ccss      1609854356          2021-01-02 00:49:28 13451697062071… 1345… sense
 9 ccss      1249594897113513985 2021-01-02 00:40:05 13451533915976… 1345… home…
10 ccss      1249594897113513985 2021-01-02 00:40:05 13451533915976… 1345… i    
# ℹ 911,163 more rows

Your Turn ⤵

Before we move any further let’s take a quick look at the most common word in our two datasets. To do so, use count() function from the {dplyr} package and include the sort = TRUE.

Hint: Like most functions we’ve introduced, the first argument count() expects is a data frame, followed by the column, in our case word, whose values we want to count:

# YOUR CODE HERE
count(ss_tokens, word,
      sort = TRUE)
# A tibble: 66,859 × 2
   word       n
   <chr>  <int>
 1 common 27199
 2 core   26992
 3 the    25896
 4 to     20549
 5 and    15686
 6 t.co   15389
 7 https  15377
 8 of     13130
 9 a      12543
10 math   12208
# ℹ 66,849 more rows

What are the three most common words and how many times do there occur?

  1. common 27199

  2. core 26992

  3. the 25896

As you may have noticed, many of these tweets are clearly about the CCSS and math, but beyond that it’s a bit hard to tell what the tweets are about and whether they are positive or negative because there are so many “stop words” like “the”, “to”, “and”, “in” that don’t carry much meaning by themselves.

Remove Stop Words

Often in text analysis, we will want to remove these stop words if they are not useful for an analysis. The stop_words dataset in the {tidytext} package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.

Let’s take a closer the lexicons and stop words included in each:

View(stop_words)

The anti_join Function

In order to remove these stop words, we will use a function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches like so:

For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119.

Now let’s remove stop words that don’t help us learn much about what people are saying about the state standards.

ss_tokens_1 <- anti_join(ss_tokens,
                         stop_words,
                         by = "word")

head(ss_tokens_1)
# A tibble: 6 × 6
  standards author_id  created_at          conversation_id     id          word 
  <chr>     <chr>      <dttm>              <chr>               <chr>       <chr>
1 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… catt…
2 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… hmmmm
3 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… comm…
4 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… core 
5 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… math 
6 ccss      1609854356 2021-01-02 00:49:28 1345169706207109120 1345170311… makes

Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the ss_tokens dataset that match the stop_words dataset.

When we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset.

However the by = argument wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.

Your Turn ⤵

Use the code chunk below to take a quick count of the most common tokens in our ss_tokens_1 data frame to see if the results are a little more meaningful, then answer the questions that follow.

# YOUR CODE HERE
count(ss_tokens_1, word,
      sort = TRUE)
# A tibble: 66,166 × 2
   word         n
   <chr>    <int>
 1 common   27199
 2 core     26992
 3 t.co     15389
 4 https    15377
 5 math     12208
 6 ngss      4290
 7 ngsschat  3284
 8 amp       3084
 9 science   2905
10 students  2577
# ℹ 66,156 more rows

Your Turn ⤵

  1. How many unique tokens are in our data tidied text?

    • 66,166
  2. How many times does the word “math” occur in our set of tweets?

    • 12208

Custom Stop Words

Notice that the nonsense word “amp” is among our high frequency words as well as some. We can create our own custom stop word list to to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.

Let’s create a custom stop word list by using the simple c() function to combine our words. We can then add a filter to keep rows where words in our word column do NOT ! match words %in% my_stopwords list:

my_stopwords <- c("amp", "=", "+", "t.co", "https")

ss_tokens_2 <-
  ss_tokens_1 |>
  filter(!word %in% my_stopwords)

Let’s take a look at our top words again and see if that did the trick:

ss_tokens_2 |>
  count(word, sort = TRUE)
# A tibble: 66,163 × 2
   word          n
   <chr>     <int>
 1 common    27199
 2 core      26992
 3 math      12208
 4 ngss       4290
 5 ngsschat   3284
 6 science    2905
 7 students   2577
 8 education  2493
 9 standards  2332
10 school     2212
# ℹ 66,153 more rows

Much better! Note that we could extend this stop word list indefinitely. Feel free to use the code chunk below to try adding more words to our stop list.

Before we move any further, let’s save our tidied tweets as a new data frame for Section 3 and also save it as a .csv file in our data folder:

ss_tidy_tweets <- ss_tokens_2

write_csv(ss_tokens_2, "data/ss_tidy_tweets.csv")

3. EXPLORE

Calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. For Unit 3, we’re going to keep things super simple and focus on:

  1. Top Tokens. Since once of our goals is to compare tweets about the NGSS and CSSS standards, we’ll take a look at the top 50 words that appear in each.

  2. Word Clouds. To help illustrate the relative frequency for each of these top 50 words, we’ll introduce the {wordclouds2} package for creating interactive word clouds that can be knitted with your HTML doc.

3a. Top Tokens

First, let’s take advantage of the |> operator to combine some of the functions we’ve used above with the top_n() function from the {dplyr} package. By default, this function is looking for a data frame as the first argument, and then the number of rows to return.

Let’s take a look at the top tokens among the CCSS tweets by filtering our standards by CCSS, counting the number of times each word occurs, and taking the look at the 50 most common words:

ccss_top_tokens <- ss_tidy_tweets |>
  filter(standards == "ccss") |>
  count(word, sort = TRUE) |>
  top_n(50)
Selecting by n
ccss_top_tokens
# A tibble: 50 × 2
   word          n
   <chr>     <int>
 1 common    27132
 2 core      26924
 3 math      12085
 4 education  2104
 5 standards  1856
 6 school     1855
 7 kids       1814
 8 grade      1484
 9 people     1420
10 schools    1299
# ℹ 40 more rows

Not surprisingly, our search terms appear in the top 50 but the word “math” also features prominently among CCSS tweets!

Word Clouds

Word clouds are much maligned and sometimes referred to as the “pie charts of text analysis”, but they can be useful for communicating simple summaries of qualitative data for education practitioners and are intuitive for them to interpret.

The {wordclouds2} package is pretty dead simple tool for generating HTML based interactive word clouds. By default, when you pass a data frame to the wordcloud2() function, it will look for a word column and a column with frequencies or counts, i.e., our column n that we created with the count() function.

Let’s run the wordcloud2() function on our ccss_top_tokens data frame.

wordcloud2(ccss_top_tokens)

As you can see, “math” is a pretty common topic when discussing the common core on twitter but words like “core” and “common” – which you can see better if you click the “show in a new window” button or run the code in you console – are not very helpful since those were in our search terms when pulling data from Twitter.

In fact, search terms like these we might want to exclude from a final data product we share with with education partners or in a publication and instead include these these in a title or caption.

ccss_top_tokens |>
  filter(word != "common" & word != "core") |>
  wordcloud2()

Your Turn ⤵

In the code chunk below, filter, count and select the top 50 tokens to create a word cloud for the NGSS tweets. A gold star if you can can do it without using the assignment operator or looking at the code above!

ss_tidy_tweets |>
  filter(standards == "ngss") |>
  count(word, sort = TRUE) |>
  top_n(50) |>
  wordcloud2()
Selecting by n

Also, take a look at the help file for wordclouds2 to see if there might be other ways you could improve the aesthetics of this visualization.

3b. Exploring Bigrams (Optional)

If you’d like to use the data we’ve been working with a little more, let’s take a quick look at text analysis using bigrams, or tokens consisting of two words.

So far in this lab, we specified tokens as individual words, but many interesting text analyses are based on the relationships between words, which words tend to follow others immediately, or words that tend to co-occur within the same documents.

We can also use the unnest_tokens() function to tokenize our tweets into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we could then build a model of the relationships between them.

To specify our tokens as bigrams, we do add token = "ngrams" to the unnest_tokens() function and setting n to the number of words in each n-gram. Let’s set n to 2, so we can examine pairs of two consecutive words, often called “bigrams”:

ngss_bigrams <- ngss_tweets |> 
  unnest_tokens(bigram, 
                text, 
                token = "ngrams", 
                n = 2)

Before we move any further let’s take a quick look at the most common bigrams in our NGSS tweets:

ngss_bigrams |> 
  count(bigram, sort = TRUE)
# A tibble: 111,411 × 2
   bigram             n
   <chr>          <int>
 1 https t.co      6240
 2 ngsschat https   721
 3 of the           630
 4 in the           531
 5 ngss https       455
 6 the ngss         403
 7 to the           318
 8 for the          295
 9 to be            272
10 on the           239
# ℹ 111,401 more rows

As we saw above, a lot of the most common bigrams are pairs of common (uninteresting) words as well. Dealing with these is a little less straightforward and we’ll need to use the separate() function from the tidyr package, which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

library(tidyr)
bigrams_separated <- ngss_bigrams |>
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated |>
  filter(!word1 %in% stop_words$word) |>
  filter(!word2 %in% stop_words$word)

tidy_bigrams <- bigrams_filtered |>
  unite(bigram, word1, word2, sep = " ")

Let’s take a look at our bigram counts now:

tidy_bigrams |> 
  count(bigram, sort = TRUE)
# A tibble: 45,507 × 2
   bigram                n
   <chr>             <int>
 1 https t.co         6240
 2 ngsschat https      721
 3 ngss https          455
 4 ngss ngsschat       236
 5 ngss aligned        192
 6 ngss standards      168
 7 ngss science        154
 8 science education   148
 9 science standards   112
10 teachers https      106
# ℹ 45,497 more rows

Better, but there are still many tokens not especially useful for analysis.

Let’s make a custom custom stop word dictionary for bigrams just like we did for our unigrams. A list is started for you below, but you likely want to expand our list off stop words:

my_words <- c("https", "t.co", "ngss https", "teachers https")

Now let’s separate, filter, and unite again:

tidy_bigrams <- bigrams_separated |>
  filter(!word1 %in% stop_words$word) |>
  filter(!word2 %in% stop_words$word) |>
  filter(!word1 %in% my_words) |>
  filter(!word2 %in% my_words) |>
  unite(bigram, word1, word2, sep = " ")

Note that since my_words is just a vector of words and not a data frame like stop_words, we do not need to select the word column using the $ operator.

Let’s take another quick count of our bigrams:

tidy_bigrams |> 
  count(bigram, sort = TRUE)
# A tibble: 37,539 × 2
   bigram                          n
   <chr>                       <int>
 1 ngss ngsschat                 236
 2 ngss aligned                  192
 3 ngss standards                168
 4 ngss science                  154
 5 science education             148
 6 science standards             112
 7 ngss_tweeps ngsschat           96
 8 science ngss                   94
 9 bmsscienceteach ngss_tweeps    92
10 approved approach              89
# ℹ 37,529 more rows

Your Turn ⤵

Use the code chunk below to tidy and count our bigrams for the CCSS tweets:

# YOUR CODE HERE
ccss_bigrams <- ccss_tweets |> 
  unnest_tokens(bigram, 
                text, 
                token = "ngrams", 
                n = 2)
bigrams_separated_cc <- ccss_bigrams |>
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered_cc <- bigrams_separated_cc |>
  filter(!word1 %in% stop_words$word) |>
  filter(!word2 %in% stop_words$word)

tidy_bigrams_cc <- bigrams_filtered_cc |>
  unite(bigram, word1, word2, sep = " ")
my_words <- c("https", "hmmmm", "catturd2 hmmmm", "in the")
tidy_bigrams_cc <- bigrams_separated_cc |>
  filter(!word1 %in% stop_words$word) |>
  filter(!word2 %in% stop_words$word) |>
  filter(!word1 %in% my_words) |>
  filter(!word2 %in% my_words) |>
  unite(bigram, word1, word2, sep = " ")
tidy_bigrams_cc |> 
  count(bigram, sort = TRUE)
# A tibble: 93,159 × 2
   bigram              n
   <chr>           <int>
 1 common core     26735
 2 core math        8249
 3 core standards    683
 4 core education    420
 5 core curriculum   372
 6 gt gt             262
 7 bill gates        252
 8 grade common      252
 9 public schools    246
10 grade level       233
# ℹ 93,149 more rows

What additional insight, if any, did looking at bigrams bring to out analysis?

  • It seems helpful to really get to the root of the conversation. In the ccss it’s clear that common core as well as as core math are a couplet that is used frequently. In the ngss data set it’s similar and can provide a cleaner analysis.

4. MODEL

Now that we have our tweets nice and tidy, we’re almost ready to begin exploring public sentiment (at least for the past week due to Twitter API rate limits) around the CCSS and NGSS standards. For this part of our workflow we introduce two new functions from the tidytext and dplyr packages respectively:

How do you “measure” sentiment?

Sentiment analysis tries to evaluate words for their emotional association. In Text Mining with R: A Tidy Approach, Silge and Robinson point out that,

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

This isn’t the only way to approach sentiment analysis, but it is an easier entry point into sentiment analysis and you’ll find that is it often used in publications that utilize sentiment analysis.

The {tidytext} package provides access to several sentiment lexicons, sometimes referred to as dictionaries, based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

The three general-purpose lexicons we’ll focus on are:

  • AFINN assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

  • bing categorizes words in a binary fashion into positive and negative categories.

  • nrc categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

Note that if this is your first time using the AFINN and NRC lexicons, you may prompted to download both. Respond yes to the prompt by entering “1” and the NRC and AFINN lexicons will download. You’ll only have to do this the first time you use the NRC lexicon.

Let’s take a quick look at each of these lexicons using the get_sentiments() function and assign them to their respective names for later use:

afinn <- get_sentiments("afinn")

afinn
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows
bing <- get_sentiments("bing")

bing
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows
nrc <- get_sentiments("nrc")

nrc
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows

And just out of curiosity, let’s take a look at the loughran lexicon as well:

loughran <- get_sentiments("loughran")

loughran
# A tibble: 4,150 × 2
   word         sentiment
   <chr>        <chr>    
 1 abandon      negative 
 2 abandoned    negative 
 3 abandoning   negative 
 4 abandonment  negative 
 5 abandonments negative 
 6 abandons     negative 
 7 abdicated    negative 
 8 abdicates    negative 
 9 abdicating   negative 
10 abdication   negative 
# ℹ 4,140 more rows

Your Turn ⤵

  1. How were these sentiment lexicons put together and validated? Hint: take a look at Chapter 2 from Text Mining with R.

    • According to the chapter, they were created from crowdsourced data or by the authors of the collection. They were validated using crowdsourcing and reviews of restaurants or movies to teach it.
  2. Why should we be cautious when using and interpreting them?

    • The sentiment lexicons were trained on data that is quite outdated and may not be accurate to today’s vernacular. It’s also a cautionary tale for understanding the rapid evolution of language, especially on social media!

Come to the Dark Side

As noted in the PERPARE section, the {vader} package is for the Valence Aware Dictionary for sEntiment Reasoning (VADER), a rule-based model for general sentiment analysis of social media text and specifically attuned to measuring sentiment in microblog-like contexts such as Twitter.

The VADER assigns a number of different sentiment measures based on the context of the entire social-media post or in our case a tweet. Ultimately, however, these measures are based on a sentiment lexicon similar to those you just saw above. One benefit of using VADER rather than the approaches described by Silge and Robinson is that we use it with our tweets in their original format and skip the text preprocessing steps demonstrated above.

One drawback to VADER is that it can take a little while to run since it’s computationally intensive. Instead of analyzing tens of thousands of tweets, let’s read in our original ccss-tweets.csv and take instead just a sample of 500 “untidy” CCSS tweets using the sample_n() function:

ccss_sample <- read_csv("data/ccss-tweets.csv") |>
  sample_n(500)
Rows: 27230 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): text, source
dbl  (4): author_id, id, conversation_id, in_reply_to_user_id
lgl  (1): possibly_sensitive
dttm (1): created_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ccss_sample
# A tibble: 500 × 8
   text             created_at          author_id      id conversation_id source
   <chr>            <dttm>                  <dbl>   <dbl>           <dbl> <chr> 
 1 PDF Download Pr… 2021-01-25 23:05:12   1.35e18 1.35e18         1.35e18 Twitt…
 2 She began to le… 2021-05-04 16:49:54   2.69e 7 1.39e18         1.39e18 Twitt…
 3 @LivePDDave1 Co… 2021-02-15 01:26:33   1.65e 8 1.36e18         1.36e18 Twitt…
 4 @8_inside @Cand… 2021-03-18 20:08:04   1.27e18 1.37e18         1.37e18 Twitt…
 5 Common core mat… 2021-05-02 00:23:39   1.32e18 1.39e18         1.39e18 Twitt…
 6 @ChumZilla Comm… 2021-01-07 02:41:20   9.77e 8 1.35e18         1.35e18 Twitt…
 7 @FrancaRose33 @… 2021-01-19 19:55:24   2.57e 8 1.35e18         1.33e18 Twitt…
 8 @AStopcommoncor… 2021-04-28 17:06:00   1.85e 7 1.39e18         1.39e18 Twitt…
 9 @BrandonStraka … 2021-01-23 09:07:58   1.28e18 1.35e18         1.35e18 Twitt…
10 Lmaoooo cuz thi… 2021-01-25 00:21:57   2.28e 9 1.35e18         1.35e18 Twitt…
# ℹ 490 more rows
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <dbl>

Note above that we passed our read_csv() output directly to our sample() function rather than saving a new data frame object, passing that to sample_n(), and saving as another data frame object. The power of the |> pipe!

On to the Dark Side. The {vader} package basically has just one function, vader_df() that does one thing and expects just one column from one frame. He’s very single minded! Let’s give VADER our ccss_sample data frame and include the $ operator to include only the text column containing our tweets.

Note, this may take a little while to run.

vader_ccss <- vader_df(ccss_sample$text)

head(vader_ccss)
                                                                                                                                                                                                                   text
1                                                                                                                              PDF Download Prentice Hall Literature: Common Core Edition -&gt; https://t.co/aTcEqpD3Ry
2 She began to learn common core math and my MIND. WAS. BLOWN. Because, you guys. It turns out I don't have to try to do math on a imaginary piece of paper in my head. Now I just add 30 + 50 = 80. Then I subtract 5.
3                                                                                                                                                                                         @LivePDDave1 Common Core Math
4                                             @8_inside @Candidus00 @904Pestana @mdnij34 @DemNevada It was... 30 years ago when I was in school. They have decided now that common core is more important than history.
5                                                                                                                                                                              Common core math https://t.co/gyJa74JyEv
6                                                                                                                                                                                           @ChumZilla Common core math
                                                                                                                                       word_scores
1                                                                                                                 {0, 0, 0, 0, 0, 0, 0, 0, 1.1, 0}
2 {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
3                                                                                                                                     {0, 0, 0, 0}
4                                                            {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.093, 0, 0}
5                                                                                                                                     {0, 0, 0, 0}
6                                                                                                                                     {0, 0, 0, 0}
  compound   pos   neu neg but_count
1    0.273 0.189 0.811   0         0
2    0.000 0.000 1.000   0         0
3    0.000 0.000 1.000   0         0
4    0.272 0.075 0.925   0         0
5    0.000 0.000 1.000   0         0
6    0.000 0.000 1.000   0         0

Take a look at vader_summary data frame using the View() function in the console and sort by most positive and negative tweets.

Does it generally seem to accurately identify positive and negative tweets? Could you find any that you think were mislabeled?

  • I think the negative were labeled fairly accurately. However, the positive completely miss the nuance of sarcasm as a written tweet. Many of the highest positive are in fact very negative just through the very human jargon of sarcasm. “@Pismo_B @GHOST_LETTERS Yep, common core ! Lol” for instance ;)

Hutto, C. & Gilbert, E. (2014) provide an excellent summary of the VADER package on their GitHub repository and I’ve copied and explanation of the scores below:

  • The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a ‘normalized, weighted composite score’ is accurate.

NOTE: The compound score is the one most commonly used for sentiment analysis by most researchers, including the authors.

Let’s take a look at the average compound score for our CCSS sample of tweets:

mean(vader_ccss$compound)
[1] 0.012168

Overall, does your CCSS tweets sample lean slightly negative or positive? Is this what you expected?

What if we wanted to compare these results more easily to our other sentiment lexicons just to check if result are fairly consistent?

The author’s note that it is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. Typical threshold values are:

  • positive sentimentcompound score >= 0.05

  • neutral sentiment: (compound score > -0.05) and (compound score < 0.05)

  • negative sentimentcompound score <= -0.05

Let’s give that a try and see how things shake out:

vader_ccss_summary <- vader_ccss |> 
  mutate(sentiment = ifelse(compound >= 0.05, "positive",
                            ifelse(compound <= -0.05, "negative", "neutral"))) |>
  count(sentiment, sort = TRUE) |> 
  spread(sentiment, n) |> 
  relocate(positive) |>
  mutate(ratio = negative/positive)

vader_ccss_summary
  positive negative neutral     ratio
1      174      168     158 0.9655172

Not quite as bleak as we might have expected according to VADER! But then again, VADER brings an entirely different perspective coming from the dark side

Your Turn ⤵

In the code chunk below, try using VADER to perform a sentiment analysis of the NGSS tweets and see how they compare:

ngss_sample <- read_csv("data/ngss-tweets.csv") |>
  sample_n(500)
Rows: 8125 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): text, source
dbl  (4): author_id, id, conversation_id, in_reply_to_user_id
lgl  (1): possibly_sensitive
dttm (1): created_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vader_ngss <- vader_df(ngss_sample$text)

vader_ngss_summary <- vader_ngss |> 
  mutate(sentiment = ifelse(compound >= 0.05, "positive",
                            ifelse(compound <= -0.05, 
                                   "negative", "neutral"))) |>
  count(sentiment, sort = TRUE) |> 
  spread(sentiment, n) |> 
  relocate(positive) |>
  mutate(ratio = negative/positive)

vader_ngss_summary
  positive negative neutral     ratio
1      334       37     129 0.1107784

How do our results compare to the CSSS sample of tweets?

  • Much smaller ratio! Skewed more toward the positive but again I think that’s misunderstanding on the part of the tools at hand.

5. COMMUNICATE

In this case study, we focused on the literature guiding our analysis; wrangling our data into a one-token-per-row tidy text format; and using simple word counts and word clouds to compare common words used in tweets about the NGSS and CCSS curriculum standards. Below, add a few notes in response to the following prompts:

  1. One thing I took away from this case study:

    • Even dealing with a smaller subset of the data, this gets quite cumbersome very quickly and the steps used create a lot of data running in the R Studio. Also, the sentiment analysis needs some updated training in order to be more accurate when quantifying positive sentiment. As is, especially on a social media data, it’s lacking the nuance of that particular type of grammar, jargon, etc.
  2. One thing I want to learn more about:

    • Whether there is a lexicon for social media sentiment, and if so, who in the world would be able to keep up with the rapid change! Additionally, I think it’s very helpful to parse apart the tweets to quantify the discourse around this subject and numerous other. On to TikTok ;)

Congratulations - you’ve completed your first text mining case study! To complete your work, click the Render button in the toolbar. This will check all your code and create an HTML file in the Files pane that serves as a record of your work that you can open in a browser or share online.

References

Note: Citations embedded in R Markdown will only show upon knitting.

References

Krumm, A., Means, B., & Bienkowski, M. (2018). Learning analytics goes to school. Routledge. https://doi.org/10.4324/9781315650722
Rosenberg, J. M., Borchers, C., Dyer, E. B., Anderson, D., & Fischer, C. (2021). Understanding Public Sentiment About Educational Reforms: The Next Generation Science Standards on Twitter. AERA Open, 7, 233285842110242. https://doi.org/10.1177/23328584211024261
Silge, J., & Robinson, D. (2017). Text mining with r: A tidy approach. " O’Reilly Media, Inc.". https://www.tidytextmining.com
Wang, Y., & Fikis, D. J. (2017). Common Core State Standards on Twitter: Public Sentiment and Opinion Leaders. Educational Policy, 33(4), 650–683. https://doi.org/10.1177/0895904817723739