Independent Analysis: Assessing Senitment

0. INTRODUCTION

This project reflects a very simplistic “replication study” by comparing the sentiment of tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand public reaction to these two curriculum reform efforts. The text is based on our course digital textbook, Unit 2 Walkthrough: Twitter Sentiment and School Reform by Dr. Shiyan Jiang. The focus will be on using the Twitter API to import data on topics or tweets of interest and using sentiment lexicons to help gauge public opinion about those topics or tweets. Silge & Robinson nicely illustrate the tools of text mining to approach the emotional content of text programmatically, in the following diagram:

The steps of the process are as follows:

Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. We’ll take a quick look at Dr. Rosenberg’s study as well as data available through Twitter’s API.
Wrangle: In section 2 we revisit tidying and tokenizing text, and and append sentiment scores to tweets using the AFFIN, bing, and nrc sentiment lexicons.
Explore: In section 3, we use simple summary statistics and basic data visualization to compare sentiment between NGSS and CCSS tweets.
Model: We will examine the mixed effects model used by Rosenberg et al. to analyze the sentiment of tweets
Communicate: Finally, I will share findings and insights from the analysis.

1. PREPARE

1b. Guiding Questions

Our (very) specific questions of interest for this project are:

What is the public sentiment expressed toward the NGSS?
How does sentiment for NGSS compare to sentiment for CCSS?

1c. Set Up Libraries

The first steps of the workflow is to set up a “Project” within RStudio. The next step is to open up a new R script, and load the following packages:

library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)

Access Twitter Data

The Twitter data for the study was accessed through the following file GitHub course repository

2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).

Import Data. In this section, we introduce the rtweet package and some key functions to search for tweets or users of interest.
Tidy Tweets. We revisit the tidytext package to both “tidy” and tokenize our tweets in order to create our data frame for analysis.
Get Sentiments. We conclude our data wrangling by introducing sentiment lexicons and the inner_join() function for appending sentiment values to our data frame.

2a. Import Tweets

The Import Tweets section introduces the following functions from the rtweet package for reading Twitter data into R:

search_tweets() Pulls up to 18,000 tweets from the last 6-9 days matching provided search terms.
search_tweets2() Returns data from multiple search queries.
get_timelines() Returns up to 3,200 tweets of one or more specified Twitter users.

Search Tweets

We will use the search_tweets() function to try reading into R 5,000 tweets containing the NGSS hashtag and store as a new data frame ngss_all_tweets.

Type or copy the following code into your R script or console and run:

Note that the first argument q = that the search_tweets() function expects is the search term included in quotation marks and that n = specifies the maximum number of tweets

Remove Retweets

While not explicitly mentioned in the paper, it’s likely the authors removed retweets in their query since a retweet is simply someone else reposting someone else’s tweet and would duplicate the exact same content of the original.

Let’s use the include_rts = argument to remove any retweets by setting it to FALSE:

Using the OR Operator

If you recall from [Section 1a], the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.

Let’s modify our query using the OR operator to also include “ngss” so it will return tweets containing either #NGSSchat or “ngss” and assign to ngss_or_tweets:

Use Multiple Queries

Unfortunately, the OR operator will only get us so far. In order to include the additional search terms, we will need to use the c() function to combine our search terms into a single list.

The rtweets package has an additional search_tweets2() function for using multiple queries in a search. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"next gen science standard"' or escape each internal double quote with a single backslash, e.g., q = "\"next gen science standard\"".

Copy and past the following code to store the results of our query in ngss_tweets:

Our First Dictionary

To compare public sentiment about both the NGSS and CCSS state standards, we will create four dictionaries. First, we will create our very first “dictionary” for identifying tweets related to either set of standards, and then use that dictionary for our the q = query argument to pull tweets related to the state standards.

To do so, we’ll need to add some additional search terms to our list:

Now let’s create a dictionary for the Common Core State Standards and pass that to our search_tweets() function to get the most recent tweets:

Notice that you can use the pipe operator with the search_tweets() function just like you would other functions from the tidyverse.

Write to Excel

Finally, let’s save our tweet files to use in later exercises since tweets have a tendency to change every minute. We’ll save as a Microsoft Excel file since one of our columns can not be stored in a flat file like .csv.

Let’s use the write_xlsx() function from the writexl package just like we would the write_csv() function from dplyr in Unit 1:

2b. Tidy Text

Now that we have the data needed to answer our questions, we still have a little bit of work to do to get it ready for analysis. This section will revisit some familiar functions from Unit 1 and introduce a couple new functions:

Functions Used

dplyr functions

select() picks variables based on their names.
slice() lets you select, remove, and duplicate rows.
rename() changes the names of individual variables using new_name = old_name syntax
filter() picks cases, or rows, based on their values in a specified column.

tidytext functions

unnest_tokens() splits a column into tokens
anti_join() returns all rows from x without a match in y.

We’ll use the readxl package highlighted in Unit 1 and the read_xlsx() function to read in the data stored in the data folder of our R project:

ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")

Subset Rows & Columns

As you are probably already aware, we have way more data than we’ll need for analysis and will need to pare it down quite a bit.

First, let’s use the filter function to subset rows containing only tweets in the language:

ngss_text <- filter(ngss_tweets, lang == "en")

Now let’s select the following columns from our new ngss_text data frame:

screen_name of the user who created the tweet
created_at timestamp for examining changes in sentiment over time
text containing the tweet which is our primary data source of interestt

ngss_text <- select(ngss_text,screen_name, created_at, text)

Add & Reorder Columns

Since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column for quickly identifying the set of state standards, with which each tweet is associated.

We’ll use the mutate() function to create a new variable called standards to label each tweets as “ngss”:

ngss_text <- mutate(ngss_text, standards = "ngss")

And just because it bothers me, I’m going to use the relocate() function to move the standards column to the first position so I can quickly see which standards the tweet is from:

ngss_text <- relocate(ngss_text, standards)

Note that you could also have used the select() function to reorder columns like so:

ngss_text <- select(ngss_text, standards, screen_name, created_at, text)

Finally, let’s rewrite the code above using the %>% operator so there is less redundancy and it is easier to read:

ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(screen_name, created_at, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

Create an new ccss_text data frame for our ccss_tweets Common Core tweets by modifying code above.

Combine Data Frames

Finally, let’s combine our ccss_text and ngss_text into a single data frame by using the bind_rows() function from dplyr to simply supplying the data frames that you want to combine as arguments:

tweets <- bind_rows(ngss_text, ccss_text)

And let’s take a quick look at both the head() and the tail() of this new tweets data frame to make sure it contains both “ngss” and “ccss” standards:

head(tweets)

## # A tibble: 6 × 4
##   standards screen_name  created_at          text                               
##   <chr>     <chr>        <dttm>              <chr>                              
## 1 ngss      loyr2662     2021-02-27 17:33:27 "Switching gears for a bit for the…
## 2 ngss      loyr2662     2021-02-20 20:02:37 "Was just introduced to the Engine…
## 3 ngss      Furlow_teach 2021-02-27 17:03:23 "@IBchemmilam @chemmastercorey I’m…
## 4 ngss      Furlow_teach 2021-02-27 14:41:01 "@IBchemmilam @chemmastercorey How…
## 5 ngss      TdiShelton   2021-02-27 14:17:34 "I am so honored and appreciative …
## 6 ngss      TdiShelton   2021-02-27 15:49:17 "Thank you @brian_womack I loved c…

tail(tweets)

## # A tibble: 6 × 4
##   standards screen_name   created_at          text                              
##   <chr>     <chr>         <dttm>              <chr>                             
## 1 ccss      JosiePaul8807 2021-02-20 00:34:53 "@SenatorHick You realize science…
## 2 ccss      ctwittnc      2021-02-19 23:44:18 "@winningatmylife I’ll bet none o…
## 3 ccss      the_rbeagle   2021-02-19 23:27:06 "@dmarush @electronlove @Montgome…
## 4 ccss      silea         2021-02-19 23:11:21 "@LizerReal I don’t think that’s …
## 5 ccss      JodyCoyote12  2021-02-19 22:58:25 "@CarlaRK3 @NedLamont Fully fund …
## 6 ccss      Ryan_Hawes    2021-02-19 22:41:01 "I just got an \"explainer\" on h…

Tokenize Text

First, let’s tokenize our tweets by using the unnest_tokens() function to split each tweet into a single row to make it easier to analyze:

tweet_tokens <- 
  tweets %>%
  unnest_tokens(output = word, 
                input = text)

Notice that we’ve included an additional argument in the call to unnest_tokens(). Specifically, we used the specialized “tweets” tokenizer in the tokens = argument that is very useful for dealing with Twitter text or other text from online forums in that it retains hashtags and mentions of usernames with the @ symbol.

Remove Stop Words

Now let’s remove stop words like “the” and “a” that don’t help us learn much about what people are tweeting about the state standards.

tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word")

Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the tweet_tokens dataset that match the stop_words dataset. Remember when we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset. However this wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.

Custom Stop Words

Before wrapping up, let’s take a quick count of the most common words in tidy_tweets data frame:

count(tidy_tweets, word, sort = T)

## # A tibble: 7,163 × 2
##    word         n
##    <chr>    <int>
##  1 common    1112
##  2 core      1109
##  3 https      623
##  4 t.co       623
##  5 math       450
##  6 ngss       224
##  7 students   141
##  8 science    140
##  9 school     128
## 10 amp        127
## # … with 7,153 more rows

Notice that the nonsense word “amp” is in our top tens words. If we use the filter() function and `grep() query from Unit 1 on our tweets data frame, we can see that “amp” seems to be some sort of html residue that we might want to get rid of.

filter(tweets, grepl('amp', text))

## # A tibble: 124 × 4
##    standards screen_name    created_at          text                            
##    <chr>     <chr>          <dttm>              <chr>                           
##  1 ngss      TdiShelton     2021-02-27 14:17:34 "I am so honored and appreciati…
##  2 ngss      STEMTeachTools 2021-02-27 16:25:04 "Open, non-hierarchical communi…
##  3 ngss      NGSSphenomena  2021-02-25 13:24:22 "Bacteria have music preference…
##  4 ngss      CTSKeeley      2021-02-21 21:50:04 "Today I was thinking about the…
##  5 ngss      richbacolor    2021-02-24 14:14:49 "Last chance to register for @M…
##  6 ngss      MrsEatonELL    2021-02-27 06:24:09 "Were we doing the hand jive? N…
##  7 ngss      STEMuClaytion  2021-02-24 14:56:19 "#WonderWednesday w/ questions …
##  8 ngss      LearningUNDFTD 2021-02-24 18:13:01 "Are candies like M&amp;Ms and …
##  9 ngss      abeslo         2021-02-26 18:54:31 "#M'Kenna, whose story we share…
## 10 ngss      E3Chemistry    2021-02-25 14:15:20 "Molarity &amp; Parts Per Milli…
## # … with 114 more rows

Let’s rewrite our stop word code to add a custom stop word to filter out rows with “amp” in them:

tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp")

Note that we could extend this filter to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.

2c. Add Sentiment Values

Now that we have our tweets nice and tidy, we’re almost ready to begin exploring public sentiment (at least for the past week due to Twitter API rate limits) around the CCSS and NGSS standards. For this part of our workflow we introduce two new functions from the tidytext and dplyr packages respectively:

get_sentiments() returns specific sentiment lexicons with the associated measures for each word in the lexicon
inner_join() return all rows from x where there are matching values in y, and all columns from x and y.

Get Sentiments

The tidytext package provides access to several sentiment lexicons based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

The three general-purpose lexicons we’ll focus on are:

AFINN assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
bing categorizes words in a binary fashion into positive and negative categories.
nrc categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

Let’s take a quick look at each of these lexicons using the get_sentiments() function and assign them to their respective names for later use:

afinn <- get_sentiments("afinn")

afinn

## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

bing <- get_sentiments("bing")

bing

## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

nrc <- get_sentiments("nrc")

nrc

## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows

And just out of curiosity, let’s take a look at the loughran lexicon as well:

loughran <- get_sentiments("loughran")

loughran

## # A tibble: 4,150 × 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,140 more rows

Join Sentiments

We’ve reached the final step in our data wrangling process before we can begin exploring our data to address our questions.

In the previous section, we used anti_join() to remove stop words in our dataset. For sentiment analysis, we’re going use the inner_join() function to do something similar. However, instead of removing rows that contain words matching those in our stop words dictionary, inner_join() allows us to keep only the rows with words that match words in our sentiment lexicons, or dictionaries, along with the sentiment measure for that word from the sentiment lexicon.

Let’s use inner_join() to combine our two tidy_tweets and afinn data frames, keeping only rows with matching data in the word column:

sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")

sentiment_afinn

## # A tibble: 1,540 × 5
##    standards screen_name  created_at          word         value
##    <chr>     <chr>        <dttm>              <chr>        <dbl>
##  1 ngss      loyr2662     2021-02-27 17:33:27 win              4
##  2 ngss      Furlow_teach 2021-02-27 17:03:23 love             3
##  3 ngss      Furlow_teach 2021-02-27 17:03:23 sweet            2
##  4 ngss      Furlow_teach 2021-02-27 17:03:23 significance     1
##  5 ngss      TdiShelton   2021-02-27 14:17:34 honored          2
##  6 ngss      TdiShelton   2021-02-27 14:17:34 opportunity      2
##  7 ngss      TdiShelton   2021-02-27 14:17:34 wonderful        4
##  8 ngss      TdiShelton   2021-02-27 14:17:34 powerful         2
##  9 ngss      TdiShelton   2021-02-27 15:49:17 loved            3
## 10 ngss      TdiShelton   2021-02-27 16:51:32 share            1
## # … with 1,530 more rows

Notice that each word in your sentiment_afinn data frame now contains a value ranging from -5 (very negative) to 5 (very positive).

sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")

sentiment_bing

## # A tibble: 1,668 × 5
##    standards screen_name  created_at          word         sentiment
##    <chr>     <chr>        <dttm>              <chr>        <chr>    
##  1 ngss      loyr2662     2021-02-27 17:33:27 win          positive 
##  2 ngss      Furlow_teach 2021-02-27 17:03:23 love         positive 
##  3 ngss      Furlow_teach 2021-02-27 17:03:23 helped       positive 
##  4 ngss      Furlow_teach 2021-02-27 17:03:23 sweet        positive 
##  5 ngss      Furlow_teach 2021-02-27 17:03:23 tough        positive 
##  6 ngss      TdiShelton   2021-02-27 14:17:34 honored      positive 
##  7 ngss      TdiShelton   2021-02-27 14:17:34 appreciative positive 
##  8 ngss      TdiShelton   2021-02-27 14:17:34 wonderful    positive 
##  9 ngss      TdiShelton   2021-02-27 14:17:34 powerful     positive 
## 10 ngss      TdiShelton   2021-02-27 15:49:17 loved        positive 
## # … with 1,658 more rows

3. EXPLORE

Now that we have our tweets tidied and sentiments joined, we’re ready for a little data exploration.

Time Series. We take a quick look at the date range of our tweets and compare number of postings by standards.
Sentiment Summaries. We put together some basic summaries of our sentiment values in order to compare public sentiment

3a. Time Series

Before we dig into sentiment, let’s use the handy ts_plot function built into rtweet to take a very quick look at how far back our tidied tweets data set goes:

ts_plot(tweets, by = "days")

Notice that this effectively creates a ggplot time series plot for us. I’ve included the by = argument which by default is set to “days”. It looks like tweets go back 9 days which the rate limit set by Twitter.

Try changing it to “hours” and see what happens.

Hint: use the ?ts_plot help function to check the examples to see how this can be done.

Your line graph should look something like this:

3b. Sentiment Summaries

Since our primary goals is to compare public sentiment around the NGSS and CCSS state standards, in this section we put together some basic numerical summaries using our different lexicons to see whether tweets are generally more positive or negative for each standard as well as differences between the two. To do this, we revisit the following dplyr functions:

count() lets you quickly count the unique values of one or more variables
group_by() takes a data frame and one or more variables to group by
summarise() creates a numerical summary of data using arguments like mean() and median()
mutate() adds new variables and preserves existing ones

And introduce one new function:

spread()

Sentiment Counts

Let’s start with bing, our simplest sentiment lexicon, and use the count function to count how many times in our sentiment_bing data frame “positive” and “negative” occur in sentiment column and :

summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)

Collectively, it looks like our combined dataset has more positive words than negative words.

summary_bing

## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative    992
## 2 positive    676

Since our main goal is to compare positive and negative sentiment between CCSS and NGSS, let’s use the group_by function again to get sentiment summaries for NGSS and CCSS separately:

summary_bing <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment) 

summary_bing

## # A tibble: 4 × 3
## # Groups:   standards [2]
##   standards sentiment     n
##   <chr>     <chr>     <int>
## 1 ccss      negative    926
## 2 ccss      positive    446
## 3 ngss      negative     66
## 4 ngss      positive    230

Looks like CCSS have far more negative words than positive, while NGSS skews much more positive. So far, pretty consistent with Rosenberg et al. findings!!!

Compute Sentiment Value

Our last step will be calculate a single sentiment “score” for our tweets that we can use for quick comparison and create a new variable indicating which lexicon we used.

First, let’s untidy our data a little by using the spread function from the tidyr package to transform our sentiment column into separate columns for negative and positive that contains the n counts for each:

summary_bing <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) 

summary_bing

## # A tibble: 2 × 3
## # Groups:   standards [2]
##   standards negative positive
##   <chr>        <int>    <int>
## 1 ccss           926      446
## 2 ngss            66      230

Finally, we’ll use the mutate function to create two new variables: sentiment and lexicon so we have a single sentiment score and the lexicon from which it was derived:

summary_bing <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(lexicon = "bing") %>%
  relocate(lexicon)

summary_bing

## # A tibble: 2 × 5
## # Groups:   standards [2]
##   lexicon standards negative positive sentiment
##   <chr>   <chr>        <int>    <int>     <int>
## 1 bing    ccss           926      446      -480
## 2 bing    ngss            66      230       164

There we go, now we can see that CCSS scores negative, while NGSS is overall positive.

Let’s calculate a quick score for using the afinn lexicon now. Remember that AFINN provides a value from -5 to 5 for each:

head(sentiment_afinn)

## # A tibble: 6 × 5
##   standards screen_name  created_at          word         value
##   <chr>     <chr>        <dttm>              <chr>        <dbl>
## 1 ngss      loyr2662     2021-02-27 17:33:27 win              4
## 2 ngss      Furlow_teach 2021-02-27 17:03:23 love             3
## 3 ngss      Furlow_teach 2021-02-27 17:03:23 sweet            2
## 4 ngss      Furlow_teach 2021-02-27 17:03:23 significance     1
## 5 ngss      TdiShelton   2021-02-27 14:17:34 honored          2
## 6 ngss      TdiShelton   2021-02-27 14:17:34 opportunity      2

To calculate late a summary score, we will need to first group our data by standards again and then use the summarise function to create a new sentiment variable by adding all the positive and negative scores in the value column:

summary_afinn <- sentiment_afinn %>% 
  group_by(standards) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(lexicon = "AFINN") %>%
  relocate(lexicon)

summary_afinn

## # A tibble: 2 × 3
##   lexicon standards sentiment
##   <chr>   <chr>         <dbl>
## 1 AFINN   ccss           -808
## 2 AFINN   ngss            503

Again, CCSS is overall negative while NGSS is overall positive!

nrc

## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows

## # A tibble: 2 × 5
## # Groups:   standards [2]
##   standards method negative positive sentiment
##   <chr>     <chr>     <int>    <int>     <dbl>
## 1 ccss      nrc         766     2294      2.99
## 2 ngss      nrc          79      571      7.23

## # A tibble: 2 × 3
##   lexicon standards sentiment
##   <chr>   <chr>         <dbl>
## 1 AFINN   ccss           -808
## 2 AFINN   ngss            503

4. MODEL

As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.

Recall from the PREPARE section that the Rosenberg et al. study was guide by the following questions:

What is the public sentiment expressed toward the NGSS?
How does sentiment for teachers differ from non-teachers?
How do tweets posted to #NGSSchat differ from those without the hashtag?
How does participation in #NGSSchat relate to the public sentiment individuals express?
How does public sentiment vary over time?

Similar to our sentiment summary using the AFINN lexicon, the Rosenberg et al. study used the -5 to 5 sentiment score from the SentiStrength lexicon to answer RQ #1. To address the remaining questions the authors used a mixed effects model (also known as multi-level or hierarchical linear models via the lme4 package in R.

Collectively, the authors found that:

The SentiStrength scale indicated an overall neutral sentiment for tweets about the Next Generation Science Standards.
Teachers were more positive in their posts than other participants.
Posts including #NGSSchat that were posted outside of chats were slightly more positive relative to those that did not include the #NGSSchat hashtag.
The effect upon individuals of being involved in the #NGSSchat was positive, suggesting that there is an impact on individuals—not tweets—of participating in a community focused on the NGSS.
Posts about the NGSS became substantially more positive over time.

5. COMMUNICATE

The final steps in the workflow are to share the results of the analysis:

Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

5a. Select

The questions of interest for selection, polishing, and narration are:

What is the public sentiment expressed toward the NGSS?
How does sentiment for NGSS compare to sentiment for CCSS?

To address questions 1 and 2, the analyses, data products and sharing focus on the following:

Analyses. For RQ1, I’m want to try and replicate as closely as possible the analysis by Rosenberg et al. so I will clean up my analysis and calculate a single sentiment score using the AFINN Lexicon for the entire tweet and label it positive or negative based on that score. I also want to highlight how regardless of the lexicon selected, NGSS tweets contain more positive words than negative, so I’ll also polish my previous analyses and calculate percentages of positive and negative words for the
Data Products. I know these are shunned in the world of data viz, but I think a pie chart will actually be an effective way to quickly communicate the proportion of positive and negative tweets among the Next Generation Science Standards. And for my analyses with the bing, nrc, and loughan lexicons, I’ll create some 100% stacked bars showing the percentage of positive and negative words among all tweets for the NGSS and CCSS.
Format. Similar to Unit 1, I’ll be using R Markdown again to create a quick slide deck. Recall that R Markdown files can also be used to create a wide range of outputs and formats, including polished PDF or Word documents, websites, web apps, journal articles, online books, interactive tutorials and more. And to make this process even more user-friendly, R Studio now includes a visual editor!

5b. Polish

NGSS Sentiment

To replicate the approach Rosenberg et al. used in their analysis some R code from section 2b. Tidy Text will be used.

To polish the analyses and prepare, first we will rebuild the tweets dataset from my ngss_tweets and ccss_tweets and select both the status_id that is unique to each tweet, and the text column which contains the actual post:

ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

ccss_text <-
  ccss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ccss") %>%
  relocate(standards)

tweets <- bind_rows(ngss_text, ccss_text)

tweets

## # A tibble: 1,441 × 3
##    standards status_id           text                                           
##    <chr>     <chr>               <chr>                                          
##  1 ngss      1365716690336645124 "Switching gears for a bit for the \"Crosscutt…
##  2 ngss      1363217513761415171 "Was just introduced to the Engineering Habits…
##  3 ngss      1365709122763653133 "@IBchemmilam @chemmastercorey I’m familiar w/…
##  4 ngss      1365673294360420353 "@IBchemmilam @chemmastercorey How well does t…
##  5 ngss      1365667393188601857 "I am so honored and appreciative to have an o…
##  6 ngss      1365690477266284545 "Thank you @brian_womack I loved connecting wi…
##  7 ngss      1365706140496130050 "Please share #NGSSchat PLN! https://t.co/Qc2c…
##  8 ngss      1363669328147677189 "So excited about this weekend’s learning... p…
##  9 ngss      1365442786544214019 "The Educators Evaluating the Quality of Instr…
## 10 ngss      1364358149164175362 "Foster existing teacher social networks that …
## # … with 1,431 more rows

The status_id is important because like Rosenberg et al., we want to calculate an overall sentiment score for each tweet, rather than for each word.

Before I get that far however, I’ll need to tidy my tweets again and attach my sentiment scores.

Note that the closest lexicon we have available in our tidytext package to the SentiStrength lexicon used by Rosenberg is the AFINN lexicon which also uses a -5 to 5 point scale.

So let’s use unnest_tokens to tidy our tweets, remove stop words, and add afinn scores to each word similar to what we did in section 2c. Add Sentiment Values:

sentiment_afinn <- tweets %>%
  unnest_tokens(output = word, 
                input = text
      )  %>% 
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp") %>%
  inner_join(afinn, by = "word")

sentiment_afinn

## # A tibble: 1,540 × 4
##    standards status_id           word         value
##    <chr>     <chr>               <chr>        <dbl>
##  1 ngss      1365716690336645124 win              4
##  2 ngss      1365709122763653133 love             3
##  3 ngss      1365709122763653133 sweet            2
##  4 ngss      1365709122763653133 significance     1
##  5 ngss      1365667393188601857 honored          2
##  6 ngss      1365667393188601857 opportunity      2
##  7 ngss      1365667393188601857 wonderful        4
##  8 ngss      1365667393188601857 powerful         2
##  9 ngss      1365690477266284545 loved            3
## 10 ngss      1365706140496130050 share            1
## # … with 1,530 more rows

Next, I want to calculate a single score for each tweet. To do that, I’ll use the by now familiar group_by and summarize

afinn_score <- sentiment_afinn %>% 
  group_by(standards, status_id) %>% 
  summarise(value = sum(value))

afinn_score

## # A tibble: 857 × 3
## # Groups:   standards [2]
##    standards status_id           value
##    <chr>     <chr>               <dbl>
##  1 ccss      1362894990813188096     2
##  2 ccss      1362899370199445508     4
##  3 ccss      1362906588021989376    -2
##  4 ccss      1362910494487535618    -9
##  5 ccss      1362910913855160320    -1
##  6 ccss      1362928225379250179     2
##  7 ccss      1362933982074073090    -1
##  8 ccss      1362947497258151945    -3
##  9 ccss      1362949805694013446     3
## 10 ccss      1362970614282264583     3
## # … with 847 more rows

And like Rosenberg et al., I’ll add a flag for whether the tweet is “positive” or “negative” using the mutate function to create a new sentiment column to indicate whether that tweets was positive or negative.

To do this, we introduced the new if_else function from the dplyr package. This if_else function adds “negative” to the sentiment column if the score in the value column of the corresponding row is less than 0. If not, it will add a “positive” to the row.

afinn_sentiment <- afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))

afinn_sentiment

## # A tibble: 820 × 4
## # Groups:   standards [2]
##    standards status_id           value sentiment
##    <chr>     <chr>               <dbl> <chr>    
##  1 ccss      1362894990813188096     2 positive 
##  2 ccss      1362899370199445508     4 positive 
##  3 ccss      1362906588021989376    -2 negative 
##  4 ccss      1362910494487535618    -9 negative 
##  5 ccss      1362910913855160320    -1 negative 
##  6 ccss      1362928225379250179     2 positive 
##  7 ccss      1362933982074073090    -1 negative 
##  8 ccss      1362947497258151945    -3 negative 
##  9 ccss      1362949805694013446     3 positive 
## 10 ccss      1362970614282264583     3 positive 
## # … with 810 more rows

Note that since a tweet sentiment score equal to 0 is neutral, I used the filter function to remove it from the dataset.

Finally, we’re ready to compute our ratio. We’ll use the group_by function and count the number of tweets for each of the standards that are positive or negative in the sentiment column. Then we’ll use the spread function to separate them out into separate columns so we can perform a quick calculation to compute the ratio.

afinn_ratio <- afinn_sentiment %>% 
  group_by(standards) %>% 
  count(sentiment) %>% 
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)

afinn_ratio

## # A tibble: 2 × 4
## # Groups:   standards [2]
##   standards negative positive ratio
##   <chr>        <int>    <int> <dbl>
## 1 ccss           421      211 2.00 
## 2 ngss            21      167 0.126

Finally,

afinn_counts <- afinn_sentiment %>%
  group_by(standards) %>% 
  count(sentiment) %>%
  filter(standards == "ngss")

afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Next Gen Science Standards",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

NGSS vs CCSS

Finally, to address Question 2, I want to compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons to see how sentiment compares based on lexicon used.

I’ll begin by polishing my previous summaries and creating identical summaries for each lexicon that contains the following columns: method, standards, sentiment, and n, or word counts:

summary_afinn2 <- sentiment_afinn %>% 
  group_by(standards) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 

summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran")

Next, I’ll combine those four data frames together using the bind_rows function again:

summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2,
                               summary_nrc2,
                               summary_loughran2) %>%
  arrange(method, standards) %>%
  relocate(method)

summary_sentiment

## # A tibble: 16 × 4
## # Groups:   standards [2]
##    method   standards sentiment     n
##    <chr>    <chr>     <chr>     <int>
##  1 AFINN    ccss      negative    740
##  2 AFINN    ccss      positive    477
##  3 AFINN    ngss      positive    278
##  4 AFINN    ngss      negative     45
##  5 bing     ccss      negative    926
##  6 bing     ccss      positive    446
##  7 bing     ngss      positive    230
##  8 bing     ngss      negative     66
##  9 loughran ccss      negative    433
## 10 loughran ccss      positive    112
## 11 loughran ngss      negative     73
## 12 loughran ngss      positive     57
## 13 nrc      ccss      positive   2294
## 14 nrc      ccss      negative    766
## 15 nrc      ngss      positive    571
## 16 nrc      ngss      negative     79

Then I’ll create a new data frame that has the total word counts for each set of standards and each method and join that to my summary_sentiment data frame:

total_counts <- summary_sentiment %>%
  group_by(method, standards) %>%
  summarise(total = sum(n))

## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.

sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining, by = c("method", "standards")

sentiment_counts

## # A tibble: 16 × 5
## # Groups:   standards [2]
##    method   standards sentiment     n total
##    <chr>    <chr>     <chr>     <int> <int>
##  1 AFINN    ccss      negative    740  1217
##  2 AFINN    ccss      positive    477  1217
##  3 AFINN    ngss      positive    278   323
##  4 AFINN    ngss      negative     45   323
##  5 bing     ccss      negative    926  1372
##  6 bing     ccss      positive    446  1372
##  7 bing     ngss      positive    230   296
##  8 bing     ngss      negative     66   296
##  9 loughran ccss      negative    433   545
## 10 loughran ccss      positive    112   545
## 11 loughran ngss      negative     73   130
## 12 loughran ngss      positive     57   130
## 13 nrc      ccss      positive   2294  3060
## 14 nrc      ccss      negative    766  3060
## 15 nrc      ngss      positive    571   650
## 16 nrc      ngss      negative     79   650

Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:

sentiment_percents <- sentiment_counts %>%
  mutate(percent = n/total * 100)

sentiment_percents

## # A tibble: 16 × 6
## # Groups:   standards [2]
##    method   standards sentiment     n total percent
##    <chr>    <chr>     <chr>     <int> <int>   <dbl>
##  1 AFINN    ccss      negative    740  1217    60.8
##  2 AFINN    ccss      positive    477  1217    39.2
##  3 AFINN    ngss      positive    278   323    86.1
##  4 AFINN    ngss      negative     45   323    13.9
##  5 bing     ccss      negative    926  1372    67.5
##  6 bing     ccss      positive    446  1372    32.5
##  7 bing     ngss      positive    230   296    77.7
##  8 bing     ngss      negative     66   296    22.3
##  9 loughran ccss      negative    433   545    79.4
## 10 loughran ccss      positive    112   545    20.6
## 11 loughran ngss      negative     73   130    56.2
## 12 loughran ngss      positive     57   130    43.8
## 13 nrc      ccss      positive   2294  3060    75.0
## 14 nrc      ccss      negative    766  3060    25.0
## 15 nrc      ngss      positive    571   650    87.8
## 16 nrc      ngss      negative     79   650    12.2

Now that I have my sentiment percent summaries for each lexicon, I’m going great my 100% stacked bar charts for each lexicon:

sentiment_percents %>%
  ggplot(aes(x = standards, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter", 
       subtitle = "The Common Core & Next Gen Science Standards",
       x = "State Standards", 
       y = "Percentage of Words")

And finished! The chart above clearly illustrates that regardless of sentiment lexicon used, the NGSS contains more positive words than the CCSS lexicon.

5c. Narrate

This project is a reorganization and resharing of a project that our class undertook with Dr. Shiyan Jiang for ECI 588. The purpose of reorganizing this walkthrough was to give a beginning coder practical familiarity with the steps of creating a useful project in R that tells a data story by moving through the following steps: Prepare, wrangle, explore, model,communicate.

Purpose. There are two guiding questions that drove this analysis: 1. What is the public sentiment expressed toward the NGSS?
2. How does sentiment for NGSS compare to sentiment for CCSS? The answers to these questions provide valuable insight for practicing researchers looking at what tweets reflect about sentiments toward CCSS and NGSS.
Methods. The data selected for analysis were tweets regarding NGSS and CCSS. To prepare and analyze the data, the following process was used:
1. Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. This study examined Dr. Rosenberg’s study as well as data available through Twitter’s API.
2. Wrangle: In section 2 we revisit tidying and tokenizing text, and and append sentiment scores to tweets using the AFFIN, bing, and nrc sentiment lexicons.
3. Explore: In section 3, we use simple summary statistics and basic data visualization to compare sentiment between NGSS and CCSS tweets.
4. Model: We examined the mixed effects model used by Rosenberg et al. to analyze the sentiment of tweets
5. Communicate: Findings and insights from the analysis were shared.
Findings. The data reveal that CCSS is overall negative while NGSS is overall positive.
Discussion. This analysis can be used to improve NGSS and CCSS policies and practice with the ultimate goal of improving learning outcomes for students. This study could be expanded to include more data over time, and to include more lexicons to further solidify findings.