This project reflects a very simplistic “replication study” by comparing the sentiment of tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand public reaction to these two curriculum reform efforts. The text is based on our course digital textbook, Unit 2 Walkthrough: Twitter Sentiment and School Reform by Dr. Shiyan Jiang. The focus will be on using the Twitter API to import data on topics or tweets of interest and using sentiment lexicons to help gauge public opinion about those topics or tweets. Silge & Robinson nicely illustrate the tools of text mining to approach the emotional content of text programmatically, in the following diagram:
The steps of the process are as follows:
Our (very) specific questions of interest for this project are:
The first steps of the workflow is to set up a “Project” within RStudio. The next step is to open up a new R script, and load the following packages:
library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
The Twitter data for the study was accessed through the following file GitHub course repository
In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).
rtweet package and some key functions to search for tweets
or users of interest.tidytext
package to both “tidy” and tokenize our tweets in order to create our
data frame for analysis.inner_join()
function for appending sentiment values to our data frame.The Import Tweets section introduces the following functions from the
rtweet package for reading Twitter data into R:
search_tweets()
Pulls up to 18,000 tweets from the last 6-9 days matching provided
search terms. search_tweets2()
Returns data from multiple search queries. get_timelines()
Returns up to 3,200 tweets of one or more specified Twitter users.We will use the search_tweets() function to try reading
into R 5,000 tweets containing the NGSS hashtag and store as a new data
frame ngss_all_tweets.
Type or copy the following code into your R script or console and run:
Note that the first argument q = that the
search_tweets() function expects is the search term
included in quotation marks and that n = specifies the
maximum number of tweets
While not explicitly mentioned in the paper, it’s likely the authors removed retweets in their query since a retweet is simply someone else reposting someone else’s tweet and would duplicate the exact same content of the original.
Let’s use the include_rts = argument to remove any
retweets by setting it to FALSE:
If you recall from [Section 1a], the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.
Let’s modify our query using the OR operator to also
include “ngss” so it will return tweets containing either #NGSSchat or
“ngss” and assign to ngss_or_tweets:
Unfortunately, the OR operator will only get us so far.
In order to include the additional search terms, we will need to use the
c() function to combine our search terms into a single
list.
The rtweets package has an additional
search_tweets2() function for using multiple queries in a
search. To do this, either wrap single quotes around a search query
using double quotes, e.g.,
q = '"next gen science standard"' or escape each internal
double quote with a single backslash, e.g.,
q = "\"next gen science standard\"".
Copy and past the following code to store the results of our query in
ngss_tweets:
To compare public sentiment about both the NGSS and CCSS state
standards, we will create four dictionaries. First, we will create our
very first “dictionary” for identifying tweets related to either set of
standards, and then use that dictionary for our the q =
query argument to pull tweets related to the state standards.
To do so, we’ll need to add some additional search terms to our list:
Now let’s create a dictionary for the Common Core State Standards and
pass that to our search_tweets() function to get the most
recent tweets:
Notice that you can use the pipe operator with the
search_tweets() function just like you would other
functions from the tidyverse.
Finally, let’s save our tweet files to use in later exercises since tweets have a tendency to change every minute. We’ll save as a Microsoft Excel file since one of our columns can not be stored in a flat file like .csv.
Let’s use the write_xlsx() function from the
writexl package just like we would the
write_csv() function from dplyr in Unit 1:
Now that we have the data needed to answer our questions, we still have a little bit of work to do to get it ready for analysis. This section will revisit some familiar functions from Unit 1 and introduce a couple new functions:
dplyr functions
select() picks variables based on their names.slice() lets you select, remove, and duplicate
rows.rename() changes the names of individual variables
using new_name = old_name syntaxfilter() picks cases, or rows, based on their values in
a specified column.tidytext functions
unnest_tokens() splits a column into tokensanti_join() returns all rows from x
without a match in y.We’ll use the readxl package highlighted in Unit 1 and
the read_xlsx() function to read in the data stored in the
data folder of our R project:
ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")
As you are probably already aware, we have way more data than we’ll need for analysis and will need to pare it down quite a bit.
First, let’s use the filter function to subset rows
containing only tweets in the language:
ngss_text <- filter(ngss_tweets, lang == "en")
Now let’s select the following columns from our new
ngss_text data frame:
screen_name of the user who created the tweetcreated_at timestamp for examining changes in sentiment
over timetext containing the tweet which is our primary data
source of interesttngss_text <- select(ngss_text,screen_name, created_at, text)
Since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column for quickly identifying the set of state standards, with which each tweet is associated.
We’ll use the mutate() function to create a new variable
called standards to label each tweets as “ngss”:
ngss_text <- mutate(ngss_text, standards = "ngss")
And just because it bothers me, I’m going to use the
relocate() function to move the standards
column to the first position so I can quickly see which standards the
tweet is from:
ngss_text <- relocate(ngss_text, standards)
Note that you could also have used the select() function
to reorder columns like so:
ngss_text <- select(ngss_text, standards, screen_name, created_at, text)
Finally, let’s rewrite the code above using the %>%
operator so there is less redundancy and it is easier to read:
ngss_text <-
ngss_tweets %>%
filter(lang == "en") %>%
select(screen_name, created_at, text) %>%
mutate(standards = "ngss") %>%
relocate(standards)
ccss_text data frame for our
ccss_tweets Common Core tweets by modifying code
above.Finally, let’s combine our ccss_text and
ngss_text into a single data frame by using the
bind_rows() function from dplyr to simply
supplying the data frames that you want to combine as arguments:
tweets <- bind_rows(ngss_text, ccss_text)
And let’s take a quick look at both the head() and the
tail() of this new tweets data frame to make
sure it contains both “ngss” and “ccss” standards:
head(tweets)
## # A tibble: 6 × 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ngss loyr2662 2021-02-27 17:33:27 "Switching gears for a bit for the…
## 2 ngss loyr2662 2021-02-20 20:02:37 "Was just introduced to the Engine…
## 3 ngss Furlow_teach 2021-02-27 17:03:23 "@IBchemmilam @chemmastercorey I’m…
## 4 ngss Furlow_teach 2021-02-27 14:41:01 "@IBchemmilam @chemmastercorey How…
## 5 ngss TdiShelton 2021-02-27 14:17:34 "I am so honored and appreciative …
## 6 ngss TdiShelton 2021-02-27 15:49:17 "Thank you @brian_womack I loved c…
tail(tweets)
## # A tibble: 6 × 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ccss JosiePaul8807 2021-02-20 00:34:53 "@SenatorHick You realize science…
## 2 ccss ctwittnc 2021-02-19 23:44:18 "@winningatmylife I’ll bet none o…
## 3 ccss the_rbeagle 2021-02-19 23:27:06 "@dmarush @electronlove @Montgome…
## 4 ccss silea 2021-02-19 23:11:21 "@LizerReal I don’t think that’s …
## 5 ccss JodyCoyote12 2021-02-19 22:58:25 "@CarlaRK3 @NedLamont Fully fund …
## 6 ccss Ryan_Hawes 2021-02-19 22:41:01 "I just got an \"explainer\" on h…
First, let’s tokenize our tweets by using the
unnest_tokens() function to split each tweet into a single
row to make it easier to analyze:
tweet_tokens <-
tweets %>%
unnest_tokens(output = word,
input = text)
Notice that we’ve included an additional argument in the call to
unnest_tokens(). Specifically, we used the
specialized “tweets” tokenizer in the tokens =
argument that is very useful for dealing with Twitter text or other text
from online forums in that it retains hashtags and mentions of usernames
with the @ symbol.
Now let’s remove stop words like “the” and “a” that don’t help us learn much about what people are tweeting about the state standards.
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word")
Notice that we’ve specified the by = argument to look
for matching words in the word column for both data sets
and remove any rows from the tweet_tokens dataset that
match the stop_words dataset. Remember when we first
tokenized our dataset I conveniently chose output = word as
the column name because it matches the column name word in
the stop_words dataset contained in the
tidytext package. This makes our call
to anti_join()simpler
because anti_join() knows to look for the column
named word in each dataset. However this wasn’t really
necessary since word is the only matching column name in
both datasets and it would have matched those columns by default.
Before wrapping up, let’s take a quick count of the most common words
in tidy_tweets data frame:
count(tidy_tweets, word, sort = T)
## # A tibble: 7,163 × 2
## word n
## <chr> <int>
## 1 common 1112
## 2 core 1109
## 3 https 623
## 4 t.co 623
## 5 math 450
## 6 ngss 224
## 7 students 141
## 8 science 140
## 9 school 128
## 10 amp 127
## # … with 7,153 more rows
Notice that the nonsense word “amp” is in our top tens words. If we
use the filter() function and `grep() query from Unit 1 on
our tweets data frame, we can see that “amp” seems to be
some sort of html residue that we might want to get rid of.
filter(tweets, grepl('amp', text))
## # A tibble: 124 × 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ngss TdiShelton 2021-02-27 14:17:34 "I am so honored and appreciati…
## 2 ngss STEMTeachTools 2021-02-27 16:25:04 "Open, non-hierarchical communi…
## 3 ngss NGSSphenomena 2021-02-25 13:24:22 "Bacteria have music preference…
## 4 ngss CTSKeeley 2021-02-21 21:50:04 "Today I was thinking about the…
## 5 ngss richbacolor 2021-02-24 14:14:49 "Last chance to register for @M…
## 6 ngss MrsEatonELL 2021-02-27 06:24:09 "Were we doing the hand jive? N…
## 7 ngss STEMuClaytion 2021-02-24 14:56:19 "#WonderWednesday w/ questions …
## 8 ngss LearningUNDFTD 2021-02-24 18:13:01 "Are candies like M&Ms and …
## 9 ngss abeslo 2021-02-26 18:54:31 "#M'Kenna, whose story we share…
## 10 ngss E3Chemistry 2021-02-25 14:15:20 "Molarity & Parts Per Milli…
## # … with 114 more rows
Let’s rewrite our stop word code to add a custom stop word to filter out rows with “amp” in them:
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp")
Note that we could extend this filter to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.
Now that we have our tweets nice and tidy, we’re almost ready to
begin exploring public sentiment (at least for the past week due to
Twitter API rate limits) around the CCSS and NGSS standards. For this
part of our workflow we introduce two new functions from the tidytext
and dplyr packages respectively:
get_sentiments() returns specific sentiment lexicons
with the associated measures for each word in the lexiconinner_join() return all rows from x where
there are matching values in y, and all columns
from x and y.The tidytext package provides access to several sentiment lexicons based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
The three general-purpose lexicons we’ll focus on are:
AFINN assigns words with a score that runs between
-5 and 5, with negative scores indicating negative sentiment and
positive scores indicating positive sentiment.
bing categorizes words in a binary fashion into
positive and negative categories.
nrc categorizes words in a binary fashion
(“yes”/“no”) into categories of positive, negative, anger, anticipation,
disgust, fear, joy, sadness, surprise, and trust.
Let’s take a quick look at each of these lexicons using the
get_sentiments() function and assign them to their
respective names for later use:
afinn <- get_sentiments("afinn")
afinn
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
bing <- get_sentiments("bing")
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
And just out of curiosity, let’s take a look at the
loughran lexicon as well:
loughran <- get_sentiments("loughran")
loughran
## # A tibble: 4,150 × 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # … with 4,140 more rows
We’ve reached the final step in our data wrangling process before we can begin exploring our data to address our questions.
In the previous section, we used anti_join() to remove
stop words in our dataset. For sentiment analysis, we’re going use the
inner_join() function to do something similar. However,
instead of removing rows that contain words matching those in our stop
words dictionary, inner_join() allows us to keep only the
rows with words that match words in our sentiment lexicons, or
dictionaries, along with the sentiment measure for that word from the
sentiment lexicon.
Let’s use inner_join() to combine our two
tidy_tweets and afinn data frames, keeping
only rows with matching data in the word column:
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn
## # A tibble: 1,540 × 5
## standards screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 ngss loyr2662 2021-02-27 17:33:27 win 4
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love 3
## 3 ngss Furlow_teach 2021-02-27 17:03:23 sweet 2
## 4 ngss Furlow_teach 2021-02-27 17:03:23 significance 1
## 5 ngss TdiShelton 2021-02-27 14:17:34 honored 2
## 6 ngss TdiShelton 2021-02-27 14:17:34 opportunity 2
## 7 ngss TdiShelton 2021-02-27 14:17:34 wonderful 4
## 8 ngss TdiShelton 2021-02-27 14:17:34 powerful 2
## 9 ngss TdiShelton 2021-02-27 15:49:17 loved 3
## 10 ngss TdiShelton 2021-02-27 16:51:32 share 1
## # … with 1,530 more rows
Notice that each word in your sentiment_afinn data frame
now contains a value ranging from -5 (very negative) to 5 (very
positive).
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing
## # A tibble: 1,668 × 5
## standards screen_name created_at word sentiment
## <chr> <chr> <dttm> <chr> <chr>
## 1 ngss loyr2662 2021-02-27 17:33:27 win positive
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love positive
## 3 ngss Furlow_teach 2021-02-27 17:03:23 helped positive
## 4 ngss Furlow_teach 2021-02-27 17:03:23 sweet positive
## 5 ngss Furlow_teach 2021-02-27 17:03:23 tough positive
## 6 ngss TdiShelton 2021-02-27 14:17:34 honored positive
## 7 ngss TdiShelton 2021-02-27 14:17:34 appreciative positive
## 8 ngss TdiShelton 2021-02-27 14:17:34 wonderful positive
## 9 ngss TdiShelton 2021-02-27 14:17:34 powerful positive
## 10 ngss TdiShelton 2021-02-27 15:49:17 loved positive
## # … with 1,658 more rows
Now that we have our tweets tidied and sentiments joined, we’re ready for a little data exploration.
Before we dig into sentiment, let’s use the handy
ts_plot function built into rtweet to take a
very quick look at how far back our tidied tweets data set
goes:
ts_plot(tweets, by = "days")
Notice that this effectively creates a ggplot time
series plot for us. I’ve included the by = argument which
by default is set to “days”. It looks like tweets go back 9 days which
the rate limit set by Twitter.
Try changing it to “hours” and see what happens.
Hint: use the ?ts_plot help function to check the
examples to see how this can be done.
Your line graph should look something like this:
Since our primary goals is to compare public sentiment around the
NGSS and CCSS state standards, in this section we put together some
basic numerical summaries using our different lexicons to see whether
tweets are generally more positive or negative for each standard as well
as differences between the two. To do this, we revisit the following
dplyr functions:
count()
lets you quickly count the unique values of one or more
variables
group_by()
takes a data frame and one or more variables to group by
summarise()
creates a numerical summary of data using arguments like mean() and median()
mutate()
adds new variables and preserves existing ones
And introduce one new function:
spread()Let’s start with bing, our simplest sentiment lexicon,
and use the count function to count how many times in our
sentiment_bing data frame “positive” and “negative” occur
in sentiment column and :
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
Collectively, it looks like our combined dataset has more positive words than negative words.
summary_bing
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 992
## 2 positive 676
Since our main goal is to compare positive and negative sentiment
between CCSS and NGSS, let’s use the group_by function
again to get sentiment summaries for NGSS and CCSS
separately:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment)
summary_bing
## # A tibble: 4 × 3
## # Groups: standards [2]
## standards sentiment n
## <chr> <chr> <int>
## 1 ccss negative 926
## 2 ccss positive 446
## 3 ngss negative 66
## 4 ngss positive 230
Looks like CCSS have far more negative words than positive, while NGSS skews much more positive. So far, pretty consistent with Rosenberg et al. findings!!!
Our last step will be calculate a single sentiment “score” for our tweets that we can use for quick comparison and create a new variable indicating which lexicon we used.
First, let’s untidy our data a little by using the
spread function from the tidyr package to
transform our sentiment column into separate columns for
negative and positive that contains the
n counts for each:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n)
summary_bing
## # A tibble: 2 × 3
## # Groups: standards [2]
## standards negative positive
## <chr> <int> <int>
## 1 ccss 926 446
## 2 ngss 66 230
Finally, we’ll use the mutate function to create two new
variables: sentiment and lexicon so we have a
single sentiment score and the lexicon from which it was derived:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
mutate(sentiment = positive - negative) %>%
mutate(lexicon = "bing") %>%
relocate(lexicon)
summary_bing
## # A tibble: 2 × 5
## # Groups: standards [2]
## lexicon standards negative positive sentiment
## <chr> <chr> <int> <int> <int>
## 1 bing ccss 926 446 -480
## 2 bing ngss 66 230 164
There we go, now we can see that CCSS scores negative, while NGSS is overall positive.
Let’s calculate a quick score for using the afinn
lexicon now. Remember that AFINN provides a value from -5 to 5 for
each:
head(sentiment_afinn)
## # A tibble: 6 × 5
## standards screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 ngss loyr2662 2021-02-27 17:33:27 win 4
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love 3
## 3 ngss Furlow_teach 2021-02-27 17:03:23 sweet 2
## 4 ngss Furlow_teach 2021-02-27 17:03:23 significance 1
## 5 ngss TdiShelton 2021-02-27 14:17:34 honored 2
## 6 ngss TdiShelton 2021-02-27 14:17:34 opportunity 2
To calculate late a summary score, we will need to first group our
data by standards again and then use the
summarise function to create a new sentiment
variable by adding all the positive and negative scores in the
value column:
summary_afinn <- sentiment_afinn %>%
group_by(standards) %>%
summarise(sentiment = sum(value)) %>%
mutate(lexicon = "AFINN") %>%
relocate(lexicon)
summary_afinn
## # A tibble: 2 × 3
## lexicon standards sentiment
## <chr> <chr> <dbl>
## 1 AFINN ccss -808
## 2 AFINN ngss 503
Again, CCSS is overall negative while NGSS is overall positive!
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
## # A tibble: 2 × 5
## # Groups: standards [2]
## standards method negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 ccss nrc 766 2294 2.99
## 2 ngss nrc 79 571 7.23
## # A tibble: 2 × 3
## lexicon standards sentiment
## <chr> <chr> <dbl>
## 1 AFINN ccss -808
## 2 AFINN ngss 503
As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
Recall from the PREPARE section that the Rosenberg et al. study was guide by the following questions:
Similar to our sentiment summary using the AFINN lexicon, the Rosenberg et al. study used the -5 to 5 sentiment score from the SentiStrength lexicon to answer RQ #1. To address the remaining questions the authors used a mixed effects model (also known as multi-level or hierarchical linear models via the lme4 package in R.
Collectively, the authors found that:
The final steps in the workflow are to share the results of the analysis:
The questions of interest for selection, polishing, and narration are:
To address questions 1 and 2, the analyses, data products and sharing focus on the following:
bing,
nrc, and loughan lexicons, I’ll create some
100% stacked bars showing the percentage of positive and negative words
among all tweets for the NGSS and CCSS.To replicate the approach Rosenberg et al. used in their analysis some R code from section 2b. Tidy Text will be used.
To polish the analyses and prepare, first we will rebuild the
tweets dataset from my ngss_tweets and
ccss_tweets and select both the status_id that
is unique to each tweet, and the text column which contains
the actual post:
ngss_text <-
ngss_tweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(standards = "ngss") %>%
relocate(standards)
ccss_text <-
ccss_tweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(standards = "ccss") %>%
relocate(standards)
tweets <- bind_rows(ngss_text, ccss_text)
tweets
## # A tibble: 1,441 × 3
## standards status_id text
## <chr> <chr> <chr>
## 1 ngss 1365716690336645124 "Switching gears for a bit for the \"Crosscutt…
## 2 ngss 1363217513761415171 "Was just introduced to the Engineering Habits…
## 3 ngss 1365709122763653133 "@IBchemmilam @chemmastercorey I’m familiar w/…
## 4 ngss 1365673294360420353 "@IBchemmilam @chemmastercorey How well does t…
## 5 ngss 1365667393188601857 "I am so honored and appreciative to have an o…
## 6 ngss 1365690477266284545 "Thank you @brian_womack I loved connecting wi…
## 7 ngss 1365706140496130050 "Please share #NGSSchat PLN! https://t.co/Qc2c…
## 8 ngss 1363669328147677189 "So excited about this weekend’s learning... p…
## 9 ngss 1365442786544214019 "The Educators Evaluating the Quality of Instr…
## 10 ngss 1364358149164175362 "Foster existing teacher social networks that …
## # … with 1,431 more rows
The status_id is important because like Rosenberg et
al., we want to calculate an overall sentiment score for each tweet,
rather than for each word.
Before I get that far however, I’ll need to tidy my
tweets again and attach my sentiment
scores.
Note that the closest lexicon we have available in our
tidytext package to the SentiStrength lexicon used by
Rosenberg is the AFINN lexicon which also uses a -5 to 5 point
scale.
So let’s use unnest_tokens to tidy our tweets, remove
stop words, and add afinn scores to each word similar to
what we did in section 2c. Add
Sentiment Values:
sentiment_afinn <- tweets %>%
unnest_tokens(output = word,
input = text
) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp") %>%
inner_join(afinn, by = "word")
sentiment_afinn
## # A tibble: 1,540 × 4
## standards status_id word value
## <chr> <chr> <chr> <dbl>
## 1 ngss 1365716690336645124 win 4
## 2 ngss 1365709122763653133 love 3
## 3 ngss 1365709122763653133 sweet 2
## 4 ngss 1365709122763653133 significance 1
## 5 ngss 1365667393188601857 honored 2
## 6 ngss 1365667393188601857 opportunity 2
## 7 ngss 1365667393188601857 wonderful 4
## 8 ngss 1365667393188601857 powerful 2
## 9 ngss 1365690477266284545 loved 3
## 10 ngss 1365706140496130050 share 1
## # … with 1,530 more rows
Next, I want to calculate a single score for each tweet. To do that,
I’ll use the by now familiar group_by and
summarize
afinn_score <- sentiment_afinn %>%
group_by(standards, status_id) %>%
summarise(value = sum(value))
afinn_score
## # A tibble: 857 × 3
## # Groups: standards [2]
## standards status_id value
## <chr> <chr> <dbl>
## 1 ccss 1362894990813188096 2
## 2 ccss 1362899370199445508 4
## 3 ccss 1362906588021989376 -2
## 4 ccss 1362910494487535618 -9
## 5 ccss 1362910913855160320 -1
## 6 ccss 1362928225379250179 2
## 7 ccss 1362933982074073090 -1
## 8 ccss 1362947497258151945 -3
## 9 ccss 1362949805694013446 3
## 10 ccss 1362970614282264583 3
## # … with 847 more rows
And like Rosenberg et al., I’ll add a flag for whether the tweet is
“positive” or “negative” using the mutate function to
create a new sentiment column to indicate whether that
tweets was positive or negative.
To do this, we introduced the new if_else function from
the dplyr package. This if_else function adds
“negative” to the sentiment column if the score in the
value column of the corresponding row is less than 0. If
not, it will add a “positive” to the row.
afinn_sentiment <- afinn_score %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment
## # A tibble: 820 × 4
## # Groups: standards [2]
## standards status_id value sentiment
## <chr> <chr> <dbl> <chr>
## 1 ccss 1362894990813188096 2 positive
## 2 ccss 1362899370199445508 4 positive
## 3 ccss 1362906588021989376 -2 negative
## 4 ccss 1362910494487535618 -9 negative
## 5 ccss 1362910913855160320 -1 negative
## 6 ccss 1362928225379250179 2 positive
## 7 ccss 1362933982074073090 -1 negative
## 8 ccss 1362947497258151945 -3 negative
## 9 ccss 1362949805694013446 3 positive
## 10 ccss 1362970614282264583 3 positive
## # … with 810 more rows
Note that since a tweet sentiment score equal to 0 is neutral, I used
the filter function to remove it from the dataset.
Finally, we’re ready to compute our ratio. We’ll use the
group_by function and count the number of
tweets for each of the standards that are positive or
negative in the sentiment column. Then we’ll use the
spread function to separate them out into separate columns
so we can perform a quick calculation to compute the
ratio.
afinn_ratio <- afinn_sentiment %>%
group_by(standards) %>%
count(sentiment) %>%
spread(sentiment, n) %>%
mutate(ratio = negative/positive)
afinn_ratio
## # A tibble: 2 × 4
## # Groups: standards [2]
## standards negative positive ratio
## <chr> <int> <int> <dbl>
## 1 ccss 421 211 2.00
## 2 ngss 21 167 0.126
Finally,
afinn_counts <- afinn_sentiment %>%
group_by(standards) %>%
count(sentiment) %>%
filter(standards == "ngss")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "Next Gen Science Standards",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
Finally, to address Question 2, I want to compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons to see how sentiment compares based on lexicon used.
I’ll begin by polishing my previous summaries and creating identical
summaries for each lexicon that contains the following columns:
method, standards, sentiment, and
n, or word counts:
summary_afinn2 <- sentiment_afinn %>%
group_by(standards) %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc")
summary_loughran2 <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran")
Next, I’ll combine those four data frames together using the
bind_rows function again:
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2,
summary_nrc2,
summary_loughran2) %>%
arrange(method, standards) %>%
relocate(method)
summary_sentiment
## # A tibble: 16 × 4
## # Groups: standards [2]
## method standards sentiment n
## <chr> <chr> <chr> <int>
## 1 AFINN ccss negative 740
## 2 AFINN ccss positive 477
## 3 AFINN ngss positive 278
## 4 AFINN ngss negative 45
## 5 bing ccss negative 926
## 6 bing ccss positive 446
## 7 bing ngss positive 230
## 8 bing ngss negative 66
## 9 loughran ccss negative 433
## 10 loughran ccss positive 112
## 11 loughran ngss negative 73
## 12 loughran ngss positive 57
## 13 nrc ccss positive 2294
## 14 nrc ccss negative 766
## 15 nrc ngss positive 571
## 16 nrc ngss negative 79
Then I’ll create a new data frame that has the total word counts for
each set of standards and each method and join that to my
summary_sentiment data frame:
total_counts <- summary_sentiment %>%
group_by(method, standards) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = c("method", "standards")
sentiment_counts
## # A tibble: 16 × 5
## # Groups: standards [2]
## method standards sentiment n total
## <chr> <chr> <chr> <int> <int>
## 1 AFINN ccss negative 740 1217
## 2 AFINN ccss positive 477 1217
## 3 AFINN ngss positive 278 323
## 4 AFINN ngss negative 45 323
## 5 bing ccss negative 926 1372
## 6 bing ccss positive 446 1372
## 7 bing ngss positive 230 296
## 8 bing ngss negative 66 296
## 9 loughran ccss negative 433 545
## 10 loughran ccss positive 112 545
## 11 loughran ngss negative 73 130
## 12 loughran ngss positive 57 130
## 13 nrc ccss positive 2294 3060
## 14 nrc ccss negative 766 3060
## 15 nrc ngss positive 571 650
## 16 nrc ngss negative 79 650
Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:
sentiment_percents <- sentiment_counts %>%
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 16 × 6
## # Groups: standards [2]
## method standards sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 AFINN ccss negative 740 1217 60.8
## 2 AFINN ccss positive 477 1217 39.2
## 3 AFINN ngss positive 278 323 86.1
## 4 AFINN ngss negative 45 323 13.9
## 5 bing ccss negative 926 1372 67.5
## 6 bing ccss positive 446 1372 32.5
## 7 bing ngss positive 230 296 77.7
## 8 bing ngss negative 66 296 22.3
## 9 loughran ccss negative 433 545 79.4
## 10 loughran ccss positive 112 545 20.6
## 11 loughran ngss negative 73 130 56.2
## 12 loughran ngss positive 57 130 43.8
## 13 nrc ccss positive 2294 3060 75.0
## 14 nrc ccss negative 766 3060 25.0
## 15 nrc ngss positive 571 650 87.8
## 16 nrc ngss negative 79 650 12.2
Now that I have my sentiment percent summaries for each lexicon, I’m going great my 100% stacked bar charts for each lexicon:
sentiment_percents %>%
ggplot(aes(x = standards, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
facet_wrap(~method, ncol = 1) +
coord_flip() +
labs(title = "Public Sentiment on Twitter",
subtitle = "The Common Core & Next Gen Science Standards",
x = "State Standards",
y = "Percentage of Words")
And finished! The chart above clearly illustrates that regardless of sentiment lexicon used, the NGSS contains more positive words than the CCSS lexicon.
This project is a reorganization and resharing of a project that our class undertook with Dr. Shiyan Jiang for ECI 588. The purpose of reorganizing this walkthrough was to give a beginning coder practical familiarity with the steps of creating a useful project in R that tells a data story by moving through the following steps: Prepare, wrangle, explore, model,communicate.
Purpose. There are two guiding questions that
drove this analysis: 1. What is the public sentiment expressed toward
the NGSS?
2. How does sentiment for NGSS compare to sentiment for CCSS? The
answers to these questions provide valuable insight for practicing
researchers looking at what tweets reflect about sentiments toward CCSS
and NGSS.
Methods. The data selected for analysis were tweets regarding NGSS and CCSS. To prepare and analyze the data, the following process was used:
Findings. The data reveal that CCSS is overall negative while NGSS is overall positive.
Discussion. This analysis can be used to improve NGSS and CCSS policies and practice with the ultimate goal of improving learning outcomes for students. This study could be expanded to include more data over time, and to include more lexicons to further solidify findings.