Unit 2 Walkthrough: Twitter Sentiment and School Reform

0. INTRODUCTION

This week, our walkthrough is guided by my colleague Josh Rosenberg’s recent article, Advancing new methods for understanding public sentiment about educational reforms: The case of Twitter and the Next Generation Science Standards. We will focus on conducting a very simplistic “replication study” by comparing the sentiment of tweets about the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS) in order to better understand public reaction to these two curriculum reform efforts. I highly recommend you watch the quick 3-minute overview of this work at https://stanford.app.box.com/s/i5ixkj2b8dyy8q5j9o5ww4nafznb497x

Walkthrough Focus

For Unit 2, our focus will be on using the Twitter API to import data on topics or tweets of interest and using sentiment lexicons to help gauge public opinion about those topics or tweets. Silge & Robinson nicely illustrate the tools of text mining to approach the emotional content of text programmatically, in the following diagram:

For Unit 2, our walkthrough will cover the following topics:

Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. We’ll take a quick look at Dr. Rosenberg’s study as well as data available through Twitter’s API.
Wrangle: In section 2 we revisit tidying and tokenizing text from Unit 1, and and learn some new functions for appending sentiment scores to our tweets using the AFFIN, bing, and nrc sentiment lexicons.
Explore: In section 3, we use simple summary statistics and basic data visualization to compare sentiment between NGSS and CCSS tweets.
Model: While we won’t leverage modeling approaches until Unit 3, we will examine the mixed effects model used by Rosenberg et al. to analyze the sentiment of tweets
Communicate: Finally, in Week 4 we’ll create a basic presentation, report, or other data product for sharing findings and insights from our analysis.

1. PREPARE

To help us better understand the context, questions, and data sources we’ll be using in Unit 2, this section will focus on the following topics:

Context. We take a quick look at the Rosenberg et al. (2021) article, Advancing new methods for understanding public sentiment about educational reforms, including the purpose of the study, questions explored, and findings.
Questions. We’ll formulate some basic questions that we’ll use to guide our analysis, attempting to replicate some of the findings by Rosenberg et al.
Twitter Setup We walkthrough the process of setting up R to pull data from our Twitter developer account created during the first week of the course.

1a. Some Context

Twitter and the Next Generation Science Standards

Full Paper (Preprint)

Abstract

While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them. To establish how public sentiment about this reform might be similar to or different from past efforts, we applied a suite of data science techniques to posts about the standards on Twitter from 2010-2020 (N = 571,378) from 87,719 users. Applying data science techniques to identify teachers and to estimate tweet sentiment, we found that the public sentiment towards the NGSS is overwhelmingly positive—33 times more so than for the CCSS. Mixed effects models indicated that sentiment became more positive over time and that teachers, in particular, showed a more positive sentiment towards the NGSS. We discuss implications for educational reform efforts and the use of data science methods for understanding their implementation.

Data Source & Analysis

Similar to what we’ll be learning in this walkthrough, Rosenberg et al. used publicly accessible data from Twitter collected using the Full-Archive Twitter API and the rtweet package in R. Specifically, the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.

Unlike this walkthrough, however, the authors determined Tweet sentiment using the Java version of SentiStrength to assign tweets to two 5-point scales of sentiment, one for positivity and one for negativity, because SentiStrength is a validated measure for sentiment in short informal texts (Thelwall et al., 2011). In addition, we used this tool because Wang and Fikis (2019) used it to explore the sentiment of CCSS-related posts. We’ll be using the AFINN sentiment lexicon which also assigns words in a tweet to two 5-point scales, in addition to explore some other sentiment lexicons.

Note that the authors also used the lme4 package in R to run a mixed effects model to determine if sentiment changes over time and differs between teachers and non-teacher. We will not attempt replicated that aspect of the analysis, but if you are interested in a guided walkthrough of how modeling can be used to understand changes in Twitter word use, see Chapter 7 of Text Mining with R.

Summary of Key Findings

Contrasting with sentiment about CSSS, sentiment about the NGSS science education reform effort is overwhelmingly positive, with approximately 9 positive tweets for every negative tweet.
Teachers were more positive than non-teachers, and sentiment became substantially more positive over the ten years of NGSS-related posts.
Differences between the context of the tweets were small, but those that did not include the #NGSSchat hashtag became more positive over time than those posts that did not include the hashtag.
Individuals posted more tweets during #NGSSchat chats, the sentiment of their posts was more positive, suggesting that while the context of individual tweets has a small effect (with posts not including the hashtag becoming more positive over time), the effect upon individuals of being involved in the #NGSSchat was positive.

1b. Guiding Questions

The Rosenberg et al. study was guided by the following five research questions:

What is the public sentiment expressed toward the NGSS?
How does sentiment for teachers differ from non-teachers?
How do tweets posted to #NGSSchat differ from those without the hashtag?
How does participation in #NGSSchat relate to the public sentiment individuals express?
How does public sentiment vary over time?

For this walkthrough, we’ll use a similar approach used by the authors to guage public sentiment around the NGSS, by compare how much more positive or negative NGSS tweets are relative to CSSS tweets.

Our (very) specific questions of interest for this walkthrough are:

What is the public sentiment expressed toward the NGSS?
How does sentiment for NGSS compare to sentiment for CCSS?

And just to reiterate from Unit 1, one overarching question we’ll explore throughout this course, and that Silge and Robinson (2018) identify as a central question to text mining and natural language processing, is:

How do we to quantify what a document or collection of documents is about?

1c. Set Up

As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up a “Project” within RStudio. This will be your “home” for any files and code used or created in Unit 2. You are welcome to continue using the same project created for Unit 1, or create an entirely new project for Unit 2. However, after you’ve created your project open up a new R script, and load the following packages that we’ll be needing for this walkthrough:

library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)

At the end of this week, I’ll ask that you share with me your r script as evidence that you have complete the walkthrough. Although I highly recommend that that you manually type the code shared throughout this walkthrough, for large blocks of text it may be easier to copy and paste.

Create a Twitter App

Before you can begin pulling tweets into R, you’ll first need to create a Twitter App in your developer account. You are not required to set up developer account for this course, but if you are still interested in creating one, these instructions succinctly outline the process and you can set one up in about 10 minutes. If you are not interested in setting one up and pulling tweets on your own, I have provided the data we’ll be using for this tutorial on my GitHub course repository and in our ECI 588 course site. You can skip to section 2b. Tidy Text.

This section and the section that follows, are borrowed largely from rtweet package by Michael Kearney, and is for those of you have a set up a Twitter developer account and are interested in pulling your own data for Twitter.

Navigate to developer.twitter.com/en/apps, click the blue button that says, Create a New App, and then complete the form with the following fields:
- App Name: What your app will be called
- Application Description: How your app will be described to its users
- Website URLs: Website associated with app–I recommend using the URL to your Twitter profile
- Callback URLs: IMPORTANT enter exactly the following: http://127.0.0.1:1410
- Tell us how this app will be used: Be clear and honest
When you’ve completed the required form fields, click the blue Create button at the bottom
Read through and indicate whether you accept the developer terms
And you’re done!

Authorization methods

Users can create their personal Twitter token in two different ways. Each method is outlined below.

Navigate to developer.twitter.com/en/apps and select your Twitter app
Click the tab labeled Keys and tokens to retrieve your keys.
Locate the Consumer API keys (aka “API Secret”).

create-app-6

Scroll down to Access token & access token secret and click Create

create-app-7

Copy and paste the four keys (along with the name of your app) into an R script file and pass them along to create_token(). Note, these keys are named secret for a reason. I recommend setting up your token in a separate R script than the one that you will eventually share.

Authorization in future R sessions

The create_token() function should automatically save your token as an environment variable for you. So next time you start an R session [on the same machine], rtweet should automatically find your token.
To make sure it works, restart your R session, run the following code, and again check to make sure the app name and api_key match.

## check to see if the token is loaded
get_token()

## <Token>
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token
## <oauth_app> wk10casestudyECI586
##   key:    AYzjxwiV8iL8gbth2qFL1EGS3
##   secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---

That’s it!

2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018).

Import Data. In this section, we introduce the rtweet package and some key functions to search for tweets or users of interest.
Tidy Tweets. We revisit the tidytext package to both “tidy” and tokenize our tweets in order to create our data frame for analysis.
Get Sentiments. We conclude our data wrangling by introducing sentiment lexicons and the inner_join() function for appending sentiment values to our data frame.

2a. Import Tweets

The Import Tweets section introduces the following functions from the rtweet package for reading Twitter data into R:

search_tweets() Pulls up to 18,000 tweets from the last 6-9 days matching provided search terms.
search_tweets2() Returns data from multiple search queries.
get_timelines() Returns up to 3,200 tweets of one or more specified Twitter users.

Search Tweets

Since one of our goals for this walkthrough is a very crude replication of the study by Rosenberg et al. (2021), let’s begin by introducing the search_tweets() function to try reading into R 5,000 tweets containing the NGSS hashtag and store as a new data frame ngss_all_tweets.

Type or copy the following code into your R script or console and run:

ngss_all_tweets <- search_tweets(q = "#NGSSchat", n=5000)

glimpse(ngss_all_tweets)

## Rows: 509
## Columns: 90
## $ user_id                 <chr> "3276741348", "4344807252", "2800231624", "280…
## $ status_id               <chr> "1485297596943978505", "1485294464990019586", …
## $ created_at              <dttm> 2022-01-23 17:05:17, 2022-01-23 16:52:50, 202…
## $ screen_name             <chr> "KollmanRebecca", "dawno_connor", "3DScinceguy…
## $ text                    <chr> "While this talks about writing specifically, …
## $ source                  <chr> "Twitter for iPhone", "Twitter for iPhone", "T…
## $ display_text_width      <dbl> 206, 140, 140, 140, 235, 248, 233, 123, 140, 6…
## $ reply_to_status_id      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ reply_to_user_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ reply_to_screen_name    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ is_quote                <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,…
## $ is_retweet              <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, …
## $ favorite_count          <int> 0, 0, 0, 0, 8, 24, 4, 0, 0, 1, 9, 0, 0, 0, 15,…
## $ retweet_count           <int> 0, 3, 3, 9, 3, 9, 1, 3, 1, 0, 1, 2, 3, 3, 1, 2…
## $ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ hashtags                <list> <"mtedchat", "blinaction", "sbg", "sbl", "ngs…
## $ symbols                 <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ urls_url                <list> "twitter.com/montesyrie/sta…", NA, NA, NA, "t…
## $ urls_t.co               <list> "https://t.co/uaZNGj8Kh3", NA, NA, NA, "https…
## $ urls_expanded_url       <list> "https://twitter.com/montesyrie/status/148489…
## $ media_url               <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ media_t.co              <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ media_expanded_url      <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ media_type              <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ext_media_url           <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ext_media_t.co          <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ext_media_expanded_url  <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ mentions_user_id        <list> NA, "794608033582247936", "794608033582247936…
## $ mentions_screen_name    <list> NA, "NGSSphenomena", "NGSSphenomena", "NGSSph…
## $ lang                    <chr> "en", "en", "en", "en", "en", "en", "en", "en"…
## $ quoted_status_id        <chr> "1484890944696565760", NA, NA, NA, "1484992357…
## $ quoted_text             <chr> "\"Into the Grade Unknown\" (Blog Post to Twit…
## $ quoted_created_at       <dttm> 2022-01-22 14:09:23, NA, NA, NA, 2022-01-22 2…
## $ quoted_source           <chr> "Twitter Web App", NA, NA, NA, "Twitter Web Ap…
## $ quoted_favorite_count   <int> 30, NA, NA, NA, 7258, NA, NA, NA, NA, 174, NA,…
## $ quoted_retweet_count    <int> 3, NA, NA, NA, 1316, NA, NA, NA, NA, 46, NA, N…
## $ quoted_user_id          <chr> "4448568809", NA, NA, NA, "47139232", NA, NA, …
## $ quoted_screen_name      <chr> "MonteSyrie", NA, NA, NA, "balail", NA, NA, NA…
## $ quoted_name             <chr> "Monte Syrie", NA, NA, NA, "Brian ☀️🌏🌘", NA, …
## $ quoted_followers_count  <int> 14091, NA, NA, NA, 998, NA, NA, NA, NA, 7845, …
## $ quoted_friends_count    <int> 5996, NA, NA, NA, 345, NA, NA, NA, NA, 2391, N…
## $ quoted_statuses_count   <int> 23898, NA, NA, NA, 3522, NA, NA, NA, NA, 14867…
## $ quoted_location         <chr> "Cheney, WA", NA, NA, NA, "", NA, NA, NA, NA, …
## $ quoted_description      <chr> "Do. Reflect. Do Better. HS ELA Teacher, Proje…
## $ quoted_verified         <lgl> FALSE, NA, NA, NA, FALSE, NA, NA, NA, NA, FALS…
## $ retweet_status_id       <chr> NA, "1485291038524817410", "148529103852481741…
## $ retweet_text            <chr> NA, "So many structure and function questions…
## $ retweet_created_at      <dttm> NA, 2022-01-23 16:39:13, 2022-01-23 16:39:13,…
## $ retweet_source          <chr> NA, "Twitter for iPhone", "Twitter for iPhone"…
## $ retweet_favorite_count  <int> NA, 8, 8, 24, NA, NA, NA, 3, 12, NA, NA, 8, 7,…
## $ retweet_retweet_count   <int> NA, 3, 3, 9, NA, NA, NA, 3, 1, NA, NA, 2, 3, 3…
## $ retweet_user_id         <chr> NA, "794608033582247936", "794608033582247936"…
## $ retweet_screen_name     <chr> NA, "NGSSphenomena", "NGSSphenomena", "NGSSphe…
## $ retweet_name            <chr> NA, "Phenomena", "Phenomena", "Phenomena", NA,…
## $ retweet_followers_count <int> NA, 6899, 6899, 6899, NA, NA, NA, 901, 3605, N…
## $ retweet_friends_count   <int> NA, 305, 305, 305, NA, NA, NA, 853, 3772, NA, …
## $ retweet_statuses_count  <int> NA, 1871, 1871, 1871, NA, NA, NA, 2724, 39137,…
## $ retweet_location        <chr> NA, "", "", "", NA, NA, NA, "Boulder, CO", "",…
## $ retweet_description     <chr> NA, "A companion account created to share all …
## $ retweet_verified        <lgl> NA, FALSE, FALSE, FALSE, NA, NA, NA, FALSE, FA…
## $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ geo_coords              <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
## $ coords_coords           <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, …
## $ bbox_coords             <list> <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA…
## $ status_url              <chr> "https://twitter.com/KollmanRebecca/status/148…
## $ name                    <chr> "Rebecca Kollman", "Dawn O'Connor", "Dr. Godfr…
## $ location                <chr> "Sidney, MT", "Danville, CA", "South Laurel, M…
## $ description             <chr> "Wife | Mom of 👦🏼👦🏻 | Person for 🐶🐱| 7-12 Sc…
## $ url                     <chr> NA, "https://t.co/bWtwR3ytm7", NA, NA, "https:…
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ followers_count         <int> 474, 779, 555, 555, 6899, 6899, 6899, 89, 12, …
## $ friends_count           <int> 959, 1205, 817, 817, 305, 305, 305, 112, 26, 1…
## $ listed_count            <int> 5, 11, 5, 5, 0, 0, 0, 1, 0, 20, 41, 41, 41, 41…
## $ statuses_count          <int> 1953, 3458, 4426, 4426, 1871, 1871, 1871, 646,…
## $ favourites_count        <int> 6493, 1602, 3949, 3949, 1091, 1091, 1091, 866,…
## $ account_created_at      <dttm> 2015-07-11 23:49:31, 2015-11-24 12:30:58, 201…
## $ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ profile_url             <chr> NA, "https://t.co/bWtwR3ytm7", NA, NA, "https:…
## $ profile_expanded_url    <chr> NA, "http://acoe.org/page/949", NA, NA, "http:…
## $ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/3276741…
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme1/bg.…
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/142600670…

ngss_all_tweets

## # A tibble: 509 × 90
##    user_id             status_id  created_at          screen_name text    source
##    <chr>               <chr>      <dttm>              <chr>       <chr>   <chr> 
##  1 3276741348          148529759… 2022-01-23 17:05:17 KollmanReb… While … Twitt…
##  2 4344807252          148529446… 2022-01-23 16:52:50 dawno_conn… So man… Twitt…
##  3 2800231624          148529226… 2022-01-23 16:44:05 3DScinceguy So man… Twitt…
##  4 2800231624          148434717… 2022-01-21 02:08:38 3DScinceguy How mi… Twitt…
##  5 794608033582247936  148529103… 2022-01-23 16:39:13 NGSSphenom… So man… Twitt…
##  6 794608033582247936  148381187… 2022-01-19 14:41:34 NGSSphenom… How mi… Twitt…
##  7 794608033582247936  148420617… 2022-01-20 16:48:22 NGSSphenom… Sound … Twitt…
##  8 1106554661673361408 148495718… 2022-01-22 18:32:37 TracyJarre… Come l… Twitt…
##  9 498529076           148491430… 2022-01-22 15:42:12 neikalee    So the… Twitt…
## 10 311128650           148491337… 2022-01-22 15:38:31 SciFiClima… A good… Twitt…
## # … with 499 more rows, and 84 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>, …

Note that the first argument q = that the search_tweets() function expects is the search term included in quotation marks and that n = specifies the maximum number of tweets

✅ Comprehension Check

View your new ngss_all_tweetsdata frame using one of the previous view methods from Unit 1 Section 2a to help answer the following questions:

How many tweets did our query using the Twitter API actually return? How many variables? -The query returned 90 variables and 157 rows
Why do you think our query pulled in far less than 5,000 tweets requested? -It is only looking at the hashtag “#NGSSChat”
Does our query also include retweets? How do you know? -Yes, the query includes 15 variables that start with retweet.

Remove Retweets

While not explicitly mentioned in the paper, it’s likely the authors removed retweets in their query since a retweet is simply someone else reposting someone else’s tweet and would duplicate the exact same content of the original.

Let’s use the include_rts = argument to remove any retweets by setting it to FALSE:

ngss_non_retweets <- search_tweets("#NGSSchat", 
                                   n=5000, 
                                   include_rts = FALSE)

Using the OR Operator

If you recall from [Section 1a], the authors accessed tweets and user information from the hashtag-based #NGSSchat online community, all tweets that included any of the following phrases, with “/” indicating an additional phrase featuring the respective plural form: “ngss”, “next generation science standard/s”, “next gen science standard/s”.

Let’s modify our query using the OR operator to also include “ngss” so it will return tweets containing either #NGSSchat or “ngss” and assign to ngss_or_tweets:

ngss_or_tweets <- search_tweets(q = "#NGSSchat OR ngss", 
                                n=5000,
                                include_rts = FALSE)

ngss_or_tweets <- search_tweets(q = "#NGSSchat ngss", 
                                n=5000,
                                include_rts = FALSE)

✅ Comprehension Check

Try including both search terms but excluding the OR operator to answer the following question:

Does excluding the OR operator return more tweets, the same number of tweets, or fewer tweets? Why?

It includes less tweet because only one tweet identifyer “#NGSSchat ngss.”

What other useful arguments does the search_tweet() function contain? Try adding one and see what happens.

using the parse = argument returns larger set of tweets.

rt <- search_tweets(q = "#NGSSchat OR ngss", 
                                n=5000, include_rt = TRUE, parse = TRUE)

Hint: Use the ?search_tweets help function to learn more about the q argument and other arguments for composing search queries.

Use Multiple Queries

Unfortunately, the OR operator will only get us so far. In order to include the additional search terms, we will need to use the c() function to combine our search terms into a single list.

The rtweets package has an additional search_tweets2() function for using multiple queries in a search. To do this, either wrap single quotes around a search query using double quotes, e.g., q = '"next gen science standard"' or escape each internal double quote with a single backslash, e.g., q = "\"next gen science standard\"".

Copy and past the following code to store the results of our query in ngss_tweets:

ngss_tweets <- search_tweets2(c("#NGSSchat OR ngss",
                                '"next generation science standard"',
                                '"next generation science standards"',
                                '"next gen science standard"',
                                '"next gen science standards"'
                                   ), 
                             n=5000,
                             include_rts = FALSE)

Our First Dictionary

Recall that for our research question we wanted to compare public sentiment about both the NGSS and CCSS state standards. Let’s go ahead and create our very first “dictionary” for identifying tweets related to either set of standards, and then use that dictionary for our the q = query argument to pull tweets related to the state standards.

To do so, we’ll need to add some additional search terms to our list:

ngss_dictionary <- c("#NGSSchat OR ngss",
                     '"next generation science standard"',
                     '"next generation science standards"',
                     '"next gen science standard"',
                     '"next gen science standards"')

ngss_tweets <- search_tweets2(ngss_dictionary,
                              n=5000,
                              include_rts = FALSE)

Now let’s create a dictionary for the Common Core State Standards and pass that to our search_tweets() function to get the most recent tweets:

ccss_dictionary <- c("#commoncore", '"common core"')

ccss_tweets <- ccss_dictionary %>% 
  search_tweets2(n=5000, include_rts = FALSE)

Notice that you can use the pipe operator with the search_tweets() function just like you would other functions from the tidyverse. { ##### ✅ Comprehension Check

Use the search_tweets function to create you own custom query for a twitter hashtag or topic(s) of interest.

LA_Tweets <- search_tweets2(c("\"Learning Analytics\"", 
                                   "rstats OR python"
                                   ), 
                                 n=5000,
                                 include_rts = FALSE)

## Warning: Rate limit exceeded - 88

## Warning: Rate limit exceeded

## Warning: Rate limit exceeded - 88

## Warning: Rate limit exceeded

write_xlsx(LA_Tweets, "data/LA_Tweets.xlsx")

Write to Excel

Finally, let’s save our tweet files to use in later exercises since tweets have a tendency to change every minute. We’ll save as a Microsoft Excel file since one of our columns can not be stored in a flat file like .csv.

Let’s use the write_xlsx() function from the writexl package just like we would the write_csv() function from dplyr in Unit 1:

write_xlsx(ngss_tweets, "data/ngss_tweets.xlsx")
write_xlsx(ccss_tweets, "data/csss_tweets.xlsx")

Other Useful Queries

For your independent analysis, you may be interest in exploring posts by specific users rather than topics, key words, or hashtags. Yes, there is a function for that too!

For example, let’s create another list containing the usernames of me and some of my colleagues at the Friday Institute using the c() function again and use the get_timelines() function to get the most recent tweets from each of those users:

fi <- c("sbkellogg", "mjsamberg", "haspires", "tarheel93", "drcallie_tweets", "AlexDreier")

fi_tweets <- fi %>%
  get_timelines(include_rts=FALSE)

And let’s use the sample_n() function from the dplyr package to pick 10 random tweets and use select() to select and view just the screenname and text columns that contains the user and the content of their post:

sample_n(fi_tweets, 10) %>%
  select(screen_name, text)

## # A tibble: 10 × 2
##    screen_name     text                                                         
##    <chr>           <chr>                                                        
##  1 mjsamberg       "For reference https://t.co/DJiFUhRd59"                      
##  2 mjsamberg       "In the last three months, I think I've won this game one we…
##  3 mjsamberg       "🧵 https://t.co/QzZiMwPTq2"                                 
##  4 mjsamberg       "Wordle 210 3/6\n\n🟩⬛🟨🟨⬛\n🟩🟩🟨⬛⬛\n🟩🟩🟩🟩🟩"       
##  5 tarheel93       "@jaclynbstevens Due to headaches and eye fatigue, I had to …
##  6 sbkellogg       "Attention @LASER_Institute scholars and @NCStateCED #learni…
##  7 AlexDreier      "@BethRabbitt My favorite thing right now is my son’s overge…
##  8 drcallie_tweets "I was asked what #advice I would offer to students. Here’s …
##  9 haspires        "Honoring and remembering. @FridayInstitute https://t.co/Fz7…
## 10 AlexDreier      "My goodness can @Donnell_Cannon deliver a tribute. #FridayM…

We’ve only scratched the surface of the number of functions available in the rtweets package for searching Twitter. Use the following function to

vignette("intro", package="rtweet")

✅ Comprehension Check

To conclude Section 2a, try one of the following search functions from the rtweet vignette:

get_timelines() Get the most recent 3,200 tweets from users.

## get user IDs of accounts followed by Learning Analytic Organizations.
tmls <- get_timelines(c("LASER_Institute", "NYU_Learn", "LearningLA"), n = 3200)

## plot the frequency of tweets for each user over time
tmls %>%
  dplyr::filter(created_at > "2021-10-29") %>%
  dplyr::group_by(screen_name) %>%
  ts_plot("days", trim = 1L) +
  ggplot2::geom_point() +
  ggplot2::theme_minimal() +
  ggplot2::theme(
    legend.title = ggplot2::element_blank(),
    legend.position = "bottom",
    plot.title = ggplot2::element_text(face = "bold")) +
  ggplot2::labs(
    x = NULL, y = NULL,
    title = "Frequency of Twitter statuses posted by Learning Analytics organization",
    subtitle = "Twitter status (tweet) counts aggregated by day from October 2021 to January 2022",
    caption = "\nSource: Data collected from Twitter's REST API via rtweet"
  )

stream_tweets() Randomly sample (approximately 1%) from the live stream of all tweets.

## stream tweets from raleigh,nc for 60 seconds
rt <- stream_tweets(lookup_coords("raleigh, nc"), timeout = 60)

## Streaming tweets for 60 seconds...

## Finished streaming tweets!

rt

## NULL

get_friends() Retrieve a list of all the accounts a user follows.

## get user IDs of accounts followed by SolaResearch, a Learning Analytic organization
SoLAR_fds <- get_friends("soLAResearch")

## lookup data on those accounts
SoLAR_fds_data <- lookup_users(SoLAR_fds$user_id)

SoLAR_fds_data

## # A tibble: 68 × 90
##    user_id             status_id  created_at          screen_name text    source
##    <chr>               <chr>      <dttm>              <chr>       <chr>   <chr> 
##  1 13046992            148443444… 2022-01-21 07:55:25 mhawksey    "@pete… Twitt…
##  2 2823168772          148149772… 2022-01-13 05:25:56 DanijelaGa… "I’ve … Twitt…
##  3 2349713293          148429642… 2022-01-20 22:47:00 Discourseo… "#Flas… Twitt…
##  4 936224786350518273  148516893… 2022-01-23 08:34:02 ChiEdMobil… "We ar… Twitt…
##  5 292312814           148524493… 2022-01-23 13:36:01 euatweets   "#Univ… Sprou…
##  6 3029739405          148413730… 2022-01-20 12:14:42 LDECEL      "Tomor… Twitt…
##  7 31362451            148094148… 2022-01-11 16:35:38 studiumdig… "Morge… Twitt…
##  8 2574452406          148450309… 2022-01-21 12:28:14 earli_offi… "The E… Twitt…
##  9 122360833           148534030… 2022-01-23 19:55:00 eAssess     "Who i… Twitt…
## 10 1011445838873153536 148344280… 2022-01-18 14:15:00 AsianJde    "A war… Twitt…
## # … with 58 more rows, and 84 more variables: display_text_width <int>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>, …

get_followers() Retrieve a list of the accounts following a user.
get_favorites() Get the most recently favorited statuses by a user.
get_trends() Discover what’s currently trending in a city.
search_users() Search for 1,000 users with the specific hashtag in their profile bios.

2b. Tidy Text

Now that we have the data needed to answer our questions, we still have a little bit of work to do to get it ready for analysis. This section will revisit some familiar functions from Unit 1 and introduce a couple new functions:

Functions Used

dplyr functions

select() picks variables based on their names.
slice() lets you select, remove, and duplicate rows.
rename() changes the names of individual variables using new_name = old_name syntax
filter() picks cases, or rows, based on their values in a specified column.

tidytext functions

unnest_tokens() splits a column into tokens
anti_join() returns all rows from x without a match in y.

ATTENTION: For those of you who do not have Twitter Developer accounts, you will need to read in the Excel files share in our Course site and also located here: https://github.com/sbkellogg/eci-588/tree/main/unit-2/data

We’ll use the readxl package highlighted in Unit 1 and the read_xlsx() function to read in the data stored in the data folder of our R project:

ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")

Note: If you have already created these data frames from 2a. Import Tweets, you do not need to read these file into R unless you want to reproduce the exact same outputs shown in the rest of this walkthrough.

Subset Rows & Columns

As you are probably already aware, we have way more data than we’ll need for analysis and will need to pare it down quite a bit.

First, let’s use the filter function to subset rows containing only tweets in the language:

ngss_text <- filter(ngss_tweets, lang == "en")

Now let’s select the following columns from our new ngss_text data frame:

screen_name of the user who created the tweet
created_at timestamp for examining changes in sentiment over time
text containing the tweet which is our primary data source of interestt

ngss_text <- select(ngss_text,screen_name, created_at, text)
ngss_text

## # A tibble: 490 × 3
##    screen_name   created_at          text                                       
##    <chr>         <dttm>              <chr>                                      
##  1 clbmanning    2022-01-19 01:02:37 @NGSS_tweeps @IndigenousSTEAM I am a PhD s…
##  2 clbmanning    2022-01-19 00:56:50 @gosciencego @NGSS_tweeps @nativelandnet W…
##  3 TdiShelton    2022-01-19 00:10:03 Join us for #NGSSchat this Thursday, Janua…
##  4 TdiShelton    2022-01-13 22:33:53 I am so excited about the new  @NSTA Strat…
##  5 TdiShelton    2022-01-13 01:39:53 This is going to be a great @nsta session.…
##  6 NGS_Education 2022-01-18 23:51:00 The 𝗠𝗜𝗗𝗗𝗟𝗘 &amp; 𝗛𝗜𝗚𝗛 𝗦𝗖𝗛𝗢𝗢𝗟 𝗡𝗚𝗦𝗦 𝗣𝗛𝗘𝗡𝗢𝗠𝗘𝗡…
##  7 NGSS_tweeps   2022-01-18 23:29:10 What roles do you have? How do you bring t…
##  8 NGSS_tweeps   2022-01-18 23:28:27 ...and cultivating Indigenous youths' coll…
##  9 NGSS_tweeps   2022-01-18 23:30:09 https://t.co/jaIyiPhcqv is an excellent re…
## 10 NGSS_tweeps   2022-01-13 02:35:44 My favorite course I took through @natgeoe…
## # … with 480 more rows

Add & Reorder Columns

Since we are interested in comparing the sentiment of NGSS tweets with CSSS tweets, it would be helpful if we had a column for quickly identifying the set of state standards, with which each tweet is associated.

We’ll use the mutate() function to create a new variable called standards to label each tweets as “ngss”:

ngss_text <- mutate(ngss_text, standards = "ngss")

And just because it bothers me, I’m going to use the relocate() function to move the standards column to the first position so I can quickly see which standards the tweet is from:

ngss_text <- relocate(ngss_text, standards)

Note that you could also have used the select() function to reorder columns like so:

ngss_text <- select(ngss_text, standards, screen_name, created_at, text)

Finally, let’s rewrite the code above using the %>% operator so there is less redundancy and it is easier to read:

ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(screen_name, created_at, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

✅ Comprehension Check

WARNING: You will not be able to progress to the next section until you have completed the following task:

Create a new ccss_text data frame for our ccss_tweets Common Core tweets by modifying code above.

Combine Data Frames

Finally, let’s combine our ccss_text and ngss_text into a single data frame by using the bind_rows() function from dplyr to simply supplying the data frames that you want to combine as arguments:

tweets <- bind_rows(ngss_text, ccss_text)

And let’s take a quick look at both the head() and the tail() of this new tweets data frame to make sure it contains both “ngss” and “ccss” standards:

head(tweets)

## # A tibble: 6 × 4
##   standards screen_name   created_at          text                              
##   <chr>     <chr>         <dttm>              <chr>                             
## 1 ngss      clbmanning    2022-01-19 01:02:37 @NGSS_tweeps @IndigenousSTEAM I a…
## 2 ngss      clbmanning    2022-01-19 00:56:50 @gosciencego @NGSS_tweeps @native…
## 3 ngss      TdiShelton    2022-01-19 00:10:03 Join us for #NGSSchat this Thursd…
## 4 ngss      TdiShelton    2022-01-13 22:33:53 I am so excited about the new  @N…
## 5 ngss      TdiShelton    2022-01-13 01:39:53 This is going to be a great @nsta…
## 6 ngss      NGS_Education 2022-01-18 23:51:00 The 𝗠𝗜𝗗𝗗𝗟𝗘 &amp; 𝗛𝗜𝗚𝗛 𝗦𝗖𝗛𝗢𝗢𝗟 𝗡𝗚𝗦𝗦…

tail(tweets)

## # A tibble: 6 × 4
##   standards screen_name     created_at          text                            
##   <chr>     <chr>           <dttm>              <chr>                           
## 1 ccss      RHansen3rdGrade 2022-01-10 10:43:47 "@k8roulette2 @annismezelsm I d…
## 2 ccss      Ou81257584433   2022-01-10 10:18:28 "@Angie_laughing I'm using Disc…
## 3 ccss      DicksonSidah    2022-01-10 09:47:13 "@voiceofgray @priyankchn Fuck …
## 4 ccss      apoliti63780208 2022-01-10 09:07:44 "Meanwhile AIPACISTAN IS STRUGG…
## 5 ccss      t_hewittt       2022-01-10 09:03:13 "@RonFilipkowski They CAN read …
## 6 ccss      kathysuf        2022-01-10 07:13:59 "@RepLeeZeldin Basically you're…

Tokenize Text

We have a couple remaining steps to tidy our text that hopefully should feel familiar by this point. If you recall from Chapter 1 of Text Mining With R, Silge & Robinson describe tokens as:

A meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix.

First, let’s tokenize our tweets by using the unnest_tokens() function to split each tweet into a single row to make it easier to analyze:

tweet_tokens <- 
  tweets %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "tweets")

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

Notice that we’ve included an additional argument in the call to unnest_tokens(). Specifically, we used the specialized “tweets” tokenizer in the tokens = argument that is very useful for dealing with Twitter text or other text from online forums in that it retains hashtags and mentions of usernames with the @ symbol.

Remove Stop Words

Now let’s remove stop words like “the” and “a” that don’t help us learn much about what people are tweeting about the state standards.

tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word")

Notice that we’ve specified the by = argument to look for matching words in the word column for both data sets and remove any rows from the tweet_tokens dataset that match the stop_words dataset. Remember when we first tokenized our dataset I conveniently chose output = word as the column name because it matches the column name word in the stop_words dataset contained in the tidytext package. This makes our call to anti_join()simpler because anti_join() knows to look for the column named word in each dataset. However this wasn’t really necessary since word is the only matching column name in both datasets and it would have matched those columns by default.

Custom Stop Words

Before wrapping up, let’s take a quick count of the most common words in tidy_tweets data frame:

count(tidy_tweets, word, sort = T)

## # A tibble: 8,229 × 2
##    word            n
##    <chr>       <int>
##  1 common       1126
##  2 core         1111
##  3 math          406
##  4 @ngsstweeps   191
##  5 school        154
##  6 science       151
##  7 standards     146
##  8 students      141
##  9 amp           136
## 10 im            114
## # … with 8,219 more rows

Notice that the nonsense word “amp” is in our top tens words. If we use the filter() function and `grep() query from Unit 1 on our tweets data frame, we can see that “amp” seems to be some sort of html residue that we might want to get rid of.

filter(tweets, grepl('amp', text))

## # A tibble: 156 × 4
##    standards screen_name   created_at          text                             
##    <chr>     <chr>         <dttm>              <chr>                            
##  1 ngss      NGS_Education 2022-01-18 23:51:00 The 𝗠𝗜𝗗𝗗𝗟𝗘 &amp; 𝗛𝗜𝗚𝗛 𝗦𝗖𝗛𝗢𝗢𝗟 𝗡𝗚𝗦…
##  2 ngss      NGSS_tweeps   2022-01-18 23:28:03 At @IndigenousSTEAM, our roles i…
##  3 ngss      NGSS_tweeps   2022-01-18 21:29:39 The roles we take on in places a…
##  4 ngss      NGSS_tweeps   2022-01-13 02:35:43 Educators at Space Camp which a …
##  5 ngss      NGSS_tweeps   2022-01-13 22:27:43 I'm teaching one CCC per day by …
##  6 ngss      NGSS_tweeps   2022-01-13 14:05:48 What are classroom strategies yo…
##  7 ngss      NGSS_tweeps   2022-01-18 16:27:42 @IndigenousSTEAM Theme Day 1! RO…
##  8 ngss      NGSS_tweeps   2022-01-17 20:41:07 Some key principles of @Indigeno…
##  9 ngss      NGSS_tweeps   2022-01-17 19:12:16 @IndigenousSTEAM resources are c…
## 10 ngss      NGSS_tweeps   2022-01-12 20:23:12 @frizzlerichard @honeywell @NBPT…
## # … with 146 more rows

Let’s rewrite our stop word code to add a custom stop word to filter out rows with “amp” in them:

tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp")

Note that we could extend this filter to weed out any additional words that don’t carry much meaning but skew our data by being so prominent.

✅ Comprehension Check

We’ve created some unnecessarily lengthy code to demonstrate some of the steps in the tidying process. Rewrite the tokenization and removal of stop words processes into a more compact series of commands and save your data frame as tidy_tweets.

2c. Add Sentiment Values

Now that we have our tweets nice and tidy, we’re almost ready to begin exploring public sentiment (at least for the past week due to Twitter API rate limits) around the CCSS and NGSS standards. For this part of our workflow we introduce two new functions from the tidytext and dplyr packages respectively:

get_sentiments() returns specific sentiment lexicons with the associated measures for each word in the lexicon
inner_join() return all rows from x where there are matching values in y, and all columns from x and y.

For a quick overview of the different join functions with helpful visuals, visit: https://statisticsglobe.com/r-dplyr-join-inner-left-right-full-semi-anti

Get Sentiments

Recall from our readings that sentiment analysis tries to evaluate words for their emotional association. Silge & Robinson point out that, “one way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.” As our readings from last week illustrated, this isn’t the only way to approach sentiment analysis, but it is an easier entry point into sentiment analysis and often-used.

The tidytext package provides access to several sentiment lexicons based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

The three general-purpose lexicons we’ll focus on are:

AFINN assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
bing categorizes words in a binary fashion into positive and negative categories.
nrc categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

Note that if this is your first time using the AFINN and NRC lexicons, you’ll be prompted to download both Respond yes to the prompt by entering “1” and the NRC and AFINN lexicons will download. You’ll only have to do this the first time you use the NRC lexicon.

Let’s take a quick look at each of these lexicons using the get_sentiments() function and assign them to their respective names for later use:

afinn <- get_sentiments("afinn")

afinn

## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

bing <- get_sentiments("bing")

bing

## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

nrc <- get_sentiments("nrc")

nrc

## # A tibble: 13,875 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,865 more rows

And just out of curiosity, let’s take a look at the loughran lexicon as well:

loughran <- get_sentiments("loughran")

loughran

## # A tibble: 4,150 × 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,140 more rows

✅ Comprehension Check

1. How were these sentiment lexicons put together and validated? Hint: take a look at Chapter 2 from Text Mining with R.

Why should we be cautious when using and interpreting them?

We should caution using the dictionaries other than they were validated for. There is error when reading words that are jokes, sarcasim, words used improperly.

Join Sentiments

We’ve reached the final step in our data wrangling process before we can begin exploring our data to address our questions.

In the previous section, we used anti_join() to remove stop words in our dataset. For sentiment analysis, we’re going use the inner_join() function to do something similar. However, instead of removing rows that contain words matching those in our stop words dictionary, inner_join() allows us to keep only the rows with words that match words in our sentiment lexicons, or dictionaries, along with the sentiment measure for that word from the sentiment lexicon.

Let’s use inner_join() to combine our two tidy_tweets and afinn data frames, keeping only rows with matching data in the word column:

sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")

sentiment_afinn

## # A tibble: 1,701 × 5
##    standards screen_name   created_at          word      value
##    <chr>     <chr>         <dttm>              <chr>     <dbl>
##  1 ngss      clbmanning    2022-01-19 01:02:37 care          2
##  2 ngss      clbmanning    2022-01-19 00:56:50 cool          1
##  3 ngss      TdiShelton    2022-01-19 00:10:03 join          1
##  4 ngss      TdiShelton    2022-01-13 22:33:53 excited       3
##  5 ngss      TdiShelton    2022-01-13 01:39:53 excited       3
##  6 ngss      TdiShelton    2022-01-13 01:39:53 join          1
##  7 ngss      NGS_Education 2022-01-18 23:51:00 easy          1
##  8 ngss      NGSS_tweeps   2022-01-18 23:28:27 healthy       2
##  9 ngss      NGSS_tweeps   2022-01-18 23:30:09 excellent     3
## 10 ngss      NGSS_tweeps   2022-01-13 02:35:44 favorite      2
## # … with 1,691 more rows

Notice that each word in your sentiment_afinn data frame now contains a value ranging from -5 (very negative) to 5 (very positive).

sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")

sentiment_bing

## # A tibble: 1,894 × 5
##    standards screen_name   created_at          word      sentiment
##    <chr>     <chr>         <dttm>              <chr>     <chr>    
##  1 ngss      clbmanning    2022-01-19 00:56:50 cool      positive 
##  2 ngss      TdiShelton    2022-01-13 22:33:53 excited   positive 
##  3 ngss      TdiShelton    2022-01-13 01:39:53 excited   positive 
##  4 ngss      NGS_Education 2022-01-18 23:51:00 easy      positive 
##  5 ngss      NGSS_tweeps   2022-01-18 23:28:27 healthy   positive 
##  6 ngss      NGSS_tweeps   2022-01-18 23:30:09 excellent positive 
##  7 ngss      NGSS_tweeps   2022-01-13 02:35:44 favorite  positive 
##  8 ngss      NGSS_tweeps   2022-01-11 17:45:06 excited   positive 
##  9 ngss      NGSS_tweeps   2022-01-18 21:29:39 dynamic   positive 
## 10 ngss      NGSS_tweeps   2022-01-13 02:35:43 free      positive 
## # … with 1,884 more rows

✅ Comprehension Check

Create a sentiment_nrc data frame using the code above.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

joysoar_nrc <- sentiment_nrc %>%
  inner_join(nrc_joy) %>%
  count(word, sort = TRUE)

## Joining, by = c("word", "sentiment")

joysoar_nrc

## # A tibble: 154 × 2
##    word          n
##    <chr>     <int>
##  1 teach        62
##  2 share        54
##  3 love         42
##  4 excited      21
##  5 child        20
##  6 money        20
##  7 fun          19
##  8 resources    19
##  9 content      17
## 10 create       17
## # … with 144 more rows

What do you notice about the change in the number of observations (i.e. words) between the tidy_tweets and data frames with sentiment values attached? Why did this happen?

I pulled the words that looked at the sentiment joy which reduced the observations to those words that evoke that emotion per the NRC dictionary. I wonder though if “teach” was always used ina joyful manner.

Note: To complete to the following section, you’ll need the sentiment_nrc data frame.

3. EXPLORE

Now that we have our tweets tidied and sentiments joined, we’re ready for a little data exploration. As highlighted in Unit 1, calculating summary statistics, data visualization, and feature engineering (the process of creating new variables from a dataset) are a key part of exploratory data analysis. One goal in this phase is explore questions that drove the original analysis and develop new questions and hypotheses to test in later stages. Topics addressed in Section 3 include:

Time Series. We take a quick look at the date range of our tweets and compare number of postings by standards.
Sentiment Summaries. We put together some basic summaries of our sentiment values in order to compare public sentiment

3a. Time Series

Before we dig into sentiment, let’s use the handy ts_plot function built into rtweet to take a very quick look at how far back our tidied tweets data set goes:

ts_plot(tweets, by = "days")

Notice that this effectively creates a ggplot time series plot for us. I’ve included the by = argument which by default is set to “days”. It looks like tweets go back 9 days which the rate limit set by Twitter.

Try changing it to “hours” and see what happens.

✅ Comprehension Check

Use ts_plot with the group_by function to compare the number of tweets over time by Next Gen and Common Core standards

ts_plot(dplyr::group_by(tweets, standards),"days")

Which set of standards is Twitter users talking about the most?

Common Core Standards was the most talked about.

Hint: use the ?ts_plot help function to check the examples to see how this can be done.

Your line graph should look something like this:

3b. Sentiment Summaries

Since our primary goals is to compare public sentiment around the NGSS and CCSS state standards, in this section we put together some basic numerical summaries using our different lexicons to see whether tweets are generally more positive or negative for each standard as well as differences between the two. To do this, we revisit the following dplyr functions:

count() lets you quickly count the unique values of one or more variables
group_by() takes a data frame and one or more variables to group by
summarise() creates a numerical summary of data using arguments like mean() and median()
mutate() adds new variables and preserves existing ones

And introduce one new function:

spread()

Sentiment Counts

Let’s start with bing, our simplest sentiment lexicon, and use the count function to count how many times in our sentiment_bing data frame “positive” and “negative” occur in sentiment column and :

summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)

Collectively, it looks like our combined dataset has more positive words than negative words.

summary_bing

## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   1002
## 2 positive    892

Since our main goal is to compare positive and negative sentiment between CCSS and NGSS, let’s use the group_by function again to get sentiment summaries for NGSS and CCSS separately:

summary_bing <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment) 

summary_bing

## # A tibble: 4 × 3
## # Groups:   standards [2]
##   standards sentiment     n
##   <chr>     <chr>     <int>
## 1 ccss      negative    905
## 2 ccss      positive    468
## 3 ngss      negative     97
## 4 ngss      positive    424

Looks like CCSS have far more negative words than positive, while NGSS skews much more positive. So far, pretty consistent with Rosenberg et al. findings!!!

Compute Sentiment Value

Our last step will be calculate a single sentiment “score” for our tweets that we can use for quick comparison and create a new variable indicating which lexicon we used.

First, let’s untidy our data a little by using the spread function from the tidyr package to transform our sentiment column into separate columns for negative and positive that contains the n counts for each:

summary_bing <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) 

summary_bing

## # A tibble: 2 × 3
## # Groups:   standards [2]
##   standards negative positive
##   <chr>        <int>    <int>
## 1 ccss           905      468
## 2 ngss            97      424

Finally, we’ll use the mutate function to create two new variables: sentiment and lexicon so we have a single sentiment score and the lexicon from which it was derived:

summary_bing <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  spread(sentiment, n) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(lexicon = "bing") %>%
  relocate(lexicon)

summary_bing

## # A tibble: 2 × 5
## # Groups:   standards [2]
##   lexicon standards negative positive sentiment
##   <chr>   <chr>        <int>    <int>     <int>
## 1 bing    ccss           905      468      -437
## 2 bing    ngss            97      424       327

There we go, now we can see that CCSS scores negative, while NGSS is overall positive.

Let’s calculate a quick score for using the afinn lexicon now. Remember that AFINN provides a value from -5 to 5 for each:

head(sentiment_afinn)

## # A tibble: 6 × 5
##   standards screen_name created_at          word    value
##   <chr>     <chr>       <dttm>              <chr>   <dbl>
## 1 ngss      clbmanning  2022-01-19 01:02:37 care        2
## 2 ngss      clbmanning  2022-01-19 00:56:50 cool        1
## 3 ngss      TdiShelton  2022-01-19 00:10:03 join        1
## 4 ngss      TdiShelton  2022-01-13 22:33:53 excited     3
## 5 ngss      TdiShelton  2022-01-13 01:39:53 excited     3
## 6 ngss      TdiShelton  2022-01-13 01:39:53 join        1

To calculate late a summary score, we will need to first group our data by standards again and then use the summarise function to create a new sentiment variable by adding all the positive and negative scores in the value column:

summary_afinn <- sentiment_afinn %>% 
  group_by(standards) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(lexicon = "AFINN") %>%
  relocate(lexicon)

summary_afinn

## # A tibble: 2 × 3
##   lexicon standards sentiment
##   <chr>   <chr>         <dbl>
## 1 AFINN   ccss           -483
## 2 AFINN   ngss            876

Again, CCSS is overall negative while NGSS is overall positive!

✅ Comprehension Check

For your final task for this walkthough, calculate a single sentiment score for NGSS and CCSS using the remaining nrc and loughan lexicons and answer the following questions. Are these findings above still consistent?

CCSS is overly positive and negative compared to NGSS.

Hint: The nrc lexicon contains “positive” and “negative” values just like bing and loughan, but also includes values like “trust” and “sadness” as shown below. You will need to use the filter() function to select rows that only contain “positive” and “negative.”

nrc

## # A tibble: 13,875 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,865 more rows

## # A tibble: 2 × 5
## # Groups:   standards [2]
##   standards method negative positive sentiment
##   <chr>     <chr>     <int>    <int>     <dbl>
## 1 ccss      nrc         753     2287      3.04
## 2 ngss      nrc         118      844      7.15

## # A tibble: 2 × 3
##   lexicon standards sentiment
##   <chr>   <chr>         <dbl>
## 1 AFINN   ccss           -483
## 2 AFINN   ngss            876

4. MODEL

As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.

Recall from the PREPARE section that the Rosenberg et al. study was guide by the following questions:

What is the public sentiment expressed toward the NGSS?
How does sentiment for teachers differ from non-teachers?
How do tweets posted to #NGSSchat differ from those without the hashtag?
How does participation in #NGSSchat relate to the public sentiment individuals express?
How does public sentiment vary over time?

Similar to our sentiment summary using the AFINN lexicon, the Rosenberg et al. study used the -5 to 5 sentiment score from the SentiStrength lexicon to answer RQ #1. To address the remaining questions the authors used a mixed effects model (also known as multi-level or hierarchical linear models via the lme4 package in R.

Collectively, the authors found that:

The SentiStrength scale indicated an overall neutral sentiment for tweets about the Next Generation Science Standards.
Teachers were more positive in their posts than other participants.
Posts including #NGSSchat that were posted outside of chats were slightly more positive relative to those that did not include the #NGSSchat hashtag.
The effect upon individuals of being involved in the #NGSSchat was positive, suggesting that there is an impact on individuals—not tweets—of participating in a community focused on the NGSS.
Posts about the NGSS became substantially more positive over time.

5. COMMUNICATE

The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) outlined the following 3-step process for communicating with education stakeholders what you have learned through analysis:

Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

5a. Select

Remember that the questions of interest that we want to focus on our for our selection, polishing, and narration include:

What is the public sentiment expressed toward the NGSS?
How does sentiment for NGSS compare to sentiment for CCSS?

To address questions 1 and 2, I’m going to focus my analyses, data products and sharing format on the following:

Analyses. For RQ1, I’m want to try and replicate as closely as possible the analysis by Rosenberg et al. so I will clean up my analysis and calculate a single sentiment score using the AFINN Lexicon for the entire tweet and label it positive or negative based on that score. I also want to highlight how regardless of the lexicon selected, NGSS tweets contain more positive words than negative, so I’ll also polish my previous analyses and calculate percentages of positive and negative words for the
Data Products. I know these are shunned in the world of data viz, but I think a pie chart will actually be an effective way to quickly communicate the proportion of positive and negative tweets among the Next Generation Science Standards. And for my analyses with the bing, nrc, and loughan lexicons, I’ll create some 100% stacked bars showing the percentage of positive and negative words among all tweets for the NGSS and CCSS.
Format. Similar to Unit 1, I’ll be using R Markdown again to create a quick slide deck. Recall that R Markdown files can also be used to create a wide range of outputs and formats, including polished PDF or Word documents, websites, web apps, journal articles, online books, interactive tutorials and more. And to make this process even more user-friendly, R Studio now includes a visual editor!

5b. Polish

NGSS Sentiment

I want to try and replicate as closely as possible the approach Rosenberg et al. used in their analysis. To do that, I’ll I can recycle some R code I used in section 2b. Tidy Text.

To polish my analyses and prepare, first I need to rebuild the tweets dataset from my ngss_tweets and ccss_tweets and select both the status_id that is unique to each tweet, and the text column which contains the actual post:

ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

ccss_text <-
  ccss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ccss") %>%
  relocate(standards)

tweets <- bind_rows(ngss_text, ccss_text)

tweets

## # A tibble: 1,606 × 3
##    standards status_id           text                                           
##    <chr>     <chr>               <chr>                                          
##  1 ngss      1483605784600752131 @NGSS_tweeps @IndigenousSTEAM I am a PhD stude…
##  2 ngss      1483604326069243906 @gosciencego @NGSS_tweeps @nativelandnet What …
##  3 ngss      1483592553962389505 Join us for #NGSSchat this Thursday, January 2…
##  4 ngss      1481756415740215297 I am so excited about the new  @NSTA Strategic…
##  5 ngss      1481440835736588290 This is going to be a great @nsta session. We …
##  6 ngss      1483587759340064772 The 𝗠𝗜𝗗𝗗𝗟𝗘 &amp; 𝗛𝗜𝗚𝗛 𝗦𝗖𝗛𝗢𝗢𝗟 𝗡𝗚𝗦𝗦 𝗣𝗛𝗘𝗡𝗢𝗠𝗘𝗡𝗔 𝗦𝗘…
##  7 ngss      1483582265502453760 What roles do you have? How do you bring them …
##  8 ngss      1483582086149812235 ...and cultivating Indigenous youths' collecti…
##  9 ngss      1483582512089767938 https://t.co/jaIyiPhcqv is an excellent resour…
## 10 ngss      1481454891377774597 My favorite course I took through @natgeoeduca…
## # … with 1,596 more rows

The status_id is important because like Rosenberg et al., I want to calculate an overall sentiment score for each tweet, rather than for each word.

Before I get that far however, I’ll need to tidy my tweets again and attach my sentiment scores.

Note that the closest lexicon we have available in our tidytext package to the SentiStrength lexicon used by Rosenberg is the AFINN lexicon which also uses a -5 to 5 point scale.

So let’s use unnest_tokens to tidy our tweets, remove stop words, and add afinn scores to each word similar to what we did in section 2c. Add Sentiment Values:

sentiment_afinn <- tweets %>%
  unnest_tokens(output = word, 
                input = text, 
                token = "tweets")  %>% 
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp") %>%
  inner_join(afinn, by = "word")

sentiment_afinn

## # A tibble: 1,701 × 4
##    standards status_id           word      value
##    <chr>     <chr>               <chr>     <dbl>
##  1 ngss      1483605784600752131 care          2
##  2 ngss      1483604326069243906 cool          1
##  3 ngss      1483592553962389505 join          1
##  4 ngss      1481756415740215297 excited       3
##  5 ngss      1481440835736588290 excited       3
##  6 ngss      1481440835736588290 join          1
##  7 ngss      1483587759340064772 easy          1
##  8 ngss      1483582086149812235 healthy       2
##  9 ngss      1483582512089767938 excellent     3
## 10 ngss      1481454891377774597 favorite      2
## # … with 1,691 more rows

Next, I want to calculate a single score for each tweet. To do that, I’ll use the by now familiar group_by and summarize

afinn_score <- sentiment_afinn %>% 
  group_by(standards, status_id) %>% 
  summarise(value = sum(value))

afinn_score

## # A tibble: 938 × 3
## # Groups:   standards [2]
##    standards status_id           value
##    <chr>     <chr>               <dbl>
##  1 ccss      1480437748397977604     0
##  2 ccss      1480438639779741704    -2
##  3 ccss      1480464216444264449     6
##  4 ccss      1480465238159937538    -5
##  5 ccss      1480465246682857473     2
##  6 ccss      1480466375042686980    -4
##  7 ccss      1480476311814553601    -3
##  8 ccss      1480484174943469575     1
##  9 ccss      1480507651318431744     1
## 10 ccss      1480509018925789189     0
## # … with 928 more rows

And like Rosenberg et al., I’ll add a flag for whether the tweet is “positive” or “negative” using the mutate function to create a new sentiment column to indicate whether that tweets was positive or negative.

To do this, we introduced the new if_else function from the dplyr package. This if_else function adds “negative” to the sentiment column if the score in the value column of the corresponding row is less than 0. If not, it will add a “positive” to the row.

afinn_sentiment <- afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))

afinn_sentiment

## # A tibble: 901 × 4
## # Groups:   standards [2]
##    standards status_id           value sentiment
##    <chr>     <chr>               <dbl> <chr>    
##  1 ccss      1480438639779741704    -2 negative 
##  2 ccss      1480464216444264449     6 positive 
##  3 ccss      1480465238159937538    -5 negative 
##  4 ccss      1480465246682857473     2 positive 
##  5 ccss      1480466375042686980    -4 negative 
##  6 ccss      1480476311814553601    -3 negative 
##  7 ccss      1480484174943469575     1 positive 
##  8 ccss      1480507651318431744     1 positive 
##  9 ccss      1480519896630976514    -4 negative 
## 10 ccss      1480524134648012809    -9 negative 
## # … with 891 more rows

Note that since a tweet sentiment score equal to 0 is neutral, I used the filter function to remove it from the dataset.

Finally, we’re ready to compute our ratio. We’ll use the group_by function and count the number of tweets for each of the standards that are positive or negative in the sentiment column. Then we’ll use the spread function to separate them out into separate columns so we can perform a quick calculation to compute the ratio.

afinn_ratio <- afinn_sentiment %>% 
  group_by(standards) %>% 
  count(sentiment) %>% 
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)

afinn_ratio

## # A tibble: 2 × 4
## # Groups:   standards [2]
##   standards negative positive ratio
##   <chr>        <int>    <int> <dbl>
## 1 ccss           347      255 1.36 
## 2 ngss            36      263 0.137

Finally,

afinn_counts <- afinn_sentiment %>%
  group_by(standards) %>% 
  count(sentiment) %>%
  filter(standards == "ngss")

afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Next Gen Science Standards",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

NGSS vs CCSS

Finally, to address Question 2, I want to compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons to see how sentiment compares based on lexicon used.

I’ll begin by polishing my previous summaries and creating identical summaries for each lexicon that contains the following columns: method, standards, sentiment, and n, or word counts:

summary_afinn2 <- sentiment_afinn %>% 
  group_by(standards) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 

summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran")

Next, I’ll combine those four data frames together using the bind_rows function again:

summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2,
                               summary_nrc2,
                               summary_loughran2) %>%
  arrange(method, standards) %>%
  relocate(method)

summary_sentiment

## # A tibble: 16 × 4
## # Groups:   standards [2]
##    method   standards sentiment     n
##    <chr>    <chr>     <chr>     <int>
##  1 AFINN    ccss      negative    647
##  2 AFINN    ccss      positive    515
##  3 AFINN    ngss      positive    460
##  4 AFINN    ngss      negative     79
##  5 bing     ccss      negative    905
##  6 bing     ccss      positive    468
##  7 bing     ngss      positive    424
##  8 bing     ngss      negative     97
##  9 loughran ccss      negative    525
## 10 loughran ccss      positive    143
## 11 loughran ngss      positive    156
## 12 loughran ngss      negative     97
## 13 nrc      ccss      positive   2287
## 14 nrc      ccss      negative    753
## 15 nrc      ngss      positive    844
## 16 nrc      ngss      negative    118

Then I’ll create a new data frame that has the total word counts for each set of standards and each method and join that to my summary_sentiment data frame:

total_counts <- summary_sentiment %>%
  group_by(method, standards) %>%
  summarise(total = sum(n))

## `summarise()` has grouped output by 'method'. You can override using the `.groups` argument.

sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining, by = c("method", "standards")

sentiment_counts

## # A tibble: 16 × 5
## # Groups:   standards [2]
##    method   standards sentiment     n total
##    <chr>    <chr>     <chr>     <int> <int>
##  1 AFINN    ccss      negative    647  1162
##  2 AFINN    ccss      positive    515  1162
##  3 AFINN    ngss      positive    460   539
##  4 AFINN    ngss      negative     79   539
##  5 bing     ccss      negative    905  1373
##  6 bing     ccss      positive    468  1373
##  7 bing     ngss      positive    424   521
##  8 bing     ngss      negative     97   521
##  9 loughran ccss      negative    525   668
## 10 loughran ccss      positive    143   668
## 11 loughran ngss      positive    156   253
## 12 loughran ngss      negative     97   253
## 13 nrc      ccss      positive   2287  3040
## 14 nrc      ccss      negative    753  3040
## 15 nrc      ngss      positive    844   962
## 16 nrc      ngss      negative    118   962

Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:

sentiment_percents <- sentiment_counts %>%
  mutate(percent = n/total * 100)

sentiment_percents

## # A tibble: 16 × 6
## # Groups:   standards [2]
##    method   standards sentiment     n total percent
##    <chr>    <chr>     <chr>     <int> <int>   <dbl>
##  1 AFINN    ccss      negative    647  1162    55.7
##  2 AFINN    ccss      positive    515  1162    44.3
##  3 AFINN    ngss      positive    460   539    85.3
##  4 AFINN    ngss      negative     79   539    14.7
##  5 bing     ccss      negative    905  1373    65.9
##  6 bing     ccss      positive    468  1373    34.1
##  7 bing     ngss      positive    424   521    81.4
##  8 bing     ngss      negative     97   521    18.6
##  9 loughran ccss      negative    525   668    78.6
## 10 loughran ccss      positive    143   668    21.4
## 11 loughran ngss      positive    156   253    61.7
## 12 loughran ngss      negative     97   253    38.3
## 13 nrc      ccss      positive   2287  3040    75.2
## 14 nrc      ccss      negative    753  3040    24.8
## 15 nrc      ngss      positive    844   962    87.7
## 16 nrc      ngss      negative    118   962    12.3

Now that I have my sentiment percent summaries for each lexicon, I’m going great my 100% stacked bar charts for each lexicon:

sentiment_percents %>%
  ggplot(aes(x = standards, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter", 
       subtitle = "The Common Core & Next Gen Science Standards",
       x = "State Standards", 
       y = "Percentage of Words")

And finished! The chart above clearly illustrates that regardless of sentiment lexicon used, the NGSS contains more positive words than the CCSS lexicon.

5c. Narrate

With our “data products” cleanup complete, we can start pulling together a quick presentation to share with the class. We’ve already seen what a more formal journal article looks like in the PREPARE section of this walkthrough. For your Independent Analysis assignment for Unit 2, you’ll be creating either a simple report or slide deck to share out some key findings from our analysis.

Regardless of whether you plan to talk us through your analysis and findings with a presentation or walk us through with a brief written report, your assignment should address the following questions:

Purpose. What question or questions are guiding your analysis? What did you hope to learn by answering these questions and why should your audience care about your findings?
Methods. What data did you selected for analysis? What steps did you take took to prepare your data for analysis and what techniques you used to analyze your data? These should be fairly explicit with your embedded code.
Findings. What did you ultimately find? How do your “data products” help to illustrate these findings? What conclusions can you draw from your analysis?
Discussion. What were some of the strengths and weaknesses of your analysis? How might your audience use this information? How might you revisit or improve upon this analysis in the future?

Unit 2 Walkthrough: Twitter Sentiment and School Reform

Dr. Shiyan Jiang, student - Jeanne McClure

0. INTRODUCTION

Walkthrough Focus

1. PREPARE

1a. Some Context

Twitter and the Next Generation Science Standards

1b. Guiding Questions

1c. Set Up

Create a Twitter App

Authorization methods

Authorization in future R sessions

2. WRANGLE

2a. Import Tweets

Search Tweets

✅ Comprehension Check

Remove Retweets

Using the OR Operator

✅ Comprehension Check

Use Multiple Queries

Our First Dictionary

Write to Excel

Other Useful Queries

✅ Comprehension Check

2b. Tidy Text

Functions Used

Subset Rows & Columns

Add & Reorder Columns

✅ Comprehension Check

Combine Data Frames

Tokenize Text

Remove Stop Words

Custom Stop Words

✅ Comprehension Check

2c. Add Sentiment Values

Get Sentiments

✅ Comprehension Check

1. How were these sentiment lexicons put together and validated? Hint: take a look at Chapter 2 from Text Mining with R.

Join Sentiments

✅ Comprehension Check

3. EXPLORE

3a. Time Series

✅ Comprehension Check

3b. Sentiment Summaries

Sentiment Counts

Compute Sentiment Value

✅ Comprehension Check

4. MODEL

5. COMMUNICATE

5a. Select

5b. Polish

NGSS Sentiment

NGSS vs CCSS

5c. Narrate