Week 4 Sentiment Analysis

1. Prepare the Environment

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr) 
library(tidyr) 
library(rtweet) 
library(writexl) 
library(readxl) 
library(tidytext) 
library(textdata) 
library(ggplot2) 
library(textdata) 
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

2. Wrangle

Read in the data from the excel sheets:

# Load in data  
ngss_tweets <- read_xlsx("lab-2/data/ngss_tweets.xlsx")  
ccss_tweets <- read_xlsx("lab-2/data/csss_tweets.xlsx")

To keep only relevant data for our analysis, filter out tweets that aren’t in English and then keep only columns specifying the screen name of the user who made the tweet, when it was created, and what the tweet said. Finally, because we will be joining the two standards’ data frames, add a column in each specifying which standard the tweet pertains to:

# Filter english tweets & add in a column specifying which standards  
ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%  
  select(screen_name, created_at, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)  
ccss_text <-     
  ccss_tweets %>%
  filter(lang == "en") %>%
  select(screen_name, created_at, text) %>% 
  mutate(standards = "ccss") %>% 
  relocate(standards)

Now to combine the two data frames into one using bind_rows:

# Combine ngss and ccss data frames  
tweets <- bind_rows(ngss_text, ccss_text)

Split each tweet into one row to tokenize it for our analysis:

# Tokenize the data
tweet_tokens <-    
  tweets %>%   
  unnest_tokens(output = word,
                input = text)

# Find common irrelevant words & remove   
count(tweet_tokens, word, sort = T)

## # A tibble: 7,708 × 2
##    word       n
##    <chr>  <int>
##  1 the     1180
##  2 common  1112
##  3 core    1110
##  4 to       982
##  5 and      744
##  6 t        629
##  7 co       623
##  8 https    623
##  9 of       589
## 10 a        584
## # ℹ 7,698 more rows

# Remove those stop words  
tidy_tweets <-   
  tweet_tokens %>%   
  anti_join(stop_words, by = "word") %>%   
  filter(!word == "amp")

To create a sentiment score, load in lexicons (AFINN, bing, NRC, and loughran) with sentiment values for words. These lexicons were put together with crowdsourcing or by the authors nd validated with using some combination of crowdsourcing, reviews, or Twitter data.

Then, combine the lexicons with the tidy_tweets data frame by shared words using inner_join:

# Get lexicons   
afinn <- get_sentiments("afinn")   
bing <- get_sentiments("bing")   
nrc <- get_sentiments("nrc")   
loughran <- get_sentiments("loughran") 

# Combine the two data frames, keeping only rows with matching data in the `word` column   
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")   
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")   
sentiment_nrc <- inner_join(tidy_tweets, nrc, by = "word")

## Warning in inner_join(tidy_tweets, nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 26 of `x` matches multiple rows in `y`.
## ℹ Row 7509 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")

## Warning in inner_join(tidy_tweets, loughran, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2299 of `x` matches multiple rows in `y`.
## ℹ Row 2589 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

3. Explore

Compare the data range and times the tweets were posted between the standards:

#  Compare the number of tweets over time by Next Gen and Common Core standards
tweets %>%   
  group_by(standards) %>%
  ts_plot(by = "days")

tweets %>%   
  group_by(standards) %>%
  ts_plot(by = "hours")

Compare the positive versus words per standard based on the bing lexicon. Then, using spread , separate the sentiment column into negative and positive and the n value for each. Finally, create new variables for the lexicon, bing, and the sentiment score:

# Count positive vs negative sentiments per standard (group_by) & separate it into columns by standard with n count (spread) then create new variables (mutate) for the lexicon used and sentiment score   
summary_bing <- sentiment_bing %>%
  group_by(standards) %>%      
  count(sentiment, sort = TRUE) %>%
  spread(sentiment, n) %>%
  mutate(sentiment = positive - negative) %>%
  mutate(lexicon = "bing") %>%
  relocate(lexicon)

Create a sentiment score for the AFINN lexicon by adding positive and negative sentiments

# Create a setiment score for the AFINN lexicon by adding positive and negative sentiments
summary_afinn <- sentiment_afinn %>% 
  group_by(standards) %>% 
  summarise(sentiment = sum(value)) %>% 
  mutate(lexicon = "AFINN") %>%
  relocate(lexicon)

Now for the NRC lexicon, using filter to only keep the ‘positive’ and ‘negative’ sentiments.

# calculate a single sentiment score for NGSS and CCSS using the remaining `nrc` and `loughan` lexicons.   
summary_nrc <- 
  sentiment_nrc %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>%
  count(sentiment, sort = TRUE) %>%
  spread(sentiment, n) %>%
  mutate(sentiment = positive/negative) %>%
  mutate(lexicon = "nrc")

To calculate an overall sentiment score for each tweet, rather than for each word, we need to add back the status_id and text fields.

# Rebuild the `tweets` dataset from the ngss_tweets and ccss_tweets and select both the `status_id` that is unique to each tweet, and the `text` column which contains the actual post:    
ngss_text <-     
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)      
ccss_text <-     
  ccss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ccss") %>%
  relocate(standards)      

tweets <- bind_rows(ngss_text, ccss_text)

Re-remove and filter stop words. Then join them again.

# unnest and remove stop words   
sentiment_afinn <- tweets %>%
  unnest_tokens(output = word,
                input = text)  %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp") %>%
  inner_join(afinn, by = "word")

Similar to before, sum the sentiment scores to calculate the sentiment score for each tweet.

# calculate a single score for each tweet 
afinn_score <- 
  sentiment_afinn %>%
  group_by(standards, status_id) %>%
  summarise(value = sum(value))

## `summarise()` has grouped output by 'standards'. You can override using the
## `.groups` argument.

Add another column classifying the summed score as positive or negative.

# flag tweets as pos or neg based on score 
afinn_sentiment <-
  afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))

Calculate the ratio of positive and negative tweets per standard.

# calculate the ratio of pos & neg sentiments per standard 
afinn_ratio <- afinn_sentiment %>%
  group_by(standards) %>%
  count(sentiment) %>%
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)

Graph the results to visualize the ratio.

# create a graph 
afinn_countsNGSS <- 
  afinn_sentiment %>%
  group_by(standards) %>%
  count(sentiment) %>%
  filter(standards == "ngss")  

afinn_countsNGSS %>% 
  ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Next Gen Science Standards",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +   theme_void()

afinn_countsCCSS <- 
  afinn_sentiment %>%
  group_by(standards) %>%
  count(sentiment) %>%
  filter(standards == "ccss")  

afinn_countsCCSS %>% 
  ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Common Core State Standards",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +   theme_void()

Repeat the steps for each of the lexicons in order to compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons to see how sentiment compares based on lexicon used:

# polishing previous summaries &  creating identical summaries for each lexicon  
summary_afinn2 <- sentiment_afinn %>% 
  group_by(standards) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 

summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran")

Then use bind_rows again to put all the data frames together

# put data frames together   
summary_sentiment <- 
  bind_rows(summary_afinn2,
            summary_bing2,
            summary_nrc2,
            summary_loughran2) %>%
  arrange(method, standards) %>%
  relocate(method)

Sum up the words for the total word count for each standard then combine it with the sentiment_summary data frame:

# create a data frame with total word counts per standard   
total_counts <- summary_sentiment %>%
  group_by(method, standards) %>%
  summarise(total = sum(n))

## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.

#  join it to sentiment dataframe   
sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining with `by = join_by(method, standards)`

Calculate the percentage of positive and negative words per standard, then graph it:

# calc % of pos and neg words per set standard   
sentiment_percents <- sentiment_counts %>%     
  mutate(percent = n/total * 100)    
# graph it   
sentiment_percents %>%     
  ggplot(aes(x = standards, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter",
       subtitle = "The Common Core & Next Gen Science Standards",
       x = "State Standards",
       y = "Percentage of Words")

4. Model

N/A

5. Communicate

Purpose:

This case study compares tweets pertaining to two sets of educational standards, the Next Generation Science Standards (NGSS) and Common Core State Standards (CCSS), in order to examine public sentiment. It looked to answer the questions to help understand public sentiments towards educational reform: What is the public sentiment expressed toward the NGSS? How does sentiment for NGSS compare to sentiment for CCSS Methods: Data included tweets collected about both the NGSS and the CCSS. From the datasets we chose to examine who authored the tweet, when it was written, and the message the body text contained. We tokenized the data and then removed any stop words. Then, we used the AFFIN, bing, and nrc sentiment lexicons to add sentiment scores to the tweets. We then compared the overall sentiment scores between the NGSS and CCSS tweets.

Findings:

Public sentiment about NGSS was more positive than CCSS. Approximately one third of CCSS tweets had a positive sentiments. Approximately one eight of NGSS tweets were negative. The number of tweets rose for both categories over time, dropping at similar times, with the CCSS with more tweets.

Discussion:

This analysis grants insight into how the public reacts to educational reform, and what specific public attitudes are to the NGSS. This analysis could be improved with deeper analysis into who is tweeting and what their overall sentiment is, the correlation of tweets over time and news articles posted or shared, and what might cause a sharp up or down turn in tweets.