Ming-Week-4-Independent-Analysis.knit

Title: Public Sentiment toward The Next Generation Science Standards on Twitter

Author: Ming Cai

1. PURPOSE

System-wide reform efforts in education in the United States are intrinsically challenging, and broad public support is likely to be a core component of their success.

While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them.

One way to understand how public sentiment about this reform might be similar to or different from past efforts are by pulling public opinion from Twitter API to evaluate using a suite of data science techniques such as Text (opinion) Mining and Sentiment Analysis. We assessed the public sentiment of NGSS and identified public sentiment by asking the following questions:

What is the public sentiment expressed toward the NGSS?
How does sentiment for NGSS compare to sentiment for CCSS?

2. METHOD

2a. Load Libraries Let’s first load our libraries to read in packages that we will use to answer our questions.

library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)

2b. Read and Restructure Data

First, read in the data in R project.

ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")

Now subset the rows and columns to pull only English language texts saved as ngss_text.

Next, select the following columns from our new ngss_text data frame and save as ‘ngss_text <- select’: 1. screen_name of the user who created the tweet 2. created_at timestamp for examining changes in sentiment over time 3. text containing the tweet which is our primary data source of interest

Create a new variable called standards to label each tweets as “ngss” and move the standards column to the first position.

Then, following the same procedures above, create an new ccss_text data frame for our ccss_tweets Common Core tweets.

Finally, combine ccss_text and ngss_text into a single data frame saved as tweets <- bind_rows.

ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(screen_name, created_at, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

tweets <- bind_rows(ngss_text, ccss_text)

2c. Tidy Text format

We will tidy our text using the tidytext and dplyr packages to split the text into tokens creating a table with one-token-per-row. The token is under a column called word(). Another step to tidy the text is to remove the most common stop words such as a, the, is, are and etc.

Take a quick count of the most common words in tidy_tweets data frame and remove the nonsense word “amp”, “t3ic”and “core”.

tweet_tokens <- 
  tweets %>%
  unnest_tokens(output = word, 
                input = text)
tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp") %>%
  filter(!word == "t3ic") %>%
  filter(!word == "https") %>%
  filter(!word == "core") %>%
  filter(!word == "common") %>%
  filter(!word == "ngsschat") %>%
  filter(!word == "2") %>%
  filter(!word == "it’s") %>%
  filter(!word == "i’m") %>%
  filter(!word == "1") %>%
  filter(!word == "5") %>%
  filter(!word == "3") %>%
  filter(!word == "4") %>%
  filter(!word == "ngss_tweeps") %>%
  filter(!word == "commoncore") %>%
  filter(!word == "don’t") %>%
  filter(!word == "20") %>%
  filter(!word == "kindergarten") %>%
  filter(!word == "standards") %>%
  filter(!word == "lol")

3. EXPLORE

In this section we will explore word counts by ungrouping the tokenized words to view in a word cloud.

3a. WordClouds

We can get a sense of the most common words in the combined Sentiment Analysis by looking at a word cloud of the word counts. By looking at the top 50 words from the word count, we can see that #commom and #core are most popular, then are #math, #ngss, #student, #school and #science.

top_tokens <- tidy_tweets %>%
  ungroup ()%>%  #ungroup the tokenize data to create a wordcloud
  count(word, sort = TRUE) %>%
  top_n(50)

## Selecting by n

library(wordcloud2)
wordcloud2(top_tokens, size = 1,shape = 'star')

3b. Basic Bar Chart

The bar chart is the workhorse for data viz and is pretty effective for comparing two or more values. Given the unique aspect of our tidy text data frame, however, we are looking at upwards of over 50 values (i.e. words and their counts) to compare with our top_tokens data frame.

top_tokens %>%
  filter(n > 50) %>% # keep rows with word counts greater than 50
  mutate(word = reorder(word, n)) %>% #reorder the word variable by n and replace with new variable called word
  ggplot(aes(n, word,)) + # create a plot with n on x axis and word on y axis
  geom_col(fill = "skyblue", just = 0.5) # make it a bar plot

3c. Time Series

Compare the number of tweets over time between NGSS and CCSS through Time Series

4. SENTIMENT ANALYSIS

Since our primary goals is to compare public sentiment around the NGSS and CCSS, in this section we put together some basic numerical summaries using our different lexicons to see whether tweets are generally more positive or negative for each standard as well as differences between the two.

Next, count and compare positive and negative sentiment between CCSS and NGSS in different lexicons.

Then, calculate a single sentiment “score” for our tweets that we can use for quick comparison and create a new variable indicating which lexicon we used.

Transform our sentiment column into separate columns for negative and positive that contains the n counts for each.

Create two new variables: sentiment and lexicon so we have a single sentiment score and the lexicon from which it was derived

We can see that CCSS scores negative, while NGSS is overall positive.

afinn <- get_sentiments("afinn")
nrc <- get_sentiments("nrc")
bing <- get_sentiments("bing")
loughran <- get_sentiments("loughran")

sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_nrc <- inner_join(tidy_tweets, nrc, by = "word")

## Warning in inner_join(tidy_tweets, nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 23 of `x` matches multiple rows in `y`.
## ℹ Row 7509 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")

## Warning in inner_join(tidy_tweets, loughran, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2113 of `x` matches multiple rows in `y`.
## ℹ Row 2589 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

summary_afinn2 <- sentiment_afinn %>% 
  group_by(standards) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 

summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran") 

summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2,
                               summary_nrc2,
                               summary_loughran2) %>%
  arrange(method, standards) %>%
  relocate(method)

4a. Compute the ratio

Now we’re ready to compute our ratio, and separate them out into separate positive or negative columns so we can perform a quick calculation to compute the ratio.

We can see that the ratio of positive are much more than negative towards ngss.

ngss_text <-
  ngss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ngss") %>%
  relocate(standards)

ccss_text <-
  ccss_tweets %>%
  filter(lang == "en") %>%
  select(status_id, text) %>%
  mutate(standards = "ccss") %>%
  relocate(standards)

tweets <- bind_rows(ngss_text, ccss_text)

sentiment_afinn <- tweets %>%
  unnest_tokens(output = word, 
                input = text)  %>% 
  anti_join(stop_words, by = "word") %>%
  filter(!word == "amp") %>%
  inner_join(afinn, by = "word")

afinn_score <- sentiment_afinn %>% 
  group_by(standards, status_id) %>% 
  summarise(value = sum(value))

## `summarise()` has grouped output by 'standards'. You can override using the
## `.groups` argument.

afinn_sentiment <- afinn_score %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive"))

afinn_ratio <- afinn_sentiment %>% 
  group_by(standards) %>% 
  count(sentiment) %>% 
  spread(sentiment, n) %>%
  mutate(ratio = negative/positive)

afinn_counts <- afinn_sentiment %>%
  group_by(standards) %>% 
  count(sentiment) %>%
  filter(standards == "ngss")

afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Next Gen Science Standards",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()

4b. NGSS vs CCSS

We now compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons.

I’ll begin by polishing my previous summaries and creating identical summaries for each lexicon that contains the following columns: method, standards, sentiment, and n, or word counts.

Next, I’ll combine those four data frames together using the bind_rows function again.

Then I’ll create a new data frame that has the total word counts for each set of standards and each method and join that to my summary_sentiment data frame.

Furthermore, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards.

Finally, I have my sentiment percent summaries for each lexicon, I’m going great my 100% stacked bar charts for each lexicon.

The chart below clearly illustrates that regardless of sentiment lexicon used, the NGSS contains more positive words than the CCSS lexicon.

summary_afinn2 <- sentiment_afinn %>% 
  group_by(standards) %>% 
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "AFINN")

summary_bing2 <- sentiment_bing %>% 
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "nrc") 

summary_loughran2 <- sentiment_loughran %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>% 
  count(sentiment, sort = TRUE) %>% 
  mutate(method = "loughran") 

summary_sentiment <- bind_rows(summary_afinn2,
                               summary_bing2,
                               summary_nrc2,
                               summary_loughran2) %>%
  arrange(method, standards) %>%
  relocate(method)

total_counts <- summary_sentiment %>%
  group_by(method, standards) %>%
  summarise(total = sum(n))

## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.

sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining with `by = join_by(method, standards)`

sentiment_percents <- sentiment_counts %>%
  mutate(percent = n/total * 100)

sentiment_percents %>%
  ggplot(aes(x = standards, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter", 
       subtitle = "The Common Core & Next Gen Science Standards",
       x = "State Standards", 
       y = "Percentage of Words")

5. COMMUNICATE

Purpose: The purpose of the case study is to produce a sentiment analysis examining Twitter public sentiment towards NGSS compared with CCSS.

Methods: For this independent analysis I explored tweet counts, time serious, sentiment analysis, compute the ratio and compare the percentage of positive and negative words for the NGSS and CCSS standards using the four different lexicons.

Findings: (1) Twitter users talking more about CCSS than NGSS. (2) The ratio of positive are much more than negative towards NGSS. (3) The NGSS contains more positive words than the CCSS in different lexicons As such, public sentiment towards NGSS is more positive than CCSS.

Discussion: We only studied posts to a single social media platform, Twitter, examining sentiment through other media—notably Facebook may lend insight, especially as different social media platforms can be characterized by different populations of users. Also, carrying out public sentiment analysis in real-time may be more reliable. Moreover, future research recommend to cover differences in perspectives about the NGSS across states.