Sentiment Analysis Badge

The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.

To earn a badge for each lab, you are required to respond to a set of prompts for two parts:

In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.

Part I: Reflect and Plan

Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.

Provide an APA citation for your selected study.
How does the sentiment analysis address research questions?

Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions: What is the level of adoption by valance of CCSS standards by geographic area based on tweet data?

What text data would need to be collected? The data that would need to be collected would be tweets that include #hashtags pertaining to both the CCSS standards.
For what reason would text data need to be collected in order to address this question?
Explain the analytical level at which these text data would need to be collected and analyzed.

Part II: Data Product

Use your case study file to create small multiples like the following figure:

I highly recommend creating a new R script in your lab-2 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.

library(forcats)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

#read in data files
ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")

#filtering for English only Tweets
ngss_text <- filter(ngss_tweets, lang == "en")
ccss_text <- filter(ccss_tweets, lang == "en")

#add column to identify standard category
ngss_text <- mutate(ngss_text, standards = "NGSS")
ccss_text <- mutate(ccss_text, standards = "CCSS")

#reorder the columns
ngss_text <- relocate(ngss_text, standards)
ccss_text <- relocate(ccss_text, standards)

#select desired variables
ngss_text <- select(ngss_text, standards, screen_name, created_at, text)
ccss_text <- select(ccss_text, standards, screen_name, created_at, text)

#combine datasets
tweets <- bind_rows(ngss_text, ccss_text)

#tokenize data
tweet_tokens <-
  tweets %>%
  unnest_tokens(output = word,
                input = text,
                token = "tweets")

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

#remove stop words and remove amp
tidy_tweets <-
  tweet_tokens %>%
  anti_join(stop_words, by = "word")%>%
  filter(!word == 'amp')

#count the most common words
count_tweets <- count(tidy_tweets, word, sort = T)

#get sentiments from AFINN, nrc, bing, loughran
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
loughran <- get_sentiments("loughran")

#join sentiments by word
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_nrc <- inner_join(tidy_tweets, nrc, by = "word")
sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")

#view tweets by standard X time
tweets %>%
  group_by(standards) %>%
  ts_plot(by = "days")

#produce summaries
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_nrc <- count(sentiment_nrc, sentiment, sort = TRUE)
summary_afinn <- count(sentiment_afinn, value, sort = TRUE)
summary_loughran <- count(sentiment_loughran, sentiment, sort = TRUE)


#revise dataframes
summary_afinn2 <- sentiment_afinn %>%
  group_by(standards) %>%
  filter(value != 0) %>%
  mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(method = "afinn")

summary_bing2 <- sentiment_bing %>%
  group_by(standards) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(method = "bing")

summary_nrc2 <- sentiment_nrc %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(method = "nrc")

summary_loughran2 <- sentiment_loughran %>%
  filter(sentiment %in% c("positive", "negative")) %>%
  group_by(standards) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(method = "loughran")


#view summaries
summary_bing2

## # A tibble: 4 × 4
## # Groups:   standards [2]
##   standards sentiment     n method
##   <chr>     <chr>     <int> <chr> 
## 1 CCSS      negative    914 bing  
## 2 CCSS      positive    437 bing  
## 3 NGSS      positive    226 bing  
## 4 NGSS      negative     60 bing

summary_nrc2

## # A tibble: 4 × 4
## # Groups:   standards [2]
##   standards sentiment     n method
##   <chr>     <chr>     <int> <chr> 
## 1 CCSS      positive   2198 nrc   
## 2 CCSS      negative    764 nrc   
## 3 NGSS      positive    542 nrc   
## 4 NGSS      negative     73 nrc

summary_afinn2

## # A tibble: 4 × 4
## # Groups:   standards [2]
##   standards sentiment     n method
##   <chr>     <chr>     <int> <chr> 
## 1 CCSS      negative    740 afinn 
## 2 CCSS      positive    468 afinn 
## 3 NGSS      positive    273 afinn 
## 4 NGSS      negative     39 afinn

summary_loughran2

## # A tibble: 4 × 4
## # Groups:   standards [2]
##   standards sentiment     n method  
##   <chr>     <chr>     <int> <chr>   
## 1 CCSS      negative    440 loughran
## 2 CCSS      positive    112 loughran
## 3 NGSS      negative     68 loughran
## 4 NGSS      positive     54 loughran

#join summaries

summary_sentiment <- bind_rows(summary_afinn2,
                              summary_bing2,
                              summary_nrc2,
                              summary_loughran2) %>%
  arrange(method, standards) %>%
  relocate(method)
summary_sentiment

## # A tibble: 16 × 4
## # Groups:   standards [2]
##    method   standards sentiment     n
##    <chr>    <chr>     <chr>     <int>
##  1 afinn    CCSS      negative    740
##  2 afinn    CCSS      positive    468
##  3 afinn    NGSS      positive    273
##  4 afinn    NGSS      negative     39
##  5 bing     CCSS      negative    914
##  6 bing     CCSS      positive    437
##  7 bing     NGSS      positive    226
##  8 bing     NGSS      negative     60
##  9 loughran CCSS      negative    440
## 10 loughran CCSS      positive    112
## 11 loughran NGSS      negative     68
## 12 loughran NGSS      positive     54
## 13 nrc      CCSS      positive   2198
## 14 nrc      CCSS      negative    764
## 15 nrc      NGSS      positive    542
## 16 nrc      NGSS      negative     73

#create totals and then join summary and total counts to prepare percentages
total_counts <- summary_sentiment %>%
  group_by(method, standards) %>%
  summarise(total = sum(n))

## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.

sentiment_counts <- left_join(summary_sentiment, total_counts)

## Joining, by = c("method", "standards")

sentiment_percents <- sentiment_counts %>%
  mutate(percent = n/total * 100)

sentiment_percents

## # A tibble: 16 × 6
## # Groups:   standards [2]
##    method   standards sentiment     n total percent
##    <chr>    <chr>     <chr>     <int> <int>   <dbl>
##  1 afinn    CCSS      negative    740  1208    61.3
##  2 afinn    CCSS      positive    468  1208    38.7
##  3 afinn    NGSS      positive    273   312    87.5
##  4 afinn    NGSS      negative     39   312    12.5
##  5 bing     CCSS      negative    914  1351    67.7
##  6 bing     CCSS      positive    437  1351    32.3
##  7 bing     NGSS      positive    226   286    79.0
##  8 bing     NGSS      negative     60   286    21.0
##  9 loughran CCSS      negative    440   552    79.7
## 10 loughran CCSS      positive    112   552    20.3
## 11 loughran NGSS      negative     68   122    55.7
## 12 loughran NGSS      positive     54   122    44.3
## 13 nrc      CCSS      positive   2198  2962    74.2
## 14 nrc      CCSS      negative    764  2962    25.8
## 15 nrc      NGSS      positive    542   615    88.1
## 16 nrc      NGSS      negative     73   615    11.9

#visualization bar chart
sentiment_percents %>%
  ggplot(aes(x = standards, y = percent, fill=sentiment)) +
  geom_bar(width = .8, stat = "identity") +
  facet_wrap(~method, ncol = 1) +
  coord_flip() +
  labs(title = "Public Sentiment on Twitter",
       subtitle = "The Common Core & Next Gen Science Standards",
       x = "State Standards",
       y = "Percentage of Words")

#Small multiples visualization with green stacked
sentiment_percents %>%
ggplot(aes(standards, percent, fill = percent)) +
  theme_minimal() +
  scale_fill_distiller(palette = "Greens") +
  labs(title="Comparison of positive sentiment in NGSS and CCSS Tweets") +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 2, scales = "free")

#Small multiples visualization with green not stacked


ggplot(sentiment_percents, aes(standards, percent, fill=sentiment)) +
  geom_bar(stat='identity', position = "dodge") +
  theme(legend.position="none", plot.title=element_text(hjust=0.5)) +
  labs(title="Comparison of positive sentiment in NGSS and CCSS Tweets",
       x="Standards",
       y="Percent") +
  facet_wrap(~method, ncol = 2, scales = "free")

Knit & Submit

Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps to submit your work for review:

Change the name of the author: in the YAML header at the very top of this document to your name. As noted in Reproducible Research in R, The YAML header controls the style and feel for knitted document but doesn’t actually display in the final output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
- Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
- Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our Text mining Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.

Sentiment Analysis Badge

LASER Institute TM Learning Lab 2

Dr. Tracy Arner

July 14, 2022

Part I: Reflect and Plan

Part II: Data Product

Knit & Submit