###1. Set Up a new project, open up a new R script, and load the following packages that we’ll be needing for this walkthrough:
library(dplyr)
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
###2. WRANGLE
rtweet package and some key functions to search for tweets
or users of interest.tidytext
package to both “tidy” and tokenize our tweets in order to create our
data frame for analysis.inner_join()
function for appending sentiment values to our data frame.# ngss_or_tweets <- search_tweets(q = "#NGSSchat OR ngss",
# n=5000,
# include_rts = FALSE)
# ngss_tweets <- search_tweets2(c("#NGSSchat OR ngss",
# '"next generation science standard"',
# '"next generation science standards"',
# '"next gen science standard"',
# '"next gen science standards"'
# ),
# n=5000,
# include_rts = FALSE)
# ngss_dictionary <- c("#NGSSchat OR ngss",
# '"next generation science standard"',
# '"next generation science standards"',
# '"next gen science standard"',
# '"next gen science standards"')
#
# ngss_tweets <- search_tweets2(ngss_dictionary,
# n=5000,
# include_rts = FALSE)
# ccss_dictionary <- c("#commoncore", '"common core"')
#
# ccss_tweets <- ccss_dictionary %>%
# search_tweets2(n=5000, include_rts = FALSE)
# write_xlsx(ngss_tweets, "data/ngss_tweets.xlsx")
# write_xlsx(ccss_tweets, "data/csss_tweets.xlsx")
dplyr functions
select() picks variables based on their names.slice() lets you select, remove, and duplicate
rows.rename() changes the names of individual variables
using new_name = old_name syntaxfilter() picks cases, or rows, based on their values in
a specified column.tidytext functions
unnest_tokens() splits a column into tokensanti_join() returns all rows from x
without a match in y.ATTENTION: For those of you who do not have Twitter Developer accounts, you will need to read in the Excel files share in our github.
We’ll use the readxl package highlighted in Lab 1 and
the read_xlsx() function to read in the data stored in the
data folder of our R project:
ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")
First, use the filter function to subset
rows containing only tweets in the language Second, select the following
columns from our new ngss_text data frame: 1.
screen_name of the user who created the tweet 2.
created_at timestamp for examining changes in sentiment
over time 3. text containing the tweet which is our primary
data source of interestt Third, use the mutate() function
to create a new variable called standards to label each
tweets as “ngss” Fourth, use the relocate() function to
move the standards column to the first
position
ngss_text <-
ngss_tweets %>%
filter(lang == "en") %>%
select(screen_name, created_at, text) %>%
mutate(standards = "ngss") %>%
relocate(standards)
ccss_text data frame for our
ccss_tweets Common Core tweets by modifying code
above.tweets <- bind_rows(ngss_text, ccss_text)
Take a quick look at both the head() and the
tail() of this new tweets data frame to make
sure it contains both “ngss” and “ccss” standards:
head(tweets)
## # A tibble: 6 × 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ngss loyr2662 2021-02-27 17:33:27 "Switching gears for a bit for the…
## 2 ngss loyr2662 2021-02-20 20:02:37 "Was just introduced to the Engine…
## 3 ngss Furlow_teach 2021-02-27 17:03:23 "@IBchemmilam @chemmastercorey I’m…
## 4 ngss Furlow_teach 2021-02-27 14:41:01 "@IBchemmilam @chemmastercorey How…
## 5 ngss TdiShelton 2021-02-27 14:17:34 "I am so honored and appreciative …
## 6 ngss TdiShelton 2021-02-27 15:49:17 "Thank you @brian_womack I loved c…
tail(tweets)
## # A tibble: 6 × 4
## standards screen_name created_at text
## <chr> <chr> <dttm> <chr>
## 1 ccss JosiePaul8807 2021-02-20 00:34:53 "@SenatorHick You realize science…
## 2 ccss ctwittnc 2021-02-19 23:44:18 "@winningatmylife I’ll bet none o…
## 3 ccss the_rbeagle 2021-02-19 23:27:06 "@dmarush @electronlove @Montgome…
## 4 ccss silea 2021-02-19 23:11:21 "@LizerReal I don’t think that’s …
## 5 ccss JodyCoyote12 2021-02-19 22:58:25 "@CarlaRK3 @NedLamont Fully fund …
## 6 ccss Ryan_Hawes 2021-02-19 22:41:01 "I just got an \"explainer\" on h…
tweet_tokens <-
tweets %>%
unnest_tokens(output = word,
input = text)
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp") %>%
filter(!word == "t3ic")
The three general-purpose lexicons we’ll focus on are:
AFINN assigns words with a score that runs between
-5 and 5, with negative scores indicating negative sentiment and
positive scores indicating positive sentiment.
bing categorizes words in a binary fashion into
positive and negative categories.
nrc categorizes words in a binary fashion
(“yes”/“no”) into categories of positive, negative, anger, anticipation,
disgust, fear, joy, sadness, surprise, and trust.
afinn <- get_sentiments("afinn")
afinn
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
bing <- get_sentiments("bing")
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
And just out of curiosity, let’s take a look at the
loughran lexicon as well:
loughran <- get_sentiments("loughran")
loughran
## # A tibble: 4,150 × 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # ℹ 4,140 more rows
Let’s use inner_join() to combine our two
tidy_tweets and afinn data frames, keeping
only rows with matching data in the word column:
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_afinn
## # A tibble: 1,540 × 5
## standards screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 ngss loyr2662 2021-02-27 17:33:27 win 4
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love 3
## 3 ngss Furlow_teach 2021-02-27 17:03:23 sweet 2
## 4 ngss Furlow_teach 2021-02-27 17:03:23 significance 1
## 5 ngss TdiShelton 2021-02-27 14:17:34 honored 2
## 6 ngss TdiShelton 2021-02-27 14:17:34 opportunity 2
## 7 ngss TdiShelton 2021-02-27 14:17:34 wonderful 4
## 8 ngss TdiShelton 2021-02-27 14:17:34 powerful 2
## 9 ngss TdiShelton 2021-02-27 15:49:17 loved 3
## 10 ngss TdiShelton 2021-02-27 16:51:32 share 1
## # ℹ 1,530 more rows
Notice that each word in your sentiment_afinn data frame
now contains a value ranging from -5 (very negative) to 5 (very
positive).
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_bing
## # A tibble: 1,668 × 5
## standards screen_name created_at word sentiment
## <chr> <chr> <dttm> <chr> <chr>
## 1 ngss loyr2662 2021-02-27 17:33:27 win positive
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love positive
## 3 ngss Furlow_teach 2021-02-27 17:03:23 helped positive
## 4 ngss Furlow_teach 2021-02-27 17:03:23 sweet positive
## 5 ngss Furlow_teach 2021-02-27 17:03:23 tough positive
## 6 ngss TdiShelton 2021-02-27 14:17:34 honored positive
## 7 ngss TdiShelton 2021-02-27 14:17:34 appreciative positive
## 8 ngss TdiShelton 2021-02-27 14:17:34 wonderful positive
## 9 ngss TdiShelton 2021-02-27 14:17:34 powerful positive
## 10 ngss TdiShelton 2021-02-27 15:49:17 loved positive
## # ℹ 1,658 more rows
## Warning in inner_join(tidy_tweets, nrc, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 24 of `x` matches multiple rows in `y`.
## ℹ Row 7509 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## Warning in inner_join(tidy_tweets, loughran, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2297 of `x` matches multiple rows in `y`.
## ℹ Row 2589 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Before we dig into sentiment, let’s use the handy
ts_plot function built into rtweet to take a
very quick look at how far back our tidied tweets data set
goes:
ts_plot(tweets, by = "days")
ts_plot with the group_by function to
compare the number of tweets over time by Next Gen and Common Core
standardsHint: use the ?ts_plot help function to check the
examples to see how this can be done.
Your line graph should look something like this:
Revisit the following dplyr functions:
count()
lets you quickly count the unique values of one or more
variables
group_by()
takes a data frame and one or more variables to group by
summarise()
creates a numerical summary of data using arguments like mean() and median()
mutate()
adds new variables and preserves existing ones
And introduce one new function:
spread()summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment)
summary_bing
## # A tibble: 4 × 3
## # Groups: standards [2]
## standards sentiment n
## <chr> <chr> <int>
## 1 ccss negative 926
## 2 ccss positive 446
## 3 ngss negative 66
## 4 ngss positive 230
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n)
summary_bing
## # A tibble: 2 × 3
## # Groups: standards [2]
## standards negative positive
## <chr> <int> <int>
## 1 ccss 926 446
## 2 ngss 66 230
Finally, we’ll use the mutate function to create two new
variables: sentiment and lexicon so we have a
single sentiment score and the lexicon from which it was derived:
summary_bing <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
spread(sentiment, n) %>%
mutate(sentiment = positive - negative) %>%
mutate(lexicon = "bing") %>%
relocate(lexicon)
summary_bing
## # A tibble: 2 × 5
## # Groups: standards [2]
## lexicon standards negative positive sentiment
## <chr> <chr> <int> <int> <int>
## 1 bing ccss 926 446 -480
## 2 bing ngss 66 230 164
There we go, now we can see that CCSS scores negative, while NGSS is overall positive.
Let’s calculate a quick score for using the afinn
lexicon now. Remember that AFINN provides a value from -5 to 5 for
each:
head(sentiment_afinn)
## # A tibble: 6 × 5
## standards screen_name created_at word value
## <chr> <chr> <dttm> <chr> <dbl>
## 1 ngss loyr2662 2021-02-27 17:33:27 win 4
## 2 ngss Furlow_teach 2021-02-27 17:03:23 love 3
## 3 ngss Furlow_teach 2021-02-27 17:03:23 sweet 2
## 4 ngss Furlow_teach 2021-02-27 17:03:23 significance 1
## 5 ngss TdiShelton 2021-02-27 14:17:34 honored 2
## 6 ngss TdiShelton 2021-02-27 14:17:34 opportunity 2
To calculate late a summary score, we will need to first group our
data by standards again and then use the
summarise function to create a new sentiment
variable by adding all the positive and negative scores in the
value column:
summary_afinn <- sentiment_afinn %>%
group_by(standards) %>%
summarise(sentiment = sum(value)) %>%
mutate(lexicon = "AFINN") %>%
relocate(lexicon)
summary_afinn
## # A tibble: 2 × 3
## lexicon standards sentiment
## <chr> <chr> <dbl>
## 1 AFINN ccss -808
## 2 AFINN ngss 503
Again, CCSS is overall negative while NGSS is overall positive!
For your final task for this walkthough, calculate a single sentiment
score for NGSS and CCSS using the remaining nrc and
loughan lexicons and answer the following questions. Are
these findings above still consistent?
Hint: The nrc lexicon contains “positive” and “negative”
values just like bing and loughan, but also
includes values like “trust” and “sadness” as shown below. You will need
to use the filter() function to select rows that only
contain “positive” and “negative.”
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
## # A tibble: 2 × 5
## # Groups: standards [2]
## standards method negative positive sentiment
## <chr> <chr> <int> <int> <dbl>
## 1 ccss nrc 766 2296 3.00
## 2 ngss nrc 79 571 7.23
## # A tibble: 2 × 3
## lexicon standards sentiment
## <chr> <chr> <dbl>
## 1 AFINN ccss -808
## 2 AFINN ngss 503
As highlighted in Chapter 3 of Data Science in Education Using R, the Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.” The authors note that while descriptive statistics and data visualization during the Explore step can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.
Recall from the PREPARE section that the Rosenberg et al. study was guide by the following questions:
Similar to our sentiment summary using the AFINN lexicon, the Rosenberg et al. study used the -5 to 5 sentiment score from the SentiStrength lexicon to answer RQ #1. To address the remaining questions the authors used a mixed effects model (also known as multi-level or hierarchical linear models via the lme4 package in R.
Collectively, the authors found that:
The final(ish) step in our workflow/process is sharing the results of analysis with wider audience. Krumm et al. (2018) outlined the following 3-step process for communicating with education stakeholders what you have learned through analysis:
Remember that the questions of interest that we want to focus on our for our selection, polishing, and narration include:
To address questions 1 and 2, I’m going to focus my analyses, data products and sharing format on the following:
bing,
nrc, and loughan lexicons, I’ll create some
100% stacked bars showing the percentage of positive and negative words
among all tweets for the NGSS and CCSS.I want to try and replicate as closely as possible the approach Rosenberg et al. used in their analysis. To do that, I’ll I can recycle some R code I used in section 2b. Tidy Text.
To polish my analyses and prepare, first I need to rebuild the
tweets dataset from my ngss_tweets and
ccss_tweets and select both the status_id that
is unique to each tweet, and the text column which contains
the actual post:
ngss_text <-
ngss_tweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(standards = "ngss") %>%
relocate(standards)
ccss_text <-
ccss_tweets %>%
filter(lang == "en") %>%
select(status_id, text) %>%
mutate(standards = "ccss") %>%
relocate(standards)
tweets <- bind_rows(ngss_text, ccss_text)
tweets
## # A tibble: 1,441 × 3
## standards status_id text
## <chr> <chr> <chr>
## 1 ngss 1365716690336645124 "Switching gears for a bit for the \"Crosscutt…
## 2 ngss 1363217513761415171 "Was just introduced to the Engineering Habits…
## 3 ngss 1365709122763653133 "@IBchemmilam @chemmastercorey I’m familiar w/…
## 4 ngss 1365673294360420353 "@IBchemmilam @chemmastercorey How well does t…
## 5 ngss 1365667393188601857 "I am so honored and appreciative to have an o…
## 6 ngss 1365690477266284545 "Thank you @brian_womack I loved connecting wi…
## 7 ngss 1365706140496130050 "Please share #NGSSchat PLN! https://t.co/Qc2c…
## 8 ngss 1363669328147677189 "So excited about this weekend’s learning... p…
## 9 ngss 1365442786544214019 "The Educators Evaluating the Quality of Instr…
## 10 ngss 1364358149164175362 "Foster existing teacher social networks that …
## # ℹ 1,431 more rows
The status_id is important because like Rosenberg et
al., I want to calculate an overall sentiment score for each tweet,
rather than for each word.
Before I get that far however, I’ll need to tidy my
tweets again and attach my sentiment
scores.
Note that the closest lexicon we have available in our
tidytext package to the SentiStrength lexicon used by
Rosenberg is the AFINN lexicon which also uses a -5 to 5 point
scale.
So let’s use unnest_tokens to tidy our tweets, remove
stop words, and add afinn scores to each word similar to
what we did in section 2c. Add
Sentiment Values:
sentiment_afinn <- tweets %>%
unnest_tokens(output = word,
input = text) %>%
anti_join(stop_words, by = "word") %>%
filter(!word == "amp") %>%
inner_join(afinn, by = "word")
sentiment_afinn
## # A tibble: 1,540 × 4
## standards status_id word value
## <chr> <chr> <chr> <dbl>
## 1 ngss 1365716690336645124 win 4
## 2 ngss 1365709122763653133 love 3
## 3 ngss 1365709122763653133 sweet 2
## 4 ngss 1365709122763653133 significance 1
## 5 ngss 1365667393188601857 honored 2
## 6 ngss 1365667393188601857 opportunity 2
## 7 ngss 1365667393188601857 wonderful 4
## 8 ngss 1365667393188601857 powerful 2
## 9 ngss 1365690477266284545 loved 3
## 10 ngss 1365706140496130050 share 1
## # ℹ 1,530 more rows
Next, I want to calculate a single score for each tweet. To do that,
I’ll use the by now familiar group_by and
summarize
afinn_score <- sentiment_afinn %>%
group_by(standards, status_id) %>%
summarise(value = sum(value))
afinn_score
## # A tibble: 857 × 3
## # Groups: standards [2]
## standards status_id value
## <chr> <chr> <dbl>
## 1 ccss 1362894990813188096 2
## 2 ccss 1362899370199445508 4
## 3 ccss 1362906588021989376 -2
## 4 ccss 1362910494487535618 -9
## 5 ccss 1362910913855160320 -1
## 6 ccss 1362928225379250179 2
## 7 ccss 1362933982074073090 -1
## 8 ccss 1362947497258151945 -3
## 9 ccss 1362949805694013446 3
## 10 ccss 1362970614282264583 3
## # ℹ 847 more rows
And like Rosenberg et al., I’ll add a flag for whether the tweet is
“positive” or “negative” using the mutate function to
create a new sentiment column to indicate whether that
tweets was positive or negative.
To do this, we introduced the new if_else function from
the dplyr package. This if_else function adds
“negative” to the sentiment column if the score in the
value column of the corresponding row is less than 0. If
not, it will add a “positive” to the row.
afinn_sentiment <- afinn_score %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive"))
afinn_sentiment
## # A tibble: 820 × 4
## # Groups: standards [2]
## standards status_id value sentiment
## <chr> <chr> <dbl> <chr>
## 1 ccss 1362894990813188096 2 positive
## 2 ccss 1362899370199445508 4 positive
## 3 ccss 1362906588021989376 -2 negative
## 4 ccss 1362910494487535618 -9 negative
## 5 ccss 1362910913855160320 -1 negative
## 6 ccss 1362928225379250179 2 positive
## 7 ccss 1362933982074073090 -1 negative
## 8 ccss 1362947497258151945 -3 negative
## 9 ccss 1362949805694013446 3 positive
## 10 ccss 1362970614282264583 3 positive
## # ℹ 810 more rows
Note that since a tweet sentiment score equal to 0 is neutral, I used
the filter function to remove it from the dataset.
Finally, we’re ready to compute our ratio. We’ll use the
group_by function and count the number of
tweets for each of the standards that are positive or
negative in the sentiment column. Then we’ll use the
spread function to separate them out into separate columns
so we can perform a quick calculation to compute the
ratio.
afinn_ratio <- afinn_sentiment %>%
group_by(standards) %>%
count(sentiment) %>%
spread(sentiment, n) %>%
mutate(ratio = negative/positive)
afinn_ratio
## # A tibble: 2 × 4
## # Groups: standards [2]
## standards negative positive ratio
## <chr> <int> <int> <dbl>
## 1 ccss 421 211 2.00
## 2 ngss 21 167 0.126
Finally,
afinn_counts <- afinn_sentiment %>%
group_by(standards) %>%
count(sentiment) %>%
filter(standards == "ngss")
afinn_counts %>%
ggplot(aes(x="", y=n, fill=sentiment)) +
geom_bar(width = .6, stat = "identity") +
labs(title = "Next Gen Science Standards",
subtitle = "Proportion of Positive & Negative Tweets") +
coord_polar(theta = "y") +
theme_void()
Finally, to address Question 2, I want to compare the percentage of positive and negative words contained in the corpus of tweets for the NGSS and CCSS standards using the four different lexicons to see how sentiment compares based on lexicon used.
I’ll begin by polishing my previous summaries and creating identical
summaries for each lexicon that contains the following columns:
method, standards, sentiment, and
n, or word counts:
summary_afinn2 <- sentiment_afinn %>%
group_by(standards) %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "AFINN")
summary_bing2 <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc")
summary_loughran2 <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran")
Next, I’ll combine those four data frames together using the
bind_rows function again:
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2,
summary_nrc2,
summary_loughran2) %>%
arrange(method, standards) %>%
relocate(method)
summary_sentiment
## # A tibble: 16 × 4
## # Groups: standards [2]
## method standards sentiment n
## <chr> <chr> <chr> <int>
## 1 AFINN ccss negative 740
## 2 AFINN ccss positive 477
## 3 AFINN ngss positive 278
## 4 AFINN ngss negative 45
## 5 bing ccss negative 926
## 6 bing ccss positive 446
## 7 bing ngss positive 230
## 8 bing ngss negative 66
## 9 loughran ccss negative 433
## 10 loughran ccss positive 112
## 11 loughran ngss negative 73
## 12 loughran ngss positive 57
## 13 nrc ccss positive 2296
## 14 nrc ccss negative 766
## 15 nrc ngss positive 571
## 16 nrc ngss negative 79
Then I’ll create a new data frame that has the total word counts for
each set of standards and each method and join that to my
summary_sentiment data frame:
total_counts <- summary_sentiment %>%
group_by(method, standards) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining with `by = join_by(method, standards)`
sentiment_counts
## # A tibble: 16 × 5
## # Groups: standards [2]
## method standards sentiment n total
## <chr> <chr> <chr> <int> <int>
## 1 AFINN ccss negative 740 1217
## 2 AFINN ccss positive 477 1217
## 3 AFINN ngss positive 278 323
## 4 AFINN ngss negative 45 323
## 5 bing ccss negative 926 1372
## 6 bing ccss positive 446 1372
## 7 bing ngss positive 230 296
## 8 bing ngss negative 66 296
## 9 loughran ccss negative 433 545
## 10 loughran ccss positive 112 545
## 11 loughran ngss negative 73 130
## 12 loughran ngss positive 57 130
## 13 nrc ccss positive 2296 3062
## 14 nrc ccss negative 766 3062
## 15 nrc ngss positive 571 650
## 16 nrc ngss negative 79 650
Finally, I’ll add a new row that calculates the percentage of positive and negative words for each set of state standards:
sentiment_percents <- sentiment_counts %>%
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 16 × 6
## # Groups: standards [2]
## method standards sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 AFINN ccss negative 740 1217 60.8
## 2 AFINN ccss positive 477 1217 39.2
## 3 AFINN ngss positive 278 323 86.1
## 4 AFINN ngss negative 45 323 13.9
## 5 bing ccss negative 926 1372 67.5
## 6 bing ccss positive 446 1372 32.5
## 7 bing ngss positive 230 296 77.7
## 8 bing ngss negative 66 296 22.3
## 9 loughran ccss negative 433 545 79.4
## 10 loughran ccss positive 112 545 20.6
## 11 loughran ngss negative 73 130 56.2
## 12 loughran ngss positive 57 130 43.8
## 13 nrc ccss positive 2296 3062 75.0
## 14 nrc ccss negative 766 3062 25.0
## 15 nrc ngss positive 571 650 87.8
## 16 nrc ngss negative 79 650 12.2
Now that I have my sentiment percent summaries for each lexicon, I’m going great my 100% stacked bar charts for each lexicon:
sentiment_percents %>%
ggplot(aes(x = standards, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
facet_wrap(~method, ncol = 1) +
coord_flip() +
labs(title = "Public Sentiment on Twitter",
subtitle = "The Common Core & Next Gen Science Standards",
x = "State Standards",
y = "Percentage of Words")
And finished! The chart above clearly illustrates that regardless of sentiment lexicon used, the NGSS contains more positive words than the CCSS lexicon.
With our “data products” cleanup complete, we can start pulling together a quick presentation to share with the class. We’ve already seen what a more formal journal article looks like in the PREPARE section of this walkthrough. For your Independent Analysis for Lab 2, you’ll be creating either a simple report or slide deck to share out some key findings from our analysis.
Regardless of whether you plan to talk us through your analysis and findings with a presentation or walk us through with a brief written report, your assignment should address the following questions: