The final activity for each learning lab provides space to work with data and to reflect on how the concepts and techniques introduced in each lab might apply to your own research.
To earn a badge for each lab, you are required to respond to a set of prompts for two parts:
In Part I, you will reflect on your understanding of key concepts and begin to think about potential next steps for your own study.
In Part II, you will create a simple data product in R that demonstrates your ability to apply a data analysis technique introduced in this learning lab.
Use the institutional library (e.g. NCSU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies text mining to an educational context or topic of interest. More specifically, locate a text mining study that visualize text data.
Provide an APA citation for your selected study.
How does the sentiment analysis address research questions?
Draft a research question for a population you may be interested in studying, or that would be of interest to educational researchers, and that would require the collection of text data and answer the following questions: What is the level of adoption by valance of CCSS standards by geographic area based on tweet data?
What text data would need to be collected? The data that would need to be collected would be tweets that include #hashtags pertaining to both the CCSS standards.
For what reason would text data need to be collected in order to address this question?
Explain the analytical level at which these text data would need to be collected and analyzed.
Use your case study file to create small multiples like the following figure:
I highly recommend creating a new R script in your lab-2 folder to complete this task. When your code is ready to share, use the code chunk below to share the final code for your model and answer the questions that follow.
library(forcats)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidyr)
library(rtweet)
library(writexl)
library(readxl)
library(tidytext)
library(textdata)
library(ggplot2)
library(textdata)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
#read in data files
ngss_tweets <- read_xlsx("data/ngss_tweets.xlsx")
ccss_tweets <- read_xlsx("data/csss_tweets.xlsx")
#filtering for English only Tweets
ngss_text <- filter(ngss_tweets, lang == "en")
ccss_text <- filter(ccss_tweets, lang == "en")
#add column to identify standard category
ngss_text <- mutate(ngss_text, standards = "NGSS")
ccss_text <- mutate(ccss_text, standards = "CCSS")
#reorder the columns
ngss_text <- relocate(ngss_text, standards)
ccss_text <- relocate(ccss_text, standards)
#select desired variables
ngss_text <- select(ngss_text, standards, screen_name, created_at, text)
ccss_text <- select(ccss_text, standards, screen_name, created_at, text)
#combine datasets
tweets <- bind_rows(ngss_text, ccss_text)
#tokenize data
tweet_tokens <-
tweets %>%
unnest_tokens(output = word,
input = text,
token = "tweets")
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#remove stop words and remove amp
tidy_tweets <-
tweet_tokens %>%
anti_join(stop_words, by = "word")%>%
filter(!word == 'amp')
#count the most common words
count_tweets <- count(tidy_tweets, word, sort = T)
#get sentiments from AFINN, nrc, bing, loughran
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")
loughran <- get_sentiments("loughran")
#join sentiments by word
sentiment_afinn <- inner_join(tidy_tweets, afinn, by = "word")
sentiment_bing <- inner_join(tidy_tweets, bing, by = "word")
sentiment_nrc <- inner_join(tidy_tweets, nrc, by = "word")
sentiment_loughran <- inner_join(tidy_tweets, loughran, by = "word")
#view tweets by standard X time
tweets %>%
group_by(standards) %>%
ts_plot(by = "days")
#produce summaries
summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_nrc <- count(sentiment_nrc, sentiment, sort = TRUE)
summary_afinn <- count(sentiment_afinn, value, sort = TRUE)
summary_loughran <- count(sentiment_loughran, sentiment, sort = TRUE)
#revise dataframes
summary_afinn2 <- sentiment_afinn %>%
group_by(standards) %>%
filter(value != 0) %>%
mutate(sentiment = if_else(value < 0, "negative", "positive")) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "afinn")
summary_bing2 <- sentiment_bing %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "bing")
summary_nrc2 <- sentiment_nrc %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "nrc")
summary_loughran2 <- sentiment_loughran %>%
filter(sentiment %in% c("positive", "negative")) %>%
group_by(standards) %>%
count(sentiment, sort = TRUE) %>%
mutate(method = "loughran")
#view summaries
summary_bing2
## # A tibble: 4 × 4
## # Groups: standards [2]
## standards sentiment n method
## <chr> <chr> <int> <chr>
## 1 CCSS negative 914 bing
## 2 CCSS positive 437 bing
## 3 NGSS positive 226 bing
## 4 NGSS negative 60 bing
summary_nrc2
## # A tibble: 4 × 4
## # Groups: standards [2]
## standards sentiment n method
## <chr> <chr> <int> <chr>
## 1 CCSS positive 2198 nrc
## 2 CCSS negative 764 nrc
## 3 NGSS positive 542 nrc
## 4 NGSS negative 73 nrc
summary_afinn2
## # A tibble: 4 × 4
## # Groups: standards [2]
## standards sentiment n method
## <chr> <chr> <int> <chr>
## 1 CCSS negative 740 afinn
## 2 CCSS positive 468 afinn
## 3 NGSS positive 273 afinn
## 4 NGSS negative 39 afinn
summary_loughran2
## # A tibble: 4 × 4
## # Groups: standards [2]
## standards sentiment n method
## <chr> <chr> <int> <chr>
## 1 CCSS negative 440 loughran
## 2 CCSS positive 112 loughran
## 3 NGSS negative 68 loughran
## 4 NGSS positive 54 loughran
#join summaries
summary_sentiment <- bind_rows(summary_afinn2,
summary_bing2,
summary_nrc2,
summary_loughran2) %>%
arrange(method, standards) %>%
relocate(method)
summary_sentiment
## # A tibble: 16 × 4
## # Groups: standards [2]
## method standards sentiment n
## <chr> <chr> <chr> <int>
## 1 afinn CCSS negative 740
## 2 afinn CCSS positive 468
## 3 afinn NGSS positive 273
## 4 afinn NGSS negative 39
## 5 bing CCSS negative 914
## 6 bing CCSS positive 437
## 7 bing NGSS positive 226
## 8 bing NGSS negative 60
## 9 loughran CCSS negative 440
## 10 loughran CCSS positive 112
## 11 loughran NGSS negative 68
## 12 loughran NGSS positive 54
## 13 nrc CCSS positive 2198
## 14 nrc CCSS negative 764
## 15 nrc NGSS positive 542
## 16 nrc NGSS negative 73
#create totals and then join summary and total counts to prepare percentages
total_counts <- summary_sentiment %>%
group_by(method, standards) %>%
summarise(total = sum(n))
## `summarise()` has grouped output by 'method'. You can override using the
## `.groups` argument.
sentiment_counts <- left_join(summary_sentiment, total_counts)
## Joining, by = c("method", "standards")
sentiment_percents <- sentiment_counts %>%
mutate(percent = n/total * 100)
sentiment_percents
## # A tibble: 16 × 6
## # Groups: standards [2]
## method standards sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 afinn CCSS negative 740 1208 61.3
## 2 afinn CCSS positive 468 1208 38.7
## 3 afinn NGSS positive 273 312 87.5
## 4 afinn NGSS negative 39 312 12.5
## 5 bing CCSS negative 914 1351 67.7
## 6 bing CCSS positive 437 1351 32.3
## 7 bing NGSS positive 226 286 79.0
## 8 bing NGSS negative 60 286 21.0
## 9 loughran CCSS negative 440 552 79.7
## 10 loughran CCSS positive 112 552 20.3
## 11 loughran NGSS negative 68 122 55.7
## 12 loughran NGSS positive 54 122 44.3
## 13 nrc CCSS positive 2198 2962 74.2
## 14 nrc CCSS negative 764 2962 25.8
## 15 nrc NGSS positive 542 615 88.1
## 16 nrc NGSS negative 73 615 11.9
#visualization bar chart
sentiment_percents %>%
ggplot(aes(x = standards, y = percent, fill=sentiment)) +
geom_bar(width = .8, stat = "identity") +
facet_wrap(~method, ncol = 1) +
coord_flip() +
labs(title = "Public Sentiment on Twitter",
subtitle = "The Common Core & Next Gen Science Standards",
x = "State Standards",
y = "Percentage of Words")
#Small multiples visualization with green stacked
sentiment_percents %>%
ggplot(aes(standards, percent, fill = percent)) +
theme_minimal() +
scale_fill_distiller(palette = "Greens") +
labs(title="Comparison of positive sentiment in NGSS and CCSS Tweets") +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 2, scales = "free")
#Small multiples visualization with green not stacked
ggplot(sentiment_percents, aes(standards, percent, fill=sentiment)) +
geom_bar(stat='identity', position = "dodge") +
theme(legend.position="none", plot.title=element_text(hjust=0.5)) +
labs(title="Comparison of positive sentiment in NGSS and CCSS Tweets",
x="Standards",
y="Percent") +
facet_wrap(~method, ncol = 2, scales = "free")
Congratulations, you’ve completed your Intro to text mining Badge! Complete the following steps to submit your work for review:
Change the name of the author: in the YAML
header at the very top of this document to your name. As noted in Reproducible
Research in R, The YAML header controls the style and feel for
knitted document but doesn’t actually display in the final
output.
Click the yarn icon above to “knit” your data product to a HTML file that will be saved in your R Project folder.
Commit your changes in GitHub Desktop and push them to your online GitHub repository.
Publish your HTML page the web using one of the following publishing methods:
Publish on RPubs by clicking the “Publish” button located in the Viewer Pane when you knit your document. Note, you will need to quickly create a RPubs account.
Publishing on GitHub using either GitHub Pages or the HTML previewer.
Post a new discussion on GitHub to our Text mining Badges forum. In your post, include a link to your published web page and a short reflection highlighting one thing you learned from this lab and one thing you’d like to explore further.