In this project, I am interested in knowing more details about sentiment analysis using Twitter data for the NGSS and the CCSS standards. While exploring these data sets, I found it interesting to observe Twitter users using different devices to tweet their thoughts. The guided questions for this project are Q1. What are the words that represent the “positive” and “negative” sides of the standards overall? Q2: How differently did the sampled Twitter users view the NGSS and the CCSS standards, grouping by the devices they used to tweet their thoughts?
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: RColorBrewer
After loading needed packages, I cleaned the NGSS and the CCSS data sets to extract needed information for this analysis. Since I am only interested in looking at the relationship among the standards, text and the user devices, I grouped the user devices into three categories (“Mobile”,“Web”,“Others”) for this analysis.
With a clean dataset, I started to explore the text data by breaking down the text and set the unit of the analysis.
## # A tibble: 20,526 × 4
## user_id dat_source standard word
## <chr> <chr> <chr> <chr>
## 1 316633205 Web NGSS switching
## 2 316633205 Web NGSS gears
## 3 316633205 Web NGSS bit
## 4 316633205 Web NGSS crosscutting
## 5 316633205 Web NGSS concepts
## 6 316633205 Web NGSS session
## 7 316633205 Web NGSS science
## 8 316633205 Web NGSS ngss
## 9 316633205 Web NGSS win
## 10 316633205 Web NGSS t3ic
## # … with 20,516 more rows
## # A tibble: 7,156 × 2
## word n
## <chr> <int>
## 1 common 1111
## 2 core 1108
## 3 math 450
## 4 ngss 224
## 5 students 141
## 6 science 140
## 7 school 128
## 8 teachers 122
## 9 standards 112
## 10 kids 110
## # … with 7,146 more rows
To gain a holistic understanding about the word frequency and
distribution, I used a wordcloud to show the first 200 most frequent
words in this sample. As shown in the graph below, words such as
“common”, “core”,“math”,“student” are the most frequently mentioned
words.
To answer Question 1, I looked at the most common positive and negative words in the sample, by referring to the Bing lexicon. I matched the “word” in this sample with the “sentiment” from the Bing dictionary, then selected to present the top 10 words from the “positive” and the “negative” words. As shown in the graph below, the most positive and negative words in the sample are very distinct.
## Joining with `by = join_by(word)`
### Q2:
How different did the sampled Twitter users view the NGSS and the CCSS
standards, grouping by the devices they used to tweet their thoughts? To
answer Question 2, I first calculated the percent of specific words
among the total number of words and used this as an index to represent
the weight of the positive/negative side of the word. I then grouped the
data based on “standard”, “data_source/device”, and “positive/negative”
variables to show the proportion of the words in each grouping category.
As shown in the graph below, on average, NGSS tweets received more
positive words than the CCSS tweets in this sample.
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'standard', 'sentiment'. You can override
## using the `.groups` argument.
To summarize, I found that the most positive and negative words in the sample are very distinct. Moreover, NGSS tweets seemed to receive more positive words than CCSS tweets. In particular, the mobile users in the NGSS group tended to use positive words compared to those in the CCSS group. Interestingly, the mobile users in the CCSS group tended to use negative words compared to the ones in the NGSS group. This makes me wonder what kinds of more specific content both user groups tweeted while using the three different devices to result in such differences.