Sentiment Analysis for NGSS and CCSS Standards

Introduction

In this project, I am interested in knowing more details about sentiment analysis using Twitter data for the NGSS and the CCSS standards. While exploring these data sets, I found it interesting to observe Twitter users using different devices to tweet their thoughts. The guided questions for this project are Q1. What are the words that represent the “positive” and “negative” sides of the standards overall? Q2: How differently did the sampled Twitter users view the NGSS and the CCSS standards, grouping by the devices they used to tweet their thoughts?

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Loading required package: RColorBrewer

Data Analysis

Data Wrangling

After loading needed packages, I cleaned the NGSS and the CCSS data sets to extract needed information for this analysis. Since I am only interested in looking at the relationship among the standards, text and the user devices, I grouped the user devices into three categories (“Mobile”,“Web”,“Others”) for this analysis.

Text Analysis

With a clean dataset, I started to explore the text data by breaking down the text and set the unit of the analysis.

## # A tibble: 20,526 × 4
##    user_id   dat_source standard word        
##    <chr>     <chr>      <chr>    <chr>       
##  1 316633205 Web        NGSS     switching   
##  2 316633205 Web        NGSS     gears       
##  3 316633205 Web        NGSS     bit         
##  4 316633205 Web        NGSS     crosscutting
##  5 316633205 Web        NGSS     concepts    
##  6 316633205 Web        NGSS     session     
##  7 316633205 Web        NGSS     science     
##  8 316633205 Web        NGSS     ngss        
##  9 316633205 Web        NGSS     win         
## 10 316633205 Web        NGSS     t3ic        
## # … with 20,516 more rows

## # A tibble: 7,156 × 2
##    word          n
##    <chr>     <int>
##  1 common     1111
##  2 core       1108
##  3 math        450
##  4 ngss        224
##  5 students    141
##  6 science     140
##  7 school      128
##  8 teachers    122
##  9 standards   112
## 10 kids        110
## # … with 7,146 more rows

Data Visualization

To gain a holistic understanding about the word frequency and distribution, I used a wordcloud to show the first 200 most frequent words in this sample. As shown in the graph below, words such as “common”, “core”,“math”,“student” are the most frequently mentioned words.

Findings and Discussion

Q1. What are the words that represent the “positive” and “negative” side of the standards overall?

To answer Question 1, I looked at the most common positive and negative words in the sample, by referring to the Bing lexicon. I matched the “word” in this sample with the “sentiment” from the Bing dictionary, then selected to present the top 10 words from the “positive” and the “negative” words. As shown in the graph below, the most positive and negative words in the sample are very distinct.

## Joining with `by = join_by(word)`

### Q2: How different did the sampled Twitter users view the NGSS and the CCSS standards, grouping by the devices they used to tweet their thoughts? To answer Question 2, I first calculated the percent of specific words among the total number of words and used this as an index to represent the weight of the positive/negative side of the word. I then grouped the data based on “standard”, “data_source/device”, and “positive/negative” variables to show the proportion of the words in each grouping category. As shown in the graph below, on average, NGSS tweets received more positive words than the CCSS tweets in this sample.

## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'standard', 'sentiment'. You can override
## using the `.groups` argument.

To summarize, I found that the most positive and negative words in the sample are very distinct. Moreover, NGSS tweets seemed to receive more positive words than CCSS tweets. In particular, the mobile users in the NGSS group tended to use positive words compared to those in the CCSS group. Interestingly, the mobile users in the CCSS group tended to use negative words compared to the ones in the NGSS group. This makes me wonder what kinds of more specific content both user groups tweeted while using the three different devices to result in such differences.