The purpose of this case study is to analyze the social network of tweets about the Next Generation Science Standards (NGSS), and understand how NGSS was publicly received by teachers, students, and parents. Research questions for this case study include
The data source for this case study is the
ngss-tweets.csv file, which has tweets and user information
related to online conversations about NGSS.
library(tidytext)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(vader)
library(tidygraph)
##
## Attaching package: 'tidygraph'
##
## The following object is masked from 'package:stats':
##
## filter
library(ggraph)
library(igraph)
##
## Attaching package: 'igraph'
##
## The following object is masked from 'package:tidygraph':
##
## groups
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
I imported the data from the ngss_tweets.csv file into a
new dataframe.
ngss_tweets <- read_csv("data/ngss-tweets.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 8126 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): text, source
## dbl (4): author_id, id, conversation_id, in_reply_to_user_id
## lgl (1): possibly_sensitive
## dttm (1): created_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ngss_tweets
## # A tibble: 8,126 × 8
## text created_at autho…¹ id conve…² source possi…³ in_rep…⁴
## <chr> <dttm> <dbl> <dbl> <dbl> <chr> <lgl> <dbl>
## 1 "Please … 2021-01-06 00:50:49 3.28e 9 1.35e18 1.35e18 Twitt… FALSE NA
## 2 "What la… 2021-01-06 00:45:32 1.01e18 1.35e18 1.35e18 Hoots… FALSE NA
## 3 "I recen… 2021-01-06 00:39:37 6.18e 7 1.35e18 1.35e18 Twitt… FALSE NA
## 4 "I'm thr… 2021-01-06 00:30:13 4.62e 8 1.35e18 1.35e18 Twitt… FALSE NA
## 5 "PLS RT.… 2021-01-06 00:15:05 2.23e 7 1.35e18 1.35e18 Twitt… FALSE NA
## 6 "Inspire… 2021-01-06 00:00:00 3.32e 9 1.35e18 1.35e18 Tweet… FALSE NA
## 7 "PLTW La… 2021-01-05 23:45:06 1.73e 7 1.35e18 1.35e18 Hoots… FALSE NA
## 8 "@NGSS_t… 2021-01-05 23:24:01 1.02e18 1.35e18 1.35e18 Twitt… FALSE 1.02e18
## 9 "@NGSS_t… 2021-01-05 23:21:56 1.02e18 1.35e18 1.35e18 Twitt… FALSE 4.92e 9
## 10 "January… 2021-01-05 23:10:03 2.37e 7 1.35e18 1.35e18 Hoots… FALSE NA
## # … with 8,116 more rows, and abbreviated variable names ¹author_id,
## # ²conversation_id, ³possibly_sensitive, ⁴in_reply_to_user_id
To clean and prepare the data for analysis, I transformed the numerical data for user IDs to characters, created an edgelist, formated the network data, and created a network object.
head(ngss_tweets)
edge_1 <- ngss_tweets |>
relocate(sender = author_id,
target = in_reply_to_user_id) |>
select(sender,
target,
created_at,
text, source)
edge_1 <- transform(edge_1, sender = as.character(sender)) |>
transform(edge_1, target = as.character(target))
edge_2 <- edge_1 |>
unnest_tokens(input = target,
output = receiver,
to_lower = FALSE) |>
relocate(sender, receiver)
edges <- edge_2 %>%
drop_na(receiver)
edges
network_actors <- edges |>
select(sender, receiver) |>
pivot_longer(cols = c(sender,receiver))
network_actors
## # A tibble: 5,602 × 2
## name value
## <chr> <chr>
## 1 sender 1023054370529857536
## 2 receiver 1023054370529857536
## 3 sender 1023054370529857536
## 4 receiver 4918507542
## 5 sender 1555906686
## 6 receiver 4918507542
## 7 sender 1211837761444954112
## 8 receiver 4918507542
## 9 sender 1342104169
## 10 receiver 1342104169
## # … with 5,592 more rows
actors <- network_actors |>
select(value) |>
rename(actors = value) |>
distinct()
actors
## # A tibble: 1,093 × 1
## actors
## <chr>
## 1 1023054370529857536
## 2 4918507542
## 3 1555906686
## 4 1211837761444954112
## 5 1342104169
## 6 260773268
## 7 2198414407
## 8 3296431010
## 9 883384220835606528
## 10 255769626
## # … with 1,083 more rows
ngss_network_1 <- tbl_graph(edges = edges,
nodes = actors)
ngss_network_1
## # A tbl_graph: 1093 nodes and 2801 edges
## #
## # A directed multigraph with 241 components
## #
## # Node Data: 1,093 × 1 (active)
## actors
## <chr>
## 1 1023054370529857536
## 2 4918507542
## 3 1555906686
## 4 1211837761444954112
## 5 1342104169
## 6 260773268
## # … with 1,087 more rows
## #
## # Edge Data: 2,801 × 10
## from to created_at text source sender… target…
## <int> <int> <dttm> <chr> <chr> <dbl> <dbl>
## 1 1 1 2021-01-05 23:24:01 "@NG… Twitt… 1.02e18 1.02e18
## 2 1 2 2021-01-05 23:21:56 "@NG… Twitt… 1.02e18 4.92e 9
## 3 3 2 2021-01-05 23:09:06 "@NG… Twitt… 1.56e 9 4.92e 9
## # … with 2,798 more rows, and 3 more variables: created_at.1 <dttm>,
## # text.1 <chr>, source.1 <chr>
For this analysis, I used techniques such as calculating meaures of
centrality and node degrees, and adding nodes, edges and layouts to
determine who the transmitters, transceivers, and transcenders are. In
addition to this, I created sociograms with the autograph()
and ggraph() functions to visualize the relationships
between users and the social network. I also identified groups
ngss_network <- ngss_network_1 |>
activate(nodes) |>
mutate(degree = centrality_degree(mode = "all")) |>
mutate(in_degree = centrality_degree(mode = "in"))
ngss_network
## # A tbl_graph: 1093 nodes and 2801 edges
## #
## # A directed multigraph with 241 components
## #
## # Node Data: 1,093 × 3 (active)
## actors degree in_degree
## <chr> <dbl> <dbl>
## 1 1023054370529857536 16 6
## 2 4918507542 589 550
## 3 1555906686 4 1
## 4 1211837761444954112 10 1
## 5 1342104169 14 8
## 6 260773268 18 7
## # … with 1,087 more rows
## #
## # Edge Data: 2,801 × 10
## from to created_at text source sender… target…
## <int> <int> <dttm> <chr> <chr> <dbl> <dbl>
## 1 1 1 2021-01-05 23:24:01 "@NG… Twitt… 1.02e18 1.02e18
## 2 1 2 2021-01-05 23:21:56 "@NG… Twitt… 1.02e18 4.92e 9
## 3 3 2 2021-01-05 23:09:06 "@NG… Twitt… 1.56e 9 4.92e 9
## # … with 2,798 more rows, and 3 more variables: created_at.1 <dttm>,
## # text.1 <chr>, source.1 <chr>
ngss_network <- ngss_network_1 |>
activate(nodes) |>
mutate(degree = centrality_degree(mode = "all")) |>
mutate(in_degree = centrality_degree(mode = "in")) |>
mutate(out_degree = centrality_degree(mode = "out"))
ngss_network
## # A tbl_graph: 1093 nodes and 2801 edges
## #
## # A directed multigraph with 241 components
## #
## # Node Data: 1,093 × 4 (active)
## actors degree in_degree out_degree
## <chr> <dbl> <dbl> <dbl>
## 1 1023054370529857536 16 6 10
## 2 4918507542 589 550 39
## 3 1555906686 4 1 3
## 4 1211837761444954112 10 1 9
## 5 1342104169 14 8 6
## 6 260773268 18 7 11
## # … with 1,087 more rows
## #
## # Edge Data: 2,801 × 10
## from to created_at text source sender… target…
## <int> <int> <dttm> <chr> <chr> <dbl> <dbl>
## 1 1 1 2021-01-05 23:24:01 "@NG… Twitt… 1.02e18 1.02e18
## 2 1 2 2021-01-05 23:21:56 "@NG… Twitt… 1.02e18 4.92e 9
## 3 3 2 2021-01-05 23:09:06 "@NG… Twitt… 1.56e 9 4.92e 9
## # … with 2,798 more rows, and 3 more variables: created_at.1 <dttm>,
## # text.1 <chr>, source.1 <chr>
node_measures <- ngss_network |>
activate(nodes) |>
data.frame()
summary(node_measures)
## actors degree in_degree out_degree
## Length:1093 Min. : 1.000 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median : 1.000 Median : 1.000 Median : 1.000
## Mean : 5.125 Mean : 2.563 Mean : 2.563
## 3rd Qu.: 3.000 3rd Qu.: 1.000 3rd Qu.: 2.000
## Max. :589.000 Max. :550.000 Max. :121.000
view(node_measures)
autograph(ngss_network)
ggraph(ngss_network)+
geom_node_point()
## Using "stress" as default layout
ggraph(ngss_network, layout = "fr") +
geom_node_point()
ggraph(ngss_network, layout = "fr") +
geom_node_point(aes(size = out_degree,
color = out_degree))
ggraph(ngss_network, layout = "fr") +
geom_node_point(aes(size = out_degree,
color = out_degree)) +
geom_node_text(aes(label = actors,
size = out_degree/2,
color = out_degree),
repel=TRUE)
## Warning: ggrepel: 1006 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ggraph(ngss_network, layout = "fr") +
geom_node_point(aes(size = out_degree,
color = out_degree)) +
geom_node_text(aes(label = actors,
size = out_degree/2,
color = out_degree),
repel=TRUE) +
geom_edge_link(arrow = arrow(length = unit(1, 'mm')),
end_cap = circle(3, 'mm'),
alpha = .2)
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## Warning: ggrepel: 1019 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ggraph(ngss_network, layout = "fr") +
geom_node_point(aes(size = out_degree,
color = out_degree),
show.legend = FALSE) +
geom_node_text(aes(label = actors,
size = out_degree/2,
color = out_degree),
repel=TRUE,
show.legend = FALSE) +
geom_edge_link(arrow = arrow(length = unit(1, 'mm')),
end_cap = circle(3, 'mm'),
alpha = .2) +
theme_graph()
## Warning: ggrepel: 996 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ggraph(ngss_network, layout = "fr") +
geom_node_point(aes(size = in_degree,
color = in_degree),
show.legend = FALSE) +
geom_node_text(aes(label = actors,
size = in_degree/2,
color = in_degree),
repel=TRUE,
show.legend = FALSE) +
geom_edge_link(arrow = arrow(length = unit(1, 'mm')),
end_cap = square(3, 'mm'),
alpha = .5) +
theme_graph()
## Warning: ggrepel: 949 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ngss_network_groups <- ngss_network |>
activate(nodes) |>
mutate(group = group_components())
ngss_network_groups
## # A tbl_graph: 1093 nodes and 2801 edges
## #
## # A directed multigraph with 241 components
## #
## # Node Data: 1,093 × 5 (active)
## actors degree in_degree out_degree group
## <chr> <dbl> <dbl> <dbl> <int>
## 1 1023054370529857536 16 6 10 1
## 2 4918507542 589 550 39 1
## 3 1555906686 4 1 3 1
## 4 1211837761444954112 10 1 9 1
## 5 1342104169 14 8 6 1
## 6 260773268 18 7 11 1
## # … with 1,087 more rows
## #
## # Edge Data: 2,801 × 10
## from to created_at text source sender… target…
## <int> <int> <dttm> <chr> <chr> <dbl> <dbl>
## 1 1 1 2021-01-05 23:24:01 "@NG… Twitt… 1.02e18 1.02e18
## 2 1 2 2021-01-05 23:21:56 "@NG… Twitt… 1.02e18 4.92e 9
## 3 3 2 2021-01-05 23:09:06 "@NG… Twitt… 1.56e 9 4.92e 9
## # … with 2,798 more rows, and 3 more variables: created_at.1 <dttm>,
## # text.1 <chr>, source.1 <chr>
ngss_network_groups |>
ggraph(layout = "fr") +
geom_node_point(aes(size = out_degree,
color = out_degree),
show.legend = FALSE) +
geom_node_text(aes(label = actors,
color = out_degree,
size = out_degree),
repel=TRUE,
show.legend = FALSE) +
geom_edge_link(arrow = arrow(length = unit(1, 'mm')),
end_cap = circle(3, 'mm'),
alpha = .2) +
theme_graph() +
geom_node_voronoi(aes(fill = factor(group),
alpha = .05),
max.radius = .5,
show.legend = FALSE)
## Warning: ggrepel: 991 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Some key findings are who the transcenders, tranceivers, and transformers are:
Transmitters: User 558971700 because they have the highest outdegree measurement
Transceivers: User 4918507542 because they have the highest indegree measurement
Transcenders: User 184649645 because they have the same indegree and outdegree measurement and the highest combination of the two
In addition to this, the sociograms show a large of cluster of nodes, indicating high activity among those users.
One potential course of action could be to do a sentiment analysis on this data to understand the public reaction to the NGSS standards. In addition, it would be interesting to see if the reactions are influenced by other factors, such as political affiliation, the age of the students affected, and the social media network the user is posting on.
One limitation is that the users are identified through a numeric author ID instead of a user name or screen name. This makes it difficult to identify and keep track of users.
In addition, it’s important to make sure social media users like in this case study are aware of how thier information is being used. Even though they are not identifiable by screen name, they may not be comfortable that what they posted in circulating in studies and platforms they may not be familiar with.