Analyzing Student Discussion

This tutorial will guide through the analysis and reporting of students’ online discussion.

For this tutorial we will use a dataset from a MOOC on Climate Change. The raw data file can be downloaded here.

Let’s load this dataset and see what is in it.

forum_data<-read.csv('ClimateForum15.csv',stringsAsFactors=FALSE)
colnames(forum_data)

##  [1] "author_id"                 "author_username"          
##  [3] "created_at"                "anonymous_to_peers"       
##  [5] "votes_count"               "votes_point"              
##  [7] "votes_down_count"          "votes_up"                 
##  [9] "votes_down"                "votes_up_count"           
## [11] "parent_ids"                "historical_abuse_flaggers"
## [13] "comment_thread_id"         "X_type"                   
## [15] "updated_at"                "abuse_flaggers"           
## [17] "child_count"               "visible"                  
## [19] "sk"                        "anonymous"                
## [21] "course_id"                 "at_position_list"         
## [23] "mongoid"                   "endorsed"                 
## [25] "parent_id"                 "last_activity_at"         
## [27] "closed"                    "title"                    
## [29] "thread_type"               "commentable_id"           
## [31] "group_id"                  "tags_array"               
## [33] "endorsement_user_id"       "endorsement_time"         
## [35] "comment_count"             "pinned"                   
## [37] "body"

Each post in the dataset contains the following important fields: * author_id: Number identification of the author of the post * author_username: Username of the author of the post * votes_count: How many votes the post have received * votes_point: How many points (positive - negative votes) the post has * votes_down_count: How many negative votes has the post * votes_up: List of users that up-voted the post * votes_down: List of users that down-voted the post (No posts have down-votes) * votes_up_count: How many positive votes has the post * parents_id: The id of the post to which this post is a response. If the message is the first of a thread or was direct response to the thread, it does not have a parents_id * comment_thread_id: Id of the root post of the thread * X_type: “CommentThread” if it is the first post of the thread. “Comment” if it is a response in the thread. * child_count: How many direct responses this post has * mongoid: Id of the post * parent_id: The same as parents_id * title: Title of the thread * comment_count: How many comments a CommentThread has * body: Text of the post

Social Network Analysis

The first step to analyze this forum will be to analyze the structure of the interactions between the authors. For example, we could create a graph in which each node is an author and a link is created between two nodes (authors) if a post of the first author is a response to a post of the second author. This will create a “social network” of the forum participants.

To be able to create such a graph, we need to create two datasets. One that contain summarized information about each author (the nodes) and one that contains information about the posts (links). We will start with the nodes dataframe.

library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

nodes_authors =
  forum_data %>%
    group_by(author_id) %>%
    summarize(
      username=last(author_username),
      posts=n(),
      thread_started=length(X_type[X_type %in% c('CommentThread')]),
      votes_up=sum(votes_up_count),
      comments=sum(comment_count[!is.na(comment_count)])
    )

## `summarise()` ungrouping output (override with `.groups` argument)

head(nodes_authors)

Explanation:

We are using again the dplyr library to facilitate the transformation of the dataframes.
First, we group the forum_data dataframe by author_id
Then we create a summary for each author_id consisting of its username (author_username), number of posts (posts), how many threads has started (thread_started), how many upvotes have received in total (votes) and how many comments has their threads received (comments).
We can see a sample of the datase containing the resulting fields.

Now we create the links dataframe:

links_posts =
  forum_data %>%
    filter(X_type %in% c('Comment')) %>%
    select(author_id,mongoid,parent_ids,comment_thread_id) %>%
    mutate(parent=ifelse(parent_ids=="",as.character(comment_thread_id),as.character(parent_ids))) %>%
    select(author_id,mongoid,parent)
head(links_posts)

Explanation:

First, we filter only those posts that are comments, because the Thread starters are not response to any previous post.
Second, we select only the author_id, mongoid,parent_ids,and comment_thread_id, because they are the ones needed to create the link between different posts.
Because a post can be a response to another post or to the original message in the thread, we create a parent that is equal to the comment_thread_id if there is no direct parent_ids, or to the parent_ids if it exists.
Finally we clean the columns with a select of only the author_id, the id of the post (mongoid) and the id of the parent post.
We can see a sample of the datase containing the resulting fields.

But that link dataframe is not what we want. We do not want a link between the posts, but a link between authors. We need to get the author_id of the parent post.

get_user_from_post = function(post_id){
  forum_data[forum_data$mongoid==post_id,]$author_id
}

links_posts$author_parent<-sapply(links_posts$parent,get_user_from_post)
head(links_posts)

Explanation:

First we create a function that will return the id of the author of a post (author_id), given the id of the post (mongoid). We call this function get_user_from_post
Second, we apply this function to parent field of each one of the rows of the links_posts dataframe.
As we can see, now we can create the relation between two authors because of their posts.

Because there could be several links between the same pair of authors, we count the number of times that an author has responded to another author and add that number to the link. This is usually represented as the weight or strength of the link. (More responses between two authors, the stronger their relationship will be)

weighted_links = links_posts %>%
  group_by(author_id,author_parent) %>%
  summarize(
    weight=n()
  )

## `summarise()` regrouping output by 'author_id' (override with `.groups` argument)

head(weighted_links)

Explanation:

We group the links_posts dataframe by the orignating author (author_id) and the responded author (author_parent).
We summarize the total number of links that exists between these two authors and store this value in the field weight.
Now we have a dataframe that links two authors and have the strength of that link.

With these two dataframes (nodes_author and weighted_links) we are able to create our network. For this we will use the igraph library in R that contains useful functions to manipulate graphs and networks. We will use the visNewtwork library to visualize the graphs in an interactive way.

library(igraph)

## Warning: package 'igraph' was built under R version 4.0.3

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(visNetwork)

## Warning: package 'visNetwork' was built under R version 4.0.3

g<-graph_from_data_frame(weighted_links, directed = TRUE, vertices = nodes_authors)

graph_data <- toVisNetworkData(g)
visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout()

Explanation:

First, we load the two needed libraries.
Then we create the graph g with the graph_from_data_frame function. We specify that the links are stored in the weighted_links dataframe, that is a directed graph (meaning that the direction of the link matters) and that the information about each vertex or node is stored in the nodes_authors dataframe.
We use the graph g to create a VisNetwork graph that we can visualize. We do it through the toVisNetworkData.
Finally, we plot this graph using VisNetwork. We are only specifying what data should be used for the nodes and the links and the size of the graph. Also, we use a default layout (position of the nodes in the graph).
You can zoom and move nodes in the resulting interactive graph.

To create a more interesting graph we want to: * The size of the node should be related to the number of nodes.
* The label of each node should be the user name * The width of each link should be related to its weight * When a node is selected, we want to highlight the nodes to which it is connected

graph_data <- toVisNetworkData(g)

graph_data$nodes$label<-graph_data$nodes$username
graph_data$nodes$value<-graph_data$nodes$posts
graph_data$edges$width<-graph_data$edges$weight

visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout() %>%
  visNodes(color = list(background = "lightblue", 
                        border = "darkblue",
                        highlight = "yellow")) %>%
  visOptions(highlightNearest = list(enabled = T, degree=0, hover = F), 
             nodesIdSelection = T)

Explanation:

Again, we convert the graph g into a VisNetwork data
The label that appears is controled by the graph_data\(nodes\)label column. We assign the value username to this column.
The size of the node is controlled by the graph_data\(nodes\)value column. We assign the value posts to this column.
The width of the link is controled by the graph_data\(edges\)width column. We assign the value weight to this column.
Then we sent the graph to plot, with new additions.
The visNodes controls the apperance of the nodes. We set the color of the nodes when they are selected and not selected (highlight)
The visOptions controls the interaction of the graph. In this case we are enabling the selection of nodes by their label (nodesIdSelection) and also the highlight of the nearest neighbourgs to a node (highlightNearest)

Now we want a graph that show the same graph, but with the size of the nodes representing the numbers of up-votes and it should only include those that have more than 2 posts.

selected_g <- induced_subgraph(g, as.character(nodes_authors[nodes_authors$posts>2,]$author_id))

graph_data <- toVisNetworkData(selected_g)

graph_data$nodes$label<-graph_data$nodes$username
graph_data$nodes$value<-graph_data$nodes$votes_up
graph_data$edges$width<-graph_data$edges$weight

visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout() %>%
  visNodes(color = list(background = "lightblue", 
                        border = "darkblue",
                        highlight = "yellow")) %>%
  visOptions(highlightNearest = list(enabled = T, degree=0, hover = F), 
             nodesIdSelection = T)

### Graph Metrics

We can also obtain several metrics from the networks: * Edge Density: The density of a graph is the ratio of the number of edges and the number of possible edges. * Reciprocity: The measure of reciprocity defines the proportion of mutual connections, in a directed graph. It is most commonly defined as the probability that the opposite counterpart of a directed edge is also included in the graph. * Diameter: Longer distance between connected nodes * Average Degree: The average number of neighbourghs

paste("Edge Density: ",edge_density(g, loops=F))

## [1] "Edge Density:  0.0111299640447623"

paste("Reciprocity: ", reciprocity(g))

## [1] "Reciprocity:  0.382409177820268"

paste("Diameter: ", diameter(g, directed=T, weights=NA))

## [1] "Diameter:  9"

paste("Average Degree: ",mean(degree(g, mode="all")))

## [1] "Average Degree:  4.94170403587444"

We can also obtain certain metrics for each node and graph their distribution. For example for the degrees of a node, that is the number of other nodes that are connected to that node.

deg <- degree(g, mode="in")
hist(deg, breaks=1:vcount(g)-1, main="Histogram of node degree")

Explanation:

We can see that most nodes has 0 to 1 connected nodes. But very few has a lot of connected nodes (20 or more)

deg.dist <- degree_distribution(g, cumulative=T, mode="in")
plot( x=0:max(deg), y=1-deg.dist, pch=19, cex=1.2, col="orange", 
      xlab="Degree", ylab="Cumulative Frequency")

Explanation:

It is more clear in this distribution, than around 30% of the authors has 1 or less responses. Another 30% only have 2 responses. 80% of the authors has less than 5 responses. A 1% of the authors have 20 responses or more.

Centrality Measures

An important set of measures are centrality measures. They measure the “importance of a node” based on different aspects. * Closeness: How easy is to reach that node from other nodes * Betweeness: How many paths betweeen other nodes pass through this node * Hub: How many nodes are pointed by this node * Authority: How many other nodes point to this node

closeness<-closeness(g, mode="in", weights=NA)

## Warning in closeness(g, mode = "in", weights = NA): At centrality.c:
## 2784 :closeness centrality is not well-defined for disconnected graphs

betweeness<-betweenness(g,weights=NA)
hub <- hub_score(g)$vector
authority <- authority_score(g)$vector

graph_data <- toVisNetworkData(g)
graph_data$nodes$label<-graph_data$nodes$username

graph_data$nodes$value<- authority

visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout() %>%
  visNodes(color = list(background = "lightblue", 
                        border = "darkblue",
                        highlight = "yellow")) %>%
  visOptions(selectedBy = "value",
             highlightNearest = list(enabled = T, degree=0, hover = F), 
             nodesIdSelection = T)

Explanation:

We calculate several centrality metrics and graph the authority. It is easy to figure it out who are the most influential authors in the forum.

Communities detection

We can also cluster nodes together to detect communities. To do it, we can use several metrics for the nodes or links. For exampke, the Fast Greedy algorithm tries to find dense subgraph, also called communities in graphs, via directly optimizing a modularity score.

g2 <- induced_subgraph(g, as.character(nodes_authors[nodes_authors$posts>10,]$author_id))
cfg <- cluster_fast_greedy(as.undirected(g2))
V(g2)$community <- cfg$membership

graph_data <- toVisNetworkData(g2)
colrs <- adjustcolor( c("gray50", "tomato", "gold", "yellowgreen"), alpha=.6)
graph_data$nodes$color<-colrs[V(g2)$community]
graph_data$nodes$label<-graph_data$nodes$username

visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout() %>%
  visOptions(selectedBy = "community",
             highlightNearest = list(enabled = T, degree=0, hover = F), 
             nodesIdSelection = T)

Explanation:

Through the Fast Greedy community detection algorithm we are able to detect 4 different communities in the authors that have more than 10 posts.
The value of the community will be stored in the community variable in the graph node data. This data can be used later to color the nodes according to the community.

Text Mining

It is not only important to know who is talking with who, but also analyze the content of that communication. The following steps will introduce basic text mining techniques.

First we will install and load the tidytext library that will provide the text mining functions and load the text data into a easier to work format.

NOTE: The installation could take several minutes. Be patient.

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.0.3

text_df <- tibble(messageId = forum_data$mongoid, text = forum_data$body)

Explanation:

We create a new dataset where the the id of the post (mongoid) will be associated with its corresponding text (body).

Then we divide tokenize the text. This means that we separate the different tokens (words). As we are not interested in common words such as “a”, “the”, “of”, we have to remove it from the list.

tokenized= text_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining, by = "word"

head(tokenized)

Explanation:

The funtion unnest_tokens receive the name of the field that will be created (word) and the field from which it will extract the tokens (text).
The anti_join function remove all the elements that are common to two datasets. In this case to the tokenized words and the stop_words dataset provided by tidytext.
We can see that the result of the operation is a list of all the words contained in a given message, one word per row.

Word frequency

The next step tokenizing our text is to count the frequency of different words. We will be using ggplot, a more sophisticated version of plot to create the graph.

library(ggplot2)

tokenized %>%
  count(word, sort = TRUE) %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Explanation:

First, we import ggplot
Second, we use the count function to obtain the frequency of each word.
Third, we remove all the words that appear less than 100 times.
Then we reorer the list based on the word and its frequency.
To visualize the result, we use ggplot, using as values the word and the frequency (n). The geom_col indicate that we want to use columns to represent the data. The coord_flip command is used to flip the X and Y axis.
The result is what we should expect from a discussion board in a climate change course.

We can do this kind of analysis for a given thread. We just need to obtain the id of the messages that belong to each thread. For example to the thread “Temperature and [CO2] correlation” with id ‘561e9800d2aca5e7dd000618’

thread_messages<-forum_data[forum_data$comment_thread_id %in% c('561e9800d2aca5e7dd000618'),]$mongoid

tokenized %>%
  filter(messageId %in% thread_messages) %>%
  count(word, sort = TRUE) %>%
  filter(n > 10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Explanation:

First, we obtain all the id of the post that belong to that thread.
Second, do the same procedure as before, but after filtering only those messages that have the ids stored in thread_messages
We can see that in this thread, C02 has a more important role than in the rest of the forum.

Sentiment Analysis

One interesting thing that we can do with the text is to obtain the “sentiment value” based on the words used in the text. There are databases that rank each word based on its positive or negative emotion value. We can use these databases to estimate the sentiment score of a text.

library(tidyr)
sentiments <- 
  tokenized %>%
  inner_join(get_sentiments("bing")) %>% 
  count(messageId, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

## Joining, by = "word"

Explanation:

First we join the tokenized dataframe with the sentiment datafreame “bing”.
Second, do the same procedure as before, but after filtering only those messages that have the ids stored in thread_messages
We can see that in this thread, C02 has a more important role than in the rest of the forum.

We can visualize these scores using a new visualization library that is most powerful than just plot, but it is also a little bit more complex.

library(ggplot2)

ggplot(sentiments, aes(messageId, sentiment,fill=sentiment)) +
  geom_col(show.legend = FALSE) +
  scale_fill_gradient(low="red", high="blue")+
  coord_flip()

Explanation:

We first indicate that we will use the sentiments dataframe, then we use aes (aesthetics) to indicate what we want to visualize. The first parameter is x axis (messageId), the second one is th y axis (sentiment score) and the fill is the value that will provide the color (in this case also the sentiment score).
To give more options to ggplot we use the “+” sign. The next line (geom_col) is used to specify that we use bars to express the values.
The next line indicates that the bars will have a gradient color from red to blue depending on the value of the sentiment score.
Finally, we indicate the we want to flip the axis to have the bars in a horizontal position.
From the graph, it can be seen that there are more negative sentiment messages.

We can also know what are the more frequent positive and negative words. For that we only need to count their apperance in the different posts.

bing_word_counts <- tokenized %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

Explanation:

The first code lines just count the number words in the text that assigning a positive o negative value.
The second group of code lines just create another ggplot where we will plot the different words and their frequency grouped by positive or negative sentiments.

Word clouds

Another interesting way to summarize the information about the frequency of words in the posts is the word cloud.

To create word clouds in R, we use the wordcloud library. We feed these library the words and their frequency.

library(wordcloud)

## Warning: package 'wordcloud' was built under R version 4.0.3

## Loading required package: RColorBrewer

tokenized %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

## Warning in wordcloud(word, n, max.words = 100): climate could not be fit on
## page. It will not be plotted.

Explanation:

We create a count of the most frequent words (those that repeat more than 100 times) and send it to wordcloud library.
We can see a graphical representation of the most frequent words, with the most frequent appearing larger.

We can also create a representation of the positive and negative words using this library visualizations. In this case we divide the most frequent words between positive and negative and use the comparison.cloud function.

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

tokenized %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "blue"),
                   max.words = 100)

## Joining, by = "word"

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## successfully could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## appreciated could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## properly could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## consistent could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## contribution could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## healthy could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## leading could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## strong could not be fit on page. It will not be plotted.

## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## wonderful could not be fit on page. It will not be plotted.

Term Importance

While the frequency of words provide some indication of the importance of the words, some word appear very frequently (for example “the”, “of”, “a”") but do not provide much meaning. On the other hand, words that sometimes appear infrequently carry a lot of meaning (for example scientific terms).

A better way to determine the importance of words is the TF-IDF metric (Term Frequency - Inverse Document Frequency). This metric divides the frequency of a word a post or document by the frequency of the word in the whole forum or set of documents. The most rare a word that a post uses frequently, the higher the TF-IDF.

message_words <- text_df %>%
  unnest_tokens(word, text) %>%
  count(messageId, word, sort = TRUE)

total_words <- message_words %>% 
  group_by(messageId) %>% 
  summarize(total = sum(n))

## `summarise()` ungrouping output (override with `.groups` argument)

message_words <- left_join(message_words, total_words)

## Joining, by = "messageId"

message_words <- message_words %>%
  bind_tf_idf(word, messageId, n)

message_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

Explanation:

First we calculate again the frequency of the words for all the posts.
Then we calculate the frequency of the word for each post (total_words)
Then we combine the frequency of the word in total, with the frequency of the word in each post
Then we calculate the TF-IDF metrics using the bind_tf_idf function
Finally, we arrange the table by decreesing value of tf-idf.
We can see that unique words (and mispellings) top the list.

Relationships between words

Until now, we have analyzed words individually. A more interesting analysis is taking several words together. We call this combination of words n-grams, where n is the number of words that we analyze togeter.

For example, lets extract the 2-grams or bigrams that occurs more frequently in the posts.

bigrams <- text_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigrams %>%
  count(bigram, sort = TRUE)

Explanation:

We extract again the tokens from the text, but this time we specify that the tokens will be n-grams with length 2.
Then, the same as before, we count the frequency of those bigrams.
As we can see, common connecting words are the most frequently bigrams.

To get a more meanigful list of n-grams we need to remove the stopwords. For that we separate the bigram into its two words and elimate the n-gram if any of the words is in the stopword list.

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

Explanation:

First we separate the bigram in word1 and word2
Then we filter that dataset if either word1 or word2 are in the stopword list.
Finally we recalculate the frequency of the remaining pairs.
As we can see, now the list make much more sense given the topic of the course.

We can visualize the relation between this word via a network diagram. In this graph each node will be a word and there will be a link between two words if those two words appear together in a frequent bigram.

bigram_graph <- bigram_counts %>%
  filter(n > 10) %>%
  graph_from_data_frame()

graph_data <- toVisNetworkData(bigram_graph)

visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout()

Explanation:

First we filter the biagram dataframe to only consider bigrams that occurs more than 10 times.
Then we convert that dataframe into a graph using the graph_from_data_frame function.
Then we use the visualization code to show the graph. You can zoom to see the label of the nodes.

We can work with n-grams of any length. For example, here is the code to obtain 3-grams from the posts.

text_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)

Explanation:

We run a similar code as before, just indicating that the length of the n-grams is 3.
Then, we filter stopwords and count the frequencies.
As we can see, “sea level raise” is the most common 3-gram.

We can also obtain TF-IDF metrics for n-grams in a similar way that we did it for single words.

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigram_tf_idf <- bigrams_united %>%
  count(messageId, bigram) %>%
  bind_tf_idf(bigram, messageId, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf

Explanation:

First, we rejoin the separated words into a single bigram.
Then, we apply the bind_tf_idf function.
As we can see, “add co2” is the most salient bigram, even if “climate change” is the most frequent.

Another way to find relationship between words is to find correlations between them. That is when they are used together in different posts.

To find the correlation between different words we use the widyr library. We need to install it and load it.

library(widyr)

## Warning: package 'widyr' was built under R version 4.0.3

word_cors <- tokenized %>%
  group_by(word) %>%
  filter(n() >= 20) %>%
  pairwise_cor(word, messageId, sort = TRUE)

## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

word_cors

Explanation:

The correlation show us words that frequently appear together in a post, not necesarily beside one another.

We can see the most correlated words with words of interest such as climate, denial or co2.

word_cors %>%
  filter(item1 %in% c("climate", "denial", "sea", "co2")) %>%
  group_by(item1) %>%
  top_n(6) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

## Selecting by correlation

Explanation:

From the dataframe that contains the correlation between words, we filter only those rows that contain the desired words.
Then, we group them by word and select the 6 more correlated for each word.
Finally we graph them using ggplot.

We can also use the correlation information between words to create a network. In this network, each word is a node and the links between them depends on the correlation coefficient between those words.

correlation_graph<-word_cors %>%
  filter(correlation > .30) %>%
  graph_from_data_frame()


graph_data <- toVisNetworkData(correlation_graph)
graph_data$edges$width<-graph_data$edges$correlation*20

visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
  visIgraphLayout()

Dashboards

Now we will create a dashboard to explore the forum activity. The dashboard will contain two tabs: A social network visualizator A word cloud by author

Here is the code to create such a dashboard:

#
# This is a Shiny web application. You can run the application by clicking
# the 'Run App' button above.
#
# Find out more about building applications with Shiny here:
#
#    http://shiny.rstudio.com/
#

library(shiny)
library(shinydashboard)

## 
## Attaching package: 'shinydashboard'

## The following object is masked from 'package:graphics':
## 
##     box

library(tidyverse)
library(visNetwork)
library(igraph)
library(tidytext)
library(wordcloud2)

## Warning: package 'wordcloud2' was built under R version 4.0.3

forum_data<-read.csv('ClimateForum15.csv',stringsAsFactors=FALSE)

nodes_authors =
    forum_data %>%
    group_by(author_id) %>%
    summarize(
        username=last(author_username),
        posts=n(),
        thread_started=length(X_type[X_type %in% c('CommentThread')]),
        votes_up=sum(votes_up_count),
        comments=sum(comment_count[!is.na(comment_count)])
    )

## `summarise()` ungrouping output (override with `.groups` argument)

nodes_authors$thread_initiatior<-ifelse(nodes_authors$thread_started>0,"Initiator","Commenter")

links_posts =
    forum_data %>%
    filter(X_type %in% c('Comment')) %>%
    select(author_id,mongoid,parent_ids,comment_thread_id) %>%
    mutate(parent=ifelse(parent_ids=="",as.character(comment_thread_id),as.character(parent_ids))) %>%
    select(author_id,mongoid,parent)

get_user_from_post = function(post_id){
    forum_data[forum_data$mongoid==post_id,]$author_id
}

links_posts$author_parent<-sapply(links_posts$parent,get_user_from_post)

weighted_links = links_posts %>%
    group_by(author_id,author_parent) %>%
    summarize(
        weight=n()
    )

## `summarise()` regrouping output by 'author_id' (override with `.groups` argument)

text_df = tibble(messageId = forum_data$mongoid, text = forum_data$body)

tokenized= text_df %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

## Joining, by = "word"

wordFreq = tokenized %>%
    anti_join(stop_words) %>%
    count(word) %>%
    filter(n>30)

## Joining, by = "word"

sentiments = 
    tokenized %>%
    inner_join(get_sentiments("bing")) %>% 
    count(messageId, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(sentiment = positive - negative)

## Joining, by = "word"

get_sentiment=function(id){
    author_messages<- forum_data %>%
        filter(author_id %in% c(id))
    author_sentiments<- sentiments %>%
        filter(messageId %in% author_messages) %>%
        summarize(
            count=sum(sentiment)
        )
    author_sentiments$count
}

nodes_authors$sentiment=sapply(nodes_authors$author_id,get_sentiment)
nodes_authors$sentiment = cut(nodes_authors$sentiment, breaks=c(-Inf, -5, 0, 5, +Inf), labels=c('Very Negative','Negative','Positive','Very Positive'))

g<-graph_from_data_frame(weighted_links, directed = TRUE, vertices = nodes_authors)

size_options = list("Number of Posts"="posts",
                    "Threads Started"="thread_started",
                    "Up-votes"="votes_up",
                    "Number of Comments"="comments")

color_options = list("None"="none",
                     "Thread Initiators"="thread_initiatior",
                     "Community"="community",
                     "Sentiment"="sentiment")

# Define UI for application that draws a histogram
ui <- dashboardPage(
    dashboardHeader(title = "Discourse Analytics"),
    dashboardSidebar(disable=TRUE),
    dashboardBody(
        tabBox(height = "1100px", width = "1000px",
               
               tabPanel(title = tagList(icon("project-diagram", 
                                             class = "fas fa-project-diagram"),
                                        "NETWORK"),
                        fluidRow(
                        box(title = "Controls", width=4, status = "primary", solidHeader=TRUE,
                            selectInput("size", label = "Size represents:", choices = size_options, selected = "posts"),
                            selectInput("color", label = "Color represents:", choices = color_options, selected = "none"),
                            sliderInput("posts", label = "Number of Posts", min = min(nodes_authors$posts), max = max(nodes_authors$posts), value = c(min(nodes_authors$posts), max(nodes_authors$posts)))
                            ),
                    
                        
                        box(title = "Network", width=8, status = "primary", visNetworkOutput("network"))
                        )
                        ),
               tabPanel(title = tagList(icon("cloud", 
                                             class="fas fa-cloud"),
                                        "WORD CLOUDS"),
                        fluidRow(
                            box(title = "Controls", width=4, status = "primary", solidHeader=TRUE,
                                selectInput("ngram", label = "Use N-grams of Size:", choices = c(1,2,3), selected = 1),
                                selectInput("author", label = "By author:", choices = c("All",nodes_authors[nodes_authors$posts>10,]$username), selected = "All")
                            ),
                            
                            
                            box(title = "Word Cloud", width=8, status = "primary", wordcloud2Output("wordcloud"))
                        )
                        
                        
               )
               
               
        )
        # Second tab content
        
        
    )
)

# Define server logic required to draw a histogram
server <- function(input, output) {

    output$network<- renderVisNetwork({
        
        selected_authors=nodes_authors[nodes_authors$posts %in% seq(input$posts[1],input$posts[2]),]$author_id
        g2 = induced_subgraph(g, as.character(selected_authors))
        cfg = cluster_fast_greedy(as.undirected(g2))
        V(g2)$community = cfg$membership
        graph_data <- toVisNetworkData(g2)
        
        graph_data$nodes$label=graph_data$nodes$username
        graph_data$nodes$value=graph_data$nodes[[input$size]]
        graph_data$nodes$group=graph_data$nodes[[input$color]]
        graph_data$edges$width=graph_data$edges$weight
        
        visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
            visIgraphLayout(randomSeed = 123) %>%
            visNodes(color = list(background = "lightblue", 
                                  border = "darkblue",
                                  highlight = "yellow")) %>%
            visOptions(highlightNearest = list(enabled = T, degree=0, hover = F), 
                       nodesIdSelection = T) %>%
            visLegend()
    })
   
    output$wordcloud<-renderWordcloud2({
        
        if(input$author=="All"){
            text_df_selected=text_df
            limitOne=30
            limitTwo=10
            limitThree=5
        }
        else{
            author_messages= forum_data %>%
                filter(author_username %in% c(input$author))
            text_df_selected=text_df%>%filter(messageId %in% author_messages$mongoid)
            limitOne=1
            limitTwo=1
            limitThree=1
        }
        
        if(input$ngram==1){
            tok= text_df_selected %>%
                unnest_tokens(word, text) %>%
                anti_join(stop_words)
            wordFreq = tok %>%
                count(word) %>%
                filter(n>limitOne)
        }
        if(input$ngram==2){
            bigrams = text_df_selected %>%
                unnest_tokens(bigram, text, token = "ngrams", n = 2)
            
            bigrams %>%
                count(bigram, sort = TRUE)
            
            bigrams_separated = bigrams %>%
                separate(bigram, c("word1", "word2"), sep = " ")
            
            bigrams_filtered = bigrams_separated %>%
                filter(!word1 %in% stop_words$word) %>%
                filter(!word2 %in% stop_words$word)
            
            
            bigram_counts = bigrams_filtered %>% 
                count(word1, word2, sort = TRUE)
            
            tok = bigrams_filtered %>%
                unite(word, word1, word2, sep = " ")
            wordFreq = tok %>%
                count(word) %>%
                filter(n>limitTwo)
        }
        if(input$ngram==3){
            tok=text_df_selected %>%
                unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
                separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
                filter(!word1 %in% stop_words$word,
                       !word2 %in% stop_words$word,
                       !word3 %in% stop_words$word) %>%
                count(word1, word2, word3, sort = TRUE)
            tok = tok %>%
                unite(word, word1, word2, word3, sep = " ")
            wordFreq = tok %>%
                filter(n>limitThree)
        }
        
        
        wordcloud2(data = wordFreq, size = 1)
        
    })
    
}

# Run the application 
shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

Student Talk

Analyzing Student Discussion

Social Network Analysis

Centrality Measures

Communities detection

Text Mining

Word frequency

Sentiment Analysis

Word clouds

Term Importance

Relationships between words

Dashboards