This tutorial will guide through the analysis and reporting of students’ online discussion.
For this tutorial we will use a dataset from a MOOC on Climate Change. The raw data file can be downloaded here.
Let’s load this dataset and see what is in it.
forum_data<-read.csv('ClimateForum15.csv',stringsAsFactors=FALSE)
colnames(forum_data)
## [1] "author_id" "author_username"
## [3] "created_at" "anonymous_to_peers"
## [5] "votes_count" "votes_point"
## [7] "votes_down_count" "votes_up"
## [9] "votes_down" "votes_up_count"
## [11] "parent_ids" "historical_abuse_flaggers"
## [13] "comment_thread_id" "X_type"
## [15] "updated_at" "abuse_flaggers"
## [17] "child_count" "visible"
## [19] "sk" "anonymous"
## [21] "course_id" "at_position_list"
## [23] "mongoid" "endorsed"
## [25] "parent_id" "last_activity_at"
## [27] "closed" "title"
## [29] "thread_type" "commentable_id"
## [31] "group_id" "tags_array"
## [33] "endorsement_user_id" "endorsement_time"
## [35] "comment_count" "pinned"
## [37] "body"
Each post in the dataset contains the following important fields: * author_id: Number identification of the author of the post * author_username: Username of the author of the post * votes_count: How many votes the post have received * votes_point: How many points (positive - negative votes) the post has * votes_down_count: How many negative votes has the post * votes_up: List of users that up-voted the post * votes_down: List of users that down-voted the post (No posts have down-votes) * votes_up_count: How many positive votes has the post * parents_id: The id of the post to which this post is a response. If the message is the first of a thread or was direct response to the thread, it does not have a parents_id * comment_thread_id: Id of the root post of the thread * X_type: “CommentThread” if it is the first post of the thread. “Comment” if it is a response in the thread. * child_count: How many direct responses this post has * mongoid: Id of the post * parent_id: The same as parents_id * title: Title of the thread * comment_count: How many comments a CommentThread has * body: Text of the post
It is not only important to know who is talking with who, but also analyze the content of that communication. The following steps will introduce basic text mining techniques.
First we will install and load the tidytext library that will provide the text mining functions and load the text data into a easier to work format.
NOTE: The installation could take several minutes. Be patient.
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.3
text_df <- tibble(messageId = forum_data$mongoid, text = forum_data$body)
Explanation:
Then we divide tokenize the text. This means that we separate the different tokens (words). As we are not interested in common words such as “a”, “the”, “of”, we have to remove it from the list.
tokenized= text_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
head(tokenized)
The next step tokenizing our text is to count the frequency of different words. We will be using ggplot, a more sophisticated version of plot to create the graph.
library(ggplot2)
tokenized %>%
count(word, sort = TRUE) %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
We can do this kind of analysis for a given thread. We just need to obtain the id of the messages that belong to each thread. For example to the thread “Temperature and [CO2] correlation” with id ‘561e9800d2aca5e7dd000618’
thread_messages<-forum_data[forum_data$comment_thread_id %in% c('561e9800d2aca5e7dd000618'),]$mongoid
tokenized %>%
filter(messageId %in% thread_messages) %>%
count(word, sort = TRUE) %>%
filter(n > 10) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
One interesting thing that we can do with the text is to obtain the “sentiment value” based on the words used in the text. There are databases that rank each word based on its positive or negative emotion value. We can use these databases to estimate the sentiment score of a text.
library(tidyr)
sentiments <-
tokenized %>%
inner_join(get_sentiments("bing")) %>%
count(messageId, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
Explanation:
We can visualize these scores using a new visualization library that is most powerful than just plot, but it is also a little bit more complex.
library(ggplot2)
ggplot(sentiments, aes(messageId, sentiment,fill=sentiment)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient(low="red", high="blue")+
coord_flip()
We can also know what are the more frequent positive and negative words. For that we only need to count their apperance in the different posts.
bing_word_counts <- tokenized %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
Another interesting way to summarize the information about the frequency of words in the posts is the word cloud.
To create word clouds in R, we use the wordcloud library. We feed these library the words and their frequency.
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.0.3
## Loading required package: RColorBrewer
tokenized %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining, by = "word"
## Warning in wordcloud(word, n, max.words = 100): climate could not be fit on
## page. It will not be plotted.
We can also create a representation of the positive and negative words using this library visualizations. In this case we divide the most frequent words between positive and negative and use the comparison.cloud function.
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
tokenized %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red", "blue"),
max.words = 100)
## Joining, by = "word"
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## successfully could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## appreciated could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## properly could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## consistent could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## contribution could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## healthy could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## leading could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## strong could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("red", "blue"), max.words = 100):
## wonderful could not be fit on page. It will not be plotted.
While the frequency of words provide some indication of the importance of the words, some word appear very frequently (for example “the”, “of”, “a”") but do not provide much meaning. On the other hand, words that sometimes appear infrequently carry a lot of meaning (for example scientific terms).
A better way to determine the importance of words is the TF-IDF metric (Term Frequency - Inverse Document Frequency). This metric divides the frequency of a word a post or document by the frequency of the word in the whole forum or set of documents. The most rare a word that a post uses frequently, the higher the TF-IDF.
message_words <- text_df %>%
unnest_tokens(word, text) %>%
count(messageId, word, sort = TRUE)
total_words <- message_words %>%
group_by(messageId) %>%
summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
message_words <- left_join(message_words, total_words)
## Joining, by = "messageId"
message_words <- message_words %>%
bind_tf_idf(word, messageId, n)
message_words %>%
select(-total) %>%
arrange(desc(tf_idf))
Until now, we have analyzed words individually. A more interesting analysis is taking several words together. We call this combination of words n-grams, where n is the number of words that we analyze togeter.
For example, lets extract the 2-grams or bigrams that occurs more frequently in the posts.
bigrams <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams %>%
count(bigram, sort = TRUE)
To get a more meanigful list of n-grams we need to remove the stopwords. For that we separate the bigram into its two words and elimate the n-gram if any of the words is in the stopword list.
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
We can visualize the relation between this word via a network diagram. In this graph each node will be a word and there will be a link between two words if those two words appear together in a frequent bigram.
bigram_graph <- bigram_counts %>%
filter(n > 10) %>%
graph_from_data_frame()
graph_data <- toVisNetworkData(bigram_graph)
visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
visIgraphLayout()
Explanation:
We can work with n-grams of any length. For example, here is the code to obtain 3-grams from the posts.
text_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
We can also obtain TF-IDF metrics for n-grams in a similar way that we did it for single words.
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigram_tf_idf <- bigrams_united %>%
count(messageId, bigram) %>%
bind_tf_idf(bigram, messageId, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
Another way to find relationship between words is to find correlations between them. That is when they are used together in different posts.
To find the correlation between different words we use the widyr library. We need to install it and load it.
library(widyr)
## Warning: package 'widyr' was built under R version 4.0.3
word_cors <- tokenized %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, messageId, sort = TRUE)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
word_cors
We can see the most correlated words with words of interest such as climate, denial or co2.
word_cors %>%
filter(item1 %in% c("climate", "denial", "sea", "co2")) %>%
group_by(item1) %>%
top_n(6) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
## Selecting by correlation
We can also use the correlation information between words to create a network. In this network, each word is a node and the links between them depends on the correlation coefficient between those words.
correlation_graph<-word_cors %>%
filter(correlation > .30) %>%
graph_from_data_frame()
graph_data <- toVisNetworkData(correlation_graph)
graph_data$edges$width<-graph_data$edges$correlation*20
visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
visIgraphLayout()
Now we will create a dashboard to explore the forum activity. The dashboard will contain two tabs: A social network visualizator A word cloud by author
Here is the code to create such a dashboard:
#
# This is a Shiny web application. You can run the application by clicking
# the 'Run App' button above.
#
# Find out more about building applications with Shiny here:
#
# http://shiny.rstudio.com/
#
library(shiny)
library(shinydashboard)
##
## Attaching package: 'shinydashboard'
## The following object is masked from 'package:graphics':
##
## box
library(tidyverse)
library(visNetwork)
library(igraph)
library(tidytext)
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.0.3
forum_data<-read.csv('ClimateForum15.csv',stringsAsFactors=FALSE)
nodes_authors =
forum_data %>%
group_by(author_id) %>%
summarize(
username=last(author_username),
posts=n(),
thread_started=length(X_type[X_type %in% c('CommentThread')]),
votes_up=sum(votes_up_count),
comments=sum(comment_count[!is.na(comment_count)])
)
## `summarise()` ungrouping output (override with `.groups` argument)
nodes_authors$thread_initiatior<-ifelse(nodes_authors$thread_started>0,"Initiator","Commenter")
links_posts =
forum_data %>%
filter(X_type %in% c('Comment')) %>%
select(author_id,mongoid,parent_ids,comment_thread_id) %>%
mutate(parent=ifelse(parent_ids=="",as.character(comment_thread_id),as.character(parent_ids))) %>%
select(author_id,mongoid,parent)
get_user_from_post = function(post_id){
forum_data[forum_data$mongoid==post_id,]$author_id
}
links_posts$author_parent<-sapply(links_posts$parent,get_user_from_post)
weighted_links = links_posts %>%
group_by(author_id,author_parent) %>%
summarize(
weight=n()
)
## `summarise()` regrouping output by 'author_id' (override with `.groups` argument)
text_df = tibble(messageId = forum_data$mongoid, text = forum_data$body)
tokenized= text_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
wordFreq = tokenized %>%
anti_join(stop_words) %>%
count(word) %>%
filter(n>30)
## Joining, by = "word"
sentiments =
tokenized %>%
inner_join(get_sentiments("bing")) %>%
count(messageId, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining, by = "word"
get_sentiment=function(id){
author_messages<- forum_data %>%
filter(author_id %in% c(id))
author_sentiments<- sentiments %>%
filter(messageId %in% author_messages) %>%
summarize(
count=sum(sentiment)
)
author_sentiments$count
}
nodes_authors$sentiment=sapply(nodes_authors$author_id,get_sentiment)
nodes_authors$sentiment = cut(nodes_authors$sentiment, breaks=c(-Inf, -5, 0, 5, +Inf), labels=c('Very Negative','Negative','Positive','Very Positive'))
g<-graph_from_data_frame(weighted_links, directed = TRUE, vertices = nodes_authors)
size_options = list("Number of Posts"="posts",
"Threads Started"="thread_started",
"Up-votes"="votes_up",
"Number of Comments"="comments")
color_options = list("None"="none",
"Thread Initiators"="thread_initiatior",
"Community"="community",
"Sentiment"="sentiment")
# Define UI for application that draws a histogram
ui <- dashboardPage(
dashboardHeader(title = "Discourse Analytics"),
dashboardSidebar(disable=TRUE),
dashboardBody(
tabBox(height = "1100px", width = "1000px",
tabPanel(title = tagList(icon("project-diagram",
class = "fas fa-project-diagram"),
"NETWORK"),
fluidRow(
box(title = "Controls", width=4, status = "primary", solidHeader=TRUE,
selectInput("size", label = "Size represents:", choices = size_options, selected = "posts"),
selectInput("color", label = "Color represents:", choices = color_options, selected = "none"),
sliderInput("posts", label = "Number of Posts", min = min(nodes_authors$posts), max = max(nodes_authors$posts), value = c(min(nodes_authors$posts), max(nodes_authors$posts)))
),
box(title = "Network", width=8, status = "primary", visNetworkOutput("network"))
)
),
tabPanel(title = tagList(icon("cloud",
class="fas fa-cloud"),
"WORD CLOUDS"),
fluidRow(
box(title = "Controls", width=4, status = "primary", solidHeader=TRUE,
selectInput("ngram", label = "Use N-grams of Size:", choices = c(1,2,3), selected = 1),
selectInput("author", label = "By author:", choices = c("All",nodes_authors[nodes_authors$posts>10,]$username), selected = "All")
),
box(title = "Word Cloud", width=8, status = "primary", wordcloud2Output("wordcloud"))
)
)
)
# Second tab content
)
)
# Define server logic required to draw a histogram
server <- function(input, output) {
output$network<- renderVisNetwork({
selected_authors=nodes_authors[nodes_authors$posts %in% seq(input$posts[1],input$posts[2]),]$author_id
g2 = induced_subgraph(g, as.character(selected_authors))
cfg = cluster_fast_greedy(as.undirected(g2))
V(g2)$community = cfg$membership
graph_data <- toVisNetworkData(g2)
graph_data$nodes$label=graph_data$nodes$username
graph_data$nodes$value=graph_data$nodes[[input$size]]
graph_data$nodes$group=graph_data$nodes[[input$color]]
graph_data$edges$width=graph_data$edges$weight
visNetwork(nodes = graph_data$nodes, edges = graph_data$edges, height = "500px") %>%
visIgraphLayout(randomSeed = 123) %>%
visNodes(color = list(background = "lightblue",
border = "darkblue",
highlight = "yellow")) %>%
visOptions(highlightNearest = list(enabled = T, degree=0, hover = F),
nodesIdSelection = T) %>%
visLegend()
})
output$wordcloud<-renderWordcloud2({
if(input$author=="All"){
text_df_selected=text_df
limitOne=30
limitTwo=10
limitThree=5
}
else{
author_messages= forum_data %>%
filter(author_username %in% c(input$author))
text_df_selected=text_df%>%filter(messageId %in% author_messages$mongoid)
limitOne=1
limitTwo=1
limitThree=1
}
if(input$ngram==1){
tok= text_df_selected %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
wordFreq = tok %>%
count(word) %>%
filter(n>limitOne)
}
if(input$ngram==2){
bigrams = text_df_selected %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams %>%
count(bigram, sort = TRUE)
bigrams_separated = bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered = bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts = bigrams_filtered %>%
count(word1, word2, sort = TRUE)
tok = bigrams_filtered %>%
unite(word, word1, word2, sep = " ")
wordFreq = tok %>%
count(word) %>%
filter(n>limitTwo)
}
if(input$ngram==3){
tok=text_df_selected %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
tok = tok %>%
unite(word, word1, word2, word3, sep = " ")
wordFreq = tok %>%
filter(n>limitThree)
}
wordcloud2(data = wordFreq, size = 1)
})
}
# Run the application
shinyApp(ui = ui, server = server)
Social Network Analysis
The first step to analyze this forum will be to analyze the structure of the interactions between the authors. For example, we could create a graph in which each node is an author and a link is created between two nodes (authors) if a post of the first author is a response to a post of the second author. This will create a “social network” of the forum participants.
To be able to create such a graph, we need to create two datasets. One that contain summarized information about each author (the nodes) and one that contains information about the posts (links). We will start with the nodes dataframe.
Now we create the links dataframe:
But that link dataframe is not what we want. We do not want a link between the posts, but a link between authors. We need to get the author_id of the parent post.
Because there could be several links between the same pair of authors, we count the number of times that an author has responded to another author and add that number to the link. This is usually represented as the weight or strength of the link. (More responses between two authors, the stronger their relationship will be)
With these two dataframes (nodes_author and weighted_links) we are able to create our network. For this we will use the igraph library in R that contains useful functions to manipulate graphs and networks. We will use the visNewtwork library to visualize the graphs in an interactive way.
Explanation:To create a more interesting graph we want to: * The size of the node should be related to the number of nodes.
Explanation:* The label of each node should be the user name * The width of each link should be related to its weight * When a node is selected, we want to highlight the nodes to which it is connected
Now we want a graph that show the same graph, but with the size of the nodes representing the numbers of up-votes and it should only include those that have more than 2 posts.
### Graph MetricsWe can also obtain several metrics from the networks: * Edge Density: The density of a graph is the ratio of the number of edges and the number of possible edges. * Reciprocity: The measure of reciprocity defines the proportion of mutual connections, in a directed graph. It is most commonly defined as the probability that the opposite counterpart of a directed edge is also included in the graph. * Diameter: Longer distance between connected nodes * Average Degree: The average number of neighbourghs
We can also obtain certain metrics for each node and graph their distribution. For example for the degrees of a node, that is the number of other nodes that are connected to that node.

Explanation:
Explanation:Centrality Measures
An important set of measures are centrality measures. They measure the “importance of a node” based on different aspects. * Closeness: How easy is to reach that node from other nodes * Betweeness: How many paths betweeen other nodes pass through this node * Hub: How many nodes are pointed by this node * Authority: How many other nodes point to this node
Explanation:Communities detection
We can also cluster nodes together to detect communities. To do it, we can use several metrics for the nodes or links. For exampke, the Fast Greedy algorithm tries to find dense subgraph, also called communities in graphs, via directly optimizing a modularity score.
Explanation: