This case study presents a sentiment analysis of educators’ opinions on Census at School program mined from discussion forum posts of a MOOC-Ed titled Teaching Statistics through a Data investigation.The MOOC is a professional development program that is targeted toward educators in K-12 and post secondary education systems in the United States. On the other hand, Census at Schools is a program that supports student learning through statistical problem-solving. It is designed for students in grades 4-12 and fosters project -based learning through international collaboration.
The targeted audience for this study includes educators and other education stakeholders who are at a point of making decisions regarding integrating Census at School in their classrooms. Additionally, this information can be useful to Census at School as an institution and it can contribute to the process of making improvements to the program.
This study follows the Data-Intensive Research Workflow presented by (Krumm et al.,2018) to perform the informed analysis and communicate the findings.
The recent advancements in text mining approaches has led to the rise of studies that seek to understand learners’ emotions, opinions, and attitudes toward specific topics in learning communities. Prior studies have indicated that sentiment analysis can be applied to glean insightful information from educational environments, including Massive Open Online Courses (Wang et al., 2014; Yan et al., 2021). For instance, studies by Moreno-Marcos et al. (2018) and Onan (2020) used sentiment analysis algorithms to assess feedback on course evaluations and used the findings to aid decision making processes and strategize for future course improvements.
From the literature review, I have noticed that empirical studies have had more focus on applying these techniques on public and social media networks such as Twiiter and Facebook. Therefore, part of my motivation for performing this study comes from the need of experimenting how these tools can be applied in MOOC contexts. In this study, a mix of sentiment analysis and social network analysis algorithms have been utilized to mine perspectives and opinions of participants of a MOOC-Ed. As a researcher, I have taken advantage of the background and profiles of MOOC participants who are positioned as educators in K-12 and post secondary.
The main research questions that will lead to understanding educators’ sentiments toward Census at School include:
What are the most frequently used words by educators regarding Census at School program?
What are educators’ sentiments regarding Census at School program?
Who are the active participants that engaged in Census at School discussions in the forum and what is their interaction pattern?
The source dataset that I am using in study has a collection of 5788 discussion forum posts from a MOOC: Teaching Statistics through a Data investigation. The dataset contains discussion forum posts from offered course instances from Fall 2015 to Fall 2017. Due to the scope of this project, my sentiment analysis has focused on a pared data frame containing observations of the Fall 2017 course with discussion posts targeting Census at Schools as the topic.
The wrangling process involved a set of steps including paring the dataset which originally had 5788 observations from instances of eight course offerings from Fall 2015 to Fall 2017. I have included comments in the code chuck to inform on the performed manipulations.
#loading libraries
library(tidytext)
library(vader)
library(tidyverse)
library(here)
library(wordcloud2)
library(tidygraph)
library(ggraph)
library(igraph)
The observations of interest for this analysis are discussion forum posts from the course offered in Fall 2017 and it specifically targeted topics that included “Census at School”.
| S/N | Variable | Description |
|---|---|---|
| 1 | post_content | Primary variable of interest containing discussion forum posts |
| 2 | discussion_id | Unique reference of new discussion post |
| 3 | forum_id | Unique reference of the forum |
| 4 | discussion_creator | User who initially created the forum post |
| 4 | discussion_poster | user who posted in the discussion forum |
| 5 | course_id | Unique identification of the course |
| 7 | post_title | For validating selected posts based on the title |
#importing dataset and converting "double" variables to characters
mooc_forum <- read_csv(here("Data", "mooc_forum.csv"),
col_types = cols(course_id = col_character(),
forum_id = col_character(),
discussion_id = col_character(),
discussion_creator = col_character(), discussion_poster = col_character(), discussion_reference = col_character(), parent_id = col_character(), post_id = col_character()
)
)
#Selecting variables of interest for analysis
mooc_forum_1 <- mooc_forum %>%
select(post_content, discussion_id, forum_id, discussion_creator, discussion_poster,
discussion_reference, post_title, course_id)%>%
# ommitting entries that do not have corresponding values
filter((!is.na(course_id))|(!is.na(discussion_id)))
By skimming through the dataset, discussion forum posts that focused on Census at School were available in the course offered in 2017 with course_id of 73.
#selecting the course of interest with ID 73, offered in Fall 2017
mooc_forum_2 <- mooc_forum_1 %>% filter (course_id == "73")
#Filtering the rows by discussion_id
mooc_forum_3 <- mooc_forum_2 %>% filter (discussion_id == "18582" | discussion_id == "19132" | discussion_id == "23801" | discussion_id == "18555" | discussion_id == "22624")
mooc_forum_3
## # A tibble: 53 × 8
## post_content discussion_id forum_id discussion_crea… discussion_post…
## <chr> <chr> <chr> <chr> <chr>
## 1 If my students were… 18582 866 14612 14612
## 2 I agree with you. C… 18582 866 14612 14730
## 3 I agree that using … 18582 866 14612 14611
## 4 I agree this is mor… 18582 866 14612 14132
## 5 I agree with you on… 18582 866 14612 14232
## 6 I agree with starti… 18582 866 14612 17316
## 7 I think it would be… 18582 866 14612 14572
## 8 Thanks so much for … 18582 866 14612 14612
## 9 Thank you for shari… 18582 866 14612 14715
## 10 I'm also in agreeme… 18582 866 14612 13639
## # … with 43 more rows, and 3 more variables: discussion_reference <chr>,
## # post_title <chr>, course_id <chr>
The tribble table represents the trimmed dataframe that contains 53 discussion forum posts that were based on Census at School topics. These were filtered by using discussion forum posts with identification numbers 18582, 19132, 23801, 18555 and 22624.
The explore step involved tokenization and computation of descriptive statistics such as word count and top tokens in the forum posts. The initial dataframe after tokenization resulted into 372 words. After removing stopwords and specified words such as “census” and “school” the dataframe had 340 words. In order to create an appealing and informative visualization and frequency graph of the words, I selected the top 50 words for further analysis.
# allocating tokens to the forum posts
tidy_mooc <- mooc_forum_3 %>% unnest_tokens(output = word, input = post_content) %>%
relocate(word)
#removing stopwords and doing a count of common words
tidy_mooc3 <- anti_join(tidy_mooc, stop_words,
by = "word") %>% count(word, sort = TRUE)
#removing customized words and saving in new dataframe
my_stopwords <- c("census", "school")
tidy_mooc4 <-
tidy_mooc3 %>%
filter(!word %in% my_stopwords)
#Saving the new dataframe as csvfile
final_mooc <- tidy_mooc4
write_csv(tidy_mooc4, here("Data", "final_mooc.csv"))
#Selecting tokens for creating frequency table and wordcloud
mooc_top_tokens <- final_mooc %>%
top_n(50)
wordcloud2 (mooc_top_tokens)
From the wordcloud visualization, it can be interpreted that educators were mainly discussing about “students”, “data”, “questions” and they seemed to “agree” with the arguments that their peers were inferring about Census at School. In order to get a more detailed representation of data, I deployed a bar graph that shows the actual frequency values of each word.
# Frequent words from posts on Census at School
mooc_top_tokens %>%
filter(n > 4) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
labs(x = "Word \n", y = "\n Count ", title = "Frequent Words on Census at School \n") +
geom_text(aes(label = n), hjust = 1.2, colour = "white", fontface = "bold") +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", colour="darkblue", size = 12),
axis.title.y = element_text(face="bold", colour="darkblue", size = 12))
The frequency graph provides further insight especially in highlighting words that can provide deeper meaning into how educators view Census at School. The words such as “overwhelming” can indicate that the nature and workload of projects that student work on at Census at School can be too much for their learning experience. The words “clean” and “cleaning” reveal the importance of cleaning the data for students prior to use. In order to validate this, I checked instances were these words were used, and the interpretations were within my expectations.
Random validated Instances
| Word | discussion_poster | post_content |
|---|---|---|
| Agree | 14135 | I agree with all of the above. I love that the kids can discover the need to clean data without it just being me telling them. |
| Cleaning | 13313 | I agree that allowing students to practice cleaning up data is valuable. Census at School would definitely allow students to \stumble onto\” situations where data would need to be cleaned up a bit.. ” |
| Overwhelming | 14141 | Sounds like Census at Schools would be beneficial in the classroom. 40 questions seem a little much but there is nothing students like better than to be heard or to have ownership in the learning and it sounds like this does just that! The only thing is the data would be overwhelming however this would give students a true picture of what data in the real world looks like. It would open up for discussions on How to clean up data. Several statistical questions can be posed.. It would allow through its simulation to teach students to become statistically literate. |
In order to conduct sentiment analysis on the discussion forum posts, I deploy the VADER algorithm and computed the mean compound score of the post content. I particularly selected this approach in order to get a collective sentiment value that will help in understanding the nature of educators perceptions regarding Census at School.
vader_mooc <- vader_df(mooc_forum_3$post_content)
##putting this function as a comment since it populates the page when run
## head(vader_mooc)
#Computing the mean compound score of the forum posts
mean(vader_mooc$compound)
## [1] 0.7236226
The computed mean compound score is 0.724 which substantially leans toward positive sentiment. The vader rules provides the parameters of -1 as being most negative and +1 as being most positive. In this regard, I can answer the second research question that the overall sentiments of educators were supportive and positive towards Census at School program.
The performed analysis provided baseline information that has helped in revealing useful insights on how educators perceive Census at School program and the level of interaction that was behind those conversations. In order to effectively structure the discussion, I will use the research questions to guide aligned presentation of the findings.
From the word counts and wordscloud, it can be observed that educators were discussing issues pertaining to their students and how they learn about data in the Census at School program. The top five words that were more frequent in the discussions were “students”,“data”, “agree”, “questions” and “clean”. During verification of the instances, the word “agree” indicated that the educators’ perceptions and contributions were in unison and they were validating each others thoughts and experiences. Additionally, since this was a small dataset, other words such as “clean” and “cleaning” can collectively send a message that dealing with Census at School questions and activities is linked to cleaning of data . Further more words such as “overwhelming” can be interpreted that some educators thought the activities were overwhelming to the students.
From the modelling activity that involved computation of VADER compound score, it has been revealed that the sentiments of educators in the forum leaned toward positive. The compound score of 0. 724 sits on the positive edge given that the rule acknowledges +1 being the most positive. This was within my expectation as when I crosschecked the instances, the majority of the posts were positive and supportive of using Census to School for teaching and learning about data.
The centrality measure identified actors with the highest numbers of degree values as 14612, 13639, 14611,14141, 14611,16600 and 14730. Out of these, I can identify actors with the most replied posts as 14612,13639 and 14141. Based on the numbers of their indegrees, I presume that some of these actors could potentially be instructors and I have to admit that more context in this aspect is required. The network is directed and from the graph it is visible that the forum posts were distributed. There is a subgroup that is made up mostly of replies (that is about 11 actors replied to that discussion forum thread). The other subgroup with actors 14612 and 14730 demonstrates mutual engagement.
The main limitation is on the scope of the study due to the time frame and available resources. As it has been observed, the sample was based on one unit of the course and the observations were filtered to include the precise topic of interest. This also implies that the findings can not be generalized to broader contexts.
Furthermore, since this course is my initial experience with using SA and SNA algorithms, I have to acknowledge that the approaches and algorithms used are the ones learned in this class. Perhaps the use of other approaches would yield better results and informative findings.
The project stands as a pilot study that can be developed to a full study in the future. Furthermore, while searching for relevant literature, I have realized that there is still a gap in the number of studies that have focused on text mining in MOOC forums.
As much as the findings are limited, they can still provide foundational information for policy, research and practice. These opinions of educators reveal the positivity toward Census for Schools, and it can be value-adding information for peer educators and policy makers who are trying to make decisions of integrating the programs into their schools and classrooms. Census for Schools as an organization can also benefit from these findings, as they will know what elements of the program need re-design and improvements. For example, words such as “clean data” “messy” and “overwhelming” can send a message on the type of activities they design for students.
The dataset was intentionally provided by the instructor, Dr. Kellogg to be used for the purpose of the class, so I that gave me consent to use it for my analysis. However, demonstrating ethical conduct as a researcher is fundamental for integrity and trustworthiness. I have therefore treated the data with confidentiality, especially in instances where identity and names of the users were exposed. This report will be used for the scope of this class and I do not intend to share (at least for now) these findings publicly.
There are ample ways of expanding on this study. Perhaps the topic could not be Census for School per se, but there are other topics that can be useful to be studied in the MOOC forums. Discussions happening in professional development programs can be mined to inform stakeholders on various issues pertaining to practice. These sentiments could as well be tracked over time , on the context of the course and beyond.
Furthermore, the other potential expansion is to conduct the same study but instead of gleaning data from the MOOC_Ed; this time around I could retrieve Twitter or Facebook data with hash tags on the topic. I am actually planning to experiment with this and I will use these findings as the pilot study.
The use of sentiment analysis and social network analysis can be used to study opinions and learning patterns in MOOCs. This study provided insights on how wordclouds and lexicon based algorithms such as Vader can be used to assess educators’ opinions and interactions regarding Census for School discussions. Overall educators were positive toward the program and active actors that were engaged in the discussion were identified. As MOOCs are becoming more prevalent, this calls for the need of more research that is set to investigate how text mining techniques can be applied to assess participants’ opinions, behaviors and patterns and design appropriate interventions.
Krumm, A., Means, B., & Bienkowski, M. (2018). Learning analytics goes to school: A collaborative approach to improving education. Routledge.
Moreno-Marcos, P. M., Alario-Hoyos, C., Muñoz-Merino, P. J., Estévez-Ayres, I., & Kloos, C. D. (2018, April). Sentiment analysis in MOOCs: A case study. In 2018 IEEE Global Engineering Education Conference (EDUCON) (pp. 1489-1496). IEEE.
ONAN, A. (2021). Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Computer Applications in Engineering Education, 29(3), 572-589.
Yan, X., Li, G., Li, Q., Chen, J., Chen, W., & Xia, F. (2021, October). sentiment analysis on massive open online course evaluation. In 2021 International Conference on Neuromorphic Computing (ICNC) (pp. 245-249). IEEE.
Wen, M., Yang, D., & Rose, C. (2014, July). Sentiment Analysis in MOOC Discussion Forums: What does it tell us?. In Educational data mining 2014.