Using Text Mining to Explore Sentiments and Topics of Students’ Essay Responses

Prepare

Introduction

Nowadays, teachers in various subjects like to assign students to different writing tasks, which is a great way to evaluate students’ understanding and also boost their writing skills. However, grading the students’ essay usually takes a lot of time. Researchers are collaborating with teachers to develop some model to automate the process of scoring students’ essays. This project will build on these resources, but seeks to understand students’ sentiments and topics in their essays.

Purpose

The purpose of this study is to leverage text mining techniques to conduct sentiment analysis and topic modeling about students’ essay responses to assigned prompts. These responses are scored by human experts in a range from 1 - 6. 1 represents the lowest level. 6 represents the highest level. This study mainly focuses on revealing differences in sentiments and topics of essay responses in different levels. Two research questions are explored in this study.

Research questions

1, what are the differences in sentiments in students’ essay responses in terms of score levels?

2, What are the differences in topics in students’ essay responses in terms of score levels?

Data sources

The data come from a publicly available dataset for developing automated essay scoring models. It can be accessed through platforms like Kaggle (https://www.kaggle.com/competitions/asap-aes/data). This dataset includes eight sets of student essays in response to different prompts. These essays are written by students in grades 7 - 10 and scored by human raters. A range of variables are collected, including students’ economic status, gender, ethnicity, and English language learner status. For this study, I only selected prompt, students’ essay, and score for analysis. I used English language learner status to as a covariate in the topic modeling analysis.

Wrange

Data Preprocessing.

It is challenging to analyze sentiments and topics in eight different prompts, so that this study focused on one prompt: In “The Challenge of Exploring Venus,” the author suggests studying Venus is a worthy pursuit despite the dangers it presents. Using details from the article, write an essay evaluating how well the author supports this idea. Be sure to include: a claim that evaluates how well the author supports the idea that studying Venus is a worthy pursuit despite the dangers; an explanation of the evidence from the article that supports your claim; an introduction, a body, and a conclusion to your essay.

New features

Considering the imbalanced distribution of scores, I recoded the score variable into three levels: high (score 5 and 6), middle (score 3 and 4), and low (score 1 and 2). Students’ essay responses were tokenized and punctuations were removed for the later topic modeling. Missing values were also removed.

df <- read.csv("C:/phd/Text mining/ASAP2_train_sourcetexts.csv")
df <- df[df$prompt_name == "Exploring Venus", ]
df <- df %>% dplyr::select(score, full_text, ell_status)
df <- df %>%
  mutate(score_level = case_when(
    score %in% c(1, 2) ~ "Low",
    score %in% c(3, 4) ~ "Middle",
    score %in% c(5, 6) ~ "High",
    TRUE ~ NA_character_
  ))
df$row_id <- 1:nrow(df)
df <- df %>% relocate(row_id)
head(df, 1)

##   row_id score
## 1      1     4
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            full_text
## 1 The author suggests that studying Venus is worthy enough even though it is very dangerous. The author mentioned that on the planet's surface, temperatures average over 800 degrees Fahrenheit, and the atmospheric pressure is 90 times greater than what we experience on our own planet . His solution to survive this weather that is dangerous to us humans is to allow them to float above the fray. A "blimp-like" vehicle hovering 30 or so miles would help avoid the unfriendly ground conditions . At thirty-plus miles above the surface, temperatures would still be toasty at around 170 degrees Fahrenheit, but the air pressure would be close to that of sea level on Earth. So not easy conditions, but survivable enough for humans. So this would help make the mission capeable of completing.\n\nHe also mentions how peering at venus from a ship orbiting or hovering safely far above the planet can provide only limited insight on ground conditions because most forms of light cannot penertrate the dense atmosphere making it hard to take photographs . They also cannot take samples of rock, gas, or anything else, from a distance. So many reaserchers are working on innovations that would allow their machines to last long enough to help gain some imformation of Venus.\n\nThey are working on other ways to study Venus such as simplified electrnics made of silicon carbide that have been tested in a chamber simulating the chaos of Venus's surface . So far they have lasted for 3 weeks in these conditions which is more than enough time hopefully for them to be able to grab enough information. Their other project that they are working on is using an old technology called mechanical computers. They are powerful, flexible, and quick. Systems that use mechanical parts can be made more resistant to pressure, heat, and other forces.\n\nHe feels that studying Venus even though its dangerous is valuable because of the insight they could gain about the planet itself but also becuase "human curiosity will likely lead us into many equally intimidating endeavors."\n\nI think the author supported his claim very well he explained why he thought it as nessary to go even though it is dangerous and he gave solutions to some of the dangers on Venus such as sollution to the heat and ways to actually help gain evicence and imformation on Venus.                 
##   ell_status score_level
## 1         No      Middle

nrow(df)

## [1] 4480

table(df$score_level)

## 
##   High    Low Middle 
##    217   1986   2277

From the table, we knew there were 4480 observations in three score levels (high: 217; middle: 2277; low: 1986). It makes sense in real life that less students have a high score and most students are in the middle level. But there was a huge imbalance in the score levels, meaning that we should be cautious of interpreting the results.

## An example of highest score
set.seed(123)
high_score_sample <- df %>%
  dplyr::filter(score == 6) %>%
  slice_sample(n = 1)
high_score_sample$full_text

## [1] "In the article \"The Challenge of Exploring Venus,\" the author gives the reader basic insight on what Venus is, its history, why scientists are trying to explore it, and the challenges scientists have faced.\n\nOverall it is a very informative article and the point of it was to convince the reader that despite all of the struggles there is that comes with exploring Venus, that it is worth exploring anyway.\n\nThough it is an admirable arguement, it is simply not supported throughout this type of article.\n\nIn the article, the author does not support his or her argument well because of his or her informative over-kill, lack of analysis, and lack of data in his or her favor.\n\nTo begin, the author gives us too many facts that are simply just an introduction. In paragraph three, the whole paragraph is stating facts about Venus and its atmosphere.\n\nFor example the author states that, \"...atmospheric pressure is 90 times great than what we experience on our own planet.\"\n\nThis does resonate with the arguement considering that it is about how Venus should be explored despite the challenges.\n\nThe real problem with this example is that this is placed in the third paragraph and everything before it was also facts.\n\nThe author spent three paragraphs talking about the weather of Venus instead of analyzing how it should be disregared and we should explore it for better reasons. Additionally, the author gives facts that have nothing to do with the arguement provided.\n\nIn the second paragraph, the author states that, \"...sometimes we are closer to Mars and other times to Venus.\"\n\nThis simply has nothing to do with the supposed topic.\n\nNot only is there too many facts, but these facts have nothing to do with the topic and the author seems to -in fact- trail off a bit.\n\nMultiple times.\n\nAdditionally, the author seems to provide no analysis towards the plethera of information.\n\nFor example, the first true analysis is seen is in paragraph five when he or she states, \"Not easy conditions, but survivable for humans.\"\n\nBy the time someone is writing to paragraph five, there should already be an extreme of amount of noticable analysis before that.\n\nThe basic structure of any arguement would be to state a fact, and give a well thought-out analysis about that fact and how it applies to one's topic.\n\nThe noticable structure here, would be an overwhelming amount of facts, and then slight analysis by assuming that the evidence speaks for itself.\n\nWhen you have that much information thrown at the reader at once, the evidence really doesn't speak for itself.\n\nThat's where the supposed analysis comes in.\n\nThe analysis is supposed to connect the dots for the reader and explain how it applies.\n\nIt seems that is lacking here.\n\nFinally, the author seems to be arguing against themselves with the type of evidence that is being presented.\n\nThroughout the whole article the author seems to be only stating the things wrong with this idea.\n\nFor example in paragraph two, the author states that, \"...no spacecraft survived the landing for than a few hours.\"\n\nThen backed that up with no analysis on how that should be worth the risk so scientists can can continue exploring.\n\nThis is a problem because that is the point of the article: that despite all the risks, scientists should perservere. That's is not what followed that unsettling fact, though.\n\nAll that followed was more facts.\n\nAdditionally in paragraph six, the author finally states a good fact on how exploration can continue, but then follow it with analysis that once again goes against the arguement.\n\nFor example, after talking about the solution NASA has made to this problem, the author follow it with, \"However, peering at Venus from a ship orbiting or hovering safely far above the planet can only provide limited insight...\"\n\nThis was a poor choice of support because the author is basically going against his or her idea, without reconciling it.\n\nIn conclusion, the author makes it unclear which side he or she is trying to prove.\n\nIt is a very well-written article, but only for informational purposes.\n\nIf one it looking for a well-supported idea on how despite all the risks, science should continue, this article is not the place to look.\n\nIf anything, it seems that the author is arguing against continuing to explore Venus-considering all the facts and analysis that went against the idea.\n\nOnce again, this is a well written informative article, not an arguable one."

## An example of lowest score
set.seed(123)
low_score_sample <- df %>%
  dplyr::filter(score == 1) %>%
  slice_sample(n = 1)
low_score_sample$full_text

## [1] "venus is something called the evening star and its one of the brightest stars in the night sky but venus is acutally a planet that is in terms or density. its occasionally spinning and it is also reffered as one of earths twin. venus is also one of the closest planets to earth and it spins at a different speed all the planets spin at different speeds.Mars is right by venus so venus sometimes spins around the corner by mars and the spac craft peole land on venus or mars and they like to see whats new on the planet and humans sent numerous space crafts around. So they can see if there is life on venus and each mission to venus was unmanned and for a good reason no spacecrafts survived. The landing for more than a few hours\n\nand there is a thick atomsphere. That is almost 97 percent carbon dioxide blankets that are on venus,venus is more challenging that could of highly corrosive sulfuric acids and it is also over 800 degrees farenheit and the pressure that is atmospheric is more extreme than other stuff. Humans have encountered on earth such as and enviroment that venus had and venus is the hottest and mercury is more closer to our sun. So we thought mercury would be hotter instead of venus becuase venus is a little bit further than murcery "

I selected two examples, one from the highest score and the other from the lowest score. The highest score essay effectively evaluates the author’s statements and is well-developed. For example, we could see many sentences like, “the author gives the reader basic insight …”, “Though it is an admirable arguement, it is simply not supported …”, “This does resonate with the arguement considering that …”. The lowest score just restates the descriptive information about the topic of the essay without much efforts in evaluating the author’s statements.

Analyze

Word cloud analysis

First, I used wordclouds to present the word usage in students’ essay responses in three score levels.

# sentiment analysis
df_sa <- df %>% 
  unnest_tokens(output = word, input = full_text) %>%
  anti_join(stop_words, by = "word")
##head(df_sa)

word cloud for low score level essays. You could see some topic words, such as “Venus”, “planet”, “earth”, and “author”.

wc_data_low <- df_sa %>%
  filter(score_level == "Low") %>%
  count(word, name = "freq")

wordcloud(words = wc_data_low$word,
          freq = wc_data_low$freq,
          min.freq = 2,               
          max.words = 100,            
          random.order = FALSE,       
          colors = brewer.pal(8, "Dark2"),
          scale = c(4, 0.5))

word cloud for middle score level essays. You could also see some topic words, such as “planet”, “author”, “earth”, and “surface”. One change was that the word “author” became more dominant in this score level.

wc_data_middle <- df_sa %>%
  filter(score_level == "Middle") %>%
  count(word, name = "freq")

wordcloud(words = wc_data_middle$word,
          freq = wc_data_middle$freq,
          min.freq = 2,               
          max.words = 100,            
          random.order = FALSE,       
          colors = brewer.pal(8, "Dark2"),
          scale = c(4, 0.5))

word cloud for high score level essays. We could see many repetitive words, but we saw that the word “author” became more dominant and we could see words like “idea” and “claim”, meaning these essays were doing more in evaluating the author’s statements.

wc_data_high <- df_sa %>%
  filter(score_level == "High") %>%
  count(word, name = "freq")

wordcloud(words = wc_data_high$word,
          freq = wc_data_high$freq,
          min.freq = 2,               
          max.words = 100,            
          random.order = FALSE,       
          colors = brewer.pal(8, "Dark2"),
          scale = c(4, 0.5))

Sentiment analysis

For sentiment analysis, I divided the whole table into three parts based on their score levels and used two techniques - afinn and Bing - to make comparisons. Afinn is one dictionary based method which assigns numerical scores (-5 to 5) to each individual word. Each student’s sentiment score is calculated by summing the total scores. Bing is another dictionary based method which classifies words to positive or negative. Each student’ sentiment score is calculated by subtracting the number of negative words from the number of positive words. I first compared the sentiment differences of three score levels within each method and then compared these differences between the two methods. Bar graphs were created to show the differences.

afinn <- get_sentiments("afinn")
##head(afinn)

sentiment_afinn <- inner_join(df_sa, afinn, by = "word")
##head(sentiment_afinn)

summary_afinn <- sentiment_afinn %>%
  group_by(score_level, row_id) %>%
  summarise(value = sum(value)) %>%
  mutate(lexicon = "afinn") %>%
  relocate(lexicon)
##summary_afinn

afinn_sentiment <- summary_afinn %>% dplyr::filter(value != 0) %>% mutate(sentiment = if_else(value < 0, "negative", "positive"))
##afinn_sentiment

score_level_afinn <- sentiment_afinn %>%
  group_by(score_level) %>%
  summarise(total_value = sum(value)) %>%
  mutate(lexicon = "afinn") %>%
  relocate(lexicon)
##score_level_afinn

ggplot(score_level_afinn, aes(x = score_level, y = total_value, fill = score_level)) +
  geom_bar(stat = "identity") +
  labs(x = "Score Level", y = "Sentiment Value", title = "Sentiment by Score Level (afinn)") +
  theme_minimal()

The sentiment analysis using afinn showed that middle score level essays had a higher sentiment score than high score level and low score level essays.

# use BING to compare with afinn
bing <- get_sentiments("bing")
##head(bing)

sentiment_bing <- inner_join(df_sa, bing, by = "word") 
##head(sentiment_bing)

summary_bing <- sentiment_bing %>%
  group_by(score_level) %>%
  count(sentiment, sort = TRUE) %>%
  pivot_wider(names_from = sentiment, values_from = n) %>%
  mutate(sentiment = positive - negative)
##head(summary_bing)

sentiment_long <- summary_bing %>%
  pivot_longer(cols = c(positive, negative, sentiment),
               names_to = "type", values_to = "count")

ggplot(sentiment_long, aes(x = score_level, y = count, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Sentiment Counts by Score Level (Bing)",
       x = "Score Level", y = "Count") +
  theme_minimal()

The sentiment analysis using Bing showed that middle score level essays had a higher sentiment score than high score level and low score level essays.

Interpretation of the sentiment analysis results: It makes sense that middle score level essays had a higher score than high score level essays because the number of middle score level essays was extremely larger than that of high score level essays. Considering that the number of middle score level essays was closer to that of low score level essays, we found that middle score level essays used more positive words than low score level essays.

Topic modeling

For topic modeling, I conducted structural topic modeling, a method to identify latent topics from a corpus of text. I conducted separate topic modeling for three score level groups. The variable of English language learner status was only included as a co-variate but was not be explored. For each topic modeling, the ten highly used words in ten topics were visualized.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

## Removing 7502 of 11398 terms (7502 of 169581 tokens) due to frequency 
## Your corpus now has 1986 documents, 3896 terms and 162079 tokens.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

## Removing 8171 of 13132 terms (8171 of 271998 tokens) due to frequency 
## Your corpus now has 2277 documents, 4961 terms and 263827 tokens.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...

## Removing 1950 of 3651 terms (1950 of 36415 tokens) due to frequency 
## Your corpus now has 217 documents, 1701 terms and 34465 tokens.

tidy_stm_low <- tidy(stm_low)
top_terms_stm_low <- tidy_stm_low %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms_stm_low %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%
  arrange(desc(beta)) %>%
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each STM topic for Low Score",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

tidy_stm_middle <- tidy(stm_middle)
top_terms_stm_middle <- tidy_stm_middle %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms_stm_middle %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%
  arrange(desc(beta)) %>%
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each STM topic for Middle Score",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

tidy_stm_high <- tidy(stm_high)
top_terms_stm_high <- tidy_stm_high %>%
  group_by(topic) %>%
  slice_max(beta, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms_stm_high %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%
  arrange(desc(beta)) %>%
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 10 terms in each STM topic for High Score",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

Communicate

Key findings and insights

Research Question 1:

The sentiment analysis using afinn showed that the middle score level essays had the highest score and the low score level essays had the lowest score. The sentiment analysis using Bing showed that the middle score level essay had the highest score and the high score level essays and the low score level essays had the almost same low sentiment score.

Given the distribution (217 high score level essays, 2277 middle score level essays, and 1986 low level score essays), it is not surprising that middle score level essays had a higher score (because their number is largest) and the high score level essays had a comparative lower score (because their number is smallest).

It is better to compare middle score level essays and low score level essay because their numbers did not differ a lot. We found that there was a difference in sentiment scores between middle score level essays and low score level essays, with middle score level essays obtaining a higher sentiment score.

Research Question 2:

I must acknowledge that this dataset may not be suitable for topic modeling because students are doing the same reading and writing the same content. It is expected that there would be hugh overlapping in topics. If we looked at the general patterns, we would find that there were no difference in the extracted topics. However, a closer examination revealed that compared with low score level essays and middle score level essays, high score level essays had more topics including terms like “author”, “support”, “explain”, “idea”, “state”, as shown in Topic 1, 3, 5, 6, 9. In low score level essays, we just found one topic included similar terms, topic 6. In middle score level essays, we also just found one topic included similar terms, topic 9. Connecting to the requirement of the prompt for this task, “Be sure to include: a claim that evaluates how well the author supports the idea that studying Venus is a worthy pursuit despite the dangers; an explanation of the evidence from the article that supports your claim”, we found that high score level essays aligned with the requirement of the prompt better.

Implications and future directions

The difference of sentiment scores between middle score level essays and low score level essays can inform that instructors should pay attention to students’ emotions while completing the writing tasks. Low score level students may experience more difficulties in their writing and show more negative sentiments. Instructors need to give them more help. Further analysis can be conducted to identify the sources of these differences.

Instructors should help students understand the task more or remind students to pay attention to the requirements of the tasks, so that students will be more likely to write essays that meet the requirements.

Limitations and ethical issues

Dictionary based methods to do sentiment analysis have some limitations. First, it lacks contextual awareness, meaning that they treat each word independently. Second, words with multiple meanings may be mis-classified.

Topic modeling can reveal the topics in students’ writing. However, it requires the interpretation from human experts, which can be subjective sometimes.

When using students’ written responses to do analysis, we need the permission from the students and should keep their responses confidential.