Introduction

Given student comments and ratings, I wanted to find out if there are any comments that are inconsistent with the ratings. To do this I started with a pre-trained, out-of-the-box Python package called TextBlob, which proved to produce slightly strange results. This led me to try to validate the results. I did this by using another such package, VADER.

This article shows several ways in which packages fail to correctly score the sentiment on short texts and potential methods to mitigate such errors.

First let’s see the data set, which has been subset to show only the two relevant columns.

student_comment student_rating
17 NOUN was amazing and very patient. 5
26 my teacher went through the question so well wand was extremely patient! 5
27 My internet dropped out just as NOUN was checking something as the answer we figured out wasn’t correct. But then I got it back almost straight away and said that I had lost all of the messages because the Internet dropped out, and then I got them all of a sudden, his but not mine, and then he ended the session half way through working it out. Please fix this issue, as the bottom of by screen recommend I refresh the page and I did, but then I lost everything. 3
30 The tutor ensured that i understood everything 5
40 THANK YOU SO MUCH FOR YOUR ASSISTANCE! 5
43 she was really nice and answer the question and will help me do this presentation 5
62 NEED MORE IT TUTORS!! 2
81 everything went well and was very nice 5
97 She was excellent I got the right feedback 5
113 i think that they should be a little bit more helpful. 1

TextBlob Sentiment Scores

Let’s calculate the sentiment scores using TextBlob for the same set of rows.

student_comment student_rating sentiment_textblob
17 NOUN was amazing and very patient. 5 0.4000000
26 my teacher went through the question so well wand was extremely patient! 5 -0.1562500
27 My internet dropped out just as NOUN was checking something as the answer we figured out wasn’t correct. But then I got it back almost straight away and said that I had lost all of the messages because the Internet dropped out, and then I got them all of a sudden, his but not mine, and then he ended the session half way through working it out. Please fix this issue, as the bottom of by screen recommend I refresh the page and I did, but then I lost everything. 3 0.0083333
30 The tutor ensured that i understood everything 5 0.0000000
40 THANK YOU SO MUCH FOR YOUR ASSISTANCE! 5 0.2500000
43 she was really nice and answer the question and will help me do this presentation 5 0.6000000
62 NEED MORE IT TUTORS!! 2 0.7812500
81 everything went well and was very nice 5 0.7800000
97 She was excellent I got the right feedback 5 0.6428571
113 i think that they should be a little bit more helpful. 1 0.1562500

It’s already evident that something has gone awry. The second row shows a student_rating of 5 and whose comment is actually very positive (at least to me), however, the sentiment score is -0.156. In fact, the score is similar the last row, whose comment I think is truly negative comment.

Highly Rated - Low Sentiment Scores

Now just looking at the low rated rows with high scores sentiment scores.

student_comment student_rating sentiment_textblob
31715 yo simon is an mk, real shit maddest dog ive ever met 5 -0.3
47396 yeah not the tutors fault but the website was really slow i had to refresh a couple of times. 5 -0.3
71394 I thought the interaction between both of us was a bit slow, it took a while to load and send through messages. 5 -0.3
32063 The connection was slow and the photo I submitted went distorted after I submitted it. 5 -0.3
19418 Helped me slowly work through it 5 -0.3

I think the first one can be attributed to the word “shit”, which is intrinsically negative. Obviously TextBlob doesn’t understand such colloquialism as “the shit”.

The next three seem to be good examples of the value of doing a sentiment analysis. Despite the highest rating of 5, there seemed to have still been room for improvement on the technical side.

The last one is also another misunderstanding of the positive phrase, “slowly work through it”, which expresses patience rather than an inability to provide help promptly, which is in stark contrast to the previous three.

Low Rated - High Sentiment Scores

Now just looking at the low rated rows with high scores sentiment scores.

student_comment student_rating sentiment_textblob
78189 best tutor in studiosity 1 1
29136 Very happy with the service, and got all the answers I needed 1 1
79281 Perfect 1 1
47031 NOUN was great! 1 1
36957 Not the greatest service this time 1 1

This is a case of the students’ misunderstanding the scale. They seemed to have thought that 1 is the highest rating, except for the last one; “Not the greatest service this time”, but the sentiment rating is 1!

Which got me thinking about how certain negated sentences such as “It’s anything but good.” are actually scored. Let’s try some of these.

sentence text_blob
It’s not the greatest. 1.000
It’s anything but good. 0.700
It’s good. 0.700
Extremely helpful. -0.125
Very helpful. 0.200

There are some obvious errors. In addition to the negative issue, surprisingly, TextBlob rates “extremely” negatively.

VADER

This section will replicate the above analysis with another well known package called VADER.

student_comment student_rating sentiment_textblob sentiment_vader
17 NOUN was amazing and very patient. 5 0.4000000 0.5859
26 my teacher went through the question so well wand was extremely patient! 5 -0.1562500 0.4648
27 My internet dropped out just as NOUN was checking something as the answer we figured out wasn’t correct. But then I got it back almost straight away and said that I had lost all of the messages because the Internet dropped out, and then I got them all of a sudden, his but not mine, and then he ended the session half way through working it out. Please fix this issue, as the bottom of by screen recommend I refresh the page and I did, but then I lost everything. 3 0.0083333 0.4411
30 The tutor ensured that i understood everything 5 0.0000000 0.0000
40 THANK YOU SO MUCH FOR YOUR ASSISTANCE! 5 0.2500000 0.4199
43 she was really nice and answer the question and will help me do this presentation 5 0.6000000 0.6997
62 NEED MORE IT TUTORS!! 2 0.7812500 0.0000
81 everything went well and was very nice 5 0.7800000 0.6361
97 She was excellent I got the right feedback 5 0.6428571 0.5719
113 i think that they should be a little bit more helpful. 1 0.1562500 0.4271

With VADER, we can see that the second row is now scored positively. The fourth one from the bottom went from very positive (TextBlob) to neutral. The other rows are largely similar.

Highly Rated - Low Sentiment Scores (VADER)

Now let’s see how VADER scores the highly rated rows with low TextBlob sentiment scores.

student_comment student_rating sentiment_textblob sentiment_vader
31715 yo simon is an mk, real shit maddest dog ive ever met 5 -0.3 -0.8126
47396 yeah not the tutors fault but the website was really slow i had to refresh a couple of times. 5 -0.3 0.3025
71394 I thought the interaction between both of us was a bit slow, it took a while to load and send through messages. 5 -0.3 0.0000
32063 The connection was slow and the photo I submitted went distorted after I submitted it. 5 -0.3 -0.4019
19418 Helped me slowly work through it 5 -0.3 0.0000

VADER has made some better scores compared to TextBlob, though the first one is rated even more negatively (incorrect). The one with “slowly work through it” is now neutral, which should actually be positive.

Low Rated - High Sentiment Scores

Now let’s see how VADER scores the low rated rows with high TextBlob sentiment scores.

student_comment student_rating sentiment_textblob sentiment_vader
78189 best tutor in studiosity 1 1 0.6369
29136 Very happy with the service, and got all the answers I needed 1 1 0.6115
79281 Perfect 1 1 0.5719
47031 NOUN was great! 1 1 0.6588
36957 Not the greatest service this time 1 1 -0.5216

In this round, VADER has correctly rated the last one negatively, while the others are largely consistent; differences in magnitude is difficult to evaluate.

Now for the test sentences

sentence text_blob vader
It’s not the greatest. 1.000 -0.5216
It’s anything but good. 0.700 0.5927
It’s good. 0.700 0.4404
Extremely helpful. -0.125 0.4754
Very helpful. 0.200 0.4754

VADER seems to have corrected the “not” negation and doesn’t score “extremely” negatively. However, the “but” negation still presents some trouble.

Next Steps

Read the Documentation

This is only one way for validating sentiment scores, another way would be to dig deep into the documentation of each of the packages used. Sentiment scores are sensitive to the data on which the algorithm has been trained. Case in point, the fact that “extremely” was scored scored negatively by TextBlob probably reflects the fact that TextBlob was trained on data in which “extremely” co-occured with truly negative sentences. In which case TextBlob would probably not be a valid package to use for a data set containing student comments.

Calculate More Sentiments

Having another independently calculated sentiment score is not very useful in automatically deciding which score is the correct one, i.e. if TextBlob scored negatively and VADER scored negatively, I wouldn’t know which is the correct one without manually evaluating the comment.

Having more sentiment scores could potentially solve this problem. If there were one more score calculated by another package (candidates include Stanford CoreNLP or gensim), it would be possible to automatically vote, e.g. by ignoring the odd one out and taking the average of the other two. Although this method still would not produce 100% accurate results, the probability of incorrectly scoring the sentiment would be greatly reduced.

Conclusion

Validation is difficult! Even a simple sentiment analyser presents a difficult problem in validation, let alone complex machine learning models. While it’s difficult to validate some models. It’s useful to even have the mindset that the model has misbehaved somewhere. The default should be that the model is wrong unless proven otherwise. If the results are accepted at face value, then there’s a high chance that they are biased and hence would lead to suboptimal decision making.