Given student comments and ratings, I wanted to find out if there are any comments that are inconsistent with the ratings. To do this I started with a pre-trained, out-of-the-box Python package called TextBlob, which proved to produce slightly strange results. This led me to try to validate the results. I did this by using another such package, VADER.
This article shows several ways in which packages fail to correctly score the sentiment on short texts and potential methods to mitigate such errors.
First let’s see the data set, which has been subset to show only the two relevant columns.
| student_comment | student_rating | |
|---|---|---|
| 17 | NOUN was amazing and very patient. | 5 |
| 26 | my teacher went through the question so well wand was extremely patient! | 5 |
| 27 | My internet dropped out just as NOUN was checking something as the answer we figured out wasn’t correct. But then I got it back almost straight away and said that I had lost all of the messages because the Internet dropped out, and then I got them all of a sudden, his but not mine, and then he ended the session half way through working it out. Please fix this issue, as the bottom of by screen recommend I refresh the page and I did, but then I lost everything. | 3 |
| 30 | The tutor ensured that i understood everything | 5 |
| 40 | THANK YOU SO MUCH FOR YOUR ASSISTANCE! | 5 |
| 43 | she was really nice and answer the question and will help me do this presentation | 5 |
| 62 | NEED MORE IT TUTORS!! | 2 |
| 81 | everything went well and was very nice | 5 |
| 97 | She was excellent I got the right feedback | 5 |
| 113 | i think that they should be a little bit more helpful. | 1 |
TextBlob Sentiment ScoresLet’s calculate the sentiment scores using TextBlob for the same set of rows.
| student_comment | student_rating | sentiment_textblob | |
|---|---|---|---|
| 17 | NOUN was amazing and very patient. | 5 | 0.4000000 |
| 26 | my teacher went through the question so well wand was extremely patient! | 5 | -0.1562500 |
| 27 | My internet dropped out just as NOUN was checking something as the answer we figured out wasn’t correct. But then I got it back almost straight away and said that I had lost all of the messages because the Internet dropped out, and then I got them all of a sudden, his but not mine, and then he ended the session half way through working it out. Please fix this issue, as the bottom of by screen recommend I refresh the page and I did, but then I lost everything. | 3 | 0.0083333 |
| 30 | The tutor ensured that i understood everything | 5 | 0.0000000 |
| 40 | THANK YOU SO MUCH FOR YOUR ASSISTANCE! | 5 | 0.2500000 |
| 43 | she was really nice and answer the question and will help me do this presentation | 5 | 0.6000000 |
| 62 | NEED MORE IT TUTORS!! | 2 | 0.7812500 |
| 81 | everything went well and was very nice | 5 | 0.7800000 |
| 97 | She was excellent I got the right feedback | 5 | 0.6428571 |
| 113 | i think that they should be a little bit more helpful. | 1 | 0.1562500 |
It’s already evident that something has gone awry. The second row shows a student_rating of 5 and whose comment is actually very positive (at least to me), however, the sentiment score is -0.156. In fact, the score is similar the last row, whose comment I think is truly negative comment.
Now just looking at the low rated rows with high scores sentiment scores.
| student_comment | student_rating | sentiment_textblob | |
|---|---|---|---|
| 31715 | yo simon is an mk, real shit maddest dog ive ever met | 5 | -0.3 |
| 47396 | yeah not the tutors fault but the website was really slow i had to refresh a couple of times. | 5 | -0.3 |
| 71394 | I thought the interaction between both of us was a bit slow, it took a while to load and send through messages. | 5 | -0.3 |
| 32063 | The connection was slow and the photo I submitted went distorted after I submitted it. | 5 | -0.3 |
| 19418 | Helped me slowly work through it | 5 | -0.3 |
I think the first one can be attributed to the word “shit”, which is intrinsically negative. Obviously TextBlob doesn’t understand such colloquialism as “the shit”.
The next three seem to be good examples of the value of doing a sentiment analysis. Despite the highest rating of 5, there seemed to have still been room for improvement on the technical side.
The last one is also another misunderstanding of the positive phrase, “slowly work through it”, which expresses patience rather than an inability to provide help promptly, which is in stark contrast to the previous three.
Now just looking at the low rated rows with high scores sentiment scores.
| student_comment | student_rating | sentiment_textblob | |
|---|---|---|---|
| 78189 | best tutor in studiosity | 1 | 1 |
| 29136 | Very happy with the service, and got all the answers I needed | 1 | 1 |
| 79281 | Perfect | 1 | 1 |
| 47031 | NOUN was great! | 1 | 1 |
| 36957 | Not the greatest service this time | 1 | 1 |
This is a case of the students’ misunderstanding the scale. They seemed to have thought that 1 is the highest rating, except for the last one; “Not the greatest service this time”, but the sentiment rating is 1!
Which got me thinking about how certain negated sentences such as “It’s anything but good.” are actually scored. Let’s try some of these.
| sentence | text_blob |
|---|---|
| It’s not the greatest. | 1.000 |
| It’s anything but good. | 0.700 |
| It’s good. | 0.700 |
| Extremely helpful. | -0.125 |
| Very helpful. | 0.200 |
There are some obvious errors. In addition to the negative issue, surprisingly, TextBlob rates “extremely” negatively.
VADERThis section will replicate the above analysis with another well known package called VADER.
| student_comment | student_rating | sentiment_textblob | sentiment_vader | |
|---|---|---|---|---|
| 17 | NOUN was amazing and very patient. | 5 | 0.4000000 | 0.5859 |
| 26 | my teacher went through the question so well wand was extremely patient! | 5 | -0.1562500 | 0.4648 |
| 27 | My internet dropped out just as NOUN was checking something as the answer we figured out wasn’t correct. But then I got it back almost straight away and said that I had lost all of the messages because the Internet dropped out, and then I got them all of a sudden, his but not mine, and then he ended the session half way through working it out. Please fix this issue, as the bottom of by screen recommend I refresh the page and I did, but then I lost everything. | 3 | 0.0083333 | 0.4411 |
| 30 | The tutor ensured that i understood everything | 5 | 0.0000000 | 0.0000 |
| 40 | THANK YOU SO MUCH FOR YOUR ASSISTANCE! | 5 | 0.2500000 | 0.4199 |
| 43 | she was really nice and answer the question and will help me do this presentation | 5 | 0.6000000 | 0.6997 |
| 62 | NEED MORE IT TUTORS!! | 2 | 0.7812500 | 0.0000 |
| 81 | everything went well and was very nice | 5 | 0.7800000 | 0.6361 |
| 97 | She was excellent I got the right feedback | 5 | 0.6428571 | 0.5719 |
| 113 | i think that they should be a little bit more helpful. | 1 | 0.1562500 | 0.4271 |
With VADER, we can see that the second row is now scored positively. The fourth one from the bottom went from very positive (TextBlob) to neutral. The other rows are largely similar.
VADER)Now let’s see how VADER scores the highly rated rows with low TextBlob sentiment scores.
| student_comment | student_rating | sentiment_textblob | sentiment_vader | |
|---|---|---|---|---|
| 31715 | yo simon is an mk, real shit maddest dog ive ever met | 5 | -0.3 | -0.8126 |
| 47396 | yeah not the tutors fault but the website was really slow i had to refresh a couple of times. | 5 | -0.3 | 0.3025 |
| 71394 | I thought the interaction between both of us was a bit slow, it took a while to load and send through messages. | 5 | -0.3 | 0.0000 |
| 32063 | The connection was slow and the photo I submitted went distorted after I submitted it. | 5 | -0.3 | -0.4019 |
| 19418 | Helped me slowly work through it | 5 | -0.3 | 0.0000 |
VADER has made some better scores compared to TextBlob, though the first one is rated even more negatively (incorrect). The one with “slowly work through it” is now neutral, which should actually be positive.
Now let’s see how VADER scores the low rated rows with high TextBlob sentiment scores.
| student_comment | student_rating | sentiment_textblob | sentiment_vader | |
|---|---|---|---|---|
| 78189 | best tutor in studiosity | 1 | 1 | 0.6369 |
| 29136 | Very happy with the service, and got all the answers I needed | 1 | 1 | 0.6115 |
| 79281 | Perfect | 1 | 1 | 0.5719 |
| 47031 | NOUN was great! | 1 | 1 | 0.6588 |
| 36957 | Not the greatest service this time | 1 | 1 | -0.5216 |
In this round, VADER has correctly rated the last one negatively, while the others are largely consistent; differences in magnitude is difficult to evaluate.
Now for the test sentences
| sentence | text_blob | vader |
|---|---|---|
| It’s not the greatest. | 1.000 | -0.5216 |
| It’s anything but good. | 0.700 | 0.5927 |
| It’s good. | 0.700 | 0.4404 |
| Extremely helpful. | -0.125 | 0.4754 |
| Very helpful. | 0.200 | 0.4754 |
VADER seems to have corrected the “not” negation and doesn’t score “extremely” negatively. However, the “but” negation still presents some trouble.
This is only one way for validating sentiment scores, another way would be to dig deep into the documentation of each of the packages used. Sentiment scores are sensitive to the data on which the algorithm has been trained. Case in point, the fact that “extremely” was scored scored negatively by TextBlob probably reflects the fact that TextBlob was trained on data in which “extremely” co-occured with truly negative sentences. In which case TextBlob would probably not be a valid package to use for a data set containing student comments.
Having another independently calculated sentiment score is not very useful in automatically deciding which score is the correct one, i.e. if TextBlob scored negatively and VADER scored negatively, I wouldn’t know which is the correct one without manually evaluating the comment.
Having more sentiment scores could potentially solve this problem. If there were one more score calculated by another package (candidates include Stanford CoreNLP or gensim), it would be possible to automatically vote, e.g. by ignoring the odd one out and taking the average of the other two. Although this method still would not produce 100% accurate results, the probability of incorrectly scoring the sentiment would be greatly reduced.
Validation is difficult! Even a simple sentiment analyser presents a difficult problem in validation, let alone complex machine learning models. While it’s difficult to validate some models. It’s useful to even have the mindset that the model has misbehaved somewhere. The default should be that the model is wrong unless proven otherwise. If the results are accepted at face value, then there’s a high chance that they are biased and hence would lead to suboptimal decision making.