Adena Lin
December 18, 2015
Here we will determine the sentiments of professor reviews on RateMyProf.com.
Text reviews will be analyzed on word-level, with the predicted sentiment being determined by the higher number of either positive or negative words. For reviews with an equal number of positive and negative words, they will be rated as “neutral”.
The predicted sentiment will be compared to the numeric ratings on RateMyProf (specifically the “Helpfulness” and “Clarity” ratings) and to coded ratings by people.
One common word found in reviews was added to the positive and negative lexicons, respectively.
# import +/- word lists
positiveWords <- scan("lexicon/positive-words.txt", what = 'character')
positiveWords[2007] <- c('funny')
negativeWords <- scan("lexicon/negative-words.txt", what = 'character')
negativeWords[4784] <- c('tough')
# enter website url (find professor's rating page) to scrape
uwaterloo <- read_html("http://www.ratemyprofessors.com/ ShowRatings.jsp?tid=1793811")
webData <- uwaterloo %>%
html_nodes(".commentsParagraph") %>%
html_text() %>%
as.character()
print(webData)
# this should give a list of reviews from the first page of the website
# remove punctuation & line breaks
reviews <- lapply(webData, gsub, pattern = '[[:punct:],\n,\r]', replacement = '')
# change all words to lowercase
reviews <- tolower(reviews)
Run reviews through word-matching function
sentimentFunction <- function(sentences, positiveWords, negativeWords, .progress='none')
...
# remove blank reviews
results <- results[!(gsub("[[:space:]]", "", results$review) == ""), ]
# formula: if #positive words > #negative words, then 'positive'
# equal values = 'neutral'
for(i in 1:length(results$review)) {
if (results$positive[i] > results$negative[i]) {
results$predicted[i] <- "positive"
} else if (results$positive[i] < results$negative[i]) {
results$predicted[i] <- "negative"
} else {
results$predicted[i] <- "neutral"
}
}
This column will hold the predicted values.
# get prof 'helpfulness' rating
helpfulness <- uwaterloo %>%
html_nodes(".break:nth-child(1) .score")%>%
html_text() %>%
as.numeric()
# get prof 'clarity' rating
clarity <- uwaterloo %>%
html_nodes(".break:nth-child(2) .score")%>%
html_text() %>%
as.numeric()
# put helpfulness & clarity rating into dataframe
ratings <- data.frame(helpfulness, clarity)
# find overall rating by averaging the 2 ratings
ratings$overall <- (helpfulness + clarity)/2
# set sentiment ranges based on overall ratings (same manner as RateYourProf)
ratings$sentiment[ratings$overall < 3] <- 'negative'
ratings$sentiment[ratings$overall >= 3 & ratings$overall <= 3.5] <- 'neutral'
ratings$sentiment[ratings$overall > 3.5] <- 'positive'
This column will hold the numeric ratings.
# import csv file of objective ratings based on reviews alone
objRatings <- read.csv("sentimentset_dbm.csv", header = FALSE)
This column will hold the coded ratings.
# compare predicted vs. numeric ratings
recall_accuracy(ratings$sentiment, results$predicted)
# compare predicted vs. coded ratings
recall_accuracy(objRatings$sentiment, results$predicted)
# compare numeric vs. coded ratings
recall_accuracy(ratings$sentiment, objRatings$sentiment)
Predicted vs. numeric: 0.7647059
Predicted vs. coded: 0.8823529
Numeric vs. coded: 0.7058824
[Note: this sentiment analysis is for Prof #1]