Natural Language Processing (NLP) is the methodology of extracting meaning and insight from text data. NLP has widely been used for the identification of different groups of users/customers, sentiment analysis of reviews, and text classification. Here, we are presented with data related to reviews of many shows and movies available on the Amazon Instant Video streaming service. As Amazon employees, we seek to find a way to improve the quality of the content that is hosted on our site. The goal is to use NLP to achieve the following:
The data used in this analysis is hosted on the University of California, San Diego web server by Professor Julian McAuley which can be found at http://jmcauley.ucsd.edu/data/amazon/. Professor McAuley amassed a collection of product reviews off of the Amazon website and has made all his datasets available for public use. The particular dataset used in this analysis is the ‘Amazon Instant Video’ dataset which is a collection of reviews of different shows and movies distributed, and some even produced, by Amazon.com.-
# Loading in the necessary libraries
library("ggplot2")
library("stats")
library("jsonlite")
library("NLP")
library("tm")
library("SnowballC")
library("corpus")
library("quanteda")
library("wordcloud")
library("RColorBrewer")
library("rJava")
library("data.table")
library("RDRPOSTagger")
library("tokenizers")
library("caTools")
library("caret")
library("randomForest")
library("ggbiplot")
library("boot")
library("class")
# Loading in the dataset
setwd('/Users/mareksalamon/Desktop/School/Hunter/Fall Semester 2018/Multivariate Analysis/Final Project')
file = 'amazon_instant_videos_review.json'
amazon.data <- fromJSON(paste0('[',paste(collapse=',',readLines(file)),']'))
Let’s take a quick look at the data.
head(amazon.data)
## reviewerID asin reviewerName helpful
## 1 A11N155CW1UV02 B000H00VBQ AdrianaM 0, 0
## 2 A3BC8O2KCL29V2 B000H00VBQ Carol T 0, 0
## 3 A60D5HQFOTSOM B000H00VBQ Daniel Cooper "dancoopermedia" 0, 1
## 4 A1RJPIGRSNX4PW B000H00VBQ J. Kaplan "JJ" 0, 0
## 5 A16XRPF40679KG B000H00VBQ Michael Dobey 1, 1
## 6 A1POFVVXUZR3IQ B000H00VBQ Z Hayes 12, 12
## reviewText
## 1 I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.
## 2 I highly recommend this series. It is a must for anyone who is yearning to watch "grown up" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.
## 3 This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.
## 4 Mysteries are interesting. The tension between Robson and the tall blond is good but not always believable. She often seemed uncomfortable.
## 5 This show always is excellent, as far as british crime or mystery showsgoes this is one of the best ever made. The stories are well done and the acting is top notch with interesting twists in the realistic and brutal storylines. This show pulls no punches as it enters into the twisted minds of criminals and the profiler psychiatrist who helps out in a northern english city police force. The show looks like it is shot in Manchester but it is called by another name in the show. One episode is not on this disc the excellent 'prayer of the bone" which is on a seperate disc. Still crime shows don't get much better than this one on either side of the ocean. It's just a great show that never has had a less than well made episode. Unfortunately like all British shows you only get about five shows a year , but these are an hour and a half shows , still one could hope for at least 8 of these a year. The realism and depth of the main character Tony Hill as protrayed by the excellent Robson Green is well worth viewing because he just makes this role truly part of himself in everyway. I bet he went to crime scenes even in real life to research his role. But the writers too must be applauded for their way above average stories. Lets hope this show continues on for many years to come.
## 6 I discovered this series quite by accident. Having watched and appreciated Masterpiece Contemporary: Place of Execution, I was keen to read the novel (which inspired the TV adaptation) by Val McDermid. The novel was very well-written, and a nail-biting suspense thriller. Then I discovered that Val McDermid wrote other novels as well, and a couple of them inspired the TV crime drama Wire in the Blood.I finished watching all of Season 1 and have become a fan of this gritty crime drama that follows the investigations led by DI Carol Jordan (Hermione Norris). She is assisted by clinical psychologist Dr. Tony Hill (Robson Green), a rather eccentric figure who delves deeply into the minds of serial killers, studies patterns of criminal behavior and profiles criminals. His methods may seem strange at times, but he always manages to get results. Both Jordan and Hill make a strange if compelling pair, with Jordan analyzing a case based on evidence, and Hill working based on his knowledge of deviant behavior and what makes people commit disturbing crimes.Unlike some of the "cozy" mysteries such as the long-running Midsomer Murders - Set One and The Complete Inspector Lynley Mysteries, Wire in the Blood is not for the faint of heart. The crimes are horrific, sometimes involving children, almost always patterned on deviant behavior and the suspects are almost always very disturbed individuals. The crime scenes are difficult to watch as are the way victims are found and even the forensic examinations are graphic and unsettling. Though compelling, this is not really a show to watch in one sitting, and may very well give viewers nightmares.The first season contains three main stories, each divided into two episodes:The Mermaids Singing - Hill is on the trail of a seriously disturbed serial killer who targets homosexuals.Shadows Rising - The skeletal remains of a young woman is found, and when further evidence turns up, DI Jordan and Hill realize they have a serial killer on their hands, one with a penchant for dark-haired young women.Justice Painted Blind - Another unsettling case, this time revolving around an old child abduction and murder case. A couple of apparently random murders turn out not to be so random after all when it is discovered that the victims do share a connection involving an old court case where the accused was found not guilty. This throws up a whole bunch of suspects, including the parents of the murdered girl.The writing on this crime drama is excellent, and Robson Green is credible as the clinical psychologist who has a rare knack for profiling and getting under the skin of some of the most dangerous criminals. The drama also explores the chemistry and tension between Hill and DI Jordan, all of which result in a riveting show that keeps viewers coming back for more.The streaming was good overall with only a little delayed streaming during Episode One. There were no glitches on the other episodes. The picture quality could be improved though as it did seem a bit grainy to me.
## overall summary
## 1 2 A little bit boring for me
## 2 5 Excellent Grown Up TV
## 3 1 Way too boring for me
## 4 4 Robson Green is mesmerizing
## 5 5 Robson green and great writing
## 6 5 I purchased the series via streaming and loved it!
## unixReviewTime reviewTime
## 1 1399075200 05 3, 2014
## 2 1346630400 09 3, 2012
## 3 1381881600 10 16, 2013
## 4 1383091200 10 30, 2013
## 5 1234310400 02 11, 2009
## 6 1318291200 10 11, 2011
amazon.data$helpful <- as.character(amazon.data$helpful)
str(amazon.data)
## 'data.frame': 37126 obs. of 9 variables:
## $ reviewerID : chr "A11N155CW1UV02" "A3BC8O2KCL29V2" "A60D5HQFOTSOM" "A1RJPIGRSNX4PW" ...
## $ asin : chr "B000H00VBQ" "B000H00VBQ" "B000H00VBQ" "B000H00VBQ" ...
## $ reviewerName : chr "AdrianaM" "Carol T" "Daniel Cooper \"dancoopermedia\"" "J. Kaplan \"JJ\"" ...
## $ helpful : chr "c(0, 0)" "c(0, 0)" "0:1" "c(0, 0)" ...
## $ reviewText : chr "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy "| __truncated__ "I highly recommend this series. It is a must for anyone who is yearning to watch \"grown up\" television. Compl"| __truncated__ "This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title m"| __truncated__ "Mysteries are interesting. The tension between Robson and the tall blond is good but not always believable. S"| __truncated__ ...
## $ overall : num 2 5 1 4 5 5 3 3 5 3 ...
## $ summary : chr "A little bit boring for me" "Excellent Grown Up TV" "Way too boring for me" "Robson Green is mesmerizing" ...
## $ unixReviewTime: int 1399075200 1346630400 1381881600 1383091200 1234310400 1318291200 1381795200 1388275200 1393372800 1396396800 ...
## $ reviewTime : chr "05 3, 2014" "09 3, 2012" "10 16, 2013" "10 30, 2013" ...
# Summary statistics of the data
summary(amazon.data[,c("reviewerID", "asin", "reviewerName", "reviewText", "overall", "summary", "unixReviewTime", "reviewTime")])
## reviewerID asin reviewerName
## Length:37126 Length:37126 Length:37126
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## reviewText overall summary unixReviewTime
## Length:37126 Min. :1.00 Length:37126 Min. :9.755e+08
## Class :character 1st Qu.:4.00 Class :character 1st Qu.:1.368e+09
## Mode :character Median :5.00 Mode :character Median :1.385e+09
## Mean :4.21 Mean :1.377e+09
## 3rd Qu.:5.00 3rd Qu.:1.394e+09
## Max. :5.00 Max. :1.406e+09
## reviewTime
## Length:37126
## Class :character
## Mode :character
##
##
##
# 'helpful' was not included due to its lack of useful information and uneccessary elongation of the resulting output
# Detemining the dimensions of the data
dim(amazon.data)
## [1] 37126 9
There seem to be a total of 37,126 reviews and 9 features in the dataset.
colnames(amazon.data)
## [1] "reviewerID" "asin" "reviewerName" "helpful"
## [5] "reviewText" "overall" "summary" "unixReviewTime"
## [9] "reviewTime"
# Checking if there are any explicit null values in the entire dataset
(sum(is.na(amazon.data))/(nrow(amazon.data)*ncol(amazon.data)))*100
## [1] 0.09846349
# Checking if there are any explicit null values in the columns that we will use
(sum(is.na(amazon.data$reviewText))/(nrow(amazon.data$reviewText)*ncol(amazon.data$reviewText)))*100 # text reviews
## numeric(0)
(sum(is.na(amazon.data$overall))/(nrow(amazon.data$overall)*ncol(amazon.data$overall)))*100 # overall product review
## numeric(0)
A description of each feature in the dataset, and their corresponding meaning, are listed below:
Feature Glossary
Let’s explore some characteristics of the text reviews. For example, we’ll analyze the number of words used for each review
# Extrating the number of words in each review
review.lengths <- sapply(strsplit(amazon.data$reviewText, " "), length)
summary(review.lengths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 27.00 44.00 92.95 89.00 2956.00
The reviews possess a mean of 93 words and a median of 44 words. This signals to us that there may be a greater number of reviews with a total word count of 50 or less, compared to reviews with over 50 words. Yet, the relative increase from the median to the mean tells us that there are reviews with word counts well over 50 words that are skewing the mean value away from the median. This can be seen by the max value of the word counts which is 2,956.
h = hist(review.lengths, breaks=100, plot = FALSE)
h$density = h$counts/sum(h$counts)*100
plot(h, freq=FALSE, xlim = range(0,500), ylim = range(0,60), col = "#00aedb", main="Histogram of Words", xlab="Word Count", ylab="Percent of Reviews")
grid (NULL,NULL, lty = 1, col = "light grey")
Here, we discover that over half of all comments are less than 50 words and around 75% of all comments are less than 100 words. The plot’s range has been truncated to hone in on word counts harboring the greatest density of reviews. As previously seen, the range of review word counts extends well beyond 2,000 words.
Let’s now explore the overall ratings accompanied by each review. This is the rating, on a 5 star scale 1 being “horrible” and 5 being “fantastic”, that the reviewer had given the reviewed product.
i = hist(amazon.data$overall, plot = FALSE)
i$density = i$counts/sum(i$counts)*100
plot(i, freq=FALSE, col = "#00aedb", main="Histogram of Overall Ratings", xlab="Overall Product Rating (Stars)", ylim = range(0,60), ylab="Percent of Reviews")
grid (NULL,NULL, lty = 1, col = "light grey")
It looks like over 50% of all reviews are accompanied by an overall rating of 5 stars which implies that over half of the reviews are very positive. ~75% of all reviews are accompanied by a 4 or 5 star rating which means that a vast majority of the reviews are positive. This may become troublesome for our classification model later on in the analysis.
With the information we have gathered thus far, we will perform some parts-of-speech tagging. What this activity involves is labeling the words of each review as being a noun, verb, adjective, etc. With this information, we will be able to analyze the adjectives, nouns, and verbs that appear most often in the negative reviews in an attempt to pinpoint the gripes that many of the reviewers had with the products they reviewed. Hopefully, these insights will translate into improvements in the quality of the content that Amazon Instant Video supplies. In addition, because we have a very large dataset, we will only perform the tagging on the “bad” reviews which will be defined as reviews that are accompanied by an overall rating of 3 or lower.
# Isolating the bad reviews
bad.reviews <- amazon.data[amazon.data$overall < 4,]
bad.reviews$reviewText <- removePunctuation(bad.reviews$reviewText)
# Specifying the parts-of-speech tags and their abbreviations
unipostag_types <- c("ADJ" = "adjective", "ADP" = "adposition", "ADV" = "adverb", "AUX" = "auxiliary", "CONJ" = "coordinating conjunction", "DET" = "determiner", "INTJ" = "interjection", "NOUN" = "noun", "NUM" = "numeral", "PART" = "particle", "PRON" = "pronoun", "PROPN" = "proper noun", "PUNCT" = "punctuation", "SCONJ" = "subordinating conjunction", "SYM" = "symbol", "VERB" = "verb", "X" = "other")
# Splitting reviews into sentences
sentences <- tokenize_sentences(bad.reviews$reviewText, simplify = TRUE)
# Defining the language and type of tagging
unipostagger <- rdr_model(language = "English", annotation = "UniversalPOS")
# Performing the tagging
unipostags <- rdr_pos(unipostagger, sentences[[1]])
for(i in seq(2,7790,1)){
unipostags <- rbind(unipostags, rdr_pos(unipostagger, sentences[[i]], add_space_around_punctuations = FALSE))
}
# Extracting the nouns, adjectives, and verbs into seperate dataframes
unipostags.noun <- unipostags[unipostags$pos == "NOUN", c("token")]
unipostags.noun <- as.data.frame(table(unipostags.noun, dnn = list("word")), responseName = "freq")
unipostags.adj <- unipostags[unipostags$pos == "ADJ", c("token")]
unipostags.adj <- as.data.frame(table(unipostags.adj, dnn = list("word")), responseName = "freq")
unipostags.verb <- unipostags[unipostags$pos == "VERB", c("token")]
unipostags.verb <- as.data.frame(table(unipostags.verb, dnn = list("word")), responseName = "freq")
set.seed(2018)
wordcloud(words = unipostags.noun$word, freq = unipostags.noun$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
The noun word cloud seems to include words that are already largely used in everyday language such as “movie”, “people”, and “story”. By themselves, these words don’t seem to convey anything in particular but begin to take on a different meaning when we remember the context in which they are used. These words were pulled, exclusively, from negative reviews. With this knowledge, we can inferences such as the use of words such as “don’t”, “didn’t”, “doesn’t” indicate that the reviewer must have felt that the movie or show they watch was lacking something. The word “plot” seems to be a reoccurring topic in the reviews hinting that the plots of the movies may have been unenjoyable to the reviewers. Similarly, the “end” of the movie may have been unsatisfactory to many reviewers. Other topics that constantly come up include “characters”, “dialogue”, and “time”. These are more words that indicate the properties individuals may not have liked such as the characters in the movie, the dialogue, or the run time. Although, all these words seem a bit too general and are likely to reflect differences in personal tastes as oppose to objective flaws in the movies. Still, such insight may be used to further improve the types of recommendations that are provided to these individuals in the future.
set.seed(2018)
wordcloud(words = unipostags.adj$word, freq = unipostags.adj$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
In this word cloud, we’ve isolate adjectives rather than nouns. Again, we see words common to many cases and not specific to negative sentiment, even though the words were pulled, exclusively, from negative reviews. “more” signals that many reviewers, perhaps, have indicated what they would have liked to have seen more of in the film. Words like “enough”, “better”, and “little” seem to convey the same thing. “first” is an interesting word because it seems out of place in negative reviews or even in reviews in general. It may be that this is the first time a reviewer has seen a movie or a movie of a particular series or genre. In general, the words convey some dissatisfaction and indicators that something was lacking.
set.seed(2018)
wordcloud(words = unipostags.verb$word, freq = unipostags.verb$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Finally, we have verbs. It seems that the trend of more general words being more prevalent continues. It may be more beneficial to focus on words the appear on the periphery of the word clouds seeing as the most general words are also the most widely used. Focusing on the words on the outskirts of the cloud may provide a good middle ground between commonly appearing words and specific words. Word connected to the “writing” and dialogue of the show/movie seem to appear multiple times. The quality of dialogue may have a strong correlation with the reviewers’ perception of the movie’s overall quality.
The word clouds above show us the most prevalent verbs, adjective, and nouns present in the negative reviews of the dataset. Although the most prevalent words are also words that are commonly used in everyday communication, insights were still able to be gleaned. Many reviews focus on the “characters”, “story”, and “time” of the movie/show in question. In addition, the “writing” of the movie/show is also mentioned. Much of the emotions conveyed by the reviews are those of boredom and dissatisfaction, and it seems as if many reviewers highlight what they believed was missing from the movie/show rather than what was included that they did not like.
We’ve gotten a feel for our data. Now, let’s explore the characteristics of the actual reviews themselves. To do this, we will first need to perform some preprocessing on our reviews. This preprocessing will include removing uppercase letters, removing irrelevant symbols/characters, combining misspelled or alternately spelled words, and lemmatization, which involves the reduction of similar words to a common form.
# Cleaning the reviews
# Creating a corpus object
corpus <- VCorpus(VectorSource(amazon.data$reviewText))
# Turning all letters to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Removing all numbers from the reviews
corpus <- tm_map(corpus, removeNumbers)
# Removing all punctuation from the reviews
corpus <- tm_map(corpus, removePunctuation)
# Removing all irrelevant words
corpus <- tm_map(corpus, removeWords, stopwords())
# Replacing similar words with their common root
corpus <- tm_map(corpus, stemDocument)
# Removing whitspace from the reviews
corpus <- tm_map(corpus, stripWhitespace)
Now that that our test data has been cleaned and stripped of unnecessary symbols, we can create a Bag-of-Words model. This ‘model’ is not a statistical nor predictive model but a way of representing our test data. In such a model, each of our words, in the corpus, will possess its own column and each row will represent a review. The cells of this table will be populated with 1’s and 0’s depending on whether or not a word appears in the corresponding review. This is how we convert our text data into a numerical form that can be understood and manipulated by our statistical models.
# Turning the corpus into a bag-of-words dataframe
dtm <- DocumentTermMatrix(corpus)
print("Before Filtering")
## [1] "Before Filtering"
dtm
## <<DocumentTermMatrix (documents: 37126, terms: 67292)>>
## Non-/sparse entries: 1432447/2496850345
## Sparsity : 100%
## Maximal term length: 178
## Weighting : term frequency (tf)
# Further reducing the number of words analyzed by filtering down to most commonly used words
dtm <- removeSparseTerms(dtm, 0.99)
print("After Filtering")
## [1] "After Filtering"
dtm
## <<DocumentTermMatrix (documents: 37126, terms: 702)>>
## Non-/sparse entries: 901725/25160727
## Sparsity : 97%
## Maximal term length: 12
## Weighting : term frequency (tf)
Here, sparsity is a measure of the proportion of 0’s in the dataset. By removing words that rarely appear in the dataset, we decrease the sparsity and boost the presence of those words that appear most often. Now, we can visualize the most common words in our text data using a word cloud.
tdm <- TermDocumentMatrix(corpus)
tdm <- removeSparseTerms(tdm, 0.99)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
bow.df <- data.frame(word = names(v),freq=v)
set.seed(2018)
wordcloud(words = bow.df$word, freq = bow.df$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Finally, we can begin building our classification model. We will build a Logistic Regression Classifier. The belief is that logistic regression will provide us with a greater balance of interpret-ability and predictability than any other model would. In addition, logistic regression as a long track record of success in solving NLP problems as reported in the latest research and literature.
# Turning the bag of words into a regular dataframe
data <- as.data.frame(as.matrix(dtm))
# Adding sentiment label to each review
data$y <- as.factor(ifelse(amazon.data$overall >= 4, "positive", "negative"))
# This will rename 4 columns to prevent issues during trianing later on.
colnames(data)[names(data) == "break"] <- "break.nlp"
colnames(data)[names(data) == "function"] <- "function.nlp"
colnames(data)[names(data) == "next"] <- "next.nlp"
colnames(data)[names(data) == "repeat"] <- "repeat.nlp"
# Creating a train-test split of the training data
set.seed(2018)
split <- sample.split(data$y, SplitRatio = 0.8)
train <- data[split == TRUE,]
test <- data[split == FALSE,]
# This will allow for the models to be trained using cross-validation
train_control <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary,savePredictions = TRUE)
x.test <- test[,!(colnames(test) %in% c("y"))]
y.test <- test$y
# Plotting distribution of dependent variable
barplot(prop.table(table(train$y)), ylim = range(0,1), col = "#00aedb", main="Histogram of Dependent Variable", xlab="Dependent Response")
grid (NULL,NULL, lty = 1, col = "light grey")
In this analysis, we have attached a sentiment indicator to each of our reviews. Each review was labelled as being either “negative” or “positive” based on the overall rating that a reviewer had given the product. The above graph is an analysis of the dependent variable, the sentiment of the reviews. We see that our data is largely imbalanced. with 80% of the reviews being positive and only 20% being negative. Due to this fact, it’s been decided that the training set will be down-sampled for training purposes. It is believed that, due to the large size and imbalance of the dataset, down-sampling will increase the predictability of the model by removing the potential bias towards the more dominant case; the positive case. In addition, the hope is that this will increase the speed at which the models are trained as previous attempts have shown that the models require excessive amounts of time to train on all of the data.
# Downsampling
negative.reviews <- train[train$y == "negative",] # 6,232 negative observations
positive.reviews <- train[train$y == "positive",] # 23,469 positive observations
positive.sample <- positive.reviews[sample(nrow(positive.reviews), 6232),] # sample pos. reviews
train <- rbind(negative.reviews, positive.sample) # combine neg. and sampled pos. reviews
train <- train[sample(nrow(train)),] # randomly shuffle the rows
Let’s perform a Principal Component Analysis (PCA) to determine, visually, if our data points can be geometrically grouped into negative and positive reviews. PCA is a useful technique for exploratory data analysis, allowing one to better visualize the variation present in a dataset with many variables. Taking into account the fact that our dataset is composed of over 700 different word features, it is safe to say that PCA may provide us with additional information we have not obtained thus far.
train.pca <- train[,-703]
train.pca <- train.pca[,apply(train.pca, 2, var, na.rm=TRUE) != 0]
pca1 <- prcomp(train.pca, center = TRUE, scale. = TRUE, tol = 0.40)
pca2 <- prcomp(train.pca, center = TRUE, scale. = TRUE)
summary(pca1)
## Importance of first k=1 (out of 702) components:
## PC1
## Standard deviation 6.56141
## Proportion of Variance 0.06133
## Cumulative Proportion 0.06133
Based on the output above, we’ve obtained 702 principal components (PCs) with the first explaining only ~6% of the total variance in our data. The first PC is responsible for capturing the largest amount of variance out of all the PCs resulting from the PCA. This means that the most variance that any of our PCs capture is 6%. This is not promising as it hints that the variation in the data is so large that no single component can capture a significant portion of it. Furthermore, this signals to us that our data does not seem to have any inherent structure or apparent pattern to it. Let’s see if a biplot can help us determine anything further.
# Plotting the principal components onto a biplot
ggbiplot(pca2, choices = 1:2, ellipse=TRUE, groups=train$y, obs.scale = 1, var.scale = 1) +
geom_hline(yintercept=0, linetype="dashed", color = "black")
Above, is a biplot. Biplots are a type of graph that allow you to visualize how the observations in a dataset relate to one another in terms of their PCs and reveals how each variable contributes to each PC. Using this type of plot, you may also determine which data points are most similar based on their relative positions, or distance from one another, on the graph. Furthermore, the loadings of variables are represented as vectors. The further away these vectors are from the origin, out of which all the vectors originate, the more influence they have on that principal component. From this biplot, it is very difficult to differentiate the different variables from the plot as their names are too densely populated near the origin of the loadings. Although, it is still possible to determine which features have the most influence on each principal component as some vectors are clearly longer than others signifying significant differences in the variable loadings. Based on the dispersion of the data points, it seems like there is a vague grouping of positive and negative reviews above and below the dashed line. This is promising as it signals that there are, in fact, features in our dataset that can differentiate our positive reviews from the negative reviews.
Based on the PCA previously conducted, it does not seem as if the principal components are capturing much besides noise and randomness. for this reason, the principal components will not be included into the model as additional features, as is common practice, since there is no evidence that doing so will result in significant increases in predictability.
Now, to create the logistic model.
start_time <- Sys.time()
set.seed(2018)
# Making the model
lr.model <- glm(y ~., data=train, family = "binomial")
pred1 <- predict(lr.model, newdata=x.test, type = "response")
pred1 <- ifelse(pred1 > 0.5,"positive","negative")
pred1 <- as.factor(pred1)
end_time <- Sys.time()
time.elapse <- (end_time - start_time)
print(time.elapse)
## Time difference of 38.25323 secs
# Printing specificity and sensitivity
confusionMatrix(data=pred1, y.test, positive = "positive")
## Confusion Matrix and Statistics
##
## Reference
## Prediction negative positive
## negative 1196 1228
## positive 362 4639
##
## Accuracy : 0.7859
## 95% CI : (0.7763, 0.7951)
## No Information Rate : 0.7902
## P-Value [Acc > NIR] : 0.823
##
## Kappa : 0.4637
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7907
## Specificity : 0.7677
## Pos Pred Value : 0.9276
## Neg Pred Value : 0.4934
## Prevalence : 0.7902
## Detection Rate : 0.6248
## Detection Prevalence : 0.6735
## Balanced Accuracy : 0.7792
##
## 'Positive' Class : positive
##
Based on the results of the confusion matrix that was fit on our data, the model possesses a 79% accuracy which, unfortunately, does not tell us much due to the unbalanced test set we are working with. According to calculations based on the confusion matrix, the test set consists of 79% positive cases which means that the model could miss all of the negative cases and still possess an overall accuracy of nearly 80%. Although, once we consult the sensitivity and specificity we see that this is not that case as the model seems to capture 79% of the positive cases and 77% of the negative cases. This is promising as we are able to predict the sentiment of a review with nearly an 80% accuracy while, simultaneously, capturing a relatively high proportion of both the negative and positive cases of our test dataset.
Now that we have created our final model, let’s determine which features it deems to be the most important in separating the most negative and positive cases.
log.imp <- data.frame(varImp(lr.model))
log.imp$Vars <- row.names(log.imp)
row.names(log.imp) <- NULL
log.imp[order(-log.imp$Overall),][1:10,]
## Overall Vars
## 376 15.078419 love
## 276 12.270708 great
## 72 9.989996 bore
## 193 8.869887 enjoy
## 425 8.148763 noth
## 668 7.933548 wast
## 209 7.257427 excel
## 304 7.026167 hook
## 559 6.973709 slow
## 309 6.570293 howev
Here, we have the features that the logistic model has determined to be most important in distinguishing positive and negative reviews. Words such as “love”, “great”, and “bore” seem to be the strongest predictors of the sentiment of a review. This makes sense as it is often the case that a positive review would include words such as “love” and “great” which are often absent from negative reviews. Similarly, “bore” is often used by reviewers to describe something they did not enjoy. Similar words that strongly imply satisfaction and dissatisfaction appear further down the list of feature importance.
Based on the results of the parts-of-speech tagging, we can say which adjectives appear most often within negative reviews. Unfortunately, the analysis of these words provided little in terms of information indicating what specific gripes reviewers have with the content they dislike. Many of the words seem to be those which are commonly used in everyday language and are not specific to negative sentiment. Although, insights were still able to be made. Many of the reviews focused on the characters, writing, and dialogue of the movie/show reviewed. Also, it seems as if many reviewers felt that a movie or show lacked something causing it to be boring, simplistic, and unsatisfactory. Improvements to the analysis done here may include filtering out words that are found in both the negative and positive reviews. This would allow us to pinpoint words that strictly appear in negative reviews only.
The Principal Component Analysis revealed that no principal components were able to capture a significant proportion of the variance presented by the data. This is likely due to the fact that the data points are scattered among over 700 different features. Yet, the resulting biplot revealed that there does seem to be collections of features that can clearly separate the negative reviews from the positive. Although, the features, nor their loadings, were able to be distinguished which would gives us an idea as to which features are most similar as well as their influences on the classification of those data points. Further analysis should be done to isolate those influential features from the biplot. In addition, reviews that are most similar to each other, as determined by the PCA and biplot, should be further analyzed to determine what makes them similar.
The final model produced was a Logistic Regression possessing an accuracy of 79% with the added capability of correctly predicting 79% of the positive cases as well as 77% of the negative cases. This model could be further improved by using additional data and tweaking the threshold of the model. Analysis of the feature importances of the model revealed that words such as “love”, “great”, “bore”, and “enjoy” have the greatest influence on classification of a review’s sentiment. For future research, more emphasis should be put on words which convey strong emotions as these are often the words that set reviews apart from each other. In addition, other models, (non)linear and (non)parametric, should be experimented with which may prove to be superior in sentiment prediction. Finally, an AUC value should be used to score the overall performance of the different classification models.