Overview

The data being analyzed is reviews for the TV show ‘How I Met Your Mother’. In this analysis, a binary classification is being performed. Reviews are either being classified as positive or negative, with a scale from 0-1.

Read In Data

We read the data from a CSV file. The first line of the corpus is printed here, and the corresponding classifier is printed as well. As we can see it is a positive review classified as positive.

reviews = read.csv("TV_Show_Review_Data.csv", header = TRUE)
head(reviews)
##   ID
## 1  1
## 2  2
## 3  3
## 4  4
## 5  5
## 6  6
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Text
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This show has beautifully captured all the emotions: joy, laughter, amazement, disappointment, heartbreak you name it, they have it. It has got a lot of life lessons and impacted me in so many ways. I found it very uplifting and it will always be close to my heart. A hearty laugh is guaranteed in most episodes (some of them are really emotional).  Also, there are things that you can learn from every character that will help you grow. I will definitely miss how I met your mother, but I advise to watch the Alternate ending as it is a much happier ending.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        this show is INCREDIBLE!! im a huge sucker for sitcoms so I've watched many of them, good and bad. this is by far one of the best sitcoms I've ever watched though. there's 9 seasons, all incredibly well written and entertaining. i became so hooked on this show that reserved the last 9 days watching this show and doing nothing else. i was addicted. 
## 3 I just finished the series for the second time. I love this show and personally, I believe this is the greatest sitcoms to air to this date. The personal relationships, the emotionally packed stories, and the beloved characters make this show not only entertaining but also makes the audience feel connected to these characters. I am a big fan of season 9, and I can now safely say that even though Ted and Robin had chemistry throughout al the show (and I loved it of course) they were just not meant for each other. I wanted Ted and Tracy to end together like it was supposed to be, it would have been the perfect finale. How I Met Your Mother is a show about friendships, and this show deserved a happy ending for Ted. The last episode was supposed to make fans see how every relationship and every breakup Ted had was worth it, just to live happily ever after with the mother. Nevertheless, I loved this show and I loved its build-up for the finale, I hope to see these actors/actresses in the future and I hope sitcoms can learn from How I Met Your Mother.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         This is possibly one of the best recent sitcoms to exist (along with Big Bang Theory). Everything about this show is amazing, especially the way the characters matured (Barney finally getting married, Ted finding 'The One', Marshall and Lily having kids) to name a few. The episodes were always filled with good stories, good acting and amazing comedy. It also had it's emotional moments (Marshall's dad dying, Robin finding out she's infertile, Ted's wife dying) to name a few. Would definitely recommend this show to others
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  This show is the best show you can ever had actually......that emotional moments gonna strike in your heart for forever just this show is totally LEGEN-WAIT FOR IT, IT'S GONNA LEAVE A MARK FOR GENERATIONS TO COME-DARY: LEGENDARY!! People expect this to be a sitcom but this is not 100% sitcom instead it's a amazingly Scripted Real life story. Please don't watch it through the eyes of only sitcom but in the manner of journey through lives of people connected to eachother. This show has a most beautiful love story in generations of the TV series
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The best show ever!!! Swarley Ted Robin Marshal Lilly the world's best t.v. show also there is many fan websites and also I bet no one here has left a single bad review because no one would dare to because this show is incredible and amazing I love this show although after it would of been great if they would of done a second show about how Ted grew old with the mother even after she died or they could if went of the other outcome where the mother lives witch would of been amazing I can never stop talking about this show it's so amazing
##   Descriptor Scale
## 1   Positive     1
## 2   Positive     1
## 3   Positive     1
## 4   Positive     1
## 5   Positive     1
## 6   Positive     1
review_corpus <- Corpus(VectorSource(reviews$Text))
# Print line of first corpus, and review that correlates to it
review_corpus[[1]][1]
## $content
## [1] "This show has beautifully captured all the emotions: joy, laughter, amazement, disappointment, heartbreak you name it, they have it. It has got a lot of life lessons and impacted me in so many ways. I found it very uplifting and it will always be close to my heart. A hearty laugh is guaranteed in most episodes (some of them are really emotional).  Also, there are things that you can learn from every character that will help you grow. I will definitely miss how I met your mother, but I advise to watch the Alternate ending as it is a much happier ending."
reviews$Scale[1]
## [1] 1

Cleanup Corpus

Convert to Lowercase

The review corpus is converted to lowecase for easier processing.

review_corpus <- tm_map(review_corpus, PlainTextDocument)
## Warning in tm_map.SimpleCorpus(review_corpus, PlainTextDocument): transformation
## drops documents
review_corpus <- tm_map(review_corpus, tolower)
## Warning in tm_map.SimpleCorpus(review_corpus, tolower): transformation drops
## documents
review_corpus[[1]][1]
## $content
## [1] "this show has beautifully captured all the emotions: joy, laughter, amazement, disappointment, heartbreak you name it, they have it. it has got a lot of life lessons and impacted me in so many ways. i found it very uplifting and it will always be close to my heart. a hearty laugh is guaranteed in most episodes (some of them are really emotional).  also, there are things that you can learn from every character that will help you grow. i will definitely miss how i met your mother, but i advise to watch the alternate ending as it is a much happier ending."

Remove Punctuation

Remove the punctuation from the review corpus.

review_corpus <- tm_map(review_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(review_corpus, removePunctuation): transformation
## drops documents
review_corpus[[1]][1]
## $content
## [1] "this show has beautifully captured all the emotions joy laughter amazement disappointment heartbreak you name it they have it it has got a lot of life lessons and impacted me in so many ways i found it very uplifting and it will always be close to my heart a hearty laugh is guaranteed in most episodes some of them are really emotional  also there are things that you can learn from every character that will help you grow i will definitely miss how i met your mother but i advise to watch the alternate ending as it is a much happier ending"

Remove Stop Words

Stop words are unhelpful words like ‘i’. ‘we’, and ‘at’. They are not helpful because the frequency of these stop words is high, and they do not help in differentiating the target classes. We remove the stop words from the review corpus here. We use the stopwords list from the TM package, as well as add a few other words to remove that will not contribute to sentiment.

review_corpus <- tm_map(review_corpus, removeWords, c("show", "tv", "character", stopwords(kind = "en")))
## Warning in tm_map.SimpleCorpus(review_corpus, removeWords, c("show", "tv", :
## transformation drops documents
review_corpus[[1]][1]
## $content
## [1] "   beautifully captured   emotions joy laughter amazement disappointment heartbreak  name       got  lot  life lessons  impacted    many ways  found   uplifting   will always  close   heart  hearty laugh  guaranteed   episodes     really emotional  also   things   can learn  every   will help  grow  will definitely miss   met  mother   advise  watch  alternate ending     much happier ending"

Stemming

Stemming is reducing the number of inflectional forms of words appearing in the text. For example, “argue”, “arguing”, and “argues” are reduced to their common stem “argu”.

review_corpus <- tm_map(review_corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(review_corpus, stemDocument): transformation
## drops documents
review_corpus[[1]][1]
## $content
## [1] "beauti captur emot joy laughter amaz disappoint heartbreak name got lot life lesson impact mani way found uplift will alway close heart hearti laugh guarante episod realli emot also thing can learn everi will help grow will definit miss met mother advis watch altern end much happier end"

Creating A Document Term Matrix

We now extract the word frequencies from the review corpus. These will be used as features in the prediction problem. We are going to generate a term matrix where the rows correspond to reviews from the data set, and the columns correspond to word in the reviews. We will also remove sparsities from the DTM. Sparsities are words that do not appear in the text.

# Get frequencies
frequencies <- DocumentTermMatrix(review_corpus)

# Remove Sparsities, we want sparsity in the matrix to be under 1%
sparsities <- removeSparseTerms(frequencies, 0.995)

# Convert the matrix into a dataframe
review_sparse_df <- as.data.frame(as.matrix(sparsities))
# make column names R friendly
colnames(review_sparse_df) <- make.names(colnames(review_sparse_df))
# add dependent variable to dataset (pos/neg rating descriptor is added)
review_sparse_df$recommend_id = reviews$Descriptor
head(review_sparse_df$recommend_id)
## [1] "Positive" "Positive" "Positive" "Positive" "Positive" "Positive"

Build Predictive Model

Now we are going to construct a predictive model. The first thing we are going to do is set the baseline accuracy of the model. The baseline accuracy is the proportion of majority label in the target variable.

prop.table(table(review_sparse_df$recommend_id))
## 
## Negative Positive 
##     0.51     0.49

51% of the data is negative and 49% of the data is positive. This becomes the baseline accuracy for the predictive model.

Creating Training and Test Data

We will divide the data set into a training set and a testing set. The training set will contain 80% of the data while the testing set will contian 20% of the data.

library(caTools) # package used for splitting the data
## Warning: package 'caTools' was built under R version 4.0.3
set.seed(100)
split = sample.split(review_sparse_df$recommend_id, SplitRatio = 0.80)

reviews.train <- subset(reviews, split==TRUE)
reviews.test <- subset(reviews, split==FALSE)

train <- subset(review_sparse_df, split==TRUE)
test <- subset(review_sparse_df, split==FALSE)

Using Random Forest Algorithm

The Random Forest algorithm is a classification algorithm composed of several classification trees. The algorithm constructs an ensemble of decision trees, usually trained with the bagging method. We use the randomForest package to construct the classification tree.

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
set.seed(100)

# Convert the target variable into the factor type
train$recommend_id = as.factor(train$recommend_id)
test$recommend_id = as.factor(test$recommend_id)

# train the random forest algorithm on the dataset
rf_model <- randomForest(recommend_id ~ ., data=train)

# test the model and print a table of results
predict_rf <- predict(rf_model, newdata = test)
table(test$recommend_id, predict_rf)
##           predict_rf
##            Negative Positive
##   Negative        9        1
##   Positive        3        7
plot(predict_rf, main = "Reviews For How I Met Your Mother TV Show", ylab = "Num Of Reviews", xlab = "Sentiment Of Review")

accuracy = 16/20
cat("Accuracy:  ", accuracy)
## Accuracy:   0.8
precision = 7/(7+1)
cat("\nPrecision: ", precision)
## 
## Precision:  0.875
recall = 7/(7 + 3)
cat("\nRecall:    ", recall)
## 
## Recall:     0.7

Random Forest Results

We can see from the results above that the model classified 16/20 reviews correctly, and 4 reviews it classified incorrectly. This gives the model an accuracy of 80%. Precision: 87.5% Recall: 70% Accuracy: 80%

Sentiment Analysis With Naive Bayes Classification

In order to determine the success of the initial Random Forest algorithm used above, we are going to use another classification algorithm for sentiment analysis on the review data. The second algorithm being used is the Naive Bayes algorithm. This is a classifier that uses Bayes theorem of probability to predict the class of the ‘unknown’ data set. This algorithm is being chosen for use of comparision because like the Random Forest, it requires a small amount of training data to learn parameters. It can also be trained relatively fast compares to other classifiers.

Train and Test Naive Bayes Model

We are going to use the training and test sets from above to train and test the model.

library(e1071) # library used for NB algorithm
## Warning: package 'e1071' was built under R version 4.0.3
dim(train)
## [1]  80 654
# Function to convert the word frequencies to yes or no labels
convert_count <- function(x){
  y <- ifelse(x > 0, 1, 0)
  y <- factor(y, levels = c(0, 1), labels = c("No", "Yes"))
  y
}

# Apply the convert_count function to get final training and testing DTMs for NB
trainNB <- apply(train, 2, convert_count)
testNB <- apply(test, 2, convert_count)

classifier <- naiveBayes(trainNB, train$recommend_id, laplace = 1)
prediction_model <- predict(classifier, newdata = testNB)

length(prediction_model)
## [1] 20
length(reviews.test)
## [1] 4
table("Predictions" = prediction_model, "Actual" = reviews.test$Descriptor)
##            Actual
## Predictions Negative Positive
##    Negative       10        4
##    Positive        0        6
plot(prediction_model, main = "Reviews For How I Met Your Mother TV Show", ylab = "Num Of Reviews", xlab = "Sentiment Of Review")

## Results of NB Below the accuracy, precision and recall are calculated for the NB predictive model used above.

precisionNB = 6 / (6 + 0)
cat("\nPrecision: ", precisionNB)
## 
## Precision:  1
recallNB = 6 / (6 + 4)
cat("\nRecall: ", recallNB)
## 
## Recall:  0.6
accuracyNB = (10 + 6)/ (10 + 4 + 6 + 0)
cat("\nAccuracy: ", accuracyNB)
## 
## Accuracy:  0.8

We can see from the data printed above that the precision of the NB classifier was better than the Random Forest classifier.
The recall was slightly better on the Random Forest algorithm by about .1. The accuracy for both methods is the exact same, .80 or 80%.

The outputs of both predictive models is relatively similar. The precision and recal were slightly different, but not by much.

Which Performed Better and Why?

As mentioned above, the predictive models were pretty close in terms of performance. Their accuracy was identical. The data set being used was a relatively small data set compared to data sets that these algorithms are usually used on. There was only 100 entries in this Reviews data set. Usually Random Forest performs well on large data sets that stays in the same format, meaning the structure of the dataframe stays the same. Naive Bayes generally works better on smaller data sets and can be changed quickly to adapt to changing data structures.

Search Engine Manipulation Effect

Search Engine Manipulation Effect is the change in consumer preferences from manipulations of search results by search engines. It is one of the largest behavioral effects ever discovered. If I was working at Google, and I knew that SEME could be used to help Clinton win the election, what I would do would depend on moral and political standings. If Google backs Clinton and was wanting her to win the election, then they could have manipulated algorithms to re-rank search results for Clinton. However, knowing this effect exists, there would have to be some double checks on the algorithms being used for the search engine. It would have to be double checked to make sure that the text mining algorithms are not purposely skewing search results towards one political party or the other.
The exact same mindset holds from a Christian worldview perspective. It is morally wrong to tamper with the election process in that fashion. Purposely tampering with the Google search engine to help your favored candidate win is lying. Its an untruthful act. The Christian worldview preaches honesty as one of its core values, such as in the commandments. Being dishonest for personal satisfaction or gain is therefore not living by the Christian worldview. Therefor my reaction would be the same.