library(reticulate)
library(tidyverse)
library(gt)
library(cvms)
set.seed(1)
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Introduction

This project combines several data science programming languages, principles, and products. I used the R and Python programming languages in this document. I used R (tidyverse, specifically) to summarize and visualize data and model performance. I used Python (scikit-learn, specifically) to create the models and extract their performance metrics. Finally, I used RMarkdown to write this document because it seamlessly interprets code written in multiple programming languages and LaTeX into a single user-friendly document.

Methods

I created four machine learning models to predict the sentiment (positive or negative) of movie reviews from IMDB. First, I trained the models with a subset of reviews with known review sentiment. Next, I tested model accuracy by predicting review sentiment and comparing determining whether the prediction was correct or incorrect. Then, I selected the most accurate model and tuned its parameters to produce the best possible prediction accuracy.

Research question

Which machine learning model is most accurate at predicting the reviewer’s sentiment given a text review of a movie?

Model inputs

Text movie reviews.

Model outputs

Binary predictions: either ‘positive’ or ‘negative’.

Data

The dataset for this project is from Kaggle and is available here.

The dataset contains 50,000 rows and 2 columns.

df_review
##                                                   review sentiment
## 0      One of the other reviewers has mentioned that ...  positive
## 1      A wonderful little production. <br /><br />The...  positive
## 2      I thought this was a wonderful way to spend ti...  positive
## 3      Basically there's a family where a little boy ...  negative
## 4      Petter Mattei's "Love in the Time of Money" is...  positive
## ...                                                  ...       ...
## 49995  I thought this movie did a down right good job...  positive
## 49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
## 49997  I am a Catholic taught in parochial elementary...  negative
## 49998  I'm going to have to disagree with the previou...  negative
## 49999  No one expects the Star Trek movies to be high...  negative
## 
## [50000 rows x 2 columns]

The dataset started with a balanced number of positive and negative reviews. Since real data may not be perfectly balanced, I imbalanced the samples and then re-balanced them through random re-sampling.

First, I imbalanced the reviews by selecting a different number of positive and negative reviews:

df_positive = df_review[df_review['sentiment']=='positive'][:9000]
df_negative = df_review[df_review['sentiment']=='negative'][:1000]
df_review_imb = pd.concat([df_positive, df_negative])

Then, I re-balanced the dataset by randomly sampling my imbalanced samples:

from imblearn.under_sampling import  RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
df_review_bal, df_review_bal['sentiment']=rus.fit_resample(df_review_imb[['review']],
                                                           df_review_imb['sentiment'])
df_review_bal
##                                                  review sentiment
## 0     Basically there's a family where a little boy ...  negative
## 1     This show was an amazing, fresh & innovative i...  negative
## 2     Encouraged by the positive comments about this...  negative
## 3     Phil the Alien is one of those quirky films wh...  negative
## 4     I saw this movie when I was about 12 when it c...  negative
## ...                                                 ...       ...
## 1995  Knute Rockne led an extraordinary life and his...  positive
## 1996  At the height of the 'Celebrity Big Brother' r...  positive
## 1997  This is another of Robert Altman's underrated ...  positive
## 1998  This movie won a special award at Cannes for i...  positive
## 1999  You'd be forgiven to think a Finnish director ...  positive
## 
## [2000 rows x 2 columns]

I split my newly balanced dataset into training and testing subsets. My training subset was \(\frac{2}{3}\) of the balanced dataset and the test subset was the remaining \(\frac{1}{3}\).

from sklearn.model_selection import train_test_split
train, test = train_test_split(df_review_bal, test_size=0.33, random_state=42)

train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']

Next, I prepared the data for modelling. To model natural language text predictions, I needed to transform words from the text reviews into a numeric representation of its frequency. I used scikit-learn’s term frequency-inverse document frequency (TF-IDF) vectorizer for these frequency calculations. In this approach, a word’s TF-IDF score increases as its frequency increases within documents but its score decreases if that word is common among documents.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)

pd.DataFrame.sparse.from_spmatrix(train_x_vector,
                                  index=train_x.index,
                                  columns=tfidf.get_feature_names_out())
##        00  000  007  01pm   02   04  ...  æon  élan  émigré  ísnt   ïn  ünfaithful
## 81    0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 915   0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 1018  0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 380   0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 1029  0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## ...   ...  ...  ...   ...  ...  ...  ...  ...   ...     ...   ...  ...         ...
## 1130  0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 1294  0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 860   0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 1459  0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 1126  0.0  0.0  0.0   0.0  0.0  0.0  ...  0.0   0.0     0.0   0.0  0.0         0.0
## 
## [1340 rows x 20625 columns]
test_x_vector = tfidf.transform(test_x)

I trained four natural language text analysis models:

  1. Support-vector machine learning (SVML) model
  2. Classification tree
  3. Naive Bayes
  4. Logistic regression
SVC(kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Finally, I tested that each model generated a prediction given a set of fake reviews. Each model predicts either ‘positive’ or ‘negative’ given a string of text:

# 1) support-vector machine learning (SVML) model
print(svc.predict(tfidf.transform(['A good movie'])))
## ['positive']
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))
## ['negative']
# 2) classification tree
print(dec_tree.predict(tfidf.transform(['A good movie'])))
## ['positive']
print(dec_tree.predict(tfidf.transform(['I did not like this movie at all'])))
## ['positive']
# 3) naive Bayes model
print(gnb.predict(tfidf.transform(['A good movie']).toarray()))  # must be transformed: sparse matrix to dense data via x.toarray()
## ['negative']
print(gnb.predict(tfidf.transform(['I did not like this movie at all']).toarray()))  # must be transformed: sparse matrix to dense data via x.toarray()
## ['negative']
# 4) logistic regression model
print(log_reg.predict(tfidf.transform(['A good movie'])))
## ['negative']
print(log_reg.predict(tfidf.transform(['I did not like this movie at all'])))
## ['negative']

Results

svc_score = svc.score(test_x_vector, test_y)
dec_score = dec_tree.score(test_x_vector, test_y)
gnb_score = gnb.score(test_x_vector.toarray(), test_y)
log_reg_score = log_reg.score(test_x_vector, test_y)

Support vector machine learning (SVML) and logistic regression were similarly accurate (0.84 and 0.83, respectively; Table 1). Decision tree and Naive Bayes modelling were considerably less accurate than SVML or logistic regression. Since the SVML model had the highest mean prediction accuracy, I extracted more of its accuracy metrics.

Table 1. Mean prediction accuracy of four movie review sentiment machine learning models
Model type Mean accuracy
Support vector ML 0.84
Logistic regression 0.83
Decision tree 0.66
Naive Bayes 0.63

The SVML model had high (i.e., near 1) F1 scores for both positive and negative sentiment reviews, which reinforces this model’s prediction accuracy. The SVML model had a positive sentiment F1 score of 0.85 and a negative sentiment F1 score of 0.83.

from sklearn.metrics import classification_report
print(classification_report(test_y,
                            svc.predict(test_x_vector),
                            labels=['positive', 'negative']))
##               precision    recall  f1-score   support
## 
##     positive       0.83      0.87      0.85       335
##     negative       0.85      0.82      0.83       325
## 
##     accuracy                           0.84       660
##    macro avg       0.84      0.84      0.84       660
## weighted avg       0.84      0.84      0.84       660

The confusion matrix

# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(test_y,
                            svc.predict(test_x_vector),
                            labels=['positive', 'negative'])

A confusion matrix provides context for the SVM model accuracy metrics and provide counts and percentages of true positives, true negatives, false positives, and false negatives. Confusion matrices summarize this context as an overall percentage and count in the center of each tile and row or column percentages on the edge of each tile. Of the 310 observations where the Target outcome was positive, the SVM model predicted 265 to be positive, yielding a true positive rate of 85.5%. The remaining 45 positive predictions were false positives, yielding a false-positive rate of 14.5%. Of the 350 observations where the Target outcome was negative, the SVM model predicted 290 to be negative, yielding a true negative rate of 82.9%. The remaining 60 negative predictions were false negatives, yielding a false-negative rate of 17.1%.

# Model tuning
# maximize the model performance with GridSearchCV
from sklearn.model_selection import GridSearchCV
#set the parameters
parameters = {'C': [1,4,8,16,32] ,'kernel':['linear', 'rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc,parameters, cv=5)
svc_grid
GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
svc_grid.fit(train_x_vector, train_y)
GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(svc_grid.best_params_)
## {'C': 1, 'kernel': 'linear'}
print(svc_grid.best_estimator_)
## SVC(C=1, kernel='linear')

Parameterization showed that the best C parameter was 1 and the best kernel parameter was linear. Next, I tuned a new SVML model with these parameters:

svc2 = SVC(C = 1, kernel='linear')
svc2.fit(train_x_vector, train_y)
SVC(C=1, kernel='linear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Then, I tested that the new SVML model would predict given text movie reviews and extracted the mean accuracy for the tuned SVML model.

# tuned support-vector machine learning (SVML) model
print(svc2.predict(tfidf.transform(['A good movie'])))
## ['positive']
print(svc2.predict(tfidf.transform(['I did not like this movie at all'])))
## ['negative']

Mean prediction accuracy was identical for both SVML models.

Table 2. Mean prediction accuracy of un-tuned and tuned SVML models
Model type Mean accuracy
Un-tuned SVML 0.84
Tuned SVML 0.84
# Confusion matrix for SVML2
from sklearn.metrics import confusion_matrix
conf_mat2 = confusion_matrix(test_y,
                            svc2.predict(test_x_vector),
                            labels=['positive', 'negative'])

The confusion matrix of the tuned model was identical to the un-tuned model, showing that the un-tuned model was no better (or worse) than the tuned model.

Discussion

I set out to determine which machine learning model was the most accurate at predicting the reviewer’s sentiment given a text review of a movie. My analyses showed that two models were similarly accurate at predicting reviewer sentiment: support vector machine learning (SVML) and logistic regression.

I investigated my SVML model for optimization because it was slightly more accurate than logistic regression for my dataset. My un-tuned SVML model had a true positive rate of 85.5% and a true negative rate of 82.9%, which indicated that my TF-IDF vectorization protocol worked well for movie review sentiment predictions. Surprisingly, tuning my SVML model produced no increase in prediction accuracy or improvement in true positive or true negative rates. Since tuning provided no appreciable benefit, I could investigate combining multiple models for an ensemble of review-sentiment predictions or I could re-visit my logistic regression model to search for better prediction accuracy.