library(reticulate)
library(tidyverse)
library(gt)
library(cvms)
set.seed(1)
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
This project combines several data science programming languages, principles, and products. I used the R and Python programming languages in this document. I used R (tidyverse, specifically) to summarize and visualize data and model performance. I used Python (scikit-learn, specifically) to create the models and extract their performance metrics. Finally, I used RMarkdown to write this document because it seamlessly interprets code written in multiple programming languages and LaTeX into a single user-friendly document.
I created four machine learning models to predict the sentiment (positive or negative) of movie reviews from IMDB. First, I trained the models with a subset of reviews with known review sentiment. Next, I tested model accuracy by predicting review sentiment and comparing determining whether the prediction was correct or incorrect. Then, I selected the most accurate model and tuned its parameters to produce the best possible prediction accuracy.
Which machine learning model is most accurate at predicting the reviewer’s sentiment given a text review of a movie?
Text movie reviews.
Binary predictions: either ‘positive’ or ‘negative’.
The dataset for this project is from Kaggle and is available here.
The dataset contains 50,000 rows and 2 columns.
df_review
## review sentiment
## 0 One of the other reviewers has mentioned that ... positive
## 1 A wonderful little production. <br /><br />The... positive
## 2 I thought this was a wonderful way to spend ti... positive
## 3 Basically there's a family where a little boy ... negative
## 4 Petter Mattei's "Love in the Time of Money" is... positive
## ... ... ...
## 49995 I thought this movie did a down right good job... positive
## 49996 Bad plot, bad dialogue, bad acting, idiotic di... negative
## 49997 I am a Catholic taught in parochial elementary... negative
## 49998 I'm going to have to disagree with the previou... negative
## 49999 No one expects the Star Trek movies to be high... negative
##
## [50000 rows x 2 columns]
The dataset started with a balanced number of positive and negative reviews. Since real data may not be perfectly balanced, I imbalanced the samples and then re-balanced them through random re-sampling.
First, I imbalanced the reviews by selecting a different number of positive and negative reviews:
df_positive = df_review[df_review['sentiment']=='positive'][:9000]
df_negative = df_review[df_review['sentiment']=='negative'][:1000]
df_review_imb = pd.concat([df_positive, df_negative])
Then, I re-balanced the dataset by randomly sampling my imbalanced samples:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
df_review_bal, df_review_bal['sentiment']=rus.fit_resample(df_review_imb[['review']],
df_review_imb['sentiment'])
df_review_bal
## review sentiment
## 0 Basically there's a family where a little boy ... negative
## 1 This show was an amazing, fresh & innovative i... negative
## 2 Encouraged by the positive comments about this... negative
## 3 Phil the Alien is one of those quirky films wh... negative
## 4 I saw this movie when I was about 12 when it c... negative
## ... ... ...
## 1995 Knute Rockne led an extraordinary life and his... positive
## 1996 At the height of the 'Celebrity Big Brother' r... positive
## 1997 This is another of Robert Altman's underrated ... positive
## 1998 This movie won a special award at Cannes for i... positive
## 1999 You'd be forgiven to think a Finnish director ... positive
##
## [2000 rows x 2 columns]
I split my newly balanced dataset into training and testing subsets. My training subset was \(\frac{2}{3}\) of the balanced dataset and the test subset was the remaining \(\frac{1}{3}\).
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_review_bal, test_size=0.33, random_state=42)
train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']
Next, I prepared the data for modelling. To model natural language text predictions, I needed to transform words from the text reviews into a numeric representation of its frequency. I used scikit-learn’s term frequency-inverse document frequency (TF-IDF) vectorizer for these frequency calculations. In this approach, a word’s TF-IDF score increases as its frequency increases within documents but its score decreases if that word is common among documents.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)
pd.DataFrame.sparse.from_spmatrix(train_x_vector,
index=train_x.index,
columns=tfidf.get_feature_names_out())
## 00 000 007 01pm 02 04 ... æon élan émigré ísnt ïn ünfaithful
## 81 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 915 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 1018 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 380 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 1029 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## ... ... ... ... ... ... ... ... ... ... ... ... ... ...
## 1130 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 1294 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 860 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 1459 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
## 1126 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0
##
## [1340 rows x 20625 columns]
test_x_vector = tfidf.transform(test_x)
I trained four natural language text analysis models:
SVC(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(kernel='linear')
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier()
GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GaussianNB()
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
Finally, I tested that each model generated a prediction given a set of fake reviews. Each model predicts either ‘positive’ or ‘negative’ given a string of text:
# 1) support-vector machine learning (SVML) model
print(svc.predict(tfidf.transform(['A good movie'])))
## ['positive']
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))
## ['negative']
# 2) classification tree
print(dec_tree.predict(tfidf.transform(['A good movie'])))
## ['positive']
print(dec_tree.predict(tfidf.transform(['I did not like this movie at all'])))
## ['positive']
# 3) naive Bayes model
print(gnb.predict(tfidf.transform(['A good movie']).toarray())) # must be transformed: sparse matrix to dense data via x.toarray()
## ['negative']
print(gnb.predict(tfidf.transform(['I did not like this movie at all']).toarray())) # must be transformed: sparse matrix to dense data via x.toarray()
## ['negative']
# 4) logistic regression model
print(log_reg.predict(tfidf.transform(['A good movie'])))
## ['negative']
print(log_reg.predict(tfidf.transform(['I did not like this movie at all'])))
## ['negative']
svc_score = svc.score(test_x_vector, test_y)
dec_score = dec_tree.score(test_x_vector, test_y)
gnb_score = gnb.score(test_x_vector.toarray(), test_y)
log_reg_score = log_reg.score(test_x_vector, test_y)
Support vector machine learning (SVML) and logistic regression were similarly accurate (0.84 and 0.83, respectively; Table 1). Decision tree and Naive Bayes modelling were considerably less accurate than SVML or logistic regression. Since the SVML model had the highest mean prediction accuracy, I extracted more of its accuracy metrics.
| Table 1. Mean prediction accuracy of four movie review sentiment machine learning models | |
|---|---|
| Model type | Mean accuracy |
| Support vector ML | 0.84 |
| Logistic regression | 0.83 |
| Decision tree | 0.66 |
| Naive Bayes | 0.63 |
The SVML model had high (i.e., near 1) F1 scores for both positive and negative sentiment reviews, which reinforces this model’s prediction accuracy. The SVML model had a positive sentiment F1 score of 0.85 and a negative sentiment F1 score of 0.83.
from sklearn.metrics import classification_report
print(classification_report(test_y,
svc.predict(test_x_vector),
labels=['positive', 'negative']))
## precision recall f1-score support
##
## positive 0.83 0.87 0.85 335
## negative 0.85 0.82 0.83 325
##
## accuracy 0.84 660
## macro avg 0.84 0.84 0.84 660
## weighted avg 0.84 0.84 0.84 660
The confusion matrix
# Confusion matrix
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(test_y,
svc.predict(test_x_vector),
labels=['positive', 'negative'])
A confusion matrix provides context for the SVM model accuracy
metrics and provide counts and percentages of true positives, true
negatives, false positives, and false negatives. Confusion matrices
summarize this context as an overall percentage and count in the center
of each tile and row or column percentages on the edge of each tile. Of
the 310 observations where the Target outcome was
positive, the SVM model predicted 265 to be
positive, yielding a true positive rate of 85.5%. The
remaining 45 positive predictions were false positives,
yielding a false-positive rate of 14.5%. Of the 350 observations where
the Target outcome was negative, the SVM model
predicted 290 to be negative, yielding a true negative rate
of 82.9%. The remaining 60 negative predictions were false
negatives, yielding a false-negative rate of 17.1%.
# Model tuning
# maximize the model performance with GridSearchCV
from sklearn.model_selection import GridSearchCV
#set the parameters
parameters = {'C': [1,4,8,16,32] ,'kernel':['linear', 'rbf']}
svc = SVC()
svc_grid = GridSearchCV(svc,parameters, cv=5)
svc_grid
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})SVC()
SVC()
svc_grid.fit(train_x_vector, train_y)
GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=5, estimator=SVC(),
param_grid={'C': [1, 4, 8, 16, 32], 'kernel': ['linear', 'rbf']})SVC()
SVC()
print(svc_grid.best_params_)
## {'C': 1, 'kernel': 'linear'}
print(svc_grid.best_estimator_)
## SVC(C=1, kernel='linear')
Parameterization showed that the best C parameter was
1 and the best kernel parameter was
linear. Next, I tuned a new SVML model with these
parameters:
svc2 = SVC(C = 1, kernel='linear')
svc2.fit(train_x_vector, train_y)
SVC(C=1, kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1, kernel='linear')
Then, I tested that the new SVML model would predict given text movie reviews and extracted the mean accuracy for the tuned SVML model.
# tuned support-vector machine learning (SVML) model
print(svc2.predict(tfidf.transform(['A good movie'])))
## ['positive']
print(svc2.predict(tfidf.transform(['I did not like this movie at all'])))
## ['negative']
Mean prediction accuracy was identical for both SVML models.
| Table 2. Mean prediction accuracy of un-tuned and tuned SVML models | |
|---|---|
| Model type | Mean accuracy |
| Un-tuned SVML | 0.84 |
| Tuned SVML | 0.84 |
# Confusion matrix for SVML2
from sklearn.metrics import confusion_matrix
conf_mat2 = confusion_matrix(test_y,
svc2.predict(test_x_vector),
labels=['positive', 'negative'])
The confusion matrix of the tuned model was identical to the un-tuned model, showing that the un-tuned model was no better (or worse) than the tuned model.
I set out to determine which machine learning model was the most accurate at predicting the reviewer’s sentiment given a text review of a movie. My analyses showed that two models were similarly accurate at predicting reviewer sentiment: support vector machine learning (SVML) and logistic regression.
I investigated my SVML model for optimization because it was slightly
more accurate than logistic regression for my dataset. My un-tuned SVML
model had a true positive rate of 85.5% and a true negative rate of
82.9%, which indicated that my TF-IDF vectorization protocol worked well
for movie review sentiment predictions. Surprisingly, tuning my SVML
model produced no increase in prediction accuracy or improvement in
true positive or true negative rates. Since
tuning provided no appreciable benefit, I could investigate combining
multiple models for an ensemble of review-sentiment predictions or I
could re-visit my logistic regression model to search for better
prediction accuracy.