Final Project

Author

Riley Swint

Introduction

For my final project, I explored the IMDB Movie Reviews dataset, which contains 50,000 movie reviews along with labels indicating whether each review is positive or negative. My goal was to investigate whether it’s possible to accurately predict the sentiment of a review (positive or negative) based solely on the language used. To do this, I applied the Support Vector Machines (SVM) supervised learning techniques to classify the reviews based on word usage patterns.

Loading the Data

library(readr)
library(tidyverse)
library(tidyr)
library(dplyr)
library(dendextend)
library(ranger)
library(tidytext)
library(SnowballC)
library(textclean)
library(e1071)
movie_data <- read_csv("~/Documents/STAT 0218/IMDB Dataset.csv", show_col_types = FALSE)

Cleaning the Data

To start off, I needed to clean the dataset. First, I renamed the columns to avoid potential conflicts where a word in the review might match a column name, which could cause errors during analysis. I also added a unique “id_num” to each review to keep track of them individually. Also, since the reviews sometimes contain HTML tags, I removed those using the replace_html() function. These tags are not meaningful words and could interfere with the model.

colnames(movie_data) <- c("Review_Text", "Review_Rank")
movie_data <- movie_data |>
  mutate(Review_Text = replace_html(Review_Text))

movie_data <- movie_data |>
  mutate(id_num = row_number())

Reshaping the Data

To explore the word language used in the reviews, I transformed the dataset into a long format where each column represents a unique word, and each value indicates how many times that word appears in the review. Since working with all 50,000 reviews is a time consuming process, I sampled 10,000 reviews. I also removed stop words (“the”, “is”, “and”) since they don’t contribute much to distinguishing between review sentiment.

library(readr)
library(tidyverse)
library(tidyr)
library(dplyr)
library(dendextend)
library(ranger)
library(tidytext)
library(SnowballC)
library(textclean)
library(e1071)
set.seed(123)
movies_long <- movie_data |>
  sample_n(10000) |>
  unnest_tokens(input = "Review_Text",
                output = "Word") |>
  filter(!Word %in% stop_words$word) |>
  pivot_wider(id_cols = c("id_num","Review_Rank"),
              names_from = "Word",
              values_from = "Word",
              values_fn = length,
              values_fill = 0)

Unsupervised Learning: Hierarchical Clustering

Now that the data is in a wide format, I wanted to visualize how similar the reviews are to each other based on the words used. To do this, I created a dendrogram, which is a tree-like diagram that groups similar observations together. The dendrogram is built by calculating the distance between reviews using word frequencies, then applying hierarchical clustering. I also colored the labels by review sentiment (green for positive, red for negative) to make it easier to interpret. Since the full movies_long dataset contains 10,000 movie reviews, plotting all of them would be too crowded and unreadable. So, I took a smaller random sample of reviews for this visualization.

set.seed(123)
movies_long_sample <- movie_data |>
  sample_n(200) |>
  unnest_tokens(input = "Review_Text",
                output = "Word") |>
  filter(!Word %in% stop_words$word) |>
  pivot_wider(id_cols = c("id_num","Review_Rank"),
              names_from = "Word",
              values_from = "Word",
              values_fn = length,
              values_fill = 0)
dend <- movies_long_sample |>
  select(-id_num, -Review_Rank) |>
  scale() |>
  dist() |>
  hclust() |>
  as.dendrogram()

my_colors <- movies_long |>
  mutate(color = case_when(Review_Rank == "negative" ~ "red",
                           Review_Rank == "positive" ~ "green")) |>
  pull(color)


colors_to_use <- my_colors[order.dendrogram(dend)]

labels_colors(dend) <- colors_to_use

plot(dend)

From this dendrogram, there are some clustering patterns based on the sentiment of the reviews. The positive reviews tend to be more clustered toward the left side of the plot, while the negative reviews are more spread out, with many appearing toward the middle and right side. Although there is no perfect separation between the two classes, some grouping does suggest that reviews with similar sentiment use similar language. This makes intuitive sense because positive reviews often share approving vocabulary (“great,” “amazing,” “loved”), while negative reviews might focus on criticism (“boring,” “terrible,” “disappointed”). However, the overlap between clusters also highlights that sentiment is not always clearly separable based on word counts alone. The dendrogram shows while reviews are not cleanly divided into two groups, some natural grouping does exist, and these clusters can potentially aid a machine learning model in classifying sentiment.

Applying Support Vector Machines (SVM) for Classification

For the supervised learning portion of my project, I chose to use a Support Vector Machine (SVM) to classify movie reviews as either positive or negative. The way SVM works is that it draws a line (in higher dimensions a hyperplane) between the positive and negative reviews. The goal is to make sure that the reviews on one side of the line are mostly positive and the reviews on the other side are mostly negative. In the context of movie reviews, this line is drawn based on the frequency of different words that appear in each review. For example, if a review frequently uses words like “great” and “amazing,” it would end up on the positive side, while a review with words like “boring” and “terrible” falls on the negative side.

One unique feature of SVM is that it doesn’t just try to draw any line; it looks for the line that has the largest possible margin between the two groups. The idea is that a larger margin makes the model more confident in its predictions, and it helps the model generalize better to new, unseen data. In cases where the data isn’t perfectly separable by a single line, SVM allows for some flexibility. It lets a few reviews “break the rules” and end up on the wrong side of the line, which can help improve the model’s ability to handle tricky situations where the language isn’t always straightforward.

In my project, I used word counts as the features for the model. Each review was represented by how many times specific words appeared in the text. By looking at these word frequencies, the SVM model tries to find patterns that differentiate positive reviews from negative ones. However, not every word carries the same weight in determining sentiment. For example, words like “happy” or “disappointing” might be much more influential in predicting sentiment than more common words like “movie” or “people”.

I also used a linear kernel, which means that the model assumes a straight-line boundary between the two types of reviews. This is a good starting point because many reviews can often be separated with clear language patterns. However, I also experimented with other models containing radial and polynomial kernels, to see if they could capture more subtle relationships in the data. In the end, the linear kernel provided the best results.

Principal Component Analysis (PCA)

To start off, I used Principal Component Analysis (PCA). PCA is a technique that reduces the complexity of data by compressing it into fewer dimensions while keeping as much important information as possible. This is especially helpful when working with large datasets, like word counts, because it helps identify the key factors that explain most of the variation in the data.

In this case, PCA helped simplify the word usage data and highlight the main patterns. The first principal component (PC1) captures the largest variation in vocabulary, and by examining the distribution of PC1 across the movie reviews, we can see how the review sentiments differ in terms of their word choices.

Similar to the dendrogram, the full movies_long dataset contains 10,000 movie reviews, and I had to use a sample of that or else PCA’s vector memory will be exhausted. Using the top 5 principal components, I split the data into training (75%) and testing (25%) sets and then trained the SVM model using the principal components to predict the sentiment each review came from. Finally, I created a confusion matrix to see how the predictions matched up with the actual labels.

set.seed(123)
movies_long_sample2 <- movie_data |>
  sample_n(200) |>
  unnest_tokens(input = "Review_Text",
                output = "Word") |>
  filter(!Word %in% stop_words$word) |>
  pivot_wider(id_cols = c("id_num","Review_Rank"),
              names_from = "Word",
              values_from = "Word",
              values_fn = length,
              values_fill = 0)
pca <- prcomp(movies_long_sample2 %>% select(-id_num, -Review_Rank), scale. = TRUE)
pca_df <- as_tibble(pca$x[, 1:5]) %>%
  mutate(Review_Rank = as.factor(movies_long_sample2$Review_Rank))

set.seed(123)
train_index <- sample(seq_len(nrow(pca_df)), size = 0.75 * nrow(pca_df))
train_data <- pca_df[train_index, ]
test_data <- pca_df[-train_index, ]
train_data <- train_data |> mutate(Review_Rank = as.factor(Review_Rank))
test_data <- test_data |> mutate(Review_Rank = as.factor(Review_Rank))

svm_model <- svm(Review_Rank ~ ., data = train_data, kernel = "linear", cost = 1, scale = TRUE)

svm_pred <- predict(svm_model, test_data)
conf_mat <- table(Predicted = svm_pred, Actual = test_data$Review_Rank)
conf_mat

          Actual
Predicted  negative positive
  negative        3        2
  positive       24       21

The results from the confusion matrix show that the model isn’t highly accurate, especially for predicting negative reviews. This suggests that the top principal components, while helpful for reducing dimensionality, don’t fully capture the patterns needed to distinguish between positive and negative sentiments.

Visualizing PCA Results

To further explore this, I plotted the reviews using the first two principal components:

ggplot(pca_df, aes(x = PC1, y = PC2, color = Review_Rank)) +
  geom_point() +
  labs(title = "PCA of Movie Reviews", x = "PC1", y = "PC2")

In the plot, there are two outlier reviews with extremely high PC1 and PC2 values, both labeled as positive. Most other points cluster near the origin, with only slight separation between the positive and negative reviews. Negative reviews seem to have slightly higher PC2 values, while positive reviews are slightly higher on PC1. However, the overlap is significant, confirming that PCA alone is not sufficient for strong predictive power.

Sentiment-Labeled Words and Support Vector Machines

Since PCA dimensionality reduction did not lead to strong classification performance, I decided to build a new SVM model using the raw word count data directly from movies_long. However, using all available words (columns) as predictors is requires a lot of computing power. So instead, I filtered the words using the “bing” sentiment lexicon, which labels individual words as either positive or negative based on sentiment. This reduced my predictors to only those words likely to have a meaningful impact on sentiment classification. I then split my data into training and testing sets (75/25) and trained SVM models using only the intersecting words between the bing words and my dataset.

The warning is caused by including words from bing that are not present or are constant (all zeros) in the sample. Since SVM attempts to scale the predictors, constant features (with no variation) cannot be scaled and are ignored or cause warnings. This is not a critical error but indicates that some predictors aren’t contributing meaningfully to the model.

To identify the best model, I had trained three different SVM models, each using a different kernel function: - Linear: Assumes a straight-line boundary between positive and negative reviews. - Radial (RBF): Allows for more flexible, curved boundaries. - Polynomial: Expands the feature space using polynomial transformations. All models were trained on the same subset of the data with cost = 0.1 (cost of misclassification) to ensure a fair comparison. Among them, the linear kernel performed the best, having the lowest misclassification rate. Since the linear kernel had the strongest performance, I then tuned the cost parameter, which controls the trade-off between model simplicity and classification accuracy. After testing multiple values (0.01, 0.1, 1, 10), I found that when the cost is equal to 0.1 the model gave the best results.

set.seed(123)
train_index <- sample(seq_len(nrow(movies_long)), size = 0.75 * nrow(movies_long))
train_data <- movies_long[train_index, ]
test_data <- movies_long[-train_index, ]
train_data <- train_data |> mutate(Review_Rank = as.factor(Review_Rank)) |> select(-id_num)
test_data <- test_data |> mutate(Review_Rank = as.factor(Review_Rank)) |> select(-id_num)


### Find pos/neg words
all_words <- movies_long |>
  colnames()

words_with_sentiments <- get_sentiments("bing") |>
  filter(word %in% all_words)

model1 <- svm(Review_Rank ~ ., data = train_data |> select(Review_Rank, words_with_sentiments$word), kernel = "linear", cost = 0.1, scale = TRUE)

Warning in svm.default(x, y, scale = scale, ..., na.action = na.action):
Variable(s) 'aborts' and 'abscond' and 'absurdness' and 'acclamation' and
'aches' and 'acumen' and 'adaptable' and 'adoringly' and 'adulation' and
'aggravate' and 'aggressiveness' and 'alarmingly' and 'ambitiously' and
'anomalous' and 'antagonize' and 'astutely' and 'asunder' and 'attentive' and
'auspicious' and 'authoritative' and 'avaricious' and 'avidly' and 'awsome' and
'backwood' and 'battering' and 'bellicose' and 'bemoan' and 'bemoaning' and
'besmirch' and 'bestial' and 'biases' and 'blameless' and 'blindingly' and
'bombardment' and 'breach' and 'brutalizing' and 'buckle' and 'buoyant' and
'cleanly' and 'coercive' and 'commiserate' and 'condemns' and 'condescension'
and 'congenial' and 'contagious' and 'cramping' and 'creak' and 'creaks' and
'cruelest' and 'cushy' and 'daringly' and 'dauntingly' and 'deadlock' and
'deceitful' and 'deceitfulness' and 'degeneration' and 'dejection' and
'demonized' and 'derogatory' and 'despairingly' and 'destitute' and 'deviously'
and 'disarray' and 'disasterous' and 'disbeliever' and 'disconcerted' and
'discontented' and 'discriminate' and 'dismissive' and 'disobedience' and
'disorganized' and 'disoriented' and 'dispensable' and 'disputed' and
'dissatisfying' and 'dissent' and 'divisive' and 'doubtfully' and 'dwindling'
and 'elation' and 'encouragement' and 'enhancement' and 'enrich' and
'equitable' and 'euphoric' and 'eventful' and 'exorbitant' and 'exterminate'
and 'fabrication' and 'facilitate' and 'ferociously' and 'fervent' and 'fib'
and 'figurehead' and 'flagrantly' and 'flickers' and 'foolhardy' and
'franticly' and 'froze' and 'fudge' and 'gaily' and 'glistening' and 'glut' and
'gnawing' and 'graciously' and 'grievously' and 'grimace' and 'grouch' and
'guile' and 'gutless' and 'hallowed' and 'handily' and 'harmonious' and 'hedge'
and 'hothead' and 'idiocies' and 'illogic' and 'illogically' and 'illuminati'
and 'impartially' and 'imperious' and 'impressiveness' and 'improbability' and
'impunity' and 'inconceivably' and 'incorrigible' and 'indecently' and
'indigent' and 'indoctrination' and 'inexorably' and 'inextricably' and
'infiltrator' and 'inhospitable' and 'injure' and 'insinuate' and 'insinuation'
and 'instigator' and 'insubordinate' and 'invigorate' and 'irks' and
'irreplaceable' and 'isolate' and 'jealously' and 'kindliness' and 'lagged' and
'laud' and 'lawlessness' and 'lecher' and 'lewdness' and 'lovably' and 'madden'
and 'magnanimous' and 'maladjusted' and 'malaise' and 'maturely' and
'melodramatically' and 'mire' and 'mists' and 'mockeries' and 'monstrously' and
'morbidly' and 'neatest' and 'niggle' and 'nimble' and 'nitpick' and
'nourishing' and 'obliterated' and 'obsess' and 'obstinate' and 'obstructed'
and 'oppose' and 'oppressiveness' and 'outperforms' and 'oversight' and
'oversights' and 'overstate' and 'overstates' and 'overtaken' and 'overtakes'
and 'pampered' and 'panders' and 'paradoxically' and 'payback' and 'peeved' and
'personages' and 'perturbed' and 'pittance' and 'poise' and 'prattle' and
'precariously' and 'pretentiously' and 'prideful' and 'proactive' and
'punishable' and 'punitive' and 'purposeful' and 'rattle' and 'recession' and
'recoil' and 'redundancy' and 'refreshed' and 'remorselessly' and 'replaceable'
and 'repress' and 'reproach' and 'repulsively' and 'reputable' and 'resilient'
and 'restriction' and 'retract' and 'reverent' and 'reviled' and 'revives' and
'ridicules' and 'roomy' and 'satisfactorily' and 'scandalized' and 'scathingly'
and 'scold' and 'scourge' and 'securely' and 'sensationally' and 'simplify' and
'skeptic' and 'slanders' and 'sluts' and 'smoother' and 'sorrowfully' and
'spewed' and 'squealing' and 'stalemate' and 'stately' and 'steadfast' and
'steadfastly' and 'succes' and 'suffices' and 'sulk' and 'sully' and 'topnotch'
and 'touts' and 'traction' and 'treacherously' and 'tumbled' and 'unaccessible'
and 'uncivil' and 'unclean' and 'undefined' and 'unjustifiable' and
'unjustifiably' and 'unobserved' and 'unrelentingly' and 'unsupported' and
'vainly' and 'vehemently' and 'venerate' and 'vexation' and 'vexing' and
'victimize' and 'vigilance' and 'vindictiveness' and 'wretchedness' and
'wrinkle' and 'zealously' constant. Cannot scale data.

svm_pred1 <- predict(model1, test_data)
conf_mat1 <- table(Predicted = svm_pred1, Actual = test_data$Review_Rank)
conf_mat1

          Actual
Predicted  negative positive
  negative     1012      182
  positive      250     1056

The final model achieved an accuracy of 82.72%, with a total of 432 misclassified reviews out of 2500 test cases. This demonstrates that filtering word features using a known sentiment improves model performance.

Visualizing Confusion Matrix

confusion_values <- c(
  True_Negative = conf_mat1["negative", "negative"],
  False_Positive = conf_mat1["positive", "negative"],
  False_Negative = conf_mat1["negative", "positive"],
  True_Positive = conf_mat1["positive", "positive"]
)

confusion_df <- enframe(confusion_values, name = "Outcome", value = "Count")

ggplot(confusion_df, aes(x = Outcome, y = Count, fill = Outcome)) +
  geom_col(width = 0.7) +
  scale_fill_manual(values = c("darkred", "darkred", "darkgreen", "darkgreen")) +
  labs(title = "SVM Predictions", x = "Prediction Type", y = "Number of Reviews") +
  theme_minimal()

The bar plot shows that the model has high counts of true positives and true negatives, which confirms strong predictive performance and demonstrating that the model performs well at predicting the sentiment of a movie review based on the word usage.

Conclusion

Through this project, I explored how language patterns in movie reviews can be used to classify sentiment using machine learning. Initially, I used PCA to reduce dimensionality and visualize sentiment variation, but it didn’t lead to high predictive power. Switching to raw word count data combined with sentiment-filtered predictors and using an SVM model with a linear kernel significantly improved the model’s performance, reaching over 82% accuracy. One key takeaway is that feature selection plays a critical role in text classification. By limiting the words to those with clear positive or negative associations (through the bing lexicon), the model had more focused, relevant input. Also, testing different kernel functions and cost parameters helped optimize the SVM’s performance.

Works Cited

N Lakshmipathi. “IMDB Dataset of 50K Movie Reviews.” Kaggle, https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.