A. INTRODUCTION # Executive summary
This is my final project, aiming to explore how review length impacts impacts about-to-watch movie viewers views and expression. In my opinion, short reviews might be more practical with clearer and direct feedbacks, people tend to look for only short reviews and overlook reviews that are too long. However, long reviews seem to allow reviewers to express more thoughts with a more diverse vocabulary. My goal is to identify patterns that reveal two types of reviews: quick and deep impression in short reviews and meaningful insights in long reviews. In the end, the visualizations together reveal that long reviews tend to include more descriptive, emotion-rich, and nuanced language, while short reviews favor direct and emotional keywords.
In my project, I use the “IMDb Dataset of 50K Movie Review” published on Kaggle 6 years ago. IMDb is one of the most popular online databases for movie-related content and user reviews.The dataset was curated for binary sentiment classification, containing a set of 25000 highly polar movie review for training and 25000 for testing, which is up to 50000 movie reviews from IMDb users in total. Therefore I believe that using this dataset will allow me to gain a deep understanding of how people talk about movies differently depending on how much they write.
To prepare the dataset for analysis, I followed several key steps in R to clean, reshape, and structure the data. First I installed and load all the necessary packages such as tidyverse, tidytext,etc.. I aimed to focus my analysis on the review length, I classified the reviews into 2 types based on the word count: short with reviews containing less than 100 characters and long with reviews containing more than 100 characters. Next, I tokenized the reviews and removed all the stop words so that my dataset only consists of words that are valuable to my analysis. These steps have helped me to work on tidy and normalized format without noises. Now, the IMDb dataset is ready for text mining, sentiment analysis and so on.
B.BODY # Text data analysis
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library (dplyr)
library(stringr)
library(wordcloud)
## Loading required package: RColorBrewer
library(ggplot2)
library(widyr)
bing <- get_sentiments("bing")
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
nrc <- get_sentiments("nrc")
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
#Load the CSV file
library(readxl)
imdb <- read_csv("IMDB Dataset 2.csv")
## Rows: 50000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): review, sentiment
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imdb
## # A tibble: 50,000 × 2
## review sentiment
## <chr> <chr>
## 1 "One of the other reviewers has mentioned that after watching just… positive
## 2 "A wonderful little production. <br /><br />The filming technique … positive
## 3 "I thought this was a wonderful way to spend time on a too hot sum… positive
## 4 "Basically there's a family where a little boy (Jake) thinks there… negative
## 5 "Petter Mattei's \"Love in the Time of Money\" is a visually stunn… positive
## 6 "Probably my all-time favorite movie, a story of selflessness, sac… positive
## 7 "I sure would like to see a resurrection of a up dated Seahunt ser… positive
## 8 "This show was an amazing, fresh & innovative idea in the 70's whe… negative
## 9 "Encouraged by the positive comments about this film on here I was… negative
## 10 "If you like original gut wrenching laughter you will like this mo… positive
## # ℹ 49,990 more rows
#Filter the reviews
imdb <- imdb %>%
mutate (word_count = str_count (review, "\\w+"),
length_group = case_when(
word_count <= 100 ~ "short",
word_count > 100 ~ "long"
))
imdb
## # A tibble: 50,000 × 4
## review sentiment word_count length_group
## <chr> <chr> <int> <chr>
## 1 "One of the other reviewers has mentioned … positive 320 long
## 2 "A wonderful little production. <br /><br … positive 166 long
## 3 "I thought this was a wonderful way to spe… positive 172 long
## 4 "Basically there's a family where a little… negative 141 long
## 5 "Petter Mattei's \"Love in the Time of Mon… positive 236 long
## 6 "Probably my all-time favorite movie, a st… positive 125 long
## 7 "I sure would like to see a resurrection o… positive 161 long
## 8 "This show was an amazing, fresh & innovat… negative 181 long
## 9 "Encouraged by the positive comments about… negative 130 long
## 10 "If you like original gut wrenching laught… positive 34 short
## # ℹ 49,990 more rows
Tokenize, remove stop words
tidy_imdb <- imdb %>%
unnest_tokens(word, review) %>%
anti_join(stop_words, by = "word")
tidy_imdb
## # A tibble: 4,601,528 × 4
## sentiment word_count length_group word
## <chr> <int> <chr> <chr>
## 1 positive 320 long reviewers
## 2 positive 320 long mentioned
## 3 positive 320 long watching
## 4 positive 320 long 1
## 5 positive 320 long oz
## 6 positive 320 long episode
## 7 positive 320 long hooked
## 8 positive 320 long happened
## 9 positive 320 long br
## 10 positive 320 long br
## # ℹ 4,601,518 more rows
COMPARE SENTIMENTS 1. Bing Lexicon Sentimental Analysis
I’d like to explore how sentiment is expressed in each type of reviews by using the Bing sentiment lexicon.
tidy_imdb <- imdb %>%
unnest_tokens(word, review) %>%
anti_join(stop_words, by = "word")
joined <- inner_join(tidy_imdb, bing, by = "word")
## Warning in inner_join(tidy_imdb, bing, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1268656 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(joined)
## # A tibble: 6 × 5
## sentiment.x word_count length_group word sentiment.y
## <chr> <int> <chr> <chr> <chr>
## 1 positive 320 long struck negative
## 2 positive 320 long brutality negative
## 3 positive 320 long trust positive
## 4 positive 320 long faint negative
## 5 positive 320 long timid negative
## 6 positive 320 long classic positive
#Join with Bing lexicon and count sentiments
imdb_bing <- joined %>%
count(length_group, sentiment = sentiment.y) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment_score = positive - negative)
imdb_bing
## # A tibble: 2 × 4
## length_group negative positive sentiment_score
## <chr> <int> <int> <int>
## 1 long 450317 308797 -141520
## 2 short 16625 13371 -3254
Visualization
# First reshape the data for ggplot
imdb_bing <- imdb_bing %>%
pivot_longer(cols = c(positive, negative), names_to = "sentiment", values_to = "count")
# Sentiment word count by review length
ggplot(imdb_bing, aes(x = length_group, y = count, fill = sentiment)) +
geom_col(position = "dodge") +
labs(
title = "Positive vs Negative Words in Long vs Short Reviews",
x = "Review Length", y = "Word Count",
fill = "Sentiment"
) +
theme_minimal()
CHART 1
In this session, I am going to apply the NRC Lexicon classifying the IMDb long reviews in order to see what kind of emotions are mainly consisted in those, instead of analyzing only positive and negative word using. I hope to gain a deeper understanding of the emotional content in user reviews.
Join with the NRC Lexicon
nrc <<- get_sentiments ("nrc")%>%
select(word, emotion = sentiment)
imdb_nrc <<- tidy_imdb %>%
select(-sentiment) %>%
inner_join(nrc, by ="word",relationship = "many-to-many")
imdb_nrc
## # A tibble: 2,755,860 × 4
## word_count length_group word emotion
## <int> <chr> <chr> <chr>
## 1 320 long hooked negative
## 2 320 long brutality anger
## 3 320 long brutality fear
## 4 320 long brutality negative
## 5 320 long violence anger
## 6 320 long violence fear
## 7 320 long violence negative
## 8 320 long violence sadness
## 9 320 long word positive
## 10 320 long word trust
## # ℹ 2,755,850 more rows
imdb_nrc %>%
count(length_group, emotion) %>%
pivot_wider(names_from = emotion, values_from = n, values_fill = 0)
## # A tibble: 2 × 11
## length_group anger anticipation disgust fear joy negative positive
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 long 201167 244922 161993 253957 224947 410311 527944
## 2 short 6988 8874 6315 8703 9170 14175 19026
## # ℹ 3 more variables: sadness <int>, surprise <int>, trust <int>
imdb_nrc
## # A tibble: 2,755,860 × 4
## word_count length_group word emotion
## <int> <chr> <chr> <chr>
## 1 320 long hooked negative
## 2 320 long brutality anger
## 3 320 long brutality fear
## 4 320 long brutality negative
## 5 320 long violence anger
## 6 320 long violence fear
## 7 320 long violence negative
## 8 320 long violence sadness
## 9 320 long word positive
## 10 320 long word trust
## # ℹ 2,755,850 more rows
Visualization
imdb_nrc <- tidy_imdb %>%
select(-sentiment) %>%
inner_join(nrc, by = "word", relationship = "many-to-many") %>%
count(length_group, emotion)
ggplot(imdb_nrc, aes(x = emotion, y = n, fill = length_group)) +
geom_col(position = "dodge") +
labs(
title = "NRC Emotion Comparison: Long vs Short Reviews",
x = "Emotion", y = "Word Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
CHART 2
In order to understand how users express their positive emotions in the movie reviews, I create word clouds based on the most frequently used positive words in both short and long reviews. I aim to visually explore the language patterns and identify the style of word using if possible.
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
#Filter only most frequently used positive words
positive <- tidy_imdb %>%
inner_join(bing, by = "word",relationship = "many-to-many") %>%
filter(sentiment.y == "positive") %>%
count(length_group, word, sort = TRUE)
positive_top10 <- positive %>%
group_by(length_group) %>%
slice_max(n, n = 10)
positive_top10
## # A tibble: 20 × 3
## # Groups: length_group [2]
## length_group word n
## <chr> <chr> <int>
## 1 long love 12385
## 2 long pretty 7007
## 3 long fun 5066
## 4 long worth 4310
## 5 long beautiful 4001
## 6 long excellent 3763
## 7 long nice 3676
## 8 long top 3578
## 9 long classic 3363
## 10 long enjoy 3355
## 11 short love 577
## 12 short worth 359
## 13 short excellent 332
## 14 short fun 317
## 15 short recommend 261
## 16 short wonderful 246
## 17 short pretty 244
## 18 short enjoy 224
## 19 short beautiful 214
## 20 short classic 206
Visualization
library(ggplot2)
ggplot(positive_top10, aes(x = reorder(word, n), y = n, fill = length_group)) +
geom_col(show.legend = FALSE) +
facet_wrap(~length_group, scales = "free") +
coord_flip() +
labs(
title = "Top 10 Positive Words in Short vs Long Reviews",
x = "Positive Word",
y = "Frequency"
) +
theme_minimal()
CHART 3 4. Create Word Cloud for Negative Words
To complement my analysis, I also create word cloud for the most frequently used negative words in both short and long reviews. My goal is to explore how users express their dissatisfaction about the movie and explore the relationship between the choice of negative words with the length of the reviews.
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
#Filter only most frequently used negative words
negative <- tidy_imdb %>%
inner_join(bing, by = "word",relationship = "many-to-many") %>%
filter(sentiment.y == "negative") %>%
count(length_group, word, sort = TRUE)
negative_top10 <- negative %>%
group_by(length_group) %>%
slice_max(n, n = 10)
negative_top10
## # A tibble: 20 × 3
## # Groups: length_group [2]
## length_group word n
## <chr> <chr> <int>
## 1 long bad 17292
## 2 long plot 12210
## 3 long funny 8147
## 4 long hard 5046
## 5 long worst 4898
## 6 long death 3832
## 7 long poor 3666
## 8 long dead 3610
## 9 long wrong 3480
## 10 long boring 3379
## 11 short bad 1156
## 12 short plot 730
## 13 short funny 590
## 14 short worst 432
## 15 short boring 247
## 16 short waste 247
## 17 short hard 223
## 18 short terrible 204
## 19 short stupid 185
## 20 short poor 171
#Visualization
library(ggplot2)
ggplot(negative_top10, aes(x = reorder(word, n), y = n, fill = length_group)) +
geom_col(show.legend = FALSE) +
facet_wrap(~length_group, scales = "free") +
coord_flip() +
labs(
title = "Top 10 Negative Words in Short vs Long Reviews",
x = "Negative Word",
y = "Frequency"
) +
theme_minimal()
CHART 4 5. Explore Most Common Phrases By Using Phi
Coefficients
imdb <- imdb %>%
mutate(id = row_number())
imdb_tidy <- imdb %>%
unnest_tokens(input = review,
output = word,
drop = F) %>%
filter(!word %in% c("br")) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
pair <- imdb_tidy %>%
pairwise_count(item = word,
feature = id,
sort = T)
pair
## # A tibble: 148,243,686 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 film movie 15369
## 2 movie film 15369
## 3 movie time 10997
## 4 time movie 10997
## 5 film time 10207
## 6 time film 10207
## 7 story movie 9321
## 8 movie story 9321
## 9 story film 9180
## 10 film story 9180
## # ℹ 148,243,676 more rows
# For LONG reviews only
long_cors <- imdb_tidy %>%
filter(length_group == "long") %>%
add_count(word) %>%
filter(n >= 150) %>%
pairwise_cor(item = word, feature = id, sort = TRUE)
long_cors
## # A tibble: 20,625,222 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 fi sci 0.984
## 2 sci fi 0.984
## 3 fu kung 0.915
## 4 kung fu 0.915
## 5 streep meryl 0.897
## 6 meryl streep 0.897
## 7 uwe boll 0.892
## 8 boll uwe 0.892
## 9 angeles los 0.891
## 10 los angeles 0.891
## # ℹ 20,625,212 more rows
library(dplyr)
long_word_cors <- long_cors %>%
filter(correlation > 0.3) %>% # Adjust threshold if needed
slice_max(correlation, n = 50) # Top 50 strongest pairs
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
word_graph_long <- long_word_cors %>%
as_tbl_graph(directed = FALSE)
library(ggraph)
set.seed(1234) # for consistent layout
ggraph(word_graph_long, layout = "fr") +
geom_edge_link(aes(alpha = correlation), color = "gray50") +
geom_node_point(color = "steelblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
theme_graph() +
labs(title = "Word Correlation Network for Long Reviews (phi > 0.3)")
CHART 5
# For SHORT reviews only
short_cors <- imdb_tidy %>%
filter(length_group == "short") %>%
add_count(word) %>%
filter(n >= 150) %>%
pairwise_cor(item = word, feature = id, sort = TRUE)
short_cors
## # A tibble: 9,506 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 effects special 0.663
## 2 special effects 0.663
## 3 2 1 0.358
## 4 1 2 0.358
## 5 waste time 0.290
## 6 time waste 0.290
## 7 highly recommend 0.252
## 8 recommend highly 0.252
## 9 money waste 0.200
## 10 waste money 0.200
## # ℹ 9,496 more rows
library(dplyr)
short_word_cors <- short_cors %>%
filter(correlation > 0.1) %>% # Adjust threshold if needed
slice_max(correlation, n = 50) # Top 50 strongest pairs
library(tidygraph)
word_graph_short <- short_word_cors %>%
as_tbl_graph(directed = FALSE)
library(ggraph)
set.seed(1234) # for consistent layout
ggraph(word_graph_short, layout = "fr") +
geom_edge_link(aes(alpha = correlation), color = "gray50") +
geom_node_point(color = "steelblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
theme_graph() +
labs(title = "Word Correlation Network for Short Reviews (phi > 0.1)")
CHART 6
C. CONCLUSION To my observation, long reviews allow users to write longer, which indicates that in those reviews, there are more sentiment-bearing words, including both positive and negative words. Also, in long reviews, the negative words outnumber the positive ones, whereas in short reviews, there is only a slight difference between negative and positive words. Although short reviews are easier to skim, they may not allow users to fully express their emotions after watching. Long reviews contain much richer emotional content, which is more helpful for readers to have an overall look at the content of the movie. (chart 1)
The chart indicates that long reviews not only contain a great number of emotional words due to their length, but also reflect a broader emotional tone. Besides positive and negative emotions, other emotions like anticipation, trust, fear, and joy are more noticeable in long reviews, which reveals that when writing long reviews, users tend to engage more thoughtfully and write more detailedly about their thoughts. Contrary to that, short reviews tend to focus on overall emotions, such as positive and negative. This suggests that although short reviews are more time-efficient for both writing and reading, long reviews provide richer emotional depth and might provide more value for both movie makers and other users who are about to watch. (chart 2)
The analysis of the top 10 positive words in both lengths indicates users’ tendency to express their emotions when writing short and long reviews. The word “love” tops in both long and short reviews, revealing the strong positive emotional reaction of users. Short reviews tend to include more direct words with a neutral tone, such as “excellent”, “worth”, “recommend”, etc, revealing their willingness to share with others. This can be a significantly valuable hint for review readers who are about to watch the movie. In contrast, the tendency to use more descriptive and more emotionally laden words may indicate the deep engagement with the movie by the users after watching. (chart 3)
The visualization shows that the word “bad” dominates in both length groups. It seems like “bad” is the most common word for users to review their general dissatisfaction. In long reviews, users tend to use more descriptive words such as “death”, “dead” to criticize directly, while in short reviews, users tend to use words that are more blunt and rude, like “stupid”, “terrible”. Interestingly, the word “funny”, which sounds positive, appears in the top 3 most frequently used negative words, indicating that users might use “funny” sarcastically and ironically to express their disappointment, the inconsistency between the opening and the ending. In conclusion, long reviews tend to provide deeper and constructive criticism while short ones focus on expressing users’ negative emotions at that time. (chart 4)
The phi coefficient word correlation networks for short and long reviews indicate patterns in how viewers express themselves based on review length. In short reviews (chart 6) the graph highlights small, tightly-connected clusters of opinions such as “bad-worst-horrible”. This might show that when writing short reviews, users tend to only pay attention to their core emotions - dissatisfaction, especially, by using direct and concrete words. Also, the chart reveals users’ good experiences by words such as “funny-laugh-comedy”, “worth-watching”, or “highly recommend”. On the other hand, the network from long reviews (chart 5) suggests a more detailed structure. It consists of many names, such as “Brad Pitt” or “Boris Karloff”, genres such as “sci-fi”, “kung-fu”. This reflects users’ tendency to describe more narratively and detailedly about every aspect of the movies (including facial expressions, low budget, etc). In conclusion, these results show that while short reviews prioritize quick and straightforward judgment, long reviews allow review readers to explore content objectively by expressing every small, specific detail.
If a reader is simply looking for a thumbs-up or thumbs-down signal, short reviews are great choices since they are useful for skimming the general sentiment and content of the movie. However, those who are looking for a deeper understanding of why people liked or disliked a film or if they don’t want to waste their time watching film that are not their cups of tea, they should turn to long reviews. Long reviews contain richer emotional word, specific examples (even spoilers!) and descriptive language, this can give readers more insight into elements like plot, acting, tone, pacing. Understanding the contrast between review lengths allows readers to engage with feedback: use short reviews for efficiency and broad consensus, but don’t overlook long reviews, especially when deciding on more complex or divisive films. Longer reviews can reveal the “why” behind the sentiment, which is often more informative than sentiment alone.