A. INTRODUCTION # Executive summary

This is my final project, aiming to explore how review length impacts impacts about-to-watch movie viewers views and expression. In my opinion, short reviews might be more practical with clearer and direct feedbacks, people tend to look for only short reviews and overlook reviews that are too long. However, long reviews seem to allow reviewers to express more thoughts with a more diverse vocabulary. My goal is to identify patterns that reveal two types of reviews: quick and deep impression in short reviews and meaningful insights in long reviews. In the end, the visualizations together reveal that long reviews tend to include more descriptive, emotion-rich, and nuanced language, while short reviews favor direct and emotional keywords.

Data background

In my project, I use the “IMDb Dataset of 50K Movie Review” published on Kaggle 6 years ago. IMDb is one of the most popular online databases for movie-related content and user reviews.The dataset was curated for binary sentiment classification, containing a set of 25000 highly polar movie review for training and 25000 for testing, which is up to 50000 movie reviews from IMDb users in total. Therefore I believe that using this dataset will allow me to gain a deep understanding of how people talk about movies differently depending on how much they write.

Data loading, cleaning and preprocessing

To prepare the dataset for analysis, I followed several key steps in R to clean, reshape, and structure the data. First I installed and load all the necessary packages such as tidyverse, tidytext,etc.. I aimed to focus my analysis on the review length, I classified the reviews into 2 types based on the word count: short with reviews containing less than 100 characters and long with reviews containing more than 100 characters. Next, I tokenized the reviews and removed all the stop words so that my dataset only consists of words that are valuable to my analysis. These steps have helped me to work on tidy and normalized format without noises. Now, the IMDb dataset is ready for text mining, sentiment analysis and so on.

B.BODY # Text data analysis

library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library (dplyr)
library(stringr)
library(wordcloud) 
## Loading required package: RColorBrewer
library(ggplot2)
library(widyr)
bing <- get_sentiments("bing")
bing
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
nrc <- get_sentiments("nrc") 
nrc
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ℹ 13,862 more rows
#Load the CSV file 

library(readxl)
imdb <- read_csv("IMDB Dataset 2.csv")
## Rows: 50000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): review, sentiment
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
imdb 
## # A tibble: 50,000 × 2
##    review                                                              sentiment
##    <chr>                                                               <chr>    
##  1 "One of the other reviewers has mentioned that after watching just… positive 
##  2 "A wonderful little production. <br /><br />The filming technique … positive 
##  3 "I thought this was a wonderful way to spend time on a too hot sum… positive 
##  4 "Basically there's a family where a little boy (Jake) thinks there… negative 
##  5 "Petter Mattei's \"Love in the Time of Money\" is a visually stunn… positive 
##  6 "Probably my all-time favorite movie, a story of selflessness, sac… positive 
##  7 "I sure would like to see a resurrection of a up dated Seahunt ser… positive 
##  8 "This show was an amazing, fresh & innovative idea in the 70's whe… negative 
##  9 "Encouraged by the positive comments about this film on here I was… negative 
## 10 "If you like original gut wrenching laughter you will like this mo… positive 
## # ℹ 49,990 more rows
#Filter the reviews
imdb <- imdb %>% 
  mutate (word_count = str_count (review, "\\w+"),
          length_group = case_when(
          word_count <= 100 ~ "short",
          word_count > 100 ~ "long"
          ))
imdb
## # A tibble: 50,000 × 4
##    review                                      sentiment word_count length_group
##    <chr>                                       <chr>          <int> <chr>       
##  1 "One of the other reviewers has mentioned … positive         320 long        
##  2 "A wonderful little production. <br /><br … positive         166 long        
##  3 "I thought this was a wonderful way to spe… positive         172 long        
##  4 "Basically there's a family where a little… negative         141 long        
##  5 "Petter Mattei's \"Love in the Time of Mon… positive         236 long        
##  6 "Probably my all-time favorite movie, a st… positive         125 long        
##  7 "I sure would like to see a resurrection o… positive         161 long        
##  8 "This show was an amazing, fresh & innovat… negative         181 long        
##  9 "Encouraged by the positive comments about… negative         130 long        
## 10 "If you like original gut wrenching laught… positive          34 short       
## # ℹ 49,990 more rows

Tokenize, remove stop words

tidy_imdb <- imdb %>%
  unnest_tokens(word, review) %>% 
  anti_join(stop_words, by = "word")
tidy_imdb
## # A tibble: 4,601,528 × 4
##    sentiment word_count length_group word     
##    <chr>          <int> <chr>        <chr>    
##  1 positive         320 long         reviewers
##  2 positive         320 long         mentioned
##  3 positive         320 long         watching 
##  4 positive         320 long         1        
##  5 positive         320 long         oz       
##  6 positive         320 long         episode  
##  7 positive         320 long         hooked   
##  8 positive         320 long         happened 
##  9 positive         320 long         br       
## 10 positive         320 long         br       
## # ℹ 4,601,518 more rows

COMPARE SENTIMENTS 1. Bing Lexicon Sentimental Analysis

I’d like to explore how sentiment is expressed in each type of reviews by using the Bing sentiment lexicon.

tidy_imdb <- imdb %>%
  unnest_tokens(word, review) %>%
  anti_join(stop_words, by = "word")

joined <- inner_join(tidy_imdb, bing, by = "word")
## Warning in inner_join(tidy_imdb, bing, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1268656 of `x` matches multiple rows in `y`.
## ℹ Row 5781 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(joined)
## # A tibble: 6 × 5
##   sentiment.x word_count length_group word      sentiment.y
##   <chr>            <int> <chr>        <chr>     <chr>      
## 1 positive           320 long         struck    negative   
## 2 positive           320 long         brutality negative   
## 3 positive           320 long         trust     positive   
## 4 positive           320 long         faint     negative   
## 5 positive           320 long         timid     negative   
## 6 positive           320 long         classic   positive
#Join with Bing lexicon and count sentiments
imdb_bing <- joined %>%
  count(length_group, sentiment = sentiment.y) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment_score = positive - negative)
imdb_bing
## # A tibble: 2 × 4
##   length_group negative positive sentiment_score
##   <chr>           <int>    <int>           <int>
## 1 long           450317   308797         -141520
## 2 short           16625    13371           -3254

Visualization

# First reshape the data for ggplot
imdb_bing <- imdb_bing %>%
  pivot_longer(cols = c(positive, negative), names_to = "sentiment", values_to = "count")

# Sentiment word count by review length
ggplot(imdb_bing, aes(x = length_group, y = count, fill = sentiment)) +
  geom_col(position = "dodge") +
  labs(
    title = "Positive vs Negative Words in Long vs Short Reviews",
    x = "Review Length", y = "Word Count",
    fill = "Sentiment"
  ) +
  theme_minimal()

CHART 1

  1. NRC Lexicon Sentimental Analysis

In this session, I am going to apply the NRC Lexicon classifying the IMDb long reviews in order to see what kind of emotions are mainly consisted in those, instead of analyzing only positive and negative word using. I hope to gain a deeper understanding of the emotional content in user reviews.

Join with the NRC Lexicon

nrc <<- get_sentiments ("nrc")%>%
  select(word, emotion = sentiment) 
imdb_nrc <<- tidy_imdb %>%
  select(-sentiment) %>%
  inner_join(nrc, by ="word",relationship = "many-to-many")
imdb_nrc
## # A tibble: 2,755,860 × 4
##    word_count length_group word      emotion 
##         <int> <chr>        <chr>     <chr>   
##  1        320 long         hooked    negative
##  2        320 long         brutality anger   
##  3        320 long         brutality fear    
##  4        320 long         brutality negative
##  5        320 long         violence  anger   
##  6        320 long         violence  fear    
##  7        320 long         violence  negative
##  8        320 long         violence  sadness 
##  9        320 long         word      positive
## 10        320 long         word      trust   
## # ℹ 2,755,850 more rows
imdb_nrc %>%
  count(length_group, emotion) %>%
  pivot_wider(names_from = emotion, values_from = n, values_fill = 0)
## # A tibble: 2 × 11
##   length_group  anger anticipation disgust   fear    joy negative positive
##   <chr>         <int>        <int>   <int>  <int>  <int>    <int>    <int>
## 1 long         201167       244922  161993 253957 224947   410311   527944
## 2 short          6988         8874    6315   8703   9170    14175    19026
## # ℹ 3 more variables: sadness <int>, surprise <int>, trust <int>
imdb_nrc 
## # A tibble: 2,755,860 × 4
##    word_count length_group word      emotion 
##         <int> <chr>        <chr>     <chr>   
##  1        320 long         hooked    negative
##  2        320 long         brutality anger   
##  3        320 long         brutality fear    
##  4        320 long         brutality negative
##  5        320 long         violence  anger   
##  6        320 long         violence  fear    
##  7        320 long         violence  negative
##  8        320 long         violence  sadness 
##  9        320 long         word      positive
## 10        320 long         word      trust   
## # ℹ 2,755,850 more rows

Visualization

imdb_nrc <- tidy_imdb %>%
  select(-sentiment) %>%
  inner_join(nrc, by = "word", relationship = "many-to-many") %>%
  count(length_group, emotion)
ggplot(imdb_nrc, aes(x = emotion, y = n, fill = length_group)) +
  geom_col(position = "dodge") +
  labs(
    title = "NRC Emotion Comparison: Long vs Short Reviews",
    x = "Emotion", y = "Word Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

CHART 2

  1. Create Word Cloud For Positive Words

In order to understand how users express their positive emotions in the movie reviews, I create word clouds based on the most frequently used positive words in both short and long reviews. I aim to visually explore the language patterns and identify the style of word using if possible.

bing
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
#Filter only most frequently used positive words
positive <- tidy_imdb %>%
  inner_join(bing, by = "word",relationship = "many-to-many") %>%
  filter(sentiment.y == "positive") %>%
  count(length_group, word, sort = TRUE)
positive_top10 <- positive %>%
  group_by(length_group) %>%
  slice_max(n, n = 10)
positive_top10
## # A tibble: 20 × 3
## # Groups:   length_group [2]
##    length_group word          n
##    <chr>        <chr>     <int>
##  1 long         love      12385
##  2 long         pretty     7007
##  3 long         fun        5066
##  4 long         worth      4310
##  5 long         beautiful  4001
##  6 long         excellent  3763
##  7 long         nice       3676
##  8 long         top        3578
##  9 long         classic    3363
## 10 long         enjoy      3355
## 11 short        love        577
## 12 short        worth       359
## 13 short        excellent   332
## 14 short        fun         317
## 15 short        recommend   261
## 16 short        wonderful   246
## 17 short        pretty      244
## 18 short        enjoy       224
## 19 short        beautiful   214
## 20 short        classic     206

Visualization

library(ggplot2)

ggplot(positive_top10, aes(x = reorder(word, n), y = n, fill = length_group)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~length_group, scales = "free") +
  coord_flip() +
  labs(
    title = "Top 10 Positive Words in Short vs Long Reviews",
    x = "Positive Word",
    y = "Frequency"
  ) +
  theme_minimal()

CHART 3 4. Create Word Cloud for Negative Words

To complement my analysis, I also create word cloud for the most frequently used negative words in both short and long reviews. My goal is to explore how users express their dissatisfaction about the movie and explore the relationship between the choice of negative words with the length of the reviews.

bing
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
#Filter only most frequently used negative words
negative <- tidy_imdb %>%
  inner_join(bing, by = "word",relationship = "many-to-many") %>%
  filter(sentiment.y == "negative") %>%
  count(length_group, word, sort = TRUE)
negative_top10 <- negative %>%
  group_by(length_group) %>%
  slice_max(n, n = 10)
negative_top10
## # A tibble: 20 × 3
## # Groups:   length_group [2]
##    length_group word         n
##    <chr>        <chr>    <int>
##  1 long         bad      17292
##  2 long         plot     12210
##  3 long         funny     8147
##  4 long         hard      5046
##  5 long         worst     4898
##  6 long         death     3832
##  7 long         poor      3666
##  8 long         dead      3610
##  9 long         wrong     3480
## 10 long         boring    3379
## 11 short        bad       1156
## 12 short        plot       730
## 13 short        funny      590
## 14 short        worst      432
## 15 short        boring     247
## 16 short        waste      247
## 17 short        hard       223
## 18 short        terrible   204
## 19 short        stupid     185
## 20 short        poor       171
#Visualization 
library(ggplot2)

ggplot(negative_top10, aes(x = reorder(word, n), y = n, fill = length_group)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~length_group, scales = "free") +
  coord_flip() +
  labs(
    title = "Top 10 Negative Words in Short vs Long Reviews",
    x = "Negative Word",
    y = "Frequency"
  ) +
  theme_minimal()

CHART 4 5. Explore Most Common Phrases By Using Phi Coefficients

imdb <- imdb %>%
  mutate(id = row_number())
imdb_tidy <- imdb %>% 
  unnest_tokens(input = review,
                  output = word,
                  drop = F) %>% 
  filter(!word %in% c("br")) %>% 
  anti_join(stop_words)
## Joining with `by = join_by(word)`
pair <- imdb_tidy %>% 
  pairwise_count(item = word,
                 feature = id,
                 sort = T)
pair
## # A tibble: 148,243,686 × 3
##    item1 item2     n
##    <chr> <chr> <dbl>
##  1 film  movie 15369
##  2 movie film  15369
##  3 movie time  10997
##  4 time  movie 10997
##  5 film  time  10207
##  6 time  film  10207
##  7 story movie  9321
##  8 movie story  9321
##  9 story film   9180
## 10 film  story  9180
## # ℹ 148,243,676 more rows
# For LONG reviews only
long_cors <- imdb_tidy %>%
  filter(length_group == "long") %>%
  add_count(word) %>%
  filter(n >= 150) %>%
  pairwise_cor(item = word, feature = id, sort = TRUE)
long_cors
## # A tibble: 20,625,222 × 3
##    item1   item2   correlation
##    <chr>   <chr>         <dbl>
##  1 fi      sci           0.984
##  2 sci     fi            0.984
##  3 fu      kung          0.915
##  4 kung    fu            0.915
##  5 streep  meryl         0.897
##  6 meryl   streep        0.897
##  7 uwe     boll          0.892
##  8 boll    uwe           0.892
##  9 angeles los           0.891
## 10 los     angeles       0.891
## # ℹ 20,625,212 more rows
library(dplyr)

long_word_cors <- long_cors %>%
  filter(correlation > 0.3) %>%   # Adjust threshold if needed
  slice_max(correlation, n = 50)  # Top 50 strongest pairs

library(tidygraph)
## 
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
word_graph_long <- long_word_cors %>%
  as_tbl_graph(directed = FALSE)

library(ggraph)

set.seed(1234)  # for consistent layout

ggraph(word_graph_long, layout = "fr") +
  geom_edge_link(aes(alpha = correlation), color = "gray50") +
  geom_node_point(color = "steelblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  theme_graph() +
  labs(title = "Word Correlation Network for Long Reviews (phi > 0.3)")

CHART 5

# For SHORT reviews only
short_cors <- imdb_tidy %>%
  filter(length_group == "short") %>%
  add_count(word) %>%
  filter(n >= 150) %>%
  pairwise_cor(item = word, feature = id, sort = TRUE)
short_cors
## # A tibble: 9,506 × 3
##    item1     item2     correlation
##    <chr>     <chr>           <dbl>
##  1 effects   special         0.663
##  2 special   effects         0.663
##  3 2         1               0.358
##  4 1         2               0.358
##  5 waste     time            0.290
##  6 time      waste           0.290
##  7 highly    recommend       0.252
##  8 recommend highly          0.252
##  9 money     waste           0.200
## 10 waste     money           0.200
## # ℹ 9,496 more rows
library(dplyr)

short_word_cors <- short_cors %>%
  filter(correlation > 0.1) %>%   # Adjust threshold if needed
  slice_max(correlation, n = 50)  # Top 50 strongest pairs

library(tidygraph)

word_graph_short <- short_word_cors %>%
  as_tbl_graph(directed = FALSE)

library(ggraph)

set.seed(1234)  # for consistent layout

ggraph(word_graph_short, layout = "fr") +
  geom_edge_link(aes(alpha = correlation), color = "gray50") +
  geom_node_point(color = "steelblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  theme_graph() +
  labs(title = "Word Correlation Network for Short Reviews (phi > 0.1)")

CHART 6

C. CONCLUSION To my observation, long reviews allow users to write longer, which indicates that in those reviews, there are more sentiment-bearing words, including both positive and negative words. Also, in long reviews, the negative words outnumber the positive ones, whereas in short reviews, there is only a slight difference between negative and positive words. Although short reviews are easier to skim, they may not allow users to fully express their emotions after watching. Long reviews contain much richer emotional content, which is more helpful for readers to have an overall look at the content of the movie. (chart 1)

The chart indicates that long reviews not only contain a great number of emotional words due to their length, but also reflect a broader emotional tone. Besides positive and negative emotions, other emotions like anticipation, trust, fear, and joy are more noticeable in long reviews, which reveals that when writing long reviews, users tend to engage more thoughtfully and write more detailedly about their thoughts. Contrary to that, short reviews tend to focus on overall emotions, such as positive and negative. This suggests that although short reviews are more time-efficient for both writing and reading, long reviews provide richer emotional depth and might provide more value for both movie makers and other users who are about to watch. (chart 2)

The analysis of the top 10 positive words in both lengths indicates users’ tendency to express their emotions when writing short and long reviews. The word “love” tops in both long and short reviews, revealing the strong positive emotional reaction of users. Short reviews tend to include more direct words with a neutral tone, such as “excellent”, “worth”, “recommend”, etc, revealing their willingness to share with others. This can be a significantly valuable hint for review readers who are about to watch the movie. In contrast, the tendency to use more descriptive and more emotionally laden words may indicate the deep engagement with the movie by the users after watching. (chart 3)

The visualization shows that the word “bad” dominates in both length groups. It seems like “bad” is the most common word for users to review their general dissatisfaction. In long reviews, users tend to use more descriptive words such as “death”, “dead” to criticize directly, while in short reviews, users tend to use words that are more blunt and rude, like “stupid”, “terrible”. Interestingly, the word “funny”, which sounds positive, appears in the top 3 most frequently used negative words, indicating that users might use “funny” sarcastically and ironically to express their disappointment, the inconsistency between the opening and the ending. In conclusion, long reviews tend to provide deeper and constructive criticism while short ones focus on expressing users’ negative emotions at that time. (chart 4)

The phi coefficient word correlation networks for short and long reviews indicate patterns in how viewers express themselves based on review length. In short reviews (chart 6) the graph highlights small, tightly-connected clusters of opinions such as “bad-worst-horrible”. This might show that when writing short reviews, users tend to only pay attention to their core emotions - dissatisfaction, especially, by using direct and concrete words. Also, the chart reveals users’ good experiences by words such as “funny-laugh-comedy”, “worth-watching”, or “highly recommend”. On the other hand, the network from long reviews (chart 5) suggests a more detailed structure. It consists of many names, such as “Brad Pitt” or “Boris Karloff”, genres such as “sci-fi”, “kung-fu”. This reflects users’ tendency to describe more narratively and detailedly about every aspect of the movies (including facial expressions, low budget, etc). In conclusion, these results show that while short reviews prioritize quick and straightforward judgment, long reviews allow review readers to explore content objectively by expressing every small, specific detail.

If a reader is simply looking for a thumbs-up or thumbs-down signal, short reviews are great choices since they are useful for skimming the general sentiment and content of the movie. However, those who are looking for a deeper understanding of why people liked or disliked a film or if they don’t want to waste their time watching film that are not their cups of tea, they should turn to long reviews. Long reviews contain richer emotional word, specific examples (even spoilers!) and descriptive language, this can give readers more insight into elements like plot, acting, tone, pacing. Understanding the contrast between review lengths allows readers to engage with feedback: use short reviews for efficiency and broad consensus, but don’t overlook long reviews, especially when deciding on more complex or divisive films. Longer reviews can reveal the “why” behind the sentiment, which is often more informative than sentiment alone.