The purpose of this experiment is to determine the viability of using code to extract movie review sentiments without having to read the actual analyses. We are trying to answer the question of:
In the modern age, movies are incredibly prevalent; in 2022, from January 1st to 4th alone, there were approximately 20 films released. With so many films available, movie-goers will often place importance on watching the best movies. However, many popular movie-review sites, such as IMDb and Rotten Tomatoes, can have wildly different views on the same movies. Plus, the average rating statistic that these sites provide may be difficult to interpret.
For example, the movie My Policeman, on Rotten Tomatoes, has two different scores, one (the “Tomatometer”) being 44% and the other (the “Audience Score”) at 96%. On IMDb, the same movie has an average rating of 6.5/10. Metacritic rates this movie 50% for one statistic and 7.2/10 for another. With such vastly different scores across and within sites, it can be difficult to trust online movie reviews. User ratings can also be unpredictable - a user who really loved a movie might only rate it a 7/10 whereas someone who thought a movie was terrible might give it a 6/10. Since users are free to add reviews to various sites, many reviews might not be reliable.
Therefore, it is important that a more objective solution to movie ratings be used; one that can take a user’s review (not rating) text and determine some objective rating. This experiment hopes to deliver that solution. To simplify the rating process, we will associate a movie’s rating with the number of positive reviews divided by the total number of reviews. This algorithm for movie reviewing will be elaborated on in the discussion section.
Exploratory Data Analysis for this experiment will begin with trying sentiment analysis on other data sets. This will allow us to determine if the idea actually works before applying it to movie reviews. This will also include various data visualizations of text data.
This section serves to show the viability of sentiment analysis on sample text passages, including presidential inauguration speeches and a passage from the movie National Lampoon’s Animal House (1978). Much of the code was provided by Dr. Brian Wright from the UVA DS 3001 Github Repository.
NOTE: This code may not run properly if run with “Run All” in RStudio. However, running code chunks consecutively should work. In the worst case, some lines may need to be rerun for the code to work. This is likely due to some functions taking longer than others, resulting in inconsistent runtimes.
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0 v purrr 0.3.5
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.3 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Exploration with National Lampoon’s Animal House
## tibble [100 x 1] (S3: tbl_df/tbl/data.frame)
## $ word: chr [1:100] "over" "did" "you" "say" ...
## Joining, by = "word"
Let’s look at a slightly larger dataset. As referenced in Text Mining with R, a package gutenbergr provides access to the public domain items from the Gutenberg Project. Let’s take a look. To learn more check out this link: https://ropensci.org/tutorials/gutenbergr_tutorial/
## # A tibble: 6 x 8
## gutenberg_id title author guten~1 langu~2 guten~3 rights has_t~4
## <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
## 1 3 "John F. Kennedy's~ Kenne~ 1666 en <NA> Publi~ TRUE
## 2 4746 "Kennedy Square" Smith~ 444 en <NA> Publi~ TRUE
## 3 12433 "Narrative of the ~ MacGi~ 4367 en <NA> Publi~ TRUE
## 4 12525 "Narrative of the ~ MacGi~ 4367 en <NA> Publi~ TRUE
## 5 56902 "The Soul Scar: A ~ Reeve~ 752 en <NA> Publi~ TRUE
## 6 58031 "Report of the Pre~ Unite~ 42517 en <NA> Publi~ TRUE
## # ... with abbreviated variable names 1: gutenberg_author_id, 2: language,
## # 3: gutenberg_bookshelf, 4: has_text
## # A tibble: 6 x 8
## gutenberg_id title author guten~1 langu~2 guten~3 rights has_t~4
## <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
## 1 3 "John F. Kennedy's~ Kenne~ 1666 en <NA> Publi~ TRUE
## 2 4746 "Kennedy Square" Smith~ 444 en <NA> Publi~ TRUE
## 3 12433 "Narrative of the ~ MacGi~ 4367 en <NA> Publi~ TRUE
## 4 12525 "Narrative of the ~ MacGi~ 4367 en <NA> Publi~ TRUE
## 5 56902 "The Soul Scar: A ~ Reeve~ 752 en <NA> Publi~ TRUE
## 6 58031 "Report of the Pre~ Unite~ 42517 en <NA> Publi~ TRUE
## # ... with abbreviated variable names 1: gutenberg_author_id, 2: language,
## # 3: gutenberg_bookshelf, 4: has_text
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
Ok, now that we have our word frequencies let’s do some analysis. We will compare the three speeches using sentiment analysis to see if the generally align or not.
get_sentiments() mimics training the model, as these
functions gather existing sentiment analysis for us. As a result, by
gathering them, we will use these to create an initial model using joins
and other functions. Therefore, these are the closest to “training
data”.
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
## # A tibble: 13,872 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,862 more rows
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
## # A tibble: 2,477 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,467 more rows
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
Now that we have our sentiment let’s do some quick comparisons
##
## negative positive
## 25 52
##
## negative positive
## 43 46
##
## negative positive
## 56 52
##
## anger anticipation disgust fear joy negative
## 16 29 7 19 34 24
## positive sadness surprise trust
## 83 9 11 50
##
## anger anticipation disgust fear joy negative
## 21 33 10 33 29 44
## positive sadness surprise trust
## 77 15 14 55
##
## anger anticipation disgust fear joy negative
## 40 43 13 42 37 65
## positive sadness surprise trust
## 86 34 18 54
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
term frequency - inverse document frequency tf-idf. Here we are going to treat each of our speeches as a document in a corpus and explore the relative importance of words to these speeches as compared to the overall corpus.
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## i Please use `as_tibble()` instead.
## i The signature and semantics have changed, see `?as_tibble`.
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## i Using compatibility `.name_repair`.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## i Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(y)
##
## # Now:
## data %>% select(all_of(y))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## i Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(z)
##
## # Now:
## data %>% select(all_of(z))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## New names:
## Joining, by = "president"
## * `text` -> `text...1`
## * `text` -> `text...2`
## * `text` -> `text...3`
The steps above clarify the process needed to format text data to then be processed by different methods for analysis. The different methods include sentiment scores, word cloud generation, and term frequency. Sentiment scores appear to be the analysis method that most aligns with our goal of processing written movie reviews and creating objective ratings for movies.The get_sentiment() function from the (syuzhet package)[https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html]. The different methods of get_sentiment: afinn, bing, and nrc evaluate the sentiment of words on different scales, so as to get a more comprehensive measure of the sentiment of each word. The benefit of using sentiment scores is that they reduce the written reviews to negative, neutral, or positive.
We will now perform similar analysis on IMDb movie reviews. The dataset comes from Kaggle, and contains various movie reviews and an overall sentiment (positive or negative) for the review.
## Rows: 50000 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): review, sentiment
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## spc_tbl_ [50,000 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ review : chr [1:50000] "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right"| __truncated__ "A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion "| __truncated__ "I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned th"| __truncated__ "Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fi"| __truncated__ ...
## $ sentiment: chr [1:50000] "positive" "positive" "positive" "negative" ...
## - attr(*, "spec")=
## .. cols(
## .. review = col_character(),
## .. sentiment = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## [1] 0
Trying the analysis on the first datum, a positive review
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
##
## negative positive
## 15 8
##
## anger anticipation disgust fear joy negative
## 6 7 3 9 3 14
## positive sadness surprise trust
## 9 7 2 7
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Trying the analysis on the fourth review, a negative review
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
##
## negative
## 6
##
## anger anticipation fear joy negative positive
## 2 3 3 1 4 2
## sadness surprise trust
## 4 2 2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Applying the analysis to every row of the dataframe
Apply this function to the reviews and evaluate, and tune the algorithm
The 500 data that are used in the loop constitute a tuning dataset,
as this data will be used to help tweak the algorithm to make minute
changes to the decision algorithm. The decision algorithm refers to how
we decide if a review is positive or negative based on the table that is
returned by the summarize() function that was declared
above.
Accuracy:
## [1] 0.62
To address the question, the main technique employed was sentiment analysis of written movie reviews. This involved breaking down text passages into words, removing stopwords, and analyzing the general sentiment of those words with the get_sentiment() function.
Tidytext utilizes 3 difference unigram lexicons: afinn, bing, and nrc. Afinn was designed for Twitter sentiment analysis and assigns scores between -5 and +5 where negative numbers represent a negative sentiment. Bing was designed for customer reviews and simply categorizes words into positive and negative. Nrc categorizes words into positive and negative for sentiment and various emotions like joy, anger, and sadness.
The results from using the afinn, nrc, and bing methods were used to give the general sentiments of words each representing a sentence from a review. However, the sentinment of each word had to be combined to evaluate the sentiment of the review. Data visualization using histograms and word clouds were used to visualize the spread of the sentiments from the first, positive, and fourth,negative, reviews from the IMDB dataset. The visualizations and tables helped us determine that the nrc method was most suitable for the sentiment analysis of the whole dataset, as it clearly provided the positive or negative sentiment of each sentence of a review. The breakdown of the sentiment was then fed into a decision algorithm, so that it could determine whether the sentiment analysis has correctly evaluated each review.
Accuracy:
## [1] 0.64
This model predicts the sentiment of the reviews approximately 64% of the time. This is decent, but it is clear that it is lacking in certain areas. Without further research, it is difficult to utilize the other sentiments that could be outputted (e.g. anger, disgust, anticipation, fear, joy, sadness, surprise, trust, etc.). This is because many words, when used in conjunction with others or without context, can completely change meaning. For example, if a movie is disgusting, it could be an incredibly gory but moving and accurate depiction of something, or it can be a terrible movie. Or the phrase “disgustingly good” should result in extreme net positive in sentiment scores, but in our case might end up with a neutral result. Without further research into natural language processing (NLP) and sentiment analysis, it is difficult to ascertain the usage of the words in the reviews. Therefore, the model only used positive and negative, which can lead to many inaccuracies and pitfalls. More will be discussed in the conclusion section.
Some reviews may be written by people with a less-extensive vocabulary, as they may not be able to select the best words to convey their review. Which might be misinterpreted during sentiment analysis, thus leading to an inconclusive result or the opposite evaluation of their intended review.
Comment reviews are also more likely to be left by people who feel passionate about the movie in one way or another. This may lead to skewed datasets, or possibly more flamboyant reviews.
Our sentiment analysis also focuses on single words. This means that qualifiers aren’t taken into account: for example the from “not bad” would end up with a cumulative net negative sentiment score when in reality it should lean more towards a neutral score. More research into the libraries used would be needed to fully conclude this, however.
From the model building, we may see that it is possible to determine the sentiment of a movie review, meaning that the answer to the question in hand is yes, it is possible to bulid a more objective evaluation tool of movie reviews using sentiment analysis.
However, the model we created is not sufficiently accurate enough to be deployed in a real-world environment. Light research reveals many examples of people producing models with 70% or more accuracy when classifying IMDb movie reviews. For example, the Kaggle site where the data was retrieved from contains a Python notebook with approximately 75% accuracy. There are many areas to improve before one should be using such a model. Because there already exist many models with higher accuracy than the model created here (64%), we may conclude that while it is possible to develop a sentiment analysis tool for movie reviews, the one we have created is not sufficiently accurate.
Difficulties arose when dealing with the size of the data, the quality of the data, and the results of sentiment analysis. The IMDB dataset contains a little over 50,000 reviews, so our model has been tuned on about 1% of the data available. However, we were not able to utilize the other 99% of data because the functions of the model are computationally expensive. Another difficulty is the contextualization of the words from the reviews, and for the most part we did not address it beyond positive or negative. The model has to be trained understand sentiment from context, but that requires linking words to their part of speech and whatnot. The nrc method provides a guess about the emotion of a word, but it is limited to 8 emotions which is not enough to encompass the variety of expressions that can be found in the reviews.
A potential source of error is that we relied upon the positive and negative evaluation of sentences, thus losing a degree of context from discarding the accompanying emotion. This is erroneous because positive does not necessarily mean a review overall is good, but we assumed it generally meant it was a good review (same with negative). Another potential source of error is the length of the reviews, and that they might not contain enough data to perform accurate sentiment analysis. Lastly, our evaluation of the decision algorithm might be off because it was tested on a relatively small sample size.
Our process would benefit from the inclusion of the tokenization of words by sentence and not word (the first parameter of unnest_tokens can take many different values, which might change the sentiments; it didn’t seem to change anything for us). In addition, incorporating the emotions of the sentence given by the nrc method to provide more context during the evaluation of a sentence to positive or negative. More future research could include general improvements in sentiment analysis algorithms and NLP.