Text Mining

Brooke Porter, Dennis Espejo, Oindriza Reza Nodi
MUSA 5000: Statistical and Data Mining Methods for Urban Data Analysis
Eugene Brusilovskiy
December 12, 2025

Introduction

Text mining is a method for analyzing large amounts of textual data that is particularly useful for revealing patterns or trends within a large dataset that are not immediately obvious (George Washington University Libraries, 2025). The speed and efficiency of text mining are rapidly increasing as advancements are made in using AI, like ChatGPT, to perform this analysis (Brusilovskiy, 2025). Sentiment analysis is a form of text analysis that is concerned with both the opinion and/or emotion expressed by a text (Yadollahi et al., 2017). In this report, we use R to conduct text mining analysis on a dataset of IMDb movie reviews. Specifically, we use the NRC and AFINN lexicons in the syuzhet package in R to perform sentiment analysis on the dataset. These lexicons take different approaches to categorizing the negative versus positive sentiments expressed by the meaningful words in the review. Performing sentiment analysis on IMDb movie reviews reveals the frequency of different tones in their writing by pulling out the negative and positive terms that reviewers use. In this report, we provide insights into our process, decisions, reasonings, and findings.

Methods


For the current project, we used RStudio, an integrated development environment for R, and R Markdown, a file format for creating dynamic documents to conduct our sentiment analysis (Grolemund, 2014). Sentiment analysis, although becoming an increasingly archaic method, is “the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral.” (Amazon Web Services, n.d.). With the increasing availability of large amounts of online text data, a process to capture the overall sentiments across this data became essential. For example, if we have a large number of reviews for a specific product, a sentiment analysis can determine how those who reviewed the product, in general, are reacting to it, which can influence potential marketing strategies or inform future product modifications.

For our sentiment analysis, we will specifically analyze IMDB movie reviews. The IMDb movie review dataset comprises over 50,000 movie reviews of various films, spanning a wide range of genres (although specific films aren’t identified by title in the dataset). Due to the large CSV file, when attempting to use all reviews, we received the following error in R when attempting to turn the full dataset into a matrix (a necessary component of sentiment analysis).

**Figure 1**: Error that appears when attempting to turn a large corpus into a matrix

Figure 1: Error that appears when attempting to turn a large corpus into a matrix

Additionally, functions such as lemmatization became computationally expensive and time-consuming when using the entire dataset of 50,000. To facilitate our sentiment analysis, we then selected a more manageable sample of reviews (2,000) for the current project. To begin, we used the first 2,000 movies in the dataset to ensure our methods were providing consistent results across our three project members. After this, we randomly selected 2,000 films to create a more representative sample of the films in the dataset, as solely using the first 2,000 films may not be as reliable due to potential biases in the way they were ordered. For example, if the first film reviews in the dataset represented five-star reviews and every subsequent film section represents films that were given fewer and fewer stars chronologically, we would want to prevent that bias in our analysis.

After loading our dataset, we then turned our dataframe of MovieReviews into a corpus using the VCorpus (function) from the text mining (tm) package. A corpus is a text object commonly used in text mining practices, where each entry, in our case, each movie review, is stored as an individual document. After transforming our movie review data into a corpus, we then proceed with cleaning the text data. By using the tm_map function, we clean our data by removing any punctuations, numbers, or spaces that may interfere with our text analysis. To ensure we grab words that are significant in our analysis and not words that are commonly used in sentence structures like “and” or “this” , etc, we load the English “stopwords” from the tm package, and remove the stopwords from our Corpus. Finally, we use lemmatization using the textsetm package on r to convert each word into its correct dictionary form, ensuring that different grammatical variations of the same word are treated as a single, unified term. For example, words like “running,” “ran,” and “runs” are all reduced to the lemma “run,” which consolidates words with similar meanings and improves the accuracy of our sentiment analysis. We could also utilize stemming, a text normalization technique that reduces words to their crude root form, known as a stem. For example, “running” may be reduced to “runn,” and “runs” may be reduced to “run.” However, stemming does not convert irregular forms like “ran” to “run,” and it often produces non-dictionary fragments, so we chose to use lemmatization as the more accurate identification tool for this project.

We will then turn our cleaned data into a matrix. A matrix sums the raw counts of each word across our dataset, facilitating the calculation of word frequencies, creation of sentiment scores, plotting, and integration with dictionaries for all movie reviews. We then utilize the wordcloud function from the wordcloud library to create a word cloud based on the counts provided in the matrix. Within a word cloud, the larger a word is, the more frequently it has appeared in our dataset. Word clouds are, overall, a helpful visual aid for familiarizing oneself with the words in a chosen dataset, as well as providing a raw view of the most frequent words. For our word cloud, we will only use words that have appeared a minimum of 250 times across all 2,000 reviews, ensuring we capture the most relevant and recurring words. 250 was specifically chosen over a trial-and-error process, where we felt 250 included a sufficient number of words that best represented our data. From the matrix, we will also apply the “nrc” sentiment lexicon to get the overall sentiment of words, ranging across the following sentiments:

Each lexicon, including the NRC lexicon, assigns emotion-based labels to individual words. When the lexicon is applied, it identifies which words in the dataset belong to each sentiment category and generates sentiment scores based on the frequencies of those words in the text. For example, in the NRC lexicon, the word “stupid” has the following scores:

**Figure 2**: NRC lexicon sentiment results for the word stupid

Figure 2: NRC lexicon sentiment results for the word stupid

Where each occurrence of the word “stupid” results in an additional 1-point increase in the frequency of “negative words”.

Because we are analyzing sentiment across the full set of movie reviews, our NRC results represent the overall count of words associated with each of the ten NRC sentiment categories. For instance, if the word “stupid” is tagged with a binary value of 1 for the negative category, and it appears 50 times in the dataset, the NRC contribution to the negative count is 50. If no other negative words appear, the total negative count would be 50.

In other lexicons, such as the AFINN lexicon created by Finn Arip Nielsen, words are rated between -5 and 5. AFINN focuses on the overall positive (highest score is +5) or negative (highest score is -5) tone of words rather than assigning them to specific emotion categories as the NRC lexicon does.The term “stupid” in the AFINN lexicon receives the following score.

**Figure 3**: AFINN lexicon sentiment results for the word stupid

Figure 3: AFINN lexicon sentiment results for the word stupid

Since AFINN assigns numeric values, it captures the intensity of sentiment by summing those values across the corpus. For example, if “stupid” has a score of -2, “dumb” has a score of -3, and “miserable” has a score of -5, and they each appeared once and are the only negative words present throughout all reviews, the cumulative negative score across reviews would be -10. Positive words are summed in the same way to produce an overall positive score.

To display the sentiment values produced by each lexicon, we will create bar charts using ggplot2, a package that supports clear and flexible data visualizations (Wickham, 2016).

Results

Data Cleaning

Cleaning the data was a critical first step in our analysis. This process transformed our raw dataset into a consistent format from which we draw meaningful findings. Due to the large size of the dataset, we reduced the dataset to a smaller subset that included a randomly selected 2000 observations. Prior to cleaning the dataset, we loaded and preprocessed all the text from the reviews into a corpus and converted all text to lower case to control for case sensitivity in R. We randomly pulled the 80th at each step of cleaning to ensure that correct actions were being performed.

**Figure 4**: Review 80 Lowercase

Figure 4: Review 80 Lowercase

The next step in our data cleaning process was to convert special characters , like @, /, ], $, to a spaces and remove all apostrophes. Then, we removed numbers and punctuation.

**Figure 5**: Review 80 Special Characters, Apostrophes, Numbers, and Punctuation

Figure 5: Review 80 Special Characters, Apostrophes, Numbers, and Punctuation

Next we used the stopwords(’english”) code to remove frequently used words that don’t express much meaning, like I, me, am, etc.

**Figure 6**: Review 80 Stopwords

Figure 6: Review 80 Stopwords

Then, , we used stemming to reduce words to their root form by removing common suffixes like es, ed, ing, etc. However, after examining the results of this action (as seen below), we decided not to include this data cleaning step as it stripped several words of meaning. For example, the creation on nonsensical words such as “stori” and “coupl”.

**Figure 7**: Review 80 Stemming Exploration

Figure 7: Review 80 Stemming Exploration

Rather than stemming, we used lemmatization in the textsetm package to convert each word into a word that could be found in the dictionary. This step took different variations of the same word and transformed them into one term that maintains its meaning.

**Figure 8**: Review 80 Lemmatization

Figure 8: Review 80 Lemmatization

Word Cloud

For our next step in our exploratory analysis, we used a word cloud to identify popular terms that appear across reviews. This step provides us with a visual representation of word frequency that allows us to check the efficiency of our data cleaning process and provides an overview of the text data. From the size of the word, we see reviewers frequently use terms like film, movie, see, good, and make within their movie reviews.

**Figure 9**: IMDB Word Cloud

Figure 9: IMDB Word Cloud

Sentiment Analysis

To conduct the sentiment analysis, from the available lexicons, afinn, bing, joker and nrc, we first used afinn. The algorithm for get_sentiment() handles tokenization and cleaning, so the text variable was used directly as input and the scores were saved as new columns in the data. In addition to the aggregated numerical scores, we also assigned categorical scores of -1, 0 and 1, to represent negative, neutral and positive emotions, respectively.

**Figure 10**: Overall AFINN Weighted Sentiment Score Chart

Figure 10: Overall AFINN Weighted Sentiment Score Chart

The overall AFINN sentiment for IMDB movie reviews is a weighted score chart that summarizes sentiment from the assigned scores to the emotions. We can see that the positive score is substantially higher than the negative score, indicating that while the negative emotions are present across reviews, positive feelings dominate overall.

The overall NRC Sentiment for NRC Reviews is a frequency bar chart, where each bar shows the number of times words associated with a given NRC sentiment appear in the review corpus. It appears that words associated with positive emotions has appeared the highest times within the review corpus followed by negative emotions. Further complex positive emotions like trust, anticipation and joy has appeared more times than the complex negative emotions like fear, sadness, anger and disgust. Surprise is another frequently occurring emotion, but it can be associated with both negative, positive, or neutral sentiments.

**Figure 11**:  Overall NRC sentiment frequency chart

Figure 11: Overall NRC sentiment frequency chart

To conclude, the NRC results complement the AFINN findings by showing that IMDB movie reviews are more positive than negative, besides being emotionally rich with a stronger presence of trust and anticipation related feelings compared to anger or disgust.

Discussion

These findings are interesting because this opens up new questions that can be explored with further analysis. While the dominance of positive emotions represent an overall positive attitude and can engender a feeling of overarching success for the film industry in engaging with its audience, it might not necessarily be the whole picture, and further analysis is required to come to that conclusion. This snapshot rather brings more questions like, are people who like a movie more inclined towards posting a review compared to when they dislike the movie? People might refrain from posting a comment when they are unimpressed, hence the negative feelings are less represented comparatively. It could also be a case where a lot of the audience have no idea, and no access to the IMDB review forms, hence are unrepresented within the emotional ranges seen in this dataset. Additionally, this simple unclustered sentiment analysis does not adequately uncover the relation and comparison between IMDB rating score of a movie, the number of reviews and the dominant emotion within that review. It would be interesting to understand whether movies with lower rating have a lower number of review posted in general, compared to movies with a higher rating.

Further analysis can be done to understand whether the success of a movie depend on the number of people putting in reviews and conversing about it, or vice versa, where less successful movies tend to see fewer reviews. To explore this hypothesis, another analysis regressing box office revenues to the emotions of reviews and number of reviews can be conducted while controlling for the time of review posted. To conclude, this overarching sentiment analysis across all the reviews within the IMDB dataset demonstrate an overall positive emotional expression across the reviews, however, further analysis is required to support its accuracy.

References

Amazon Web Services. (n.d.). What is Sentiment Analysis? https://aws.amazon.com/what-is/sentiment-analysis/

George Washington University Libraries. (2025, August 15). An introduction to text analysis and text mining. GW LibGuides. https://libguides.gwu.edu/textanalysis

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org

Yadollahi, A., Shahraki, A. G., & Zaiane, O. R. (2017). Current state of text sentiment analysis from opinion to emotion mining. ACM Computing Surveys (CSUR), 50(2), 1-33.