Sentiment Movie Review Dataset

About this R notebook

This R Notebook contains my data visualization of one of the first widely-available sentiment analysis datasets. It is the movie review dataset (movie-pang02.csv) that was obtained from http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/. More information about the dataset can be found at http://www.cs.cornell.edu/people/pabo/movie-review-data/. Paper associated with the dataset can be found at https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf. This is the first part of my data analysis of the dataset.

To visualize the dataset, I have made the following:

A wordcloud and word frequency plot of the cleaned text
A Sentiment wordcloud and word frquency plot of the cleaned text

Note: The graphs are interactive. Place the mouse cursor on the graphs to see more information.

Note: If you cannot see the wordclouds, refresh the browser.

A first look at the dataset

Wordcloud

The movie review text was cleaned by removing the punctuation, numbers, and stopwords. The top 1500 words were then extracted and used to make the wordcloud below and the word frequency plot of the top 200 words. The wordcloud below was created using a figure file of a film projector as a mask.

Word Frequency

Sentiment Analysis

The sentiment lexicon that was used for the wordcloud and barplot below is bing from library(tidytext). The lexicon was created by Bing Liu and collabotors (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). Words with a frequency greater than 100 were used to make the sentiment analysis visualizations.

Sentiment Analysis Wordcloud

wordcloud2 does not have the ability yet to make a comparison cloud. I used this answer (https://stackoverflow.com/questions/49908939/comparison-cloud-in-wordcloud2-package-in-r given by a Cristián Rodríguez) to make the wordcloud above. The blue words have negative sentiment and the black words have positive sentiment.

Sentiment Analysis Barplot

References

Links for the dataset that was used in this data visualization.

Link to where I found a .csv file of the dataset: http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/.
Link to the original authors of the dataset : http://www.cs.cornell.edu/people/pabo/movie-review-data/.
Link to the paper associated with the dataset : https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf. (If this link is broken, the paper is Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of the ACL, 2004. )

The following libraries were used and the following book was read in the making of the graphs above.

library(tm) : http://tm.r-forge.r-project.org/
library(tidytext): https://www.tidytextmining.com/
library(wordcloud2) : https://github.com/Lchiffon/wordcloud2
library(highcharter) : http://jkunst.com/highcharter/index.html
library(SnowballC) : https://CRAN.R-project.org/package=SnowballC
library(dplyr) : https://dplyr.tidyverse.org/
require(reshape2) : https://github.com/hadley/reshape
library(htmltools) : https://CRAN.R-project.org/package=htmltools
library(tidyr) : https://tidyr.tidyverse.org/
R Markdown from RStudio: https://rmarkdown.rstudio.com/
Text Mining with R A Tidy Approach by Julia Silge and David Robinson https://www.tidytextmining.com/

And I also I read the following websites, blogs and book.

ggplot2 : https://ggplot2.tidyverse.org/index.html
ggpubr : http://www.sthda.com/english/rpkgs/ggpubr/
R-bloggers : https://www.r-bloggers.com/
ggplot2 elegant graphics for data analysis by Hadley Wickham
Stackoverflow : https://stackoverflow.com/