This R Notebook contains my data visualization of one of the first widely-available sentiment analysis datasets. It is the movie review dataset (movie-pang02.csv) that was obtained from http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/. More information about the dataset can be found at http://www.cs.cornell.edu/people/pabo/movie-review-data/. Paper associated with the dataset can be found at https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf. This is the first part of my data analysis of the dataset.
To visualize the dataset, I have made the following:
Note: The graphs are interactive. Place the mouse cursor on the graphs to see more information.
Note: If you cannot see the wordclouds, refresh the browser.
The movie review text was cleaned by removing the punctuation, numbers, and stopwords. The top 1500 words were then extracted and used to make the wordcloud below and the word frequency plot of the top 200 words. The wordcloud below was created using a figure file of a film projector as a mask.
The sentiment lexicon that was used for the wordcloud and barplot below is bing from library(tidytext). The lexicon was created by Bing Liu and collabotors (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html). Words with a frequency greater than 100 were used to make the sentiment analysis visualizations.
wordcloud2 does not have the ability yet to make a comparison cloud. I used this answer (https://stackoverflow.com/questions/49908939/comparison-cloud-in-wordcloud2-package-in-r given by a Cristián RodrÃguez) to make the wordcloud above. The blue words have negative sentiment and the black words have positive sentiment.
Links for the dataset that was used in this data visualization.
The following libraries were used and the following book was read in the making of the graphs above.
And I also I read the following websites, blogs and book.