Sentiment Movie Review Dataset

About this R notebook

This R Notebook contains Part 2 of my analysis of one of the first widely-available sentiment analysis datasets. It is the movie review dataset (movie-pang02.csv) that was obtained from http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/. More information about the dataset can be found at http://www.cs.cornell.edu/people/pabo/movie-review-data/. Paper associated with the dataset can be found at https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf.

Looking at the dataset classes

Classification Models

I made two classification models using the movie review dataset. One model was made using the textmodel_nb() function from library(quanteda) and the other model was made using the knn() function from library(class). I followed Text Message Classification a tutorial by Anish Singh Walial located at https://www.r-bloggers.com/text-message-classification/ on R-bloggers. Please see this tutorial if you are interested in the code that I used. After I followed the tutorial using the movie review dataset, I used the same techiniques to make a model using knn() from library(class) with k =3 and k=5.

Note: I am still learning about text analysis and sentiment analysis using R and Python.

Results of Naive Bayes classifier

print(paste0("Confusion Matrix the Naive Bayes classifier" ))

## [1] "Confusion Matrix the Naive Bayes classifier"

ConTable

##          actual
## predicted Neg Pos
##       Neg 227  68
##       Pos  64 242

print(paste0("Accuracy of Naive Bayes classifier: ",AccuracyPercentNB ))

## [1] "Accuracy of Naive Bayes classifier: 78.0366056572379"

Results of KNN model with k=3

print(paste0("Confusion Matrix for KNN with k = 3" ))

## [1] "Confusion Matrix for KNN with k = 3"

conKNN3

##          actual
## predicted Neg Pos
##       Neg 197 148
##       Pos  94 162

print(paste0("Accuracy of KNN (k=3): ",AccuracyPercentKNN3 ))

## [1] "Accuracy of KNN (k=3): 59.7337770382696"

Results of KNN model with k=5

print(paste0("Confusion Matrix for KNN with k = 5" ))

## [1] "Confusion Matrix for KNN with k = 5"

conKNN5

##          actual
## predicted Neg Pos
##       Neg 231 185
##       Pos  60 125

print(paste0("Accuracy of KNN (k=5): ",AccuracyPercentKNN5 ))

## [1] "Accuracy of KNN (k=5): 59.234608985025"

References

References:

Links for the dataset that was used.

Link to where I found a .csv file of the dataset: http://boston.lti.cs.cmu.edu/classes/95-865-K/HW/HW3/.
Link to the original authors of the dataset : http://www.cs.cornell.edu/people/pabo/movie-review-data/.
Link to the paper associated with the dataset : https://www.cs.cornell.edu/home/llee/papers/cutsent.pdf. (If this link is broken, the paper is Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of the ACL, 2004. )

The following libraries were used and the following blog was used in the making of the graph and models above.

library(highcharter) : http://jkunst.com/highcharter/index.html
library(dplyr) : https://dplyr.tidyverse.org/
require(reshape2) : https://github.com/hadley/reshape
library(htmltools) : https://CRAN.R-project.org/package=htmltools
library(quanteda) : Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774. https://doi.org/10.21105/joss.00774.
Text Message Classification a tutorial (September 7, 2017) by Anish Singh Walial : https://www.r-bloggers.com/text-message-classification/
library(class) : https://CRAN.R-project.org/package=class

And I also I read the following websites, blogs and book.

R-bloggers : https://www.r-bloggers.com/
ggplot2 elegant graphics for data analysis by Hadley Wickham