For this assignment, we choose the Movie Review Data http://www.cs.cornell.edu/people/pabo/movie-review-data/ from Cornell University. We choose the sentiment polarity dataset: “polarity dataset v2.0,” which was introduced in Pang/Lee ACL in 2004. This dataset contains 1000 positive processed reviews and 1000 negative processed reviews.Classification of movie reivews are based on explicit numerical or star rating. Three and a half stars or more are considered positive in a five star rating system. With a letter grade system: B or above is considered positive. Based on this methods, 1000 positive text reviews and 1000 negative text reviews are collected.
We downloaded the zip file “polarity dataset v2.0” from the link mentioned above. The dataset was in a Tar GZip File. We unzipped the file using 7zip. Then placed the unzipped file in our local working directory.
Installing all the necessary packages needed for this data analysis.
library(tm)
library(SnowballC)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
library(bigmemory)
library(randomForest)
First we uploaded all the text files in the positive movie review folder to r by creating a path from our local working directory. Then we created a corpus for positive movie reviews, using the tm package. Using the tm and SnowballC packages we transformed and tidied the corpus.
pos <- file.path(getwd(), "review_polarity/txt_sentoken/pos")
head(dir(pos))
## [1] "cv000_29590.txt" "cv001_18431.txt" "cv002_15918.txt" "cv003_11664.txt"
## [5] "cv004_11636.txt" "cv005_29443.txt"
pos_corpus <- Corpus(DirSource(pos))
simple_words <- function(x) removeWords(x, stopwords("SMART"))
funs <- list(stripWhitespace, content_transformer(tolower), simple_words, removePunctuation, removeNumbers, stemDocument, PlainTextDocument)
pos_corpus <- tm_map(pos_corpus, FUN = tm_reduce, tmFuns = funs)
#inspect posive corpus (document 408)
writeLines(as.character(pos_corpus[408]))
## list(list(content = c(" admit dislik film initi ", " certian isnt everi tast sheer tortur sit restless mood ", " mood absolut incred ", " favorit movi shooin ani year ", "perhap big turnoff mani film unconvent ", "id hardpress compar ani film iv veri veri artsi incred slow amaz work beauti ", " view realiz film follow act structur didnt ani sort structur ", " act serv set charact sort ", " exist moreso set mood tension restles perhap feel boredom ", " shatter intens violenc encompass movi ", " major film extend battl scene intercut flashback voiceov ",
## " artsier element detract action add succeed briefli viewer peek mind soldier onli sudden yank back realiti battl resum ", " battl scene amaz onli save privat ryan opinion ", "theyr brutal horrifi time beauti due amaz cinematographi ", " act immers brilliant haunt ani film ", " onli problem dure act ", "malick littl long film start ", " initi scene consist soldier experienc nearedenlik paradis awol preper battl effect necessari hint pretenti sink ", " film tad artsi begin lot peopl dislik movi probabl gave becaus ",
## " final act effect wind film problem persist bit long pretenti time ", " sequenc soldier devast note wife anoth main charact kill noth short incred ", " perform phenomin ", " standout nick nolt newcom jim caviezel nomin oscar ", "nolt rivet intens colonel charg oper ", " charact hard reckless live men nolt manag evok sympathi ", "caviezel forev question natur war place ani deeper hell ", " absolut perfect genuin sympathet sincer strong ", " restrict relat small role sean penn veri good compani pessimist seargent ",
## " stack save privat ryan favorit movi ryan ", " realli hard compar differ wont spielberg film impact ", "howev compar standpoint qualiti easili whi someon thin red line abov ryan ", " high recommend consid war movi made ", " thin red line filmmak incred high order ", " slight fault easili offset sheer brillianc ", " real shame tank box offic film unconvent power thoughtprovok dont veri "), meta = list(author = character(0), datetimestamp = list(sec = 12.8337249755859, min = 59, hour = 22, mday = 22,
## mon = 10, year = 115, wday = 0, yday = 325, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
creating a data frame with the most occurring words in positive reviews.
common_terms <- DocumentTermMatrix(pos_corpus)
common_terms
## <<DocumentTermMatrix (documents: 1000, terms: 22273)>>
## Non-/sparse entries: 232587/22040413
## Sparsity : 99%
## Maximal term length: 61
## Weighting : term frequency (tf)
word_freq_pos <- sort(colSums(as.table(common_terms)), decreasing=FALSE)
pos_words_freq <- data.frame(words=names(word_freq_pos), freq=word_freq_pos)
row.names(pos_words_freq) <- NULL
tail(pos_words_freq)
## words freq
## 22268 scene 1350
## 22269 time 1525
## 22270 make 1650
## 22271 charact 2052
## 22272 movi 3120
## 22273 film 6145
Creating visual for positive movie review. Using ggplot2, wordcloud and RColorBrewer packages we created a histogram with most frequent words (frequency >= 900) and a word cloud with the most frequent (top 100) words.
findAssocs(common_terms, c("flim", "story"), 0.5)
## $flim
## awoken blanco cokeaddict curlyhair exdrug foghorn
## 1.00 1.00 1.00 1.00 1.00 1.00
## goreg insultthrow jailterm leghorn leguiziamo populus
## 1.00 1.00 1.00 1.00 1.00 1.00
## pulpi ratso tangodanc trashcan scent guzman
## 1.00 1.00 1.00 1.00 0.89 0.82
## rican pacino carlito backtrack brigant dunno
## 0.82 0.80 0.75 0.71 0.71 0.71
## gail lid lyndon neckdeep remors residenti
## 0.71 0.71 0.71 0.71 0.71 0.71
## rizzo penelop puerto palma soprano underr
## 0.71 0.67 0.67 0.66 0.58 0.51
##
## $story
## anachron carbon clement cliffhanger cyberspac eggar
## 0.71 0.71 0.71 0.71 0.71 0.71
## feature frewer herc hydra iiit ixii
## 0.71 0.71 0.71 0.71 0.71 0.71
## latura longoverdu meadow menkendavid ment musker
## 0.71 0.71 0.71 0.71 0.71 0.71
## olympian pocohonta retool satyr slystyl trundl
## 0.71 0.71 0.71 0.71 0.71 0.71
## permiss
## 0.53
pos_freq100 <- subset(pos_words_freq, freq >= 100)
pos_freq500 <- subset(pos_words_freq, freq >= 500)
head(pos_freq500, 10)
## words freq
## 22218 befor 504
## 22219 direct 509
## 22220 day 514
## 22221 action 523
## 22222 someth 532
## 22223 act 533
## 22224 dont 538
## 22225 audienc 541
## 22226 live 543
## 22227 set 543
pos_freq900 <- subset(pos_words_freq, freq >= 900)
ggplot(data = pos_freq900, aes(x= words, y =freq, fill=words)) + geom_bar(stat="identity") + theme(legend.position="none") + theme(axis.text.x = element_text(angle=10, vjust=.9, hjust=.6)) + ggtitle("Graph 1: Words that appear the most in positive movie reviews.") + ylab("Frequency")
dtm2 <- as.matrix (common_terms)
freq <- colSums(dtm2)
freq <- sort(freq, decreasing = TRUE)
words <-names(freq)
wordcloud(words[1:100], freq [1:100], colors=brewer.pal(5, "Dark2"))
We uploaded all the text files in the negative movie review folder to r by creating a path from our local working directory. Then we created a corpus for negative movie reviews, using the tm package. Using the tm and SnowballC packages we transformed and tidied the corpus.
neg <- file.path(getwd(), "review_polarity/txt_sentoken/neg")
head(neg)
## [1] "C:/Users/Nabila/Documents/CLASS IS 607/Assignment 11-12/review_polarity/txt_sentoken/neg"
head(dir(neg))
## [1] "cv000_29416.txt" "cv001_19502.txt" "cv002_17424.txt" "cv003_12683.txt"
## [5] "cv004_12641.txt" "cv005_29357.txt"
neg_corpus <- Corpus(DirSource(neg))
funs <- list(stripWhitespace, removePunctuation, removeNumbers, content_transformer(tolower), simple_words, stemDocument, PlainTextDocument)
neg_corpus <- tm_map(neg_corpus, FUN = tm_reduce, tmFuns = funs)
#inspect (neg_corpus [2])
writeLines(as.character(neg_corpus[2]))
## list(list(content = c(" happi bastard quick movi review", "damn yk bug ", " head start movi star jami lee curti anoth baldwin brother william time stori regard crew tugboat desert russian tech ship strang kick power back ", "littl power ", " gore bring action sequenc virus feel veri empti movi flash substanc ", " whi crew realli middl nowher origin ship big pink flashi thing hit mir cours whi donald sutherland stumbl drunken ", " hey chase peopl robot ", " act averag curti ", " kick work halloween h ",
## "sutherland wast baldwin act baldwin cours ", " real star stan winston robot design schnazzi cgi occasion good gore shot pick someon brain ", " robot bodi part realli turn movi ", "otherwis pretti sunken ship movi "), meta = list(author = character(0), datetimestamp = list(sec = 31.0727589130402, min = 59, hour = 22, mday = 22, mon = 10, year = 115, wday = 0, yday = 325, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
Creating a data frame with the most occurring words in negative reviews.
common_terms2 <- DocumentTermMatrix(neg_corpus)
common_terms2
## <<DocumentTermMatrix (documents: 1000, terms: 20111)>>
## Non-/sparse entries: 208766/19902234
## Sparsity : 99%
## Maximal term length: 53
## Weighting : term frequency (tf)
word_freq_neg <- sort(colSums(as.table(common_terms2)), decreasing=FALSE)
neg_word_freq <- data.frame(word=names(word_freq_neg), freq=word_freq_neg)
row.names(neg_word_freq) <- NULL
tail(neg_word_freq)
## word freq
## 20106 onli 1349
## 20107 time 1377
## 20108 make 1526
## 20109 charact 1798
## 20110 movi 3760
## 20111 film 4971
Creating visual for negative movie review. Using ggplot2, wordcloud and RColorBrewer packages we created a histogram with most frequent words (frequency >= 900) and a word cloud with the most frequent (top 100) words.
findAssocs(common_terms2, c("flim", "story"), 0.5)
## $flim
## numeric(0)
##
## $story
## exotic gradeschool halfstat heroine inveigl
## 0.71 0.71 0.71 0.71 0.71
## rangoon screenarkin westernerperil mors arkin
## 0.71 0.71 0.71 0.69 0.67
## doo
## 0.62
neg_freq900 <- subset(neg_word_freq, freq >= 900)
ggplot(data = neg_freq900, aes(x= word, y =freq, fill=word)) + geom_bar(stat="identity") + scale_fill_brewer(palette = "Set3") + theme(legend.position="none") + theme(axis.text.x = element_text(angle=10, vjust=.9, hjust=.6)) + ggtitle("Graph 2: Words that appear the most in negative movie reviews.") +ylab("Frequency")
neg_freq100 <- subset(neg_word_freq, freq >= 100)
neg_freq500 <- subset(neg_word_freq, freq >= 500)
head(neg_freq500, 10)
## word freq
## 20063 script 504
## 20064 minut 506
## 20065 actual 508
## 20066 role 508
## 20067 becom 514
## 20068 guy 516
## 20069 someth 526
## 20070 find 528
## 20071 audienc 533
## 20072 interest 536
wordcloud(neg_freq100$word, neg_freq100$freq, max.words=100, colors=brewer.pal(8, "Set1"))
Creating words matrix associated with positive or negatives reviews
step1. transform the most frequently appeared positive words, over 900 times, into a data frame format.
step2. transform the most frequently appeared negative words, over 900 times, into a data frame format.
Step3. Adjust the dimension of the data frames for both the negative and positive reviews.
Step4. Combined both data frames to create “movies review matrix”
dtm <- t(common_terms)
class(dtm)
## [1] "TermDocumentMatrix" "simple_triplet_matrix"
dtm3 <- as.big.matrix(x=as.matrix(dtm))
str(dtm3)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
## ..@ address:<externalptr>
M <- as.matrix(dtm3)
write.csv(pos_freq900, file= "csvpos.csv")
pos_key <- read.csv("csvpos.csv", stringsAsFactors = FALSE)
pos_key1 <- pos_key$words
M2 <- t(M)
M3<- M2[,colnames(M2) %in% pos_key1]
head(M3)
## charact end film good life make movi onli perform play scene
## character(0) 0 3 6 2 0 2 0 0 1 0 1
## character(0) 1 0 8 1 2 1 2 0 1 3 0
## character(0) 0 1 2 1 0 2 2 2 0 2 1
## character(0) 0 0 7 0 0 3 5 3 5 2 1
## character(0) 2 1 3 0 0 0 6 1 0 2 2
## character(0) 1 1 1 3 4 0 0 0 2 0 0
## stori time veri work year
## character(0) 0 2 1 0 0
## character(0) 1 0 0 0 1
## character(0) 1 2 1 2 2
## character(0) 2 4 0 2 1
## character(0) 0 0 0 0 0
## character(0) 7 3 0 1 1
M4 <- as.data.frame(M3)
head(M4)
## charact end film good life make movi onli perform play
## character(0) 0 3 6 2 0 2 0 0 1 0
## character(0).1 1 0 8 1 2 1 2 0 1 3
## character(0).2 0 1 2 1 0 2 2 2 0 2
## character(0).3 0 0 7 0 0 3 5 3 5 2
## character(0).4 2 1 3 0 0 0 6 1 0 2
## character(0).5 1 1 1 3 4 0 0 0 2 0
## scene stori time veri work year
## character(0) 1 0 2 1 0 0
## character(0).1 0 1 0 0 0 1
## character(0).2 1 1 2 1 2 2
## character(0).3 1 2 4 0 2 1
## character(0).4 2 0 0 0 0 0
## character(0).5 0 7 3 0 1 1
M4$class <- "Positive"
write.csv(neg_freq900, file = "csvtest.csv")
negkey <- read.csv("csvtest.csv", stringsAsFactors = FALSE)
negkey
## X word freq
## 1 20100 plot 962
## 2 20101 stori 977
## 3 20102 bad 1087
## 4 20103 play 1137
## 5 20104 good 1169
## 6 20105 scene 1289
## 7 20106 onli 1349
## 8 20107 time 1377
## 9 20108 make 1526
## 10 20109 charact 1798
## 11 20110 movi 3760
## 12 20111 film 4971
keyword3<- negkey$word
str(keyword3)
## chr [1:12] "plot" "stori" "bad" "play" "good" "scene" ...
neg_dtm <- t(common_terms2)
dtm4 <- as.big.matrix(x=as.matrix(neg_dtm))
str(dtm4)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
## ..@ address:<externalptr>
M5 <- as.matrix(dtm4)
M6 <- t(M5)
M7 <- M6[,colnames(M6) %in% keyword3]
M8 <- as.data.frame(M7)
M8$class <- "Negative"
colnames(M8)
## [1] "bad" "charact" "film" "good" "make" "movi" "onli"
## [8] "play" "plot" "scene" "stori" "time" "class"
colnames(M4)
## [1] "charact" "end" "film" "good" "life" "make" "movi"
## [8] "onli" "perform" "play" "scene" "stori" "time" "veri"
## [15] "work" "year" "class"
M4$bad <- 0
M4$film <- 0
M4$plot <- 0
M8$end<-0
M8$life<-0
M8$perform<-0
M8$veri<- 0
M8$work<-0
M8$year <- 0
totalM <- rbind.data.frame(M4, M8)
totalM [995:1020,4:17]
## good life make movi onli perform play scene stori time
## character(0)994 5 1 0 4 0 3 0 0 0 0
## character(0)995 4 0 2 22 1 1 5 0 0 4
## character(0)996 0 0 0 0 0 0 0 2 2 1
## character(0)997 0 0 4 5 2 1 3 4 2 2
## character(0)998 1 0 1 3 0 2 1 0 0 2
## character(0)999 0 7 0 8 0 1 0 4 4 0
## character(0)1000 2 0 7 7 1 0 1 2 0 0
## character(0)1001 1 0 0 5 0 0 0 0 1 1
## character(0)1002 0 0 2 3 2 0 1 0 0 2
## character(0)1003 1 0 0 0 2 0 0 1 1 1
## character(0)1004 1 0 0 3 2 0 1 1 3 2
## character(0)1005 2 0 3 2 0 0 0 4 2 3
## character(0)1006 1 0 2 1 2 0 2 0 0 2
## character(0)1007 0 0 0 7 1 0 4 2 0 1
## character(0)1008 0 0 2 3 3 0 2 0 1 4
## character(0)1009 0 0 0 8 1 0 1 2 1 3
## character(0)1010 0 0 2 4 2 0 2 1 0 1
## character(0)1011 1 0 0 5 2 0 4 1 0 2
## character(0)1012 0 0 0 0 2 0 1 2 1 1
## character(0)1013 0 0 3 27 3 0 6 7 2 0
## character(0)1014 0 0 2 1 0 0 1 2 0 3
## character(0)1015 0 0 1 5 1 0 0 0 1 1
## character(0)1016 2 0 0 16 4 0 1 0 1 1
## character(0)1017 0 0 0 10 1 0 0 1 2 2
## character(0)1018 1 0 3 3 1 0 1 0 0 1
## character(0)1019 0 0 0 4 3 0 0 2 2 0
## veri work year class
## character(0)994 1 1 0 Positive
## character(0)995 6 0 3 Positive
## character(0)996 1 1 0 Positive
## character(0)997 0 0 0 Positive
## character(0)998 0 1 1 Positive
## character(0)999 0 1 0 Positive
## character(0)1000 0 0 0 Negative
## character(0)1001 0 0 0 Negative
## character(0)1002 0 0 0 Negative
## character(0)1003 0 0 0 Negative
## character(0)1004 0 0 0 Negative
## character(0)1005 0 0 0 Negative
## character(0)1006 0 0 0 Negative
## character(0)1007 0 0 0 Negative
## character(0)1008 0 0 0 Negative
## character(0)1009 0 0 0 Negative
## character(0)1010 0 0 0 Negative
## character(0)1011 0 0 0 Negative
## character(0)1012 0 0 0 Negative
## character(0)1013 0 0 0 Negative
## character(0)1014 0 0 0 Negative
## character(0)1015 0 0 0 Negative
## character(0)1016 0 0 0 Negative
## character(0)1017 0 0 0 Negative
## character(0)1018 0 0 0 Negative
## character(0)1019 0 0 0 Negative
Writting a CSV file with the movie review matrix.
write.csv(totalM, file = "C:/Users/Nabila/Documents/GitHub/Class-IS607/Week 11-12 Assignment/movies_review_matrix.csv")
Testing with random Forest model
Step1.clean up the row names of the data frame.
Step2. Change the data frame column, “class”, to factor from character.
Step3. Run Random Forest package and produce “confusion matrix” & important variables association.
Surprisingly the word “film” and “veri” have highest association scores.
row.names(totalM) <-NULL
WM <- totalM
str(WM)
## 'data.frame': 2000 obs. of 19 variables:
## $ charact: num 0 1 0 0 2 1 3 4 1 1 ...
## $ end : num 3 0 1 0 1 1 1 0 0 0 ...
## $ film : num 0 0 0 0 0 0 0 0 0 0 ...
## $ good : num 2 1 1 0 0 3 2 1 0 0 ...
## $ life : num 0 2 0 0 0 4 0 1 2 0 ...
## $ make : num 2 1 2 3 0 0 7 3 2 0 ...
## $ movi : num 0 2 2 5 6 0 4 2 0 0 ...
## $ onli : num 0 0 2 3 1 0 2 0 0 0 ...
## $ perform: num 1 1 0 5 0 2 4 4 1 0 ...
## $ play : num 0 3 2 2 2 0 2 1 1 0 ...
## $ scene : num 1 0 1 1 2 0 2 0 1 1 ...
## $ stori : num 0 1 1 2 0 7 2 1 1 1 ...
## $ time : num 2 0 2 4 0 3 1 2 0 1 ...
## $ veri : num 1 0 1 0 0 0 4 0 0 0 ...
## $ work : num 0 0 2 2 0 1 2 0 0 1 ...
## $ year : num 0 1 2 1 0 1 3 1 0 0 ...
## $ class : chr "Positive" "Positive" "Positive" "Positive" ...
## $ bad : num 0 0 0 0 0 0 0 0 0 0 ...
## $ plot : num 0 0 0 0 0 0 0 0 0 0 ...
#change the catagorization to factor type
WM$class <- factor(WM$class)
rf = randomForest(class ~ ., data = WM, mtry =4 , ntree = 400)
rf
##
## Call:
## randomForest(formula = class ~ ., data = WM, mtry = 4, ntree = 400)
## Type of random forest: classification
## Number of trees: 400
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 1.45%
## Confusion matrix:
## Negative Positive class.error
## Negative 986 14 0.014
## Positive 15 985 0.015
rf$confusion
## Negative Positive class.error
## Negative 986 14 0.014
## Positive 15 985 0.015
rf$importance
## MeanDecreaseGini
## charact 4.175632
## end 94.193930
## film 354.549988
## good 4.194192
## life 38.501848
## make 4.062108
## movi 8.773636
## onli 4.244540
## perform 69.468478
## play 3.943697
## scene 4.706530
## stori 3.565164
## time 4.610355
## veri 91.417814
## work 59.106525
## year 76.457462
## bad 79.199003
## plot 85.699544