For this assignment, we choose the Movie Review Data http://www.cs.cornell.edu/people/pabo/movie-review-data/ from Cornell University. We choose the sentiment polarity dataset: “polarity dataset v2.0,” which was introduced in Pang/Lee ACL in 2004. This dataset contains 1000 positive processed reviews and 1000 negative processed reviews.Classification of movie reivews are based on explicit numerical or star rating. Three and a half stars or more are considered positive in a five star rating system. With a letter grade system: B or above is considered positive. Based on this methods, 1000 positive text reviews and 1000 negative text reviews are collected.

Pre-coding steps

Studing the Positive Movie reviews.

Studing the negative movie reviews.

Studing the positive and negative movie reviews.

We downloaded the zip file “polarity dataset v2.0” from the link mentioned above. The dataset was in a Tar GZip File. We unzipped the file using 7zip. Then placed the unzipped file in our local working directory.

Installing all the necessary packages needed for this data analysis.

library(tm) 
library(SnowballC)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
library(bigmemory)
library(randomForest)

UP

First we uploaded all the text files in the positive movie review folder to r by creating a path from our local working directory. Then we created a corpus for positive movie reviews, using the tm package. Using the tm and SnowballC packages we transformed and tidied the corpus.

pos <- file.path(getwd(), "review_polarity/txt_sentoken/pos")
head(dir(pos))
## [1] "cv000_29590.txt" "cv001_18431.txt" "cv002_15918.txt" "cv003_11664.txt"
## [5] "cv004_11636.txt" "cv005_29443.txt"
pos_corpus <- Corpus(DirSource(pos))

simple_words <- function(x) removeWords(x, stopwords("SMART"))

funs <- list(stripWhitespace, content_transformer(tolower), simple_words, removePunctuation, removeNumbers, stemDocument, PlainTextDocument)

pos_corpus <- tm_map(pos_corpus, FUN = tm_reduce, tmFuns = funs)

#inspect posive corpus (document 408)
writeLines(as.character(pos_corpus[408]))
## list(list(content = c(" admit dislik film initi ", " certian isnt everi tast sheer tortur sit restless mood ", " mood absolut incred ", " favorit movi shooin ani year ", "perhap big turnoff mani film unconvent ", "id hardpress compar ani film iv veri veri artsi incred slow amaz work beauti ", " view realiz film follow act structur didnt ani sort structur ", " act serv set charact sort ", " exist moreso set mood tension restles perhap feel boredom ", " shatter intens violenc encompass movi ", " major film extend battl scene intercut flashback voiceov ", 
## " artsier element detract action add succeed briefli viewer peek mind soldier onli sudden yank back realiti battl resum ", " battl scene amaz onli save privat ryan opinion ", "theyr brutal horrifi time beauti due amaz cinematographi ", " act immers brilliant haunt ani film ", " onli problem dure act ", "malick littl long film start ", " initi scene consist soldier experienc nearedenlik paradis awol preper battl effect necessari hint pretenti sink ", " film tad artsi begin lot peopl dislik movi probabl gave becaus ", 
## " final act effect wind film problem persist bit long pretenti time ", " sequenc soldier devast note wife anoth main charact kill noth short incred ", " perform phenomin ", " standout nick nolt newcom jim caviezel nomin oscar ", "nolt rivet intens colonel charg oper ", " charact hard reckless live men nolt manag evok sympathi ", "caviezel forev question natur war place ani deeper hell ", " absolut perfect genuin sympathet sincer strong ", " restrict relat small role sean penn veri good compani pessimist seargent ", 
## " stack save privat ryan favorit movi ryan ", " realli hard compar differ wont spielberg film impact ", "howev compar standpoint qualiti easili whi someon thin red line abov ryan ", " high recommend consid war movi made ", " thin red line filmmak incred high order ", " slight fault easili offset sheer brillianc ", " real shame tank box offic film unconvent power thoughtprovok dont veri "), meta = list(author = character(0), datetimestamp = list(sec = 12.8337249755859, min = 59, hour = 22, mday = 22, 
##     mon = 10, year = 115, wday = 0, yday = 325, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()

UP

creating a data frame with the most occurring words in positive reviews.

common_terms <- DocumentTermMatrix(pos_corpus)
common_terms
## <<DocumentTermMatrix (documents: 1000, terms: 22273)>>
## Non-/sparse entries: 232587/22040413
## Sparsity           : 99%
## Maximal term length: 61
## Weighting          : term frequency (tf)
word_freq_pos <- sort(colSums(as.table(common_terms)), decreasing=FALSE)
pos_words_freq <- data.frame(words=names(word_freq_pos), freq=word_freq_pos) 
row.names(pos_words_freq) <- NULL

tail(pos_words_freq)
##         words freq
## 22268   scene 1350
## 22269    time 1525
## 22270    make 1650
## 22271 charact 2052
## 22272    movi 3120
## 22273    film 6145

UP

Creating visual for positive movie review. Using ggplot2, wordcloud and RColorBrewer packages we created a histogram with most frequent words (frequency >= 900) and a word cloud with the most frequent (top 100) words.

findAssocs(common_terms, c("flim", "story"), 0.5)
## $flim
##      awoken      blanco  cokeaddict   curlyhair      exdrug     foghorn 
##        1.00        1.00        1.00        1.00        1.00        1.00 
##       goreg insultthrow    jailterm     leghorn  leguiziamo     populus 
##        1.00        1.00        1.00        1.00        1.00        1.00 
##       pulpi       ratso   tangodanc    trashcan       scent      guzman 
##        1.00        1.00        1.00        1.00        0.89        0.82 
##       rican      pacino     carlito   backtrack     brigant       dunno 
##        0.82        0.80        0.75        0.71        0.71        0.71 
##        gail         lid      lyndon    neckdeep      remors   residenti 
##        0.71        0.71        0.71        0.71        0.71        0.71 
##       rizzo     penelop      puerto       palma     soprano      underr 
##        0.71        0.67        0.67        0.66        0.58        0.51 
## 
## $story
##    anachron      carbon     clement cliffhanger   cyberspac       eggar 
##        0.71        0.71        0.71        0.71        0.71        0.71 
##     feature      frewer        herc       hydra        iiit        ixii 
##        0.71        0.71        0.71        0.71        0.71        0.71 
##      latura  longoverdu      meadow menkendavid        ment      musker 
##        0.71        0.71        0.71        0.71        0.71        0.71 
##    olympian   pocohonta      retool       satyr     slystyl      trundl 
##        0.71        0.71        0.71        0.71        0.71        0.71 
##     permiss 
##        0.53
pos_freq100 <- subset(pos_words_freq, freq >= 100)
pos_freq500 <- subset(pos_words_freq, freq >= 500)
head(pos_freq500, 10)
##         words freq
## 22218   befor  504
## 22219  direct  509
## 22220     day  514
## 22221  action  523
## 22222  someth  532
## 22223     act  533
## 22224    dont  538
## 22225 audienc  541
## 22226    live  543
## 22227     set  543
pos_freq900 <- subset(pos_words_freq, freq >= 900)

ggplot(data = pos_freq900, aes(x= words, y =freq, fill=words)) + geom_bar(stat="identity")  + theme(legend.position="none") + theme(axis.text.x  = element_text(angle=10, vjust=.9, hjust=.6)) + ggtitle("Graph 1: Words that appear the most in positive movie reviews.") + ylab("Frequency")   

dtm2 <- as.matrix (common_terms)
freq <- colSums(dtm2)

freq <- sort(freq, decreasing = TRUE)
words <-names(freq)

wordcloud(words[1:100], freq [1:100], colors=brewer.pal(5, "Dark2"))

UP

We uploaded all the text files in the negative movie review folder to r by creating a path from our local working directory. Then we created a corpus for negative movie reviews, using the tm package. Using the tm and SnowballC packages we transformed and tidied the corpus.

neg <- file.path(getwd(), "review_polarity/txt_sentoken/neg")
head(neg)
## [1] "C:/Users/Nabila/Documents/CLASS  IS 607/Assignment 11-12/review_polarity/txt_sentoken/neg"
head(dir(neg))
## [1] "cv000_29416.txt" "cv001_19502.txt" "cv002_17424.txt" "cv003_12683.txt"
## [5] "cv004_12641.txt" "cv005_29357.txt"
neg_corpus <- Corpus(DirSource(neg))

funs <- list(stripWhitespace, removePunctuation, removeNumbers, content_transformer(tolower), simple_words, stemDocument, PlainTextDocument)
neg_corpus <- tm_map(neg_corpus, FUN = tm_reduce, tmFuns = funs)

#inspect (neg_corpus [2])
writeLines(as.character(neg_corpus[2]))
## list(list(content = c(" happi bastard quick movi review", "damn yk bug ", " head start movi star jami lee curti anoth baldwin brother william time stori regard crew tugboat desert russian tech ship strang kick power back ", "littl power ", " gore bring action sequenc virus feel veri empti movi flash substanc ", " whi crew realli middl nowher origin ship big pink flashi thing hit mir cours whi donald sutherland stumbl drunken ", " hey chase peopl robot ", " act averag curti ", " kick work halloween h ", 
## "sutherland wast baldwin act baldwin cours ", " real star stan winston robot design schnazzi cgi occasion good gore shot pick someon brain ", " robot bodi part realli turn movi ", "otherwis pretti sunken ship movi "), meta = list(author = character(0), datetimestamp = list(sec = 31.0727589130402, min = 59, hour = 22, mday = 22, mon = 10, year = 115, wday = 0, yday = 325, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()

UP

Creating a data frame with the most occurring words in negative reviews.

common_terms2 <- DocumentTermMatrix(neg_corpus)
common_terms2
## <<DocumentTermMatrix (documents: 1000, terms: 20111)>>
## Non-/sparse entries: 208766/19902234
## Sparsity           : 99%
## Maximal term length: 53
## Weighting          : term frequency (tf)
word_freq_neg <- sort(colSums(as.table(common_terms2)), decreasing=FALSE)


neg_word_freq <- data.frame(word=names(word_freq_neg), freq=word_freq_neg) 
row.names(neg_word_freq) <- NULL

tail(neg_word_freq)
##          word freq
## 20106    onli 1349
## 20107    time 1377
## 20108    make 1526
## 20109 charact 1798
## 20110    movi 3760
## 20111    film 4971

UP

Creating visual for negative movie review. Using ggplot2, wordcloud and RColorBrewer packages we created a histogram with most frequent words (frequency >= 900) and a word cloud with the most frequent (top 100) words.

findAssocs(common_terms2, c("flim", "story"), 0.5)
## $flim
## numeric(0)
## 
## $story
##         exotic    gradeschool       halfstat        heroine        inveigl 
##           0.71           0.71           0.71           0.71           0.71 
##        rangoon    screenarkin westernerperil           mors          arkin 
##           0.71           0.71           0.71           0.69           0.67 
##            doo 
##           0.62
neg_freq900 <- subset(neg_word_freq, freq >= 900)

ggplot(data =  neg_freq900, aes(x= word, y =freq, fill=word)) + geom_bar(stat="identity") + scale_fill_brewer(palette = "Set3") + theme(legend.position="none") + theme(axis.text.x  = element_text(angle=10, vjust=.9, hjust=.6)) + ggtitle("Graph 2: Words that appear the most in negative movie reviews.") +ylab("Frequency") 

neg_freq100 <- subset(neg_word_freq, freq >= 100)

neg_freq500 <- subset(neg_word_freq, freq >= 500)
head(neg_freq500, 10)
##           word freq
## 20063   script  504
## 20064    minut  506
## 20065   actual  508
## 20066     role  508
## 20067    becom  514
## 20068      guy  516
## 20069   someth  526
## 20070     find  528
## 20071  audienc  533
## 20072 interest  536
wordcloud(neg_freq100$word, neg_freq100$freq, max.words=100, colors=brewer.pal(8, "Set1"))

UP

Creating words matrix associated with positive or negatives reviews

step1. transform the most frequently appeared positive words, over 900 times, into a data frame format.

step2. transform the most frequently appeared negative words, over 900 times, into a data frame format.

Step3. Adjust the dimension of the data frames for both the negative and positive reviews.

Step4. Combined both data frames to create “movies review matrix”

dtm <- t(common_terms)
class(dtm)
## [1] "TermDocumentMatrix"    "simple_triplet_matrix"
dtm3 <- as.big.matrix(x=as.matrix(dtm))
str(dtm3)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
##   ..@ address:<externalptr>
M <- as.matrix(dtm3)


write.csv(pos_freq900, file= "csvpos.csv")
pos_key <- read.csv("csvpos.csv", stringsAsFactors = FALSE)
pos_key1 <- pos_key$words

M2 <- t(M)

M3<- M2[,colnames(M2) %in% pos_key1]
head(M3)
##              charact end film good life make movi onli perform play scene
## character(0)       0   3    6    2    0    2    0    0       1    0     1
## character(0)       1   0    8    1    2    1    2    0       1    3     0
## character(0)       0   1    2    1    0    2    2    2       0    2     1
## character(0)       0   0    7    0    0    3    5    3       5    2     1
## character(0)       2   1    3    0    0    0    6    1       0    2     2
## character(0)       1   1    1    3    4    0    0    0       2    0     0
##              stori time veri work year
## character(0)     0    2    1    0    0
## character(0)     1    0    0    0    1
## character(0)     1    2    1    2    2
## character(0)     2    4    0    2    1
## character(0)     0    0    0    0    0
## character(0)     7    3    0    1    1
M4 <- as.data.frame(M3)
head(M4)
##                charact end film good life make movi onli perform play
## character(0)         0   3    6    2    0    2    0    0       1    0
## character(0).1       1   0    8    1    2    1    2    0       1    3
## character(0).2       0   1    2    1    0    2    2    2       0    2
## character(0).3       0   0    7    0    0    3    5    3       5    2
## character(0).4       2   1    3    0    0    0    6    1       0    2
## character(0).5       1   1    1    3    4    0    0    0       2    0
##                scene stori time veri work year
## character(0)       1     0    2    1    0    0
## character(0).1     0     1    0    0    0    1
## character(0).2     1     1    2    1    2    2
## character(0).3     1     2    4    0    2    1
## character(0).4     2     0    0    0    0    0
## character(0).5     0     7    3    0    1    1
M4$class <- "Positive"

write.csv(neg_freq900, file = "csvtest.csv")
negkey <- read.csv("csvtest.csv", stringsAsFactors = FALSE)
negkey
##        X    word freq
## 1  20100    plot  962
## 2  20101   stori  977
## 3  20102     bad 1087
## 4  20103    play 1137
## 5  20104    good 1169
## 6  20105   scene 1289
## 7  20106    onli 1349
## 8  20107    time 1377
## 9  20108    make 1526
## 10 20109 charact 1798
## 11 20110    movi 3760
## 12 20111    film 4971
keyword3<- negkey$word
str(keyword3)
##  chr [1:12] "plot" "stori" "bad" "play" "good" "scene" ...
neg_dtm <- t(common_terms2)
dtm4 <- as.big.matrix(x=as.matrix(neg_dtm))
str(dtm4)
## Formal class 'big.matrix' [package "bigmemory"] with 1 slot
##   ..@ address:<externalptr>
M5 <- as.matrix(dtm4)

M6 <- t(M5)
M7 <- M6[,colnames(M6) %in% keyword3]
M8 <- as.data.frame(M7)
M8$class <- "Negative"


colnames(M8)
##  [1] "bad"     "charact" "film"    "good"    "make"    "movi"    "onli"   
##  [8] "play"    "plot"    "scene"   "stori"   "time"    "class"
colnames(M4)
##  [1] "charact" "end"     "film"    "good"    "life"    "make"    "movi"   
##  [8] "onli"    "perform" "play"    "scene"   "stori"   "time"    "veri"   
## [15] "work"    "year"    "class"
M4$bad <- 0
M4$film <- 0
M4$plot <- 0

M8$end<-0
M8$life<-0
M8$perform<-0
M8$veri<- 0
M8$work<-0
M8$year <- 0

totalM <- rbind.data.frame(M4, M8)

totalM [995:1020,4:17]
##                  good life make movi onli perform play scene stori time
## character(0)994     5    1    0    4    0       3    0     0     0    0
## character(0)995     4    0    2   22    1       1    5     0     0    4
## character(0)996     0    0    0    0    0       0    0     2     2    1
## character(0)997     0    0    4    5    2       1    3     4     2    2
## character(0)998     1    0    1    3    0       2    1     0     0    2
## character(0)999     0    7    0    8    0       1    0     4     4    0
## character(0)1000    2    0    7    7    1       0    1     2     0    0
## character(0)1001    1    0    0    5    0       0    0     0     1    1
## character(0)1002    0    0    2    3    2       0    1     0     0    2
## character(0)1003    1    0    0    0    2       0    0     1     1    1
## character(0)1004    1    0    0    3    2       0    1     1     3    2
## character(0)1005    2    0    3    2    0       0    0     4     2    3
## character(0)1006    1    0    2    1    2       0    2     0     0    2
## character(0)1007    0    0    0    7    1       0    4     2     0    1
## character(0)1008    0    0    2    3    3       0    2     0     1    4
## character(0)1009    0    0    0    8    1       0    1     2     1    3
## character(0)1010    0    0    2    4    2       0    2     1     0    1
## character(0)1011    1    0    0    5    2       0    4     1     0    2
## character(0)1012    0    0    0    0    2       0    1     2     1    1
## character(0)1013    0    0    3   27    3       0    6     7     2    0
## character(0)1014    0    0    2    1    0       0    1     2     0    3
## character(0)1015    0    0    1    5    1       0    0     0     1    1
## character(0)1016    2    0    0   16    4       0    1     0     1    1
## character(0)1017    0    0    0   10    1       0    0     1     2    2
## character(0)1018    1    0    3    3    1       0    1     0     0    1
## character(0)1019    0    0    0    4    3       0    0     2     2    0
##                  veri work year    class
## character(0)994     1    1    0 Positive
## character(0)995     6    0    3 Positive
## character(0)996     1    1    0 Positive
## character(0)997     0    0    0 Positive
## character(0)998     0    1    1 Positive
## character(0)999     0    1    0 Positive
## character(0)1000    0    0    0 Negative
## character(0)1001    0    0    0 Negative
## character(0)1002    0    0    0 Negative
## character(0)1003    0    0    0 Negative
## character(0)1004    0    0    0 Negative
## character(0)1005    0    0    0 Negative
## character(0)1006    0    0    0 Negative
## character(0)1007    0    0    0 Negative
## character(0)1008    0    0    0 Negative
## character(0)1009    0    0    0 Negative
## character(0)1010    0    0    0 Negative
## character(0)1011    0    0    0 Negative
## character(0)1012    0    0    0 Negative
## character(0)1013    0    0    0 Negative
## character(0)1014    0    0    0 Negative
## character(0)1015    0    0    0 Negative
## character(0)1016    0    0    0 Negative
## character(0)1017    0    0    0 Negative
## character(0)1018    0    0    0 Negative
## character(0)1019    0    0    0 Negative

Writting a CSV file with the movie review matrix.

write.csv(totalM, file = "C:/Users/Nabila/Documents/GitHub/Class-IS607/Week 11-12 Assignment/movies_review_matrix.csv")

UP

Testing with random Forest model

Step1.clean up the row names of the data frame.

Step2. Change the data frame column, “class”, to factor from character.

Step3. Run Random Forest package and produce “confusion matrix” & important variables association.

Surprisingly the word “film” and “veri” have highest association scores.

row.names(totalM) <-NULL
WM <- totalM

str(WM)
## 'data.frame':    2000 obs. of  19 variables:
##  $ charact: num  0 1 0 0 2 1 3 4 1 1 ...
##  $ end    : num  3 0 1 0 1 1 1 0 0 0 ...
##  $ film   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ good   : num  2 1 1 0 0 3 2 1 0 0 ...
##  $ life   : num  0 2 0 0 0 4 0 1 2 0 ...
##  $ make   : num  2 1 2 3 0 0 7 3 2 0 ...
##  $ movi   : num  0 2 2 5 6 0 4 2 0 0 ...
##  $ onli   : num  0 0 2 3 1 0 2 0 0 0 ...
##  $ perform: num  1 1 0 5 0 2 4 4 1 0 ...
##  $ play   : num  0 3 2 2 2 0 2 1 1 0 ...
##  $ scene  : num  1 0 1 1 2 0 2 0 1 1 ...
##  $ stori  : num  0 1 1 2 0 7 2 1 1 1 ...
##  $ time   : num  2 0 2 4 0 3 1 2 0 1 ...
##  $ veri   : num  1 0 1 0 0 0 4 0 0 0 ...
##  $ work   : num  0 0 2 2 0 1 2 0 0 1 ...
##  $ year   : num  0 1 2 1 0 1 3 1 0 0 ...
##  $ class  : chr  "Positive" "Positive" "Positive" "Positive" ...
##  $ bad    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ plot   : num  0 0 0 0 0 0 0 0 0 0 ...
#change the catagorization to factor type
WM$class <- factor(WM$class)


rf = randomForest(class ~ ., data = WM, mtry =4 , ntree = 400)
rf
## 
## Call:
##  randomForest(formula = class ~ ., data = WM, mtry = 4, ntree = 400) 
##                Type of random forest: classification
##                      Number of trees: 400
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 1.45%
## Confusion matrix:
##          Negative Positive class.error
## Negative      986       14       0.014
## Positive       15      985       0.015
rf$confusion
##          Negative Positive class.error
## Negative      986       14       0.014
## Positive       15      985       0.015
rf$importance
##         MeanDecreaseGini
## charact         4.175632
## end            94.193930
## film          354.549988
## good            4.194192
## life           38.501848
## make            4.062108
## movi            8.773636
## onli            4.244540
## perform        69.468478
## play            3.943697
## scene           4.706530
## stori           3.565164
## time            4.610355
## veri           91.417814
## work           59.106525
## year           76.457462
## bad            79.199003
## plot           85.699544

UP