Text-Analyzing a simple set of documents

This is a mark down to pick a movie from imdb , scrape the reviews and to recommend which particular attributes to focus on in order to go for a sequel to the movie.

Loading required packages for the task.

rm(list=ls())
library(rvest)
library(text2vec)
library(data.table)
library(stringr)
library(tm)
library(RWeka)
library(tokenizers)
library(slam)
library(wordcloud)
library(ggplot2)
library(igraph)
library(textir)
library(qdap)

```

Scraping the imdb pages to get “Three idiots movie” a bollywood movie.The text we are targetting to be considered as corpus is positive (loved it) and negative (hated it) reviews.

counts = c(0,10,20,30,40,50)
reviews = NULL
for (j in counts){
  url1 = paste0("http://www.imdb.com/title/tt1187043/reviews?filter=love;filter=love;start=",j)
  url2 = paste0("http://www.imdb.com/title/tt1187043/reviews?filter=hate;filter=hate;start=",j)
  
  page1 = read_html(url1)
  page2 = read_html(url2)
  reviews1 = html_text(html_nodes(page1,'#tn15content p'))
  reviews2 = html_text(html_nodes(page2,'#tn15content p'))
  
  reviews.positive = setdiff(reviews1, c("*** This review may contain spoilers ***","Add another review"))
  reviews.negative = setdiff(reviews2, c("*** This review may contain spoilers ***","Add another review"))
  
  reviews = c(reviews,reviews.positive,reviews.negative)
  
}

reviews = gsub("\n",' ',reviews)
reviews[1:2]
## [1] " I'm an IITian myself and hence, needless to say, was looking forward to this movie as it is based on the life in IITs. I went with my family and watched the first-day-first-show and was I pleased ? In one sentence - The best Bollywood movie ever ! If you are into movies, then this one is not to be missed unless you are the 4th idiot. I cried, I laughed and I enjoyed every moment of the 3 hours that I spent watching this gem. The songs that seemed mediocre before watching the movie, feel like perfect for the movie. They are so ideal for the situations that I just loved them.As you can understand, they could not bring the name of any IIT into the movie to avoid disputes. But trust me, whatever they have shown about the life in IITs, it's all true. Right from suicides, cold-hearted Professors to lack of encouragement and support for talents who can really bring serious change in this world. The movie successfully depicts how once inside, your life is only about meaningless grades, go get into the race before it's too late.Take my free advice and avoid any reviews or words from people who have already seen the movie. Don't let anyone spoil any scene from the movie. Rush to the nearest theater and experience the phenomenon for yourself. You'll bless me for this advice ! "                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [2] " It's hard for me to review this film, as I have not seen a huge number of Indian films--probably no more than a couple dozen. Most of the ones I've seen were wonderfully entertaining but I am far from an expert on Bollywood. Because of this, I have a hard time knowing how good this film is relative to other films from this country. So, consider this when you read this review. This may be among the very best India has to offer or it just seems that way to me.Like so many Indian films, this is a very, very long film--with a run-time of almost three hours. When a movie is bad or just okay, this can seem like forever, but since \"3 Idiots\" is a very, very good film I loved its length. And, like most films of the genre, it has its share of the usual singing and dancing so foreign to films from other countries. One thing you should know, however, is that defining the type of film it is isn't really easy. Much of it is a comedy, but it also has many poignant moments (keep the Kleenex nearby), some existential moments where they explore the meaning of life and work and it's also a tender film about friendship. And, as my daughter pointed out when she saw the film, she loved that the men in the movie are not afraid to cry--something you rarely see in western films.As for the plot, it's very long and involved and I could recount what occurs. But I don't want to spoil a single wonderful moment, so my advice is just sit back and watch--and if you give it a chance, I can almost guarantee you'll have a great time with this poignant and funny film. Wonderful and well worth your time--with a delightful script, wonderful characters and lots of moments that made me smile...and a few that brought me to tears. See this film. "

Method to cleanse the raw text

The below function cleans non-ascii characters,keeps only alpha numeric,converts to lower case characters , removes numbers and white spaces.

text.clean = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex for removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # keep only alpha numeric 
  x  =  tolower(x)                          # convert to lower case characters
  x  =  removeNumbers(x)                    # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

Preprocessing of the text.

The code below uses the function text.clean to clean the text. A data frame is created with length as the number of documents.The stopwords are picked up from github repository and removed from the text.After multiple iterations.The extra stop words which do not convey any meaning in the context such as movie,movies,watch,story are removed.

data = data.frame(id = 1:length(reviews),  # creating doc IDs if name is not given
                  text = reviews, 
                  stringsAsFactors = F)

stpw1 = readLines('https://raw.githubusercontent.com/imtiazBDSgit/TextAnalytics/master/stopwords.txt')      # read-in stopwords.txt
stpw2 = tm::stopwords('english')      # tm package stop word list; tokenizer package has the same name function, hence 'tm::'
comn  = unique(c(stpw1, stpw2,c("movie","movies","watch","story")))         # Union of two list
stopwords = unique(gsub("'"," ",comn))  # final stop word lsit after removing punctuation

x  = text.clean(data$text)                # applying func defined above to pre-process text corpus
x  =  removeWords(x,stopwords)            # removing stopwords created above
x  =  stripWhitespace(x)

DTM creation

The dtm created through corpus with the multiple steps.The methods used are itoken,create_vocabulary,pruned_vocab,vocab_vectorizer and create_dtm.The weighting function used here is term frequency

#This function initialises the word_tokenizer which splits by spaces.
tok_fun = word_tokenizer  # using word & not space tokenizers

#This function iterates over input objects
#This function creates iterators over input objects to vocabularies, 
#corpora, or DTM and TCM matrices. This iterator is usually used in following functions : 
#create_vocabulary, create_corpus, create_dtm, vectorizers, create_tcm
it_0 = itoken( x,
               #preprocessor = text.clean,
               tokenizer = tok_fun,
               ids = data$id,
               progressbar = F)
# func collects unique terms & corresponding statistics
#Creates a vocabulary of unique terms
vocab = create_vocabulary(it_0,   
                          ngram = c(1L, 1L) #,
                          #stopwords = stopwords
)
# length(vocab); str(vocab)     # view what vocab obj is like
#This function filters the input vocabulary and throws out very frequent and 
#very infrequent terms.
pruned_vocab = prune_vocabulary(vocab,  # filters input vocab & throws out v frequent & v infrequent terms
                                term_count_min = 10)


# length(pruned_vocab);  str(pruned_vocab)

vectorizer = vocab_vectorizer(pruned_vocab) #  creates a text vectorizer func used in constructing a dtm/tcm/corpus

dtm_m  = create_dtm(it_0, vectorizer) # high-level function for creating a document-term matrix



dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTf)
a0 = (apply(dtm, 1, sum) > 0)   # build vector to identify non-empty docs
dtm = dtm[a0,]                  # drop empty docs

#print(difftime(Sys.time(), t1, units = 'sec'))

# view a sample of the DTM, sorted from most to least frequent tokens 
dtm = dtm[,order(apply(dtm, 2, sum), decreasing = T)]     # sorting dtm's columns in decreasing order of column sums
inspect(dtm[1:5, 1:5])     # inspect() func used to view parts of a DTM object           
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 15/10
## Sparsity           : 40%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs film idiots rancho good aamir
##    1    0      0      0    0     0
##    2    9      1      0    2     0
##    3    1      2      3    5     4
##    4    1      0      1    0     4
##    5    0      2      4    2     5

Building a word cloud (term frequency)

The word cloud is built to understand the frequency distribution of top words.

#   1- Using Term frequency(tf)             

tst = round(ncol(dtm)/100)  # divide DTM's cols into 100 manageble parts
a = rep(tst,99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  
}

tsum = ss.col
tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)
##    film  idiots  rancho    good   aamir college 
##     167     114     110     105     105      72
tail(tsum)
##    decided    started  knowledge    idiotic     superb screenplay 
##         10         10         10         10         10         10
  # New plot window
wordcloud(names(tsum), tsum,     # words, their freqs 
          scale = c(2, 0.5),     # range of word sizes
          0.05,                     # min.freq of words to consider
          max.words = 100,       # max #words
          colors = brewer.pal(8, "Dark2"))    # Plot results in a word cloud 
title(sub = "Term Frequency - Wordcloud")     # title for the wordcloud display

Analysis of word cloud:

We can see from the above word cloud words corresponding to the movie looks positive with top frequency to the cast aamir khan, madhavan,kareena,boman irani,sharman .Its related to student in the movie.We could see over positve set of words with success,brilliant moments, performance, wonderful from the cloud.Initial impression is that the movie could be good.

Better visualization through bar plot.

test = as.data.frame(round(tsum[1:15],0))

  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "Blue") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

Creating the same visualizations through tfidf weighting scheme.

# step 2b - Using Term frequency inverse document frequency (tfidf)             
# -------------------------------------------------------------- #


dtm.tfidf = tfidf(dtm, normalize=F)

tst = round(ncol(dtm.tfidf)/100)
a = rep(tst, 99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm.tfidf))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm.tfidf[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  
}

tsum = ss.col

tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)
##    rancho      film      good    idiots      raju     aamir 
## 152.49238 121.41714  83.84331  82.88356  80.94788  80.02471
tail(tsum)
##    minutes  recommend       kind     superb screenplay     boring 
##   24.84907   24.84907   24.84907   24.84907   24.84907   23.89596
wordcloud(names(tsum), tsum, scale=c(2,0.5),0.05, max.words=100,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud 
title(sub = "Term Frequency Inverse Document Frequency - Wordcloud")

as.matrix(tsum[1:20])     #  to see the top few tokens & their IDF scores
##                [,1]
## rancho    152.49238
## film      121.41714
## good       83.84331
## idiots     82.88356
## raju       80.94788
## aamir      80.02471
## college    77.32222
## indian     72.44405
## films      68.08686
## bollywood  65.54462
## comedy     63.90882
## hirani     63.81056
## life       63.63008
## people     62.83933
## farhan     60.70784
## funny      59.38527
## love       58.22803
## scene      58.03879
## khan       57.32963
## time       56.91774
(dtm.tfidf)[1:10, 1:10]   # view first 10x10 cells in the DTM under TF IDF.
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'film', 'idiots', 'rancho' ... ]]
##                                                                       
## 1  .         .         .        .         .         .        .        
## 2  6.5434386 0.7270487 .        1.5970154 .         .        .        
## 3  0.7270487 1.4540975 4.158883 3.9925385 3.0485602 2.147839 0.8556661
## 4  0.7270487 .         1.386294 .         3.0485602 .        .        
## 5  .         1.4540975 5.545177 1.5970154 3.8107003 3.221759 0.8556661
## 6  3.6352437 9.4516335 .        .         1.5242801 .        0.8556661
## 7  1.4540975 0.7270487 .        .         .         1.073920 .        
## 8  .         .         1.386294 0.7985077 0.7621401 .        .        
## 9  4.3622924 4.3622924 5.545177 .         0.7621401 3.221759 0.8556661
## 10 0.7270487 .         .        .         .         .        .        
##                              
## 1  .        3.078875 1.149906
## 2  2.299811 1.026292 1.149906
## 3  1.149906 1.026292 1.149906
## 4  .        .        .       
## 5  .        3.078875 .       
## 6  .        .        1.149906
## 7  .        1.026292 .       
## 8  .        .        .       
## 9  4.599622 5.131458 3.449717
## 10 1.149906 .        .
# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))
  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

The tfidf word cloud and frequency plot show the real features corresponding wholly.There is a stress on music,romance and mind excellence as well.From a sequel point of view cast,cinematography,music,comedy,script are the key take aways.

Term co-occurence matrix and graphs.

# Term co-occurence matrix and co-occurence graphs

vectorizer = vocab_vectorizer(pruned_vocab, 
                              grow_dtm = FALSE, 
                              skip_grams_window = 5L)

tcm = create_tcm(it_0, vectorizer) # func to build a TCM

tcm.mat = as.matrix(tcm)         # use tcm.mat[1:5, 1:5] to view
adj.mat = tcm.mat + t(tcm.mat)   # since adjacency matrices are symmetric

z = order(colSums(adj.mat), decreasing = T)
adj.mat = adj.mat[z,z]

# Plot Simple Term Co-occurance graph
adj = adj.mat[1:30,1:30]

library("igraph")
cog = graph.adjacency(adj, mode = 'undirected')
cog =  simplify(cog)  

cog = delete.vertices(cog, V(cog)[ degree(cog) == 0 ])


plot(cog)

#Distilled Cog

distill.cog = function(mat1, # input TCM ADJ MAT
                       title, # title for the graph
                       s,    # no. of central nodes
                       k1){  # max no. of connections  
  library(igraph)
  a = colSums(mat1) # collect colsums into a vector obj a
  b = order(-a)     # nice syntax for ordering vector in decr order  
  
  mat2 = mat1[b, b]     # order both rows and columns along vector b
  
  diag(mat2) =  0
  
  ## +++ go row by row and find top k adjacencies +++ ##
  
  wc = NULL
  
  for (i1 in 1:s){ 
    thresh1 = mat2[i1,][order(-mat2[i1, ])[k1]]
    mat2[i1, mat2[i1,] < thresh1] = 0   # neat. didn't need 2 use () in the subset here.
    mat2[i1, mat2[i1,] > 0 ] = 1
    word = names(mat2[i1, mat2[i1,] > 0])
    mat2[(i1+1):nrow(mat2), match(word,colnames(mat2))] = 0
    wc = c(wc,word)
  } # i1 loop ends
  
  
  mat3 = mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
  ord = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))]  # removed any NAs from the list
  mat4 = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
  graph <- graph.adjacency(mat4, mode = "undirected", weighted=T)    # Create Network object
  graph = simplify(graph) 
  V(graph)$color[1:s] = "green"
  V(graph)$color[(s+1):length(V(graph))] = "pink"
  
  graph = delete.vertices(graph, V(graph)[ degree(graph) == 0 ]) # delete singletons?
  
  plot(graph, 
       layout = layout.kamada.kawai, 
       main = title)
  
} # func ends


distill.cog(tcm.mat, 'Distilled COG with tf scheme',  10,  5)

## adj.mat and distilled cog for tfidf DTMs ##

adj.mat = t(dtm.tfidf) %*% dtm.tfidf
diag(adj.mat) = 0
a0 = order(apply(adj.mat, 2, sum), decreasing = T)
adj.mat = as.matrix(adj.mat[a0[1:50], a0[1:50]])


distill.cog(adj.mat, 'Distilled COG with tfidf scheme',  10,  10)

Analysis of co-occurence graphs from a sequel point of view.

The graph stress on the cast,comedy,acting,love from a frequency graph.

The distilled cogs drives us more insights on the direction,screenplay,comdedy.

Polarity and sentiments

Measures polarity of each review and computes the postive and negative wordclouds and barplots

x1 = x[a0]    # remove empty docs from corpus

#t1 = Sys.time()   # set timer

pol = polarity(x1)         # Calculate the polarity from qdap dictionary
wc = pol$all[,2]                  # Word Count in each doc
val = pol$all[,3]                 # average polarity score
p  = pol$all[,4]                  # Positive words info
n  = pol$all[,5]                  # Negative Words info  

#Sys.time() - t1  # how much time did the above take?

head(pol$all)
##   all  wc   polarity
## 1 all 195  1.0598500
## 2 all  59 -1.7705692
## 3 all  75  0.4849742
## 4 all  89 -1.1023978
## 5 all 161  0.2364331
## 6 all 111  1.2528886
##                                                                                                                                                                                                                                                               pos.words
## 1 free, love, inspiration, blockbuster, humour, good, leading, brilliantly, astounding, pleasant, good, cool, good, convincing, stellar, genuine, consummate, commendable, supporting, good, liberty, wholesome, excellence, success, excellence, success, favour, good
## 2                                                                                                                                                                                                                                                 pretty, favorite, top
## 3                                                                                                                                                       pleased, enjoyed, gem, perfect, ideal, loved, trust, encouragement, support, talents, successfully, free, bless
## 4                                                                                                                                                                                                                   good, love, good, works, fun, love, perfect, famous
## 5                                                                             winner, pride, sparkling, entertains, promises, heartwarming, worth, unparalleled, blockbuster, strong, enhance, elevate, solid, supremely, talented, fine, humor, strong, humor, winning
## 6                                                                                              wonderfully, entertaining, good, good, loved, easy, poignant, work, tender, loved, wonderful, guarantee, great, poignant, wonderful, worth, delightful, wonderful, smile
##                                                                                                                               neg.words
## 1              plot, idiots, lost, idiot, twists, idiots, satirical, flaws, twist, steals, eccentric, downside, unbelievable, negatives
## 2                fool, wasting, stooges, weird, poor, poor, horrible, worst, idiots, problem, fall, childish, idiots, worst, discourage
## 3                                                                     needless, missed, idiot, mediocre, cold, lack, meaningless, spoil
## 4          overrated, dark, overrated, lethal, suicide, sad, plot, hard, lack, exhausted, complains, lazy, struggled, sucks, cry, wrong
## 5 idiots, idiots, idiots, idiots, idiots, idiots, idiots, gross, idiots, idiots, twists, restless, idiots, plot, idiots, idiots, idiots
## 6                                                                              hard, hard, bad, idiots, afraid, cry, plot, spoil, funny
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    text.var
## 1 plot idiots farhan madhavan raju sharman sets outs journey find lost friend idiot rancho aamir journey starts college life memories friendship unique free thinker rancho inspired changed lives unfolded flashbacks twists turns views idiots educational system pressurizes students chose destination satirical brings flaws parallel love angle rancho pia kareena situations characters resemblance inspiration point blockbuster penned chetan bhagat humour emotional quotients amalgamated proportion full credits raju film offers lots quality rib tickling moments intermission point offers twist totally blue hooked feel good comic caper leading culmination surprising element climax brilliantly weaved hirani technically rocks astounding visually arresting cinematography aerial shots shimla ladakh pleasant treat eyes editing good song visualizations cool music good flow narration coming performances convincing portrayals aamir khan real showman bollywood steals shows stellar portrayal performances mannerisms college student genuine consummate madhavan sharman adds commendable performances kareena part supporting boman irani eccentric head institution good downside unbelievable filmy situations excused writer liberty episode mona delivery bit stretched minor negatives affect wholesome entertainer short basic theme chase excellence success follow aamir raju chased excellence success sum aamir hirani delivered favour highly emotional feel good comic entertainer verdict rating reviews latest indian visit www snehasallapam 
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  feel fool wasting hours practically indian equivalent stooges context bollywood film make rush back view jokes pretty unfunny inline unintelligent slapstick comedy weird foreign comedy film made continue viewing poor acting poor script horrible cinematography worst idiots favorite homeland foreign viewer india huge problem viewing jokes fall flat flat feels juvenile childish idiots worst imdb top discourage viewing costs
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  iitian needless forward based life iits family watched day show pleased sentence bollywood missed idiot cried laughed enjoyed moment hours spent watching gem songs mediocre watching feel perfect ideal situations loved understand bring iit avoid disputes trust shown life iits true suicides cold hearted professors lack encouragement support talents bring change world successfully depicts inside life meaningless grades race late free advice avoid reviews words people spoil scene rush nearest theater experience phenomenon bless advice
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           asian mind goddamn good asians high school thing overrated dark knight avengers overrated superheroes thing lot people love stop taking personal feelings lethal truth make understand high school thing studied chinese indian high school understand suicide sad plot life pis sh wake bed choose class teacher relationship boys girls party prom study study make life good grades hard works lack fun totally exhausted teachers parents complains lazy funnier struggled learn future life simply sucks love film showed mind perfect make laugh cry douban famous website china average wrong understand
## 5                                                                                                                                                                                                                                                                                                                                        paid premiere idiots today idiots winner everythingwise idiots suits term landmark cinema takes bollywood giant step world cinema pride idiots sparkling qualitative cinema idiots entertains enlightens idiots forward thinking makes recall roots promises lots laughs heartwarming message aplenty remain etched memory possess recall idiots films indisputably undeniably aamir madhavan sharman outing worth price ticket film set records merits emerge biggest hits times weekend business historic week business unparalleled lifetime gross biggest times short idiots blockbuster written idiots told differently importantly offers twists turns guess unfold happening scene screenplay gripping feel auditorium ceiling intervals restless idiots demonstrates strong film making enhance elevate solid concept aamir khan film short event supremely talented actor acts film year films identical terms plot line sum idiots commercial hindi cinema film hit written put cancel today idiots director rajkumar hirani strikes fine balance humor emotions comic portions executed panache drama attention grabbing emotional quotient strong turn moist eyed marriage humor emotions technique content drives idiots winning post
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         hard review film huge number indian films couple dozen wonderfully entertaining expert bollywood hard time knowing good film relative films country read review india offer indian films long film run time hours bad forever idiots good film loved length films genre share usual singing dancing foreign films countries thing defining type film easy comedy poignant moments kleenex nearby existential moments explore meaning life work tender film friendship daughter pointed film loved men afraid cry rarely western films plot long involved recount occurs spoil single wonderful moment advice sit back give chance guarantee great time poignant funny film wonderful worth time delightful script wonderful characters lots moments made smile brought tears film
head(pol$group)
##   all total.sentences total.words ave.polarity sd.polarity
## 1 all             201       11447    0.2708706   0.9716321
##   stan.mean.polarity
## 1           0.278779
positive_words = unique(setdiff(unlist(p),"-"))  # Positive words list
negative_words = unique(setdiff(unlist(n),"-"))  # Negative words list

print(positive_words)       # Print all the positive words found in the corpus
##   [1] "free"           "love"           "inspiration"    "blockbuster"   
##   [5] "humour"         "good"           "leading"        "brilliantly"   
##   [9] "astounding"     "pleasant"       "cool"           "convincing"    
##  [13] "stellar"        "genuine"        "consummate"     "commendable"   
##  [17] "supporting"     "liberty"        "wholesome"      "excellence"    
##  [21] "success"        "favour"         "pretty"         "favorite"      
##  [25] "top"            "pleased"        "enjoyed"        "gem"           
##  [29] "perfect"        "ideal"          "loved"          "trust"         
##  [33] "encouragement"  "support"        "talents"        "successfully"  
##  [37] "bless"          "works"          "fun"            "famous"        
##  [41] "winner"         "pride"          "sparkling"      "entertains"    
##  [45] "promises"       "heartwarming"   "worth"          "unparalleled"  
##  [49] "strong"         "enhance"        "elevate"        "solid"         
##  [53] "supremely"      "talented"       "fine"           "humor"         
##  [57] "winning"        "wonderfully"    "entertaining"   "easy"          
##  [61] "poignant"       "work"           "tender"         "wonderful"     
##  [65] "guarantee"      "great"          "delightful"     "smile"         
##  [69] "astonishing"    "wins"           "amazingly"      "aspirations"   
##  [73] "cleverly"       "respect"        "joy"            "happy"         
##  [77] "intelligent"    "unforgettable"  "charismatic"    "inspiring"     
##  [81] "enjoying"       "fond"           "gleeful"        "incredibly"    
##  [85] "ideally"        "gratifying"     "precisely"      "lead"          
##  [89] "enjoy"          "gaining"        "brilliant"      "articulate"    
##  [93] "beautiful"      "warmth"         "humorous"       "pleasure"      
##  [97] "competitive"    "finest"         "phenomenal"     "positively"    
## [101] "brilliance"     "positive"       "convincingly"   "super"         
## [105] "patient"        "prowess"        "master"         "supreme"       
## [109] "fantastic"      "hilarious"      "worked"         "masterpiece"   
## [113] "charming"       "loving"         "awesome"        "classic"       
## [117] "amazing"        "wow"            "accolades"      "clean"         
## [121] "courage"        "effective"      "relief"         "likable"       
## [125] "ambitious"      "prodigy"        "prestigious"    "greatest"      
## [129] "passion"        "skilled"        "passionate"     "advantage"     
## [133] "nicely"         "freedom"        "leads"          "lucky"         
## [137] "accessible"     "lovely"         "heartfelt"      "invaluable"    
## [141] "beautifully"    "recommended"    "magnificent"    "led"           
## [145] "sweet"          "likes"          "excellent"      "perfectly"     
## [149] "top notch"      "witty"          "effectively"    "awesomeness"   
## [153] "bright"         "thoughtful"     "recommend"      "succeeds"      
## [157] "fresh"          "decent"         "fancy"          "proving"       
## [161] "wisdom"         "reputation"     "successful"     "improves"      
## [165] "maturity"       "interesting"    "genius"         "adore"         
## [169] "important"      "wise"           "refreshing"     "originality"   
## [173] "honest"         "helped"         "rich"           "credible"      
## [177] "marvelous"      "superb"         "exciting"       "enjoyable"     
## [181] "envy"           "magic"          "attractive"     "rational"      
## [185] "logical"        "incredible"     "significant"    "reasonable"    
## [189] "progress"       "liking"         "tough"          "remarkably"    
## [193] "awed"           "celebration"    "satisfying"     "cute"          
## [197] "modest"         "wealthy"        "talent"         "loves"         
## [201] "richer"         "world famous"   "cleaner"        "gold"          
## [205] "modern"         "awarded"        "gems"           "memorable"     
## [209] "gentle"         "warm"           "vivacious"      "thrilling"     
## [213] "joyously"       "smart"          "ready"          "enjoys"        
## [217] "simplify"       "virtue"         "capable"        "congratulate"  
## [221] "sensational"    "idolized"       "idol"           "enlightenment" 
## [225] "flawless"       "hero"           "won"            "awards"        
## [229] "spectacular"    "appealing"      "immense"        "unreal"        
## [233] "humble"         "praise"         "treasure"       "praising"      
## [237] "praiseworthy"   "correct"        "redeeming"      "pleasing"      
## [241] "inspire"        "realistic"      "clever"         "thrilled"      
## [245] "nice"           "hug"            "accomplishment" "promise"       
## [249] "perfection"     "glad"           "believable"     "quiet"         
## [253] "improve"        "outstanding"    "smooth"         "enlighten"     
## [257] "precise"        "heroine"        "grace"          "exceptional"   
## [261] "impressive"     "ovation"        "stupendous"     "engaging"      
## [265] "mesmerize"      "appeal"         "fluent"         "optimistic"    
## [269] "promising"      "joyful"         "enthusiast"     "amazed"        
## [273] "clear"          "stunning"       "win"            "easier"        
## [277] "keen"           "exceptionally"  "mature"         "succeed"       
## [281] "admirable"      "golden"         "instructive"    "assure"        
## [285] "happiness"      "empathy"        "sensitive"      "lover"         
## [289] "creative"       "complement"     "meticulously"   "profound"      
## [293] "valuable"       "charm"          "achievements"   "prefers"       
## [297] "excel"          "flexible"       "rightly"        "energetic"     
## [301] "romantic"       "breathtaking"   "delight"        "powerfully"    
## [305] "properly"       "balanced"       "positives"      "entertain"     
## [309] "fair"           "impressed"      "integrated"     "fascinating"   
## [313] "pardon"         "fans"           "dazzled"        "precious"      
## [317] "proud"          "proves"         "adequate"       "proper"        
## [321] "superbly"       "eager"          "luck"           "powerful"      
## [325] "exemplary"      "ample"          "righteous"      "understandable"
## [329] "splendid"       "impress"        "enthrall"       "catchy"        
## [333] "fairness"       "festive"        "compliment"     "faith"         
## [337] "long lasting"   "famed"          "glee"           "comfortable"   
## [341] "playfully"      "wonders"        "calm"           "uplifting"     
## [345] "hopeful"        "jolly"          "renowned"       "appreciated"   
## [349] "impressively"   "extraordinary"  "dawn"           "redemption"    
## [353] "formidable"     "lovable"        "colorful"       "picturesque"   
## [357] "fast"           "considerate"    "fresher"        "honor"         
## [361] "galore"         "generous"       "nifty"          "infallible"    
## [365] "aspire"         "endearing"      "everlasting"    "jaw dropping"  
## [369] "shiny"          "resplendent"    "frolic"         "accomplish"    
## [373] "smartest"       "inspirational"  "innocuous"      "affable"       
## [377] "faithful"
print(negative_words)       # Print all neg words
##   [1] "plot"            "idiots"          "lost"           
##   [4] "idiot"           "twists"          "satirical"      
##   [7] "flaws"           "twist"           "steals"         
##  [10] "eccentric"       "downside"        "unbelievable"   
##  [13] "negatives"       "fool"            "wasting"        
##  [16] "stooges"         "weird"           "poor"           
##  [19] "horrible"        "worst"           "problem"        
##  [22] "fall"            "childish"        "discourage"     
##  [25] "needless"        "missed"          "mediocre"       
##  [28] "cold"            "lack"            "meaningless"    
##  [31] "spoil"           "overrated"       "dark"           
##  [34] "lethal"          "suicide"         "sad"            
##  [37] "hard"            "exhausted"       "complains"      
##  [40] "lazy"            "struggled"       "sucks"          
##  [43] "cry"             "wrong"           "gross"          
##  [46] "restless"        "bad"             "afraid"         
##  [49] "funny"           "aching"          "issues"         
##  [52] "stereotypical"   "lengthy"         "shock"          
##  [55] "stress"          "pessimistic"     "seriousness"    
##  [58] "death"           "difficulty"      "problems"       
##  [61] "sneak"           "ironies"         "crazy"          
##  [64] "dictator"        "hate"            "restricted"     
##  [67] "revenge"         "pains"           "goofy"          
##  [70] "crude"           "senseless"       "misery"         
##  [73] "risk"            "frazzled"        "nervous"        
##  [76] "frantically"     "desperately"     "despair"        
##  [79] "terribly"        "frantic"         "desperate"      
##  [82] "farce"           "stunted"         "jittery"        
##  [85] "fearful"         "strict"          "poverty"        
##  [88] "expensive"       "fragile"         "inability"      
##  [91] "virus"           "issue"           "unable"         
##  [94] "falling"         "gravely"         "grossly"        
##  [97] "denying"         "unrealistic"     "collapse"       
## [100] "fear"            "anxiety"         "unhappy"        
## [103] "wasted"          "slog"            "loathe"         
## [106] "uncomfortable"   "skeptical"       "blind"          
## [109] "biased"          "struggle"        "strictly"       
## [112] "limited"         "worse"           "reluctant"      
## [115] "parody"          "ignore"          "silly"          
## [118] "bully"           "heartbreaking"   "rejection"      
## [121] "flawed"          "challenging"     "unhealthy"      
## [124] "drunk"           "doubt"           "stupid"         
## [127] "repetitive"      "laughable"       "miserable"      
## [130] "idiotic"         "ripoff"          "pathetic"       
## [133] "annoying"        "hated"           "sucked"         
## [136] "regret"          "irritating"      "loud"           
## [139] "blatant"         "unstable"        "appalled"       
## [142] "mindless"        "suffered"        "stunt"          
## [145] "killed"          "fails"           "lacks"          
## [148] "roadblocks"      "cheap"           "bored"          
## [151] "sorrow"          "melodramatic"    "ridiculous"     
## [154] "slow"            "disappointed"    "headache"       
## [157] "unnecessary"     "heck"            "controversy"    
## [160] "disappoints"     "overdone"        "hype"           
## [163] "disaster"        "immature"        "boring"         
## [166] "dense"           "outrageous"      "disappointing"  
## [169] "debauch"         "aborted"         "dumb"           
## [172] "impractical"     "corruption"      "struggling"     
## [175] "tired"           "hating"          "awful"          
## [178] "useless"         "divisive"        "superficiality" 
## [181] "vomit"           "suffering"       "complex"        
## [184] "fake"            "pretentious"     "tension"        
## [187] "sadness"         "freaking"        "ploy"           
## [190] "nitpicking"      "failed"          "shame"          
## [193] "destroy"         "falls"           "wild"           
## [196] "attack"          "abort"           "died"           
## [199] "fail"            "interrupt"       "irritated"      
## [202] "overacted"       "badly"           "disappointment" 
## [205] "starkly"         "critics"         "superficially"  
## [208] "mournful"        "simplistic"      "falter"         
## [211] "stuck"           "bs"              "joke"           
## [214] "haste"           "shake"           "difficulties"   
## [217] "unbelievably"    "mystery"         "irrelevant"     
## [220] "lame"            "insanely"        "sadly"          
## [223] "illogical"       "confusing"       "stupidity"      
## [226] "hell"            "moronic"         "outrageously"   
## [229] "obnoxious"       "morbid"          "immoral"        
## [232] "rape"            "delusional"      "crap"           
## [235] "fooled"          "hang"            "waste"          
## [238] "mistakes"        "sloppily"        "weakness"       
## [241] "sensationalize"  "difficult"       "irresponsible"  
## [244] "mistaken"        "bothered"        "misbehavior"    
## [247] "insult"          "misbehave"       "steal"          
## [250] "miserably"       "unbearable"      "annoyingly"     
## [253] "lose"            "miss"            "shocked"        
## [256] "villains"        "imperfect"       "rotten"         
## [259] "appalling"       "hum"             "absurd"         
## [262] "cheesy"          "painful"         "trick"          
## [265] "losing"          "ripped"          "comical"        
## [268] "confrontation"   "misfit"          "manipulative"   
## [271] "dull"            "ruined"          "criticizing"    
## [274] "evil"            "snappish"        "naughty"        
## [277] "cramp"           "impose"          "shallow"        
## [280] "atrocious"       "crass"           "unexpected"     
## [283] "stressful"       "dilemma"         "kill"           
## [286] "alienation"      "ambiguity"       "criticize"      
## [289] "confess"         "ruins"           "bother"         
## [292] "messed"          "murder"          "angry"          
## [295] "impulsive"       "myth"            "jam"            
## [298] "anger"           "overlook"        "dislike"        
## [301] "despise"         "mockery"         "stealing"       
## [304] "detracts"        "corrupted"       "gullible"       
## [307] "deceive"         "hindrance"       "dragged"        
## [310] "randomly"        "unusually"       "painfully"      
## [313] "explosive"       "concerned"       "fears"          
## [316] "bull"            "damn"            "harmed"         
## [319] "criticism"       "misleading"      "nonsense"       
## [322] "offensive"       "tragedy"         "ignorance"      
## [325] "rubbish"         "impatiently"     "confused"       
## [328] "annoyed"         "unpredictable"   "prisoner"       
## [331] "bugs"            "disappointments" "blasphemous"    
## [334] "bizarre"         "vicious"         "clash"          
## [337] "cracked"         "fidgety"         "vile"           
## [340] "virulent"        "weak"            "obscured"       
## [343] "superficial"     "messy"           "tiresome"       
## [346] "faulty"          "sugarcoated"     "authoritarian"  
## [349] "thankless"       "mediocrity"      "irks"           
## [352] "troublemaker"    "unsuccessfully"  "breaking"       
## [355] "testy"           "sucker"          "punch"          
## [358] "caricature"      "whining"         "lackluster"     
## [361] "clunky"          "undermined"      "excessively"    
## [364] "touted"          "strike"          "false"          
## [367] "confession"      "naively"         "hinder"         
## [370] "ruthless"        "insulting"       "retarded"       
## [373] "soapy"           "annoys"          "confuse"        
## [376] "indecency"       "scared"          "disgusting"     
## [379] "betraying"       "unnatural"       "scream"         
## [382] "spoils"          "loses"           "irritate"       
## [385] "damaging"        "leaking"         "criminal"       
## [388] "offenses"        "disdain"         "terrible"       
## [391] "suck"            "rival"           "nagging"        
## [394] "strange"         "hardships"       "harsh"          
## [397] "bore"            "hurt"            "weaker"         
## [400] "rhetorical"      "fictitious"      "disregard"      
## [403] "stricken"        "rivalry"         "nasty"          
## [406] "ghastly"         "inappropriate"   "drag"           
## [409] "contrived"       "mawkish"         "implausible"    
## [412] "superstitious"   "inevitable"
#--------------------------------------------------------#
#   Create Postive Words wordcloud                       #
#--------------------------------------------------------#

pos.tdm = dtm[,which(colnames(dtm) %in% positive_words)]
m = as.matrix(pos.tdm)
v = sort(colSums(m), decreasing = TRUE)
 # opens new image window
wordcloud(names(v), v, scale=c(3,1),0.5, max.words=100,colors=brewer.pal(8, "Dark2"))
title(sub = "Positive Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
# opens new image window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "blue") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

#--------------------------------------------------------#
#  Create Negative Words wordcloud                       #
#--------------------------------------------------------#

neg.tdm = dtm[,which(colnames(dtm) %in% negative_words) ]
m = as.matrix(neg.tdm)
v = sort(colSums(m), decreasing = TRUE)

wordcloud(names(v), v, scale=c(3,1),0.5, max.words=100,colors=brewer.pal(8, "Dark2"))         
title(sub = "Negative Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])

ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = test), vjust= -0.10) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (geom_text).

## Comparing postive and negative sentiment

#--------------------------------------------------------#
#  Positive words vs Negative Words plot                 #
#--------------------------------------------------------#

len = function(x){
  if ( x == "-" && length(x) == 1)  {return (0)} 
  else {return(length(unlist(x)))}
}

pcount = unlist(lapply(p, len))
ncount = unlist(lapply(n, len))
doc_id = seq(1:length(wc))


plot(doc_id,pcount,type="l",col="green",xlab = "Document ID", ylab= "Word Count")
lines(doc_id,ncount,type= "l", col="red")
title(main = "Positive words vs Negative Words" )
legend("topright", inset=.05, c("Positive Words","Negative Words"), fill=c("green","red"), horiz=TRUE)

# Documet Sentiment Running plot

plot(pol$all$polarity, type = "l", ylab = "Polarity Score",xlab = "Document Number")
abline(h=0)
title(main = "Polarity Plot" )

### COG for sentiment-laden words ? ###

senti.dtm = cbind(pos.tdm, neg.tdm); dim(senti.dtm)
## [1] 120  39
senti.adj.mat = as.matrix(t(senti.dtm)) %*% as.matrix(senti.dtm)
diag(senti.adj.mat) = 0


distill.cog(senti.adj.mat,   # ad mat obj 
            'Distilled COG of senti words',       # plot title
            5,       # max #central nodes
            5)        # max #connexns
## Warning in vattrs[[name]][index] <- value: number of items to replace is
## not a multiple of replacement length

Summary

The audience overall are postive about the movies.This would make a good point to consider the movie for a sequel in the first hand. The main attributes which i would recommend from the analysis

1)The cast(aamir,kareena,madhavan,sharman,boman irani) 2)Comedy(words suggesting are funny,humour,hillarious) 3)Direction(includes screenplay,story line)

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.