Text-Analyzing a simple set of documents

R Markdown

By Sree Kashyap Addanki - 716200072

Individual Assignment - Task 1

Extract 12 Angry Men Movie reviews and corresponding ratings from IMDB website

#####################################
# Extract 12 Angry Men Movie 
counts = c(0,10,20,30,40,50,60,70,80,90)
ratings = NULL
reviews = NULL
for (j in counts){
    url1 = paste0("http://www.imdb.com/title/tt0050083/reviews?filter=love;filter=love;start=",j) # positive reviews
    url2 = paste0("http://www.imdb.com/title/tt0050083/reviews?filter=hate;filter=hate;start=",j) # negative reviews 
    page1 = read_html(url1)
    page2 = read_html(url2)
    

    
    reviews1 = html_text( html_nodes( page1,'#tn15content p') )
    reviews2 = html_text( html_nodes( page2,'#tn15content p') )
    
    
    movie.nodes = html_nodes(page1,'h2 + img')
    rating1 = substr(html_attr(movie.nodes, name = 'alt'),0,2)
    
    movie.nodes = html_nodes(page2,'h2 + img')
    rating2 = substr(html_attr(movie.nodes, name = 'alt'),0,1)
    
    ratings <- c(ratings,rating1,rating2)
    
    reviews.positive = setdiff(reviews1, c("*** This review may contain spoilers ***","Add another review"))
    reviews.negative = setdiff(reviews2, c("*** This review may contain spoilers ***","Add another review"))
    
    reviews = c(reviews,reviews.positive,reviews.negative)
 
       
}

Create a text file which includes all the 200 reviews. 100 Positive and 100 Negative

reviews = gsub("\n",' ',reviews)
writeLines(reviews,' 12 Angry Men (1957) .txt')

Create a fucntion to clean the text

#--------------------------------------------------------#
# Step 0 - Create a fucntion to clean data               #
#--------------------------------------------------------#

text.clean = function(x)                    # text data
{ require("tm")
    x  =  gsub("<.*?>", " ", x)               # regex for removing HTML tags
    x  =  iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
    x  =  gsub("[^[:alnum:]]", " ", x)        # keep only alpha numeric 
    x  =  tolower(x)                          # convert to lower case characters
    x  =  removeNumbers(x)                    # removing numbers
    x  =  stripWhitespace(x)                  # removing white space
    x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
    return(x)
}

#--------------------------------------------------------#
# Step 1 - Reading text data                             #
#--------------------------------------------------------#


temp.text = readLines('C:\\Users\\sreek\\Desktop\\Term1\\TA\\Individual Assignment\\ 12 Angry Men (1957) .txt') #12 Angry Men reviews text file
head(temp.text,1)

## [1] "  An excellent courtroom drama with a unique twist. Instead of following the trial itself, the viewer has a unique chance to observe the events behind the closed doors of a jury room. The film begins with the end of the trial. The jurors retire to deliberate the case. A preliminary vote is taken and the result is 11:1 in favour of the guilty verdict. Eleven jurors have raised their hands to convict a young man of killing his father. Only Juror #8 has doubts. At first even he does not truly believe the young man to be innocent but notes (rightfully) that the case for the defence might have been presented in a more convincing manner and that the boy might be given the benefit of a doubt. Since the boy is to be executed if found guilty his life is now in the hands of the jury and juror #8 reasons that the least they could do is talk about the case a bit. As time goes on some of the jurors change their minds and find that there is perhaps enough reasonable doubt not to convict the young man after all. But not everyone is easy to convince.Although the plot of the film is excellent and it is fascinating to see what little things can influence which way a verdict goes, where this film really succeeds is in presenting the characters of the 12 jurors. The character of each of the jurors emerges through a wonderful mix of perfect casting, excellent dialogue and near-flawless acting.Juror #1 - a simple man who clearly does not understand the full complexity of the task that lies before him but is trying to do everything not to let anyone else find this out. He appears at ease only once during the film - when he talks about football. He has the misfortune to be selected foreman of the jury - a task he clearly does not relish.Juror #2 - a small, quite man, clearly unaccustomed to giving his own opinion much less to expecting his views to be of any importance. Apparently he finds solace in his job - he is an accountant.Juror #3 - probably the most complex personality in the film. Starts off like a pleasant self-made successful businessman, he analyses the case impartially, explains his arguments well and is reasonably self assured. As time goes on he becomes more and more passionate and seems to be somehow personally involved with the case.  He also starts to show some signs of slight mental instability. Wonderfully played by Lee J. Cobb - this is the character you remember after the film is over.Juror #4 - self assured, slightly arrogant stockbroker. Obviously considers himself more intelligent than anyone else in the room, he approaches the case with cool heartless logic but (as one of the jurors says - \"this is not an exact science\") he does not take into account the feelings, the passions, the characters of the people involved in the case. He is conspicuous by the fact that he is the only juror that does not take his jacket off (it is a very hot day).Juror #5 - here is a man under great emotional stress. He comes from the same social background as the accused boy - with who he almost unwillingly seems to identify with. Paradoxically this appears one of the main reasons for him voting guilty - he does not want compassion to influence him - so ironically it does.Juror #6 - a simple man, quite readily admitting that everyone in the room is better qualified than he is to make decisions and offer explanations. But he really wants to see justice done and it worries him that he might make a mistake.Juror #7 - the only one that really has no opinion on this case. Literally throughout the film his thoughts are never on the case - he talks of baseball, of the heat, of fixing the fan but the only reason he has for voting this way or that is to speed things up a bit so he might be out of the jury room as soon as possible. Not an evil man he just has no sense of morality whatsoever - he can tell right from wrong but does not seem to think it's worth the bother.Juror #8- a caring man, has put more thought into the case than any of the other jurors. He tries to do his best even in the face of seemingly impossible odds.Juror #9 - a wise old man with his great life experience has quite a unique way of looking at the case.Juror #10 - the most horrifying character in the film. Votes guilty and does not even try to hide the fact that he does so only because of the boy's social background. The tragedy comes from the fact that his own social position is only a cut above the boy's - which makes him all the more eager to accentuate the difference.Juror #11 - an immigrant watchmaker, careful methodical man, well mannered and soft spoken. respects the right of people to have different opinion to his - and is willing to look at both sides of the problem. Loses his temper only once - horrified by the complete indifference of juror #7.Juror #12 - a young business type - perhaps he has his own opinions - but is careful to hide them. What he has learnt out of life seems to be that intelligence is equal with agreeing with what the majority of people think.The film succeeds in doing something very rare today - developing an intelligent plot while also developing 12 believable, memorable and distinct characters.  "

data = data.frame(id = 1:length(temp.text), text = temp.text, stringsAsFactors = F)

Check the dimensions of data

dim(data)

## [1] 200   2

Adding all the uncessary words to stop list Since movie has 12 member jury who decide verdict of a young man we are adding words like jury to stop list. Also added few obvious words

Below are stop words list

1.angry
2.man
3.men
4.film
5.movie
6.twelve
7.movies
8.played
9.actors
10.cast
11.films
12.spoilers
13.juror, jurors, jury
14.watch

# Read Stopwords list

stpw1 = readLines('C:\\Users\\sreek\\Desktop\\Term1\\TA\\Individual Assignment\\stopwords.txt')      # read in stopwords file

stpw2 = tm::stopwords('english')                   # tm package stop word list; tokenizer package has the same name function
stpw3 =c("angry","man","men","film","movie","twelve","movies", "played","actors", "cast", "films","spoilers","juror","jurors","jury", "watch")
comn  = unique(c(stpw1, stpw2,stpw3) )                 # Union of two list
stopwords = unique(gsub("'"," ",comn) )  # final stop word lsit after removing punctuation

x  = text.clean(data$text )             # pre-process text corpus
x  =  removeWords(x,stopwords )            # removing stopwords created above
x  =  stripWhitespace(x )                  # removing white space
# x  =  stemDocument(x)

Create DTM using text2vec package

#--------------------------------------------------------#
## Step 2: Create DTM using text2vec package             #
#--------------------------------------------------------#

t1 = Sys.time()

tok_fun = word_tokenizer  # using word & not space tokenizers

it_0 = itoken( x,
               #preprocessor = text.clean,
               tokenizer = tok_fun,
               ids = data$id,
               progressbar = T)

vocab = create_vocabulary(it_0,    #  func collects unique terms & corresponding statistics
                          ngram = c(2L, 2L) #,
                          #stopwords = stopwords
)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

pruned_vocab = prune_vocabulary(vocab,  # filters input vocab & throws out v frequent & v infrequent terms
                                term_count_min = 10)

length(pruned_vocab);  str(pruned_vocab)

## [1] 5

## List of 5
##  $ vocab         :Classes 'data.table' and 'data.frame': 35 obs. of  3 variables:
##   ..$ terms       : chr [1:35] "jack_warden" "young_boy" "shut_case" "guilty_vote" ...
##   ..$ terms_counts: int [1:35] 26 12 10 10 26 10 17 10 59 50 ...
##   ..$ doc_counts  : int [1:35] 25 11 8 7 25 8 16 10 54 43 ...
##   ..- attr(*, ".internal.selfref")=<externalptr> 
##  $ ngram         : Named int [1:2] 2 2
##   ..- attr(*, "names")= chr [1:2] "ngram_min" "ngram_max"
##  $ document_count: int 200
##  $ stopwords     : chr(0) 
##  $ sep_ngram     : chr "_"
##  - attr(*, "class")= chr "text2vec_vocabulary"

vectorizer = vocab_vectorizer(pruned_vocab) #  creates a text vectorizer func used in constructing a dtm/tcm/corpus

dtm_0  = create_dtm(it_0, vectorizer) # high-level function for creating a document-term matrix

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# Sort bi-gram with decreasing order of freq
tsum = as.matrix(t(rollup(dtm_0, 1, na.rm=TRUE, FUN = sum))) # find sum of freq for each term
tsum = tsum[order(tsum, decreasing = T),]       # terms in decreasing order of freq
head(tsum)

##      henry_fonda         lee_cobb     sidney_lumet reasonable_doubt 
##              106               59               50               43 
##        ed_begley      black_white 
##               30               30

tail(tsum)

##   accused_guilty defendant_guilty   joseph_sweeney     john_fiedler 
##               10               10               10               10 
##    robert_webber      boy_accused 
##               10               10

The code nicely create bigrams of names of people like henry_fonda,lee_cobb,sidney_lumet also it does good job in creating bi grams like reasonable_doubt which people might commonly discuss in reviews

#-------------------------------------------------------
# Code bi-grams as unigram in clean text corpus

text2 = x
text2 = paste("",text2,"")

pb <- txtProgressBar(min = 1, max = (length(tsum)), style = 3) ; i = 0

for (term in names(tsum)){
    i = i + 1
    focal.term = gsub("_", " ",term)        # in case dot was word-separator
    replacement.term = term
    text2 = gsub(paste("",focal.term,""),paste("",replacement.term,""), text2)
    setTxtProgressBar(pb, i)
}

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  18%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |===============                                                  |  24%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |======================================================           |  82%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%

it_m = itoken(text2,     # function creates iterators over input objects to vocabularies, corpora, DTM & TCM matrices
              # preprocessor = text.clean,
              tokenizer = tok_fun,
              ids = data$id,
              progressbar = T)

vocab = create_vocabulary(it_m     # vocab func collects unique terms and corresponding statistics
                          # ngram = c(2L, 2L),
                          #stopwords = stopwords
)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 1)
vectorizer = vocab_vectorizer(pruned_vocab)

dtm_m  = create_dtm(it_m, vectorizer)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

dim(dtm_m)

## [1]  200 5228

dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTf)
a0 = (apply(dtm, 1, sum) > 0)   # build vector to identify non-empty docs
dtm = dtm[a0,]                  # drop empty docs

print(difftime(Sys.time(), t1, units = 'sec'))

## Time difference of 2.603582 secs

# view a sample of the DTM, sorted from most to least frequent tokens 
dtm = dtm[,order(apply(dtm, 2, sum), decreasing = T)]     # sorting dtm's columns in decreasing order of column sums
inspect(dtm[1:5, 1:5])     # inspect() func used to view parts of a DTM object

## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 13/12
## Sparsity           : 48%
## Maximal term length: 5
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs room time case good great
##    1    4    2   11    0     2
##    2    0    0    2    0     0
##    3    2    3    0    0     0
##    4    0    2    1    0     1
##    5    0    0    2    1     2

#--------------------------------------------------------#
## Step 2a:     # Build word cloud                       #
#--------------------------------------------------------#

#   1- Using Term frequency(tf)             

tst = round(ncol(dtm)/100)  # divide DTM's cols into 100 manageble parts
a = rep(tst,99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
    tempdtm = dtm[,(b[i]+1):(b[i+1])]
    s = colSums(as.matrix(tempdtm))
    ss.col = c(ss.col,s)
}

tsum = ss.col
tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)

##        room        time        case        good       great henry_fonda 
##         190         124         119         117         113         106

tail(tsum)

##     steven    reflect  balancing    warming oftentimes     spouts 
##          1          1          1          1          1          1

windows()  # New plot window
wordcloud(names(tsum), tsum,     # words, their freqs 
          scale = c(4, 0.5),     # range of word sizes
          1,                     # min.freq of words to consider
          max.words = 200,       # max #words
          colors = brewer.pal(8, "Dark2"))    # Plot results in a word cloud 
title(sub = "Term Frequency - Wordcloud")     # title for the wordcloud display

# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))

windows()  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
    geom_bar(stat = "identity", fill = "Blue") +
    geom_text(aes(label = test), vjust= -0.20) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

In above word cloud which is created using TF we can see words like room, henry_fonda, good, great case are showing up

# -------------------------------------------------------------- #
# step 2b - Using Term frequency inverse document frequency (tfidf)             
# -------------------------------------------------------------- #

library(textir)
dtm.tfidf = tfidf(dtm, normalize= FALSE)

tst = round(ncol(dtm.tfidf)/100)
a = rep(tst, 99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm.tfidf))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
    tempdtm = dtm.tfidf[,(b[i]+1):(b[i+1])]
    s = colSums(as.matrix(tempdtm))
    ss.col = c(ss.col,s)

}

tsum = ss.col

tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)

##     case    great    fonda     good      boy   people 
## 139.3708 123.5796 120.1981 117.9194 114.8574 113.6047

tail(tsum)

##     steven    reflect  balancing    warming oftentimes     spouts 
##    4.60517    4.60517    4.60517    4.60517    4.60517    4.60517

windows()  
wordcloud(names(tsum), tsum, scale=c(4,0.5),1, max.words=200,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud 
title(sub = "Term Frequency Inverse Document Frequency - Wordcloud")

as.matrix(tsum[1:20])     #  to see the top few tokens & their IDF scores

##                  [,1]
## case        139.37077
## great       123.57960
## fonda       120.19809
## good        117.91938
## boy         114.85743
## people      113.60475
## guilty      112.31175
## time        112.07966
## room        110.16551
## evidence    109.34743
## number      108.79122
## characters  106.43161
## made        104.56972
## story       103.72467
## make         97.52180
## murder       96.83839
## character    95.61680
## end          94.29516
## acting       91.35227
## henry_fonda  89.46083

(dtm.tfidf)[1:10, 1:10]   # view first 10x10 cells in the DTM under TF IDF.

## 10 x 10 sparse Matrix of class "dgCMatrix"
##                                                                      
## 1  2.3192740 1.8077364 12.883013 .        2.187249 .         3.513549
## 2  .         .          2.342366 .        .        0.8439701 1.171183
## 3  1.1596370 2.7116046  .        .        .        3.3758803 .       
## 4  .         1.8077364  1.171183 .        1.093625 0.8439701 1.171183
## 5  .         .          2.342366 1.007858 2.187249 0.8439701 .       
## 6  .         1.8077364  .        .        .        .         1.171183
## 7  0.5798185 .          .        1.007858 1.093625 0.8439701 .       
## 8  .         0.9038682  1.171183 .        .        1.6879401 .       
## 9  1.7394555 0.9038682  2.342366 2.015716 2.187249 1.6879401 1.171183
## 10 2.8990925 .          5.855915 .        2.187249 0.8439701 1.171183
##                            
## 1  3.325988 3.66234 .      
## 2  5.543313 3.66234 1.44817
## 3  .        1.22078 .      
## 4  4.434650 .       2.89634
## 5  .        1.22078 .      
## 6  .        .       .      
## 7  .        .       2.89634
## 8  .        .       .      
## 9  1.108663 .       .      
## 10 .        1.22078 2.89634

# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))
windows()  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
    geom_bar(stat = "identity", fill = "red") +
    geom_text(aes(label = test), vjust= -0.20) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

In above word cloud which is created using TFIDF we can see words like accused defendant character classic are showing up. Also we can observer that words that came using TF like henry_fonda, good are not showing up here

#------------------------------------------------------#
# step 2c - Term Co-occurance Matrix (TCM)             #
#------------------------------------------------------#

vectorizer = vocab_vectorizer(pruned_vocab, 
                              grow_dtm = FALSE, 
                              skip_grams_window = 5L)

tcm = create_tcm(it_m, vectorizer) # func to build a TCM

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

tcm.mat = as.matrix(tcm)         # use tcm.mat[1:5, 1:5] to view
adj.mat = tcm.mat + t(tcm.mat)   # since adjacency matrices are symmetric

z = order(colSums(adj.mat), decreasing = T)
adj.mat = adj.mat[z,z]

# Plot Simple Term Co-occurance graph
adj = adj.mat[1:30,1:30]

library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:qdap':
## 
##     %>%, diversity

## The following object is masked from 'package:stringr':
## 
##     %>%

## The following objects are masked from 'package:text2vec':
## 
##     %>%, normalize

## The following object is masked from 'package:rvest':
## 
##     %>%

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

cog = graph.adjacency(adj, mode = 'undirected')
cog =  simplify(cog)  

cog = delete.vertices(cog, V(cog)[ degree(cog) == 0 ])

windows()
plot(cog)

The cog created here shows how the movie revolves around fonda and room also wrods like lee_cobb, verdict, murder are showing up

#-----------------------------------------------------------#
# Step 2d - a cleaned up or 'distilled' COG PLot            #
#-----------------------------------------------------------#

distill.cog = function(mat1, # input TCM ADJ MAT
                       title, # title for the graph
                       s,    # no. of central nodes
                       k1){  # max no. of connections  
    library(igraph)
    a = colSums(mat1) # collect colsums into a vector obj a
    b = order(-a)     # nice syntax for ordering vector in decr order  
    
    mat2 = mat1[b, b]     # order both rows and columns along vector b
    
    diag(mat2) =  0
    
    ## +++ go row by row and find top k adjacencies +++ ##
    
    wc = NULL
    
    for (i1 in 1:s){ 
        thresh1 = mat2[i1,][order(-mat2[i1, ])[k1]]
        mat2[i1, mat2[i1,] < thresh1] = 0  
        mat2[i1, mat2[i1,] > 0 ] = 1
        word = names(mat2[i1, mat2[i1,] > 0])
        mat2[(i1+1):nrow(mat2), match(word,colnames(mat2))] = 0
        wc = c(wc,word)
    } # i1 loop ends
    
    
    mat3 = mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
    ord = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))]  # removed any NAs from the list
    mat4 = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
    graph <- graph.adjacency(mat4, mode = "undirected", weighted=T)    # Create Network object
    graph = simplify(graph) 
    V(graph)$color[1:s] = "green"
    V(graph)$color[(s+1):length(V(graph))] = "pink"
    
    graph = delete.vertices(graph, V(graph)[ degree(graph) == 0 ]) 
    
    plot(graph, 
         layout = layout.kamada.kawai, 
         main = title)
    
} # func ends

windows()
distill.cog(tcm.mat, 'Distilled COG',  10,  5)

## Warning in vattrs[[name]][index] <- value: number of items to replace is
## not a multiple of replacement length

## adj.mat and distilled cog for tfidf DTMs ##

adj.mat = t(dtm.tfidf) %*% dtm.tfidf
diag(adj.mat) = 0
a0 = order(apply(adj.mat, 2, sum), decreasing = T)
adj.mat = as.matrix(adj.mat[a0[1:50], a0[1:50]])

windows()
distill.cog(adj.mat, 'Distilled COG',  10,  10)

Compare each review’s polarity score with its star rating.

#--------------------------------------------------------#
#  Step 3 correlation between polarity and rating       #
#--------------------------------------------------------#

reviews_df <- data.frame(reviews,ratings)

polarity <- counts(polarity(reviews_df$reviews))[, "polarity"]

reviews_df$polarity <-polarity

# reviews_df --> first colmun contains review second column has rating and third has polarity of review
cor(as.numeric(reviews_df$ratings), reviews_df$polarity)

## [1] -0.1525462

We see that the correlation values between ratings and polarity is very small negative. Which means the there is slight negative correlation between polarity and rating but as the number is very less we can say those two are independent.

#--------------------------------------------------------#
#             Sentiment Analysis                         #
#--------------------------------------------------------#

library(qdap)

x1 = x[a0]    # remove empty docs from corpus

t1 = Sys.time()   # set timer

pol = polarity(x1)         # Calculate the polarity from qdap dictionary
wc = pol$all[,2]                  # Word Count in each doc
val = pol$all[,3]                 # average polarity score
p  = pol$all[,4]                  # Positive words info
n  = pol$all[,5]                  # Negative Words info  

dim(pol)

## NULL

Sys.time() - t1  # how much time did the above take?

## Time difference of 1.042037 mins

head(pol$all)

##   all  wc   polarity
## 1 all  50 -0.2828427
## 2 all 167  0.9750173
## 3 all 212 -0.4120817
## 4 all  50  1.5273506
## 5 all  58 -1.0504515
## 6 all  49  2.4571429
##                                                                                                                                                                                                                                                           pos.words
## 1                                                                                                                                                                                                                                      thrilling, convincing, tough
## 2 uplifting, brilliant, flawless, grace, masterfully, greatest, contribution, important, legendary, incredibly, realistic, genius, astoundingly, brilliant, impressed, powerful, inexpensive, awards, wonderful, loves, love, superb, greatest, achievements, honor
## 3                                                                                                                                works, brilliant, great, decent, gratitude, kindness, reasonable, contribution, helped, great, wonderful, shine, marvel, effective
## 4                                                                                                                                           loved, good, effectively, proves, good, beautiful, good, marvelous, achievement, sufficient, clever, intelligent, great
## 5                                                                                                                                                                                                                                                   fairness, clean
## 6                                                                                                                                                                         masterpiece, bright, great, great, great, great, excellent, good, good, great, good, good
##                                                                                                                                                                   neg.words
## 1                                                                                                                                    guilty, guilty, guilty, guilty, boring
## 2                                                                                       plot, simplistic, drain, mystery, killing, guilty, guilty, doubts, tension, complex
## 3 murder, guilty, lost, stifling, uncomfortable, worst, guilty, guilty, prejudices, biases, murder, debt, doubt, killed, estranged, conflict, intense, turmoil, false, poor
## 4                                                                                                                                                       bores, twists, blow
## 5                                                                                              struck, bad, plot, bothered, knife, critical, omission, plot, critic, fooled
## 6                                                                                                                                                         confusing, guilty
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       text.var
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              classifying genre drama rate thriller action action thrilling guns knives words considered member convinced points provided member plots scenes vote guilty guilty base ball match concludes convincing points ideas stand voting guilty guilty equal part felt boring single minute screenplay diamond crown tough make viewers watchers sit made room people 
## 2                                                                                                                                                                                                                                         rarely uplifting brilliant time intrigued flawless plot dialogue acting simplistic story set room surprising sidney lumet drain emotions leave edge seat suspense mystery acting bound grace silver screen boy day trial killing father heat domestic arguments forced present verdict guilty ticket electric chair boy decide quickly end discussion raise hands find thinks boy guilty henry fonda put hand trial character revelations doubts possibilities follow masterfully crafted time includes character development sidney lumet expert field greatest contribution hollywood history important contributions world cinema henry fonda lee cobb made legendary incredibly realistic performances casting genius dialogue astoundingly riveting brilliant finale impressed personally camera angles movements made suspenseful black white made powerful music minimal gave atmospheric experience room feel tension built proceeds inexpensive simple setting world talking academy awards nominations rolling henry fonda complete form rarely hypnotized lawrence arabia wonderful life mind definitive viewing loves sums love technical point view superb acting simple complex character driven story platinum greatest cinematic achievements time bar statue erected sydney lumet honor henry fonda
## 3  decided recently showed cable channel sidney lumet packs lot power times changed justice system works screen play reginald rose shows brilliant insight human beings called sit murder case read case court accused staring panel hands fate rests male public defendant appointed defend accused realize great job room convinced boy guilty proved innocent settling chairs deliberating table standing window streets lost thought stifling room days air conditioned sweating uncomfortable room worst days summer proceed preliminary vote number casts guilty vote shocks room dare majority stand open shut case guy guilty begins deliberations bring prejudices biases determine boy electric chair appearances case decent mind facing rest accused murder authorities committed process understand meaning justice drama owes henry fonda debt gratitude actor playing mr fonda exudes kindness person room convinced reasonable doubt boy killed father make contribution actor helped produced lee cobb great american century sees accused young estranged son couple years left home conflict mr cobb opposite mr fonda intense performance show turmoil rest wonderful moment shine single moment rings false reading couple comments people marvel women accused case presented defense attorney effective remember times action takes place fact poor accused attorney interested defense young minority migrant group elicit sympathy prosecution goal selection serving court justice mr lumet mr rose created timeless standard judged
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         bores loved proceeds good pace roles effectively proves make good large budgets beautiful locations good script characters courtroom make marvelous achievement grabs attention start makes person gong change mind people short time journey manipulating minds sufficient amount twists clever inferences find clues small details blow mind intelligent script great performance 
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     characters thick nuanced recently time struck dimensional minutes easily decipher bad guy bastion choirboy fairness innocence contained nature plot room character development caricatures thing bothered pays attention method reason fact early stated knife wiped clean police witnesses boy allegedly fled bit incongruent brought critical review brought major factor case overlooked omission discussion makes plot thin characters critic fooled
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  masterpiece bright simple confusing great script great directing great great show thousand huge settings make excellent picture good story boy sidney lumet good fonda great believes accused innocent convinced lee cobb equally good convinced boy guilty fonda convince facts believing word remake good buffs strongly advised accept substitutes costs

head(pol$group)

##   all total.sentences total.words ave.polarity sd.polarity
## 1 all            5228       19958   0.03687127   0.7818314
##   stan.mean.polarity
## 1         0.04716013

positive_words = unique(setdiff(unlist(p),"-"))  # Positive words list
negative_words = unique(setdiff(unlist(n),"-"))  # Negative words list

print(positive_words)       # Print all the positive words found in the corpus

##   [1] "thrilling"       "convincing"      "tough"          
##   [4] "uplifting"       "brilliant"       "flawless"       
##   [7] "grace"           "masterfully"     "greatest"       
##  [10] "contribution"    "important"       "legendary"      
##  [13] "incredibly"      "realistic"       "genius"         
##  [16] "astoundingly"    "impressed"       "powerful"       
##  [19] "inexpensive"     "awards"          "wonderful"      
##  [22] "loves"           "love"            "superb"         
##  [25] "achievements"    "honor"           "works"          
##  [28] "great"           "decent"          "gratitude"      
##  [31] "kindness"        "reasonable"      "helped"         
##  [34] "shine"           "marvel"          "effective"      
##  [37] "loved"           "good"            "effectively"    
##  [40] "proves"          "beautiful"       "marvelous"      
##  [43] "achievement"     "sufficient"      "clever"         
##  [46] "intelligent"     "fairness"        "clean"          
##  [49] "masterpiece"     "bright"          "excellent"      
##  [52] "honest"          "success"         "delicate"       
##  [55] "diligence"       "intelligence"    "work"           
##  [58] "wisely"          "nifty"           "deservedly"     
##  [61] "tidy"            "convenient"      "formidable"     
##  [64] "cool"            "logical"         "favorite"       
##  [67] "top"             "believable"      "excited"        
##  [70] "interesting"     "treasure"        "incredible"     
##  [73] "heaven"          "terrific"        "enjoy"          
##  [76] "advantage"       "expertly"        "accomplished"   
##  [79] "talented"        "leads"           "genuine"        
##  [82] "admire"          "fantastic"       "appreciated"    
##  [85] "adored"          "lead"            "revelation"     
##  [88] "defeat"          "wealthy"         "easier"         
##  [91] "righteous"       "boundless"       "protect"        
##  [94] "win"             "trust"           "strong"         
##  [97] "rich"            "fun"             "talent"         
## [100] "enthusiasm"      "correct"         "wonderfully"    
## [103] "amazing"         "relaxed"         "modern"         
## [106] "profound"        "solid"           "wise"           
## [109] "educated"        "recommend"       "enjoyed"        
## [112] "favour"          "rightfully"      "benefit"        
## [115] "easy"            "fascinating"     "succeeds"       
## [118] "perfect"         "ease"            "relish"         
## [121] "solace"          "pleasant"        "successful"     
## [124] "impartially"     "passionate"      "hot"            
## [127] "compassion"      "readily"         "qualified"      
## [130] "morality"        "worth"           "eager"          
## [133] "soft"            "memorable"       "privilege"      
## [136] "conveniently"    "poetic"          "rational"       
## [139] "piety"           "privileged"      "transparent"    
## [142] "pretty"          "popular"         "solidarity"     
## [145] "cheer"           "cheapest"        "spectacular"    
## [148] "fast"            "fine"            "thoughtful"     
## [151] "quiet"           "astonishing"     "brilliance"     
## [154] "nicely"          "beautifully"     "satisfying"     
## [157] "meaningful"      "recommended"     "favor"          
## [160] "blockbuster"     "noble"           "straighten"     
## [163] "positive"        "superbly"        "suffice"        
## [166] "exceptional"     "supporting"      "originality"    
## [169] "sharp"           "likes"           "impress"        
## [172] "gems"            "progressive"     "brilliantly"    
## [175] "master"          "respect"         "won"            
## [178] "exciting"        "classic"         "prefer"         
## [181] "clear"           "skillful"        "fair"           
## [184] "free"            "fortunately"     "inspiration"    
## [187] "miraculously"    "fantastically"   "superior"       
## [190] "inspirational"   "courageous"      "stunned"        
## [193] "courage"         "entertaining"    "succeeded"      
## [196] "handy"           "hopeful"         "calm"           
## [199] "supremely"       "hero"            "sane"           
## [202] "reasoned"        "outstandingly"   "appeal"         
## [205] "masterpieces"    "outstanding"     "awarded"        
## [208] "mesmerized"      "helpful"         "pure"           
## [211] "masterful"       "amazed"          "influential"    
## [214] "finest"          "respectful"      "principled"     
## [217] "amusing"         "decency"         "support"        
## [220] "endearing"       "passion"         "supports"       
## [223] "darling"         "famous"          "cleverly"       
## [226] "beloved"         "qualify"         "steadfast"      
## [229] "advocate"        "significant"     "thrilled"       
## [232] "steadfastly"     "sharper"         "smart"          
## [235] "captivating"     "instantly"       "stunning"       
## [238] "ready"           "proving"         "pride"          
## [241] "insightful"      "stimulating"     "variety"        
## [244] "regard"          "decisive"        "intricate"      
## [247] "commendable"     "appealing"       "worked"         
## [250] "intriguing"      "reputation"      "distinctive"    
## [253] "vivid"           "thankful"        "impartial"      
## [256] "openly"          "guarantee"       "nice"           
## [259] "coolest"         "awesome"         "revolutionary"  
## [262] "top notch"       "faster"          "enthrall"       
## [265] "greatness"       "led"             "worthwhile"     
## [268] "excellence"      "glad"            "glowing"        
## [271] "genial"          "superiority"     "enjoyable"      
## [274] "impeccable"      "fabulous"        "improvements"   
## [277] "charm"           "happy"           "hottest"        
## [280] "relief"          "accurate"        "elevate"        
## [283] "talents"         "dynamic"         "merciful"       
## [286] "righteousness"   "ideal"           "earnest"        
## [289] "honesty"         "polished"        "sensitive"      
## [292] "wholesome"       "remarkable"      "perfection"     
## [295] "exceedingly"     "delightfully"    "versatile"      
## [298] "confidence"      "noteworthy"      "magic"          
## [301] "sublime"         "strongest"       "fascination"    
## [304] "fortunate"       "reliable"        "praise"         
## [307] "fans"            "simplest"        "fairly"         
## [310] "accomplishment"  "leading"         "successfully"   
## [313] "appreciable"     "comprehensive"   "enhance"        
## [316] "winning"         "accolades"       "simpler"        
## [319] "clear cut"       "lavish"          "clearer"        
## [322] "faultless"       "exemplary"       "engaging"       
## [325] "entrancing"      "amazingly"       "judicious"      
## [328] "astounding"      "conscientious"   "applaud"        
## [331] "finer"           "freed"           "supported"      
## [334] "golden"          "promising"       "endorsed"       
## [337] "virtue"          "prefers"         "wow"            
## [340] "perfectly"       "super"           "impressive"     
## [343] "authentic"       "sincere"         "congratulations"
## [346] "wisdom"          "polite"          "sufficiently"   
## [349] "understandable"  "brave"           "comfort"        
## [352] "liking"          "abundance"       "stellar"        
## [355] "efficient"       "gains"           "heroic"         
## [358] "engrossing"      "swift"           "accomplish"     
## [361] "properly"        "powerfully"      "victory"        
## [364] "phenomenal"      "magnificent"     "enthralled"     
## [367] "paradise"        "grateful"        "peace"          
## [370] "supreme"         "faith"           "reforming"      
## [373] "meticulous"      "straightforward" "lover"          
## [376] "splendid"        "acclaimed"       "creative"       
## [379] "dominate"        "freedom"         "faithful"       
## [382] "dazzling"        "sweeping"        "veritable"      
## [385] "impassioned"     "lean"            "infallibility"  
## [388] "richly"          "assuredly"       "safe"           
## [391] "exceptionally"   "thrills"         "enrich"         
## [394] "inspiring"       "flashy"          "sexy"           
## [397] "silent"          "constructive"    "adore"          
## [400] "beauty"          "valuable"        "magnificently"  
## [403] "precise"         "rapt"            "patience"       
## [406] "preferring"      "happily"         "champion"       
## [409] "glee"            "modest"          "reward"         
## [412] "mastery"         "snappy"          "destiny"        
## [415] "entertain"       "striking"        "instructive"    
## [418] "correctly"       "adequate"        "awe"            
## [421] "smooth"          "pinnacle"        "refreshing"     
## [424] "vibrant"         "fascinate"       "notably"        
## [427] "tremendously"    "joy"             "nobly"          
## [430] "rightly"         "economical"      "kindly"         
## [433] "encouraging"     "fancy"           "seamless"       
## [436] "intrigue"        "excitement"      "innovative"     
## [439] "admirable"       "affirmation"     "foolproof"      
## [442] "goodness"        "smile"           "permissible"    
## [445] "famed"           "fond"            "celebrated"     
## [448] "neatly"          "finely"          "capable"        
## [451] "admiration"      "pluses"          "improve"        
## [454] "improvement"     "humble"          "honorable"      
## [457] "loving"          "hard working"    "quieter"        
## [460] "suitable"        "adventurous"     "trusting"       
## [463] "empathy"         "gained"          "unforgettable"  
## [466] "advanced"        "attractive"      "positives"      
## [469] "proper"          "triumph"         "ingenious"      
## [472] "spellbound"      "warm"            "improved"       
## [475] "distinguished"   "effortlessly"    "staunchly"      
## [478] "gratifying"      "satisfactory"    "foremost"       
## [481] "rewarding"       "consistently"    "immaculate"     
## [484] "renaissance"     "unparalleled"    "splendidly"     
## [487] "prudent"         "illuminating"    "extraordinary"  
## [490] "wins"

print(negative_words)       # Print all neg words

##   [1] "guilty"          "boring"          "plot"           
##   [4] "simplistic"      "drain"           "mystery"        
##   [7] "killing"         "doubts"          "tension"        
##  [10] "complex"         "murder"          "lost"           
##  [13] "stifling"        "uncomfortable"   "worst"          
##  [16] "prejudices"      "biases"          "debt"           
##  [19] "doubt"           "killed"          "estranged"      
##  [22] "conflict"        "intense"         "turmoil"        
##  [25] "false"           "poor"            "bores"          
##  [28] "twists"          "blow"            "struck"         
##  [31] "bad"             "bothered"        "knife"          
##  [34] "critical"        "omission"        "critic"         
##  [37] "fooled"          "confusing"       "death"          
##  [40] "broken"          "confined"        "scary"          
##  [43] "hardened"        "breaking"        "repetitive"     
##  [46] "prejudice"       "ruin"            "cramped"        
##  [49] "hurt"            "disputed"        "ignorance"      
##  [52] "hate"            "crime"           "defy"           
##  [55] "quibble"         "wrath"           "opponent"       
##  [58] "poorly"          "revolting"       "sweaty"         
##  [61] "oppressive"      "static"          "loud"           
##  [64] "abused"          "trouble"         "refuses"        
##  [67] "stubborn"        "badly"           "inability"      
##  [70] "broke"           "pretentious"     "inconsistencies"
##  [73] "strain"          "nemesis"         "revenge"        
##  [76] "fallacy"         "slander"         "guilt"          
##  [79] "dangerous"       "kill"            "suffering"      
##  [82] "punish"          "overwhelming"    "fall"           
##  [85] "irrational"      "antithetical"    "erroneous"      
##  [88] "helpless"        "bias"            "heartbreaking"  
##  [91] "sad"             "greed"           "danger"         
##  [94] "childish"        "crap"            "die"            
##  [97] "awful"           "warning"         "hype"           
## [100] "bored"           "wasted"          "discomfort"     
## [103] "belittle"        "mock"            "refute"         
## [106] "shame"           "misconceptions"  "tense"          
## [109] "troubling"       "ruined"          "difficulty"     
## [112] "stab"            "damaged"         "ridiculous"     
## [115] "refuted"         "crash"           "fleeing"        
## [118] "shabby"          "slowly"          "stresses"       
## [121] "stiff"           "choleric"        "stupid"         
## [124] "damn"            "violent"         "lose"           
## [127] "difficult"       "disliked"        "twist"          
## [130] "lies"            "misfortune"      "unaccustomed"   
## [133] "instability"     "arrogant"        "heartless"      
## [136] "conspicuous"     "stress"          "unwillingly"    
## [139] "paradoxically"   "ironically"      "worries"        
## [142] "mistake"         "evil"            "wrong"          
## [145] "bother"          "impossible"      "horrifying"     
## [148] "tragedy"         "problem"         "loses"          
## [151] "temper"          "horrified"       "indifference"   
## [154] "denying"         "blah"            "unfounded"      
## [157] "sin"             "tragic"          "conservative"   
## [160] "bigotry"         "racist"          "racism"         
## [163] "issues"          "skepticism"      "falling"        
## [166] "pretend"         "intimidation"    "breaks"         
## [169] "excuse"          "brute"           "coercion"       
## [172] "sanctimonious"   "falls"           "discredit"      
## [175] "unpopular"       "protest"         "opposition"     
## [178] "discouraging"    "pandering"       "murky"          
## [181] "dislike"         "tired"           "waste"          
## [184] "slow"            "dull"            "cheap"          
## [187] "frustrating"     "mediocre"        "murderer"       
## [190] "lack"            "lacking"         "long time"      
## [193] "hated"           "undermined"      "accusation"     
## [196] "utterly"         "mocked"          "excruciatingly" 
## [199] "inadequate"      "wrongly"         "sketchy"        
## [202] "bleeding"        "crack"           "scream"         
## [205] "louder"          "laughable"       "ridicule"       
## [208] "oddly"           "odd"             "annoying"       
## [211] "overrated"       "hell"            "reluctance"     
## [214] "ignorant"        "concerns"        "despair"        
## [217] "scars"           "flaws"           "tragically"     
## [220] "worse"           "ironic"          "grumpy"         
## [223] "dark"            "tiresome"        "conflicts"      
## [226] "unexpectedly"    "pity"            "unpredictable"  
## [229] "ashamed"         "die hard"        "hang"           
## [232] "shortcomings"    "criticism"       "error"          
## [235] "horrible"        "pathetic"        "weak"           
## [238] "inevitable"      "cheesy"          "penalty"        
## [241] "volatile"        "ordeal"          "annoyed"        
## [244] "smell"           "stupidity"       "incompetent"    
## [247] "disappointed"    "absurd"          "failing"        
## [250] "flaw"            "impatient"       "rude"           
## [253] "spoil"           "hard"            "propaganda"     
## [256] "sadly"           "excuses"         "deluded"        
## [259] "fell"            "anger"           "criminal"       
## [262] "abusive"         "undesirable"     "fallen"         
## [265] "moronic"         "illegally"       "silly"          
## [268] "fears"           "suspicious"      "implausible"    
## [271] "missed"          "deterrent"       "futile"         
## [274] "refuse"          "muddy"           "insane"         
## [277] "thug"            "dumb"            "solemn"         
## [280] "adamant"         "taut"            "bore"           
## [283] "shake"           "bloody"          "apocalypse"     
## [286] "creeping"        "disgusted"       "reactionary"    
## [289] "diatribe"        "stuck"           "frustrated"     
## [292] "intolerance"     "insecure"        "disgraceful"    
## [295] "disgrace"        "critics"         "stereotype"     
## [298] "preoccupy"       "pointless"       "ambivalence"    
## [301] "grate"           "petty"           "lone"           
## [304] "dissenter"       "failed"          "intrusive"      
## [307] "hesitant"        "unsure"          "concerned"      
## [310] "isolated"        "loneliness"      "unable"         
## [313] "deceptively"     "devil"           "reluctantly"    
## [316] "perversely"      "lacks"           "useless"        
## [319] "ulterior"        "cave"            "cold"           
## [322] "harassed"        "sarcastic"       "desperate"      
## [325] "vile"            "derision"        "terribly"       
## [328] "object"          "tedious"         "unravel"        
## [331] "insurmountable"  "prejudicial"     "complication"   
## [334] "succumb"         "desperately"     "arrogance"      
## [337] "expensive"       "funny"           "startling"      
## [340] "stuffy"          "disadvantaged"   "stern"          
## [343] "resistance"      "spiteful"        "unusual"        
## [346] "temptation"      "upset"           "shocked"        
## [349] "uncertain"       "chaotic"         "messed"         
## [352] "imperfections"   "mistakes"        "ranting"        
## [355] "raving"          "plea"            "fake"           
## [358] "lazy"            "unknown"         "suffer"         
## [361] "disappointment"  "din"             "irritating"     
## [364] "blame"           "distracting"     "melancholy"     
## [367] "stereotypical"   "angrily"         "break"          
## [370] "fail"            "criticized"      "misses"         
## [373] "obnoxious"       "warned"          "freaking"       
## [376] "afraid"          "darkness"        "crept"          
## [379] "hostile"         "trick"           "hung"           
## [382] "blinding"        "terrible"        "hopeless"       
## [385] "sham"            "anxiously"       "fragile"        
## [388] "issue"           "miss"            "mysterious"     
## [391] "catastrophic"    "hasty"           "adversity"      
## [394] "precarious"      "ambiguity"       "insatiable"     
## [397] "desperation"     "scarcely"        "twisted"        
## [400] "warped"          "uncomfortably"   "discrimination" 
## [403] "suspect"         "lurking"         "excessively"    
## [406] "heavy handed"    "unrealistic"     "unexpected"     
## [409] "dispute"         "dubious"         "ignore"         
## [412] "hefty"           "disappoint"      "unsatisfactory" 
## [415] "unbelievable"    "abrasive"        "fleeting"       
## [418] "tortured"        "appalling"       "troubled"       
## [421] "pervasive"       "controversial"   "controversy"    
## [424] "ploy"            "absence"         "disaster"       
## [427] "criticisms"      "unbelievably"    "complicated"    
## [430] "overblown"       "overbearing"     "gruesome"       
## [433] "sloppy"          "limited"         "excruciating"   
## [436] "punch"           "problematic"     "confused"       
## [439] "toxic"           "disregard"       "bizarre"        
## [442] "dead"            "unreasonable"    "crazy"          
## [445] "hollow"          "flimsy"          "shocking"       
## [448] "nefarious"       "undermine"       "creepy"         
## [451] "vague"           "vulgar"          "meaningless"    
## [454] "obscures"        "creep"           "problems"       
## [457] "despised"        "disbelief"       "bullying"       
## [460] "bland"           "scared"          "smugly"         
## [463] "destruction"     "loose"           "injustice"      
## [466] "haste"           "extraneous"      "excessive"      
## [469] "mad"             "overacted"       "indiscernible"  
## [472] "inevitably"      "unreliable"      "inconsistent"   
## [475] "unwilling"       "misgivings"      "ambiguous"      
## [478] "denies"          "implication"     "spite"          
## [481] "passive"         "aggressive"      "insult"         
## [484] "clash"           "died"            "drawback"       
## [487] "disagree"        "ominous"         "explosive"      
## [490] "offensive"       "unnecessary"     "fictional"      
## [493] "miserable"       "hatred"          "disturbing"     
## [496] "dust"            "swelling"        "lying"          
## [499] "rough"           "refused"         "raped"          
## [502] "erosion"         "poorest"         "dirty"          
## [505] "wildly"          "fear"            "strange"        
## [508] "abuse"           "unfortunate"     "brash"          
## [511] "frustration"     "drag"            "bitter"         
## [514] "cocky"           "cloud"           "interferes"     
## [517] "upsets"          "arbitrary"       "burden"         
## [520] "irritated"       "spoiled"         "idiots"         
## [523] "nightmare"       "steals"          "hateful"        
## [526] "alienated"       "spews"           "poverty"        
## [529] "accuse"          "exhausts"        "criticize"      
## [532] "subjugate"       "detriment"       "vengeful"       
## [535] "enraged"         "cynical"         "extravagant"    
## [538] "exploitation"    "premeditated"    "unsuccessful"   
## [541] "cumbersome"      "unwillingness"   "bully"          
## [544] "partiality"      "manic"           "brutally"       
## [547] "deplorable"      "outburst"        "illogical"      
## [550] "losing"          "antagonist"      "disrespectful"  
## [553] "picket"          "regret"          "risky"          
## [556] "vindictive"      "poison"          "anxiety"        
## [559] "worthless"       "negative"        "destroy"        
## [562] "manipulation"    "vengeance"       "dense"          
## [565] "faults"          "stark"           "attack"         
## [568] "struggle"        "strained"        "fatally"        
## [571] "imprisonment"    "bickering"       "flaunt"         
## [574] "gross"           "questionable"    "improper"       
## [577] "grievous"        "manipulative"    "despicable"     
## [580] "weaker"          "worried"         "mindless"       
## [583] "heck"            "draining"        "killer"         
## [586] "convoluted"      "egregious"       "nonsense"       
## [589] "horrific"        "sceptical"       "strangely"      
## [592] "unresolved"      "threaten"        "overwhelm"      
## [595] "outbursts"       "mundane"         "fails"          
## [598] "mistaken"        "naive"           "annoyance"      
## [601] "insecurity"      "uncaring"        "smoke"          
## [604] "trapped"         "villains"        "hobble"         
## [607] "aggravation"     "outsider"        "tumble"         
## [610] "seriousness"     "puzzled"         "rage"           
## [613] "chatter"         "aspersions"      "awkward"        
## [616] "ugly"            "unconvincing"    "absent minded"  
## [619] "timid"           "subversive"      "deprived"       
## [622] "isolation"       "deceive"         "absurdity"      
## [625] "frail"           "undetermined"    "friction"       
## [628] "drab"            "impropriety"     "ruins"          
## [631] "fiction"         "distortion"      "rebuke"         
## [634] "interfere"       "antagonistic"    "harbors"        
## [637] "riled"           "rife"            "anti"           
## [640] "extremists"      "lengthy"         "drawbacks"      
## [643] "conflicting"     "shallow"         "repulsive"      
## [646] "weaknesses"      "crowded"         "idiot"          
## [649] "jaded"           "bashed"          "unconfirmed"    
## [652] "distorted"       "glare"           "ludicrous"      
## [655] "commonplace"     "rift"            "pig"            
## [658] "unnoticed"       "errors"          "knock"          
## [661] "vain"            "flawed"          "objection"      
## [664] "pains"           "insufferably"    "provoke"        
## [667] "weary"           "mistrust"        "anxious"        
## [670] "drags"           "provocative"     "limitations"    
## [673] "shaky"           "overdone"        "boil"           
## [676] "collapse"        "noise"           "vibrate"        
## [679] "insignificant"   "reluctant"       "motley"         
## [682] "violently"       "stormy"          "darker"         
## [685] "sly"             "exhaustion"      "incapable"      
## [688] "strangest"       "biased"          "confrontation"  
## [691] "defensive"       "prison"          "limits"         
## [694] "adversary"       "neglect"         "struggling"     
## [697] "condemn"         "dissent"         "failure"

#--------------------------------------------------------#
#   Create Postive Words wordcloud                       #
#--------------------------------------------------------#

pos.tdm = dtm[,which(colnames(dtm) %in% positive_words)]
m = as.matrix(pos.tdm)
v = sort(colSums(m), decreasing = TRUE)
windows() # opens new image window
wordcloud(names(v), v, scale=c(4,1),1, max.words=100,colors=brewer.pal(8, "Dark2"))
title(sub = "Positive Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
windows() # opens new image window
ggplot(test, aes(x = rownames(test), y = test)) + 
    geom_bar(stat = "identity", fill = "blue") +
    geom_text(aes(label = test), vjust= -0.20) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

#--------------------------------------------------------#
#  Create Negative Words wordcloud                       #
#--------------------------------------------------------#

neg.tdm = dtm[,which(colnames(dtm) %in% negative_words) ]
m = as.matrix(neg.tdm)
v = sort(colSums(m), decreasing = TRUE)
windows()
wordcloud(names(v), v, scale=c(4,1),1, max.words=100,colors=brewer.pal(8, "Dark2"))         
title(sub = "Negative Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
windows()
ggplot(test, aes(x = rownames(test), y = test)) + 
    geom_bar(stat = "identity", fill = "red") +
    geom_text(aes(label = test), vjust= -0.20) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

#--------------------------------------------------------#
#  Positive words vs Negative Words plot                 #
#--------------------------------------------------------#

len = function(x){
    if ( x == "-" && length(x) == 1)  {return (0)} 
    else {return(length(unlist(x)))}
}

pcount = unlist(lapply(p, len))
ncount = unlist(lapply(n, len))
doc_id = seq(1:length(wc))

windows()
plot(doc_id,pcount,type="l",col="green",xlab = "Document ID", ylab= "Word Count")
lines(doc_id,ncount,type= "l", col="red")
title(main = "Positive words vs Negative Words" )
legend("topright", inset=.05, c("Positive Words","Negative Words"), fill=c("green","red"), horiz=TRUE)

# Documet Sentiment Running plot
windows()
plot(pol$all$polarity, type = "l", ylab = "Polarity Score",xlab = "Document Number")
abline(h=0)
title(main = "Polarity Plot" )

### COG for sentiment-laden words ? ###

senti.dtm = cbind(pos.tdm, neg.tdm); dim(senti.dtm)

## [1]  200 1182

senti.adj.mat = as.matrix(t(senti.dtm)) %*% as.matrix(senti.dtm)
diag(senti.adj.mat) = 0

windows()
distill.cog(senti.adj.mat,   # ad mat obj 
            'Distilled COG of senti words',       # plot title
            5,       # max #central nodes
            5)        # max #connexns

After observing the word cloud and different COGs throughout this document following are critical points to consider if we were to make a sequel

1.Characters like Henry Fonda and Lee J Cobb are very nicely depicted and are important part of movie’s success so it is good that we might want to those characters’ similar weightage in upcoming sequels.
2.Words are like room has showed up on cloud showing how it is unique for a film that entirely takes place in a single so that concept of entire plot took place inside a room is catchy point among people.
3.Words like guilt, murder, dangerous are showed on negative word list, but this does not mean people received this negatively, considering these topics and writing the story without violence is also point to ponder.
4.Distilled COGs has central nodes like guilty, great, good and other like death, tension, prejudice means those are critical elements while considering a sequel.
5.COGs also include words like sensitive words like death love and crime, these are compelling topics which should be considered while making sequel.