Extraction of 100 reviews for Braveheart Movie

library("rvest")
## Loading required package: xml2
######################################

counts = c(0,10,20,30,40)
reviews = NULL
ratings = NULL
for (j in counts){
  url1 = paste0("http://www.imdb.com/title/tt0112573/reviews?filter=love;filter=love;start=",j)
  url2 = paste0("http://www.imdb.com/title/tt0112573/reviews?filter=hate;filter=hate;start=",j)
  
  page1 = read_html(url1)
  page2 = read_html(url2)
  reviews1 = html_text(html_nodes(page1,'#tn15content div+ p'))
  reviews2 = html_text(html_nodes(page2,'#tn15content div+ p'))
  
  reviews.positive = setdiff(reviews1, c("*** This review may contain spoilers ***","Add another review"))
  reviews.negative = setdiff(reviews2, c("*** This review may contain spoilers ***","Add another review"))
  
   #ratings1=gsub('/','',substr(html_attr(html_nodes(page1,'h2+ img'),name='alt'),0,2))
   #ratings2=gsub('/','',substr(html_attr(html_nodes(page2,'h2+ img'),name='alt'),0,2))
  # 
  #ratings =c(ratings,ratings1,ratings2)
  reviews = c(reviews,reviews.positive,reviews.negative)
  #new = data.frame(reviews,ratings) 
  
}

reviews = gsub("\n",' ',reviews)

head(reviews)
## [1] " Most on this site pick the Godfather, or the Shawshank Redemption, but this is it, this is the best film ever made. People will complain, will argue that I am wrong, but I will say it again...Braveheart is as close to perfection as a movie can be. The acting is superb, the man who played Lonshanks, the actor who portrayed Robert the Bruce, both should have been nominated for Oscars due to their powerful rendering of evil and a man who is saved from losing his humanity (from becoming evil) by meeting William Wallace. And let us not forget the direction, the cinematography. Braveheart is glorious, beautiful to look at. The slow motion pictures of horses preparing to charge armed combatants, the entire landscape of Scotland that Mel Gibson captures with the camera. Braveheart is artwork, it is as good as any picture. That the film is number 93 on the list of the top 250 movies ever is a shame. Yes there is violence in this film but that violence does serve a point...that freedom isn't free and sometimes it takes death, gruesome and horrible, to let ones people taste what it is like to be free. Braveheart is a great movie and it deserves to at least be in the top ten of IMDb's list of greatest films. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [2] " I remember seeing this movie for the first time in late 2003, and I was impressed. I saw it again last night, and I was even more impressed. The acting is amazing, and the ending was brilliant. For me, all my guesses were incorrect. Everything that happens in this movie in unpredicted. The last half hour itself was highly unpredictable, and it had a powerful message. When a scene was meant to be dramatic, they did a great job at it. I don't know about everybody else, but the ending did make me cry. The message the movie sent kept me thinking for a while. The amount of courage and bravery was inconceivable, there was barely any faults or anything wrong with the movie. For a movie of 1995, they did a great job.I absolutely guarantee this movie to anybody who enjoys action and war with a bit of drama mixed in. One of the best, or maybe even the best movie of the 20th century. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [3] " Unfortunately, I wasn't able to watch Braveheart till 2003 when it was on TV. However, the lack of theatrical effects never stopped me from being mesmerized by this epic for one moment. So mesmerized, I literally sat motionlessly on the couch for two minutes after the movie. Any normal audience would likely to cast his/her sense of reality away and be captivated by this distant Celtic saga. Beside proving himself as a brilliant director, Mel Gibson more importantly gave life to a historical hero whose superb gallantry, vivid character and magnificent spirit shall never be history. Along with the unforgettable 'Alba gu bragh!' and the unprecedentedly heart-stopping 'Freeeeedom', Braveheart unquestionably is one of the greatest movies ever made. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [4] " This is simply the best movie ever made, containing all the elements a perfect movie should, even considering that every person has a right to his/her opinion. The soundtrack is amazing, the scenes are ingenious and the story is simply excellent! This is a story about a Scotsman named William Wallace (Mel Gibson) and his fight for the freedom of the Scottish people, from the oppression of the English ruler-ship. After seeing the death of his wife at the hands of an English nobleman, William Wallace (Mel Gibson) sets out on a quest for vengeance that quickly turns into a crusade for freedom for the entire \"country\". The extreme violence as well as the human compassion in this movie are overwhelming in its brilliancy. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [5] " I saw this film for the first time on cable, and, fortunately, it was an \"uncut\" version. I was greatly impacted, but, as bad luck would have it, I would not see it again for two years.Mel Gibson is an accomplished actor, with films like \"Mad Max\" and \"Lethal Weapon\" under his belt. \"Ransom\" showed he was more than just a quirky role actor, but it was \"Braveheart\" that proved to everyone that he was a great actor... and director.What he has envisioned and ensnared on camera is one of the great cinematic achievements of all time, and at an awkward time, too. Looking back at previous years at the Oscars, and you have \"Schindler's List,\" \"Dances with Wolves,\" and \"Unforgiven.\" Looking ahead, you have \"Titanic,\" \"Shakespeare in Love,\" and \"Gladiator.\" These are all period pieces. Right smack dab in the middle is \"Braveheart.\" It is the most simple of the films above, yet it is arguably the best. None will argue its impact is greater than \"Schindler's List\" nor its power greater than \"Unforgiven,\" but what it has, more than any of those other films, is heart. Much like his \"Passion of the Christ,\" Mel Gibson brings a passion to this film, and that is what sustains it.Mel Gibson plays William Wallace, a well-educated Scottish peasant who is determined to lead a peaceful life. Well, if you've seen the poster for this film, you probably already know that he doesn't succeed. When a law is put into place that says English noblemen have first right to lay with Scottish brides, Wallace marries in secret. But, when it is found out, a local noble attempts to take Murron, Wallace's wife, she resists, leading to a gruesome execution. With little choice, Wallace opts for vengeance, and thus begins the journey of Scotland's greatest warrior.This is a wonderfully acted, directed, photographed, and designed film with great performances, particularly from a breathtakingly beautiful Sophie Marceau, and I recommend it wholeheartedly. "
## [6] " This has to be one of the best movies I have ever seen. I recently purchased it and have watched it at least five times since then, and each time i pick up on things I did not see the other times. The fight scenes are great, the plot is both interesting and thought provoking, there is romance and comedy. This is a movie that any person can appreciate at some level. True, the historical content may have been distorted, but even though, this movie is meant for entertainment. It is not a history lesson caught on video.The acting is absolutely superb, this movie is guaranteed to have you on the edge of your seat for the entire three hours. "

Creation of Docunment Matrix

library(text2vec)
## Warning: package 'text2vec' was built under R version 3.3.2
library(data.table)
library(stringr)
library(tm)
## Warning: package 'tm' was built under R version 3.3.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.3.2
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.3.2
library(tokenizers)
## Warning: package 'tokenizers' was built under R version 3.3.2
## 
## Attaching package: 'tokenizers'
## The following object is masked from 'package:tm':
## 
##     stopwords
library(slam)
## Warning: package 'slam' was built under R version 3.3.2
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.3.2
## Loading required package: RColorBrewer
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
text.clean = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex for removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # keep only alpha numeric 
  x  =  tolower(x)                          # convert to lower case characters
  x  =  removeNumbers(x)                    # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

temp.text = readLines(file.choose())  # Q25.txt for ice-cream data, india strikes back twitter.csv
data = data.frame(id = 1:length(temp.text),  # creating doc IDs if name is not given
                  text = temp.text, 
                  stringsAsFactors = F)
dim(data)
## [1] 100   2
stpw1 = readLines('https://raw.githubusercontent.com/sudhir-voleti/basic-text-analysis-shinyapp/master/data/stopwords.txt')# stopwords list from git
stpw2 = tm::stopwords('english')      # tm package stop word list; tokenizer package has the same nafunction, hence 'tm::'
stpw3 = c('film','english','movie','start','scotland','scottish','makes','making') # New Stop words

comn  = unique(c(stpw1, stpw2))         # Union of two list
stopwords = unique(gsub("'"," ",comn))  # final stop word lsit after removing punctuation

x  = text.clean(data$text)                # applying func defined above to pre-process text corpus
x  =  removeWords(x,stopwords)            # removing stopwords created above
x  =  stripWhitespace(x)                  # removing white space

###  Create DTM using text2vec package   
t1 = Sys.time()

tok_fun = word_tokenizer  # using word & not space tokenizers

it_0 = itoken( x,
                  #preprocessor = text.clean,
                  tokenizer = tok_fun,
                  ids = data$id,
                  progressbar = T)

vocab = create_vocabulary(it_0,    #  func collects unique terms & corresponding statistics
                          ngram = c(2L, 2L) #,
                          #stopwords = stopwords
                          )
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
# length(vocab); str(vocab)     # view what vocab obj is like

pruned_vocab = prune_vocabulary(vocab,  # filters input vocab & throws out v frequent & v infrequent terms
                                term_count_min = 10)
                                # doc_proportion_max = 0.5,
                                # doc_proportion_min = 0.001)

# length(pruned_vocab);  str(pruned_vocab)

vectorizer = vocab_vectorizer(pruned_vocab) #  creates a text vectorizer func used in constructing a dtm/tcm/corpus

dtm_0  = create_dtm(it_0, vectorizer) # high-level function for creating a document-term matrix
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
# Sort bi-gram with decreasing order of freq
tsum = as.matrix(t(rollup(dtm_0, 1, na.rm=TRUE, FUN = sum))) # find sum of freq for each term
tsum = tsum[order(tsum, decreasing = T),]       # terms in decreasing order of freq
head(tsum)
##       mel_gibson  william_wallace patrick_mcgoohan   sophie_marceau 
##               87               68               16               14 
##     robert_bruce      king_edward 
##               13               13
tail(tsum)
##    patrick_mcgoohan      sophie_marceau        robert_bruce 
##                  16                  14                  13 
##         king_edward catherine_mccormack     battle_stirling 
##                  13                  12                  10
# # select Top 1000 bigrams to unigram
# if (length(tsum) > 1000) {n = 1000} else {n = length(tsum)}
# tsum = tsum[1:n]

#-------------------------------------------------------
# Code bi-grams as unigram in clean text corpus

text2 = x
text2 = paste("",text2,"")

pb <- txtProgressBar(min = 1, max = (length(tsum)), style = 3) ; i = 0

for (term in names(tsum)){
  i = i + 1
  focal.term = gsub("_", " ",term)        # in case dot was word-separator
  replacement.term = term
  text2 = gsub(paste("",focal.term,""),paste("",replacement.term,""), text2)
  setTxtProgressBar(pb, i)
}
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=================================================================| 100%
it_m = itoken(text2,     # function creates iterators over input objects to vocabularies, corpora, DTM & TCM matrices
              # preprocessor = text.clean,
              tokenizer = tok_fun,
              ids = data$id,
              progressbar = T)

vocab = create_vocabulary(it_m     # vocab func collects unique terms and corresponding statistics
                          # ngram = c(2L, 2L),
                          #stopwords = stopwords
                          )
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
# length(vocab); str(vocab)     # view what vocab obj is like

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 1)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

vectorizer = vocab_vectorizer(pruned_vocab)


dtm_m  = create_dtm(it_m, vectorizer)
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
dim(dtm_m)
## [1]  100 3014
dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTf)
  a0 = (apply(dtm, 1, sum) > 0)   # build vector to identify non-empty docs
  dtm = dtm[a0,]                  # drop empty docs
 
print(difftime(Sys.time(), t1, units = 'sec'))
## Time difference of 0.5210302 secs
# view a sample of the DTM, sorted from most to least frequent tokens 
dtm = dtm[,order(apply(dtm, 2, sum), decreasing = T)]     # sorting dtm's columns in decreasing order of column sums
inspect(dtm[1:5, 1:5])
## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 13/12
## Sparsity           : 48%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs movie film english wallace mel_gibson
##    1     2    3       0       0          1
##    2     7    0       0       0          0
##    3     1    0       0       0          1
##    4     3    0       2       0          2
##    5     0    4       1       3          3
 # Using Term frequency(tf)    

tst = round(ncol(dtm)/100)  # divide DTM's cols into 100 manageble parts
a = rep(tst,99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100
tsum = ss.col
tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq

head(tsum)
##      movie       film    english    wallace mel_gibson braveheart 
##        195        187        107         93         87         69
tail(tsum)
##        morons exaggerations       outlaws       sickens     motioning 
##             1             1             1             1             1 
##          step 
##             1
#windows()  # New plot window
wordcloud(names(tsum), tsum,     # words, their freqs 
          scale = c(4, 0.5),     # range of word sizes
          1,                     # min.freq of words to consider
          max.words = 200,       # max #words
          colors = brewer.pal(8, "Dark2"))    # Plot results in a word cloud 
title(sub = "Term Frequency - Wordcloud")     # title for the wordcloud display

# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))

#windows()  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
       geom_bar(stat = "identity", fill = "Blue") +
       geom_text(aes(label = test), vjust= -0.20) + 
       theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

Using Term frequency inverse document frequency (tfidf)

library(textir)
## Warning: package 'textir' was built under R version 3.3.2
## Loading required package: distrom
## Warning: package 'distrom' was built under R version 3.3.2
## Loading required package: Matrix
## Loading required package: gamlr
## Warning: package 'gamlr' was built under R version 3.3.2
## Loading required package: parallel
dtm.tfidf = tfidf(dtm, normalize=F)

tst = round(ncol(dtm.tfidf)/100)
a = rep(tst, 99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm.tfidf))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm.tfidf[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100
tsum = ss.col

tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)
##    wallace       film      movie    english   scottish braveheart 
##   95.01357   86.40063   84.00267   80.78742   69.04382   68.60341
tail(tsum)
##        morons exaggerations       outlaws       sickens     motioning 
##      3.912023      3.912023      3.912023      3.912023      3.912023 
##          step 
##      3.912023
#windows()  # New plot window
wordcloud(names(tsum), tsum, scale=c(4,0.5),1, max.words=200,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud 
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : wallace could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : braveheart could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : england could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : scottish could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : freedom could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : wonderful could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : watching could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : epic could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : longshanks could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : simple could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : action could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : back could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : gibson could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : people could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : hollywood could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : army could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : personal could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : oscars could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : greatest could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : scenes could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : historical could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : making could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : scotland could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : american could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : life could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : make could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : fight could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : english could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : give could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : sense could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : real could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : audience could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : beautiful could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : great could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : gave could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : horner could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : bad could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : reviews could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : accurate could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : americans could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : portrays could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : end could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : period could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : edward could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : robert_bruce could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : fact could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : mel could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : absolutely could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : angus could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : princess could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : inaccurate could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : fighting could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : ending could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : person could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : extremely could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : totally could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : leads could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : review could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : king_edward could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : history could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : actors could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : mel_gibson could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : world could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : british could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : job could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words
## = 200, : catherine_mccormack could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : true could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : good could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : mad could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : work could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : direction could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : movies could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : point could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : murron could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : wrong could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : war could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : death could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : director could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : love could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : role could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : poor could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : filmed could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : young could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : time could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : thing could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : characters could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : oscar could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : long could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : queen could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : piece could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : land could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : amazing could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : performance could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : acting could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : film could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : king could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : makes could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : screen could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : inaccuracies could not be fit on page. It will not be plotted.
## Warning in wordcloud(names(tsum), tsum, scale = c(4, 0.5), 1, max.words =
## 200, : main could not be fit on page. It will not be plotted.
title(sub = "Term Frequency Inverse Document Frequency - Wordcloud")

as.matrix(tsum[1:20])     #  to see the top few tokens & their IDF scores
##                     [,1]
## wallace         95.01357
## film            86.40063
## movie           84.00267
## english         80.78742
## scottish        69.04382
## braveheart      68.60341
## william_wallace 62.30777
## story           58.11115
## scotland        57.65046
## time            55.55490
## great           54.59075
## good            54.46647
## mel             53.14486
## freedom         52.97480
## epic            51.31228
## watch           49.84172
## made            49.51497
## mel_gibson      48.90435
## history         48.56341
## people          48.54643
(dtm.tfidf)[1:10, 1:10]   # view first 10x10 cells in the DTM under TF IDF.
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'movie', 'film', 'english' ... ]]
##                                                                        
## 1  0.8615658 1.3861064 .         .        0.5621189 3.9770091 0.9162907
## 2  3.0154804 .         .         .        .         .         .        
## 3  0.4307829 .         .         .        0.5621189 1.9885045 .        
## 4  1.2923487 .         1.5100452 .        1.1242378 .         1.8325815
## 5  .         1.8481418 0.7550226 3.064954 1.6863568 1.9885045 0.9162907
## 6  1.2923487 .         .         .        .         .         .        
## 7  .         2.3101773 .         .        .         1.9885045 .        
## 8  1.7231317 0.4620355 .         6.129907 0.5621189 3.9770091 0.9162907
## 9  1.2923487 0.9240709 .         .        .         0.9942523 .        
## 10 3.8770462 1.3861064 6.7952033 8.173210 1.1242378 .         1.8325815
##                                
## 1  .        .         .        
## 2  .        0.9416085 .        
## 3  .        .         0.9162907
## 4  1.078810 .         .        
## 5  2.157619 2.8248256 .        
## 6  .        0.9416085 0.9162907
## 7  .        .         1.8325815
## 8  .        1.8832171 1.8325815
## 9  .        .         0.9162907
## 10 2.157619 .         .
# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))
#windows()  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

step 2c - Term Co-occurance Matrix (TCM)

vectorizer = vocab_vectorizer(pruned_vocab, 
                              grow_dtm = FALSE, 
                              skip_grams_window = 5L)

tcm = create_tcm(it_m, vectorizer) # func to build a TCM
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
tcm.mat = as.matrix(tcm)         # use tcm.mat[1:5, 1:5] to view
adj.mat = tcm.mat + t(tcm.mat)   # since adjacency matrices are symmetric

z = order(colSums(adj.mat), decreasing = T)
adj.mat = adj.mat[z,z]

# Plot Simple Term Co-occurance graph
adj = adj.mat[1:30,1:30]

library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:stringr':
## 
##     %>%
## The following objects are masked from 'package:text2vec':
## 
##     %>%, normalize
## The following object is masked from 'package:rvest':
## 
##     %>%
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
cog = graph.adjacency(adj, mode = 'undirected')
cog =  simplify(cog)  

cog = delete.vertices(cog, V(cog)[ degree(cog) == 0 ])

#windows()
plot(cog)

a cleaned up or ‘distilled’ COG PLot

distill.cog = function(mat1, # input TCM ADJ MAT
                       title, # title for the graph
                       s,    # no. of central nodes
                       k1){  # max no. of connections  
  library(igraph)
  a = colSums(mat1) # collect colsums into a vector obj a
  b = order(-a)     # nice syntax for ordering vector in decr order  
  
  mat2 = mat1[b, b]     # order both rows and columns along vector b
  
  diag(mat2) =  0
  
  ## +++ go row by row and find top k adjacencies +++ ##

  wc = NULL
  
  for (i1 in 1:s){ 
    thresh1 = mat2[i1,][order(-mat2[i1, ])[k1]]
    mat2[i1, mat2[i1,] < thresh1] = 0   # neat. didn't need 2 use () in the subset here.
    mat2[i1, mat2[i1,] > 0 ] = 1
    word = names(mat2[i1, mat2[i1,] > 0])
    mat2[(i1+1):nrow(mat2), match(word,colnames(mat2))] = 0
    wc = c(wc,word)
  } # i1 loop ends
  
  
  mat3 = mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
  ord = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))]  # removed any NAs from the list
  mat4 = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
  graph <- graph.adjacency(mat4, mode = "undirected", weighted=T)    # Create Network object
  graph = simplify(graph) 
  V(graph)$color[1:s] = "green"
  V(graph)$color[(s+1):length(V(graph))] = "pink"

  graph = delete.vertices(graph, V(graph)[ degree(graph) == 0 ]) # delete singletons?
  
  plot(graph, 
       layout = layout.kamada.kawai, 
       main = title)

  } # func ends

#windows()
distill.cog(tcm.mat, 'Distilled COG',  10,  5)
## Warning in vattrs[[name]][index] <- value: number of items to replace is
## not a multiple of replacement length

## adj.mat and distilled cog for tfidf DTMs ##

adj.mat = t(dtm.tfidf) %*% dtm.tfidf
diag(adj.mat) = 0
a0 = order(apply(adj.mat, 2, sum), decreasing = T)
adj.mat = as.matrix(adj.mat[a0[1:50], a0[1:50]])

#windows()
distill.cog(adj.mat, 'Distilled COG',  10,  10)

Sentiment Analysis

library(qdap)
## Warning: package 'qdap' was built under R version 3.3.2
## Loading required package: qdapDictionaries
## Warning: package 'qdapDictionaries' was built under R version 3.3.2
## Loading required package: qdapRegex
## Warning: package 'qdapRegex' was built under R version 3.3.2
## 
## Attaching package: 'qdapRegex'
## The following object is masked from 'package:ggplot2':
## 
##     %+%
## Loading required package: qdapTools
## Warning: package 'qdapTools' was built under R version 3.3.2
## 
## Attaching package: 'qdapTools'
## The following object is masked from 'package:data.table':
## 
##     shift
## 
## Attaching package: 'qdap'
## The following objects are masked from 'package:igraph':
## 
##     %>%, diversity
## The following object is masked from 'package:Matrix':
## 
##     %&%
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:stringr':
## 
##     %>%
## The following object is masked from 'package:text2vec':
## 
##     %>%
## The following object is masked from 'package:rvest':
## 
##     %>%
## The following object is masked from 'package:base':
## 
##     Filter
x1 = x[a0]    # remove empty docs from corpus

t1 = Sys.time()   # set timer

pol = polarity(x1)         # Calculate the polarity from qdap dictionary
wc = pol$all[,2]                  # Word Count in each doc
val = pol$all[,3]                 # average polarity score
p  = pol$all[,4]                  # Positive words info
n  = pol$all[,5]                  # Negative Words info  

Sys.time() - t1  # how much time did the above take?
## Time difference of 12.44071 secs
head(pol$all)
##   all  wc   polarity
## 1 all  55 0.51239190
## 2 all  54 0.89814624
## 3 all  57 1.32453236
## 4 all 162 1.41421356
## 5 all  41 0.03123475
## 6 all 115 0.54085279
##                                                                                                                                                                                                                                          pos.words
## 1                                                                                                                                                                             perfect, amazing, ingenious, excellent, freedom, freedom, compassion
## 2                                                                                                                                            impressed, impressed, amazing, brilliant, powerful, great, courage, bravery, great, guarantee, enjoys
## 3                                                                                                                            mesmerized, mesmerized, proving, brilliant, hero, superb, vivid, magnificent, unforgettable, unquestionably, greatest
## 4 great, memorable, favorite, terrific, remarkable, accurate, enjoy, peace, harmony, promising, love, safe, love, freedom, freedom, memorable, freedom, bravery, love, freedom, inspirational, powerful, terrific, heartfelt, won, terrific, enjoy
## 5                                                                                                                                                                                                                       great, interesting, superb
## 6                                                                                                                   spectacular, perfectly, love, glowing, witty, flawless, classic, top notch, patriotic, gorgeous, bonus, comfy, enjoy, greatest
##                                                              neg.words
## 1                           oppression, death, vengeance, overwhelming
## 2          incorrect, unpredictable, cry, inconceivable, faults, wrong
## 3                                                                 lack
## 4  doubt, sick, rape, kill, murder, bloodshed, bleed, struggling, dies
## 5                                                      plot, distorted
## 6 mystery, cry, hate, bad, negative, plot, flaw, inaccuracies, dislike
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 text.var
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          simply movie made elements perfect movie person opinion soundtrack amazing scenes ingenious story simply excellent story scotsman named william wallace mel gibson fight freedom scottish people oppression english ruler ship death wife hands english nobleman william wallace mel gibson sets quest vengeance quickly turns crusade freedom entire country extreme violence human compassion movie overwhelming brilliancy
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 remember movie time late impressed night impressed acting amazing ending brilliant guesses incorrect movie unpredicted half hour highly unpredictable powerful message scene meant dramatic great job ending make cry message movie thinking amount courage bravery inconceivable barely faults wrong movie movie great job absolutely guarantee movie enjoys action war bit drama mixed movie century
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               watch braveheart till tv lack theatrical effects stopped mesmerized epic moment mesmerized literally sat motionlessly couch minutes movie normal audience cast sense reality captivated distant celtic saga proving brilliant director mel gibson importantly gave life historical hero superb gallantry vivid character magnificent spirit history unforgettable alba gu bragh unprecedentedly heart stopping freeeeedom braveheart unquestionably greatest movies made
## 4  talking great time movies memorable doubt braveheart braveheart favorite movies long running time watch movie day mel gibson movie business stars directed braveheart terrific job deserves lot credit remarkable film totally accurate history enjoy braveheart based life story william wallace town peasant live peace harmony wife children british king england longshanks sick scotland full scots bringing british land breeding promising british soldier scotish bride wedding night husband wallace hear marries love secret british soldier notices wife attempts rape wallace saves causing riot village thinks safe sends soldiers catch kill front family wallace war murder wife love freedom scotland shed blood freedom memorable speeches cinematic history major battle british countless battles friends traitors bloodshed wallace wishes freedom robert bruce father fight side describes wallace bravery feels wanting love robert line bled wallace bleed army stands continues struggling battles freedom inspirational movie powerful message man dies man lives tearful ending make shed tear terrific heartfelt sound track won picture oscars movie terrific enjoy watch
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        movies recently purchased watched times time pick things times fight scenes great plot interesting thought provoking romance comedy movie person level true historical content distorted movie meant entertainment history lesson caught video acting absolutely superb movie guaranteed edge seat entire hours
## 6                                                                                                                                                                                                                                                                                                          find film caliber braveheart elements romance heart wrenching warming instances epic action spectacular scenes mystery viewings put history albeit romanticised drama perfectly unravelled level uncompromised multi tasking film move laugh cry love hate taught avoid clichs glowing reviews bad negative movies feel deserves witty banter friends foes lovers relatives opinion flawless aids film claim true classic soundtrack similarly top notch encapsulates refracts patriotic theme key moments political plot gorgeous scenery serve refreshers heavy areas story braveheart flaw share sentiments bonus length prepare comfy seat pots tea complete cosies suppose relevant touch historic inaccuracies expect people dislike film history written hanged heroes sketchiness periods coupled artistic license personally dismiss thoughts note hope enjoy greatest film created
head(pol$group)
##   all total.sentences total.words ave.polarity sd.polarity
## 1 all            3014        8635   0.08906548   0.9045222
##   stan.mean.polarity
## 1         0.09846688
positive_words = unique(setdiff(unlist(p),"-"))  # Positive words list
negative_words = unique(setdiff(unlist(n),"-"))  # Negative words list

print(positive_words)       # Print all the positive words found in the corpus
##   [1] "perfect"         "amazing"         "ingenious"      
##   [4] "excellent"       "freedom"         "compassion"     
##   [7] "impressed"       "brilliant"       "powerful"       
##  [10] "great"           "courage"         "bravery"        
##  [13] "guarantee"       "enjoys"          "mesmerized"     
##  [16] "proving"         "hero"            "superb"         
##  [19] "vivid"           "magnificent"     "unforgettable"  
##  [22] "unquestionably"  "greatest"        "memorable"      
##  [25] "favorite"        "terrific"        "remarkable"     
##  [28] "accurate"        "enjoy"           "peace"          
##  [31] "harmony"         "promising"       "love"           
##  [34] "safe"            "inspirational"   "heartfelt"      
##  [37] "won"             "interesting"     "spectacular"    
##  [40] "perfectly"       "glowing"         "witty"          
##  [43] "flawless"        "classic"         "top notch"      
##  [46] "patriotic"       "gorgeous"        "bonus"          
##  [49] "comfy"           "redemption"      "perfection"     
##  [52] "glorious"        "beautiful"       "good"           
##  [55] "top"             "free"            "deservedly"     
##  [58] "playful"         "impeccable"      "incredible"     
##  [61] "exciting"        "victory"         "work"           
##  [64] "thrilled"        "enjoying"        "catchy"         
##  [67] "award"           "winning"         "reputation"     
##  [70] "lavish"          "success"         "halcyon"        
##  [73] "grand"           "works"           "sweeping"       
##  [76] "loyalty"         "pure"            "supporting"     
##  [79] "wonderful"       "delightful"      "rightly"        
##  [82] "awarded"         "lush"            "accolade"       
##  [85] "modern"          "relief"          "noble"          
##  [88] "defeated"        "superior"        "ease"           
##  [91] "pretty"          "heroic"          "revolutionary"  
##  [94] "joy"             "famous"          "passion"        
##  [97] "beautifully"     "romantic"        "redeeming"      
## [100] "genuine"         "holy"            "passionate"     
## [103] "successful"      "blockbuster"     "dynamic"        
## [106] "expansive"       "realistic"       "winner"         
## [109] "maturely"        "indulgent"       "lovely"         
## [112] "convincingly"    "entertaining"    "gutsy"          
## [115] "prefer"          "led"             "extraordinarily"
## [118] "charisma"        "benefit"         "fun"            
## [121] "educated"        "pride"           "credible"       
## [124] "intelligence"    "interests"       "confidence"     
## [127] "inspiring"       "proud"           "winners"        
## [130] "compliant"       "support"         "leading"        
## [133] "succeeds"        "fame"            "inspiration"    
## [136] "regard"          "pleasure"        "fantastic"      
## [139] "masterpiece"     "amazingly"       "awards"         
## [142] "praise"          "worthy"          "honest"         
## [145] "awe"             "fairly"          "humorous"       
## [148] "phenomenal"      "likable"         "loved"          
## [151] "engrossing"      "fancy"           "incredibly"     
## [154] "convincing"      "honor"           "stunningly"     
## [157] "charming"        "fine"            "patriot"        
## [160] "marvellous"      "stunning"        "supported"      
## [163] "respect"         "proper"          "handy"          
## [166] "enhance"         "fortunately"     "luck"           
## [169] "accomplished"    "achievements"    "lead"           
## [172] "peaceful"        "succeed"         "wonderfully"    
## [175] "breathtakingly"  "recommend"       "wholeheartedly" 
## [178] "masterpieces"    "salute"          "impressive"     
## [181] "defeat"          "excellence"      "quiet"          
## [184] "enjoyed"         "accurately"      "fortunate"      
## [187] "favor"           "heaven"          "elite"          
## [190] "privileged"      "happy"           "master"         
## [193] "supreme"         "divine"          "bless"          
## [196] "correct"         "fair"            "lover"          
## [199] "extraordinary"   "strong"          "sensational"    
## [202] "splendid"        "remarkably"      "foremost"       
## [205] "proves"          "worth"           "exemplary"      
## [208] "smarter"         "leads"           "succeeded"      
## [211] "expertly"        "breathtaking"    "flawlessly"     
## [214] "favour"          "amusing"         "hilarious"      
## [217] "chivalry"        "easier"          "appeal"         
## [220] "helpful"         "reasonable"      "prominent"      
## [223] "idol"            "conveniently"    "popular"        
## [226] "praising"        "likes"           "irresistible"   
## [229] "variety"         "loyal"           "coherent"       
## [232] "logical"         "humour"          "sweet"          
## [235] "helped"          "pleasant"        "excited"        
## [238] "brave"           "doubtless"       "goodness"       
## [241] "beloved"         "accomplish"      "win"            
## [244] "compliment"      "magic"           "freedoms"       
## [247] "wise"            "innovative"      "effective"      
## [250] "cohesive"        "decent"          "righteous"      
## [253] "easy"            "gaining"         "nice"           
## [256] "wins"            "trust"           "rational"       
## [259] "important"       "glad"            "awesome"        
## [262] "satisfying"      "kindly"          "meaningful"     
## [265] "adorable"        "gorgeously"      "magnificently"  
## [268] "motivated"       "amazed"          "recommendations"
## [271] "gladly"          "loving"          "accomplishments"
## [274] "exceptionally"   "easiest"         "legendary"      
## [277] "believable"      "attractive"      "uplifting"      
## [280] "saint"           "liberty"         "homage"         
## [283] "intelligent"     "capable"         "astonishing"    
## [286] "ovation"         "humor"           "champion"       
## [289] "dedicated"       "generous"        "dawn"           
## [292] "hottest"         "outstanding"     "grace"
print(negative_words)       # Print all neg words
##   [1] "oppression"      "death"           "vengeance"      
##   [4] "overwhelming"    "incorrect"       "unpredictable"  
##   [7] "cry"             "inconceivable"   "faults"         
##  [10] "wrong"           "lack"            "doubt"          
##  [13] "sick"            "rape"            "kill"           
##  [16] "murder"          "bloodshed"       "bleed"          
##  [19] "struggling"      "dies"            "plot"           
##  [22] "distorted"       "mystery"         "hate"           
##  [25] "bad"             "negative"        "flaw"           
##  [28] "inaccuracies"    "dislike"         "complain"       
##  [31] "evil"            "losing"          "slow"           
##  [34] "shame"           "gruesome"        "horrible"       
##  [37] "villains"        "errors"          "scream"         
##  [40] "killed"          "bent"            "heavy handed"   
##  [43] "choppy"          "flaws"           "strike"         
##  [46] "treachery"       "struggle"        "egotistical"    
##  [49] "toll"            "damn"            "steals"         
##  [52] "fake"            "insulted"        "hatred"         
##  [55] "poor"            "long time"       "crisis"         
##  [58] "upset"           "insulting"       "revenge"        
##  [61] "brutal"          "weak"            "fictional"      
##  [64] "unrealistic"     "suspect"         "hack"           
##  [67] "bash"            "inappropriate"   "stumble"        
##  [70] "joke"            "rash"            "crap"           
##  [73] "woefully"        "coward"          "perverted"      
##  [76] "twisted"         "pander"          "hates"          
##  [79] "cranky"          "prejudices"      "rage"           
##  [82] "loss"            "betrays"         "hell"           
##  [85] "outcry"          "denying"         "hung"           
##  [88] "dying"           "controversy"     "embarrassing"   
##  [91] "detract"         "doubts"          "nonsense"       
##  [94] "confused"        "mad"             "irresponsible"  
##  [97] "shockingly"      "tragic"          "dead"           
## [100] "repression"      "disappointment"  "brutally"       
## [103] "cruel"           "pompous"         "haughty"        
## [106] "tyrannical"      "cowardly"        "ignorant"       
## [109] "enemies"         "burned"          "poorly"         
## [112] "collapse"        "wreak"           "butchery"       
## [115] "despised"        "dark"            "abused"         
## [118] "harpy"           "savage"          "stern"          
## [121] "unyielding"      "missed"          "downer"         
## [124] "insult"          "break"           "ruthless"       
## [127] "impossible"      "bloody"          "unnecessary"    
## [130] "torture"         "turmoil"         "unwanted"       
## [133] "recklessly"      "danger"          "anger"          
## [136] "imposing"        "menace"          "hateful"        
## [139] "unimportant"     "failures"        "heartbreaking"  
## [142] "bloated"         "boring"          "breaks"         
## [145] "goofy"           "funny"           "ridiculous"     
## [148] "weird"           "killing"         "enemy"          
## [151] "laughable"       "worse"           "lost"           
## [154] "silly"           "agony"           "worst"          
## [157] "violent"         "sad"             "disturbing"     
## [160] "hard"            "intense"         "ignore"         
## [163] "stupid"          "deception"       "overrated"      
## [166] "inaccurate"      "distortion"      "sadly"          
## [169] "smugly"          "downfall"        "hopeless"       
## [172] "shaky"           "lose"            "blurred"        
## [175] "shake"           "bland"           "angry"          
## [178] "terrible"        "lacked"          "irrational"     
## [181] "disliked"        "lethal"          "awkward"        
## [184] "smack"           "dumb"            "awful"          
## [187] "waste"           "overthrow"       "betrayals"      
## [190] "feeble"          "opposition"      "falls"          
## [193] "issue"           "fiction"         "blatant"        
## [196] "inaccurately"    "lying"           "concerned"      
## [199] "vague"           "protest"         "anti"           
## [202] "garbage"         "hang"            "jerk"           
## [205] "racism"          "critics"         "bogus"          
## [208] "morons"          "unbelievably"    "overwhelmed"    
## [211] "unable"          "cold"            "sorrow"         
## [214] "uncomfortable"   "slowly"          "falling"        
## [217] "devastating"     "immature"        "dislikes"       
## [220] "unbelievable"    "simplistic"      "mediocre"       
## [223] "racist"          "scum"            "prosecute"      
## [226] "disgusting"      "refusing"        "racists"        
## [229] "broken"          "fierce"          "died"           
## [232] "spoiled"         "uneasiness"      "asunder"        
## [235] "obscene"         "complex"         "unscrupulous"   
## [238] "fall"            "tragedy"         "conflict"       
## [241] "uprising"        "suffer"          "misery"         
## [244] "flair"           "unacceptable"    "injustices"     
## [247] "raped"           "nasty"           "yawn"           
## [250] "spoil"           "blame"           "drunk"          
## [253] "heck"            "drowning"        "badly"          
## [256] "complaining"     "misfortune"      "problem"        
## [259] "blatantly"       "pandering"       "ruins"          
## [262] "inaccuracy"      "incoherent"      "rant"           
## [265] "infuriating"     "ashamed"         "mindless"       
## [268] "bother"          "botch"           "thug"           
## [271] "temper"          "touted"          "utterly"        
## [274] "irredeemably"    "chronic"         "absence"        
## [277] "solemn"          "crazy"           "parody"         
## [280] "indifference"    "rival"           "drags"          
## [283] "futile"          "perish"          "avenge"         
## [286] "unwilling"       "oppose"          "selfish"        
## [289] "madman"          "whore"           "fussy"          
## [292] "inept"           "bloodthirsty"    "murderer"       
## [295] "weakness"        "madness"         "foul"           
## [298] "baffled"         "biased"          "terribly"       
## [301] "burning"         "idiotic"         "oppressive"     
## [304] "kills"           "pathetic"        "pretentious"    
## [307] "conspicuously"   "sinking"         "grind"          
## [310] "bump"            "lies"            "wallow"         
## [313] "tortured"        "rubbish"         "travesty"       
## [316] "resent"          "problems"        "betray"         
## [319] "breaking"        "haunting"        "despicable"     
## [322] "debacle"         "die"             "sugar coat"     
## [325] "screwed"         "ailing"          "famine"         
## [328] "monstrosity"     "hating"          "hindrance"      
## [331] "idiots"          "fat"             "dangerous"      
## [334] "twists"          "offensive"       "propaganda"     
## [337] "barbaric"        "trouble"         "weary"          
## [340] "pain"            "underdog"        "betrayal"       
## [343] "confess"         "hated"           "clumsy"         
## [346] "lazy"            "rough"           "failed"         
## [349] "worn"            "blind"           "oppressors"     
## [352] "overdue"         "excuse"          "disregard"      
## [355] "pretend"         "exile"           "worthless"      
## [358] "dissuade"        "fright"          "strange"        
## [361] "offensiveness"   "worried"         "risk"           
## [364] "guilt"           "inconsistencies" "incompetent"    
## [367] "cruelty"         "sack"            "suffering"      
## [370] "depressingly"    "bored"           "tyranny"        
## [373] "traitor"         "disrespectful"   "hostilities"    
## [376] "warp"            "limit"           "confusion"      
## [379] "annoying"        "wild"            "opponent"       
## [382] "fell"            "foolishly"       "disappoint"     
## [385] "invader"         "criticisms"      "viciously"      
## [388] "naive"           "incredulous"     "mess"           
## [391] "grief"           "loot"            "annihilate"     
## [394] "sacrificed"      "unemployed"      "desperate"      
## [397] "fear"            "darkness"        "addicted"       
## [400] "punishable"      "bothered"        "trash"          
## [403] "cheap"           "bashing"         "warped"         
## [406] "wicked"          "objections"      "lie"            
## [409] "overblown"       "ridiculously"    "macabre"        
## [412] "adamant"         "suspicion"       "grim"           
## [415] "attack"          "pity"

Create Postive Words wordcloud

pos.tdm = dtm[,which(colnames(dtm) %in% positive_words)]
m = as.matrix(pos.tdm)
v = sort(colSums(m), decreasing = TRUE)
windows() # opens new image window
wordcloud(names(v), v, scale=c(4,1),1, max.words=100,colors=brewer.pal(8, "Dark2"))
title(sub = "Positive Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
#windows() # opens new image window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "blue") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

Create Negative Words wordcloud

neg.tdm = dtm[,which(colnames(dtm) %in% negative_words) ]
m = as.matrix(neg.tdm)
v = sort(colSums(m), decreasing = TRUE)
windows()
wordcloud(names(v), v, scale=c(4,1),1, max.words=100,colors=brewer.pal(8, "Dark2"))         
title(sub = "Negative Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
#windows()
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.

#  Positive words vs Negative Words plot 
len = function(x){
  if ( x == "-" && length(x) == 1)  {return (0)} 
  else {return(length(unlist(x)))}
}

pcount = unlist(lapply(p, len))
ncount = unlist(lapply(n, len))
doc_id = seq(1:length(wc))

# windows()
plot(doc_id,pcount,type="l",col="green",xlab = "Document ID", ylab= "Word Count")
lines(doc_id,ncount,type= "l", col="red")
title(main = "Positive words vs Negative Words" )
legend("topright", inset=.05, c("Positive Words","Negative Words"), fill=c("green","red"), horiz=TRUE)

# Documet Sentiment Running plot
#windows()
plot(pol$all$polarity, type = "l", ylab = "Polarity Score",xlab = "Document Number")
abline(h=0)
title(main = "Polarity Plot" )