TABA_Assignment1_IMDB_Reviews_How_to_train_your

R Markdown for extracting and analysing the user reviews for the IMDB Ratings for the movie How to train your dragon

This is an R Markdown document that contains the Analysis of the User Reviews for the movie How to train your dragon which was released in the year 2010.

For more details on the data used for extracting the user reviews infer the data from the below webpage http://www.imdb.com/title/tt0892769/reviews?.

Let’s first start with assigning the libraries and packages for the extraction and analysis of the user reviews.

#install.packages('RSelenium')
rm(list=ls())

#--------------------------------------------------------#
# Step 0 - Assign Library & define functions             #
#--------------------------------------------------------#

library(qdap)
library(textir)
library(text2vec)
library(data.table)
library(stringr)
library(tm)
library(RWeka)
library(tokenizers)
library(slam)
library(wordcloud)
library(ggplot2)
library(igraph)
library("rvest")
library(rJava)

The user reviews are extracted from the webpage and pasted into the url1 and url2. the url1 variable contains the positive reviews and the url2 contains the negative reviews.

counts = c(0,10,20,30,40,50)
reviews = NULL
for (j in counts){
  
  
  
  url1 = paste0("http://www.imdb.com/title/tt0892769/reviews?filter=love;filter=love;start=",j)
  url2 = paste0("http://www.imdb.com/title/tt0892769/reviews?filter=hate;filter=hate;start=",j)
  
  page1 = read_html(url1)
  page2 = read_html(url2)
  
  # #tn15content div+ p
  reviews1 = html_text(html_nodes(page1,'#tn15content div+ p'))
  reviews2 = html_text(html_nodes(page2,'#tn15content div+ p'))
  
  reviews.positive = setdiff(reviews1, c("*** This review may contain spoilers ***","Add another review"))
  reviews.negative = setdiff(reviews2, c("*** This review may contain spoilers ***","Add another review"))
  
  reviews = c(reviews,reviews.positive,reviews.negative)
  
}

reviews = gsub("\n",' ',reviews)
reviews = gsub("\r",' ',reviews)
writeLines(reviews,'How_to_Train_your_Dragon_IMDB_Reviews.txt')

the below snippet contains the function named text.clean. This function is used to remove all the blank spaces, Non ASCII Characters, HTML Tags etc.. This function removes all the above fields and returns the input after formatting it.

text.clean = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex for removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # keep only alpha numeric 
  x  =  tolower(x)                          # convert to lower case characters
  x  =  removeNumbers(x)                    # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

Let’s input the text file into which the IMDB reviews for the movie is written. The stopwords are also inserted in the below snippet.

#--------------------------------------------------------#
# Step 1 - Reading text data                             #
#--------------------------------------------------------#
temp.text = readLines(file.choose())  # Q25.txt for ice-cream data, india strikes back twitter.csv
head(temp.text, 5)

## [1] " I watched How to Train Your Dragon about 5 times now, and it never gets boring. It actually keeps on getting better and better with with more and more views. This is a huge accomplishment for DreamWorks Animation, it might actually be its Best Animated Feauture it yet. It is an amazing experience to watch this film in Cinema. The 3D is amazing and at times Breathtaking. I may of had the most fun that I've ever had in Cinema watching How to Train Your Dragon.The script is really good and is has a lot of dramatic depth. This movie is for everyone. Adults and Kids will enjoy it equally and will love it at the end. This movie will probably become a series like Shrek. But I'm hoping this film doesn't get bad sequels like Shrek 3 and Shrek Forever After. Anyways this film will be most recognized for its beautiful animation.10/10 Highly Recommended "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [2] " If this is done following the same old beat up formula that Hollywood sticks to with regards to animation, then the dragons will be yakking non-stop. Thank goodness that this film, directed by Dean DeBlois and Chris Sanders, avoids this like the plague, and Jay Baruchel voices Hiccup, a viking kid who happens to be more brains than brawn, more scrawny than buffed, and this of course sets him apart from the rest of his warrior clan folks, who are battle scarred from the constant defense of the village pests - dragons who come from afar to plunder their livestock and setting their houses on fire, so much so that every house on the block is relatively brand new. Wanting to help out in any way he can, he's deemed more of a liability than an asset, especially when even his dad Stoick (Gerard Butler) cannot appreciate his unique, technical talent.In a stroke of uncanny luck, Hiccup downs a flying dragon in the heat of battle, and his compassion meant to set the dragon free, rather than trying to prove himself to be a worthy viking man by killing it. And it's a rare specimen of a dragon too, which would have brought him instant glory. So a bond between man and mythical beast gets struck, and christened as Toothless, this is one pest who slowly grows into a pet, with Hiccup's secret rendezvous resulting in growing appreciation for the species, despite what the knowledge that his kinsman had compiled into a Dragon compendium which details facts all ending with an advisory on compulsory annihilation.The story here is the strength of the film, being witty, smart but never condescending nor insulting the intelligence of the audience. While most characters are caricatures, especially Hiccup's peers, a lot of effort have been put into creating the leads as multi-dimensional and full of heart, and I enjoyed how the characters are so open to their emotions, that it becomes a lot more real than the photo realistic 3D animation and effects. Sure there's the usual father-son misunderstanding and expectations, and how a zero turns to hero, or even the theme of fearing something that we don't fully comprehend, but it's the manner in which the usual got delivered, that made all the difference. Especially so for its anti-war stance, that all it takes is a little step back from the common battle-cry, and instead seek to be understood, by holding out an olive branch, and to understand first.For those who enjoy the mythology of the dragon creature, there are a number of ideas thrown up in the film that would make you nod in appreciation how these got conjured up for the film, and they worked wonders, even though they may be a tad predictable plot wise. And I'm betting that a lot of folks out there will take to Toothless, thanks to its \"stitch\"-ish design similar to Lilo and Stitch (since it's co-director Chris Sander's previous work) and huge saucer like eyes, plus a lovable demeanour built into the character that's always apprehensive, and mischievous. Being the creature that has no track record also helped, since it ropes you into a journey of friendship, bonding and discovery with Hiccup as to how powerful his new found friend can be, not to mention how symbiotic their relationship will evolve into as well.Action junkies will find the action sequences in the film faultless, and the 3D got specifically crafted for certain set action pieces that really had me ducking for cover, for once. Fights are incredible, and always accompanied either by humour that worked without the feeling that it was deliberate nor just tried too hard, coupled with the comedic voice talents such as Jonah Hill and Christopher Mintz-Plasse.How to Train Your Dragon is similar to last year's Cloudy With a Chance of Meatballs - Long titles, great story, beautiful animation and a total delight. Highly recommended, and it goes into my list as contenders for best films of this year! "
## [3] " I am not at all interested in dragons and all such fantasy creatures. I don't like children movies with all their stupid messages. I saw this movie rather just to pass the time than to watch it for its sake. And Whoa! I was drawn in this river in first 5 minutes. And what a experience it has been! Right from the start as the narrator describes his world, you are immediately there. You feel yourself in the characters place. The movie does that for you. This is very uncommon movie and it has set a milestone for 3D, not because of its technical aspects, but because of the Depth this movie has. This movie is as much for a 7 year old as it is for an old man who has seen a lot of life. This movie will entertain each viewer in his own way. This is a masterpiece! This movie isn't what it sounds on the surface. It has layers of meanings attached to it. Look at just the title: How to train your Dragon!. If you see it carefully you will notice that there is more to it than meets the eye. Watch the movie and you will know what i mean. This movie cleverly comments on Human Fear, War, Friendship, prejudices, courage, Love. ........................... Don't miss this movie or you will miss one of the few periods when you really LIVE. Note: Just remember to carry your heart with you when you see this movie. It will fill your heart with nothing but what should truly belong there. 10/10. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [4] " I saw the trailer and I enjoyed it but I was afraid that all the good parts from the movie will be there and that will be all, like it was with many films lately. That was certainly not the case. There are way better parts that were left to be discovered and I definitely congratulate the choice.I didn't read the book, so I don't know the story, witch might have suffered, as stories usually do from books to picture, but I think a writer couldn't hope for a better image, better portraits of characters, especially the black dragon who one definitely falls in love with - the mimic and the gestures and the face expressions, so complex and real. I agree it's not the kind of movie that makes you keep thinking too much once it's finished bot it's not meant to be. It's just lovely, from the beginning to the end, I really laughed and I was anxious for the characters when they suffered (and I'm 22). The film wasn't too long, it didn't have stupid lines whatsoever and it put to silence the annoying child behind me from the first five minutes or so, which I believe says it all.I don't know if I will actually go to the cinema but I definitely want to see it again. Great special effects and, again, a very lovely dragon. "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## [5] " I saw this film in early March, of 2010 in Indianapolis. I am one of the judges for the Heartland Truly Moving Picture Award. A Truly Moving Picture \"<U+0085>explores the human journey by artistically expressing hope and respect for the positive values of life.\" Heartland gave that award to this film.It's in 3-D and it's gorgeous animation. But what really matters is the story. And it's a good one. At first it seems the main story is about a Viking colony equally distant from nowhere, which is being constantly attacked by a wide variety of marauding dragons. It's a full time job trying to keep the dragons at bay and the Viking warriors are often out on their boats hunting their wily and ferocious opponents.But really the story is about a father and chief of the Vikings who has a young son, Hiccup, who is small and who is a slick, sarcastic talker and who doesn't take orders well, but still seeks respect from his impressive father. At first, his Father will not let his son be a warrior Viking, but later relents to have Hiccup train with the other youngsters. But the young boy gets sidetracked and instead of wanting to kill dragons, the boy befriends them and seeks to understand them.A young and inexperienced son seeking approval of a strong father is an often-told tale. Sons often act foolishly trying to impress their fathers. And fathers often ignore the strivings of their sons. In this case, there is honor and courage on all sides and it is inspiring to watch the father and son wrestle with their relationship.And yes, about the dragons <U+0096> they ARE ferocious and talented and aggressive warriors. But their motivations are a mystery that unfolds slowly. And that's the fun of this film.FYI <U+0096> There is a Truly Moving Pictures web site where there is a listing of past Truly Moving Picture Award winners that are now either at the theater or available on video. "

data = data.frame(id = 1:length(temp.text),  # creating doc IDs if name is not given
                  text = temp.text, 
                  stringsAsFactors = F)
dim(data)

## [1] 120   2

# Read Stopwords list
stpw1 = readLines(file.choose())      # read-in stopwords.txt
stpw2 = tm::stopwords('english')      # tm package stop word list; tokenizer package has the same name function, hence 'tm::'
comn  = base::unique(c(stpw1, stpw2))         # Union of two list
stopwords = unique(gsub("'"," ",comn))  # final stop word lsit after removing punctuation

x  = text.clean(data$text)                # applying func defined above to pre-process text corpus
x  =  removeWords(x,stopwords)            # removing stopwords created above
x  =  stripWhitespace(x)                  # removing white space
# x  =  stemDocument(x)                   # can stem doc if needed.

Lets perform the basic analysis by creating a Document Term Matrix of the above text extracted from the IMDB user reviews using the text2vec package.

#--------------------------------------------------------#
## Step 2: Create DTM using text2vec package             #
#--------------------------------------------------------#

t1 = Sys.time()

tok_fun = word_tokenizer  # using word & not space tokenizers

it_0 = itoken( x,
               #preprocessor = text.clean,
               tokenizer = tok_fun,
               ids = data$id,
               progressbar = T)

vocab = create_vocabulary(it_0,    #  func collects unique terms & corresponding statistics
                          ngram = c(2L, 2L) #,
                          #stopwords = stopwords
)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# length(vocab); str(vocab)     # view what vocab obj is like

pruned_vocab = prune_vocabulary(vocab,  # filters input vocab & throws out v frequent & v infrequent terms
                                term_count_min = 3)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

# length(pruned_vocab);  str(pruned_vocab)

vectorizer = vocab_vectorizer(pruned_vocab) #  creates a text vectorizer func used in constructing a dtm/tcm/corpus

dtm_0  = create_dtm(it_0, vectorizer) # high-level function for creating a document-term matrix

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# Sort bi-gram with decreasing order of freq
tsum = as.matrix(t(rollup(dtm_0, 1, na.rm=TRUE, FUN = sum))) # find sum of freq for each term
tsum = tsum[order(tsum, decreasing = T),]       # terms in decreasing order of freq
head(tsum)

##   train_dragon   jay_baruchel main_character  gerard_butler     night_fury 
##             97             23             18             16             14 
##   dean_deblois 
##             12

tail(tsum)

##    highly_recommend     main_antagonist          voiced_jay 
##                   3                   3                   3 
##        shrek_movies     animated_action absolutely_stunning 
##                   3                   3                   3

tsum = tsum[1:50]

In the above the snippet we had created the DTM using the bi-grams.The function #ngram = c(2L, 2L)# creates a bigram vocabulary. To create a text corpus the created bi-grams are converted into uni-grams. The progress bar are displayed using the function #txtProgressBar#

#-------------------------------------------------------
# Code bi-grams as unigram in clean text corpus

text2 = x
text2 = paste("",text2,"")

pb <- txtProgressBar(min = 1, max = (length(tsum)), style = 3) ; i = 0

for (term in names(tsum)){
  i = i + 1
  focal.term = gsub("_", " ",term)        # in case dot was word-separator
  replacement.term = term
  text2 = gsub(paste("",focal.term,""),paste("",replacement.term,""), text2)
  setTxtProgressBar(pb, i)
}

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |===============                                                  |  22%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  78%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%

it_m = itoken(text2,     # function creates iterators over input objects to vocabularies, corpora, DTM & TCM matrices
              # preprocessor = text.clean,
              tokenizer = tok_fun,
              ids = data$id,
              progressbar = T)

vocab = create_vocabulary(it_m     # vocab func collects unique terms and corresponding statistics
                          # ngram = c(2L, 2L),
                          #stopwords = stopwords
)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# length(vocab); str(vocab)     # view what vocab obj is like

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 1)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

vectorizer = vocab_vectorizer(pruned_vocab)

dtm_m  = create_dtm(it_m, vectorizer)

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

dim(dtm_m)

## [1]  120 3506

dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTf)
a0 = (apply(dtm, 1, sum) > 0)   # build vector to identify non-empty docs
dtm = dtm[a0,]                  # drop empty docs

print(difftime(Sys.time(), t1, units = 'sec'))

## Time difference of 1.551088 secs

Now we sort the DTM from the most frequent tokens to the least frequent tokens.

# view a sample of the DTM, sorted from most to least frequent tokens 
dtm = dtm[,order(apply(dtm, 2, sum), decreasing = T)]     # sorting dtm's columns in decreasing order of column sums
inspect(dtm[1:5, 1:5])     # inspect() func used to view parts of a DTM object

## <<DocumentTermMatrix (documents: 5, terms: 5)>>
## Non-/sparse entries: 14/11
## Sparsity           : 44%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs dragon story characters train_dragon animation
##    1      0     0          0            2         1
##    2      5     1          2            1         3
##    3      0     0          1            1         0
##    4      2     1          2            0         0
##    5      0     3          0            0         1

Lets find the term frequency of the data obtained from the user reviews of the movie How to train your dragon from the IMDB website. To find the frequency the coloumns in the DTM would be divided into 100 managable parts and the repetition is observed which is then mapped to the word.

#   1- Using Term frequency(tf)             

tst = round(ncol(dtm)/100)  # divide DTM's cols into 100 manageble parts
a = rep(tst,99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100

tsum = ss.col
tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)

##       dragon        story   characters train_dragon    animation 
##          139          125          112           97           84 
##    toothless 
##           75

tail(tsum)

## sympathise      armor      bless      grasp   fragment    clothes 
##          1          1          1          1          1          1

Lets build a word cloud and find the most frequent words observed from the user reviews for the movie How to train your dragon.

#--------------------------------------------------------#
## Step 2a:     # Build word cloud                       #
#--------------------------------------------------------#


windows()  # New plot window
wordcloud(names(tsum), tsum,     # words, their freqs 
          scale = c(4, 0.5),     # range of word sizes
          1,                     # min.freq of words to consider
          max.words = 200,       # max #words
          colors = brewer.pal(8, "Dark2"))    # Plot results in a word cloud 
title(sub = "Term Frequency - Wordcloud")     # title for the wordcloud display

From the word cloud we can observe that the most frequently used terms are dragons, characters, hiccup, story, animation vikings etc..

Lets plot the frequent words using the bar plot and identify the frequency of usage of each word with comparision to the whole data extracted from the IMDB user reviews.

# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))

windows()  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "Blue") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From the bar plot we can observe that the word hiccup has the highest frequency of 160 followed by characters, dragons, animation, vikings etc.

Now lets perform the term frequency using the Inverse document Frequency.

# -------------------------------------------------------------- #
# step 2b - Using Term frequency inverse document frequency (tfidf)             
# -------------------------------------------------------------- #

dtm.tfidf = tfidf(dtm, normalize=F)

tst = round(ncol(dtm.tfidf)/100)
a = rep(tst, 99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm.tfidf))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm.tfidf[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100

tsum = ss.col

tsum = tsum[order(tsum, decreasing = T)]       #terms in decreasing order of freq
head(tsum)

##    toothless       dragon   dreamworks         book        great 
##     96.82381     85.22152     80.95117     77.78192     77.55118 
## train_dragon 
##     77.45525

tail(tsum)

## sympathise      armor      bless      grasp   fragment    clothes 
##   4.094345   4.094345   4.094345   4.094345   4.094345   4.094345

Plotting the word cloud for the same

windows()  # New plot window
wordcloud(names(tsum), tsum, scale=c(4,0.5),1, max.words=200,colors=brewer.pal(8, "Dark2")) # Plot results in a word cloud 
title(sub = "Term Frequency Inverse Document Frequency - Wordcloud")

Lets view the top ttokens and their IDF scores.

as.matrix(tsum[1:20])     #  to see the top few tokens & their IDF scores

##                  [,1]
## toothless    96.82381
## dragon       85.22152
## dreamworks   80.95117
## book         77.78192
## great        77.55118
## train_dragon 77.45525
## characters   75.78120
## viking       71.37482
## time         68.23844
## story        67.37456
## village      66.29510
## pixar        64.47690
## character    64.03264
## animation    64.01976
## good         63.03375
## watch        61.86471
## plot         59.64823
## amazing      59.60740
## movies       59.27317
## kids         58.99467

To view the first 5 cells in the Document Term Matrix under the Term frequency of IDF

(dtm.tfidf)[1:5, 1:5]   # view first 5x5 cells in the DTM under TF IDF.

## 5 x 5 sparse Matrix of class "dgCMatrix"
##     Terms
## Docs   dragon     story characters train_dragon animation
##    1 .        .          .            1.5970154 0.7621401
##    2 3.065522 0.5389965  1.3532358    0.7985077 2.2864202
##    3 .        .          0.6766179    0.7985077 .        
##    4 1.226209 0.5389965  1.3532358    .         .        
##    5 .        1.6169895  .            .         0.7621401

Lets plot the Barchart for the top tokens obtained through TF IDF

# plot barchart for top tokens
test = as.data.frame(round(tsum[1:15],0))
windows()  # New plot window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Lets find the Term Co-Occurance Matrix for the word in the text extracted from the user reviews of the IMDB ratings and plot the cog. The initial plot is undirected and shows the relation between all the frequently used terms in the data.

#------------------------------------------------------#
# step 2c - Term Co-occurance Matrix (TCM)             #
#------------------------------------------------------#

vectorizer = vocab_vectorizer(pruned_vocab, 
                              grow_dtm = FALSE, 
                              skip_grams_window = 5L)

tcm = create_tcm(it_m, vectorizer) # func to build a TCM

## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

tcm.mat = as.matrix(tcm)         # use tcm.mat[1:5, 1:5] to view
adj.mat = tcm.mat + t(tcm.mat)   # since adjacency matrices are symmetric

z = order(colSums(adj.mat), decreasing = T)
adj.mat = adj.mat[z,z]

# Plot Simple Term Co-occurance graph
adj = adj.mat[1:30,1:30]

library(igraph)
cog = graph.adjacency(adj, mode = 'undirected')
cog =  simplify(cog)  

cog = delete.vertices(cog, V(cog)[ degree(cog) == 0 ])

windows()
plot(cog)

Lets distill the above cog to make it easier to infer the data and its relationships. To perform the above actions the objects are ordered and distilled along another vector. The correlation is thus performed. The cog thus performed is plotted below.

#-----------------------------------------------------------#
# Step 2d - a cleaned up or 'distilled' COG PLot            #
#-----------------------------------------------------------#

distill.cog = function(mat1, # input TCM ADJ MAT
                       title, # title for the graph
                       s,    # no. of central nodes
                       k1){  # max no. of connections  
  library(igraph)
  a = colSums(mat1) # collect colsums into a vector obj a
  b = order(-a)     # nice syntax for ordering vector in decr order  
  
  mat2 = mat1[b, b]     # order both rows and columns along vector b
  
  diag(mat2) =  0
  
  ## +++ go row by row and find top k adjacencies +++ ##
  
  wc = NULL
  
  for (i1 in 1:s){ 
    thresh1 = mat2[i1,][order(-mat2[i1, ])[k1]]
    mat2[i1, mat2[i1,] < thresh1] = 0   # neat. didn't need 2 use () in the subset here.
    mat2[i1, mat2[i1,] > 0 ] = 1
    word = names(mat2[i1, mat2[i1,] > 0])
    mat2[(i1+1):nrow(mat2), match(word,colnames(mat2))] = 0
    wc = c(wc,word)
  } # i1 loop ends
  
  
  mat3 = mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
  ord = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))]  # removed any NAs from the list
  mat4 = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
  graph <- graph.adjacency(mat4, mode = "undirected", weighted=T)    # Create Network object
  graph = simplify(graph) 
  V(graph)$color[1:s] = "green"
  V(graph)$color[(s+1):length(V(graph))] = "pink"
  
  graph = delete.vertices(graph, V(graph)[ degree(graph) == 0 ]) # delete singletons?
  
  plot(graph, 
       layout = layout.kamada.kawai, 
       main = title)
  
} # func ends

windows()
distill.cog(tcm.mat, 'Distilled COG for TF DF',  10,  5)

From the above cog we can infer the various thoughts which the users had mentioned about the movie in the user reviews of the movie.

We can infer that the story is simple and original. The movie provided good visual experience, involved action scenes, was good for kids. We can also infer that the movie was awesome from the reviews. Like wise we can also infer the characters and the most frequently commented characteristic about them. For eg: The father of hiccup in the movie is big, stoic and has lots of friends.

## adj.mat and distilled cog for tfidf DTMs ##
adj.mat = t(dtm.tfidf) %*% dtm.tfidf
diag(adj.mat) = 0
a0 = order(apply(adj.mat, 2, sum), decreasing = T)
adj.mat = as.matrix(adj.mat[a0[1:50], a0[1:50]])

windows()
distill.cog(adj.mat, 'Distilled COG for TF IDF',  10,  10)

Similar to the cog of the TF DF we have plotted the cog for the TF IDF above. We can also infer the relationship between the characters in the movie. The relationship between the boy hiccup and the dragon, his relationship with his father, the passion for training the dragons, the village with the vikings.

Now that we have done the basic analysis using the Term frequency by using the DF and the IDF. Now let’s analyse the Sentiments of the users from the reviews given by them for the movie How to train your dragon.

To perform the Sentiment Analysis we use the function #polarity(x1)# . The polarity function calculates theh polarity from the qdap function. When the function contains the term 2 then we get the word count. for the term 3 in the function we obtain the average polarity score. For the term 4 in the function we obtain all the positive terms in the reviews. For the function with the term as 5 we get the negative terms.

#--------------------------------------------------------#
#             Sentiment Analysis                         #
#--------------------------------------------------------#

x1 = x[a0]    # remove empty docs from corpus

t1 = Sys.time()   # set timer

pol = polarity(x1)         # Calculate the polarity from qdap dictionary
wc = pol$all[,2]                  # Word Count in each doc
val = pol$all[,3]                 # average polarity score
p  = pol$all[,4]                  # Positive words info
n  = pol$all[,5]                  # Negative Words info  

Sys.time() - t1  # how much time did the above take?

## Time difference of 11.54266 secs

head(pol$all)

##   all  wc   polarity
## 1 all  48  1.3856406
## 2 all 275  1.0733804
## 3 all  70  0.1195229
## 4 all  64 -0.1250000
## 5 all 130  0.7893522
## 6 all  45  1.7888544
##                                                                                                                                                                                                                                                                     pos.words
## 1                                                                                                                                                                              accomplishment, amazing, amazing, breathtaking, fun, good, enjoy, love, beautiful, recommended
## 2 goodness, talent, luck, compassion, free, worthy, glory, witty, smart, intelligence, leads, enjoyed, realistic, hero, enjoy, worked, wonders, wise, work, lovable, helped, powerful, faultless, incredible, humour, worked, talents, great, beautiful, delight, recommended
## 3                                                                                                                                                                                                                       whoa, entertain, masterpiece, cleverly, courage, love
## 4                                                                                                                                                                                                                    enjoyed, good, congratulate, love, lovely, great, lovely
## 5                                                                                                   award, respect, positive, award, gorgeous, good, variety, slick, respect, impressive, approval, strong, impress, honor, courage, inspiring, talented, fun, award, winners
## 6                                                                                                                                                        wonderful, thrilling, great, amazing, positive, constructive, perfect, good, good, congratulations, amazing, popular
##                                                                                                                                                                          neg.words
## 1                                                                                                                                                                      boring, bad
## 2 plague, scarred, plunder, liability, killing, struck, pest, slowly, annihilation, condescending, insulting, misunderstanding, anti, cry, apprehensive, mischievous, hard, cloudy
## 3                                                                                                                                             stupid, fear, prejudices, miss, miss
## 4                                                                                                            afraid, suffered, falls, complex, anxious, suffered, stupid, annoying
## 5                                                                       wily, sarcastic, sidetracked, kill, inexperienced, foolishly, ignore, wrestle, aggressive, mystery, slowly
## 6                                                                                                                                                                                -
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    text.var
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    watched train dragon times boring views huge accomplishment dreamworks animation animated feauture amazing experience watch cinema amazing times breathtaking fun cinema watching train dragon script good lot dramatic depth adults kids enjoy equally love end series shrek hoping bad sequels shrek shrek forever recognized beautiful animation highly recommended
## 2  beat formula hollywood sticks animation yakking stop goodness directed dean deblois chris sanders avoids plague jay baruchel voices viking kid brains brawn scrawny buffed sets rest warrior clan folks battle scarred constant defense village pests afar plunder livestock setting houses fire house block brand wanting deemed liability asset dad stoick gerard butler unique technical talent stroke uncanny luck downs flying dragon heat battle compassion meant set dragon free prove worthy viking man killing rare specimen dragon brought instant glory bond man mythical beast struck christened toothless pest slowly grows pet secret rendezvous resulting growing appreciation species knowledge kinsman compiled dragon compendium details facts ending advisory compulsory annihilation story strength witty smart condescending insulting intelligence audience characters caricatures peers lot effort put creating leads multi dimensional full heart enjoyed characters open emotions lot real photo realistic animation effects usual father son misunderstanding expectations turns hero theme fearing fully comprehend manner usual delivered made difference anti war stance takes step back common battle cry seek understood holding olive branch understand enjoy mythology dragon creature number ideas thrown make nod appreciation conjured worked wonders tad predictable wise betting lot folks toothless stitch ish design similar lilo stitch director chris sander previous work huge saucer eyes lovable demeanour built character apprehensive mischievous creature track record helped ropes journey friendship bonding discovery powerful found friend mention symbiotic relationship evolve action junkies find action sequences faultless specifically crafted set action pieces ducking cover fights incredible accompanied humour worked feeling deliberate hard coupled comedic voice talents jonah hill christopher mintz plasse train dragon similar year cloudy chance meatballs long titles great story beautiful animation total delight highly recommended list contenders films year
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  interested fantasy creatures children movies stupid messages pass time watch sake whoa drawn river minutes experience start narrator describes world immediately feel characters place uncommon set milestone technical aspects depth year man lot life entertain viewer masterpiece sounds surface layers meanings attached title train dragon carefully notice meets eye watch cleverly comments human fear war friendship prejudices courage love miss miss periods live note remember carry heart fill heart belong 
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  trailer enjoyed afraid good parts films case parts left discovered congratulate choice read book story witch suffered stories books picture writer hope image portraits characters black dragon falls love mimic gestures face expressions complex real agree kind makes thinking finished bot meant lovely beginning end laughed anxious characters suffered long stupid lines whatsoever put silence annoying child minutes cinema great special effects lovely dragon
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                early march indianapolis judges heartland moving picture award moving picture explores human journey artistically expressing hope respect positive values life heartland gave award gorgeous animation matters story good main story viking colony equally distant constantly attacked wide variety marauding full time job bay viking warriors boats hunting wily ferocious opponents story father chief young son small slick sarcastic talker orders seeks respect impressive father father son warrior viking relents train youngsters young boy sidetracked wanting kill boy befriends seeks understand young inexperienced son seeking approval strong father told tale sons act foolishly impress fathers fathers ignore strivings sons case honor courage sides inspiring watch father son wrestle relationship ferocious talented aggressive warriors motivations mystery unfolds slowly fun fyi moving pictures web site listing past moving picture award winners theater video
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    young viking befriends toothless young dragon lord rings trilogy virtually wonderful rarely drawn animated aspects thrilling great story amazing animation stop action positive constructive message made pet dragon perfect people ages feel good make feel good congratulations contributed amazing make toy popular gift item hope

head(pol$group)

##   all total.sentences total.words ave.polarity sd.polarity
## 1 all             120       11450    0.5784319   0.7612858
##   stan.mean.polarity
## 1          0.7598091

positive_words = unique(setdiff(unlist(p),"-"))  # Positive words list
negative_words = unique(setdiff(unlist(n),"-"))  # Negative words list

print(positive_words)       # Print all the positive words found in the corpus

##   [1] "accomplishment"  "amazing"         "breathtaking"   
##   [4] "fun"             "good"            "enjoy"          
##   [7] "love"            "beautiful"       "recommended"    
##  [10] "goodness"        "talent"          "luck"           
##  [13] "compassion"      "free"            "worthy"         
##  [16] "glory"           "witty"           "smart"          
##  [19] "intelligence"    "leads"           "enjoyed"        
##  [22] "realistic"       "hero"            "worked"         
##  [25] "wonders"         "wise"            "work"           
##  [28] "lovable"         "helped"          "powerful"       
##  [31] "faultless"       "incredible"      "humour"         
##  [34] "talents"         "great"           "delight"        
##  [37] "whoa"            "entertain"       "masterpiece"    
##  [40] "cleverly"        "courage"         "congratulate"   
##  [43] "lovely"          "award"           "respect"        
##  [46] "positive"        "gorgeous"        "variety"        
##  [49] "slick"           "impressive"      "approval"       
##  [52] "strong"          "impress"         "honor"          
##  [55] "inspiring"       "talented"        "winners"        
##  [58] "wonderful"       "thrilling"       "constructive"   
##  [61] "perfect"         "congratulations" "popular"        
##  [64] "instantly"       "works"           "winner"         
##  [67] "precious"        "loved"           "interesting"    
##  [70] "mighty"          "fresh"           "solid"          
##  [73] "important"       "strongest"       "maturity"       
##  [76] "meaningful"      "memorable"       "recommend"      
##  [79] "mature"          "excellent"       "tough"          
##  [82] "awards"          "nice"            "benefits"       
##  [85] "honest"          "modern"          "super"          
##  [88] "proper"          "pure"            "polite"         
##  [91] "elegant"         "excellently"     "splendid"       
##  [94] "master"          "wisdom"          "fantastic"      
##  [97] "achievement"     "classic"         "top"            
## [100] "impressed"       "exciting"        "innovative"     
## [103] "pretty"          "grace"           "satisfies"      
## [106] "charming"        "preferable"      "fast"           
## [109] "inspiration"     "friendly"        "exhilarating"   
## [112] "precise"         "fine"            "silent"         
## [115] "success"         "intimate"        "worth"          
## [118] "defeat"          "victorious"      "loves"          
## [121] "gains"           "loyal"           "enjoyable"      
## [124] "wins"            "triumphant"      "humorous"       
## [127] "likable"         "superb"          "brilliantly"    
## [130] "brilliant"       "standout"        "marvel"         
## [133] "refreshing"      "genuine"         "understandable" 
## [136] "believable"      "interests"       "spirited"       
## [139] "stronger"        "romantic"        "clear"          
## [142] "adorable"        "smile"           "lead"           
## [145] "cleared"         "awesome"         "fortunately"    
## [148] "counter attacks" "confidence"      "victory"        
## [151] "succeed"         "winning"         "stunning"       
## [154] "improve"         "gain"            "retractable"    
## [157] "peace"           "crisp"           "poise"          
## [160] "poignant"        "passion"         "dedicated"      
## [163] "nurturing"       "masterful"       "endearing"      
## [166] "astounding"      "amaze"           "beauty"         
## [169] "clever"          "clean"           "humor"          
## [172] "fascinating"     "entertaining"    "accurate"       
## [175] "charm"           "beautifully"     "awe"            
## [178] "perfectly"       "fans"            "supporting"     
## [181] "rich"            "playful"         "enchanting"     
## [184] "bravo"           "heartwarming"    "fairly"         
## [187] "abundant"        "tender"          "convenient"     
## [190] "win"             "hilarious"       "happy"          
## [193] "refresh"         "gentle"          "gusto"          
## [196] "proves"          "leading"         "heroic"         
## [199] "pleasantly"      "regard"          "originality"    
## [202] "trust"           "obsession"       "vibrant"        
## [205] "gaining"         "heavenly"        "terrific"       
## [208] "wonderfully"     "kindness"        "cooperative"    
## [211] "awesomeness"     "exceeded"        "wow"            
## [214] "magnificent"     "ambitious"       "improving"      
## [217] "cool"            "charismatic"     "proud"          
## [220] "favorite"        "glorious"        "spectacular"    
## [223] "fearless"        "astonishing"     "guarantee"      
## [226] "mind blowing"    "prefer"          "improvement"    
## [229] "brave"           "vivid"           "helping"        
## [232] "attraction"      "enhances"        "light hearted"  
## [235] "enjoys"          "concise"         "incredibly"     
## [238] "mesmerizing"     "recovery"        "affectionate"   
## [241] "likes"           "protect"         "glad"           
## [244] "correct"         "ideal"           "satisfying"     
## [247] "sophisticated"   "intelligent"     "appeal"         
## [250] "entranced"       "protection"      "cheer"          
## [253] "simplifies"      "pleasure"        "deserving"      
## [256] "fast paced"      "cute"            "rewarding"      
## [259] "ingenious"       "creative"        "amazingly"      
## [262] "suitable"        "successful"      "effectively"    
## [265] "effective"       "won"             "perfection"     
## [268] "stellar"         "inventive"       "loyalty"        
## [271] "bravery"         "lover"           "magical"        
## [274] "empathy"         "effectiveness"   "calm"           
## [277] "ingenuity"       "blossom"         "happiness"      
## [280] "finest"          "aspire"          "smiling"        
## [283] "easy"            "capability"      "revelation"     
## [286] "stylized"        "happily"         "improves"       
## [289] "heroine"         "redemption"      "accomplished"   
## [292] "redeeming"       "simplest"        "sweet"          
## [295] "sensitive"       "excitement"      "pleasing"       
## [298] "triumph"         "fair"            "successfully"   
## [301] "secure"          "survival"        "peacefully"     
## [304] "fond"            "defeating"       "pleasant"       
## [307] "decent"          "benefit"         "kid friendly"   
## [310] "kudos"           "woo"             "lean"           
## [313] "enjoyment"       "genius"          "superbly"       
## [316] "brilliance"      "engrossing"      "magic"          
## [319] "happier"         "greatest"        "phenomenal"     
## [322] "grand"           "colorful"        "patient"        
## [325] "loving"          "spiritual"       "valuable"       
## [328] "catchy"          "relish"          "bless"          
## [331] "excelled"        "credible"        "acclaimed"      
## [334] "reputation"      "amenable"        "satisfied"      
## [337] "peaceful"        "sumptuous"       "achievements"   
## [340] "famous"          "prowess"         "favour"         
## [343] "logical"         "dazzling"        "splendor"       
## [346] "exceptionally"   "savior"          "extraordinary"  
## [349] "outdone"         "promise"         "sharp"          
## [352] "intrigue"        "quicker"         "safe"           
## [355] "excited"         "blockbuster"     "easier"         
## [358] "feat"            "cuteness"        "yay"            
## [361] "bright"          "meticulously"    "dynamic"        
## [364] "faithful"        "luxury"          "youthful"       
## [367] "wowing"          "softer"          "advanced"       
## [370] "fame"            "praise"          "favor"          
## [373] "complement"      "geeky"           "cheerful"       
## [376] "continuity"      "relaxed"         "appealing"      
## [379] "destiny"

print(negative_words)       # Print all neg words

##   [1] "boring"            "bad"               "plague"           
##   [4] "scarred"           "plunder"           "liability"        
##   [7] "killing"           "struck"            "pest"             
##  [10] "slowly"            "annihilation"      "condescending"    
##  [13] "insulting"         "misunderstanding"  "anti"             
##  [16] "cry"               "apprehensive"      "mischievous"      
##  [19] "hard"              "cloudy"            "stupid"           
##  [22] "fear"              "prejudices"        "miss"             
##  [25] "afraid"            "suffered"          "falls"            
##  [28] "complex"           "anxious"           "annoying"         
##  [31] "wily"              "sarcastic"         "sidetracked"      
##  [34] "kill"              "inexperienced"     "foolishly"        
##  [37] "ignore"            "wrestle"           "aggressive"       
##  [40] "mystery"           "slow"              "overwhelmed"      
##  [43] "noisy"             "impossible"        "burned"           
##  [46] "enemies"           "challenging"       "sadness"          
##  [49] "seriousness"       "rubbish"           "hate"             
##  [52] "disbelief"         "hype"              "false"            
##  [55] "terribly"          "disappointed"      "mediocre"         
##  [58] "misunderstand"     "waste"             "lame"             
##  [61] "crappy"            "pointless"         "crap"             
##  [64] "prejudge"          "poorly"            "underdog"         
##  [67] "moron"             "abusive"           "bullies"          
##  [70] "dumb"              "annoyingly"        "abuse"            
##  [73] "unrealistic"       "sucks"             "rampant"          
##  [76] "sad"               "silly"             "childish"         
##  [79] "skinny"            "awful"             "worst"            
##  [82] "lost"              "awkward"           "dreadful"         
##  [85] "blasphemous"       "dull"              "rape"             
##  [88] "murder"            "destroy"           "dead"             
##  [91] "loses"             "shocking"          "horrible"         
##  [94] "bugs"              "rip"               "knock"            
##  [97] "shark"             "monster"           "desperately"      
## [100] "killed"            "fury"              "unable"           
## [103] "commonplace"       "monstrous"         "nightmare"        
## [106] "spite"             "whiny"             "refuses"          
## [109] "pathetic"          "messes"            "loud"             
## [112] "darker"            "conflict"          "unbelievably"     
## [115] "complained"        "disagree"          "complaint"        
## [118] "clash"             "trap"              "trapped"          
## [121] "spoil"             "prejudice"         "disliked"         
## [124] "worse"             "bland"             "bully"            
## [127] "flaws"             "distracting"       "lacks"            
## [130] "mistakes"          "worries"           "vicious"          
## [133] "fierce"            "lethal"            "fears"            
## [136] "hostility"         "threat"            "pale"             
## [139] "long time"         "issue"             "steal"            
## [142] "burn"              "wrong"             "lies"             
## [145] "danger"            "worn"              "irritation"       
## [148] "plight"            "strained"          "longing"          
## [151] "forged"            "fall"              "rumbling"         
## [154] "dust"              "splitting"         "problem"          
## [157] "ruin"              "scar"              "stuck"            
## [160] "missed"            "scary"             "villains"         
## [163] "disaster"          "failure"           "broken"           
## [166] "cheesy"            "lack"              "hectic"           
## [169] "doubt"             "bug"               "pretentious"      
## [172] "rotten"            "death"             "disappointment"   
## [175] "ruins"             "annoyed"           "irritating"       
## [178] "disgusted"         "ugh"               "fat"              
## [181] "confused"          "distorted"         "slack"            
## [184] "ferociously"       "destructive"       "passive"          
## [187] "plaything"         "misfit"            "stupidity"        
## [190] "irritated"         "overrated"         "hapless"          
## [193] "flair"             "breaking"          "downhill"         
## [196] "failed"            "enemy"             "attack"           
## [199] "contrived"         "burning"           "complain"         
## [202] "bloody"            "blatantly"         "damn"             
## [205] "terrible"          "worry"             "appalled"         
## [208] "naive"             "fictional"         "hollow"           
## [211] "critics"           "dismay"            "wasted"           
## [214] "fails"             "stubborn"          "scare"            
## [217] "regret"            "warning"           "fool"             
## [220] "crazy"             "hopeless"          "rejected"         
## [223] "disappoints"       "wild"              "bored"            
## [226] "heartless"         "sully"             "inconsistent"     
## [229] "weird"             "risks"             "rough"            
## [232] "hell"              "killer"            "complicated"      
## [235] "wimpy"             "tired"             "outcast"          
## [238] "brutish"           "comical"           "sceptical"        
## [241] "stale"             "weak"              "stall"            
## [244] "shallow"           "overdone"          "blah"             
## [247] "heck"              "excuse"            "trouble"          
## [250] "suspicious"        "lie"               "twist"            
## [253] "inevitable"        "suck"              "bother"           
## [256] "suffer"            "dragged"           "condemn"          
## [259] "die"               "boredom"           "lukewarm"         
## [262] "spoonfed"          "cave"              "cynical"          
## [265] "crush"             "angry"             "critical"         
## [268] "noises"            "loner"             "defiantly"        
## [271] "blame"             "begging"           "nuisance"         
## [274] "laughingstock"     "break"             "struggles"        
## [277] "unpredictable"     "twists"            "stumble"          
## [280] "wary"              "sadly"             "snarky"           
## [283] "loser"             "unknown"           "cruelty"          
## [286] "incapable"         "unexpected"        "anguish"          
## [289] "knife"             "misunderstandings" "delayed"          
## [292] "recklessness"      "cheap"             "heartbreaking"    
## [295] "smug"              "noise"             "misunderstood"    
## [298] "bullying"          "stupidest"         "lose"             
## [301] "ugly"              "hedge"             "dangerous"        
## [304] "problems"          "poor"              "disrespectfulness"
## [307] "shame"             "unnecessary"       "rant"             
## [310] "fell"              "bitter"            "cynicism"         
## [313] "adversity"         "tedious"           "isolated"         
## [316] "yawn"              "antagonist"        "fateful"          
## [319] "unsure"            "twisted"           "sick"             
## [322] "vice"              "rigidity"          "miserably"        
## [325] "limits"            "insane"            "disappoint"       
## [328] "crack"             "fallen"            "grouse"           
## [331] "ruthless"          "insecure"          "relentless"       
## [334] "harsh"             "critic"            "downside"         
## [337] "fatefully"         "frozen"            "touchy"           
## [340] "murderous"         "aggression"        "frightening"      
## [343] "falling"           "chagrin"           "unpleasant"       
## [346] "posturing"         "fanciful"          "conceit"          
## [349] "demonized"         "belligerent"       "disastrous"       
## [352] "overlook"          "lackluster"        "idiotic"          
## [355] "notorious"         "scream"            "destruction"      
## [358] "damaged"           "shoddy"            "ineptitude"       
## [361] "pander"            "lazy"              "wrath"            
## [364] "useless"           "refusal"           "cartoonish"       
## [367] "somber"            "reluctant"         "clumsy"           
## [370] "wildly"            "uneven"            "criticism"        
## [373] "ashamed"           "skeptical"         "avenge"           
## [376] "harm"              "randomly"          "criticize"        
## [379] "ripped"            "struggling"        "lacking"          
## [382] "forbidden"         "misleading"        "evil"             
## [385] "limited"           "astray"            "fearsome"         
## [388] "disadvantage"      "bothered"          "sullen"           
## [391] "dark"              "drastically"       "disingenuous"     
## [394] "massacre"          "conservative"      "disappointing"    
## [397] "satirical"         "foe"               "difficult"        
## [400] "confuses"          "pales"             "split"            
## [403] "shamelessly"       "bickering"         "greedy"           
## [406] "overwhelming"      "inane"             "drag"             
## [409] "hurt"              "loneliness"        "complains"        
## [412] "fault"             "ridiculous"        "scared"           
## [415] "concerned"         "indifference"      "ridiculously"     
## [418] "misguided"

Lets create the word cloud for Positive terms obtained from the user reviews from the IMDB site for the movie How to train you dragon.

#--------------------------------------------------------#
#   Create Postive Words wordcloud                       #
#--------------------------------------------------------#

pos.tdm = dtm[,which(colnames(dtm) %in% positive_words)]
m = as.matrix(pos.tdm)
v = sort(colSums(m), decreasing = TRUE)
windows() # opens new image window
wordcloud(names(v), v, scale=c(4,1),1, max.words=100,colors=brewer.pal(8, "Dark2"))
title(sub = "Positive Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
windows() # opens new image window
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "blue") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From the word cloud and the bar graph we can infer that the movie is good, great, fantastic and amazing. The hero of the movie seems to have captured their hearts.

Lets create the word cloud for Negative terms obtained from the user reviews from the IMDB site for the movie How to train you dragon.

#--------------------------------------------------------#
#  Create Negative Words wordcloud                       #
#--------------------------------------------------------#

neg.tdm = dtm[,which(colnames(dtm) %in% negative_words) ]
m = as.matrix(neg.tdm)
v = sort(colSums(m), decreasing = TRUE)
windows()
wordcloud(names(v), v, scale=c(4,1),1, max.words=100,colors=brewer.pal(8, "Dark2"))         
title(sub = "Negative Words - Wordcloud")

# plot barchart for top tokens
test = as.data.frame(v[1:15])
windows()
ggplot(test, aes(x = rownames(test), y = test)) + 
  geom_bar(stat = "identity", fill = "red") +
  geom_text(aes(label = test), vjust= -0.20) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From the above word cloud and bar graph for the negative sentiments of the user reviewswe can infer that the movie was annoying and disturbing for few of the users when compared to the positive reviews given by the users.

Now lets plot the between the positive and the negative sentiments of the user for the reviews.

#--------------------------------------------------------#
#  Positive words vs Negative Words plot                 #
#--------------------------------------------------------#

len = function(x){
  if ( x == "-" && length(x) == 1)  {return (0)} 
  else {return(length(unlist(x)))}
}

pcount = unlist(lapply(p, len))
ncount = unlist(lapply(n, len))
doc_id = seq(1:length(wc))

windows()
plot(doc_id,pcount,type="l",col="green",xlab = "Document ID", ylab= "Word Count")
lines(doc_id,ncount,type= "l", col="red")
title(main = "Positive words vs Negative Words" )
legend("topright", inset=.05, c("Positive Words","Negative Words"), fill=c("green","red"), horiz=TRUE)

From teh above plot we can infer that the good polarity is higher than the negative sentiments of the users. The green that stands for the positive sentiments has higher peaks when compared to the negative sentiments which is plotted in red.

Let’s plot the Polarity Plot that contains both the positive and the negative sentiments of the users.

From the above the plot we can infer that the data is mostly concentrated on the upper side of the graph which indicates that the positive sentiments of the users outstands the negative sentiments of the user.

TABA_Assignment1_IMDB_Reviews_How_to_train_your_Dragon

Cris _Alfo_71620020

November 27, 2016

R Markdown for extracting and analysing the user reviews for the IMDB Ratings for the movie How to train your dragon