Step 1

First let us get Top 50 positive and Negative reviews from IMDB for the movie “Django Unchained”

rm(list=ls())
library("rvest")
library("dplyr")
library("tm")
library("text2vec")
library(data.table)
library(stringr)
library(RWeka)
library(tokenizers)
library(slam)
library(wordcloud)
library(qdap)
library(ggplot2)
library(igraph)
counts = c(0,10,20,30,40)

ratings.df = data.frame(Review=character(), Ratings=character(), stringsAsFactors = FALSE)
reviews = NULL

t1 = Sys.time()   # set timer
for (j in counts){
  #tt1049413 - Up
  #tt1853728 - Django
  url1 = paste0("http://www.imdb.com/title/tt1853728/reviews?filter=love;filter=love;start=",j)
  url2 = paste0("http://www.imdb.com/title/tt1853728/reviews?filter=hate;filter=hate;start=",j)

  page1 = read_html(url1)
  page2 = read_html(url2)
  reviews1 = html_text(html_nodes(page1,'#tn15content p'))
  reviews2 = html_text(html_nodes(page2,'#tn15content p'))
  
  for (review_num in 1:10){
  xpath1 = paste0('//*[@id="tn15content"]/p[',as.character(review_num),']')
  xpath2 = paste0('//*[@id="tn15content"]/div[',as.character(2*review_num-1),']/img/@alt')
  review1 = gsub("[\r\n]", "",trimws(html_text(html_nodes(page1,xpath = xpath1))))

  rating1 = html_text(html_nodes(page1,xpath = xpath2))

  if (nrow(ratings.df)==0){
    ratings.df[1,]<-c(review1,rating1)
    }
  else{
    ratings.df<-rbind(ratings.df,c(review1,rating1))
    }  

  }
  for (review_num in 1:10){
  xpath1 = paste0('//*[@id="tn15content"]/p[',as.character(review_num),']')
  xpath2 = paste0('//*[@id="tn15content"]/div[',as.character(2*review_num-1),']/img/@alt')
  review2 = gsub("[\r\n]", "",trimws(html_text(html_nodes(page2,xpath = xpath1))))

  rating2 = html_text(html_nodes(page2,xpath = xpath2))

  ratings.df<-rbind(ratings.df,c(review2,rating2))
  }
  
  
  reviews.positive = setdiff(reviews1, c("*** This review may contain spoilers ***","Add another review"))
  
  reviews.negative = setdiff(reviews2, c("*** This review may contain spoilers ***","Add another review"))
  
  reviews = c(reviews,reviews.positive,reviews.negative)
  
}

reviews = gsub("\n",' ',reviews)
writeLines(reviews,'Django Unchained IMDB reviews.txt')
Sys.time() - t1  # how much time did the above take?
## Time difference of 15.65114 secs

Step 1- Top 50 positive and negative reviews are collected.

Showing the Top 4 Positive and Negative movies in alternate order

knitr::kable(filter(ratings.df)%>%slice(1:3))
Review Ratings
Absolutely loved every minute of this movie. Usually I’m not too crazyabout Tarantino’s movies, but this one is definitely the best one I’veseen in a long time. The actors were picked perfectly. The overallexperience of a movie is amazing. When we first went to watch it, I wasa bit skeptical and thought I’d end up leaving an hour into the movie(it’s a 3 hr movie), but it grabbed my attention from the verybeginning and I didn’t even wanna get up to go to the bathroom, afraidto miss something. I’m usually very particular about the movies,nothing can hardly satisfy me, but this one is definitely in the top 5.Soundtrack was perfect. When I got home, I’ve done some more researchon it and loved it even more! Overall, I would highly recommend thisfilm! 10/10
Merry Christmas to all you Tarantino fans out there. I hope you made aTarantino checklist so here we go.Witty dialogue, check. Excessive profanity especially use the word’nigger’, check. Excessive violence including testicles getting blownoff, check. Soulful musical score, check. Sometimes non-linearnarrative, check. Shots of women’s feet, check. Very great characterdriven plot, check. An actual spaghetti western, even though it takesplace in the American South, check.There are four standout characters played by the top billed actors.Jamie Foxx plays Django, a freed slave who becomes a bounty hunter.Even though he is the titular character, he gets downplayed when in thepresence of the other actors. Still he delivers a solid performance, infact hes very convincing. We all know Jamie Foxx as this golden voiceRnB singer and comedian with a very clean cut image. He was able topull off the whole transitioning from a timid slave to a menacingbounty hunter. Not only that he had the whole look down too, with allthe facial scarring and the messy hair.Christoph Waltz plays Dr. King Schulz, a German dentist turned bountyhunter who frees Django so he could help pursue his previous owners whoare targets. Waltz is a very charismatic actor, and thats how he doesthis role. Presents every line with finesse.Leonardo Dicaprio is in his best yet. He plays a plantation owner,Calvin Candie, and is the owner of Django’s wife. This is a verydifferent role. We’ve seen Leonardo in gritty roles before but neverdid he play this lecherous antagonist. We were all used to Leo beingthis teen idol, who looked like a member of Hanson. Here he’s thisSoutherner with discoloured teeth and a scruffy beard.Finally Samuel L. Jackson who plays Steve, a house slave who you couldsay is the secret antagonist here. For all the screen time that he hashe dominates. Sam usually plays boisterous roles as a tough guy, but itwas very interesting seeing him play a devious and manipulative oldman.The only gripe here was that this film was a little too long exceedingthe three act structure, but its an epic western film so I’ll excuseTarantino for that. Yet again he made another great film with a lot offlair and carried well by the four big hitter actors. Well done Mr.Tarantino. 10/10
Quentin Tarantino, one of the most iconic directors of the 21st (andlate 20th) century, why? Simple. Because of masterpieces like this.Tarantino defies the laws of film, he shoots them in his own way,however he wants. Tarantino has always focused upon the action thrillergenre from Reservoir Dogs up until Inglourious Basterds. However,Django Unchained is Tarantino’s first look at the Western genre, hisfirst attempt at it and he executed it beautifully. The scenes wereshot perfectly alongside an amazing soundtrack as well as his own smallcameo.Django Unchained tells the story of Django (Jamie Foxx), a slave who issoon picked up by bounty hunter Dr King Shultz (Christoph Waltz). Thestory follows on as Shultz takes on Django as his “deputy” during theirtasks of bounty hunting, in return Shultz says that after winter hewill help find Django’s lost wife, Broomhilda. This takes them to ahuge plantation in Mississippi owned by Calvin Candie (LeonardoDiCaprio), from here they plan up a scheme on how to get away withBroombilda.The cast boast out amazing performances, particularly Christoph Waltz(also famous for his previous collaboration with Tarantino onInglourious Bastards as Colonel Landa). Both Foxx and DiCaprio’sperformance are both equally amazing. All three are able to add somelight-hearted humour in the mix to make sure it doesn’t stay tooserious, as well as having comic actor Jonah Hill play a member of theKKK.There’s a reason the film has been nominated for 5 Oscars. 10/10
knitr::kable(filter(ratings.df)%>%slice(98:n()))
Review Ratings
Actually I’m not even going to list the crimes against that word (trustme, they are legion). Apparently in a Tarantino film you’re notsupposed to question incongruities and historical absurdities; itseems, for some inexplicable reason, such restrictions apply only toother filmmakers. So rap songs in a pre-civil war western are okay aslong as your name starts with T and ends with O. Negroes riding aroundwith cool sunglasses talking back (or is that ‘black’?) to plantationowning southerners without instantly getting blown out of the saddle isall part of Tarantino-World; as indeed are weapons with an accuracy farbeyond anything known in the 19th Century. So no, I’m not going toquestion the lack of verisimilitude in this movie; I couldn’t be sounkind to such a genius movie maker. I’m just going to question thelogic of one particular scene.The “Doc’ has this Derringer, right? Which springs out of his rightsleeve. We’ve seen already that it fires two shots (and is surprisinglydeadly for such a small calibre weapon). He wants to kill DiCaprio.However, DiCaprio’s henchman has a gun on his friends. So he shootsDiCaprio, says”I couldn’t resist" or some such, and stands there as ifwaiting to be shot? Why? I mean if his gun had two bullets, he couldhave shot the henchman first (in the head, as he was directly to theright of him) and then shot DiCaprio, who was weaponless. It made nosense (pretty much like the rest of the movie really). He was supposedto be a master strategist, but a 3 year-old-child could have seen whatneeded to be done there. It wouldn’t even have effected the plot much.You could have still had that repellent splatterfest afterwards andjust had the Doc die in that. Really, folks, that’s just bad, badwriting.I won’t list any more silliness. Not even Tarantino himself affecting aweird Australian accent (what was that all about? Is there evidencethat Australians used to escort black prisoners to the mines in the1850s?). No, to me this was clearly just another Tarantino revengefantasy project made for no other reason (money excluded, of course)than to convince the rest of us that humanity really is asunremittingly black-hearted as Tarantino sees us. Well, I for one justdon’t buy that, and I certainly won’t be buying this movie. 1/10
Once again QT bring nothing to the table other than death andketchup… the plot is transparent… the acting is stiff… the deathand guts are the main filler… I beg to understand who likes thiscrap… goodbye QT I for one will never watch another of your deathblood mindless films… this is the kind of crap polluting society…you have a lot to answer for… society these days doesn’t need thissort of crap being distributed to the masses !I don’t think that showing the blood and guts add any value to thefilm… we all know whats happening… we all know the outcome… sowhy does he show the ketchup flying everywhere… its like he thinks itadds something… all it adds is glorifying death and diluting the painand suffering of the victims…I know people will say its only a film… but the youth and influentialpeople in this world are starting to think its normal to kill andsplatter… I just don’t see the point of this film… if the killingwas remove the whole film would be 15 minutes long… so therefore thefilm is about killing humans… hmmm… nice…great entertainment !!! I stopped watching after 1.5 hours… I was sick of watching peoplebeing killed… the plot was weak and build around death… not evenassumed death… its blatant death…no atmospheric emotional implicitdeath… blatant explicit death… sorry this is crap… id give itzero… but cant !… 1/10
Going into the film i initially wasn’t fond on the idea of a Western socalled being made in todays world (2012). Movies these days failmiserably to deliver a story towards the audience, as they are onlyfavorable for special effects and aspects of filming. The beginning ofthe film perhaps had some interesting story, its portrayed at thestart, whites vs blacks, blacks are slaves etc. The idea of the film was generally poor. It lost its meaning, as themovie dragged on, with Django and Dr. Schultz whatever his name iscontinuously killing people. I felt like there was no emotion or storyor background behind the scenes which kept occurring. Another point isthe humor, i don’t understand the necessity of adding a comedic senseto a film which is somewhat serious, we are talking about blacks beingtorched. If you compare this extremely boring average film to a 60’sspaghetti wow, don’t even. those movies back then had story, they adcharacter they had a purpose they were once easily to follow. Thesedays there are too much talking and they spend too much time focusingon special effects rather than adding some emotion into the actualstory and trying to forward to message to the viewer. Perhaps the end of the film was slightly entertaining with theshooting, but overall oh boy this movie absolutely sucked! 1/10

Step 2- create Corpus from reviews and finally create DTM after doing necessary cleaning

docs<-Corpus(VectorSource(reviews))
text.clean = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex for removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # keep only alpha numeric 
  x  =  tolower(x)                          # convert to lower case characters
  x  =  removeNumbers(x)                    # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}
distill.cog = function(mat1, # input TCM ADJ MAT
                       title, # title for the graph
                       s,    # no. of central nodes
                       k1){  # max no. of connections  
  
  a = colSums(mat1) # collect colsums into a vector obj a
  b = order(-a)     # nice syntax for ordering vector in decr order  
  
  mat2 = mat1[b,b]  #
  
  diag(mat2) =  0
  
  ## +++ go row by row and find top k adjacencies +++ ##
  
  wc = NULL
  
  for (i1 in 1:s){ 
    thresh1 = mat2[i1,][order(-mat2[i1, ])[k1]]
    mat2[i1, mat2[i1,] < thresh1] = 0   # wow. didn't need 2 use () in the subset here.
    mat2[i1, mat2[i1,] > 0 ] = 1
    word = names(mat2[i1, mat2[i1,] > 0])
    mat2[(i1+1):nrow(mat2), match(word,colnames(mat2))] = 0
    wc = c(wc,word)
  } # i1 loop ends
  
  
  mat3 = mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
  ord = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))]  # removed any NAs from the list
  mat4 = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
  graph <- graph.adjacency(mat4, mode = "undirected", weighted=T)    # Create Network object
  graph = simplify(graph) 
  V(graph)$color[1:s] = "green"
  V(graph)$color[(s+1):length(V(graph))] = "pink"
  
  graph = delete.vertices(graph, V(graph)[ degree(graph) == 0 ])
  
  plot(graph, 
       layout = layout.kamada.kawai, 
      main = title)

  } # func ends

Step 2- Preprocessing and creation of DTM

Now Reading the text file and loading it into dataframe which will have 100 records

temp.text = readLines(file.choose())  # reading Django Unchained file
head(temp.text, 3)
## [1] " Absolutely loved every minute of this movie. Usually I'm not too crazy about Tarantino's movies, but this one is definitely the best one I've seen in a long time. The actors were picked perfectly. The overall experience of a movie is amazing. When we first went to watch it, I was a bit skeptical and thought I'd end up leaving an hour into the movie (it's a 3 hr movie), but it grabbed my attention from the very beginning and I didn't even wanna get up to go to the bathroom, afraid to miss something. I'm usually very particular about the movies, nothing can hardly satisfy me, but this one is definitely in the top 5. Soundtrack was perfect. When I got home, I've done some more research on it and loved it even more! Overall, I would highly recommend this film
## [2] " Merry Christmas to all you Tarantino fans out there. I hope you made a Tarantino checklist so here we go.Witty dialogue, check. Excessive profanity especially use the word 'nigger', check. Excessive violence including testicles getting blown off, check. Soulful musical score, check. Sometimes non-linear narrative, check. Shots of women's feet, check. Very great character driven plot, check. An actual spaghetti western, even though it takes place in the American South, check.There are four standout characters played by the top billed actors.Jamie Foxx plays Django, a freed slave who becomes a bounty hunter. Even though he is the titular character, he gets downplayed when in the presence of the other actors. Still he delivers a solid performance, in fact hes very convincing. We all know Jamie Foxx as this golden voice RnB singer and comedian with a very clean cut image. He was able to pull off the whole transitioning from a timid slave to a menacing bounty hunter. Not only that he had the whole look down too, with all the facial scarring and the messy hair.Christoph Waltz plays Dr. King Schulz, a German dentist turned bounty hunter who frees Django so he could help pursue his previous owners who are targets. Waltz is a very charismatic actor, and thats how he does this role. Presents every line with finesse.Leonardo Dicaprio is in his best yet. He plays a plantation owner, Calvin Candie, and is the owner of Django's wife. This is a very different role. We've seen Leonardo in gritty roles before but never did he play this lecherous antagonist. We were all used to Leo being this teen idol, who looked like a member of Hanson. Here he's this Southerner with discoloured teeth and a scruffy beard.Finally Samuel L. Jackson who plays Steve, a house slave who you could say is the secret antagonist here. For all the screen time that he has he dominates. Sam usually plays boisterous roles as a tough guy, but it was very interesting seeing him play a devious and manipulative old man.The only gripe here was that this film was a little too long exceeding the three act structure, but its an epic western film so I'll excuse Tarantino for that. Yet again he made another great film with a lot of flair and carried well by the four big hitter actors. Well done Mr. Tarantino. "
## [3] " Quentin Tarantino, one of the most iconic directors of the 21st (and late 20th) century, why? Simple. Because of masterpieces like this. Tarantino defies the laws of film, he shoots them in his own way, however he wants. Tarantino has always focused upon the action thriller genre from Reservoir Dogs up until Inglourious Basterds. However, Django Unchained is Tarantino's first look at the Western genre, his first attempt at it and he executed it beautifully. The scenes were shot perfectly alongside an amazing soundtrack as well as his own small cameo.Django Unchained tells the story of Django (Jamie Foxx), a slave who is soon picked up by bounty hunter Dr King Shultz (Christoph Waltz). The story follows on as Shultz takes on Django as his \"deputy\" during their tasks of bounty hunting, in return Shultz says that after winter he will help find Django's lost wife, Broomhilda. This takes them to a huge plantation in Mississippi owned by Calvin Candie (Leonardo DiCaprio), from here they plan up a scheme on how to get away with Broombilda.The cast boast out amazing performances, particularly Christoph Waltz (also famous for his previous collaboration with Tarantino on Inglourious Bastards as Colonel Landa). Both Foxx and DiCaprio's performance are both equally amazing. All three are able to add some light-hearted humour in the mix to make sure it doesn't stay too serious, as well as having comic actor Jonah Hill play a member of the KKK.There's a reason the film has been nominated for 5 Oscars. "
data = data.frame(id = 1:length(temp.text), text = temp.text, stringsAsFactors = F)
dim(data)
## [1] 100   2

Clean the data using stopwords, stemming and cleaning using previously defined text.clean file

stpw1 = readLines(file.choose()) # stopwords.txt
stpw2 = tm::stopwords('english') # tm package stop word list; tokenizer package has the same name function
context_stopwords = c("slave","slaves","revenge", "vengeance","death","dark","racism","inglorious","fiction","antagonist","villain","villains","django unchained", "movie","slavery", "film", "hollywood","pulp")
comn  = unique(c(stpw1, stpw2,context_stopwords))    # Union of two list
stopwords = unique(gsub("'"," ",comn))  # final stop word list after removing punctuation

x  = text.clean(data$text)             # pre-process text corpus
x  =  removeWords(x,stopwords)            # removing stopwords created above
x  =  stripWhitespace(x)                  # removing white space
x  =  stemDocument(x)

Once preprocessing and cleaning of data is done, we will create DTM using text2vec package

tok_fun = word_tokenizer
it_0 = itoken( x,
               #preprocessor = text.clean,
               tokenizer = tok_fun,
               ids = data$id,
               progressbar = T)

Creating bigrams

vocab = create_vocabulary(it_0,
                          ngram = c(2L, 2L)
                          #stopwords = stopwords
)
pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 10)
pruned_vocab
vectorizer = vocab_vectorizer(pruned_vocab)
corpus = create_corpus(it_0,vectorizer)
dtm_0 = get_dtm(corpus)

Using TF weightage, creating word clouds

# Sort bi-gram with decreasing order of freq
tsum = as.matrix(t(rollup(dtm_0, 1, na.rm=TRUE, FUN = sum))) # find sum of freq for each term
tsum = tsum[order(tsum, decreasing = T),]       #terms in decreasing order of freq
text2 = x
text2 = paste("",text2,"")

pb <- txtProgressBar(min = 1, max = (length(tsum)), style = 3) ; i = 0

for (term in names(tsum)){
  i = i + 1
  focal.term = gsub("_", " ",term)        # in case dot was word-separator
  replacement.term = term
  text2 = gsub(paste("",focal.term,""),paste("",replacement.term,""), text2)
  setTxtProgressBar(pb, i)
}


it_m = itoken(text2,
              # preprocessor = text.clean,
              tokenizer = tok_fun,
              ids = data$id,
              progressbar = T)

vocab = create_vocabulary(it_m
                          # ngram = c(2L, 2L),
                          #stopwords = stopwords
)

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 1)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

vectorizer = vocab_vectorizer(pruned_vocab)

dtm_m  = create_dtm(it_m, vectorizer)

Create COGS of the most co-occuring words

text2 = x
text2 = paste("",text2,"")

pb <- txtProgressBar(min = 1, max = (length(tsum)), style = 3) ; i = 0

for (term in names(tsum)){
  i = i + 1
  focal.term = gsub("_", " ",term)        # in case dot was word-separator
  replacement.term = term
  text2_TF = gsub(paste("",focal.term,""),paste("",replacement.term,""), text2)
  setTxtProgressBar(pb, i)
}


it_m = itoken(text2,
              # preprocessor = text.clean,
              tokenizer = tok_fun,
              ids = data$id,
              progressbar = T)

vocab = create_vocabulary(it_m,
                          # ngram = c(2L, 2L),
                          stopwords = stopwords
)
pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 1)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

COG graphic

#------------------------------------------------------#
# Term Co-occurance Matrix                             #
#------------------------------------------------------#

vectorizer = vocab_vectorizer(pruned_vocab, grow_dtm = FALSE, skip_grams_window = 3L)
tcm = create_tcm(it_m, vectorizer)

tcm.mat = as.matrix(tcm)
adj.mat = tcm.mat + t(tcm.mat)

diag(adj.mat) = 0     # set diagonals of the adj matrix to zero --> node isn't its own neighor
a0 = order(apply(adj.mat, 2, sum), decreasing = T)
adj.mat = as.matrix(adj.mat[a0[1:50], a0[1:50]])

distill.cog(adj.mat, 'Distilled COG for full corpus',  5,  5)

Now creating Word cloud using TF weightage
dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTf)

Now using TF weightage, create the word cloud of most important terms in reviews (both positive and negative)

a0 = apply(dtm, 1, sum)   # apply sum operation to dtm's rows. i.e. get rowSum
  dtm = dtm[(a0 > 5),]    # retain only those rows with token rowSum >5, i.e. delete empty rows
  dim(dtm); rm(a0)        # delete a0 object
## [1]  100 3711
a0 = apply(dtm, 2, sum)   # use apply() to find colSUms this time
  dtm = dtm[, (a0 > 4)]     # retain only those terms that occurred > 4 times in the corpus
  dim(dtm); rm(a0)
## [1] 100 468
# view summary wordlcoud
a0 = apply(dtm, 2, sum)     # colSum vector of dtm
  a0[1:5]                   # view what a0 obj is like
##     rick taratino     days  started absolute 
##        7        6        6        5        5
  a1 = order(as.vector(a0), decreasing = TRUE)     # vector of token locations
  a0 = a0[a1]     # a0 ordered asper token locations
  a0[1:5]         # view a0 now
## tarantino    django  violence    movies      time 
##       185       135        83        72        71
wordcloud(names(a0), a0,     # invoke wordcloud() func. Use ?wordcloud for more info
          scale=c(4,1), 
          3, # min.freq 
          max.words = 100,
          colors = brewer.pal(8, "Dark2"))
title(sub = "Quick Summary Wordcloud using TF")

Now creating Word cloud using TFIDF weightage
dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTfIdf)
a0 = apply(dtm, 1, sum)
dtm = dtm[(a0 > 0),]    # retain only those rows with token rowSum >3, i.e. delete empty rows
dim(dtm); rm(a0)        # delete a0 object
## [1]  100 3711
a0 = apply(dtm, 2, sum)   # use apply() to find colSUms this time
  dtm = dtm[, (a0 > 0.05)]     # retain only those terms that occurred > 2 times in the corpus
  dim(dtm); rm(a0)
## [1]  100 2604
# view summary wordlcoud
a0 = apply(dtm, 2, sum)     # colSum vector of dtm
  a0[1:5]                   # view what a0 obj is like
##        spend    occurring continuously      filming    favorable 
##   0.07382062   0.07382062   0.07382062   0.07382062   0.07382062
  a1 = order(as.vector(a0), decreasing = TRUE)     # vector of token locations
  a0 = a0[a1]     # a0 ordered asper token locations
  a0[1:5]         # view a0 now
##    watch   django violence    black    great 
## 1.130564 1.120305 1.072440 1.051331 1.048732

Word cloud using TFIDF

wordcloud(names(a0), a0,     
          scale=c(4,1), 
          3, # min.freq 
          max.words = 100,
          colors = brewer.pal(8, "Dark2"))
title(sub = "Quick Summary Wordcloud using TFIDF")

Step 3- Comapring each review’s Poalrity score with its star rating

t1 = Sys.time()   # set timer
pol_all = polarity(x)
Sys.time() - t1  # how much time did the above take?
## Time difference of 5.850596 secs
wc_all = pol_all$all[,2]                  # Word Count in each doc
val_all = pol_all$all[,3]                 # average polarity score
p_all  = pol_all$all[,4]                  # Positive words info
n_all  = pol_all$all[,5]                  # Negative Words info 

Addin the polarity as column to Original data frame

ratings1.df<-cbind(ratings.df,Polarity=val_all)

Finding out the correlation between Polarity and IMDB User rating using cor() function

cor(as.numeric(gsub("/[0-9]+", "", ratings1.df$Ratings)), ratings1.df$Polarity)
## [1] 0.72828

We can note that correlation is about 0.73 after adding lots of stop words for a dark themed movie Django Unchained

Now generaing the scatter plot, with actually only two distinct points on X-axis 10 and 1, we can see that those with rating 1 has negative polarity for majority of the cases with few exceptions, the same is the case with Positive polarity vs. Rating of 10

 ggplot(ratings1.df,aes(x=as.numeric(gsub("/[0-9]+", "", ratings1.df$Ratings)), y= Polarity, color = Ratings))+geom_point(aes(fill = Ratings))+ggtitle("IMDB Rating vs. Ploarity")+xlab("IMDB Rating")

Also listed the Polarity per document basis chart

len = function(x){
  if ( x == "-" && length(x) == 1)  {return (0)} 
  else {return(length(unlist(x)))}
}

pcount = unlist(lapply(p_all, len))
ncount = unlist(lapply(n_all, len))
doc_id = seq(1:length(wc_all))

plot(doc_id,pcount,type="l",col="green",xlab = "Document ID", ylab= "Word Count")
lines(doc_id,ncount,type= "l", col="red")
title(main = "Positive words vs Negative Words" )
legend("topright", inset=.05, c("Positive Words","Negative Words"), fill=c("green","red"), horiz=TRUE)

# Documet Sentiment Running plot
plot(pol_all$all$polarity, type = "l", ylab = "Polarity Score",xlab = "Document Number")
abline(h=0)
title(main = "Polarity Plot" )

Step 4 - Recommendation & Trial and errors

Selected Django Unchained as the theme of the movie itself is pretty dark with Slavery, Racism, violence being the prime attributes of this movie. This required much cleaning of the data.

Used Stopwords from tm package, stopwords file used in the class and additional context based stopwords, listed below

context_stopwords
##  [1] "slave"            "slaves"           "revenge"         
##  [4] "vengeance"        "death"            "dark"            
##  [7] "racism"           "inglorious"       "fiction"         
## [10] "antagonist"       "villain"          "villains"        
## [13] "django unchained" "movie"            "slavery"         
## [16] "film"             "hollywood"        "pulp"
Trial and Errors
  • Had to use lot of stop words as even positive reviews would consist of these words. So had to look at positive reviews (i.e. having 10/10 rating) and continued to remove words like “salvery”, “racism”
  • Tarintino’s previous movies include negative titles like Inglorious Bastards and Pulp Fiction with fiction showing up in negative words
  • Noticed that there are few documents where there are spelling mistakes, probably we need to clean data for spell mistakes as well
  • Antagonist performance is well appreciated in positive reviews, but the polarity counts it as a negative word and if we have added in stop words, it may go unnoticed
  • On the whole with this exercise, the results reiterate that basic Text mining wouldn’t really give the right picture and lot of contextual data cleaning is required
Recommendations

By going through the word clouds, the following are the observations

  • Director Quentin Tarantino is greatly appreciated by the positive reviewers
  • Performace of star cast, especially Christopher Waltz as a Bounty Hunter is great as per the reviews
  • Jamie Foxx is much lauded for his performance as Django
  • Most negative reviewers seem to suggest that film seem to promote violence, blood and racism
  • Scene and Time also appears in the list

So as a Data Scientist working for studio, my recommendation would be to assemble the same cast, for their performances have been quite appreciated. Reduce the violence, bloodshed and killings if possible. Script need to be as strong as Django Unchained & Tarantino is a must as a director, but he may follow his own style of violence :)