1 Overview

The dataset is sourced from Kaggle and the main focus of this project is to learn various NLP techniques. The user would be able to accurately predict how the movie is trending among the public by reviewing and classifying each review. This will help users to make strategic decisions in their personal or business life.

  • Q1. Why build the NLP model to classify movie reviews and how it can be useful in business or personnel life ?
    • We know that it has become very common for people to look at review either at social websites or on movie critique websites like IMDB reviews before watching they go to see a movie. Some website provides movie ranking, rating, and reviews for coming movies as well as the old movies. Rotten tomato is one of the famous ones and it grabs all the reviews from other movie critic’s websites.
    • Also, In today’s world, the media-services provider has replaced cable streaming and has become a significant movie and TV shows provider. There are many media-services providers which in today’s competing market are coming with new ideas and innovations to retain maximum customers. So, it has become important not just to keep a good collection of movies and TV shows but to have a good content recommendation algorithm that can provide a good recommendation to its user. The user profile history may tell the algorithm the user’s movie/TV shows genre but at the same time, it is important to recommend good choices from the vast pool of users’ favorite genre.

So, The above two points suggest that by classifying movie reviews, we can provide a machine learning model extra information required to recommend the movie to the users.

  • Real-Time Examples of Movie recommendation system
    • Everyone knows about Netflix - an American media-services provider and production. It is one of the most populous media service providers in the world. It is a known fact that Netflix has one of the best media recommendation systems. It monitors all the information like genre, length of watch duration, viewing history,etc.. This algorithm uses Topic modeling techniques and the information monitored on users to provide the best media recommendation to the Users.
  • How we can boost Netflix recommendation system
    • One of the fact about Netflix’s recommendation system accuracy is around 82%. The accuracy can be improved by the following ways:
      • We can use the below NLP modeling technique to classify reviews and recommend trending movies to the User.
      • Moreover, we can classify true and fake reviews and classify only real reviews to the movie review classification model for better prediction.

2 Facts about dataset and data

In the dataset, each scentence is divided into various phrases of varying length. Some phrase contains only one word.

2.1 Glimpse of Training Data

From below statistics, we can infer that:

  • Sentences have been split in 18 phrases at average for train dataset and 20 phrases for test dataset.
  • There are total of 156060 phrases in the train dataset and on average each phrase contains around seven words.
  • There are total of 66292 phrases in the train dataset and on average each phrase contains around seven words.
  • The dataset is big enough to provide insightful results with train dataset having 8529 sentences and test dataset having 3310 sentences.
## [1] "Average count of phrases per sentence in the train dataset is: 18"
## [1] "Average count of phrases per sentence in the test dataset is: 20"
## [1] "Number of phrases in the train dataset: 156060"
## [1] "Number of phrases in the test dataset: 66292"
## [1] "Number of sentences in the train dataset: 8529"
## [1] "Number of sentences in the test dataset: 3310"
## [1] "Average word length of phrases in the train dataset is: 7"
## [1] "Average word length of phrases in the test dataset is: 7"

2.2 Target Variable: Sentiment

The description of Target Variable:

  • 0 - Very Bad rating
  • 1 - Bad Rating
  • 2 - Average Rating
  • 3 - Good Rating
  • 4 - Very Good Rating

Below we can see the distribution of sentiment (target variable) within the training dataset. It shows that most of the phrases received 2 sentiment value (average) and the very few received 1 and 4 rating with 0 being the lowest in frequency.

3 Data Preprocessing to create Word Cloud

3.1 Converting Data into Corpus

We are making Corpus (dictionary of words) from all the phrases, then removing unnecessary text like stopwords, numbers,pronunciation, whitespaces which does not provides any meaning. Also, we have performed lemmenting to derive meaningful root of the words.

3.2 Bag of Words

Now, We have the clean text from phrases which needs to be converted into number vectors using word count per sentence

3.3 Word Cloud preprocessing

Creating Dataframe for WordCloud

4 Exploratory Data Analysis

4.1 Word Cloud

The Word Cloud shows that film, movie, like, one, character, make, story, good, time, not, one, see, comedy, plot, work, funny have been used most in the movie reviews. Here the word ‘not’ suggest that we have to perform bigram in order to perform any analysis

4.2 Most Frequent Words

Let’s look at most frequent to understand which words are used by people to provide movie reviews are influential in the sentiment analysis.

4.3 Least Frequent Words

Let’s look at least words to understand which words are used by people to provide movie reviews and which might not be influential in the sentiment analysis. To be sure, we need to perform TF-IDF which will provide weights to each word which might make less occurring words important for analysis

4.5 Frequency of Words

Below graph suugets that word s has come lot of time. S is not a word and needs to be eliminated from dataframe

## Selecting by n

##Sentiment lexicons After exploring sentiment lexicons using bing, we deduce that we have 47044 positive and 47753 negative words

## Joining, by = "word"

Below are some most postive and negative words within the training dataset that are contributing the most to the sentiment scores.

## Joining, by = "word"
## Selecting by n

4.6 N- Grams

Implementing Bigram

Below are some most common bigrams

5 Network Analysis

Creating a word network from bigrams

## IGRAPH de3e901 DN-- 64 43 -- 
## + attr: name (v/c), n (e/n)
## + edges from de3e901 (vertex names):
##  [1] romantic ->comedy     lrb      ->rrb        love     ->story     
##  [4] spin     ->dry        subject  ->matter     special  ->effect    
##  [7] soap     ->opus       run      ->time       bad      ->movie     
## [10] action   ->film       rrb      ->lrb        de       ->niro      
## [13] sense    ->humor      action   ->sequence   horror   ->movie     
## [16] world    ->war        action   ->movie      horror   ->film      
## [19] target   ->audience   war      ->ii         war      ->movie     
## [22] character->study      motion   ->picture    adam     ->sandler   
## + ... omitted several edges

After visualizing Bigram Network, we found that closely occurring two words. Below words are occurring more than 40 times within all the phrases within the dataset. Below Network suggests that bigram will be best suited for sentiment analysis.

7 8 Run Models

  • Data Preprocessing
    • Data is already divided into train and test datastet
  • Cross Validation
    • We will use k-fold(k=5) cross validation to avoid overfitting of our model
    • We will precompute these k-folds even before our model and feed the same k-folds for training and cross validation so that we can compare our model for all k-folds and reproducibility
  • We will run 8 ML algorithms to train our model
    • Linear SVM
    • Radial SVm
    • Decision Tree
    • Adaboost Decision Tree
    • GGboost Decision Tree
    • Decison Tree with Bagging
    • random Forest
    • Neural Network (using caret)

7.1 Linear SVM

The accuacy for Ljnear SVM is 50.3%.

## Aggregating results
## Selecting tuning parameters
## Fitting C = 1.2 on full training set
## Confusion Matrix and Statistics
## 
##                
## Linear_SVM_pred    0    1    2    3    4
##               0    0    0    0    0    0
##               1   30  146   88   23    4
##               2   90  499 1389  572  194
##               3    2   12   19   43   22
##               4    0    8   15    6   27
## 
## Overall Statistics
##                                         
##                Accuracy : 0.503         
##                  95% CI : (0.486, 0.521)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : 0.00046       
##                                         
##                   Kappa : 0.122         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.0000   0.2195    0.919   0.0668  0.10931
## Specificity            1.0000   0.9426    0.192   0.9784  0.99014
## Pos Pred Value            NaN   0.5017    0.506   0.4388  0.48214
## Neg Pred Value         0.9617   0.8209    0.726   0.8056  0.92978
## Prevalence             0.0383   0.2085    0.474   0.2019  0.07745
## Detection Rate         0.0000   0.0458    0.436   0.0135  0.00847
## Detection Prevalence   0.0000   0.0913    0.860   0.0307  0.01756
## Balanced Accuracy      0.5000   0.5811    0.556   0.5226  0.54973

7.2 Radial SVM

The accuacy for Radial SVM is 49.5%.

## Aggregating results
## Fitting final model on full training set
## Confusion Matrix and Statistics
## 
##                
## Radial_SVM_pred    0    1    2    3    4
##               0    0    0    0    0    0
##               1   39  178   95   30    4
##               2   82  479 1407  574  207
##               3    1    8    7   40   16
##               4    0    0    2    0   20
## 
## Overall Statistics
##                                         
##                Accuracy : 0.516         
##                  95% CI : (0.498, 0.533)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : 1.12e-06      
##                                         
##                   Kappa : 0.141         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.0000   0.2677    0.931   0.0621  0.08097
## Specificity            1.0000   0.9334    0.200   0.9874  0.99932
## Pos Pred Value            NaN   0.5145    0.512   0.5556  0.90909
## Neg Pred Value         0.9617   0.8287    0.764   0.8062  0.92832
## Prevalence             0.0383   0.2085    0.474   0.2019  0.07745
## Detection Rate         0.0000   0.0558    0.441   0.0125  0.00627
## Detection Prevalence   0.0000   0.1085    0.862   0.0226  0.00690
## Balanced Accuracy      0.5000   0.6006    0.566   0.5248  0.54015

7.3 Decision Tree

The accuacy for Decision Tree is 49.4%.

## Aggregating results
## Selecting tuning parameters
## Fitting cp = 0.00511 on full training set
## Confusion Matrix and Statistics
## 
##          
## tree_pred    0    1    2    3    4
##         0    0    0    0    0    0
##         1   53  105   86   23    0
##         2   69  559 1423  589  216
##         3    0    0    2   30   14
##         4    0    1    0    2   17
## 
## Overall Statistics
##                                         
##                Accuracy : 0.494         
##                  95% CI : (0.476, 0.511)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : 0.0122        
##                                         
##                   Kappa : 0.088         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.0000   0.1579    0.942  0.04658  0.06883
## Specificity            1.0000   0.9358    0.146  0.99371  0.99898
## Pos Pred Value            NaN   0.3933    0.498  0.65217  0.85000
## Neg Pred Value         0.9617   0.8084    0.736  0.80465  0.92742
## Prevalence             0.0383   0.2085    0.474  0.20194  0.07745
## Detection Rate         0.0000   0.0329    0.446  0.00941  0.00533
## Detection Prevalence   0.0000   0.0837    0.896  0.01442  0.00627
## Balanced Accuracy      0.5000   0.5469    0.544  0.52015  0.53390

7.3.1 Decision Tree Plot

Since, we have LSA transformed variables, we can’t deduce which words are more important to establish decision tree.

7.3.2 Variable Importance accoring to Decision Tree

  • Let consider only the variables that our more than 50% importance for our analysis
    • V32 is the most important LSA variable
    • V23 is the second most important LSA variable

7.4 XGboost Decision Tree

The accuracy for XGboost Decision Tree is 58.4%. After XGboost boosting algorithm, decision tree accuracy has improved little but still is not good at classifying reviews.

## Aggregating results
## Selecting tuning parameters
## Fitting nrounds = 50, max_depth = 2, eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1, subsample = 0.75 on full training set
## Confusion Matrix and Statistics
## 
##                
## XGboost_DT_pred    0    1    2    3    4
##               0   15    6    7    1    0
##               1   54  291  136   38    8
##               2   52  349 1277  392   88
##               3    1   13   61  180   53
##               4    0    6   30   33   98
## 
## Overall Statistics
##                                         
##                Accuracy : 0.584         
##                  95% CI : (0.566, 0.601)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.329         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.12295   0.4376    0.845   0.2795   0.3968
## Specificity           0.99544   0.9065    0.475   0.9497   0.9765
## Pos Pred Value        0.51724   0.5522    0.592   0.5844   0.5868
## Neg Pred Value        0.96614   0.8595    0.773   0.8389   0.9507
## Prevalence            0.03826   0.2085    0.474   0.2019   0.0775
## Detection Rate        0.00470   0.0913    0.400   0.0564   0.0307
## Detection Prevalence  0.00909   0.1653    0.677   0.0966   0.0524
## Balanced Accuracy     0.55919   0.6720    0.660   0.6146   0.6867

7.5 Random Forest

The accuracy for Random Forest is 76.5%. Random Forest’s accuracy has improved a lot at classifying reviews.

## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 26 on full training set
## Confusion Matrix and Statistics
## 
##             
## rf_tree_pred    0    1    2    3    4
##            0  122    0    0    0    0
##            1    0  557  120   11    1
##            2    0   97 1322  302   58
##            3    0    9   56  307   58
##            4    0    2   13   24  130
## 
## Overall Statistics
##                                         
##                Accuracy : 0.765         
##                  95% CI : (0.749, 0.779)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.642         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            1.0000    0.838    0.875   0.4767   0.5263
## Specificity            1.0000    0.948    0.728   0.9517   0.9867
## Pos Pred Value         1.0000    0.808    0.743   0.7140   0.7692
## Neg Pred Value         1.0000    0.957    0.866   0.8779   0.9613
## Prevalence             0.0383    0.209    0.474   0.2019   0.0775
## Detection Rate         0.0383    0.175    0.415   0.0963   0.0408
## Detection Prevalence   0.0383    0.216    0.558   0.1348   0.0530
## Balanced Accuracy      1.0000    0.893    0.801   0.7142   0.7565

7.6 Adaboost Decision Tree

The accuracy for XGboost Decision Tree is 49.6%. After Adaboost boosting algorithm, decision tree accuracy is almost same.

## Aggregating results
## Selecting tuning parameters
## Fitting mfinal = 50, maxdepth = 3 on full training set
## Confusion Matrix and Statistics
## 
##                 
## Adaboost_DT_pred    0    1    2    3    4
##                0    0    0    0    0    0
##                1   45  179  107   28    6
##                2   77  486 1404  616  241
##                3    0    0    0    0    0
##                4    0    0    0    0    0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.496         
##                  95% CI : (0.479, 0.514)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : 0.00563       
##                                         
##                   Kappa : 0.095         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.0000   0.2692    0.929    0.000   0.0000
## Specificity            1.0000   0.9263    0.154    1.000   1.0000
## Pos Pred Value            NaN   0.4904    0.497      NaN      NaN
## Neg Pred Value         0.9617   0.8279    0.707    0.798   0.9225
## Prevalence             0.0383   0.2085    0.474    0.202   0.0775
## Detection Rate         0.0000   0.0561    0.440    0.000   0.0000
## Detection Prevalence   0.0000   0.1145    0.886    0.000   0.0000
## Balanced Accuracy      0.5000   0.5977    0.541    0.500   0.5000

7.7 BoostStrap Decision Tree

The accuracy for BoostStrap Decision Tree Model is 75.8%. After boostStraping, decision Tree’s accuracy has improved a lot at classifying reviews.

## Aggregating results
## Fitting final model on full training set
## Confusion Matrix and Statistics
## 
##               
## dtree_reg_pred    0    1    2    3    4
##              0  115    0    0    0    0
##              1    3  533  105   11    0
##              2    4  117 1307  276   55
##              3    0   13   83  336   66
##              4    0    2   16   21  126
## 
## Overall Statistics
##                                         
##                Accuracy : 0.758         
##                  95% CI : (0.743, 0.773)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.633         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9426    0.802    0.865    0.522   0.5101
## Specificity            1.0000    0.953    0.731    0.936   0.9867
## Pos Pred Value         1.0000    0.817    0.743    0.675   0.7636
## Neg Pred Value         0.9977    0.948    0.857    0.886   0.9600
## Prevalence             0.0383    0.209    0.474    0.202   0.0775
## Detection Rate         0.0361    0.167    0.410    0.105   0.0395
## Detection Prevalence   0.0361    0.204    0.552    0.156   0.0517
## Balanced Accuracy      0.9713    0.877    0.798    0.729   0.7484

7.8 Neural Net

The accuacy for Neural Net is 50.6%.

## Aggregating results
## Selecting tuning parameters
## Fitting size = 3, decay = 0 on full training set
## # weights:  173
## initial  value 5867.332857 
## iter  10 value 4240.305889
## iter  20 value 4194.626796
## iter  30 value 3973.247208
## iter  40 value 3771.903294
## iter  50 value 3629.095470
## iter  60 value 3584.016068
## iter  70 value 3560.778177
## iter  80 value 3549.414945
## iter  90 value 3543.091799
## iter 100 value 3535.772916
## final  value 3535.772916 
## stopped after 100 iterations
## Confusion Matrix and Statistics
## 
##              
## Nnet_reg_pred    0    1    2    3    4
##             0    9    8   10    3    0
##             1   30  127   90   10    0
##             2   78  515 1358  544  173
##             3    5    9   37   68   23
##             4    0    6   16   19   51
## 
## Overall Statistics
##                                         
##                Accuracy : 0.506         
##                  95% CI : (0.488, 0.523)
##     No Information Rate : 0.474         
##     P-Value [Acc > NIR] : 0.000161      
##                                         
##                   Kappa : 0.141         
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.07377   0.1910    0.899   0.1056   0.2065
## Specificity           0.99315   0.9485    0.219   0.9709   0.9861
## Pos Pred Value        0.30000   0.4942    0.509   0.4789   0.5543
## Neg Pred Value        0.96423   0.8165    0.706   0.8110   0.9367
## Prevalence            0.03826   0.2085    0.474   0.2019   0.0775
## Detection Rate        0.00282   0.0398    0.426   0.0213   0.0160
## Detection Prevalence  0.00941   0.0806    0.837   0.0445   0.0288
## Balanced Accuracy     0.53346   0.5697    0.559   0.5383   0.5963

8 Comparing Various Models

8.2 Boxplot

Each colour represents a fold. The line gives us the information of what accuracy the model is giving us when the same fold is fed into it.

Extreme Boosting(xgbTree) is giving us significantly better accuracy for every fold.

8.3 Parallel Plot

Time Comparison Each colour represents a model. The best model is going to be a balance between accuracy and time taken ie. the one that takes less time and also gives us good accuracy.

Extreme Boosting(xgbTree) is giving us significantly better accuracy and take few seconds to give us the result

8.5 Learning Curve for our best model

This is the learning curve for Training, Cross Validation Accuracy vs no of Training Examples By looking at the plot, we can say that our model is to simple and we need more features for it perform better because the training accuracy is being pulled down and testing accuracy is not increasing. We need to perform some more analysis before running models

## Training for 10% (n = 318)
## Training for 20% (n = 637)
## Training for 30% (n = 956)
## Training for 40% (n = 1275)
## Training for 50% (n = 1594)
## Training for 60% (n = 1913)
## Training for 70% (n = 2232)
## Training for 80% (n = 2551)
## Training for 90% (n = 2870)
## Training for 100% (n = 3189)

9 Data Preprocessing for Test Dataset

Our Model is ready and now preparing test data so that we can predict sentiments for phrases in test dataset.

9.1 Predicting Sentiment

Below is the Test dataset for which prediction is made.

10 Final Summary

10.1 Best Model

Random Forest is the best model even though the time consumed by it is little higher than others.

  • How to improve the model Accuracy
    • Word2vec can be used to create word embeddings which might provide better results.
    • Deep learning models like LSTM and GRU could be applied to provide better results.
  • Time Spent
    • Learning NLP topics and data exploration, understanding n value for ngram took most of the time.
    • Modelling and plotting took around 20% of the time.
    • Understanding Rmd, Rhub platform, designing report took 15% of the time.
  • Challenge Faced
    • Running Tensorflow was problematic. Architecture error and resolving took a lot of time .
    • Memmory Issue:
      • Converting large text into matrix was taking more memory than available memory. I had to take subset of train and test dataset.
    • Since most of the decisions made in life are impacted by emotions, understanding emotions is challenging in NLP.