Restaurant Review Classification

This is a Natural Language Processing Project, In this project I have used the dataset from “SuperDataScience” website. The dataset has 1000 Restaurant reviews, The data set is a balanced dataset with (670) positive and (450) negative reviews. I used NLTK package to make reviews machine readable and then applied different classification models and compared their performance. Task: 1. To classify the review as negative or positive. 2. Find out the best classification model for review classification

Future work: Machine learning techniques like dimentionality reduction can be used to increase the performance of the models, in this project I have just used the basic models.Different models like CART C5.0 Maximum Entropy can be used to get more accurate results.

Dataset

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('/Users/rishika/Desktop/Restaurant_Reviews.tsv', sep = '\\t', quoting = 3, engine= "python")


print(dataset.head(5))

                                              Review  Liked
0                           Wow... Loved this place.      1
1                                 Crust is not good.      0
2          Not tasty and the texture was just nasty.      0
3  Stopped by during the late May bank holiday of...      1
4  The selection on the menu was great and so wer...      1

positive_review_count=0
for i in range (0,1000):
    if dataset.iloc[i,1] == 1:
        positive_review_count= positive_review_count+1
print("The number of positive reviews = "+ str(positive_review_count) )       
print("This is a balanced dataset with equal number of positive and negative reviews")

The number of positive reviews = 500
This is a balanced dataset with equal number of positive and negative reviews

Feature Engineering

import re
import nltk

nltk.download("stopwords")
from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rishika/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

corpus=[]
#Iterate over all the reviews in the dataset
for i in range(0,1000):
    
    #1. Remove all the non-text elements from the review like emoticons, exclamatory mark, etc
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    
    #2. Convert the text to lower case.
    review = review.lower()

    #3. split the review to match the words from the stop word list and remove the words not required.
    #4. PorterStemmer is used to obtain the root word of all the words, eg loved, loving will become love.

    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words("english"))]

    #5 after performing all the operations join the words back to apply machine learning.
    review = ' '.join(review)
    corpus.append(review)


print(corpus[:5])

['wow love place', 'crust good', 'tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price']

#Creating bag of words model.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,1].values
print(X[:5])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Model Creation

#splitting the dataset into the training set and test set

from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test= train_test_split(X,y,test_size=.20, random_state = 0)

#fitting Naive Bayes to training set

from sklearn.naive_bayes import GaussianNB

classifier_NB = GaussianNB()
classifier_NB.fit(X_train,y_train)

#predicting the test set results

y_pred_NB = classifier_NB.predict(X_test)

#Making Confusion Matrix

from sklearn.metrics import confusion_matrix
cm_NB= confusion_matrix(y_test,y_pred_NB)

print(cm_NB)

[[55 42]
 [12 91]]

# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier_DT = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier_DT.fit(X_train, y_train)

# Predicting the Test set results
y_pred_DT = classifier_DT.predict(X_test)

from sklearn.metrics import confusion_matrix

cm_DT= confusion_matrix(y_test,y_pred_DT)

print(cm_DT)

[[74 23]
 [35 68]]

# Fitting Kernel SVM to the Training set
# from sklearn.svm import SVC
# classifier_SVM = SVC(kernel = 'rbf', random_state = 0)
# classifier_SVM.fit(X_train, y_train)

from sklearn.svm import SVC
classifier_SVM = SVC(kernel = 'linear', random_state = 0)
classifier_SVM.fit(X_train, y_train)

# Predicting the Test set results
y_pred_SVM = classifier_SVM.predict(X_test)

cm_svm= confusion_matrix(y_test,y_pred_SVM)
print(cm_svm)

[[74 23]
 [33 70]]

from sklearn.ensemble import RandomForestClassifier
classifier_RF = RandomForestClassifier(n_estimators = 500, criterion = 'entropy', random_state = 0)
classifier_RF.fit(X_train, y_train)

# Predicting the Test set results
y_pred_RF = classifier_RF.predict(X_test)

cm_RF= confusion_matrix(y_test,y_pred_RF)
Accuracy_RF = ((cm_RF[0,0]+ cm_RF[1,1])/ len(y_test))
print(cm_RF)

[[88  9]
 [47 56]]

Model Evaluation

Accuracy_NB = ((cm_NB[0,0]+ cm_NB[1,1])/ len(y_test))
Precision_NB = cm_NB[0,0] / (cm_NB[0,0] + cm_NB[0,1])
Recall_NB = cm_NB[0,0] / (cm_NB[0,0]+ cm_NB[1,1])
F1_Score_NB = 2 * Precision_NB * Recall_NB / (Precision_NB + Recall_NB)

print("Accuracy using Naive Bayes is "+ str(Accuracy_NB))
print("Precision using Naive Bayes is "+ str(Precision_NB))
print("Recall using Naive Bayes is "+ str(Recall_NB))
print("F1_Score using Naive Bayes is "+ str(F1_Score_NB))

Accuracy using Naive Bayes is 0.73
Precision using Naive Bayes is 0.5670103092783505
Recall using Naive Bayes is 0.3767123287671233
F1_Score using Naive Bayes is 0.45267489711934156

cm_DT= confusion_matrix(y_test,y_pred_DT)
Accuracy_DT = ((cm_DT[0,0]+ cm_DT[1,1])/ len(y_test))
Precision_DT = cm_DT[0,0] / (cm_DT[0,0] + cm_DT[0,1])
Recall_DT = cm_DT[0,0] / (cm_DT[0,0]+ cm_DT[1,1])
F1_Score_DT = 2 * Precision_DT * Recall_DT / (Precision_DT + Recall_DT)

print("Accuracy using Decision Tree Classifier is "+ str(Accuracy_DT))
print("Precision using Decision Tree Classifier is "+ str(Precision_DT))
print("Recall using Decision Tree Classifier is "+ str(Recall_DT))
print("F1_Score using Decision Tree Classifier is "+ str(F1_Score_DT))

Accuracy using Decision Tree Classifier is 0.71
Precision using Decision Tree Classifier is 0.7628865979381443
Recall using Decision Tree Classifier is 0.5211267605633803
F1_Score using Decision Tree Classifier is 0.6192468619246861

cm_svm= confusion_matrix(y_test,y_pred_SVM)
Accuracy_SVM = ((cm_svm[0,0]+ cm_svm[1,1])/ len(y_test))
Precision_SVM = cm_svm[0,0] / (cm_svm[0,0] + cm_svm[0,1])
Recall_SVM = cm_svm[0,0] / (cm_svm[0,0]+ cm_svm[1,1])
F1_Score_SVM = 2 * Precision_SVM * Recall_SVM / (Precision_SVM + Recall_SVM)

print("Accuracy using Support Vector Machine is "+ str(Accuracy_SVM))
print("Precision using Support Vector Machine is "+ str(Precision_SVM))
print("Recall using Support Vector Machine is "+ str(Recall_SVM))
print("F1_Score using Support Vector Machine is "+ str(F1_Score_SVM))

Accuracy using Support Vector Machine is 0.72
Precision using Support Vector Machine is 0.7628865979381443
Recall using Support Vector Machine is 0.5138888888888888
F1_Score using Support Vector Machine is 0.6141078838174273

cm_RF= confusion_matrix(y_test,y_pred_RF)
Accuracy_RF = ((cm_RF[0,0]+ cm_RF[1,1])/ len(y_test))
Precision_RF = cm_RF[0,0] / (cm_RF[0,0] + cm_RF[0,1])
Recall_RF = cm_RF[0,0] / (cm_RF[0,0]+ cm_RF[1,1])
F1_Score_RF = 2 * Precision_RF * Recall_RF / (Precision_RF + Recall_RF)

print("Accuracy using Random Forest is "+ str(Accuracy_RF))
print("Precision using Random Forest is "+ str(Precision_RF))
print("Recall using Random Forest is "+ str(Recall_RF))
print("F1_Score using Random Forest is "+ str(F1_Score_RF))

Accuracy using Random Forest is 0.72
Precision using Random Forest is 0.9072164948453608
Recall using Random Forest is 0.6111111111111112
F1_Score using Random Forest is 0.7302904564315353