We will use the Yelp Review Data Set from Kaggle.
Each observation in this dataset is a review of a particular business by a particular user.
The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
The "cool" column is the number of "cool" votes this review received from other Yelp users.
The "useful" and "funny" columns are similar to the "cool" column.
The goal of this project is to predict whether the customer will rate the business as GOOD, BAD or NEUTRAL
We have information regarding the Stars that where allocated to a business by a user. Using this we will create a new attrubute that is CUSTOMER EXP which will categorize stars 1 & 2 as BAD experience, star 3 as NEUTRAL and stars 4 % 5 as GOOD experience.
We will use Word clouds to obtain better infographic content of all the reviews.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
%matplotlib inline
from wordcloud.wordcloud import WordCloud, STOPWORDS
from PIL import Image
The file was downlaoded from the yelp data-set available on the website and was converted into refined form in order to make the analysis easier
yelp = pd.read_csv('yelp.csv')
yelp.head()
yelp.info()
yelp.describe()
Here we create the Customer Experience column where we categorize the Stars given by customers to different business as GOOD, BAD and NEUTRAL.
Also, we create a new feature that is Text Length that gives the length of the reviews. This feature will give us an understanding of customer behavior and their experience.
Cust = []
for i in yelp['stars']:
if (i == 1):
Cust.append('BAD')
elif (i == 3) | (i == 2):
Cust.append('NEUTRAL')
else:
Cust.append('GOOD')
yelp['Customer EXP'] = Cust
yelp['Customer EXP'].value_counts()
yelp['Text length'] = yelp['text'].apply(lambda x:len(x.split()))
yelp.head()
a = sns.FacetGrid(data = yelp, col = 'Customer EXP', hue = 'Customer EXP', palette='plasma', size=5)
a.map(sns.distplot, "Text length")
yelp.groupby('Customer EXP').mean()['Text length']
From the above graph we find the Density distributions and Histograms of the Text lengths for Reviews that where marked as GOOD, BAD and NEUTRAL. We observe that people who tend to review a business as BAD or NEUTRAL have approximately 150 words in their reviews while people who are suppposed to review the business as a GOOD experience have on average about 100 words in their reviews.
plt.figure(figsize = (10,7))
sns.boxplot(x = 'stars', y = 'Text length', data = yelp)
plt.figure(figsize = (7,5))
sns.countplot('stars', data = yelp, palette="husl")
Use groupby to get the mean values of the numerical columns, you should be able to create this dataframe with the operation:
plt.figure(figsize = (7,5))
sns.countplot('Customer EXP', data = yelp, palette="Oranges")
Lets find the Correlation between COOL, USEFUL, FUNNY and TEXTLENGTH features from the data set when we group by it according to STARS
yelp.groupby('Customer EXP').mean().corr()
Then use seaborn to create a heatmap based off that .corr() dataframe:
plt.figure(figsize = (8,6))
sns.heatmap(yelp.groupby('Customer EXP').mean().corr(), cmap = "coolwarm", annot=True)
Splitting our data set into Train and Test
x = yelp['text']
y = yelp['Customer EXP']
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state = 101)
Engineering a Text cleaning function to remove the Punctuations and Stopwords from the data
from nltk.corpus import stopwords
def text_clean(message):
nopunc = [i for i in message if i not in string.punctuation]
nn = "".join(nopunc)
nn = nn.lower().split()
nostop = [words for words in nn if words not in stopwords.words('english')]
return(nostop)
good = yelp[yelp['Customer EXP'] == 'GOOD']
bad = yelp[yelp['Customer EXP'] == 'BAD']
neu = yelp[yelp['Customer EXP'] == 'NEUTRAL']
good_bow = text_clean(good['text'])
bad_bow = text_clean(bad['text'])
neu_bow = text_clean(neu['text'])
good_para = ' '.join(good_bow)
bad_para = ' '.join(bad_bow)
new_para = ' '.join(neu_bow)
Word cloud to display the most common words in the Reviews where customer experience was GOOD
stopwords = set(STOPWORDS)
stopwords.add('one')
stopwords.add('also')
mask_image = np.array(Image.open("thumb_up.png"))
wordcloud_good = WordCloud(colormap = "Paired",mask = mask_image, font_path = "C:\Windows\Fonts\chint__.ttf", width = 300, height = 200, scale=2,max_words=1000, stopwords=stopwords).generate(good_para)
plt.figure(figsize = (7,10))
plt.imshow(wordcloud_good, interpolation="bilinear", cmap = plt.cm.autumn)
plt.axis('off')
plt.figure(figsize = (10,6))
plt.show()
wordcloud_good.to_file("good.png")
Word cloud to display the most common words in the Reviews where customer experience was BAD
stopwords = set(STOPWORDS)
stopwords.add('one')
stopwords.add('also')
stopwords.add('good')
mask_image1 = np.array(Image.open("thumb_down.png"))
wordcloud_bad = WordCloud(colormap = 'tab10', mask = mask_image1, font_path = "C:\Windows\Fonts\chint__.ttf", width = 1100, height = 700, scale=2,max_words=1000, stopwords=stopwords).generate(bad_para)
plt.figure(figsize = (7,10))
plt.imshow(wordcloud_bad,cmap = plt.cm.autumn)
plt.axis('off')
plt.show()
wordcloud_bad.to_file('bad.png')
Word cloud to display the most common words in the Reviews where customer experience was NEUTRAL
stopwords = set(STOPWORDS)
wordcloud_neu = WordCloud(colormap = "plasma",font_path = "C:\Windows\Fonts\Verdana.ttf", width = 1100, height = 700, scale=2,max_words=1000, stopwords=stopwords).generate(new_para)
plt.figure(figsize = (7,10))
plt.imshow(wordcloud_neu,cmap = plt.cm.autumn)
plt.axis('off')
plt.show()
wordcloud_neu.to_file('neu.png')
Observations from the Word Cloud
From these observations, we find that there are a lot of unique words in our reviews which can turn up as good classifiers for a business. These words can be used as independent variables to classfify the reviews and customer experience as GOOD, BAD or NEUTRAL.
We shall use Naive Bayes classfier and Random Forest to classify the customer experience.
from sklearn.feature_extraction.text import CountVectorizer
cv_transformer = CountVectorizer(analyzer = text_clean)
x = cv_transformer.fit_transform(x)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3, random_state = 101)
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(x_train, y_train)
predictions = nb.predict(x_test)
predictions
Creating a confusion matrix and classification report using these predictions and the original values
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, predictions))
print("\n")
print(classification_report(y_test, predictions))
We find that the Naive Bayes predictor performs pretty well! It helps us recognize 73% of our test data correctly.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(criterion='gini')
rf.fit(x_train, y_train)
pred_rf = rf.predict(x_test)
print("Confusion Matrix\n",confusion_matrix(y_test, pred_rf))
print("\n")
print("Classification report\n",classification_report(y_test, pred_rf))
From the classification report for Random forest classifier we find that the model accuracy is 65%.
Words used inside the reviews can be used as classifiers to categorize the customer expereince in GOOD, BAD and NEUTRAL.
Naive Bayes classifier gives us a higher accuracy with 73% instances that where correctly recognized.
While, Random forest classfier has an accuracy of 68%.