I want to bring in the python script and process that used the sci-kit, keras, and tensorflow packages of python to show the results of machine learning using multinomial naive bayes. There is a package in R that allows for communicating the same work from python packages into R. Lets try it out, its new to me and sounds really cool. But every thing sounds really cool before tested and demonstrated. So, lets see how well it works based on the cheatsheet for reticulate found in the Rstudio Help menu under cheatsheets.
The python packages were sklearn, matplotlib, pandas, numpy, nltk, textBlob, and regex. Some versions that work are later modules, for instance the re package was used that made regex obsolete because it is a build version that replaced regex for my version of python, 3.6.
# knitr::knit_engines$set(python = reticulate::eng_python)
library(reticulate)
## Warning: package 'reticulate' was built under R version 3.6.3
conda_list(conda = "auto")
## name python
## 1 Anaconda2 C:\\Users\\m\\Anaconda2\\python.exe
## 2 djangoenv C:\\Users\\m\\Anaconda2\\envs\\djangoenv\\python.exe
## 3 python36 C:\\Users\\m\\Anaconda2\\envs\\python36\\python.exe
## 4 python37 C:\\Users\\m\\Anaconda2\\envs\\python37\\python.exe
## 5 r-reticulate C:\\Users\\m\\Anaconda2\\envs\\r-reticulate\\python.exe
I have my python IDE, Anaconda, open in the console and use the python36 environment mostly, and more importantly for the testing that was done on NLP using multinomial Naive Bayes to classify 5 ratings categores per review. The above shows those environments in conda.
use_condaenv(condaenv = "python36")
import pandas as pd
import matplotlib.pyplot as plt
from textblob import TextBlob
import sklearn
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
np.random.seed(47)
reviews = pd.read_csv('cleanedRegexReviews13.csv', encoding = 'unicode_escape')
print(reviews.head())
## userReviewSeries ... userCheckIns
## 0 mostRecentVisit_review ... NaN
## 1 mostRecentVisit_review ... NaN
## 2 mostRecentVisit_review ... NaN
## 3 mostRecentVisit_review ... NaN
## 4 mostRecentVisit_review ... NaN
##
## [5 rows x 18 columns]
print(reviews.tail())
## userReviewSeries ... userCheckIns
## 609 mostRecentVisit_review ... 1.0
## 610 mostRecentVisit_review ... 1.0
## 611 mostRecentVisit_review ... 1.0
## 612 mostRecentVisit_review ... 1.0
## 613 mostRecentVisit_review ... NaN
##
## [5 rows x 18 columns]
print(reviews.shape)
## (614, 18)
import regex
def preprocessor(text):
text = regex.sub('<[^>]*>', '', text)
emoticons = regex.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
text = regex.sub('[\W]+', ' ', text.lower()) +\
' '.join(emoticons).replace('-', '')
return text
reviews.tail()
## userReviewSeries ... userCheckIns
## 609 mostRecentVisit_review ... 1.0
## 610 mostRecentVisit_review ... 1.0
## 611 mostRecentVisit_review ... 1.0
## 612 mostRecentVisit_review ... 1.0
## 613 mostRecentVisit_review ... NaN
##
## [5 rows x 18 columns]
import numpy as np
reviews = reviews.reindex(np.random.permutation(reviews.index))
print(reviews.head())
## userReviewSeries ... userCheckIns
## 551 mostRecentVisit_review ... NaN
## 340 mostRecentVisit_review ... NaN
## 474 lastVisit_review ... NaN
## 7 mostRecentVisit_review ... 1.0
## 239 mostRecentVisit_review ... NaN
##
## [5 rows x 18 columns]
print(reviews.tail())
## userReviewSeries ... userCheckIns
## 23 mostRecentVisit_review ... NaN
## 584 mostRecentVisit_review ... 1.0
## 264 mostRecentVisit_review ... 6.0
## 327 mostRecentVisit_review ... NaN
## 135 mostRecentVisit_review ... NaN
##
## [5 rows x 18 columns]
reviews.groupby('userRatingValue').describe()
## friends ... userCheckIns
## count mean std min ... 25% 50% 75% max
## userRatingValue ...
## 1 81.0 85.370370 133.524103 0.0 ... 1.0 1.5 2.00 3.0
## 2 31.0 149.967742 152.750010 0.0 ... 1.0 1.0 2.00 3.0
## 3 52.0 275.461538 700.341862 0.0 ... 1.0 2.0 2.75 22.0
## 4 101.0 288.841584 493.898000 0.0 ... 1.0 1.0 2.25 45.0
## 5 308.0 122.746753 329.574151 0.0 ... 1.0 1.0 3.00 41.0
##
## [5 rows x 40 columns]
reviews.groupby('businessType').describe()
## userRatingValue ... userCheckIns
## count mean std ... 50% 75% max
## businessType ...
## chiropractic 233.0 4.686695 0.956216 ... 1.0 3.0 43.0
## grocery store 136.0 3.779412 1.484194 ... 1.0 5.5 45.0
## high end massage retreat 245.0 3.261224 1.511271 ... 1.0 2.0 4.0
##
## [3 rows x 48 columns]
reviews['length'] = reviews['userReviewOnlyContent'].map(lambda text: len(text))
print(reviews.head())
## userReviewSeries ... length
## 551 mostRecentVisit_review ... 112
## 340 mostRecentVisit_review ... 750
## 474 lastVisit_review ... 2972
## 7 mostRecentVisit_review ... 210
## 239 mostRecentVisit_review ... 213
##
## [5 rows x 19 columns]
# %matplotlib inline
reviews.length.plot(bins=20, kind='hist')
plt.show()
reviews.length.describe()
## count 614.000000
## mean 626.206840
## std 588.507777
## min 36.000000
## 25% 249.000000
## 50% 433.500000
## 75% 785.750000
## max 3489.000000
## Name: length, dtype: float64
print(list(reviews.userReviewOnlyContent[reviews.length > 630].index))
## [340, 474, 107, 319, 460, 75, 157, 417, 331, 214, 182, 581, 119, 110, 100, 390, 440, 360, 483, 556, 528, 427, 12, 410, 559, 587, 68, 248, 1, 414, 463, 220, 385, 371, 426, 547, 146, 336, 301, 407, 304, 415, 431, 386, 17, 328, 121, 513, 314, 24, 502, 222, 291, 462, 158, 217, 531, 313, 352, 320, 375, 393, 469, 347, 424, 508, 439, 312, 381, 270, 302, 236, 120, 583, 112, 269, 242, 452, 34, 329, 298, 20, 41, 409, 349, 325, 364, 365, 296, 613, 495, 344, 438, 464, 315, 316, 299, 401, 191, 434, 419, 392, 317, 272, 282, 592, 138, 377, 330, 335, 358, 404, 149, 459, 466, 601, 318, 45, 49, 376, 444, 505, 309, 0, 78, 86, 83, 527, 480, 193, 22, 526, 521, 455, 26, 485, 348, 279, 307, 337, 332, 604, 451, 94, 412, 246, 98, 189, 356, 97, 67, 229, 333, 267, 156, 475, 341, 373, 537, 372, 277, 310, 210, 355, 430, 402, 262, 465, 476, 391, 535, 382, 238, 201, 380, 369, 366, 418, 44, 305, 406, 442, 354, 489, 374, 573, 30, 397, 416, 306, 225, 195, 324, 205, 170, 458, 21, 223, 578, 379, 23, 327]
print(list(reviews.userRatingValue[reviews.length > 630]))
## [1, 1, 5, 1, 2, 5, 1, 5, 5, 1, 5, 5, 4, 4, 1, 1, 3, 4, 5, 5, 2, 1, 4, 4, 1, 5, 5, 4, 4, 5, 1, 5, 4, 2, 2, 3, 5, 1, 5, 4, 4, 5, 5, 2, 3, 5, 5, 5, 3, 5, 1, 1, 3, 1, 5, 4, 4, 3, 4, 5, 1, 2, 3, 1, 5, 4, 4, 3, 5, 3, 3, 1, 5, 5, 1, 5, 5, 1, 3, 1, 3, 5, 5, 1, 2, 5, 3, 2, 5, 5, 4, 4, 4, 2, 3, 2, 5, 3, 5, 2, 4, 3, 2, 4, 4, 4, 5, 5, 3, 4, 4, 5, 2, 1, 4, 5, 5, 4, 5, 2, 5, 2, 1, 5, 4, 5, 5, 1, 4, 1, 1, 1, 5, 3, 1, 5, 2, 1, 2, 5, 5, 5, 2, 5, 3, 3, 5, 2, 3, 5, 1, 4, 4, 5, 1, 1, 3, 1, 5, 5, 5, 4, 5, 2, 4, 1, 5, 2, 2, 2, 4, 4, 5, 1, 5, 1, 5, 4, 5, 4, 3, 1, 5, 1, 3, 5, 5, 5, 5, 5, 1, 1, 4, 1, 5, 4, 5, 4, 3, 3, 4, 4]
reviews.hist(column='length', by='userRatingValue', bins=10)
plt.show()
def split_into_tokens(review):
#review = unicode(review, 'iso-8859-1')# in python 3 the default of str() previously python2 as unicode() is utf-8
return TextBlob(review).words
reviews.userReviewOnlyContent.head().apply(split_into_tokens)
## 551 [Still, no, update, by, this, facility, do, n'...
## 340 [It, 's, a, pretty, cool, nice, place, from, w...
## 474 [Imagine, planning, a, family, event, for, the...
## 7 [has, been, treating, myself, family, and, fri...
## 239 [Love, the, deli, department, cheap, fast, foo...
## Name: userReviewOnlyContent, dtype: object
TextBlob("hello world, how is it going?").tags # list of (word, POS) pairs
## [('hello', 'JJ'), ('world', 'NN'), ('how', 'WRB'), ('is', 'VBZ'), ('it', 'PRP'), ('going', 'VBG')]
import nltk
nltk.download('stopwords')
## True
##
## [nltk_data] Downloading package stopwords to
## [nltk_data] C:\Users\m\AppData\Roaming\nltk_data...
## [nltk_data] Package stopwords is already up-to-date!
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop = stop + [u'a',u'b',u'c',u'd',u'e',u'f',u'g',u'h',u'i',u'j',u'k',u'l',u'm',u'n',u'o',u'p',u'q',u'r',u's',u't',u'v',u'w',u'x',u'y',u'z']
def split_into_lemmas(review):
#review = unicode(review, 'iso-8859-1')
review = review.lower()
#review = unicode(review, 'utf8').lower()
#review = str(review).lower()
words = TextBlob(review).words
# for each word, take its "base form" = lemma
return [word.lemma for word in words if word not in stop]
reviews.userReviewOnlyContent.head().apply(split_into_lemmas)
## 551 [still, update, facility, n't, think, 'll, eve...
## 340 ['s, pretty, cool, nice, place, tell, next, mo...
## 474 [imagine, planning, family, event, last, three...
## 7 [treating, family, friend, many, year, drive, ...
## 239 [love, deli, department, cheap, fast, food, st...
## Name: userReviewOnlyContent, dtype: object
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(reviews['userReviewOnlyContent'])
print(len(bow_transformer.vocabulary_))
## 4547
review4 = reviews['userReviewOnlyContent'][42]
print(review4)
## Love this place! I had never been to a chiropractor before and was definitely scared but I tried this place out because I had heard great things and it was even better than I anticipated. The whole staff is super efficient and organized. Dr. Brian Heller was super friendly and helped ease the neck pain I was having before.
##
## On top of that, the first appointment which includes X-rays, a consultation and the first adjustment was only $40! Great price and an overall awesome experience. I plan to come here regularly now.
bow4 = bow_transformer.transform([review4])
print(bow4)
## (0, 106) 1
## (0, 212) 1
## (0, 335) 1
## (0, 363) 1
## (0, 459) 1
## (0, 571) 1
## (0, 663) 1
## (0, 854) 1
## (0, 945) 1
## (0, 1013) 1
## (0, 1185) 1
## (0, 1330) 1
## (0, 1374) 1
## (0, 1389) 1
## (0, 1465) 1
## (0, 1515) 1
## (0, 1620) 2
## (0, 1709) 1
## (0, 1813) 2
## (0, 1908) 1
## (0, 1925) 1
## (0, 1929) 1
## (0, 2076) 1
## (0, 2398) 1
## (0, 2650) 1
## (0, 2665) 1
## (0, 2784) 1
## (0, 2799) 1
## (0, 2833) 1
## (0, 2944) 2
## (0, 2947) 1
## (0, 3048) 1
## (0, 3243) 1
## (0, 3453) 1
## (0, 3802) 1
## (0, 3922) 2
## (0, 4052) 1
## (0, 4121) 1
## (0, 4167) 1
## (0, 4441) 1
## (0, 4502) 1
reviews_bow = bow_transformer.transform(reviews['userReviewOnlyContent'])
print('sparse matrix shape:', reviews_bow.shape)
## sparse matrix shape: (614, 4547)
print('number of non-zeros:', reviews_bow.nnz)
## number of non-zeros: 29971
print('sparsity: %.2f%%' % (100.0 * reviews_bow.nnz / (reviews_bow.shape[0] * reviews_bow.shape[1])))
## sparsity: 1.07%
Indexing is different in python compared to R. Python includes zero and when indicating a slice, the last value is ignored, so only up to the value. So it is used to slice, so that the next can start and include that number up to the empty slice which indicates the last value.
# Split/splice into training ~ 80% and testing ~ 20%
reviews_bow_train = reviews_bow[:491]
reviews_bow_test = reviews_bow[491:]
reviews_sentiment_train = reviews['userRatingValue'][:491]
reviews_sentiment_test = reviews['userRatingValue'][491:]
print(reviews_bow_train.shape)
## (491, 4547)
print(reviews_bow_test.shape)
## (123, 4547)
review_sentiment = MultinomialNB().fit(reviews_bow_train, reviews_sentiment_train)
print('predicted:', review_sentiment.predict(bow4)[0])
## predicted: 5
print('expected:', reviews.userRatingValue[42])
## expected: 5
predictions = review_sentiment.predict(reviews_bow_test)
print(predictions)
## [5 4 2 4 5 1 5 5 4 1 4 5 5 4 5 5 1 3 5 4 5 4 5 5 1 4 4 5 5 5 5 5 5 4 5 5 4
## 5 5 4 5 1 5 5 3 4 1 4 5 5 5 4 5 2 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 5 4 1 1 1
## 5 5 5 5 3 5 5 5 1 5 5 1 4 5 4 5 5 4 5 5 4 5 5 5 5 5 1 4 5 5 4 4 1 3 5 4 5
## 5 5 4 3 5 5 5 4 5 4 4 5]
print('accuracy', accuracy_score(reviews_sentiment_test, predictions))
## accuracy 0.7235772357723578
print('confusion matrix\n', confusion_matrix(reviews_sentiment_test, predictions))
## confusion matrix
## [[10 0 0 2 2]
## [ 1 1 0 3 0]
## [ 1 0 1 4 2]
## [ 0 1 3 15 5]
## [ 1 0 1 8 62]]
print('(row=expected, col=predicted)')
## (row=expected, col=predicted)
This model generated a 72% accuracy using multinomial naive bayes. The confusion matrix above gives the 1 through 5 values that 10 were correctly predicted 1s, but a 1 was falsely predicted as a 2, 3, and a 5 as type 1 errors. Also, 62 5s were correctly predicted, but 8 5s were misclassified as a 4, one 5 as a 3, and another 5 as a 1.
print(classification_report(reviews_sentiment_test, predictions))
## precision recall f1-score support
##
## 1 0.77 0.71 0.74 14
## 2 0.50 0.20 0.29 5
## 3 0.20 0.12 0.15 8
## 4 0.47 0.62 0.54 24
## 5 0.87 0.86 0.87 72
##
## accuracy 0.72 123
## macro avg 0.56 0.51 0.52 123
## weighted avg 0.72 0.72 0.72 123
From the above, precision accounts for type 1 errors (how many real negatives classified as positives-False Positives: TP/(TP+FP)) and type 2 errors (how many real posiives classified as negatives-False Negatives: TP/(TP+FN)) are part of recall. The 5s and 1 ratings had higher recall and precision than the 2-4 ratings classified.
def predict_review(new_review):
new_sample = bow_transformer.transform([new_review])
pr = np.around(review_sentiment.predict_proba(new_sample),2)
print(new_review,'\n\n', pr)
if (pr[0][0] == max(pr[0])):
print('The max probability is 1 for this review with ', pr[0][0]*100,'%')
elif (pr[0][1] == max(pr[0])):
print('The max probability is 2 for this review with ', pr[0][1]*100,'%')
elif (pr[0][2] == max(pr[0])):
print('The max probability is 3 for this review with ', pr[0][2]*100,'%')
elif (pr[0][3] == max(pr[0])):
print('The max probability is 4 for this review with ', pr[0][3]*100,'%')
else:
print('The max probability is 5 for this review with ', pr[0][4]*100,'%')
print('-----------------------------------------\n\n')
reviews.userRatingValue.unique()
## array([1, 5, 4, 2, 3], dtype=int64)
predict_review('great place. loved it. returning soon.')
## great place. loved it. returning soon.
##
## [[0.01 0. 0.01 0.05 0.92]]
## The max probability is 5 for this review with 92.0 %
## -----------------------------------------
predict_review('i\'ve been going here for years, and never again, worst place ever.')
## i've been going here for years, and never again, worst place ever.
##
## [[0.1 0. 0. 0. 0.9]]
## The max probability is 5 for this review with 90.0 %
## -----------------------------------------
predict_review('way too over priced. had better')
## way too over priced. had better
##
## [[0.02 0.01 0. 0.08 0.88]]
## The max probability is 5 for this review with 88.0 %
## -----------------------------------------
predict_review('wonderful. perfect. loved anaconda.')
## wonderful. perfect. loved anaconda.
##
## [[0.01 0.01 0. 0.16 0.81]]
## The max probability is 5 for this review with 81.0 %
## -----------------------------------------
In the above, the second review is more of a low review, and the algorithm predicted it would be a 5 instead of a 1-3. It did predict it being a 1 rating by 10%.
predict_review('can never get an appointment. Still waiting. ')
## can never get an appointment. Still waiting.
##
## [[0.25 0.03 0.01 0.08 0.63]]
## The max probability is 5 for this review with 63.0 %
## -----------------------------------------
predict_review("don't waste your time or money here.")
## don't waste your time or money here.
##
## [[0.57 0.09 0.09 0.15 0.09]]
## The max probability is 1 for this review with 56.99999999999999 %
## -----------------------------------------
The above shows that this sentiment put into the function predicted the sentiment to be a 1 rating by 57%, and next best was a 4 rating with 15%
predict_review('love this place better than others')
## love this place better than others
##
## [[0. 0. 0. 0.01 0.98]]
## The max probability is 5 for this review with 98.0 %
## -----------------------------------------
predict_review('''OMG! the best! a hidden gem.
The prices are affordable. ''')
## OMG! the best! a hidden gem.
## The prices are affordable.
##
## [[0. 0. 0. 0.05 0.95]]
## The max probability is 5 for this review with 95.0 %
## -----------------------------------------
predict_review('''OMG! I am in so much pain. Sale on the massages. I want to go here regularly. ''')
## OMG! I am in so much pain. Sale on the massages. I want to go here regularly.
##
## [[0. 0. 0. 0. 1.]]
## The max probability is 5 for this review with 100.0 %
## -----------------------------------------
When knitting with python36 open in Anaconda prompt window, the matplotlib graphs above threw an error and halted knitr with a message,‘…could not find or load the Qt platform plugin …’ for windows. Checking online, stackoverflow, found one to:
$ conda env remove -n r-reticulate $ conda create -n r-reticulate python=3 $ source activate r-reticulate $ python -m pip install matplotlib $ Rscript -e “library(knitr); knit(‘eng-reticulate-example.Rmd’)”
in the Anaconda prompt.
I started at line 2, and made python=3.6 adjustment to the command. Anaconda updated some packages. This actually created a new environment called ‘r-reticulate’