I want to bring in the python script and process that used the sci-kit, keras, and tensorflow packages of python to show the results of machine learning using multinomial naive bayes. There is a package in R that allows for communicating the same work from python packages into R. Lets try it out, its new to me and sounds really cool. But every thing sounds really cool before tested and demonstrated. So, lets see how well it works based on the cheatsheet for reticulate found in the Rstudio Help menu under cheatsheets.

The python packages were sklearn, matplotlib, pandas, numpy, nltk, textBlob, and regex. Some versions that work are later modules, for instance the re package was used that made regex obsolete because it is a build version that replaced regex for my version of python, 3.6.

# knitr::knit_engines$set(python = reticulate::eng_python)

library(reticulate)
## Warning: package 'reticulate' was built under R version 3.6.3
conda_list(conda = "auto") 
##           name                                                  python
## 1    Anaconda2                     C:\\Users\\m\\Anaconda2\\python.exe
## 2    djangoenv    C:\\Users\\m\\Anaconda2\\envs\\djangoenv\\python.exe
## 3     python36     C:\\Users\\m\\Anaconda2\\envs\\python36\\python.exe
## 4     python37     C:\\Users\\m\\Anaconda2\\envs\\python37\\python.exe
## 5 r-reticulate C:\\Users\\m\\Anaconda2\\envs\\r-reticulate\\python.exe

I have my python IDE, Anaconda, open in the console and use the python36 environment mostly, and more importantly for the testing that was done on NLP using multinomial Naive Bayes to classify 5 ratings categores per review. The above shows those environments in conda.

use_condaenv(condaenv = "python36")
import pandas as pd 
import matplotlib.pyplot as plt 
from textblob import TextBlob 
import sklearn 
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix 
 
np.random.seed(47) 
reviews = pd.read_csv('cleanedRegexReviews13.csv', encoding = 'unicode_escape') 
print(reviews.head())
##          userReviewSeries  ... userCheckIns
## 0  mostRecentVisit_review  ...          NaN
## 1  mostRecentVisit_review  ...          NaN
## 2  mostRecentVisit_review  ...          NaN
## 3  mostRecentVisit_review  ...          NaN
## 4  mostRecentVisit_review  ...          NaN
## 
## [5 rows x 18 columns]
print(reviews.tail())
##            userReviewSeries  ... userCheckIns
## 609  mostRecentVisit_review  ...          1.0
## 610  mostRecentVisit_review  ...          1.0
## 611  mostRecentVisit_review  ...          1.0
## 612  mostRecentVisit_review  ...          1.0
## 613  mostRecentVisit_review  ...          NaN
## 
## [5 rows x 18 columns]
print(reviews.shape)
## (614, 18)
import regex
def preprocessor(text):
    text = regex.sub('<[^>]*>', '', text)
    emoticons = regex.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = regex.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text
reviews.tail()
##            userReviewSeries  ... userCheckIns
## 609  mostRecentVisit_review  ...          1.0
## 610  mostRecentVisit_review  ...          1.0
## 611  mostRecentVisit_review  ...          1.0
## 612  mostRecentVisit_review  ...          1.0
## 613  mostRecentVisit_review  ...          NaN
## 
## [5 rows x 18 columns]
import numpy as np

reviews = reviews.reindex(np.random.permutation(reviews.index))

print(reviews.head())
##            userReviewSeries  ... userCheckIns
## 551  mostRecentVisit_review  ...          NaN
## 340  mostRecentVisit_review  ...          NaN
## 474        lastVisit_review  ...          NaN
## 7    mostRecentVisit_review  ...          1.0
## 239  mostRecentVisit_review  ...          NaN
## 
## [5 rows x 18 columns]
print(reviews.tail())
##            userReviewSeries  ... userCheckIns
## 23   mostRecentVisit_review  ...          NaN
## 584  mostRecentVisit_review  ...          1.0
## 264  mostRecentVisit_review  ...          6.0
## 327  mostRecentVisit_review  ...          NaN
## 135  mostRecentVisit_review  ...          NaN
## 
## [5 rows x 18 columns]
reviews.groupby('userRatingValue').describe()
##                 friends                               ... userCheckIns                 
##                   count        mean         std  min  ...          25%  50%   75%   max
## userRatingValue                                       ...                              
## 1                  81.0   85.370370  133.524103  0.0  ...          1.0  1.5  2.00   3.0
## 2                  31.0  149.967742  152.750010  0.0  ...          1.0  1.0  2.00   3.0
## 3                  52.0  275.461538  700.341862  0.0  ...          1.0  2.0  2.75  22.0
## 4                 101.0  288.841584  493.898000  0.0  ...          1.0  1.0  2.25  45.0
## 5                 308.0  122.746753  329.574151  0.0  ...          1.0  1.0  3.00  41.0
## 
## [5 rows x 40 columns]
reviews.groupby('businessType').describe()
##                          userRatingValue                      ... userCheckIns           
##                                    count      mean       std  ...          50%  75%   max
## businessType                                                  ...                        
## chiropractic                       233.0  4.686695  0.956216  ...          1.0  3.0  43.0
## grocery store                      136.0  3.779412  1.484194  ...          1.0  5.5  45.0
## high end massage retreat           245.0  3.261224  1.511271  ...          1.0  2.0   4.0
## 
## [3 rows x 48 columns]
reviews['length'] = reviews['userReviewOnlyContent'].map(lambda text: len(text))
print(reviews.head())
##            userReviewSeries  ... length
## 551  mostRecentVisit_review  ...    112
## 340  mostRecentVisit_review  ...    750
## 474        lastVisit_review  ...   2972
## 7    mostRecentVisit_review  ...    210
## 239  mostRecentVisit_review  ...    213
## 
## [5 rows x 19 columns]
# %matplotlib inline 
reviews.length.plot(bins=20, kind='hist')
plt.show()

reviews.length.describe()
## count     614.000000
## mean      626.206840
## std       588.507777
## min        36.000000
## 25%       249.000000
## 50%       433.500000
## 75%       785.750000
## max      3489.000000
## Name: length, dtype: float64
print(list(reviews.userReviewOnlyContent[reviews.length > 630].index))
## [340, 474, 107, 319, 460, 75, 157, 417, 331, 214, 182, 581, 119, 110, 100, 390, 440, 360, 483, 556, 528, 427, 12, 410, 559, 587, 68, 248, 1, 414, 463, 220, 385, 371, 426, 547, 146, 336, 301, 407, 304, 415, 431, 386, 17, 328, 121, 513, 314, 24, 502, 222, 291, 462, 158, 217, 531, 313, 352, 320, 375, 393, 469, 347, 424, 508, 439, 312, 381, 270, 302, 236, 120, 583, 112, 269, 242, 452, 34, 329, 298, 20, 41, 409, 349, 325, 364, 365, 296, 613, 495, 344, 438, 464, 315, 316, 299, 401, 191, 434, 419, 392, 317, 272, 282, 592, 138, 377, 330, 335, 358, 404, 149, 459, 466, 601, 318, 45, 49, 376, 444, 505, 309, 0, 78, 86, 83, 527, 480, 193, 22, 526, 521, 455, 26, 485, 348, 279, 307, 337, 332, 604, 451, 94, 412, 246, 98, 189, 356, 97, 67, 229, 333, 267, 156, 475, 341, 373, 537, 372, 277, 310, 210, 355, 430, 402, 262, 465, 476, 391, 535, 382, 238, 201, 380, 369, 366, 418, 44, 305, 406, 442, 354, 489, 374, 573, 30, 397, 416, 306, 225, 195, 324, 205, 170, 458, 21, 223, 578, 379, 23, 327]
print(list(reviews.userRatingValue[reviews.length > 630]))
## [1, 1, 5, 1, 2, 5, 1, 5, 5, 1, 5, 5, 4, 4, 1, 1, 3, 4, 5, 5, 2, 1, 4, 4, 1, 5, 5, 4, 4, 5, 1, 5, 4, 2, 2, 3, 5, 1, 5, 4, 4, 5, 5, 2, 3, 5, 5, 5, 3, 5, 1, 1, 3, 1, 5, 4, 4, 3, 4, 5, 1, 2, 3, 1, 5, 4, 4, 3, 5, 3, 3, 1, 5, 5, 1, 5, 5, 1, 3, 1, 3, 5, 5, 1, 2, 5, 3, 2, 5, 5, 4, 4, 4, 2, 3, 2, 5, 3, 5, 2, 4, 3, 2, 4, 4, 4, 5, 5, 3, 4, 4, 5, 2, 1, 4, 5, 5, 4, 5, 2, 5, 2, 1, 5, 4, 5, 5, 1, 4, 1, 1, 1, 5, 3, 1, 5, 2, 1, 2, 5, 5, 5, 2, 5, 3, 3, 5, 2, 3, 5, 1, 4, 4, 5, 1, 1, 3, 1, 5, 5, 5, 4, 5, 2, 4, 1, 5, 2, 2, 2, 4, 4, 5, 1, 5, 1, 5, 4, 5, 4, 3, 1, 5, 1, 3, 5, 5, 5, 5, 5, 1, 1, 4, 1, 5, 4, 5, 4, 3, 3, 4, 4]
reviews.hist(column='length', by='userRatingValue', bins=10)


plt.show()

def split_into_tokens(review):
    
    #review = unicode(review, 'iso-8859-1')# in python 3 the default of str() previously python2 as unicode() is utf-8
    return TextBlob(review).words
reviews.userReviewOnlyContent.head().apply(split_into_tokens)
## 551    [Still, no, update, by, this, facility, do, n'...
## 340    [It, 's, a, pretty, cool, nice, place, from, w...
## 474    [Imagine, planning, a, family, event, for, the...
## 7      [has, been, treating, myself, family, and, fri...
## 239    [Love, the, deli, department, cheap, fast, foo...
## Name: userReviewOnlyContent, dtype: object
TextBlob("hello world, how is it going?").tags  # list of (word, POS) pairs
## [('hello', 'JJ'), ('world', 'NN'), ('how', 'WRB'), ('is', 'VBZ'), ('it', 'PRP'), ('going', 'VBG')]
import nltk
nltk.download('stopwords')
## True
## 
## [nltk_data] Downloading package stopwords to
## [nltk_data]     C:\Users\m\AppData\Roaming\nltk_data...
## [nltk_data]   Package stopwords is already up-to-date!
from nltk.corpus import stopwords

stop = stopwords.words('english')
stop = stop + [u'a',u'b',u'c',u'd',u'e',u'f',u'g',u'h',u'i',u'j',u'k',u'l',u'm',u'n',u'o',u'p',u'q',u'r',u's',u't',u'v',u'w',u'x',u'y',u'z']
def split_into_lemmas(review):
    #review = unicode(review, 'iso-8859-1')
    review = review.lower()
    #review = unicode(review, 'utf8').lower()
    #review = str(review).lower()
    words = TextBlob(review).words
    # for each word, take its "base form" = lemma 
    return [word.lemma for word in words if word not in stop]

reviews.userReviewOnlyContent.head().apply(split_into_lemmas)
## 551    [still, update, facility, n't, think, 'll, eve...
## 340    ['s, pretty, cool, nice, place, tell, next, mo...
## 474    [imagine, planning, family, event, last, three...
## 7      [treating, family, friend, many, year, drive, ...
## 239    [love, deli, department, cheap, fast, food, st...
## Name: userReviewOnlyContent, dtype: object
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(reviews['userReviewOnlyContent'])
print(len(bow_transformer.vocabulary_))
## 4547
review4 = reviews['userReviewOnlyContent'][42]
print(review4)
##  Love this place! I had never been to a chiropractor before and was definitely scared but I tried this place out because I had heard great things and it was even better than I anticipated. The whole staff is super efficient and organized. Dr. Brian Heller was super friendly and helped ease the neck pain I was having before.
## 
## On top of that, the first appointment which includes X-rays, a consultation and the first adjustment was only $40! Great price and an overall awesome experience. I plan to come here regularly now.
bow4 = bow_transformer.transform([review4])
print(bow4)
##   (0, 106)   1
##   (0, 212)   1
##   (0, 335)   1
##   (0, 363)   1
##   (0, 459)   1
##   (0, 571)   1
##   (0, 663)   1
##   (0, 854)   1
##   (0, 945)   1
##   (0, 1013)  1
##   (0, 1185)  1
##   (0, 1330)  1
##   (0, 1374)  1
##   (0, 1389)  1
##   (0, 1465)  1
##   (0, 1515)  1
##   (0, 1620)  2
##   (0, 1709)  1
##   (0, 1813)  2
##   (0, 1908)  1
##   (0, 1925)  1
##   (0, 1929)  1
##   (0, 2076)  1
##   (0, 2398)  1
##   (0, 2650)  1
##   (0, 2665)  1
##   (0, 2784)  1
##   (0, 2799)  1
##   (0, 2833)  1
##   (0, 2944)  2
##   (0, 2947)  1
##   (0, 3048)  1
##   (0, 3243)  1
##   (0, 3453)  1
##   (0, 3802)  1
##   (0, 3922)  2
##   (0, 4052)  1
##   (0, 4121)  1
##   (0, 4167)  1
##   (0, 4441)  1
##   (0, 4502)  1
reviews_bow = bow_transformer.transform(reviews['userReviewOnlyContent'])
print('sparse matrix shape:', reviews_bow.shape)
## sparse matrix shape: (614, 4547)
print('number of non-zeros:', reviews_bow.nnz)
## number of non-zeros: 29971
print('sparsity: %.2f%%' % (100.0 * reviews_bow.nnz / (reviews_bow.shape[0] * reviews_bow.shape[1])))
## sparsity: 1.07%

Indexing is different in python compared to R. Python includes zero and when indicating a slice, the last value is ignored, so only up to the value. So it is used to slice, so that the next can start and include that number up to the empty slice which indicates the last value.

# Split/splice into training ~ 80% and testing ~ 20%
reviews_bow_train = reviews_bow[:491]
reviews_bow_test = reviews_bow[491:]
reviews_sentiment_train = reviews['userRatingValue'][:491]
reviews_sentiment_test = reviews['userRatingValue'][491:]

print(reviews_bow_train.shape)
## (491, 4547)
print(reviews_bow_test.shape)
## (123, 4547)
review_sentiment = MultinomialNB().fit(reviews_bow_train, reviews_sentiment_train)
print('predicted:', review_sentiment.predict(bow4)[0])
## predicted: 5
print('expected:', reviews.userRatingValue[42])
## expected: 5
predictions = review_sentiment.predict(reviews_bow_test)
print(predictions)
## [5 4 2 4 5 1 5 5 4 1 4 5 5 4 5 5 1 3 5 4 5 4 5 5 1 4 4 5 5 5 5 5 5 4 5 5 4
##  5 5 4 5 1 5 5 3 4 1 4 5 5 5 4 5 2 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 5 4 1 1 1
##  5 5 5 5 3 5 5 5 1 5 5 1 4 5 4 5 5 4 5 5 4 5 5 5 5 5 1 4 5 5 4 4 1 3 5 4 5
##  5 5 4 3 5 5 5 4 5 4 4 5]
print('accuracy', accuracy_score(reviews_sentiment_test, predictions))
## accuracy 0.7235772357723578
print('confusion matrix\n', confusion_matrix(reviews_sentiment_test, predictions))
## confusion matrix
##  [[10  0  0  2  2]
##  [ 1  1  0  3  0]
##  [ 1  0  1  4  2]
##  [ 0  1  3 15  5]
##  [ 1  0  1  8 62]]
print('(row=expected, col=predicted)')
## (row=expected, col=predicted)

This model generated a 72% accuracy using multinomial naive bayes. The confusion matrix above gives the 1 through 5 values that 10 were correctly predicted 1s, but a 1 was falsely predicted as a 2, 3, and a 5 as type 1 errors. Also, 62 5s were correctly predicted, but 8 5s were misclassified as a 4, one 5 as a 3, and another 5 as a 1.

print(classification_report(reviews_sentiment_test, predictions))
##               precision    recall  f1-score   support
## 
##            1       0.77      0.71      0.74        14
##            2       0.50      0.20      0.29         5
##            3       0.20      0.12      0.15         8
##            4       0.47      0.62      0.54        24
##            5       0.87      0.86      0.87        72
## 
##     accuracy                           0.72       123
##    macro avg       0.56      0.51      0.52       123
## weighted avg       0.72      0.72      0.72       123

From the above, precision accounts for type 1 errors (how many real negatives classified as positives-False Positives: TP/(TP+FP)) and type 2 errors (how many real posiives classified as negatives-False Negatives: TP/(TP+FN)) are part of recall. The 5s and 1 ratings had higher recall and precision than the 2-4 ratings classified.


def predict_review(new_review): 
    new_sample = bow_transformer.transform([new_review])
    pr = np.around(review_sentiment.predict_proba(new_sample),2)
    print(new_review,'\n\n', pr)
    
    if (pr[0][0] == max(pr[0])):
        print('The max probability is 1 for this review with ', pr[0][0]*100,'%')
    elif (pr[0][1] == max(pr[0])):
        print('The max probability is 2 for this review with ', pr[0][1]*100,'%')
    elif (pr[0][2] == max(pr[0])):
        print('The max probability is 3 for this review with ', pr[0][2]*100,'%')
    elif (pr[0][3] == max(pr[0])):
        print('The max probability is 4 for this review with ', pr[0][3]*100,'%')
    else:
        print('The max probability is 5 for this review with ', pr[0][4]*100,'%')
    print('-----------------------------------------\n\n')
reviews.userRatingValue.unique()
## array([1, 5, 4, 2, 3], dtype=int64)
predict_review('great place. loved it. returning soon.')
## great place. loved it. returning soon. 
## 
##  [[0.01 0.   0.01 0.05 0.92]]
## The max probability is 5 for this review with  92.0 %
## -----------------------------------------
predict_review('i\'ve been going here for years, and never again, worst place ever.')
## i've been going here for years, and never again, worst place ever. 
## 
##  [[0.1 0.  0.  0.  0.9]]
## The max probability is 5 for this review with  90.0 %
## -----------------------------------------
predict_review('way too over priced. had better')
## way too over priced. had better 
## 
##  [[0.02 0.01 0.   0.08 0.88]]
## The max probability is 5 for this review with  88.0 %
## -----------------------------------------
predict_review('wonderful. perfect. loved anaconda.')
## wonderful. perfect. loved anaconda. 
## 
##  [[0.01 0.01 0.   0.16 0.81]]
## The max probability is 5 for this review with  81.0 %
## -----------------------------------------

In the above, the second review is more of a low review, and the algorithm predicted it would be a 5 instead of a 1-3. It did predict it being a 1 rating by 10%.

predict_review('can never get an appointment. Still waiting. ')
## can never get an appointment. Still waiting.  
## 
##  [[0.25 0.03 0.01 0.08 0.63]]
## The max probability is 5 for this review with  63.0 %
## -----------------------------------------
predict_review("don't waste your time or money here.")
## don't waste your time or money here. 
## 
##  [[0.57 0.09 0.09 0.15 0.09]]
## The max probability is 1 for this review with  56.99999999999999 %
## -----------------------------------------

The above shows that this sentiment put into the function predicted the sentiment to be a 1 rating by 57%, and next best was a 4 rating with 15%

predict_review('love this place better than others')
## love this place better than others 
## 
##  [[0.   0.   0.   0.01 0.98]]
## The max probability is 5 for this review with  98.0 %
## -----------------------------------------
predict_review('''OMG! the best! a hidden gem. 
The prices are affordable. ''')
## OMG! the best! a hidden gem. 
## The prices are affordable.  
## 
##  [[0.   0.   0.   0.05 0.95]]
## The max probability is 5 for this review with  95.0 %
## -----------------------------------------
predict_review('''OMG! I am in so much pain. Sale on the massages. I want to go here regularly. ''')
## OMG! I am in so much pain. Sale on the massages. I want to go here regularly.  
## 
##  [[0. 0. 0. 0. 1.]]
## The max probability is 5 for this review with  100.0 %
## -----------------------------------------

When knitting with python36 open in Anaconda prompt window, the matplotlib graphs above threw an error and halted knitr with a message,‘…could not find or load the Qt platform plugin …’ for windows. Checking online, stackoverflow, found one to:

$ conda env remove -n r-reticulate $ conda create -n r-reticulate python=3 $ source activate r-reticulate $ python -m pip install matplotlib $ Rscript -e “library(knitr); knit(‘eng-reticulate-example.Rmd’)”

in the Anaconda prompt.

I started at line 2, and made python=3.6 adjustment to the command. Anaconda updated some packages. This actually created a new environment called ‘r-reticulate’