Naive Bayes Vs Logistic Regression Vs Support Vector Machine: Python + Rstudio

1 Import python packages

import nltk
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews
from nltk.tokenize import word_tokenize,sent_tokenize
from sklearn.svm import SVC , LinearSVC , NuSVC
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier

2 Naive Bayes Theorem: Basic Idea in natural language processing (NLP) problems.

\[\begin{equation} \textbf{P(Tag}|\textbf{Sentence)} = \textbf{P(Tag)} \frac{\textbf{P(Sentence} |\textbf{Tag})}{\textbf{P(Sentence)}} \end{equation}\]

\[\mbox{Posterior}= \mbox{Likelihood}\frac{\mbox{Proposition prior probability}}{\mbox{Evidence prior probability}}\]

2.1 Bayes’ Theorem for Naive Bayes Algorithm

The basic idea how to use Naive Bayes algorithm for machine learning classification problem is as follows: Suppose that a busines problem has multiple feature classes, say, \(C_1, C_2, \ldots, C_h\). The Naive Bayes algorithm use to compute the conditional probability of an object with a feature vector \(x_1, x_2,\ldots, x_m\) belongs to a particular class \(C_i\),

\[\displaystyle P(C_i|x_1, x_2,\ldots, x_m)=\frac{P(x_1, x_2,\ldots, x_m|C_i).P(C_i)}{P(x_1, x_2,\ldots, x_m)}\]

The main assumption of Naive Bayes Algorithm is that feature classes are mutually independent. Therefore, the conditional probability term, \(P(x_j|x_{j+1},\ldots, x_m, C_i)\) becomes \(P(x_j|C_i)\). Then

\[\displaystyle P(C_i|x_1, x_2,\ldots, x_m)=\left(\prod_{j=1}^{j=m}P(x_j|C_i)\right).\frac{P(C_i)}{P(x_1, x_2,\ldots, x_m)}\]

Due to the invariant scaling expression of \(P(x_1, x_2,\ldots, x_m)\) for all the feature classes, the above expression can simplify as \[\displaystyle P(C_i|x_1, x_2,\ldots, x_m)\propto\left(\prod_{j=1}^{j=m}P(x_j|C_i)\right).P(C_i)\] for \(1\leq i\leq h\)

3 A Practical Example

df = pd.read_csv(infile)
df = df[pd.notnull(df['tags'])]
print(df.head(20))
##                                                  post           tags
## 0   what is causing this behavior  in our c# datet...             c#
## 1   have dynamic html load as if it was in an ifra...        asp.net
## 2   how to convert a float value in to min:sec  i ...    objective-c
## 3   .net framework 4 redistributable  just wonderi...           .net
## 4   trying to calculate and print the mean and its...         python
## 5   how to give alias name for my website  i have ...        asp.net
## 6   window.open() returns null in angularjs  it wo...      angularjs
## 7   identifying server timeout quickly in iphone  ...         iphone
## 8   unknown method key  error in rails 2.3.8 unit ...  ruby-on-rails
## 9   from the include  how to show and hide the con...      angularjs
## 10  when we need interface c# <blockquote>    <str...             c#
## 11  how to install .ipa on jailbroken iphone over ...            ios
## 12  dynamic textbox text - asp.net  i m trying to ...        asp.net
## 13  rather than bubblesorting these names...the pr...              c
## 14  site deployed in d: drive and uploaded files a...        asp.net
## 15  connection in .net  i got     <blockquote>    ...           .net
## 16  how to subtract 1 from an int  how do i subtra...    objective-c
## 17  ror console show syntax error  i want to add d...  ruby-on-rails
## 18  distance between 2 or more drop pins  i was do...         iphone
## 19  sql query - how to exclude a record from anoth...            sql
print(df['post'].apply(lambda x: len(x.split(' '))).sum())
## 10278243
my_tags = ['android','angularjs','asp.net','c#','c','c++','css','ios','html','jquery','mysql','java','javascript','.net','python','php','iphone','ruby-on-rails','sql','objective-c']
plt.figure(figsize=(15,6))
#Plot the data:
# my_colors = 'rgbkymc'  #red, green, blue, black, etc.
ax = plt.gca()
ax.tick_params(axis='x', colors='blue')
ax.tick_params(axis='y', colors='red')
df.tags.value_counts().plot( kind='bar', color= 'green');
plt.show()

4 Text Before Cleansing

def print_plot(index):
    example = df[df.index == index][['post', 'tags']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Tag:', example[1])


print_plot(10)
## when we need interface c# <blockquote>    <strong>possible duplicate:</strong><br>   <a href= https://stackoverflow.com/questions/240152/why-would-i-want-to-use-interfaces >why would i want to use interfaces </a>   <a href= https://stackoverflow.com/questions/9451868/why-i-need-interface >why i need interface </a>    </blockquote>     i want to know where and when to use it     for example    <pre><code>interface idemo {  // function prototype  public void show(); }  // first class using the interface class myclass1 : idemo {  public void show()  {   // function body comes here   response.write( i m in myclass );  }  }  // second class using the interface class myclass2 : idemo {  public void show()   {   // function body comes here   response.write( i m in myclass2 );   response.write( so  what  );  } </code></pre>   these two classes has the same function name with different body. this can be even achieved without interface. then why we need an interface where and when to use it
## Tag: c#

5 Text After Cleansing

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))


def clean_text(text):
# text = BeautifulSoup(text, "lxml").text
    text = text.lower()
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    text = BAD_SYMBOLS_RE.sub(' ', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text
df['post'] = df['post'].apply(clean_text)
print_plot(10)
## need interface c# blockquote strong possible duplicate strong br href https stackoverflow com questions 240152 would want use interfaces would want use interfaces href https stackoverflow com questions 9451868 need interface need interface blockquote want know use example pre code interface idemo function prototype public void show first class using interface class myclass1 idemo public void show function body comes response write myclass second class using interface class myclass2 idemo public void show function body comes response write myclass2 response write code pre two classes function name different body even achieved without interface need interface use
## Tag: c#

6 Split Dataset into Train-set and Test-set


X = df.post
y = df.tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1506)

7 Naive Bayes Classifier for Multinomial Models

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

nb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)
print("Accuracy: %.4f%%" % accuracy_score(y_pred, y_test))
## Accuracy: 0.7379%
print(classification_report(y_test, y_pred,target_names=my_tags))
##                precision    recall  f1-score   support
## 
##       android       0.62      0.60      0.61       606
##     angularjs       0.92      0.86      0.89       597
##       asp.net       0.84      0.93      0.88       589
##            c#       0.73      0.76      0.75       601
##             c       0.73      0.90      0.80       602
##           c++       0.70      0.56      0.62       599
##           css       0.81      0.77      0.79       606
##           ios       0.64      0.89      0.75       606
##          html       0.54      0.60      0.57       591
##        jquery       0.65      0.62      0.63       605
##         mysql       0.61      0.56      0.58       569
##          java       0.82      0.76      0.79       618
##    javascript       0.83      0.59      0.69       605
##          .net       0.72      0.77      0.75       609
##        python       0.65      0.78      0.71       581
##           php       0.69      0.64      0.67       608
##        iphone       0.81      0.74      0.77       614
## ruby-on-rails       0.90      0.83      0.86       622
##           sql       0.92      0.90      0.91       610
##   objective-c       0.71      0.68      0.70       562
## 
##      accuracy                           0.74     12000
##     macro avg       0.74      0.74      0.74     12000
##  weighted avg       0.74      0.74      0.74     12000

8 Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5)),
               ])
y_pred = logreg.predict(X_test)

print("Accuracy: %.4f%%" % accuracy_score(y_pred, y_test))
## Accuracy: 0.7774%
print(classification_report(y_test, y_pred,target_names=my_tags))
##                precision    recall  f1-score   support
## 
##       android       0.64      0.60      0.62       606
##     angularjs       0.93      0.89      0.91       597
##       asp.net       0.98      0.96      0.97       589
##            c#       0.81      0.72      0.76       601
##             c       0.78      0.83      0.81       602
##           c++       0.59      0.63      0.61       599
##           css       0.79      0.74      0.76       606
##           ios       0.79      0.81      0.80       606
##          html       0.65      0.69      0.67       591
##        jquery       0.66      0.64      0.65       605
##         mysql       0.61      0.66      0.63       569
##          java       0.85      0.83      0.84       618
##    javascript       0.79      0.77      0.78       605
##          .net       0.84      0.85      0.84       609
##        python       0.80      0.77      0.78       581
##           php       0.67      0.69      0.68       608
##        iphone       0.80      0.82      0.81       614
## ruby-on-rails       0.91      0.91      0.91       622
##           sql       0.96      0.93      0.94       610
##   objective-c       0.73      0.82      0.78       562
## 
##      accuracy                           0.78     12000
##     macro avg       0.78      0.78      0.78     12000
##  weighted avg       0.78      0.78      0.78     12000

9 Linear Support Vector Machine

from sklearn.linear_model import SGDClassifier

sgd = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-5, random_state=1506, max_iter=5, tol=None)),
               ])
y_pred = sgd.predict(X_test)
print("Accuracy: %.4f%%" % accuracy_score(y_pred, y_test))
## Accuracy: 0.7939%
print(classification_report(y_test, y_pred,target_names=my_tags))
##                precision    recall  f1-score   support
## 
##       android       0.73      0.62      0.67       606
##     angularjs       0.89      0.91      0.90       597
##       asp.net       0.98      0.97      0.97       589
##            c#       0.79      0.76      0.77       601
##             c       0.81      0.85      0.83       602
##           c++       0.63      0.64      0.63       599
##           css       0.81      0.75      0.78       606
##           ios       0.85      0.79      0.82       606
##          html       0.69      0.72      0.70       591
##        jquery       0.68      0.65      0.66       605
##         mysql       0.64      0.66      0.65       569
##          java       0.88      0.84      0.86       618
##    javascript       0.81      0.77      0.79       605
##          .net       0.82      0.89      0.85       609
##        python       0.80      0.78      0.79       581
##           php       0.66      0.72      0.69       608
##        iphone       0.80      0.86      0.83       614
## ruby-on-rails       0.91      0.91      0.91       622
##           sql       0.95      0.94      0.95       610
##   objective-c       0.76      0.85      0.80       562
## 
##      accuracy                           0.79     12000
##     macro avg       0.79      0.79      0.79     12000
##  weighted avg       0.79      0.79      0.79     12000

Written by DK WC

2019-06-09