Sentiment Analysis

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
Import the separate python file that includes the functions you will need for the classification reports.

##r chunk
library(reticulate)
#py_install("feature_selector")
 
#source_python("model_evaluation_utils.py")

Load the Python libraries or functions that you will use for that section.

##python chunk
import pandas as pd
import numpy as np
import nltk
import textblob
from bs4 import BeautifulSoup
import unicodedata
import contractions
from nltk import PorterStemmer
ps = PorterStemmer()

The Data

This dataset includes tweets that have been coded as either negative or positive.
Import the data using either R or Python. I put a Python chunk here because you will need one to import the data, but if you want to first import into R, that’s fine.

##python chunk
df = pd.read_csv('twitter_small.csv')

Clean up the data (text normalization)

Use our clean text function from this lecture to clean the text for this analysis.

##python chunk
stop_words = set(nltk.corpus.stopwords.words('english')) #stopwords
stop_words.remove('no')
stop_words.remove('but')
stop_words.remove('not')

def remove_stop_words(text):
    text = BeautifulSoup(text).get_text() #html
    text = text.lower() #lower case
    text = contractions.fix(text) #contractions
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore') #symbols
    #text = ' '.join([ps.stem(word) for word in text.split()]) #stem
    text = ' '.join(word for word in text.split() if word not in stop_words) # stopwords
    return text
df['tweet_parse1'] = df['tweet'].apply(remove_stop_words)
df.head(2)

##   sentiment  ...                       tweet_parse1
## 0  negative  ...                     worried adara.
## 1  negative  ...  german television program boring.
## 
## [2 rows x 3 columns]

TextBlob

Calculate the expected polarity for all the tweets.
Using a cut off score of 0, change the polarity numbers to positive and negative categories.
Display the performance metrics of using Textblob on this dataset.

##python chunk

tweets = np.array(df['tweet_parse1'])
sentiments = np.array(df['sentiment'])
#py_install("afinn", pip = T)
from afinn import Afinn
#load the model 
afn = Afinn(emoticons=True)

AFINN

Calculate the expected polarity for all the tweets using AFINN.
Using a cut off score of 0, change the polarity numbers to positive and negative categories.
Display the performance metrics of using AFINN on this dataset.

##python chunk
#py_install("afinn", pip = T)
#get_sentiments("afinn")

Split the dataset

Split the dataset into training and testing datasets.

##python chunk
from sklearn.model_selection import train_test_split

train_tweet, test_tweet, train_sentiment, test_sentiment = train_test_split(df['tweet_parse1'], df['sentiment'], test_size=0.20, random_state = 42)
train_tweet.shape

## (3200,)

test_tweet.shape

## (800,)

TF-IDF

Calculate features for testing and training using the TF-IDF vectorizer.

##python chunk
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)

# apply to train and test
tv_train_features = tv.fit_transform(train_tweet)
tv_test_features = tv.transform(test_tweet)

Logistic Regression Classifier

Create a blank logistic regression model.
Fit the the model to the training data.
Create the predicted value for the testing data.

##python chunk
from sklearn import linear_model
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report
lr = linear_model.LogisticRegression(penalty='l2', solver='lbfgs', multi_class='ovr',
                        max_iter=1000, C=1, random_state=42)
lr.fit(tv_train_features, train_sentiment)

## LogisticRegression(C=1, max_iter=1000, multi_class='ovr', random_state=42)

y_pred = lr.predict(tv_test_features)

Accuracy and Classification Report

Display the performance metrics of the logistic regression model on the testing data.

##python chunk
print('accuracy %s' % accuracy_score(test_sentiment, y_pred))

## accuracy 0.72125

print(classification_report(test_sentiment, y_pred))

##               precision    recall  f1-score   support
## 
##     negative       0.74      0.72      0.73       422
##     positive       0.70      0.72      0.71       378
## 
##     accuracy                           0.72       800
##    macro avg       0.72      0.72      0.72       800
## weighted avg       0.72      0.72      0.72       800

Topic Model Positive Reviews

Create a dataset of just the positive reviews.
Create a dictionary and document term matrix to start the topics model.

##python chunk
from nltk import word_tokenize, pos_tag
from nltk.classify import NaiveBayesClassifier
import nltk.classify.util # Utility functions and classes for classifiers. Contains functions such as accuracy(classifier, gold)

# Given a word, returns a dict {word: True}. This will be our feature in the classifier. 
#def word_feats(words):
#  return dict([(word, True) for word in words if word not in stops and word.isalpha()])

#pos_ids = df['tweet_parse1'].('pos')
#neg_ids = df['tweet_parse1'].('neg')

#print(len(pos_ids) + len(neg_ids))

#pos_feats = [(word_feats(df['tweet_parse1'].words(fileids=[f])), 'pos') for f in pos_ids]
#neg_feats = [(word_feats(df['tweet_parse1'].words(fileids=[f])), 'neg') for f in neg_ids]
#pos_len_train = int(len(pos_feats) * 3 / 4)
#neg_len_train = int(len(neg_feats) * 3 / 4)
##train_feats = neg_feats[:neg_len_train] + pos_feats[:pos_len_train]
#test_feats = neg_feats[neg_len_train:] + pos_feats[pos_len_train:]
#classifier = NaiveBayesClassifier.train(tv_train_features)

#print('Accuracy: ', nltk.classify.util.accuracy(classifier, tv_test_features))

Topic Model

Create the LDA Topic Model for the positive reviews with three topics.

##python chunk


#classifier.show_most_informative_features()

Terms for the Topics

Print out the top terms for each of the topics.

##python chunk
#sentence = "I feel so miserable, it makes me amazing"
#tokens = [word for word in word_tokenize(sentence) if word not in stop_words]
#tokens
#feats = word_feats(word for word in tokens)
#print(feats)
#test = lr.predict(feats)
#print(test)
 
#from feature_selector import FeatureSelector
#fs = FeatureSelector(data = tv_train_features, labels = train_sentiment)
#fs.feature_importances.head(10)

Interpretation

Which model best represented the polarity in the dataset? # Logistic Regression Classifiercan reach over 70% accuracy which is good.
Looking at the topics analysis, what are main positive components to the data? # the main positive components are “good”,“amazing”, etc.

Sentiment Analysis

J.H.237066

2020-10-05

Load the libraries + functions

The Data

Clean up the data (text normalization)

TextBlob

AFINN

Split the dataset

TF-IDF

Logistic Regression Classifier

Accuracy and Classification Report

Topic Model Positive Reviews

Topic Model

Terms for the Topics

Interpretation