During the covid-19 pandemic, most places around the world were stuck at home due to lock-down. Social media such as Twitter, FB, TikTok, and YouTube are filled with anxiety, anger, rage. People are spreading thoughts, opinions, real or fake news. All the messages were mixed with election year’s mess lead to huge information flow on the social network. No matter you want it or not, it’s part of everyone’s daily life. Virus outbreak, variant, vaccine, origin, face mask, etc., are all being discussed repeatedly. It imposes huge pressure on the supervision of social media. There have been quite a few cases that fake news led to some unnecessary loss of human lives or property.
Thus, I conducted the research of sentiment analysis/ classification of Tweet regarding covid-19. I’m trying to build a classifier to detect the attitude that each user is trying to rely on based on their text. If the classification method works, it will help not only the social media platform but also social workers to help those in need. It will also provide a tool for regulators to refine current law systems to purify the social network.
My goal is to build a classifier to detect the sentiment of tweet regarding the covid-19 pandemic. So my research question is which method will be the best of classification, logistic regression, naive Bayes or SVM?
The data was tweet retrieved from Twitter, regarding the covid-19 pandemic. The data was stored in 2 .csv files. The data was already split into the training set and test set. Both data sets have 6 variables including username, screen name, location, tweet time, original tweet, and sentiment. The training set has 40,000+ observations while the test set has 3000+ observations. We only use the original tweet as the independent variable to analyze the sentiment.
We applied data cleaning strategies first to clean all the unnecessary characters in the tweet text body. We defined various functions to achieve my goal. Then we used feature extraction to analyze the main issue affecting sentiment. Finally, we build the model based on my clean and processed data and test the accuracy of the model.
library(reticulate)
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from bs4 import BeautifulSoup
import re,string,unicodedata
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
train_data = pd.read_csv('Corona_NLP_train.csv', encoding='latin_1')
test_data = pd.read_csv("Corona_NLP_test.csv",encoding ='latin_1')
train_data.head()
## UserName ... Sentiment
## 0 3799 ... Neutral
## 1 3800 ... Positive
## 2 3801 ... Positive
## 3 3802 ... Positive
## 4 3803 ... Extremely Negative
##
## [5 rows x 6 columns]
test_data.head()
#check for na value
## UserName ... Sentiment
## 0 1 ... Extremely Negative
## 1 2 ... Positive
## 2 3 ... Extremely Positive
## 3 4 ... Negative
## 4 5 ... Neutral
##
## [5 rows x 6 columns]
train_data.isnull().values.any()
## True
test_data.isnull().values.any()
#remove na value
## True
train_data.dropna(inplace=True)
test_data.dropna(inplace=True)
#attribute dependent variables into 3 categories instead of 5
def re_attribute(sentiment):
if sentiment == "Extremely Positive":
return 'positive'
elif sentiment == "Extremely Negative":
return 'negative'
elif sentiment == "Positive":
return 'positive'
elif sentiment == "Negative":
return 'negative'
else:
return 'netural'
train_data['Sentiment'] = train_data['Sentiment'].apply(lambda x: re_attribute(x))
test_data['Sentiment'] = test_data['Sentiment'].apply(lambda x: re_attribute(x))
#class re-attribute visualization
class_df = train_data.groupby('Sentiment').count()['OriginalTweet'].reset_index().sort_values(by='OriginalTweet',ascending=False)
percent_class=class_df.OriginalTweet
labels= class_df.Sentiment
colors = ['lightcoral','lightgreen','aqua']
my_pie,_,_ = plt.pie(percent_class,radius = 1.2,labels=labels,colors=colors,autopct="%.1f%%")
plt.setp(my_pie, width=0.6, edgecolor='white')
## [None, None, None, None, None, None]
plt.show()
#seems most people hold a either positive or negative attitude rather than neutral
#It's in line with common sense that during over 1 years of pandemic, people are enough with the current condition
train_data['tweet'] = train_data.OriginalTweet
train_data["tweet"] = train_data["tweet"].astype(str)
test_data['tweet'] = test_data.OriginalTweet
test_data["tweet"] = test_data["tweet"].astype(str)
##remove non-english char
def remove_accented_chars(text):
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return text
##remove special char
def rmv_spe_char(text):
char_rmv = re.compile(r"[-*()!+:'$@?#]")
return char_rmv.sub(r"",text)
##remove url
def rmv_url(text):
url_remove = re.compile(r'https?://\S+|www\.\S+')
return url_remove.sub(r'', text)
##remove html
def rmv_html(text):
html = re.compile(r'<.*?>')
return html.sub(r'',text)
##lower case
def lower_case(text):
low_text= text.lower()
return low_text
##remove stop words
", ".join(stopwords.words('english'))
## "i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't"
STOPWORDS = set(stopwords.words('english'))
def rmv_stopwords(text):
return " ".join([word for word in str(text).split() if word not in STOPWORDS])
##apply all functions
train_data['tweet_1'] = train_data['tweet'].apply(lambda x:rmv_url(x))
train_data['tweet_2'] = train_data['tweet_1'].apply(lambda x:rmv_html(x))
train_data['tweet_3'] = train_data['tweet_2'].apply(lambda x:rmv_spe_char(x))
train_data['tweet_4'] = train_data['tweet_3'].apply(lambda x:lower_case(x))
train_data['tweet_5'] = train_data['tweet_4'].apply(lambda x:rmv_stopwords(x))
test_data['tweet_1'] = test_data['tweet'].apply(lambda x:rmv_url(x))
test_data['tweet_2'] = test_data['tweet_1'].apply(lambda x:rmv_html(x))
test_data['tweet_3'] = test_data['tweet_2'].apply(lambda x:rmv_spe_char(x))
test_data['tweet_4'] = test_data['tweet_3'].apply(lambda x:lower_case(x))
test_data['tweet_5'] = test_data['tweet_4'].apply(lambda x:rmv_stopwords(x))
Based on my research, the SVM method provided the best prediction result. Considering the complexity of each algorithm, we are confident about the results we get. During my literature review, we have also found some deep learning techniques with which we are not quite familiar. So if we want to get more accurate classification results, we might need to dive deeper into the ocean of ML/AI algorithms.
We have also found that among the tweet sentiment, the positive and negative are almost equal, which is far beyond my expectations. Originally we thought most people will have a negative attitude toward this pandemic since most industries were negatively impacted during the lock-down period. Surprisingly, despite the chaos of the past 17+ months, half of the users still hold a positive attitude toward the future, which makes us feel touched.
As we have stated before, the study can be beneficial to not only the platform but also the social workers and regulators. The platform can use the algorithm to stop misinformation from spreading and monitor abnormal information flow. The social workers can use the data to analyze people’s mental health conditions and conduct interventions timely. The regulators can use the result to impose new requirements on the platform to make sure they play the correct role in the whole system, i.e. the boundary of the first amendment.