Sentiment Analysis
Introduction
In this study, we will perform sentiment analysis based on tweets from the consumer of major US airlines in February 2015. The consumer’s voice is important because it can describe the level of satisfaction with services and products offered by US airlines. The airlines included in this dataset are Virgin American, Delta, Southwest, United, US Airways and Virgin America. The Naive Bayes model will be used in this analysis.
Data Information
The data is collected from the Kaggle website [^1]. The data contains 14640 tweets related to services and products of US airlines. The target variable is airline_sentiment.
Data Preparation
Load libraries
Load all libraries used in this study
library(dbplyr)
library(e1071)
library(inspectdf)
library(tm)
library(wordcloud)
library(caret)
library(ggplot2)
library(wesanderson)
library(ROCR)
library(pROC)Attach Dataset
Load dataset by using function read.csv()
# Read the dataset
us <- read.csv("Tweets.csv")
# Check dataset
rmarkdown::paged_table(tail(us))Data Wrangling
Observe Dataset
There are 14640 observations and 15 variables. For feature selection, we require only 2 columns, which are airline_sentiment and text. Thus, we will change text column into factor.
str(us)## 'data.frame': 14640 obs. of 15 variables:
## $ tweet_id : num 5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
## $ airline_sentiment : chr "neutral" "positive" "neutral" "negative" ...
## $ airline_sentiment_confidence: num 1 0.349 0.684 1 1 ...
## $ negativereason : chr "" "" "" "Bad Flight" ...
## $ negativereason_confidence : num NA 0 NA 0.703 1 ...
## $ airline : chr "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
## $ airline_sentiment_gold : chr "" "" "" "" ...
## $ name : chr "cairdin" "jnardino" "yvonnalynn" "jnardino" ...
## $ negativereason_gold : chr "" "" "" "" ...
## $ retweet_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ text : chr "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they hav"| __truncated__ ...
## $ tweet_coord : chr "" "" "" "" ...
## $ tweet_created : chr "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800" "2015-02-24 11:15:48 -0800" "2015-02-24 11:15:36 -0800" ...
## $ tweet_location : chr "" "" "Lets Play" "" ...
## $ user_timezone : chr "Eastern Time (US & Canada)" "Pacific Time (US & Canada)" "Central Time (US & Canada)" "Pacific Time (US & Canada)" ...
Our target variable is airline_sentiment. We can check the class of this target by using the function unique(). The text has 3 levels of category, which are neutral, positive and negative.
unique(us$airline_sentiment)## [1] "neutral" "positive" "negative"
The proportion for each class in the target variable is not balanced. We will keep this proportion and observe the model performance.
rmarkdown::paged_table(as.data.frame(prop.table(table(us$airline_sentiment))))Now, create the plot of sentiment classification for each airline by using ggplot
# Create dataframe containing the frequency of each sentiment based on each airline.
airline <- as.data.frame(table(us$airline, us$airline_sentiment))
colnames(airline) <- c("Airline", "Sentiment", "Frequency")
# Create plot
ggplot(airline, aes(x=Airline, y =Frequency, fill= Sentiment)) +
geom_col()+
theme_minimal()+
labs(title = "Sentiment Classification Each Airline")+
scale_fill_manual(values= wes_palette(10, name = "Darjeeling1", type="continuous"))+
theme(plot.title = element_text(size= 14, color = 'black', face ='bold'),
axis.title.x = element_text(size=12, color = 'black'),
axis.title.y = element_text(size = 12, color = 'black'),
axis.text.x = element_text(size = 12, color = 'black', vjust = 0.5),
axis.text.y = element_text(size = 12, color = 'black'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
legend.position = "bottom")Insight: As shown in the plot, the proportion of negative sentiment overpowers the other sentiment class. It implies that most customer of major U.S airline is not satisfied with the services or products of the airline. The highest proportion of negative sentiment belongs to US Airways and United. While Virgin America has a balance proportion for the sentiment of each tweet.
Feature Selection
# Select only 2 columns and change column text airline_sentiment into factor
us <- us %>%
dplyr::select(airline_sentiment, text) %>%
dplyr::mutate(airline_sentiment=as.factor(airline_sentiment))
# Check the dataset
rmarkdown::paged_table(head(us))Missing Value
We can observe missing value using function colSums() and is.na(). We don’t have missing value in dataset
colSums(is.na(us))## airline_sentiment text
## 0 0
Text Cleaning
Text to Corpus
Corpus is the component from the document. In this case, one document is equal to one observation of the text. In one text, it can be one or more words. One package that can be utilized for text mining is tm package. We use the Vcorpus to change the vector text into the original corpus.
word_corpus <- VCorpus(VectorSource(us$text))Let’s check observation no 10.
word_corpus[[10]]$content## [1] "@VirginAmerica it was amazing, and arrived an hour early. You're too good to me."
Text cleaning
In the next code, we will replace the uppercase with lowercase by using the function tolower.
word_corpus <- word_corpus %>%
tm_map(content_transformer(tolower))
# Check observation no 10
word_corpus[[10]]$content## [1] "@virginamerica it was amazing, and arrived an hour early. you're too good to me."
In the next code, we will remove numbers, punctuation, white spaces and stopwords. Also, we will transform the word into its original form. We can use function tm_map().
# Cleaning the text
word_corpus <- word_corpus %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stemDocument) %>%
tm_map(stripWhitespace)
# Check the observation number 10
word_corpus[[10]]$content## [1] "virginamerica amaz arriv hour earli good"
Document-Term Matrix (DTM)
We need to transform the text data into Document-Term Matrix (DTM) through the tokenization process. This process split one text into several terms. In DTM, one word will be one predictor with the value containing the frequency of those words appearing in the document. We use function DocumentTermMatrix(). Then, we inspect the result by using the function inspect().
word_dtm <- DocumentTermMatrix(word_corpus)
# Inspect
inspect(word_dtm)## <<DocumentTermMatrix (documents: 14640, terms: 11280)>>
## Non-/sparse entries: 136258/165002942
## Sparsity : 100%
## Maximal term length: 46
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs americanair can flight get hour jetblu southwestair thank unit usairway
## 1076 0 0 5 0 0 0 0 0 1 0
## 2363 0 0 0 0 0 0 0 0 1 0
## 2945 0 0 0 1 0 0 0 0 2 0
## 3698 0 0 2 0 0 0 0 0 1 0
## 376 0 0 2 0 0 0 0 0 0 0
## 3871 0 0 0 1 0 0 0 1 1 0
## 3995 0 0 0 0 0 0 0 0 1 0
## 4797 0 0 2 0 1 0 1 0 0 0
## 827 0 0 1 0 0 0 0 0 1 0
## 974 0 0 2 0 0 0 0 0 1 0
Cross-Validation
We split the train and test dataset with 75% and 25% proportions. We use function sample()
RNGkind(sample.kind = "Rounding")
set.seed(100)
# Split train and test dataset
intrain <- sample(x = nrow(word_dtm), nrow(word_dtm)*0.75)
# train-test splitting
word_train <- word_dtm[intrain,]
word_test <- word_dtm[-intrain,]Then, we prepare for the label for train and test dataset
# Prepare label for train and test
label_train <- us[intrain, "airline_sentiment"]
label_test <- us[-intrain, "airline_sentiment"]
# Check proportion
prop.table(table(label_train))## label_train
## negative neutral positive
## 0.6286885 0.2131148 0.1581967
The proportion of the class target is not balanced. Let’s keep it and observe the model performance. Now, Let’s check the dimension of the train dataset
dim(word_train)## [1] 10980 11280
We use the function findFreqTerms() to reduce the noise in our dataset by collecting the word that frequently appears. Thus, we set lowfreq = 20 to ensure just words with at least 20 times appearance are used in this dataset.
word_freq <- findFreqTerms(x = word_train, lowfreq=20)Subset the train dataset only containing the word_freq. The output shows that only 644 predictors (words) have appeared more than 20 times.
# Subset the train dataset
word_train <- word_train[ ,word_freq]
# Check for dimension
dim(word_train)## [1] 10980 753
Bernoulli Converter
Bernoulli Converter is used to transform the frequency value into condition ; 0 as not appear and 1 as appear
- If frequency > 0, then the value is 1
- If frequency = 0, then the value is 0
# Create Bernoulli converter function
bernoulli_conv <- function(x){
x <- as.factor(ifelse(x > 0, 1, 0))
return(x)
}
# Apply the function to train dataset
word_train_bn <- apply(word_train,
MARGIN =2, # apply function to column, thus the matrix keeps as DocumentTermMatrix
FUN = bernoulli_conv)
# Apply the function to test dataset
word_test_bn <- apply(word_test,
MARGIN = 2,
FUN= bernoulli_conv)
# Check the result
word_train_bn[4:10, 1:5]## Terms
## Docs “jetblu abl absolut accept access
## 826 "0" "0" "0" "0" "0"
## 6858 "0" "0" "0" "0" "0"
## 7080 "0" "0" "0" "0" "0"
## 11889 "0" "0" "0" "0" "0"
## 5419 "0" "0" "0" "0" "0"
## 7998 "0" "0" "0" "0" "0"
## 2492 "0" "0" "0" "0" "0"
Naive Bayes
Naive Bayes is an algorithm of simple “probabilistic classifiers” based on Bayes’ theorem. The model considers independence assumptions between predictor variables [^2]. The model is easy to be used and fast to build a real-time prediction. It can handle both discrete and continuous target variables. However, the model suffers from skewness due to data scarcity.
Model Fitting
Create Naive Bayes model and assign it as model_naive
model_naive <- naiveBayes(x=word_train_bn,
y = label_train,
laplace=1) # to prevent from data scarcityModel Prediction
Prediction can be performed by using function predict()
model_pred <- predict(object = model_naive,
newdata= word_test_bn,
type="class")Model Evaluation
Confusion Matrix
The confusion matrix is used to describe the performance of a classification algorithm. Four metrics to evaluate classifiers are Accuracy, Sensitivity, Specificity and Precision. In this case, we will use accuracy for our performance metrics. Accuracy is defined as the ratio of correctly predicted cases by the total cases.
confusionMatrix(data = as.factor(model_pred),
reference= as.factor(label_test),
positive="negative")## Confusion Matrix and Statistics
##
## Reference
## Prediction negative neutral positive
## negative 1776 163 68
## neutral 345 504 91
## positive 154 92 467
##
## Overall Statistics
##
## Accuracy : 0.7505
## 95% CI : (0.7362, 0.7645)
## No Information Rate : 0.6216
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5643
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity 0.7807 0.6640 0.7460
## Specificity 0.8332 0.8497 0.9189
## Pos Pred Value 0.8849 0.5362 0.6550
## Neg Pred Value 0.6981 0.9062 0.9460
## Prevalence 0.6216 0.2074 0.1710
## Detection Rate 0.4852 0.1377 0.1276
## Detection Prevalence 0.5484 0.2568 0.1948
## Balanced Accuracy 0.8069 0.7569 0.8325
Insight: In this case, we are only interested in accuracy as a performance metric. The output shows that the accuracy of the model is 0.7432, which is justified as an apppropriate model.
ROC and AUC
The empirical ROC curve is a probability curve showing the true positive rate (sensitivity) versus the false positive rate (1 - specificity) for all possible cut-off values [^3]. The AUC (Area Under Curve) measures how much the model is capable to distinguish between classes. The higher the AUC value, the better the model is at distinguishing positive and negative classes.
# Create a probability prediction from `model_naive`
prob_naive <- predict(object = model_naive,
newdata= word_test_bn,
type="raw")
# Check the prob_naive
head(prob_naive)## negative neutral positive
## [1,] 0.086772852 0.63364012 0.2795870
## [2,] 0.038372158 0.02292477 0.9387031
## [3,] 0.196390932 0.27562676 0.5279823
## [4,] 0.028351975 0.58168238 0.3899656
## [5,] 0.008846767 0.31145431 0.6796989
## [6,] 0.088155208 0.66783561 0.2440092
# Create a prob and label from prob_naive.
roc_naive <- data.frame(prob=prob_naive[,1],
label=as.numeric(label_test=="negative"))
# Create an object prediction
prediction_roc_naive <- prediction(predictions = roc_naive$prob,
labels = roc_naive$label)
# Create an ROC plot
plot(ROCR::performance(prediction.obj = prediction_roc_naive,
measure = "tpr",
x.measure = "fpr"), main = "ROC Naive Bayes", col="#519259")
abline(a = 0, b = 1)# Obtain the AUC
auc_naive <- ROCR::performance(prediction.obj=prediction_roc_naive, measure = "auc")
auc_naive@y.values[[1]]## [1] 0.890021
Insight: The AUC of the model is 0.8839, implying that the model can distinguish between positive and negative classes.
Conclusion
Even though the target variable is not balanced, the performance metrics of accuracy show that the model is quite appropriate for predicting the airline sentiment of US major airline customers. The ROC value also shows that the model can distinguish the positive and negative classes very well.
References
- [^1] Data Collection
- [^2] Naive Bayes Introduction
- [^3] ROC