The dataset used in this project is provided on Kaggle and originally collected by Crowdflower’s Data for Everyone library.
This Twitter data was scraped on February 2015. It contains tweets on six major United States(US) airlines.
The dataset contains 14640 instances which are tweets submitted by individual travelers and 15 features.And each instance is labeled as positive, negative or neutral.
Features description:
In this work we want to determine which airlines tweeted about the most and the reason behind the negative tweets. Then we will see the most frequent words used in the tweets using the WordCloud technique. And finally We will predict the sentiment of a tweet without any other information but the tweet text itself, using carious machine learning algorithms.
Uploading libraries
library(tidyverse) # for piping and select()
library(textcat) # to determine the language of the tweets
library(ggplot2) # visualizing interesting and interactive graphs
library(tm) # text mining package to create corpus
library(SnowballC) # for cleaning the tweets
library(wordcloud) # to create WordClouds which give the most frequent words
library(dplyr) # for data manipulation
library(rpart) # build decision tree model
library(rpart.plot) # plot the decision tree
library(randomForest) # RandomForest model
library(caret) # for confusion matrix
library(nnet) # for multinomial logistic regression
library(Metrics) # for accuracy function
library(treemapify)
library(lubridate) # for ymd_hms function
library(gganimate) # for graph animation
library(gifski)
library(av)
library(gapminder)
Uploading the data.
setwd("C:/Users/jihen/Desktop/Nihel")
ustweets <- read.csv('tweets.csv', header = T)
Let’s have a look at the structure of the data.
str(ustweets)
## 'data.frame': 14640 obs. of 15 variables:
## $ tweet_id : num 5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
## $ airline_sentiment : chr "neutral" "positive" "neutral" "negative" ...
## $ airline_sentiment_confidence: num 1 0.349 0.684 1 1 ...
## $ negativereason : chr "" "" "" "Bad Flight" ...
## $ negativereason_confidence : num NA 0 NA 0.703 1 ...
## $ airline : chr "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
## $ airline_sentiment_gold : chr "" "" "" "" ...
## $ name : chr "cairdin" "jnardino" "yvonnalynn" "jnardino" ...
## $ negativereason_gold : chr "" "" "" "" ...
## $ retweet_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ text : chr "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they hav"| __truncated__ ...
## $ tweet_coord : chr "" "" "" "" ...
## $ tweet_created : chr "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800" "2015-02-24 11:15:48 -0800" "2015-02-24 11:15:36 -0800" ...
## $ tweet_location : chr "" "" "Lets Play" "" ...
## $ user_timezone : chr "Eastern Time (US & Canada)" "Pacific Time (US & Canada)" "Central Time (US & Canada)" "Pacific Time (US & Canada)" ...
#View(tweets)
ustweets[3:4,]
## tweet_id airline_sentiment airline_sentiment_confidence negativereason
## 3 5.703011e+17 neutral 0.6837
## 4 5.703010e+17 negative 1.0000 Bad Flight
## negativereason_confidence airline airline_sentiment_gold name
## 3 NA Virgin America yvonnalynn
## 4 0.7033 Virgin America jnardino
## negativereason_gold retweet_count
## 3 0
## 4 0
## text
## 3 @VirginAmerica I didn't today... Must mean I need to take another trip!
## 4 @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces & they have little recourse
## tweet_coord tweet_created tweet_location
## 3 2015-02-24 11:15:48 -0800 Lets Play
## 4 2015-02-24 11:15:36 -0800
## user_timezone
## 3 Central Time (US & Canada)
## 4 Pacific Time (US & Canada)
The reviews need to be cleaned for further analysis.
Let’s randomize our data using the sample() command in case it’s not randomized.
set.seed(1912)
tweets <- ustweets[sample(nrow(ustweets)),]
Extract relevant features.
tweets <- tweets %>%
select(airline_sentiment, negativereason, airline, text, tweet_location)
Checking for missing values.
sum(is.na(tweets))
## [1] 0
Luckily we don’t have any missing values.
ggplot(tweets) + aes(airline,fill= airline) + geom_bar() +geom_text(stat='count', aes(label=..count..), vjust=1.6, color="black") + labs(title= 'Plot of Airlines') + theme_minimal() + guides(fill = F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
As we see we have six different airlines which are American, Delta, Southwest, United, US Airways and Virgin America. The most airline tweeted about is United airlines. And the least is Virgin America.
Let’s assume that passengers will go through the trouble of tweeting about an airline only if they were highly impressed or highly disappointed.
tweets %>%
group_by(tweet_location) %>%
summarise(Count = n()) %>%
top_n(5)
## # A tibble: 5 x 2
## tweet_location Count
## <chr> <int>
## 1 "" 4733
## 2 "Boston, MA" 157
## 3 "New York" 127
## 4 "New York, NY" 156
## 5 "Washington, DC" 150
4733 of tweets are without location. People who landed at Boston state have the most tweets about these six airlines.
We also notice that most of the tweets are from important cities in the US. So I would assume that many of the plane passengers are professionals traveling for work and not for pleasure. Thus, they expect their flight to be on time and to get good service.
Let’s have a look at what language the tweets are written in.
tweets$language = textcat(tweets$text)
ggplot(tweets, aes(x=language))+ geom_bar(stat="count", fill="lightblue") + ggtitle("Language Count") + xlab("Language") + ylab("Count") + theme_minimal() + coord_flip() + theme_minimal()
As expected, English is the most common language since we are dealing with US airlines. However, Scots seems to be extremely popular too. Nevertheless, Scots is the closest language to English.
I will keep only English and Scottish tweets for the text analysis.
tweets = subset(tweets, language =='english' | language =='scots')
dim(tweets)
## [1] 13691 6
Building and cleaning corpus with the help of tm package.
A corpus is a collection of document so each tweet will be treated as a document. We create a VectorSource, that is the input type for the Corpus function defined in the package tm.
corpus0 <- iconv(tweets$text, to = "UTF-8")
corpus <- Corpus(VectorSource(corpus0))
inspect(corpus[1:5])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] @USAirways thx for delayed checkin on noon flt to mia then forcing me to take Late Flightr flts which were delayed. Cincy to Miami in 13 hrs! #fast
## [2] @VirginAmerica just got on the 1pm in Newark home to LA. Your folks at EWR are incredible #letsgohome
## [3] @SouthwestAir Am I flying on Spirit air?
## [4] @AmericanAir I need to go to YYZ tmr morning 8am. I switched to United already, but my bag is still off in AA la-la land.
## [5] @USAirways depart from where? It departed PHX and was due to land at DCA around 8pm.
First we put everything in lowercase. Then we remove punctuation, remove numbers, English stop words(common words), strip white spaces, and stem each word.
Convert text to lower case.
corpus <- tm_map(corpus, tolower)
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] @usairways thx for delayed checkin on noon flt to mia then forcing me to take late flightr flts which were delayed. cincy to miami in 13 hrs! #fast
Remove all punctuation.
corpus <- tm_map(corpus, removePunctuation)
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] usairways thx for delayed checkin on noon flt to mia then forcing me to take late flightr flts which were delayed cincy to miami in 13 hrs fast
Remove all number.
corpus <- tm_map(corpus, removeNumbers)
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] usairways thx for delayed checkin on noon flt to mia then forcing me to take late flightr flts which were delayed cincy to miami in hrs fast
Remover common English words.
corpus <- tm_map(corpus, removeWords, stopwords('english'))
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] usairways thx delayed checkin noon flt mia forcing take late flightr flts delayed cincy miami hrs fast
corpus <- tm_map(corpus, removeWords, c('flight','get','plane','flights','flightl', 'i', 'day', 'im', 'cant', 'can', 'now', 'just', 'will', 'dont', 'ive', 'got', 'much'))
Get rid of white space.
corpus <- tm_map(corpus, stripWhitespace)
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] usairways thx delayed checkin noon flt mia forcing take late flightr flts delayed cincy miami hrs fast
Stemming of documents, normalize words into its base form or form.
corpus <- tm_map(corpus, stemDocument)
inspect(corpus[1])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] usairway thx delay checkin noon flt mia forc take late flightr flts delay cinci miami hrs fast
TermDocumentMatrix(TDM) convert text data(unstructured) into rows and columns(structured). Where in rows we find the term(words) and in columns the document(tweets).
TDM identify each and every term in the document(tweets) and give the count how many times each time the word repeat in each document.
TDM <- TermDocumentMatrix(corpus)
TDM
## <<TermDocumentMatrix (terms: 11286, documents: 13691)>>
## Non-/sparse entries: 123385/154393241
## Sparsity : 100%
## Maximal term length: 62
## Weighting : term frequency (tf)
The TDM contains 11286 terms and 13691 documents with 100% sparcity.
Some terms(words) are more important than others, and we want to remove those that are not. So we will use the function removeSparseTerms from the tm package where we reduce the sparsity to 99.9%.
TDMS <- removeSparseTerms(TDM, sparse = 0.999)
TDMS
## <<TermDocumentMatrix (terms: 1128, documents: 13691)>>
## Non-/sparse entries: 102801/15340647
## Sparsity : 99%
## Maximal term length: 17
## Weighting : term frequency (tf)
TDM_matrix <- as.matrix(TDMS)
TDM_matrix[1:7, 1:7]
## Docs
## Terms 1 2 3 4 5 6 7
## checkin 1 0 0 0 0 0 0
## delay 2 0 0 0 0 0 0
## fast 1 0 0 0 0 0 0
## flightr 1 0 0 0 0 0 0
## flt 1 0 0 0 0 0 0
## forc 1 0 0 0 0 0 0
## hrs 1 0 0 0 0 0 0
Now our DTM composed of 1128 terms and 13691 documents.
Most frequent words, how often the word appear.
w <- rowSums(TDM_matrix)
s <- subset(w,w>=500)
v <- sort(s, decreasing = T)
barplot(v,col = rainbow(38),las = 2, main = "Plot of words with frequency more then 500", xlab = "Words", ylab = "Frequency", border = "black", )
Most of the discussion are about United, American,USairways and Southwest airlines, since that they were mentioned more than 2000 time.
Also, we see that help, cancel, hour, thank, time, delay and jetblue, all mentioned more than a 1000 time.
The WordCloud is a technique for visualizing important words right away.
wc <- sort(w, decreasing = TRUE)
set.seed(222)
wordcloud(words= names(wc),
freq= wc,
max.words= 200,
random.order=F,
colors = brewer.pal(12, 'Paired'),
rot.per= 0.3)
Most tweets about the four airlines I mentioned above also about delay, hour, cancel, time, help, custom and service.
These words imply that the reviews about the airlines are not good.
Now I will be comparing the corpus of all three sentiments.
Subsetting the sentiments.
pos_tweets <- subset(tweets$text , tweets$airline_sentiment=="positive")
neg_tweets <- subset(tweets$text , tweets$airline_sentiment=="negative")
neu_tweets <- subset(tweets$text , tweets$airline_sentiment=="neutral")
Paste and collapse positive, negative and neutral tweets.
pos_terms <- paste(pos_tweets , collapse =" ")
neg_terms <- paste(neg_tweets , collapse =" ")
neu_terms <- paste(neu_tweets , collapse =" ")
Combine both positive and negative terms.
all_terms <- c(pos_terms, neg_terms, neu_terms)
Building the corpus and creating the TDM.
all_corpus <- VCorpus(VectorSource(all_terms))
all_tdm <- TermDocumentMatrix( all_corpus, control = list(removePunctuation = TRUE, removeNumbers =TRUE, stemDocument = TRUE, tolower = TRUE ,stopwords= c('flight','get','plane','flights','flightl', 'i', 'day', 'im', 'cant', 'can', 'now', 'just', 'will', 'dont', 'ive', 'got', 'much'), stopwords = stopwords('english')))
all_tdm_m <- as.matrix(all_tdm)
colnames(all_tdm_m) <- c("positive","negative", "neutral")
all_term_freq <- rowSums(all_tdm_m)
all_term_freq <- sort(all_term_freq,TRUE)
comparison.cloud(
all_tdm_m,
max.words = 100,
colors = c("#00BA5D", "#E83B20","#00468b")
)
The result shows that words expressing positive emotions are thanks, good, great, better, awesome. we can also see that Southwest and Virgin America are mentioned here. This indicate that the passengers who used SouthWest and Virgin America were delighted with their flight, the service was good, their luggage were unharmed, the plane came on time, ect.
As we notice that negative emotions gave words such as delayed, hold, canceled, lost, late, hours, waiting, service, ect. Furthermore, we have here US Airways and American airlines. I would say that the passenger who wrote these negative reviews must of faced challenges while flying under these airline, such as, delayed or canceled flight, the waiting time took longer than expected, the service was bad, their luggage were lost or destroyed.
# % of all sentiments
round(prop.table(table(tweets$airline_sentiment)),3)
##
## negative neutral positive
## 0.644 0.199 0.157
ggplot(tweets) + aes(airline_sentiment,fill= airline_sentiment) + geom_bar() +geom_text(stat='count', aes(label=..count..), vjust=1.6, color="white") + labs(title= 'Plot of Sentiments') +
scale_fill_manual(values = c("#ff576a","#3db5ff", "#66CC99"))+ theme_minimal() + guides(fill= F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Overall sentiments: 64.4% of the tweets were negative, 19.9% were neutral and, 15.7% were positive. This implies that the data is biased towards the negative class. Therefore, people actually write a tweet about their flight if something bad happened. #### Plot of Tweet Sentiment by Airline
ggplot(tweets, aes(x = airline , fill = airline_sentiment))+ geom_bar( colour = 'black') + scale_fill_manual(values = c("#ff576a","#3db5ff", "#66CC99")) + labs(x = 'airlines', y = 'Proportion', title ='Tweet Sentiment by Airline') + theme(axis.text.x = element_text(angle = 25, size=9)) + theme_minimal() + guides(fill= F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
We notice here that all airlines experienced negative feedback more than positive and negative feedbacks. And since United is the most tweeted about airline of course it will get the most negative feedback.
While Virgin America has the least negative tweets but also all the sentiment are proportionally close to each others.
plotdata <- tweets %>%
count(negativereason)
ggplot(plotdata,
aes(fill = negativereason,
area = n,
label = negativereason)) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre", size=20) +
labs(title = "Negative reason ") +
theme(legend.position = "none")
More 5000 of passengers gave positive or neutral review. And the other passengers did. And there are multiple reason behind behind these bad reviews.
Overall the most negative reasons are customer service issues, with the highest percentage, and late flights. And the least reason is damaged luggage.
ggplot(tweets) +
aes(x= negativereason, fill=negativereason ) + facet_wrap(~airline) + geom_bar() +
labs(x = 'Negative Reason', y = 'Count', title ='Negative reason per Airline')+theme(axis.text.x = element_text(angle = 25, size=6)) + guides(fill= F)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Most airlines have problem with customer service and that could be due to kicking passenger out the airplane when the fight is overbooked or late service delivery. While Delta airline has a problem late fights.
Next up, let’s see the distribution of text length of the tweets by adding a new feature for the length of each tweet.
tweets$text_length <- sapply(tweets$text, function(x) nchar(x))
ggplot(tweets, aes(x = text_length,
fill = airline_sentiment)) +
geom_density(alpha = 0.5) +
labs(x = 'Length of Text', title= 'Distribution of Text Lengths with Sentiment') +
theme(text = element_text(size=12)) +
scale_fill_manual(values = c("#ff576a","#3db5ff", "#66CC99")) + theme_minimal()
We can clearly see here that the majority of long tweets are the negative ones. While most of the short tweets are neutral.
To sum up, people who are experiencing negative situations tend to write longer tweets.
d <- ustweets %>%
mutate(tweet_created = ymd_hms(tweet_created))
head(ustweets$tweet_created)
## [1] "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800"
## [3] "2015-02-24 11:15:48 -0800" "2015-02-24 11:15:36 -0800"
## [5] "2015-02-24 11:14:45 -0800" "2015-02-24 11:14:33 -0800"
d %>% group_by(tweet_created) %>% #group_by date
summarise(count = n()) %>% #number of cases for each date
mutate(cuml = cumsum(count)) %>% #cumulative cases each date
ggplot(aes(x = tweet_created, y = cuml)) +
geom_line(color = 'red') +
geom_point(size = 1.5) +
theme_bw() +
ggtitle('Daily Cumulative Tweets') +
transition_reveal(cuml)
# extarct day
d$day = day(d$tweet_created)
new = d %>% group_by(day, airline) %>% summarise(count = n())
## `summarise()` has grouped output by 'day'. You can override using the `.groups` argument.
# Animated daily line plot
new %>%
ggplot(aes(x = day, y = count,
group = airline,
color = airline)) +
theme_bw()+
geom_line() +
geom_point() +
ggtitle("Animated Daily Tweets per Airline") +
transition_reveal(day)
To find document input features for our classifier, we want to put this corpus in the shape of a document matrix.
A document matrix is a numeric matrix containing a column for each different term(word) in our whole corpus, and a row for each document(tweet).
datadtm = DocumentTermMatrix(corpus)
datadtm
## <<DocumentTermMatrix (documents: 13691, terms: 11286)>>
## Non-/sparse entries: 123385/154393241
## Sparsity : 100%
## Maximal term length: 62
## Weighting : term frequency (tf)
The DTM presently has 11286 words extracted from 13691 tweets. These words are what we will use to decide if a tweet is positive, neutral or negative. The sparsity of the DTM is 100% which means no words is left out the matrix.
If we consider each column as a term for our model, we will end up with a very complex model with 11286 different features. And it will take hours for model to run if we work with 11286 terms. We need to reduce the number and work with only the most frequent once.
Reduce sparsity to 99.9%.
datadtm = removeSparseTerms(datadtm, 0.999)
dim(datadtm)
## [1] 13691 1128
Now we can work with our model without difficulties and effectively.
Preparing the DTM.
dataset <- as.data.frame(as.matrix(datadtm))
colnames(dataset) <- make.names(colnames(dataset))
dataset$airline_sentiment <- tweets$airline_sentiment
str(dataset$airline_sentiment)
## chr [1:13691] "negative" "positive" "negative" "negative" "negative" ...
Convert airline_sentiment to factor.
dataset$airline_sentiment <- as.factor(dataset$airline_sentiment)
set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.8,0.2),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,]
train_set[4:6,57:59]
## market passeng servic
## 7 0 0 0
## 8 0 0 0
## 9 1 1 1
test_set[4:6,57:59]
## market passeng servic
## 15 0 0 0
## 21 0 0 0
## 26 0 0 0
let’s compare the proportion the training and the test sets to the dataset, to confirm that they are the same.
prop.table(table(train_set$airline_sentiment))
##
## negative neutral positive
## 0.6468445 0.1976141 0.1555414
prop.table(table(test_set$airline_sentiment))
##
## negative neutral positive
## 0.6335793 0.2022140 0.1642066
The data is biased towards negative tweets. Thus, the machine learning algorithms will predict negative tweets more accurately than the positive and the neutral tweets.
The accuracy of any model should be better than 65%.
A CART model stands for classification and regression trees. In our case it will be classification because we are dealing with categorical features. Some of the benefits if decision tree that it is easier to interpret ans visualize.
To train the model, we will be using rpart function from rpart package. Once the model is trained we will test using the predict function.
dt_classifier <- rpart(airline_sentiment ~ ., data= train_set, method="class", minbucket= 25)
rpart.plot(dt_classifier)
summary(dt_classifier)
## Call:
## rpart(formula = airline_sentiment ~ ., data = train_set, method = "class",
## minbucket = 25)
## n= 10981
##
## CP nsplit rel error xerror xstd
## 1 0.124033 0 1.000000 1.000000 0.01291505
## 2 0.010000 1 0.875967 0.875967 0.01249017
##
## Variable importance
## thank
## 100
##
## Node number 1: 10981 observations, complexity param=0.124033
## predicted class=negative expected loss=0.3531555 P(node) =1
## class counts: 7103 2170 1708
## probabilities: 0.647 0.198 0.156
## left son=2 (9693 obs) right son=3 (1288 obs)
## Primary splits:
## thank < 0.5 to the left, improve=553.91500, (0 missing)
## hour < 0.5 to the right, improve=130.94910, (0 missing)
## great < 0.5 to the left, improve=115.59640, (0 missing)
## usairway < 0.5 to the right, improve= 87.53977, (0 missing)
## hold < 0.5 to the right, improve= 77.01616, (0 missing)
## Surrogate splits:
## safe < 1.5 to the left, agree=0.883, adj=0.002, (0 split)
## prompt < 0.5 to the left, agree=0.883, adj=0.002, (0 split)
## philli < 1.5 to the left, agree=0.883, adj=0.001, (0 split)
##
## Node number 2: 9693 observations
## predicted class=negative expected loss=0.299804 P(node) =0.8827065
## class counts: 6787 1995 911
## probabilities: 0.700 0.206 0.094
##
## Node number 3: 1288 observations
## predicted class=positive expected loss=0.3812112 P(node) =0.1172935
## class counts: 316 175 797
## probabilities: 0.245 0.136 0.619
Thank is the most important term in classifying tweets into negative or positive .
There this only one split in the tree which is based on the condition that thanks<1 was mentioned in the tweet. - If it is then is then we move to right and predict positive. - If it is not mentioned then predict negative.
For the 88% of tweets without ‘thank’ in their tweet, 70% of them are considered people with negative emotions, with 21% neutral, and with positive emotions only 9%.
While 12% of those who wrote ‘thank’ in their tweets 25% are with negative emotions, 14% with neutral emotions, and 62% with positive.
This model is useless because because it doesn’t give classification for the neutral class.
Let’s now see how the Random Forest will perform.
dt_predict1 <- predict(dt_classifier, newdata=train_set, type="class")
accuracy(dt_predict1,train_set$airline_sentiment)
## [1] 0.6906475
It looks promising considering our baseline is 0.65.
To understand how good the classifier is, we will predict sentiments in test data set.
dt_pred = predict(dt_classifier, newdata=test_set, type="class")
A confusion matrix will give us metric like accuracy, sensitivity and specificity.
confusionMatrix(table(dt_pred,test_set$airline_sentiment))
## Confusion Matrix and Statistics
##
##
## dt_pred negative neutral positive
## negative 1649 516 261
## neutral 0 0 0
## positive 68 32 184
##
## Overall Statistics
##
## Accuracy : 0.6764
## 95% CI : (0.6584, 0.694)
## No Information Rate : 0.6336
## P-Value [Acc > NIR] : 1.705e-06
##
## Kappa : 0.2213
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity 0.9604 0.0000 0.4135
## Specificity 0.2175 1.0000 0.9558
## Pos Pred Value 0.6797 NaN 0.6479
## Neg Pred Value 0.7606 0.7978 0.8924
## Prevalence 0.6336 0.2022 0.1642
## Detection Rate 0.6085 0.0000 0.0679
## Detection Prevalence 0.8952 0.0000 0.1048
## Balanced Accuracy 0.5890 0.5000 0.6847
The result shows that our decision decision model has accuracy of 0.6764 on test dataset, meaning that 67.64% of our data is correctly classified.
Sensitivity for class negative is 0.9604 implies that 96.04% of negative tweets were correctly classified. The model almost perfectly classified the negative class. The specificity of 0.2175 implies that 21.75% of non-negative tweets were correctly classified.
Sensitivity for class neutral is 0 implies none of the neutral tweets were classified correctly. As for the specificity it implies that all of non-neutral tweets were correctly classified.
Sensitivity for class positive is 0.4135 implies that 41.35% of positive tweets were correctly classified which is not a good. The specificity of 0.9558 implies that 95.58% of non-positive tweets were correctly classified. This is expected because we already know that the data is biased towards negative class.
This model is useless because because it doesn’t give classification for the neutral class.
Random forest algorithm avoids overfitting and can deal with large number of features. Works by building large number of CART trees. Each tree vote on the outcome and we pick the outcome which receives the majority vote. Each tree can split on only random subset of the variables and the observation are randomly selected. It uses majority vote for classification.
I will train this model with 20 trees so it wont take hours to run.
To train the model, we will be using randomForest function from randomForest package.
rf_classifier = randomForest(airline_sentiment ~., data=train_set, ntree = 20)
rf_classifier
##
## Call:
## randomForest(formula = airline_sentiment ~ ., data = train_set, ntree = 20)
## Type of random forest: classification
## Number of trees: 20
## No. of variables tried at each split: 33
##
## OOB estimate of error rate: 25.61%
## Confusion matrix:
## negative neutral positive class.error
## negative 6346 513 244 0.1065747
## neutral 1092 863 214 0.6021208
## positive 512 237 958 0.4387815
As expected, the output notes that the random forest included 20 trees and tried 33 variables at each split.
rf_predict1 <- predict(rf_classifier, newdata=train_set, type="class")
accuracy(rf_predict1,train_set$airline_sentiment)
## [1] 0.9271469
It looks promising considering our baseline is 0.65.
Predicting the Test set results.
rf_pred = predict(rf_classifier, test_set,"class")
confusionMatrix(table(rf_pred, test_set$airline_sentiment))
## Confusion Matrix and Statistics
##
##
## rf_pred negative neutral positive
## negative 1565 277 138
## neutral 113 218 63
## positive 39 53 244
##
## Overall Statistics
##
## Accuracy : 0.748
## 95% CI : (0.7312, 0.7642)
## No Information Rate : 0.6336
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4828
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity 0.9115 0.39781 0.54831
## Specificity 0.5821 0.91859 0.95938
## Pos Pred Value 0.7904 0.55330 0.72619
## Neg Pred Value 0.7918 0.85751 0.91533
## Prevalence 0.6336 0.20221 0.16421
## Detection Rate 0.5775 0.08044 0.09004
## Detection Prevalence 0.7306 0.14539 0.12399
## Balanced Accuracy 0.7468 0.65820 0.75385
The result shows that our random forest model has accuracy of 0.748 on test dataset, meaning that 74.76% of our data is correctly classified.
Sensitivity for class negative is 0.9115 implies that 91.15% of negative tweets were correctly classified. Similarly, the specificity of 0.5821 implies that 58.21% of non-negative tweets were correctly classified. The model did great at classifying the negative class.
Sensitivity for class neutral is 0.39781 implies that 39.781% of negative tweets were correctly classified. Similarly, the specificity of 0.91859 implies that 91.859% of non-neutral tweets were correctly classified. Did not do well at classifying the neutral class.
Sensitivity for class positive is 0.54831 implies that 54.831% of positive tweets were correctly classified. The specificity of 0.95938 implies that 95.938 % of non-positive tweets were correctly classified.
Multinomial Logistic Regression used to predict multinomial outcomes. In our case that is whether the sentiment gave positive, neutral or negative feeling.
To train the model, we will be using multinom function from nnet package.
# MaxNWts in the nnet package controls the maximum number of weights.
lg_classifier <- multinom(airline_sentiment ~., data=train_set, MaxNWts =4000)
## # weights: 3390 (2258 variable)
## initial value 12063.861542
## iter 10 value 7088.270172
## iter 20 value 5622.751391
## iter 30 value 4401.387985
## iter 40 value 4040.951820
## iter 50 value 3882.801950
## iter 60 value 3812.961155
## iter 70 value 3790.325497
## iter 80 value 3775.888125
## iter 90 value 3766.529837
## iter 100 value 3761.480024
## final value 3761.480024
## stopped after 100 iterations
lg_predict1 <- predict(lg_classifier, newdata=train_set, type="class")
accuracy(lg_predict1,train_set$airline_sentiment)
## [1] 0.8654039
lg_pred <- predict(lg_classifier, newdata = test_set, "class")
It looks promising considering our baseline is 0.65.
confusionMatrix(table(lg_pred,test_set$airline_sentiment))
## Confusion Matrix and Statistics
##
##
## lg_pred negative neutral positive
## negative 1468 190 79
## neutral 160 289 82
## positive 89 69 284
##
## Overall Statistics
##
## Accuracy : 0.7531
## 95% CI : (0.7364, 0.7693)
## No Information Rate : 0.6336
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.532
##
## Mcnemar's Test P-Value : 0.2322
##
## Statistics by Class:
##
## Class: negative Class: neutral Class: positive
## Sensitivity 0.8550 0.5274 0.6382
## Specificity 0.7291 0.8881 0.9302
## Pos Pred Value 0.8451 0.5443 0.6425
## Neg Pred Value 0.7441 0.8811 0.9290
## Prevalence 0.6336 0.2022 0.1642
## Detection Rate 0.5417 0.1066 0.1048
## Detection Prevalence 0.6410 0.1959 0.1631
## Balanced Accuracy 0.7920 0.7077 0.7842
The result shows that our multinomial logistic regression model has accuracy of .7531 on test dataset, meaning that 75.31% of our data is correctly classified.
The value of sensitivity and specificity of the negative class is 0.8550 and 0.8550 This indicate that 85.50% of negative outcomes are correctly classified also 0.7291% of the non-negative outcomes are correctly classified too. Did well classifying the negative class.
The value of sensitivity and specificity of the neutral class is 0.5274 and 0.8881. This indicate that 52.74% of neutral outcomes are correctly classified also 88.81% of the non-neutral outcomes are correctly classified too. The prediction of the neutral class is not as strong the negative class.
The value of sensitivity and specificity of the positive class is 0.6382 and 0.9302. This indicate that 63.82% of positive outcomes are correctly classified also 93.02% of the non-positive outcomes are correctly classified too.
The RandomForest predicted the outcomes significantly better than the Decision Tree classifier. But it was the Multinomial Logistic Regression model who gave the best accuracy.
As for the sensitivity, the Decision Tree model classified negative outcome correctly more often than the Random Forest and Multinomial Logistic Regression. As a result, the sensitivity of the Decision Tree is higher and the specificity is lower. However, it performed worse in classifying correctly neutral and positive outcomes. While the Multinomial Logistic Regression did well at classifying correctly the neutral and the positive outcomes.
We can conclude that the Multinomial Logistic Regression model is the best model for predicting the sentiment of tweets with an accuracy of 75.31%.
In this work we extracted many information about the given datatset. Which are:
26% of tweets were about United airline, it was also the most complained about airline due to bad service and late flights.
More than 60% of the tweets expressed negative emotions.
The longer the tweet the more it expresses negative feelings.
As for the sentiment classification we found that the Multinomial Logistic Regression model did best at predicting the sentiments.