1. Data Description

The dataset used in this project is provided on Kaggle and originally collected by Crowdflower’s Data for Everyone library.

This Twitter data was scraped on February 2015. It contains tweets on six major United States(US) airlines.

The dataset contains 14640 instances which are tweets submitted by individual travelers and 15 features.And each instance is labeled as positive, negative or neutral.

Features description:

tweet_id: A numeric feature which give the twitter ID of the tweet’s writer.
airline_sentiment: A categorical feature contains labels for tweets, positive, negative or neutral.
airline_sentiment_confidence: A numeric feature representing the confidence level of classifying the tweet to one of the 3 classes.
negativereason: Categorical feature which represent the reason behind considering this tweet as negative.
negativereason_confidence: The level of confidence in determining the negative reason behind the negative tweet.
airline: Name of the airline Company
airline_sentiment_gold
negativereason_gold
retweet_count: Number of retweets of a tweet.
text: Original tweet posted by the user.
tweet_coord: The coordinates of the tweet.
tweet_created: The date and the time of tweet.
tweet_location: From where the tweet was posted.
user_timezone: The timezone of the user.

In this work we want to determine which airlines tweeted about the most and the reason behind the negative tweets. Then we will see the most frequent words used in the tweets using the WordCloud technique. And finally We will predict the sentiment of a tweet without any other information but the tweet text itself, using carious machine learning algorithms.

2. Exploratory Data Analysis

Uploading libraries

library(tidyverse) # for piping and select()
library(textcat) # to determine the language of the tweets
library(ggplot2) # visualizing interesting and interactive graphs
library(tm) # text mining package to create corpus
library(SnowballC) # for cleaning the tweets
library(wordcloud) # to create WordClouds which give the most frequent words
library(dplyr) # for data manipulation
library(rpart) # build decision tree model  
library(rpart.plot) # plot the decision tree
library(randomForest) # RandomForest model
library(caret) # for confusion matrix
library(nnet) # for multinomial logistic regression
library(Metrics) # for accuracy function
library(treemapify)
library(lubridate) # for ymd_hms function
library(gganimate) # for graph animation
library(gifski)
library(av)
library(gapminder)

Uploading the data.

setwd("C:/Users/jihen/Desktop/Nihel")
ustweets <- read.csv('tweets.csv', header = T)

Let’s have a look at the structure of the data.

str(ustweets)

## 'data.frame':    14640 obs. of  15 variables:
##  $ tweet_id                    : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment           : chr  "neutral" "positive" "neutral" "negative" ...
##  $ airline_sentiment_confidence: num  1 0.349 0.684 1 1 ...
##  $ negativereason              : chr  "" "" "" "Bad Flight" ...
##  $ negativereason_confidence   : num  NA 0 NA 0.703 1 ...
##  $ airline                     : chr  "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
##  $ airline_sentiment_gold      : chr  "" "" "" "" ...
##  $ name                        : chr  "cairdin" "jnardino" "yvonnalynn" "jnardino" ...
##  $ negativereason_gold         : chr  "" "" "" "" ...
##  $ retweet_count               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ text                        : chr  "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they hav"| __truncated__ ...
##  $ tweet_coord                 : chr  "" "" "" "" ...
##  $ tweet_created               : chr  "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800" "2015-02-24 11:15:48 -0800" "2015-02-24 11:15:36 -0800" ...
##  $ tweet_location              : chr  "" "" "Lets Play" "" ...
##  $ user_timezone               : chr  "Eastern Time (US & Canada)" "Pacific Time (US & Canada)" "Central Time (US & Canada)" "Pacific Time (US & Canada)" ...

#View(tweets)
ustweets[3:4,]

##       tweet_id airline_sentiment airline_sentiment_confidence negativereason
## 3 5.703011e+17           neutral                       0.6837               
## 4 5.703010e+17          negative                       1.0000     Bad Flight
##   negativereason_confidence        airline airline_sentiment_gold       name
## 3                        NA Virgin America                        yvonnalynn
## 4                    0.7033 Virgin America                          jnardino
##   negativereason_gold retweet_count
## 3                                 0
## 4                                 0
##                                                                                                                             text
## 3                                                        @VirginAmerica I didn't today... Must mean I need to take another trip!
## 4 @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
##   tweet_coord             tweet_created tweet_location
## 3             2015-02-24 11:15:48 -0800      Lets Play
## 4             2015-02-24 11:15:36 -0800               
##                user_timezone
## 3 Central Time (US & Canada)
## 4 Pacific Time (US & Canada)

The reviews need to be cleaned for further analysis.

Let’s randomize our data using the sample() command in case it’s not randomized.

set.seed(1912)  
tweets <- ustweets[sample(nrow(ustweets)),]

Extract relevant features.

tweets <- tweets %>% 
  select(airline_sentiment, negativereason, airline, text, tweet_location)

Checking for missing values.

sum(is.na(tweets))

## [1] 0

Luckily we don’t have any missing values.

Plot of Airlines

ggplot(tweets) + aes(airline,fill= airline) + geom_bar() +geom_text(stat='count', aes(label=..count..), vjust=1.6, color="black")  + labs(title= 'Plot of Airlines') +  theme_minimal() + guides(fill =  F)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

As we see we have six different airlines which are American, Delta, Southwest, United, US Airways and Virgin America. The most airline tweeted about is United airlines. And the least is Virgin America.

Let’s assume that passengers will go through the trouble of tweeting about an airline only if they were highly impressed or highly disappointed.

Most frequent tweets by location.

tweets %>%
  group_by(tweet_location) %>%
  summarise(Count = n()) %>% 
  top_n(5)

## # A tibble: 5 x 2
##   tweet_location   Count
##   <chr>            <int>
## 1 ""                4733
## 2 "Boston, MA"       157
## 3 "New York"         127
## 4 "New York, NY"     156
## 5 "Washington, DC"   150

4733 of tweets are without location. People who landed at Boston state have the most tweets about these six airlines.

We also notice that most of the tweets are from important cities in the US. So I would assume that many of the plane passengers are professionals traveling for work and not for pleasure. Thus, they expect their flight to be on time and to get good service.

2. Text Pre-processing and Sentiment Analysis

Cheking the language

Let’s have a look at what language the tweets are written in.

tweets$language = textcat(tweets$text)

ggplot(tweets, aes(x=language))+ geom_bar(stat="count", fill="lightblue") +  ggtitle("Language Count") + xlab("Language") + ylab("Count") + theme_minimal() + coord_flip() +   theme_minimal()

As expected, English is the most common language since we are dealing with US airlines. However, Scots seems to be extremely popular too. Nevertheless, Scots is the closest language to English.

I will keep only English and Scottish tweets for the text analysis.

tweets = subset(tweets, language =='english' | language =='scots')
dim(tweets)

## [1] 13691     6

Text Pre-processing

Building and cleaning corpus with the help of tm package.

A corpus is a collection of document so each tweet will be treated as a document. We create a VectorSource, that is the input type for the Corpus function defined in the package tm.

corpus0 <- iconv(tweets$text, to = "UTF-8")
corpus <- Corpus(VectorSource(corpus0))
inspect(corpus[1:5])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] @USAirways thx for delayed checkin on noon flt to mia then forcing me to take Late Flightr flts which were delayed. Cincy to Miami in 13 hrs! #fast
## [2] @VirginAmerica just got on the 1pm in Newark home to LA. Your folks at EWR are incredible #letsgohome                                              
## [3] @SouthwestAir Am I flying on Spirit air?                                                                                                           
## [4] @AmericanAir I need to go to YYZ tmr morning 8am. I switched to United already, but my bag is still off in AA la-la land.                          
## [5] @USAirways depart from where? It departed PHX and was due to land at DCA around 8pm.

First we put everything in lowercase. Then we remove punctuation, remove numbers, English stop words(common words), strip white spaces, and stem each word.

Convert text to lower case.

corpus <- tm_map(corpus, tolower)
inspect(corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] @usairways thx for delayed checkin on noon flt to mia then forcing me to take late flightr flts which were delayed. cincy to miami in 13 hrs! #fast

Remove all punctuation.

corpus <- tm_map(corpus, removePunctuation)
inspect(corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] usairways thx for delayed checkin on noon flt to mia then forcing me to take late flightr flts which were delayed cincy to miami in 13 hrs fast

Remove all number.

corpus <- tm_map(corpus, removeNumbers)
inspect(corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] usairways thx for delayed checkin on noon flt to mia then forcing me to take late flightr flts which were delayed cincy to miami in  hrs fast

Remover common English words.

corpus <- tm_map(corpus, removeWords, stopwords('english'))
inspect(corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] usairways thx  delayed checkin  noon flt  mia  forcing   take late flightr flts   delayed cincy  miami   hrs fast

corpus <- tm_map(corpus, removeWords, c('flight','get','plane','flights','flightl', 'i', 'day', 'im', 'cant', 'can', 'now', 'just', 'will', 'dont', 'ive', 'got', 'much'))

Get rid of white space.

corpus <- tm_map(corpus, stripWhitespace)
inspect(corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] usairways thx delayed checkin noon flt mia forcing take late flightr flts delayed cincy miami hrs fast

Stemming of documents, normalize words into its base form or form.

corpus <- tm_map(corpus, stemDocument)
inspect(corpus[1])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] usairway thx delay checkin noon flt mia forc take late flightr flts delay cinci miami hrs fast

Creating the Bag of Words model

TermDocumentMatrix(TDM) convert text data(unstructured) into rows and columns(structured). Where in rows we find the term(words) and in columns the document(tweets).

TDM identify each and every term in the document(tweets) and give the count how many times each time the word repeat in each document.

TDM <- TermDocumentMatrix(corpus)
TDM

## <<TermDocumentMatrix (terms: 11286, documents: 13691)>>
## Non-/sparse entries: 123385/154393241
## Sparsity           : 100%
## Maximal term length: 62
## Weighting          : term frequency (tf)

The TDM contains 11286 terms and 13691 documents with 100% sparcity.

Some terms(words) are more important than others, and we want to remove those that are not. So we will use the function removeSparseTerms from the tm package where we reduce the sparsity to 99.9%.

TDMS <- removeSparseTerms(TDM, sparse = 0.999)
TDMS

## <<TermDocumentMatrix (terms: 1128, documents: 13691)>>
## Non-/sparse entries: 102801/15340647
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)

TDM_matrix <- as.matrix(TDMS)
TDM_matrix[1:7, 1:7]

##          Docs
## Terms     1 2 3 4 5 6 7
##   checkin 1 0 0 0 0 0 0
##   delay   2 0 0 0 0 0 0
##   fast    1 0 0 0 0 0 0
##   flightr 1 0 0 0 0 0 0
##   flt     1 0 0 0 0 0 0
##   forc    1 0 0 0 0 0 0
##   hrs     1 0 0 0 0 0 0

Now our DTM composed of 1128 terms and 13691 documents.

Extract most frequent terms

Most frequent words, how often the word appear.

w <- rowSums(TDM_matrix)

Plot of words with frequency more then 500

s <- subset(w,w>=500)
v <- sort(s, decreasing = T)
barplot(v,col = rainbow(38),las = 2, main = "Plot of words with frequency more then 500", xlab = "Words", ylab = "Frequency", border = "black",   )

Most of the discussion are about United, American,USairways and Southwest airlines, since that they were mentioned more than 2000 time.

Also, we see that help, cancel, hour, thank, time, delay and jetblue, all mentioned more than a 1000 time.

WordCloud

The WordCloud is a technique for visualizing important words right away.

wc <-  sort(w, decreasing = TRUE)
set.seed(222)
wordcloud(words= names(wc),
          freq= wc,
          max.words= 200,
          random.order=F,
          colors = brewer.pal(12, 'Paired'),
          rot.per= 0.3)

Most tweets about the four airlines I mentioned above also about delay, hour, cancel, time, help, custom and service.

These words imply that the reviews about the airlines are not good.

Comparison of Corpus

Now I will be comparing the corpus of all three sentiments.

Subsetting the sentiments.

pos_tweets <- subset(tweets$text , tweets$airline_sentiment=="positive")
neg_tweets <- subset(tweets$text , tweets$airline_sentiment=="negative")
neu_tweets <- subset(tweets$text , tweets$airline_sentiment=="neutral")

Paste and collapse positive, negative and neutral tweets.

pos_terms <- paste(pos_tweets , collapse =" ")
neg_terms <- paste(neg_tweets , collapse =" ")
neu_terms <- paste(neu_tweets , collapse =" ")

Combine both positive and negative terms.

all_terms <- c(pos_terms, neg_terms, neu_terms)

Building the corpus and creating the TDM.

all_corpus <- VCorpus(VectorSource(all_terms))

all_tdm <- TermDocumentMatrix( all_corpus, control = list(removePunctuation = TRUE, removeNumbers =TRUE, stemDocument = TRUE, tolower = TRUE ,stopwords= c('flight','get','plane','flights','flightl', 'i', 'day', 'im', 'cant', 'can', 'now', 'just', 'will', 'dont', 'ive', 'got', 'much'), stopwords = stopwords('english')))

all_tdm_m <- as.matrix(all_tdm)

Comparaison WordCloud.

colnames(all_tdm_m) <- c("positive","negative", "neutral")

all_term_freq <- rowSums(all_tdm_m)
all_term_freq <- sort(all_term_freq,TRUE)

comparison.cloud(
  all_tdm_m, 
  max.words = 100,
  colors = c("#00BA5D", "#E83B20","#00468b")
)

The result shows that words expressing positive emotions are thanks, good, great, better, awesome. we can also see that Southwest and Virgin America are mentioned here. This indicate that the passengers who used SouthWest and Virgin America were delighted with their flight, the service was good, their luggage were unharmed, the plane came on time, ect.

As we notice that negative emotions gave words such as delayed, hold, canceled, lost, late, hours, waiting, service, ect. Furthermore, we have here US Airways and American airlines. I would say that the passenger who wrote these negative reviews must of faced challenges while flying under these airline, such as, delayed or canceled flight, the waiting time took longer than expected, the service was bad, their luggage were lost or destroyed.

Sentiment Analysis

# % of all sentiments
round(prop.table(table(tweets$airline_sentiment)),3)

## 
## negative  neutral positive 
##    0.644    0.199    0.157

ggplot(tweets) + aes(airline_sentiment,fill= airline_sentiment) + geom_bar() +geom_text(stat='count', aes(label=..count..), vjust=1.6, color="white")  + labs(title= 'Plot of Sentiments') +
  scale_fill_manual(values = c("#ff576a","#3db5ff", "#66CC99"))+  theme_minimal() + guides(fill= F)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Overall sentiments: 64.4% of the tweets were negative, 19.9% were neutral and, 15.7% were positive. This implies that the data is biased towards the negative class. Therefore, people actually write a tweet about their flight if something bad happened. #### Plot of Tweet Sentiment by Airline

ggplot(tweets, aes(x = airline , fill = airline_sentiment))+ geom_bar( colour = 'black')  + scale_fill_manual(values = c("#ff576a","#3db5ff", "#66CC99")) + labs(x = 'airlines', y = 'Proportion', title ='Tweet Sentiment by Airline') + theme(axis.text.x = element_text(angle = 25, size=9)) +  theme_minimal() + guides(fill= F)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

We notice here that all airlines experienced negative feedback more than positive and negative feedbacks. And since United is the most tweeted about airline of course it will get the most negative feedback.

While Virgin America has the least negative tweets but also all the sentiment are proportionally close to each others.

Plot of Negative Reasons

plotdata <- tweets %>%
  count(negativereason)

ggplot(plotdata, 
       aes(fill = negativereason, 
           area = n, 
           label = negativereason)) +
  geom_treemap() + 
  geom_treemap_text(colour = "white", 
                    place = "centre", size=20) +
  labs(title = "Negative reason ") +
  theme(legend.position = "none")

More 5000 of passengers gave positive or neutral review. And the other passengers did. And there are multiple reason behind behind these bad reviews.

Overall the most negative reasons are customer service issues, with the highest percentage, and late flights. And the least reason is damaged luggage.

Plot of Negative Reason per Airline

ggplot(tweets) +
aes(x= negativereason, fill=negativereason ) + facet_wrap(~airline) + geom_bar() + 
labs(x = 'Negative Reason', y = 'Count', title ='Negative reason per Airline')+theme(axis.text.x = element_text(angle = 25, size=6)) + guides(fill= F)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Most airlines have problem with customer service and that could be due to kicking passenger out the airplane when the fight is overbooked or late service delivery. While Delta airline has a problem late fights.

Next up, let’s see the distribution of text length of the tweets by adding a new feature for the length of each tweet.

Distribution of Text Lengths with Sentiment

tweets$text_length <- sapply(tweets$text, function(x) nchar(x))

ggplot(tweets, aes(x = text_length, 
    fill = airline_sentiment))  +
  geom_density(alpha = 0.5)  +
  labs(x = 'Length of Text', title= 'Distribution of Text Lengths with Sentiment')  +
  theme(text = element_text(size=12)) +
  scale_fill_manual(values = c("#ff576a","#3db5ff", "#66CC99")) +  theme_minimal()

We can clearly see here that the majority of long tweets are the negative ones. While most of the short tweets are neutral.

To sum up, people who are experiencing negative situations tend to write longer tweets.

Daily Cumulative Tweets

d <- ustweets %>% 
  mutate(tweet_created = ymd_hms(tweet_created))
head(ustweets$tweet_created)

## [1] "2015-02-24 11:35:52 -0800" "2015-02-24 11:15:59 -0800"
## [3] "2015-02-24 11:15:48 -0800" "2015-02-24 11:15:36 -0800"
## [5] "2015-02-24 11:14:45 -0800" "2015-02-24 11:14:33 -0800"

d %>% group_by(tweet_created) %>% #group_by date
  summarise(count = n()) %>% #number of cases for each date
  mutate(cuml = cumsum(count)) %>% #cumulative cases each date
  ggplot(aes(x = tweet_created, y = cuml)) +
  geom_line(color = 'red') +
  geom_point(size = 1.5) +
  theme_bw() + 
  ggtitle('Daily Cumulative Tweets') +
  transition_reveal(cuml)

Animated Daily Tweets per Airline

# extarct day 
d$day = day(d$tweet_created)

new = d %>% group_by(day, airline) %>% summarise(count = n())

## `summarise()` has grouped output by 'day'. You can override using the `.groups` argument.

# Animated daily line plot
new %>%
  ggplot(aes(x = day, y = count, 
             group = airline,
             color = airline)) +
  theme_bw()+
  geom_line() +
  geom_point() +
  ggtitle("Animated Daily Tweets per Airline") +
  transition_reveal(day)

4. Sentiments Classification

Preparing the data

To find document input features for our classifier, we want to put this corpus in the shape of a document matrix.

A document matrix is a numeric matrix containing a column for each different term(word) in our whole corpus, and a row for each document(tweet).

datadtm = DocumentTermMatrix(corpus)
datadtm

## <<DocumentTermMatrix (documents: 13691, terms: 11286)>>
## Non-/sparse entries: 123385/154393241
## Sparsity           : 100%
## Maximal term length: 62
## Weighting          : term frequency (tf)

The DTM presently has 11286 words extracted from 13691 tweets. These words are what we will use to decide if a tweet is positive, neutral or negative. The sparsity of the DTM is 100% which means no words is left out the matrix.

If we consider each column as a term for our model, we will end up with a very complex model with 11286 different features. And it will take hours for model to run if we work with 11286 terms. We need to reduce the number and work with only the most frequent once.

Reduce sparsity to 99.9%.

datadtm = removeSparseTerms(datadtm, 0.999)
dim(datadtm)

## [1] 13691  1128

Now we can work with our model without difficulties and effectively.

Preparing the DTM.

dataset <- as.data.frame(as.matrix(datadtm))
colnames(dataset) <- make.names(colnames(dataset))
dataset$airline_sentiment <- tweets$airline_sentiment
str(dataset$airline_sentiment)

##  chr [1:13691] "negative" "positive" "negative" "negative" "negative" ...

Convert airline_sentiment to factor.

dataset$airline_sentiment <- as.factor(dataset$airline_sentiment)

Splitting the data into Train & Test datasets

set.seed(222)
split = sample(2,nrow(dataset),prob = c(0.8,0.2),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,]
train_set[4:6,57:59]

##   market passeng servic
## 7      0       0      0
## 8      0       0      0
## 9      1       1      1

test_set[4:6,57:59]

##    market passeng servic
## 15      0       0      0
## 21      0       0      0
## 26      0       0      0

Baseline accuracy

let’s compare the proportion the training and the test sets to the dataset, to confirm that they are the same.

prop.table(table(train_set$airline_sentiment))

## 
##  negative   neutral  positive 
## 0.6468445 0.1976141 0.1555414

prop.table(table(test_set$airline_sentiment))

## 
##  negative   neutral  positive 
## 0.6335793 0.2022140 0.1642066

The data is biased towards negative tweets. Thus, the machine learning algorithms will predict negative tweets more accurately than the positive and the neutral tweets.

The accuracy of any model should be better than 65%.

a)Decision Tree

A CART model stands for classification and regression trees. In our case it will be classification because we are dealing with categorical features. Some of the benefits if decision tree that it is easier to interpret ans visualize.

Model Training

To train the model, we will be using rpart function from rpart package. Once the model is trained we will test using the predict function.

dt_classifier <- rpart(airline_sentiment ~ ., data= train_set, method="class", minbucket= 25)
rpart.plot(dt_classifier)

summary(dt_classifier)

## Call:
## rpart(formula = airline_sentiment ~ ., data = train_set, method = "class", 
##     minbucket = 25)
##   n= 10981 
## 
##         CP nsplit rel error   xerror       xstd
## 1 0.124033      0  1.000000 1.000000 0.01291505
## 2 0.010000      1  0.875967 0.875967 0.01249017
## 
## Variable importance
## thank 
##   100 
## 
## Node number 1: 10981 observations,    complexity param=0.124033
##   predicted class=negative  expected loss=0.3531555  P(node) =1
##     class counts:  7103  2170  1708
##    probabilities: 0.647 0.198 0.156 
##   left son=2 (9693 obs) right son=3 (1288 obs)
##   Primary splits:
##       thank    < 0.5 to the left,  improve=553.91500, (0 missing)
##       hour     < 0.5 to the right, improve=130.94910, (0 missing)
##       great    < 0.5 to the left,  improve=115.59640, (0 missing)
##       usairway < 0.5 to the right, improve= 87.53977, (0 missing)
##       hold     < 0.5 to the right, improve= 77.01616, (0 missing)
##   Surrogate splits:
##       safe   < 1.5 to the left,  agree=0.883, adj=0.002, (0 split)
##       prompt < 0.5 to the left,  agree=0.883, adj=0.002, (0 split)
##       philli < 1.5 to the left,  agree=0.883, adj=0.001, (0 split)
## 
## Node number 2: 9693 observations
##   predicted class=negative  expected loss=0.299804  P(node) =0.8827065
##     class counts:  6787  1995   911
##    probabilities: 0.700 0.206 0.094 
## 
## Node number 3: 1288 observations
##   predicted class=positive  expected loss=0.3812112  P(node) =0.1172935
##     class counts:   316   175   797
##    probabilities: 0.245 0.136 0.619

Thank is the most important term in classifying tweets into negative or positive .

There this only one split in the tree which is based on the condition that thanks<1 was mentioned in the tweet. - If it is then is then we move to right and predict positive. - If it is not mentioned then predict negative.

For the 88% of tweets without ‘thank’ in their tweet, 70% of them are considered people with negative emotions, with 21% neutral, and with positive emotions only 9%.

While 12% of those who wrote ‘thank’ in their tweets 25% are with negative emotions, 14% with neutral emotions, and 62% with positive.

This model is useless because because it doesn’t give classification for the neutral class.

Let’s now see how the Random Forest will perform.

Predict train

dt_predict1 <- predict(dt_classifier, newdata=train_set, type="class")
accuracy(dt_predict1,train_set$airline_sentiment)

## [1] 0.6906475

It looks promising considering our baseline is 0.65.

Model Testing

To understand how good the classifier is, we will predict sentiments in test data set.

dt_pred = predict(dt_classifier, newdata=test_set, type="class")

Model Evaluation

A confusion matrix will give us metric like accuracy, sensitivity and specificity.

confusionMatrix(table(dt_pred,test_set$airline_sentiment))

## Confusion Matrix and Statistics
## 
##           
## dt_pred    negative neutral positive
##   negative     1649     516      261
##   neutral         0       0        0
##   positive       68      32      184
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6764         
##                  95% CI : (0.6584, 0.694)
##     No Information Rate : 0.6336         
##     P-Value [Acc > NIR] : 1.705e-06      
##                                          
##                   Kappa : 0.2213         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                   0.9604         0.0000          0.4135
## Specificity                   0.2175         1.0000          0.9558
## Pos Pred Value                0.6797            NaN          0.6479
## Neg Pred Value                0.7606         0.7978          0.8924
## Prevalence                    0.6336         0.2022          0.1642
## Detection Rate                0.6085         0.0000          0.0679
## Detection Prevalence          0.8952         0.0000          0.1048
## Balanced Accuracy             0.5890         0.5000          0.6847

The result shows that our decision decision model has accuracy of 0.6764 on test dataset, meaning that 67.64% of our data is correctly classified.

Sensitivity for class negative is 0.9604 implies that 96.04% of negative tweets were correctly classified. The model almost perfectly classified the negative class. The specificity of 0.2175 implies that 21.75% of non-negative tweets were correctly classified.

Sensitivity for class neutral is 0 implies none of the neutral tweets were classified correctly. As for the specificity it implies that all of non-neutral tweets were correctly classified.

Sensitivity for class positive is 0.4135 implies that 41.35% of positive tweets were correctly classified which is not a good. The specificity of 0.9558 implies that 95.58% of non-positive tweets were correctly classified. This is expected because we already know that the data is biased towards negative class.

This model is useless because because it doesn’t give classification for the neutral class.

b) Random Forest Model

Random forest algorithm avoids overfitting and can deal with large number of features. Works by building large number of CART trees. Each tree vote on the outcome and we pick the outcome which receives the majority vote. Each tree can split on only random subset of the variables and the observation are randomly selected. It uses majority vote for classification.

I will train this model with 20 trees so it wont take hours to run.

Model Training

To train the model, we will be using randomForest function from randomForest package.

rf_classifier = randomForest(airline_sentiment ~., data=train_set, ntree = 20)
rf_classifier

## 
## Call:
##  randomForest(formula = airline_sentiment ~ ., data = train_set,      ntree = 20) 
##                Type of random forest: classification
##                      Number of trees: 20
## No. of variables tried at each split: 33
## 
##         OOB estimate of  error rate: 25.61%
## Confusion matrix:
##          negative neutral positive class.error
## negative     6346     513      244   0.1065747
## neutral      1092     863      214   0.6021208
## positive      512     237      958   0.4387815

As expected, the output notes that the random forest included 20 trees and tried 33 variables at each split.

Predict Train

rf_predict1 <- predict(rf_classifier, newdata=train_set, type="class")
accuracy(rf_predict1,train_set$airline_sentiment)

## [1] 0.9271469

It looks promising considering our baseline is 0.65.

Model Testing

Predicting the Test set results.

rf_pred = predict(rf_classifier,  test_set,"class")

Model Evaluation

confusionMatrix(table(rf_pred, test_set$airline_sentiment))

## Confusion Matrix and Statistics
## 
##           
## rf_pred    negative neutral positive
##   negative     1565     277      138
##   neutral       113     218       63
##   positive       39      53      244
## 
## Overall Statistics
##                                           
##                Accuracy : 0.748           
##                  95% CI : (0.7312, 0.7642)
##     No Information Rate : 0.6336          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4828          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                   0.9115        0.39781         0.54831
## Specificity                   0.5821        0.91859         0.95938
## Pos Pred Value                0.7904        0.55330         0.72619
## Neg Pred Value                0.7918        0.85751         0.91533
## Prevalence                    0.6336        0.20221         0.16421
## Detection Rate                0.5775        0.08044         0.09004
## Detection Prevalence          0.7306        0.14539         0.12399
## Balanced Accuracy             0.7468        0.65820         0.75385

The result shows that our random forest model has accuracy of 0.748 on test dataset, meaning that 74.76% of our data is correctly classified.

Sensitivity for class negative is 0.9115 implies that 91.15% of negative tweets were correctly classified. Similarly, the specificity of 0.5821 implies that 58.21% of non-negative tweets were correctly classified. The model did great at classifying the negative class.

Sensitivity for class neutral is 0.39781 implies that 39.781% of negative tweets were correctly classified. Similarly, the specificity of 0.91859 implies that 91.859% of non-neutral tweets were correctly classified. Did not do well at classifying the neutral class.

Sensitivity for class positive is 0.54831 implies that 54.831% of positive tweets were correctly classified. The specificity of 0.95938 implies that 95.938 % of non-positive tweets were correctly classified.

c) Multinomial Logistic Regression

Multinomial Logistic Regression used to predict multinomial outcomes. In our case that is whether the sentiment gave positive, neutral or negative feeling.

Model Training

To train the model, we will be using multinom function from nnet package.

#  MaxNWts in the nnet package controls the maximum number of weights.
lg_classifier <- multinom(airline_sentiment ~., data=train_set, MaxNWts =4000)

## # weights:  3390 (2258 variable)
## initial  value 12063.861542 
## iter  10 value 7088.270172
## iter  20 value 5622.751391
## iter  30 value 4401.387985
## iter  40 value 4040.951820
## iter  50 value 3882.801950
## iter  60 value 3812.961155
## iter  70 value 3790.325497
## iter  80 value 3775.888125
## iter  90 value 3766.529837
## iter 100 value 3761.480024
## final  value 3761.480024 
## stopped after 100 iterations

Predict train

lg_predict1 <- predict(lg_classifier, newdata=train_set, type="class")
accuracy(lg_predict1,train_set$airline_sentiment)

## [1] 0.8654039

Model Testing

lg_pred <- predict(lg_classifier, newdata = test_set, "class")

It looks promising considering our baseline is 0.65.

Model Evaluation

confusionMatrix(table(lg_pred,test_set$airline_sentiment))

## Confusion Matrix and Statistics
## 
##           
## lg_pred    negative neutral positive
##   negative     1468     190       79
##   neutral       160     289       82
##   positive       89      69      284
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7531          
##                  95% CI : (0.7364, 0.7693)
##     No Information Rate : 0.6336          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.532           
##                                           
##  Mcnemar's Test P-Value : 0.2322          
## 
## Statistics by Class:
## 
##                      Class: negative Class: neutral Class: positive
## Sensitivity                   0.8550         0.5274          0.6382
## Specificity                   0.7291         0.8881          0.9302
## Pos Pred Value                0.8451         0.5443          0.6425
## Neg Pred Value                0.7441         0.8811          0.9290
## Prevalence                    0.6336         0.2022          0.1642
## Detection Rate                0.5417         0.1066          0.1048
## Detection Prevalence          0.6410         0.1959          0.1631
## Balanced Accuracy             0.7920         0.7077          0.7842

The result shows that our multinomial logistic regression model has accuracy of .7531 on test dataset, meaning that 75.31% of our data is correctly classified.

The value of sensitivity and specificity of the negative class is 0.8550 and 0.8550 This indicate that 85.50% of negative outcomes are correctly classified also 0.7291% of the non-negative outcomes are correctly classified too. Did well classifying the negative class.

The value of sensitivity and specificity of the neutral class is 0.5274 and 0.8881. This indicate that 52.74% of neutral outcomes are correctly classified also 88.81% of the non-neutral outcomes are correctly classified too. The prediction of the neutral class is not as strong the negative class.

The value of sensitivity and specificity of the positive class is 0.6382 and 0.9302. This indicate that 63.82% of positive outcomes are correctly classified also 93.02% of the non-positive outcomes are correctly classified too.

5. Conclusion

The RandomForest predicted the outcomes significantly better than the Decision Tree classifier. But it was the Multinomial Logistic Regression model who gave the best accuracy.

As for the sensitivity, the Decision Tree model classified negative outcome correctly more often than the Random Forest and Multinomial Logistic Regression. As a result, the sensitivity of the Decision Tree is higher and the specificity is lower. However, it performed worse in classifying correctly neutral and positive outcomes. While the Multinomial Logistic Regression did well at classifying correctly the neutral and the positive outcomes.

We can conclude that the Multinomial Logistic Regression model is the best model for predicting the sentiment of tweets with an accuracy of 75.31%.

In this work we extracted many information about the given datatset. Which are:

26% of tweets were about United airline, it was also the most complained about airline due to bad service and late flights.
More than 60% of the tweets expressed negative emotions.
The longer the tweet the more it expresses negative feelings.
As for the sentiment classification we found that the Multinomial Logistic Regression model did best at predicting the sentiments.

Sentiment Prediction of US Airline Tweets

Nihel Samet

23/06/2021

1. Data Description

2. Exploratory Data Analysis

Plot of Airlines

Most frequent tweets by location.

2. Text Pre-processing and Sentiment Analysis

Cheking the language

Text Pre-processing

Creating the Bag of Words model

Extract most frequent terms

Plot of words with frequency more then 500

WordCloud

Comparison of Corpus

Comparaison WordCloud.

Sentiment Analysis

Plot of Negative Reasons

Plot of Negative Reason per Airline

Distribution of Text Lengths with Sentiment

Daily Cumulative Tweets

Animated Daily Tweets per Airline

4. Sentiments Classification

Preparing the data

Splitting the data into Train & Test datasets

Baseline accuracy

a)Decision Tree

Model Training

Predict train

Model Testing

Model Evaluation

b) Random Forest Model

Model Training

Predict Train

Model Testing

Model Evaluation

c) Multinomial Logistic Regression

Model Training

Predict train

Model Testing

Model Evaluation

5. Conclusion