Text Classification Using Airlines Tweets

Luis Ernesto Herrera/Carlos Alejandro Hernandez/ Miguel Angel Ibarra/ Edgar Pereida

May 23, 2017

Index

Index

Abstract

This work consists of evaluating an algorithm that allows the classification of analysis of feelings from the tweets provided by people who have traveled in one of the most important airlines in the United States, giving your opinion regarding the service received by the company’s staff, the importance of this work is to try to minimize the percentage of error at the time of classifying , this work shows that it is able to improve the classification by having as a base a large number of samples.

The methodology used to carry out this analysis was naive Bayes, training the classifier with a sample of the collected data and thus probabilistically assess the likelihood between the tweets and then classify the test observations. The results of this analysis will make it possible to identify the difference of the correct classification when small and large samples are used.

Introduction

Consider the problem of classifying quickly and automatically with minimal error the comments received in their Twitter accounts of the different airlines, this represents a major problem for the different companies who want to know the perspective that users have about the service they offer. Through learning algorithms that can be trained to carry out the classification, these tasks can be automated, making it easier for companies to classify the work manually.

Description the data domain

Description of DataSet

For this project we will use the dataset AirlinesTweets that has 14640 observations of the tweets about the comments of the problems of each of the major U.S. airlines in which taxpayers were asked to classify with positive first, negative or neutral, followed by negative or positive reasons (such as “Late Flight” or “rude service”).

Variables Description

Collection of the data

The data were collected using the Twitter API using the official accounts of some of the American Airlines an analysis.

Exploring and preparing the data.

In this second step we prepare the data to be transformed for this analysis will use the word bag method, which is to ignore the order of words and only present a variable that indicates if the word appears.

Verify the structure of the dataset.

setwd("C:/Users/luisoftgb28/Desktop/Proyecto2/Proyecto2")
airlines <- read.csv("AirlinesTweets.csv",stringsAsFactors = FALSE)
str(airlines)
## 'data.frame':    14640 obs. of  4 variables:
##  $ tweet_id         : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment: chr  "neutral" "positive" "neutral" "negative" ...
##  $ airline          : chr  "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
##  $ text             : chr  "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they have little recours"| __truncated__ ...

As you can see we have the field airline_sentiment of type character, in order to perform some functions we will convert this field to a factor.

We transform the field to factor airline_sentiment.

airlines$airline_sentiment <- factor(airlines$airline_sentiment)
str(airlines$airline_sentiment)
##  Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...

Table and Boxplots.

#Show table
table(airlines$airline_sentiment)
## 
## negative  neutral positive 
##     9178     3099     2363

In this table you can see the total of the dataset how many tweets are negative, how many positive and how many neutrals. In this case we have 9178 negative tweets, 3099 neutral and 2363 positive so in cocnlusion we have to predominate the negative tweets because there is not much conformity with the services of American Airlines.

Percent by airline.

In this graph we can clearly observe the positive, negative and neutral tweets of each one of the airlines, concluding so that the airline Virgin America is the best distributed because in others it is very predominant negative tweets.

library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.3.2
ggplot(airlines, aes(x=factor(airline),fill=factor(airline_sentiment)))+
  geom_bar(position="dodge")+xlab("")+ylab("Mensajes")+
  labs(fill = "Clasificación")

Methodology

Naive Bayes

Naive Bayes is an algorithm used for the classification of text which can be trained to improve its rating capacity. Your classification is based on probabilities, taking into account the probabilities that each of the attributes exists and thus obtaining a total probability of whether it belongs to one group or another. In the text classification it calculates the probability of whether there is a word or not in the text to be sorted.

Naive Bayes Applications

The applications of Naive Bayes can be many as long as our goal can be classified. Some examples are:

Corpus creation

A corpus is the creation of text documents. We’re going to create a corpus, using the VCorpus() command to create a text vector from the airlines.

library("tm")
airlines_corpus <- VCorpus(VectorSource(airlines$text))
print(airlines_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 14640

Corpus check

It will give us a summary of a message or group of messages.

inspect(airlines_corpus[1:2])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 35
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 72

If you want to see the actual message we can use as.character().

as.character(airlines_corpus[[ 1]])
## [1] "@VirginAmerica What @dhepburn said."

You can use the lapply() function to view multiple documents.

lapply(airlines_corpus[1:3], as.character)
## $`1`
## [1] "@VirginAmerica What @dhepburn said."
## 
## $`2`
## [1] "@VirginAmerica plus you've added commercials to the experience... tacky."
## 
## $`3`
## [1] "@VirginAmerica I didn't today... Must mean I need to take another trip!"

Cleaning and standardization of text data

Given that taxpayers tweets may contain numbers, punctuation, or blanks. You have to clean the tweets by removing the aforementioned. For this part we will use the Library “TM” which helps us to clean the data.

library("tm")

When obtaining the information from different sources should be sought that data on the same object are unified and Identify similar patterns during the mixing process.

It creates a matrix with each of the words making exception in punctuation, numbers, blanks or irrelevant words.

airlines_corpus_clean <- tm_map(airlines_corpus,content_transformer(tolower))
airlines_corpus_clean <- tm_map(airlines_corpus_clean,removeNumbers)
airlines_corpus_clean <- tm_map(airlines_corpus_clean,removeWords,stopwords())
airlines_corpus_clean <- tm_map(airlines_corpus_clean,removePunctuation)
airlines_corpus_clean <- tm_map(airlines_corpus_clean,stemDocument)
airlines_corpus_clean <- tm_map(airlines_corpus_clean, stripWhitespace)
airlines_dtm <- DocumentTermMatrix(airlines_corpus_clean)

  • tolower: It helps us to pass everything on to lowercase.
  • removeNumbers: Remove all numbers in the text.
  • removePunctuation: Remove any puntuaction marks.
  • stopwords: Remove those words that are irrelevant like the articles.
  • stemming: Gets the root of the word.

Data Splitting

Creating training and test datasets

We divide the data into two parts: 70% approximately for the test set and 30% for testing.

airlines_dtm_train <- airlines_dtm[1: 10231, ]
airlines_dtm_test <- airlines_dtm[ 10232: 14617, ]
airlines_train_labels <- airlines[1: 10231, ]$airline_sentiment
airlines_test_labels <- airlines[ 10232: 14617, ]$airline_sentiment

Checking training and test data

prop.table(table(airlines_train_labels))
## airlines_train_labels
##  negative   neutral  positive 
## 0.5826410 0.2372202 0.1801388
prop.table(table(airlines_test_labels))
## airlines_test_labels
##  negative   neutral  positive 
## 0.7305062 0.1525308 0.1169631

Word Cloud

Is a way to visually represent the frequency with which the words of text data appear.

library(wordcloud)
col=brewer.pal(6,"Dark2")
wordcloud(airlines_corpus_clean,min.freq = 100,random.color = TRUE, random.order = FALSE,colors = col)

Creation of the subset

First we classify the comments by recommendation.

positive <- subset(airlines, airline_sentiment=="positive")
negative <- subset(airlines, airline_sentiment=="negative")
neutral <- subset(airlines, airline_sentiment=="neutral")

Word Cloud positive messages.

col=brewer.pal(6,"Dark2")
wordcloud(positive $ text,max.words = 100,scale = c(3, 0.5),colors = col)

Word Cloud negative messages

col=brewer.pal(6,"Dark2")
wordcloud(negative $ text,max.words = 100,scale = c(3, 0.5),colors = col)

Word Cloud neutral messages

col=brewer.pal(6,"Dark2")
wordcloud(neutral $ text,max.words = 100,scale = c(3, 0.5),colors = col)

Frecuent Words

We use FidFreqTerms to find the most frequent words.

airlines_freq_words <- findFreqTerms(airlines_dtm_train,100)
str(airlines_freq_words)
##  chr [1:166] "<U+0089>ûïjetblu""| __truncated__ "agent" "airlin" "airport" ...

Now we filter the DTM.

airlines_dtm_freq_train <- airlines_dtm_train[ , airlines_freq_words]

airlines_dtm_freq_test <- airlines_dtm_test[ , airlines_freq_words]

Model Construction

We load the library to be able to use naive Bayes

library(e1071)
airlines_train<-as.matrix(airlines_dtm_freq_train)
airlines_test <- as.matrix(airlines_dtm_freq_test)

We build the model using the training data.

airlines_classifier <- naiveBayes(airlines_train,airlines_train_labels)

Model Evaluation

We evaluate the model

airlines_test_pred <- predict( airlines_classifier, airlines_test)

Confusion Matrix

library(gmodels)  

CrossTable( airlines_test_pred, airlines_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c(' predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4386 
## 
##  
##              | actual 
##    predicted |  negative |   neutral |  positive | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##     negative |      2223 |       186 |        90 |      2499 | 
##              |     0.890 |     0.074 |     0.036 |     0.570 | 
##              |     0.694 |     0.278 |     0.175 |           | 
## -------------|-----------|-----------|-----------|-----------|
##      neutral |       453 |       272 |        68 |       793 | 
##              |     0.571 |     0.343 |     0.086 |     0.181 | 
##              |     0.141 |     0.407 |     0.133 |           | 
## -------------|-----------|-----------|-----------|-----------|
##     positive |       528 |       211 |       355 |      1094 | 
##              |     0.483 |     0.193 |     0.324 |     0.249 | 
##              |     0.165 |     0.315 |     0.692 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |      3204 |       669 |       513 |      4386 | 
##              |     0.731 |     0.153 |     0.117 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 

Results

**The following images show the frequency of the tweets comments, classifying by positive, negative and neutral.

As you can see the words more frequently appear larger, coinciding with the image that is referenced as, for example we see that in the positive comments one of the words that excels is that of “thanks”.**

Results

Conclusions

The model Naive Bayes is attractive because of its simplicity, elegance and sturdiness.

The frequency managed when creating the model is very important and differs greatly according to the objective to reach, to get good results will have to do many tests with different values to achieve a good prediction, on this time the best frequency was 100, because it gives us a better prediction reaching a 70% certainty.

It is one of the oldest formal classification algorithms.It is widely used in areas such as text sorting and spam filtering.

References