Airlines Tweets

Description of DataSet

For this project we will use the dataset AirlinesTweets that has 14640 observations of the tweets about the comments of the problems of each of the major U.S. airlines in which taxpayers were asked to classify with positive first, negative or neutral, followed by negative or positive reasons (such as “Late Flight” or “rude service”).

Variables Description

tweet_id: Indicates the ID of the tweet
airline_sentiment: Indicates the feeling if it is positive negative or neutral
airline: It is the name of the airline
text: It is the tweet

Collection of the data

The data were collected using the Twitter API using the official accounts of some of the American Airlines an analysis.

Step 2

Exploring and preparing the data.

In this second step we prepare the data to be transformed for this analysis will use the word bag method, which is to ignore the order of words and only present a variable that indicates if the word appears.

Verify the structure of the dataset.

setwd("C:/Users/luisoftgb28/Desktop/Proyecto2/Proyecto2")
airlines <- read.csv("AirlinesTweets.csv",stringsAsFactors = FALSE)
str(airlines)

## 'data.frame':    14640 obs. of  4 variables:
##  $ tweet_id         : num  5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
##  $ airline_sentiment: chr  "neutral" "positive" "neutral" "negative" ...
##  $ airline          : chr  "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
##  $ text             : chr  "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces &amp; they have little recours"| __truncated__ ...

As you can see we have the field airline_sentiment of type character, in order to perform some functions we will convert this field to a factor.

We transform the field to factor. airline_sentiment

airlines$airline_sentiment <- factor(airlines$airline_sentiment)
str(airlines$airline_sentiment)

##  Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...

#Show table
table(airlines$airline_sentiment)

## 
## negative  neutral positive 
##     9178     3099     2363

In this table you can see the total of the dataset how many tweets are negative, how many positive and how many neutrals. In this case we have 9178 negative tweets, 3099 neutral and 2363 positive so in cocnlusion we have to predominate the negative tweets because there is not much conformity with the services of American Airlines.

Percent by airline

In this graph we can clearly observe the positive, negative and neutral tweets of each one of the airlines, concluding so that the airline Virgin America is the best distributed because in others it is very predominant negative tweets.

library("ggplot2")
ggplot(airlines, aes(x=factor(airline),fill=factor(airline_sentiment)))+
  geom_bar(position="dodge")+xlab("")+ylab("Mensajes")+
  labs(fill = "Clasificación")

Cleaning and standardization of text data

Given that taxpayers tweets may contain numbers, punctuation, or blanks. You have to clean the tweets by removing the aforementioned. For this part we will use the Library “TM” which helps us to clean the data.

library("tm")

Corpus creation

A corpus is the creation of text documents. We’re going to create a corpus, using the VCorpus() command to create a text vector from the airlines.

airlines_corpus <- VCorpus(VectorSource(airlines$text))
print(airlines_corpus)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 14640

Corpus check

It will give us a summary of a message or group of messages

inspect(airlines_corpus[1:2])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 35
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 72

If you want to see the actual message we can use as.character().

as.character(airlines_corpus[[ 1]])

## [1] "@VirginAmerica What @dhepburn said."

You can use the lapply() function to view multiple documents.

lapply(airlines_corpus[1:3], as.character)

## $`1`
## [1] "@VirginAmerica What @dhepburn said."
## 
## $`2`
## [1] "@VirginAmerica plus you've added commercials to the experience... tacky."
## 
## $`3`
## [1] "@VirginAmerica I didn't today... Must mean I need to take another trip!"

Data cleaning

-When obtaining the information from different sources should be sought that data on the same object are unified -Identify similar patterns during the mixing process.

It creates a matrix with each of the words making exception in punctuation, numbers, blanks or irrelevant words.

airlines_dtm <- DocumentTermMatrix(airlines_corpus, control = list(
  tolower=TRUE, removeNumbers=TRUE, removePunctuation=TRUE, 
  stopwords=TRUE,stemming=TRUE))

Creating training and test datasets

We divide the data into two parts: 70% approximately for the test set and 30% for testing.

airlines_dtm_train <- airlines_dtm[1: 10231, ]
airlines_dtm_test <- airlines_dtm[ 10232: 14617, ]
airlines_train_labels <- airlines[1: 10231, ]$airline_sentiment
airlines_test_labels <- airlines[ 10232: 14617, ]$airline_sentiment

Checking training and test data

prop.table(table(airlines_train_labels))

## airlines_train_labels
##  negative   neutral  positive 
## 0.5826410 0.2372202 0.1801388

prop.table(table(airlines_test_labels))

## airlines_test_labels
##  negative   neutral  positive 
## 0.7305062 0.1525308 0.1169631

Word Cloud

Is a way to visually represent the frequency with which the words of text data appear

library(wordcloud)
wordcloud(airlines_corpus,min.freq = 250,random.order = FALSE)

Creation of the subset

First we classify the comments by recommendation.

positive <- subset(airlines, airline_sentiment=="positive")
negative <- subset(airlines, airline_sentiment=="negative")
neutral <- subset(airlines, airline_sentiment=="neutral")

Word Cloud positive messages

wordcloud(positive $ text,max.words = 100,scale = c(3, 0.5))

Word Cloud negative messages

wordcloud(negative $ text,max.words = 100,scale = c(3, 0.5))

Word Cloud neutral messages

wordcloud(neutral $ text,max.words = 100,scale = c(3, 0.5))

Frecuent Words

We use FidFreqTerms to find the most frequent words.

airlines_freq_words <- findFreqTerms(airlines_dtm_train,50)
str(airlines_freq_words)

##  chr [1:352] "<U+0089>ûïjetblu""| __truncated__ "abl" "account" "actual" "add" ...

Now we filter the DTM

airlines_dtm_freq_train <- airlines_dtm_train[ , airlines_freq_words]

airlines_dtm_freq_test <- airlines_dtm_test[ , airlines_freq_words]

We load the library to be able to use naive Bayes

library(e1071)
airlines_train<-as.matrix(airlines_dtm_freq_train)
airlines_test <- as.matrix(airlines_dtm_freq_test)

We build the model using the training data.

airlines_classifier <- naiveBayes(airlines_train,airlines_train_labels)

We evaluate the model

airlines_test_pred <- predict( airlines_classifier, airlines_test)

Confusion Matrix

library(gmodels)  

CrossTable( airlines_test_pred, airlines_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c(' predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4386 
## 
##  
##              | actual 
##    predicted |  negative |   neutral |  positive | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##     negative |      1939 |       121 |        72 |      2132 | 
##              |     0.909 |     0.057 |     0.034 |     0.486 | 
##              |     0.605 |     0.181 |     0.140 |           | 
## -------------|-----------|-----------|-----------|-----------|
##      neutral |       410 |       215 |        44 |       669 | 
##              |     0.613 |     0.321 |     0.066 |     0.153 | 
##              |     0.128 |     0.321 |     0.086 |           | 
## -------------|-----------|-----------|-----------|-----------|
##     positive |       855 |       333 |       397 |      1585 | 
##              |     0.539 |     0.210 |     0.250 |     0.361 | 
##              |     0.267 |     0.498 |     0.774 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |      3204 |       669 |       513 |      4386 | 
##              |     0.731 |     0.153 |     0.117 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
##