Luis Ernesto Herrera/Carlos Alejandro Hernandez/ Miguel Angel Ibarra/ Edgar Pereida
May 2, 2017
For this project we will use the dataset AirlinesTweets that has 14640 observations of the tweets about the comments of the problems of each of the major U.S. airlines in which taxpayers were asked to classify with positive first, negative or neutral, followed by negative or positive reasons (such as “Late Flight” or “rude service”).
The data were collected using the Twitter API using the official accounts of some of the American Airlines an analysis.
In this second step we prepare the data to be transformed for this analysis will use the word bag method, which is to ignore the order of words and only present a variable that indicates if the word appears.
Verify the structure of the dataset.
setwd("C:/Users/luisoftgb28/Desktop/Proyecto2/Proyecto2")
airlines <- read.csv("AirlinesTweets.csv",stringsAsFactors = FALSE)
str(airlines)## 'data.frame': 14640 obs. of 4 variables:
## $ tweet_id : num 5.7e+17 5.7e+17 5.7e+17 5.7e+17 5.7e+17 ...
## $ airline_sentiment: chr "neutral" "positive" "neutral" "negative" ...
## $ airline : chr "Virgin America" "Virgin America" "Virgin America" "Virgin America" ...
## $ text : chr "@VirginAmerica What @dhepburn said." "@VirginAmerica plus you've added commercials to the experience... tacky." "@VirginAmerica I didn't today... Must mean I need to take another trip!" "@VirginAmerica it's really aggressive to blast obnoxious \"entertainment\" in your guests' faces & they have little recours"| __truncated__ ...
As you can see we have the field airline_sentiment of type character, in order to perform some functions we will convert this field to a factor.
We transform the field to factor. airline_sentiment
airlines$airline_sentiment <- factor(airlines$airline_sentiment)
str(airlines$airline_sentiment)## Factor w/ 3 levels "negative","neutral",..: 2 3 2 1 1 1 3 2 3 3 ...
#Show table
table(airlines$airline_sentiment)##
## negative neutral positive
## 9178 3099 2363
In this table you can see the total of the dataset how many tweets are negative, how many positive and how many neutrals. In this case we have 9178 negative tweets, 3099 neutral and 2363 positive so in cocnlusion we have to predominate the negative tweets because there is not much conformity with the services of American Airlines.
In this graph we can clearly observe the positive, negative and neutral tweets of each one of the airlines, concluding so that the airline Virgin America is the best distributed because in others it is very predominant negative tweets.
library("ggplot2")
ggplot(airlines, aes(x=factor(airline),fill=factor(airline_sentiment)))+
geom_bar(position="dodge")+xlab("")+ylab("Mensajes")+
labs(fill = "Clasificación")
Given that taxpayers tweets may contain numbers, punctuation, or blanks. You have to clean the tweets by removing the aforementioned. For this part we will use the Library “TM” which helps us to clean the data.
library("tm")A corpus is the creation of text documents. We’re going to create a corpus, using the VCorpus() command to create a text vector from the airlines.
airlines_corpus <- VCorpus(VectorSource(airlines$text))
print(airlines_corpus)## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 14640
It will give us a summary of a message or group of messages
inspect(airlines_corpus[1:2])## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 35
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 72
If you want to see the actual message we can use as.character().
as.character(airlines_corpus[[ 1]])## [1] "@VirginAmerica What @dhepburn said."
You can use the lapply() function to view multiple documents.
lapply(airlines_corpus[1:3], as.character)## $`1`
## [1] "@VirginAmerica What @dhepburn said."
##
## $`2`
## [1] "@VirginAmerica plus you've added commercials to the experience... tacky."
##
## $`3`
## [1] "@VirginAmerica I didn't today... Must mean I need to take another trip!"
-When obtaining the information from different sources should be sought that data on the same object are unified -Identify similar patterns during the mixing process.
It creates a matrix with each of the words making exception in punctuation, numbers, blanks or irrelevant words.
airlines_dtm <- DocumentTermMatrix(airlines_corpus, control = list(
tolower=TRUE, removeNumbers=TRUE, removePunctuation=TRUE,
stopwords=TRUE,stemming=TRUE))We divide the data into two parts: 70% approximately for the test set and 30% for testing.
airlines_dtm_train <- airlines_dtm[1: 10231, ]
airlines_dtm_test <- airlines_dtm[ 10232: 14617, ]
airlines_train_labels <- airlines[1: 10231, ]$airline_sentiment
airlines_test_labels <- airlines[ 10232: 14617, ]$airline_sentimentprop.table(table(airlines_train_labels))## airlines_train_labels
## negative neutral positive
## 0.5826410 0.2372202 0.1801388
prop.table(table(airlines_test_labels))## airlines_test_labels
## negative neutral positive
## 0.7305062 0.1525308 0.1169631
Is a way to visually represent the frequency with which the words of text data appear
library(wordcloud)
wordcloud(airlines_corpus,min.freq = 250,random.order = FALSE)First we classify the comments by recommendation.
positive <- subset(airlines, airline_sentiment=="positive")
negative <- subset(airlines, airline_sentiment=="negative")
neutral <- subset(airlines, airline_sentiment=="neutral")wordcloud(positive $ text,max.words = 100,scale = c(3, 0.5))wordcloud(negative $ text,max.words = 100,scale = c(3, 0.5))wordcloud(neutral $ text,max.words = 100,scale = c(3, 0.5))We use FidFreqTerms to find the most frequent words.
airlines_freq_words <- findFreqTerms(airlines_dtm_train,50)
str(airlines_freq_words)## chr [1:352] "<U+0089>ûïjetblu""| __truncated__ "abl" "account" "actual" "add" ...
Now we filter the DTM
airlines_dtm_freq_train <- airlines_dtm_train[ , airlines_freq_words]
airlines_dtm_freq_test <- airlines_dtm_test[ , airlines_freq_words]library(e1071)
airlines_train<-as.matrix(airlines_dtm_freq_train)
airlines_test <- as.matrix(airlines_dtm_freq_test)We build the model using the training data.
airlines_classifier <- naiveBayes(airlines_train,airlines_train_labels)We evaluate the model
airlines_test_pred <- predict( airlines_classifier, airlines_test)library(gmodels)
CrossTable( airlines_test_pred, airlines_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c(' predicted', 'actual'))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 4386
##
##
## | actual
## predicted | negative | neutral | positive | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## negative | 1939 | 121 | 72 | 2132 |
## | 0.909 | 0.057 | 0.034 | 0.486 |
## | 0.605 | 0.181 | 0.140 | |
## -------------|-----------|-----------|-----------|-----------|
## neutral | 410 | 215 | 44 | 669 |
## | 0.613 | 0.321 | 0.066 | 0.153 |
## | 0.128 | 0.321 | 0.086 | |
## -------------|-----------|-----------|-----------|-----------|
## positive | 855 | 333 | 397 | 1585 |
## | 0.539 | 0.210 | 0.250 | 0.361 |
## | 0.267 | 0.498 | 0.774 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 3204 | 669 | 513 | 4386 |
## | 0.731 | 0.153 | 0.117 | |
## -------------|-----------|-----------|-----------|-----------|
##
##