Airline tweets data

The twitter_data data frame has over 7,000 tweets about airlines. The tweets have already been classified as either complaints or non-complaints in the complaint_label column. Let’s get a sense of how many of these tweets are complaints.

## Parsed with column specification:
## cols(
##   index = col_double(),
##   tweet_id = col_double(),
##   date = col_datetime(format = ""),
##   complaint_label = col_character(),
##   tweet_text = col_character(),
##   usr_followers_count = col_double(),
##   usr_verified = col_logical()
## )
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

Tokenizing and cleaning

## Joining, by = "word"

It looks like complaints include frequent references to time, delays, and service. However, there are simply a lot of specific airlines referenced. These could be considered as stop words specific to this data, and we’ll see how to remove them in the next chapter.

Plotting word counts

Improving word count plots

## Joining, by = "word"

Sentiment dictionaries

Appending dictionaries

## Joining, by = "word"
## Joining, by = "word"

Improving sentiment analysis

## Joining, by = "word"
## Joining, by = "word"
## `summarise()` regrouping output by 'complaint_label' (override with `.groups` argument)

With the output of the grouped summary spread(), we can easily use mutate() to create a new overall_sentiment column. It looks like unverified users complain more often, on aggregate.

## Joining, by = "word"

Complaints are very negative while non-complaints are neutral at best.

Latent Dirichlet allocation

LDA is a standard topic model. Topic models find patterns of words appearing together.

Document term matrices

Creating a DTM Create a DTM using our tidy_twitter data. In this case, each tweet is considered a document. Print tidy_twitter in the console to confirm the column names.

## <<DocumentTermMatrix (documents: 7044, terms: 17994)>>
## Non-/sparse entries: 59122/126690614
## Sparsity           : 100%
## Maximal term length: 44
## Weighting          : term frequency (tf)

Running topic models

Fitting an LDA It’s time to run your first topic model! As discussed, the three additional arguments of the LDA() function are critical for properly running a topic model. Note that running the LDA() function could take about 10 seconds. The tidyverse and tidytext packages along with the tidy_twitter dataset have been loaded for you.

## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
##   ..@ seedwords      : NULL
##   ..@ z              : int [1:60688] 2 1 1 1 1 1 2 1 2 1 ...
##   ..@ alpha          : num 25
##   ..@ call           : language LDA(x = dtm_twitter, k = 2, method = "Gibbs", control = list(seed = 42))
##   ..@ Dim            : int [1:2] 7044 17994
##   ..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
##   ..@ k              : int 2
##   ..@ terms          : chr [1:17994] "_adowaa_" "_arzar" "_austrian" "_bbbb_" ...
##   ..@ documents      : chr [1:7044] "486973619952971776" "478816318784036864" "477008545637224448" "477077022695768064" ...
##   ..@ beta           : num [1:2, 1:17994] -12.65 -10.31 -10.26 -12.71 -7.94 ...
##   ..@ gamma          : num [1:7044, 1:2] 0.509 0.516 0.517 0.491 0.441 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:59122] 1 1 1 2 2 2 2 2 2 2 ...
##   .. ..$ j   : int [1:59122] 1 5306 17631 2 2155 9755 10134 10337 10974 12281 ...
##   .. ..$ v   : num [1:59122] 2 1 1 1 1 1 2 1 2 1 ...
##   .. ..$ nrow: int 7044
##   .. ..$ ncol: int 17994
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num -506519
##   ..@ iter           : int 2000
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 60688