R Markdown

Including Plots

You can also embed plots, for example:

## # A tibble: 2 x 2
##   Product                                  starts_mean
##   <chr>                                          <dbl>
## 1 iRobot Roomba 650 for Pets                      4.49
## 2 iRobot Roomba 880 for Pets and Allergies        4.42

Text Data is categorical…

## # A tibble: 2 x 2
##   Product                                  number_rows
##   <chr>                                          <int>
## 1 iRobot Roomba 880 for Pets and Allergies        1200
## 2 iRobot Roomba 650 for Pets                       633

Some natural NLP vocabulary:

unnest_tokens Is like a gather() function but for all words in the text column, remove punctuation,each word is lowercase and white space has been removed too.

Word Clouds

Sentiment analysis

## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
## # A tibble: 1 x 2
##     min   max
##   <dbl> <dbl>
## 1    -5     5

## Improving sentiment analysis

## Joining, by = "word"

Latent dirichlet allocation

Searches for patterns of words ocurring together within and across a collection documents, also known as corpus.

Topic modeling

  • Topics are covered based on word frequency, which is discrete.
  • Every document is a mixture (i.e. partial member) of every topic.

Document term matrix and LDA modeling

## Joining, by = "word"
## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
##   ..@ seedwords      : NULL
##   ..@ z              : int [1:500] 1 1 2 2 1 1 1 1 1 1 ...
##   ..@ alpha          : num 25
##   ..@ call           : language LDA(x = dtm_matrix, k = 2, method = "Gibbs", control = list(seed = 42))
##   ..@ Dim            : int [1:2] 500 211
##   ..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
##   ..@ k              : int 2
##   ..@ terms          : chr [1:211] "thank" "enthusiasm" "hung" "well" ...
##   ..@ documents      : chr [1:500] "222819" "93079" "26657" "51889" ...
##   ..@ beta           : num [1:2, 1:211] -4.89 -7.87 -5.54 -7.87 -7.94 ...
##   ..@ gamma          : num [1:500, 1:2] 0.51 0.51 0.49 0.49 0.51 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ j   : int [1:500] 1 2 3 4 5 6 7 6 8 9 ...
##   .. ..$ v   : num [1:500] 1 1 2 2 1 1 1 1 1 2 ...
##   .. ..$ nrow: int 500
##   .. ..$ ncol: int 211
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num -2492
##   ..@ iter           : int 2000
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 500