R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

setwd("~/Documents/")

nps = read.csv('nps_survey_full.csv', header=TRUE, stringsAsFactors=FALSE)

#nps = read.csv('NPS_end_of_treatment_2017_July2019.csv', header=TRUE, stringsAsFactors=FALSE)

Prep the Data

Overall NPS Score

## [1] 0.3902363

Yearly Changes

Monthly Change

Day of Week Analysis

Day of Week Scores

This plot shows the kernel density of the scores (0-10) received by day.

Here is a ridges plot for the score density by day. It’s harder to compare the relative heights of the score density compared to the density plot above.

## Picking joint bandwidth of 0.428

Score over the Years

## Warning: Ignoring unknown aesthetics: binwidth

Tokenization

## # A tibble: 351,388 x 3
##    cat      `row_number()` word           
##    <fct>             <int> <chr>          
##  1 Promoter              1 smiledirectclub
##  2 Promoter              1 is             
##  3 Promoter              1 this           
##  4 Promoter              1 bizzzzomb      
##  5 Promoter              1 got            
##  6 Promoter              1 my             
##  7 Promoter              1 pearly         
##  8 Promoter              1 whites         
##  9 Promoter              1 poppin         
## 10 Promoter              3 treatment      
## # … with 351,378 more rows
## # A tibble: 7,273 x 2
##    word         n
##    <chr>    <int>
##  1 teeth     5162
##  2 smile     3378
##  3 results   2393
##  4 aligners  2346
##  5 easy      2070
##  6 service   1650
##  7 direct    1446
##  8 process   1404
##  9 customer  1299
## 10 happy     1175
## # … with 7,263 more rows

sentiment analysis

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

## Selecting by n

## Selecting by n

## Selecting by n

## Selecting by n

## # A tibble: 3,370 x 2
##    ngram                                n
##    <chr>                            <int>
##  1 not as straight as i                36
##  2 not happy with my results           26
##  3 not having to go to                 18
##  4 not happy with the results          17
##  5 not happy with my smile             11
##  6 not 100 satisfied with my            9
##  7 not satisfied with my results        9
##  8 not satisfied with the results       9
##  9 not completely satisfied with my     8
## 10 not 100 happy with my                7
## # … with 3,360 more rows

## Warning in comparison.cloud(., random.order = FALSE, title.size = 1.5,
## max.words = 250, : absolutely could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(., random.order = FALSE, title.size = 1.5,
## max.words = 250, : changed could not be fit on page. It will not be
## plotted.

## # A tibble: 13,440 x 6
##    cat       word             n       tf   idf   tf_idf
##    <fct>     <chr>        <int>    <dbl> <dbl>    <dbl>
##  1 Detractor waste          130 0.000815 0.405 0.000331
##  2 Promoter  amazed          27 0.000188 1.10  0.000207
##  3 Promoter  compliments     24 0.000167 1.10  0.000184
##  4 Promoter  painless        59 0.000411 0.405 0.000167
##  5 Detractor unhappy         62 0.000389 0.405 0.000158
##  6 Promoter  economical      20 0.000139 1.10  0.000153
##  7 Promoter  transformed     19 0.000132 1.10  0.000145
##  8 Detractor lied            19 0.000119 1.10  0.000131
##  9 Detractor unsatisfied     18 0.000113 1.10  0.000124
## 10 Detractor dissatisfied    17 0.000107 1.10  0.000117
## # … with 13,430 more rows

A central question in text analysis deals with what the text is about, to explore that, we will try to do different things;

Try and rank the terms using tf-idf (term frequency - inverse document frequency) Try to extract the topic or subject of the text using LDA (Latent Dirichlet allocation ) The statsitic tf-idf measures how important a term is to a document in a collection of documents. So in our context we are trying to see if a term/word has special meaning in one respondent’s comments in relation to all other comments.

We will use the bind_tf_idf function to extract the statistic and sort based on the relative importance.

## Selecting by tf_idf

## # A tibble: 221,791 x 6
##    cat       ngram                 n       tf   idf   tf_idf
##    <fct>     <chr>             <int>    <dbl> <dbl>    <dbl>
##  1 Promoter  easy to use         212 0.00154  0.405 0.000626
##  2 Promoter  was so easy          77 0.000561 1.10  0.000616
##  3 Promoter  so easy and          74 0.000539 1.10  0.000592
##  4 Promoter  and easy to          64 0.000466 1.10  0.000512
##  5 Detractor waste of money       55 0.000359 1.10  0.000394
##  6 Promoter  thank you smile      46 0.000335 1.10  0.000368
##  7 Promoter  very pleased with   124 0.000903 0.405 0.000366
##  8 Promoter  super easy and       41 0.000299 1.10  0.000328
##  9 Promoter  easy process and     39 0.000284 1.10  0.000312
## 10 Promoter  am very happy       105 0.000765 0.405 0.000310
## # … with 221,781 more rows
## Selecting by tf_idf