This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
setwd("~/Documents/")
nps = read.csv('nps_survey_full.csv', header=TRUE, stringsAsFactors=FALSE)
#nps = read.csv('NPS_end_of_treatment_2017_July2019.csv', header=TRUE, stringsAsFactors=FALSE)
Prep the Data
Overall NPS Score
## [1] 0.3902363
Yearly Changes
Monthly Change
Day of Week Analysis
Day of Week Scores
This plot shows the kernel density of the scores (0-10) received by day.
Here is a ridges plot for the score density by day. It’s harder to compare the relative heights of the score density compared to the density plot above.
## Picking joint bandwidth of 0.428
Score over the Years
## Warning: Ignoring unknown aesthetics: binwidth
Tokenization
## # A tibble: 351,388 x 3
## cat `row_number()` word
## <fct> <int> <chr>
## 1 Promoter 1 smiledirectclub
## 2 Promoter 1 is
## 3 Promoter 1 this
## 4 Promoter 1 bizzzzomb
## 5 Promoter 1 got
## 6 Promoter 1 my
## 7 Promoter 1 pearly
## 8 Promoter 1 whites
## 9 Promoter 1 poppin
## 10 Promoter 3 treatment
## # … with 351,378 more rows
## # A tibble: 7,273 x 2
## word n
## <chr> <int>
## 1 teeth 5162
## 2 smile 3378
## 3 results 2393
## 4 aligners 2346
## 5 easy 2070
## 6 service 1650
## 7 direct 1446
## 8 process 1404
## 9 customer 1299
## 10 happy 1175
## # … with 7,263 more rows
sentiment analysis
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
## # A tibble: 3,370 x 2
## ngram n
## <chr> <int>
## 1 not as straight as i 36
## 2 not happy with my results 26
## 3 not having to go to 18
## 4 not happy with the results 17
## 5 not happy with my smile 11
## 6 not 100 satisfied with my 9
## 7 not satisfied with my results 9
## 8 not satisfied with the results 9
## 9 not completely satisfied with my 8
## 10 not 100 happy with my 7
## # … with 3,360 more rows
## Warning in comparison.cloud(., random.order = FALSE, title.size = 1.5,
## max.words = 250, : absolutely could not be fit on page. It will not be
## plotted.
## Warning in comparison.cloud(., random.order = FALSE, title.size = 1.5,
## max.words = 250, : changed could not be fit on page. It will not be
## plotted.
## # A tibble: 13,440 x 6
## cat word n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 Detractor waste 130 0.000815 0.405 0.000331
## 2 Promoter amazed 27 0.000188 1.10 0.000207
## 3 Promoter compliments 24 0.000167 1.10 0.000184
## 4 Promoter painless 59 0.000411 0.405 0.000167
## 5 Detractor unhappy 62 0.000389 0.405 0.000158
## 6 Promoter economical 20 0.000139 1.10 0.000153
## 7 Promoter transformed 19 0.000132 1.10 0.000145
## 8 Detractor lied 19 0.000119 1.10 0.000131
## 9 Detractor unsatisfied 18 0.000113 1.10 0.000124
## 10 Detractor dissatisfied 17 0.000107 1.10 0.000117
## # … with 13,430 more rows
A central question in text analysis deals with what the text is about, to explore that, we will try to do different things;
Try and rank the terms using tf-idf (term frequency - inverse document frequency) Try to extract the topic or subject of the text using LDA (Latent Dirichlet allocation ) The statsitic tf-idf measures how important a term is to a document in a collection of documents. So in our context we are trying to see if a term/word has special meaning in one respondent’s comments in relation to all other comments.
We will use the bind_tf_idf function to extract the statistic and sort based on the relative importance.
## Selecting by tf_idf
## # A tibble: 221,791 x 6
## cat ngram n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 Promoter easy to use 212 0.00154 0.405 0.000626
## 2 Promoter was so easy 77 0.000561 1.10 0.000616
## 3 Promoter so easy and 74 0.000539 1.10 0.000592
## 4 Promoter and easy to 64 0.000466 1.10 0.000512
## 5 Detractor waste of money 55 0.000359 1.10 0.000394
## 6 Promoter thank you smile 46 0.000335 1.10 0.000368
## 7 Promoter very pleased with 124 0.000903 0.405 0.000366
## 8 Promoter super easy and 41 0.000299 1.10 0.000328
## 9 Promoter easy process and 39 0.000284 1.10 0.000312
## 10 Promoter am very happy 105 0.000765 0.405 0.000310
## # … with 221,781 more rows
## Selecting by tf_idf