Libraries

library(readr)
library(dplyr)
library(lubridate)
library(stringr)
library(purrr)
library(tibble)
library(tidytext)
library(textclean)
library(tm)
library(stringr)
library(knitr)

Inital parsing of the data

There are two data sources for this project.

Chatboxes

The older contacts (2014 to mid-2017) are sourced directly from the chatbox tool. This system exports the data as a file per day, appending further conversations on the same day into the same file with a line of equals signs between them.

chatboxes_raw <- map(list.files(path = 'Chatboxes/', full.names = TRUE), read_file) %>%
  str_split("\n================================================================================\n")

chatboxes_parsed <- chatboxes_raw %>%
  unlist() %>%
  tibble() %>%
  rename(Chat = 1) %>%
  filter(!Chat == "") %>%
  mutate(Date = as.POSIXct(str_sub(Chat,12,32), format = "%Y-%m-%dT%H:%M:%SZ", tz = "UTC")) %>%
  mutate(Visitor = str_extract(Chat, "Visitor Name: (.*?)\nVisitor Email:")) %>%
  mutate(Visitor = str_replace(Visitor, "Visitor Name: ", "")) %>%
  mutate(Visitor = word(Visitor, sep = "\n"))
glimpse(chatboxes_parsed)
## [1] "<Potentially sensitive content removed>"

Client Queries

The later contacts (mid-2017 to current) are sourced from the company’s Active Collab database. This tool has feed ins from more than just the chatbox tool; sources such as the “Contact Us” page also add tasks to the system. The data is sourced as two tables: one containing information about the task, such as title and created date, and a second containing the task text.

cq <- read_csv('TaskDetails.csv') %>%
  mutate(`Created On` = with_tz(`Created On`, tz = "Australia/Sydney"))
cq_t <- read_delim('TaskText.txt', delim = ';', col_names = c("Assignment ID", "Text"))

cq_with_text <- cq %>%
  left_join(cq_t, by = "Assignment ID")
glimpse(cq_with_text)
## [1] "<Potentially sensitive content removed>"

Text format

To begin with, here’s an example of the format that the text has in the Chatboxes data:

chatboxes_parsed %>%
  select(Chat) %>%
  head(1) %>%
  pull()
## [1] "<Potentially sensitive content removed>"

So, there’s a lot of information about the user’s computer and browser. Helpful in a tech support context, but less useful for this particular analysis. This will likely need to be removed or minimised to make things viable. Here’s the Active Collab extract form:

cq_with_text %>%
  select(Text) %>%
  head(1) %>%
  pull()
## [1] "<Potentially sensitive content removed>"

Ah, that’s a lot of HTML. And at the end is the person’s details, the way they were at the start in the chatboxes. Again, this stuff will need removal to get to the ‘meat’ of the conversation.

Exploration

Let’s start with the dumbest possible approach: no cleaning at all, just raw counts of words. I fully expect this to be terrible/unusable, but hey, that means there’s nowhere to go than up, right?

First, the chatboxes:

chatboxes_parsed %>%
  rownames_to_column("ID") %>%  
  unnest_tokens(word, Chat) %>%
  group_by(word) %>%
  summarise(count = n(),
            conversations = n_distinct(ID)) %>%
  arrange(desc(count)) %>%
  top_n(10, count) %>%
  kable()
## [1] "<Potentially sensitive content removed>"

Next, the Active Collab items:

cq_with_text %>%
  rownames_to_column("ID") %>%  
  unnest_tokens(word, Text) %>%
  group_by(word) %>%
  summarise(count = n(),
            conversations = n_distinct(ID)) %>%
  arrange(desc(count)) %>%
  top_n(10, count) %>%
  kable()
## [1] "<Potentially sensitive content removed>"

So, as anticipated, the first versions of the lists don’t contain much we can use. Lots of stop words, contextless numbers, and HTML strings. On the other hand, it at least appears to have been clever enough to be case insensitive in this case, so that’s nice. Time to do some cleaning!

First, the chatboxes:

chatbox_clean <- chatboxes_parsed %>%
  rownames_to_column("ID") %>%  
  mutate(Chat = word(Chat,start = 2, end = -1, sep = "\n\n")) %>% #Remove the browser info etc section
  mutate(Chat = str_replace_all(Chat, "\n", " ")) %>%
  mutate(Chat = removeNumbers(Chat)) %>%
  mutate(Chat = replace_contraction(Chat)) %>%
  mutate(Chat = str_to_title(Chat)) %>% #Replace_names seems to be case sensitive
  mutate(Chat = replace_names(Chat)) %>% #Removes common first and last names
  mutate(Chat = str_to_lower(Chat)) %>%
  mutate(Chat = str_replace_all(Chat,'[[:punct:] ]+',' ')) #Remove punctuation; if there are two spaces, convert to one
chatbox_clean %>%
  unnest_tokens(word, Chat) %>%
  anti_join(stop_words) %>%
  group_by(word) %>%
  summarise(count = n(),
            conversations = n_distinct(ID)) %>%
  arrange(desc(count)) %>%
  top_n(10, count) %>%
  kable()
## [1] "<Potentially sensitive content removed>"

Yes, this looks much better already. There’s still some bits and bobs that may need tweaking, but it certainly is more meaningful than the first round. It might be worth doing something like TF/IDF to make the ‘unusual’ words float more to the top in the next steps, but it’s a start. Now let’s do the same sort of cleaning for the Active Collab items.

cq_with_text_clean <- cq_with_text %>%
  rownames_to_column("ID") %>%
  mutate(Text = word(Text, 1, sep = "<br /><br /> \tNAME<br /><br />")) %>% #Remove the person's details such as email address from the string
  mutate(Text = str_replace_all(Text, "<.*?>", " ")) %>% #Remove HTML tags
  mutate(Text = removeNumbers(Text)) %>%
  mutate(Text = str_to_title(Text)) %>%
  mutate(Text = replace_names(Text)) %>%
  mutate(Text = str_to_lower(Text)) %>%
  mutate(Text = replace_contraction(Text)) %>%
  mutate(Text = str_replace_all(Text,'[[:punct:] ]+',' ')) #Remove punctuation; if there are two spaces, convert to one
cq_with_text_clean %>%
  unnest_tokens(word, Text) %>%
  anti_join(stop_words) %>%
  group_by(word) %>%
  summarise(count = n(),
            conversations = n_distinct(ID)) %>%
  arrange(desc(count)) %>%
  top_n(10, count) %>%
  kable()
## [1] "<Potentially sensitive content removed>"

Also a definite improvement. The parts of website domains might be a problem, but it’s definitely more relevant than the first attempt.

Lemmatisation

Do this before further analysis.

TF/IDF

I think this will be the next step?

Topic Modelling

Another approach to attempt