Towards a Predictive Text Tool: Preliminary Steps

Overview

This project ultimately seeks to develop a predictive text Shiny tool that provides text suggestions following user input. Such tools are routinely employed in cell phone messaging applications, website search fields, e-mail clients, word processors, and other such environments.

This document provides a cursory summary about the initial steps completed towards the goal; refer to comments in the Code Appendix for additional details.

Initial Evaluation

An investigation of the texts revealed that each line represents an independent sample from the source and that the number of characters varied considerably for blogs in particular. Other noteworthy observations include differences in sample size with Twitter samples (2.36 million) significantly outweighing Blog (0.90 million) and News (0.08 million) samples. The mean and median number of characters for Blog and News samples were fairly similar to one another and significantly higher than the same values for Twitter samples, which is expected given the 140-character limit of Twitter. Table 1 and Figure 1 provide additional summary details.

Table 1. Characters by Text Source: Descriptive Statistics
Source	Samples	min	mean	median	max
Blog	899288	1	230.0	156	40833
News	77259	2	202.4	186	5760
Twitter	2360148	2	68.7	64	140

Corpus Construction and Tokenization

A corpus is the entire body of text used in a natural language processing (NLP) project such as this. The corpus is constructed by compiling all documents of interest and adding useful variables such as composed language, encoding, source, etc.

Tokenization of the corpus is required for data analysis and involves breaking up text into constituent components to isolateinteresting elements (for this project, words) and remove ancillary elements such as punctuation, numbers, and symbols. Table 2 provides a summary of the tokenized corpus, where “Types” means the number of distinct words, “Tokens” means the number of total words, and “Sentences” indicates the number of sentences.

Table 2. Corpus Summary
Source	Types	Tokens	Sentences
Blog	296171	37127667	2072941
News	79048	2604138	143558
Twitter	352560	29642004	2588551

Token Analysis

This project focuses on words and word associations; as such, the tokenization process removed all non-word elements. Table 3 presents the five most common single words, two-word strings, and three-word strings for each source. The most common single words and two-word strings across sources are very similar, and several of the same three-word strings are common for Blog and News sources; however, the most common three-word strings in Twitter do not overlap with the other sources. This suggests that a predictive text tool might need to take into consideration the employed context to enhance accuracy.

Table 3. Top 5 Tokens by Source and Length
Blog-1	Blog-2	Blog-3	News-1	News-2	News-3	Twitter-1	Twitter-2	Twitter-3
the	of the	one of the	the	of the	one of the	the	in the	thanks for the
and	in the	a lot of	to	in the	a lot of	to	for the	looking forward to
to	to the	as well as	and	to the	as well as	i	of the	thank you for
a	on the	to be a	a	on the	according to the	a	on the	i love you
of	to be	it was a	of	for the	in the first	you	to be	for the follow

What’s Next?

The next step in the project will involve using the common words and word strings to generate a predictive model of word associations. The end-goal Shiny tool will use this model to present the user with reasonable text options after s/he enters a word or phrase.

——————————————————————————————–

Code Appendix

——————————————————————————————–

Load libraries:

## STEP 1: Load important libraries.

library(tidyverse) # dplyr, stringr, ggplot2, etc.
library(quanteda) # "Quantitative Analysis of Textual Data"
library(DT) # DataTables
library(readtext) # readtext
library(knitr) # for kable
library(kableExtra) # for fancier kables

Read in text:

## STEP 2: Read in the text files.

# The data were provided by the course instructors in the following .zip file:
# https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

# Download and unzip the files. The unzipped file contains 4 folders of 3 files each (separated by language). Put all 12 
# files into a single /data subdirectory of the working directory and rename them to separate variables evident in the file 
# name by replacing the first '.' with '_'. Also change 'blogs' to 'Blog', 'news' to 'News', and 'twitter' to 'Twitter'.

# This project will use the English language files. The code below reads each file line by line and converted into a data 
# frame, skipping null lines and identifying the encoding as UTF-8 (otherwise some things appear funky). Two columns are 
# added to the data frame to specify language (English) and source (Blog, News, or Twitter). The first column is 
# renamed to "Text" otherwise it is automatically named after the file name. Then the data frames are combined into one 
# object ("English"), the individual data frames are removed, a new column is added that indicates the number of characters
# per line, the columns are reordered, and the rows are grouped by source.

if(!exists("English")) { # this takes a minute so only run if necessary

enBlog <- as.data.frame(readLines("./data/en_US_Blog.txt", skipNul = TRUE, encoding = "UTF-8"), 
                        stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "Blog")

enNews <- as.data.frame(readLines("./data/en_US_News.txt", skipNul = TRUE, encoding = "UTF-8"), 
                        stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "News")

enTwit <- as.data.frame(readLines("./data/en_US_Twitter.txt", skipNul = TRUE, encoding = "UTF-8"), 
                        stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "Twitter")

names(enBlog)[1] <- "Text"
names(enNews)[1] <- "Text"
names(enTwit)[1] <- "Text"

English <- bind_rows(enBlog, enNews, enTwit) %>% 
        mutate(Characters = nchar(Text)) %>% 
        select(Language, Source, Characters, Text)

English$Language <- as.factor(English$Language)
English$Source <- as.factor(English$Source)
English <- English %>% group_by(Source)
        
rm(list = c("enBlog", "enNews", "enTwit"))
}

Table 1:

## STEP 3: Tabulate descriptives of data by source and output summary table.

EnglishSum <- English %>% summarize(
        Samples = n(), # number of text samples (lines)
        min = min(Characters), # minimum number of characters
        mean = mean(Characters), # mean number of characters
        median = median(Characters), # median number of characters
        max = max(Characters) # maximum number of characters
)

kable(EnglishSum, 
      format = "html", 
      table.attr = "style='width:90%;'",
      caption = "Table 1. Characters by Text Source: Descriptive Statistics",
      align = 'c',
      digits = 1) %>%
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Figure 1:

## STEP 4: Plot descriptives of data by source.

if(!exists("plot1")) { # this takes a minute so only run if necessary
  
plot1 <- ggplot(English, aes(x = Source, y = Characters)) + 
        coord_cartesian(ylim = c(0, 1000)) +
        labs(title = "Characters by Text Source",
             x = "Text Source",
             y = "Number of Characters",
             caption = "Figure 1. Combined violin plots and boxplots reveal character number \ndistribution and quantiles by text source. The y-axis is limited to 1000 but \nsome blog and news samples extend higher; see Table 1.") +
        geom_violin(trim = TRUE, fill = "gray", color = "darkred", size = 1.0) +
        geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
        theme_classic() + 
        theme(
                plot.title = element_text(color="black", face="bold", hjust = 0.5, size = 20),
                axis.title.x = element_text(color="black", face="bold", size = 14),
                axis.title.y = element_text(color="black", face="bold", size = 14),
                axis.text = element_text(color = "darkred", face = "bold", size = 12),
                plot.caption = element_text(color = "black", face = "bold", hjust = 0.0, size = 12)
                )
plot1
}

Create corpus and Table 2:

## STEP 5: Create a corpus and output summary table.

# This project uses the quanteda package; the quickstart page is very helpful: https://quanteda.io/articles/quickstart.html
# The code below creates a "corpus" (and a corpus summary) from the texts after reading them in using readtext:
# https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html.

if(!exists("enTexts")) { # this takes a minute so only run if necessary

enTexts <- readtext(paste0("./data/", "en*.txt"), # use readtext with paste0 to concatenate strings of English texts
                    docvarsfrom = "filenames", # identify source of docvars
                    docvarnames = c("Language", "Country", "Source"), # label docvars
                    dvsep = "_", # identify docvar separator
                    encoding = "UTF-8") # specify text encoding
}

# create the corpus from the read-in texts
if(!exists("enCorpus")) { # this takes a minute so only run if necessary
enCorpus <- corpus(enTexts)
}

# create a summary of the corpus; note that the summary takes on tokenization arguments as well... this is so the summary
# accurately reflects the corpus as used for token analysis in the next step
if(!exists("enCorpusSum")) {# this takes several minutes so only run if necessary
enCorpusSum <- summary(enCorpus,
                       tolower = TRUE,
                       groups = "Source",
                       remove_numbers = TRUE, 
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_separators = TRUE,
                       remove_twitter = TRUE,
                       remove_hyphens = TRUE,
                       remove_url = TRUE)        
}

# convert summary to data frame to eliminate unnecessary clutter; also select desired variables in desired order
enCorpusSum <- as.data.frame(enCorpusSum) %>% select(Source, Types, Tokens, Sentences)

# output summary table
kable(enCorpusSum,
      format = "html", 
      caption = "Table 2. Corpus Summary",
      align = 'c') %>% 
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                      full_width = FALSE, 
                      position = "float_left")

Create tokens:

## STEP 6: Tokenize the corpus.

# This will evaluate the corpus to identify "tokens" based on the set parameters. Since this project is primarily interested 
# in words, the parameters are set to remove numbers, symbols, punctuation, etc.

# The following code also generates a "document-feature matrix" that will allow for exploratory analyses.

# One-word tokens.
if(!exists("enTokens")) { # this takes several minutes so only run if necessary
enTokens <- dfm(enCorpus,
                tolower = TRUE,
                groups = "Source",
                what = "word",
                remove_numbers = TRUE, 
                remove_punct = TRUE, 
                remove_symbols = TRUE, 
                remove_separators = TRUE,
                remove_twitter = TRUE,
                remove_hyphens = TRUE,
                remove_url = TRUE)
}

# Two-word tokens.
if(!exists("enTokens2")) { # this takes a while so only run if necessary
enTokens2 <- dfm(enCorpus,
                tolower = TRUE,
                ngrams = 2,
                concatenator = " ",
                groups = "Source",
                what = "word",
                remove_numbers = TRUE, 
                remove_punct = TRUE, 
                remove_symbols = TRUE, 
                remove_separators = TRUE,
                remove_twitter = TRUE,
                remove_hyphens = TRUE,
                remove_url = TRUE)
}

# Three-word tokens.
if(!exists("enTokens3")) { # this takes about 20 minutes so only run if necessary
enTokens3 <- dfm(enCorpus,
                tolower = TRUE,
                ngrams = 3,
                concatenator = " ",
                groups = "Source",
                what = "word",
                remove_numbers = TRUE, 
                remove_punct = TRUE, 
                remove_symbols = TRUE, 
                remove_separators = TRUE,
                remove_twitter = TRUE,
                remove_hyphens = TRUE,
                remove_url = TRUE)
}

Extract top-5 tokens and Table 3:

## STEP 7: Extract the top 5 tokens and output summary table.

if(!exists("topTokens")) { # this takes a while so only run if necessary
topBlog1 <- as.data.frame(names(topfeatures(enTokens["Blog"],5))) # top 5 one-word tokens, Blog
topBlog2 <- as.data.frame(names(topfeatures(enTokens2["Blog"],5))) # top 5 two-word tokens, Blog
topBlog3 <- as.data.frame(names(topfeatures(enTokens3["Blog"],5))) # top 5 three-word tokens, Blog

topNews1 <- as.data.frame(names(topfeatures(enTokens["News"],5))) # top 5 one-word tokens, News
topNews2 <- as.data.frame(names(topfeatures(enTokens2["News"],5))) # top 5 two-word tokens, News
topNews3 <- as.data.frame(names(topfeatures(enTokens3["News"],5))) # top 5 three-word tokens, News

topTwitter1 <- as.data.frame(names(topfeatures(enTokens["Twitter"],5))) # top 5 one-word tokens, Twitter
topTwitter2 <- as.data.frame(names(topfeatures(enTokens2["Twitter"],5))) # top 5 two-word tokens, Twitter
topTwitter3 <- as.data.frame(names(topfeatures(enTokens3["Twitter"],5))) # top 5 three-word tokens, Twitter

# bind the columns into a single data frame
topTokens <- bind_cols(topBlog1, topBlog2, topBlog3, topNews1, topNews2, topNews3, topTwitter1, topTwitter2, topTwitter3)
}

# rename the columns
names(topTokens) <- c("Blog-1", "Blog-2", "Blog-3", "News-1", "News-2", "News-3", "Twitter-1", "Twitter-2", "Twitter-3")

#output table
kable(topTokens,
      format = "html", 
      caption = "Table 3. Top 5 Tokens by Source and Length",
      align = 'c') %>% 
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                      full_width = FALSE, 
                      position = "float_left")

Towards a Predictive Text Tool: Preliminary Steps

A Milestone Report for the Data Science Capstone Course

David J. Bauer

8/6/2019

Overview

Initial Evaluation

Corpus Construction and Tokenization

Token Analysis

What’s Next?

——————————————————————————————–

Code Appendix

——————————————————————————————–

Load libraries:

Read in text:

Table 1:

Figure 1:

Create corpus and Table 2:

Create tokens:

Extract top-5 tokens and Table 3: