Overview

This project ultimately seeks to develop a predictive text Shiny tool that provides text suggestions following user input. Such tools are routinely employed in cell phone messaging applications, website search fields, e-mail clients, word processors, and other such environments.

This document provides a cursory summary about the initial steps completed towards the goal; refer to comments in the Code Appendix for additional details.

Initial Evaluation

An investigation of the texts revealed that each line represents an independent sample from the source and that the number of characters varied considerably for blogs in particular. Other noteworthy observations include differences in sample size with Twitter samples (2.36 million) significantly outweighing Blog (0.90 million) and News (0.08 million) samples. The mean and median number of characters for Blog and News samples were fairly similar to one another and significantly higher than the same values for Twitter samples, which is expected given the 140-character limit of Twitter. Table 1 and Figure 1 provide additional summary details.

Table 1. Characters by Text Source: Descriptive Statistics
Source Samples min mean median max
Blog 899288 1 230.0 156 40833
News 77259 2 202.4 186 5760
Twitter 2360148 2 68.7 64 140

Corpus Construction and Tokenization

A corpus is the entire body of text used in a natural language processing (NLP) project such as this. The corpus is constructed by compiling all documents of interest and adding useful variables such as composed language, encoding, source, etc.

Tokenization of the corpus is required for data analysis and involves breaking up text into constituent components to isolateinteresting elements (for this project, words) and remove ancillary elements such as punctuation, numbers, and symbols. Table 2 provides a summary of the tokenized corpus, where “Types” means the number of distinct words, “Tokens” means the number of total words, and “Sentences” indicates the number of sentences.

Table 2. Corpus Summary
Source Types Tokens Sentences
Blog 296171 37127667 2072941
News 79048 2604138 143558
Twitter 352560 29642004 2588551

Token Analysis

This project focuses on words and word associations; as such, the tokenization process removed all non-word elements. Table 3 presents the five most common single words, two-word strings, and three-word strings for each source. The most common single words and two-word strings across sources are very similar, and several of the same three-word strings are common for Blog and News sources; however, the most common three-word strings in Twitter do not overlap with the other sources. This suggests that a predictive text tool might need to take into consideration the employed context to enhance accuracy.

Table 3. Top 5 Tokens by Source and Length
Blog-1 Blog-2 Blog-3 News-1 News-2 News-3 Twitter-1 Twitter-2 Twitter-3
the of the one of the the of the one of the the in the thanks for the
and in the a lot of to in the a lot of to for the looking forward to
to to the as well as and to the as well as i of the thank you for
a on the to be a a on the according to the a on the i love you
of to be it was a of for the in the first you to be for the follow

What’s Next?

The next step in the project will involve using the common words and word strings to generate a predictive model of word associations. The end-goal Shiny tool will use this model to present the user with reasonable text options after s/he enters a word or phrase.

——————————————————————————————–

Code Appendix

——————————————————————————————–

Load libraries:

## STEP 1: Load important libraries.

library(tidyverse) # dplyr, stringr, ggplot2, etc.
library(quanteda) # "Quantitative Analysis of Textual Data"
library(DT) # DataTables
library(readtext) # readtext
library(knitr) # for kable
library(kableExtra) # for fancier kables

Read in text:

## STEP 2: Read in the text files.

# The data were provided by the course instructors in the following .zip file:
# https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

# Download and unzip the files. The unzipped file contains 4 folders of 3 files each (separated by language). Put all 12 
# files into a single /data subdirectory of the working directory and rename them to separate variables evident in the file 
# name by replacing the first '.' with '_'. Also change 'blogs' to 'Blog', 'news' to 'News', and 'twitter' to 'Twitter'.

# This project will use the English language files. The code below reads each file line by line and converted into a data 
# frame, skipping null lines and identifying the encoding as UTF-8 (otherwise some things appear funky). Two columns are 
# added to the data frame to specify language (English) and source (Blog, News, or Twitter). The first column is 
# renamed to "Text" otherwise it is automatically named after the file name. Then the data frames are combined into one 
# object ("English"), the individual data frames are removed, a new column is added that indicates the number of characters
# per line, the columns are reordered, and the rows are grouped by source.

if(!exists("English")) { # this takes a minute so only run if necessary

enBlog <- as.data.frame(readLines("./data/en_US_Blog.txt", skipNul = TRUE, encoding = "UTF-8"), 
                        stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "Blog")

enNews <- as.data.frame(readLines("./data/en_US_News.txt", skipNul = TRUE, encoding = "UTF-8"), 
                        stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "News")

enTwit <- as.data.frame(readLines("./data/en_US_Twitter.txt", skipNul = TRUE, encoding = "UTF-8"), 
                        stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "Twitter")

names(enBlog)[1] <- "Text"
names(enNews)[1] <- "Text"
names(enTwit)[1] <- "Text"

English <- bind_rows(enBlog, enNews, enTwit) %>% 
        mutate(Characters = nchar(Text)) %>% 
        select(Language, Source, Characters, Text)

English$Language <- as.factor(English$Language)
English$Source <- as.factor(English$Source)
English <- English %>% group_by(Source)
        
rm(list = c("enBlog", "enNews", "enTwit"))
}

Table 1:

## STEP 3: Tabulate descriptives of data by source and output summary table.

EnglishSum <- English %>% summarize(
        Samples = n(), # number of text samples (lines)
        min = min(Characters), # minimum number of characters
        mean = mean(Characters), # mean number of characters
        median = median(Characters), # median number of characters
        max = max(Characters) # maximum number of characters
)

kable(EnglishSum, 
      format = "html", 
      table.attr = "style='width:90%;'",
      caption = "Table 1. Characters by Text Source: Descriptive Statistics",
      align = 'c',
      digits = 1) %>%
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Figure 1:

## STEP 4: Plot descriptives of data by source.

if(!exists("plot1")) { # this takes a minute so only run if necessary
  
plot1 <- ggplot(English, aes(x = Source, y = Characters)) + 
        coord_cartesian(ylim = c(0, 1000)) +
        labs(title = "Characters by Text Source",
             x = "Text Source",
             y = "Number of Characters",
             caption = "Figure 1. Combined violin plots and boxplots reveal character number \ndistribution and quantiles by text source. The y-axis is limited to 1000 but \nsome blog and news samples extend higher; see Table 1.") +
        geom_violin(trim = TRUE, fill = "gray", color = "darkred", size = 1.0) +
        geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
        theme_classic() + 
        theme(
                plot.title = element_text(color="black", face="bold", hjust = 0.5, size = 20),
                axis.title.x = element_text(color="black", face="bold", size = 14),
                axis.title.y = element_text(color="black", face="bold", size = 14),
                axis.text = element_text(color = "darkred", face = "bold", size = 12),
                plot.caption = element_text(color = "black", face = "bold", hjust = 0.0, size = 12)
                )
plot1
}

Create corpus and Table 2:

## STEP 5: Create a corpus and output summary table.

# This project uses the quanteda package; the quickstart page is very helpful: https://quanteda.io/articles/quickstart.html
# The code below creates a "corpus" (and a corpus summary) from the texts after reading them in using readtext:
# https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html.

if(!exists("enTexts")) { # this takes a minute so only run if necessary

enTexts <- readtext(paste0("./data/", "en*.txt"), # use readtext with paste0 to concatenate strings of English texts
                    docvarsfrom = "filenames", # identify source of docvars
                    docvarnames = c("Language", "Country", "Source"), # label docvars
                    dvsep = "_", # identify docvar separator
                    encoding = "UTF-8") # specify text encoding
}

# create the corpus from the read-in texts
if(!exists("enCorpus")) { # this takes a minute so only run if necessary
enCorpus <- corpus(enTexts)
}

# create a summary of the corpus; note that the summary takes on tokenization arguments as well... this is so the summary
# accurately reflects the corpus as used for token analysis in the next step
if(!exists("enCorpusSum")) {# this takes several minutes so only run if necessary
enCorpusSum <- summary(enCorpus,
                       tolower = TRUE,
                       groups = "Source",
                       remove_numbers = TRUE, 
                       remove_punct = TRUE,
                       remove_symbols = TRUE,
                       remove_separators = TRUE,
                       remove_twitter = TRUE,
                       remove_hyphens = TRUE,
                       remove_url = TRUE)        
}

# convert summary to data frame to eliminate unnecessary clutter; also select desired variables in desired order
enCorpusSum <- as.data.frame(enCorpusSum) %>% select(Source, Types, Tokens, Sentences)

# output summary table
kable(enCorpusSum,
      format = "html", 
      caption = "Table 2. Corpus Summary",
      align = 'c') %>% 
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                      full_width = FALSE, 
                      position = "float_left")

Create tokens:

## STEP 6: Tokenize the corpus.

# This will evaluate the corpus to identify "tokens" based on the set parameters. Since this project is primarily interested 
# in words, the parameters are set to remove numbers, symbols, punctuation, etc.

# The following code also generates a "document-feature matrix" that will allow for exploratory analyses.

# One-word tokens.
if(!exists("enTokens")) { # this takes several minutes so only run if necessary
enTokens <- dfm(enCorpus,
                tolower = TRUE,
                groups = "Source",
                what = "word",
                remove_numbers = TRUE, 
                remove_punct = TRUE, 
                remove_symbols = TRUE, 
                remove_separators = TRUE,
                remove_twitter = TRUE,
                remove_hyphens = TRUE,
                remove_url = TRUE)
}

# Two-word tokens.
if(!exists("enTokens2")) { # this takes a while so only run if necessary
enTokens2 <- dfm(enCorpus,
                tolower = TRUE,
                ngrams = 2,
                concatenator = " ",
                groups = "Source",
                what = "word",
                remove_numbers = TRUE, 
                remove_punct = TRUE, 
                remove_symbols = TRUE, 
                remove_separators = TRUE,
                remove_twitter = TRUE,
                remove_hyphens = TRUE,
                remove_url = TRUE)
}

# Three-word tokens.
if(!exists("enTokens3")) { # this takes about 20 minutes so only run if necessary
enTokens3 <- dfm(enCorpus,
                tolower = TRUE,
                ngrams = 3,
                concatenator = " ",
                groups = "Source",
                what = "word",
                remove_numbers = TRUE, 
                remove_punct = TRUE, 
                remove_symbols = TRUE, 
                remove_separators = TRUE,
                remove_twitter = TRUE,
                remove_hyphens = TRUE,
                remove_url = TRUE)
}

Extract top-5 tokens and Table 3:

## STEP 7: Extract the top 5 tokens and output summary table.

if(!exists("topTokens")) { # this takes a while so only run if necessary
topBlog1 <- as.data.frame(names(topfeatures(enTokens["Blog"],5))) # top 5 one-word tokens, Blog
topBlog2 <- as.data.frame(names(topfeatures(enTokens2["Blog"],5))) # top 5 two-word tokens, Blog
topBlog3 <- as.data.frame(names(topfeatures(enTokens3["Blog"],5))) # top 5 three-word tokens, Blog

topNews1 <- as.data.frame(names(topfeatures(enTokens["News"],5))) # top 5 one-word tokens, News
topNews2 <- as.data.frame(names(topfeatures(enTokens2["News"],5))) # top 5 two-word tokens, News
topNews3 <- as.data.frame(names(topfeatures(enTokens3["News"],5))) # top 5 three-word tokens, News

topTwitter1 <- as.data.frame(names(topfeatures(enTokens["Twitter"],5))) # top 5 one-word tokens, Twitter
topTwitter2 <- as.data.frame(names(topfeatures(enTokens2["Twitter"],5))) # top 5 two-word tokens, Twitter
topTwitter3 <- as.data.frame(names(topfeatures(enTokens3["Twitter"],5))) # top 5 three-word tokens, Twitter

# bind the columns into a single data frame
topTokens <- bind_cols(topBlog1, topBlog2, topBlog3, topNews1, topNews2, topNews3, topTwitter1, topTwitter2, topTwitter3)
}

# rename the columns
names(topTokens) <- c("Blog-1", "Blog-2", "Blog-3", "News-1", "News-2", "News-3", "Twitter-1", "Twitter-2", "Twitter-3")

#output table
kable(topTokens,
      format = "html", 
      caption = "Table 3. Top 5 Tokens by Source and Length",
      align = 'c') %>% 
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                      full_width = FALSE, 
                      position = "float_left")