This project ultimately seeks to develop a predictive text Shiny tool that provides text suggestions following user input. Such tools are routinely employed in cell phone messaging applications, website search fields, e-mail clients, word processors, and other such environments.
This document provides a cursory summary about the initial steps completed towards the goal; refer to comments in the Code Appendix for additional details.
An investigation of the texts revealed that each line represents an independent sample from the source and that the number of characters varied considerably for blogs in particular. Other noteworthy observations include differences in sample size with Twitter samples (2.36 million) significantly outweighing Blog (0.90 million) and News (0.08 million) samples. The mean and median number of characters for Blog and News samples were fairly similar to one another and significantly higher than the same values for Twitter samples, which is expected given the 140-character limit of Twitter. Table 1 and Figure 1 provide additional summary details.
| Source | Samples | min | mean | median | max |
|---|---|---|---|---|---|
| Blog | 899288 | 1 | 230.0 | 156 | 40833 |
| News | 77259 | 2 | 202.4 | 186 | 5760 |
| 2360148 | 2 | 68.7 | 64 | 140 |
A corpus is the entire body of text used in a natural language processing (NLP) project such as this. The corpus is constructed by compiling all documents of interest and adding useful variables such as composed language, encoding, source, etc.
Tokenization of the corpus is required for data analysis and involves breaking up text into constituent components to isolateinteresting elements (for this project, words) and remove ancillary elements such as punctuation, numbers, and symbols. Table 2 provides a summary of the tokenized corpus, where “Types” means the number of distinct words, “Tokens” means the number of total words, and “Sentences” indicates the number of sentences.
| Source | Types | Tokens | Sentences |
|---|---|---|---|
| Blog | 296171 | 37127667 | 2072941 |
| News | 79048 | 2604138 | 143558 |
| 352560 | 29642004 | 2588551 |
This project focuses on words and word associations; as such, the tokenization process removed all non-word elements. Table 3 presents the five most common single words, two-word strings, and three-word strings for each source. The most common single words and two-word strings across sources are very similar, and several of the same three-word strings are common for Blog and News sources; however, the most common three-word strings in Twitter do not overlap with the other sources. This suggests that a predictive text tool might need to take into consideration the employed context to enhance accuracy.
| Blog-1 | Blog-2 | Blog-3 | News-1 | News-2 | News-3 | Twitter-1 | Twitter-2 | Twitter-3 |
|---|---|---|---|---|---|---|---|---|
| the | of the | one of the | the | of the | one of the | the | in the | thanks for the |
| and | in the | a lot of | to | in the | a lot of | to | for the | looking forward to |
| to | to the | as well as | and | to the | as well as | i | of the | thank you for |
| a | on the | to be a | a | on the | according to the | a | on the | i love you |
| of | to be | it was a | of | for the | in the first | you | to be | for the follow |
The next step in the project will involve using the common words and word strings to generate a predictive model of word associations. The end-goal Shiny tool will use this model to present the user with reasonable text options after s/he enters a word or phrase.
## STEP 1: Load important libraries.
library(tidyverse) # dplyr, stringr, ggplot2, etc.
library(quanteda) # "Quantitative Analysis of Textual Data"
library(DT) # DataTables
library(readtext) # readtext
library(knitr) # for kable
library(kableExtra) # for fancier kables
## STEP 2: Read in the text files.
# The data were provided by the course instructors in the following .zip file:
# https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
# Download and unzip the files. The unzipped file contains 4 folders of 3 files each (separated by language). Put all 12
# files into a single /data subdirectory of the working directory and rename them to separate variables evident in the file
# name by replacing the first '.' with '_'. Also change 'blogs' to 'Blog', 'news' to 'News', and 'twitter' to 'Twitter'.
# This project will use the English language files. The code below reads each file line by line and converted into a data
# frame, skipping null lines and identifying the encoding as UTF-8 (otherwise some things appear funky). Two columns are
# added to the data frame to specify language (English) and source (Blog, News, or Twitter). The first column is
# renamed to "Text" otherwise it is automatically named after the file name. Then the data frames are combined into one
# object ("English"), the individual data frames are removed, a new column is added that indicates the number of characters
# per line, the columns are reordered, and the rows are grouped by source.
if(!exists("English")) { # this takes a minute so only run if necessary
enBlog <- as.data.frame(readLines("./data/en_US_Blog.txt", skipNul = TRUE, encoding = "UTF-8"),
stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "Blog")
enNews <- as.data.frame(readLines("./data/en_US_News.txt", skipNul = TRUE, encoding = "UTF-8"),
stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "News")
enTwit <- as.data.frame(readLines("./data/en_US_Twitter.txt", skipNul = TRUE, encoding = "UTF-8"),
stringsAsFactors = FALSE) %>% mutate(Language = "English", Source = "Twitter")
names(enBlog)[1] <- "Text"
names(enNews)[1] <- "Text"
names(enTwit)[1] <- "Text"
English <- bind_rows(enBlog, enNews, enTwit) %>%
mutate(Characters = nchar(Text)) %>%
select(Language, Source, Characters, Text)
English$Language <- as.factor(English$Language)
English$Source <- as.factor(English$Source)
English <- English %>% group_by(Source)
rm(list = c("enBlog", "enNews", "enTwit"))
}
## STEP 3: Tabulate descriptives of data by source and output summary table.
EnglishSum <- English %>% summarize(
Samples = n(), # number of text samples (lines)
min = min(Characters), # minimum number of characters
mean = mean(Characters), # mean number of characters
median = median(Characters), # median number of characters
max = max(Characters) # maximum number of characters
)
kable(EnglishSum,
format = "html",
table.attr = "style='width:90%;'",
caption = "Table 1. Characters by Text Source: Descriptive Statistics",
align = 'c',
digits = 1) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
## STEP 4: Plot descriptives of data by source.
if(!exists("plot1")) { # this takes a minute so only run if necessary
plot1 <- ggplot(English, aes(x = Source, y = Characters)) +
coord_cartesian(ylim = c(0, 1000)) +
labs(title = "Characters by Text Source",
x = "Text Source",
y = "Number of Characters",
caption = "Figure 1. Combined violin plots and boxplots reveal character number \ndistribution and quantiles by text source. The y-axis is limited to 1000 but \nsome blog and news samples extend higher; see Table 1.") +
geom_violin(trim = TRUE, fill = "gray", color = "darkred", size = 1.0) +
geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
theme_classic() +
theme(
plot.title = element_text(color="black", face="bold", hjust = 0.5, size = 20),
axis.title.x = element_text(color="black", face="bold", size = 14),
axis.title.y = element_text(color="black", face="bold", size = 14),
axis.text = element_text(color = "darkred", face = "bold", size = 12),
plot.caption = element_text(color = "black", face = "bold", hjust = 0.0, size = 12)
)
plot1
}
## STEP 5: Create a corpus and output summary table.
# This project uses the quanteda package; the quickstart page is very helpful: https://quanteda.io/articles/quickstart.html
# The code below creates a "corpus" (and a corpus summary) from the texts after reading them in using readtext:
# https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html.
if(!exists("enTexts")) { # this takes a minute so only run if necessary
enTexts <- readtext(paste0("./data/", "en*.txt"), # use readtext with paste0 to concatenate strings of English texts
docvarsfrom = "filenames", # identify source of docvars
docvarnames = c("Language", "Country", "Source"), # label docvars
dvsep = "_", # identify docvar separator
encoding = "UTF-8") # specify text encoding
}
# create the corpus from the read-in texts
if(!exists("enCorpus")) { # this takes a minute so only run if necessary
enCorpus <- corpus(enTexts)
}
# create a summary of the corpus; note that the summary takes on tokenization arguments as well... this is so the summary
# accurately reflects the corpus as used for token analysis in the next step
if(!exists("enCorpusSum")) {# this takes several minutes so only run if necessary
enCorpusSum <- summary(enCorpus,
tolower = TRUE,
groups = "Source",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
}
# convert summary to data frame to eliminate unnecessary clutter; also select desired variables in desired order
enCorpusSum <- as.data.frame(enCorpusSum) %>% select(Source, Types, Tokens, Sentences)
# output summary table
kable(enCorpusSum,
format = "html",
caption = "Table 2. Corpus Summary",
align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "float_left")
## STEP 6: Tokenize the corpus.
# This will evaluate the corpus to identify "tokens" based on the set parameters. Since this project is primarily interested
# in words, the parameters are set to remove numbers, symbols, punctuation, etc.
# The following code also generates a "document-feature matrix" that will allow for exploratory analyses.
# One-word tokens.
if(!exists("enTokens")) { # this takes several minutes so only run if necessary
enTokens <- dfm(enCorpus,
tolower = TRUE,
groups = "Source",
what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
}
# Two-word tokens.
if(!exists("enTokens2")) { # this takes a while so only run if necessary
enTokens2 <- dfm(enCorpus,
tolower = TRUE,
ngrams = 2,
concatenator = " ",
groups = "Source",
what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
}
# Three-word tokens.
if(!exists("enTokens3")) { # this takes about 20 minutes so only run if necessary
enTokens3 <- dfm(enCorpus,
tolower = TRUE,
ngrams = 3,
concatenator = " ",
groups = "Source",
what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
}
## STEP 7: Extract the top 5 tokens and output summary table.
if(!exists("topTokens")) { # this takes a while so only run if necessary
topBlog1 <- as.data.frame(names(topfeatures(enTokens["Blog"],5))) # top 5 one-word tokens, Blog
topBlog2 <- as.data.frame(names(topfeatures(enTokens2["Blog"],5))) # top 5 two-word tokens, Blog
topBlog3 <- as.data.frame(names(topfeatures(enTokens3["Blog"],5))) # top 5 three-word tokens, Blog
topNews1 <- as.data.frame(names(topfeatures(enTokens["News"],5))) # top 5 one-word tokens, News
topNews2 <- as.data.frame(names(topfeatures(enTokens2["News"],5))) # top 5 two-word tokens, News
topNews3 <- as.data.frame(names(topfeatures(enTokens3["News"],5))) # top 5 three-word tokens, News
topTwitter1 <- as.data.frame(names(topfeatures(enTokens["Twitter"],5))) # top 5 one-word tokens, Twitter
topTwitter2 <- as.data.frame(names(topfeatures(enTokens2["Twitter"],5))) # top 5 two-word tokens, Twitter
topTwitter3 <- as.data.frame(names(topfeatures(enTokens3["Twitter"],5))) # top 5 three-word tokens, Twitter
# bind the columns into a single data frame
topTokens <- bind_cols(topBlog1, topBlog2, topBlog3, topNews1, topNews2, topNews3, topTwitter1, topTwitter2, topTwitter3)
}
# rename the columns
names(topTokens) <- c("Blog-1", "Blog-2", "Blog-3", "News-1", "News-2", "News-3", "Twitter-1", "Twitter-2", "Twitter-3")
#output table
kable(topTokens,
format = "html",
caption = "Table 3. Top 5 Tokens by Source and Length",
align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "float_left")