Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, built a smart keyboard that makes it easier for people to type on their mobile devices.

One cornerstone of their smart keyboard is predictive text models When user types some text, the keyboard presents three options for what the next word might be.

The aim of this project is to build a predictive model of text and illustrate its functionality in a Shiny App.

 

For all project calculations is used the following PC:

print("Operating System:")
## [1] "Operating System:"
version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          6.3                         
## year           2020                        
## month          02                          
## day            29                          
## svn rev        77875                       
## language       R                           
## version.string R version 3.6.3 (2020-02-29)
## nickname       Holding the Windsock

 

Methods and Analysis

Importing data

 

Let"s start by downloading data in our project inside data folder.

 

# Set seed for reproducible results
set.seed(12345)
# Create variable that saves url : data_url
data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# Check if directory data exist in working directory
if (!file.exists("./data")) {
    dir.create("./data")  # if not create
}
# Check if file exists inside directory data
if (!file.exists("./data/Coursera-SwiftKey.zip")) {
    download.file(data_url, destfile = "./data/Coursera-SwiftKey.zip", 
        mode = "wb")  # if not download
    Download_Date <- Sys.time()  # save download date
}
# Check if extracted folder exists
if (!file.exists("./data/final")) {
    unzip(zipfile = "./data/Coursera-SwiftKey.zip", exdir = "./data")  # if not unzip 
}

pathFiles <- file.path("./data/final", "en_US")
files <- list.files(pathFiles, recursive = TRUE)
files

 

Let"s check the result of unzip

list.files("./data/", recursive = TRUE)
##  [1] "Coursera-SwiftKey.zip"         "final/de_DE/de_DE.blogs.txt"  
##  [3] "final/de_DE/de_DE.news.txt"    "final/de_DE/de_DE.twitter.txt"
##  [5] "final/en_US/en_US.blogs.txt"   "final/en_US/en_US.news.txt"   
##  [7] "final/en_US/en_US.twitter.txt" "final/fi_FI/fi_FI.blogs.txt"  
##  [9] "final/fi_FI/fi_FI.news.txt"    "final/fi_FI/fi_FI.twitter.txt"
## [11] "final/ru_RU/ru_RU.blogs.txt"   "final/ru_RU/ru_RU.news.txt"   
## [13] "final/ru_RU/ru_RU.twitter.txt" "swearWords.csv"

We see that as result of unzipping *Coursera-SwiftKey.zip we now have folder final with four other folders inside (de_DE, en_US, fi_FI, ru_RU)

Each folder contains 3 text files named by this rule :

  • File1 : language_pattern.twitter

  • File2 : language_pattern.blogs

  • File3 : language_pattern.news

Loading the data in

# Identify existing files with english pattern
list.files("./data/", pattern = "^en_US", recursive = TRUE)
## [1] "final/en_US/en_US.blogs.txt"   "final/en_US/en_US.news.txt"   
## [3] "final/en_US/en_US.twitter.txt"

Now let"s read the data using vroom library to get faster results:

# Read data in separate files Read blogs
blogs <- vroom_lines("./data/final/en_US/en_US.blogs.txt")
# Read news
news <- vroom_lines("./data/final/en_US/en_US.news.txt")
# Read twitter
twitter <- vroom_lines("./data/final/en_US/en_US.twitter.txt")

Sampling Data

For the model we will build, we will not need all of the data. We will use rbinom function to take a random sample of size 5% of each file.

# Sample data Set seed for reproducible results
set.seed(12345)
# Function to random sample out of given files : binom_sample
binom_sample <- function(orig_f) {
    # Get length of original files
    file_length <- length(orig_f)
    # Use rbinom to sample
    orig_f[rbinom(file_length * 0.05, length(orig_f), 0.5)]
}
# Use binom_sample to sample blogs: sample_blog
sample_blog <- binom_sample(blogs)
# Use binom_sample to sample news: sample_news
sample_news <- binom_sample(news)
# Use binom_sample to sample twitter: sample_twitter
sample_twitter <- binom_sample(twitter)
# Remove unneeded objects
rm(blogs, news, twitter, binom_sample)

Data Tidying

We will get our files in tidy text format to make process of cleaning more straightforward. To get the tidy format we will get use of tidytext package.

We will use unnest_tokens function to tokenize the text, meaning to split the text in single words (unigrams) and later on in groups of two and three words(bigrams and trigrams).

With these words we will create so called bag of words meaning we dont care for now about structure or grammar.

# Covert sample blog : tidy_blog
tidy_blog <- sample_blog %>% data_frame(text = .) %>% unnest_tokens(word, 
    text, format = "text")
# Covert sample news : tidy_news
tidy_news <- sample_news %>% data_frame(text = .) %>% unnest_tokens(word, 
    text, format = "text")
# Covert sample twitter: tidy_twitter
tidy_twitter <- sample_twitter %>% data_frame(text = .) %>% unnest_tokens(word, 
    text, format = "text")

Data Cleaning

Now we will proceed data cleaning using tidyverse grammar. We will create a function that will do all the cleaning for us

# Read a list of swearwords
swear <- read_csv("./data/swearWords.csv")
# Create unary function
stop_num <- . %>% # Remove stopwords
anti_join(get_stopwords()) %>% # Remove numbers
filter(is.na(as.numeric(word))) %>% # Remove everything not alphabetic
filter(grepl("/[^A-Za-z0-9 ]/", word) == FALSE) %>% # Only one space left
filter(grepl("[\\s]+", word) == FALSE) %>% # Remove twitter hashtags
filter(grepl("#\\S+", word) == FALSE) %>% # Remove mentions
filter(grepl("@\\S+", word) == FALSE) %>% # Remove special characters
filter(grepl("[[:cntrl:]]", word) == FALSE) %>% # Remove HTML/XML
filter(grepl("<[^>]*>", word) == FALSE) %>% # Remove URL
filter(grepl("http\\S+", word) == FALSE) %>% # remove everything non-english
filter(grepl("[^[:alnum:][:space:]]", word) == FALSE) %>% # remove single letter or two
filter(nchar(word) > 2) %>% # Remove swearwords
anti_join(swear, by = "word") %>% # Remove punctation and spaces
mutate(word = str_replace_all(word, "[:punct:]|[:space:]", "")) %>% 
    # remove letters repeated more than 2 times
mutate(word = str_replace_all(word, "(.{2,})\\1", "\\1"))

Let"s clean:

# Clean stopwords from tidy_blog
tidy_blog <- tidy_blog %>% stop_num()
# Clean stopwords from tidy_news
tidy_news <- tidy_news %>% stop_num()
# Clean stopwords from tidy_twitter
tidy_twitter <- tidy_twitter %>% stop_num()
# Remove unneeded objects
rm(swear, stop_num)

Finally we have our reasonably cleaned data as tibble for making our following step of exploratory data analysis more straightforward.

Exploratory Data Analysis

After cleaning our data a little bit, a natural step would be getting some insight on our data. Since the main question in text mining is to quantify the meaning of text, one way is to measure frequencies of words.

Let"s check 20 most common words inside each text.

# Top 20 words
blog_20counts <- tidy_blog %>% # Count by word
count(word) %>% arrange(desc(n)) %>% mutate(word = factor(word, 
    levels = rev(unique(word)))) %>% head(20)
# View
head(blog_20counts, 20)

Now lets check grafically the above result

# Visualise
ggplot(head(blog_20counts, 20), aes(x = word, y = n)) + geom_bar(stat = "identity", 
    colour = "white", fill = customColor) + geom_text(aes(x = word, 
    y = 1, label = paste0("(", n, ")", sep = "")), hjust = 0, 
    vjust = 0.5, size = 4, colour = "black", fontface = "bold") + 
    labs(x = "Word", y = "Word Count", title = "Top 20 most Common Words in blog data") + 
    coord_flip() + theme_bw()

Let"s repeat our steps for tidy_news and tidy_twitter

For tidy_news

# Top 20 words in tidy_news
news_20counts <- tidy_news %>% # Count by word
count(word) %>% arrange(desc(n)) %>% mutate(word = factor(word, 
    levels = rev(unique(word)))) %>% head(20)
# View
head(news_20counts, 20)

Now lets check grafically the above result

# Visualise
ggplot(head(news_20counts, 20), aes(x = word, y = n)) + geom_bar(stat = "identity", 
    colour = "white", fill = customColor) + geom_text(aes(x = word, 
    y = 1, label = paste0("(", n, ")", sep = "")), hjust = 0, 
    vjust = 0.5, size = 4, colour = "black", fontface = "bold") + 
    labs(x = "Word", y = "Word Count", title = "Top 20 most Common Words in news data") + 
    coord_flip() + theme_bw()

For tidy_twitter

# Top 20 words in tidy_twitter
twitter_20counts <- tidy_twitter %>% # Count by word
count(word) %>% arrange(desc(n)) %>% mutate(word = factor(word, 
    levels = rev(unique(word)))) %>% head(20)
# View
head(twitter_20counts, 20)

Now lets check grafically the above result

# Visualise
ggplot(head(twitter_20counts, 20), aes(x = word, y = n)) + geom_bar(stat = "identity", 
    colour = "white", fill = customColor) + geom_text(aes(x = word, 
    y = 1, label = paste0("(", n, ")", sep = "")), hjust = 0, 
    vjust = 0.5, size = 4, colour = "black", fontface = "bold") + 
    labs(x = "Word", y = "Word Count", title = "Top 20 most Common Words in twitter data") + 
    coord_flip() + theme_bw()

Let"s attempt to join 3 files together and calculate overall frequencies. We will use library tm. to build a corpus from our sample files, so that we can parse those further:

# Create a list of sample files
sample_list <- list(as.character(tidy_blog), as.character(tidy_news), 
    as.character(tidy_twitter))
# Create corpus
text_corpus <- VCorpus(VectorSource(sample_list))
# Corpus in tidy form
tidy_corpus <- tidy(text_corpus)
# Remove unneeded data
rm(sample_list)

Now that we understood which are the most commonly used words in each text, let’s also calculate the total words in each text

words_by_text <- tidy_corpus %>% # Tokenize
unnest_tokens(word, text) %>% # Count by text and word
count(id, word, sort = TRUE)
# Calculate totals
sum_words <- words_by_text %>% # Group by text
group_by(id) %>% summarize(total = sum(n))
# Get results
words_from_text <- left_join(words_by_text, sum_words)
# View
head(words_from_text, 20)

Now let"s plot the distribution for each text:

# Plot distributions
ggplot(words_from_text, aes(x = (n/total), fill = id)) + geom_histogram(show.legend = FALSE, 
    fill = customColor) + xlim(NA, 9e-04) + facet_wrap(~id, ncol = 1, 
    scales = "free") + ylab("Frequency") + xlab("Proportion") + 
    ggtitle("Frequency Distribution of words")

We see that in all 3 text there are common words that repeat very often, but still there are many words that repeat rarely and this is explained by long tails of distribution.

Lets calculate how many unique words do we need in a frequency sorted dictionary to cover 50% of all word instances in the language.

For this we will based in our previously calculated words_by_text

# Cummulative frequency
accum <- words_from_text %>% # Group by id
group_by(id) %>% # New column with cumfreq
mutate(row_n = row_number(), cumfreq = cumsum(n/total)) %>% mutate(row_n = row_number(), 
    relative = prop.table(n/total))
# View
head(accum, 5)

Now that we have cummulative frequencies we can plot density plots

# Density plots
accum %>% ggplot(aes(row_n, relative, color = id)) + geom_line(size = 1.1, 
    alpha = 0.8, show.legend = FALSE) + # Log transformation for scale
scale_x_log10() + xlab("Count") + ylab("Relative Frequency") + 
    ggtitle("Density plot for each text")

Now we can answer to the question about unique words do we need in a frequency sorted dictionary to cover a given percentage of all word instances in the language.

# For 50% coverage
coverage_05 <- accum %>% filter(cumfreq < 0.5) %>% nrow()
coverage_05
## [1] 1182
# For 90% coverage
coverage_09 <- accum %>% filter(cumfreq < 0.9) %>% nrow()
coverage_09
## [1] 8917

Let"s check frequencies of 20 most common words in the joined text. Also we will visualise the results using barplot and wordcloud.

# Top 20 words in text_corpus
corpus_20counts <- tidy_corpus %>% unnest_tokens(word, text) %>% 
    # Count by word
count(word) %>% arrange(desc(n))
# View
head(corpus_20counts, 20)

Now lets check grafically the above result

# Visualise
ggplot(head(corpus_20counts, 20), aes(x = reorder(word, -n), 
    n)) + geom_bar(stat = "identity", colour = "white", fill = customColor) + 
    geom_text(aes(x = word, y = 1, label = paste0("(", n, ")", 
        sep = "")), hjust = 0, vjust = 0.5, size = 4, colour = "black", 
        fontface = "bold") + labs(x = "Word", y = "Word Count", 
    title = "Top 20 most Common Words for all data") + coord_flip() + 
    theme_bw()

And now as wordcloud, lets visualise top 100 common words:

# Top 100 words Top 20 words in text_corpus
corpus_100counts <- tidy_corpus %>% unnest_tokens(word, text) %>% 
    # Count by word
count(word) %>% arrange(desc(n)) %>% head(100)
# Wordcloud
wordcloud2(corpus_100counts, size = 0.6, color = customColor)

We have a clean corpus text_corpus of three documents, so we will call TermDocumentMatrix to construct our TDM which is a matrix where the rows represent the documents, columns represent words and values are frequencies.

# Create TDM
text_tdm <- TermDocumentMatrix(text_corpus)
class(text_tdm)
## [1] "TermDocumentMatrix"    "simple_triplet_matrix"
# Convert tdm to matrix
tdm_mat <- as.matrix(text_tdm)
# Rename columns
colnames(tdm_mat) = c("blogs", "news", "twitter")
# Tidy up
tidy_tm <- tidy(tdm_mat)

Let"s plot a commonality cloud for our data:

# Commonality cloud
commonality.cloud(tdm_mat, max.words = 100, colors = customColor)

We can also visualize the words which are not common between our files.

# Comparision cloud
comparison.cloud(tdm_mat, colors = c("blue", "green", "darkorange"), 
    max.words = 50)

Let"s continue our analysis by splitting text in groups containing two or more words (bigrams and trigrams).

# Create bigrams
bigrams <- tidy_corpus %>% unnest_tokens(bigram, text, token = "ngrams", 
    n = 2, n_min = 2)

And now groups of 3 words:

# Create trigrams
trigrams <- tidy_corpus %>% unnest_tokens(trigram, text, token = "ngrams", 
    n = 3, n_min = 3)

Let"s plot bigram frequencies:

# Bigram frequencies
bigrams_f <- bigrams %>% # Count by bigram
count(bigram) %>% # Sort
arrange(desc(n))
# top 20 bigrams
head(bigrams_f, 20)
# Visualise
ggplot(head(bigrams_f, 20), aes(x = reorder(bigram, -n), n)) + 
    geom_bar(stat = "identity", colour = "white", fill = customColor) + 
    geom_text(aes(x = bigram, y = 1, label = paste0("(", n, ")", 
        sep = "")), hjust = 0, vjust = 0.5, size = 4, colour = "black", 
        fontface = "bold") + labs(x = "Word", y = "Bigram Count", 
    title = "Top 20 most bigrams") + coord_flip() + theme_bw()

Let"s repeat the steps for trigrams:

# trigram frequencies
trigrams_f <- trigrams %>% # Count by trigram
count(trigram) %>% # Sort
arrange(desc(n))
# top 20 trigrams
head(trigrams_f, 20)
# Visualise
ggplot(head(trigrams_f, 20), aes(x = reorder(trigram, -n), n)) + 
    geom_bar(stat = "identity", colour = "white", fill = customColor) + 
    geom_text(aes(x = trigram, y = 1, label = paste0("(", n, 
        ")", sep = "")), hjust = 0, vjust = 0.5, size = 4, colour = "black", 
        fontface = "bold") + labs(x = "Word", y = "trigram Count", 
    title = "Top 20 most trigrams") + coord_flip() + theme_bw()

Considerations

There are several ways to evaluate how many of the words come from foreign languages. One of them is to check the difference in row numbers in dataset where we applied the regex which removes all non-english words. Here we should be careful because script has to read only words and ignore other non UTF8 symbols.

Another way is to try to compare with a dictionary of English language like the approach we had for stopwords and offensive words. Everything non-matching is from foreign languages but this will be time-consuming.

Approach to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases can be in my opinion in several directions. One si to normalize text by using techniques such as Stemming or Lemmatization to get the root of words or derivations in words. Another approach is to study the author style by comparing several texts of the same author and predicting by machine learning techniques.

Next steps

The future steps to undertake follow this order:

  1. Based on combinations of words we previously observed inside text, I will try to create an n-gram statistical model to predict the following word based on one, two or three words. Following Markov assumption that the probability of a word depends only on the previous word I will try to estimate bigram probabilites (only one word in the past) and try to expand the idea for trigrams.

  2. The estimation of probabilites will be done by Maximum Likelihood Estimation by normalizing the counts we already computed in previous section. For the trigram we will need the parameter \(p(w_3, w1, w_2)\)

  3. To “smooth” the probabilities ( giving all n-grams a non-zero probability even if they aren“t observed in the data) we can apply backoff approach, meaning turn to a lower-order n-gram since we don”t know anything about higher-order interpolation n-gram. We can backoff like this to the level we find some counts.

  4. The estimation of model will be done by splitting our previously build corpus of data in train and test sets. We will train parameters of the model in the training set and see how accurately the test set is predicted.

  5. Finally we will build a Shiny app to see practically how our model performs.