Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. Predictive text models help overcome this pain.

The objective of this project is to build a predictive text product; while the objective of this report is to explain my exploratory analysis and my goals for the eventual app and algorithm.

Data Processing

The data is provided by Swiftkey and includes files in four languages: English, German, Finnish, and Russian. The focus here will be on the English language files, which consist of .txt files from three sources: blogs, news, and twitter.

Before building the model, the data must be downloaded, analysed, and cleaned.

Pre-processing/set-up

Necessary packages are loaded

library(quanteda)
library(readtext)
library(stringi)
library(kableExtra)
library(reader)
library(ggplot2)
library(RColorBrewer)
library(data.table)

Data download and read

if(!file.exists("./final")){
    fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(fileUrl,destfile="./dataSwift.zip",method="curl")
    unzip(zipfile="./dataSwift.zip")
}

Summary Statistics

# Destination
files <- list.files(path = "./final", recursive = TRUE)
blogs.dest <- paste("./final/", files[4], sep = "")
news.dest <- paste("./final/", files[5], sep = "")
twits.dest <- paste("./final/", files[6], sep = "")

# File sizes
blogs.size <- round(file.info(blogs.dest)[[1]]/1024/1024,0)
news.size <- round(file.info(news.dest)[[1]]/1024/1024,0)
twits.size <- round(file.info(twits.dest)[[1]]/1024/1024,0)

# Reading in
data_blogs <- texts(readtext(blogs.dest, encoding = "UTF-8"))
data_news <- texts(readtext(news.dest, encoding = "UTF-8"))
data_twits <- texts(readtext(twits.dest, encoding = "UTF-8"))

# Number of lines
## +1 because there is no \n in the last line
blogs_length <- stri_count_fixed(data_blogs, "\n") + 1
news_length <- stri_count_fixed(data_news, "\n") + 1
twits_length <- stri_count_fixed(data_twits, "\n") + 1

# Number of words
blogs_words <- tokens(data_blogs, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE) %>% as.character()
blogs_words_l <- length(blogs_words)
news_words <- tokens(data_news, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE) %>% as.character()
news_words_l <- length(news_words)
twits_words <- tokens(data_twits, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE) %>% as.character()
twits_words_l <- length(twits_words)

# Summary stats
df <- data.frame(c(blogs.dest,news.dest,twits.dest),
                 c(blogs.size,news.size,twits.size),
                 c(blogs_length,news_length,twits_length),
                 c(blogs_words_l,news_words_l,twits_words_l))
colnames(df) <- c("File", "Size (MB)", "Num of Lines", "Num of Words")
kable(df) %>%
  kable_styling(full_width = FALSE)

File	Size (MB)	Num of Lines	Num of Words
./final/en_US/en_US.blogs.txt	200	899288	37130897
./final/en_US/en_US.news.txt	196	77259	2604160
./final/en_US/en_US.twitter.txt	159	2360148	29642313

As the above table shows the original .txt files are huge - over 150MB each. Smaller sample sizes will be used to speed up the analysis, by making an accurate approximation to results that would be obtained using all the data. This will be done by creating using the rbinom function.

Creating Samples

samples <- function(originalFile, prob, seed) {
    conR <- file(originalFile, open = "rb")
    data <- readLines(conR, skipNul = TRUE)
    # sampling by rbinom()
    set.seed(seed)
    sampleData <- data[as.logical(rbinom(n = length(data), size = 1, prob = prob)) == 1]
    close(conR)
    
    # Write out the sample file to the local file to save it
    conW <- file(paste(dirname(originalFile), "/", "sample_", basename(originalFile),sep=""), open = "w")
    writeLines(sampleData, con = conW)
    close(conW)
}
### 3% sample size
if(!file.exists("./final/en_US/sample_en_US.blogs.txt")){
    samples(blogs.dest, .03, 111)
}
if(!file.exists("./final/en_US/sample_en_US.news.txt")){
    samples(news.dest, .03, 111)
}
if(!file.exists("./final/en_US/sample_en_US.twitter.txt")){
    samples(twits.dest, .03, 111)
}

Read in sample text

sample_blogs <- texts(readtext("./final/en_US/sample_en_US.blogs.txt", encoding = "UTF-8"))
sample_news <- texts(readtext("./final/en_US/sample_en_US.news.txt", encoding = "UTF-8"))
sample_twits <- texts(readtext("./final/en_US/sample_en_US.twitter.txt", encoding = "UTF-8"))

A profanity filter has to be implemented to remove profanity and any other words that should not be included in the predictor. The filter was made using a list of ‘bad words’ taken from the internet.

# Profanity Filter
if(!file.exists("base-list-of-bad-words_text-file_2018_07_30.txt")){
    fileUrl <- "https://www.freewebheaders.com/download/files/base-list-of-bad-words_text-file_2018_07_30.zip"
    download.file(fileUrl,destfile="profList.zip",method="curl")
    unzip(zipfile="./profList.zip")
}
profWords <- n.readLines("base-list-of-bad-words_text-file_2018_07_30.txt", n =76, skip = 12)

The three sample files were tokenised, so as to not only apply the profanity filter, but also to remove punctuation, numbers, and symbols from the data sets that may interfere with the prediction algorithm. Then using the dfm command, a document-frequency matrix is created.

sample_blogs_words <- tokens(sample_blogs, remove_numbers = TRUE,  remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE)
sample_blogs_words <- tokens_tolower(sample_blogs_words)
sample_blogs_words <- tokens_remove(sample_blogs_words, profWords)
blogs_dfm <- dfm(sample_blogs_words %>% as.character())

sample_news_words <- tokens(sample_news, remove_numbers = TRUE,  remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE)
sample_news_words <- tokens_tolower(sample_news_words)
sample_news_words <- tokens_remove(sample_news_words, profWords)
news_dfm <- dfm(sample_news_words %>% as.character())


sample_twits_words <- tokens(sample_twits, remove_numbers = TRUE,  remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE, remove_twitter = TRUE)
sample_twits_words <- tokens_tolower(sample_twits_words)
sample_twits_words <- tokens_remove(sample_twits_words, profWords)
twits_dfm <- dfm(sample_twits_words %>% as.character())

Exploratory Data Analysis

Coverage

Calculating the coverage

From the document-frequency matrix for each sample, the coverage of each word (token) can be calculated by first sorting the list of the most frequently found words, and then calculating the relative frequency with respect to the total count. By using the Position function, the number of terms that cover a certain percentage (ie. 50%) of the data set can be calculated.

blogs_freqs <- topfeatures(blogs_dfm, n = nfeat(blogs_dfm))
df1  <- data.table(word = names(blogs_freqs), freq = blogs_freqs)
sumFreq <- df1[,sum(freq)]
df1[, coverage := cumsum(freq)/sumFreq]
df1[, rank := 1:.N]

# Number of unique words to cover 50% of all word instances 
blog50 <- Position(function(x) x >= .5, df1$coverage)
# Number of unique words to cover 90% of all word instances 
blog90 <- Position(function(x) x >= .9, df1$coverage)

news_freqs <- topfeatures(news_dfm, n = nfeat(news_dfm))
df2  <- data.table(word = names(news_freqs), freq = news_freqs)
sumFreq <- df2[,sum(freq)]
df2[, coverage := cumsum(freq)/sumFreq]
df2[, rank := 1:.N]

# Number of unique words to cover 50% of all word instances 
news50 <- Position(function(x) x >= .5, df2$coverage)
# Number of unique words to cover 90% of all word instances 
news90 <- Position(function(x) x >= .9, df2$coverage)

twits_freqs <- topfeatures(twits_dfm, n = nfeat(twits_dfm))
df3  <- data.table(word = names(twits_freqs), freq = twits_freqs)
sumFreq <- df3[,sum(freq)]
df3[, coverage := cumsum(freq)/sumFreq]
df3[, rank := 1:.N]

# Number of unique words to cover 50% of all word instances 
twit50 <- Position(function(x) x >= .5, df3$coverage)
# Number of unique words to cover 90% of all word instances 
twit90 <- Position(function(x) x >= .9, df3$coverage)

## Defining coordinates of line interceptions
dfInt <- data.frame(x=c(0,0,blog50,news50,twit50,blog90,news90,twit90), y=c(.5,.9,0,0,0,0,0,0), xe=c(news50,news90,blog50,news50,twit50,blog90,news90,twit90), ye=c(.5,.9,.5,.5,.5,.9,.9,.9))
## plot
ggplot() +
    geom_line(data=df1, aes(x=rank, y=coverage, color = "Blogs")) +
    geom_line(data=df2, aes(x=rank, y=coverage, color = "News")) +
    geom_line(data=df3, aes(x=rank, y=coverage, color = "Twitter")) +
    scale_color_manual("",breaks=c("Blogs","News","Twitter"),
                       values = c("Blogs"="blue", "News"="red", "Twitter"="green")) +
    ggtitle("Coverage of unigrams across all samples") +
    labs(x="Number of Unique Words", y="Coverage") +
    scale_y_continuous(breaks = seq(0,1,.1)) +
    xlim(0, 8000) +
    geom_segment(data=dfInt,
                 mapping=aes(x = x, xend = xe, y = y, yend = ye),
                 linetype = 2) +
    annotate("text", x=c(850, blog90+350, 1250, news90+350, 450, twit90+350), y=c(.5,.8,.5,.8,.5,.8),
             label=c(blog50,blog90,news50, news90,twit50, twit90),
             color = c("blue","blue","red","red","green","green"))

Figure 1: Coverage of unigrams across all samples: just 107 unique words cover 50% of the Blogs sample, 193 for the News sample, and 123 for the Twitter sample; while 6232 cover 90% of the Blogs sample, 7570 for the News sample, and 5186 for the Twitter sample

blog50pct <- round(blog50/length(df1$word)*100,1)
news50pct <- round(news50/length(df2$word)*100,1)
twit50pct <- round(twit50/length(df3$word)*100,1)
blog90pct <- round(blog90/length(df1$word)*100,1)
news90pct <- round(news90/length(df2$word)*100,1)
twit90pct <- round(twit90/length(df3$word)*100,1)

dfT <- data.frame(c(length(df1$word),blog50,blog50pct,blog90,blog90pct),
                 c(length(df2$word),news50,news50pct,news90,news90pct),
                 c(length(df3$word),twit50,twit50pct,twit90,twit90pct))
colnames(dfT) <- c("Blogs", "News", "Twitter")
row.names(dfT) <- c("Total Unigrams","50% coverage","50% coverage % of Total Words","90% coverage","90% coverage % of Total Words")
kable(dfT) %>%
  kable_styling(full_width = FALSE) %>%
    column_spec(1,border_right = TRUE)

	Blogs	News	Twitter
Total Unigrams	50482.0	50307.0	46644.0
50% coverage	107.0	193.0	123.0
50% coverage % of Total Words	0.2	0.4	0.3
90% coverage	6232.0	7570.0	5186.0
90% coverage % of Total Words	12.3	15.0	11.1

While the x-axis of Figure 1 has been limited to 8000, the table above shows how in all three samples, 50% of the data are covered by less than 0.5% of total unique word count. While for 90% coverage, this value only increases to no more than 15% of the total. This implies that over 85% of the unique words found in the sample only occur once. A likely reason for this could be due to mispellings, and potentially foreign words being used. Regardless, >85% of the tokens that only cover 10% of the sample texts will be filtered out, leaving a much smaller, easier to handle sample size.

# plot frequency of 20 most frequent terms 
theme_set(theme_minimal())
textstat_frequency(blogs_dfm, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 words in Blogs sample") +
    labs(x = "", y = "Term Frequency")

Figure 2: Frequency of top 20 unigrams in Blogs sample

textstat_frequency(news_dfm, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() + 
    ggtitle("Frequency of top 20 words in News sample") +
    labs(x = "", y = "Term Frequency")

Figure 3: Frequency of top 20 unigrams in News sample

textstat_frequency(twits_dfm, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 words in Twitter sample") +
    labs(x = "", y = "Term Frequency")

Figure 4: Frequency of top 20 unigrams in Twitter sample

This same pattern can be observed from the bar charts (figures 2-4) detailing the frequency of the top 20 words per sample. There is a huge drop from the first rank to the second, and this trend continues, tailing off from around the the 7-8th term.

### blogs
textplot_wordcloud(dfm_keep(blogs_dfm, pattern = textstat_frequency(blogs_dfm, n = 100)[,1]), color = brewer.pal(8, "Dark2"), min_size=1.5)

Figure 5: Wordcloud of top 100 unigrams in Blogs sample

### news
textplot_wordcloud(dfm_keep(news_dfm, pattern = textstat_frequency(news_dfm, n = 100)[,1]), color = brewer.pal(8, "Dark2"), min_size=1.5)

Figure 6: Wordcloud of top 100 unigrams in News sample

### twits
textplot_wordcloud(dfm_keep(twits_dfm, pattern = textstat_frequency(twits_dfm, n = 100)[,1]), color = brewer.pal(8, "Dark2"), min_size=1.5)

Figure 7: Wordcloud of top 100 unigrams in Twitter sample

The wordclouds above give a clear, visual indication of how few words stand out , in terms of their frequency, in their respective corpuses. Just like the bar charts above, the word clouds show that the word “the” is the most frequent term.

n-grams

By revisiting the dfm function, bigrams and trigrams in the samples can be determined, and the frequency of these can be graphed:

#2-gram
blogs_dfm2 <- dfm(sample_blogs_words, n=2, concatenator = " ")
textstat_frequency(blogs_dfm2, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 bigrams in Blogs sample") +
    labs(x = "", y = "Term Frequency")

Figure 8: Frequency of top 20 bigrams in Blogs sample

news_dfm2 <- dfm(sample_news_words, n=2, concatenator = " ")
textstat_frequency(news_dfm2, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 bigrams in News sample") +
    labs(x = "", y = "Term Frequency")

Figure 9: Frequency of top 20 bigrams in News sample

twits_dfm2 <- dfm(sample_twits_words, n=2, concatenator = " ")
textstat_frequency(twits_dfm2, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 bigrams in News sample") +
    labs(x = "", y = "Term Frequency")

Figure 10: Frequency of top 20 bigrams in Twitter sample

# 3-gram
blogs_dfm3 <- dfm(sample_blogs_words, n=3, concatenator = " ")
textstat_frequency(blogs_dfm3, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 trigrams in Blogs sample") +
    labs(x = "", y = "Term Frequency")

Figure 11: Frequency of top 20 trigrams in Blogs sample

news_dfm3 <- dfm(sample_news_words, n=3, concatenator = " ")
textstat_frequency(news_dfm3, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 trigrams in News sample") +
    labs(x = "", y = "Term Frequency")

Figure 12: Frequency of top 20 trigrams in News sample

twits_dfm3 <- dfm(sample_twits_words, n=3, concatenator = " ")
textstat_frequency(twits_dfm3, n = 20) %>%
    ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
    geom_bar(stat = "identity") + coord_flip() +
    ggtitle("Frequency of top 20 trigrams in Twitter sample") +
    labs(x = "", y = "Term Frequency")

Figure 13: Frequency of top 20 trigrams in Twitter sample

The bigram and trigram barcharts emulate the trends seen in the unigram charts - with the first two sets seen to appear sometimes more than twice as frequently as the third set.

Conclusion

The main conclusion to take from the above analysis, is that there are a select few ‘stop words’ (ie. ‘the’, ‘of’) that appear so frequently, that they cover most of the corpus. From the samples, which are just 3% of the original provided data sets, less than 0.5% of the unique words, cover 50% of the corpus. In the blogs sample, this figure drops down to 0.2%. It can be seen that less than 15% of the unique words cover 90% of the corpus, meaning that 85%, or in the case of the News sample, over 42000, of the unique words only appear once.

Further work

The next step is to combine the samples into a single corpus and then filter out all the tokens which only appear once - which will greatly reduce the data set without prejudicing the analysis. This will then be followed by splitting the resultant corpus into a training and testing set - 60% and 40% respectively. A predictive model will then be built, using a term’s history in the n-gram to estimate its probability. A shiny app will then be built, so that the above model can be interacted with.

Data Science Capstone: Week 2 Milestone Report

André Tipping

14 April 2019