Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. Predictive text models help overcome this pain.
The objective of this project is to build a predictive text product; while the objective of this report is to explain my exploratory analysis and my goals for the eventual app and algorithm.
The data is provided by Swiftkey and includes files in four languages: English, German, Finnish, and Russian. The focus here will be on the English language files, which consist of .txt files from three sources: blogs, news, and twitter.
Before building the model, the data must be downloaded, analysed, and cleaned.
library(quanteda)
library(readtext)
library(stringi)
library(kableExtra)
library(reader)
library(ggplot2)
library(RColorBrewer)
library(data.table)
if(!file.exists("./final")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl,destfile="./dataSwift.zip",method="curl")
unzip(zipfile="./dataSwift.zip")
}
# Destination
files <- list.files(path = "./final", recursive = TRUE)
blogs.dest <- paste("./final/", files[4], sep = "")
news.dest <- paste("./final/", files[5], sep = "")
twits.dest <- paste("./final/", files[6], sep = "")
# File sizes
blogs.size <- round(file.info(blogs.dest)[[1]]/1024/1024,0)
news.size <- round(file.info(news.dest)[[1]]/1024/1024,0)
twits.size <- round(file.info(twits.dest)[[1]]/1024/1024,0)
# Reading in
data_blogs <- texts(readtext(blogs.dest, encoding = "UTF-8"))
data_news <- texts(readtext(news.dest, encoding = "UTF-8"))
data_twits <- texts(readtext(twits.dest, encoding = "UTF-8"))
# Number of lines
## +1 because there is no \n in the last line
blogs_length <- stri_count_fixed(data_blogs, "\n") + 1
news_length <- stri_count_fixed(data_news, "\n") + 1
twits_length <- stri_count_fixed(data_twits, "\n") + 1
# Number of words
blogs_words <- tokens(data_blogs, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE) %>% as.character()
blogs_words_l <- length(blogs_words)
news_words <- tokens(data_news, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE) %>% as.character()
news_words_l <- length(news_words)
twits_words <- tokens(data_twits, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE) %>% as.character()
twits_words_l <- length(twits_words)
# Summary stats
df <- data.frame(c(blogs.dest,news.dest,twits.dest),
c(blogs.size,news.size,twits.size),
c(blogs_length,news_length,twits_length),
c(blogs_words_l,news_words_l,twits_words_l))
colnames(df) <- c("File", "Size (MB)", "Num of Lines", "Num of Words")
kable(df) %>%
kable_styling(full_width = FALSE)
| File | Size (MB) | Num of Lines | Num of Words |
|---|---|---|---|
| ./final/en_US/en_US.blogs.txt | 200 | 899288 | 37130897 |
| ./final/en_US/en_US.news.txt | 196 | 77259 | 2604160 |
| ./final/en_US/en_US.twitter.txt | 159 | 2360148 | 29642313 |
As the above table shows the original .txt files are huge - over 150MB each. Smaller sample sizes will be used to speed up the analysis, by making an accurate approximation to results that would be obtained using all the data. This will be done by creating using the rbinom function.
samples <- function(originalFile, prob, seed) {
conR <- file(originalFile, open = "rb")
data <- readLines(conR, skipNul = TRUE)
# sampling by rbinom()
set.seed(seed)
sampleData <- data[as.logical(rbinom(n = length(data), size = 1, prob = prob)) == 1]
close(conR)
# Write out the sample file to the local file to save it
conW <- file(paste(dirname(originalFile), "/", "sample_", basename(originalFile),sep=""), open = "w")
writeLines(sampleData, con = conW)
close(conW)
}
### 3% sample size
if(!file.exists("./final/en_US/sample_en_US.blogs.txt")){
samples(blogs.dest, .03, 111)
}
if(!file.exists("./final/en_US/sample_en_US.news.txt")){
samples(news.dest, .03, 111)
}
if(!file.exists("./final/en_US/sample_en_US.twitter.txt")){
samples(twits.dest, .03, 111)
}
sample_blogs <- texts(readtext("./final/en_US/sample_en_US.blogs.txt", encoding = "UTF-8"))
sample_news <- texts(readtext("./final/en_US/sample_en_US.news.txt", encoding = "UTF-8"))
sample_twits <- texts(readtext("./final/en_US/sample_en_US.twitter.txt", encoding = "UTF-8"))
A profanity filter has to be implemented to remove profanity and any other words that should not be included in the predictor. The filter was made using a list of ‘bad words’ taken from the internet.
# Profanity Filter
if(!file.exists("base-list-of-bad-words_text-file_2018_07_30.txt")){
fileUrl <- "https://www.freewebheaders.com/download/files/base-list-of-bad-words_text-file_2018_07_30.zip"
download.file(fileUrl,destfile="profList.zip",method="curl")
unzip(zipfile="./profList.zip")
}
profWords <- n.readLines("base-list-of-bad-words_text-file_2018_07_30.txt", n =76, skip = 12)
The three sample files were tokenised, so as to not only apply the profanity filter, but also to remove punctuation, numbers, and symbols from the data sets that may interfere with the prediction algorithm. Then using the dfm command, a document-frequency matrix is created.
sample_blogs_words <- tokens(sample_blogs, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE)
sample_blogs_words <- tokens_tolower(sample_blogs_words)
sample_blogs_words <- tokens_remove(sample_blogs_words, profWords)
blogs_dfm <- dfm(sample_blogs_words %>% as.character())
sample_news_words <- tokens(sample_news, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE)
sample_news_words <- tokens_tolower(sample_news_words)
sample_news_words <- tokens_remove(sample_news_words, profWords)
news_dfm <- dfm(sample_news_words %>% as.character())
sample_twits_words <- tokens(sample_twits, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE, remove_twitter = TRUE)
sample_twits_words <- tokens_tolower(sample_twits_words)
sample_twits_words <- tokens_remove(sample_twits_words, profWords)
twits_dfm <- dfm(sample_twits_words %>% as.character())
From the document-frequency matrix for each sample, the coverage of each word (token) can be calculated by first sorting the list of the most frequently found words, and then calculating the relative frequency with respect to the total count. By using the Position function, the number of terms that cover a certain percentage (ie. 50%) of the data set can be calculated.
blogs_freqs <- topfeatures(blogs_dfm, n = nfeat(blogs_dfm))
df1 <- data.table(word = names(blogs_freqs), freq = blogs_freqs)
sumFreq <- df1[,sum(freq)]
df1[, coverage := cumsum(freq)/sumFreq]
df1[, rank := 1:.N]
# Number of unique words to cover 50% of all word instances
blog50 <- Position(function(x) x >= .5, df1$coverage)
# Number of unique words to cover 90% of all word instances
blog90 <- Position(function(x) x >= .9, df1$coverage)
news_freqs <- topfeatures(news_dfm, n = nfeat(news_dfm))
df2 <- data.table(word = names(news_freqs), freq = news_freqs)
sumFreq <- df2[,sum(freq)]
df2[, coverage := cumsum(freq)/sumFreq]
df2[, rank := 1:.N]
# Number of unique words to cover 50% of all word instances
news50 <- Position(function(x) x >= .5, df2$coverage)
# Number of unique words to cover 90% of all word instances
news90 <- Position(function(x) x >= .9, df2$coverage)
twits_freqs <- topfeatures(twits_dfm, n = nfeat(twits_dfm))
df3 <- data.table(word = names(twits_freqs), freq = twits_freqs)
sumFreq <- df3[,sum(freq)]
df3[, coverage := cumsum(freq)/sumFreq]
df3[, rank := 1:.N]
# Number of unique words to cover 50% of all word instances
twit50 <- Position(function(x) x >= .5, df3$coverage)
# Number of unique words to cover 90% of all word instances
twit90 <- Position(function(x) x >= .9, df3$coverage)
## Defining coordinates of line interceptions
dfInt <- data.frame(x=c(0,0,blog50,news50,twit50,blog90,news90,twit90), y=c(.5,.9,0,0,0,0,0,0), xe=c(news50,news90,blog50,news50,twit50,blog90,news90,twit90), ye=c(.5,.9,.5,.5,.5,.9,.9,.9))
## plot
ggplot() +
geom_line(data=df1, aes(x=rank, y=coverage, color = "Blogs")) +
geom_line(data=df2, aes(x=rank, y=coverage, color = "News")) +
geom_line(data=df3, aes(x=rank, y=coverage, color = "Twitter")) +
scale_color_manual("",breaks=c("Blogs","News","Twitter"),
values = c("Blogs"="blue", "News"="red", "Twitter"="green")) +
ggtitle("Coverage of unigrams across all samples") +
labs(x="Number of Unique Words", y="Coverage") +
scale_y_continuous(breaks = seq(0,1,.1)) +
xlim(0, 8000) +
geom_segment(data=dfInt,
mapping=aes(x = x, xend = xe, y = y, yend = ye),
linetype = 2) +
annotate("text", x=c(850, blog90+350, 1250, news90+350, 450, twit90+350), y=c(.5,.8,.5,.8,.5,.8),
label=c(blog50,blog90,news50, news90,twit50, twit90),
color = c("blue","blue","red","red","green","green"))
Figure 1: Coverage of unigrams across all samples: just 107 unique words cover 50% of the Blogs sample, 193 for the News sample, and 123 for the Twitter sample; while 6232 cover 90% of the Blogs sample, 7570 for the News sample, and 5186 for the Twitter sample
blog50pct <- round(blog50/length(df1$word)*100,1)
news50pct <- round(news50/length(df2$word)*100,1)
twit50pct <- round(twit50/length(df3$word)*100,1)
blog90pct <- round(blog90/length(df1$word)*100,1)
news90pct <- round(news90/length(df2$word)*100,1)
twit90pct <- round(twit90/length(df3$word)*100,1)
dfT <- data.frame(c(length(df1$word),blog50,blog50pct,blog90,blog90pct),
c(length(df2$word),news50,news50pct,news90,news90pct),
c(length(df3$word),twit50,twit50pct,twit90,twit90pct))
colnames(dfT) <- c("Blogs", "News", "Twitter")
row.names(dfT) <- c("Total Unigrams","50% coverage","50% coverage % of Total Words","90% coverage","90% coverage % of Total Words")
kable(dfT) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1,border_right = TRUE)
| Blogs | News | ||
|---|---|---|---|
| Total Unigrams | 50482.0 | 50307.0 | 46644.0 |
| 50% coverage | 107.0 | 193.0 | 123.0 |
| 50% coverage % of Total Words | 0.2 | 0.4 | 0.3 |
| 90% coverage | 6232.0 | 7570.0 | 5186.0 |
| 90% coverage % of Total Words | 12.3 | 15.0 | 11.1 |
While the x-axis of Figure 1 has been limited to 8000, the table above shows how in all three samples, 50% of the data are covered by less than 0.5% of total unique word count. While for 90% coverage, this value only increases to no more than 15% of the total. This implies that over 85% of the unique words found in the sample only occur once. A likely reason for this could be due to mispellings, and potentially foreign words being used. Regardless, >85% of the tokens that only cover 10% of the sample texts will be filtered out, leaving a much smaller, easier to handle sample size.
# plot frequency of 20 most frequent terms
theme_set(theme_minimal())
textstat_frequency(blogs_dfm, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 words in Blogs sample") +
labs(x = "", y = "Term Frequency")
Figure 2: Frequency of top 20 unigrams in Blogs sample
textstat_frequency(news_dfm, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 words in News sample") +
labs(x = "", y = "Term Frequency")
Figure 3: Frequency of top 20 unigrams in News sample
textstat_frequency(twits_dfm, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 words in Twitter sample") +
labs(x = "", y = "Term Frequency")
Figure 4: Frequency of top 20 unigrams in Twitter sample
This same pattern can be observed from the bar charts (figures 2-4) detailing the frequency of the top 20 words per sample. There is a huge drop from the first rank to the second, and this trend continues, tailing off from around the the 7-8th term.
### blogs
textplot_wordcloud(dfm_keep(blogs_dfm, pattern = textstat_frequency(blogs_dfm, n = 100)[,1]), color = brewer.pal(8, "Dark2"), min_size=1.5)
Figure 5: Wordcloud of top 100 unigrams in Blogs sample
### news
textplot_wordcloud(dfm_keep(news_dfm, pattern = textstat_frequency(news_dfm, n = 100)[,1]), color = brewer.pal(8, "Dark2"), min_size=1.5)
Figure 6: Wordcloud of top 100 unigrams in News sample
### twits
textplot_wordcloud(dfm_keep(twits_dfm, pattern = textstat_frequency(twits_dfm, n = 100)[,1]), color = brewer.pal(8, "Dark2"), min_size=1.5)
Figure 7: Wordcloud of top 100 unigrams in Twitter sample
The wordclouds above give a clear, visual indication of how few words stand out , in terms of their frequency, in their respective corpuses. Just like the bar charts above, the word clouds show that the word “the” is the most frequent term.
By revisiting the dfm function, bigrams and trigrams in the samples can be determined, and the frequency of these can be graphed:
#2-gram
blogs_dfm2 <- dfm(sample_blogs_words, n=2, concatenator = " ")
textstat_frequency(blogs_dfm2, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 bigrams in Blogs sample") +
labs(x = "", y = "Term Frequency")
Figure 8: Frequency of top 20 bigrams in Blogs sample
news_dfm2 <- dfm(sample_news_words, n=2, concatenator = " ")
textstat_frequency(news_dfm2, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 bigrams in News sample") +
labs(x = "", y = "Term Frequency")
Figure 9: Frequency of top 20 bigrams in News sample
twits_dfm2 <- dfm(sample_twits_words, n=2, concatenator = " ")
textstat_frequency(twits_dfm2, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 bigrams in News sample") +
labs(x = "", y = "Term Frequency")
Figure 10: Frequency of top 20 bigrams in Twitter sample
# 3-gram
blogs_dfm3 <- dfm(sample_blogs_words, n=3, concatenator = " ")
textstat_frequency(blogs_dfm3, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 trigrams in Blogs sample") +
labs(x = "", y = "Term Frequency")
Figure 11: Frequency of top 20 trigrams in Blogs sample
news_dfm3 <- dfm(sample_news_words, n=3, concatenator = " ")
textstat_frequency(news_dfm3, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 trigrams in News sample") +
labs(x = "", y = "Term Frequency")
Figure 12: Frequency of top 20 trigrams in News sample
twits_dfm3 <- dfm(sample_twits_words, n=3, concatenator = " ")
textstat_frequency(twits_dfm3, n = 20) %>%
ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
geom_bar(stat = "identity") + coord_flip() +
ggtitle("Frequency of top 20 trigrams in Twitter sample") +
labs(x = "", y = "Term Frequency")
Figure 13: Frequency of top 20 trigrams in Twitter sample
The bigram and trigram barcharts emulate the trends seen in the unigram charts - with the first two sets seen to appear sometimes more than twice as frequently as the third set.
The main conclusion to take from the above analysis, is that there are a select few ‘stop words’ (ie. ‘the’, ‘of’) that appear so frequently, that they cover most of the corpus. From the samples, which are just 3% of the original provided data sets, less than 0.5% of the unique words, cover 50% of the corpus. In the blogs sample, this figure drops down to 0.2%. It can be seen that less than 15% of the unique words cover 90% of the corpus, meaning that 85%, or in the case of the News sample, over 42000, of the unique words only appear once.
The next step is to combine the samples into a single corpus and then filter out all the tokens which only appear once - which will greatly reduce the data set without prejudicing the analysis. This will then be followed by splitting the resultant corpus into a training and testing set - 60% and 40% respectively. A predictive model will then be built, using a term’s history in the n-gram to estimate its probability. A shiny app will then be built, so that the above model can be interacted with.