This is the milestone report for the Data Science Capstone of the Coursera Data Science Specialization of the Johns Hopkins University.
The goal of the report is to show a brief understanding of the data structure which is provided for the capstone. Containing three files from twitter, blogs and news sites we will use this data to build a prediction algorithm to predict based on an user input the next word a person would expect. This concept is well known from the use of mobile phones with the application swift key.
This milestone report is divided into different parts:
The following r script uses different packges. dplyr, data.table and tidyr are mainly used for data manipulation. The text mining part is mainly done with tm and ngram, especially the creation of the data corpus, since the tm package provides a convenient way to adapt the data corpus for our purposes. The creation of the ngrams is done with the tidytext package, because this package provided a may convenient and stable way than the tm package. treemap, ggraph and igraph are used for data visualisation.
library(dplyr)
library(data.table)
library(tidyr)
library(tidytext)
library(tm)
library(ngram)
library(treemap)
library(ggraph)
library(igraph)
The following steps show how we are getting the data via a readLines command. The data files were downloaded from this source: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
mainpath <- "C:/Projekte/Projects/R/Coursera/Week 10/data/"
blogstxt <- "en_US.blogs.txt"
newstxt <- "en_US.news.txt"
twittertxt <- "en_US.twitter.txt"
twitter <- readLines(paste(mainpath, twittertxt, sep=""), warn=FALSE, encoding="UTF-8")
blogs <- readLines(paste(mainpath, blogstxt, sep=""), warn=FALSE, encoding="UTF-8")
news <- readLines(paste(mainpath, newstxt, sep=""), warn=FALSE, encoding="UTF-8")
To understand the dataset we create a data.table with different information about absolute values such as the size, lines, word and character count. Since the sample data set will be obviously smaller, we create relative metrics, such as words and characters per line, which will show us later a better understanding in how far the sample file could be a good representation of the whole data set.
twitter_size <- file.info(paste(mainpath, twittertxt, sep=""))$size/1024/1024
twitter_lines <- as.numeric(summary(twitter)[1])
twitter_words <- wordcount(twitter)
twitter_characters <- sum(nchar(twitter))
blogs_size <- file.info(paste(mainpath, blogstxt, sep=""))$size/1024/1024
blogs_lines <- as.numeric(summary(blogs)[1])
blogs_words <- wordcount(blogs)
blogs_characters <- sum(nchar(blogs))
news_size <- file.info(paste(mainpath, newstxt, sep=""))$size/1024/1024
news_lines <- as.numeric(summary(news)[1])
news_words <- wordcount(news)
news_characters <- sum(nchar(news))
file_exploration <- data.frame(File = c("blogs", "news", "twitter"),
Size_MB = c(blogs_size, news_size, twitter_size),
Lines = c(blogs_lines, news_lines, twitter_lines),
Words = c(blogs_words, news_words, twitter_words),
Characters = c(blogs_characters, news_characters, twitter_characters),
Words_per_Line = c(blogs_words/blogs_lines, news_words/news_lines, twitter_words/twitter_lines),
Ch._per_Line = c(blogs_characters/blogs_lines, news_characters/news_lines, twitter_characters/twitter_lines)
)
file_exploration
## File Size_MB Lines Words Characters Words_per_Line Ch._per_Line
## 1 blogs 200.4242 899288 37334131 206824505 41.51521 229.98695
## 2 news 196.2775 77259 2643969 15639408 34.22215 202.42830
## 3 twitter 159.3641 2360148 30373543 162096031 12.86934 68.68045
As we can see, the input file size of each file is around 200 MB. The twitter file contains the most lines.
Computation of huge text documents bring problems, especially concerning memory space. Therefore, the random segmentation of the data into a sample set is necessary to provide a convenient speed for the user with the calcuation of the further app but still the necessary accuracy.
Since the provided data is not a representative source of the english language we do not have to respect the size or other given shape of the data set. It’s more important to create a small data set with as much different - but still representative - word combinations as possible, to have a heterogenic data sample as possible.
Our first approach is to count the number of unique words per dataset.
The sample is done with a simple biased coinflip decision per line of each file via rbinom command. It’s important to understand, that we erase lines per document, so regarding our previous data table it’s important to remember that the twitter file contained much more lines than the other two. We explore each data set with 100%, 90%, 50%, 30% and 10% of the lines.
set.seed(202003)
twitter100 <- tibble(line = 1:length(twitter), text = twitter) %>%
unnest_tokens(word, text)
twitter_sample10 <- twitter[rbinom(twitter_lines, 1, 0.1) == 1]
twitter_sample10 <- tibble(line = 1:length(twitter_sample10), text = twitter_sample10) %>%
unnest_tokens(word, text)
twitter_sample30 <- twitter[rbinom(twitter_lines, 1, 0.3) == 1]
twitter_sample30 <- tibble(line = 1:length(twitter_sample30), text = twitter_sample30) %>%
unnest_tokens(word, text)
twitter_sample50 <- twitter[rbinom(twitter_lines, 1, 0.5) == 1]
twitter_sample50 <- tibble(line = 1:length(twitter_sample50), text = twitter_sample50) %>%
unnest_tokens(word, text)
twitter_sample70 <- twitter[rbinom(twitter_lines, 1, 0.7) == 1]
twitter_sample70 <- tibble(line = 1:length(twitter_sample70), text = twitter_sample70) %>%
unnest_tokens(word, text)
twitter_sample90 <- twitter[rbinom(twitter_lines, 1, 0.9) == 1]
twitter_sample90 <- tibble(line = 1:length(twitter_sample90), text = twitter_sample90) %>%
unnest_tokens(word, text)
blogs100 <- tibble(line = 1:length(blogs), text = blogs) %>%
unnest_tokens(word, text)
blogs_sample10 <- blogs[rbinom(blogs_lines, 1, 0.1) == 1]
blogs_sample10 <- tibble(line = 1:length(blogs_sample10), text = blogs_sample10) %>%
unnest_tokens(word, text)
blogs_sample30 <- blogs[rbinom(blogs_lines, 1, 0.3) == 1]
blogs_sample30 <- tibble(line = 1:length(blogs_sample30), text = blogs_sample30) %>%
unnest_tokens(word, text)
blogs_sample50 <- blogs[rbinom(blogs_lines, 1, 0.5) == 1]
blogs_sample50 <- tibble(line = 1:length(blogs_sample50), text = blogs_sample50) %>%
unnest_tokens(word, text)
blogs_sample70 <- blogs[rbinom(blogs_lines, 1, 0.7) == 1]
blogs_sample70 <- tibble(line = 1:length(blogs_sample70), text = blogs_sample70) %>%
unnest_tokens(word, text)
blogs_sample90 <- blogs[rbinom(blogs_lines, 1, 0.9) == 1]
blogs_sample90 <- tibble(line = 1:length(blogs_sample90), text = blogs_sample90) %>%
unnest_tokens(word, text)
news100 <- tibble(line = 1:length(news), text = news) %>%
unnest_tokens(word, text)
news_sample10 <- news[rbinom(news_lines, 1, 0.1) == 1]
news_sample10 <- tibble(line = 1:length(news_sample10), text = news_sample10) %>%
unnest_tokens(word, text)
news_sample30 <- news[rbinom(news_lines, 1, 0.3) == 1]
news_sample30 <- tibble(line = 1:length(news_sample30), text = news_sample30) %>%
unnest_tokens(word, text)
news_sample50 <- news[rbinom(news_lines, 1, 0.5) == 1]
news_sample50 <- tibble(line = 1:length(news_sample50), text = news_sample50) %>%
unnest_tokens(word, text)
news_sample70 <- news[rbinom(news_lines, 1, 0.7) == 1]
news_sample70 <- tibble(line = 1:length(news_sample70), text = news_sample70) %>%
unnest_tokens(word, text)
news_sample90 <- news[rbinom(news_lines, 1, 0.9) == 1]
news_sample90 <- tibble(line = 1:length(news_sample90), text = news_sample90) %>%
unnest_tokens(word, text)
In the next step we build a data table where we calculate percentages how many unique words remain relatively to our original file.
t100 <- n_distinct(twitter100$word) / n_distinct(twitter100$word) * 100
t90 <- n_distinct(twitter_sample90$word) / n_distinct(twitter100$word) * 100
t70 <- n_distinct(twitter_sample70$word) / n_distinct(twitter100$word) * 100
t50 <- n_distinct(twitter_sample50$word) / n_distinct(twitter100$word) * 100
t30 <- n_distinct(twitter_sample30$word) / n_distinct(twitter100$word) * 100
t10 <- n_distinct(twitter_sample10$word) / n_distinct(twitter100$word) * 100
b100 <- n_distinct(blogs100$word) / n_distinct(blogs100$word) * 100
b90 <- n_distinct(blogs_sample90$word) / n_distinct(blogs100$word) * 100
b70 <- n_distinct(blogs_sample70$word) / n_distinct(blogs100$word) * 100
b50 <- n_distinct(blogs_sample50$word) / n_distinct(blogs100$word) * 100
b30 <- n_distinct(blogs_sample30$word) / n_distinct(blogs100$word) * 100
b10 <- n_distinct(blogs_sample10$word) / n_distinct(blogs100$word) * 100
n100 <- n_distinct(news100$word) / n_distinct(news100$word) * 100
n90 <- n_distinct(news_sample90$word) / n_distinct(news100$word) * 100
n70 <- n_distinct(news_sample70$word) / n_distinct(news100$word) * 100
n50 <- n_distinct(news_sample50$word) / n_distinct(news100$word) * 100
n30 <- n_distinct(news_sample30$word) / n_distinct(news100$word) * 100
n10 <- n_distinct(news_sample10$word) / n_distinct(news100$word) * 100
unique_words <- data.frame(File = c("blogs", "news", "twitter"),
p100 = c(b100, n100, t100),
p90 = c(b90, n90, t90),
p70 = c(b70, n70, t70),
p50 = c(b50, n50, t50),
p30 = c(b30, n30, t30),
p10 = c(b10, n10, t10))
unique_words
## File p100 p90 p70 p50 p30 p10
## 1 blogs 100 94.53911 83.04828 69.56935 53.29325 30.17347
## 2 news 100 94.86493 83.76703 71.30917 55.41215 31.14061
## 3 twitter 100 93.84942 80.80013 66.34529 49.17411 26.04458
The combination of the 50% news, 50% blogs and 10% twitter sample leads us to the following share of unique words compared to the combined original files.
unique_words_combination <- combine(news_sample50$word, blogs_sample50$word, twitter_sample10$word)
unique_words_100 <- combine(news100$word, blogs100$word, twitter100$word)
n_distinct(unique_words_combination) / n_distinct(unique_words_100)
## [1] 0.4742059
This was just a first approach. During the computation of the final app product we will still have to iterate the file size to decide what will be the best trade off between accuracy and speed.
So far we’ve created a lot of data variables which we won’t need any more for our next steps. In order to speed up calculation we will remove them from our workspace. This will happen a couple of more times during this process. There won’t be any more comments concerning this point.
rm(twitter100,
twitter_sample10,
twitter_sample30,
twitter_sample50,
twitter_sample70,
twitter_sample90,
news100,
news_sample10,
news_sample30,
news_sample50,
news_sample70,
news_sample90,
blogs100,
blogs_sample10,
blogs_sample30,
blogs_sample50,
blogs_sample70,
blogs_sample90,
t100,
t90,
t70,
t50,
t30,
t10,
n100,
n90,
n70,
n50,
n30,
n10,
b100,
b90,
b70,
b50,
b30,
b10,
unique_words,
unique_words_combination,
unique_words_100)
With the information from the previous calculations we create our data samples.
set.seed(202003)
twitter_sample <- twitter[rbinom(twitter_lines, 1, 0.1) == 1]
blogs_sample <- blogs[rbinom(blogs_lines, 1, 0.3) == 1]
news_sample <- news[rbinom(news_lines, 1, 0.3) == 1]
rm(twitter)
rm(blogs)
rm(news)
rm(blogs_characters)
rm(blogs_lines)
rm(blogs_size)
rm(blogs_words)
rm(news_characters)
rm(news_lines)
rm(news_size)
rm(news_words)
rm(twitter_characters)
rm(twitter_lines)
rm(twitter_size)
rm(twitter_words)
In order to understand the difference between our input file and our samples, we create again an overview of the date sample files and compare it with the input file.
twitter_size <- as.numeric(object.size(twitter_sample)/1024/1024)
twitter_lines <- as.numeric(summary(twitter_sample)[1])
twitter_words <- wordcount(twitter_sample)
twitter_characters <- sum(nchar(twitter_sample))
blogs_size <- as.numeric(object.size(blogs_sample)/1024/1024)
blogs_lines <- as.numeric(summary(blogs_sample)[1])
blogs_words <- wordcount(blogs_sample)
blogs_characters <- sum(nchar(blogs_sample))
news_size <- as.numeric(object.size(news_sample)/1024/1024)
news_lines <- as.numeric(summary(news_sample)[1])
news_words <- wordcount(news_sample)
news_characters <- sum(nchar(news_sample))
file_exploration_sample <- data.frame(File = c("blogs", "news", "twitter"),
Size_MB = c(blogs_size, news_size, twitter_size),
Lines = c(blogs_lines, news_lines, twitter_lines),
Words = c(blogs_words, news_words, twitter_words),
Characters = c(blogs_characters, news_characters, twitter_characters),
Words_per_Line = c(blogs_words/blogs_lines, news_words/news_lines, twitter_words/twitter_lines),
Ch._per_Line = c(blogs_characters/blogs_lines, news_characters/news_lines, twitter_characters/twitter_lines)
)
file_exploration
## File Size_MB Lines Words Characters Words_per_Line Ch._per_Line
## 1 blogs 200.4242 899288 37334131 206824505 41.51521 229.98695
## 2 news 196.2775 77259 2643969 15639408 34.22215 202.42830
## 3 twitter 159.3641 2360148 30373543 162096031 12.86934 68.68045
file_exploration_sample
## File Size_MB Lines Words Characters Words_per_Line Ch._per_Line
## 1 blogs 76.634285 269411 11213813 62108173 41.62344 230.53317
## 2 news 5.900749 23034 789768 4672357 34.28705 202.84610
## 3 twitter 32.161278 236111 3032631 16175536 12.84409 68.50818
It’s interesting to see, that the file sizes now differ a lot. The blog data set remains with around 30% of the file size whereas the news file only remains with 5.9 MB. However, when we compare the relative metrics words_per_line and characters_per_line we can see, that we our sampling didn’t change much concerning these metrics. There are not more or less words per lines in each document, which is a good sign, because the further prediction depends on word orders.
Since it is hard to get a feeling of the actual word sizes of the input and sample file, we visualize the number of words per source via treemaps.
file_source <- c("Blogs","News","Twitter")
number_words <- c(file_exploration$Words)
number_words_sample <- c(file_exploration_sample$Words)
data_file <- data.frame(file_source, number_words)
data_sample <- data.frame(file_source, number_words_sample)
treemap(data_file,
index="file_source",
palette = "Set3",
vSize="number_words",
type="index"
)
treemap(data_sample,
index="file_source",
palette = "Set3",
vSize="number_words_sample",
type="index"
)
What we can see here is that we have biased the data sample pro blogs and lost share of words from the twitter sample. Again, the input files do not represent a representative source of english language, so we do not have to care much about this finding. We also could assume at this point, that the characteristics of the news input file is a very formal one concerning the used language, whereas the twitter input file will be using much more slang words and was limited to a tweet character limit size. The blog input file could represent a good source of words for our future user prediction input.
To sum up. Our data sample - is much smaller and biased pro the blog file - uses still around 50 % of the unique words of the original files - did not change much of the word frequency per line in each document
rm(blogs_characters)
rm(blogs_lines)
rm(blogs_size)
rm(blogs_words)
rm(news_characters)
rm(news_lines)
rm(news_size)
rm(news_words)
rm(twitter_characters)
rm(twitter_lines)
rm(twitter_size)
rm(twitter_words)
rm(mainpath)
rm(data_file)
rm(blogstxt)
rm(newstxt)
rm(twittertxt)
For our next steps we have to use some techniques from NLP (natural language processing) to tidy up our data set from unnecessary elements and shape the data in such way, that the machine can read and calculate with it.
Our first step is to combine the sample files into one big data_sample file.
data_sample <- combine(twitter_sample, blogs_sample, news_sample)
rm(twitter_sample)
rm(blogs_sample)
rm(news_sample)
At this point of the analysis I’ve decided to use the text mining package tm for further data manipulation which is a very convenient way to shape the data for our purposes. The necessary file class is a corpus which can be created via the VCorpus command.
data_corpus <- VCorpus(VectorSource(data_sample))
rm(data_sample)
The next step is crucial for the further analysis. Here we decide which elements of the data we will erase. This will speed up calculations on the one hand but also brings us more accuracy on the other hand. But we have to be careful to not erase still important words which will be used by the user in our final app. One common technique in NLP, namely erasing stopwords, shouldn’t be used for our purpose, because words such as “the”, “to” or “I” are used very often in spoken or written language. So these words remain in the dataset and no stopwords were excluded. In order to generalize our data sample we lower all cases, remove punctuation which is not crucial in english language, remove numbers and erase unnercessary whitespace and construct a text document (containing meta information, which we do not to have to care about). The commands below are described in the Journal of Statistical Science in the article Text Mining Infrastructure in R by Ingo Feinerer, Kurt Hornik and David Meyer especially on page 19. see: https://www.jstatsoft.org/article/view/v025i05
data_corpus <- tm_map(data_corpus, stripWhitespace)
data_corpus <- tm_map(data_corpus, tolower)
data_corpus <- tm_map(data_corpus, removePunctuation)
data_corpus <- tm_map(data_corpus, removeNumbers)
data_corpus <- tm_map(data_corpus, PlainTextDocument)
Now we have a huge tidied up data corpus, but this is still not readable for machines. The next step is to divide the text into short section of two, three or four words. This process is called tokenization, the output is called ngrams, respectively bi-, tri- or quadgrams.
The tm-package provides the necessary infrastructure, but I’ve decided to use the approach from the book Text Mining with R! especially the chapter ngrams, which is available here: https://www.tidytextmining.com/ngrams.html
First, we have to transform our data corpus into a tidy data corpus.
corpus_td <- tidy(data_corpus)
rm(data_corpus)
In this step we use the unnest_tokens function to create our ngrams. In the same step we order our corpuses descending from the most common ngrams and calculate the share of each word order compared to the full corpus.
corpus_unigrams <- corpus_td %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
mutate(share = n / sum(n))
corpus_bigrams <- corpus_td %>%
unnest_tokens(word, text, token = "ngrams", n=2) %>%
count(word, sort = TRUE) %>%
mutate(share = n / sum(n))
corpus_trigrams <- corpus_td %>%
unnest_tokens(word, text, token = "ngrams", n=3) %>%
count(word, sort = TRUE) %>%
mutate(share = n / sum(n))
corpus_quadgrams <- corpus_td %>%
unnest_tokens(word, text, token = "ngrams", n=4) %>%
count(word, sort = TRUE) %>%
mutate(share = n / sum(n))
At this point it’s interesting to check the 20 most common ngrams from each corpus.
unigram_plot <- ggplot(data = slice(corpus_unigrams, 1:20), aes(x = reorder(word, n), y = n)) +
geom_bar(stat="identity", fill="darkblue") +
theme_minimal() +
coord_flip() +
labs(x = "unigram", y = "count", title = "Number of Top 20 unigrams")
bigram_plot <- ggplot(data = slice(corpus_bigrams, 1:20), aes(x = reorder(word, n), y = n)) +
geom_bar(stat="identity", fill="darkblue") +
theme_minimal() +
coord_flip() +
labs(x = "bigram", y = "count", title = "Number of Top 20 bigrams")
trigram_plot <- ggplot(data = slice(corpus_trigrams, 1:20), aes(x = reorder(word, n), y = n)) +
geom_bar(stat="identity", fill="darkblue") +
theme_minimal() +
coord_flip() +
labs(x = "trigram", y = "count", title = "Number of Top 20 trigrams")
quadgram_plot <- ggplot(data = slice(corpus_quadgrams, 1:20), aes(x = reorder(word, n), y = n)) +
geom_bar(stat="identity", fill="darkblue") +
theme_minimal() +
coord_flip() +
labs(x = "quadgram", y = "count", title = "Number of Top 20 quadgrams")
unigram_plot
bigram_plot
trigram_plot
quadgram_plot
Unsurprisingly, the most common word orders contain very often used word orders in the english language, which is good, since we also want to predict common words.
So far, we know what are the most common word orders of each ngram. However, this information alone is not good enough to predict words. The final step is to seperate the words in each ngram to make it readable for our future prediction algorithm.
uni_words <- corpus_unigrams
bi_words <- corpus_bigrams %>%
separate(word, c("word1", "word2"), sep = " ")
tri_words <- corpus_trigrams %>%
separate(word, c("word1", "word2", "word3"), sep = " ")
quad_words <- corpus_quadgrams %>%
separate(word, c("word1", "word2", "word3", "word4"), sep = " ")
head(quad_words)
## # A tibble: 6 x 6
## word1 word2 word3 word4 n share
## <chr> <chr> <chr> <chr> <int> <dbl>
## 1 the end of the 1246 0.0000847
## 2 the rest of the 1095 0.0000744
## 3 at the end of 1037 0.0000705
## 4 at the same time 824 0.0000560
## 5 for the first time 771 0.0000524
## 6 when it comes to 703 0.0000478
Based on this this data structure, we get a sense of how our prediction model could work. It’s most likely, that after the term at the same the word time will follow. However, just the term at the could lead us to a different prediction than same and then time. Therefore, we’ve created a bi-, tri- and quadgram data structure, to be able to predict also cases with a word input up to 3 words. To get a sense of the word infrastructure it’s good to visualize the network of the most common ngrams, which is our final step of this analysis.
Again, this step was inspired by the Text Mining with R! book, which you can find here: https://www.tidytextmining.com/ngrams.html
The following network analysis shows bigrams which appear more than 3000 times in the dataset, trigrams more often than 500 times and quadgrams more often than 50 times.
The direction of the the arrow shows the direction of the word order, the darkness of the color shows the intensity of the connection. It’s interesting to see, that there are a couple of anchor words, such as a or I from where a lot of connections to other words start. The variety and intensity of these connections differs from ngram to ngram.
Of course, the following data chunks are chosen arbitrarly and should just provide a not to overwhelming visualization.
bigram_graph <- subset(bi_words, bi_words[, 3] > 3000) %>%
graph_from_data_frame()
set.seed(202003)
a <- grid::arrow(type = "closed", length = unit(.05, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "darkblue", size = 1) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
trigram_graph <- subset(tri_words, tri_words[, 4] > 500) %>%
graph_from_data_frame()
set.seed(202003)
a <- grid::arrow(type = "closed", length = unit(.05, "inches"))
ggraph(trigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "darkblue", size = 1) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
quadgram_graph <- subset(quad_words, quad_words[, 5] > 200) %>%
graph_from_data_frame()
set.seed(202003)
a <- grid::arrow(type = "closed", length = unit(.05, "inches"))
ggraph(quadgram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "darkblue", size = 1) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
rm(a)
rm(unigram_plot)
rm(bigram_graph)
rm(bigram_plot)
rm(trigram_graph)
rm(trigram_plot)
rm(quadgram_graph)
rm(quadgram_plot)
rm(file_exploration)
rm(file_exploration_sample)
rm(corpus_td)
Our next step is to build a prediction algorithm based on our data files. To get an idea of how this could word, we do a sample grepl command for all terms beginning with at the same.
subset(corpus_quadgrams, grepl("^at the same", word))
## # A tibble: 78 x 3
## word n share
## <chr> <int> <dbl>
## 1 at the same time 824 0.0000560
## 2 at the same damn 11 0.000000748
## 3 at the same rate 7 0.000000476
## 4 at the same pace 6 0.000000408
## 5 at the same speed 6 0.000000408
## 6 at the same level 5 0.000000340
## 7 at the same location 4 0.000000272
## 8 at the same place 4 0.000000272
## 9 at the same event 3 0.000000204
## 10 at the same point 3 0.000000204
## # ... with 68 more rows
It’s obvious, that the best prediction for the next word should be time
One problem which still has to be dealt with are words which are not included in the data corpus. If users make up words, we won’t have a chance to make a good prediction and this is not our goal. So a feedback such as Please try something else should be fine. But common typos and misspellings of words should be covered by our prediction engine. Here, we will use a a levenshtein calculation which searches for similar words within certain boundaries.
Of course, a lot of effort will be put into speedingup calculations without losing accuracy and finally building our prediction app.