We will be building a shiny app to predict the next word the user is going to type as they are typing it. We will using already collected data to create a prediction model to suggest the next word to the user.

Existing Data Analysis

First we will gain a better understanding of the data we will be using to create our model by reporting file size, the number of lines in the file, and the length of the longest line.

library(stringr); library(readr); library(knitr)
library(stringr); library(ggplot2)
library(stringr); library(readr); library(knitr)
fileDir <- "./rawdata/en_US"
files <- list.files(fileDir)
fullFiles <- list.files(fileDir, full.names = T)
fileSize <- lapply(fullFiles, function(x) system(paste("du -sk", x),intern=T))
fileSize <- round(as.numeric(str_extract(fileSize, "^\\d+")) / 1024, 2)
fileSize <- paste(fileSize, "mb")
readText <- function(file) {
        read_delim(file, delim = "\n", escape_backslash = F, col_names = F)
}
blogs <- readText(fullFiles[1])
lines <- nrow(blogs)
longLine <- max(nchar(blogs$X1))
news <- readText(fullFiles[2])
lines[2] <- nrow(news)
longLine[2] <- max(nchar(news$X1))
twitter <- readText(fullFiles[3])
## 
|================================================================================| 100%  159 MB
lines[3] <- nrow(twitter)
longLine[3] <- max(nchar(twitter$X1))
fileInfo <- data.frame(file=files, 
                       file.size=fileSize,
                       line.count=lines, 
                       longest.line=longLine)

Table 1: File Information

file file.size line.count longest.line
en_US.blogs.txt 200.43 mb 878689 40833
en_US.news.txt 196.28 mb 1000107 11384
en_US.twitter.txt 159.37 mb 2342938 9734

Data Exploration

library(stringr); library(dplyr); library(readr)
subpath <- "./rawdata/en_US"
set.seed(94259)
files <- list.files(subpath)
files <- paste0(subpath, "/", files)
readText <- function(file) {
        df <- read_delim(file, delim="\n", col_names=F, escape_backslash=F)
        return(df)
}
sampleRows <- function(df) sort(sample(1:nrow(df), size=nrow(df)*.2))
dir.create("./training", showWarnings = F)
dir.create("./testing", showWarnings = F)
df <- readText(files[1])
cat(df$X1[-sampleRows(df)], file="./training/blogs.txt", sep="\n")
cat(df$X1[sampleRows(df)], file="./testing/blogs.txt", sep="\n")
df <- readText(files[2])
cat(df$X1[-sampleRows(df)], file="./training/news.txt", sep="\n")
cat(df$X1[sampleRows(df)], file="./testing/news.txt", sep="\n")
df <- readText(files[3])
cat(df$X1[-sampleRows(df)], file="./training/twitter.txt", sep="\n")
cat(df$X1[sampleRows(df)], file="./testing/twitter.txt", sep="\n")
library(dplyr); library(knitr); library(readr)
wordr <- function(x){
        reg <- "[A-Za-z]+'?[A-Za-z]*"
        words <- str_replace_all(x, "’", "'")
        words <- str_extract_all(tolower(words), reg)
        return(words)
}
library(readr); library(stringr)
blogs <- readText("./training/blogs.txt")
words <- sapply(blogs$X1, wordr)
words <- unlist(words, use.names = F)
wordData <- data.frame(word = words, stringsAsFactors = F)
wordData$dataset <- "blogs"
library(readr); library(stringr)
news <- readText("./training/news.txt")
words <- sapply(news$X1, wordr)
words <- unlist(words, use.names = F)
words <- data.frame(word = words, stringsAsFactors = F)
words$dataset <- "news"
library(readr); library(stringr)
twitter <- readText("./training/twitter.txt")
wordData <- rbind(wordData,words)
words <- sapply(twitter$X1, wordr)
words <- unlist(words, use.names = F)
words <- data.frame(word = words, stringsAsFactors = F)
words$dataset <- "twitter"
wordData <- rbind(wordData,words)
wordData$dataset <- as.factor(wordData$dataset)
library(dplyr)
wordCount <- summarize(group_by(wordData, word), count=n()) %>%
                ungroup() %>% arrange(desc(count))
wordData <- summarize(group_by(wordData, word, dataset), 
                count=n())
wordData <- wordData %>% ungroup() %>% arrange(desc(count), dataset)

Figure 1: Top Words

x <- head(wordCount$word, 30)
ggplot(head(wordData, 60), aes(word, count, fill=dataset)) +
        stat_summary(fun.y = sum, geom = "bar") +
        coord_flip()

ShinyApp Plans

In order to predict the next word a user is going to type based off of the previous words in the sentance, I plan to use ngrams, which look at how different words are grouped together in my current data set. My perdiction model will calculate the most frequently word selection after the last three words that were typed. If there is no match using the group of three words, I will then use a “backoff” technique where I find the most frequent word from the last two words typed. I will continue backing off until a word is selected by the model.

As you type in the shinyapp, the three most frequent words will be displayed. The user will be able to type enter to select the top selection. If they would like to use one of the other two selections, they can hit tab until the correct word is selected and hit enter to place it into their text.