We will be building a shiny app to predict the next word the user is going to type as they are typing it. We will using already collected data to create a prediction model to suggest the next word to the user.

Existing Data Analysis

First we will gain a better understanding of the data we will be using to create our model by reporting file size, the number of lines in the file, and the length of the longest line.

library(stringr); library(readr); library(knitr)
library(stringr); library(ggplot2)

library(stringr); library(readr); library(knitr)
fileDir <- "./rawdata/en_US"
files <- list.files(fileDir)
fullFiles <- list.files(fileDir, full.names = T)
fileSize <- lapply(fullFiles, function(x) system(paste("du -sk", x),intern=T))
fileSize <- round(as.numeric(str_extract(fileSize, "^\\d+")) / 1024, 2)
fileSize <- paste(fileSize, "mb")
readText <- function(file) {
        read_delim(file, delim = "\n", escape_backslash = F, col_names = F)
}
blogs <- readText(fullFiles[1])
lines <- nrow(blogs)
longLine <- max(nchar(blogs$X1))
news <- readText(fullFiles[2])
lines[2] <- nrow(news)
longLine[2] <- max(nchar(news$X1))
twitter <- readText(fullFiles[3])

## 
|================================================================================| 100%  159 MB

lines[3] <- nrow(twitter)
longLine[3] <- max(nchar(twitter$X1))
fileInfo <- data.frame(file=files, 
                       file.size=fileSize,
                       line.count=lines, 
                       longest.line=longLine)

Table 1: File Information

file	file.size	line.count	longest.line
en_US.blogs.txt	200.43 mb	878689	40833
en_US.news.txt	196.28 mb	1000107	11384
en_US.twitter.txt	159.37 mb	2342938	9734

Data Exploration

library(stringr); library(dplyr); library(readr)
subpath <- "./rawdata/en_US"
set.seed(94259)
files <- list.files(subpath)
files <- paste0(subpath, "/", files)
readText <- function(file) {
        df <- read_delim(file, delim="\n", col_names=F, escape_backslash=F)
        return(df)
}
sampleRows <- function(df) sort(sample(1:nrow(df), size=nrow(df)*.2))
dir.create("./training", showWarnings = F)
dir.create("./testing", showWarnings = F)
df <- readText(files[1])
cat(df$X1[-sampleRows(df)], file="./training/blogs.txt", sep="\n")
cat(df$X1[sampleRows(df)], file="./testing/blogs.txt", sep="\n")
df <- readText(files[2])
cat(df$X1[-sampleRows(df)], file="./training/news.txt", sep="\n")
cat(df$X1[sampleRows(df)], file="./testing/news.txt", sep="\n")
df <- readText(files[3])
cat(df$X1[-sampleRows(df)], file="./training/twitter.txt", sep="\n")
cat(df$X1[sampleRows(df)], file="./testing/twitter.txt", sep="\n")

library(dplyr); library(knitr); library(readr)
wordr <- function(x){
        reg <- "[A-Za-z]+'?[A-Za-z]*"
        words <- str_replace_all(x, "’", "'")
        words <- str_extract_all(tolower(words), reg)
        return(words)
}

library(readr); library(stringr)
blogs <- readText("./training/blogs.txt")
words <- sapply(blogs$X1, wordr)
words <- unlist(words, use.names = F)
wordData <- data.frame(word = words, stringsAsFactors = F)
wordData$dataset <- "blogs"

library(readr); library(stringr)
news <- readText("./training/news.txt")
words <- sapply(news$X1, wordr)
words <- unlist(words, use.names = F)
words <- data.frame(word = words, stringsAsFactors = F)
words$dataset <- "news"

library(readr); library(stringr)
twitter <- readText("./training/twitter.txt")
wordData <- rbind(wordData,words)
words <- sapply(twitter$X1, wordr)
words <- unlist(words, use.names = F)
words <- data.frame(word = words, stringsAsFactors = F)
words$dataset <- "twitter"

wordData <- rbind(wordData,words)
wordData$dataset <- as.factor(wordData$dataset)

library(dplyr)
wordCount <- summarize(group_by(wordData, word), count=n()) %>%
                ungroup() %>% arrange(desc(count))
wordData <- summarize(group_by(wordData, word, dataset), 
                count=n())
wordData <- wordData %>% ungroup() %>% arrange(desc(count), dataset)

Figure 1: Top Words

x <- head(wordCount$word, 30)
ggplot(head(wordData, 60), aes(word, count, fill=dataset)) +
        stat_summary(fun.y = sum, geom = "bar") +
        coord_flip()

ShinyApp Plans

In order to predict the next word a user is going to type based off of the previous words in the sentance, I plan to use ngrams, which look at how different words are grouped together in my current data set. My perdiction model will calculate the most frequently word selection after the last three words that were typed. If there is no match using the group of three words, I will then use a “backoff” technique where I find the most frequent word from the last two words typed. I will continue backing off until a word is selected by the model.

As you type in the shinyapp, the three most frequent words will be displayed. The user will be able to type enter to select the top selection. If they would like to use one of the other two selections, they can hit tab until the correct word is selected and hit enter to place it into their text.

Data Science Capstone Milestone Report

Michael J Pfammatter

July 26, 2015

Existing Data Analysis

Table 1: File Information

Data Exploration

Figure 1: Top Words

ShinyApp Plans