Spend less time in entry!

By predicting the next word, this application helps user to save time in text entry if integrated in text editing softwares.
This application was developed and is running in R shinyapp, a free server accessible to everyone.

To start with, we load the required packages and make connection to the dat. R codes are shown for your information but not run

Loading packages
Connecting to data

# Set working directory and load data (done ahead to save time)
fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileurl, destfile ="Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
setwd("Week 2/final/en_US/")
blogs <- read_lines("en_US.blogs.txt", skip_empty_rows = TRUE)
news <- read_lines("en_US.news.txt", skip_empty_rows = TRUE)
twitter <- read_lines("en_US.twitter.txt", skip_empty_rows = TRUE)
all_text <- c(blogs, news, twitter)
rm(blogs, news, twitter)

Tidying data

Some basic transformations like removing internet characters and symbols, converting all words into lower case, etc. are processed using a randomly selected sample

set.seed("12345")
trainingData <- linesInFile[rbinom(fileNLine, 1, 0.01)==1]
corpusFeeds <- VCorpus(VectorSource(trainingData))
corpusFeeds <- tm_map(corpusFeeds, removePunctuation) # remove punctuation
corpusFeeds <- tm_map(corpusFeeds, content_transformer(tolower))  # put in lower char
corpusFeeds <- tm_map(corpusFeeds, content_transformer(remove_chars)) # remove internet chars
corpusFeeds <- tm_map(corpusFeeds, removeWords, stopwords("english")) # remove English stop words
corpusFeeds <- tm_map(corpusFeeds, content_transformer(remove_symbols)) #remove symbols
corpusFeeds <- tm_map(corpusFeeds, stripWhitespace) # remove extra spaces

Remove profanities Next step is to remove profanities listed in http://www.bannedwordlist.com

filerul2 <- "http://www.bannedwordlist.com/lists/swearWords.txt"
download.file(filerul2, destfile = "badwords.txt")
badwords <- readLines("badwords.txt")
profanity <- VectorSource(badwords)
corpusFeeds <- tm_map(corpusFeeds, removeWords, profanity)

Descriptive analysis

I chose two ways to present interesting findings. The first one is with word to cloud to see which words are the most frequently used alt text

The second is with NGrams histogram to understand which combination of words are frequently used. alt text

Prediction of next word

Spend less time in entry!

Tidying data

Descriptive analysis

Prediction