Executive Summary

This report is a brief introduction into the shiny text predictor. In this report, I will:

  1. Download the Data
  2. Import the data to R
  3. Take a sample of the data
  4. Clean and standardize the data
  5. Create and visualize nGrams
  6. Explain Project Plan

Download the Data

The dataset was taken directly from the course’s website as directed by the tutor. A zip file was downloaded and extracted. The directory contains files in German, English, Finnish and Russian.

require(stringi)
require(knitr)
require(tm)
require(RWeka)
library(ggplot2)
#get the data
  url <-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  zip <- "Coursera-SwiftKey.zip"
  path<-getwd()
 #path
 # download.file(url, zip)
 # unzip('Coursera-SwiftKey.zip') 

Import the data to R

I will be using the US dataset for this project.

blogs <- readLines("en_US.blogs.txt", encoding = 'UTF-8', skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = 'UTF-8', skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = 'UTF-8', skipNul = TRUE)
swearWords <- readLines("swearWords.csv", encoding = 'UTF-8', skipNul = TRUE)

kable(data.frame(row.names = c("blogs","news","twitter")
                      ,LineCount = sapply(list(blogs,news,twitter),length)
                      ,LongestLines = sapply(list(stri_length(blogs),stri_length(news), stri_length(twitter)),max)
                      ,TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,]
))
LineCount LongestLines TotalWords
blogs 899288 40833 37570839
news 77259 5760 2651432
twitter 2360148 140 30451170

Take a sample of the data

set.seed(2016)
sampleData <- list()
sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
usSample <- c(sampleTwitter,sampleNews,sampleBlogs)
rm(blogs,news,twitter, sampleBlogs, sampleNews, sampleTwitter)

Clean and standardize the data

usSample <- iconv(usSample, 'UTF-8', 'ASCII', "byte")
usCorpus <- VCorpus(VectorSource(usSample))


toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
usCorpus <- tm_map(usCorpus, toSpace, "/|@|\\|") # remove more transforms
usCorpus <- tm_map(usCorpus, removeWords, stopwords("english")) # remove english stop words
usCorpus <- tm_map(usCorpus, content_transformer(tolower))# convert to lowercase
usCorpus <- tm_map(usCorpus, removePunctuation)# remove punctuation
usCorpus <- tm_map(usCorpus, removeNumbers)# remove numbers
usCorpus <- tm_map(usCorpus, stripWhitespace) # strip whitespace
usCorpus <- tm_map(usCorpus, removeWords, swearWords) # Remove profanity

Create nGrams

#Tokenizer functions
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

Visualize nGrams

#Plotting function
bar_plot <- function(df, title) {
  color="red"
    ggplot(df, aes(x = seq(1:20), y = freq)) +
    geom_bar(stat = "identity", fill = color, colour = "black", width = 0.80) +
    coord_cartesian(xlim = c(0, 21)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count")  +    
    scale_x_discrete(breaks = seq(1, 20, by = 1), labels = df$word) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
# Unigram Chart
bar_plot(unigram_df,"Unigrams")

unigram_df
##          word freq
## the       the 4938
## said     said 3074
## will     will 2837
## one       one 2584
## just     just 2251
## like     like 2063
## can       can 2049
## time     time 1816
## get       get 1722
## new       new 1555
## people people 1379
## now       now 1351
## also     also 1303
## first   first 1260
## good     good 1232
## know     know 1211
## day       day 1189
## but       but 1172
## and       and 1119
## back     back 1099
bar_plot(bigram_df,"Bigrams")

bigram_df
##                word freq
## i think     i think  481
## i know       i know  343
## i love       i love  303
## i just       i just  277
## i can         i can  276
## i will       i will  264
## i want       i want  215
## i like       i like  186
## last year last year  186
## right now right now  169
## new york   new york  163
## i really   i really  150
## i get         i get  147
## i hope       i hope  143
## i thought i thought  139
## years ago years ago  139
## i feel       i feel  136
## i donet     i donet  134
## i need       i need  131
## i got         i got  130

Project Plan

Since we have completed cleaning the data, we were able to explore it. Above I have shown the nGrams that are most prevalent in the data. The steps will be to build a predictive model that uses the nGrams above. The shiny app will be a text input bar and based on what the user type we will predict the next word or phrase.