This report is a brief introduction into the shiny text predictor. In this report, I will:
The dataset was taken directly from the course’s website as directed by the tutor. A zip file was downloaded and extracted. The directory contains files in German, English, Finnish and Russian.
require(stringi)
require(knitr)
require(tm)
require(RWeka)
library(ggplot2)
#get the data
url <-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zip <- "Coursera-SwiftKey.zip"
path<-getwd()
#path
# download.file(url, zip)
# unzip('Coursera-SwiftKey.zip')
I will be using the US dataset for this project.
blogs <- readLines("en_US.blogs.txt", encoding = 'UTF-8', skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = 'UTF-8', skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = 'UTF-8', skipNul = TRUE)
swearWords <- readLines("swearWords.csv", encoding = 'UTF-8', skipNul = TRUE)
kable(data.frame(row.names = c("blogs","news","twitter")
,LineCount = sapply(list(blogs,news,twitter),length)
,LongestLines = sapply(list(stri_length(blogs),stri_length(news), stri_length(twitter)),max)
,TotalWords = sapply(list(blogs,news,twitter),stri_stats_latex)[4,]
))
| LineCount | LongestLines | TotalWords | |
|---|---|---|---|
| blogs | 899288 | 40833 | 37570839 |
| news | 77259 | 5760 | 2651432 |
| 2360148 | 140 | 30451170 |
set.seed(2016)
sampleData <- list()
sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
usSample <- c(sampleTwitter,sampleNews,sampleBlogs)
rm(blogs,news,twitter, sampleBlogs, sampleNews, sampleTwitter)
usSample <- iconv(usSample, 'UTF-8', 'ASCII', "byte")
usCorpus <- VCorpus(VectorSource(usSample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
usCorpus <- tm_map(usCorpus, toSpace, "/|@|\\|") # remove more transforms
usCorpus <- tm_map(usCorpus, removeWords, stopwords("english")) # remove english stop words
usCorpus <- tm_map(usCorpus, content_transformer(tolower))# convert to lowercase
usCorpus <- tm_map(usCorpus, removePunctuation)# remove punctuation
usCorpus <- tm_map(usCorpus, removeNumbers)# remove numbers
usCorpus <- tm_map(usCorpus, stripWhitespace) # strip whitespace
usCorpus <- tm_map(usCorpus, removeWords, swearWords) # Remove profanity
#Tokenizer functions
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
#Plotting function
bar_plot <- function(df, title) {
color="red"
ggplot(df, aes(x = seq(1:20), y = freq)) +
geom_bar(stat = "identity", fill = color, colour = "black", width = 0.80) +
coord_cartesian(xlim = c(0, 21)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
scale_x_discrete(breaks = seq(1, 20, by = 1), labels = df$word) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
# Unigram Chart
bar_plot(unigram_df,"Unigrams")
unigram_df
## word freq
## the the 4938
## said said 3074
## will will 2837
## one one 2584
## just just 2251
## like like 2063
## can can 2049
## time time 1816
## get get 1722
## new new 1555
## people people 1379
## now now 1351
## also also 1303
## first first 1260
## good good 1232
## know know 1211
## day day 1189
## but but 1172
## and and 1119
## back back 1099
bar_plot(bigram_df,"Bigrams")
bigram_df
## word freq
## i think i think 481
## i know i know 343
## i love i love 303
## i just i just 277
## i can i can 276
## i will i will 264
## i want i want 215
## i like i like 186
## last year last year 186
## right now right now 169
## new york new york 163
## i really i really 150
## i get i get 147
## i hope i hope 143
## i thought i thought 139
## years ago years ago 139
## i feel i feel 136
## i donet i donet 134
## i need i need 131
## i got i got 130
Since we have completed cleaning the data, we were able to explore it. Above I have shown the nGrams that are most prevalent in the data. The steps will be to build a predictive model that uses the nGrams above. The shiny app will be a text input bar and based on what the user type we will predict the next word or phrase.