The motivation for this project is to:
For this Milestone Report, I will go about these steps to outline my goals and objectives.
We will load the sesssion and packages needed and overall system set up
rm(list = ls(all.names = TRUE))
library(ggplot2)
library(downloader)
library(plyr)
library(dplyr)
library(knitr)
library(tm)
library(wordcloud)
library(slam)
library(ngram)
library(knitr)
library(kableExtra)
library(RColorBrewer)
library(gridExtra)
library(RWeka)
setwd(getwd())
set.seed(123456)
Swiftkey has three(3) datasets:
blogs, news, and twitterYou can normally retreive all of this data set from pubilc sources. These datasets can be extremly larger so we will sample our data to show overall sampling of the data.
This project will only focus on the English corpora.
## Check do you have the data already?
if(!file.exists("./data")){
dir.create("./data")
}
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## Check: is hte zip file in you working directory?
if(!file.exists("./data/Coursera-SwiftKey.zip")){
download.file(Url,destfile="./data/Coursera-SwiftKey.zip",mode = "wb")
}
## Check: unzip the downloaded zip file.
if(!file.exists("./data/final")){
unzip(zipfile="./data/Coursera-SwiftKey.zip",exdir="./data")
}
First, we will download the data sets and install it into working directory.
./data/final/en_US/ Dataset folderThere will be text from 3 different sources:
In this project, we will only focus on the English - US data sets.
path <- file.path("./data/final" , "en_US")
files<-list.files(path, recursive=TRUE)
#file connection twitter data set
con <- file("./data/final/en_US/en_US.twitter.txt", "r")
Twitter<-readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
# Close the connection
close(con)
#file connection blog data set
con <- file("./data/final/en_US/en_US.blogs.txt", "r")
Blogs<-readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
# Close the connection e
close(con)
#file connection news data set
con <- file("./data/final/en_US/en_US.news.txt", "r")
News<-readLines(con, skipNul = TRUE, warn = FALSE, encoding = "UTF-8")
# Close the connection
close(con)
Before we start building the corpus we will need to clean the data and create a basic summary of the three datasets provided.
We will review the:
We will also included are basic statistics: - words per line (min, mean, and max)
library(stringi)
#Get file size
Blogs.size <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
News.size <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024 ^ 2
Twitter.size <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
# Get words in files
Blogs.words <- stri_count_words(Blogs)
News.words <- stri_count_words(News)
Twitter.words <- stri_count_words(Twitter)
# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = round(c(Blogs.size, News.size, Twitter.size), digits = 2),
words.per.line = c(length(Blogs), length(News), length(Twitter)),
word.count = c(sum(Blogs.words), sum(News.words), sum(Twitter.words)),
mean.word.count = round(c(mean(Blogs.words), mean(News.words), mean(Twitter.words)),digits = 2))
The above table shows that on average:
To improve processing time, sample size of 5% will be used from all three data sets. Then will be combined into a unified document corpus for subsequent analyses later.
Text data sets are quite large, we will randomly choose 5% of the data for cleaning and exploratory analysis also convert to UTF-8 characters.
Before performing exploratory analysis, must clean the data first.
#Remove profanity from each list: Using profanity list originally published by Google
profanityFile <- "full-list-of-bad-words-banned-by-google.csv"
pathToprofanityList <- file.path("./data", profanityFile)
profanity <- read.csv(pathToprofanityList, sep="\t", strip.white = TRUE, encoding = "UTF-8")
#Sample the data
set.seed(123456)
sampleData <- c(sample(Blogs, length(Blogs) * 0.05),
sample(News, length(News) * 0.05),
sample(Twitter, length(Twitter) * 0.05))
#Create corpus and clean the data
corpus <- VCorpus(VectorSource(sampleData))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
#remove URL, Twitter handles and email patterns
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")
#remove profanity from the sample data set
profanity <- iconv(profanity, "latin1", "ASCII", sub = "")
corpus <- tm_map(corpus, removeWords, profanity)
#remove rest of unwanted characters
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
We will use some different techniques to develop an understanding of the data.
This Section will look at:
options(mc.cores=1)
#gather the frequencies of words
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue"))
}
# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
## Warning in dist(df, method = "euclidean"): NAs introduced by coercion
## Warning in dist(df, method = "euclidean"): NAs introduced by coercion
## Warning in dist(df, method = "euclidean"): NAs introduced by coercion
In conclusion, the final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface.
Possible models:
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.
The final strategy will be based on the one that increases efficiency and provides the best accuracy.