This is the initial Milestone Report which is part of the Coursera Data Science Capstone Project. The task is to create a predictive text model, using natural language processing techniques to perform the analysis and build the model. This Milestone Report describes the important features of the training data using exploratory data analysis and describes further plans for a predictive model.
**Download zip file including text files from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.**
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "C://Users//u182335//Documents//DataScience//CAPSTONE//Week 2//Coursera-SwiftKey.zip")
unzip("C://Users//u182335//Documents//DataScience//CAPSTONE//Week 2//Coursera-SwiftKey.zip")
}
The data sets consist of text from 3 different sources: 1=Blogs, 2=News, 3=Twitter feeds. The text data is in 4 different languages: one. German, two. English - United States, three. Finnish and four. Russian. I will focus on the English - United States data sets only.
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines("final/en_US/en_US.news.txt", encoding =
## "UTF-8", skipNul = TRUE): incomplete final line found on 'final/en_US/
## en_US.news.txt'
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Create a summary of findings (counts of rows, file sizes, counts of words, mean of words per line)
library(stringi)
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blogs.size, news.size, twitter.size),
num.lines = c(length(blogs), length(news), length(twitter)),
num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546246 41.75108
## 2 news 196.2775 77259 2674536 34.61779
## 3 twitter 159.3641 2360148 30093410 12.75065
Initially we clean the data to perform the analysis with more efficiency. In the process I remove special characters, missing data URLs, formatting etc. Due to filesize a 1% sample has been used to demonstrate and decrease runtime
library(tm)
## Loading required package: NLP
set.seed(679)
data.sample <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
library(stringr)
usableText=str_replace_all(data.sample,"[^[:alnum:]]", " ")
usableText <- gsub("[ÁbcdêãçoàúüÃ]","" , usableText ,ignore.case = TRUE)
corpus <- VCorpus(VectorSource(usableText))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Perform exploratory analysis on the data and list the most common unigrams to start indicating common themes.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}
library(ngram)
library(tokenizers)
library(rJava)
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
Histogram of the 30 most common unigrams in the data sample.
makePlot(freq1, "30 Most Common Unigrams")
#Next Steps For Prediction Algorithm And Shiny App Next steps of this capstone project is to finalize a predictive algorithm, and deploy as a Shiny app. The predictive algorithm will be using n-gram model with frequency lookup following on from the exploratory analysis above. A potential strategy would be to use the trigram model to predict the next word. If no matching trigram can be located, the algorithm would revert to the bigram model and then to the unigram model as required. In terms of using the app, the plan is to enter a phrase into an input box to allow the user to seatch keywords. The app would then suggest the most likely ‘next word’ from the data provided by the user.