This document is the Milestone Report for Coursera’s Data Science Specialization Capstone. It explains the major features of the provided data for the Capstone Project and briefly summarizes my plans for creating the prediction algorithm and Shiny app.
After we downloaded the data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and unzipped them
DownloadData <- function()
{
cw_dir <- getwd()
setwd("../data")
## Download (if necessary) and reading the data for the assignment
if(!file.exists("final"))
{
print("Data will be downloaded and unziped...")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip" )
}
else
{
print("Data was already downloaded ...")
}
setwd(cw_dir)
return()
}
DownloadData()
We can see that this data are very huge text files in different languages. But because we are only interested in English texts we load in R only the English files. We also eliminate all the non alphanumeric characters on the fly.
read_file <- function(file){
file <- paste("../data/final/en_US/en_US.",file,".txt", sep="")
con <- file(file, encoding="UTF-8")
content <- gsub("[^[:alnum:]]", " ",scan(con,what=character(), sep="\n", skipNul = TRUE))
close(con)
return(content)
}
blogs <- read_file("blogs")
news <- read_file("news")
twitter <- read_file("twitter")
Then we split the lines to words.
blogs_words <- unlist(strsplit(blogs, " +"))
save(blogs_words, file="blogs_words.saved")
rm(blogs_words)
news_words <- unlist(strsplit(news, " +"))
save(news_words, file="news_words.saved")
rm(news_words)
twitter_words <- unlist(strsplit(twitter, " +"))
save(twitter_words, file="twitter_words.saved")
rm(twitter_words)
And after that we are ready to do some statistics. First we count lines and words in the files.
load("blogs_words.saved")
num_blines <- length(blogs)
num_blines
## [1] 899288
num_bwords <- length(blogs_words)
num_bwords
## [1] 38372148
load("news_words.saved")
num_nlines <- length(news)
num_nlines
## [1] 77259
num_nwords <-length(news_words)
num_nwords
## [1] 2753580
load("twitter_words.saved")
num_tlines <- length(twitter)
num_tlines
## [1] 2360148
num_twords <- length(twitter_words)
num_twords
## [1] 31143477
There are 899288 lines and 38372148 words in blogs dataset, 77259 lines and 2753580 words in news dataset and 2360148 lines and 31143477 words in twitter dataset.
Next we can count different words in all three datasets.
words <- c(blogs_words,news_words,twitter_words)
rm(blogs_words)
rm(news_words)
rm(twitter_words)
words <- gsub("[^[:alpha:]]","",words)
w_freq <- table(words)
num_words <- length(w_freq)
num_words
## [1] 612443
save(words, file="words.saved")
There are 612443 different words in these datasets, and the twenty most common words are:
head(sort(w_freq,TRUE),20)
## words
## the to I a and of in you
## 2644537 1895413 1709175 1508553 1507890 1279497 1188756 965140 817800
## is for that it s on my with t
## 787545 742552 719653 707336 656037 550472 495734 463884 413358
## was be
## 407181 395965
We can see that this are all words that are shorter than five characters, so we eliminate them to see how many “real” words we have.
load("words.saved")
r_words <- words[nchar(words)>4]
num_r_words <- length(table(r_words))
So, there are 544687 different words longer than four characters. The most common of these words are:
r_w_freq <- head(sort(table(r_words),TRUE),20)
r_w_freq
## r_words
## about there would their people think going really great
## 207831 141631 134280 120984 107718 99346 91255 88871 86317
## today which first other right because could still should
## 85815 83068 76657 76656 76547 75995 72486 69782 63105
## little being
## 61960 60199
barplot(r_w_freq, main="Frequencies of most common words longer than four characters",las=2)
The number of different words in these datasets is around two times larger than the number of distinct English words, which is around a quarter of a million. This tells us that must be a lots of misspellings and slangs words in our datasets. So, if we want a good prediction model we should correct all the misspellings and find all the synonyms and clean datasets. The best way to do that is to use R text mining library “tm”.For purpose of this milestone report, we will use only 10% of all text data.
library("tm")
## Loading required package: NLP
library("SnowballC")
blogs <- blogs[sample(length(blogs), size=0.1*length(blogs))]
news <- news[sample(length(news), size=0.1*length(news))]
twitter <- twitter[sample(length(twitter), size=0.1*length(twitter))]
text <- c(blogs,news,twitter)
rm(blogs)
rm(news)
rm(twitter)
corpus <- Corpus(VectorSource(text))
rm(text)
stopwords <- c(stopwords('english'), "t", "a","s","the","don","m")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
When we have clean corpus, we can build bigrams to show the most comon word pairs and their occurencies:
library("RWeka")
library(slam)
n_gram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = n_gram_tokenizer))
tdm <- rollup(tdm, 2, na.rm=TRUE, FUN = sum)
pairs_freq <- head(sort(rowSums(as.matrix(tdm)),TRUE), 20)
pairs_freq
## right now look like can wait last night feel like
## 2170 1764 1644 1559 1557
## look forward thank follow can get year old last year
## 1494 1250 1107 1057 926
## make sure new york first time year ago happi birthday
## 922 902 889 885 874
## let know one day good morn let go just got
## 837 816 803 803 797
barplot(pairs_freq, main="Frequencies of most common word pairs",las=2)
My plan for the final project is to build a Shiny app that will predict the next word in a phrase based on the previous 1, 2 or 3 words. For the cases where a combination of these words doesn’t appear in base texts, I’ll try to predict next word with use of a back-off model that estimates the conditional probability of a word given its history in the n-gram. If I would have enough time, I’ll try to improve the predictive accuracy and reduce computational runtime and model complexity.