The code here is to get, clean, explore and analyze data from the Capstone Project of the Coursera Specialization in Data Science.In this Capstone Project we will be applying data science in the area of natural language processing with a collaboration with Swiftkey Company.
Original Data are available online (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). This dataset has been collected from publicly available sources by a web crawler. We will use it with the goal to create a Shiny app predicting the next word when entering few words.
This actual file present the first steps: getting, cleaning and exploring the data.
if (!require("knitr")) {
install.packages("knitr")}
if (!require("R.utils")) {
install.packages("R.utils")}
if (!require("stringr")) {
install.packages("stringr")}
if (!require("data.table")) {
install.packages("data.table")}
if (!require("doParallel")) {
install.packages("doParallel")}
if (!require("stringr")) {
install.packages("stringr")}
if (!require("tm")) {
install.packages("tm")}
if (!require("wordcloud")) {
install.packages("wordcloud")}
if (!require("SnowballC")) {
install.packages("SnowballC")}
if (!require("RWeka")) {
install.packages("RWeka")}
if (!require("ggplot2")) {
install.packages("ggplot2")}
if (!require("quanteda")) {
install.packages("quanteda")}
library(knitr)
library(R.utils)
library(stringr)
library(data.table)
library(doParallel) # to use more than only one processor core
library(tm) # For Text Mining
library(wordcloud) # To prepare a word cloud
library(SnowballC)
library(RWeka)
library(ggplot2)
require(quanteda)
set.seed(12345)
if(!file.exists("./OriginalData")) {
dir.create("./OriginalData")
message("Folder OriginalData is missing, creating it in ", getwd())
}else {
message("Folder already exist in: ", getwd())}
## Folder already exist in: C:/Users/jbass/Documents/CapstoneProject
if(!file.exists("./OriginalData/Coursera-SwiftKey.zip")) {
message("File is missing, downloading it in ", getwd(),"/OriginalData/")
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile="./OriginalData/Coursera-SwiftKey.zip")
}else {
message("File already downloaded in: ", getwd(), "/OriginalData/")}
## File already downloaded in: C:/Users/jbass/Documents/CapstoneProject/OriginalData/
if(!file.exists("./OriginalData/final")){
message("Files are missing, unzipping original zip in ", getwd(),"/OriginalData/")
unzip(zipfile="./OriginalData/Coursera-SwiftKey.zip",exdir="./OriginalData")
}else {
message("Files already unzipped in: ", getwd(), "/OriginalData/")}
## Files already unzipped in: C:/Users/jbass/Documents/CapstoneProject/OriginalData/
list.files("./OriginalData/final", recursive = TRUE)
## [1] "de_DE/de_DE.blogs.txt" "de_DE/de_DE.news.txt"
## [3] "de_DE/de_DE.twitter.txt" "en_US/en_US.blogs.txt"
## [5] "en_US/en_US.news.txt" "en_US/en_US.twitter.txt"
## [7] "fi_FI/fi_FI.blogs.txt" "fi_FI/fi_FI.news.txt"
## [9] "fi_FI/fi_FI.twitter.txt" "ru_RU/ru_RU.blogs.txt"
## [11] "ru_RU/ru_RU.news.txt" "ru_RU/ru_RU.twitter.txt"
For the next steps, I will focus only on the documents in English.
Each English documents are loaded into R to collect some basic information from the files. For example, the number of lines and the number of words are determinded here.
message ("Loading English Documents")
blogEN<- file("./OriginalData/final/en_US/en_US.blogs.txt", open="rb")
blogData<- readLines(blogEN, encoding="latin1")
close(blogEN)
newsEN<- file("./OriginalData/final/en_US/en_US.news.txt", open="rb")
newsData<- readLines(newsEN, encoding="latin1")
close(newsEN)
twittsEN<- file("./OriginalData/final/en_US/en_US.twitter.txt", open="rb")
twittsData<- readLines(twittsEN, encoding="latin1")
close(twittsEN)
message (" Retrieving file sizes")
blogSize <- file.info("./OriginalData/final/en_US/en_US.blogs.txt")$size / 1024^2
newsSize <- file.info("./OriginalData/final/en_US/en_US.news.txt")$size / 1024^2
twitterSize <- file.info("./OriginalData/final/en_US/en_US.twitter.txt")$size / 1024^2
message ("Counting number of words and lines in each file")
blogWordCnt<- sum((nchar(blogData) - nchar(gsub(' ','',blogData))) + 1)
blogLinesCnt<-NROW(blogData)
newsWordCnt<- sum((nchar(newsData) - nchar(gsub(' ','',newsData))) + 1)
newsLinesCnt<-NROW(newsData)
twittWordCnt<- sum((nchar(twittsData) - nchar(gsub(' ','',twittsData))) + 1)
twittLinesCnt<-NROW(twittsData)
message ("Measuring the length of the longest line in each file")
blogMaxLine <- max(nchar(blogData))
newsMaxLine <- max(nchar(newsData))
twittMaxLine <- max(nchar(twittsData))
## Making a summarizing table of basic information
size <- c(blogSize, newsSize, twitterSize); WordCnt <- c(blogWordCnt, newsWordCnt, twittWordCnt); LinesCnt <- c(blogLinesCnt, newsLinesCnt, twittLinesCnt); MaxLine <- c(blogMaxLine, newsMaxLine, twittMaxLine)
dset_names <- c("blogs", "news", "twitter")
stat <- data.frame(dset_names, size, WordCnt, LinesCnt, MaxLine)
colnames(stat) <- c("Dataset names", "File Size (Mb)", "# of words","# of lines", "Max line lenght")
kable(stat, format="markdown", caption = "Basic statistics of the datasets")
| Dataset names | File Size (Mb) | # of words | # of lines | Max line lenght |
|---|---|---|---|---|
| blogs | 200.4242 | 37334131 | 899288 | 40835 |
| news | 196.2775 | 34372530 | 1010242 | 11384 |
| 159.3641 | 30373545 | 2360148 | 213 |
The files are rather large in size so a sampling will be applied to reduce RAM usage and computing times.
It is necessary to pre-process the data such that only the most pertinent and accurate data will be used for building the predictive model. First, a subset of 0.5% of the original data will be used. Second, regular expressions will be used to remove non-ASCII text, numbers, whitespaces, changes cases to lower cases, … Third, remove of profanity words
I will train the further model on a little part of the original dataset, to speedup computing.
##Use line_counts to randomly sample 0.5% of the data to work with, and combine resulting samples.
##(I tried a larger subset of data, 1, 10% but I had error messages related to memory usage.)
message ("Sampling 0.5% of each file, and combine the resulting samples")
## Sampling 0.5% of each file, and combine the resulting samples
blogSamp <- sample(blogData,blogLinesCnt*0.005)
newsSamp <- sample(newsData,newsLinesCnt * 0.005)
twittsSamp <- sample(twittsData,twittLinesCnt * 0.005)
combSamp <- c(blogSamp, newsSamp, twittsSamp)
lenghtcombSamp <- length(combSamp)
writeLines(combSamp, "./combSamp.txt") # to save this new dataset
message ("Saving sampled dataset in ", getwd(), "/combSamp.txt")
## Saving sampled dataset in C:/Users/jbass/Documents/CapstoneProject/combSamp.txt
The sampled dataset has 21347 lines, with 0.5% of each original datasets.
The final text data needs to be cleaned to be used in the word prediction algorithm. Here, I create a cleaned Corpus file. This Corpus will be cleaned using methods as removing whitespaces, numbers, URLs, punctuations and so on, using TM package. The profanity Words list, used in the filtering here, is from Luis von Ahn’s research group (http://www.cs.cmu.edu/~biglou/resources/) and aggregates 1,300+ words that can be considered as profanity.
Load profanity words list (for further use)
if(!file.exists("./bad-words.txt")) {
message("Downloading profanity words list in ", getwd())
fileUrl2 <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
download.file(fileUrl2, destfile="./bad-words.txt")
}else {
message("File already downloaded in: ", getwd())}
## File already downloaded in: C:/Users/jbass/Documents/CapstoneProject
# convert to "ASCII" encoding to get rid of few weird characters
combSamp <- iconv(combSamp, to="ASCII", sub="")
## Use tm package to convert text dataset into the Corpus which is a structured set of texts used for statistical analysis
textCorpus <- Corpus(VectorSource(combSamp))
message ("Corpus prepared")
## Corpus prepared
## Using the TM Package to clean the text
textCorpus <- tm_map(textCorpus, content_transformer(function(x) iconv(x, to="UTF-8", sub="byte"))) # getting rid of other weird characters
textCorpus <- tm_map(textCorpus, content_transformer(tolower)) # converting to lowercase
# Removing 1500+ acronyms, Text Messaging and Chat Abbreviations (compiled from different internet sources) (code not shown to keep the report concise)
message ("Cleaning the Corpus")
## Cleaning the Corpus
## Removing URLs
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeURL2 <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
removeURL3 <- function(x) gsub("www\\.", "", x)
removeURL4 <- function(x) gsub("(\\.com|\\.org|\\.edu|\\.net)", "", x)
textCorpus <- tm_map(textCorpus, content_transformer(removeURL))
textCorpus <- tm_map(textCorpus, content_transformer(removeURL2))
textCorpus <- tm_map(textCorpus, content_transformer(removeURL3))
textCorpus <- tm_map(textCorpus, content_transformer(removeURL4))
# Removing Email adresses
email<-function(x) gsub("\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b", "", x)
textCorpus<- tm_map(textCorpus, email)
# Removing Twitter tags
tweettag<-function(x) gsub("RT |via", "", x)
textCorpus<- tm_map(textCorpus, tweettag)
# Removing Twitter Usernames
Tweetname<-function(x) gsub("[@][a - zA - Z0 - 9_]{1,15}", "", x)
textCorpus<- tm_map(textCorpus, Tweetname)
# Removing hahaha variants
#removeAhah <- function(x) gsub("(ha)+", "", x)
#textCorpus <- tm_map(textCorpus, content_transformer(removeAhah))
# Removing Profanity Words with TM Package
profanityWords <- readLines('./bad-words.txt')
textCorpus <- tm_map(textCorpus,removeWords, profanityWords)
# textCorpus <- tm_map(textCorpus, removeWords, stopwords("english")) # removing stop words in English (a, as, at, so, and, etc.), helps to reduce memory usage, and related problems
textCorpus <- tm_map(textCorpus, content_transformer(removePunctuation), preserve_intra_word_dashes=TRUE) # removing ponctuation
textCorpus <- tm_map(textCorpus, content_transformer(removeNumbers)) # removing numbers with TM Package
#textCorpus <- tm_map(textCorpus, stemDocument) # stemming the document (removes suffixes and prefixes) with TM Package, usefull or not for the prediction algorithm, not sure???
textCorpus <- tm_map(textCorpus, stripWhitespace) # Stripping unnecessary whitespace from document with TM Package
##showing some lines of the textcorpus
for (i in 1:5){
print(textCorpus[[i]]$content)
}
## [1] "and now home older wiser a little slimmer and hopefully secure in the knowledge of what good and what i want to do with my time"
## [1] "i turned the today show on to catch up on the morning news and immediately i knew something terrible had happened in new york city"
## [1] "ill take this opportunity to diverge from the usual take three path and instead of focusing on one last role offer up an arkin remix a concisely-potted overview arkin has long been seen as one of the exemplary supporting actors so many of his roles before his resurgence in popularity during the s and s to present day were memorable its hard to single one last role out he added charm and a studious commitment to characterising a range of films from his debut thats me in onwards"
## [1] "my eyes started burning i had to close them or risk crying my brain out onto my open palms i was shaking i could still hear tommy out there arguing with the other staff members he was telling them i was fine that this was all an act my hands closed into painfully tight fists i beat them against the floor"
## [1] "every man is said to have his peculiar ambition abraham lincoln"
##Convert Corpus to plain text document with TM Package
##textCorpus <- tm_map(textCorpus, PlainTextDocument)
## Saving the final corpus
saveRDS(textCorpus, file = "./Corpus.RData")
message ("The Corpus has been cleaned and saved")
## The Corpus has been cleaned and saved
Now I have a cleaned Corpus, I will quickly explore its content: number of words, most frequent words and file size. NB. The file size will be critical for next steps and has to be kept small enough to ensure quick computing and low RAM usage.
# calculating number of words and file size
corpusSize <- file.info("./Corpus.RData")$size / 1024^2
CorpusWordCnt<- sum((nchar(textCorpus) - nchar(gsub(' ','',textCorpus))) + 1)
The Corpus dataset is about 1.0844669 Mb and compile 4.9296510^{5} words.
## Create a word cloud of the data
TDM <- TermDocumentMatrix(textCorpus)
wcloud <- as.matrix(TDM)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq,
c(5,.3),50,
random.order=FALSE,
colors=brewer.pal(8, "Dark2"))
# Create a bar chart of the 30 most frequent words.
mtextCorpusOrdered <- sort(rowSums(wcloud), decreasing = TRUE)
barplot(mtextCorpusOrdered[1:30],
ylab='frequency',
main='top 30 most frequent words',
col="red", las=2, cex.names=.7)
The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app. The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use my predictive algorithm to suggest the most likely next word after a short delay.
The predictive algorithm will comprise the n-gram model with frequency lookup similar to the exploratory analysis above. Next step of this Capstone Project will be creating n-grams and building figures and tables to understand variation in the frequencies of words and n-gram frequencies in the data.
According to actual quick data exploration, the most frequent words are “stop words” (the, as, and, but….), I might have to remove them to improve the prediction model (this has to be tested).
For an efficient predictive model, high n-grams (Trigram or higher) is the first priority to lookup for the predicted words, and followed by smaller n-grams like bigrams and unigrams. This means that if no matching trigram can be found, then the algorithm would back-off to the bigram model, and then to the unigram model if needed. To have a nice prediction model and a working Shiny app I will have to build a model to handle unseen words and n-grams. In some cases people will want to type a combination of words that does not appear in the Corpus, and the model has to be able to handle that.
To have a good and working app I will try to make my model small and efficient. There will be evaluated via 2 different aspects: Size of RAM required and Runtime.