Synopsis

The code here is to get, clean, explore and analyze data from the Capstone Project of the Coursera Specialization in Data Science.In this Capstone Project we will be applying data science in the area of natural language processing with a collaboration with Swiftkey Company.

Original Data are available online (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). This dataset has been collected from publicly available sources by a web crawler. We will use it with the goal to create a Shiny app predicting the next word when entering few words.

This actual file present the first steps: getting, cleaning and exploring the data.

Checking for required packages and install them if necessary, then load them

if (!require("knitr")) {
    install.packages("knitr")}
if (!require("R.utils")) {
    install.packages("R.utils")}
if (!require("stringr")) {
    install.packages("stringr")}
if (!require("data.table")) {
    install.packages("data.table")}
if (!require("doParallel")) {
     install.packages("doParallel")}
if (!require("stringr")) {
     install.packages("stringr")}
if (!require("tm")) {
     install.packages("tm")}
if (!require("wordcloud")) {
     install.packages("wordcloud")} 
if (!require("SnowballC")) {
     install.packages("SnowballC")}
if (!require("RWeka")) {
     install.packages("RWeka")}
if (!require("ggplot2")) {
     install.packages("ggplot2")}
if (!require("quanteda")) {
     install.packages("quanteda")}

library(knitr)
library(R.utils)
library(stringr)
library(data.table)
library(doParallel) # to use more than only one processor core
library(tm) # For Text Mining
library(wordcloud) # To prepare a word cloud
library(SnowballC)
library(RWeka)
library(ggplot2)
require(quanteda)

Set seed for reproducible further sampling

set.seed(12345)

Task 1: Getting and Cleaning the Data

Checking if files exist in the default working directory, otherwise create the folder, download and unzip the dataset

if(!file.exists("./OriginalData")) {
    dir.create("./OriginalData")
  message("Folder OriginalData is missing, creating it in ", getwd())
  }else {
    message("Folder already exist in: ", getwd())}

## Folder already exist in: C:/Users/jbass/Documents/CapstoneProject

if(!file.exists("./OriginalData/Coursera-SwiftKey.zip")) {
  message("File is missing, downloading it in ", getwd(),"/OriginalData/")
    fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
    download.file(fileUrl, destfile="./OriginalData/Coursera-SwiftKey.zip")
  }else {
    message("File already downloaded in: ", getwd(), "/OriginalData/")}

## File already downloaded in: C:/Users/jbass/Documents/CapstoneProject/OriginalData/

if(!file.exists("./OriginalData/final")){
  message("Files are missing, unzipping original zip in ", getwd(),"/OriginalData/")
  unzip(zipfile="./OriginalData/Coursera-SwiftKey.zip",exdir="./OriginalData")
  }else {
    message("Files already unzipped in: ", getwd(), "/OriginalData/")}

## Files already unzipped in: C:/Users/jbass/Documents/CapstoneProject/OriginalData/

Listing files from the original zip file

list.files("./OriginalData/final", recursive = TRUE)

##  [1] "de_DE/de_DE.blogs.txt"   "de_DE/de_DE.news.txt"   
##  [3] "de_DE/de_DE.twitter.txt" "en_US/en_US.blogs.txt"  
##  [5] "en_US/en_US.news.txt"    "en_US/en_US.twitter.txt"
##  [7] "fi_FI/fi_FI.blogs.txt"   "fi_FI/fi_FI.news.txt"   
##  [9] "fi_FI/fi_FI.twitter.txt" "ru_RU/ru_RU.blogs.txt"  
## [11] "ru_RU/ru_RU.news.txt"    "ru_RU/ru_RU.twitter.txt"

For the next steps, I will focus only on the documents in English.

Loading and quick exploration of the data

Each English documents are loaded into R to collect some basic information from the files. For example, the number of lines and the number of words are determinded here.

message ("Loading English Documents")
blogEN<- file("./OriginalData/final/en_US/en_US.blogs.txt", open="rb")
blogData<- readLines(blogEN, encoding="latin1")
close(blogEN)

newsEN<- file("./OriginalData/final/en_US/en_US.news.txt", open="rb")
newsData<- readLines(newsEN, encoding="latin1")
close(newsEN)

twittsEN<- file("./OriginalData/final/en_US/en_US.twitter.txt", open="rb")
twittsData<- readLines(twittsEN, encoding="latin1")
close(twittsEN)

message (" Retrieving file sizes")
blogSize <- file.info("./OriginalData/final/en_US/en_US.blogs.txt")$size / 1024^2 
newsSize <- file.info("./OriginalData/final/en_US/en_US.news.txt")$size / 1024^2 
twitterSize <- file.info("./OriginalData/final/en_US/en_US.twitter.txt")$size / 1024^2 

message ("Counting number of words and lines in each file")
blogWordCnt<- sum((nchar(blogData) - nchar(gsub(' ','',blogData))) + 1)
blogLinesCnt<-NROW(blogData)

newsWordCnt<- sum((nchar(newsData) - nchar(gsub(' ','',newsData))) + 1)
newsLinesCnt<-NROW(newsData)

twittWordCnt<- sum((nchar(twittsData) - nchar(gsub(' ','',twittsData))) + 1)
twittLinesCnt<-NROW(twittsData)

message ("Measuring the length of the longest line in each file")
blogMaxLine <- max(nchar(blogData))
newsMaxLine <- max(nchar(newsData))
twittMaxLine <- max(nchar(twittsData))

## Making a summarizing table of basic information
size <- c(blogSize, newsSize, twitterSize); WordCnt <- c(blogWordCnt, newsWordCnt, twittWordCnt); LinesCnt <- c(blogLinesCnt, newsLinesCnt, twittLinesCnt); MaxLine <- c(blogMaxLine, newsMaxLine, twittMaxLine)
dset_names <- c("blogs", "news", "twitter")
stat <- data.frame(dset_names, size, WordCnt, LinesCnt, MaxLine)
colnames(stat) <- c("Dataset names", "File Size (Mb)", "# of words","# of lines", "Max line lenght")
kable(stat, format="markdown", caption = "Basic statistics of the datasets")

Dataset names	File Size (Mb)	# of words	# of lines	Max line lenght
blogs	200.4242	37334131	899288	40835
news	196.2775	34372530	1010242	11384
twitter	159.3641	30373545	2360148	213

The files are rather large in size so a sampling will be applied to reduce RAM usage and computing times.

Data Pre-Processing

It is necessary to pre-process the data such that only the most pertinent and accurate data will be used for building the predictive model. First, a subset of 0.5% of the original data will be used. Second, regular expressions will be used to remove non-ASCII text, numbers, whitespaces, changes cases to lower cases, … Third, remove of profanity words

Data sampling

I will train the further model on a little part of the original dataset, to speedup computing.

##Use line_counts to randomly sample 0.5% of the data to work with, and combine resulting samples.
##(I tried a larger subset of data, 1, 10% but I had error messages related to memory usage.)
message ("Sampling 0.5% of each file, and combine the resulting samples")

## Sampling 0.5% of each file, and combine the resulting samples

blogSamp <- sample(blogData,blogLinesCnt*0.005)
newsSamp <- sample(newsData,newsLinesCnt * 0.005)
twittsSamp <- sample(twittsData,twittLinesCnt * 0.005)

combSamp <- c(blogSamp, newsSamp, twittsSamp)
lenghtcombSamp <- length(combSamp)
writeLines(combSamp, "./combSamp.txt") # to save this new dataset
message ("Saving sampled dataset in ", getwd(), "/combSamp.txt")

## Saving sampled dataset in C:/Users/jbass/Documents/CapstoneProject/combSamp.txt

The sampled dataset has 21347 lines, with 0.5% of each original datasets.

Data cleaning (remove profanity words, URLs, punctuations, numbers, whitspaces, …)

The final text data needs to be cleaned to be used in the word prediction algorithm. Here, I create a cleaned Corpus file. This Corpus will be cleaned using methods as removing whitespaces, numbers, URLs, punctuations and so on, using TM package. The profanity Words list, used in the filtering here, is from Luis von Ahn’s research group (http://www.cs.cmu.edu/~biglou/resources/) and aggregates 1,300+ words that can be considered as profanity.

Load profanity words list (for further use)

if(!file.exists("./bad-words.txt")) {
  message("Downloading profanity words list in ", getwd())
    fileUrl2 <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
    download.file(fileUrl2, destfile="./bad-words.txt")
  }else {
    message("File already downloaded in: ", getwd())}

## File already downloaded in: C:/Users/jbass/Documents/CapstoneProject

# convert to "ASCII" encoding to get rid of few weird characters
combSamp <- iconv(combSamp, to="ASCII", sub="")

## Use tm package to convert text dataset into the Corpus which is a structured set of texts used for statistical analysis
textCorpus <- Corpus(VectorSource(combSamp)) 
message ("Corpus prepared")

## Corpus prepared

## Using the TM Package to clean the text
textCorpus <- tm_map(textCorpus, content_transformer(function(x) iconv(x, to="UTF-8", sub="byte"))) # getting rid of other weird characters

textCorpus <- tm_map(textCorpus, content_transformer(tolower)) # converting to lowercase

# Removing 1500+ acronyms, Text Messaging and Chat Abbreviations (compiled from different internet sources) (code not shown to keep the report concise)
message ("Cleaning the Corpus")

## Cleaning the Corpus

## Removing URLs 
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeURL2 <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
removeURL3 <- function(x) gsub("www\\.", "", x)
removeURL4 <- function(x) gsub("(\\.com|\\.org|\\.edu|\\.net)", "", x)
textCorpus <- tm_map(textCorpus, content_transformer(removeURL))
textCorpus <- tm_map(textCorpus, content_transformer(removeURL2))
textCorpus <- tm_map(textCorpus, content_transformer(removeURL3))
textCorpus <- tm_map(textCorpus, content_transformer(removeURL4))

# Removing Email adresses
email<-function(x) gsub("\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b", "", x)
textCorpus<- tm_map(textCorpus, email)

# Removing Twitter tags
tweettag<-function(x) gsub("RT |via", "", x)
textCorpus<- tm_map(textCorpus, tweettag)

# Removing Twitter Usernames
Tweetname<-function(x) gsub("[@][a - zA - Z0 - 9_]{1,15}", "", x)
textCorpus<- tm_map(textCorpus, Tweetname)

# Removing hahaha variants
#removeAhah <- function(x) gsub("(ha)+", "", x)
#textCorpus <- tm_map(textCorpus, content_transformer(removeAhah))

# Removing Profanity Words with TM Package
profanityWords <- readLines('./bad-words.txt')
textCorpus <- tm_map(textCorpus,removeWords, profanityWords)

# textCorpus <- tm_map(textCorpus, removeWords, stopwords("english")) # removing stop words in English (a, as, at, so, and, etc.), helps to reduce memory usage, and related problems

textCorpus <- tm_map(textCorpus, content_transformer(removePunctuation), preserve_intra_word_dashes=TRUE) # removing ponctuation

textCorpus <- tm_map(textCorpus, content_transformer(removeNumbers)) # removing numbers with TM Package

#textCorpus <- tm_map(textCorpus, stemDocument) # stemming the document (removes suffixes and prefixes) with TM Package, usefull or not for the prediction algorithm, not sure???

textCorpus <- tm_map(textCorpus, stripWhitespace) # Stripping unnecessary whitespace from document with TM Package

##showing some lines of the textcorpus
for (i in 1:5){
print(textCorpus[[i]]$content)
}

## [1] "and now home older wiser a little slimmer and hopefully secure in the knowledge of what good and what i want to do with my time"
## [1] "i turned the today show on to catch up on the morning news and immediately i knew something terrible had happened in new york city"
## [1] "ill take this opportunity to diverge from the usual take three path and instead of focusing on one last role offer up an arkin remix a concisely-potted overview arkin has long been seen as one of the exemplary supporting actors so many of his roles before his resurgence in popularity during the s and s to present day were memorable its hard to single one last role out he added charm and a studious commitment to characterising a range of films from his debut thats me in onwards"
## [1] "my eyes started burning i had to close them or risk crying my brain out onto my open palms i was shaking i could still hear tommy out there arguing with the other staff members he was telling them i was fine that this was all an act my hands closed into painfully tight fists i beat them against the floor"
## [1] "every man is said to have his peculiar ambition abraham lincoln"

##Convert Corpus to plain text document with TM Package 
##textCorpus <- tm_map(textCorpus, PlainTextDocument) 

## Saving the final corpus
saveRDS(textCorpus, file = "./Corpus.RData")
message ("The Corpus has been cleaned and saved")

## The Corpus has been cleaned and saved

Task 2: Exploratory data analysis of the Corpus

Now I have a cleaned Corpus, I will quickly explore its content: number of words, most frequent words and file size. NB. The file size will be critical for next steps and has to be kept small enough to ensure quick computing and low RAM usage.

# calculating number of words and file size
corpusSize <- file.info("./Corpus.RData")$size / 1024^2 
CorpusWordCnt<- sum((nchar(textCorpus) - nchar(gsub(' ','',textCorpus))) + 1)

The Corpus dataset is about 1.0844669 Mb and compile 4.9296510^{5} words.

## Create a word cloud of the data
TDM <- TermDocumentMatrix(textCorpus)
wcloud <- as.matrix(TDM)
v <- sort(rowSums(wcloud),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(d$word,d$freq,
          c(5,.3),50,
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

# Create a bar chart of the 30 most frequent words.
mtextCorpusOrdered <- sort(rowSums(wcloud), decreasing = TRUE)
barplot(mtextCorpusOrdered[1:30],
        ylab='frequency',
        main='top 30 most frequent words',        
        col="red", las=2, cex.names=.7)

Next Steps For Prediction Algorithm and Shiny App

The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app. The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use my predictive algorithm to suggest the most likely next word after a short delay.

The predictive algorithm will comprise the n-gram model with frequency lookup similar to the exploratory analysis above. Next step of this Capstone Project will be creating n-grams and building figures and tables to understand variation in the frequencies of words and n-gram frequencies in the data.

According to actual quick data exploration, the most frequent words are “stop words” (the, as, and, but….), I might have to remove them to improve the prediction model (this has to be tested).

For an efficient predictive model, high n-grams (Trigram or higher) is the first priority to lookup for the predicted words, and followed by smaller n-grams like bigrams and unigrams. This means that if no matching trigram can be found, then the algorithm would back-off to the bigram model, and then to the unigram model if needed. To have a nice prediction model and a working Shiny app I will have to build a model to handle unseen words and n-grams. In some cases people will want to type a combination of words that does not appear in the Corpus, and the model has to be able to handle that.

To have a good and working app I will try to make my model small and efficient. There will be evaluated via 2 different aspects: Size of RAM required and Runtime.

Capstone project - Week 2

jbassard

February 2018