Project Summary

This is the Milestone Report for the Coursera Data Science Specialization offered by The John Hopkins University. The Capstone project is done in collaboration with Swiftkey and the goal of this project is to design a shiny application using R codes, with a text prediction capabilities. This report will outline the exploratory analysis of the dataset and the plans for implementing the text prediction algorithm.

Description of Data

The dataset is comprised of the output of crawls of news sites, blogs and twitter downloaded at Datasets. The dataset contains 3 files across four languages (Russian, Finnish, German and English). This project will focus on the English language datasets. The names of the data files are as follows:

en_US.blogs.txt
en_US.twitter.txt
en_US.news.txt

The datasets will be referred to as “Blogs”, “Twitter” and “News” for the remainder of this report.

Download the data

if(!file.exists("Coursera-SwiftKey.zip")){
    #Download the dataset
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  "Coursera-SwiftKey.zip")
    Download_Date <- Sys.time()
    Download_Date
    #"2016-03-15 13:30:48 GMT"
    
    #Unzip the dataset
    unzip("Coursera-SwiftKey.zip")
}else{
    print("Dataset is already downloaded!")
}

Polish News Dataset

There is a minor problem with the News dataset. It contains an unusual character on line 77,259. In order to address this issue a small piece of code was written to edit out the character before processing the dataset.

if(!file.exists("./en_US/en_US.news_edit.txt")){
    con <- file("./en_US/en_US.news.txt", "rb")
    News_Data <- readLines(con)
    close(con)
    #Remove the odd symbol on line 77259
    News_Data <- gsub("\032", "", News_Data, ignore.case=F, perl=T)
    writeLines(News_Data, con=file("./en_US/en_US.news_edit.txt"))
    close(con=file("./en_US/en_US.news_edit.txt"))
    file.rename("./en_US/en_US.news.txt", "./en_US.news.txt")
}else{
    con <- file("./en_US/en_US.news_edit.txt", "rb")
    News_Data <- readLines(con) 
    close(con)
}

Characteristics of Datasets

#Load libraries
library(NLP)

## Warning: package 'NLP' was built under R version 3.2.3

library(tm)

## Warning: package 'tm' was built under R version 3.2.3

library(stringi)

## Warning: package 'stringi' was built under R version 3.2.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.4

library(RWeka)

## Warning: package 'RWeka' was built under R version 3.2.4

library(data.table)

## Warning: package 'data.table' was built under R version 3.2.4

#Generate Corpus for text analysis
cname <- file.path(".","en_US")
docs <- Corpus(DirSource(cname))

The first part of this exploratory analysis is to determine the basic characteristics for each dataset. These characteristics are shown in the table below.

Dataset	File Size (bytes)	Number of Lines	Smallest entry	Largest entry
Blogs	210160014	899288	1	40835
Twitter	167105338	2360148	2	213
News	NA	77259	2	5760

Subsetting and cleaning the Dataset

This process is to reduce the sample size by cleaning and processing each of the datasets (Blogs, Twitter and News) which were large enough that processing time is a factor. In order to address this concern, a representative sampling of each of the datasets was made for the remainder of this analysis. The subset of each file is outlined in the table below.

#Limit Dataset to a random subset of 20% of the data
set.seed(1337)
Subset <- docs
Subset[[1]]$content <- Subset[[1]]$content[as.logical(rbinom(length(Subset[[1]]$content),
                                                             1, prob=0.2))]
Subset[[2]]$content <- Subset[[2]]$content[as.logical(rbinom(length(Subset[[2]]$content),
                                                             1, prob=0.2))]
Subset[[3]]$content <- Subset[[3]]$content[as.logical(rbinom(length(Subset[[3]]$content),
                                                             1, prob=0.2))]

Dataset	File Size (bytes)	Number of Lines	Smallest entry	Largest entry
Subset Blogs	52658576	179180	1	12421
Subset Twitter	63762584	471986	2	213
Subset News	4035936	15495	3	1346

Before the subsetted data can be fully analyzed the data needs to be pre-processed to standardize the words and characters from each dataset. An example entry from the Blogs dataset is shown below:

## [1] "Point 2: If itâ<U+0080><U+0099>s a show that your kid wants, a show about a book is always better than CRACCC."

Word Frequency

#Plot Word Frequencies
ggplot(wf[wf$freq>60000, ], aes(x=word, y=freq)) +
    geom_bar(stat="identity") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("") +
    ylab("Frequency") +
    ggtitle("Words that appear over 60,000\ntimes in the three Datasets")

The high frequency for “connecting” words, such as “the”, “and”, “that” suggests that using a pattern based on word frequency alone will not be sufficient for text prediction. The next analysis looks at common word combinations.

N-gram Frequency

For brivity the N-gram analysis of this report was limited to 2-grams.

#Plot Word Frequencies
ggplot(WF_Ngram[WF_Ngram$freq>16000, ], aes(x=word, y=freq)) +
    geom_bar(stat="identity") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +
    xlab("") +
    ylab("Frequency") +
    ggtitle("Two Word Combinations (2-grams)that appear\n over 16,000 times in the three Datasets")

The distribution of 2-grams gives an idea of the prevalence of prepositions in natural language. The text prediction model will have to take this into account.

Plan Moving Forward

The plan is to develop a text prediction application (a Shiny App using R code) that applies the frequency of 4-grams, 3-grams and 2-grams to estimate the most likely word to follow the entered text. The trick will be to offer valid predictions of N-grams that are not observed within the dataset. In these cases the algorithm will likely default to a list of “non-common” words (i.e. factor out words like the, and, that) and estimate the best possible choice.

Session Information

This analysis was performed on a machine with the following characteristics:

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Malaysia.1252  LC_CTYPE=English_Malaysia.1252   
## [3] LC_MONETARY=English_Malaysia.1252 LC_NUMERIC=C                     
## [5] LC_TIME=English_Malaysia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.6 RWeka_0.4-24     ggplot2_2.1.0    stringi_1.0-1   
## [5] tm_0.6-2         NLP_0.1-9       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0        knitr_1.11         magrittr_1.5      
##  [4] RWekajars_3.7.12-1 munsell_0.4.2      colorspace_1.2-6  
##  [7] stringr_1.0.0      plyr_1.8.3         tools_3.2.0       
## [10] parallel_3.2.0     grid_3.2.0         gtable_0.1.2      
## [13] htmltools_0.3      yaml_2.1.13        digest_0.6.9      
## [16] rJava_0.9-8        formatR_1.2        evaluate_0.8      
## [19] slam_0.1-32        rmarkdown_0.6.1    labeling_0.3      
## [22] scales_0.3.0       chron_2.3-47

Reference

1: Swiftkey - http://swiftkey.com/en/

2: Datasets - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data Science - Capstone Project - Milestone Report

Velladurai

March 15, 2016